Apache NiFi Update

As 2015 comes to a close, I thought it would make sense to write a closing post for this blog.

Although I envisioned it as a place where I would post technical documentation on various topics, I only ever wrote about Apache NiFi. I’ve since moved on to other types of work and no longer work enough with NiFi to provide up-to-date information.

When I wrote the blog posts here, NiFi was still in the Apache Software Foundation’s Incubator program.

Now, several months later, Apache NiFi is a full-fledged project of the Apache Software Foundation. In addition, it has progressed to version 0.4.0, and many things have changed since I first wrote about it and created my video tutorials.

As such, I’d like to point readers to nifi.apache.org for all the best information on NiFi. There, you can not only find the latest release, but also lots of up-to-date documentation, blog posts, and a wiki site with sample flow templates, FAQs, an Upgrade Guide, and many other helpful resources.

Happy NiFi-ing!

Advertisements

Apache NiFi: Video Tutorials

The best way to learn how to use Apache NiFi (incubating) is to see it in action. I’ve created the following video tutorials to highlight some key concepts in the NiFi User Interface:

You can also watch all these videos in order here:

If you have any feedback or want to see a video on a particular topic in NiFi, please mention it in the comments below.

NiFi FlowFile Attributes – An Analogy

One of the most important things to understand in Apache NiFi (incubating) is the concept of FlowFile attributes. You may already have a general understanding of what attributes are or know them by the term “metadata”, which is data about the data. There is also a good description in this Wikipedia article. However, since this blog is all about keeping things simple… I like to use the simple analogy of a letter in an envelope when describing NiFi FlowFile attributes.

Envelope

Let’s say that all the data objects being processed in your NiFi dataflow are like letters you have received. In this analogy, these objects, or FlowFiles, are each made up of the content (the letter inside the envelope) and the attributes (the details written on outside of the envelope). So, for example, you have the sender’s name, the sender’s address, the recipient’s name, and the recipient’s address. You could think of these as attributes. If the letter has been mailed, you might even have details like the date it was mailed, the post office that processed it, etc. All of these things are written on the outside of the envelope.

So, without even making the effort of opening the letter, NiFi knows a lot about it and can make decisions about it. For example, we could tell NiFi that if it’s from John Smith, route it so that it gets filed in a directory called “Love Letters”; whereas, if it’s from Smitty John, delete it.

(For more on routing FlowFiles based on attributes, see the usage documentation for the RouteOnAttribute processor.)

The fact that NiFi can just inspect the attributes (keeping only the attributes in memory) and perform actions without even looking at the content means that NiFi dataflows can be very fast and efficient.

Within the dataflow, the user can also add or change the attributes on a FlowFile to make it possible to perform other actions. For example, let’s say all of John Smith’s letters include both a letter and a newspaper clipping, whereas Smitty John’s letters do not. The user could add an attribute to “flag” each of John Smith’s letters, indicating that somewhere later in the dataflow, the newspaper clipping should be separated from the letter content. In this case, NiFi would then delve into the contents inside the envelope, but only for letters that require it. (Doing something with the content of the FlowFile is often more resource intensive than just inspecting the attributes. So, by flagging only those letters that need such an action, the dataflow will be more efficient overall.)

(For more on manipulating attributes, see the NiFi Expression Language Guide and the usage documentation for the UpdateAttribute processor.)

Each attribute is made up of a key-value pair. The key is the name or type of attribute. And the value is the unique information assigned to that key. So, for the letter shown in the image above, we might have the following keys and values:

KEY VALUE
senderName John Smith
senderAddress 456 Street Cir., Elsewhere, ST 00000
recipientName Jane Smith
recipientAddress 123 Road Ave., Anytown, ST 11111

If we wanted to add an attribute to flag certain letters so that NiFi would know to separate the newspaper clipping from the letter, we might have an attribute like the following:

KEY VALUE
separateClipping true

Real Attributes

So, let’s leave our letter analogy now and look at how attributes actually appear in NiFi. Below is a bulletin from a LogAttribute processor. This processor is mainly used for testing and it logs the list of attributes of any FlowFiles it processes. This bulletin pertains to a FlowFile that was produced by the GenerateFlowFile processor. (It’s important to note which processor the FlowFile has come from, because different types of processors can add/change different attributes.)

LogAttribute Bulletin

In the image above, we can see each key and value listed. For example, the first one is Key: ‘entryDate’ and Value: ‘Tue Jan 20 10:21:48 EST 2015’. You can think of all the key-value pairs that are currently on a FlowFile as the FlowFile’s “Attribute Map”. The LogAttribute processor can help you determine whether the Attribute Map is currently what you expect, so you know how to build and adjust your dataflow from that point forward.

Having this understanding about FlowFile attributes and how they can be used is an important part of building dynamic, efficient, and powerful dataflows in NiFi.

Simple Tasks in NiFi – File Objects by Date

When you copy files to a local directory in Apache NiFi (incubating) , you can auto-generate directories according to the current date. This is a super simple but handy thing you can do by using the NiFi Expression Language.

Note: If you are not familiar with the NiFi EL, check out the Expression Language Guide by clicking “help” in the upper-right corner of the User Interface and looking for it in the left-hand column.

Why?

First of all, why would you want to do this? Well, let’s say that you process a lot of data on any given day, and you want to keep a copy of it. But you’d also like to be able to go back and find your data objects according to the date on which you processed them.

One use case I know of, for example, is when system administrators want to save off copies of the NiFi logs for later reference. The idea is to use a GetFile processor to pick up a copy of the log files and then use a PutFile processor to copy them to another location according to their date. This way, your logs directory is not constantly filling up, and you still have an easy-to-reference copy of all your logs. But there are probably many other cases where this super simple method of filing objects can be used.

How?

This is how simple it is… Simply configure the Directory property in your PutFile processor with an Expression that uses the date format you want. The following example files objects with an auto-generated path of year (yyyy), month (MMM), day (dd), using a NiFi Expression of:

${now():format('yyyy/MMM/dd')}

Here is how it looks in the PutFile processor’s configuration of the Directory property; note that the full directory path includes the expression shown above:

PutFile Configuration

This example copies files into the Output directory and creates new sub-directories according to the current time, as converted to a date format of yyyy/MMM/dd. So, on the date that this post is being written, it creates a directory structure of /Output/2015/Jan/18. You can use a different date format if you like, but it needs to conform to Java SimpleDateFormat.

It goes without saying that the user NiFi is running as needs to have the proper permissions to create and write to the directory path configured. Have fun trying this out!

 

Simple Tasks in NiFi – Creating a Super Simple Cluster

Creating a Super Simple Cluster

My posts about Apache NiFi (incubating) to this point have focused on simple tasks that do not require a lot of technical knowledge about how NiFi works. And this post is really no different. However, I have to thank NiFi developer Matt for helping me set this up myself when I was working on another task. So, I say it’s simple, but I did need a little help myself. And that’s why I thought it would be good to document it for someone else…

First of all, it’s important to note that this simple cluster might only be useful for the purposes of (a) just figuring out how to create a NiFi cluster in general and (b) testing an easy cluster before you build a “real” one. A real production cluster would be set up securely (SSL) and would involve much more than what I discuss here.

This post makes some assumptions:

  • That you know what a NiFi cluster is
  • That you know how a NiFi cluster works

If you don’t already know those two things, you should check out the NiFi Clustering Guide when it is available.

Cluster Components

This super simple cluster will consist of three instances of NiFi:

  • The NiFi Cluster Manager (NCM)
  • Node 1
  • Node 2

For the purposes of this simple cluster, we will assume that you have installed the NCM and Node 1 on the same server. This is a typical setup, because the NCM is very lightweight and, therefore, can run on the same server as one of the nodes. (In fact, if you are building this cluster just for fun, you can even run all three instances on the same machine; however, please understand that doing so defeats the purpose of clustering, which is to harness the power of multiple servers so that you can process more data. Using one machine only makes sense if you are just learning and want to try it out for fun.)

NiFi Properties File

Once you have three instances of NiFi installed, the next step is to edit the NiFi properties file for each instance. In the main installation directory, there is a directory called conf and in that directory, there is a nifi.properties file. For each instance, we’ll open this file in a text editor and edit it in a particular way for that instance. Let’s start with the NCM:

In the nifi.properties file of the NCM, you’ll need to check out and/or change the following properties:

  • Under web properties, take note of the setting for nifi.web.http.port
    • Since this is going to be a non-secured cluster, we’ll be running NiFi on an HTTP port and not a HTTPS port. You can leave the NCM’s HTTP port set to 8080 (which is the default). But you will want to make sure that Node 1 has a different HTTP port, since both instances are running on the same machine and can’t share ports.
  •  Under the cluster manager properties, set the following:
    • nifi.cluster.is.manager=
      • This must be changed to true to distinguish this instance as an NCM.
    • nifi.cluster.manager.protocol.port=
      • Select an open port that is higher than 1024 (anything lower requires root). Also, take note of what you set this port to, because you will need to reference the cluster manager’s protocol port when you set up the nodes.

That’s it. Now, your NCM is set up. Let’s edit the nifi.properties file on Node 1:

  • Under web properties, take note of the setting for nifi.web.http.port
    • Make sure that Node 1 has a different HTTP port from the NCM, since they are running on the same machine and can’t share ports. For example, you could make it 8181. You may need to set other web properties, depending on the types of servers you are using. Be sure to review all properties in this section for your particular situation. There is a description of all the properties in the NiFi Admin Guide, which you can get to by clicking the “help” link in the upper-right corner of the NiFi window.
  •  Under the cluster node properties, set the following:
    • nifi.cluster.is.node=
      • This must be changed to true to distinguish this instance as a node.
    • nifi.cluster.node.address=
      • As noted in the comments below, if you leave this field blank, it will default to “localhost”. If that is not appropriate for your situation, such as if you are using a virtual machine (VM), then you should set this to the fully qualified hostname of the node’s machine.
    • nifi.cluster.node.protocol.port=
      • For Node 1, make sure this is a different port than the NCM’s protocol port. Select an open port that is higher than 1024.
    • nifi.cluster.node.unicast.manager.address=
      • This should be the fully qualified hostname of the NCM
    • nifi.cluster.node.unicast.manager.protocol.port=
      • This should be the same port that was configured as the NCM’s

Now Node 1 is set up. Let’s edit the nifi.properties file on Node 2:

  • Under web properties, take note of the setting for nifi.web.http.port
    • Assuming that Node 2 is on a different machine from the NCM and Node 1, then this setting can be left as the default 8080. However, if you are building this cluster just for fun and have all three instances running on the same machine, make sure this is a different HTTP port than the one you have configured on the NCM and Node 1. You may need to set other web properties, depending on the types of servers you are using. Be sure to review all properties in this section for your particular situation. There is a description of all the properties in the NiFi Admin Guide, which you can get to by clicking the “help” link in the upper-right corner of the NiFi window.
  •  Under the cluster node properties, set the following:
    • nifi.cluster.is.node=
      • This must be changed to true to distinguish this instance as a node.
    • nifi.cluster.node.address=
      • As noted in the comments below, if you leave this field blank, it will default to “localhost”. If that is not appropriate for your situation, such as if you are using a virtual machine (VM), then you should set this to the fully qualified hostname of the node’s machine.
    • nifi.cluster.node.protocol.port=
      • Assuming that Node 2 is on a different machine, simply make sure this is a different port than the NCM’s protocol port. Select an open port that is higher than 1024. However, if you are building this cluster just for fun and you have all three instances on the same machine, you should also make sure this is a different port than the one used for Node 1.
    • nifi.cluster.node.unicast.manager.address=
      • This should be the fully qualified hostname of the NCM
    • nifi.cluster.node.unicast.manager.protocol.port=
      • This should be the same port that was configured as the NCM’s nifi.cluster.manager.protocol.port.

That’s it! Now your cluster is set up and you can start each instance of NiFi. I recommend that you start your NCM first, followed by Node 1 and then Node 2. This will give you a cluster where Node 1 is the Primary Node. Navigate to the URL for your NCM and your cluster should look like this:

NiFi Cluster

Simple Tasks in NiFi – Creating a Limited Failure Loop

Creating a Limited Failure Loop in NiFi

In my previous posts, I provided an introduction to Apache NiFi (incubating), and I offered tips on how to do some simple things in the User Interface.

In this post, I focus on one of the frequently asked questions that NiFi users have had in the past. In a NiFi dataflow, it’s a common best practice to draw failure relationships so that they loop back to the feeding processor rather than terminating them. This way, you don’t lose data when it fails to process. And when it does fail, you can see that it has gone down the failure relationship and can troubleshoot the issue. However, this can also result in an infinite loop, and some people have asked whether it’s possible to configure your flow so that a failure happens only so many times and then the flow does something else with the failing FlowFile, such as sending it to an error directory. The following dataflow demonstrates one way to do this…

failure-loop2This is a simple GetFile->CompressContent->PutFile flow. An issue that can occur in this flow is that the PutFile might fail if a file of the same name already exists in the target directory, if the disk is full, or if the NiFi application does not have the proper permissions to write files to the target directory. We’ve added some processors to this flow that will allow us to have the PutFile try to write the file three times and then, after the third failure, the file will be written to an error directory. Let’s look at how this flow works in more detail…

So, in this flow, if a failure occurs, rather than looping the failure back to the PutFile processor, we send it to an UpdateAttribute processor (named “Check # of Failures”). This processor is configured with rules to determine whether the file has failed up to 3 times. If it has not failed 3 times, the RouteOnAttribute processor to the left (named “Route on Third Failure”) sends it back (via the unmatched relationship) to the PutFile processor to try it again. Once it has failed 3 times, the “Route on Third Failure” processor sends the file (via the “failure3” relationship) to an UpdateAttribute processor to ensure it has a unique filename, and then to another PutFile processor to put it in an error directory. The processor configurations are discussed further below…

The “Check # of Failures” processor is an UpdateAttribute processor with three rules configured. The rules were added using the “Advanced” tab in the processor. The first rule is called “Check for failure”. It checks to see if there is an attribute called “failure”. If there isn’t, it adds one and gives it a value of “1”.

failure-1

The second rule in the “Check # of Failures” processor (called “Second failure”) checks to see if the “failure” attribute is equal to “1”. If so, it changes the value to “2”. This indicates that the file has failed two times.

failure-2

The third rule (called “Third failure”) checks to see if the failure attribute is equal to “2”. If so, it changes the value to “3”. This will trigger the “Route on Third Failure” processor to send the FlowFile to be renamed and saved in an error directory.

failure-3

The image below shows the configuration of the “Route on Third Failure” processor. Anything that does not match the “failure3” value is sent down the “unmatched” relationship. When a FlowFile does match, it is sent down the “failure3” relationship.

Router-Config

The last UpdateAttribute processor is configured to change the filename so that the FlowFile’s UUID is appended to the original filename. This ensures that every file has a unique filename when it goes to the final PutFile processor.

update-filename

This is a simple example of how you can add a few processors to your flow so that you can set the number of times a given file fails and then end it down a new path in the flow. While this example uses a simple flow with a PutFile processor as the failure point, this concept could be duplicated in any flow where you want to set up a limited failure loop.

Simple Tasks in NiFi – Turning Flows into Groups

Turning Flows into Groups

As noted in the User Guide for Apache NiFi (incubating), it is possible to drag a Process Group from the Components portion of the toolbar and then build a flow inside it. But what if you already have a flow or part of a flow that you think would logically make a good Process Group?

You can select multiple items on the graph and add them to a new group. To do this:

  1. Hold down the Shift key and drag a selection box around the components that you’d like to add to the group. Note that connections will only be included if the components on both sides of the connections are also included.
  2. Click the “Group” button in the Actions section of the toolbar and then name the group.

Group Button

Process Group

It is also possible to select a group of components and drag them into an existing Process Group or copy/paste components into an existing Process Group. You may also drag or copy whole Process Groups into other Process Groups. These features make it extremely easy to organize you dataflows in the best way possible and even to re-organize them as your dataflows evolve over time.