Log Data Ingestion with Flume

Some amount of data volume that ends up in HDFS might land there through database load operations or other types of batch processes, but what if we want to capture the data that’s flowing in high-throughput data streams, for example application log data? Apache Flume is the widely popular standard way to do that with ease, efficiently, and safely.

Apache Flume is a top-level project from the Apache Software Foundation, works as a distributed system for aggregating and moving massive amounts of streaming data from various sources to a centralized data store. In other words, Flume is designed and engineered for the continuous ingestion** of data into HDFS. The data can be of any kind, but Flume is typically well-suited to handling log data, like the log data from web servers. Single Unit of the data that Flume processes is called an event; a good example of an event is a log record.

To know and learn how Flume works within a Hadoop cluster, we required to know that Flume executes as one or more agents, and that every agent consists of three pluggable components: sources, channels, and sinks:

·        Sources: It retrieve data and handover it to channels.

·    Channels: It is used to hold data queues and serve as conduits in between sources and sinks, which is very useful when the incoming flow rate exceeds the outgoing data flow rate.

·        Sinks: It process data that was taken from channels and deliver it to a destination, such as HDFS.

An agent should have at least one of each component to execute, and every agent is contained within its own instance of the JVM (Java Virtual Machine).Every agent can consists of many sources, channels, and sinks, and although a source can write to several channels, a sink can  only take data from one channel.

An agent is just a JVM that’s running Flume, and the sinks for each agent node in the Hadoop cluster send data to collector nodes, which aggregate the data from many agents before writing it to HDFS, where it can be analysed by other Hadoop tools.

Agents can be worked and chained together so that the sink from one agent sends data to the source from another agent. Avro,  which an Apache’s remote call-and-serialization framework, is the typical way of sending data across a network with Flume, since it serves as a useful utility tool for the efficient serialization and transformation of data into a compact binary format. In the context of Flume, compatibility is key : An Avro event needs an Avro source, for instance, and a sink should deliver events that are appropriate and suitable to the destination.

The main thing that makes this great chain of sources, channels, and sinks work is the Flume agent configuration, which is located in a local text file that’s structured similar to a Java properties file and we can configure multiple agents in the same file.

**NOTE: We use the word “Ingestion” here, ingesting data simply means to accept data from an outside source and store it in Hadoop.

Leave Comment