Log Data Ingestion with Flume
amount of data volume that ends up in HDFS might land there through database
load operations or other types of batch processes, but what if we want to
capture the data that’s flowing in high-throughput data streams, for example
application log data? Apache Flume is the widely popular standard way to do
that with ease, efficiently, and safely.
Flume is a top-level project from the Apache Software Foundation, works as a
distributed system for aggregating and moving massive amounts of streaming data
from various sources to a centralized data store. In other words, Flume is
designed and engineered for the continuous ingestion** of data into HDFS. The data can be of any kind, but Flume is
typically well-suited to handling log data, like the log data from web servers.
Single Unit of the data that Flume processes is called an event; a good example
of an event is a log record.
know and learn how Flume works within a Hadoop cluster, we required to know that
Flume executes as one or more agents, and that every agent consists of three
pluggable components: sources, channels, and sinks:
It retrieve data and handover it to channels.
It is used to hold data queues and serve as conduits in between sources and
sinks, which is very useful when the incoming flow rate exceeds the outgoing data
It process data that was taken from channels and deliver it to a destination,
such as HDFS.
agent should have at least one of each component to execute, and every agent is
contained within its own instance of the JVM (Java Virtual Machine).Every agent
can consists of many sources, channels, and sinks, and although a source can
write to several channels, a sink can only take data from one channel.
agent is just a JVM that’s running Flume, and the sinks for each agent node in
the Hadoop cluster send data to collector nodes, which aggregate the data from
many agents before writing it to HDFS, where it can be analysed by other Hadoop
can be worked and chained together so that the sink from one agent sends data
to the source from another agent. Avro, which an Apache’s remote call-and-serialization framework, is the typical way of
sending data across a network with Flume, since it serves as a useful utility
tool for the efficient serialization and transformation of data into a compact
binary format. In the context of Flume, compatibility is key : An Avro event
needs an Avro source, for instance, and a sink should deliver events that are
appropriate and suitable to the destination.
main thing that makes this great chain of sources, channels, and sinks work is
the Flume agent configuration, which is located in a local text file that’s
structured similar to a Java properties file and we can configure multiple
agents in the same file.
**NOTE: We use the word
“Ingestion” here, ingesting data simply means to accept data from an outside source and
store it in Hadoop.