Log Data Analysis with Hadoop
analysis is a common practice which can be easily handled by Hadoop project.
Indeed, the early applications of Hadoop were for the large-scale analysis of
clickstream logs — log which record data about the web pages that users browse
and in which order they visit them. We often refer to all kind of log data that
are generated by our IT infrastructure as data exhaust. A log is kind of a
by-product of a functioning server very similar to smoke coming from a working
engine’s exhaust pipe. Data exhaust has the connotation of pollution or waste,
and many enterprises undoubtedly use this kind of data with that thinking in
mind. Data logs usually grow quickly, and results high volumes generation which
can be difficult to analyse. And, the potential value of log data is not very
clear. So the temptation in software IT departments is to hold this data log
for as little time as reasonably possible. But Hadoop changed all the equations:
The cost of storing data is comparatively less expensive, and Hadoop was
initially developed specifically for the large-scale batch processing and
analysis of log data.
log data analysis work flow is a perfect place to start our Hadoop journey because
there lot of chances that the data we work with is being deleted, or “dropped to
the floor.” There are many software development companies that constantly
stores a terabyte (TB) or more of client web activity per week, only to discard
the data with no analysis (which makes you wonder why they bothered to collect it).
industry analysts thought of the rapidly increasing volumes of data that exist
(4.1 Exabyte’s as of 2015 — more than 4 million 1TB hard drives), data log accounts
for much of this expansion. And no wonder: Almost every activity of life now
led to the creation of data. A mobile phone can generate thousands of log entries
per day for an active client, tracking not only voice, text, and data transfer
but also geo-location data. Most home appliance now have smart meters that log
their electricity use. Newer cars have thousands of smart sensors that record
every aspects of their condition and use. Every click and mouse movement we
make while browsing the Internet results a cascade of log entries to be generated.
Each time we buy anything — even without using a credit card or debit card —
the activity is recorded in the system’s databases — and in logs. Some of the
most common sources of log data are: IT servers, web clickstreams, sensors, and