Log analysis is a common practice which can be easily handled by Hadoop project. Indeed, the early applications of Hadoop were for the large-scale analysis of clickstream logs — log which record data about the web pages that users browse and in which order they visit them. We often refer to all kind of log data that are generated by our IT infrastructure as data exhaust. A log is kind of a by-product of a functioning server very similar to smoke coming from a working engine’s exhaust pipe. Data exhaust has the connotation of pollution or waste, and many enterprises undoubtedly use this kind of data with that thinking in mind. Data logs usually grow quickly, and results high volumes generation which can be difficult to analyse. And, the potential value of log data is not very clear. So the temptation in software IT departments is to hold this data log for as little time as reasonably possible. But Hadoop changed all the equations: The cost of storing data is comparatively less expensive, and Hadoop was initially developed specifically for the large-scale batch processing and analysis of log data.
The log data analysis work flow is a perfect place to start our Hadoop journey because there lot of chances that the data we work with is being deleted, or “dropped to the floor.” There are many software development companies that constantly stores a terabyte (TB) or more of client web activity per week, only to discard the data with no analysis (which makes you wonder why they bothered to collect it).
When industry analysts thought of the rapidly increasing volumes of data that exist (4.1 Exabyte’s as of 2015 — more than 4 million 1TB hard drives), data log accounts for much of this expansion. And no wonder: Almost every activity of life now led to the creation of data. A mobile phone can generate thousands of log entries per day for an active client, tracking not only voice, text, and data transfer but also geo-location data. Most home appliance now have smart meters that log their electricity use. Newer cars have thousands of smart sensors that record every aspects of their condition and use. Every click and mouse movement we make while browsing the Internet results a cascade of log entries to be generated. Each time we buy anything — even without using a credit card or debit card — the activity is recorded in the system’s databases — and in logs. Some of the most common sources of log data are: IT servers, web clickstreams, sensors, and transaction systems.