Big data is all about applying analytics to more data, for more people. To carry out this task, big data practitioners use new tools — such as Hadoop — to explore and understand data in ways that previously might not have been possible (challenges that were “too complex,” “too expensive,” or “too slow”). Some of the “bigger analytics” that we often hear mentioned when Hadoop comes up in a conversation revolve around concepts such as machine learning, data mining, and predictive analytics. Now, what’s the common thread that runs through all these methods? That’s right: they all use good old-fashioned statistical analysis.
Challenged with expensive hardware infrastructures and a very high commitment in terms of time and RAM, people tried to make the analytics workload quite a bit more reasonable by analysing only a sampling of the data. The concept was to keep the chunks an chunks of data safely stashed in data warehouses, they only move a statistically significant sampling of the data from the repositories to a statistical engine.
While sampling is a very nice idea in theory, in practice this is often considered as an unreliable tactic. Finding a statistically significant sampling can be challenging for complex data like sparse and/or skewed data sets, which are very common. This results to poorly judged samplings, which will likely introduce outliers and anomalous data points, and can, in turn, bias the results of our analysis.
The reason analysts sample their data before running statistical analysis is that kind of analysis often needs significant computing resources. This isn’t just about massive data volumes: there are five key factors that influence the scale of statistical analysis:
·The Data volume on which we will perform the analysis definitely determines the scale of the analysis.
·The number of transformations required on the data set before applying statistical models is definitely a factor.
·The number of pairwise correlations we will require to calculate plays an important role.
·The degree of complexity of the statistical computations to be applied is a noticing factor.
·The number of statistical models needed to be applied to the data set plays a significant role.
Hadoop provides a way out of this dilemma by offering a platform to perform massively parallel processing computations on data in Hadoop. By doing so, it make an ability to flip the analytic data flow; rather than moving the data from its repository to the analytics server, Hadoop delivers analytics straight to the data. More specifically, HDFS allows us to store our massive volumes of data and then bring the computation (in the form of MapReduce jobs) to the slave nodes.
The typical challenge posed by moving from classical symmetric multiprocessing statistical systems (SMP) to Hadoop architecture is the locality of the data. On classical SMP platforms, multiple processors share access to a single main memory resource. In Hadoop, HDFS replicates(multiple copies of same data) partitions of data across multiple nodes and machines. Also, statistical algorithms that were designed for processing data in-memory must now adapt to datasets that span multiple nodes/racks and could not hope to fit in a single block of memory.
Hadoop offers many solutions for implementing the algorithms required to perform statistical modelling and analysis by providing two greatly helpful tools: Mahout and the ‘R’ language.