Unlike the supervised learning method described earlier for Mahout’s recommendation engine feature, clustering is a kind of unsupervised learning — where the data labels points are not known ahead of time and should be inferred from the data without human input (the supervised part).
Usually, elements within a cluster should be similar; objects from distinct clusters should be different. Decisions are need to be made ahead of time about the number of clusters to generate, the rules for measuring “similarity,” and how the representation of objects will impact the labelling generated by clustering algorithms.
For instance, a clustering engine that is input with a list of news stories should be able to define clusters of articles within that collection which represents similar topics. Suppose a set of articles about India, Germany, England, fashion, software development, and energy were to be clustered. If the maximum number of clusters allocated were set to 2, our algorithm might create categories such as “regions” and “industries.” Adjustments to the number of clusters will create different categorizations; for instance, selecting for 3 clusters may produce pairwise groupings of nation-industry categories.
Classification algorithms use human-labelled training data sets very well, where the categorization and classification of all future input is controlled by these known labels. These classifiers build what is called as supervised learning in the machine learning field. Classification rules — usually assign by the training data, which has been labelled ahead of time by analysts and domain experts — are then applied against raw, unprocessed data so as to best identify their appropriate labelling.
These techniques are usually used by e-mail services which they try to classify spam e-mail way before they ever cross our inboxes. Particularly, given an e-mail consists of a set of phrases known to usually occur together in a specific class of spam mail — which is delivered from an address belonging to a known botnet — our classification algorithm is able to reliably determine the e-mail as malicious or virus prone.
In addition to the robustness of statistical algorithms that Mahout enables natively, a supporting User Defined Algorithms (UDA) module is also available. Users can also override existing algorithms and implement their own through the UDA module. This powerful customization enables for performance tuning of native Mahout Algorithms and flexibility in handling unique statistical analysis problems. If we consider Mahout as a statistical analytics extension to Hadoop, then UDA should be seen as an extension to Mahout’s statistical capabilities.
Classical statistical analysis applications (such as SAS, SPSS, and R) come with powerful techniques for generating workflows. These applications consume intuitive graphical user interfaces that empowered for better data visualization. Mahout scripts follow an identical pattern as these other tools for generating statistical analysis workflows. During the last data exploration and visualization phase, users can export to human-readable formats (JSON, CSV) or take benefits of visualization tools like Tableau Desktop.