Classification with Mahout
the supervised learning method described earlier for Mahout’s recommendation
engine feature, clustering is a kind of unsupervised learning — where the data
labels points are not known ahead of time and should be inferred from the data
without human input (the supervised part).
elements within a cluster should be similar; objects from distinct clusters
should be different. Decisions are need to be made ahead of time about the number
of clusters to generate, the rules for measuring “similarity,” and how the
representation of objects will impact the labelling generated by clustering algorithms.
instance, a clustering engine that is input with a list of news stories should be
able to define clusters of articles within that collection which represents
similar topics. Suppose a set of articles about India, Germany, England,
fashion, software development, and energy were to be clustered. If the maximum
number of clusters allocated were set to 2, our algorithm might create
categories such as “regions” and “industries.” Adjustments to the number of
clusters will create different categorizations; for instance, selecting for 3
clusters may produce pairwise groupings of nation-industry categories.
algorithms use human-labelled training data sets very well, where the
categorization and classification of all future input is controlled by these
known labels. These classifiers build what is called as supervised learning in
the machine learning field. Classification rules — usually assign by the
training data, which has been labelled ahead of time by analysts and domain
experts — are then applied against raw, unprocessed data so as to best identify
their appropriate labelling.
techniques are usually used by e-mail services which they try to classify spam
e-mail way before they ever cross our inboxes. Particularly, given an e-mail
consists of a set of phrases known to usually occur together in a specific
class of spam mail — which is delivered
from an address belonging to a known botnet — our classification algorithm is
able to reliably determine the e-mail as malicious or virus prone.
addition to the robustness of statistical algorithms that Mahout enables natively,
a supporting User Defined Algorithms (UDA) module is also available. Users can
also override existing algorithms and implement their own through the UDA
module. This powerful customization enables for performance tuning of native
Mahout Algorithms and flexibility in handling unique statistical analysis
problems. If we consider Mahout as a statistical analytics extension to Hadoop,
then UDA should be seen as an extension to Mahout’s statistical capabilities.
statistical analysis applications (such as SAS, SPSS, and R) come with powerful
techniques for generating workflows. These applications consume intuitive graphical
user interfaces that empowered for better data visualization. Mahout scripts
follow an identical pattern as these other tools for generating statistical analysis
workflows. During the last data exploration and visualization phase, users can
export to human-readable formats (JSON, CSV) or take benefits of visualization
tools like Tableau Desktop.