It is not true if I say that we all are now living in an advanced state of the information age. Data is being evolved and stored electronically by networked sensors at very large volumes, in an accelerating pace and in mind-boggling varieties. Devices such as smart phones, digital cameras, automotive, TV, and equipment in industry and health care all contribute to the exploding volumes of data sets. This data can be surfed, captured, and shared, but its greatest potential remains largely untapped. The value lies in its potential and capability to provide insight that can easily rectify vexing business challenges, open new domain, reduce cost rates, and improve the overall health of our societies.
In the mid-2000s, software development companies such as Google and Yahoo! were in need for an approach to analyse their huge volumes of data that search engines were storing. Hadoop is the by-product of that approach, representing an elegant and cost-effective way of reducing big analytical problems to small, manageable tasks.
Structured data has a very high degree of organization and is usually the kind of data we see in relational databases (RDBMS’s) or spread sheets. Due to its well-defined structure, it can be easily maps to one of the standard data types (or user-defined, custom data types that are based on those standard types). It can easily be searched using standard search algorithms and processed in well-defined ways.
Semi-structured data (such as what you might see in log files) is a bit more complex to perceive as compared to structured data. Typically, this kind of data is seen in the form of text files, which has some degree of order — for example, tab-delimited files where columns are separated by a tab character. So instead of being able to issue a database query for a certain schema and knowing results what we are getting back, people specifically need to explicitly assign data types to the data elements extracted from semi-structured data sets.
Unstructured data has none of the benefits of having structured and schematic codes into a data set. Its analysis by way of more classical ways is complex and costly at best, and logistically impossible at worst. Just assume having many years’ worth of notes and information typed by call centre operators that examines customer activities. Without a powerful and smart set of text analytics tools, it would be extremely difficult to determine any relevant behaviour patterns. Moreover, the sheer volume of data in many cases poses virtually insurmountable challenges to traditional data mining techniques, which, even when condition.