Data Warehousing with Hadoop
warehouses are on the edge of the line, trying to cope with growing needs on their
finite resources. The sudden growth in the volumes of data sets generated in
the world has also impacted data warehouses because the amount of data they handle
are expanding — partly due to more structured data is created but also because
we often have to manage the regulatory requirements designed to maintain query
able access to historical data. Also, the exceling power in data warehouses is
usually used to process transformations of the relational data (RDBMS) as it
either comes to the warehouse itself or is put into a child data mart (a
separate subset of the data warehouse) for a separate analytics application. In
addition, the demand is rising for analysts to design new queries against the
structured data stored in warehouses, and these kinds of ad hoc queries might use
significant data processing resources. Many times a one-time report may
suffice, and many times an exploratory analysis is required to find questions
that haven’t been asked yet that may yield significant business results. The
bottom line is that data warehouses are typically being used for reasons beyond
their original design.
software development, Hadoop can provide significant relief in this situation,
using high-level architecture, Hadoop can live alongside data warehouses and fulfill some of the purposes that they aren’t designed for.
can modernize a data warehousing ecosystem by provide a landing zone for all
data and persisting the data to provide a query able archive of cold data.
Leveraging Hadoop’s large-scale batch processing efficiencies to pre-process and
transform data for the warehouse. It also enables an environment for ad hoc
one hand, the Hadoop hype machine is in full gear and bent on world domination.
This camp sees Hadoop replacing the relational database products that now power
the world’s data warehouses. The argument here is compelling: Hadoop is cheap
and scalable, and it has query able interfaces that are becoming increasingly
faster and more closely compliant with ANSI SQL — the standard for programming
applications used with database systems.
the other hand, many relational warehouse vendors have gone out of their way to
resist the appeal of all the Hadoop hype. Understandably, they won’t roll over
and make way for Hadoop to replace their relational database offerings. They’ve
adopted what we consider to be a protectionist stance, drawing a line between
structured data, which they consider to be the exclusive domain of relational
databases, and unstructured data, which is where they feel Hadoop can operate.
In this model, they’re positioning Hadoop as solely a tool to transform unstructured
data into a structured form for relational databases to store.