Strategies: Classic data processing systems (Scale –Up and Scale-Out)
The fundamental reason that big
data mining systems were rare and expensive is that scaling a system to process
large data sets is very difficult; it has traditionally been limited to the
processing power that can be built into a single computer.
There are however two broad
approaches to scaling a system as the size of the data increases, generally
referred to as scale-up and scale-out.
Scale-up: In most enterprises, data processing has typically been performed on
impressively large computers with impressively larger price tags. As the size
of the data grows, the approach is to move to a bigger server or storage array.
Through an effective architecture—even today, the cost of such hardware could
easily be measured in hundreds of thousands or in millions of dollars. The
advantage of simple scale-up is that the architecture does not significantly
change through the growth. Though larger components are used, the basic
relationship (for example, database server and storage array) stays the same.
For applications such as commercial database engines, the software handles the
complexities of utilizing the available hardware, but in theory, increased
scale is achieved by migrating the same software onto larger and larger
servers. Note though that the difficulty of moving software onto more and more
processors is never trivial; in addition, there are practical limits on just
how big a single host can be, so at some point, scale-up cannot be extended any
In software development, the
promise of a single architecture at any scale is also unrealistic. Designing a
scale-up system to handle data sets of sizes such as 1 terabyte, 100 terabyte,
and 1 petabyte may conceptually apply larger versions of the same components,
but the complexity of their connectivity may vary from cheap commodity through
custom hardware as the scale increases.
Instead of growing a system onto
larger and larger hardware, the scale-out approach spreads the
processing onto more and more machines. If the data set doubles, simply use two
servers instead of a single double-sized one. If it doubles again, move to four
The obvious benefit of this
approach is that purchase costs remain much lower than for scale-up. Server
hardware costs tend to increase sharply when one seeks to purchase larger
machines, and though a single host may cost $5,000, one with ten times the
processing power may cost a hundred times as much. The downside is that we need
to develop strategies for splitting our data processing across a fleet of
servers and the tools historically used for this purpose have proven to be