Introduction to Ooize in
data and running different kinds of applications in Hadoop is great stuff, but
it’s only half the battle. For Hadoop’s efficiencies to truly start paying off
for us, start thinking about how we can tie together a number of these actions
to form a cohesive workflow. This idea is appealing, especially after we and our
colleagues have built a number of Hadoop applications and we need to mix and
match them for different purposes. At the same time, we inevitably need to
prepare or move data as we progress through our workflows and make decisions
based on the output of our jobs or other factors. Of course, we can always
write our own logic or hack an existing workflow tool to do this in a Hadoop
setting — but that’s a lot of work. Our best bet is to use Apache Oozie, a
workflow engine and scheduling facility designed specifically for Hadoop.
workflow engine, Oozie enables us to run a set of Hadoop applications in a
specified sequence known as a workflow. We define this sequence in
the form of a directed acyclic graph (DAG) of actions. In this workflow, the nodes
and decision points (where the
control flow will go in one direction, or another), while the connecting lines
show the sequence of these actions and the directions of the control flow. Oozie
graphs are acyclic (no cycles, in other words), which means we can’t use loops
in our workflows. In terms of the actions we can schedule, Oozie supports a
wide range of job types, including Pig, Hive, and MapReduce, as well as jobs
coming from Java programs and Shell scripts.
also provides a handy scheduling facility. An Oozie coordinator job, for example,
enables us to schedule any workflows we have already created. We can schedule
them to run based on specific time intervals, or even based on data availability.
At an even higher level, we can create an Oozie bundle job to manage our
coordinator jobs. Using a bundle job, we can easily apply policies against a
set of coordinator jobs by using a bundle job.
all three kinds of Oozie jobs (workflow, coordinator, and bundle), we start out
by defining them using individual .xml files, and then we configure them using
a combination of properties files and command-line options.