Pig Data Types in Hadoop
have already seen the Pig architecture and Pig Latin Application flow. We also learn
the Pig Design principle in the previous post. Now it’s time to see, what are
the different data types used by Pig.
data types make the basis of the data model for how Pig perceives and thinks of
the structure of the data it is processing. With Pig, the data model only gets
defined when the data is loaded. Any data we load into Pig from the disk is
going to have a specific schema and structure. It is necessary for Pig to
understand that structure, so as soon as we finish with the data loading, the data automatically goes
through a mapping.
for us, the Pig data model is rich enough to handle most anything thrown its
way, including table-like structures and other advanced nested hierarchical
data structures. In general terms, though, Pig data types can be categorized
into two types:
types consists of a single value, whereas complex types are comprised of other
types, such as the Tuple, Bag, and Map types listed as following.
Latin has these four imortant types in its data model:
An atom represents any single value, such as a string or a number —
‘MindStick’, for example. Pig’s atomic values are scalar types that appear in almost
every programming language for example like integer, long, float, double, chararray,
A tuple represents a record that composed of a sequence of fields. Each of the
field can be of any type — ‘MindStick’, ‘Hadoop’, or 6, for instance. a tuple can be think as a row in a table.
A bag represents a collection of non-unique tuples. The schema of the bag is
quiet flexible — every tuple in the collection can contain an arbitrary number
of fields, and each field can be of any type.
A map is a collection of key value pairs (As stated before, very similar to key
value pair in MapReduce programs). Any type can be stored in the value, and the
key always needs to be unique. The key of a map must be a chararray and the
value can be of any type.
value in all these types can also be null.
The semantics for null are very same as to those used in SQL. The notion of
null in Pig means that the value is unknown. Nulls can show up in the data in many cases where values are unreadable or unrecognizable
— for example, if we were to use a wrong data type in the LOAD statement. Null
could be used as a placeholder until data is added or as a value for a field
that is optional.