Pig Data Types in Hadoop

Ailsa Singh 3199 06 May 2016 Updated 14 Mar 2018

We have already seen the Pig architecture and Pig Latin Application flow. We also learn the Pig Design principle in the previous post. Now it’s time to see, what are the different data types used by Pig.

Pig’s data types make the basis of the data model for how Pig perceives and thinks of the structure of the data it is processing. With Pig, the data model only gets defined when the data is loaded. Any data we load into Pig from the disk is going to have a specific schema and structure. It is necessary for Pig to understand that structure, so as soon as we finish with the data loading, the data automatically goes through a mapping.

Luckily for us, the Pig data model is rich enough to handle most anything thrown its way, including table-like structures and other advanced nested hierarchical data structures. In general terms, though, Pig data types can be categorized into two types:

1. Scalar types and

2. Complex types.

Scalar types consists of a single value, whereas complex types are comprised of other types, such as the Tuple, Bag, and Map types listed as following.

Pig Latin has these four imortant types in its data model:

1. Atom: An atom represents any single value, such as a string or a number — ‘MindStick’, for example. Pig’s atomic values are scalar types that appear in almost every programming language for example like integer, long, float, double, chararray, and bytearray.

2. Tuple: A tuple represents a record that composed of a sequence of fields. Each of the field can be of any type — ‘MindStick’, ‘Hadoop’, or 6, for instance. a tuple can be think as a row in a table.

3. Bag: A bag represents a collection of non-unique tuples. The schema of the bag is quiet flexible — every tuple in the collection can contain an arbitrary number of fields, and each field can be of any type.

4. Map: A map is a collection of key value pairs (As stated before, very similar to key value pair in MapReduce programs). Any type can be stored in the value, and the key always needs to be unique. The key of a map must be a chararray and the value can be of any type.

The value in all these types can also be null. The semantics for null are very same as to those used in SQL. The notion of null in Pig means that the value is unknown. Nulls can show up in the data in many cases where values are unreadable or unrecognizable — for example, if we were to use a wrong data type in the LOAD statement. Null could be used as a placeholder until data is added or as a value for a field that is optional.

blog

Pig Data Types in Hadoop

Pig Latin has these four imortant types in its data model:

Leave a Comment

1 Comments