Hadoop is the most significant concrete technology behind the so called “Big Data” revolution. Hadoop combines an economical model for storing massive quantities of data – the Hadoop Distributed File System – with a flexible model for programming massively scalable programs – MapReduce. However, as powerful and flexible as MapReduce might be, it is hardly a productive programming model. Programming in MapReduce reminds one of programming in Assembly language – the simplest operations require substantial code.
It was recognized early on that the full potential of Hadoop could not be unlocked if commonplace operations required highly skilled Java programmers. At Facebook, the development of Hive was an important step toward a solution to this problem. The Hive system compiles SQL-like statements into Java Map Reduce code, allowing analysts or programmers with SQL skills to query data within Hadoop without any Java programming.
Around the same time as Facebook was developing Hive, researchers at Yahoo! were facing similar problems with the productivity of MapReduce. But, the Yahoo! team felt that the SQL paradigm could not address a sufficiently broad category of Map Reduce programming tasks. Therefore, Yahoo! set out to create a language that maximized productivity, but still allowed for complex procedural data flows. The result was Pig. Pig caught on rapidly at Yahoo! and within a few years, Pig programs accounted for about half of Yahoo’s Hadoop workload. Since then, Pig has become an official Apache project, and drives an unknown, but clearly high proportion of worldwide Hadoop workflows.
Pig superficially resembles scripting languages such as Perl or Python in terms of flexible syntax and dynamically typed variables. But, Pig actually implements a fairly unique programming paradigm. Pig statements typically represent data operations roughly analogous to individual operators in SQL – load, sort, join, group, aggregate, etc. Typically, each Pig statement accepts one or more datasets as inputs and returns a single data set as an output. For instance, a Pig statement might accept two datasets as inputs and return the joined set as an output.
For those familiar with SQL programming, Pig turns the programming model upside down. SQL is a non-procedural language: You specify the data you want, rather than outline the sequence of events to be executed. In contrast, Pig is explicitly procedural - the exact sequence of data operations is specified within your Pig code. For SQL gurus, it more closely resembles the execution plan of an SQL statement rather than the SQL statement itself.
SQL compilers and Hives HQL compilers include optimizers that attempt to determine the most efficient way to resolve a SQL request. Pig is not heavily reliant on such an optimizer, since the execution plan is explicit. As the Pig gurus are fond of saying, “Pig uses the optimizer between your ears.”
Pig is extensible through User Defined Functions that are written in Java. These can be used to implement complex business logic, bridge to other systems such as Mahout or R, and read or write from external data sources.
Pig’s processing philosophy is compatible with several other NoSQL systems. Since HBase is layered on top of Hadoop, it can be the target of Pig programs. Other systems - such as Cassandra - that integrate Hadoop MapReduce also may take advantage of Pig. Projects also exist to integrate Pig with other NoSQL systems, such as MongoDB.
Pig bridges the gap between Hadoop Hive and Hadoop MapReduce Java programming. It is nowhere near as simple or familiar as Hive’s SQL dialect; but, it can be used to create more complex multi-step data flows that, otherwise, would be available only to very experienced Java programmers.