For enterprises grappling with the onslaught of big data, a new platform has emerged from the open source world that promises to provide a cost-effective way to store and process petabytes and petabytes worth of information. Hadoop, an Apache project, is already being eagerly embraced by data managers and technologists as a way to manage and analyze mountains of data streaming in from websites and devices. Running data such as weblogs through traditional platforms such as data warehouses or standard analytical toolsets often cannot be cost-justified, as these solutions tend to have high overhead costs. However, organizations are beginning to recognize that such information ultimately can be of tremendous value to the business. Hadoop packages up such data and makes it digestible.
For this reason, Hadoop "has become a sought-after commodity in the enterprise space over the past year," Anjul Bhambhri, vice president of Big Data Products at IBM, tells DBTA. "It is cost-effective, and it allows businesses to conduct analysis of larger data sets and on more information types than ever before, unlocking key information and insight." Hadoop couldn't have arrived on the scene at a better time, since it is estimated that 2.5 quintillion bytes of data are now being created every day, she adds.
The advantage that Hadoop provides is that it enables enterprises to store and analyze large data sets with virtually no size limits. "We often talk about users needing to throw data on the floor because they cannot store it," says Alan Gates, co-founder of HortonWorks and an original member of the engineering team that took the Pig subproject (an analysis tool run on Hadoop) from a Yahoo! Labs research project to an Apache open source project. "Hadoop addresses this, and does more," he tells DBTA. "By supporting the storage and processing of unstructured and semi-structured data, it allows users to derive value from data that they would not otherwise be able to. This fundamentally changes the data platform market. What was waste to be thrown out before is now a resource to be mined."
Hadoop is an open source software framework originally created by Doug Cutting, an engineer with Yahoo! at the time, and named after his son's toy elephant. The Hadoop framework includes the Hadoop Distributed File System (HDFS), which stores files across clustered storage nodes and is designed to scale to tens of petabytes of storage. Prominent Hadoop users include social media giants such as Facebook, Twitter, LinkedIn, Yahoo!, and Amazon. Facebook's Hadoop cluster is estimated to be the largest in the industry, with reportedly more than 30 petabytes of storage.
But it's no longer just the social media giants that are interested in Hadoop, as their leaders point out. Data managers within big data companies are growing just as enthusiastic about the potential that Hadoop offers to get big data under control, says Peter Skomoroch, principal data scientist at LinkedIn. Hadoop "is a disruptive force, hitting the mainstream and being adopted by the big players in the Fortune 100," he tells DBTA. "A year ago, Apache Hadoop needed to mature as a platform, particularly in security, and further define enterprise adoption outside of the consumer web space. Today, Hadoop has hit the milestone of a 1.0 release and the community has put a significant amount of thought and effort into security, with government organizations and large established companies making Hadoop a key part of their data strategies."
Ecosystems
Hadoop also includes a robust tools ecosystem, which includes the MapReduce engine, originally designed by Google, which supports functions called Job Tracker and Task Tracker that seek to run applications in the same node or as close to data sources as possible, thereby reducing latency. Additional tools include ZooKeeper, a configuration and coordination service; Sqoop (SQL-to-Hadoop), a data import tool; Hive, a Hadoop-centric data warehouse infrastructure; Pig, an analysis platform for data from parallelized applications; Oozie, a workflow service; Hue, a graphical user interface for Hadoop; and Chukwa, a monitoring tool for large Hadoop-enabled systems.
Applications
While there is a lot of hype and excitement around Hadoop, David Gorbet, vice president of product strategy at MarkLogic Corp., urges companies to step back and evaluate their big data needs. "At its core, Hadoop was born out of a need for a parallel compute framework for large-scale batch processing," he tells DBTA. "Hadoop is exciting because it presents a new option for solving computationally expensive problems on commodity hardware. By breaking the problem up and farming pieces of it out to be executed in parallel by ‘map' functions, and then rolling up the result in a ‘reduce' function, it allows faster processing of data to enable complex tasks like finding, formatting or enriching unstructured data."
One of the most compelling value propositions for Hadoop-in combination with MapReduce-is the ability to apply analytics against large sets of data. From this perspective, "the current primary value of Hadoop is low cost storage for large volumes of data, along with support for parallel processing," Mark Troester, data management and IT strategist with SAS, tells DBTA.
"For my company, Hadoop means we can analyze extremely large sets of data at a very localized level to help us buy the best impressions for our customers' digital advertising campaigns," says Kurt Carlson, CTO of MaxPoint Interactive.
A popular application for Hadoop has been the ability to turn around large data processing jobs in a very short time. "We have seen broad adoption of Hadoop by users looking to shrink the time to run a batch job from weeks to hours," says Max Schireson, president of 10gen. "By giving engineers a flexible batch processing framework on which to program their batch processing jobs, Hadoop has gained enormous traction very quickly."
Industry observers also point out that we are only at the beginning stages of the innovations Hadoop will bring to data management operations. "We identified predictive analytics, visualization, and packages atop Hadoop Core to address a larger scope of problems," Murali Raghavan, senior vice president and head of Horizontal Services at iGATE Patni, tells DBTA. "The next phase around data cleanup will be the real-time analysis and decision-making, and we're looking at using those technologies to help our clients make more informed decisions based off their data."
The next stage for Hadoop will be "for processing analytic use cases directly in Hadoop," says Troester. "However, before analysis applications can fully leverage Hadoop, those solutions need to be able to identify the relative data. Ideally, this would be done using a visual exploratory capability that is aware of the organizational context. The organizational context consideration is especially important, since data streams in. If an organization can identify information that is relevant and timely based on organizational knowledge built into email, wikis, product categories and other sources, it can better determine what data to analyze and process. The data that is not relevant at that point in time can be placed on cheap, commodity-based storage for later use."
Enterprise Infancy
While the technology holds great promise for enterprises, Hadoop is not without its challenges, especially to early adopters of the technology. First and foremost is, simply, the fact that the solution is still relatively new, especially when brought into the realm of enterprise applications. "Hadoop is in its enterprise infancy," Ramon Chen, vice president of product management for RainStor points out. "Lack of security and a single node point of failure are just a few shortcomings, which are actively being addressed by the open source community. The great thing about Hadoop is that you can simply add nodes as you need to scale. While this seems manageable due to the low purchase costs, the hidden economic downside is the substantial ongoing administration and operating costs of a significant number of nodes."
This calls for knowledge and skills that many enterprises don't have under their roofs yet. "Hadoop's complexity and the current lack of skills to support it are two of the problems," IBM's Bhambhri cautions. The good news, she adds, is that "these challenges can be addressed with training options that are currently available." Leading vendors now offer Hadoop training to IT professionals. For its part, IBM now offers a Hadoop Fundamentals course, and "BigDataUniversity.com offers hundreds of tutorials, videos and exercises on the basics of Hadoop and other topics such as stream computing, open source software development, and database management techniques," Bhambhri adds.
"Hadoop is a highly powerful big data framework but it is not user-friendly," agrees Yves de Montcheuil, vice president of marketing at Talend. "Developers need to develop highly specialized skills to write MapReduce programs. And it is certainly not usable by business analysts."
Along with a dearth of skills, there are additional limitations, as pointed out by Troester. "These include the fact that building and maintaining a Hadoop cluster is complicated and expensive," he tells DBTA. "Better tools are needed; the Hadoop core structure is fluid; file systems vary; and Hadoop was originally designed for off-line batch processing. Most of these limitations need to be addressed by the open source and vendor community focused on basic Hadoop distribution."
Hadoop "is meant to complement existing data technology, not replace it," L.N. Balaji, president of ITC Infotech (USA), points out. "It works best for read-intensive processing as opposed to data that requires frequent updates. But it is essentially batch-oriented. Once it starts processing a data set, it can't easily be updated. Thus, it is not appropriate for real-time uses."
However, Gates points out that Hadoop's batch-oriented processing "is not a flaw; it is a design choice. Hadoop was designed to handle very large volumes of data and processing that can be broken into many parallel units. This means it performs very well for algorithms that need to scan vast amounts of data-such as fraud detection or web user session analysis-and for situations where data can be stored in a large distributed hash, such as Facebook using it to store the messages for its mail. These same design choices make Hadoop a poor choice for algorithms that require transactions or locks across different data sources-for example, a shopping cart application where the user's shopping cart and the store's inventory need to be updated in one transaction. These OLTP problems are much better handled by RDBMSs."
Gates agrees, however, that "Hadoop falls short on usability," noting that the framework needs to integrate well with existing monitoring and management tools so that data center administrators can use the tools they already have. In addition, data movement is not yet adequately addressed, he adds. "Getting data into and out of Hadoop is often still a manual process or depends on very young, immature tools. Vendors are starting to address this, but their products are early in their lifecycle."
Some industry observers also question how efficiently Hadoop can manage big data. "Due to the discrete nature of this big data, Hadoop is exceptional at simple structuring, aggregation and transformation," Robert Greene, vice president of technology at Versant, points out. "However, Hadoop is not good at the other side of the story-creating the links that lead to big complex data - though many are trying to force-fit Hadoop into this role," he tells DBTA. "This is intuitively obvious as linking removes the property of being discrete. Data that is no longer discrete is not easily processed as a parallel operation by using the same architectural brute force for transformation and aggregation."
Conversely, enterprises need to adopt data management capabilities that are Hadoop-aware. These capabilities include "basic data integration, data quality and governance," says Troester. "These capabilities should provide native capability for leveraging the power of Hadoop-pushing down the appropriate processing into Hadoop and leveraging parallelization. Hadoop also needs the best possible analytics. SAS analytics, storage, and parallel processing capabilities perfectly complement Hadoop. In practice, this means allowing SAS functionality to be executed as MapReduce code. In addition, support for Hadoop Pig, Hive, and MapReduce is needed directly within the data management tools."
Sample No More
Currently, most early enterprise implementations attempt to fit Hadoop into the current constrained analytical tools that can only handle samples of the data, says Gates. Ultimately, the power of Hadoop is to enable analytics across an entire data set. see many users doing large data crunching in Hadoop and then extracting slices of data into an RDBMS or OLAP system for doing reporting or ad hoc queries," he explains. "Hadoop is not yet full-featured enough to support these functions in a way users are accustomed to. But this limits the power of Hadoop. Now instead of reporting on all of the data, users are reporting on aggregated subsets, or on samples. In time, reporting and analytics tools need to evolve to use Hadoop's parallelism so they can access all of the data, instead of a slice."
However, Versant's Greene offers that this capability opens up new challenges, as well. "The relational database does successive queries and applies the set operators to find data matches re-establishing the links, pipelining a set operation for each subsequent query into the results of the current query," he explains. "In Hadoop, the same basic processes occur, except it is not something that gets automated by the engine. Instead, the subsequent level of query in Hadoop, a query being a MapReduce operation resulting in a map, must get ‘joined' with the current MapReduce operation's resulting map, by the developer writing some custom code to do the set operation."
While this workaround provides querying capabilities in Hadoop, "it is both highly inefficient and unmanageable in both development and production within a complex system," Greene continues. "Ultimately, trying to force Hadoop into this paradigm to solve the linking problem carries the same computational costs found in a classic relational database solution. Even worse, there are no indexes to speed things up. It just gets us right back where we were at the start with scalability and performance problems when dealing with big complex data." NoSQL databases need to be employed in conjunction with Hadoop to help resolve these issues, he adds.
Proponents are quick to point out that Hadoop will only get stronger and more robust as enterprises adopt the technology to its fullest potential, and the skills base expands. Ultimately, "widespread adoption and a stable platform means developers can extend existing libraries, instead of everyone reinventing the wheel," Skomoroch points out. "The future will be in the programming languages and applications built on top of Hadoop. We'll see a number of specialized applications emerge within verticals like finance, biotech, marketing, and retail that take advantage of this new platform. In the 1980s the PC changed the way we live and work. The current workplace has been revolutionized by the growth of the internet and mobile devices. The next major shift will be around big data, and Hadoop is at the foundation of that revolution. Storage gets cheaper every day, and if you have more data you can make better decisions than competitors. In 5 years, everyone will be an analyst."