Big data has quickly become a transformative force, and it has even emerged as a consumer buzzword. Behind the scenes, much of the work is done by Hadoop. Created and maintained by which splits data across many servers, simplifying the management of large volumes. Hadoop also assists with the processing of the data using a program called MapReduce.
Is a Focus on Hadoop Security Overdue?
The popularity of Hadoop has risen as fast as has big data, with heavyweights such as eBay, Facebook, IBM, and Netflix using it for critical aspects of their businesses.1 But Hadoop has vaulted to prominence faster than its security has matured.
“It is a well-known fact that security was not a factor when Hadoop was initially developed,” says Kevin T. Smith of big data company Novetta. “As the initial use cases of Hadoop revolved around managing large amounts of public web data, confidentiality was not an issue. For Hadoop's initial purposes, it was always assumed that clusters would consist of cooperating, trusted machines used by trusted users in a trusted environment."2
The need for better Hadoop security is widely acknowledged, and a variety of organizations from different corners of the industry are working to fix the problem. As Smith notes, there are a range of new commercial products: "Cloudera Sentry, IBM InfoSphere Optim Data Masking, Intel's secure Hadoop distribution, DataStax Enterprise, DataGuise for Hadoop, Protegrity Big Data Protector for Hadoop, Revelytix Loom, Zettaset Secure Data Warehouse, and the list could go on," he says. And there are open source efforts, including Apache's Accumulo and Knox Gateway.
One of the most significant third party efforts is Intel's Project Rhino, which aims to help Hadoop better protect confidential data. According to Intel, the project's key goals are the following:
- "Framework support for encryption and key management
- "Common authorization engine for the Hadoop framework
- "Token-based authentication and single sign-on
- "Granular access control in HBase
- "Enhanced Auditing"3
Hadoop Presents a Complex Security Challenge
Hadoop is not a turnkey product for launching a big data program. Assembling the pieces requires specialized knowledge and customized programming work. These modules, as defined by Apache itself, are
- "Hadoop Common: The common utilities that support the other Hadoop modules.
- "Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.
- "Hadoop YARN: A framework for job scheduling and cluster resource management.
- "Hadoop MapReduce: A YARN-based system for parallel processing of large data sets."4
But these modules are only part of Hadoop's "ecosystem," to borrow a term from Novetta’s Smith. Apache itself offers other tools that supplement Hadoop by providing needed capabilities. These tools include colorfully named software like Hive, Pig, and ZooKeeper. And third parties such as IBM, Microsoft, Oracle, and SAP are further expanding this ecosystem, notes Tony Baer, a big data analyst at Ovum.
The complexity and constantly changing nature of Hadoop's ecosystem make it difficult to secure.
Approach Big Data and Hadoop Cautiously
Even as big data and Hadoop are delivering significant benefits to major companies, they are still creating confusion. And this confusion is not just about how they work, but also about what their benefits are. In an article titled "Security Is the Least of Hadoop's Concerns" Matt Asay summarizes the result of a Gartner webinar this way: "Hadoop's biggest roadblock may well be that people can't figure out what they're supposed to do with it."5
The point is that until an organization makes the business case for it to use big data and Hadoop, technical questions about security can wait. Organizations can also consider other ways to perform the functions for which Hadoop is used.
Big Data Initiatives Need Strategic Governance
Big data primarily aims to provide business intelligence. It is not a tactical approach used by IT. Therefore, it is important for executives to govern big data programs, providing core strategic guidance. "You can't do big data without [information integration and governance]," says Michele Goetz, a senior analyst at Forrester Research.6 "When we looked at organizations that are embarking on big data or full force into it, their governance levels are so much more mature."
Hadoop and other big data technologies are extremely complex and specialized, and making them work securely entails stitching together many pieces. An organization that wants to set up a big data program would be wise to get help -- not just for setting up Hadoop, but for every aspect of ensuring Big Data security.
Follow Developments in Filling Hadoop Security Gaps
The great potential of big data is spurring the industry to quickly fill Hadoop's security gaps. To keep pace with these developments, organizations must keep a close watch on the new tools and practices being deployed. In particular, as big data uses an increasing amount of users' personal information, people will expect companies to adhere to the latest and best available standards for protecting confidentiality.
References
1 Van Manen, T. "Expert Talk: Anjul Bhambhri (IBM) on Big Data Paradigm Shifts, Hadoop and Transforming Data." Sogeti. June 2012.
2 Smith, K. T. "Big Data Security: The Evolution of Hadoop's Security Model." InfoQ. August 2013.
3 Intel. "Support and Contributions to the Apache Hadoop Community." Available online from: http://hadoop.intel.com/community.
4 Apache. "Welcome to Apache Hadoop." March 2014.
5 Asay, M. "Security Is the Least of Hadoop's Concerns." readwrite. January 2014.
6 Goetz, M. "You Can't Do Big Data Without Information Integration and Governance" (video). IBM.
About the Author
Geoff Keston is the author of more than 250 articles that help organizations find opportunities in business trends and technology. He also works directly with clients to develop communications strategies that improve processes and customer relationships. Keston has worked as a project manager for a major technology consulting and services company and is a Microsoft Certified Systems Engineer and a Certified Novell Administrator.
This article is based on a comprehensive report published by Faulkner Information Services, a division of Information Today, Inc., that provides a wide range of reports in the IT, telecommunications, and security fields. For more information, visit www.faulkner.com.
To subscribe to the Faulkner Information Services, visit www.faulkner.com/showcase/subscription.asp.
Copyright 2014, Faulkner Information Services. All rights reserved.