If you look at what is really going on in the big data space it’s all about inexpensive open source solutions that are facilitating the modernization of data centers and data warehouses, and at the center of this universe is Hadoop. In the evolution of the big data market, open source is playing a seminal role as the “disruptive technology” challenging the status quo. Additionally, organizations large and small are leveraging these solutions often based on inexpensive hardware and memory platforms in the cloud or on premise.
At the same time, information is data and, from a legal perspective, big data is very important to general counsel
not only as part of the e-discovery process, but in terms of the overall management of big data within and external to the enterprise. This is significant in the context of compliance with industry, national, and global standards. Big data is already playing an important role in business litigation and undoubtedly will only increase in the future. These are among the clear messages stressed by industry experts during DBTA’s 2-day Big Data Boot Camp in New York City.
Here are 10 key takeaways from the Big Data Boot Camp conference:
- Persistence, context, and access are what you need to think about when it comes to big data. Use Hadoop as a staging area for big data, develop a three-tiered architecture for your modern data management platform, and use Hadoop for more than archiving because it has analytical capabilities, according to John O’Brien of Radiant Advisors.
- Businesses will reap 3 trillion in business value from big data in the next several years, predicts McKinsey. “The only way to get at this value is to make big data more friendly,” says O’Brien.
- Data platform modernization doesn’t mean rip and replace. This was one of the most important takeaways from the modernization panel on day-1 of the conference. Panel experts recommended that attendees leverage the strengths of existing databases in addition to modernizing by deploying Hadoop, MapReduce, and other open source technologies to manage and deliver the value of big data to as many users as possible.
- Big data is big in many ways—big tables, big pictures, big text and big metadata, big velocity, big volume, big variety, and big veracity; and 80% of enterprise data is unstructured, according to George Everett of Applied Relevance and Brian Clark of Objectivity. Veracity equals the truth of the data.
- Lawyers are excited about big data, but it is important to differentiate relevant from irrelevant data and employ discretionary deletion. All Big Data Boot Camp legal presenters emphasized the importance of big data management since that data in many ways automatically becomes a record. In the SAP-Oracle-Tomorrow Now lawsuit more than 3 terabytes of data were used in the discovery process that resulted in one of the largest awards in the technology industry.
- There is a major disconnect between IT and LOB. In what has become the continuing saga of the odd couple (think of Felix as the IT guy, and Oscar as the messy line-of-business professional), information technology professionals for the most part are still working closely with line-of-business managers. In an informal poll taken during a conference presentation, only 4% of the audience indicated that it was working with LOB on businessrelated big data initiatives.
- RDBMSs Make No Sense for storing and processing images and video, and other large files; PDFs, Excel files; processing large blocks of natural language text; blog posts, job ads, XML, log files; ad hoc, exploratory analytics; integrating data from many volatile external sources; data cleanup tasks (data wrangling); very advanced analytics (machine learning); and when business domain knowledge is not well defined, according to Alex Gorbachev of Pythian.
- The key benefits of Hadoop are that it provides a reliable solution based on unreliable hardware; is designed for large files; enables a load data first, structure later approach; is designed to maximize throughput of large scans, to leverage parallelism, and to scale; provides a flexible development platform; and a solution ecosystem.
- Key use cases for Hadoop include analysis of customer behavior, optimization of ad placements, customized promotions, recommendation systems such as Netflix, Pandora, and Amazon, in addition to providing inexpensive archives storage with an ETL layer, transformation and data cleansing engine. Pythian’s Gorbachev says one of the key differentiators for Hadoop is its ability to support 100 to 1000 Hadoop cluster nodes unlike traditional RDBMSs which support maybe dozens of nodes. Start with 50 Hadoop clusters not 10, he advises.
- The data being used in many big data projects is already sitting on machines or in storage systems with enterprises— as opposed to exotic or hard to capture data. The main types of data currently being dealt with in big data initiatives are existing production or transactional data, followed by real-time data feeds, as well as ERP and CRM data, according to the 2013 Big Data Opportunities survey sponsored by SAP. David Jonker of SAP presented the results of the survey, which was conducted by Unisphere Research just prior to the Big Data Boot Camp. The findings suggest that in terms of big data initiatives now, many organizations are trying to deal more effectively with traditional data.
Memorable Big Data Boot Camp Conference Sound Bites
• Data is the key asset of the corporation.—Fred Gallagher, Actian
• Google white papers drove the evolutionof Hadoop. —Amir Halfon, MarkLogic
• Execute transformations of data warehousesin SQL. —Robert Hodges, Continuent
• Marketo manages 225 terabytes of SQL data.—Hodges, Continuent
• Differentiate relevant from irrelevant data anddon’t retain it. —James Dawson, KPMG
• Largest internet data records theft: 77 Million.—Scott Zoldi, FICO
• Big data architecture is everything.—Brian Clark, Objectivity
• Hbase makes Hadoop a real-time system.—Alex Gorbachev, Pythian
• Most fun conference quote: Those that failto learn from history are doomed to repeat it.—Winston Churchill,thanks Mouli Venkatesan, MEICS
• Most referenced book: "The Fourth Paradigm:Data Intensive Scientific Discovery"
DBTA’s Big Data Boot Camp was indeed an intense journey in many of the important aspects of big data management. More importantly, it provided seminal insights in how to manage and leverage big data, both for competitive advantage and for compliance with established standards.