A.M. Turing Award Laureate and database technology pioneer Michael Stonebraker delivered the welcome keynote at Data Summit 2019, titled “Big Data, Technological Disruption, and the 800-Pound Gorilla in the Corner.”
Presently, he serves as an advisor to VoltDB and CTO of Paradigm4 and Tamr, Inc. and is an adjunct professor of computer science at M.I.T, where he is co-director of the Intel Science and Technology Center focused on big data.
Speaking to the Data Summit audience, Stonebraker touched on many of the considerations and decisions that are facing organizations as they navigate through the big data landscape, and concerns about the volume, velocity, and variety of data.
Increasingly, people are worried about complex analytics, while others say that too much data is a concern, or that data variety with data coming from so many places is the problem, noted Stonebraker.
SQL analytics on big data has been well addressed by data warehouses but “the fly in the ointment” to the data warehouse crowd is the cloud because cloud vendors all play by different rules. Customers must choose have to choose bundles of storage, computing, and networking so it behooves people to get smart about any vendor’s offering because the pricing can vary dramatically
“The bigger issue is that warehouses are yesterday’s problems,” said Stonebraker. Data science will supersede business intelligence as soon as enterprises can hire enough competent data scientists. Data science will take over and it is very different stuff. It does not look like SQL. Technically, it means non-SQL data analytics or machine learning.
“Data science is in your future, and if you are not worried about it, you should start.”
Machine learning requires training data, and deep learning requires a lot of training data, he added, noting that deep learning is this black box, not explainable, although maybe one day it will be. However, if you need to explain something to your customers don’t even think about deep learning.
Get familiar with higher level packages and expect to use a lot of custom hardware to make it work, advised Stonebraker. “All of this has no data management and no persistence so data management and persistence is going to be a big problem not currently addressed.”
Looking at Hadoop, Stonebraker observed that Hadoop now commonly means a complete stack with HDFS at the bottom and other technologies on top, while marketing teams have moved to selling HDFS as a platform for data lakes
All kinds of applications are seeing velocity go through the roof with the sensor tagging of everything, smartphones, and the need for the state of multiplayer internet games to be recorded, said Stonebraker citing a few of the contributors to the need for greater velocity.
Comparing SQL, NoSQL, and NewSQL data management systems, Stonebraker pointed to NewSQL entrants to the market with main memory DBMS, HA, and concurrency control as offering the potential to address many issues. “Put a lot of hardware behind a main memory system and you can do a lot of transactions,” said Stonebraker, adding, “I expect NewSQL systems to keep up with whatever problems you have.”
So, where is the 800-pound gorilla, if it is not in volume or velocity, asked Stonebraker.
The problem is that to do their work data scientists must locate huge quantities of relevant data and integrate and cleanse that data for it to be useful, said Stonebraker, citing the example of a data scientist at iRobot who said, “I spend 90% of my time finding and cleaning data and then 90% of the other 10% checking on the cleaning” because you can’t analyze dirty data.
For each local data source, Stonebraker said, people have to:
- ingest the source
- perform transformation (for example, dollars to euros)
- perform data cleansing (a rule of thumb is that at least 10% of the data is wrong or missing)
- perform schema integration
- perform deduplication
- find golden values in cluster of duplicates
“Think about 38 hours a week doing other stuff and 2 hours a week doing data science,” said Stonebraker, adding that the key thing is that you can’t analyze dirty data, because garbage in will produce garbage.
This is the big enterprise problem, said Stonebraker. There are solutions such as ETL packages plus MDM tools but they require too much manual effort and won’t scale. There are easy-to-use data preparation solutions for simple problems. But machine learning and statistics is what is needed to address ETL limitations. Yet, first the data has to be integrated and cleansed.
“Even on simple analytics, data quality matters,” said Stonebraker, which makes the wide variety of data sources, the need to integrate the data, and for it to be of high quality, a critical issue facing enterprises today.
Many presenters are making their slide decks available on the Data Summit 2019 website at www.dbta.com/DataSummit/2019/Presentations.aspx.