I have an increasing sense of a building “trough of disillusionment” around big data. For those who haven’t encountered the term, the “trough of disillusionment” is a standard phase within the Gartner hype cycle. New technologies are expected to pass from a “peak of inflated expectations” through the trough of disillusionment before eventually reaching the “plateau of productivity.”
Most new technologies are expected to go through this trough, so it’s hardly surprising to find big data entering this phase. And, of course, the bigger they are, the harder they fall—the expectations of big data have been incredible, so the potential for disappointment is all the greater. But, in the case of big data, it has been obvious for a long time that many early adopters were being set up for a disappointment. At its core, big data projects have two critical success factors: one, establishing the mechanisms for acquiring, storing, and processing massive data, and two, developing the algorithms to more effectively leverage that data for competitive or other advantages.
Prerequisite No. 1 is not trivial, but technologies such as Hadoop and Spark at least provide a fairly accessible recipe for success. However, the second objective—which is directly tied to the overall payback for a big data project—is much harder to achieve.
To successfully develop the sort of algorithmic breakthroughs that big data projects promise requires at least three distinct ingredients: strong statistical and data mining expertise, the ability to create software that can implement these algorithms, and the business savvy to identify the problems that these algorithms should solve. Early successful big data companies such as Google and Amazon were able to hire rock star data scientists who combined these three success criteria. However, there simply aren’t enough of these rock star data scientists to go around. In addition, while universities are graduating an increasing number of appropriately trained professionals, a truly successful data scientist requires years of experience to develop the sort of judgment and imagination required to create innovative big data solutions.
In my opinion, a big part of the problem is that data science is still bogged down in the minutiae of specific mathematical algorithms. Open any book on data science, and you are likely to begin with a long discussion of some seriously complex mathematical techniques. You need to learn the difference between dozens of algorithms such as K-Nearest Neighbors, Support Vector Machines, logistic regression, and K-Means before you can aspire to develop machine learning solutions.
However, the exact algorithm employed in a data science project is not necessarily decisive. Generally, data science problems involve a few high-level techniques—extrapolation, clustering, and classification. We need software packages that work at a higher level of abstraction, hiding the details of algorithms while allowing data scientists to work at the solution level.
KeystoneML—one of the latest offerings from the AMPLab group, which gave us Spark—attempts to provide such a high-level tool. KeystoneML allows data science problems to be specified in general terms, with the selection of algorithms left to the discretion of the framework. KeystoneML is analogous to the database query optimizer that is ubiquitous within relational database systems. In an SQL statement, you specify the results you want without having to specify the exact access path you expect the database to use. The optimizer determines the most effective path using sophisticated algorithms. In the same way, KeystoneML does not require you to specify the exact algorithms to be used to solve your machine learning problem; the KeystoneML engine will determine the best algorithms to achieve that goal.
Increasing the productivity of data scientists is going to be essential if big data projects can pay off. Initiatives such as KeystoneML are a step in the right direction.