“As companies seek to incorporate machine learning and AI models, they quickly realize they must completely rethink their approach to data center infrastructure,” said Surya Varanasi, founder and CTO of Vexata. “Typical systems do not have the right architecture for significantly larger throughput for data ingest and high volume I/O processing. The problem is that these older systems use legacy architectures that introduce bottlenecks that can throttle system performance, backlogging transactions, delaying analytics outcomes, and increasing response time, which results in poor customer experience.”
For data managers, it’s also a matter of broadening the foundations of their data assets beyond traditional relational databases. “Much of the raw data enterprises want to use for machine learning is historically stored in relational databases,” said Joe Pasqua, EVP of products at MarkLogic. “The information about a single entity, like a customer, is split across many tables, and that is repeated across many systems. This is a lousy form to feed to a machine learning system.” Instead, Pasqua suggests the ideal data should be denormalized with “all the information about a particular customer gathered together in one place. Document databases do this naturally. It’s one JSON document versus various columns spread across 20 different tables. That also makes it much easier to deal with the growing masses of unstructured data that are critical for many AI tasks.”
Basic, commonsense data administration challenges come into play when moving to cognitive approaches. For example, said Matt Watts, director of technology and strategy at NetApp, “If we consider an IoT workflow, then you’ve got five stages to address: Where is data going to be collected and what will collect it? How will you transport it from the edge to the core? How will you cost-effectively store it? Do you have the performance necessary to analyze it? And, how will you archive data as it ages and its value decreases?” It is important to get the data fabric right from the beginning, Watts emphasized. “You also need to ensure that what you’re collecting is diverse enough to give you real-world variety; if you limit the scope of your data too much then you’ll limit the scope of the answers.”
Many of the same challenges that apply to analytics applications “apply to AI and machine learning as well,” Pasqua agreed. “One challenge that many organizations don’t face up to—until it is too late—is data governance. Where did this data come from? Does it contain private information? Under what agreement was it captured? Who is allowed to see it and for what purpose? Businesses can amass and prepare all the data they want, but unless they can answer questions like these, they can’t be comfortable using the data for AI.”
Legacy Issues
To date, “most of the stuff out there right now is incredibly data-hungry and relies exclusively on correlations and not heuristics,” said Medina, who also noted that because ML is correlation-based, it needs enormous amounts of clean, labeled data to be trained. “So the problem is not only the quantity of the data but also the quality of the data. Most data collection systems like CRM are not built for this—the data is incomplete, or nonsensical. For instance, if you changed how you sell or track your sales performance from last year to this year, all of a sudden you lost the ability to compare year-to-year trends because last year’s data is stored in a completely different way than this year’s and the effort to map it to get a clean labeled source is not worth it. This exorbitant data requirement is what is holding machine learning back.”
The challenge of preparing back-end data resources to support cognitive computing initiatives may be the single greatest factor that hinders progress, some observers warn. “Data scientists spend up to 70% of their time on data collection and preparation, rather than building and deploying predictive models,” said Jeff Healey, senior director of product marketing, Vertica, Micro Focus. “While the need for data preparation for model-building would probably remain, having analytic platforms in place that support this critical step of data cleaning and preparation helps to shorten the predictive analytics lifecycle. Additionally, many data science and business analyst teams are being challenged to prove the value of their work in order to secure the time and resources needed to develop and operationalize ML-driven applications. Multiple data sources and different open source systems that require considerable maintenance and resources further complicates the time to value equation.”
The ability to apply the same processes and business questions against all forms of data—structured and unstructured—will move cognitive computing approaches more thoroughly into the mainstream. Standardized data that takes into account all variables associated with the observation will result in the best dataset, said Geoff Tudor, VP and GM of cloud data services at Panzura. “For instance, if you are trying to build a model based on customer satisfaction, you should have a standard way of determining what ‘satisfied’ or ‘highly satisfied’ means across your sample set.”