The move to real time also poses challenges for data quality, Hugg continued. “Immediacy and correctness are often difficult to manage at the same time. Many systems do a poor job of processing late and out-of-order data, preferring to collect mini-batches and process them whole. When latency is paramount, you see a return to the simpler request/response model of traditional services.” This requires technologies that can step up and compensate for these new sources of latency, such as in-memory caches, grids, and databases enabling real-time processing, Hugg explained.
Often, even cloud, with all its unlimited power, may not be the right fit for organizations seeking to get closer to real-time movement of information. For example, while the “movement of in-memory databases is fundamental to the applications, the truth is that deploying the infrastructure to support these uses outside of a data center or exclusively in a cloud is hard for real-world situations,” said Jason Andersen, VP of business line management at Stratus Technologies. “There is a need for a better infrastructure to protect real-time data at these edge locations.”
Nowhere is the need for real-time and reduced latency felt more strongly than in efforts to leverage the Internet of Things (IoT). Capturing data in real time, tied to IoT, can be effective only with systems capable of cost-effectively handling large data volumes with very low latencies, said Joshi of Redis Labs. “Being able to implement adaptive applications powered by machine learning in real time is a critical aspiration for most enterprises, but real-time databases that can power such applications with built-in capabilities are most likely to make these aspirations a reality.” Joshi added that another critical force in making the data-driven enterprise a reality is the shift in hardware technology, which puts more cost-effective memory such as flash within reach of applications. “Datasets that can deliver the real-time performance of memory but with the cost-effectiveness of flash are likely to create a competitive edge for enterprise,” she said.
Metadata and Data Catalogs
Despite all the hype and excitement about data-driven, AI-savvy enterprises, there is a fundamental component of data management that managers are beginning to embrace: keeping track of data assets and making them discoverable to decision makers. With data streaming in from a wide variety of internal and external sources, there needs to be a way to intelligently track, archive, and identify what is available to decision makers and applications. Metadata repositories and data catalogs are the way this can be achieved. “People tend to focus on things like in-memory and other speed-and-feeds sorts of metrics when they think about real-time technologies,” said Joe Pasqua, executive VP at MarkLogic. “But that assumes you’ve got all the relevant data in one place and you’re just trying to serve it up quickly. That’s the easy part. The real enabler is making the data available in the first place. This is made possible by a strong metadata solution to describe what and where the data is, and a multi-model approach that allows access to the varied shapes, sizes, and formats of the data across your organization, including graphs, documents, rows, geospatial, and so on.”
Metadata is also key to the success of AI, as well. “When AI can be leveraged to automatically and accurately append metadata attributes to information, the whole game changes,” said Greg Milliken, senior VP of marketing for M-Files. “AI can automatically evaluate the file contents for specific terms like a customer name, project, or case as well as the type or class of document—a contract, invoice, project plan, financial report—and then apply those metadata attributes to the file. This automatically initiates workflow processes and establishes access permissions, so only authorized people can access the information—such as the project team, the HR department, or those managing a specific case.” The result, Milliken continued, “is a more intelligent and streamlined information environment that not only ensures consistency in how content is organized, but also that information is intelligently linked to other relevant data, content, and processes to deliver a 360-degree view of structured data and unstructured content residing in different business systems.”
Open Source Prevails
Open source technologies have emerged that support the emerging real-time data center, said Marc Concannon, CTO of Clavis Insight. These include Kafka for capturing and distributing incoming streaming data; NiFi for data routing; Ignite for faster in memory process of the incoming data; Hadoop 2.0 for data access and storage; and Kubernetes for managing how we scale a streaming infrastructure which is susceptible to bursts. “All of these technologies are relatively new to our stack,” Concannon pointed out. “But these technologies at their core are all about working with more and more data and extracting the relevant insights from this data quicker and hence making it available to our customers quicker.”
Hadoop “does enable a lot of tools which are focused on streaming and it also enables quicker access to the core insights on large datasets which is not really possible or would mean a long wait on more traditional technologies, said Concannon. “This, to me, is all about making more data available at decision time.”