Getting data back to the operator efficiently is necessary, but is not a trivial task – typically left up to the upper data layers to translate queries into map-reduce jobs for storing data and then querying data. This step requires advance forethought and design for true big data analytics that leads to real-time performance; and it is an absolute necessity for moving toward greater automation of operations.
Layering data technologies such as HBASE and HIVE on top of Hadoop is important for structuring, querying and simplifying the complexity involved with map-reduce. However, Hadoop is inherently a batch-oriented system and has limitations, particularly when the needs are increasingly mandating real-time operations. Batch jobs are used to scan data that matches the query, return distributed results and allow a process to feed the results back to the calling system such as HBASE. The underlying batch process can mean scalability (in incremental steps, each with its own process) but does not necessarily offer the real-time results and performance required for guaranteed results.
For real-time data processing and analytics, Apache Storm provides streaming capability for ingesting data into the Hadoop system. However, Storm is a framework and exposes hooks in the data stream for developers to access and process data in near real-time. It therefore requires development for data handling/processing – development that requires time and resources to develop, test and deploy each and every process. This additional provisioning becomes particularly burdensome in attempts to scale and automate future processes.
It is also important to note that real-time processing in the big data sense is not necessarily processing data in accordance with operational requirements within the industrial environment. For example, real-time processing of events and steering controls within a utility’s distribution or transmission environment must be based on sub-second analysis of the events, data and conditions. This level of real-time performance eliminates the possibility of consuming large volumes of data into a data lake and then performing scan analysis of the data. Eliminating the consumption and query model more effectively address the real world challenges and limitations from skills and personnel shortages and already overtaxed experts. Scenarios such as this require an approach similar to complex event processing and streaming analysis—areas where our recommended architecture is designed to leverage high performance adaptive stream computing and machine learning.
Know Your Users to Give Them the Right Tools
When designing a better approach to big data, we must understand who will be using the data and which systems access the data. This is critical to the overall design of the architecture and will impact scalability and performance. In industrial environments, primary users will include not only data scientists but operators and applications as well. Not all interfaces to data will be the same and the solution must support programmatic interfaces (e.g. web services) as well as human-to-machine (HMI) interfaces. When designed properly, data integration and management architectures can serve as a bridge between the information technology (IT) and operational technology (OT) functions by providing accurate, reliable data across an enterprise so operators and engineers may have the information and situational awareness they need to do their jobs effectively and make operational improvements.
On the HMI side, operators will require different types of access than data scientists. In this case, the operators are performing forensic analysis of information based on issues they face on a daily basis. In contrast, data scientists are seeking patterns in data and formulating models that can be used for simulations, predictions and other complex analysis formulas. The methods for forensics and science are different and typically require different tools for looking at the data. Because of this, it is important to design a Big Data architecture with flexibility, providing different users access to the data they need in order to derive the intelligence and insights needed for their distinct use cases, whether as IT functions or OT functions.
A Purpose-Built Data Architecture for Industrial Environments
Industrial organizations should implement a layered architecture that is purpose-built and incorporates data stores, data indexing and data analytics. It should be flexible enough to allow for new use cases and future requirements including functional and performance scalability.
This diagram illustrates the recommended architecture, which includes a software platform leveraging a semantic data model layered on top of a Hadoop-based data lake. With this layered approach, the architecture provides key framework elements for integration, analytics, knowledge and visualization as well as a critically important information-indexing layer that uses a power semantic model. The base of the architecture includes a Hadoop infrastructure that provides flexibility and scalability for raw and processed data storage. This layer includes a HDFS platform for scaling the data storage aspects as well as data layers for structuring and organizing information such as HBASE and HIVE.
Key Architecture Points
The Hadoop infrastructure is a base solution for the data lake, providing scalability with raw storage of data, flexible layers of structure, and data typing. HBASE is recommended for its column wide and massive data storage and data analysis capabilities, but care is recommended for the HBASE data structure and approach to data retrieval. For industrial environments, the import of the data into the system is of critical importance. It is best to use a data ingestion method with semantic modeling (outlined in greater detail below), along with power indexing methods for fast data retrieval. HIVE can be layered in separately and as needed for more common SQL-like access to data for relational purposes.
The Federated and Standardized Information Indexing layer is provided by the semantic data model layer and utilizes a highly distributed and scalable indexing approach based on Elastic Search and Lucene. The indexing normalizes the information for analysis and retrieval and allows for rapid access to data within the data lake and is mandatory for fast access during analytics and operations. Without an indexing method, the data lake would rely on limited primary keys or full data scans for data retrieval.
The addition of the software platform leveraging the semantic data model enables industrial organizations to efficiently structure the indexes and create a dynamic adaptive approach to information indexing. Indices will vary and change as necessary and is easily accommodated as an independent layer from the data storage. In this case, re-indexing is a process that is easily accommodated based on business requirements and layered in effectively rather than brute-forced after the fact. Data mapping, modeling and ingestion methods are based on common semantic models.
Common data services are also provided by the Federated and Standardized Information and Correlation Engine and are important for consistent data access. Data services provide access to data using correlation and aggregation as well as allowing for unstructured “fishing” within the data lake. Data services are an essential part of the overall architecture and design as they dramatically impact performance and scalability. Data services are also responsible for putting processed data back into the lake. In this case, they can process or pre-process data into correlations, aggregations and calculations that are stored and retrieved as necessary. Note that data services can provide access to ‘raw’ data and allow for ‘raw’ queries.
Query interfaces should be different to allow for different data types, different types of correlation and analysis, as well as different types of models. Query interfaces must also be different to leverage the scalability and performance characteristics of the data infrastructure. SQL was not designed for NoSQL and translating from one to another is not obvious. It is necessary to learn new techniques for NoSQL analytics. SQL concepts such as normalization, joins and aggregations will differ significantly in NoSQL. In some cases it might be more beneficial to de-normalize rather than normalize some sets of data. Management of de-normalized data therefore becomes critical and in some cases, models will need to be re-written. SQL to NoSQL translators such as HIVE will have issues but may serve well in a limited capacity.
Integrating the Right Data
For industrial environments, we recommend a process for ingesting data into a Hadoop infrastructure that leverages semantic modeling capabilities for fast data integration. This not only ensures that the data lake is populated intelligently, but allows for changes and adaptations at a later time. This gives industrial organizations the ability to process business rules at-speed for: data quality checks; data transformation; and other data processing requirements. It also allows for semantic modeling and mapping of information from source to target and to handle normalization and de-normalization activities automatically.
Without a semantic data model there is little a machine can use to baseline data and thus becomes reliant on human interpretation. The human element can be inconsistent and is a time-consuming hindrance when the ultimate goal for these processes should be increased automation resulting in more efficient operations; or, software defined operations through a semantic data model.
A Semantic Data Model
In basic terms, Hadoop is a good data infrastructure component for handling the I/O requirements for massive volumes of data and for the demands of large data scans, but falls short compared to a data architecture that can make complex decisions without reliance on a centralized data lake and can make real time decisions from processes running on the very edge of the network. This reduces the reliance of Operational Technology upon Information Technology and provides intelligent, actionable decisions directly into the hands of the operator.
By leveraging a software platform with semantic data modeling with their Hadoop architecture, industrial organizations can more quickly integrate extremely large volumes of operational data from a variety of disparate sources, correlate that data into a common data model, and apply predictive analytics and machine learning at the edge to derive actionable intelligence in real-time. Gaining actionable insights from large volumes of unstructured data enables utilities to improve power distribution, lower operational costs, proactively identify and address risks, and accommodate new distributed energy resources.