Data integration can seem like a never-ending quest as organizations try to combine and access data from disparate applications and sources. But as we move beyond relational as the only DBMS type that matters and embrace NoSQL and Hadoop data platforms, data integration can become more challenging and require new tools and approaches to achieve success.
What Is Data Integration?
Before we look at how things are changing, let’s first define what we mean by data integration. At a high level, data integration is defined as the mixture of business and technical processes deployed to unite data from multiple sources into meaningful and useful information.
There are three different types of popular data integration products:
• Extract, Transform, and Load (ETL) — used to modify and move large amounts of data
• Enterprise Application Integration (EAI) — used to move smaller quantities of data with different frequency patterns
• Enterprise Data Replication (EDR)— used to synchronize datasets when data has been changed or modified
Traditionally, each of these types of data integration focus primarily on data in relational databases: massaging the data to make it useful for analysis, moving the data from one place to another to eliminate the overhead of reporting and analysis (BI) from production processing, and keeping the BI data in sync with production. But new requirements for modern development and data analytics are changing the dynamics of data integration.
Modern Data Integration Challenges
As we look at the state of data integration technology and requirements today, we see two significant challenges: dealing with new types of data and dealing with new types of analysis. Let’s examine both issues starting with the data.
The Changing World of Data Management
For decades, most organizations primarily used relational DBMSs to support all of their production applications. Although there may have been pockets of older technology (mainframes with IMS or non-database applications), data integration technologies could focus on one type of DBMS—relational/SQL.
But that has changed. Relational remains the bedrock of most production systems, but organizations are also using other types of DBMSs and data management platforms, such as Hadoop and NoSQL databases. These changes have been driven by big data requirements, embracing unstructured data, and mobile application needs and efficiencies, among other trends.
Hadoop is an open source framework that enables distributed processing of large datasets across clusters of computers. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. It is not a DBMS but can be used to store vast amounts of data.
Many organizations have adopted Hadoop to house their data lake. A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed. Any type of data can be stored in a data lake: structured, semi-structured, and unstructured. For example, you might use a data lake to capture all customer information from multiple sources for future analysis and aggregation. The data might comprise numbers, characters, dates, and times, but also complex documents, text, multimedia, and more. Data in a data lake is ingested without transformation so that data scientists and business analysts can run models and BI queries against the data.