Data Warehousing Evolves with Data Vault 2.0 Implementations
One day, corporations awoke to the fact that having data was not the same thing as having believable data. They awoke to discover the meaning of “data integrity.” That was the day the enterprise data warehouse (EDW) was born. With an EDW, corporations had the bedrock data on which to make important and trustworthy decisions. Prior to the EDW, corporations had plenty of data, but the data was not believable.
With the data warehouse came what was called the “single version of the truth,” or the “system of record.” With the single version of the truth, the organization now had a foundation of data, which it could turn to with confidence. The volumes of data continued to explode with the advent of the data warehouse. Prior to the data warehouse there was no convenient place to store historical data. But with the data warehouse, for the first time, there was a convenient and natural place for historical data.
As data warehouses grew, it was realized that there needed to be flexibility in the design of the data warehouse and in the improvement in the integrity of data. Thus born was the “Data Vault.” With the Data Vault, the data warehouse now enjoyed the ultimate in design and integrity.
A Data Vault Model Concept
From a conceptual level, the Data Vault model is a hub-and-spoke–based model, designed to focus its integration patterns around business keys. These business keys are the keys to the information stored across multiple systems (hopefully the master keys), used to locate and uniquely identify records or data. From a conceptual level, these business keys stand alone, meaning they don’t rely on other information in order to exist. The concepts are derived from business context (or business ontologies), which are elements that make sense to the business from a master data perspective, such as customer, product, service, and so on. These concepts are business drivers at the lowest level of grain. The Data Vault model does not subscribe to the notion of supertype and subtype unless that is the way the source systems are providing the data.
Data Vault 1.0
Data Vault 1.0 is highly focused on the Data Vault Modeling components. A Data Vault 1.0 model attaches surrogate sequence keys as its primary key selection for each of the entity types. Unfortunately surrogate sequences exhibit the following problems:
• Introduce dependencies on the ETL/ELT loading paradigm
• Contain an upper bound/upper limit; when reached can cause issues
• Are meaningless numbers (mean absolutely nothing to the business)
• Cause performance problems (due to dependencies) on load of big data sets
• Reduce parallelism (again due to dependencies) of loading processes
• Cannot be used as MPP partition keys for data placement, to do so would potentially cause hot spots in the MPP platform
• Cannot be reliably rebuilt or reassigned (reattached to their old values) during recovery loads
Data Vault 1.0 does not meet the needs of big data, unstructured data, semistructured data, or very large relational data sets.
Introduction to Data Vault Architecture Data Vault 2.0
Architecture Data Vault 2.0 Architecture is based on three-tier data warehouse architecture. The tiers are commonly identified as staging or landing zone, data warehouse, and information delivery layer (or data marts). Figure 1 shows an overview of Data Vault 2.0 Architecture. The multiple tiers allow implementers and designers to decouple the enterprise data warehouse from both sourcing and acquisition functions and information delivery and data provisioning functions. In turn, the team becomes more nimble and the architecture is more resilient to failure and more flexible in responding to changes. The sections are: staging, enterprise data warehouse (EDW), and information marts or information delivery layer.
Regardless of platforms and technology used for implementation, these layers will continue to exist. However, as the system nears full real-time enablement, the need and dependency on the staging area will decline. True real-time data will feed directly into the EDW layer. In addition to the three tiers, the architecture of Data Vault 2.0 dictates several different components:
- Hadoop or NoSQL handles big data
- Real-time information flows both in and out of the business intelligence (BI) ecosystem; in turn, over time this also evolves the EDW into an operational data
- Managed self-service business intelligence (SSBI) through write-back and master data capabilities is used, enabling total quality management (TQM).
- Hard and soft business rules are split, making the enterprise data warehouse a system of record for raw facts that are loaded over time.
How NoSQL Fits into the Architecture
NoSQL platform implementations will vary. Some will contain SQL-like interfaces; some will contain relational database management system (RDBMS) technology integrated with nonrelational warehouse technology. The line between the two (RDBMS and NoSQL) will continue to be blurred. Eventually, it will be a “data management system” capable of housing both relational and nonrelational data simply by design. The NoSQL platform today, in most cases, is based on Hadoop at its core, which is comprised of the Hadoop file system (HDFS) or metadata management for files in the different directories. Various implementations of SQL access layers and in-memory technology will sit on top of the HDFS. Once ACID, which stands for atomicity, consistency, isolation, and durability, compliance is achieved (which is available today with some NoSQL vendors), the differentiation between RDBMS and NoSQL will fade.
Excerpted from A PRIMER FOR THE DATA SCIENTIST: Big Data, Data Warehouse and Data Vault; W.H. Inmon; Daniel Linstedt; Copyright © 2015 Elsevier Inc. Adapted with permission of Elsevier Inc.