Note that not all Hadoop or NoSQL platforms offer ACID compliance today, and not all NoSQL platforms offer update of records in place making it impossible to completely supplant the RDBMS technology. This is changing quickly. Even as this section is written, the technology continues to advance. Eventually the technology will be seamless and what is purchased from the vendors in this space will be hybrid based.
Current positioning of a platform such as Hadoop is to use it or leverage it as an ingestion area and a staging area for any and all data that might proceed to the warehouse. This includes structured data sets (delimited files, fixed width columnar files), multistructured data sets, such as XML and JSON files, and unstructured data such as Microsoft Word documents, Excel spreadsheets, video, audio, and images. This is because it is quite simple to ingest a file into Hadoop: Copy the file into a directory that is managed by Hadoop. It is from that point that Hadoop splits the file across the multiple nodes or machines that it has registered as part of its cluster.
The second purpose for Hadoop (or best practice today) is to leverage it as a place to perform data mining, using SAS, R, or textual mining. The results of the mining efforts often are structured data sets that can and should be copied into relational database engines, making them available for ad hoc querying.
Data Vault 2.0 Architecture Objectives
There are several objectives of the architecture of Data Vault 2.0:
- To seamlessly connect existing RDBMSs with new NoSQL platforms
- To engage business users and provide space for managed SSBI (write back and direct control over data in the data warehouse)
- To provide for real-time arrival direct to the data warehouse environment without forcing a landing in the staging tables
- To enable agile development by decoupling the always changing business rules from the static data alignment rules
The architecture plays a key role in separation of responsibilities, isolating data acquisition from data provisioning. By separating responsibilities and pushing ever-changing business rules closer to the business user, agility by the implementation teams is enabled.
Data Vault 2.0 Modeling Objective
The objective of Data Vault 2.0 Modeling is to provide seamless platform integration, or at least make it available and possible via design. The design that is leveraged includes several basic elements. The first is the use of the hash keys (to replace the surrogates as primary keys). The hash keys allow the implementation of parallel decoupled loading practices across heterogeneous platforms. The hash keys and loading process are introduced and discussed in the Implementation and Modeling sections of this chapter. That said, the hash keys provide the connection between the two environments, allowing cross-system joins to occur where possible. Performance of the cross-system join will vary depending on the NoSQL platform chosen and the hardware infrastructure underneath. Figure 2 shows an example data model that provides a logical foreign key between RDBMS and Hadoop-stored satellite. In other words, the idea is to allow the business to augment their current infrastructure by adding a NoSQL platform to the mix, while retaining the value and use of their currently existing RDBMS engines, not to mention all the historical data they already contain.
Hard and Soft Business Rules
Business rules are the requirements translated into code. The code manipulates the data and in some cases turns data into information. Part of the Data Vault 2.0 BI system is to enable agility. Agility is enabled by first splitting the business rules into two distinct groups: hard rules and soft rules.
The idea is to separate data interpretation from data storage and alignment rules. By decoupling these rules, the team can be increasingly agile. Also, the business users can be empowered and the BI solution can be moved toward managed SSBI. Beyond that, Data Vault 2.0–based data warehouse carries raw data, in a nonconformed state. That data is aligned with the business constructs known as “business keys.”
The raw data, integrated by business keys, serve as a foundation for passing audits. Especially since the data set is not in conformed format. The idea of the Data Vault 2.0 model is to provide for data warehousing–based storage of raw data, so that if necessary (due to an audit or other needs), the team can reconstruct or reassemble the source system data.
This in turn makes the Data Vault 2.0–based data warehouse a system of record. Mostly because after warehousing the data from the source systems, those systems are either shut down or replaced by newer sources. In other words, the Data Vault 2.0 data warehouse becomes the only place where one can find the raw history integrated by business key.
Excerpted from A PRIMER FOR THE DATA SCIENTIST: Big Data, Data Warehouse and Data Vault; W.H. Inmon; Daniel Linstedt; Copyright © 2015 Elsevier Inc. Adapted with permission of Elsevier Inc.