It has been widely documented - data is growing at astronomical rates. The amount of data your organization has is less important than how the data is being used. Is data growth hindering your business or is it providing a better understanding of the direction you need to go?
Data warehousing and business intelligence (BI) are evolving to become more operational in nature, rather than BI's traditional classification as a purely strategic function. That means more data marts, more moving parts and a wider reach throughout organizations. Keeping pace with this evolution requires creating more documentation of the data and the data models. Data models are now as much about lineage and data "whereabouts" as they are about engineering quality databases, putting them squarely in the data warehousing/BI realm.
This article discusses some best practices and techniques for leveraging your modeling tool as a metadata solution in a data warehousing and BI environment.
Know Your Audiences, Know Their Preferences
Ask yourself this question: Who are our customers? The days of "just" designing a schema that is tossed over the wall to DBAs to implement are gone. Data architects are getting involved with governance, security and compliance and service-oriented architecture, and all of these tie into data warehousing and BI projects. Now, more than ever, data architects hold a very unique role in an organization. They have the ear of many different audiences and have to "talk data" with each of them at their level of understanding. How you communicate with technical users like developers and DBAs is going to be drastically different than how you communicate with executives, business analysts and line of business managers. The real art is being able to use models to effectively communicate with all of them.
The main principles to focus on are: 1) finding out what people want, 2) giving people what they want, and most importantly, 3) make sure it is relevant for them. Finding out what people want is an exercise that involves interviewing people from respective groups throughout your organization and understanding what they need from you and, more specifically, the format in which they want to receive the information. Although it sounds simple, it can break the entire system if you are delivering the information in a format that is unfavorable to the recipient. When you meet with each group, agree upon a format that they find useful and is provided by your modeling tool. If they want a Microsoft Excel report, give it to them. If they want a PDF they can print out and put up on a wall, give that to them. If they want direct access to navigate, make sure they have it. If they want HTML, so be it. Providing levels of abstraction will make sure it is relevant for them. The last thing you want to do is bring a printout of a physical database schema (Figure 1) when you go into a meeting to discuss current data requirements with business users or executives.
Click here for Figure 1 - Physical Database Schema
Their eyes will glaze over and you will lose them within 30 seconds of discussing the latest denormalization strategy of the various type codes in the employee dimension and why a non-identifying relationship was used instead of an identifying relationship between a dimension and fact table. Instead, if you can abstract the requirements of a new data mart into a conceptual model (Figure 2), you will be speaking their language and it will be much better received.
Click here for Figure 2 - Conceptual Model of a Data Mart
A diagram like this makes it very easy to see the key dimensions for the data mart. For each dimension, it is clear how each of the measures in the fact table are being sliced and diced. Communicating in your audience's terms will make meetings much more productive because you will be equipped to focus on the business requirements rather than the nuances of the design and notation that may not be relevant for them.
The Model of a Data Model
Understanding what new types of metadata are necessary in your models will go a long way toward increasing the visibility and usage of the models across your organization. Five years ago, data security and compliance mappings were almost unheard of in a data model. Now, they are nearly a necessity. Ask yourself these two questions: What makes your data model complete? And, what new types of metadata will further service your organization?
A data model would be considered gold standard if it includes these four components:
1. Well formed definitions
Good definitions can do wonders for increasing a model's effectiveness. If you can answer what, why and how in a definition, it is much more beneficial than using a simple sentence stating the obvious wrapped around the entity name. I seen countless instances of people using, "This is the customer identifier" as the definition on the customer_id column, which is almost the same as no description at all. It is much more useful and will provide a more complete picture to the consumers of the model if you use the definition to capture points like: What is this piece of data? Why do we need it? How it is used?
2. Business names in the logical
Business names in the logical will dramatically increase the readability of a model and will help with tracing it to the business side. In data warehousing, the logical model is often an afterthought or merely a carbon copy of the physical sans platform-specific properties. That limits the visibility and the use of the model to a narrow audience.
3. Reference values
Reference values are typically stored in database tables so that they can be used across various applications. And they should be. What often happens is that valuable documentation only finds its way into the model as a physical constraint and not in a business-friendly format that contains both the physical and logical term for each reference value. That is wasting a very good opportunity to leverage the publishing abilities of a model to communicate that crucial information, especially if ETL processes depend on those reference values.
4. Domains assigned to each attribute/column
Domains are a valuable way to enforce consistent definitions of common pieces of data between transactional systems and dimensional systems. They are also grossly under used. If you can define one domain and reuse it across a model or multiple models, it will help promote the reuse of common data elements across different environments and enhance the data quality along the way.
A Case of When More Is Not Necessarily Better
Understanding what new types of metadata you can put into your models will greatly increase the visibility of the models and the value they provide to your organization. Helpful metadata to incorporate in a data warehouse environment includes:
- Source/target mappings
- Data movement rules
- Data security classifications
- Compliance mappings
Source/target mappings
If I had a dollar for every customer who I hear about documenting source and target mappings outside of the data model, I could put in for an early retirement. They invariably end up using Excel or Access or some home-grown repository for the mappings, which creates a problem because, when the model(s) change, it requires manual work to keep the two in sync. Excel may be a great mode to communicate mapping information because everyone has it, but it is important to avoid making the Excel version the master copy. If you put the mappings and lineage into the model, and then drive the Excel and report generation from the model, you will save yourself quite a bit of time and effort.
Data movement rules
Data movement rules are also a valuable piece of information to keep everyone on the same page. These can be great for documenting pre/post transformation tasks, archiving rules, data loading rules for different types of dimensions, etc. Without this information in the models, developers are left to wonder how long to keep data in a specific dimension, whether they should update existing records and insert new ones, or whether to truncate it completely before inserting.
Data security classifications
Data security classifications are a mechanism to communicate the sensitivity of the data to your audiences. To better handle data that requires extra care, consider creating a classification scheme that documents privacy and security impact levels and is mapped to subject areas, views, tables, and columns in a data model. Use security flags to help users report, search and isolate specific data elements like social security numbers, credit card numbers and contact information, which may require special obfuscation or masking controls as it moves from a source system to the data warehouse or data mart.
Compliance mappings
Much like data security information, compliance mappings will help users understand what data is critical to specific regulatory laws. If it can be flagged at the subject area, view, table and/or column level, it will be much easier to search, report and isolate it.
These new forms of metadata are typically documented outside the data model, which forces users to hunt in different locations for information they need. The more places they have to go, the less likely they are to find that information and, obviously, the more time they're going to spend trying to find it. Deciding what metadata to capture is also important. Albert Einstein once said, "make things as simple as possible but not simpler." Follow this principle when determining what metadata to include in your data model. If a piece of information will help one of your consumers (i.e. your audiences) understand the data better, put it in the there. If not, don't waste your time or theirs. I see many modelers struggling to document "everything" when it is not necessary. "Everything" is going to be dictated by your audiences. The modelers who are successful have defined what "everything" is based on their consumers and reached an agreement to deliver that and nothing more.
Automate the Delivery
Automating the delivery of the metadata is the final best practice worth addressing. I see many customers using a manual, hand-rolled process to deliver their metadata to their audiences, which typically involves generating different reports and document types that incorporates all of the changes to every data model, and then these are distributed at the end of the week via email or placed on a variety of portals or network drives. Sometimes, the effort bleeds into the weekend or the following week. Once users notice something is outdated, their trust is lost immediately and it will be nearly impossible to convince them to return.
Communicating information about your metadata is as important - if not more - as creating it. This is one area where it's worth letting technology do the heavy lifting in the form of a self-service, self-maintaining methodology and reporting solution. The ideal is a web-based interface that provides a window into your models and the metadata, and that enables users to provide feedback, search, build their own reports, and browse images of the models. If your modeling solution provides these collaboration features, great. If not, look at using other portal technologies like wikis and SharePoint. If you go the latter route, keep in mind that it will be necessary to integrate the two together. Wiki technologies and their markup languages can make pages dynamic and self-maintaining, but you also need to know how to query the underlying structure of the modeling tool files or repository for the information you need. If you plan to leverage static reporting from your modeling tool, make sure there is a way schedule it on a regular basis and a there is a central place to put the reports that keeps them secure and easily accessible.