As more organizations embrace big data and analytics to gain insight from extremely large datasets, the tools and systems used to manage data have grown, changed, and multiplied. Instead of just relational database systems, we now use NoSQL databases and Hadoop file systems to store increasingly large amounts of corporate data.
You would think that with the towering importance of data in today’s modern organization that data modeling would be viewed as extremely important by management and IT professionals, so it is somewhat ironic that the age of big data has coincided with a long-term slide in data administration and modeling in many organizations. This is not a situation that should continue to be tolerated.
What is Data Modeling?
Data modeling is the process of analyzing the “things” of interest to your organization and how these things relate to each other. The data modeling process results in the discovery and documentation of the data resources of your business. As you create your conceptual and logical data models, you are developing the lexicon of your organization’s business.
A data model is built using components that act as abstractions of real-world things. The simplest data model consists of entities and relationships. As work on the data model progresses, additional detail and complexity are added, including attributes, domains, constraints, keys, cardinality, requirements, relationships—and importantly, definitions of everything in the data model. If we want to understand the data we have—and how to use it—a foundational model is required.
Issues with Big Data
Big data and analytics are an important part of modern IT. Analysts estimate that the amount of data we use and manage doubles annually and performing analytics on that data can uncover heretofore unknown insights that lead to competitive advantage. Furthermore, the big data used to power analytics is being adapted for use by AI and machine learning software that will further improve the return on our computing investment through automating processes and tasks thereby increasing productivity and operational efficiencies.
But issues can arise when flexible schema technologies such as NoSQL and Hadoop are used. Such flexibility is often a requirement when large amounts of data are being discovered, ingested, and moved into an organization. When one row (or record) of data can have a different schema than the next, you cannot apply a fixed model to the data.
Nevertheless, the programmer has to know what the data looks like. You cannot just throw a big lump of data at somebody and say, “Here is the data, now write me a program.” Well, you can say that, but then the programmer (or somebody) has to analyze and document the structure of the data.
Hmmm, that sounds a lot like a data model, doesn’t it? Well, it should, because it is. Instead of upfront modeling, before any code is written, as is common in the relational world, big data modeling is sometimes performed based on application queries in program code or tools. What we want to avoid is having all of the knowledge of the data embedded in application programs as was common before relational became popular in the 1980s. And we should try to avoid requiring developers to re-model the same data every time it is used. Data modeling creates a system of record for enterprise data that is accessible by all, and not just those who understand the programming language du jour.
Why Data Modeling is Still Needed
If I haven’t convinced you that data modeling is still important, then consider the impact of regulatory compliance, which is the necessary processes and procedures that your organization takes to assure that it adheres to governmental laws and applicable industry regulations.
This includes regulations such as HIPAA, PCI-DSS, and GDPR. These, and many other regulations, specify the types and instances of data that must be protected or controlled in various ways. Without a data model that identifies and defines what data you have (including where it came from and who uses it), how can you ever hope to be in compliance with all of the regulations that apply to your business?
Start Today
If you do not model and define your organization’s data today, you can always start doing so on a project-by-project basis. Incorporate the documentation of data as a component of every new project you start. Educate your development teams on the importance of data modeling and proper documentation. And be sure that a data model is required for every project moving forward. Using this approach, over time, you can build data modeling into the fabric of your organization.