Data models allow for the expression of a great deal of clarity and precision—when the data modeler chooses to allow for it. Many designers seem to work in “sloppy” or “imprecise” mode. Entities are defined containing many nulls allowed attributes. Certainly, if in the existing situation the source data is so dirty that every defined attribute “should” apply, but randomly things are not passed on, then yes, the data model is accurate. However, if the condition is such that the object has many very similar sub-objects, and various combinations of attributes must be populated based on which sub-type is being instantiated, then the data model is not expressing those rules very well.
In an alternative circumstance, the model still lacks precision when the object defined, while not having many sub-types, has differing data needs that are based on the object’s current state within its lifecycle. Under both these circumstances, the data model is obfuscating rather than enlightening the business rules. A viewer of the data model sees a long attribute list, but nothing within the model expresses what configurations of populated values are expected, nor what circumstances drive their population.
Having actual rules defining when each column should be populated is something that can be expressed within one’s data model. Defining well-named sub-types, in a super type/sub-type configuration, allows for those reviewing the data model to understand when columns are to be populated. With a super type/sub-type arrangement, the super type and each identified sub-type have separate data structures. For example, if one set of columns applies when an order is first placed, and more columns apply once an order is shipped, this is easily and graphically expressible. Using the super type/sub-type construct, a Placed_Order sub-type contains columns unique to a placed order, and a Shipped_Order sub-type contains the columns that only apply when an order is shipped. While under a general Order super type, there are the columns that always must be populated regardless of the order’s current state.
This typed data model configuration allows everyone to grasp immediately the semantics of the placed versus shipped situation. In this fashion, data models have fewer nulls. Any nulls the end users see will be from query results when performing outer joins rather than the data itself. But avoiding nulls in the database is not really the issue. It would be just as inaccurate to force every column to be not null, and then allow developers to freely use pre-defined “non-value values” like “-1” or “unknown.”
Sometimes imprecision in data models is because of a lack of available business knowledge, or a lack of patience to allow for the gathering of available business knowledge, or fear that everything might change, or a simple lack of understanding on the part of the data modeler. Timelines often can become more important than future rework. The result of this modeling imprecision can be data models that are fragile and easily broken as the previously ignored business knowledge rears its head, or, more benignly, data models that are not very useful to an organization and are rarely referred to because they do not tell the correct story. And even under a situation where developers and project management may believe they need everything to allow nulls, or non-value values, a data modeler still has the option of having a proper logical data model expressing the precise rules, then allowing the physical data model to warp the true business logic. Every data modeler must have the courage to speak up for gathering the proper data knowledge in building a data model. Not every match can be won, but a discussion not engaged will never end in success for a vibrant, accurate, and clear data model.