Articles by Todd Schraml
A surrogate is, "One appointed to act in place of another." A surrogate key is appointed to act in place of a natural key. Why? Well, one could argue that surrogate keys existed since the beginning of the computer. As the computer uses internal addresses for finding everything, and in the early days that usage had to be explicitly managed by those programming. It wasn't until relational theory came along that the concept of using data values as identifiers arose and were functionally implemented.
Posted January 09, 2025
The beauty and joy of a relational database is the concept of relational closure—everything is a table. Beyond the eponymous table, query results are also "tables." Any query serves as a table to be queried by any other query, which is why queries may be nested almost infinitely within queries.
Posted December 12, 2024
Far too often information engineers and others take a dismissive stance about managing data structures. This indifferent attitude is largely because whatever tool the engineers use provides fast performance in resolving queries. They believe that quick performance means all is well, so nothing else matters, right?
Posted November 14, 2024
In multidimensional data modeling practice, there is the concept of a bridge dimension. A bridge dimension, being the dimensionalized equivalent of a normalized data design's associative entity, allows for the resolution of a many-to-many relationship. By its very nature, the bridge is saying, "Yes, let's normalize our dimensional designs." The bridge dimension exists to resolve a many-to-many relationship between two other dimensions.
Posted October 10, 2024
The lines between entities within an Entity-Relationship Diagram [ERD] represent an interdependency between the involved entities. In a normalized designed, this interdependency is both semantic and functional, as in, "a HOUSE has one-or-more DOORs," "an ORDER has one-or-more ORDER LINEs," or "a STUDENT enrolls in zero-to-many CLASSes." These object pairings would have a line drawn between them in an ERD representation. There also would be markings from whatever notation one is employing, designating the "one" side of the relationship and the "many" side of the relationship.
Posted September 12, 2024
Everyone likes a diagram. Diagrams often help make an issue clear—except when they don't. There are plenty of diagrams that really do not convey information to the viewer, and that is a sad state for any diagram to find itself in. Maybe, just maybe, not everyone has the proper temperament to create a useful diagram? Before composing a diagram, the diagram-builder needs to be clear about what viewers of the created diagram should gain from viewing it.
Posted August 08, 2024
Eventually the day comes when an existing data mart or warehouse needs to be re-examined. Maybe new workloads are desired, new business models arise, a new platform is considered, or new supporting tools in the environment are wanted. Especially when a new platform is part of the equation, many in charge think first of "lift and shift" as the approach. Taking as much existing data and processes as possible and keeping them virtually unchanged as things are moved onto new foundations is almost always the cost-saving approach—for the immediate short term.
Posted July 11, 2024
Data modeling has always had a push-me/pull-you relationship in the IT world. The idea of complete logical and physical data models, insightful definitions, coherent data flow diagrams, and other related informational caches is always desirable. However, far too often, the expectation is that these activities take virtually no time to create. When the expenditure of resources goes beyond a certain level, the activities are jettisoned or otherwise given the short shrift. Actually, getting code in place is viewed as the only evidence of true progress.
Posted June 13, 2024
A first and important step in being successful at almost any job or task is understanding oneself. This is true for engineers, modelers, or even architects. Do you know how to perform your tasks? Are you faster or slower than others doing the same tasks? Do you have more or less errors in your output than others? It seems that a rising number of people are so uncomfortable with competing that they may even avoid asking these kinds of questions of themselves; but this is not about competition, it is about understanding oneself, it is about being responsible, it is about holding oneself accountable.
Posted May 09, 2024
There is an old quote attributed to various origins that says, "When I hear the word ‘culture,' I reach for my gun." The saying expresses frustration at a term, in this case "culture" being used, abused, and politicized beyond all forbearance. Even when it was first popularized, it was a line meant to draw laughter from the audience of a play, rather than suggesting anyone carry an actual firearm. And no one is suggesting anyone do so now. "Relational" is a term that, over the last few decades, has undergone its own level of use, abuse, and politicization.
Posted April 11, 2024
As disheartening as it may sound to new graduates, even when you leave school never to return, you still must continue to learn. There are always new tasks to comprehend, and new tools to drive. And when one serves as a data modeler, one is always going to be exposed to new data. Fortunately, if one has a good foundation, acquainting with new data can be easily managed. Every new conceptual or logical data model is an opportunity to develop a deeper understanding about another subject area within an organization.
Posted March 14, 2024
In far too many organizations, processes become automatic. Requests are made, and requests are fulfilled. People wish to please; they like to show how fast they can respond or how agile they are. The same is true when addressing database design. This delivery-focused desire comes from a good place. However, regardless of how well-intended, blindly fulfilling requests is dangerous and harmful to one's overall results. Optimization of processes, especially processes involving database design, should never remove the step where questions are asked and rationale is provided.
Posted February 08, 2024
When I was in high school, a little German lady would show up as a substitute teacher for almost any subject. Apparently, she had a very broad skill set. As I mentioned this was high school, eventually we students would become disruptive, unruly, loud—generally annoying—and our substitute teacher would give us "the speech," spending what I recall as far too much time explaining to us that "knowledge is power" and how we would be better served by changing our attitudes. At the time I did not believe her exhortations had any impact on 15-or-so year old me, but over the years it turns out she did.
Posted January 11, 2024
When one hears the term "degenerate dimension," thoughts of teenagers, leather jackets, motorcycles, and petty crime may come to mind. After all, the word "degenerate" is associated with something that is "below normal" or "corrupt." Science tries to rehabilitate degenerate's meaning by using it for a manner of simplification.
Posted December 14, 2023
The original intention of surrogate keys across multidimensional database designs was to help optimize joins by keeping all keys, used for joining between facts and dimensions, both numeric and single-columned. Often such surrogates were generated as simple sequential numbers, 1, 2, 3…. Early versions of many relational database products had sub-optimal performance when joining data together via character strings—especially large quantities of data.
Posted November 09, 2023
Many people seem to become filled with anxiety over the word "normalization." Mentioning the word causes folks to slowly back away toward the exits. Why? What might have caused this data modeling phobia? Do people have images flashing through their minds of data modelers running around wearing a hockey mask and carrying a chainsaw screaming, "Give me your primary keys"? I hope not.
Posted October 12, 2023
Data models allow for the expression of a great deal of clarity and precision—when the data modeler chooses to allow for it. Many designers seem to work in "sloppy" or "imprecise" mode. Entities are defined containing many nulls allowed attributes. Certainly, if in the existing situation the source data is so dirty that every defined attribute "should" apply, but randomly things are not passed on, then yes, the data model is accurate. However, if the condition is such that the object has many very similar sub-objects, and various combinations of attributes must be populated based on which sub-type is being instantiated, then the data model is not expressing those rules very well.
Posted September 14, 2023
When Apple first released the iPhone, the company did something that was a bit wild, a bit innovative, especially for a new piece of technology. The iPhone did not come with a 500-page instruction manual explaining how everything worked. There was very little documentation for users. Apple worked hard designing the interface to allow users to apply their own intuitions and navigate around. While some grumbled, it was successful enough, and people adapted.
Posted August 10, 2023
Occasionally one may hear that a data model is "over-normalized," but just what does that mean? Normalization is intended to analyze the functional dependencies across a set of data. The goal is to understand which data elements relate to what other data elements. The context of a normalization exercise is the semantically constructed reality within a chosen organization.
Posted July 13, 2023
In dealing with databases, there are times when one must do some level of programming. With the rise of the variety of ETL, data pipeline, and data platform tools, programming elements have become mushy, confused, and scattered. This confusion has arisen because of what happens, and where that "what" happens has devolved into everything, everywhere, all at once.
Posted June 08, 2023
In 1961, a clever science fiction author named Robert Heinlein coined a new term in one of his novels. The term was "grok." Grok is a wonderful word that means "to empathize or communicate sympathetically; also, to experience enjoyment," and is a term similar to the later sixties phrase "dig it." Grok meanings also include "to drink," "to love," and "to be one with." Grok can apply when one experiences those moments of insight as a new realization coalesces in one's mind, or as that joyful experience continues to provide a thrill. Perhaps grok can be a mental state one achieves as one attains the peacefulness advertised in another 1960s term—"be here now." Grokking can be vital in working through the creation of a new logical data model. If one does not understand the data, then one cannot model the data. Alternately, if one groks the data, the data modeling efforts will flow freely.
Posted May 11, 2023
Sooner or later, in business, even in IT, almost everyone must write something more than simple emails. Procedures, standards, business cases, proposals, plans: All kinds of things may need to be authored by everyone and contain descriptive narrative about the subject at hand. Much documentation can be found on the internet, but your organization's internal rules, goals, and desires cannot be found via a Google search. Even database designers and data architects must contribute to corporate internal literature.
Posted April 13, 2023
Across database structures, dates are ubiquitous: sales date; order date; shipment date; receipt date; created date; last updated date. You might think that with such an abundant presence, dates would be well understood. Sadly, dates hide in the shadows—and many are confused over how dates work. People run a query, get a result, and seem to expect that what one sees, is what one gets. If they see a two-digit year, they believe that is exactly how the data is stored.
Posted March 09, 2023
Designing a database is not a magic trick—the design process does not involve trap doors, mirrors, or unseen wires holding things up. Sometimes database designers and data architects fail to think ahead as their designs are composed. Rather than checking on each item, the designer believes that whoever provided them the requirement "knew what they were doing."
Posted February 09, 2023
A good data architect is a gift to their organization because they achieve results that advance an organization's data maturity. What kinds of characteristics lead to a data architect being considered good to great? One of the first traits needed is attention-to-detail. Many tasks that fall under the data architecture umbrella may have very repetitive and tedious aspects.
Posted January 12, 2023
Often data is categorized into very high-level groupings of structured or unstructured. Generally, structured data is considered data that conforms to an easily identifiable pattern and as part of this conforming, that data may be easily loaded into a relational database table "as is." Examples of this might be fixed-format files, or comma-separated files having an agreed upon pattern to each record within it. Unstructured data supposedly cannot be loaded "as is" into a relational table. Unstructured data is, by name, lacking an identifiable structure to make sense of the data, right? Not exactly.
Posted December 08, 2022
In the simplest terms, a logical data model is a visual representation of the business rules and requirements covering the universe-of-discourse for a given solution or enterprise, along with some textual support. Keeping this in mind, the logical data model is a metaphor describing the piece of the organization under analysis.
Posted November 10, 2022
In a Continuous Deployment/Continuous Integration (CD/CI) world, most commentary dances around references to database changes as being a bottleneck, hard, awkward, or even painful. The reasoning supporting this perspective of suffering seems to arise from a desire for all changes to be fairly isolated matters. After all, a new function within an API, driving a new behavior on a screen, can just slip in, start executing, and life continues.
Posted October 06, 2022
In building a logical data model, some entities are considered strong, other entities are considered weak. Strong entities are the most foundational elements within a nascent Entity-Relationship Diagram (ERD) and comprise the list of objects that would likely come to mind first. Strong entities are independent in that they exist all on their own and can be created without having to meet any pre-conditions.
Posted September 08, 2022
In science fiction movies, the word "entity" pops up when discussing an as-yet-unseen but at least suspected alien presence, presumed evil (or sometimes evil presence, presumed alien). This entity is not named until half the cast has vanished, then we might hear, "It's Cthulhu's daughter and she's angry!" But not all entities are evil or angry.
Posted August 11, 2022
Data modeling has always been a task that seems positioned in the middle of a white-water rapids with a paddle but no canoe. On one side of the data modeling rapids are the raging agilists who are demanding working software and decrying virtually all documentation. To this agilists' group, data modeling is often seen as too simple to matter. But at the same time, their implementations will miss standardization in naming or data model patterns. And results may be so far off course that major rework is unavoidable. Sadly, far too many agile practices have been set up to place things under the technical debt umbrella, when in reality those practices never allow the re-factoring closet door to be opened. Poor data models are "overcome" by creating ever more complex logic around the data in order to get to a more proper result, as developers learn what really needs to be accomplished along the way, maybe. The results may work but can be a nightmare to maintain.
Posted July 07, 2022
Data modeling is the process of defining datapoints and structures at a detailed or abstract level to communicate information about the data shape, content, and relationships to target audiences. Data models can be focused on a very specific universe of discourse or an entire enterprise's informational concerns. The final product for a data modeling exercise varies from a list of critical subject areas, an entity-relationship diagram (ERD) with or without details about attributes, or even a data definition language (DDL) script containing all the SQL commands to build a set of physical structures within some chosen database management system (DBMS).
Posted June 02, 2022
The value of normalization is in understanding the data well enough to create the normalized design. Pulling out the business rules, business terms, and relationships from the mass of jumbled together raw content is critical. The business rules that result from performing the normalization exercise establish the requirements that need to be satisfied by solutions, whether they are either built or purchased. When an organization creates and maintains a normalized design for the data within the important areas of their business, they reduce work on all future systems.
Posted May 04, 2022
Data architects live in a world caged by bars of process, standards, and documented procedures—things many would consider a high ceremony lifestyle. As an industry, information technology has been migrating more and more into agile frameworks for some time now. High ceremony is often seen as an earmark of "waterfall" approaches, which constitute the evil empire that agile frameworks are fighting to replace. The result of this opposition is that formal data architecture groups often do not fold easily into agile approaches.
Posted April 07, 2022
Often one reads a book or hears a presenter making a pun about relational theory being called "relational" because of entities being "related." Such references are nothing but misplaced puns. Relational theory derives the relational in the name from the idea that a "relation" is a mathematical term synonymous for a "set" and each entity represents a set of some sort. However, relationships between entities are still a very important concept albeit not an eponymous one.
Posted March 11, 2022
Folks relate to physical tables; even the most non-relational-minded person can picture a fixed structure file and equate that to a table and its columns. The spreadsheet image is ubiquitous. DBMS-defined views are logically similar to tablesand in usage are certainly interchangeable with tables.
Posted February 08, 2022
Dealing with data warehouses, data marts, and even data lakes, can be awkward in an agile environment. While adding a single metric onto a dashboard can be very natural, no one builds a dimension table a few columns at a time. This awkwardness has caused many variants in how an agile methodology might be applied to one's analytics databases.
Posted January 03, 2022
Using surrogate keys within a database is often considered a technique to improve performance. The assumption is that using anything other than a numeric data value to join tables provides "bad" performance. Therefore, whatever the natural key may be—one column, multiple columns, alphanumeric, etc.—the surrogate key can be a 100% numeric single value, standing in for that natural key value set. Some DBMSs have key generators that are numeric, others may be more wide-ranging in values. Some organizations may choose to use surrogate keys generated from hashed natural key values. Will surrogate keys improve everyone's query performance? As with the stock market, specific circumstances differ everywhere, so the individual results may vary.
Posted December 08, 2021
When normalizing data structures, attributes congregate around the business keys that identify the grain at which those attributes derive their values. Attributes directly related to a person, deriving their value from that individual, will appear on an Employee entity, or a Customer entity, and so forth. However, in the process of normalizing, the data modeler must identify the objects that are meaningful and useful to the organization
Posted November 01, 2021
Marvel should have an evil villain named "Null." Nulls have always been trouble in the relational world. Certainly, nulls are used all over the place by virtually everyone. Still, that does not mean that nulls are harmless.
Posted October 05, 2021
It is entertaining to be in a room filled with people all claiming that they love data. What becomes more entertaining is discovering how each one views data uniquely and enjoys doing different things with or to that data. The data community has become one that is far more diverse than many realize.
Posted September 15, 2021
The meaning and interrelationships of and between data are important. If you are the designer of a database and being lobbied to allow the creation of a table without a primary key, make sure you understand how the table is to be used, and that people will not be writing queries against the structure that will potentially be multiplying data unexpectedly
Posted August 02, 2021
Within IT, testing has matured as an industry. Many tools exist, and many IT shops have testing groups. But, often those testing groups are unable to assist on data-related projects. The heart of the problem is that the focus of the testing practice has been perverted. The testing industry is concerned primarily with validating code, specifically the kinds of code that interact with people. With data-related projects, the need is to test the accuracy of the data at rest within the structures.
Posted July 15, 2021
Inside a relational database management system, the principal persisted data structure is considered a logical relation. Operations performed against that data within the RDBMS result in a logical relation too. In other words, everything is a table. To step away a little from the term "table," let's use the word "grid."
Posted June 10, 2021
Anything worth doing, is worth doing again and again. Right? When building out and standardizing new subject areas or new sources for one's data warehouse, hub, or other analytics area, a task often initially overlooked is the logic bringing in the data. Obviously, everybody knows that the new data must be processed. What many ignore is the idea that establishing the process to bring the data in often must be done twice, or more.
Posted April 29, 2021
All data architects should consider themselves change agents for the organizations in which they work. But at the same time, business also wants to keep much the same. Such discoveries can be confusing when business is saying that change is desired, but their actions seem focused on preventing change. It can suggest the Albert Einstein quote (or the Baba Ram Dass quote, based on which version of history one subscribes to) that "We cannot solve our problems with the same thinking we used when we created them."
Posted April 06, 2021
Dashboard users doing unsophisticated, largely repetitive, operational reports and analyses ask their IT support personnel to provide data in a simple way; they demand fast performance from their queries; and they demand new functionality be provided quickly. While often not stated explicitly, the "simple" data presentation implies several characteristics.
Posted March 01, 2021
The data lakehouse is a merging of the organization's data lake and data warehouse into one platform, eliminating data redundancy and loss of time moving data around from one place to another. Is this newest data savior really better than all the data saviors of the past?
Posted February 10, 2021
An architecture derives its strength from a level of consistency in how things are implemented. However, that is not to say that a mindless devotion to absolute consistency is a good thing. Times will arise when exceptions to almost any rule are necessary. The skill, the art, the balance in applying decisions that result in a good data architecture across an organization are based on a prudent use of when to conform and when an exception is needed. If there are too many exceptions, it can rightfully be declared by observers that there are no rules and that chaos reigns.
Posted January 07, 2021
A recent study indicated that IT professionals were three times more likely to disagree with their leadership than professionals in other industries. Identifying a three-times-more-likely difference of opinion seems a significant variance. I don't know what detailed insights the study arrived at to account for this level of disagreeableness. Maybe the study considered IT folks as being more educated, more logical, and therefore more difficult?
Posted December 10, 2020
Data modeling has an intimate relationship with abbreviations. Since the creation of the very first data model, there were circumstances where fully worded names for tables or columns were simply too long to implement within one tool or another. Occasionally one runs into an individual who cannot conceive that anyone on the planet would abbreviate something in a different fashion than they do; but more often data modelers tend to enjoy consistency, and when possible, employ rules to support consistent outcomes. Abbreviations are no exception.
Posted November 04, 2020
Plenty of analytics environments have landing areas, plenty have staging areas, and some have both. So, are landing and staging just synonyms for the same thing? A survey of usage would show there is much overlap in implementations, and even some confusion.
Posted October 08, 2020
Why set a trap to fail at some point in the future? When you arrive at a design wherein an entity needs to have a relationship to almost every other entity, stop and think about what is happening, and review your reasoning before proceeding.
Posted September 09, 2020
IT management can often succumb to repeated patterns of behavior. One of those repeated patterns is accusing developers of being "perfectionists." The developer or developers in question will be told that they should not strive for perfection, because perfection is not needed for the circumstance, and, more importantly, attempts at perfection take too long. Is the developer a closet perfectionist? Maybe, maybe not.
Posted August 11, 2020
An enterprise conceptual data model is often seen as a high mountain to be climbed, a journey that will last a lifetime. People have visions of 10 feet or more of wall in the corporate offices wallpapered with an entity relationship diagram [ERD] that has zillions of teeny, tiny boxes and more relationship lines than the combined lines of queuing patrons in all Disney Resorts, when full. In this context, an enterprise conceptual data model is a daunting task not to be taken lightly. But in today's world, that enterprise conceptual data model can simply be a list of subject areas.
Posted July 01, 2020
As the number of types of slowly changing dimensions (SCDs) increased, things have ended with Types 0 through 7, making essentially eight of them. But it is unclear whether there full consensus exists among current practitioners on what actually differentiates each of these eight SCD types. Some confusion may result from the fact that when the first three SCD types were defined, each could be equated to a result for a dimensional attribute. Type 1 had facts associated to dimension values as they are currently, or always current, Type 2 had facts associated with dimension values as were current when the facts were processed, Type 3 had facts associated with both current values and values current when processed.
Posted June 10, 2020
One creates the potential for some interesting anomalies when building a star schema wherein the fact table contains future-dated metrics and any of the dimensions are Type 2. A Type 2 dimension tracks changes to the data items contained within it. Effectively, each dimension contains a surrogate key, a natural key with a start and stop date, and additional descriptor columns. If any of the descriptor column values change, the existing dimension row has the stop date populated while a new row is inserted with the same natural key, new start date, and new descriptor values.
Posted May 13, 2020
As we have moved forward with APIs and microservices, every organization has even more data stores to manage and more sources of data to consider. Sorting through data structures for operational solutions can become mind-numbing due to the variety, or even frustrating due to a lack of detail from many vendors. Source systems are no longer the monoliths they once were.
Posted April 08, 2020
From CEO to presidents, VPs, directors, to any number of mid-to-low level managers, the concept of hierarchies is pervasive in organizations. But, if one is dealing with a relational DBMS, a hierarchy remains an awkward concept.
Posted March 05, 2020
A good data modeler must know when there is something the users need to analyze, even when what that is is not necessarily obvious. Subtle differences abound between things in a state, versus actions taken to create or change states, versus the duration of a business object within or across states, versus business objects and the workflows containing them. Each of these subtleties drives differing metrics with distinct uses.
Posted February 10, 2020
When applied implementation efforts are not efficient, more often than not, the inefficiencies are due to the interference of an imp known as "churn," i.e., implementation wheels spinning away and not actually making progress. Churn is bad. Churn is one of the most destructive circumstances for any IT project. Churn may raise its ugly head at any point where a project requirement or need is left unclear.
Posted January 02, 2020
What exactly is a data architecture? As the Zachman Framework exposed long ago, different people look for different kinds of details and documentation to answer fundamental questions about an enterprise's architecture. Someone involved with infrastructure will need to understand the tools used and the methods employed to move data and to be clear on concepts about how security will be enforced. But these aspects are only initial parts of the overall architecture, and as such, a simple diagram of tools used is incomplete and insufficient for a comprehensive view of data architecture.
Posted December 01, 2019
Every organization needs a data warehouse. A data warehouse has never been a one-size-fits-all kind of solution. Variations exist and should be accepted.
Posted October 31, 2019
Certain kinds of issues in data modeling need to be addressed in specific ways. Many options may exist, but it is very rare for all the possible choices to be equally appropriate. The reasons for using a less-than-satisfactory path may be many. They may include a misguided concern for speed-to-completion or be a matter of the areas of control for those "in the loop" on the existence of the issue. Largely, this involves communication or, more precisely, the lack of communication.
Posted October 01, 2019
In order to protect your organization, it is critical to watch over the elements that have been built, keep processes running, and be on top of change. Spend the time and resources necessary to properly maintain the solutions for which you are responsible. The amount spent in such endeavors will be less time than that spent trying to play catch up on too many changes after bad things have resulted.
Posted September 03, 2019
An effective approach to processing and transforming large datasets is likely comprised of multiple steps. The large data will likely be split apart into several smaller sets, maybe even in a couple of differing fashions with a common and understandable theme. But there should not be too many split-apart variants; rather, as with the three bears, it should be just the right number of smaller datasets. And then, similar to solving a Rubik's Cube, a twist or two at the very end brings all the new and old datapoints together in a complete and organized fashion.
Posted August 07, 2019
Data virtualization enables the ability to have one or more data stores that break the bank processing-wise, because they can physically exist once but logically exist in multiple transformed structures. Occasionally, IT managers get the idea that data virtualization is a more generic answer, presuming that if it works for the big data, it can work for all data.
Posted July 18, 2019
At times, there is a need to have security within the database be a bit more sophisticated than what is available. On specific tables, there may be a need to limit access to a subset of rows, or a subset of columns to specific users. Yes indeed, views have always existed, and yes indeed, views can be established limiting rows or columns displayed. However, views only can go so far.
Posted June 10, 2019
CDC can greatly minimize the amount of data processed; but the cost is that the processes themselves become more complicated and overall storage may be higher. Costs are moved around, the final level of processing becomes focused on the minimal changes, and this minimization is the efficiency to be gained. Moving forward, using the data becomes standardized and ultimately straightforward.
Posted May 01, 2019
When working on a multidimensional design, every fact table within scope should be handled with care. In an ideal world, each low-level fact table represents the metrics related to a business event. The meaning of a fact table, ideally, should be evident based on the table name and the composition of the fact table's primary key. Deciding on a primary key for a fact table is an important choice.
Posted April 09, 2019
Clarity of vision is absolutely the most important part of database design. The data architect must understand the shape and patterns of the data being modeled. This lucidity arises when the designer understands the subject area, the goals of the target database, the nature of the data sources involved, and the internal lifecycle of the database objects in scope.
Posted March 04, 2019
In dimensional modeling, business events are typically designated as facts while descriptive information elements are dimensions. However, events (or information about them) occasionally serve as dimensions as well as facts. A good data architect must watch their p's and q's and be certain when it is appropriate for a fact to also serve as a dimension—or when the dual function is not appropriate.
Posted February 08, 2019
Data mart builders must understand what they are working to accomplish. The DBMS is not going to magically guide them to a solution. The builder is responsible for knowing how dimensional techniques work, why they work, and what options may exist within the dimensional framework.
Posted December 04, 2018
While relational database management systems are still the workaday workhorse, we are now adding into the mix document, columnar, and graph datastores, and their variants. Each datastore has something at which it excels, and other things it may not. Similarly, the rules followed in composing data structures, based on the platforms selected, also vary greatly.
Posted November 01, 2018
There has always been a need to tightly control some data items, such as passwords and Social Security numbers. Today, with the rise of concerns over personally identifiable information (PII), the General Data Protection Regulation (GDPR), and other legal mandates, a much larger group of data elements must be controlled. These legal data governance issues may need to guide our hands as we establish database designs.
Posted October 10, 2018
In establishing a staging or landing area for a data lake, a data hub, or a quaint data warehouse environment, structures need to be established that will mimic source data in support of two very basic queries. The first is: "What does the current source dataset look like?" And the second: "What change activity has occurred against the source since the last time it was interrogated?"
Posted September 04, 2018
In the big data world of today, issues abound. People discuss structured data versus unstructured data; graph versus JSON versus columnar data stores; even batch processing versus streaming. Differences between each of these kinds of things are important. How they are used can help direct how best to store content for use. Therefore, deep understanding of usage is critical in determining the flavor of data persistence employed.
Posted August 08, 2018
When we hear the term "think outside the box," how often do we really examine what that phrase truly means? First, one needs a box. And it is on this issue where most folks fail. Before one can consider what is "outside the box," one must clearly understand what exactly is meant by "inside the box." People often consider random approaches the same as being "outside the box." However, just different is not enough.
Posted July 02, 2018
Under usual circumstances, the one-to-many or many-to-many relationship, alone, drives the pattern used within the database model. Certainly, the logical database model should represent the proper business semantics of the situation. But on the physical side, there may exist extenuating circumstances that would cause a data modeler to consider including an associative table construct for a one-to-many relationship.
Posted June 01, 2018
For data architects, it is not unusual to use a data modeling tool to reverse-engineer existing solutions' databases. The reverse-engineering could occur for a functional reason, such as gathering information to evaluate a replacement option, or to comprehend a solution, seeking to work out what data should be extracted for a downstream business intelligence need.
Posted April 12, 2018
Many organizations talk about data governance, but rather than establishing an ongoing governance process that is involved with every project, the governance is viewed as a one-time task to be slogged through, over with and done. Such myopic approaches will only lead to failure. Data governance is like life, it is the journey, not a destination.Proper data governance brings commonality to an organization; it leads the journey to a single version of the truth. A single version of the truth does not mean everyone must kowtow to a single metric, but it does mean distinct calculations unique to dissimilar sub-groups have different specific names at the corporate level, even if those naming differences are subtle.
Posted March 07, 2018
Don't let the term "unstructured data" confuse you. Structure exists, somehow and somewhere, within unstructured source data. It is that subtle, possibly even encrypted, structure that contains the gems of knowledge that an enterprise seeks. Sometimes those gems only shine when extracted and combined with other little gems from other data sources.
Posted January 02, 2018
No, we weren't born in a crossfire hurricane, nor schooled with a strap right across our backs; we have other crises, mixed priorities, and resourcing deficiencies to cope with. But it's all right now. In fact, it's a gas. There is little choice for those who wish to survive in the IT trade. Either one copes with an ever-changing landscape, or one moves on to another industry. Technology shops across the nation, if not across the world, are in themidst of a crisis. The only problem is that this crisis has gone on for a couple of decades or more.
Posted December 01, 2017
Operational systems are where data is born. These systems either force people to enter their details or acquire the same from trusted sources. Names, addresses, merchandise selections, and credit card numbers are consumed. The operational solutions interact with users and compatriot applications to give birth to their raison d'etre, be it purchase orders, payroll checks, or any of the thousands of other documents and transactions.
Posted October 18, 2017
Every few years we hear of one new idea or product that will surely bring death to the relational database, or death to the data warehouse, or death to something. It appears that many prefer to see death or at least they greatly enjoy planning for it. Often these finalities never seem to arrive.
Posted September 07, 2017
There is an old joke that a tourist in New York City is lost while sightseeing and notices a musician leaving a taxi; he walks up to the musician and asks, "How do you get to Carnegie Hall?" The musician responds, "Practice, practice, practice." Obviously, such jokes are doomed to obscurity as people now will simply use the navigation app on their phone. However, the advice itself, in its own way, will always be relevant—practice, practice, practice.
Posted August 09, 2017
For everyone normalizing their designs, just remember: There ain't no such thing as a partial foreign key. When a project pretends that partial foreign keys are a valid concept, the end result is a database model that either lies to viewers, or does not give them the necessary information to readily join tables together.
Posted July 05, 2017
Updating fact rows inside a star schema set of tables is never a best practice. Even so, some organizations travel down this path.
Posted June 01, 2017
Choices are pervasive when designing a database. The data modeler must progress through a series of issues: What ideas are important? Which objects stand out? Which concepts can take a back seat? Adding to all those decisions comes understanding the target structures one is shooting at. A normalized design may lean one way, while a dimensional design may lean another.
Posted May 05, 2017
It often seems that working around things is a full-time task in every area of information technology. When workarounds are conceived and deployed, people are not always in agreement.
Posted April 07, 2017
Many businesses seem to believe that dimension tables and reference tables are indistinguishable objects. Apparently, the only difference worthy of note seems to be altering the object's name from "something reference," or "something xref," to "something dimension." As these organizations build multidimensional data marts, they often place a view on top of their reference table and feel good about how quickly and efficiently they complete their data mart.
Posted March 02, 2017
There was a time when what you saw was what you got. Building up the components of a business intelligence area was very straight-forward. A staging area was a staging area; an operational data store was an operational data store. But like buying a pitcher of beer for $2, or gas for less than a dollar per gallon, those days are gone. The dynamics have changed, things are more federated, and IT must accept more than one standard tool.
Posted January 03, 2017
Terms such as "active," "inactive," and "canceled" may seem mundane and inconsequential, and when folks hear the term,"valid values," their eyes glaze and expectations of interest diminish. But exciting or not, reference values and an understanding of them are important to every organization.
Posted December 01, 2016
Why is there such pervasive abuse and misuse of database views today? Views are a helpful tool in building a business intelligence environment, yet many organizations establish practices that not only rob views of their full usefulness but present patterns that actually confuse issues instead.
Posted November 02, 2016
One symptom of an organization in the middle of a knowledge vacuum is evidenced by SQL that often includes what appears to be extravagant usage of the GROUP BY clause. Writing GROUP BYs here, there, and everywhere becomes a little SQL development dance step, a jitterbug to bypass the issue—moving but not really getting anywhere. Why do these kinds of circumstances exist? Well, maybe the only expert on the involved system has retired and no one else has picked up the torch, so no one is willing to touch the code.
Posted October 07, 2016
Programming is a literal sport. Code does exactly what it is configured to do, no compromises. When the definition of a task is fuzzy, it is up to the developer to do what they believe is correct. Does the code reflect what is desired? That answer is left open to interpretation. Sadly, developers may not have a clear understanding, and even the users requesting the solution may not be sure. The results can be very painful for an organization. Expectations may not align with the delivered solutions. Users will blame IT; IT will blame users.
Posted September 02, 2016
As one works through the normal forms, be it a journey to the placid shores of third normal, or the more arid climes of Boyce-Codd normal form, or even the cloudy peaks of fourth normal and beyond—and before one starts thinking about normalizing the design—the database designer has covered a lot of ground work already. Before thinking of normalizing, one needs to have conceptualized the relations that might be within the solution's scope.
Posted August 04, 2016
In the data warehousing arena, development databases often get short shrift. Frequently, this situation arises because development databases are considered too much work to be done properly. So, instead of embracing the problem and following through on the necessary work, it is ignored or done poorly. One could almost say that ignoring the development database has become the standard practice.
Posted July 12, 2016
Many of the NoSQL tools out there, such as MongoDB, Couchbase, Hadoop, and others, purport to be leading a revolution and breaking the bonds of servitude to the restrictive, inflexible, established, relational market. They claim users need more, users need better … and they are there to help. Of course, when speaking about those relational flaws, the comments always focus on problematic aspects of a DBMS' physical implementation.
Posted June 09, 2016
Not too long ago, IT embraced the pattern language concepts of Christopher Alexander. Being an architect, of the more traditional variety, his ideas were based on creating spaces in which people felt good, even if they didn't comprehend exactly why. Architected spaces need to express multiple qualities that include being alive, whole, comforting, free, exact, egoless, and eternal. The more those qualities were embodied, the better people responded to the desirability of the space.
Posted May 04, 2016
The traditional information engineering approach advocates the placement of as much business logic as possible inside the database management system (DBMS). But, more recently, under the umbrella of service-oriented architecture (SOA), folks are arguing for placement of that business logic in a layer of code outside the DBMS. Occasionally, those who favor locating business logic outside the DBMS have even gone so far as to say that this logic "naturally" belongs in a non-DBMS-supported layer.
Posted March 31, 2016
When it comes to operational, third normal form approaches, too many database models today have very low levels of business veracity. This is because current IT trends work against incorporating too much truth inside the entity-relationship diagram. Instead, development teams center on the finished product and excessively worry about describing a solution rather than describing the actual organization.
Posted March 03, 2016
Many people claim to understand the basics of good data and database hygiene. Often, these same people claim it is all very simple and very obvious. However, when peering into existing code and databases, it doesn't seem that good practices are as obvious as people say. "A2" as the name of a column may sound ridiculous, but it has happened.
Posted February 10, 2016
Data modelers must look at the big picture of an organization's data ecosystem to ensure additions and changes fit in properly. Simultaneously, each data modeler must be focused on the minute details, adhering to naming standards, domain rules, data type practices, still remaining ever vigilant for instilling consistency across everything they do. And while focused on all of the above, their efforts must culminate in a practical model that serves the individual project's requirements while also being implementable, maintainable, and extensible.
Posted January 07, 2016
Our data models, in reflecting a specific business, must accurately portray the essence of each business. The unique reality within each organization drives the shape of every data model. The logical meaning of each data element originates with what is actually done and how it is accomplished within that particular organization.
Posted December 02, 2015
Many neophyte data modelers have trouble distinguishing between logical and physical data models. These novices likely cannot explain why each model exists, or the differences expressed between them. Sadly, such confusion also exists in the realm of the experienced data modeler. Not to say an experienced data modeler can't express the difference between a logical and a physical data model; but across a group of experienced data modelers one would not get a consistent answer.
Posted November 09, 2015
The Agile methodology is great for getting turgid development teams to start working faster and more coherently. With Agile, which focuses on more rapid, incremental deliverables and cross-departmental collaboration, the bureaucratic plaque is flushed from the information technology groups' arteries. But there is a dark side to Agile approaches.
Posted October 07, 2015
Data modelers face a choice when encountering multiple variations of a data item. Designers must focus on the longer term appropriateness of their decisions when choosing how their designs will play out; and going vertical or horizontal does have an impact over time.
Posted September 09, 2015
There are fundamental differences in priorities and approaches in building operational solutions and in constructing analytical supporting solutions. However, these differences can produce a gap in understanding that occasionally results in minor battles and skirmishes between these groups along the path of getting work done. A critical area of conflict is access to production data. When building an operationally focused solution such as a website, the application is supreme.
Posted August 10, 2015
Data governance is a touchy subject. Many organizations have tried to build a practice…and failed. Other organizations feel too daunted to even try. IT in particular has a very conflicted relationship with data governance.
Posted July 08, 2015
Petabytes, and more, of data are meaningless without guides to what it all is. Future data users need to know where it came from, how it was filtered, and how it was transformed. Without some functional level of governance we are headed toward a "datapocalypse"—having lots of content but little of value.
Posted June 09, 2015
The objects and relationships identified within any database design need to match with the very subjective realities considered true within a given organization. An optimal data model must incorporate the semantics of the business and provide for a structural arrangement supporting the necessary corporate instrumentality. And that model should do so, even if in the details things within the model are contrary to what other organizations may consider to be the nature of their reality.
Posted May 14, 2015
How does an organization acknowledge that data is important? An organization does so by enabling and supporting efforts for gathering and persisting information about the organization's data resources.
Posted April 06, 2015
In information services, providing a solution that removes a user's pain can be a great boost to one's psyche. Being the hero, wins trust, and may even enhance job security. However, organizationally, too many sideline heroics create chaos. The day after the crisis has passed, things are left in a less than wonderful state.
Posted March 12, 2015
Shunryu Suzuki, in his classic book, Zen Mind, Beginner's Mind, tells us that in the beginner's mind there are many possibilities, but in the expert's mind there are few. Both a beginner's and an expert's perspective have value, but bias toward either an expert or a beginner approach seems to consume an organization far too often.
Posted December 03, 2014
Big data tool vendors try to downplay the notion that data warehouses and data marts still need to exist, even in a big data world. Relational DBMSs are painted as "old-fashioned," "yesterday," and "inadequate." They beckon potential customers to take a dip in the refreshing data lake. The fact that big data, in all of its glory, is only part of a larger business intelligence solution is getting lost in the dialog.
Posted November 12, 2014
It is fair to say that relational theory is the only solid framework for establishing a rational expression of data that falls anywhere inside the boundaries of formal logic. As people continue to laud the "death of relational" by coming up with one or other "new" physical implementations of coding or data engines, from object-oriented, XML, columnar, or anything else one might name, the primary short-coming is that these are physical implementations that avoid having any formalized logic underpinning them.
Posted October 08, 2014
"On occasion, the inmates actually do take over the asylum. Similarly, business personnel can be defeated by their IT peers," writes Todd Schraml. Although some might argue that such an unexpected role reversal cannot happen, it may indeed occur when data architects skilled in their craft can identify the options that help achieve the business goals, versus those alternatives that stifle the business goals. "A good data architect must also be a very good business analyst," he notes.
Posted September 10, 2014
A healthy data warehouse environment can self-correct to accomplish the right things. And that self-correction is crucial, for no process is likely to be perfect from the start.
Posted August 05, 2014
Commitment to the creation of a business intelligence environment means embracing a well-measured and thoughtful allocation of resources with the right skills, the right tools, the right scope, and the right timelines
Posted July 03, 2014
How does one tame the data beast to start making tangible progress toward an organization's business intelligence goals? For any progress to occur on the business intelligence-side, tasks, projects and goals need to be fashioned into small steps along a much larger path.
Posted June 11, 2014
Finding the means to harmonize conflicting processing requirements is where a design comes alive. When done well, providing symmetry to a suite of processes greatly empowers those who develop and maintain those processes; changes and enhancement arise seemingly in the blink of an eye. In ETL, there are three key principles to driving exceptional design.
Posted May 08, 2014
Denormalization of a database design is a legitimate tool, when used appropriately. However, it is simply sad and sometimes even silly to see designs denormalized as a matter of course.
Posted April 04, 2014
Historical content within data structures increases their complexity. Making complexity "easy" for the initial developers and "harder" for the later users is a short-sighted approach.
Posted March 12, 2014
In many organizations, users find it hard to trust their own internal information technology (IT) group, leading them to try any possible option to solve problems own their own. The resulting stealth IT projects can lead to confusion or even complete chaos.
Posted February 10, 2014
Changes and enhancement to solutions are hard, even under the best of circumstances. It is not usual that, as operational changes roll out into production, the business intelligence area is left uninformed, suggesting that data warehouses and business intelligence be categorized according to the view of the old comedian Rodney Dangerfield because they both "get no respect."
Posted January 07, 2014
In the world of database products, trends seem to pop up quickly, like prairie dogs testing the air. However, regardless of the marketing jargon, or platform, vendor-product purchased, or open source utility downloaded, the one thing that remains unaltered is that in order to extract value from data, the data must be understood. The individuals within any organization who actually comprehend the data, the data structures, and all the exceptions to the usual rules are individuals who are considered critical resources.
Posted December 04, 2013
Changes to database structures should be performed in a coordinated fashion as the application processes that support the new functionality are rolled out into production. While the "work" involved in adding a column or a table to a relational database is actually minimal, often there are circumstances where developers and DBAs create additional columns and additional tables in anticipation of future needs. Sadly, this "proactive" effort results in databases littered with half-formed ideas, fits-and-starts, and scattered-about columns and tables that provide no meaningful content.
Posted November 13, 2013
How does one avoid the semantically wishy-washy use of NULL-surrogates and instead, actually design structures wherein NULLs are not necessary?
Posted October 09, 2013
Database management systems support numerous unique date and time functions - and while the date-related functions are many, they do not go far enough. One date-driven circumstance often encountered has to do with objects having a type of date range that needs to be associated with it. While there are some exceptions, this date range need generally ends up implemented via two distinct date columns—one signaling the "start" and the other designating the "end." Maybe, should the creative juices of DBMS builders' flow, such things as numeric-range-datatypes could be created in addition to a date-range data-type. Who knows where things could end up?
Posted September 11, 2013
Data models attempt to express the business rules of an organization. A good data model reflects the semantics used within an organization to such an extent that business people within that organization can relate to and easily agree with what is being expressed. In this regard the data modeler's goal is to properly mirror back the organization's concepts onto those people within the organization. The goal is not to force an organization into a "standard" data model, nor is the goal to abstract everything in the creation of a master model that will never need to change even if the business rules were drastically shifted.
Posted August 07, 2013
One of the principles within relational theory is that each entity's row or tuple be uniquely identifiable. This means the defined structure includes some combination of attributes whose populated values serve to identify an individual row within the table/relation. This, or these, attribute(s) are the candidate key(s) for the structure. The candidate key is also known as the primary key, or if a structure has multiple candidate keys, then one of them is designated as the primary key. When building up a logical design, primary keys should be identified by the actual data points in play.
Posted July 09, 2013
The grain of a fact table is derived by the dimensions with which the fact is associated. For example, should a fact have associations with a Day dimension, a Location dimension, a Customer dimension, and a Product dimension, then the usual assumption would be for the fact to be described as being at a "by Day," "by Location," "by Customer," "by Product" metrics level. Evidence of this specific level of granularity for the fact table is seen by the primary key of the fact being the composite of the Day dimension key, Location dimension key, Customer dimension key, and Product dimension key. However, this granularity and these relationships are easily disrupted.
Posted June 13, 2013
It seems that juggling is the most useful of all skills when embarking on a data warehousing project. During the discovery and analysis phase, the workload grows insanely large, like some mutant science fiction monster. Pressures to deliver can encourage rampant corner-cutting to move quickly, while the need to provide value urges caution in order not to throw out the proverbial baby with the bath water as the project speeds along. Change data capture is one area that is a glaring example of the necessary juggling and balancing.
Posted May 09, 2013
Dimensions are the workhorses of a multidimensional design. They are used to manage the numeric content being analyzed. It is through the use of dimensions that the metrics can be sliced, diced, drilled-down, filtered and sorted. Many people relate to dimensions by thinking of them as reference tables. Such thoughts aren't exactly accurate. A dimension groups together the textual/descriptor columns within a rationalized business category. Therefore, much of the content coming from relational tables may be sourced from reference tables, but the relationship between each source reference table and the targeted dimension is unlikely to be one-for-one. These grouped-format dimensions often contain one or more hierarchies of related data items used within the OLAP queries supported by the structures.
Posted April 10, 2013
Do not allow well-meaning but confused proponents to obscure concepts related to normalization and dimensional design. Under a normalized approach one usually would not expect for numeric data items and textual data items to fall into different logical relations when connected to the same entity object. Yet within a multidimensional approach that is exactly what happens. Multidimensional design and normal design are not the same, and one should not expect to claim that both approaches were used and that they resulted in the same data model.
Posted March 14, 2013
Establishing a data warehousing or business intelligence environment initiates a process that works its way through the operational applications and data sources across an enterprise. This process focuses not only on identifying the important data elements the business lives and breathes, but the process also tries very hard to provide rationality in explaining these elements to business intelligence users.
Posted February 13, 2013
Multi-dimensional design involves dividing the world into dimensions and facts. However, like many aspects of language, the term "fact" is used in multiple ways. Initially, the term referred to the table structure housing the numeric values for the metrics to be analyzed. But "fact" also is used to refer to the metric values themselves. Therefore, when the unique circumstances arise wherein a fact table is defined that does not contain specific numeric measures, such a structure is referred to by the superficially oxymoronic characterization of a "factless fact."
Posted January 03, 2013
Within the information technology sector, the term architect gets thrown around quite a lot. There are software architects, infrastructure architects, application architects, business intelligence architects, data architects, information architects, and more. It seems as if any area may include someone with an "architect"status. Certainly when laying out plans for a physical building, an architect has a specific meaning and role. But within IT "architect" is used in a much fuzzier manner.
Posted December 06, 2012
In writing a definition for an entity, an attribute, or any other element within a database design, the desired end is a descriptive text that is clear, factual and concise. Semantics are an ambiguous and often painful tool to employ. Balancing the need for clarity against the desire to avoid redundancy can be a juggling act that is hard to accomplish. One might not easily recognize what is complete versus what is lacking, versus what has gone too far. But even so, within a definition if one finds oneself listing valid values and decoding the value's meaning, then one has likely already moved beyond what is "concise." Lists of values easily add bulk and quantity of verbiage into a definition, yet such lists do not usually increase the quality of a definition.
Posted November 13, 2012
The beauty of a truly wonderful database design is its ability to serve many masters. And good database designers are able to empathize with those who will use their designs. In business intelligence settings, three perspectives deserve consideration when composing designs.
Posted October 10, 2012
It seems easy to fall into a state where projects and activities assume such soft-focus that routine takes control, where one simply does necessary tasks automatically, no questions are raised regarding what is moving through the work-life production line and everyone is essentially asleep at the switch. Certainly, we may have one eye open ensuring that within a broad set of parameters all is well, but as long as events are basically coloring inside the borders we continue allowing things to just move along. In this semi-somnambulant state we can easily add columns to tables, or even add new entities and tables, or triggers and procedures to our databases, then eventually at some point down the road have someone turn to us and ask, "Why this?" or, "What does this really mean?" And at that point, we surprise ourselves with the discovery that the only answer we have is that someone else told us it was what we needed, but we do not really understand why it was needed.
Posted September 11, 2012
A database design may occasionally show evidence that it lacks proper prioritization. Data models should express truths about the business, or about the universe of discourse. But in expressing business truth this does not mean a data model should express absolutely every truth that anyone might conceive. Some relationships are significant while other relationships are not. And as a general rule, database design is not an exercise in trivial pursuit. Insignificant truths only clutter up a design, increasing complexity, causing users' eyes to glaze over more quickly, and adding no real value towards the endeavors of the enterprise.
Posted August 09, 2012
The whole world can be divided into two groups, these being splitters and lumpers. Design battles are waged across conference rooms as debates rage over whether to split or to lump. Splitters take a group of items divide them up into sub-groups and sub-sub-groups occasionally going so far as to end with each lowest level becoming a group of one. On the other side of the design fence, lumpers combine items until everything is abstracted into group objects covering very broad territory, such as a "Party" construct, or ultimately an "Object" object. Within data modeling, arguments arise, such as whether to sub-type an entity. Or perhaps lumping is discussed as the grain of a multidimensional fact is proposed. This debate underlies much of the decision-making involved in determining what domains to create within a data model. The split-versus-lump issue is ubiquitous and universal. The question to split or lump arises across many kinds of choices, in addition to the entity definition, table grain, or the domain grain mentioned in the previous examples; this issue is at the heart of deliberations regarding establishing functions, overriding methods, or composing an organizational structure.
Posted July 11, 2012
In the dim, dark past of data warehousing, there was a time when the argument was put forward that "history does not change." It was posited that once a piece of data was received by the data warehouse, it was sacrosanct and nonvolatile. A fact record, once processed, was to remain unchanged forever. Dimensions, due to their descriptive nature, could be changed following the prescribed Type 1, 2, or 3 update strategies, but that was all. It was the expectation that due to their very nature, fact tables would become huge and in being huge would give poor update performance; performance so poor that updates would be virtually impossible to enact.
Posted June 13, 2012
It seems only reasonable that what one person can do, others can learn. On the other hand, taking people through training does not usually result in the creation of great new database administrators (DBAs). It often appears as if those who are exceptional at the craft operate at higher levels as they dive into a problem. Can training alone provide folks with the attention to detail, the urge to keep digging, or the ability to recall minutiae that allow them to rise from simply holding the DBA title to becoming someone who is a great DBA? Or must the genetic potential exist first, and then one might fall into the DBA occupation and astound those around them. It is very hard to say with any degree of certainty whether great DBAs are made or born; yet again the battle between nature and nurture arises.
Posted May 09, 2012
Organizations are populated with solutions entitled DW, EDW, BIC, ODS, EIW, IW, CIF, or BW. Why must every organization have a data warehousing or analytics solution identified by monikers from a very limited pool of choices? Why must every deployment of a database that is expected to function as an operational data store be called ODS? For internal solutions it seems that plain and dreary naming approaches are de rigueur. Two and three letter acronyms have long been a part of corporate-speak; but when it comes to IT systems, these TLAs have become exceedingly narrow and soul-less.
Posted April 11, 2012
Solution development work is usually accomplished via projects, or a combination of programs and projects. This project perspective often leads to thoughts of documentation as project-owned. And while many documents are project-specific, such as timelines, resource plans, and such, not everything is project-specific. Unless projects are established in a fashion whereby each is very limited in scope to the creation or enhancement of a single application or system, specification and design documents belong to the final solution and not to the project.
Posted March 07, 2012
Lectures related to master data bring forth all sorts of taxonomies intended to help clarify master data and its place within an organization. Sliding scales may be presented: at the top, not master data; at the bottom, very much master data; in the middle, increasing degrees of "master data-ness." For the longest of times everyone thought metadata was confusing enough ... oops, we've done it again. And, we have accomplished the establishment of this master data semantic monster in quite a grand fashion.
Posted February 09, 2012
Retaining the particulars of change over time is a fairly intricate configuration. Audit log or shadow tables are sometimes employed, but on occasion there is a need for the "old" and "new" rows to exist in a single operation table for application use. Far too often, the implementation of temporal data structures is shoddy, loose, and imprecise; rather than the fairly complex dance move such temporal arrangements must perform in actuality. The sub-optimal result is much like one's performance of the Funky Chicken at a friend's wedding; the desired moves are mimicked, after a fashion, but it is unlikely to earn high marks on "So You Think You Can Dance." The usual temporal implementation simply slaps on start and stop dates, debates a little over default date values versus NULLs, then moves on to the next subject.
Posted January 11, 2012
The cost for new development can often be easily justified. If a new function is needed, staffing a team to create such functionality and supporting data structures can be quantified and voted up or down by those controlling resources. Money can be found to build those things that move the organization forward; often, the expense may be covered by savings or increased revenue derived from providing the new services.
Posted December 01, 2011
It is not magic. Building a successful IT solution takes time. And that time is used in various ways: obtaining an understanding of the goal; mapping out what components are necessary and how those components interact; testing components and their interaction; and finally migrating those components into the production environment - otherwise known as analysis, design, development, testing, and deployment. Regardless of the methodology employed, these functions must always be addressed. Different approaches focus on differing needs and aspects. But any complete methodology must fill in all the blanks for accomplishing each of these tasks.
Posted November 10, 2011
Every project is, or should be, driven by user requirements. Requirements are the organization's way to articulate what needs to happen in order to provide value. Yet time and again requirements are looked at as something overly technical, mysterious, and too confusing to easily handle. Repeatedly, organizations use a template to ensure that requirements are defined early in the solution process. Sadly, the intended purposes are habitually defeated as these templates are filled with a lack of understanding for what information belongs in a given section, resulting in people creating documents that the authors themselves do not understand. Across many organizations requirements documents are created, reviewed, and even agreed to, that far too often are incoherent monstrosities saying nothing of actual value.
Posted October 15, 2011
When assembling a database design, one of the keys for success is consistency. There should be more than just similarity in the way things are named, the manner in which tables or groups of tables are constructed; the manifestation of these elements should follow standards and practices that are documented and understood. If one tries to rely on the idea that individual developers will simply look at existing tables and glean standards via osmosis as they add on or create new tables, then one does not actually have any standards at all.
Posted September 14, 2011
The Broadway tune goes, "The sun will come out tomorrow ... it's only a day away." The words from this optimistic jingle are often heard on IT projects that are overburdened with features and functions. On any project of significant size the list of desired things often becomes larger than the budgeted resources or time. Faced with limiting circumstances, the only option becomes aligning the work effort with the constraints and only doing what fits within those constraints. The items are "timeboxed," and the amount of planned work is exactly the amount of allowed work. Some things remain in, while other features are left out. Alternatively, responsibility for controlling a project may be ignored and the end date arrives with some things simply not completed. Essentially, a project is timeboxed by default.
Posted August 11, 2011
Naïve approaches to business intelligence will occasionally trap designers as they juggle operational data stores and data warehouses. The trap results from an honest endeavor to simplify designs and increase consistency throughout the solution. Under the umbrella of consistency a designer may plan for a reference table used for operational look-ups to perform a second service as a star schema dimensional table. Some or all reference tables then are declared by fiat to also be dimensions. While on a superficial level there are similarities between dimension tables and more normalized look up or reference tables, fundamentally these two concepts are separate things.
Posted July 07, 2011
Occasionally, one sees a data structure abomination. This atrocity involves an object of almost any type, in almost any database wherein the object has a start date but no end date. It is not that the finish date currently has no value and is null; it is that the end date does not even exist on the table structure. The stop date was never intended to exist. The object in question starts, but it doesn't ever end.
Posted June 08, 2011
Tables within a database management system (DBMS) need primary keys and defined indexes in order for the DBMS to have the opportunity to provide good query performance. Without indexing, the worst possible query performance is guaranteed. The content of these non-keyed tables remains unknown to the DBMS, a black box, where the only possible approach for query execution is to read every row one-by-one. Under such scenarios, the DBMS is little more than a file server that operates more slowly than usual. Without the primary key and indexing, the DBMS may expend even more processor cycles in moving data in and out of the DBMS proprietary storage areas than a file server uses in opening files.
Posted May 12, 2011
Dates are important. Without dates how can anything be planned? However, due dates have been know to increase in importance in the delivery of software solutions. Sometimes the due date becomes such an overwhelming creature of importance that the date is more important than following best practices, more important than verifying that what is built is correct, more important than the solution team gaining a proper understanding of the work they are attempting to perform.
Posted April 05, 2011
The understanding of object states and their transitions obviously is of great importance to the solution developers because as processes are built they will need to support each state and every possible transition. Additionally, knowledge of object states and transitions is of vital importance to the data modeler because the data must be persisted across each of those states, and often the state of an object needs to be easily identifiable. A data model minimally requires a reference table, along with the varying entities that reference that table (the foreign keys tracking an individual object's status). Specific states drive variations of required attributes or combinations of those attributes that apply to one state and not another. The logical definition of the database can identify these variations through the use of supertype/subtype constructs.
Posted March 09, 2011
Referential integrity helps manage data by enforcing validation between related entities. This enforcement follows logical semantics behind the database design -- i.e., an employee can only work for an already defined department; a prescription can only be written by a health care practitioner with the proper authority. A Foreign Key on an Employee table rejects data when any attempt is made to insert or update a row with a department value that does not already exist as a department identifier within a Department table.
Posted February 02, 2011
Back in the 1970s, the ANSI SPARC three-tiered model arose, foreshadowing a smooth intertwining of data and architectural design. The three tiers concept isolated the physical storage needs of data structures independent of business' perception of these structures. The three levels were comprised of schemas labeled external, conceptual, and internal, with each level describing the data in focus from varying perspectives.
Posted January 07, 2011
How does one know what one doesn't know? When evaluating what one knows, it is hard to know where to begin. The wise men say, "The more you know, the more you know you don't know." If one believes such commentary, what is known constitutes the tip of the proverbial iceberg. Databases have an easier time with such missing circumstances. If the rows of a database table are lacking referents, an outer join query filtering for NULLs might detail for you all the missing items. In developing and delivering projects, such a reference list for our minds to link to does not exist, for an outer join or anything else. Often, we do not know everything that needs to be done, particularly as a project starts. The difference between success and failure is not so much what one knows, but in how one handles the gaps between what is known now and what needs to be known before one finishes.
Posted November 30, 2010
Psychiatrists report play is important for keeping our minds and bodies healthy and happy. It has been uncovered that without at least some play, we mentally and physically deteriorate. Lack of play can lead to depression. Our brains are so hardwired in their need for playful activities that without such expressions we loose focus. Play enhances the creativity and imagination within us. Likewise, play has a proper role within our work. And not just for the psychological benefit to the individual, but for the overall health of the organization. For an IT development group, play is essential for finding better ways to approach problems. Within an IT context, play involves evaluating new ways to handle old problems. Prototyping, at times, constitutes play. Prototypes can be used for more than just the practicalities of the user interface; sometimes a prototype of specific functions is necessary for comparisons in order to judge how non-obvious techniques may perform. Such prototype endeavors can involve not only new algorithms, but potentially variant data structures. Ideally, such prototypes will improve team understanding of the data structures and offer insights into potential new and innovative approaches.
Posted November 09, 2010
As bizarre as it seems, no one sabotages our efforts more easily than ourselves. We often leave ourselves vulnerable to failure whenever we respond to requests for project estimates. Like "A Tale of Two Cities" estimating can bring our best and our worst. In offering estimates, individuals may undercut their own efforts. For example, a developer may only consider time coding, and fail to include enough headroom for testing, rework, or documentation. While some aspects of these estimating attempts might suggest a passive-aggressive approach to avoid tasks, much of this estimate-short-sheeting often results from a desire to please and provide a number to make management "happy." There is always a desire from management to have more things done, and to do things quickly on every project. Even when not stated, this unspoken desire colors the responses of the individuals doing the estimating.
Posted October 12, 2010
As databases are established, particularly databases intended to support analytics initiatives, responsibilities for the design must include articulating the planned approaches for enhancing and scaling the database over time. If a database is created to express a multidimensional data warehouse bus architecture, or a corporate information factory, or anything else, the explanation of this connection should exist somewhere. Such documentation should also expand on why things were decided as they were and the expected stylings to be associated with proposed enhancements. Descriptions involving anticipated processing patterns extend naturally from such architecture artifacts. Database and application design personnel should work together in the creation of such credentials to ensure these documents thoroughly cover the needs of the personnel involved in building and maintaining the solution.
Posted September 07, 2010
Design should be an intention, preferably a planned intention. In that intention, design requires more groundwork than a simple thought-train such as the following, "I planned to write a module that functions; since the module functions, my designs are working." Some situations do exist where true design really is less important than successful functionality. Determining the appropriate level of design and preparation offers an interesting question to every architect. At the lowest level, standards and practices present suitable patterns that can serve as a design skeleton for those low-level or isolated items not requiring a heavy-handed blueprint.
Posted August 10, 2010
When integrating data, evaluating objects from multiple sources aids in determining their equivalence. Each source may identify customers, but determining which customer from each system represents the same customer can prove daunting. Sometimes matching things is straight-forward; for example, if all sources should have an accurate social security number or taxpayer ID, success involves simply linking the matching numbers.
Posted July 12, 2010
Quality can be a hard thing to define. What is good and what is bad may not be easily identified and quantified. When a data mart accurately reflects data exactly as found in the source, should that be considered a quality result? If the source data is bad, is the data mart of high quality or not? If the data mart differs from the source, when is the difference an improvement of quality and when is said difference evidence of diminished quality? While it may seem self-evident that correcting the source of load data would be the "right" thing to do, in practice that direction is not necessarily self-evident. The reasons supporting this nonintuitive approach are varied. Sometimes changes to the source impact other processes that must not change, or the changes will expose problems that may provoke undesired political fallout, or it may simply be that making the proper adjustments to the source application would prove too costly to the organization. For all these reasons and more, in the world of business intelligence, the dependent data often is expected to be of higher quality than the source data. In order for that improvement to occur, data placed within the dependent mart or data warehouse must be altered from the source. Sometimes these alterations become codified within the process migrating data from the source. Other times changes are made via one-time ad hoc updates. Either way, this alteration leads to a situation in which the dependent data will no longer equate one-for-one to the source data. Superficial comparisons of this altered content will highlight the disparity that what exists for analytics is not the same as what exists for the operational system.
Posted June 07, 2010
It is rare to find an implemented database devoid of reference tables. Reference tables provide valid values for drop-down lists and in a slightly obtuse way, also allow for expression of a domain or user-defined data type within the database design. Nominally a reference list is but a list of values to be referenced. In practice, the implemented structure of a reference table is driven by small nuances regarding how each list is used. The signifiers may consist of a code or number that creates a short-hand way of expressing a state or value. Yet in addition to such simple codes, a "medium length" name column or even a column to contain a lengthy text description should exist in support of these codes. The names may be used in drop-down lists so that people need not memorize many code values, and the descriptions can assist help-functions by providing users with more understanding of a value's meaning or intended use. For operational applications it may be worthwhile to have an attribute expressing the order in which items should be displayed, unless alphabetical display is a preference. If codes are maintained via online screens, an indicator may be helpful in flagging values that are part of the system and not to be removed or deactivated via those online screens.
Posted May 10, 2010
Far too often we all talk past one another. This cross talking, while not always drastic, remains perceived as an understood fuzziness. Much of the time we ignore these minor miscommunications because precision and clarity are not necessarily critical in all situations. If the general gist is effectively understood between those communicating, that generality may be all that is necessary. Those involved in the communication may feel comfortable that assumptions made to "fill in the gaps" will fall within an acceptable range. Although the lack of clarity in the message communicated may be acceptable, in other circumstances it may not be acceptable.
Posted April 07, 2010
Primary keys come from candidate keys. Each candidate key consists of the attribute or attributes used to label a distinct row in a table. Every candidate key should contain the fewest number of attributes possible to identify rows individually and uniquely. Every entity within a design requires at least one candidate key.
Posted March 04, 2010
Designing a data model that supports the reporting and analytical functions is no different, initially, than any other modeling effort. Understanding the data is crucial. The data architect or modeler needs to feel comfortable with dimensional modeling techniques and needs to obtain a working knowledge of the universe of discourse for the subject-at-hand. A good start in gathering this knowledge begins by reviewing the operational data structures containing the identified source elements. The challenge in designing analytical solutions is found in applying best practices for analytics simply and effectively.
Posted February 09, 2010
The process for designing a database that supports the operational functions of an organization begins with simple understanding. The data architect or modeler needs to obtain a working knowledge of the language comprising the universe of discourse for the solution. This awareness is gathered through many activities, such as talking with the people currently doing the work, sitting with them and watching how they do their tasks, reading over existing training manuals or standard operation procedures. The designer is best served when figuratively walking a mile in the shoes of the future application users. The more that the designer knows about the user needs and goals, the better able the designer is to definitively craft a data model supporting user tasks.
Posted January 11, 2010
In composing a data model, the structures are put together thoughtfully and with intention. Data structures emerge from the application of semantics germane to the universe-of-discourse and filtered through the rules of normalization. Each table expresses meaning, with columns that are self-evident. The best models reduce the data items within the covered subject area into obvious arrangements. However, this simplicity often confuses observers, persuading many that modeling itself must therefore be a simple task. DBMS tools often incorporate wizards that facilitate the quick definition of tables which are then immediately created within the chosen database. These tools enable the developer to build tables in the blink of an eye. At times some prototypes are approached in this fashion, and while this provides for placeholders, such slapped-up table structures are insufficient for an industrial strength solution. Under these instant-table setups, developers often have no problem reworking large sections of code for minor changes, or misusing data elements to mean many things while making the determination of meanings at a specific point-in-time less clear. Unaware of these less-than-stellar consequences, users become confused; they often wonder why modeling tasks should ever need to be done because the proof of concept worked, didn't it?
Posted December 14, 2009
The pervasive nature of data continues unabated, leaving organizations more awash in data than ever before. Technology has enabled the access and leveraging of information to heights undreamed of a generation ago. Between corporate dashboards and internet Googling, vast quantities of information are truly at one's finger tips. Data-driven, domain-driven, model-driven … the data itself is a force to be met and managed. When managed well, users never explicitly think about the databases that persist all that data.
Posted November 11, 2009
The art of building software solutions is comprised of many moving parts. There are project managers coordinating tasks; business analysts gathering requirements; architects establishing standards and designs; developers assembling processes; DBAs building database structures; quality assurance staff testing components for compliance to requirements; and an array of supporting roles providing for functional environments, infrastructure, security, etc. A common task that everyone must perform is estimating the effort necessary to deliver results. Certainly for simple and/or repetitive tasks there is no need for recurrent estimating, since applicable values are based on past known metrics. For example, creating the third or fourth version of the same database within the same environment should allow a DBA to incorporate costs experienced previously as a guide. And unless something unusual occurs, such estimates should be on target. However for creative tasks, such as designing new structures, or building new kinds of processes, there will be no previous documented events to refer to. Faced with these circumstances, individuals usually are not allowed to shrug and say, "it will take as long as it takes," and be left alone.
Posted October 13, 2009
Composite keys are an implementation of business rules within the database. As an example, a table named INVOICE has a composite primary key consisting of Account_Number and Invoice_Date. In this example several possible rules are being expressed. The Account_Number which partially identifies an INVOICE instance implies that an account must exist before an invoice can exist. In addition to this INVOICE table, one expects to see a table named ACCOUNT with only the Account_Number as its primary key. Likewise, a referential integrity constraint would be defined between the INVOICE and the ACCOUNT tables based on that Account_Number value. In this manner, the DBMS would prevent Account_Number values from insertion into the INVOICE table unless they already existed in the ACCOUNT table.
Posted September 14, 2009
New data modelers often see things as black and white. But rather than being concrete flooring beneath our feet, knowledge is more like a gossamer web that builds up layer upon layer to provide the effect of a solid foundation.
Posted September 02, 2009
New data modelers often see things as black and white. But rather than being concrete flooring beneath our feet, knowledge is more like a gossamer web that builds up layer upon layer to provide the effect of a solid foundation. We may think we know facts, such as one plus one equals two, or Columbus discovered America in 1492. What we have come to know about our world comprises our own internal knowledge base. We gain much of this knowledge because it has been passed onto us by others, people with experience, parents, educators, clever friends, verbally or in print. But how do we know such items are real, actual, absolute bona fide beyond-a-shadow-of-a-doubt truths?
Posted August 14, 2009
The Escalating Necessity for Good Database Design
Posted August 04, 2009
While good database design is always necessary, the value of good design is reinforced by such endeavors as service-oriented architecture. The semantics-laden engineering of service-oriented tactics used in creating solutions melds seamlessly into the semantics-laden world of data modeling. Although one aspect does not necessarily build on top of the other, each works as part of a team in braiding solutions that grow in usefulness as they emerge. The processes surface as a method to further define the meaning of an object, and the object serves as a harbinger of the processes that must exist. Business rules much more advanced than today's simple constraints may one day reside within the database itself as a built-in function of a some future DBMS. Therefore, data architects should never try to wall themselves behind a veneer of thinking only in terms of data without a thought regarding the processes that touch the data. As a database designer, working within projects that use service-oriented architecture approaches can be exhilarating.
Posted July 13, 2009
There comes a time at the start of a new engagement when the data architect must acquaint himself with the system for the first time. When first learning about a new application, the relevant data, and its foundational concepts, many questions are reviewed.
Posted June 15, 2009
Database Elaborations: Digging Deeper into Many-To-Many Relationships
Posted May 26, 2009
An inherent awkwardness exists in every many-to-many relationship between entities. Ambiguity causes this persistent awkwardness, primarily because a many-to-many relationship is such a fuzzy thing. In data discovery, encountering many-to-many relationships may actually expose a level of disregard about details by the subject matter experts.
Posted May 15, 2009
In a relational database approach, regardless of whether one is considering persisted data or transitory query results, almost everything is viewed as a logical table. Associations between these envisioned tables are based on data values, versus the alternative to the pre-relational idea of linking multiple data structures via "hidden" pointers maintained by the system. Relationships among objects are ultimately derived from the semantics of a situation.
Posted April 15, 2009
In formulating the tenets of relational theory, issues anent to order were explicitly addressed. These relational theory tenets included defining a relational database so that it need have no concern with the order of columns in a row, or with the order of rows in a table. And yet, such a stance seems counter-intuitive since the database brings structure and organization to content. Chaos is the primordial soup from which all things originated. Thus it seems only reasonable that a relational database, being the best and brightest of its kind, should abhor such chaos and bring ever more order instead, right?
Posted March 15, 2009
Increased governmental oversight has amplified organizational attention concerning the tracking of content changes made to application database tables. Database log reading tools do not always provide the most useful means of reviewing specific changes for all organizational purposes. Fortunately, other reviewing options do exist.
Posted February 15, 2009
Occasionally the database under design becomes more than a one-off project. For example, the data mart or data warehouse might act as a profit center for the organization, and as a solution sold to many customers. The possibility exists to capture a basic truth for a given industry in a simple universal design exactly suiting everyone. But unless the situation really qualifies as a valid "one-size-fits-all" circumstance, one needs a level of customization within the architecture.
Posted January 15, 2009
Tight budgets and limited talent pools do not make a winning combination for staffing a database management group. Despite the importance of saving money, problems can arise when the lines blur regarding what gets wasted versus what gets saved.
Posted December 15, 2008
Posted September 15, 2008