The processing needed to populate a data warehouse is generically referred to as “ETL.” ETL originally stood as an acronym for “Extract, Transform, and Load.” Those three kinds of actions were considered the crucial steps compulsory to move data from the operational source [Extract], clean it and enhance it [Transform], and place it into the targeted data warehouse [Load].
Over time, folks have attempted to come up with clever name variations, like ELT or LTE, but regardless of the detailed steps used or their order, the term ETL remains and continues to provide the desired semantic and descriptive coverage. And while initially the process was literally intended, today it is called ETL even if the steps are different.
Three points need to drive ETL design
In establishing wonderful ETL processes, as opposed to mundane ones, three points need to drive the design. The first point is that every process should have a specific purpose. If data is to be extracted from a source, focus on extracting that data; do not attempt to bring in data from several other sources and mash up the results at the same time. The more tasks to be performed, the higher the level of complexity within the process.
By increasing the tasks to be performed simultaneously, what could have been a simple, easily-maintained process might end up as a complicated process that is hard to debug and correct. Therefore, to maintain simplicity, processes should remain focused on their main task. Second point, minimize touch points, means that each source’s data should be handled in as few steps as is practical. For example, if a process is inserting data into a table and then must come back and update those newly inserted rows, the designers must ask themselves if there is a way to avoid the “second touch” update; can the process be established to only insert a row and be done with it? This suggestion does not mean to contradict the first point, and “few” is not necessarily the same as “only once.” In order to successfully integrate these first two seemingly contradictory points, consider this third point: symmetry, identify and reuse patterns.
Design involves bringing conflicting goals together
Roads orthogonal to each other cross eventually. Design involves finding an elegance, a balance, a means of bringing conflicting goals together. Reusing code saves development time, debugging time, training time, etc. Keep processes that are the “same” the same, but recognize those processes that must be unique as unique. Finding the means to harmonize the often conflicting processing needs is where a design comes alive. When done well, providing symmetry to a suite of processes greatly empowers those who develop and maintain those processes; changes and enhancement arise seemingly in the blink of an eye.
Development teams should never simply dive directly into creating code. Incorporating these principles into a design requires that time be taken upfront to map out the ETL steps. Time used to articulate the design here, augments the speed with which the later coding can be accomplished by getting everyone in sync on the steps and tasks to be done, and by having already evaluated and worked through potential problems.
Well-designed ETL processes will do the heavy lifting
Data flow diagrams can serve as a useful tool to plan out a design. Working with data flow diagrams as they are sketched out layer by layer can help center the designer’s thought patterns. Ideally the various balance points and patterns will emerge. Then mapping documents can be created that point out the details for the developers. Stable and well-designed ETL processes will do the heavy lifting to effectively move data from the various source data stores into the target components of the data warehouse. If those ETL processes follow the design principles offered above, they will also make it easier for staff to learn how to maintain them, to build upon them, and they should prove to be error-free from the start.