It seems that juggling is the most useful of all skills when embarking on a data warehousing project. During the discovery and analysis phase, the workload grows insanely large, like some mutant science fiction monster. Pressures to deliver can encourage rampant corner-cutting to move quickly, while the need to provide value urges caution in order not to throw out the proverbial baby with the bath water as the project speeds along. Change data capture is one area that is a glaring example of the necessary juggling and balancing.
Change Data Capture Refers to Obtaining "Delta Data"
Change data capture refers to the processing necessary to obtain “delta data,” a data store containing only information about what has changed. Delta data is essentially the opposite of a “snapshot” or “universe” which contains a complete set of all data for an entity type. And delta data is very much like gold, in that when one is dealing with large quantities of data and one knows absolutely and exactly what has changed, this knowledge allows for processes to focus on simply implementing the change. When deltas do not exist, as something to be simply extracted and used, processing increases in bandwidth and cycle time. Every record of the current data must be evaluated against every record of either the previous universe or perhaps the current target to look for variances to be applied. While it is the machine rather than humans performing these comparisons, it still is a brute force approach to determining change.
Typically, internal systems that provide the source for the data warehouse, store change activity in a haphazard fashion. It is not unusual for a pastiche of approaches to be scattered about. Some systems or objects within a system may have absolutely no change tracking, so that everything is current and current is all there is. Other systems may have shadow tables that audit change activity. These shadow tables may track all change activity, or only changes to a sub-set of specific columns. Delete activity may or may not be properly tracked. And in some cases the “current” data tables themselves may have history embedded within them, each business object having multiple rows with each one in time “active” for a particular slice of time. Or, there may be a combination of approaches, all surrounding just a single business object. One could have circumstances where the table of record for a data item may not have any history at all, but a secondary table is updated via a trigger from activity against the original source, and that secondary table may have a shadow table that has historical activity.
Change Data Capture is a Blancing Act
This all leads to one of those balancing acts. Digging into each source table and tracking down a proper means of change tracking can take analysis time and generate a few process hoops to be jumped. Alternately, one may save analysis time by selecting only current values form the source and generating all deltas within the data warehouse ETL processes by the comparison logic. This can greatly decrease the analysis efforts. However, in choosing the brute force method, there may be circumstances where the data volumes are such that this results in excessive processing time each and every cycle.
In the course of navigating a project through to completion, juggling everything is a natural result. And just where is the optimal sweet spot to balance all these issues for successful project delivery? It does not seem that there is a definitive and correct answer; success can be obtained in many different ways. A good project leader finds a way to deliver somewhere inside that zone of success. Less important corners can be cut, but the important stuff must remain in order to provide something fit for use and of value for the user community.