Large enterprises are facing a debt crisis. Not financial debt, but “data debt.” It’s a form of technical debt, and it can hamstring an organization’s capacity to tackle new challenges and stifle its ability to innovate. The problem is pervasive.
A recent article in Harvard Business Review showed that only a shocking 3% of companies’ data met basic quality standards. For years, software development teams have understood and reckoned with the future work created by making short-term tradeoffs to ship their code faster, and now IT organizations are realizing they have created massive amounts of remediation work for themselves due to decades of deprioritizing data management.
For most large enterprises, the root of this problem lies in years of treating the data generated by their operational systems as a form of exhaust rather than as a fuel to deliver great services, build better products, and create competitive advantage. Every new enterprise application deployed is essentially a data creation engine. Unless companies have a method of easily integrating each new data source to capture and leverage the new data, the debt will grow daily—and exponentially—with the number of data sources in a company. For many companies, this problem is compounded by a history of M&A, reorganizations, “data hoarding,” politics, and rogue shadow IT activity.
The consequence of accumulated enterprise data debt is that many companies struggle to answer the most basic questions about their business, such as: How many customers do we have? What and how much do our customers buy? How many suppliers do we have? How much and on what do we spend with each one?
Accurate, complete, up-to-date answers to these questions require the unification of data that is often spread across dozens or even hundreds of enterprise silos. Attempts to break down the barriers between silos typically run into a host of seemingly insurmountable technical, operational, and behavioral challenges. Organizations have tried to tackle this problem for decades—often at great cost and without much success. Approaches have included application rationalization, in which they work to reduce many dozens or even hundreds of instances of an application down to a mere handful of instances; data standardization, in which IT teams try to pursue the tantalizing goal of creating a top-down comprehensive data model to meet the needs of all users, but that is hard to enforce and becomes quickly outdated; and master data management, a software-based approach that relies on developers to code up deterministic rules to integrate a few source systems and then further requires extensive manual curation of the data to create accurate master records.
These approaches have had a mixed track record historically, and with the increasing volume, velocity, and variety of data that enterprises must now manage, new options are needed. Fortunately, chief data officers (CDOs) charged with solving the data debt crisis can take a page from the playbook of their CFO colleagues.
What CDOs Can Learn From CFOs
Similar to cash, debt is a tool that managers can use to fuel business growth. But debt is also a liability—a future obligation—that eventually has to be reckoned with. If it grows out of control, the consequences are dire and sometimes life-threatening. Ask CFOs how much money their companies have and where it is, where it came from, and where it’s going, and you’ll get a precise, fast answer. They will have systems in place to control who has access to the company’s money and what they can do with it. They will be able to show what return that money is generating, and they will be able to move that money around to get the best return for the company. In this regard, CFOs are a role model for CDOs, particularly when it comes to managing and reducing data debt.
Historically, CDOs haven’t had the tools to measure, manage, and optimize their data in the way that CFOs can with cash. But that is changing. Companies such as Facebook, Google, and Apple, which treated their data as a strategic asset from their inception, have emerged to disrupt whole industries. They’ve built their data management infrastructure as a core capability, investing deliberately and consistently in systems to capture, store, curate, and share information. Broad (but carefully managed) access to high-quality information has fueled their rapid growth.
CDOs at large, long-standing enterprises wishing to emulate that kind of success face a more complex challenge because they must first address their accumulated data debt. But by thinking like a CFO and implementing systems and processes that manage data from creation to consumption, they can begin to create an asset that is more valuable than cash.
DataOps: A Strategy for Debt Relief
While the data debt crisis may have been decades in the making, the remedy doesn’t need to take quite that long to implement. Recent advances in data management technologies, combined with approaches that borrow liberally from the success of the DevOps revolution, now offer CDOs a new data engineering model for tackling the problem: DataOps. Just as the goal of the DevOps movement was to increase feature velocity in software, the DataOps approach seeks to radically increase analytic velocity.
As with DevOps, focus is essential to make DataOps work. CDOs should start by identifying a use case where clean, unified data is essential to achieving a high impact business outcome. Next, they should assemble a team, often called a DataOps Center of Excellence (CoE), which includes a solutions architect, line-of-business subject matter experts, and an executive sponsor.
Invariably, this group’s first task is to build an inventory of their data sources relating to the identified business use case. This inventory will create a catalog of the physical data attributes and their location. This effort should start by looking at well-documented, well-understood sources (many organizations have an MDM tool, for example) and working up the data chain to find increasingly less well-governed applications and sources such as CRMs, ERPs, and data lakes. Typically, this process begins to reveal the vast scale and variety of enterprise data sources, and the business-driven use case is essential to maintain focus and avoid boiling the ocean.
The next step is to architect a blueprint for creating a data management environment. While many organizations immediately procure software at this step, they should first catalog the tools they already have before buying anything new. Most large organizations have too much software sitting undeployed on the shelf, and the most expensive mistake a fledgling DataOps CoE can make is not doing enough due-diligence at this stage. Understanding both their data inventory and software inventory will help the CoE to understand the critical capabilities they are missing. It can also illuminate why data unification projects have failed previously and enable teams to avoid repeating past mistakes.
Now, with a data inventory, a software inventory, and an understanding of key capability gaps, the CoE can architect a solution. This is when it makes sense to procure new tools in order to address the capabilities gaps. While every organization is different, there are common data curation tasks that need to be performed to create a pipeline that ?combines several sources to deliver high- quality, useful data. These include transformations, attribute mapping, record matching and deduplication, and classification.
It’s likely that the data sources to be unified will number in the tens or hundreds, and, given the natural variety that exists across that many stores, the above operations can only scale to the challenge through automation, meaning that human intervention must be eliminated. Therefore, DataOps tools should strategically incorporate technologies such as machine learning to achieve the necessary scale and levels of automation. While machine learning for predictive analytics is unlikely to yield much value if the underlying data is of poor quality, using it to automate data preparation can yield important benefits. This is because integration models built with machine learning algorithms function as a highly transparent team of data curators and can reduce the amount of costly manual intervention by orders of magnitude.
Once harmonized, data will need to be stored, and data stores should integrate easily with downstream consumption tools such as visualization and analytics products. In a modern DataOps stack, there are typically multiple best-of-breed technologies, each performing a few tasks very well, so interoperability and the use of RESTful APIs are essential.
The ability to find data sources and unify them quickly, accurately, and at scale is the core competency of a CoE. Proficiency will allow organizations to deliver rapid, repeatable, and trusted data and analytics to end users to drive faster, better business decisions. Developing a DataOps capability can become the fastest way to pay down data debt and increase the velocity of analytics to allow organizations to build competitive advantage. While not as sexy as self-driving cars or virtual reality, in-the-trenches data engineering of this type is essential for companies that aspire to make “competing on analytics” more than a slogan.
Paying down enterprise data debt is a big challenge, and it won’t happen overnight. But it’s not impossible either. It’s a challenge that won’t be solved by any single vendor, but rather by innovative and empowered CDOs who understand that their core mission is to help their organizations manage their data as an asset and who can create the data management infrastructure to enable that change.