The notion that “data is the new oil has many believing that every piece of data needs to be captured and stored in a big data technology of some sort,” said McGrattan. “Few are questioning the value of storing, managing, and securing it. We often hear people say that they’re storing data in case they may need it at some point in the future. As a result, this becomes a bit like the junk drawer in the kitchen—filled with stuff that we believe may be needed but that likely will get tossed during the next remodel.”
A better solution, McGrattan suggested, “is to leave the data assets where they naturally live and bring in just the data elements that are required to answer a particular question at the time the question is asked. Rather than pulling all of one’s Salesforce contact data, which is of dubious quality, into a data warehouse to drive a marketing campaign, it makes more sense to pull the specific contacts required for that campaign in real time. Integration vendors that provide that ability will definitely help solve the problem. Similarly, data catalogs will provide an inventory of data assets and help identify unused assets that could be cleaned up.” McGrattan urged the formation of a “value index” for data assets, and for organizations to “be ruthless in imposing rules around only storing data assets that meet a minimum threshold.”
QUALITY THROUGH AUTOMATION
Without quality data, AI and machine learning cannot deliver on their promise—it’s as simple as that. The most pressing opportunity for organizations in the year ahead, then, “lies in the ability to discover and consume data which is clean and trustworthy,” said Manoj Karanth, vice president and global head of data science and engineering for Mindtree. “While big data paradigms have converged around lakehouse architectures, the ability to self-serve the data with provenance continues to be a challenge. In some sense, this is akin to the age-old data governance challenge. However, with the increasing speed and volume of data, this has emerged as a big impediment.”
Data quality will be “the most pressing issue shaping big data management in 2022,” agreed John Nash, chief strategy and marketing officer at Redpoint Global. “As we collect increasingly more data to utilize in business settings, the challenges around data quality multiply. While the concept of data quality is not new, its role in shaping customer experience has increased exponentially. Data hygiene, data governance, and data privacy have become issues beyond IT and are under the scrutiny of many enterprise stakeholders.”
The rise of “cloud-native unified data governance solutions will allow automated data discovery and data classification with end-to-end lineage,” resulting in sustainable ways to address the challenge, said Karanth. “Pairing this with business leaders who can own these data products will address the issues around trust and data usage.” To leverage this opportunity, “data and IT managers will need to look at data as an enterprise concern and put in governance mechanisms which help measure the state of the ‘data economy’ within the enterprise,” he added. With insight into the maturity of an organization to make data-driven decisions, steps can be taken to prioritize moves to improve the situation, Karanth noted.
Privacy and security are other areas that will increasingly be turned over to automation in the coming year. Data volumes are growing, but data is becoming harder to use and get value from due to heightened security and privacy requirements, according to Steven Touw, CTO at Immuta. This problem is exacerbated as organizations migrate to the cloud because homegrown controls that were built up over the years from their on-premise systems are lost, he noted. Touw sees a rise in automated data governance technology, designed to enable organizations “to implement data access controls, ensuring sensitive data is only accessible by those authorized, meaning each user who queries data sees only the data they’re supposed to see for their approved purpose or role.” Such technology helps detect “sensitive consumer data such as first and last names, Social Security numbers, and address information, allowing companies to classify and tag the data.” This also helps organizations build data policies that comply with complex privacy regulations without having to manage it manually one table or column at a time, and so it is orchestrated in a way that aligns with their existing DataOps workflows, Touw said.