Video produced by Steve Nathans-Kelly
At Data Summit Connect Fall 2020, Pariveda's Ryan Gross surveyed the current landscape of technology and tools for data quality, discovery, availability, security, and compliance.
"Our guideposts around data governance, quality, discovery, compliance, availability, and security have modern tools that enable you to move more quickly," said Gross.
Data Quality
"In the quality area, you have data wrangling tools that allow you to get that visual view of data quality, build quality rules, and understand the current state of data more quickly. Those are good for understanding the data quickly. The next step is codifying the output. So there's pipeline testing frameworks like Soda, Great Expectations, and Toro that enable your organization to very quickly have code-based data tests alongside the code that transforms that data. So you may learn from Trifacta and Paxata—Paxata is now part of DataRobot—what your current shape of the data looks like, and then encode it in Great Expectations in order to say, here's what my data must look like in order to tie that in such that data is immediately discoverable."
Governance
There is a whole category of ML-based data catalogs that learn about the data, not only from manual intervention to define documentation, but also understanding how people are using that data or the profile of the data, said Gross. "Alation has really good capabilities around query log mining, to understand exactly how people are using your data, such that you can continue to find things that may need to go back into your pipeline testing." Waterline Data can provide the capability to take those black box data sources that you haven't seen before and understand them in a deep way, and has really unique column-fingerprinting capabilities that enable you to understand how your data might join together and what particular issues might exist while making that available and discoverable to know an analyst user base, said Gross.
Rapid Exploration
The next step is rapid exploration. "Even before you bring something into a data platform, there are now tools that enable you to get in there and really take it to the next step beyond what you can get with the visual data wrangling tools, that enable that discovery to go to the extra depth, where you understand the logic and you can get through that experimentation phase to understand what changes are necessary in order to get from your source code to your compiled output."
Compliance
You also have the ability to bake in compliance, said Gross. "This is an area where I think things that are maturing. I separated out the ML discoverability-focused data catalogs from the ML compliance-focused data catalogs. I think that that this is really starting to blend as we speak such that in the next year or two, you'll be able to get both the compliance side and the discoverability side within a data cataloging product. And part of that is because some of these players are starting to blend in these data privacy tools that surface data privacy concerns along the way—and, again, enable you to codify the management expectations around this such that as people are developing data pipelines, it's not something that they have to come back in at the end and seek approval that they met privacy means. It's something where it can be verified via test automation that that was happening, and that you can scan and find areas where it has not happened and correct those moving forward.
Availability
Availability is all around ensuring that this runs on an ongoing basis and that you have the ability to see what the data looked like at any point in time. "And that's really where tools like databand, StreamSets for data pipeline building, and then Prometheus for being able to build the observation into your data pipelines to see what's actually happening as they're running in production, taking that step that I was talking about before of lean manufacturing principles, applying your tests to the data, as opposed to just in the developing process. And then at any point in time is now readily available to anybody, uh, you know, primarily in the world of Spark, Hive, big data platforms via the Apache Hudi project coming out of Uber or the Delta Lake project from Databricks."
In addition, access management is now manageable as code using tools like Privacera, which has built upon the foundations of Apache Ranger in order to provide codified data access rules, and one of their key competitors. Immuta which has taken that even a step further, such that you can look at attributes of your users and attributes of your data and use that to build the access management decisions at query time. And then in more widespread use, things like AWS Lake Formation and the new functionality that's being built into Azure Synapse Analytics are baking that into the platforms that everybody uses on a regular basis.
Videos of full presentations from Data Summit Connect Fall 2020, a 3-day series of data management and analytics webinars presented by DBTA and Big Data Quarterly, are also now available for on-demand viewing on the DBTA YouTube channel.