DataOps is a modern engineering practice that can improve the speed and accuracy of analytics.
In a pre-conference workshop titled "DataOps 101" at Data Summit 2019, Mark Marinelli, head of product at Tamr, identified the processes, technologies, and roles involved in DataOps to improve the speed and availability of data for analytics—as well as the common pitfalls to avoid.
According to Marinelli, “DataOps is an automated process-oriented methodology used by analytics and data teams to improve the quality and reduce the cycle times of data analytics.”
Considerations for Getting Started with DataOps:
Technology:
Identify a path to a modern, modular service architecture:
- Create la blueprint for next-generation data platform
- Revisit a cloud migration strategy.
Inventory the current toolset:
- TCO/skill requirements, etc.
- Determine which should be replaced and when this is viable
Decouple monolithic processes:
- Wrap components in APIs, expose as services
Start building with new technologies:
- Choose a subset of tools for proof of concept to replace old technology
Organization:
Rules: division of labor across mixed skill teams: Identify existing key roles—data engineers and their consumersFind best candidates for new roles data curators and data stewards
Create cross-functional teams:
- Data curators—will depend on the project
- Data engineers
- Data curators
- Data stewards
Structure: working model for projects across technical and business teams:
- Choose your operating model, start with shared services for the first project, and ensure executive alignment
Processes:
Agile: Incremental delivery model
- Agile is the key: if not already there, choose a model that works (Scrum, SAFe)
- Inventory the set of available projects and consider the availability of data versus the value of solving a problem
- Define a high-value data-rich project that will demand a complex solution
What Not to Do:
- Avoid “boil the ocean” waterfall projects that measure success in years andquarters
- Single platforms: Don’t overestimate what a single piece of software can do and instead focus on a thoughtfully designed ecosystem of best-of-breed tools
- Single vendor: Don’t overestimate what a single vendor can do—align vendors with APIs and expectations that they must work together.
- Don’t underestimate the effort that it takes to make FOSS (free and open source software) work
- Don’t underestimate the human/behavioral challenges with data
Key takeaways for successful DataOps projects:
- Use a data catalog as a layer of abstraction to help people access raw data sources
- Publish: People are constantly reinventing the wheel but if there is a way to access the best data, they will save time and this will also help to provide a feedback loop as far as the value/quality of the data. If you implement a good publishing system, the organization will save time and allow control with governance
- Machine learning is useful to save people time but humans are definitely needed. Machine learning is not a substitute for people.
- Go cloud-first if possible.
- It is simultaneously great and terrible that there are so many options for technology choices.
- The data curator is an emerging role to ensure that consumers have the data they need in the form that they need it
- The data steward is another role to create policies, enabling data governance
By leveraging automation, data democratization, and greater collaboration among data scientists, engineers, and other technologists, DataOps can help organizations improve the time-to-value of their data.
Marinelli and many other presenters are making their slide decks available on the Data Summit 2019 website at www.dbta.com/DataSummit/2019/Presentations.aspx.