Getting Started with DataOps: Processes, Technologies, and Roles

May 20, 2019

By Joyce Wells

DataOps is a modern engineering practice that can improve the speed and accuracy of analytics.

In a pre-conference workshop titled "DataOps 101" at Data Summit 2019, Mark Marinelli, head of product at Tamr, identified the processes, technologies, and roles involved in DataOps to improve the speed and availability of data for analytics—as well as the common pitfalls to avoid.

According to Marinelli, “DataOps is an automated process-oriented methodology used by analytics and data teams to improve the quality and reduce the cycle times of data analytics.”

Considerations for Getting Started with DataOps:

Technology:

Identify a path to a modern, modular service architecture:

Create la blueprint for next-generation data platform
Revisit a cloud migration strategy.

Inventory the current toolset:

TCO/skill requirements, etc.
Determine which should be replaced and when this is viable

Decouple monolithic processes:

Wrap components in APIs, expose as services

Start building with new technologies:

Choose a subset of tools for proof of concept to replace old technology

Organization:

Rules: division of labor across mixed skill teams: Identify existing key roles—data engineers and their consumersFind best candidates for new roles data curators and data stewards

Create cross-functional teams:

Data curators—will depend on the project
Data engineers
Data curators
Data stewards

Structure: working model for projects across technical and business teams:

Choose your operating model, start with shared services for the first project, and ensure executive alignment

Processes:

Agile: Incremental delivery model

Agile is the key: if not already there, choose a model that works (Scrum, SAFe)
Inventory the set of available projects and consider the availability of data versus the value of solving a problem
Define a high-value data-rich project that will demand a complex solution

What Not to Do:

Avoid “boil the ocean” waterfall projects that measure success in years andquarters
Single platforms: Don’t overestimate what a single piece of software can do and instead focus on a thoughtfully designed ecosystem of best-of-breed tools
Single vendor: Don’t overestimate what a single vendor can do—align vendors with APIs and expectations that they must work together.
Don’t underestimate the effort that it takes to make FOSS (free and open source software) work
Don’t underestimate the human/behavioral challenges with data

Key takeaways for successful DataOps projects:

Use a data catalog as a layer of abstraction to help people access raw data sources
Publish: People are constantly reinventing the wheel but if there is a way to access the best data, they will save time and this will also help to provide a feedback loop as far as the value/quality of the data. If you implement a good publishing system, the organization will save time and allow control with governance
Machine learning is useful to save people time but humans are definitely needed. Machine learning is not a substitute for people.
Go cloud-first if possible.
It is simultaneously great and terrible that there are so many options for technology choices.
The data curator is an emerging role to ensure that consumers have the data they need in the form that they need it
The data steward is another role to create policies, enabling data governance

By leveraging automation, data democratization, and greater collaboration among data scientists, engineers, and other technologists, DataOps can help organizations improve the time-to-value of their data.

Marinelli and many other presenters are making their slide decks available on the Data Summit 2019 website at www.dbta.com/DataSummit/2019/Presentations.aspx.