Migrating a large-scale Hadoop cluster to the cloud is challenging, especially when the cluster is very active and downtime during the migration is not an option.
Tony Velcich, senior director, product marketing, WANdisco, and Ken Seier, chief architect, data and AI, Insight discussed the challenges involved in such a migration during their Data Summit Connect 2021 presentation, “Considerations for Large Scale Hadoop Data Migration to the Cloud.”
The annual Data Summit event is being held virtually again this year—May 10–May 12—due to the ongoing COVID-19 pandemic.
Cloud data migrations are being driven by digital transformation, cost optimization/IT agility, analytics AI/machine learning, and external factors, Velcich said.
“We’ve seen much more innovation happening in the cloud,” Velcich said. “Last year organizations really found they needed to support a global workforce that became remote.”
The Hadoop cloud migrations can be complex but it offers lower costs, is scalable, highly available, collaborative, and secure, he noted.
There are 3 key data migration considerations that include the scale of data migration, he said. Various options exist for small data volumes, but few work well at scale. Migrating large volumes of data takes time.
The next consideration is to look at what data changes occur in the Hadoop environment. Typical Hadoop production environments are very active between data ingests and updates. Ongoing activity adds to migration time and complexity. Business disruption is a top concern for planned Hadoop migration projects, Velcich said.
The last consideration to think about migration approaches requiring manual or custom development efforts. Automated migration tools offer a better option, he explained.
According to Seier, there are a few key workload migrations to also consider. The first one is asking if your environment is cloud ready.
“Part of the challenge comes in the complexity of workloads,” Seier said.
The next consideration is looking at the time to refactor which leads to full cloud modernization. He suggested refactoring workload by workload. This then leads into considering what refactoring entails. Compare workload outputs side by side and implement new functionality immediately using cloud patterns, he explained.
Post migration, Velcich explained, companies need to consider if they will need a solution that supports changes at both source and target or across multiple environments.
“The data migration is the lynchpin of success,” Seier said.
Register here now for Data Summit Connect 2021 which continues through Wednesday, May 12.