The IT industry is going through a major shift from centralized data centers to dispersed deployments across a variety of cloud and on-premise platforms. At the same time, availability is becoming more critical. Recently, Dave Bermingham, technical evangelist of SIOS Technology, shared his views on the current state of high availability in the cloud, and what organizations need to do to ensure continuity of service. "When moving to the cloud, the first thing you will discover is that the traditional SAN-based failover cluster for HA is no longer an option," he noted.
In your view, what is the reality of high availability (HA) in the cloud now?
Dave Bermingham: When compared to most traditional on-premises datacenters, the cloud has much more to offer in terms of infrastructure to support high availability. Beyond the local resiliency which allows cloud providers to offer a 99.9% availability SLA for a single instance VM, being able to offer up to three different well connected datacenters within a single region and dozens of regions spread across the globe, the availability options in the cloud are endless. When leveraging the cloud appropriately, the availability SLA can be 99.99% or higher.
However, it is incumbent upon the user to understand what is required on their end to ensure the high availability of their particular service. In many cases there will be a learning curve since the availability options available in the cloud can be numerous and ever changing.
With more choices than ever, has it become easier or more difficult for organizations to craft an approach to availability and recovery that is customized for their specific needs?
DB: The challenge organizations face is mostly the learning curve. Technologies that enable HA/DR in traditional datacenters are pretty mature and well known to the seasoned IT pro. HA typically revolved around SAN based failover cluster solutions, while DR consisted of a combination of backup, snapshot and replication technologies that normally required maintaining an active DR location.
When moving to the cloud, the first thing you will discover is that the traditional SAN based failover cluster for HA is no longer an option. Instead, SANless based HA solutions that span Availability Zones must be leveraged. New options for DR include DRaaS options from cloud providers, geo-redundant storage replication options, application based replication and third party replication solutions. Figuring out which solutions will help you achieve your RTO/RPO goals while also keeping overall costs for HA/DR inline can be a challenge.
The good news is that once you settle on a solution, the cloud makes it easy to implement since whatever you need is available instantly and on demand. This also allows you to become more agile since there is no long term commitment to any one hardware or software solution.
What HA options and solutions are available in the cloud?
DB: Before I answer, let’s agree that HA means ensuring an application is available 99.99% of the time. Cloud providers typically offer high availability options with their PaaS and SaaS offerings, sometimes at an additional cost. For their IaaS offering it would appear as if they offer a 99.99% SLA there as well, but if you look a little deeper you will discover that this only covers remote connectivity to the server, it does not guarantee the services running on the server are highly available, nor does it guarantee that transactional data stored on the server is highly available.
To address high availability of the application and data you have to take the additional step of configuring a combination of replication and failover clustering. Some applications have those features baked in, like SQL Server Always On Availability Groups. For others you will have to investigate other SANless based cluster solutions that support a wide range of applications across both Windows and Linux.
What are some of the new wrinkles that hybrid cloud scenarios introduce?
DB: Hybrid cloud allows companies to extend their existing on-premises datacenter to the cloud, without having to do a lift-and-shift of their entire datacenter. This allows them to keep their most sensitive applications and data on-premises, while shifting some of their less sensitive, or non-core applications to the cloud. Having the cloud as an option allows them to be more agile and adjust to the ebbs and flows of their business, spinning up resources when needed, and turning them off at the end of the busy season, or after a project has run its course.
The wrinkles are mostly getting comfortable enough with the cloud to start using it. Once the initial projects move to the cloud and are successful, it will be hard to justify the next purchase of hardware for the datacenter. The cloud will simply be the default place for any new applications and as older hardware reaches its end of life the most logical step will be to migrate it to the cloud.
Why are some applications better suited to the cloud, and why should others remain on-prem?
DB: Let’s just say that all applications should run in the cloud unless there is a legitimate reason to run it on premises. One legitimate reason for staying on-premises might involve legacy applications that can’t be moved to the cloud. If you have some applications that have some components that just can’t run in the cloud, then it’s typically best to keep all of the tiers of that application running conjointly in the same datacenter.
Another reason for keeping applications on-premises typically involves cloud costs. In particular, if an application produces a lot of egress data, that is, data that must travel out of the virtual network, the charge for that data can sometimes cause sticker shock after you get your first cloud bill.
The most common excuse I hear for staying on-premises is data security. There are certain people that just have a fear of the cloud, thinking their data will not be secure. I should clarify that typically the IT people I speak with are all for moving to the cloud, but the fear generally comes from upper management outside of IT who are less trusting and unfamiliar with cloud security options.
What are the primary challenges of creating a reliable HA/DR strategy from cloud components and home-grown designs?
DB: When I hear “home-grown” and HA/DR I immediately think open source, especially in the Linux space. Home-grown designs are great..until they’re not. You can go through the trouble of piecing together open source HA, replication, quorum and application monitoring and recovery scripts, and if you are lucky enough to figure it all out, you might have something that works well for you. But have you tested every edge case? Do you have hundreds, or even thousands of other people running your exact configuration to validate that it works as expected? The challenge of home-grown is just that ... it is home grown and may not be as robust as you hope it is.
When considering solutions to ensure HA/DR is actually attainable, what are the key issues?
DB: Test, test and test again! Any HA/DR solution is only good if it has been proven in a real life fire drill. Far too often I see people that want to do “non-disruptive” DR drills that don’t impact production. Or, they are afraid to pull power cords on production servers to test HA failover.
Trust me, in the event of an actual disaster, it WILL be disruptive, your server will crash HARD. Unless you test these scenarios you never know for sure whether your DR and HA provisions will work when you really need them.
Which approaches help to enable HA/DR versus those which add complications?
DB: You have to look at HA/DR from a very high level. If you are the DBA you may only be concerned about keeping the database online, however, there is so much more to consider. Application servers, web server, end users, client redirection and much more must be considered. In order for any HA/DR implementation to be successful you must ensure that someone has the big picture in mind when making decisions.
The other thing which is often overlooked is the line of succession. Let’s face it, in times of true disaster you may be left with just your most junior IT admin to put things back together. A clear set of instructions and as much automation as possible needs to be available and set aside in multiple secure locations known to the key stakeholders in the company.
What certifications and capabilities should organizations look for with reference to availability and recovery when configuring a technology stack for the cloud?
DB: When looking at HA/DR solutions for the cloud you should first talk to your cloud provider. Who are the vendors that have the most robust and mature offerings in the cloud vendor’s marketplace? Inclusion in the marketplace generally means that the vendor is certified by the OS, Application and Cloud provider. That should be enough in terms of certification. Secondly you should ask for cloud references, who has done this? What was their experience?
Where possible try to settle on just a few key technologies for HA/DR. If you have a different HA/DR solution for each application you are going to quickly find that it will be hard to manage and maintain the expertise required to manage all of the different HA solutions.