One configures SQL Server for high availability (HA) in anticipation of some event that will cause a critical application to go offline unexpectedly—a software glitch that causes an app to freeze or a fault in storage that brings down a disk, for example. A well-designed HA solution will cause a secondary instance of your application and database infrastructure—running in a nearby but geographically distinct data center—to assume your critical loads and keep your operations running smoothly, with no loss of data.
When the incident that takes your critical applications offline is larger, though—think hurricanes, bomb cyclones, and atmospheric rivers that inundate whole regions—there’s a very real risk that both the data centers (also known as availability zones, or AZs) supporting your HA configuration may be knocked offline.
If your critical applications and databases need to be available and active when that disaster strikes, you need a disaster recovery (DR) solution that will enable you to run those applications and databases from a region that is unaffected by the disaster striking yours.
Building out such a DR solution involves different considerations and trade-offs than a cloud-based HA solution. Given the physical distance between the AZs in which your normal production environment is running and the remote AZ in which your DR infrastructure resides, you are unlikely to be able to rely on an HA technology such as synchronous data replication to ensure that the instance of your SQL database on your DR site is a perfect replica of your production database. You’ll need to choose a different approach to replicating the SQL Server database to your DR site, and your options for doing so are constrained by a combination of factors:
- What constitutes an acceptable database recovery point in a DR scenario? All but the last 5 minutes’ worth of SQL Server transactions? All but the last 5 hours of transactions? Whatever period you choose constitutes your recovery point objective (RPO), and different approaches to database replication will support different RPOs.
- How quickly must your DR solution be online and operational? This constitutes your recovery time objective (RTO) and may range from seconds to days, depending on the criticality of the systems whose availability you are trying to ensure. How quickly you expect your DR site to be operational in an emergency will influence your decision about approaches to database replication.
- What is your budget? As you might imagine, the closer your RPO objectives get to zero data loss and the closer your RTO objectives get to seconds rather than minutes or hours, the more costly your DR solution is likely to be. If your budget is constrained, you may have to deploy a DR solution that has a more forgiving RPO, RTO, or both.
Balancing RPO and RTO
Because your DR infrastructure is likely to be located hundreds, if not thousands, of miles away from your production infrastructure, you are unlikely to be able to rely on synchronous data replication techniques to ensure that the SQL database on your DR site is a mirror image of the SQL database in your production environment. What are your options?
If your RPO is to have the SQL database on your DR site be as close to a mirror image of your production database as possible, you could use asynchronous replication, which is built into the Availability Groups (AGs) feature of SQL Server as well as third-party SANless clustering technologies. Depending on the volume of transactions you are replicating and the speed at which they can be written to your DR site, most, if not all, of your database updates will likely be replicated to your remote site within seconds of being written to your production database. The risk of data loss arises, however, in a scenario in which your production environment goes down before the most recent transactions have been replicated to the DR site. You could bring your applications online quickly from the DR site, but your database may be missing some of the updates that had already been written to the production database.
If your RTO and RPO are very forgiving, you could rely on database backups stored on the DR site. In little more than the time it takes to restore the latest backup—which could take anywhere from minutes to hours, depending on your database—you could bring your applications online from your DR site and continue to support your critical business processes. The risk with this approach, though, is that your backups may be hours or days out-of-date compared to the production database you had been using.
A middle ground might involve log shipping. You could spin up your DR site and build your SQL database from the log files shipped from the instance of SQL Server on the production site. On the one hand, this may provide you with a more up-to-date database than you could restore from an hours- or days-old backup file. On the other hand, your rebuilt SQL database might be less up-to-date than it would be if you had been using asynchronous replication. The real risk associated with this approach lies in the time it takes to rebuild the database from log files. That could take hours—which may be well outside the bounds of your RTO.
A final consideration for replication might be a cloud service such as Azure Site Recovery or AWS Disaster Recovery Services. These services replicate your entire virtual machine (VM) from one region to another. You can restore the most recent instance of your replicated VM on your DR site and be up and running quickly. The database on this replicated VM may be more current than the database you are able to reconstruct through log files or from an older backup, but it will not be as current as the database built via asynchronous replication. You may be able to bring your DR site online quickly by restoring this replicated VM, but perhaps not as quickly as you could if you were using an asynchronous replication approach.