Surviving the Unexpected: Cloud Disaster Recovery Techniques

Leveraging the Cloud for Business Resilience

5 min readMar 25, 2024

Introduction

In the world of digital enterprise, data stands as the essential life force for every organization. However natural or man-made disasters can disrupt business operations and cause data loss. This is where cloud disaster recovery strategies come into play. They ensure business continuity by minimizing downtime and preventing data loss.

Disaster recovery (DR) is an organization’s ability to restore access and functionality to IT infrastructure after a disaster event, whether natural or caused by human action (or error). — [4]

Before we dive into the cloud disaster recovery strategies, it’s important to understand a few key terms:

Recovery Time Objective (RTO): This is the targeted duration of time within which a business process must be restored after a disaster in order to avoid unacceptable consequences. In simpler terms, it’s the maximum acceptable length of time that your application can be offline. This varies from business to business. For example, for an e-commerce site, an RTO of a few minutes might be necessary, while a non-critical reporting tool might have an RTO of a few hours.

Recovery Point Objective (RPO): This is the maximum targeted period in which data might be lost due to a major incident. It’s essentially defining how much data loss is acceptable. For instance, if you have an RPO of an hour, then you should be backing up your data every hour. This means, in the event of a disaster, you would lose an hour’s worth of data.

Both RTO and RPO are crucial in disaster recovery planning and largely influence the choice of strategy. They help businesses understand the tolerance towards downtime (RTO) and data loss (RPO).

Backup and Restore

Backup and restore is the most basic strategy. It involves making copies of data and storing them in the cloud. In case of a disaster, the data can be restored from the backup. This strategy is simple and cost-effective but restoring large amounts of data can be time-consuming.

GCP Backup and DR Service — Source: Image from google

For example, a small e-commerce company regularly backs up its data to a cloud service. When a local server failure occurs, they are able to restore their operations quickly from the cloud backup.

Pilot Light

The pilot light strategy involves replicating critical systems in the cloud. A minimal version of the system is always running. In case of a disaster, this system can be quickly scaled up.

The term pilot light refers to a small flame that is always lit in devices such as gas-powered heaters, and can be used to start the devices quickly when required. In the context of DR, a pilot-light environment contains the core components of a given workload, with the latest configuration and critical data, running at a minimal scale at a location that’s remote from the primary site. In the event of a disaster at the primary site, you can use the pilot-light components at the remote location to restore a production-scale environment quickly. -[5]

For example, a financial institution uses the pilot light strategy. They have a minimal version of their environment always running in the cloud. In case of a disaster, they can quickly scale up this environment.

Warm Standby

Warm standby involves keeping a scaled-down version of a fully functional environment always operational in the cloud. It provides a faster recovery time than the pilot light strategy but is more expensive. This environment is regularly updated using full and incremental backups of the primary system.

In the event of a disaster, this system can quickly scale up to handle the production load. This strategy offers a balance between lower recovery time objectives (RTO) and recovery point objectives (RPO), and the costs of implementation and operation. It’s an active/passive strategy; the standby environment does not actively serve traffic until a failover event is triggered.

For example, an online streaming service uses the warm standby strategy. They have a scaled-down version of their fully functional environment always operational in the cloud. When their primary site goes down, they switch over to the cloud environment with minimal disruption.

Hot Standby

The Hot Standby recovery strategy is a high-availability disaster recovery approach where a fully operational duplicate of the primary system is always on standby. This secondary system is kept updated in real-time and is ready to take over the primary system’s operations immediately in the event of a failure. This strategy ensures continuous operation and minimizes downtime.

However, it can be expensive to configure and maintain due to the need for real-time data replication and the requirement of identical hardware. Similar to warm standby, the standby environment does not actively serve traffic until a failover event is triggered.

For example, a global social media platform uses the hot standby strategy. They maintain a full-scale replica of their application in the cloud. When a disaster strikes one region, they immediately switch over to the cloud replica without any service disruption.

Multi-Site Solution

Multi-site solution involves running applications in multiple cloud regions simultaneously. It provides the highest level of availability and can withstand regional disasters. These secondary systems are kept updated in real-time and are ready to take over the primary system’s operations immediately in the event of a failure.

Multi-site active/active DR strategy — Source: Image from amazon.

This strategy ensures continuous operation, minimizes downtime, and provides protection against regional disasters. However, it can be expensive and complex due to the need for real-time data replication and identical hardware across multiple sites. It’s an active/active strategy; all sites are actively serving traffic and can take over if one site fails.

For example, a multinational corporation runs its applications on multiple cloud regions simultaneously. When one region experiences an outage, the traffic is automatically rerouted to the other regions.

Conclusion

Choosing the right cloud disaster recovery strategy depends on the specific needs and budget of your organization. While backup and restore might be sufficient for some, others might require the high availability provided by a multi-site solution. By understanding these strategies, you can make an informed decision and ensure the continuity of your business operations in the face of disaster.

References

Disaster recovery (DR) objectives — reliability pillar. (n.d.-b). https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/disaster-recovery-dr-objectives.html
Disaster recovery options in the cloud — disaster recovery of … (n.d.-c). https://docs.aws.amazon.com/whitepapers/latest/disaster-recovery-workloads-on-aws/disaster-recovery-options-in-the-cloud.html
Google. (n.d.). Backup and dr service overview | google cloud. Google. https://cloud.google.com/backup-disaster-recovery/docs/concepts/backup-dr
Google. (n.d.-b). What is disaster recovery and why is it important? | google cloud. Google. https://cloud.google.com/learn/what-is-disaster-recovery#:~:text=Disaster%20recovery%20(DR)%20is%20an,human%20action%20(or%20error).
Design a pilot-light disaster recovery (DR) topology. Oracle Help Center. (2021, July 12). https://docs.oracle.com/en/solutions/oci-pilot-light-dr/index.html#GUID-3C1F7B6B-0195-4166-A38C-8B7AD53F0B79