RTO & RPO achievable through Azure Services

RPO stands for recovery POINT objective, i.e., how much data is one potentially prepared and willing to lose, worse case. Tolerance for data loss (sec’s, hours, days); determines frequency of backups and
replication approaches. RPO is measured in units of time, not volume: "30 minutes of data", "four hours of data", and so on.
RPO <= Data lost: How far back?--{(Disaster strikes)}-- Downtime: How long to recover? => RTO

RTO stands for recovery TIME objective, i.e., if/when the ‘bad thing’ happens, how much time does it take to be back up and running again. For example, if the acceptable downtime duration is eight hours in the event of a disaster, then your RTO is eight hours.

An RTO of 30 minutes means the service must be running again within half an hour and an RPO of 5 minutes means no more than 5 minutes’ worth of data can be lost.

Image:Taos.com

Many clients first reaction is they want RTO and RPO of zero (i.e. NO data loss with no downtime).

While this is technically possible, RPOs of zero require synchronous replication. RPOs of zero require synchronous replication.  Synchronous replication by design require multiple writes / updates /deletes in multiple locations before giving an ACK back to the application.  These additional transactions to multiple locations may introduce unacceptable performance, typically due to network distances and associated latency

It is difficult to guarantee RPO/RTO targets as there are many dependencies not necessarily in your control. If someone initiates a restore, the time it takes to be back up and running is dependent on variables like the size of the restore, available network bandwidth, speed of the disk drives/VMs, etc.

RPO of backup solutions are most dependent on the backup policies.  For example, if someone setups up a daily backup policy, then the RPO is closer to a day.

In a more traditional DR failover scenario whether on-prem to cloud or cloud to cloud, it is common to use a service like Azure Site Recovery.  Since the data has already been replicated, the RTO in this case has many dependencies including how long it takes to provision the DR infrastructure on the ‘other side’, speed of the disk drives/VMs, time to run the recovery plan, time to propagate the appropriate DNS changes to point to the ‘other’ side, etc.  Generally in the ~minutes to many minutes range.

RPO of replication solutions are often most dependent on the distance separating the two sites.  For example, when someone configures ASR to replicate across two regions, then the RPO is more likely to be in the ~seconds to many seconds range.

Azure Storage typically has an RPO of less than 15 minutes, although there's currently no SLA on how long geo-replication takes.

The RTO is a measure of how long it takes to perform the failover and get the storage account back online. The time to perform the failover includes the following actions:
- The time until the customer initiates the failover of the storage account from the primary to the secondary region.
- The time required by Azure to perform the failover by changing the primary DNS entries to point to the secondary location.

Creating a disaster recovery plan requires expert knowledge of the application's workflows, data, infrastructure, and dependencies.

The following table compares RPO and RTO of each recovery option for Azure SQL Database:
Recovery method  RTO  RPO 
 Geo-restore from geo-replicated backups  12 h  1 h
 Auto-failover groups  1 h 5 s
 Manual database failover  30 s  5 s

Active-active or multi-master replication enables multiple replicas to be used simultaneously, enabling load balancing at the cost of complicating data consistency, while active-passive replication reserves replicas for live use only during failover.

Comments