High Availability and Disaster Recovery Plan

The resiliency remain in two most used terms : HA and DRP. They are often confused in IT, especially on Flexible Engine :

Definitions on FE

High Availability

HA will answer to a failure of the service by providing a second instance of the service to keep minimum interruption, in every case, HA has to be automatic and is better offered by the software. For some case, HA can also be offered by hardware or 3rd party solution, but consistency can be not warranted. No matter of the location of the second instance, HA is a generic term explaining that you service is fault tolerant to a specific failure case.

High Availability category

(Check architecture best practices to know more)

  • Intra Availabilty Zone HA : HA in the same AZ of nominal (All services used in Flexible Engine are covered by HA in AZ)
  • Inter Availability Zones HA : HA distributed to one or more different AZ from the nominal one, in the same region
  • Multi Region HA : HA distributed to one or more different region from nominal

Disaster Recovery Plan

DRP will answer to a major failure of a service / provider (or when HA cannot cover this type of failure), it should cover a large spectrum of application and is not only focused on the technical part, in most cases customer want to trigger it manually, (as it have a huge impact on their application and need verification of the status of each application after). If customer want a very low interruption DRP (RTO/RPO near 0), it’s called a BCP (Business Continuity Plan). The term DRP or BCP is finally a generic term used to speak about HA, but simply in different ways, in the common usage, most of people speak about HA for local HA and DRP for Regional HA. But in most of case, it is based on 3rd party solutions, with or without manual or automatic failover. As described before, HA permit very low interruption, a DRP is different and based on two factors :

the Recovery Time Objective (RTO) and Recovery Point Objective (RPO) :

  • A Recovery Time Objective is the maximum amount of time that a system can be down before it is recovered to an operational state. For some systems, this recovery time objective can be measured in hours or even days, but for more mission-critical systems the recovery time objectives are typically measured in seconds.
  • A Recovery Point Objective is the amount of data loss, measured in time, that is tolerable in a disaster. For some systems, losing a day’s worth of data might be acceptable while for other systems this might be mere minutes or seconds. The length of RTO’s and RPO’s have profound implications on how disaster recovery plans are implemented.

It can be as simple as backup and restore in other region or way more complex with hardware or software replication / HA solutions.

Disaster Recovery Plan category

(Check in architecture best practice to know more):

  • Multi AZ DRP : DRP between one or more different AZ of nominal, in the same region
  • Multi Region DRP : DRP between one or more different region of nominal
  • Multi Cloud : DRP between one or more different Cloud Provider of nominal
  • On Premises to cloud : DRP between on-premises DC to Flexible Engine


High Availability is the technical part when DRP will allow a High Level view that help to define which HA solution to choose for each scenarios

Resiliency on Flexible Engine Feature

This section cover the resiliency proposed or integrated in the feature.



Intra AZ : Resiliency with multiple servers and automatic HA mechanism that recover the ECS on others hypervisor in the AZ if needed. A service interruption for OS rebooting (as it is a cold migration) can be observed.


You can create an anti affinity group then add ECS in if you want to run these ECS on different hypervisor


There is no HA mechanism on BMS as it is a dedicated bare metal server. In case of failure, spare BMS will be provided (as BMS can contains local storage. Please make sure that you have a backup plan for this)




Inter AZ resiliency : HA in backend. Data are stored on OBS Storage

Auto Scaling

Inter AZ Resilience : HA in backend

Dedicated Cloud

Depend of the subscribed design

Dedicated Host

Intra AZ : HA in backend. Handled like the ECS intra AZ resiliency





Intra AZ : Data are distributed on 3 storage nodes

Dedicated Storage Service

Depend of the subscribed design


Inter AZ resiliency
Goal of this feature is to replicated EVS data from 1 AZ to other one


Inter AZ resiliency


2 class of object storage :

  • Mono AZ : Inter AZ HA – Data are distributed over 3 nodes in the same AZ
  • Multi AZ : Intra AZ HA – Data are distributed over 3 AZ

Capable of inter region resiliency with cross region OBS replication feature


SFS Classic

Intra AZ resiliency : Data a distributed over 3 nodes in the same AZ

SFS Turbo

Intra AZ resiliency : Data a distributed over 3 nodes in the same AZ
Inter AZ resiliency in roadmap


Inter AZ resiliency for frontend
Data are stored on Mono AZ object storage (Multi AZ Storage is in roadmap). Capable on Inter Region resiliency with cross region OBS replication feature



Inter AZ resiliency and subnet extended on all AZ of the region


Inter AZ resiliency

Direct Connect

Direct connect port are sold with 2 ports on differents AZ


NAT Gateway

Inter AZ resiliency

VPC Endpoint

Inter AZ resiliency


Anti DDoS


Inter AZ resiliency


Inter AZ resiliency

Management and Deployment


Inter AZ resiliency

Cloud Eye






SWR for Container

Service Stage


DMS for Kafka


API Gateway

Inter AZ resiliency



Service include inter AZ resiliency
For databases instances, two modes are available for customers :

  • Standalone. One instance on 1 AZ
  • Master/ Standby. 2 instance with master / standby architecture. Can be distributed or not on 2 AZ





Data Analysis