Building a resilient workload

Technical additional documentation

This article describes resiliency design & features offered by Flexible Engine, and share good practises & example to respect for any workload design for High Availibility 

A Multi Region / Multi AZ design

Design

3 AZ per Region


Standard design of Orange Business Service Public Cloud is 3 Availability zones [ie. AZ] based to

  • Optimize your applications via a distributed architecture;
  • Replicate your data, and implement your disaster recovery / business continuity plan;
  • Reduce the impact of single point of failure.

Our commitments

  • One AZ = One (or more) DC (Data Center)
  • Dual connectivity from external
  • Private high performance backbone between AZ to enable LAN workload
  • Less than 10km between each DC to minimize latency (less than 1.2ms between AZ), but enough to face local disaster

Flexible Engine

Available Region

Flexible Engine regions and its Availability Zones

Current High level design

Based on an extended Openstack design at the region level, each Avalibilty Zone  [AZ] is matching specific DC to cover disaster in one place

  • External Connectivity (Internet by default, Private network as option) lands in AZ a & b to offer High Available access, any time
  • Extended Virtual network (Virtual Private Cloud & Subnets) without latency inside region to enable workload geo-distribution

 

This architecture propose native capacity for user to distribute their workload on that network across physical Datacenter, hiding the compexity behind the Virtual Private Cloud, enabling any High Level design for any customer workload on top of the Flexible Engine Region

How to evaluate a resilient approach for Cloud workload

On Flexible Engine, the level of resilience is to be defined on several levels according to any Public Cloud model

Resilience

Définition

Owner

Cloud Service

The Cloud service is designed to be resilient in multi AZ within a region

Orange

Cloud Ressource

Cloud resources deployed provide a level of resilience to be identified: AZ > Regional > Global

Orange

Application workload

The application must be designed to take advantage of the 2 previous plans, in accordance with good practice

Users

Cloud Services resiliency within Paris region

Resiliency per service

NB : inventory below is based on Paris region, Amsterdam region is being upgraded for availability

Apply to Flexible Engine service, we can consider 3 different categories relying on their reliency model :

  • Some Services are natively resilient. Their design implement a native High Availibilty, distributed between Availibility Zone in the region, without any action from the user providing the such Cloud ressource. As an example, the ELB building Load balancing between ECS backend is build on a HA configuration between AZa & AZb
  • Some Services propose resiliency as an Option. This mean user can choose between a standalone mono AZ configuration and a HA multi AZ one (setup by default). In case of HA configuration, user can place the Cloud ressource building the service in several AZ. As an example, the RDS service propose some Master/Slave configuration in one click; enabling to place Master in one AZ and Slave in another one.
  • Basic Compute (ECS) and Storage (EVS) implement AZ local HA (thanks to autorecovery feature for ECS or 3 copy concept for EVS). But user can leverage the extended Virtual Network to build its own workload , and distribute compute Nodes accros AZ in the region. These Services are considered as Enabler for workload resiliency
 

Provisionning

Item

Service

Resiliency

Enabler

Option

Native

1

Connection to the service

Management

IAM

 

2

Creation of the network base & associated services

Network

VPC

   

ELB

   

DC

   

DNS

   

NAT GW

   

3

Provisioning of computing resources and storage

Compute

ECS

√[1]

   

BMS

   

CCE

 

√[4]

 

Storage

EVS

√ [5]

   

4

Provisioning of shared storage resources

Storage

SFS

OBS

x[2]

   
 

√[3]

 

sDRS

 

√[1]

 

5

Provisioning of managed resources

Database

RDS

 

√[4]

 

DDS

 

√[4]

 

Application

DCS

 

√[4]

 

DMS

 

√[4]

 

6

Adding security services

Security

Anti-DDoS

   

KMS

   

WAF

   

 

[1] User should locate its ECS in one AZ of the region. Locally in the AZ, ECS availibility rely on autorecovery (ie. relocate & restart after Hypervisor failure, possible only for flavor without local disk, like D series) 

[2] SFS propose NAS feature with different class of storage :

  • SFS Capacity oriented : volume created are located in eu-west-Xa only, even if the volume can be mounted in enay ECS of the region, thanks to the VPC/Subnet cross AZ feature
  • SFS Turbo : volume cerated can be located in any AZ of the region, with a lolca resiliency design.

[3] OBS backend storage implement a 3 copy mechanisms allowing resiliency (check EVS description below, mechanism is identical). Different class of Storage are available

  • Online storage (Standard & Warm) propose a Multi-AZ resiliency (1 copy per AZ) and a Mono-AZ resiliency (3 copy in AZ eu-west-Xa)
  • Offline storage (Cold)  propose only Mono-AZ resiliency (3 copy in AZ eu-west-Xa)

[4] All those services propose resiliency as an option with cross AZ configuration activated by default. User can decide to unconfigure HA to save mony (for test purpose as example)

[5] EVS propose some virtual disk in the AZ where the user locate the ECS. As a consequence, EVS propose a Mono AZ resiliency with a 3 copy mechanism, check EVS Three-Copy Redundancy for more detail 

Focus on backup

Service

Backup & Restore object

Functions

Data localisation

Cross Region Replication

Support OBS 3AZ

VBS [*]

Volume EVS disk (system or data disk)

  • EVS disk specific backup
  • Backup data management
  • EVS disk restoration using data backups
  • EVS disk creation using backups

OBS Mono AZ in eu-west-Xa

Not planned[*]

Not planned[*]

CSBS [*]

  • ECS + all attached disks (system and data disks)
  • ECS based backup
  • Policy driven data backup
  • Data backup management
  • Image creation using backups
  • All EVS disks on an ECS have consistent data

OBS Mono AZ in eu-west-Xa

Not planned[*]

Not planned[*]

CBR

(CSBS/VBS evolution)

  • ECS + all attached disks (system and data disks)
  • BMS,
  • SFS Turbo

Cloud Server Backup:

  • Creating Backup policy
  • Backup management
  • Restoring data using a cloud Server Backup
  • Sharing a backup (with other projects)

Cloud Disk Backup:

  • Creating Backup policy
  • Backup management
  • Restoring Data Using a Cloud Disk Backup
  • Using a backup to create a Disk

SFS Turbo Backup:

  • Creating Backup policy
  • Backup management
  • Restoring Data Using a SFS Turbo Backup
  • Using a backup to create a SFS volume

OBS Mono AZ in eu-west-Xa for any Vault created before 2022Q4

OBS Multi AZ for any Vault created after 2022Q4

Live

Live for any Vault created after 2022Q4

FAB

  • Files
  • DB
  • Applications (Exchange, AD, Office 365, etc…)
  • CommVault based solution
  • Backup management
  • Backup for On Premise, or other CSP data
  • Granular restoration (File or application level)

OBS Multi AZ (eu-west-0 or eu-west-1)

Live

Live

 

[*] CBR is the next generation Service to federate all backup services : CSBS will be replaced by CBR Server Backup & VBS will be replaced by CBR Volume Backup. So good practis is to use CBR for any new workload

Tips & Good practice

[1] ECS/EVS/sDRS

  • ECS and their EVS volumes are provisioned in a given AZ, with local AZ resiliency (Auto-recovery ECS & distributed block storage, check EVS Three-Copy Redundancy for more detail )
  • Auto Scaling Group on multi AZ when possible for your workload hosted on ECS, with ELB integration to manage the flow
  • sDRS is only available on eu-west-0a & 0b! Depending the backups Service you are using (if backups are stored on OBS mono-AZ, in eu-west-0a, check above), the sDRS master has to be provisioned on eu-west-0b

[2] SFS/SFS Turbo

  • SFS Turbo is available on each AZ (SFS only available on eu-west-0a ) but only locally resilient even if visible from each AZ (via NFS) [Roadmap evolution & workaround provision in progress].

[3] OBS

  • OBS is available in Mono-AZ and multi-AZ for the Online classes (Standard & Warm) while the Offline class (Cold) is Mono-AZ [evolution Roadmap].
  • It is recommand to migrate from Mono-AZ OBS to a multi-AZ OBS, thanks to obstutil tool. Check the obsutil online help to download, setup and replicate a bucket 
  • CSBS & VBS Backup services (or any backup feature integrated with managed services, consider RDS backup as an example) still use OBS mono-AZ (on Xa). As they are soon deprecated, prefer to use CBR, but becarefull to (re)create the Backup Vault after the release date (check Roadmap). If you are using CSBS/VBS, you can refer to Migrating Resources from CSBS/VBS to migrate to CBR

[4] Managed services (CCE, RDS, DDS, DCS, etc…)

  • Enable HA functionality with respect to localization
  • Master/Slave cluster services (RDS type): master on eu-west-0b or 0c
  • Services with quorum (type CCE, DDS, DCS): node distribution on the 3 AZ
  • Auto Scaling Group on multi AZ when possible for your workload hosted on CCE

 

User workload (Paris region as example)

legacy to Cloud workload

Characteristics

  • IaaS
  • Non-resilient design by default

Solution

  • Backup
  • sDRS

Recommendation

  • Workload on AZ eu-west0-b or eu-west0-c
  • Backup with CBR or FAB (OBS Multi AZ)
  • sDRS with master on AZ eu-west0-b & slave on eu-west-0a
  • Enable Auto Scaling Group on multi AZ when possible for your workload hosted on ECS

Data in the Cloud

Characteristics

  • Massive data storage

Solution

  • OBS S3 object storage

Recommendation

  •  Use OBS Multi-AZ (exept for cold storage class, which is Mono AZ only)
  • Use Life Cycle rules to optimize Class storage
  • Implement bucket replication for cross-region replication to Amsterdam

Cloud Native workload

Characteristics

  • Massive use of Infra As Code via CI/CD [Terraform provider]
  • Data localisation

Solution

  • Network services (ELB, NATGW, etc.)
  • Managed services CCE, RDS, DDS, etc.

Recommendation

  • Respect the best practices of managed services in HA service deployment with Multi-AZ
  • Master/Slave cluster services (RDS type): master on eu-west-0b or 0c:
  • Services with quorum (CCE type): node distribution on the 3 AZ
  • Data backup on Multi-AZ OBS
  • Enable CCE autoscaler add-on