Building a resilient workload

Technical additional documentation

This article describes resiliency design & features offered by Flexible Engine, and share good practises & example to respect for any workload design for High Availibility

A Multi Region / Multi AZ design

Design

3 AZ per Region

Standard design of Orange Business Service Public Cloud is 3 Availability zones [ie. AZ] based to

Optimize your applications via a distributed architecture;
Replicate your data, and implement your disaster recovery / business continuity plan;
Reduce the impact of single point of failure.

Our commitments

One AZ = One (or more) DC (Data Center)
Dual connectivity from external
Private high performance backbone between AZ to enable LAN workload
Less than 10km between each DC to minimize latency (less than 1.2ms between AZ), but enough to face local disaster

Flexible Engine

Available Region

Flexible Engine regions and its Availability Zones

Current High level design

Based on an extended Openstack design at the region level, each Avalibilty Zone [AZ] is matching specific DC to cover disaster in one place

External Connectivity (Internet by default, Private network as option) lands in AZ a & b to offer High Available access, any time
Extended Virtual network (Virtual Private Cloud & Subnets) without latency inside region to enable workload geo-distribution

This architecture propose native capacity for user to distribute their workload on that network across physical Datacenter, hiding the compexity behind the Virtual Private Cloud, enabling any High Level design for any customer workload on top of the Flexible Engine Region

How to evaluate a resilient approach for Cloud workload

On Flexible Engine, the level of resilience is to be defined on several levels according to any Public Cloud model

Resilience	Définition	Owner
Cloud Service	The Cloud service is designed to be resilient in multi AZ within a region	Orange
Cloud Ressource	Cloud resources deployed provide a level of resilience to be identified: AZ > Regional > Global	Orange
Application workload	The application must be designed to take advantage of the 2 previous plans, in accordance with good practice	Users

Cloud Services resiliency within Paris region

Resiliency per service

NB : inventory below is based on Paris region, Amsterdam region is being upgraded for availability

Apply to Flexible Engine service, we can consider 3 different categories relying on their reliency model :

Some Services are natively resilient. Their design implement a native High Availibilty, distributed between Availibility Zone in the region, without any action from the user providing the such Cloud ressource. As an example, the ELB building Load balancing between ECS backend is build on a HA configuration between AZa & AZb
Some Services propose resiliency as an Option. This mean user can choose between a standalone mono AZ configuration and a HA multi AZ one (setup by default). In case of HA configuration, user can place the Cloud ressource building the service in several AZ. As an example, the RDS service propose some Master/Slave configuration in one click; enabling to place Master in one AZ and Slave in another one.
Basic Compute (ECS) and Storage (EVS) implement AZ local HA (thanks to autorecovery feature for ECS or 3 copy concept for EVS). But user can leverage the extended Virtual Network to build its own workload , and distribute compute Nodes accros AZ in the region. These Services are considered as Enabler for workload resiliency

	Provisionning	Item	Service	Resiliency
	Provisionning	Item	Service	Enabler	Option	Native
1	Connection to the service	Management	IAM			√
2	Creation of the network base & associated services	Network	VPC			√
			ELB			√
			DC			√
			DNS			√
			NAT GW			√
3	Provisioning of computing resources and storage	Compute	ECS	√[1]
			BMS	√
			CCE		√[4]
		Storage	EVS	√ [5]
4	Provisioning of shared storage resources	Storage	SFS OBS	x[2]
			SFS OBS		√[3]
			sDRS		√[1]
5	Provisioning of managed resources	Database	RDS		√[4]
		Database	DDS		√[4]
		Application	DCS		√[4]
		Application	DMS		√[4]
6	Adding security services	Security	Anti-DDoS			√
			KMS			√
			WAF			√

[1] User should locate its ECS in one AZ of the region. Locally in the AZ, ECS availibility rely on autorecovery (ie. relocate & restart after Hypervisor failure, possible only for flavor without local disk, like D series)

[2] SFS propose NAS feature with different class of storage :

SFS Capacity oriented : volume created are located in eu-west-Xa only, even if the volume can be mounted in enay ECS of the region, thanks to the VPC/Subnet cross AZ feature
SFS Turbo : volume cerated can be located in any AZ of the region, with a lolca resiliency design.

[3] OBS backend storage implement a 3 copy mechanisms allowing resiliency (check EVS description below, mechanism is identical). Different class of Storage are available

Online storage (Standard & Warm) propose a Multi-AZ resiliency (1 copy per AZ) and a Mono-AZ resiliency (3 copy in AZ eu-west-Xa)
Offline storage (Cold) propose only Mono-AZ resiliency (3 copy in AZ eu-west-Xa)

[4] All those services propose resiliency as an option with cross AZ configuration activated by default. User can decide to unconfigure HA to save mony (for test purpose as example)

[5] EVS propose some virtual disk in the AZ where the user locate the ECS. As a consequence, EVS propose a Mono AZ resiliency with a 3 copy mechanism, check EVS Three-Copy Redundancy for more detail

Focus on backup

Service	Backup & Restore object	Functions	Data localisation	Cross Region Replication	Support OBS 3AZ
*VBS []**	Volume EVS disk (system or data disk)	EVS disk specific backup Backup data management EVS disk restoration using data backups EVS disk creation using backups	OBS Mono AZ in eu-west-Xa	Not planned^[*]	Not planned^[*]
*CSBS []**	ECS + all attached disks (system and data disks)	ECS based backup Policy driven data backup Data backup management Image creation using backups All EVS disks on an ECS have consistent data	OBS Mono AZ in eu-west-Xa	Not planned^[*]	Not planned^[*]
CBR (CSBS/VBS evolution)	ECS + all attached disks (system and data disks) BMS, SFS Turbo	Cloud Server Backup: Creating Backup policy Backup management Restoring data using a cloud Server Backup Sharing a backup (with other projects) Cloud Disk Backup: Creating Backup policy Backup management Restoring Data Using a Cloud Disk Backup Using a backup to create a Disk SFS Turbo Backup: Creating Backup policy Backup management Restoring Data Using a SFS Turbo Backup Using a backup to create a SFS volume	OBS Mono AZ in eu-west-Xa for any Vault created before 2022Q4 OBS Multi AZ for any Vault created after 2022Q4	Live	Live for any Vault created after 2022Q4
FAB	Files DB Applications (Exchange, AD, Office 365, etc…)	CommVault based solution Backup management Backup for On Premise, or other CSP data Granular restoration (File or application level)	OBS Multi AZ (eu-west-0 or eu-west-1)	Live	Live

[*] CBR is the next generation Service to federate all backup services : CSBS will be replaced by CBR Server Backup & VBS will be replaced by CBR Volume Backup. So good practis is to use CBR for any new workload

Tips & Good practice

[1] ECS/EVS/sDRS

ECS and their EVS volumes are provisioned in a given AZ, with local AZ resiliency (Auto-recovery ECS & distributed block storage, check EVS Three-Copy Redundancy for more detail )
Auto Scaling Group on multi AZ when possible for your workload hosted on ECS, with ELB integration to manage the flow
sDRS is only available on eu-west-0a & 0b! Depending the backups Service you are using (if backups are stored on OBS mono-AZ, in eu-west-0a, check above), the sDRS master has to be provisioned on eu-west-0b

[2] SFS/SFS Turbo

SFS Turbo is available on each AZ (SFS only available on eu-west-0a ) but only locally resilient even if visible from each AZ (via NFS) [Roadmap evolution & workaround provision in progress].

[3] OBS

OBS is available in Mono-AZ and multi-AZ for the Online classes (Standard & Warm) while the Offline class (Cold) is Mono-AZ [evolution Roadmap].

It is recommand to migrate from Mono-AZ OBS to a multi-AZ OBS, thanks to obstutil tool. Check the obsutil online help to download, setup and replicate a bucket

CSBS & VBS Backup services (or any backup feature integrated with managed services, consider RDS backup as an example) still use OBS mono-AZ (on Xa). As they are soon deprecated, prefer to use CBR, but becarefull to (re)create the Backup Vault after the release date (check Roadmap). If you are using CSBS/VBS, you can refer to Migrating Resources from CSBS/VBS to migrate to CBR

[4] Managed services (CCE, RDS, DDS, DCS, etc…)

Enable HA functionality with respect to localization

Master/Slave cluster services (RDS type): master on eu-west-0b or 0c

Services with quorum (type CCE, DDS, DCS): node distribution on the 3 AZ
Auto Scaling Group on multi AZ when possible for your workload hosted on CCE

User workload (Paris region as example)

legacy to Cloud workload

Characteristics

IaaS
Non-resilient design by default

Solution

Backup
sDRS

Recommendation

Workload on AZ eu-west0-b or eu-west0-c
Backup with CBR or FAB (OBS Multi AZ)
sDRS with master on AZ eu-west0-b & slave on eu-west-0a
Enable Auto Scaling Group on multi AZ when possible for your workload hosted on ECS

Data in the Cloud

Characteristics

Massive data storage

Solution

OBS S3 object storage

Recommendation

Use OBS Multi-AZ (exept for cold storage class, which is Mono AZ only)
Use Life Cycle rules to optimize Class storage
Implement bucket replication for cross-region replication to Amsterdam

Cloud Native workload

Characteristics

Massive use of Infra As Code via CI/CD [Terraform provider]
Data localisation

Solution

Network services (ELB, NATGW, etc.)
Managed services CCE, RDS, DDS, etc.

Recommendation

Respect the best practices of managed services in HA service deployment with Multi-AZ
Master/Slave cluster services (RDS type): master on eu-west-0b or 0c:
Services with quorum (CCE type): node distribution on the 3 AZ
Data backup on Multi-AZ OBS
Enable CCE autoscaler add-on