Managed Cloud Native Services on GCP

Managed Cloud Native Services on GCP

Customer’s business application deployed on GCP are dependent on GCP Cloud Native Services (IaaS, PaaS). Orange Business Services provides the managed services necessary to ensure service assurance and change management for those dependences, as well as the configuration and deployment for building and recovering them.

Description

Orange Business Services provides technical operation and monitoring of the Customer’s native GCP, as well as optimization/upgrade activities through the implementation of a network interconnection between the Orange Business Services “service area” and the third party cloud provider’s IaaS platform.

The cloud native services

One can typically distinguish 3 categories of services:

The user plane services: if a business application depends on it, the business application is likely to be affected by a defect of it. The service does not have persistent data, therefore the recovery does not necessitate data restore.
The data services: if a business application depends on a data service, the business application is likely to be affected by a defect of it. The service has persistent data, therefore a recovery may necessitate data restore. Data loss, data corruption may affect the business application as well.
The other services: the business application does not depend on them. Most of those services are used for automation, observation, migration. The loss of the service is not likely to affect the business application. Some of the services are used for managing the user plane and data plane services of the business application, some others have specific usage for which a scope of work shall be established would the customer requires OBS to leverage them as part of the managed service provided.

Tasks involved Cloud Native service management

The tasks involved for the management of a cloud native service depends on the service. They consist in:

Configuring and deploying the service: Infrastructure as Code is leveraged in order to configure the service, the observability, the backup. Level 3 expertise on the service is leveraged for proper implementation thanks to the scope of work (refer to detailed description of build and SRE services)
Applying the security group and access control policy defined by the customer.
Service recovery thanks to Infrastructure as Code: in case of failure, most of the services requires to be recovered thanks to a redeployment. Re-configuring the service manually from scratch is not an efficient option: it takes time and is error prone. This is why recovery / redeployment from Infrastructure as Code is preferred.
Supervision and remedial consists in watching for alarms raised on the service during the monitoring range (typically: 8×5 or 24×7). When an alarm occurs, an incident ticket is raised, a priority is assigned, the customer is notified. Then remedial action is taken thanks to the procedures made available to Level 2 / 1 by the Level 3. The remedial on a cloud native service may be necessary to restore the service of the business application. Would the procedure not remedy to the incident, then the incident is escaladed to the Level 3. Would the root cause be the CSP itself, then the incident is raised to the CSP by the Level 3.
Backup and restore: depending on the service (if the service has persistence), it is necessary to backup the service data. The management service consists in configuring the backup solution and monitoring the proper run of it. Note: the backup solution has to be subscribed separately e.g. GCBDR. Restoring the service on incident may involve restoring the data from a backup.
OS patching and anti-virus: keeping OS up to date and virus free is a managed service for Managed Virtual Machine / Managed OS. Please refer to the detailed description.
Specifics: some cloud native services may have specific configuration or management tasks.
Business application specifics: by default, standard alerts are watched. The configuration of alerts, logs on a cloud native service which are specific to a business application is subject to a specific scope of work.

General pre-requisites to the run of managed services

The following pre-requisites are necessary to all managed services:

The Customer shall have defined a valid architecture. (OBS can optionally provide Professional Services for architecture definition).
The Customer shall have a valid subscription to GCP including subscription to GCP Support plan and procure the GCP resources and GCP support plan. OBS can optionally supply this subscription inclusive of GCP support (ref to Multi-Cloud Ready offer for GCP), however, the subscription, the IaaS resources, the GCP support are not part of the Managed Services. The Managed Services will leverage this support contract to escalate incident to GCP CSP.
GCP platform for the Customer shall be urbanized alongside best practices of GCP’s landing zone or shall offer comparable services.
OBS proposes a default RACI depending on the class of transition and the resource managed. As a pre-requisite to the project, OBS and the Customer shall agree on the RACI.
Agreement on the tooling used for GIT, CI / CD chain, Monitoring, Logging and Alerting solution.
Additional pre-requisites are required when transition is not the entire responsibility of OBS (not Full Build, ref to Build chapter of the document)

In the case of Fully Managed service, OBS is using its own Git, CI / CD chain, Monitoring, Logging and Alerting solution.

In the case of a Co-managed service, OBS and the Customer agree on the Git, CI / CD chain, Monitoring, Logging and Alerting solution to be used. By default, the tooling is

Either based on GCP tools i.e. Google Cloud Deployment Manager, Google Cloud Monitoring
Or based on generic multi-cloud tooling proposed by OBS e.g. CaasCad (Prometheus, Grafana, …)

This tooling not included in the Managed Applications work units and can be purchased separately as part of GCP Subscription or as a multi-cloud tooling proposal made by OBS.

Criteria for the run of a managed cloud native service component

Criteria shall be met with an approval by Level 2 before turning a cloud native component to an active managed service (i.e. Run) by the Level 2 / Level 1 operations. The owner of the Build and of the Level 3 support owns the responsibility of making sure that the criteria are met:

The architecture and deployment of the service shall be defined.
The service shall be deployed thanks to Infrastructure-as-Code and tested prior to transitioning to the run team. Typically, successful testing in pre-production, with a pre-production environment iso-production. Note: IaC is necessary to recover the services in case of major failure.
The use of the service shall be explained to the operation team
The security policies and access control shall have been configured.
The access shall have been configured allowing OBS Level 2 teams access.
The service shall export the necessary metrics towards Google Cloud Monitoring.
The data backup shall be configured in GCBDR when backup is applicable.
The disaster recovery shall be configured when applicable.
The troubleshooting and service restoration procedures shall be provided to Level 2.
Whereas a procedure requires logs or dashboard those shall have been developed and deployed prior to transferring to run phase.
A remedial procedure on incident shall not last more than 15 minutes. Beyond, that time amount, the effort would be charged on time base.