The Service Reliability Engineer is a key actor of the managed services on public clouds especially in the DevOps model and co-managed model.
The SRE is a named expert, knowledgeable about operations and software engineering, simultaneously participating to the run of the managed service within OBS operations team and working closely with the customer’s development team.
The SRE works closely with Customer’s development Team to identify and implement the observability indicators, automation of operations and infrastructure as code to meet business needs. He contributes his expertise to the development team for delivering the enablers necessary to a reliable run. On longer term, the SRE contributes to continuous improvement of the reliability of the business application and its operations.
The SRE participates (remotely) to regular meetings with Application owners for continuous improvement Alignments.
Deliverables
The SRE contributes together with the development team to the following deliverables:
- Guidelines for DevOps automation (Infra as Code, Integration, Blue-Green deployment, etc.) according to customer’s team maturity
- Infra as Code necessary to deploy / redeploy the resources in case of service loss or misconfiguration
- Identification and implementation of observability metrics necessary to monitor the business activity
- Define and manage SLO, SLI
- Implementation of automated dashboards allowing analysis of metrics and trends. Pieces of advice for the tooling for implementing them.
- Identification of alarms / thresholds on metrics and alarm collection mechanism
- Identification of necessary backup procedures and security measures for the application and data to meet customer’s needs
- Write-up of main procedures necessary to handle the known incidents. Procedures which will be handoff to the level 1 & 2 core operation teams.
- Simple procedures are typically integrated in the infrastructure as code to accelerate the remedial actions.
- Review/ validation of technical procedures for the changes proposed by the Service Delivery Manager for inclusion in the change catalogue.
- Identification and implementation of log collection to detect anomalies and ease troubleshooting for the business application. Setup of automated correlations and alerts from logs analysis.
- Cold analysis of dashboards, logs for preventive maintenance when requested.
- Configuration of security tooling and SIEM.
- Definition and write-up of recurring check procedures when necessary.
- Criteria for “go” to pre-production. RACI between the customer and OBS for the deployment to pre-preproduction. Automation of deployment if requested.
- Criteria for “go” to production taking into consideration technical and business constraints (deployment time, particular events, etc.). RACI between the customer and OBS for the deployment to production. Automation of deployment if requested.
- Criteria for “go” to pre-production. RACI between the customer and OBS for the deployment to pre-preproduction. Automation of deployment if requested.
- Criteria for “go” to production taking into consideration technical and business constraints (deployment time, particular events, etc.). RACI between the customer and OBS for the deployment to production. Automation of deployment if requested.
Limitations
- The architecture is not the responsibility of the SRE. It is rather, the responsibility of the customer or of an architect i.e. the Technical Design Authority. As such, the design of the Disaster Recovery is not in the mandate of the SRE but of a TDA.
- The build and design of the architecture, including disaster recovery, HLD and LLD is the responsibility of the customer or of an architect i.e. the Technical Design Authority. Following their validation by OBS, the SRE maintains architecture, HLD and LLD during RUN phase and identifies the necessary updates in terms of fault tolerance, auto healing, resilience and reliability to meet new business needs.
Implications
Customer’s development team decides jointly with the SRE of the improvement to bring to the management and reliability of the application. Such decision could lead to an additional use of cloud services and to additional scope of work for operations team (additional business functions monitored, additional resources monitored, additional backup routines) which may lead to additional monthly recurring charges