Site Reliability Engineer - Digital SaaS Platform, .net, Cloud, CI/CD, Infrastructure, Scalability

TSP Group

Alte locuri de munca publicate de aceasta companie

Site Reliability Engineer - Digital SaaS Platform, .net, Cloud, CI/CD, Infrastructure, Scalability

Site Reliability Engineer - Digital SaaS Platform, .net, Cloud, CI/CD, Infrastructure, Scalability - London/Hybrid - Perm - 65k Plus benefits

My client - Global E-commerce company - are seeking to recruit an experienced SRE/Web Operations Engineer to join their team. This is an exciting time to join as they are going through an expansion therefore an opportunity to progress.
In this role you will be responsible for the reliability, scalability and performance of the company's digital platform and infrastructure. You will lead a small team of engineers, plus assist in the management of our external Azure Managed Service Providers.
Reporting to the Head of Engineering, you will be responsible for improving and optimising our Azure platform. Incident and problem management, DR, release management, manage observability tools (Datadog) and improve the developer experience and tools, the role would be ideal role for a .net developer who wants to get more involved in cloud/Dev ops.

Duties include:

  • SRE Strategy and Vision: Define and implement the overall strategy for SRE to align with organisational goals, balancing reliability, scalability and development velocity.
  • Service Uptime: Ensure systems and services meet agreed-upon service level agreements (SLAs) and SLOs for uptime and performance.
  • Incident Management: Lead efforts to establish effective incident response protocols, including detection, triage, resolution, and post-incident reviews.
  • Disaster Recovery: Oversee the development and testing of disaster recovery plans and procedures.
  • Infrastructure as Code: Drive adoption and best practices for automation, ensuring repeatability and consistency in infrastructure provisioning.
  • CI/CD Pipeline Optimisation: Ensure seamless integration and delivery pipelines to support development and deployment at scale.
  • Observability: Ensure comprehensive monitoring, logging, and alerting systems are in place to track the health and performance of systems.
  • Incident Resolution: Lead and coordinate major incident responses, ensuring swift recovery while minimizing impact.
  • Root Cause Analysis: Oversee post-mortem processes to identify root causes, document lessons learned, and implement preventive measures.

Looking for candidates with similar experience with the following:

  • Ideally background in .net development as you will be resolving incidents and working with the developers to fix code problems
  • SRE/Web operations engineering exp
  • Exp of working with 3rd party Infrastructure Management (Azure MSPs)
  • Experienced with .NET technology - ideally
  • Experienced at working with large scale codebase platforms
  • SRE Practices and Principles
  • Automation and Tooling
  • Monitoring & Observability
  • Performance Optimisation
  • Incident & DR Management
  • Proven experience in scaling infrastructure
  • Excellent communication skills both verbal and written
  • Strong organisational expertise and the ability to effectively multi-task
  • Strategic thinker
  • Data driven decision maker
  • Security & Compliance - ideally
  • Cloud Native Architectures (eg: Kubernetes, Docker) - ideally
  • Cloud Certifications - ideally

Excellent benefits, training and career progression.

Site Reliability Engineer - Digital SaaS Platform, .net, Cloud, CI/CD, Infrastructure, Scalability - London/Hybrid - Perm - 65k Plus benefits

Descriere companie

Detalii oferta de angajare

05 Feb 2025

Locatia jobului

Tip job

Full time

Categorie job

Tehnologia informaţiei, Telecomunicaţii

Salariu lunar