Skip to main content

Site Reliability Engineer (SRE)

Job Description

Site Reliability Engineer (Observability)

London- Hybrid/ 3 Days

Contract Inside IR35- 6 Months initially


We’re looking for a Site Reliability Engineer (SRE) to join our client to build and maintain observability systems and to ensure their core services remain reliable, scalable, and high-performing.


Responsibilities:

  • Deploy and manage observability tools using a Prometheus like metrics store and Grafana Enterprise.
  • Automate monitoring, alerting, and incident response.
  • Build Grafana dashboards for system insights.
  • Apply Infrastructure as Code (IaC) principles.
  • Develop tooling in Golang () or Python.
  • Advocate for SRE principles like SLOs, SLIs, and error budgets.
  • Integrate monitoring with incident management workflows.


Requirements:

  • SRE principles and reliability engineering expertise.
  • Solid familiarity with Linux
  • Strong experience in deploying and building containers using Podman or Docker
  • Golang () or Python for automation and API integration.
  • Experience with Grafana, VictoriaMetrics, and PromQL
  • Experience with centralized logs solutions deployment and management
  • Strong Infrastructure as Code (IaC) knowledge.


Nice to Have:

  • OpenTelemetry experience.
  • Terraform, Ansible, or CI/CD knowledge.
  • Background in datacentre and compute hardware services.
  • AWS infrastructure configuration and deployment
  • Familiarity with Kubernetes and cloud- systems.
  • Incident response automation expertise.

Site Reliability Engineer (SRE)

London, UK
Full time

Published on 04/04/2025

Share this job now