Job Description

Site Reliability Engineer (Observability)

London- Hybrid/ 3 Days

Contract Inside IR35- 6 Months initially

We’re looking for a Site Reliability Engineer (SRE) to join our client to build and maintain observability systems and to ensure their core services remain reliable, scalable, and high-performing.

Responsibilities:

Deploy and manage observability tools using a Prometheus like metrics store and Grafana Enterprise.
Automate monitoring, alerting, and incident response.
Build Grafana dashboards for system insights.
Apply Infrastructure as Code (IaC) principles.
Develop tooling in Golang () or Python.
Advocate for SRE principles like SLOs, SLIs, and error budgets.
Integrate monitoring with incident management workflows.

Requirements:

SRE principles and reliability engineering expertise.
Solid familiarity with Linux
Strong experience in deploying and building containers using Podman or Docker
Golang () or Python for automation and API integration.
Experience with Grafana, VictoriaMetrics, and PromQL
Experience with centralized logs solutions deployment and management
Strong Infrastructure as Code (IaC) knowledge.

Nice to Have:

OpenTelemetry experience.
Terraform, Ansible, or CI/CD knowledge.
Background in datacentre and compute hardware services.
AWS infrastructure configuration and deployment
Familiarity with Kubernetes and cloud- systems.
Incident response automation expertise.

Site Reliability Engineer (SRE)

Site Reliability Engineer (SRE)

Share this job now