Site Reliability Engineer (SRE)
Job Description
Site Reliability Engineer (Observability)
London- Hybrid/ 3 Days
Contract Inside IR35- 6 Months initially
We’re looking for a Site Reliability Engineer (SRE) to join our client to build and maintain observability systems and to ensure their core services remain reliable, scalable, and high-performing.
Responsibilities:
- Deploy and manage observability tools using a Prometheus like metrics store and Grafana Enterprise.
- Automate monitoring, alerting, and incident response.
- Build Grafana dashboards for system insights.
- Apply Infrastructure as Code (IaC) principles.
- Develop tooling in Golang () or Python.
- Advocate for SRE principles like SLOs, SLIs, and error budgets.
- Integrate monitoring with incident management workflows.
Requirements:
- SRE principles and reliability engineering expertise.
- Solid familiarity with Linux
- Strong experience in deploying and building containers using Podman or Docker
- Golang () or Python for automation and API integration.
- Experience with Grafana, VictoriaMetrics, and PromQL
- Experience with centralized logs solutions deployment and management
- Strong Infrastructure as Code (IaC) knowledge.
Nice to Have:
- OpenTelemetry experience.
- Terraform, Ansible, or CI/CD knowledge.
- Background in datacentre and compute hardware services.
- AWS infrastructure configuration and deployment
- Familiarity with Kubernetes and cloud- systems.
- Incident response automation expertise.