Job DescriptionJob Description

improvement of our core reliability engineering practices. This role requires deep, practical experience in SRE, including incident response, monitoring, automation, service level management, and infrastructure reliability at scale.

As an SRE Lead, you will be accountable for the end-to-end reliability of production systems, leading a team of SREs, working cross-functionally with engineering and operations, and shaping a culture of proactive resilience, performance, and availability.

Key Responsibilities

Lead the architecture, implementation, and continuous improvement of reliable, scalable systems in cloud- environments.
Own and define SRE best practices, including SLAs, SLOs, SLIs, error budgets, incident response, and root cause analysis.
Build and maintain automated monitoring, alerting, and observability systems using Prometheus, Grafana, Datadog, or similar.
Drive automation-first strategies using Infrastructure-as-Code tools (Terraform, Ansible) to eliminate manual operations.
Optimize and manage CI/CD pipelines to support fast, safe, and reliable deployments.
Lead production readiness reviews and ongoing health checks to identify and remediate risks before they impact users.
Champion incident management processes, including on-call rotation leadership, playbook development, and blameless postmortems.
Guide and mentor a team of SREs, fostering a culture of reliability, ownership, and engineering excellence.
Collaborate closely with development teams to ensure services are designed with reliability and operability in mind.

Required Skills & Qualifications

Bachelor's or Master’s degree in Computer Science, Engineering, or a related technical field.
8+ years of experience in Site Reliability Engineering with a strong hands-on track record.
Proven experience leading SRE teams in high-availability, production-critical environments.
Deep knowledge of cloud platforms (AWS, Azure, or GCP) and Kubernetes-based infrastructure.
Strong command over observability stacks: Prometheus, Grafana, OpenTelemetry, Datadog, New Relic, or Splunk.
Hands-on expertise with scripting and automation using Python, Go, or Bash.
Solid experience in Infrastructure-as-Code (IaC) and configuration management (Terraform, Ansible, CloudFormation).
Strong understanding of networking, distributed systems, system performance, and fault tolerance.
Experience managing on-call rotations, and implementing scalable incident response and escalation processes.
Excellent communication skills with the ability to influence across engineering, operations, and executive teams.

Qualifications

Certifications such as AWS Solutions Architect, Google Professional Cloud Architect, or CKA (Kubernetes Administrator).
Experience with chaos engineering, disaster recovery, and capacity planning.
Background in software engineering with a strong focus on building reliable, scalable services.
Demonstrated ability to scale SRE practices in fast-growing, production-intensive environments.

Company Descriptionplease visit our site nobletechies.com.Company Descriptionplease visit our site nobletechies.com.

Site Reliability Engineer LEAD

Site Reliability Engineer LEAD

Share this job now