Site Reliability Engineer
Purpose: To help drive quality outcomes for client’s and enable engineering teams by providing capabilities focused on the supportability and reliability of client’s software.The role ensures both engineering and operationally focused teams work seamlessly as one combined, end-2-end product engineering team.The role must demonstrate an ability to work with a large global product engineering team supporting a complex mix of software, clients and delivery outcomes.Must have experience in AWS services, Terraform, Monitoring tools & configuration management tools Accountabilities & Deliverable Approaches operations as a software problem and develops solutions to eliminate toils.Improves reliability, availability, quality, and deployment frequency of client productsDesigns and builds software and systems to manage platform infrastructure and applicationsRuns, monitors, and observes production applications and improves the overall lifecycle of reliability services.Coaching and mentoring other SRE team membersWork closely with engineering/product teams to define and align SLIs/SLOs to business needs and to troubleshoot application and infrastructure issues.Optimises on-call and escalation process as part of an incident management systemFinding better ways of doing things, with a view to the future and technology engineering trends. Core Skills, Knowledge and Attributes Working knowledge of modern software and technology:CI/CD processes and tools such as BuildKite, Jenkins, Azure DevOps or any other similar toolExperience in at least one of the programming languages such as Java, Nodejs, Golang, C#, etc.Strong scripting skills in bash and/or pythonStrong knowledge of AWS services such as VPC, EC2, S3, RDS, ECS, etc.Expertise in IaC patterns and tools such as Terraform / Cloud formationStrong knowledge of configuration management tools such as Ansible / Chef / PuppetExpertise in docker container management and orchestration of containers such as ECS, AKS, EKS, GKE or native KubernetesKnowledge of databases (e.g. Postgres, SQL Server, Oracle or NoSQL DBs)Experience in Application monitoring and relevant monitoring tools such as Datadog, New Relic,Dynatrace or AppDynamics. Experience in defining SLIs that will align with the team to meet availability and latency objectives.Engineering practices: availability, reliability, scalability and disaster recoveryIs a modern thinker looks to the future while ensuring practical, commercial outcomes are achieved in the present – but does not aim to keep the status quoAlways communicates positively and confidently, internally and externally relevant, valuable information.Has an ability to relate effectively and positively with all people at all levels.Creates loyalty, trust and following.A combination of personality traits; smart, innovative, low ego, collaborative, honest, of high integrity, intensity, and passion.Capable of contributing to broader business conversations beyond operational engineering and technologySolicits the involvement of others to build a sense of ownership and engagement. Must have the confidence to act quickly and decisively.Can define a delivery plan and identify & propose any supporting budgetCan empathise with people and clients appropriately and use that empathy in effective decision makingCan effectively lead and engage with remote teamsProven ability to navigate ambiguity and collaborate with other functional leaders to provide great outcomes.