Skip to main content

Site Reliability Engineer

Position Description: As a Site Reliability Engineer (SRE) within our team, you will play a key role in ensuring the reliability, scalability, and performance of our systems and infrastructure. Reporting to the OCC, ITSM & ServiceNow Manager, you will be responsible for closely collaborating with cross-functional teams to implement best practices, automate processes, and proactively monitor our systems to maintain optimal uptime and a satisfactory user experience. Your future duties and responsibilities: SITE SREs will improve operational standards & update documentation.• Evaluate current operational practices and identify areas for improvement.• Develop and implement standardized processes and procedures to enhance efficiency and effectiveness.• Maintaining up-to-date documentation in Confluence (KB, FEX, etc.)SITE SREs will collaborate with DevOps teams to create a robust CI/CD pipeline for fully automated applications and platform deployment.• Design and architect a Continuous Integration/Continuous Deployment (CI/CD) pipeline to automate the build, test, and deployment processes.• Implement tools and technologies such as Jenkins, GitLab CI/CD, or similar, to streamline the pipeline.• Integrate automated testing frameworks to ensure code quality and reliability throughout the deployment pipeline.• To be the primary point of contact for code deployments.SITE SREs will take ownership of, manage, and enhance the release process, focusing on scalability, efficiency, andquality.• Lead the planning, coordination, and execution of up-to-date releases across multiple products and environments.• Continuously monitor, improve, and validate release processes based on feedback and metrics.SITE SREs will provide support for regular production updates and Job AppWorks corrections.• Coordinate with development teams to prioritize and schedule production & maintenance updates.• Execute deployment plans and verify successful updates while minimizing downtime and impact on users.• Troubleshoot and resolve critical issues with job execution, including errors, failures, and unexpected behavior.• Analyze job execution logs and metrics to identify any errors, failures, or performance bottlenecks.• Reduce the number of redundant/duplicate alerts that are no longer used and be part of the optimization.SITE SREs will be On-call Support and Incident handling• Participate in an on-call rotation to provide 24/7 support for production systems, responding to alerts and incidents in a timely manner.• Document incident response procedures and lessons learned for continuous improvement.• Monitor system health and respond promptly to incidents, escalating as necessary for resolution.SITE SREs will be responsible for validation & Sanity Checks• Perform post-PPR and production deployment sanity checks to ensure system stability and functionality.• Utilize both manual and automated checks to validate the integrity and coherence of the deployed code and configurations.• Document and report any issues discovered during the validation process for further investigation and resolution.SITE SREs will be responsible for ServiceNow Ticket handling• Monitor, prioritize, and manage ServiceNow tickets according to defined SLAs and operational priorities.• Assign tickets to appropriate teams or individuals for resolution and ensure timely follow-up and closure.• Maintain accurate records and documentation within the ServiceNow platform.SITE SREs will be responsible for Capacity planning & Security Alert prioritization.• Perform capacity testing to validate the scalability of systems and infrastructure under various load conditions.• Prioritize security alerts based on severity and potential impact on system integrity and data confidentiality.• Coordinate with security teams to assess and respond to security alerts promptly, implementing appropriate mitigation measures.SITE SREs will monitor DevOps Platform Products• Monitor the stability, performance, and availability of DevOps platform products such as JFrog, GitLab, Vault, Kong, ELK, Rancher, and Kubernetes (K8s).SITE SREs will define Monitoring Objectives• Collaborate with stakeholders to determine the key objectives and metrics for monitoring latency, traffic, errors, and saturation.• Identify critical service-level indicators (SLIs) and objectives (SLOs) to ensure the monitoring aligns with business and user expectations. Required qualifications to be successful in this role: • Bachelor's or Master's degree in Software Engineering, Computer Science, or equivalent.• 2+ years of experience with Kubernetes• 5+ years of expertise in Linux administration• 3+ years of strong coding skills in languages such as Java, React.Js, etc.• 2+ years of experience in infrastructure-related tools (Terraform, Ansible, VScode, Postman, etc.)• Monitoring infrastructure and applications (Splunk, ECK, Grafana, Prometheus…. )• A solid understanding of CI/CD concepts, version control systems, and testing (experience with Jenkins, AppWorks, Git, Docker, Gitlab, etc.)• Collaboration (Jira/Confluence, ServiceNow).• Deep understanding of task automation.• Proficiency in DevOps principles to ensure effective collaboration between IT operations and developers.• Expertise in incident management and application security.• Ability to define Service Level Objectives (SLOs), Service Level Agreements (SLAs), and Service Level Indicators (SLIs).• Excellent communication skills to collaborate with diverse teams• Analytical mindset to understand and solve complex problems.• Autonomy and sense of responsibility to manage various aspects of the role.• Can work well under pressure and manage multiple priorities.• Must be amenable to working onsite 2 days a week in Taguig. Skills: LinuxKubernetes

Site Reliability Engineer

CGI
Taguig, Metro Manila
Full time

Published on 08/03/2024

Share this job now