Principal Site Reliability Engineer
Responsibilities Solve complex problems related to Linux infrastructure and Oracle Cloud Infrastructure Act as a partner concern point for critical issues that may not have a detailed procedure and provide Root Cause Analysis (RCA) Understand the end-to-end configuration, technical dependencies, characteristics of production infrastructure and services Quickly grasp and analyze new technologies that are sophisticated and constantly evolving and integrate those into automation and infrastructure support Design and delivery of mission-critical automation, with a focus on security, resiliency, scale, and performance. See opportunities and drive the implementation of automation to improve service health, availability and reliability Author functional and technical documentation and standard operating producers (SOP) Collaborate with development teams in defining and implementing improvements in service architecture. Articulate technical characteristics of services and technology areas and guide multi-functional teams to engineer and add capabilities to internal tools. Partner with DevOps teams, Oracle Cloud Infrastructure deployment, and development teams to identify and resolve issues.Career Level - Career Level - Knowledge SkillsProven experience in Site Reliability Engineering and automation.Experience in Linux Administration with good knowledge of Kernel-level debuggingExperience in debugging operating system performance issues and performance tuningExperience working with fault-tolerant, highly available, high-efficiency, distributed and scalable systemsExpertise in developing scripts, utilities, and tools to automate routine or manual intensive tasksExperience in application, compute, storage, and database solving for improving application reliability, scalability, availabilityExperience in cloud infrastructure technologiesExperience in operations and problem managementĀ Development experience using Python and building Infrastructure using TerraformExperience in handling high-availability production applicationsExperience working with global teams across different time zones.Possesses and demonstrates strong logical-thinking skills, full of intellectual curiosity and high for self-development.Ability to be a good teammate and the desire to learn and implement new Cloud technologies as neededGood understanding of Agile software development principles including using common tools such as JIRAGood understanding of cloud security, and compliance management including patchingExcellent interpersonal, verbal, and written communication skillsQualifications requiredProven experience working in IT Operations\Infrastructure teamBachelor degree in Computer Science, Computer Engineering, Software Engineering, or related areas is helpful