Skip to main content

Principal Site Reliability Engineer

Oracle’s Cloud Infrastructure team is building new File Storage Services deployment at scale and Operations team in a broadly distributed multi-tenant cloud environment. Our customers run their businesses on our cloud, and our mission is to provide them with best-in-class file storage capabilities in conjunction with other compute, storage, networking, database, security offerings.We’re looking for hands-on engineers with a passion for solving problems in distributed systems, virtualized infrastructure, and highly available services. Joining Oracle will give you the opportunity to learn and help build innovative new systems from the ground up and operate services at scale. Engineers at every level can have significant technical and business impact while delivering critical enterprise level features during multiple parallel deployments.As an Operations SRE, you will work as part of a highly collaborative team to support features/tools for File Storage Service while operating and growing the current service offering. You should value simplicity and scale, work comfortably in a collaborative environment, be familiar with ITL and Agile methodologies, produce good user documentation and be excited to learn. AS or BS degree or equivalent experience relevant to functional area.Basic Qualifications: 2+ years’ experience supporting commercial software in a distributed environment. Good knowledge of Terraform, Python is a must. Strong knowledge of Operations, Linux, operating systems, and distributed systems fundamentals. Strong solving, debugging and ticket resolutions skills. Career Level - IC4Dealing with operational tickets that have well defined Runbooks. These include:Auto cut tickets that have known resolutions (example, tearing down bad hosts, provisioning new capacity, vulnerability scan issues, Auto cut tickets that are associated with known problems and need to be cleared manually - pending service enhancements or bug fixes.Human cut tickets that are associated with common customer issues or require some initial triage before bringing up.Participating in Large Scale Events involving Service dependencies.Learn and Adapt for new technologies at a high pace environment.React and create new alarms and definitions. See opportunities for improvement and documenting operational processes Identifying highest sources of ticket noise and operational toil and work with service team to:Create and track bug fixes.Enhance the service.Adjust alarm thresholds.Identify missing metrics and alarms.Performing and fully own software deployments across both pre-production and production routinely using tools and expertise such as Terraform and Python. Proficient in Organizational Change standard methodologies. Duties and tasks are varied and require independent judgment.

Principal Site Reliability Engineer

Oracle
Bengaluru, Karnataka
Full time

Published on 10/29/2024

Share this job now