Production Operations Engineer
Job DescriptionJob DescriptionSalary: $95,000 - $105,000/annually
As aProduction Operations Engineer, you will play a critical role in operationalizing, maintaining, scaling, and optimizing ourAI-driven applications and supporting infrastructure. With a blend of software development and infrastructure skills, you will work closely with cross-functional teams including software engineers, data scientists, and platform engineers, to ensure the delivery and operation of highly available, low latency, and optimally performingAIproducts.
Your expertise will be crucial in developing solutions, automating processes, monitoring system health, troubleshooting, and managing incidents to ensure our products deliver a seamless experience for our clients.
Key Responsibilities:
- Software Development and Operations:
- Collaborate with Software Engineers to design, implement, and maintain scalable, efficient, and secure systems using React, Python, Docker, and Kubernetes stack.
- Optimize application performance by profiling and tuning frontend and backend services for speed, scalability, and resilience.
- System Monitoring & Maintenance:
- Monitor production systems and services, ensuring optimal uptime and performance.
- Implement monitoring tools and dashboards for proactive incident detection.
- Infrastructure Automation:
- Automate repetitive tasks, deployment processes, and infrastructure provisioning using tools such as Ansible, Terraform, or similar.
- Develop and maintainCI/CD pipelines to facilitate smooth deployments.
- Incident Management & Troubleshooting:
- Respond to system incidents, troubleshoot issues, and work towards timely resolutions.
- Conduct root cause analysis (RCA) of system failures and develop strategies to prevent future incidents.
- Performance Optimization:
- OptimizeAImodel deployment and data pipelines for speed, efficiency, and cost-effectiveness.
- Collaborate with data scientists and engineers to ensureAIsystems are running efficiently in production environments.
- Scalability & Reliability:
- Design and implement scalable infrastructure solutions forAIapplications.
- Ensure system reliability, fault tolerance, and high availability through effective architecture and best practices.
- Security & Compliance:
- Work with security teams to ensure all systems are compliant with company security protocols and industry standards.
- Implement security best practices across production environments.
Required Qualifications:
- Bachelors degree in Computer Science, Engineering, or a related field (or equivalent practical experience).
- 5+ years of experience in a combination of software development, production operations, DevOps, infrastructure engineering, or security roles.
- Strong application development experience in Javascript and Python, particularly with RESTfulAPIdesign and development using a service-oriented architecture.
- Strong experience with cloud platforms (AWS,GCP, or Azure).
- Proficiency in container orchestration technologies (e.g., Kubernetes, Docker).
- Solid understanding ofCI/CD pipelines and automation tools (Github Actions, ArgoCD, Jenkins, GitLabCI, etc.).
- Experience with infrastructure as code (Terraform, Ansible, etc.).
- Hands-on experience with monitoring and logging tools (Datadog, Prometheus, Grafana, ELK stack, etc.).
- Strong experience with Bash, and other similar scripting .
- Solid understanding of frontend frameworks, particularly React, and their interaction with backend services.
- Strong problem-solving skills and attention to detail.
- Experience in handling large-scale distributed systems.
Qualifications:
- Proficiency in working withNoSQLdatabases (MongoDB) and understanding of document-based data models.
- Knowledge of object storage systems and experience withS3-compatible APIs (MinIO) for storing and managing large-scale unstructured data.
- Experience in supportingAI/MLpipelines and production systems.
- Knowledge of data engineering and distributed data systems (e.g., Kafka, Spark, Hadoop).
- Understanding ofGPU-based infrastructure forAIworkloads.
- Familiarity with security best practices in cloud andAIenvironments.
- Flexible work environment and culture that promotes work-life balance.
About Us:
Aicadium is a global technology company delivering AI-powered industrial computer vision products into the hands of enterprises. With offices in Singapore and San Diego, California, and an international team of data scientists, engineers, and business strategists, Aicadium is operationalizing AI within organizations where advanced machine learning innovations were previously out of reach.
Team
Join a growing team of data scientists, machine learning, and software engineers in an agile development environment. Work together with some of the best in the field to tackle challenging projects and operationalize the solutions you develop across a variety of industries and use cases.
Culture
We work in a casual and collaborative startup environment. Every member of the team plays a key role in shaping the solutions we develop and creating positive business value for the companies we work with. We are building a hub of the best talent in San Diego, CA but we are open to working with people all over the U.S.
Benefits
Aicadium has a great benefits package to come with your salary. Benefits include PTO, Health insurance, Vision and Dental Insurance, Life and AD&D, 401k with matching, and more!