Skip to main content

MLOps Professional Services Engineer (Cloud & AI Infra)

Job DescriptionJob DescriptionAbout the Company

Our client is at the forefront of the AI revolution, providing cutting-edge infrastructure that's reshaping the landscape of artificial intelligence. They offer an AI-centric cloud platform that empowers Fortune 500 companies, top-tier innovative startups, and AI researchers to drive breakthroughs in AI. This publicly traded company is committed to building full-stack infrastructure to service the explosive growth of the global AI industry, including large-scale GPU clusters, cloud platforms, tools, and services for developers.

  • Company Type: Publicly traded

  • Product: AI-centric GPU cloud platform & infrastructure for training AI models

  • Candidate Location: Remote anywhere in the US

Their mission is to democratize access to world-class AI infrastructure, enabling organizations of all sizes to turn bold AI ambitions into reality. At the core of their success is a culture that celebrates creativity, embraces challenges, and thrives on collaboration.

The Opportunity

As an MLOps Professional Services Engineer (Remote), you’ll play a key role in designing, implementing, and maintaining large-scale machine learning (ML) training and inference workflows for clients. Working closely with a Solutions Architect and support teams, you’ll provide expert, hands-on guidance to help clients achieve optimal ML pipeline performance and efficiency. 

What You'll Do

  • Design and implement scalable ML training and inference workflows using Kubernetes and Slurm, focusing on containerization (e.g., Docker) and orchestration.

  • Optimize ML model training and inference performance with data scientists and engineers

  • Develop and expand a library of training and inference solutions by designing, deploying, and managing Kubernetes and Slurm clusters for large-scale ML training with ready-to-deploy, standardized solutions

  • Integrate with ML frameworks: integrate K8s and Slurm with popular ML frameworks like TensorFlow, PyTorch, or MXNet, ensuring seamless execution of distributed ML training workloads

  • Develop monitoring and logging tools to track distributed training performance, identify bottlenecks, and troubleshoot issues

  • Create automation scripts and tools to streamline ML training workflows, leveraging technologies like Ansible, Terraform, or Python

  • Participate in industry conferences, meetups, and online forums to stay up-to-date with the latest developments in MLOps, K8S, Slurm, and ML


What You Bring

  • At least 3 years of experience in MLOps, DevOps, or a related field

  • Strong experience with Kubernetes and containerization (e.g., Docker)

  • Experience with cloud providers like AWS, GCP, or Azure

  • Familiarity with Slurm or other distributed computing frameworks

  • Proficiency in Python, with experience in ML frameworks such as TensorFlow, PyTorch, or MXNet

  • Knowledge of ML model serving and deployment

  • Familiarity with CI/CD pipelines and tools like Jenkins, GitLab CI/CD or CircleCI

  • Experience with monitoring and logging tools like Prometheus, Grafana or ELK Stack 

  • Solid understanding of distributed computing principles, parallel processing, and job scheduling

  • Experience with automation tools like Ansible, Terraform

Key Attributes for Success

  • Passion for AI and transformative technologies

  • A genuine interest in optimizing and scaling ML solutions for high-impact results

  • Results-driven mindset and problem-solver mentality

  • Adaptability and ability to thrive in a fast-paced startup environment

  • Comfortable working with an international team and diverse client base

  • Communication and collaboration skills, with experience working in cross-functional teams

Why Join?

  • Competitive compensation: $130,000-$175,000 (negotiable based on experience and skills)

  • Full medical benefits and life - insurance: 100% coverage for health, vision, and dental insurance for employees and their families

  • 401(k) match program with up to a 4% company match

  • PTO and paid holidays 

  • Flexible remote work environment

  • Reimbursement of up to $85/month for mobile and internet

  • Work with state-of-the-art AI and cloud technologies, including the latest NVIDIA GPUs (H100, L40S, with H200 and Blackwell chips coming soon)

  • Be part of a team that operates one of the most powerful commercially available supercomputers

  • Contribute to sustainable AI infrastructure with energy-efficient data centers that recover waste heat to warm nearby residential buildings

Interviewing Process

  • Level 1: Virtual interview with the Talent Acquisition Lead (General fit, Q&A)

  • Level 2: Virtual interview with the Hiring Manager (Skills assessment)

  • Level 3: Interview with the C-level (Final round)

  • Reference and Background Checks: Conducted post-interviews

  • Offer: Extended to the selected candidate

We are proud to be an equal opportunity workplace and are committed to equal employment opportunity regardless of , , , , , , marital status, ancestry, physical or mental , genetic information, veteran status, , or expression, , or any other characteristic protected by applicable federal, state or local law.

Compensation Range: $130K - $175K

MLOps Professional Services Engineer (Cloud & AI Infra)

San Francisco, CA
Full time

Published on 11/25/2024

Share this job now