MLOps Professional Services Engineer (Cloud & AI Infra)
Job DescriptionJob DescriptionAbout the Company
Our client is at the forefront of the AI revolution, providing cutting-edge infrastructure that's reshaping the landscape of artificial intelligence. They offer an AI-centric cloud platform that empowers Fortune 500 companies, top-tier innovative startups, and AI researchers to drive breakthroughs in AI. This publicly traded company is committed to building full-stack infrastructure to service the explosive growth of the global AI industry, including large-scale GPU clusters, cloud platforms, tools, and services for developers.
-
Company Type: Publicly traded
-
Product: AI-centric GPU cloud platform & infrastructure for training AI models
-
Candidate Location: Remote anywhere in the US
Their mission is to democratize access to world-class AI infrastructure, enabling organizations of all sizes to turn bold AI ambitions into reality. At the core of their success is a culture that celebrates creativity, embraces challenges, and thrives on collaboration.
The Opportunity
As an MLOps Professional Services Engineer (Remote), you’ll play a key role in designing, implementing, and maintaining large-scale machine learning (ML) training and inference workflows for clients. Working closely with a Solutions Architect and support teams, you’ll provide expert, hands-on guidance to help clients achieve optimal ML pipeline performance and efficiency.
What You'll Do
-
Design and implement scalable ML training and inference workflows using Kubernetes and Slurm, focusing on containerization (e.g., Docker) and orchestration.
-
Optimize ML model training and inference performance with data scientists and engineers
-
Develop and expand a library of training and inference solutions by designing, deploying, and managing Kubernetes and Slurm clusters for large-scale ML training with ready-to-deploy, standardized solutions
-
Integrate with ML frameworks: integrate K8s and Slurm with popular ML frameworks like TensorFlow, PyTorch, or MXNet, ensuring seamless execution of distributed ML training workloads
-
Develop monitoring and logging tools to track distributed training performance, identify bottlenecks, and troubleshoot issues
-
Create automation scripts and tools to streamline ML training workflows, leveraging technologies like Ansible, Terraform, or Python
-
Participate in industry conferences, meetups, and online forums to stay up-to-date with the latest developments in MLOps, K8S, Slurm, and ML
What You Bring
-
At least 3 years of experience in MLOps, DevOps, or a related field
-
Strong experience with Kubernetes and containerization (e.g., Docker)
-
Experience with cloud providers like AWS, GCP, or Azure
-
Familiarity with Slurm or other distributed computing frameworks
-
Proficiency in Python, with experience in ML frameworks such as TensorFlow, PyTorch, or MXNet
-
Knowledge of ML model serving and deployment
-
Familiarity with CI/CD pipelines and tools like Jenkins, GitLab CI/CD or CircleCI
-
Experience with monitoring and logging tools like Prometheus, Grafana or ELK Stack
-
Solid understanding of distributed computing principles, parallel processing, and job scheduling
-
Experience with automation tools like Ansible, Terraform
Key Attributes for Success
-
Passion for AI and transformative technologies
-
A genuine interest in optimizing and scaling ML solutions for high-impact results
-
Results-driven mindset and problem-solver mentality
-
Adaptability and ability to thrive in a fast-paced startup environment
-
Comfortable working with an international team and diverse client base
-
Communication and collaboration skills, with experience working in cross-functional teams
Why Join?
-
Competitive compensation: $130,000-$175,000 (negotiable based on experience and skills)
-
Full medical benefits and life - insurance: 100% coverage for health, vision, and dental insurance for employees and their families
-
401(k) match program with up to a 4% company match
-
PTO and paid holidays
-
Flexible remote work environment
-
Reimbursement of up to $85/month for mobile and internet
-
Work with state-of-the-art AI and cloud technologies, including the latest NVIDIA GPUs (H100, L40S, with H200 and Blackwell chips coming soon)
-
Be part of a team that operates one of the most powerful commercially available supercomputers
-
Contribute to sustainable AI infrastructure with energy-efficient data centers that recover waste heat to warm nearby residential buildings
Interviewing Process
-
Level 1: Virtual interview with the Talent Acquisition Lead (General fit, Q&A)
-
Level 2: Virtual interview with the Hiring Manager (Skills assessment)
-
Level 3: Interview with the C-level (Final round)
-
Reference and Background Checks: Conducted post-interviews
-
Offer: Extended to the selected candidate
We are proud to be an equal opportunity workplace and are committed to equal employment opportunity regardless of , , , , , , marital status, ancestry, physical or mental , genetic information, veteran status, , or expression, , or any other characteristic protected by applicable federal, state or local law.
Compensation Range: $130K - $175K