Job Description

We are seeking a highly experienced infrastructure engineer (*phD Only*) to design and maintain large-scale distributed systems that support machine learning workloads, high-throughput data pipelines, and real-time model serving.

This role involves building robust backend services, optimizing GPU resource orchestration, and implementing advanced system monitoring to ensure low-latency and highly reliable AI infrastructure.

The ideal candidate is comfortable working at the intersection of *distributed systems, ML infrastructure, and performance engineering*, and has strong experience with large-scale production environments.

Responsibilities

* Design and implement high-performance backend services supporting AI model training and inference pipelines
* Develop distributed systems capable of handling multi-petabyte datasets and high-throughput data processing
* Optimize GPU cluster scheduling, memory utilization, and workload orchestration
* Build internal tooling for model deployment, monitoring, and automated scaling
* Collaborate with ML researchers to translate research models into production-ready systems
* Maintain strict reliability and latency requirements for large-scale model serving infrastructure

Minimum Qualifications

Applicants must meet *all* of the following requirements:

* PhD in Computer Science, Electrical Engineering, or related technical field
* 7+ years of experience building distributed backend systems in production environments
* Deep experience with *both* of the following:
* GPU cluster orchestration (Kubernetes + GPU scheduling frameworks)
* Large-scale ML model serving infrastructure
* Strong experience with *at least three* of the following:
* CUDA / GPU kernel optimization
* Distributed training frameworks (DeepSpeed, FSDP, Megatron, or equivalent)
* Low-latency inference systems (vLLM, TensorRT-LLM, or custom inference runtimes)
* High-performance networking (RDMA / Infiniband)
* Large-scale data infrastructure (Spark, Ray, or similar)
* Demonstrated experience operating *clusters larger than 256 GPUs*
* Strong programming skills in *C++, Rust, and Python*
* Experience debugging performance issues at the *kernel, networking, and system level*
* Published research in top-tier systems or ML conferences *(OSDI, SOSP, MLSys, ICML, NeurIPS, or similar)*

Preferred Qualifications

* Experience building infrastructure for *large language models (70B+ parameters)*
* Experience designing *custom model inference runtimes*
* Familiarity with *compiler-level optimization for ML workloads*
* Contributions to major open-source infrastructure projects

Job Type: Contract

Pay: $70,000.00 - $90,000.00 per year

Work Location: In person

Senior Infrastructure Engineer – Distributed AI Systems

Job Description