Senior Infrastructure Engineer – Distributed AI Systems
AetherGrid Systems Inc.
San Francisco, CA 94111$70,000 - $90,000 a yearContract
Job Description
Job Description
We are seeking a highly experienced infrastructure engineer (*phD Only*) to design and maintain large-scale distributed systems that support machine learning workloads, high-throughput data pipelines, and real-time model serving.
This role involves building robust backend services, optimizing GPU resource orchestration, and implementing advanced system monitoring to ensure low-latency and highly reliable AI infrastructure.
The ideal candidate is comfortable working at the intersection of *distributed systems, ML infrastructure, and performance engineering*, and has strong experience with large-scale production environments.
Responsibilities
* Design and implement high-performance backend services supporting AI model training and inference pipelines
* Develop distributed systems capable of handling multi-petabyte datasets and high-throughput data processing
* Optimize GPU cluster scheduling, memory utilization, and workload orchestration
* Build internal tooling for model deployment, monitoring, and automated scaling
* Collaborate with ML researchers to translate research models into production-ready systems
* Maintain strict reliability and latency requirements for large-scale model serving infrastructure
Minimum Qualifications
Applicants must meet *all* of the following requirements:
* PhD in Computer Science, Electrical Engineering, or related technical field
* 7+ years of experience building distributed backend systems in production environments
* Deep experience with *both* of the following:
* GPU cluster orchestration (Kubernetes + GPU scheduling frameworks)
* Large-scale ML model serving infrastructure
* Strong experience with *at least three* of the following:
* CUDA / GPU kernel optimization
* Distributed training frameworks (DeepSpeed, FSDP, Megatron, or equivalent)
* Low-latency inference systems (vLLM, TensorRT-LLM, or custom inference runtimes)
* High-performance networking (RDMA / Infiniband)
* Large-scale data infrastructure (Spark, Ray, or similar)
* Demonstrated experience operating *clusters larger than 256 GPUs*
* Strong programming skills in *C++, Rust, and Python*
* Experience debugging performance issues at the *kernel, networking, and system level*
* Published research in top-tier systems or ML conferences *(OSDI, SOSP, MLSys, ICML, NeurIPS, or similar)*
Preferred Qualifications
* Experience building infrastructure for *large language models (70B+ parameters)*
* Experience designing *custom model inference runtimes*
* Familiarity with *compiler-level optimization for ML workloads*
* Contributions to major open-source infrastructure projects
Job Type: Contract
Pay: $70,000.00 - $90,000.00 per year
Work Location: In person
We are seeking a highly experienced infrastructure engineer (*phD Only*) to design and maintain large-scale distributed systems that support machine learning workloads, high-throughput data pipelines, and real-time model serving.
This role involves building robust backend services, optimizing GPU resource orchestration, and implementing advanced system monitoring to ensure low-latency and highly reliable AI infrastructure.
The ideal candidate is comfortable working at the intersection of *distributed systems, ML infrastructure, and performance engineering*, and has strong experience with large-scale production environments.
Responsibilities
* Design and implement high-performance backend services supporting AI model training and inference pipelines
* Develop distributed systems capable of handling multi-petabyte datasets and high-throughput data processing
* Optimize GPU cluster scheduling, memory utilization, and workload orchestration
* Build internal tooling for model deployment, monitoring, and automated scaling
* Collaborate with ML researchers to translate research models into production-ready systems
* Maintain strict reliability and latency requirements for large-scale model serving infrastructure
Minimum Qualifications
Applicants must meet *all* of the following requirements:
* PhD in Computer Science, Electrical Engineering, or related technical field
* 7+ years of experience building distributed backend systems in production environments
* Deep experience with *both* of the following:
* GPU cluster orchestration (Kubernetes + GPU scheduling frameworks)
* Large-scale ML model serving infrastructure
* Strong experience with *at least three* of the following:
* CUDA / GPU kernel optimization
* Distributed training frameworks (DeepSpeed, FSDP, Megatron, or equivalent)
* Low-latency inference systems (vLLM, TensorRT-LLM, or custom inference runtimes)
* High-performance networking (RDMA / Infiniband)
* Large-scale data infrastructure (Spark, Ray, or similar)
* Demonstrated experience operating *clusters larger than 256 GPUs*
* Strong programming skills in *C++, Rust, and Python*
* Experience debugging performance issues at the *kernel, networking, and system level*
* Published research in top-tier systems or ML conferences *(OSDI, SOSP, MLSys, ICML, NeurIPS, or similar)*
Preferred Qualifications
* Experience building infrastructure for *large language models (70B+ parameters)*
* Experience designing *custom model inference runtimes*
* Familiarity with *compiler-level optimization for ML workloads*
* Contributions to major open-source infrastructure projects
Job Type: Contract
Pay: $70,000.00 - $90,000.00 per year
Work Location: In person