Member of Technical Staff — Inference

Palo Alto, CA

About the Role

RadixArk is seeking a

Member of Technical Staff — Inference

to push the limits of large-scale AI inference.

You will work on the core systems that serve frontier models at scale, optimizing performance, latency, throughput, and cost across thousands of GPUs. This role sits at the intersection of systems engineering, ML infrastructure, and performance optimization.

Your work will directly shape how state-of-the-art models are deployed and experienced by users worldwide.

This is a deeply technical, high-impact role for engineers who enjoy working close to the hardware–software boundary and solving performance-critical problems at scale.

Requirements

5+ years of experience in systems engineering, ML infrastructure, or performance-critical backend systems

Strong expertise in large-scale inference systems for LLMs or generative models

Deep understanding of GPU architecture and performance characteristics

Experience optimizing latency- and throughput-critical production systems

Strong knowledge of distributed systems and networking fundamentals

Proficiency in C++, Rust, Go, or Python for production systems

Experience profiling and optimizing compute-intensive workloads

Strong debugging skills across system layers (model, runtime, kernel, network)

Strong Plus

Experience with LLM serving stacks (vLLM, TensorRT-LLM, SGLang, etc.)

Familiarity with CUDA, Triton, or custom kernel optimization

Experience with batching, KV-cache management, and scheduling strategies

Experience running inference at scale (1000+ GPUs)

Background in HPC or high-performance systems

Open-source contributions in ML or systems infrastructure

Responsibilities

Design and build large-scale inference systems for frontier AI models

Optimize latency, throughput, and GPU utilization in production inference

Develop and improve model serving architectures and runtimes

Work on batching, scheduling, and memory management strategies

Collaborate with kernel, compiler, and systems teams on performance optimization

Debug performance bottlenecks across the stack

Drive reliability and scalability of inference infrastructure

Build tooling for observability, profiling, and performance analysis

Contribute to long-term inference architecture and strategy

About RadixArk

RadixArk is an infrastructure-first company built by engineers who've shipped production AI systems, created SGLang (20K+ GitHub stars, the fastest open LLM serving engine), and developed Miles (our large-scale RL framework).

We're on a mission to democratize frontier-level AI infrastructure by building world-class open systems for inference and training.

Our team has optimized kernels serving billions of tokens daily, designed distributed training systems coordinating 10,000+ GPUs, and contributed to infrastructure that powers leading AI companies and research labs.

We're backed by well-known infrastructure investors and partner with Nvidia, Google, AWS, and frontier AI labs.

Join us in building infrastructure that gives real leverage back to the AI community.

Compensation

We offer competitive compensation with meaningful equity, comprehensive benefits, and flexible work arrangements. Compensation depends on location, experience, and level.

Equal Opportunity

RadixArk is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more.

5+ years of experience in systems engineering, ML infrastructure, or performance-critical backend systems

Strong expertise in large-scale inference systems for LLMs or generative models

Deep understanding of GPU architecture and performance characteristics

Experience optimizing latency- and throughput-critical production systems

Strong knowledge of distributed systems and networking fundamentals

Proficiency in C++, Rust, Go, or Python for production systems

Experience profiling and optimizing compute-intensive workloads

Strong debugging skills across system layers (model, runtime, kernel, network)

Strong Plus

Experience with LLM serving stacks (vLLM, TensorRT-LLM, SGLang, etc.)

Familiarity with CUDA, Triton, or custom kernel optimization

Experience with batching, KV-cache management, and scheduling strategies

Experience running inference at scale (1000+ GPUs)

Background in HPC or high-performance systems

Open-source contributions in ML or systems infrastructure

Responsibilities

Design and build large-scale inference systems for frontier AI models

Optimize latency, throughput, and GPU utilization in production inference

Develop and improve model serving architectures and runtimes

Work on batching, scheduling, and memory management strategies

Collaborate with kernel, compiler, and systems teams on performance optimization

Debug performance bottlenecks across the stack

Drive reliability and scalability of inference infrastructure

Build tooling for observability, profiling, and performance analysis

Contribute to long-term inference architecture and strategy

About RadixArk

RadixArk is an infrastructure-first company built by engineers who've shipped production AI systems, created SGLang (20K+ GitHub stars, the fastest open LLM serving engine), and developed Miles (our large-scale RL framework).

We're on a mission to democratize frontier-level AI infrastructure by building world-class open systems for inference and training.

Our team has optimized kernels serving billions of tokens daily, designed distributed training systems coordinating 10,000+ GPUs, and contributed to infrastructure that powers leading AI companies and research labs.

We're backed by well-known infrastructure investors and partner with Nvidia, Google, AWS, and frontier AI labs.

Join us in building infrastructure that gives real leverage back to the AI community.

Apply with uptayn.

Sign in free to open the apply link, get this role scored against your CV, and track your application.

uptayn
2026 · built quietly in Berlin.
uptayn = up + attain
Built for
  • Recent business grads
  • Engineers pivoting to ops
  • Consultants → startup
  • Second-job operators
Quiet by default
  • No tracking pixels
  • No LinkedIn login
  • No spam outreach
  • Just roles + your CV