Senior Backend Engineer, Inference Platform
Company: Together AI
Location: San Francisco
Posted on: April 1, 2026
|
|
|
Job Description:
About the Role Together AI is building the Inference Platform
that brings the most advanced generative AI models to the world.
Our platform powers multi-tenant serverless workloads and dedicated
endpoints, enabling developers, enterprises, and researchers to
harness the latest LLMs, multimodal models, image, audio, video,
and speech models at scale. If you get a thrill from optimizing
latency down to the last millisecond, this is your playground.
You’ll work hands-on with tens of thousands of GPUs (H100s, H200s,
GB200s, and beyond), figuring out how to fully utilize every FLOP
and every gigabyte of memory. You’ll collaborate directly with
research teams to bring frontier models into production, making
breakthroughs usable in the real world. Our team also works closely
with the open source community, contributing to and leveraging
projects like SGLang, vLLM, and NVIDIA Dynamo to push the
boundaries of inference performance and efficiency. Shape the core
inference backbone that powers Together AI’s frontier models. Solve
performance-critical challenges in global request routing, load
balancing, and large-scale resource allocation. Work with
state-of-the-art accelerators (H100s, H200s, GB200s) at global
scale. Partner with world-class researchers to bring new model
architectures into production. Collaborate with and contribute to
the open source community, shaping the tools that advance the
industry. A culture of deep technical ownership and high impact —
where your work makes models faster, cheaper, and more accessible.
Competitive compensation, equity, and benefits. Responsibilities
Build and optimize global and local request routing, ensuring
low-latency load balancing across data centers and model engine
pods. Develop auto-scaling systems to dynamically allocate
resources and meet strict SLOs across dozens of data centers.
Design systems for multi-tenant traffic shaping, tuning both
resource allocation and request handling — including smart rate
limiting and regulation — to ensure fairness and consistent
experience across all users. Engineer trade-offs between latency
and throughput to serve diverse workloads efficiently. Optimize
prefix caching to reduce model compute and speed up responses.
Collaborate with ML researchers to bring new model architectures
into production at scale. Continuously profile and analyze
system-level performance to identify bottlenecks and implement
optimizations. Requirements 5 years of demonstrated experience
building large-scale, fault-tolerant, distributed systems and API
microservices. Strong background in designing, analyzing, and
improving efficiency, scalability, and stability of complex
systems. Excellent understanding of low-level OS concepts:
multi-threading, memory management, networking, and storage
performance. Expert-level programming in one or more of: Rust, Go,
Python, or TypeScript. Knowledge of modern LLMs and generative
models and how they are served in production is a plus. Experience
working with the open source ecosystem around inference is highly
valuable; familiarity with SGLang, vLLM, or NVIDIA Dynamo will be
especially handy. Experience with Kubernetes or container
orchestration is a strong plus. Familiarity with GPU software
stacks (CUDA, Triton, NCCL) and HPC technologies (InfiniBand,
NVLink, MPI) is a plus. Bachelor’s or Master’s degree in Computer
Science, Computer Engineering, or related field, or equivalent
practical experience. About Together AI Together AI is a
research-driven artificial intelligence company. We believe open
and transparent AI systems will drive innovation and create the
best outcomes for society, and together we are on a mission to
significantly lower the cost of modern AI systems by co-designing
software, hardware, algorithms, and models. We have contributed to
leading open-source research, models, and datasets to advance the
frontier of AI, and our team has been behind technological
advancement such as FlashAttention, Hyena, FlexGen, and RedPajama.
We invite you to join a passionate group of researchers and
engineers in our journey in building the next generation AI
infrastructure. Compensation We offer competitive compensation,
startup equity, health insurance and other competitive benefits.
The US base salary range for this full-time position is: $160,000 -
$250,000 equity benefits. Our salary ranges are determined by
location, level and role. Individual compensation will be
determined by experience, skills, and job-related knowledge. Equal
Opportunity Together AI is an Equal Opportunity Employer and is
proud to offer equal employment opportunity to everyone regardless
of race, color, ancestry, religion, sex, national origin, sexual
orientation, age, citizenship, marital status, disability, gender
identity, veteran status, and more. Please see our privacy policy
at https://www.together.ai/privacy
Keywords: Together AI, Mountain View , Senior Backend Engineer, Inference Platform, Engineering , San Francisco, California