Senior Site Reliability Engineer (SRE) - AI Inftastructure Job at Confidential, San Francisco, CA

K3g0T3lFaVMyajgxTTFUa1d0ays2TU5SelE9PQ==
  • Confidential
  • San Francisco, CA

Job Description

Join a stealth-mode startup building out their AI and cloud platform, powered by thousands of H100s, H200s, and B200s, ready for experimentation, full-scale model training, or inference. As a Platform Engineer/Senior Site Reliability Engineer, you’ll own the reliability, performance, and automation of this GPU-powered infrastructure, ensuring seamless orchestration across environments managed by Slurm, Kubernetes, or direct SSH access. As well as supporting their extremely exciting new products coming to the market! 

This is a rare opportunity to work at the intersection of AI infrastructure and AI, shaping the operational backbone of one of the largest GPU clusters in private deployment.

If you want to build and operate infrastructure for frontier AI workloads, automate systems at petascale, and be part of a founding engineering team, this is the place to do it. Get in touch and apply today! 

Responsibilities:

  • Design, deploy, and maintain large-scale GPU clusters (H100/H200/B200) for training and inference workloads.
  • Build automation pipelines for provisioning, scaling, and monitoring compute resources across Slurm and Kubernetes environments.
  • Develop observability, alerting, and auto-healing systems for high-availability GPU workloads.
  • Collaborate with ML, networking, and platform teams to optimise resource scheduling, GPU utilisation, and data flow.
  • Implement infrastructure-as-code, CI/CD pipelines, and reliability standards across thousands of nodes.
  • Diagnose performance bottlenecks and drive continuous improvements in reliability, latency, and throughput.

Skills / Must Have:

  • 7+ years of experience in SRE, DevOps, or Infrastructure Engineering roles supporting large-scale compute environments.
  • Strong hands-on experience with Kubernetes and Slurm for cluster orchestration and workload management.
  • Deep knowledge of Linux systems, networking, and GPU infrastructure (NVIDIA H100/H200/B200 preferred).
  • Proficiency in Python, Go, or Bash for automation, tooling, and performance tuning.
  • Experience with observability stacks (Prometheus, Grafana, Loki) and incident response frameworks.
  • Familiarity with high-performance computing (HPC) or AI/ML training infrastructure at scale.
  • Background in reliability engineering, distributed systems, or hardware acceleration environments is a strong plus.

Salary & Benefits:

  • $300,000 gross per year 
  • Equity

Job Tags

Permanent employment

Similar Jobs

Bluestone Child & Adolescent Psychiatric Hospital

Art Therapist Job at Bluestone Child & Adolescent Psychiatric Hospital

 ...hospital indemnity, accident, critical illness ~ Flexible Spending Account for Health and Dependent Care JOB SUMMARY: The Art Therapist serves to facilitate skills and expressive/art therapy groups and provide care coordination for patients. Specifically, the Art... 

Amazon Web Services, Inc.

Account-Based Marketing (ABM) Manager , AWS NAMER Strategic Customer and Partner Marketing Job at Amazon Web Services, Inc.

 ...DESCRIPTION Amazon Web Services is seeking a results-driven Account-Based Marketing (ABM) Manager to design and execute high-impact ABM programs for enterprise customers within a designated field territory. This role combines strategic thinking with hands-on execution... 

Canter Power Systems

Generator Technician Job at Canter Power Systems

 ...Generator Technician Canter Power Systems About Canter Power Systems Founded in 1948, Canter Power Systems is the largest residential backup generator installer in the United States. We specialize in Generac generators and provide end-to-end solutions including... 

Rumble Boxing Long Beach

Fitness Trainer Job at Rumble Boxing Long Beach

Do you Rumble? Ready to shake up the stale norms of the group fitness world? Join the fastest growing boxing inspired fitness studio in the country. The new Rumble Boxing flagship location is coming to 2nd & PCH in Long Beach overlooking the marina. *No boxing experience... 

Mark Rink.

Public Relations Associate Job at Mark Rink.

 ...We are seeking a dynamic and proactive Public Relations Associate to join our team. This...  ...tools and PR analytics Additional Information Benefits Competitive hourly...  ...advancement. Hands-on experience in office administration and operations. Supportive...