Senior Site Reliability Engineer, Production Engineering
Company: Anduril Industries
Location: Costa Mesa
Posted on: April 1, 2026
|
|
|
Job Description:
ABOUT THE TEAM The Production Engineering team is a newly formed
organization within Anduril's Software Platform, dedicated to
ensuring the reliability, performance, and scalability of
mission-critical systems that directly support our warfighters in
the field. We solve complex reliability challenges at massive
scale, ensuring that critical components of Lattice—Anduril's
autonomous command and control platform—operates flawlessly in the
most demanding operational environments. This is a foundational
role and you will be among the first hires building this team from
the ground up. You'll have the unique opportunity to shape the
technical direction, establish best practices, and define what
production engineering excellence means at Anduril. Our team
operates at the intersection of software engineering and systems
reliability, building the infrastructure, tooling, and processes
that keep our systems operational 24/7/365. ABOUT THE ROLE We are
seeking an experienced Senior Site Reliability Engineer who is
passionate about building resilient, highly available systems that
scale to meet the demands of the core systems powering Lattice. You
will work closely with platform engineering teams, product
developers, and field operations to proactively identify
reliability risks, implement defensive strategies, and continuously
improve the operational excellence of our software platform. If you
thrive on solving hard problems at scale and want your work to have
direct impact on national security, this is the role for you. WHAT
YOU’LL DO Design and implement comprehensive monitoring,
observability, and alerting systems to ensure early detection of
reliability issues across the Lattice platform Drive incident
response and conduct blameless postmortems to identify systemic
improvements and prevent recurrence of production issues Build and
maintain infrastructure automation using tools like Terraform,
Kubernetes operators, and custom tooling to manage large-scale
distributed systems Establish and track Service Level Objectives
(SLOs) and Error Budgets to balance feature velocity with system
reliability Partner with software engineering teams to improve
system architecture for reliability, implementing patterns like
circuit breakers, graceful degradation, and chaos engineering
Develop capacity planning models and performance testing frameworks
to ensure systems can handle growth and peak operational demands
Create runbooks, documentation, and training materials to enable
teams to operate production systems effectively Lead
cross-functional efforts to improve deployment safety through
progressive rollouts, automated testing, and rollback capabilities
Implement security best practices and compliance controls for
production environments handling sensitive defense data Build
tooling and automation to reduce toil and improve operational
efficiency for the engineering organization Participate in on-call
rotations and serve as an escalation point for critical production
incidents REQUIRED QUALIFICATIONS 7 years of engineering experience
with at least 3 years focused on SRE, production operations, or
infrastructure engineering Bachelor's degree in Computer Science,
Engineering, or equivalent practical experience Deep expertise with
Kubernetes in production environments, including operational
challenges at scale (100 nodes) Strong programming skills in one or
more languages such as Go, Python, Rust, or Java with ability to
build production-grade tooling Proven experience designing and
implementing observability stacks (metrics, logging, tracing) using
tools like Prometheus, Grafana, ELK/EFK, or equivalent Hands-on
experience with cloud platforms (AWS, Azure, or GCP) and
infrastructure as code practices Demonstrated ability to debug
complex distributed systems issues across multiple layers of the
stack Track record of improving system reliability through
architectural changes, not just operational band-aids Strong
incident management and communication skills, with experience
leading responses to critical outages Must be a U.S. Person due to
required access to U.S. export controlled information or facilities
Eligible to obtain and maintain an active U.S. Secret security
clearance PREFERRED QUALIFICATIONS Experience with defense,
aerospace, or other mission-critical systems where downtime has
severe consequences Expertise in performance optimization and
capacity planning for high-throughput, low-latency systems
Knowledge of chaos engineering principles and experience
implementing resilience testing frameworks Experience with service
mesh technologies (Istio, Linkerd) and advanced traffic management
patterns Background in database operations and optimization
(PostgreSQL, Cassandra, or similar at scale) Familiarity with CI/CD
platforms and deployment automation (ArgoCD, FluxCD, Spinnaker,
Jenkins) Understanding of networking fundamentals including load
balancing, DNS, TLS/SSL, and network security Experience with
configuration management and secrets management solutions (Vault,
Sealed Secrets, SOPS) Strong written and verbal communication
skills with ability to explain technical concepts to non-technical
stakeholders Active Secret or higher security clearance US Salary
Range $166,000 - $220,000 USD The salary range for this role is an
estimate based on a wide range of compensation factors, inclusive
of base salary only. Actual salary offer may vary based on (but not
limited to) work experience, education and/or training, critical
skills, and/or business considerations. Highly competitive equity
grants are included in the majority of full time offers; and are
considered part of Anduril's total compensation package.
Additionally, Anduril offers top-tier benefits for full-time
employees, including: Healthcare Benefits US Roles: Comprehensive
medical, dental, and vision plans at little to no cost to you. UK &
AUS Roles: We cover full cost of medical insurance premiums for you
and your dependents. IE Roles: We offer an annual contribution
toward your private health insurance for you and your dependents.
Additional Benefits Income Protection : Anduril covers life and
disability insurance for all employees. Generous time off : Highly
competitive PTO plans with a holiday hiatus in December. Caregiver
& Wellness Leave is available to care for family members, bond with
a new baby, or address your own medical needs. Family Planning &
Parenting Support: Coverage for fertility treatments (e.g., IVF,
preservation), adoption, and gestational carriers, along with
resources to support you and your partner from planning to
parenting. Mental Health Resources: Access free mental health
resources 24/7, including therapy and life coaching. Additional
work-life services, such as legal and financial support, are also
available. Professional Development: Annual reimbursement for
professional development Commuter Benefits: Company-funded commuter
benefits based on your region. Relocation Assistance: Available
depending on role eligibility. Retirement Savings Plan US Roles:
Traditional 401(k), Roth, and after-tax (mega backdoor Roth)
options. UK & IE Roles: Pension plan with employer match. AUS
Roles: Superannuation plan. The recruiter assigned to this role can
share more information about the specific compensation and benefit
details associated with this role during the hiring process.
Protecting Yourself from Recruitment Scams Anduril is committed to
maintaining the integrity of our Talent acquisition process and the
security of our candidates. We've observed a rise in sophisticated
phishing and fraudulent schemes where individuals impersonate
Anduril representatives, luring job seekers with false interviews
or job offers. These scammers often attempt to extract payment or
sensitive personal information. To ensure your safety and help you
navigate your job search with confidence, please keep the following
critical points in mind: No Financial Requests: Anduril will never
solicit payment or demand personal financial details (such as
banking information, credit card numbers, or social security
numbers) at any stage of our hiring process. Our legitimate
recruitment is entirely free for candidates. Please always verify
communications: Direct from Anduril: If you receive an email from
one of our recruiters, it will only come from an @anduril.com
address. Via Agency Partner: If contacted by a recruiting agency
for an Anduril role, their email will clearly identify their
agency. If you suspect any suspicious activity, please verify the
agency's authenticity by reaching out to contact@anduril.com .
Exercise Caution with Unsolicited Outreach: If you receive any
communication that appears suspicious, contains grammatical errors,
or makes unusual requests, do not engage. Always confirm the
sender's email domain is @anduril.com before providing any personal
information or clicking on links. What to Do If You Suspect Fraud:
Should you encounter any questionable or fraudulent outreach
claiming to be from Anduril, please report it immediately to
contact@anduril.com . Your proactive caution is invaluable in
protecting your personal information and upholding the security and
trustworthiness of our recruitment efforts. Data Privacy To view
Anduril's candidate data privacy policy, please visit
https://anduril.com/applicant-privacy-notice/ .
Keywords: Anduril Industries, Alhambra , Senior Site Reliability Engineer, Production Engineering, IT / Software / Systems , Costa Mesa, California