offergenie_white
OpenEvidence

Site Reliability

OpenEvidence

San Francisco, CARemoteFull Time
Senior LevelDevops EngineerRemote
Apply with AI Cover Letter

Job Description

As an engineer working on Site Reliability, you'll be a key architect and driver in building and hardening the mission-critical infrastructure powering our medical AI platform used by healthcare providers worldwide. This role combines exceptional technical scope with direct impact, focusing on the systemic health, performance, and efficiency of our entire production ecosystem. You'll join our talented backend team in architecting and scaling our infrastructure, applying the SRE mindset to reduce toil, improve observability, and define robust Service Level Objectives (SLOs) across our services and data platforms. You will have significant autonomy to make architectural decisions and drive initiatives across performance optimization, infrastructure design, security, and data pipelines handling sensitive medical data at scale. We're looking for a backend expert who thrives in a focused startup environment where technical excellence meets rapid iteration. You'll work directly with engineering leadership to translate business objectives into elegant technical solutions. The ideal candidate has a proven track record of building and scaling production systems, thinks deeply about system design, and is energized by the challenge of building healthcare infrastructure that must be both highly innovative and extremely reliable.

Responsibilities

Design and institute automated, low-toil operational practices for system health, performance, and scalability, embracing the SRE mindset.

Engage in the end-to-end design, development, and deployment of production software, ensuring built-in reliability and performance from the start.

Own, operate, and optimize key backend services and resources (databases, caches, load balancers), driving measurable improvements in system efficiency and speed.

Lead continuous risk mitigation, incident response, and conduct blameless postmortems to enhance system resilience.

Partner with engineering and product teams to translate platform requirements into robust technical execution and contribute to product strategy.

On-call escalation rotation (approximately one week per month, US daylight hours).

Candidate Qualifications

B.S. or higher in computer science or related major

4+ years of software engineering experience

Firm grasp of the SRE philosophy and mindset, with practical experience working on or directly with SRE teams that have proactively engaged in system design and improvement.

Willingness to proactively engage with development teams to influence the course of software design and operational practices.

Capability to manage risk, make decisions, and exhibit sound judgment

High proficiency operating backend services at scale

Moderate proficiency with Google Cloud or high proficiency with any public cloud

Moderate proficiency with Postgres or high proficiency with another relational DB

Experience with Django, Django REST Framework, Postgres

Motivation, drive, and ability to operate independently

Locations

Miami, FL (on-site)

San Francisco, CA (on-site)

Remote, US + on-site 4-6x per year