At Weights & Biases, our mission is to build the best developer tools for machine learning. Weights & Biases is a series C company with $200 million in funding and a rapidly growing user base. Our platform is an essential piece of the daily work for machine learning engineers, from academic research institutions like FAIR and UC Berkeley to massive enterprise teams including iRobot, OpenAI, Toyota Research Institute, Samsung, NVIDIA, Salesforce, Blue Cross Blue Shield, Lyft, and more.

As a Senior Site Reliability Engineer, you’ll own the monitoring and observability stack, working closely with the Infrastructure Team and other developers to scale wandb.ai in lockstep with our exponentially growing user base and a fleet of customer deployments. You’ll be instrumental in building the foundations of an SRE team at a fast-growing startup, establishing the patterns and practices necessary to operate highly reliable services at scale.

We encourage you to apply even if your experience doesn't perfectly align with the job description as we seek out diverse and creative perspectives. Team members who love to learn and collaborate in an inclusive environment will flourish with us. We are an equal-opportunity employer and do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status. If you need additional accommodations to feel comfortable during your interview process, reach out at careers@wandb.com.

#LI-Remote

Scale a system trusted by leaders in the ML industry to ingest and query terabytes of data daily.
Build a monitoring and observability platform to pinpoint issues across a fleet of customer deployments.
Establish the foundations of an SRE team at a fast-growing startup.
Advise and educate development teams on how to build observable, reliable services.

In-depth knowledge of at least one cloud provider (AWS, GCP, Azure).
Strong grasp of at least one higher-level language and its ecosystem (Go, Python, TypeScript, etc.).
A willingness to dive into and debug issues at any layer of the tech stack, from the application layer to the network.
Deep experience managing, monitoring, and debugging distributed systems/databases (MySQL, Postgres, BigTable, etc.) in production.
A demonstrated ability to think critically under pressure.
Excellent communication skills and an ability to explain deeply technical concepts simply.

Empathetic and friendly: You'll love this role if you enjoy connecting with real users day to day, helping them solve issues and understanding good patterns for using our tools. Day to day you'll be answering questions and requests with a kind, thoughtful tone that makes users feel appreciated and connected to our team.
Autonomous : If you work well in a self-directed environment, and proactively find ways to improve processes and collaborate with team members or engaged users, your initiative will really shine in this role.
Curious and driven : Explore machine learning and learn more about the engineering stack and common ML workflows. Solve problems in both fast-paced, short-term sprints and in larger, more long-term projects.
Organized : A core part of engineering support at Weights & Biases is organizing feedback from many channels into a single, orderly stream. Your organization skills and time management will be key to running this process well.

Top-tier machine learning teams rely on our tools for their daily work at companies including OpenAI, Toyota Research Institute, Lyft, Samsung, and Pandora.
You'll never stop learning. This role gives you first-hand experience talking with leading researchers in the field, understanding their problems, and directly shaping the product direction.
Our experienced founding team has successfully built and sold ML tools in the past at Figure Eight, and their deep knowledge of our industry, empathy for our users, and skillful management is driving W&B to success.
Customers genuinely benefit from our tool. Here's a quote from Wojciech Zaremba, Cofounder and Robotics Lead, OpenAI: "W&B allows to scale up insights from a single researcher to the entire team, and from a single machine to hundreds of them

🏝️ Flexible time off
🩺 Medical, Dental, and Vision for employees and Family Coverage
🏠 Remote first culture with in-office flexibility in San Francisco
💵 Home office budget with a new high-powered laptop
🥇 Truly competitive salary and equity
🚼 12 weeks of Parental leave (U.S. specific)
📈 401(k) (U.S. specific)
Supplemental benefits may be available depending on your location
Explore benefits by country

Apply

Senior Site Reliability Engineer - (Remote)

Wandb in San Francisco, California