Senior Manager -Reliability Engineer and Observability Platforms

Atlanta Georgia

We are seeking an experienced and dynamic Senior Manager, Reliability Engineering & Observability Platforms to lead our observability initiatives and reliability engineering efforts. This role is accountable for designing and managing platforms that ensure visibility, uptime, performance, and seamless operation of critical systems and services.

The ideal candidate will bring deep technical expertise in observability tools and reliability engineering practices, along with proven leadership experience. They will lead a team responsible for enabling high availability, incident response, performance monitoring, and operational resilience through automation, process improvement, and cross-functional collaboration.

This leader will partner closely with IT, DevOps, Infrastructure, and Business Units to deliver scalable and reliable services with a focus on proactive issue detection and resolution. This is an in-office position based in Atlanta (80% onsite).

RESPONSIBILITIES

Own and evolve observability platforms (monitoring, logging, tracing) to meet organizational needs for performance and availability.

Improve observability maturity by driving adoption of best practices and platform-wide instrumentation.

Lead the reliability engineering function focused on ensuring system uptime, operability, and resilience.

Define and track SLOs/SLIs/SLAs, partnering with product and infrastructure teams to uphold service quality standards.

Drive adoption of reliability best practices into application design, deployments, and operations.

Develop and mature incident management processes including alerting, triage, resolution, and post-incident reviews.

Oversee and continuously improve on-call strategies, ensuring the team is prepared for high-impact production events.

Champion automation of monitoring, diagnostics, deployment validation, and platform operations to reduce manual effort.

Integrate observability and reliability engineering practices into CI/CD pipelines and deployment workflows.

Mentor and lead a team of engineers with a focus on operational excellence, continuous learning, and accountability.

Build a high-performing team culture aligned to business outcomes and platform stability.

Collaborate with cross-functional teams including application developers, DevOps, cloud infrastructure, and security to ensure reliable and observable service delivery.

Partner with architecture and engineering teams to ensure new systems are designed with reliability in mind.

Provide regular reporting and insights on system health, incidents, and reliability trends to leadership.

Use telemetry data to identify system bottlenecks, recurring issues, and areas for proactive improvement.

Manage observability and reliability tool vendors, including evaluation, contracts, renewals, and integrations.

EDUCATION AND EXPERIENCE QUALIFICATIONS

Bachelor’s degree in computer science, Information Technology, Engineering, or a related field.

Master’s degree is a plus
5–7 years of experience managing and leading engineering teams.
7+ years in IT operations, DevOps, or software/platform engineering.
3+ years in a leadership role focused on observability or reliability engineering

5–7 years of experience managing and leading engineering teams.

7+ years in IT operations, DevOps, or software/platform engineering.

3+ years in a leadership role focused on observability or reliability engineering.

KNOWLEDGE, SKILLS, AND ABILITIES

Technical Expertise

Hands-on experience with observability platforms (e.g., Splunk, Prometheus, Grafana, ELK, New Relic, Dynatrace).

Deep understanding of logging, monitoring, alerting, and tracing technologies.

Strong knowledge of public cloud (Azure, AWS, or GCP) and container platforms (e.g., Kubernetes).

Familiarity with infrastructure as code (e.g., Terraform, Ansible).

Reliability Engineering Competency

Experience implementing and supporting highly available, scalable systems.

Understanding of SLOs, SLIs, incident lifecycle, and post-incident analysis.

Ability to embed reliability practices into SDLC and CI/CD workflows.

Leadership & Communication

Demonstrated ability to build, grow, and lead high-performing teams.

Strong analytical, communication, and cross-functional collaboration skills.

Inspire is a multi-brand restaurant company whose portfolio includes more than 33,000 Arby’s, Baskin-Robbins, Buffalo Wild Wings, Dunkin’, Jimmy John’s, and SONIC restaurants worldwide.

We’re made up of some of the world’s most iconic restaurant brands, but we’re much more than just a restaurant company. We’re a team of hundreds of thousands who individually and collectively are changing the way people eat, drink, and gather around the table. We know that food is much more than a staple—it’s an experience. At Inspire, that’s our purpose: to ignite and nourish flavorful experiences.

Apply now

Job Details

Job Category

Technology

Date Posted

2025-09-05

Job ID

JR37104-Atlanta_Support

Apply now

3 Inspire Brands employee discussing at a table

Join Our Talent Community

We're a growing company and often have new, exciting opportunities. Let us notify you about relevant positions and give you access to career search tools and resources.