Site Reliability Engineer Specialist
Descrição da vaga
At Digibee, we aren’t just building technology; we are unlocking the innovation potential of the world’s largest companies by making the complex simple.
In an integration market valued at approximately $250 billion, our cloud-native, low-code iPaaS platform empowers every developer to build and monitor end-to-end workflows, eliminating technical debt and accelerating digital transformation.
Why join us?
Founded in Brazil in 2017, we are now a global team distributed across the Americas, driven by a culture of flexibility and autonomy.
Following a $60 million Series B funding round, we are in full global expansion, combining the agility of a startup with the stability of a company backed by major global players.
Here, you don’t just witness growth, you are the engine behind it.
If you seek real impact and want to redefine how the world connects, you belong here.
Let’s simplify the world, one integration at a time.
About the role:
We are looking for a Site Reliability Engineer Specialist to be the technical anchor for observability and incident response across the Digibee platform. This is a senior individual contributor role with significant cross-team influence over how we instrument, monitor, alert on, and recover from issues in a complex distributed system — primarily Java with some Node.js services running on GKE, fronted by Kong, and backed by RabbitMQ, PostgreSQL, MongoDB Atlas, Redis, MinIO, and an Elasticsearch/Logstash/Fluent Bit logging pipeline.
You will set the bar for reliability engineering at Digibee — defining our SLO culture, evolving our Dash0/OpenTelemetry-based observability framework, leading major incident response, and mentoring engineers across SRE, platform, and product teams.
Responsabilidades e atribuições
On a typical day, you will…
- Own the technical direction of our observability stack (Dash0, OpenTelemetry, Elasticsearch/Logstash/Fluent Bit) — defining instrumentation standards for Java and Node.js services and driving adoption of tracing, metrics, and structured logging.
- Establish meaningful SLIs, SLOs, and error budgets, and partner with engineering and product teams to use them to drive real engineering decisions.
- Lead major incident response as a senior incident commander, and run blameless postmortems with technical depth and real follow-through.
- Evolve our on-call program so it is humane and sustainable — driving down toil and alert noise as a first-class engineering priority.
- Influence architecture decisions across the platform, going deep where it matters: GKE, Kong, RabbitMQ, PostgreSQL, MongoDB Atlas, Redis, and MinIO.
- Mentor SREs and platform engineers, raise the technical bar through design and incident reviews, and grow the SRE discipline at Digibee.
Requisitos e qualificações
What you'll need to bring…
- 8+ years in SRE, infrastructure, or platform engineering, with meaningful time at Specialist or Principal level operating large-scale production systems — this is a mandatory requirement.
- Deep production experience with Kubernetes (preferably GKE), including real fluency debugging things under pressure.
- Strong observability background with OpenTelemetry, Prometheus, distributed tracing, and centralized logging (Elasticsearch, Logstash, Fluent Bit, or similar). Experience with Dash0 is a strong plus.
- Hands-on experience operating stateful services in production: at least two of PostgreSQL, MongoDB Atlas, Redis, RabbitMQ, or object storage (MinIO/S3).
- Production experience instrumenting and troubleshooting Java services (JVM tuning, GC, thread dumps); familiarity with Node.js runtime characteristics is a plus.
- Proven track record leading incident response and SLO programs that actually changed engineering behavior — not dashboards nobody looks at.
- Demonstrated ability to mentor senior engineers and influence technical direction across teams without formal authority.
- Strong communication skills in both English and Portuguese (written and verbal), with proven ability to collaborate across cross-functional, remote-first teams.
Informações adicionais
It's a plus if you have…
- Experience operating an iPaaS or similarly multi-tenant runtime where customer workloads are first-class.
- Experience with Kong API Gateway and Apache Camel at scale.
- Experience with FluxCD, GitLab CI, and GitOps workflows.
- Background contributing to OpenTelemetry, Prometheus, or related open source projects.
- Familiarity with Chaos Engineering and Production Readiness Review programs.
- CNCF Kubernetes certifications (CKA, CKS) or GCP Professional Cloud Engineer / Architect certifications.
Etapas do processo
- Etapa 1: Cadastro
- Etapa 2: Screening
- Etapa 3: People Interview
- Etapa 4: Hiring Manager Interview
- Etapa 5: Cross-Functional Interview
- Etapa 6: Offer
- Etapa 7: Contratação
Together we can shape the future of integration!
We are a global integration software company with a Brazilian foundation. Our supportive work environment inspires our employees to give their best. Our people love what they do, we work hard and have fun doing it. We know our integration platform is special, and we’re excited to share it with our team, our customers, our industry, and the world.
Social Media