Platform Observability @ Enexis
2022 – Present
TL;DR
- Problem: Fragmented monitoring across 12+ teams with no unified observability, causing slow incident response
- Action: Designed and owned a centralized observability platform (Grafana/Prometheus) with SLO-based alerting and self-service dashboards
- Outcome: 65% MTTD reduction, 80+ dashboards, 99.9% uptime SLO achieved across critical infrastructure
65%
MTTD Reduction
12+
Teams Onboarded
80+
Dashboards Created
99.9%
Uptime SLO
CONTEXT
Building a scalable observability platform for the Dutch energy grid, enabling real-time monitoring of critical infrastructure across 3M+ connections.
THE PROBLEM
Enexis lacked unified visibility into platform health. Teams operated in silos with fragmented monitoring, leading to slow incident response and blind spots in system reliability.
CONSTRAINTS
- —Legacy infrastructure with heterogeneous tech stacks across teams
- —Strict compliance and data governance requirements in the energy sector
- —Needed to onboard 12+ teams without dedicated platform engineers per team
- —Budget constraints required open-source-first tooling strategy
THE APPROACH
Designed and implemented a centralized observability stack using Grafana, Prometheus, and custom dashboards. Introduced SLO-based alerting, runbooks, and a platform-as-product mindset to shift from reactive firefighting to proactive reliability engineering.
THE OUTCOME
Mean Time To Detection dropped by 65%. Platform teams gained self-service dashboards, and incident postmortems became data-driven. The observability platform became a shared capability across 12+ engineering teams.
MY ROLE & OWNERSHIP
As Product Owner, I owned the full observability platform roadmap. I defined the platform vision, prioritized the backlog based on team adoption metrics and incident data, and worked directly with SREs to design alerting strategies. I drove stakeholder alignment across engineering leadership to secure buy-in for the platform-as-product approach. Key ownership areas: roadmap, backlog prioritization, SLO definitions, team onboarding strategy, and vendor/tool evaluation.
LEARNINGS
- →Platform adoption is a product challenge, not a technical one — onboarding UX and documentation matter more than features
- →SLO-based alerting dramatically reduces alert fatigue vs threshold-based approaches
- →Self-service dashboards scale better than centralized dashboard teams
- →Incident postmortems become 10x more valuable when backed by observability data
Want to know more about this project or work together?