Back to Case Studies
Energy / Infrastructure·Product Owner — Platform Engineering

Platform Observability @ Enexis

2022 – Present

TL;DR

  • Problem: Fragmented monitoring across 12+ teams with no unified observability, causing slow incident response
  • Action: Designed and owned a centralized observability platform (Grafana/Prometheus) with SLO-based alerting and self-service dashboards
  • Outcome: 65% MTTD reduction, 80+ dashboards, 99.9% uptime SLO achieved across critical infrastructure

65%

MTTD Reduction

12+

Teams Onboarded

80+

Dashboards Created

99.9%

Uptime SLO

CONTEXT

Building a scalable observability platform for the Dutch energy grid, enabling real-time monitoring of critical infrastructure across 3M+ connections.

THE PROBLEM

Enexis lacked unified visibility into platform health. Teams operated in silos with fragmented monitoring, leading to slow incident response and blind spots in system reliability.

CONSTRAINTS

  • Legacy infrastructure with heterogeneous tech stacks across teams
  • Strict compliance and data governance requirements in the energy sector
  • Needed to onboard 12+ teams without dedicated platform engineers per team
  • Budget constraints required open-source-first tooling strategy

THE APPROACH

Designed and implemented a centralized observability stack using Grafana, Prometheus, and custom dashboards. Introduced SLO-based alerting, runbooks, and a platform-as-product mindset to shift from reactive firefighting to proactive reliability engineering.

THE OUTCOME

Mean Time To Detection dropped by 65%. Platform teams gained self-service dashboards, and incident postmortems became data-driven. The observability platform became a shared capability across 12+ engineering teams.

MY ROLE & OWNERSHIP

As Product Owner, I owned the full observability platform roadmap. I defined the platform vision, prioritized the backlog based on team adoption metrics and incident data, and worked directly with SREs to design alerting strategies. I drove stakeholder alignment across engineering leadership to secure buy-in for the platform-as-product approach. Key ownership areas: roadmap, backlog prioritization, SLO definitions, team onboarding strategy, and vendor/tool evaluation.

LEARNINGS

  • Platform adoption is a product challenge, not a technical one — onboarding UX and documentation matter more than features
  • SLO-based alerting dramatically reduces alert fatigue vs threshold-based approaches
  • Self-service dashboards scale better than centralized dashboard teams
  • Incident postmortems become 10x more valuable when backed by observability data
GrafanaPrometheusLokiKubernetesAzureTerraform

Want to know more about this project or work together?