Observability

Week in Review: AI, SRE & Observability -- March 20-27, 2026

This was KubeCon week, and it showed. Amsterdam became the center of gravity for cloud-native infrastructure, with announcements ranging from NVIDIA donating its GPU DRA driver to the CNCF, to Kubernetes 1.35’s in-place pod resize graduating to stable. Meanwhile, the AI world kept shipping – Google dropped Gemini 3.1 Flash Live, MiniMax open-sourced a massive hybrid-attention reasoning model, and OpenTelemetry quietly cemented profiling as the fourth observability signal. It was one of those weeks where you could feel the industry shifting under your feet. ...

Week in Review: AI, SRE & Observability — March 14–20, 2026

GTC week is always loud, but this one hit different. NVIDIA unveiled a trillion-dollar infrastructure roadmap while OpenAI quietly absorbed one of Python’s most beloved open-source teams. Meanwhile, the SRE world went all-in on agentic AI, with PagerDuty, Komodor, and Microsoft all shipping agent-driven incident response features in the same week. And on the observability front, OpenTelemetry made a significant architectural decision that’ll ripple through every tracing backend for years. Buckle up. ...

Week in Review: AI, SRE & Observability — March 7–13, 2026

This was a week where “agentic” stopped being a buzzword and started showing up in architecture diagrams. NVIDIA dropped a model built specifically for multi-agent workflows, observability vendors raced to give AI agents direct access to production telemetry via MCP, and the cloud-native ecosystem quietly matured with a new CNCF graduation and a Kubernetes release preview that finally lets you scale to zero. If you build, run, or monitor software at scale, there’s something here for you. ...

Week in Review: AI, SRE & Observability — March 2–8, 2026

This was a week where the AI race got tangibly closer to your desktop, the Kubernetes ecosystem said goodbye to an old friend, and the observability world kept tightening its grip around OpenTelemetry as the universal standard. If you only have five minutes, the headlines are: GPT-5.4 can now operate your computer better than most humans, Ingress NGINX is officially done, and Google Cloud now speaks fluent OTLP. AI & Machine Learning OpenAI releases GPT-5.4 with native computer use — and it beats human performance. OpenAI’s latest frontier model isn’t just another benchmark bump. GPT-5.4 is the first general-purpose model to ship with production-ready computer use capabilities, scoring 75.0% on OSWorld-Verified desktop tasks — above the 72.4% human expert baseline. It supports up to 1M tokens of context, brings a new “reasoning plan preview” that lets users steer the model mid-thought, and introduces tool search for navigating large ecosystems of APIs and connectors. Available in ChatGPT (as GPT-5.4 Thinking), Codex, and the API. The agentic future just got a lot more concrete. Source: OpenAI ...

Provisioning Dashboards with Grafana

Background Grafana has become the de-facto visualization tool for Prometheus. While it is cool to run a central Grafana hooked up to an RDS database, I think it is even better if you can make Grafana completely configurable via git and thus have stateless Grafana instances which you can scale horizontally. Based on this philosophy, I have been running a Grafana setup at Red Hat, here’s some key points: Grafana runs as pods on a Kubernetes (OpenShift) cluster Each dashboard is mounted into the pod via ConfigMap Our GitOps pipeline takes care of adding the dashboard configmaps into the namespace, so all dashboards and their changes ultimately must end up in Git One of the best benefits of this approach is that you never have to worry about Grafana upgrades/downgrades. Because the pods are stateless, you can simply roll out a new version as long as the dashboard schema stays consistent. ...