Sre | Aditya Konarde's Blog

Week in Review: AI, SRE & Observability -- March 20-27, 2026

This was KubeCon week, and it showed. Amsterdam became the center of gravity for cloud-native infrastructure, with announcements ranging from NVIDIA donating its GPU DRA driver to the CNCF, to Kubernetes 1.35’s in-place pod resize graduating to stable. Meanwhile, the AI world kept shipping – Google dropped Gemini 3.1 Flash Live, MiniMax open-sourced a massive hybrid-attention reasoning model, and OpenTelemetry quietly cemented profiling as the fourth observability signal. It was one of those weeks where you could feel the industry shifting under your feet. ...

Week in Review: AI, SRE & Observability — March 14–20, 2026

GTC week is always loud, but this one hit different. NVIDIA unveiled a trillion-dollar infrastructure roadmap while OpenAI quietly absorbed one of Python’s most beloved open-source teams. Meanwhile, the SRE world went all-in on agentic AI, with PagerDuty, Komodor, and Microsoft all shipping agent-driven incident response features in the same week. And on the observability front, OpenTelemetry made a significant architectural decision that’ll ripple through every tracing backend for years. Buckle up. ...

Week in Review: AI, SRE & Observability — March 7–13, 2026

This was a week where “agentic” stopped being a buzzword and started showing up in architecture diagrams. NVIDIA dropped a model built specifically for multi-agent workflows, observability vendors raced to give AI agents direct access to production telemetry via MCP, and the cloud-native ecosystem quietly matured with a new CNCF graduation and a Kubernetes release preview that finally lets you scale to zero. If you build, run, or monitor software at scale, there’s something here for you. ...

Week in Review: AI, SRE & Observability — March 2–8, 2026

This was a week where the AI race got tangibly closer to your desktop, the Kubernetes ecosystem said goodbye to an old friend, and the observability world kept tightening its grip around OpenTelemetry as the universal standard. If you only have five minutes, the headlines are: GPT-5.4 can now operate your computer better than most humans, Ingress NGINX is officially done, and Google Cloud now speaks fluent OTLP. AI & Machine Learning OpenAI releases GPT-5.4 with native computer use — and it beats human performance. OpenAI’s latest frontier model isn’t just another benchmark bump. GPT-5.4 is the first general-purpose model to ship with production-ready computer use capabilities, scoring 75.0% on OSWorld-Verified desktop tasks — above the 72.4% human expert baseline. It supports up to 1M tokens of context, brings a new “reasoning plan preview” that lets users steer the model mid-thought, and introduces tool search for navigating large ecosystems of APIs and connectors. Available in ChatGPT (as GPT-5.4 Thinking), Codex, and the API. The agentic future just got a lot more concrete. Source: OpenAI ...

How much do SRE's really Code?

A quick recap on SRE Site reliability engineering (SRE) is a discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems.[1] The main goals are to create scalable and highly reliable software systems. According to Ben Treynor, founder of Google’s Site Reliability Team, SRE is “what happens when a software engineer is tasked with what used to be called operations.” ^ Source: Wikipedia How much time do you spend coding? I get this question quite often: “How much time do you spend writing code?” ...