[{"content":"If you use the terminal on macOS, you\u0026rsquo;ve typed your password for sudo thousands of times. There\u0026rsquo;s a better way. Touch ID works for sudo, and Apple even ships a template config for it. Most people just don\u0026rsquo;t know it\u0026rsquo;s there.\nThe Setup macOS includes a PAM (Pluggable Authentication Module) template at /etc/pam.d/sudo_local.template with the Touch ID line already written but commented out. All you need to do is copy it and uncomment:\nsudo cp /etc/pam.d/sudo_local.template /etc/pam.d/sudo_local sudo sed -i \u0026#39;\u0026#39; \u0026#39;s/#auth sufficient pam_tid.so/auth sufficient pam_tid.so/\u0026#39; /etc/pam.d/sudo_local That\u0026rsquo;s it. Your next sudo command will prompt for Touch ID instead of a password.\nWhy sudo_local Instead of Editing sudo Directly? Older guides tell you to edit /etc/pam.d/sudo directly. The problem: macOS system updates overwrite that file, and your change disappears. Apple introduced sudo_local specifically to solve this. It\u0026rsquo;s a local override that persists across updates.\nWhat\u0026rsquo;s Actually Happening PAM is the authentication framework that handles sudo. The pam_tid.so module bridges PAM to the Secure Enclave, which is the hardware chip that stores your fingerprint data. When you run a sudo command:\nPAM checks sudo_local first (because it\u0026rsquo;s included from the main sudo config) The pam_tid.so module triggers the Touch ID prompt The Secure Enclave verifies your fingerprint locally on-device If it matches, sudo proceeds. No password needed The sufficient keyword means Touch ID is enough on its own. If it fails (say you hit Cancel), PAM falls through to the next method, your password, so you\u0026rsquo;re never locked out.\nA Small Thing That Adds Up This is the kind of micro-optimization that seems trivial in isolation but compounds over a day of terminal work. Every sudo that used to break your flow with a password prompt now takes a tap. Once you\u0026rsquo;ve used it for a week, you\u0026rsquo;ll wonder why you didn\u0026rsquo;t set it up sooner.\n","permalink":"https://www.adityakonarde.com/posts/touchid-sudo-macos/","summary":"\u003cp\u003eIf you use the terminal on macOS, you\u0026rsquo;ve typed your password for \u003ccode\u003esudo\u003c/code\u003e thousands of times. There\u0026rsquo;s a better way. Touch ID works for sudo, and Apple even ships a template config for it. Most people just don\u0026rsquo;t know it\u0026rsquo;s there.\u003c/p\u003e\n\u003ch2 id=\"the-setup\"\u003eThe Setup\u003c/h2\u003e\n\u003cp\u003emacOS includes a PAM (Pluggable Authentication Module) template at \u003ccode\u003e/etc/pam.d/sudo_local.template\u003c/code\u003e with the Touch ID line already written but commented out. All you need to do is copy it and uncomment:\u003c/p\u003e","title":"TIL: Touch ID for sudo on macOS"},{"content":"I am a bit of a Xiaomi fan, so when I spotted the B3600 Pro I bought it without a strong reason. The UI was entirely in Chinese, and my Vodafone station was already good enough for daily use, so the router ended up sitting on a shelf collecting dust.\nMeanwhile, I have been running a fork of PicoClaw, a lightweight Go-based AI assistant, as my personal assistant for a while now. I have been genuinely impressed with how capable it is, so one day I looked at the idle router and thought: can I get PicoClaw running on that thing? Armed with my set of AI tools, I set out on an adventure.\nIt turned out to be more practical than I expected.\nThe Starting Point I used the trick to get the ssh password from the xiaomi dev panel. I won\u0026rsquo;t go into detail here, but a quick google search should show you how.\nOnce I had SSH access, the device turned out to be a fairly interesting little box:\nLinux 5.4.213 Vendor-flavored OpenWrt 18.06-SNAPSHOT 4 ARMv7 cores Roughly 443 MB of RAM The hardware identified itself as Qualcomm Technologies, Inc. IPQ5332/AP-MI04.1-C2, which is a lot more useful than the marketing label when you are trying to figure out which binaries might run on it.\nThe good news was RAM. The bad news was storage.\nThe Real Constraint Was storage The router\u0026rsquo;s root filesystem was mounted as read-only squashfs, and it was already full:\n/dev/mtdblock25 on / type squashfs (ro,noatime) The writable persistent area lived under /data, and there was only about 7.4 MB free there.\nThat immediately ruled out the obvious approach of treating the router like a normal Linux box and installing a bunch of packages into persistent flash. PicoClaw itself is lightweight at runtime, but the distribution artifacts are not tiny:\npicoclaw_Linux_armv7.tar.gz: about 19.6 MB picoclaw: about 26.6 MB picoclaw-launcher: about 19.5 MB picoclaw-launcher-tui: about 7.3 MB So the challenge was not \u0026ldquo;can this CPU run PicoClaw?\u0026rdquo; It clearly can. The challenge was \u0026ldquo;where do I put the binaries?\u0026rdquo;\nWhy PicoClaw Was a Good Fit Anyway The thing that made this workable was the way PicoClaw is packaged.\nAt the time of writing on April 7, 2026, the latest PicoClaw release was v0.2.5, and it included an official Linux_armv7 tarball. That meant I did not need to build Go on the router, cross-compile manually, or fight with a vendor toolchain.\nEven better, the binaries are self-contained enough that they run happily in a stripped-down BusyBox/OpenWrt environment.\nIn other words: this is exactly the sort of workload that benefits from a single Go binary.\nI do maintain my own fork of picoclaw, with some bells and whistles inspired by other agents (OpenClaw, Heremes) and hardened much more than regular picoclaw. If picoclaw could run there, my assistant could too.\nThe Trick: Use RAM for Binaries, Flash for Config Instead of trying to persist everything, I split the setup into two parts:\nPut the PicoClaw binaries in /tmp, which is backed by RAM and has plenty of space. Keep only the configuration in /root/.picoclaw, which on this router is backed by persistent storage under /data. That gave me the best of both worlds:\nenough space to unpack and run the binaries persistence for config and workspace no need to remount or rewrite the firmware rootfs I briefly considered restoring a proper writable root overlay, but for this use case it was unnecessary risk. The vendor firmware was already using separate writable areas and bind mounts for persistent state. I did not need a \u0026ldquo;normal\u0026rdquo; Linux rootfs to make PicoClaw work.\nMinimal Userspace Improvements Since / is read-only, I created a tiny extra userspace under /data/opkg and installed a couple of things there:\ntmux bash That let me launch PicoClaw in a detached terminal session and keep it running independently of my SSH connection.\nI also added a small dopkg alias so packages can be managed against /data/opkg without pretending the router has a normal writable root.\nWhat Actually Worked The core setup ended up being surprisingly short.\nFirst, fetch the official ARMv7 release:\ncurl -fL -o /tmp/picoclaw_Linux_armv7.tar.gz \\ https://github.com/sipeed/picoclaw/releases/download/v0.2.5/picoclaw_Linux_armv7.tar.gz curl -fL -o /tmp/picoclaw_checksums.txt \\ https://github.com/sipeed/picoclaw/releases/download/v0.2.5/checksums.txt cd /tmp \u0026amp;\u0026amp; sha256sum -c picoclaw_checksums.txt --ignore-missing mkdir -p /tmp/picoclaw tar xzf /tmp/picoclaw_Linux_armv7.tar.gz -C /tmp/picoclaw Then initialize the persistent config:\n/tmp/picoclaw/picoclaw onboard That created:\n/root/.picoclaw/config.json /root/.picoclaw/.security.yml /root/.picoclaw/workspace/... The nice part is that the initial config footprint was tiny, around 100 KB.\nFinally, start the web launcher in tmux:\ntmux new-session -d -s picoclaw \\ \u0026#39;/tmp/picoclaw/picoclaw-launcher -console -public -no-browser /root/.picoclaw/config.json\u0026#39; After that, the Web UI came up on port 18800 and was reachable on the router\u0026rsquo;s LAN IP.\nRuntime Footprint This is where the experiment got fun.\nThe launcher\u0026rsquo;s virtual memory size looked large in ps, which is common and not very useful on Linux. The number that mattered was resident memory:\nVmRSS: about 13.5 MB That is perfectly reasonable on a router with roughly 443 MB RAM.\nSo in practice, PicoClaw was not stressing the device at all. It really does what it says on the Readme. You can run it on some seriously low powered devices.\nMaking It Reusable Since the binaries live in /tmp, they disappear on reboot. That is acceptable for testing, but annoying for daily use.\nTo make the setup repeatable, I added three tiny helper scripts under /data:\npicoclaw-start.sh picoclaw-stop.sh picoclaw-status.sh The start script does three things:\nDownloads the latest ARMv7 PicoClaw release into /tmp if the binaries are missing. Initializes config if /root/.picoclaw/config.json does not exist. Starts the launcher in a detached tmux session. That keeps the persistent storage requirements extremely small while still making the setup feel \u0026ldquo;installed\u0026rdquo;.\nThe next obvious step would be wiring picoclaw-start.sh into the router\u0026rsquo;s existing @reboot flow so the Web UI comes back automatically after a reboot.\nPostscript: Tailscale Fit the Same Pattern After getting PicoClaw working, I ended up applying the exact same idea to remote access.\nI wanted the router reachable over Tailscale without depending on its LAN IP or the vendor SSH setup. The shape of the problem was almost identical to PicoClaw:\nthe root filesystem was still read-only persistent flash was still tiny RAM-backed /tmp still had plenty of room The router did have opkg, and it even exposed tailscale and tailscaled packages from an OpenWrt feed. But that path turned out to be a bad fit:\nthe feed was serving an old 1.24.2 build tailscale wanted kmod-tun as a package dependency even though /dev/net/tun already existed on this kernel a normal package install would still have been awkward with only about 7 MB free in persistent storage So I used the official static ARM tarball instead and kept the same split:\nbinaries in /tmp state in /data/tailscale The core launch sequence looked like this:\nmkdir -p /data/tailscale /tmp/tailscale curl -fL -o /tmp/tailscale/tailscale.tgz \\ https://pkgs.tailscale.com/stable/tailscale_1.96.4_arm.tgz tar xzf /tmp/tailscale/tailscale.tgz -C /tmp/tailscale start-stop-daemon -S -b -m -p /tmp/tailscaled.pid \\ -x /tmp/tailscale/tailscale_1.96.4_arm/tailscaled -- \\ --state=/data/tailscale/tailscaled.state \\ --socket=/tmp/tailscaled.sock /tmp/tailscale/tailscale_1.96.4_arm/tailscale \\ --socket=/tmp/tailscaled.sock \\ up --accept-dns=false --hostname=xiaomi-picoclaw That gives the router a stable Tailscale identity while keeping the actual binaries ephemeral.\nTo make it survive reboot, I used a tiny /data/tailscale-start.sh bootstrap script. On this firmware, /etc is a RAM filesystem, so dropping something into rc.local is not actually a persistent solution. Instead, I added a line to the vendor\u0026rsquo;s persistent init script under /data that sources the bootstrap on boot.\nOnce the node was enrolled, I enabled Tailscale SSH:\n/tmp/tailscale/tailscale_1.96.4_arm/tailscale --socket=/tmp/tailscaled.sock set --ssh That was the nicer finish to the whole experiment. The router no longer needed password-based SSH over the LAN. With Tailscale SSH enabled, I could connect to it as:\ntailscale ssh root@xiaomi-picoclaw So the end result was not just \u0026ldquo;PicoClaw runs on a weird router.\u0026rdquo; It was \u0026ldquo;PicoClaw runs on a weird router that is also privately reachable from anywhere, without opening anything to the public internet.\u0026rdquo;\nWhat I Did Not Do A few things were intentionally left alone:\nI did not try to force the squashfs root filesystem to become writable (although, I am now curious how that\u0026rsquo;s done) I did not build PicoClaw from source on-device. I did not enable Docker, chroots, or anything container-like. Just not enough space anyway. All of those are possible rabbit holes, but they were unnecessary for now.\nLimitations This setup works, but it has some clear constraints:\nThe PicoClaw binaries are not persistent unless they are re-downloaded on boot. Internal flash is far too small for a larger userspace or heavy tooling. You still need to configure an LLM provider in PicoClaw before it becomes truly useful. This is much better as a tiny appliance than as a general-purpose server. Tailscale SSH still depends on having the right tailnet policy, because access is controlled by Tailscale identity rather than the router\u0026rsquo;s local password database. That said, those limitations are perfectly acceptable for a router-sized AI box.\nWhy I Like This Kind of Hack There is something satisfying about making inexpensive hardware do more than the manufacturer intended.\nRouters are especially interesting because they already have:\nreliable networking low idle power consumption always-on operation decent ARM CPUs enough RAM for lightweight services Once you stop thinking of them purely as \u0026ldquo;routers\u0026rdquo; and start thinking of them as \u0026ldquo;small Linux systems with very weird storage layouts\u0026rdquo;, a lot of possibilities open up.\nPicoClaw fits that mindset very well. It is lightweight enough to run comfortably, packaged well enough to avoid toolchain pain, and flexible enough to be useful even on constrained hardware.\nFinal Thoughts In the end, getting PicoClaw running on this Xiaomi router was less about raw compute and more about respecting the constraints of embedded Linux.\nFor a device that started life as a locked-down networking appliance, that is a pretty good outcome.\nIf I keep pushing this further, the next steps will probably be:\nauto-starting PicoClaw after reboot trimming more of the Xiaomi-specific services moving larger artifacts to external storage if I want a more permanent install advertising the LAN subnet through Tailscale so the router can double as a tiny subnet router For now, though, I have a router that can host an AI assistant. That is already a pretty fun place to stop.\nCredits to Codex and Amp. It would have taken me a long time banging my head against a Linux book without you. And I\u0026rsquo;d have never written this post.\n","permalink":"https://www.adityakonarde.com/posts/picoclaw-on-xiaomi-b3600-pro-router/","summary":"\u003cp\u003eI am a bit of a Xiaomi fan, so when I spotted the B3600 Pro I bought it without a strong reason. The UI was entirely in Chinese, and my Vodafone station was already good enough for daily use, so the router ended up sitting on a shelf collecting dust.\u003c/p\u003e\n\u003cp\u003eMeanwhile, I have been running a fork of \u003ca href=\"https://github.com/sipeed/picoclaw\"\u003ePicoClaw\u003c/a\u003e, a lightweight Go-based AI assistant, as my personal assistant for a while now. I have been genuinely impressed with how capable it is, so one day I looked at the idle router and thought: can I get PicoClaw running on that thing? Armed with my set of AI tools, I set out on an adventure.\u003c/p\u003e","title":"Getting PicoClaw Running on a Xiaomi B3600 Pro WiFi Router"},{"content":"This was KubeCon week, and it showed. Amsterdam became the center of gravity for cloud-native infrastructure, with announcements ranging from NVIDIA donating its GPU DRA driver to the CNCF, to Kubernetes 1.35\u0026rsquo;s in-place pod resize graduating to stable. Meanwhile, the AI world kept shipping \u0026ndash; Google dropped Gemini 3.1 Flash Live, MiniMax open-sourced a massive hybrid-attention reasoning model, and OpenTelemetry quietly cemented profiling as the fourth observability signal. It was one of those weeks where you could feel the industry shifting under your feet.\nAI and Machine Learning Google launches Gemini 3.1 Flash Live for real-time audio AI \u0026ndash; Google released Gemini 3.1 Flash Live, its highest-quality audio model designed for natural, low-latency voice interactions. The model improves precision in real-time dialogue and is available through the Gemini Live API in Google AI Studio. Developers can build voice agents that handle complex tasks more reliably, and all audio output is watermarked to combat misinformation. This is a clear signal that the voice-agent space is heating up fast. Source\nMiniMax open-sources M1, a hybrid-attention reasoning model with 1M token context \u0026ndash; MiniMax released M1, which they call the first open-source, large-scale, hybrid-attention reasoning model. The standout feature is a 1 million token context window \u0026ndash; matching Gemini 2.5 Pro and 8x larger than DeepSeek R1. The model uses a proprietary Lightning Attention mechanism that requires roughly 30% of the compute of DeepSeek R1 for deep reasoning tasks. For teams doing long-context work at scale, this changes the cost equation significantly. Source\nNVIDIA unveils Nemotron 3 agentic model stack at GTC 2026 \u0026ndash; NVIDIA introduced its Nemotron 3 family at GTC, a coordinated set of models designed to work together as a unified agentic stack. This includes Nemotron 3 Super (a hybrid MoE model activating 12B parameters per pass for long-context reasoning), along with specialized models for content safety, voice chat, and multimodal RAG. The approach of shipping purpose-built models that compose into an agentic system, rather than one monolithic model, is an interesting architectural bet. Source\nOpenAI\u0026rsquo;s GPT-5.3 Instant continues rolling out with measurable hallucination reduction \u0026ndash; While the initial launch was on March 3, GPT-5.3 Instant continued its wider rollout this month as the default ChatGPT model. The key numbers: 26.8% fewer hallucinations with web search, 19.7% fewer without. OpenAI was unusually candid about the previous model\u0026rsquo;s tone problems, acknowledging it could \u0026ldquo;feel cringe.\u0026rdquo; The model is now available on both ChatGPT and the API, with Thinking and Pro variants expected soon. Source\nSite Reliability Engineering NVIDIA donates GPU DRA driver to CNCF at KubeCon Europe \u0026ndash; The biggest KubeCon announcement: NVIDIA is donating its Dynamic Resource Allocation (DRA) Driver for GPUs to the CNCF, moving it from vendor-governed to full community ownership under the Kubernetes project. This is a watershed moment for AI infrastructure on Kubernetes. GPU scheduling has been the wild west of specialized tooling and vendor lock-in \u0026ndash; having a standardized, community-maintained driver changes the game for anyone running ML workloads on K8s. NVIDIA also introduced GPU support for Kata Containers for confidential computing. Source\nKubernetes 1.35 in-place pod resize graduates to stable \u0026ndash; After six years and four release cycles, in-place pod resource resize is finally stable in Kubernetes 1.35. The kubelet now applies CPU and memory changes directly through the cgroup layer while the container keeps running \u0026ndash; no restart required. This removes the \u0026ldquo;restart tax\u0026rdquo; that has made Vertical Pod Autoscaler impractical for stateful workloads like PostgreSQL, Redis, and Kafka. Platform teams that shelved VPA can now revisit it seriously. Source\nBroadcom donates Velero to CNCF Sandbox, ships VKS 3.6 \u0026ndash; Broadcom moved Velero, the widely-used Kubernetes backup and migration tool, into the CNCF Sandbox for vendor-neutral governance. Alongside this, they shipped vSphere Kubernetes Service 3.6 with Kubernetes 1.35 support, RHEL 9 compatibility, and declarative performance tuning via TuneD profiles. The Velero donation is particularly significant \u0026ndash; it\u0026rsquo;s one of those projects that many teams depend on but worried about single-vendor control. Source\nSUSE launches agentic AI platform for Kubernetes operations \u0026ndash; SUSE announced that Rancher Prime is evolving into what they call the industry\u0026rsquo;s first context-aware Agentic AI Ecosystem at KubeCon EU. Their AI Assistant \u0026ldquo;Liz\u0026rdquo; expands into a crew of specialized agents for Linux, observability, security, provisioning, and fleet management. While \u0026ldquo;AI for Kubernetes ops\u0026rdquo; is becoming a crowded space, SUSE\u0026rsquo;s approach of embedding agents directly into the platform management layer rather than bolting them on is worth watching. Source\nObservability OpenTelemetry Profiles signal enters alpha \u0026ndash; profiling becomes the fourth pillar \u0026ndash; The OpenTelemetry Profiles signal has officially reached public alpha, establishing profiling as the fourth observability signal alongside logs, metrics, and traces. Elastic donated its eBPF-based continuous profiling agent to OTel, which provides whole-system visibility across applications and runtimes with minimal overhead. For SREs and developers, this means you can now correlate performance bottlenecks with traces and metrics in a single, vendor-neutral pipeline. This is a big deal. Source\nGoogle Cloud goes all-in on OpenTelemetry for metrics ingestion \u0026ndash; Google Cloud now supports OTLP format for metrics alongside traces and logs in Cloud Monitoring, completing their observability stack\u0026rsquo;s OpenTelemetry integration. The update includes delta-type metrics, exponential histograms, and expanded naming conventions. Google has been methodically rebuilding its observability stack around OTel since September 2025 \u0026ndash; deprecating proprietary agents and pointing developers to OTel packages. When a hyperscaler makes this kind of commitment, the signal is clear: OTel is the standard. Source\nGrafana patches critical RCE vulnerability (CVE-2026-27876) \u0026ndash; Grafana released version 12.4.2 along with patches for versions 12.3, 12.2, 12.1, and 11.6 to fix a critical security vulnerability scored at CVSS 9.1. The SQL expressions feature permitted writing arbitrary files to the filesystem, enabling remote code execution with just Viewer-level permissions. If you run Grafana with SQL expressions enabled, patch immediately. Grafana Cloud and major managed providers (Amazon Managed Grafana, Azure Managed Grafana) have already been patched. Source\nOTTL context inference lands in the OpenTelemetry Filter Processor \u0026ndash; Starting with collector-contrib v0.146.0, the OpenTelemetry Filter Processor supports context inference through new top-level config fields: trace_conditions, metric_conditions, log_conditions, and profile_conditions. This removes the need to manually organize filtering rules into OTTL context blocks. It\u0026rsquo;s a quality-of-life improvement that makes Collector configurations significantly less error-prone, especially for teams managing complex filtering pipelines. Source\nQuick Links Grafana Cloud Attribution Alerts GA \u0026ndash; Set alerts scoped to specific teams, services, and environments using cost attribution labels. Grafana Labs CiliumCon at KubeCon EU \u0026ndash; Cilium 1.19 sessions covered flow aggregation, scaling Tetragon policies, and replacing legacy hardware load balancers. Cilium celebrates 10 years since first commit. CNCF Blog AWS pledges $3M in cloud credits to CNCF for 2026 \u0026ndash; Sustaining open source infrastructure that powers the Kubernetes community. AWS Containers Blog Microsoft at KubeCon EU \u0026ndash; Updates across multi-cluster operations, networking, observability, storage, and cluster lifecycle. Microsoft Open Source Blog OpAMP for managing OpenTelemetry at scale \u0026ndash; The Open Agent Management Protocol provides standardized remote management for OTel Collector fleets. Dotan Horovits on Medium Crossplane sessions at KubeCon EU \u0026ndash; End-user stories from major financial institutions and global tech companies on API-driven infrastructure and self-healing platforms. Crossplane Blog My Take The thread running through this week is standardization eating the world. NVIDIA donating the GPU DRA driver, Google rebuilding its observability stack on OpenTelemetry, Broadcom handing Velero to the CNCF, OTel Profiles reaching alpha \u0026ndash; these are all moves toward shared, community-owned interfaces replacing proprietary ones.\nWhat makes this moment different from previous \u0026ldquo;open standard\u0026rdquo; waves is that it\u0026rsquo;s happening simultaneously across AI infrastructure, reliability tooling, and observability. GPU scheduling is getting standardized just as the compute demands of AI workloads are exploding. Profiling is becoming a first-class OTel signal just as teams need deeper visibility into AI inference costs. Kubernetes in-place resize is going stable just as stateful AI workloads make restart-free resource adjustment critical.\nThe organizations that will move fastest aren\u0026rsquo;t the ones with the biggest budgets \u0026ndash; they\u0026rsquo;re the ones that bet early on these open standards and built their platforms around composable, vendor-neutral tooling. If you\u0026rsquo;re still running proprietary agents, locked into a single observability vendor, or manually managing GPU resources, this was the week that the gap between you and the leaders got measurably wider.\nThanks for reading this week\u0026rsquo;s roundup. If something here caught your eye or I missed a story you think deserves attention, I\u0026rsquo;d love to hear about it \u0026ndash; reach out on LinkedIn. See you next week.\n","permalink":"https://www.adityakonarde.com/posts/week-in-review-ai-sre-observability-mar-20-27-2026/","summary":"\u003cp\u003eThis was KubeCon week, and it showed. Amsterdam became the center of gravity for cloud-native infrastructure, with announcements ranging from NVIDIA donating its GPU DRA driver to the CNCF, to Kubernetes 1.35\u0026rsquo;s in-place pod resize graduating to stable. Meanwhile, the AI world kept shipping \u0026ndash; Google dropped Gemini 3.1 Flash Live, MiniMax open-sourced a massive hybrid-attention reasoning model, and OpenTelemetry quietly cemented profiling as the fourth observability signal. It was one of those weeks where you could feel the industry shifting under your feet.\u003c/p\u003e","title":"Week in Review: AI, SRE \u0026 Observability -- March 20-27, 2026"},{"content":"GTC week is always loud, but this one hit different. NVIDIA unveiled a trillion-dollar infrastructure roadmap while OpenAI quietly absorbed one of Python\u0026rsquo;s most beloved open-source teams. Meanwhile, the SRE world went all-in on agentic AI, with PagerDuty, Komodor, and Microsoft all shipping agent-driven incident response features in the same week. And on the observability front, OpenTelemetry made a significant architectural decision that\u0026rsquo;ll ripple through every tracing backend for years. Buckle up.\nAI \u0026amp; Machine Learning NVIDIA GTC 2026: Vera Rubin, $1T in demand, and the agentic AI stack Jensen Huang\u0026rsquo;s keynote was the main event. NVIDIA unveiled the Vera Rubin platform, a rack-scale supercomputer integrating seven co-designed chips built for the agentic inference era. The headline number: $1 trillion in projected demand for Blackwell and Vera Rubin platforms through 2027, double last year\u0026rsquo;s forecast. Alongside Vera Rubin, NVIDIA launched OpenClaw, an open-source agentic AI framework Huang compared to \u0026ldquo;Windows for AI agents,\u0026rdquo; and Dynamo 1.0 for inference orchestration. The message is clear: NVIDIA sees inference, not training, as the next mega-cycle. NVIDIA GTC 2026 coverage (Data Center Knowledge)\nOpenAI acquires Astral: uv, Ruff, and ty join the Codex team In a move that sent ripples through the Python ecosystem, OpenAI announced it will acquire Astral, the company behind uv (package manager), Ruff (linter/formatter), and ty (type checker). The Astral team joins OpenAI\u0026rsquo;s Codex group, which now has over 2 million weekly active users. Both sides promise the tools will remain open source, but the Hacker News thread hit 757 points and 475 comments in hours, with the community mood landing somewhere between cautious and anxious. The real question: is this about the talent (including BurntSushi of ripgrep fame) or the products? Likely both. Astral announcement | Simon Willison\u0026rsquo;s analysis\nClaude Opus 4.6 finds 22 Firefox vulnerabilities in two weeks Anthropic\u0026rsquo;s Claude Opus 4.6 discovered 22 security vulnerabilities in Firefox, 14 of which earned high-severity classifications. That\u0026rsquo;s nearly 20% of all high-severity Firefox bugs patched throughout 2025, found in just two weeks. Mozilla validated the findings and shipped fixes in Firefox 148. Unlike typical AI-generated bug reports (which Mozilla\u0026rsquo;s engineers described as \u0026ldquo;garbage\u0026rdquo;), these came with minimal test cases, detailed proofs of concept, and candidate patches. The implications cut both ways: AI-powered security auditing is becoming genuinely useful, but the same capabilities could accelerate exploit development. InfoQ coverage\nWhite House AI regulation framework imminent The White House is expected to release an AI regulation framework within days, including preemption of state laws. This will kick off congressional negotiations on bill text. For AI practitioners, this could mean new compliance requirements for model deployment and usage. The details are still behind closed doors, but the timing, during GTC week, is likely not coincidental. Punchbowl News\nSite Reliability Engineering Anthropic\u0026rsquo;s SRE team on using Claude for incident response (and why it\u0026rsquo;s not enough) At QCon London, Alex Palcuie from Anthropic\u0026rsquo;s AI reliability engineering team gave a remarkably honest talk about using Claude for incident response. Key takeaway: Claude is fantastic at the \u0026ldquo;observe\u0026rdquo; and \u0026ldquo;orient\u0026rdquo; phases, searching logs at I/O speed and summarizing vast amounts of telemetry data. But it consistently mistakes correlation for causation during the \u0026ldquo;decide\u0026rdquo; phase. Palcuie\u0026rsquo;s team exists specifically to keep Claude running, and they\u0026rsquo;re actively hiring, which he noted is proof enough that AI hasn\u0026rsquo;t replaced SREs. \u0026ldquo;It would be hypocritical to say that Claude fixes everything,\u0026rdquo; he said. A refreshingly grounded perspective. The Register\nPagerDuty ships agentic SRE with virtual responders and MCP integration PagerDuty\u0026rsquo;s Spring 2026 release evolves its SRE Agent into a virtual responder that can be embedded directly into on-call schedules and escalation policies. The agent handles detection, triage, and initial diagnostics before escalating to humans. Notable additions include deeper Slack-native incident workflows and a multi-agent ecosystem built on Model Context Protocol (MCP). This is one of the clearest signals yet that the industry is treating AI agents as first-class participants in operational workflows, not just assistants. Analysis (Efficiently Connected)\nKomodor launches multi-agent extensibility framework for Kubernetes SRE Komodor unveiled a new extensibility framework for its Klaudia AI platform, enabling organizations to combine their own tools and agents with Komodor\u0026rsquo;s 50+ specialized agents for Kubernetes troubleshooting. The system coordinates agents working in parallel across Kubernetes, GPUs, networking, and storage layers, mirroring how human SRE teams actually work during complex incidents. The pitch: multi-domain incidents investigated at machine speed. Komodor blog\nAzure SRE Agent hits general availability Microsoft\u0026rsquo;s Azure SRE Agent, which continuously observes telemetry, correlates incidents with recent changes, and assists with remediation, went GA this week after several months in public preview. Unlike traditional AIOps tools, it operates as a genuine agentic system integrated natively with incident management workflows. Elastic published a same-day integration guide showing how to pair it with Elasticsearch for higher-fidelity data foundations. Azure SRE Agent overview | Elastic integration guide\nObservability OpenTelemetry deprecates the Span Events API This is a big architectural decision. OpenTelemetry officially announced the deprecation of the Span Events API, moving toward a unified model where events are logs correlated with spans via context. The rationale: having two overlapping ways to emit events (span events and log-based events) created duplicate concepts, split guidance for instrumentation authors, and slowed evolution of the event model. Existing span event data will keep working, but new code should use the Logs API. If you maintain OTel instrumentation, start planning the migration now. OpenTelemetry blog\nGrafana Labs releases 2026 Observability Survey: open standards win, AI welcomed cautiously Grafana Labs published its fourth annual Observability Survey, the largest yet with 1,300+ respondents across 76 countries. The headline stats: 77% say open source and open standards are important to their observability strategy, 92% see value in AI for anomaly detection, but only 77% trust AI to take autonomous actions, and 15% don\u0026rsquo;t trust it at all. Half of organizations now use observability tools for business metrics, not just infrastructure. The survey confirms what many practitioners feel: OpenTelemetry is becoming the default, and the industry is shifting from vendor-locked to open and portable. Grafana Labs survey results | Press release\nKubernetes attributes reach release candidate in OTel Semantic Conventions The Kubernetes Semantic Conventions SIG promoted Kubernetes attributes to release candidate status in OpenTelemetry. This is the culmination of months of focused work aligning with the Collector SIG\u0026rsquo;s goal to stabilize the k8sattributes processor. Users can try the new schema via feature gates and provide feedback before the final stable release. For teams running OTel Collectors in Kubernetes environments, this is a meaningful step toward production-grade stability. OpenTelemetry blog\nSignal Studio: a dry-run mode for the OpenTelemetry Collector Canonical published a deep dive on Signal Studio, a new tool that adds a diagnostic \u0026ldquo;plan\u0026rdquo; mode to OpenTelemetry Collectors. Think terraform plan but for telemetry pipelines. It combines static configuration analysis with live metrics and an ephemeral OTLP tap to evaluate filter behavior against observed traffic. For anyone who has nervously edited Collector YAML in production and crossed their fingers, this addresses a real gap in the open-source observability toolchain. Canonical blog\nAlibaba Cloud and Datadog release OpenTelemetry Go auto-instrumentation tool Alibaba Cloud and Datadog jointly released an open-source OpenTelemetry Go automatic instrumentation tool that uses compile-time injection to enable zero-code tracing. Go\u0026rsquo;s static compilation has long made automatic instrumentation difficult compared to Java\u0026rsquo;s bytecode enhancement. The tool, donated to the OpenTelemetry community as opentelemetry-go-compile-instrumentation, intercepts the Go compiler via -toolexec to analyze and modify code before compilation. The first preview version (v0.1.0) is available now. Dev.to writeup\nQuick Links OpenTelemetry Collector v0.148.0 released with breaking changes including removal of the SAPM exporter and k8slog receiver. Release notes Google launches multi-cluster GKE Inference Gateway for model-aware load balancing across regions and clusters. Google Cloud blog Kubernetes 1.36 preview: expected April 22, with DRA improvements, Gateway API updates, and the ingress-nginx retirement. Cloud Native Now Kubernetes image promoter (kpromo) rewrite shipped silently: 20% less code, dramatically faster, zero user-visible changes. Kubernetes blog KubeCon EU 2026 co-located events announced including Platform Engineering Day with increased focus on AI within platform engineering. CNCF blog Red Hat\u0026rsquo;s OpenShift positioning as the enterprise hybrid AI platform at KubeCon EU. SiliconANGLE My Take The theme this week is convergence. AI agents are no longer experimental add-ons; they\u0026rsquo;re becoming first-class participants in operational workflows. PagerDuty puts them on-call schedules. Komodor orchestrates them like parallel SRE teams. Azure gives them GA status with native incident management integration. And Anthropic\u0026rsquo;s own SRE team honestly admits they\u0026rsquo;re useful but flawed, a degree of self-awareness the rest of the industry should take to heart.\nMeanwhile, the observability world is making the kind of bold, breaking-change decisions that signal maturity. Deprecating span events in favor of unified log-based events is the right call architecturally, even if it\u0026rsquo;ll cause short-term pain for instrumentation maintainers. Combined with Kubernetes semantic conventions reaching RC and the Grafana survey confirming that 77% of practitioners now anchor their strategies around open standards, the trajectory is clear: OpenTelemetry is becoming the lingua franca of operational telemetry, and everyone is building around it.\nThe NVIDIA keynote and the Astral acquisition share a common thread too: the infrastructure layer for AI is consolidating fast. Whether it\u0026rsquo;s silicon (Vera Rubin), software (OpenClaw), or developer tooling (uv, Ruff joining Codex), the companies with the capital are assembling full-stack AI platforms. For practitioners, this means more powerful tools, but also more concentration of control. Worth watching closely.\nWhat caught your eye this week? I\u0026rsquo;d love to hear your thoughts: LinkedIn\nIf you found this useful, share it with your team and subscribe for next week\u0026rsquo;s roundup.\n","permalink":"https://www.adityakonarde.com/posts/week-in-review-2026-03-14-20/","summary":"\u003cp\u003eGTC week is always loud, but this one hit different. NVIDIA unveiled a trillion-dollar infrastructure roadmap while OpenAI quietly absorbed one of Python\u0026rsquo;s most beloved open-source teams. Meanwhile, the SRE world went all-in on agentic AI, with PagerDuty, Komodor, and Microsoft all shipping agent-driven incident response features in the same week. And on the observability front, OpenTelemetry made a significant architectural decision that\u0026rsquo;ll ripple through every tracing backend for years. Buckle up.\u003c/p\u003e","title":"Week in Review: AI, SRE \u0026 Observability — March 14–20, 2026"},{"content":"This was a week where \u0026ldquo;agentic\u0026rdquo; stopped being a buzzword and started showing up in architecture diagrams. NVIDIA dropped a model built specifically for multi-agent workflows, observability vendors raced to give AI agents direct access to production telemetry via MCP, and the cloud-native ecosystem quietly matured with a new CNCF graduation and a Kubernetes release preview that finally lets you scale to zero. If you build, run, or monitor software at scale, there\u0026rsquo;s something here for you.\n🤖 AI \u0026amp; Machine Learning NVIDIA Releases Nemotron 3 Super: A 120B-Parameter Open Model Built for Agentic AI NVIDIA launched Nemotron 3 Super, a 120B total / 12B active-parameter hybrid Mamba-Transformer MoE model with a native 1M-token context window. The architecture is purpose-built to address two pain points in multi-agent systems: the \u0026ldquo;context explosion\u0026rdquo; (agents generating up to 15x more tokens than standard chat) and the \u0026ldquo;thinking tax\u0026rdquo; (reasoning at every step being too expensive with large models). It delivers 5x higher throughput than the previous Nemotron Super and ships fully open with weights, datasets, and training recipes. Companies like CodeRabbit, Perplexity, Palantir, and Siemens are already integrating it.\nThinking Machines Lab Lands Massive NVIDIA Compute Partnership Mira Murati\u0026rsquo;s year-old startup Thinking Machines Lab secured a multi-year strategic partnership with NVIDIA, including a significant investment and a commitment to deploy at least one gigawatt of next-gen Vera Rubin systems. For context, a facility at that scale could cost around $50 billion. The startup, now valued at over $12B, is focused on customizable AI models rather than another chatbot play. This is one of the clearest signals yet that the \u0026ldquo;next wave\u0026rdquo; of AI labs are competing on infrastructure access as much as model quality.\nGoogle Ships TensorFlow 2.21 with LiteRT Going Production TensorFlow 2.21 officially graduates LiteRT from preview to production, replacing TFLite as the universal on-device inference framework. Key numbers: 1.4x faster GPU performance than TFLite, new NPU acceleration, and first-class PyTorch/JAX model conversion support. If you\u0026rsquo;re deploying models to edge devices, this is a meaningful upgrade path. The addition of lower-precision data types (int8, int16, INT4) across multiple operators is also worth noting for efficiency-conscious teams.\nDatabricks Launches Genie Code: Agentic Engineering for Data Work Genie Code is Databricks\u0026rsquo; new autonomous AI agent that handles pipeline building, debugging, dashboard shipping, and production maintenance. On real-world data science tasks, Databricks claims it more than doubled the success rate of leading coding agents. Alongside this, Databricks acquired Quotient AI for continuous evaluation of AI agents. This is the \u0026ldquo;agentic\u0026rdquo; pattern applied to data engineering rather than software engineering, and it\u0026rsquo;s a space to watch.\nAnthropic\u0026rsquo;s Pentagon Lawsuit Draws Industry-Wide Support In a rare show of cross-industry solidarity, 30+ employees from OpenAI and Google DeepMind filed an amicus brief supporting Anthropic\u0026rsquo;s lawsuit against the Pentagon\u0026rsquo;s \u0026ldquo;supply-chain risk\u0026rdquo; designation. Google chief scientist Jeff Dean is among the signatories. The brief argues the government\u0026rsquo;s action is \u0026ldquo;an improper and arbitrary use of power.\u0026rdquo; Regardless of where you stand on AI and defense, this is a significant moment for how AI companies relate to government contracting.\n🔧 Site Reliability Engineering Google Publishes How It Applies SRE Principles to Cybersecurity Google\u0026rsquo;s Cloud team published a detailed blog on applying SRE to security operations. The core insight is unsurprising but well-articulated: SLOs, error budgets, toil elimination, and blameless retrospectives work just as well for security as they do for reliability. The practical takeaway is that security teams should treat detection-and-response as a software problem with measurable service levels, not a reactive firefighting exercise. Worth reading if your security and SRE teams still operate in silos.\nKubernetes v1.36 Preview: HPA Scale-to-Zero Finally Enabled by Default The Kubernetes v1.36 sneak peek (scheduled for April 22) confirms that HPAScaleToZero will be enabled by default after sitting in alpha since v1.16. This means the Horizontal Pod Autoscaler can scale workloads to zero replicas when there\u0026rsquo;s no traffic, then scale back up on demand. If you\u0026rsquo;re running staging, test, or intermittent workloads, this is real cost savings without Knative-style complexity. The release also includes improvements to ephemeral storage and topology-aware scheduling.\nCNCF Graduates Dragonfly for P2P Image and Model Distribution Dragonfly reached CNCF graduated status, the highest maturity level in the CNCF project lifecycle. Originally open-sourced by Alibaba, Dragonfly uses peer-to-peer acceleration for distributing container images, OCI artifacts, and AI models at scale. Production deployments report image pull times dropping from minutes to seconds and up to 90% savings in storage bandwidth. With GenAI models getting larger by the month, having efficient distribution infrastructure is increasingly a reliability concern, not just an optimization.\nCloudflare, Azure, and TikTok Outages Share a Common Root Cause An analysis piece on the early 2026 outages across Cloudflare, Azure, and TikTok highlights a pattern that every SRE should internalize: a single automation action with no blast radius limit caused total propagation before a human could intervene. Three different companies, three different stacks, same architectural flaw. The SREs on those teams reportedly saw the risk. If you have automated workflows that can propagate changes without a ceiling on scope, this is your Monday morning reading.\n🔭 Observability Grafana Labs Signs Five-Year Strategic Collaboration with AWS Grafana Labs and AWS signed a strategic collaboration agreement to accelerate open observability adoption at scale. The five-year deal deepens technical and go-to-market alignment for Grafana Cloud on AWS, with a focus on AI-driven insights, simplified operations, and marketplace access. For teams already running Grafana on AWS, expect smoother integrations. For the broader market, this is another signal that open observability standards are winning the enterprise.\nHoneycomb Ships GA Metrics and Expands MCP for the Agent Era Honeycomb announced GA for its time-series Metrics product alongside expanded MCP integrations. The pitch is compelling: traditional metrics platforms force you to choose between data completeness and cost by discarding high-cardinality dimensions. Honeycomb Metrics claims to eliminate that tradeoff. The MCP expansion is the more forward-looking move, giving AI coding agents direct access to structured observability data for autonomous debugging. \u0026ldquo;Observability was built for a world where humans wrote the code and humans read the dashboards. That world is changing fast,\u0026rdquo; said Graham Siener, SVP of Product.\nDatadog MCP Server Goes Generally Available Datadog\u0026rsquo;s MCP Server hit GA, providing AI agents and development tools direct access to live logs, metrics, and traces from the Datadog platform. Compatible with Claude Code, Cursor, Codex, and GitHub Copilot, the server feeds real-time telemetry into AI workflows under existing RBAC controls. The timing of both Honeycomb and Datadog shipping MCP integrations in the same week isn\u0026rsquo;t coincidental. The observability industry is clearly betting that AI agents will be primary consumers of production data, not just humans staring at dashboards.\nElastic Donates Its OpenTelemetry PHP Distribution to the OTel Project Elastic donated its EDOT PHP distribution to the OpenTelemetry project, and a first beta is about to ship. The donation addresses a real gap: many production PHP environments are locked down and can\u0026rsquo;t build native extensions during deployment. The OS-package-first approach (deb, rpm, apk) enables zero-code instrumentation without rebuilding production images. PHP still powers a significant portion of the web, and this makes OTel adoption dramatically easier for those teams.\n🔗 Quick Links Cisco LiveProtect: eBPF-powered network infrastructure security bringing kernel-level protection to network hardware, not just workloads. Kubernetes Ingress at a turning point: Platform teams need to rethink ingress architecture as the ecosystem shifts toward Kubernetes Gateway API. KubeCon India 2026 schedule announced: 55 sessions in Mumbai on June 18-19, with AI, observability, and platform engineering as headline tracks. Running Ray at scale on AKS: Microsoft and Anyscale share guidance on multi-cluster GPU orchestration for ML workloads. IBM builds PyTorch-native support for Spyre Accelerator: Extending torch.inductor for dataflow accelerators with tile-based tensor layouts and scratchpad optimization. Google Cloud Threat Horizons Report: Vulnerability exploitation now tops credential abuse as the primary cloud entry vector at 44.5% of intrusions. CRI-O registry mirror auth with K8s secrets: New credential provider enables private mirror authentication using namespace-scoped secrets instead of node-level credentials. 💬 My Take The theme of this week is unmistakable: AI agents are becoming first-class participants in the software lifecycle, and our infrastructure is adapting to accommodate them. NVIDIA builds a model architecture specifically for multi-agent token economics. Datadog and Honeycomb both ship MCP integrations so AI agents can query production telemetry directly. Databricks gives data engineering its own agentic workflow.\nBut here\u0026rsquo;s what I find most interesting: the reliability and observability layers are evolving in lockstep. The Kubernetes ecosystem is maturing in ways that reduce operational toil (HPA scale-to-zero, Dragonfly for efficient distribution), while observability platforms are pivoting from \u0026ldquo;dashboards for humans\u0026rdquo; to \u0026ldquo;structured data for agents.\u0026rdquo; Google\u0026rsquo;s piece on applying SRE principles to security is a reminder that these foundational ideas, SLOs, error budgets, and blameless culture, are portable across domains. The common thread? Treat operations as a software problem, and increasingly, let software handle the operations.\nThe early 2026 outage analysis is the cautionary counterpoint: automation without blast radius controls remains one of the fastest paths to a bad day. As we hand more operational agency to AI systems, the guardrails we build around automated actions matter more than ever.\nWhat caught your eye this week? I\u0026rsquo;d love to hear your thoughts: LinkedIn.\nIf you found this useful, consider subscribing for weekly roundups covering AI, SRE, and Observability.\n","permalink":"https://www.adityakonarde.com/posts/week-in-review-ai-sre-observability-2026-03-07-13/","summary":"\u003cp\u003eThis was a week where \u0026ldquo;agentic\u0026rdquo; stopped being a buzzword and started showing up in architecture diagrams. NVIDIA dropped a model built specifically for multi-agent workflows, observability vendors raced to give AI agents direct access to production telemetry via MCP, and the cloud-native ecosystem quietly matured with a new CNCF graduation and a Kubernetes release preview that finally lets you scale to zero. If you build, run, or monitor software at scale, there\u0026rsquo;s something here for you.\u003c/p\u003e","title":"Week in Review: AI, SRE \u0026 Observability — March 7–13, 2026"},{"content":"This was a week where the AI race got tangibly closer to your desktop, the Kubernetes ecosystem said goodbye to an old friend, and the observability world kept tightening its grip around OpenTelemetry as the universal standard. If you only have five minutes, the headlines are: GPT-5.4 can now operate your computer better than most humans, Ingress NGINX is officially done, and Google Cloud now speaks fluent OTLP.\nAI \u0026amp; Machine Learning OpenAI releases GPT-5.4 with native computer use — and it beats human performance. OpenAI\u0026rsquo;s latest frontier model isn\u0026rsquo;t just another benchmark bump. GPT-5.4 is the first general-purpose model to ship with production-ready computer use capabilities, scoring 75.0% on OSWorld-Verified desktop tasks — above the 72.4% human expert baseline. It supports up to 1M tokens of context, brings a new \u0026ldquo;reasoning plan preview\u0026rdquo; that lets users steer the model mid-thought, and introduces tool search for navigating large ecosystems of APIs and connectors. Available in ChatGPT (as GPT-5.4 Thinking), Codex, and the API. The agentic future just got a lot more concrete. Source: OpenAI\nGoogle launches Gemini 3.1 Flash-Lite — fastest and cheapest in the Gemini 3 family. At $0.25 per million input tokens and $1.50 per million output tokens, Flash-Lite undercuts most competitors while delivering a 2.5x faster time-to-first-token and 45% faster output speed compared to Gemini 2.5 Flash. It includes configurable \u0026ldquo;thinking levels\u0026rdquo; for developers to balance cost and reasoning depth. Early benchmarks show an Elo of 1432 on Arena.ai and 86.9% on GPQA Diamond. This is Google\u0026rsquo;s play for the high-volume, cost-sensitive inference tier — translation, content moderation, and real-time analytics at scale. Source: Google DeepMind\nMicrosoft open-sources Phi-4-reasoning-vision-15B — a small model that punches way above its weight. Microsoft Research released a 15B-parameter multimodal reasoning model trained on roughly 200B tokens — a fraction of what comparable models require. It handles image captioning, document reading, chart interpretation, and UI grounding, and excels at math and science reasoning. The key insight: a hybrid mix of reasoning and non-reasoning data with explicit mode tokens lets one model deliver fast direct answers for simple tasks and chain-of-thought for hard ones. Available on HuggingFace, GitHub, and Microsoft Foundry under a permissive license. Source: Microsoft Research\nOpenAI ships GPT-5.3 Instant — optimizing the model most people actually use. While GPT-5.4 grabbed the spotlight, OpenAI also quietly released GPT-5.3 Instant, refining the lightweight model that handles the majority of everyday ChatGPT traffic. The update improves response quality, conversational flow, and reliability for routine queries. It\u0026rsquo;s a signal that the industry is shifting from pure capability races to infrastructure optimization — making AI cheaper and more reliable at the scale of millions of daily interactions. Source: The Next Web\nSite Reliability Engineering Kubernetes Ingress NGINX reaches end of life — time to migrate. The clock has run out. As of March 2026, the community-maintained Ingress NGINX controller — the de facto standard for routing external traffic into Kubernetes clusters for nearly a decade — is no longer receiving security patches, bug fixes, or compatibility updates. The Kubernetes SIG Network and Security Response Committee announced this retirement back in November 2025, but now it\u0026rsquo;s real. The recommended path forward is the Gateway API, with alternatives like NGINX Gateway Fabric, Kong, Traefik, or HAProxy for teams that need a bridge. If you haven\u0026rsquo;t started your migration planning, Monday morning is the time. Source: Kubernetes Blog\nCNCF: 82% of container users now run Kubernetes in production, and AI workloads are driving convergence. A CNCF blog post this week laid out the case that Kubernetes has become the unified platform for data processing, model training, inference, and AI agent workloads. According to CNCF\u0026rsquo;s January 2026 survey, 66% of organizations hosting generative AI models use Kubernetes for some or all inference workloads. The piece traces three eras of Kubernetes — microservices, data and GenAI, and the current agentic era — and argues the platform is converging around all of them simultaneously. Source: CNCF Blog\nThoughtworks publishes \u0026ldquo;SRE is entering a paradigm shift\u0026rdquo; — a first-principles rethink. This thoughtful piece argues that as systems become cognitive and partially autonomous (think AI agents making runtime decisions), the traditional SRE control model — built on the assumption that systems are observable, modelable, predictable, and intervenable — starts to break down. When AI introduces stochastic decision-making into the system itself, observability requirements change fundamentally. Whether you agree with the framing or not, it\u0026rsquo;s the kind of strategic thinking SRE leaders should be engaging with. Source: Thoughtworks\nArgo CD 3.3 ships PreDelete hooks, OIDC background refresh, and shallow clones. The Argo CD 3.3 release candidate fills a long-standing gap: PreDelete hooks that let you run cleanup logic (data backups, external teardown) before application deletion, with deletion blocked if the cleanup fails. Other highlights include background OIDC token refresh (no more Keycloak session timeouts every five minutes), granular RBAC with resource name whitelisting for CRDs, and shallow clone support for large monorepos. If you\u0026rsquo;re running GitOps at scale, this is a meaningful quality-of-life upgrade. Source: Argo Project Blog\nObservability Google Cloud adds full OTLP support for Cloud Monitoring metrics. Google Cloud now accepts metrics in OpenTelemetry Protocol format alongside traces and logs, completing the OTLP trifecta in Cloud Observability. This means you can use OpenTelemetry SDKs and Collectors to send all three signal types through a vendor-agnostic pipeline. New features include delta-type metrics (reducing client-side memory), exponential histograms for dynamic bucket sizing, and expanded naming conventions aligned with OTel semantic conventions. Ingested OTLP metrics are stored like Prometheus data and queryable with existing Monitoring tools. Source: Google Cloud Blog\nAtlassian ships OpenTelemetry traces for Bitbucket Pipelines via webhooks. Bitbucket Pipelines now exposes pipeline execution as OTel traces delivered through webhook events. You get structured spans for each run, step, command, and container — including CPU and memory resource metrics — in a standard format ingestible by Grafana, Datadog, Honeycomb, and any OTLP-compatible backend. This is a significant step toward treating CI/CD pipelines as first-class observable infrastructure rather than opaque black boxes. The integration answers questions like \u0026ldquo;which part of this pipeline is slow?\u0026rdquo; and \u0026ldquo;are we hitting resource limits on build containers?\u0026rdquo; Source: Atlassian Blog\nOpenTelemetry Collector gets OTTL context inference in the Filter Processor. Starting with collector-contrib v0.146.0, the Filter Processor supports context inference through four new top-level config fields: trace_conditions, metric_conditions, log_conditions, and profile_conditions. Previously, writing filter rules required understanding the Collector\u0026rsquo;s internal telemetry hierarchy and splitting conditions across distinct context blocks. Now you write flat condition lists and let the Collector infer whether a condition applies to a resource, span, or log context. It\u0026rsquo;s a small but meaningful usability improvement that reduces configuration errors. Source: OpenTelemetry Blog\nThe \u0026ldquo;observability tax\u0026rdquo; conversation heats up — enterprises pivot to OpenTelemetry for cost control. Multiple pieces this week covered the growing frustration with observability costs, with 66% of enterprises reporting unexpected overages on observability tooling. The core argument: standardizing on OpenTelemetry eliminates the duplicated engineering effort of building custom pipelines, reduces vendor lock-in premiums, and lets teams route telemetry data more efficiently. Combined with eBPF-based collection (sub-1% CPU overhead) and tiered collector architectures, organizations are finding they can cut costs while improving coverage. Source: IPv6.net\nQuick Links Calico Winter 2026 release adds an AI-powered assistant for natural language cluster troubleshooting and a unified ingress gateway dashboard — Tigera Blog OpenAI releases Symphony, an open-source framework for orchestrating autonomous AI coding agents through structured implementation runs — MarkTechPost Cilium 1.19 released, celebrating 10 years of eBPF-based networking with security hardening, stricter encryption modes, and improved scalability for large clusters — InfoQ Sematext publishes architecture patterns for running OpenTelemetry at scale across hundreds of services, covering collector tiers, load balancing, and multi-cluster setups — Sematext Blog KubeCon EU 2026 is less than three weeks away (March 23-26 in Amsterdam) — if you\u0026rsquo;re heading there, the program is live and registration is still open — CNCF ObservabilityCON on the Road continues its 2026 world tour with Toronto (March 5) done and Sydney (March 10), Tokyo (March 17), and Amsterdam (March 31) coming up — Grafana Labs My Take The thread connecting this week\u0026rsquo;s biggest stories is convergence. GPT-5.4\u0026rsquo;s computer use capabilities and the CNCF\u0026rsquo;s survey data both point to the same reality: AI workloads are no longer a separate world from infrastructure operations. They\u0026rsquo;re running on the same Kubernetes clusters, instrumented with the same OpenTelemetry pipelines, and increasingly making decisions inside systems that SREs are responsible for keeping reliable.\nThoughtworks\u0026rsquo; paradigm shift piece crystallizes the tension. When your system includes AI agents that make stochastic runtime decisions, the traditional SRE contract — that systems should be observable, modelable, and predictable — needs serious rethinking. The observability stack has to evolve from \u0026ldquo;tell me what happened\u0026rdquo; to \u0026ldquo;tell me what the AI decided and why.\u0026rdquo; That\u0026rsquo;s a fundamentally harder problem, and it\u0026rsquo;s one reason the push toward standardized, vendor-neutral telemetry (OTLP everywhere, OTel at scale) matters so much. You can\u0026rsquo;t debug cognitive systems with fragmented, proprietary instrumentation.\nMeanwhile, the Ingress NGINX retirement is a reminder that even foundational infrastructure components have a lifecycle. The Kubernetes ecosystem moves fast, and the Gateway API represents a genuinely better abstraction. If you\u0026rsquo;re still planning your migration, the clock isn\u0026rsquo;t ticking anymore — it\u0026rsquo;s already stopped.\nWhat caught your eye this week? I\u0026rsquo;d love to hear your thoughts — find me on LinkedIn or X.\nSubscribe to stay updated on the latest across AI, SRE, and Observability.\n","permalink":"https://www.adityakonarde.com/posts/week-in-review-ai-sre-observability-mar-2-8-2026/","summary":"\u003cp\u003eThis was a week where the AI race got tangibly closer to your desktop, the Kubernetes ecosystem said goodbye to an old friend, and the observability world kept tightening its grip around OpenTelemetry as the universal standard. If you only have five minutes, the headlines are: GPT-5.4 can now operate your computer better than most humans, Ingress NGINX is officially done, and Google Cloud now speaks fluent OTLP.\u003c/p\u003e\n\u003ch2 id=\"ai--machine-learning\"\u003eAI \u0026amp; Machine Learning\u003c/h2\u003e\n\u003cp\u003e\u003cstrong\u003eOpenAI releases GPT-5.4 with native computer use — and it beats human performance.\u003c/strong\u003e OpenAI\u0026rsquo;s latest frontier model isn\u0026rsquo;t just another benchmark bump. GPT-5.4 is the first general-purpose model to ship with production-ready computer use capabilities, scoring 75.0% on OSWorld-Verified desktop tasks — above the 72.4% human expert baseline. It supports up to 1M tokens of context, brings a new \u0026ldquo;reasoning plan preview\u0026rdquo; that lets users steer the model mid-thought, and introduces tool search for navigating large ecosystems of APIs and connectors. Available in ChatGPT (as GPT-5.4 Thinking), Codex, and the API. The agentic future just got a lot more concrete.\n\u003ca href=\"https://openai.com/index/introducing-gpt-5-4/\"\u003eSource: OpenAI\u003c/a\u003e\u003c/p\u003e","title":"Week in Review: AI, SRE \u0026 Observability — March 2–8, 2026"},{"content":"The bottleneck in AI-assisted coding isn\u0026rsquo;t the models—it\u0026rsquo;s the interface. After months of experimenting with various AI-powered development tools, I\u0026rsquo;ve landed on a voice-first, multi-agent setup that removes the friction between thought and code.\nFor context, my primary role is an Engineering Manager. I don\u0026rsquo;t write production code at work, but I build apps on the side and experiment with new AI tooling. The challenge: typing instructions slows down iteration, especially when exploring ideas or debugging complex problems.\nIn my previous post on multi-LLM workflows, I described using different AI models for different tasks.\nIn this fast-moving world, my bottleneck was typing requests and instructions. Voice interaction removes that friction. Speaking instructions feels more natural than typing them, and it\u0026rsquo;s helped me iterate faster and overcome writer\u0026rsquo;s block when exploring new ideas.\nMy Current Open Code Setup The Stack My current employer is quite generous with AI tooling, but I try to keep my personal development tooling separate from work. Here\u0026rsquo;s what I\u0026rsquo;m running:\nTerminal: Warp Terminal serves as my primary interface with built-in AI agent capabilities Voice Input: System-wide voice dictation integrated with all tools via Wispr Flow Orchids: A tool for generating full-stack apps. Quite good at UI, at least according to UI-Bench OpenCode: A CLI/TUI-based IDE similar to Claude Code, but with support for multiple AI models through OpenRouter. The oh-my-opencode plugin extends it with additional model configurations and workflow automation The other apps are pretty much out-of-the-box, but what\u0026rsquo;s interesting is my multi-model setup for OpenCode.\nThe LLM Frankenstein The debate about which AI model is \u0026ldquo;best\u0026rdquo; misses the point. As professionals, we should use the right tool for each job. I\u0026rsquo;ve been mixing various LLMs, routing tasks to each model based on its strengths. Here\u0026rsquo;s my current setup:\nClaude Opus 4.5: My go-to for complex coding tasks, especially backend engineering and architectural decisions. The quality is unmatched, but cost is the constraint—I use it via OpenRouter API billing, which gets expensive quickly. I reserve it for tasks requiring deep reasoning. Gemini 3 Flash: Best for long-context exploration, documentation generation, and rapid iteration. Its high tokens-per-second and low latency make it ideal when I need to explore large codebases. The trade-off: weaker tool-calling and coding capabilities compared to Opus. GLM 4.7 (Z.AI): Provides practically unlimited tokens at low cost. While quality doesn\u0026rsquo;t match Gemini or Opus, it\u0026rsquo;s perfect for experimentation, simple refactoring, and tasks where I need to iterate without budget concerns. If you\u0026rsquo;d like to get the same setup, I provide my OpenCode config in these GitHub gists:\nOpenCode Oh-my-OpenCode Challenges and Solutions Challenge 1: Context Loss in Long-Running Tasks Problem: When using oh-my-opencode, agents sometimes lose track of context during extended sessions. I\u0026rsquo;ve noticed this especially with long-running commands (which they don\u0026rsquo;t know to interrupt) or when the LLM\u0026rsquo;s context window gets closer to full. Solution: Breaking work into smaller, focused sessions helps. Planning mode is another useful approach: you can use Gemini for planning, then switch to Opus for the actual implementation. This pattern reduces context loss while managing costs.\nChallenge 2: Voice Dictation in Shared Spaces Problem: Dictating code instructions feels awkward when others are nearby, and ambient noise can interfere with accuracy. Solution: Using headphones with a directional microphone significantly improves accuracy and reduces self-consciousness. The awkwardness fades once you establish a flow, and most people adapt to speaking code naturally after a few sessions.\nChallenge 3: Cost Management Problem: Running multiple high-quality models simultaneously can get expensive, especially with Opus 4.5. Solution: I use a tiered approach: GLM 4.7 for simple tasks and experimentation (free/low-cost), Gemini Flash for exploration and documentation (moderate cost), and Opus 4.5 only for complex reasoning tasks (high cost). I also configure reasoning effort limits and hard dollar limits in OpenRouter to cap costs on expensive models. This keeps my monthly spend somewhat predictable while maintaining quality where it matters.\nWhat\u0026rsquo;s Next This setup is continuously evolving. Here\u0026rsquo;s what I\u0026rsquo;m exploring:\nQuality-focused agents: Agents that act as automated reviewers and SREs, catching issues before code reaches production. The Droid CLI from Factory demonstrates this pattern—I want to bring similar capabilities to open-source tooling.\nProactive agents: Background agents that suggest improvements unprompted, with automatic review gates before creating pull requests. A friend uses multiple AI agents to review PRs. I\u0026rsquo;d love to have some agents running 24/7.\nFaster iteration: Cerebras will soon enable GLM 4.7 in their API, which will provide high-throughput inference. I want to experiment with parallel model execution hitting 1000+ tokens-per-second for rapid iteration cycles.\nTry It Yourself Want to experiment with this workflow? Start simple:\nEnable voice dictation: On macOS, enable system-wide dictation in System Settings \u0026gt; Keyboard \u0026gt; Dictation. On other platforms, use built-in accessibility features or tools like Wispr Flow. Install OpenCode: Follow the installation instructions for OpenCode, then add the oh-my-opencode plugin for multi-model support. Configure your first model: Start with a single model (Gemini Flash is a good starting point for cost and quality balance) via OpenRouter. Try voice-first coding: Speak your next coding instruction instead of typing. Start with simple tasks to build familiarity. Expand gradually: Add additional models as you identify use cases where each excels. At some point, you can look into using Oh-my-OpenCode as a framework. This setup works because it combines the best of each tool without locking me into a single vendor\u0026rsquo;s ecosystem. Using OpenRouter means I can switch between models with a single configuration change, staying nimble as new models emerge.\nThe combination of voice input and multi-model routing removes the friction that used to slow down my side projects. I can iterate faster, explore ideas more freely, and maintain code quality without sacrificing flexibility.\nWhat\u0026rsquo;s your experience with voice coding or multi-agent setups? I\u0026rsquo;d love to hear your thoughts: LinkedIn\nThis post was written with AI assistance—voice-to-text conversion and proofreading. The thoughts and setup are my own.\n","permalink":"https://www.adityakonarde.com/posts/ai-frankenstein-development-workflow/","summary":"\u003cp\u003eThe bottleneck in AI-assisted coding isn\u0026rsquo;t the models—it\u0026rsquo;s the interface. After months of experimenting with various AI-powered development tools, I\u0026rsquo;ve landed on a voice-first, multi-agent setup that removes the friction between thought and code.\u003c/p\u003e\n\u003cp\u003eFor context, my primary role is an Engineering Manager. I don\u0026rsquo;t write production code at work, but I build apps on the side and experiment with new AI tooling. The challenge: typing instructions slows down iteration, especially when exploring ideas or debugging complex problems.\u003c/p\u003e","title":"Look ma, I made an AI Frankenstein"},{"content":"After I switched to engineering management, I realized that I missed coding. With my new job, I spend more time on github, and innovation is very much encouraged at the company. As such, I\u0026rsquo;ve been fiddling around with LLM\u0026rsquo;s to see what works (and not).\nAs I explore various models and learn more, I\u0026rsquo;ve developed a workflow that leverages multiple LLMs, each chosen for its specific strengths.\nHere\u0026rsquo;s how I use various AI models to boost my development productivity.\nWork projects: Github Copilot with Claude Sonnet v3.5 For work things, GitHub Copilot is my primary assistant for several reasons:\nSecurity: I don\u0026rsquo;t want to use my personal LLM for work things, plus my company already has a subscription for Copilot. Team Consistency: Entire team using the same model ensures consistent suggestions I primarily use Copilot for:\nQuick code completions during active development Documentation generation Repetitive code pattern implementation I also sometimes use the OpenAI models, but I\u0026rsquo;m not a big fan of those.\nCursor with Multiple Models for Personal Projects For personal projects, I use Cursor as my primary IDE with two different models. The subscription is a tad bit expensive, so I\u0026rsquo;ll see how it goes.\nClaude Sonnet v3.5 Use Case: Best coding model so far. Using this to bootstrap my personal projects. Strengths: Good at understanding context Excellent for completing partial code Decent context window Agent mode, MCP, and other neat features. Weaknesses: Still slow for many use cases Expensive, limited usage either with cursor or direct API with openrouter Roo-cline with Deepseek v3 Deepseek v3 is my go-to model for fast and cheap iterations. It does sometimes get lost, but it\u0026rsquo;s quite nice alongside Claude.\nSpecific Strengths:\nGreat at understanding complex codebases Great at suggesting optimizations Feels faster than claude sonnet v3.5 Weaknesses:\nDecent context window size, sometimes is a limiting factor as the codebase gets large Sometimes gets lost in the codebase. Needs to be prompted to stay focused Local Phi-3 Setup with Continue I\u0026rsquo;m currently setting up a local setup using Continue with Phi-3.\nOffline Capabilities:\nWorks without internet connection No token limits or API costs Use Cases:\nQuick syntax checks Documentation / Markdown edits A better search and replace :) Weaknesses:\nYou can only get so much out of it compared to the other two models Performance limitations as I\u0026rsquo;m using an M2 Pro Mac mini, which neither has the best GPU for the job, nor the most RAM (16GB) Workflow Integration Here\u0026rsquo;s how I typically combine these tools in my daily workflow:\nInitial Development:\nGet started: Use Copilot at work / Cursor with Claude Sonnet v3.5 for personal projects Agent mode to move forward quickly. personal projects are not production critical, so I don\u0026rsquo;t mind going YOLO Commit along the way for checkpoints where things work, and then go crazy with the generation. I think I can do better with some prompt engineering here. Code Review:\nRun Deepseek v3 through roo-cline for deep analysis Use local Phi-3 for quick syntax and style checks Documentation:\nDeepseek v3 for technical details Local Phi-3 for quick doc updates Benefits of Multi-LLM Approach Toolbox approach: Each model is used for what it does best Redundancy: Not dependent on a single service, also using openrouter to be able to switch between models Cost Efficiency: Use expensive models only when needed Conclusion While it might seem complex to juggle multiple LLMs, for me the benefits far outweigh the initial setup complexity. Each model brings its unique strengths (and weaknesses), and learning when to use which tool has significantly improved my development workflow.\nIn the coming weeks, I plan to explore a multi-agent setup.\nI\u0026rsquo;m curious to hear how others are using LLMs in their workflows. What tools do you use? What are your favorite models? Let\u0026rsquo;s discuss on LinkedIn\n","permalink":"https://www.adityakonarde.com/posts/multiple-llm-development-workflow/","summary":"\u003cp\u003eAfter I switched to engineering management, I realized that I missed coding. With my new job, I spend more time on github, and innovation is very much encouraged at the company. As such, I\u0026rsquo;ve been fiddling around with LLM\u0026rsquo;s to see what works (and not).\u003c/p\u003e\n\u003cp\u003eAs I explore various models and learn more, I\u0026rsquo;ve developed a workflow that leverages multiple LLMs, each chosen for its specific strengths.\u003c/p\u003e\n\u003cp\u003eHere\u0026rsquo;s how I use various AI models to boost my development productivity.\u003c/p\u003e","title":"My Multi-LLM Development Workflow: Leveraging Different AI Models"},{"content":"A quick recap on SRE Site reliability engineering (SRE) is a discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems.[1] The main goals are to create scalable and highly reliable software systems. According to Ben Treynor, founder of Google\u0026rsquo;s Site Reliability Team, SRE is \u0026ldquo;what happens when a software engineer is tasked with what used to be called operations.\u0026rdquo;\n^ Source: Wikipedia\nHow much time do you spend coding? I get this question quite often: \u0026ldquo;How much time do you spend writing code?\u0026rdquo;\nWith the rising popularity of the SRE mindset, companies each have their own take on what SRE means, which ends up confusing the average reader.\nIn an ideal world, SRE\u0026rsquo;s spend no more than 50% of their working time on operations work. SRE teams also strive to minimize their \u0026rsquo;toil'.\nFor example, Google places a 50% cap on the aggregate \u0026ldquo;ops\u0026rdquo; work for all SREs—tickets, on-call, manual tasks, etc\nLet\u0026rsquo;s look at what an average SRE day looks like:\nAnatomy of an SRE day There is no such thing as a \u0026rsquo;normal\u0026rsquo; day in the SRE world. My answer to the earlier question is always: \u0026ldquo;It depends on what problem I am trying to solve\u0026rdquo;. Let\u0026rsquo;s see why:\nSRE\u0026rsquo;s are tasked with a lot of responsibilities. Let\u0026rsquo;s pick a few from the Google SRE book\u0026rsquo;s index:\nMonitoring and Alerting Eliminating Toil On-call and Incident Response Configuration Design and Best Practices Managing Load While many of these are software problems, one does not use the same tool to solve all of them. Remember: one should not use a hammer to chop a tree.\nSome problem can be solved by writing code, while another can be solved without. Incident response is a critical part of an SRE\u0026rsquo;s job, but doesn\u0026rsquo;t involve code. The same for documentation, architecture reviews, consulting and many other aspects of the role.\nIt also depends on what \u0026lsquo;code\u0026rsquo; really means for you. Is configuration management considered \u0026lsquo;code\u0026rsquo;? What about something like jsonnet? Is writing BASH or Python scripts considered code? People have various opinions about this, which are outside the scope of this blog.\nThe realistic model for SRE time I currently work with Red Hat as an SRE. Here\u0026rsquo;s what my breakdown looks like:\n40% Development 30% Operations 20% Consulting, Documentation, Reviews, Meetings, Training, Research 10% Meetings :) While this may sound much less than ideal, I would like to note that I believe knowledge sharing and collaboration with others are one of the harder software engineering problems, and should be treated as such. Consistency, documentation and \u0026rsquo;teaching how to fish\u0026rsquo; can have a surprisingly good long-term impact, so I love spending time there. This totally depends on your organization.\nThe breakdown is definitely not set in stone, and one must always be prepared to move things around as per the needs of the business. An SRE team is also not a single person, so a good manager will know to encourage individuals to work on their core areas while also adding a fine balance to this breakdown.\nIf you would like to know more, feel free to reach out to me on LinkedIn and I would be happy to write a follow-up to this blog :)\nFurther reading Google SRE book Eliminating Toil Chapter from the Google SRE Workbook Managing SRE load ","permalink":"https://www.adityakonarde.com/posts/sre-time-management/","summary":"\u003ch2 id=\"a-quick-recap-on-sre\"\u003eA quick recap on SRE\u003c/h2\u003e\n\u003cp\u003eSite reliability engineering (SRE) is a discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems.[1] The main goals are to create scalable and highly reliable software systems. According to Ben Treynor, founder of Google\u0026rsquo;s Site Reliability Team, SRE is \u0026ldquo;what happens when a software engineer is tasked with what used to be called operations.\u0026rdquo;\u003c/p\u003e\n\u003cp\u003e^ Source: Wikipedia\u003c/p\u003e\n\u003ch2 id=\"how-much-time-do-you-spend-coding\"\u003eHow much time do you spend coding?\u003c/h2\u003e\n\u003cp\u003eI get this question quite often: \u0026ldquo;How much time do you spend writing code?\u0026rdquo;\u003c/p\u003e","title":"How much do SRE's really Code?"},{"content":"I\u0026rsquo;ve been asked quite a few times about the hardware I use for work, so I decided to write a blog post about it. Here\u0026rsquo;s a list of the hardware that I use for my daily work and hobbies.\nFirst and foremost, I use a M1 Macbook Pro for my laptop. This computer is incredibly fast and reliable, which is essential for the work that I do. It also has a beautiful display, which makes it a good laptop for editing pictures. :)\nIn addition to my laptop, I also use a Dell P2518D monitor as a secondary display. This gives me extra screen real estate and saves my neck and shoulders.\nWhen it comes to my keyboard and mouse, I use a Keychron K1 keyboard and an MX Master 3 mouse for work, and a Logitech G Pro wired mouse for gaming. Both of these peripherals are comfortable to use, and they have all the features that I need for my work and hobbies. I particularly like the high DPI on the gaming mouse, and the \u0026lsquo;side button\u0026rsquo; on the MX master mouse. I use these interchangeably.\nFor audio, I use a Bose ANC 700 headset for work and Apple Airpods Pro 2 for everyday use. Both of these headphones have excellent sound quality and noise cancellation, which is essential for me since I often have to make calls and attend meetings. I do have a set of Edifier speakers on the desk in case I want to enjoy some music without headphones.\nFor video, I use a Logitech C920 webcam for everyday use, and a Fujifilm XT-4 with an Elgato Camlink 4K for higher quality video. Both of these cameras produce clear and crisp images, which is important for online meetings and video calls.\nFinally, I use a Tonor mic for audio recording and podcasting. This microphone has excellent sound quality, and it\u0026rsquo;s very easy to use.\nWhat hardware do you use? I\u0026rsquo;d love to know! :)\n","permalink":"https://www.adityakonarde.com/posts/tech-hardware/","summary":"\u003cp\u003eI\u0026rsquo;ve been asked quite a few times about the hardware I use for work, so I decided to write a blog post about it. Here\u0026rsquo;s a list of the hardware that I use for my daily work and hobbies.\u003c/p\u003e\n\u003cp\u003eFirst and foremost, I use a M1 Macbook Pro for my laptop. This computer is incredibly fast and reliable, which is essential for the work that I do. It also has a beautiful display, which makes it a good laptop for editing pictures. :)\u003c/p\u003e","title":"Tech Hardware"},{"content":"Background Grafana has become the de-facto visualization tool for Prometheus. While it is cool to run a central Grafana hooked up to an RDS database, I think it is even better if you can make Grafana completely configurable via git and thus have stateless Grafana instances which you can scale horizontally.\nBased on this philosophy, I have been running a Grafana setup at Red Hat, here\u0026rsquo;s some key points:\nGrafana runs as pods on a Kubernetes (OpenShift) cluster Each dashboard is mounted into the pod via ConfigMap Our GitOps pipeline takes care of adding the dashboard configmaps into the namespace, so all dashboards and their changes ultimately must end up in Git One of the best benefits of this approach is that you never have to worry about Grafana upgrades/downgrades. Because the pods are stateless, you can simply roll out a new version as long as the dashboard schema stays consistent.\nThe how For this exercise, we use a feature in Grafana called Provisioning\nProvisioning allows you to inject certain configuration such as dashboards, plugins and notifiers into Grafana via a config file, and Grafana will know to load them at startup (and in the case of dashboards, watch them for updates).\nProvisioning Challenges: Too many dashboards on the main page So once you discover the awesome technique of dashboard provisioning, you are likely to read the documentation and start with a configuration that looks like the following:\napiVersion: v1 data: dashboards.yaml: |- { \u0026#34;apiVersion\u0026#34;: 1, \u0026#34;providers\u0026#34;: [ { \u0026#34;folder\u0026#34;: \u0026#34;\u0026#34;, \u0026#34;name\u0026#34;: \u0026#34;0\u0026#34;, \u0026#34;options\u0026#34;: { \u0026#34;path\u0026#34;: \u0026#34;/grafana-dashboard-definitions/0\u0026#34; }, \u0026#34;orgId\u0026#34;: 1, \u0026#34;type\u0026#34;: \u0026#34;file\u0026#34; } ] } kind: ConfigMap metadata: name: grafana-dashboards And the dashboards will be mounted as a volume in the Kubernetes deployment spec:\n- mountPath: /grafana-dashboard-definitions/0/grafana-dashboard-foo name: grafana-dashboard-foo - configMap: defaultMode: 420 name: grafana-dashboard-foo name: grafana-dashboard-foo And soon as you add more dashboards, you will have corresponding Volumemounts under the same paths. At some point, your /dashboards page has a few dozen dashboards and it is a challenge trying to quickly get to the relevant ones\nProvisioning dashboards into their own folders In the spirit of keeping our workspace hygienic, I wanted to clean up the mess that the /dashboards page was. I wasn\u0026rsquo;t very sure if the documentation around provisioning already provided a way to group dashboards into a folder, so I had given up on that.\nBut the good news is, you actually can, in two simple steps:\nAdd another folder to the providers in your grafana dashboards config, like so: apiVersion: v1 data: dashboards.yaml: |- { \u0026#34;apiVersion\u0026#34;: 1, \u0026#34;providers\u0026#34;: [ { \u0026#34;folder\u0026#34;: \u0026#34;\u0026#34;, \u0026#34;name\u0026#34;: \u0026#34;0\u0026#34;, \u0026#34;options\u0026#34;: { \u0026#34;path\u0026#34;: \u0026#34;/grafana-dashboard-definitions/0\u0026#34; }, \u0026#34;orgId\u0026#34;: 1, \u0026#34;type\u0026#34;: \u0026#34;file\u0026#34; }, { \u0026#34;folder\u0026#34;: \u0026#34;Bar\u0026#34;, \u0026#34;name\u0026#34;: \u0026#34;0\u0026#34;, \u0026#34;options\u0026#34;: { \u0026#34;path\u0026#34;: \u0026#34;/grafana-dashboard-definitions/Bar\u0026#34; }, \u0026#34;orgId\u0026#34;: 1, \u0026#34;type\u0026#34;: \u0026#34;file\u0026#34; } ] } kind: ConfigMap metadata: name: grafana-dashboards When mounting the configmaps, mount them under a path listed in providers: - mountPath: /grafana-dashboard-definitions/0/grafana-dashboard-foo name: grafana-dashboard-foo - mountPath: /grafana-dashboard-definitions/Bar/grafana-dashboard-bar name: grafana-dashboard-bar - configMap: defaultMode: 420 name: grafana-dashboard-foo name: grafana-dashboard-foo - configMap: defaultMode: 420 name: grafana-dashboard-bar name: grafana-dashboard-bar Note: Any dashboards which are not under any of the paths in providers will just disappear. Also, I would recommend you at least always have the /0/ path available for General dashboards\nAnd that\u0026rsquo;s a win! now your dashboards will be grouped by folders on the /dashboards page, making it super easy for teams to get to them in the time of need.\nGrafana on Kubernetes: Quick Start I was only able to discover this because Frederic mentioned that someone added this feature to his repo.\nOnly later I found that this repo is a gold mine. Not only it allows you to easily generate dashboards from jsonnet and create a ready-to-deploy configuration from it, but it also comes enabled with the folder-wise provisioning we talked about in this blog post.\nIf you\u0026rsquo;re not already running Grafana this way on Kubernetes, I would highly recommend giving this repo a try: https://github.com/brancz/kubernetes-grafana\nSome documentation for further reading:\nhttps://grafana.com/docs/grafana/latest/administration/provisioning/ https://github.com/brancz/kubernetes-grafana https://grafana.com/blog/2020/02/26/how-to-configure-grafana-as-code/ ","permalink":"https://www.adityakonarde.com/posts/provisioning-dashboards-grafana/","summary":"\u003ch2 id=\"background\"\u003eBackground\u003c/h2\u003e\n\u003cp\u003eGrafana has become the de-facto visualization tool for Prometheus. While it is cool to run a central Grafana hooked up to an RDS database, I think it is even better if you can make Grafana completely configurable via git and thus have stateless Grafana instances which you can scale horizontally.\u003c/p\u003e\n\u003cp\u003eBased on this philosophy, I have been running a Grafana setup at Red Hat, here\u0026rsquo;s some key points:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eGrafana runs as pods on a Kubernetes (OpenShift) cluster\u003c/li\u003e\n\u003cli\u003eEach dashboard is mounted into the pod via ConfigMap\u003c/li\u003e\n\u003cli\u003eOur GitOps pipeline takes care of adding the dashboard configmaps into the namespace, so all dashboards and their changes ultimately must end up in Git\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eOne of the best benefits of this approach is that you never have to worry about Grafana upgrades/downgrades. Because the pods are stateless, you can simply roll out a new version as long as the dashboard schema stays consistent.\u003c/p\u003e","title":"Provisioning Dashboards with Grafana"},{"content":"This is a quick knowledge sharing post before it gets out of my head :)\nI\u0026rsquo;m sure many (if not most) of you use Alertmanager as the go-to alerting system with Prometheus\nI really like the simplicity of Alertmanager\u0026rsquo;s configuration file and how nicely you can plug it into your configuration generation.\nThe deployment pattern, however, is always a confusion for new adopters. I am going to try to solve some of that confusion in this post.\nOne Alertmanager to rule them all You start with one Prometheus and a corresponding Alertmanager. The alerting flow looks like this:\nPrometheus -\u0026gt; Alertmanager -\u0026gt; Slack\nThis is where life is simple.\nNext, you add some HA to your Prometheus instance. No problems here either\nPrometheus 1 and 2 -\u0026gt; Alertmanager -\u0026gt; Slack\nOh, but you indeed also need HA for your alerting system:\nPrometheus 1 and 2 -\u0026gt; Alertmanager 1,2,3 -\u0026gt; Slack\nAt this stage, you have introduced a new alertmanager functionality, which is Gossip. Some of the interesting defaults for this protocol can be found here: https://github.com/prometheus/alertmanager/blob/master/cluster/cluster.go#L98-L106\nIt is important to note that for this trick to work, each Prometheus individually must fire its alerts to all Alertmanagers\nGossip Protocol Deep Dive The Gossip protocol used by Alertmanager is based on the SWIM protocol (Scalable Weakly-consistent Infection-style Process Group Membership Protocol). Key features include:\nFailure detection through periodic ping/ack messages State synchronization through gossip messages Configurable parameters for tuning performance vs. consistency Advanced Deployment Patterns Multi-Datacenter Setup For organizations with multiple datacenters, Alertmanager can be deployed in a way that:\nMaintains local alerting within each DC Forwards critical alerts to a global Alertmanager cluster Implements cross-DC redundancy Federation Setup In large organizations with multiple Prometheus instances, Alertmanager can be deployed in a federated manner:\nEach team maintains their own Alertmanager instance A central Alertmanager handles organization-wide alerts Implements hierarchical alert routing Best Practices Dead Man\u0026rsquo;s Snitch: Implement a dead man\u0026rsquo;s switch using a constantly firing alert to ensure your alerting system is working. If the alert stops firing, it indicates a problem with your monitoring system Monitoring Alertmanager: Always monitor your Alertmanager instances using Prometheus itself Configuration Management: Use version control and CI/CD for Alertmanager configurations Alert Deduplication: Configure proper grouping and inhibition rules to reduce alert noise Capacity Planning: Monitor alert volume and scale Alertmanager accordingly Disaster Recovery: Implement backup and restore procedures for Alertmanager state Troubleshooting Common Issues Silent Alerts: Check Alertmanager logs and Prometheus alert rules Duplicate Alerts: Verify Gossip protocol configuration and network connectivity Delayed Alerts: Monitor Alertmanager processing latency and scale as needed Configuration Errors: Use amtool to validate configuration files Conclusion Proper Alertmanager deployment is crucial for maintaining reliable alerting in your monitoring stack. By understanding these patterns and best practices, you can build a robust alerting system that scales with your organization\u0026rsquo;s needs.\nRemember that the right deployment pattern depends on your specific requirements and infrastructure. Start simple and evolve your architecture as your needs grow.\n","permalink":"https://www.adityakonarde.com/posts/alertmanager-deployment-patterns/","summary":"\u003cp\u003eThis is a quick knowledge sharing post before it gets out of my head :)\u003c/p\u003e\n\u003cp\u003eI\u0026rsquo;m sure many (if not most) of you use Alertmanager as the go-to alerting system with Prometheus\u003c/p\u003e\n\u003cp\u003eI really like the simplicity of Alertmanager\u0026rsquo;s configuration file and how nicely you can plug it into your configuration generation.\u003c/p\u003e\n\u003cp\u003eThe deployment pattern, however, is always a confusion for new adopters. I am going to try to solve some of that confusion in this post.\u003c/p\u003e","title":"Alertmanager Deployment Patterns"},{"content":"What is Prometheus If you\u0026rsquo;re an engineer working with Cloud technologies, chances are that you\u0026rsquo;ve already heard of Prometheus\nPrometheus is an Open Source monitoring tool. Its development started at SoundCloud and it has now evolved into being a go-to choice for metrics collection. I often relate its rise in popularity to its simple, gitops friendly configuration management, simple setup and modularity.\nPrometheus does a few things and does it well. While doing this, it does have some nice modularity as you can mix and match it with other tooling such as Grafana and Alertmanager.\nI don\u0026rsquo;t want to make the first paragraph a clickbait. While it was important to set the context, this is not a post that introduces Prometheus itself. Others in the community have done a very good job at doing this, and here\u0026rsquo;s a few recommended talks and blogs about Prometheus if you have some catching up to do.\nPrerequisite reading: A Prometheus crash course Get an overview Understand the architecture And please get your hands dirty Again, note that this is a \u0026lsquo;deep dive\u0026rsquo; series. If you\u0026rsquo;re new to Prometheus, I would highly recommend making sure you\u0026rsquo;re familiar with the terminology first with the material above\nPrometheus: Diving into the fire When I first started learning to use and set up Prometheus, I faced certain challenges that I don\u0026rsquo;t want other users to see. One of them is the lack of documentation around the details of how Prometheus really works under the hood.\nUnderstanding these core subsystems will help you become a more effective Prometheus operator:\nThe Prometheus Data Model: How metrics are structured and labeled Life cycle of a scrape: What happens when Prometheus collects metrics from targets TSDB: The time series database layer that powers storage and queries Query evaluation: How PromQL queries are parsed and executed Alerting: The flow from alert rules to notifications via Alertmanager Service Discovery: How Prometheus automatically finds scrape targets Self-monitoring: Using Prometheus to monitor Prometheus itself For deeper exploration of these topics, I recommend the Prometheus documentation and the various conference talks from the maintainers.\nA special shoutout to everyone who contributes to this project and has given talks or written content around it. The community\u0026rsquo;s knowledge sharing is what makes Prometheus so accessible.\nIf you\u0026rsquo;d like to discuss Prometheus or have questions, feel free to reach out on LinkedIn.\n","permalink":"https://www.adityakonarde.com/posts/prometheus-deep-dive-series-intro/","summary":"\u003ch2 id=\"what-is-prometheus\"\u003eWhat is Prometheus\u003c/h2\u003e\n\u003cp\u003eIf you\u0026rsquo;re an engineer working with Cloud technologies, chances are that you\u0026rsquo;ve already heard of \u003ca href=\"https://prometheus.io/\"\u003ePrometheus\u003c/a\u003e\u003c/p\u003e\n\u003cp\u003ePrometheus is an Open Source monitoring tool. Its development started at \u003ca href=\"https://soundcloud.com/pages/contact\"\u003eSoundCloud\u003c/a\u003e and it has now evolved into being a go-to choice for metrics collection. I often relate its rise in popularity to its simple, gitops friendly configuration management, simple setup and modularity.\u003c/p\u003e\n\u003cp\u003ePrometheus does a few things and does it well. While doing this, it does have some nice modularity as you can mix and match it with other tooling such as \u003ca href=\"https://github.com/grafana/grafana\"\u003eGrafana\u003c/a\u003e and \u003ca href=\"https://github.com/prometheus/alertmanager\"\u003eAlertmanager\u003c/a\u003e.\u003c/p\u003e","title":"Prometheus Deep Dive: Understanding the Fundamentals"},{"content":"Hello World!\nAs we approach the end of the year, I wanted to take a moment to reflect on what I learnt in these 10 years leading to 2020.\nI must start off by saying it\u0026rsquo;s been such a roller coaster ride. Starting from the early 2010s where I was in ahem School (!) all the way up to now me sitting here in a cafe in my new home - Berlin.\nI\u0026rsquo;m not a writer, and I\u0026rsquo;m gonna keep this short. Here\u0026rsquo;s 10 years of mistakes and good decisions, summarized:\nDo what you can\u0026rsquo;t It\u0026rsquo;s never too late to start travelling Keep your mind open to possibilities Cherish the people who were around for you in your rough times Social Media is fake validation. Know that, use it to your advantage if you can but don\u0026rsquo;t let it consume you Get a hobby Make good friends outside of work and outside of what you do Don\u0026rsquo;t be afraid to try new things. Go to new places, talk to people, try that one thing you always wanted to You might end up regretting things you didn\u0026rsquo;t do while you had the chance You will also end up regretting things you did do :D Don\u0026rsquo;t have regrets. Everything is an experience. Your actions until now is what makes you, you Don\u0026rsquo;t give free advice 10 years of experience cannot be summarized in a single blog\nAnd at this point, I have realized that its futile to try and transfer over 10 years of experiences in a blog post. Life is so rich that every day can be a 100 of these.\nCheers to the 2010\u0026rsquo;s, the decade that brought me love, learning, opportunities and experiences.\nLet\u0026rsquo;s all try to be better humans (whatever that means to you) this upcoming decade and beyond.\nCheers.\nPS: Welcome to my new website. :)\n","permalink":"https://www.adityakonarde.com/posts/heres-to-2010s/","summary":"\u003cp\u003eHello World!\u003c/p\u003e\n\u003cp\u003eAs we approach the end of the year, I wanted to take a moment to reflect on what I learnt in these 10 years leading to 2020.\u003c/p\u003e\n\u003cp\u003eI must start off by saying it\u0026rsquo;s been such a roller coaster ride. Starting from the early 2010s where I was in \u003cem\u003eahem\u003c/em\u003e School (!) all the way up to now me sitting here in a cafe in my new home - Berlin.\u003c/p\u003e","title":"Here's to the 2010s"},{"content":"About Me I am a Technologist (TM) based in Berlin, Germany, with a passion for building scalable and reliable systems involving computers.\nMy career spans across software engineering, site reliability engineering, and engineering leadership. I currently lead engineering teams at Grafana Labs, where we\u0026rsquo;re building the next generation of observability tools.\nI\u0026rsquo;m an active contributor to the open source community and frequently speak at industry conferences about cloud-native technologies and engineering best practices.\nWhen I\u0026rsquo;m not working, you can find me:\nVibe-coding some side projects Trying to keep my plants alive Learning a new recipe 🌐 Work In my careers I have primarily focused/worked on these projects:\nKubernetes Prometheus OpenTelemetry 🌍 Community Involvement I\u0026rsquo;ve been involved in organizing and speaking at various cloud-native meetups:\nBangalore Kubernetes 🇮🇳 Cloud Native Berlin 🇩🇪 Kubernetes Berlin 🇩🇪 Grafana and Friends Berlin 🇩🇪 The CNCF ecosystem thrives on community contributions. Whether you\u0026rsquo;re interested in code, documentation, or community building, there\u0026rsquo;s a place for you. I hope to see you there!\nLet\u0026rsquo;s connect:\nGitHub: https://github.com/aditya-konarde LinkedIn: https://www.linkedin.com/in/adityakonarde/ ","permalink":"https://www.adityakonarde.com/about/","summary":"\u003ch1 id=\"about-me\"\u003eAbout Me\u003c/h1\u003e\n\u003cp\u003eI am a Technologist (TM) based in Berlin, Germany, with a passion for building scalable and reliable systems involving computers.\u003c/p\u003e\n\u003cp\u003eMy career spans across software engineering, site reliability engineering, and engineering leadership. I currently lead engineering teams at Grafana Labs, where we\u0026rsquo;re building the next generation of observability tools.\u003c/p\u003e\n\u003cp\u003eI\u0026rsquo;m an active contributor to the open source community and frequently speak at industry conferences about cloud-native technologies and engineering best practices.\u003c/p\u003e","title":""}]