How LLM Inference Got 10x Cheaper
How the field slashed LLM inference costs through attention redesigns, IO-aware computation, KV cache paging, and speculative decoding, and what comes next.
How the field slashed LLM inference costs through attention redesigns, IO-aware computation, KV cache paging, and speculative decoding, and what comes next.
Why naive MCP usage breaks at scale, how code execution fixes context bloat, and how skills turn agent workflows into reusable building blocks.
Cursor 2.0 brings real workflow improvements with worktrees and parallel agents, but several features still lag behind in practice.
A practical journey to deploy faster, cheaper LLM/transformer models using distillation, quantization, and ONNX Runtime—no PhD required.
How Anyone Can Guide AI Without Being a Techie.
How a simple attention-only architecture scaled from translation to language, vision, code, and diffusion.
My day-to-day shifted from prompt phrasing to context-first workflows—feeding the model code, docs, and tickets to drive reliable results.
Architecture notes on bridging A2A and MCP toolchains through a single streaming orchestrator.