Blog

How LLM Inference Got 10x Cheaper

How the field slashed LLM inference costs through attention redesigns, IO-aware computation, KV cache paging, and speculative decoding, and what comes next.

Feb 27, 2026

TransformersLLM InferenceAttention MechanismsProduction MLAI Engineering

Read the full article

Latest writing

AI Agents, MCP Tools, and Skills: Toward Smarter Workflows

Why naive MCP usage breaks at scale, how code execution fixes context bloat, and how skills turn agent workflows into reusable building blocks.

Jan 31, 2026Read more →

Reality Check with Cursor 2.0

Cursor 2.0 brings real workflow improvements with worktrees and parallel agents, but several features still lag behind in practice.

Nov 16, 2025Read more →

Making LLMs Efficient in Production

A practical journey to deploy faster, cheaper LLM/transformer models using distillation, quantization, and ONNX Runtime—no PhD required.

Oct 18, 2025Read more →

From Perfect Prompts to Perfect Ingredients

How Anyone Can Guide AI Without Being a Techie.

Oct 14, 2025Read more →

Transformers: Simple Yet Wildly Versatile

How a simple attention-only architecture scaled from translation to language, vision, code, and diffusion.

Oct 4, 2025Read more →

The Days of Prompt Engineering Are Over

My day-to-day shifted from prompt phrasing to context-first workflows—feeding the model code, docs, and tickets to drive reliable results.

Sep 27, 2025Read more →

Building a Unified Orchestrator for Agentic Flow with A2A + MCP

Architecture notes on bridging A2A and MCP toolchains through a single streaming orchestrator.

Sep 20, 2025Read more →