Making LLMs Efficient in Production

## The heavy model in your backpack Ever tried to stuff your entire wardrobe into a single carry-on? That's what deploying a large language model can feel like. In [Transformers: Simple Yet Wildly Versatile](https://sidparmar.ca/blog/transformers-simple-yet-wildly-versatile) I showed how a simple attention-only architecture conquered translation, vision and even diffusion models. But there's a catch: the very models that write essays and fix code can also be unwieldy when you try to ship them. A vanilla BERT base checkpoint tips the scales at over **400 MB** and takes tens of milliseconds to answer a single query. When you're serving thousands of requests per second, those delays add up. > **What I optimize in this post:** a **BERT-class transformer (encoder-only)** fine-tuned for intent classification. I’ll use **“LLM”** broadly when talking about production impact, and **“transformer/BERT”** when referring to the exact model and code. **Why speed matters in practice (not theory):** * **Throughput:** Faster models = **more requests/sec per instance** * **Cost:** Same hardware handles **more traffic** or the **same traffic at lower cost** * **Memory:** Smaller checkpoints run **more workers per machine** * **UX:** Lower latency → **snappier responses** and more **headroom for spikes** In this post I’ll take you on a practical journey to lighten that load. We’ll start with shipping a full BERT into production—and gradually unpack tricks like distillation, quantization, and ONNX Runtime. --- ## Baseline BERT My journey began with a simple question: *How fast and accurate is a plain BERT?* I wrote a small helper to benchmark size, latency and accuracy. **Environment:** CPU inference on a modern x86 instance; batch size 1; latency is per-request average. ```python class PerformanceBenchmark: def __init__(self, model_name_or_path, task='text-classification'): self.pipeline = pipeline(task=task, model=model_name_or_path) def benchmark(self, test_dataset): # Version-proof model size in MB model_size = sum(p.numel() * p.element_size() for p in self.pipeline.model.parameters()) / (1024**2) latencies, correct = [], 0 for item in test_dataset: start = time.time() pred = self.pipeline(item['text'])[0]['label'] latencies.append((time.time() - start) * 1000) correct += (pred == item['intent']) return { 'model_size': model_size, 'avg_latency': float(np.mean(latencies)), 'accuracy': correct / len(test_dataset) } ``` **Results (CLINC OOS intent classification):** | Model | Size (MB) | Latency (ms) | Accuracy (%) | | ----- | --------- | ------------ | ------------ | | BERT | ~418 | ~40 | 86.7 | **Capacity & cost framing (40 ms):** * **To serve 1,000 req/s:** ~**40** parallel workers * **Memory:** 418 MB/worker limits packing density per machine --- ## First stop: off-the-shelf distillation **What it is:** Distillation is **model compression via teacher–student training**. A smaller **student** learns to imitate a larger **teacher**—not just the teacher’s final labels, but its *confidence profile* over classes. **How it works (practically):** I load a strong teacher (BERT fine-tuned on CLINC) and a compact student (DistilBERT). During training, I optimize two things at once: 1. the usual cross-entropy on ground-truth labels, and 2. a KL-divergence term that pushes the student’s logits toward the teacher’s **softened** logits (temperature > 1). The **temperature** spreads out the teacher’s probabilities so the student can learn *relative* preferences (“A is twice as likely as B”) rather than only hard labels. A mixing weight **alpha** controls how much to trust teacher vs. labels. **Why it helps:** Soft targets contain extra information that hard labels throw away. The student inherits some of the teacher’s generalization while using fewer parameters. I tried the ready-made checkpoint `transformersbook/distilbert-base-uncased-finetuned-clinc`. | Model | Size (MB) | Latency (ms) | Accuracy (%) | | --------------- | --------- | ------------ | ------------ | | DistilBERT (HF) | ~250 | ~22 | 85.5 | **Capacity & cost framing (22 ms):** * **To serve 1,000 req/s:** ~**22** workers (≈45% fewer than BERT) * **Memory:** 250 MB → **~1.6×** more workers per machine vs BERT To visualise the progress, I plotted average latency versus accuracy, using bubble size to encode model size. The blue bubble (BERT) sits far to the right—big and slow. The orange bubble (DistilBERT) shifts left on latency while staying reasonably high on accuracy: ![Scatter plot showing accuracy vs average latency with bubble size representing model size. BERT baseline appears as a large blue solid bubble at 40ms latency and 86.7% accuracy. DistilBERT appears as a smaller orange hollow bubble at 22ms latency and 85.5% accuracy, demonstrating the trade-off between speed and accuracy.](/blog/making-llms-efficient-in-production/transformers_markdown_51_1.png) A quick win—but could a student I trained do even better? --- ## Distillation: teaching my own student **What I change vs. off-the-shelf:** I keep the teacher–student recipe but **tune the blend** (`alpha`) and **temperature** for *my* data, and optionally unfreeze more layers or train a bit longer. I also monitor which classes the student struggles with and adjust the loss weights if needed. **Why it works better:** Off-the-shelf students are generic. A short, targeted run on your actual distribution lets the student specialize. Think of it like a chef showing the sous-chef *your* kitchen’s menu and rush-hour patterns—not just cooking in theory. ```python distil_config = DistillationConfig( teacher_model=teacher_model, student_model=AutoModelForSequenceClassification.from_pretrained( 'distilbert-base-uncased', num_labels=num_labels), alpha=0.5, temperature=2.0, epochs=3, learning_rate=5e-5, ) trainer = DistillationTrainer(distil_config) trainer.train(train_dataset) metrics = trainer.evaluate(test_dataset) ``` | Model | Size (MB) | Latency (ms) | Accuracy (%) | | ---------------- | --------- | ------------ | ------------ | | Custom Distilled | ~250 | ~20 | **86.8** | **Capacity & cost framing (20 ms):** * **To serve 1,000 req/s:** ~**20** workers (≈2× cheaper than baseline at same traffic) ![Scatter plot comparing three models: BERT baseline (blue solid bubble at 40ms, 86.7%), DistilBERT (orange solid bubble at 22ms, 85.5%), and custom distillation (green hollow bubble at 20ms, 86.8%). The green bubble shows custom distillation achieves nearly identical accuracy to BERT while being 2x faster.](/blog/making-llms-efficient-in-production/transformers_markdown_63_1.png) --- ## Deep compression: dynamic quantization **What it is:** Quantization **reduces numeric precision** to make math cheaper. Instead of 32-bit floats everywhere, we represent weights (and sometimes activations) as **8-bit integers** and keep per-tensor (or per-channel) scales/zero-points to map back and forth. **How it works (in practice):** With **dynamic quantization** in PyTorch, I wrap the model and tell it to quantize **Linear** layers at runtime. The weights are stored in INT8; activations are observed dynamically and quantized on the fly. The core attention math still behaves the same logically—just with lighter arithmetic and less memory bandwidth. **Why it helps:** Transformers are dense-matmul heavy. Shrinking weights/activations cuts memory traffic and lets CPUs use faster integer kernels. As a side effect, the small quantization noise can act like **regularization**—the model sometimes gets a tiny accuracy bump. ```python quantized_model = torch.quantization.quantize_dynamic( distilled_model, {nn.Linear}, dtype=torch.qint8 ) ``` | Model | Size (MB) | Latency (ms) | Accuracy (%) | | --------- | --------- | ------------ | ------------ | | Quantized | **~132** | **~12** | **87.6** | **Capacity & cost framing (12 ms):** * **To serve 1,000 req/s:** ~**12** workers (~3.3× fewer than baseline) * **Memory:** 132 MB → **~3×** more workers per machine vs BERT **A small surprise:** When the quantized model **slightly outperformed** the original on accuracy, I thought I’d miscalculated. It wasn’t a bug—quantization can act as a mild **regularizer**. ![Scatter plot showing four optimization stages: BERT baseline (blue solid, 40ms, 86.7%), DistilBERT (orange solid, 22ms, 85.5%), custom distillation (green solid, 20ms, 86.8%), and distillation plus quantization (red hollow bubble, 12ms, 87.6%). The red bubble in the upper-left quadrant shows quantization achieved both faster inference and higher accuracy than the baseline.](/blog/making-llms-efficient-in-production/transformers_markdown_70_1.png) --- ## Going further: ONNX Runtime and hardware-accelerated inference **What it is:** **ONNX** is a portable model format; **ONNX Runtime (ORT)** is a high-performance engine with optimized kernels for CPUs/GPUs/accelerators. **How it works (end-to-end):** 1. **Export** the trained PyTorch model to ONNX with dynamic axes (so batch can vary). 2. **Load** it in ORT and select providers (e.g., `CPUExecutionProvider`, CUDA, etc.). 3. Optionally run **ORT’s quantization** pass to get INT8 benefits on top of its kernels. 4. Deploy the single ONNX artifact broadly—handy when your fleet spans multiple hardware types. **Why it helps:** Even without changing the model, a better executor wins. ORT fuses ops, picks faster kernels, and reduces framework overhead. On CPUs, pairing ORT with quantization compounds the gains. ```python # Export to ONNX (CPU path) inputs = tokenizer('hello world', return_tensors='pt') torch.onnx.export( distilled_model, (inputs['input_ids'], inputs['attention_mask']), 'distilled.onnx', input_names=['input_ids','attention_mask'], output_names=['logits'], dynamic_axes={'input_ids': {0: 'batch'}, 'attention_mask': {0: 'batch'}}, opset_version=16, ) ort_session = ort.InferenceSession('distilled.onnx', providers=['CPUExecutionProvider']) ``` | Model | Size (MB) | Latency (ms) | Accuracy (%) | | --------------- | --------- | ------------ | ------------ | | ORT (quantized) | **~132** | **~8** | **~87.7** | **Capacity & cost framing (8 ms):** * **To serve 1,000 req/s:** ~**8** workers → **~5× fewer instances** vs 40 ms baseline * **Memory:** small artifact + ORT quantization → dense packing and better autoscaling headroom ![Complete scatter plot showing all six optimization approaches: BERT baseline (blue solid, 40ms, 86.7%), DistilBERT (orange solid, 22ms, 85.5%), custom distillation (green solid, 20ms, 86.8%), distillation plus quantization (red solid, 12ms, 87.6%), distillation plus ONNX Runtime (purple solid, 21ms, 86.8%), and the final champion distillation plus ONNX Runtime quantized (brown hollow bubble, 8ms, 87.7%). The brown bubble demonstrates a 5x speedup over baseline BERT while maintaining comparable accuracy.](/blog/making-llms-efficient-in-production/transformers_markdown_94_1.png) --- ## Conclusion: the short playbook (no silver bullets) Pick by constraint, then iterate: * **Quick win / low effort:** start with an **off-the-shelf distilled model**. * **Accuracy + speed:** run **DIY distillation** (teacher–student with temperature + alpha). * **Edge/CPU or memory-bound:** apply **dynamic quantization** first. * **Cross-platform & kernel speed:** export to **ONNX** and run with **ONNX Runtime** (+ ORT quantization). * **Chasing the last mile:** **combine them** (distillation → quantization → ORT), then consider **structured pruning** and serving-stack tweaks. ### One-glance summary | Model | Size (MB) | Latency (ms) | Accuracy (%) | Speedup vs BERT | | ---------------- | --------- | ------------ | ------------ | --------------- | | BERT | 418 | 40 | 86.7 | 1.0× | | DistilBERT (HF) | 250 | 22 | 85.5 | 1.8× | | Custom Distilled | 250 | 20 | 86.8 | 2.0× | | Quantized | 132 | 12 | 87.6 | 3.3× | | ORT Quantized | 132 | 8 | 87.7 | 5.0× | ## Shout-out & further reading A huge shout-out to Lewis Tunstall, Leandro von Werra, and Thomas Wolf—much of the way I think about distillation, quantization, and practical deployment comes from their book, *Natural Language Processing with Transformers*. If you're hungry for a deeper, hands-on tour, start here: **[Natural Language Processing with Transformers (Revised Edition)](https://www.oreilly.com/library/view/natural-language-processing/9781098136789/)** — Read/buy on O'Reilly Media.
LLMsMachine LearningTransformersDeep LearningModel OptimizationProduction MLQuantizationDistillationONNX runtime