How Batching in LLM Decode Works

Maximizing speed: How continuous batching unlocks unprecedented LLM throughput

Think of continuous batching as the LLM world’s turbocharger — keeping GPUs busy nonstop and cranking out results up to 20x faster. I discussed how PagedAttention cracked the code on LLM memory chaos ...

Semiconductor Engineering

GPU Analysis Identifying Performance Bottlenecks That Cause Throughput Plateaus In Large-Batch Inference

A new technical paper titled “Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference” was published by researchers at Barcelona Supercomputing Center, Universitat Politecnica de ...

Semiconductor Engineering

Heterogeneous System With Specialized HW For Disaggregated LLM Inference (Princeton Univ., Univ. of Washington)

A new technical paper titled “SPAD: Specialized Prefill and Decode Hardware for Disaggregated LLM Inference” was published by researchers at Princeton University and University of Washington. “Large ...

NextBigFuture

Defeating Nondeterminism in LLM Inference by Thinking Machines

A research article by Horace He and the Thinking Machines Lab (X-OpenAI CTO Mira Murati founded) addresses a long-standing issue in large language models (LLMs). Even with greedy decoding bu setting ...

CRN

Nvidia Says New Software Will Double LLM Inference Speed On H100 GPU

The AI chip giant says the open-source software library, TensorRT-LLM, will double the H100’s performance for running inference on leading large language models when it comes out next month. Nvidia ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results