Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Dwarkesh Podcast·May 1, 2026

OVERVIEW

This podcast episode with Reiner Pope, CEO of Maddox, delves into the underlying mathematical and architectural principles governing how Large Language Models (LLMs) are trained and served. It explores the interplay between hardware capabilities (memory bandwidth, compute performance, interconnects) and model design choices (batch size, context length, sparsity) to explain the complex trade-offs in latency, cost, and model performance. The discussion aims to demystify why AI models behave as they do and how pricing structures for LLM APIs are determined by these factors.

KEY TOPICS

  • Analyzing LLM inference time: compute vs. memory fetches.
  • The economic importance of batch size in inference.
  • Components of compute time: active parameters and throughput.
  • Components of memory time: weight fetches and KV cache.
  • The relationship between latency, batch size, and hardware constraints.
  • The impact of context length and sparsity on memory usage and performance.
  • The derivation of optimal batch sizes based on hardware and model characteristics.
  • Cost analysis: latency per token vs. batch size.
  • Pipeline parallelism and its benefits for memory capacity in training, but not latency in inference.
  • Hardware architecture: GPU racks, interconnects (NVLink), and scale-up domains.
  • The "memory wall" challenge and its components (bandwidth, capacity, interconnect).
  • The role of pricing structures in LLM APIs (input vs. output tokens).
  • Memory hierarchies and the cost implications of cache hits/misses.
  • The concept of "overtraining" models relative to training compute for better user-facing inference.

MAIN TAKEAWAYS

  • Batch size is a critical factor for optimizing LLM inference, as not batching user requests can lead to dramatically higher costs (e.g., 1000x worse). Larger batches amortize fixed costs.
  • The total inference time is limited by the maximum of compute-bound and memory-bound operations. Understanding this balance is key to optimizing performance and cost.
  • Hardware utilization (FLOPS per memory bandwidth) and model sparsity are fundamental drivers of the optimal batch size needed to achieve efficiency. For DeepSeek-like models, this optimal batch size is around 2000-3000 tokens.
  • Modern GPU architectures like Nvidia's Blackwell utilize highly optimized interconnects (NVLink) to create "scale-up domains" (e.g., a single rack), allowing for efficient all-to-all communication necessary for MoE layers. Scaling beyond these domains incurs significant latency penalties due to slower external networks.
  • API pricing models for LLMs, such as the differential cost for input vs. output tokens, are a direct reflection of the underlying hardware cost structures, particularly the cost of processing the KV cache during token generation (decode).
  • Pipeline parallelism helps reduce memory capacity requirements in training by overlapping computation stages, but offers no inherent latency or cost-per-token benefits for inference.
  • The "memory wall" is a multi-faceted constraint, encompassing not just HBM capacity but also memory bandwidth and the speed of interconnects within and between hardware racks. This is a primary factor limiting larger context lengths and model scaling.
  • The pricing strategy for caching (e.g., cache hits being 10x cheaper) indicates the use of memory hierarchies (HBM, DDR, Flash, Spinning Disk) to manage costs based on data access frequency.

NOTABLE QUOTES

"The big effect is batch size."
"If you do not batch together many users, the cost and the economics you get is can be like a thousand times worse than if you do batch many users together."
"For a given hardware configuration... there is a lower bound on latency, which is simply the time to read all of my total parameters from memory into the chips."
"The optimal batch size needs to be bigger than approximately 300 times sparsity."
"One rack is actually the bounds the size of an expert layer you can do."
"Pipeline parallelism... in inference actually the effect of pipeline on anything you care about like batch size or latency actually is neutral."
"The cost ratio [input vs output tokens] is really talking about the ratio between those two mechanisms for producing it [tokens]."

Summarized with DriftNote — AI-powered podcast summaries

Try it free