Chip Log

Chip Log

Why speculative decoding wants two kinds of silicon: NVIDIA, Groq, d-Matrix, Gimlet Labs, NVLink Fusion

Disaggregated speculative decoding, why the new class of accelerators are a better fit than GPUs, how 3D DRAM will put pressure on HBM based GPUs

Subbu's avatar
Subbu
Mar 14, 2026
∙ Paid

Over the last few years, most of the conversation in AI hardware has understandably centered on training. Bigger clusters, higher bandwidth fabrics, more FLOPs, and ever-larger memory.

Inference is now forcing a different kind of conversation.

LLM inference has two phases, prefill and decode. Prefill processes the prompt and builds the initial KV cache. Decode is what happens next: the model generates one token, feeds that token back in, then generates the next token, and so on. That loop (one token at a time) is inherently serial, which is why decode often becomes the limiting step in production inference. It shows up directly as user-visible latency, because it determines how long it takes to produce a complete response. The technical term for this is autoregressive decoding.

Two phases of LLM inference - Prefill and Decode. Decode is serial by nature.

One of the most promising responses to the serial nature of the decode phase is speculative decoding. Instead of asking an expensive trillion-parameter model to do all the work, it splits the decode phase into two parts:

  • A smaller, faster “draft model” that proposes tokens quickly, and

  • A larger “target model” that verifies these proposals and decides which tokens to accept and commit.

Once you look at inference in this way, the hardware implications become clear. Drafting and verification have different memory, compute, and latency requirements. The target model (larger parent) benefits from the high-throughput characteristics of large GPU systems. The draft model (smaller child) often benefits from hardware that is optimized for low-latency, small batches, and fast token-by-token generation. As a result, the two models can benefit from very different silicon architectures, each tuned to its part of the pipeline.

That’s why I find the recent NVIDIA–Groq deal strategically interesting. It’s less about any single chip, and more about what it signals. The industry is treating draft-model-class inference seriously, and speculative decoding is one of the cleanest system-level mechanisms to turn that into lower latency and better economics.

This post systematically explores this topic and is organized as follows:

  • A simple explanation of speculative decoding, a metric called acceptance rate, and why it matters

  • Why disaggregating the draft and target models is a system-level win

  • Why running draft models on Blackwell or Rubin GPUs can be an awkward fit, and why specialized architectures like d-Matrix and Groq are a strong alternative

  • 3D DRAM based XPUs versus HBM based GPUs

  • NVLink Fusion and how it can enable heterogeneous chip ecosystems

  • What speculative decoding means for this era of Agentic AI

  • Conclusion — SRAM vs 3D DRAM vs HBM

Disclosure: I work at d-Matrix. This post reflects my personal views and does not represent the views of the company.

Speculative decoding, simply explained

In traditional LLM serving, you deploy a single model, say Llama-70B, and it generates every token itself. During the decode phase, it repeats the same serial loop over and over. Generate a token, append it, run again, and continue until the answer is complete.

Speculative decoding, an idea introduced by Google and DeepMind in 2023, discovered that this large model does not need to generate every token by itself. Instead, the end-to-end latency can be reduced while keeping output quality high by splitting decode into two phases.

  • Draft phase: A small model that closely approximates the large model (for example, Llama-7B) generates the next K tokens sequentially, but quickly.

  • Target phase: The larger, more accurate model (for example, Llama-70B) verifies those K draft tokens in a single pass, treating them as a batch, and decides which tokens to accept and commit. The rest are rejected.

Then the process repeats. Draft a few more tokens, verify the next chunk, commit what’s accepted. This draft → verify loop continues until the full response is generated.

The figure below, from Google’s paper, shows how speculative decoding reduces wall-clock time compared to standard inference. This is quite promising because it can deliver meaningful gains in both latency and throughput without sacrificing output quality.

K, the number of draft tokens, cannot be increased indefinitely. There is a point of diminishing returns, as the rejection rate increases. [Source: Google. Annotations by Chiplog]

The promise of speculative decoding

A technical post by NVIDIA from Dec 2024 shows just how dramatic speculative decoding can be in practice.

  • Llama-70B on one Hopper H200 yields about 51 tokens/sec without speculative decoding. But when they pair it with a tiny Llama-1B draft model, still on the same GPU, the throughput jumps to about 146 tokens/sec.

  • For a much larger model, Llama-405B on 4× H200, they show about 33 tokens/sec without speculative decoding. With speculative decoding using a Llama-3B draft model, throughput jumps roughly 3.6× to about 120 tokens/sec.

A 3x improvement in inference throughput from one technique is quite incredible.

Speculative Decoding inference with Llama-70B as the target model and Llama-1B as the draft yield 146 tps, compared to 51 tps with just vanilla Llama-70B. Deployed on 1 x H200 [Source: NVIDIA]

Acceptance Rate

The original intuition with speculative decoding was that there are simple, predictable continuations where a draft model can easily guess what comes next, like completing:

“The quick brown fox …”

But in more esoteric cases (rare subject domains, weird proper nouns, highly constrained formatting, tricky reasoning steps) the draft may diverge quickly, and most of the tokens get rejected by the target model.

A key metric to notice here is the acceptance rate. The draft model still has to do a decent job. If the target model accepts something like 80–90% of the draft tokens, then speculative decoding starts to feel useful. It will reduce end-to-end latency and raise throughput. But if most of the draft tokens are rejected then the whole trick falls apart. You’re doing extra work to draft tokens that you end up throwing away.

Medusa & EAGLE: Alternative mechanisms of speculative decoding

Since its introduction in 2023, speculative decoding has remained an active research area and has gotten better over time. The core idea is still the same: keep output quality high while reducing how often the expensive target model has to step through the serial decode loop.

A lot of follow-on work focuses on improvements such as higher acceptance rates, lower overhead, and simpler deployment.

An important approach is avoiding the requirement to load two completely separate models. Medusa and EAGLE take this approach. They add a draft stage inside the target model itself so you get speculative style candidates without standing up a full separate draft model. This is equivalent to grafting a small draft head onto the LLM’s pipeline.

A useful reference here is the recent survey of speculative decoding techniques.

Source: NVIDIA

Why disaggregating draft and target models is a system-level win

In June 2025, the ByteDance team published a remarkable paper called SwiftSpec and argued that existing speculative decoding systems were leaving significant performance on the table. They traced it to two main issues.

First, the draft stage sits idle when the target stage is performing verification, i.e., the draft → verify loop still happens sequentially. The Nth draft stage cannot execute until the (N-1)th verify step has completed. So the end-to-end latency is not optimized.

Second, large target models usually require multiple GPUs with model parallelism. Earlier deployments often co-located draft and target, and then tried to scale them both the same way across GPUs. SwiftSpec points out why this is inefficient in small-batch serving. The draft and target have very different compute needs, and when you force both through the same tensor-parallel setup, you run into compute imbalance, KV cache consistency issues, and communication overhead that is hard to amortize at low batch sizes.

They solved these two problems by first disaggregating the draft and the target models and running them on different groups of NVIDIA Hopper H200 GPUs, as shown in the figure below.

SwiftSpec - Disaggregated speculative decoding. [Source: ByteDance, annotation by Chiplog]

This allowed each side to scale differently depending upon the models deployed, whether it’s Llama, Qwen, or DeepSeek. The figure below shows how end-to-end performance changes as you vary how many H200 GPUs are allocated to the target versus the draft model.

Target TP = 2, 4, 6 indicates the number of H200 GPUs allocated to the Target Model. The remaining portion of the 8 x H200s in the DGX system were allocated to Draft. [Source: ByteDance]

They also developed a novel scheme to let the two models run asynchronously. The draft model generates a tree of tokens and can move ahead and generate candidate tokens for the next iteration while the target stage is still verifying the previous one. After each step, the two models synchronize state, and the draft “re-roots” by keeping only the portion of its token tree that the target has verified. This is represented in the figure below.

With this setup, they report an average 1.75X speedup over previous speculative decoding systems. The highlight of the paper is their claim that SwiftSpec serves Llama3-70B at 348 tokens/s on 8 NVIDIA Hopper GPUs, which they describe as the fastest known system for low-latency LLM serving at this scale.

Draft model re-rooting. [Source ByteDance, annotation by Chiplog]

Why Blackwell and Rubin are awkward draft engines and why heterogeneous XPUs can win

Decode is memory bound and draft decoding is the worst-case version of that

SwiftSpec has a section called “GPU Constraints for Low Batch Size”. The basic point is that GPUs are incredible when you can keep them busy with large batches and big, regular kernels.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2026 Chip Log · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture