Surely you could just pre-generate rollouts with slightly stale weights and then...

pama · 2026-02-12T13:33:54 1770903234

Cheap verifying of speculative decoding only works for a few tokens at a time. Long sequence generations (thousands to tens of thousands of tokens in typical rollouts for thinking models) are dominated by distribution drift on stale weights (because slightly wrong probabilities multiply over long streams), and the off policy RL training methods dont work well (high variance) for such high dimensional problems.