Not OP. Personal opinion on why it is a somewhat hard problem. The main problem ...

zozbot234 · 2026-02-12T13:20:56 1770902456

Surely you could just pre-generate rollouts with slightly stale weights and then cheaply verify the rollout when up-to-date weights stream in by treating the former solution as speculative decoding. Sounds quite trivial to me, perhaps I'm missing something.

pama · 2026-02-12T13:33:54 1770903234

Cheap verifying of speculative decoding only works for a few tokens at a time. Long sequence generations (thousands to tens of thousands of tokens in typical rollouts for thinking models) are dominated by distribution drift on stale weights (because slightly wrong probabilities multiply over long streams), and the off policy RL training methods dont work well (high variance) for such high dimensional problems.