Six months ago I'd have said EPYC Turin. You could do a heck of a build with 12C...

mythz · 2026-02-12T01:03:53 1770858233

Not VRAM? What performance are people getting running GLM or Kimi on DDR5?

Gracana · 2026-02-12T18:06:49 1770919609

It's important to have enough VRAM to get the kv cache and shared trunk of the model on GPU, but beyond that it's really hard to make a dent in the pool of 100s of gigabytes of experts.

I wish I had better numbers to compare with the 2x M3 Ultra setup. My system is a few RTX A4000s on a Xeon with 190GB/s actual read bandwidth, and I get ~8 tok/s with experts quantized to INT4 (for large models with around 30B active parameters like Kimi K2.) Moving to 1x RTX Pro 6000 Blackwell and tripling my read bandwidth with EPYC Turin might make it competitive with the the macs, but I dunno!

There's also some interesting tech with ktransformers + sglang where the most frequently-used experts are loaded on GPU. Pretty neat stuff and it's all moving fast.

Gracana · 2026-02-14T19:41:09 1771098069

There's a reddit comment here https://www.reddit.com/r/LocalLLaMA/comments/1r4m4it/comment... that says:

my system is running GLM-5 MXFP4 at about 17 tok/s. That’s with a single RTX Pro 6000 on an EPYC 9455P with 12 channels of DDR5-6400. Only 16k context though, since it’s too slow to use for programming anyway and that’s the only application where I need big context.