Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> it's very good for HN headlines to make a $X.cpp but it's not the right tool for products.

Funny how it's the opposite.

ONNX in this case, outside of the HN headline and saying "we did it" is almost useless.

LLMs are so heavy that you can't afford running a suboptimized version. This FP16 ONNX takes 4x as much memory and is probably 5-10x slower than something hand optimized such as llama.cpp or exllama with 4 bit quants.



Right, and as my comment alludes to, quantize it to FP4, they got a nice CLI tool that took me about 15 seconds to do that to a 150 MB model. Here youd do that with each shard


Straight 4 bit quantization with LLMs sacrifices some quality. The currently popular frameworks quantize the models in blocks to get better results, which also makes runtime performance less straightforward.

I think ONNX would need to natively support this "packed" format or otherwise quantize fp16 models from disk on the fly... Which is a problem, as the FP16 models are huge.


"some" quality - naive quantization (that isn't just the usual full->half) absolutely mutilates the model, to the point that picking a smaller model is much better.

gpt-q is also more than just group wise quantization and takes a decent while. Quantizing models without a major performance hit is not possible on the fly.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: