> it's very good for HN headlines to make a $X.cpp but it's not the right tool f...

refulgentis · on Aug 10, 2023

Right, and as my comment alludes to, quantize it to FP4, they got a nice CLI tool that took me about 15 seconds to do that to a 150 MB model. Here youd do that with each shard

brucethemoose2 · on Aug 11, 2023

Straight 4 bit quantization with LLMs sacrifices some quality. The currently popular frameworks quantize the models in blocks to get better results, which also makes runtime performance less straightforward.

I think ONNX would need to natively support this "packed" format or otherwise quantize fp16 models from disk on the fly... Which is a problem, as the FP16 models are huge.

ImprobableTruth · on Aug 11, 2023

"some" quality - naive quantization (that isn't just the usual full->half) absolutely mutilates the model, to the point that picking a smaller model is much better.

gpt-q is also more than just group wise quantization and takes a decent while. Quantizing models without a major performance hit is not possible on the fly.