> Quick and dirty/hype solutions, not sure. Curious what you mean by this

refulgentis · on Aug 10, 2023

It's tough to hear and communicate, but TL;DR: it's very good for HN headlines to make a $X.cpp but it's not the right tool for products.

There's only going to be more of these models and ONNX starts from the right place, cross-platform and from base principles rather than coupling tightly to one model structure.

Most importantly, it is freakin' awesome, the comments thus far, 30 in, don't reflect what its like to use or it's technical realities.*

* the main threads of discussion are "not even wrong": float16 is big compared to float4 (its trivial to quantize to your liking) and looking for an alternative that supports CoreML (ONNX is the magic that makes your model take advantage of CoreML / WebGPU / WebGL / whatever Android's marketing name for its API is etc. etc. etc.)

redox99 · on Aug 10, 2023

> it's very good for HN headlines to make a $X.cpp but it's not the right tool for products.

Funny how it's the opposite.

ONNX in this case, outside of the HN headline and saying "we did it" is almost useless.

LLMs are so heavy that you can't afford running a suboptimized version. This FP16 ONNX takes 4x as much memory and is probably 5-10x slower than something hand optimized such as llama.cpp or exllama with 4 bit quants.

refulgentis · on Aug 10, 2023

Right, and as my comment alludes to, quantize it to FP4, they got a nice CLI tool that took me about 15 seconds to do that to a 150 MB model. Here youd do that with each shard

brucethemoose2 · on Aug 11, 2023

Straight 4 bit quantization with LLMs sacrifices some quality. The currently popular frameworks quantize the models in blocks to get better results, which also makes runtime performance less straightforward.

I think ONNX would need to natively support this "packed" format or otherwise quantize fp16 models from disk on the fly... Which is a problem, as the FP16 models are huge.

ImprobableTruth · on Aug 11, 2023

"some" quality - naive quantization (that isn't just the usual full->half) absolutely mutilates the model, to the point that picking a smaller model is much better.

gpt-q is also more than just group wise quantization and takes a decent while. Quantizing models without a major performance hit is not possible on the fly.

kiratp · on Aug 10, 2023

When hardware is so expensive and so difficult to obtain, performance starts to trump the cost of “single model implementation” pretty quickly.

It can be cheaper to deploy Llama.cpp and then foobar.cpp (6 months from now) than it is to have inference that is 2x slower.

Interestingly in the LLM space all these model servers seem to be converging to using the same API as OpenAI, making it easy to swap containers to get a different model+inference server with 0 code change.

brucethemoose2 · on Aug 10, 2023

TVM is all of this. It even has compilation support for exotic devices like FPGAs, phone ASICs and such. MLIR frameworks are heading this direction too.

> its trivial to quantize to your liking

...Except its not implemented in the demo. Also, quantization is far from simple.

All this sounds good, but I have seen cool sounding ONNX demos for years, and (outside of some fairly quick one off TensorRT demos) I havent really seen the pavement hit the road.