Llama 2 on ONNX runs locally

c_o_n_v_e_x · on Aug 11, 2023

For anyone running locally, could you please describe your hardware setup? CPU only? CPU+GPU(s)? How much memory? What type of CPU? Particularly interested in larger models (say >30b params).

For transparency, I work for an x86 motherboard manufacturer and the LLM-on-local-hw space is very interesting. If you're having trouble finding the right HW, would love to hear those pain points.

brucethemoose2 · on Aug 11, 2023

The most popular performant desktop llm runtimes are pure GPU (exLLAMA) or GPU + CPU (llama.cpp). People stuff the biggest models that will fit into the collective RAM + VRAM pool, up to ~48GB for llama 70B. Sometimes users will split models across two 24GB CUDA GPUs.

Inference for me is bottlenecked by my GPU and CPU RAM bandwidth. TBH the biggest frustration is that OEMs like y'all can't double up VRAM like you could in the old days, or sell platforms with a beefy IGP, and that quad channel+ CPUs are too expensive.

Vulkan is a popular target runtimes seem to be heading for. IGPs with access to lots of memory capacity + bandwidth will be very desirable, and I hear Intel/AMD are cooking up quad channel IGPs.

On the server side, everyone is running Nvidia boxes, I guess. But I had a dream about an affordable llama.cpp host: The cheapest Sapphire Rapids HBM SKUs, with no DIMM slots, on a tiny, dirt cheap motherboard you can pack into a rack like sardines. Llama.cpp is bottlenecked by bandwidth, and ~64GB is perfect.

mk_stjames · on Aug 11, 2023

If you're at a motherboard manufacturer I have some definite points for you to hear.

One is.. there are essentially zero motherboards that space out x16 pci-e slots so that you can appropriately use more than 2 triple-slot GPUs. 3090s and 4090s are all triple slot cards, but often motherboard are putting x16 slots spaced 2-apart, with x8 or less slots in between. There may be a few that allow you to fit 2x cards, but none that would support 3x I don't think, and definitely none that do 4x. Obviously that would result in a non-standard length motherboard (much taller). But in the ML world it would be appreciated because it would be possible to build quad card systems without watercooling cards or using A5000/A6000 or other dual slot, expensive datacenter cards.

And then, even for dual-slot cards like the A5000/A6000 etc again there are very few motherboards that you can get the x16 slots spaced appropriately. The Supermicro H12SSL-i is about the only one that gets 4 x16 slots double-slot spaced appropriately and in a way that you could run 4 blower or WC'd cards and not overlap something else. And then, even when you do, you have the problem of the pin headers on the bottom of the motherboard interfere with the last card. That location of the pin headers is archaic and annoying and just needs to die.

Remember those mining-rig specialty motherboards, with all the wide-spaced pci-e slots for like 8x GPU's at once? we need that, but with x16-bandwidth slots. Those mining cards were typically only x1 bandwidth slots (even if they were x16 length) because for mining, bandwidth between cards and CPU isn't a problem, but for ML it is.

Sure, these won't fit the typical ATX case standards. But if you build it, they will come.

brucethemoose2 · on Aug 11, 2023

This would have to be a server board, or at least a HEDT board.

And yeah, as said below, 4x 4090s would trip most circuit breakers and require some contortions to power with a regular PSU. And it would be so expensive that you mind as well buy 2x A6000s.

Really, the problem is no one will sell fast, sanely priced 32GB+ GPUs. I am hoping Intel will disrupt this awful status quo with Battlemage.

ohgodplsno · on Aug 11, 2023

The thought of power draw of 4 4090s going through a commercial along with insisting on staying with aircooling in a case that is now going to be horribly cramped and with no airflow probably keeps a firefighter awake a night sometimes.

There's no reasonable use of consumer motherboards having the space for that. Even SLI is more or less abandoned. Needless to say:

* Making a new motherboard form factor, incompatible with _everything in the market_

* Having to make therefore new cases, incompatible with _everything in the market_ (or, well, it'll be compatible. It'll just be extremely empty.

* Having to probably make your own PSUs because I wouldn't trust a constant 1200W draw from just GPUs on your average Seasonic PSU

If you build it, not only will noone come, but they also don't have the money to pay for what you'd charge to even just offset costs.

Tepix · on Aug 12, 2023

There are cases on the market today with more than 7 slots next to each other like the Fractal Design Define 7 XL and the Fractal Design Meshify 2 XL. These could fit three 3-slot cards with the right mainboard.

I think there is a market for it and i hope these products will arrive soon.

Tepix · on Aug 12, 2023

I totally agree!

There are a few more boards with 4 x16 slots at 2 slot spacing:

GIGABYTE MU92-TU0

ASRock Rack SPC621D8 (3 variants)

ASRock C621A WS

Supermicro X12SPA-TF

ImprobableTruth · on Aug 11, 2023

I use two 3090s to run the 70b model at a good speed. Takes 32 gigs of vram, more depending on context. I tried CPU+GPU (5900X + 3090) but with extended context it's slow enough that I wouldn't recommend it (~1 token/s). CPU only gets "let it run over night" slow. Works ok-ish for with a small context though (even if it's still "non-interactive" slow).

josephg · on Aug 11, 2023

What’s the difference in output quality between that and the 33b parameter model? That would fit entirely in vram, right?

brucethemoose2 · on Aug 11, 2023

The 33B model is llama V1. Facebook reportedly held back 34B llama v2 because it failed some safety metrics.

So... Generally the quality is worse, but the available set of finetunes is totally different. Some llama v1 33b finetunes are not available in 70B, and extremely good at their niche.

Also 70B should get more than 1 token/sec on a single 3090 offloaded to CPU. I dunno what framework op is using.

a20eac1d · on Aug 11, 2023

Any chance you could point me in the right direction on how to set something like this up?

Right now, I'm using pure CPU Llama but only the 17B version, based on I believe llama.cpp. How do I mix both CPU and GPU together for more performance?

brucethemoose2 · on Aug 11, 2023

The easy way: download koboldcpp. Otherwise you have to compile llama.cpp (or kobold.cpp) with opencl or cuda support. There are instructions for this on the git page.

Then offload as many layers as you can to the gpu with the gpu layers flag. You will have to play with this and observe your gpu's vram.

MuffinFlavored · on Aug 11, 2023

what example niche?

brucethemoose2 · on Aug 11, 2023

- roleplaying (chronos merged with airoboros)

- theraputic/friend style chat (Samantha)

- translation (various single language finetines)

- medical advice (can't remember this one)

This is non exhaustive. And Llama V2's extended native context does really help some niches (like storytelling) that a few 33B models are still pretty good at.

cosmojg · on Aug 11, 2023

> medical advice (can't remember this one)

You're probably thinking of Clinical Camel: https://huggingface.co/augtoma/qCammel-70-x

brucethemoose2 · on Aug 11, 2023

This is a 70B tune. And its new to me. Looks interesting!

Tepix · on Aug 12, 2023

I'm running two RTX 3090 at PCIe 4.0 x8 on a X570 board w/ 128GB DDR4 @ 3200.

Going beyond that is very expensive right now.

The AMD X670 chipset offers 28 PCIe 5.0 lanes, can't you make a mainboard with three x16 PCIe 4.0 slots out of that? Ideally two models: One with 2-slot spacing (for watercooling) and another (oversized) board with 3-slot spacing for cases like the Fractal Design Meshify 2 XL.

c_o_n_v_e_x · on Aug 13, 2023

Slot spacing on motherboards is a challenge due to high frequency signal attenuation. You have finite limits on how far your slots can be from the CPU. Your signal budget/allowable distances are decreasing as each successive PCIe generation runs at a higher frequency.

Yes, you can space out slots widely, however this means you have to use PCIe redrivers/retimers which adds cost to the board. You can also use different materials for the motherboard but again, this adds cost.

We'd love to provide better slot spacing configs, but there are technical and commercial tradeoffs to be made.

Tepix · on Aug 13, 2023

There are boards on the market that offer four PCIe 4.0 x16 slots at 2x spacing. Offering three PCIe 4.0 x16 slots at 3x spacing means just going one slot position further. I hope it's possible.

sandGorgon · on Aug 11, 2023

In EdgeChains, we run models using DeepJavaLibrary (DJL). Preferably only CPU - more focused on edge+embedding usecases

thevania · on Aug 11, 2023

I would love to see an AMD MI300A board for hobbyist :D

rrherr · on Aug 10, 2023

How does this compare to using https://github.com/ggerganov/llama.cpp with https://huggingface.co/models?search=thebloke/llama-2-ggml ?

brucethemoose2 · on Aug 10, 2023

Very unfavorably. Mostly because the ONNX models are FP32/FP16 (so ~3-4x the RAM use), but also because llama.cpp is well optimized with many features (like prompt caching, grammar, device splitting, context extending, cfg...)

MLC's Apache TVM implementation is also excellent. The autotuning in particular is like black magic.

Havoc · on Aug 10, 2023

Speaking of MLC - recently discovered they have a iphone app that can do lama7b locally on high end iphones at decent pace. Bit hard to find in the store given the ocean of API front end apps - called MLCchat.

skeletoncrew · on Aug 10, 2023

I tried quite a few of these and the ONNX one seems the most elegantly put together of all. I’m impressed.

Speed can be improved. Quick and dirty/hype solutions, not sure.

I really hope ONNX gets traction it deserves.

mrbungie · on Aug 10, 2023

Unless I'm working in an academic project I couldn't care less about elegancy.

Speed (and ergo costs) trumps elegancy in industrial settings imho, specially considering how expensive to run LLMs are. Such an OSS dependency can be improved or at least "wrapped" to isolate it from the rest of your codebase.

brucethemoose2 · on Aug 11, 2023

As do features.

Even if llama.cpp was ~30% slower and the same size as ONNX, something like a custom grammar implementation or an extended context would be a huge deal.

kanwisher · on Aug 11, 2023

ONNX doesn’t give you any benefits over something like llama.cpp and it’s significantly slower for zero advantage

version_five · on Aug 10, 2023

> Quick and dirty/hype solutions, not sure.

Curious what you mean by this

refulgentis · on Aug 10, 2023

It's tough to hear and communicate, but TL;DR: it's very good for HN headlines to make a $X.cpp but it's not the right tool for products.

There's only going to be more of these models and ONNX starts from the right place, cross-platform and from base principles rather than coupling tightly to one model structure.

Most importantly, it is freakin' awesome, the comments thus far, 30 in, don't reflect what its like to use or it's technical realities.*

* the main threads of discussion are "not even wrong": float16 is big compared to float4 (its trivial to quantize to your liking) and looking for an alternative that supports CoreML (ONNX is the magic that makes your model take advantage of CoreML / WebGPU / WebGL / whatever Android's marketing name for its API is etc. etc. etc.)

redox99 · on Aug 10, 2023

> it's very good for HN headlines to make a $X.cpp but it's not the right tool for products.

Funny how it's the opposite.

ONNX in this case, outside of the HN headline and saying "we did it" is almost useless.

LLMs are so heavy that you can't afford running a suboptimized version. This FP16 ONNX takes 4x as much memory and is probably 5-10x slower than something hand optimized such as llama.cpp or exllama with 4 bit quants.

refulgentis · on Aug 10, 2023

Right, and as my comment alludes to, quantize it to FP4, they got a nice CLI tool that took me about 15 seconds to do that to a 150 MB model. Here youd do that with each shard

brucethemoose2 · on Aug 11, 2023

Straight 4 bit quantization with LLMs sacrifices some quality. The currently popular frameworks quantize the models in blocks to get better results, which also makes runtime performance less straightforward.

I think ONNX would need to natively support this "packed" format or otherwise quantize fp16 models from disk on the fly... Which is a problem, as the FP16 models are huge.

ImprobableTruth · on Aug 11, 2023

"some" quality - naive quantization (that isn't just the usual full->half) absolutely mutilates the model, to the point that picking a smaller model is much better.

gpt-q is also more than just group wise quantization and takes a decent while. Quantizing models without a major performance hit is not possible on the fly.

kiratp · on Aug 10, 2023

When hardware is so expensive and so difficult to obtain, performance starts to trump the cost of “single model implementation” pretty quickly.

It can be cheaper to deploy Llama.cpp and then foobar.cpp (6 months from now) than it is to have inference that is 2x slower.

Interestingly in the LLM space all these model servers seem to be converging to using the same API as OpenAI, making it easy to swap containers to get a different model+inference server with 0 code change.

brucethemoose2 · on Aug 10, 2023

TVM is all of this. It even has compilation support for exotic devices like FPGAs, phone ASICs and such. MLIR frameworks are heading this direction too.

> its trivial to quantize to your liking

...Except its not implemented in the demo. Also, quantization is far from simple.

All this sounds good, but I have seen cool sounding ONNX demos for years, and (outside of some fairly quick one off TensorRT demos) I havent really seen the pavement hit the road.

brucethemoose2 · on Aug 10, 2023

> ONNX one seems the most elegantly put together of all.

What do you mean by this? The demo UI? Code quality?

version_five · on Aug 10, 2023

Ggml / llama.cpp has a lot of hardware optimizations built in now, CPU, GPU and specific instruction sets like for apple silicon (I'm not familiar with the names). I would want to know how many of those are also present in onnx and available to this model.

There are currently also more quantization options available as mentioned. Though those incur a performance loss (they make the model faster but worse) so it depends on what you're optimizing for.

brucethemoose2 · on Aug 10, 2023

ONNX is a format. There are different runtimes for different devices... But I can't speak for any of them.

> specific instruction sets like for apple silicon

You are thinking of the Accelerate framework support, which is basically Apple's ARM CPU SIMD library.

But Llama.cpp also has a Metal GPU backend, which is the defacto backend for Apple devices now.

moffkalast · on Aug 10, 2023

These are still FP16/32 models, almost certainly a few times slower and larger than the latest N bit quantized GGMLs.

abrookewood · on Aug 11, 2023

For anyone unsure what ONNX actually is: "ONNX is an open format built to represent machine learning models ... [which] defines a common set of operators ... a common file format ... [and should make] it easier to access hardware optimizations".

[0] https://onnx.ai/

hashtag-til · on Aug 10, 2023

This is very cool! I really hope the ONNX project gets much more adoption in the next months and years and help reduce the fragmentation in the ML ecosystem.

brucethemoose2 · on Aug 10, 2023

Eh... I have seen ONNX demos for years, and they tend to stay barebones and slow, kinda like this.

NCNN, MLIR and TVM based ports have been far more impressive.

mathisfun123 · on Aug 11, 2023

Lololol show me an "MLIR" port. Do you mean tensorflow port or jax port or torch port (that uses torch-mlir)? Or do you really mean llama implemented in linalg/tosa/tendor?

brucethemoose2 · on Aug 11, 2023

I wasn't talking about Llama specifically. I was thinking of the SHARK Stable Diffusion port (which uses MLIR/IREE), as it considerably outpaced the ONNX runtime.

But apparently the performance of llama on torch-mlir is progressing: https://github.com/nod-ai/SHARK/issues/1707

lostmsu · on Aug 10, 2023

Have you used TVM much? Is it only good for inference?

brucethemoose2 · on Aug 11, 2023

Inference only, as far as I know.

TBH the mlc projects were my first exposure, but they are all impressive, as are the bits of source I poked through. And the autotuning makes a huge difference in my quick testing.

claytonjy · on Aug 10, 2023

I'm not sure there's much chance of that happening. ONNX seems to be the broadest in coverage, but for basically any model ONNX supports, there's a faster alternative.

For the latest generative/transformer stuff (whisper, llama, etc) it's often specialized C(++) stuff, but torch 2.0 compilation keeps geting better, BetterTransformers, TensorRT, etc.

esperent · on Aug 11, 2023

How does Llama 2 compare to GPT-4? I see a lot of discussion about it but not much comparison. I don't have the hardware to run the 13b or 30b model locally so I'd be running it in the cloud anyway. In that case, should I stick with GPT-4?

ImprobableTruth · on Aug 11, 2023

GPT-4 trounces everything else available. I'd say that 70b is about 3.5 level, unless you're doing something where finetuning greatly benefits you.

13b is alright for toy applications, but the difference to gpt 3.5 (let alone gpt4) is huge.

chpatrick · on Aug 11, 2023

Vicuna 1.5 13b seems comparable to GPT 3.5 to me, and the fact that you can run it locally on last gen commodity hardware is incredible.

I bought a dodgy used mining 3090 last year and now my computer is writing poetry in the terminal in real time.

spacebanana7 · on Aug 11, 2023

Llama 2 has uncensored versions. For many applications that alone makes it superior.

danielbln · on Aug 11, 2023

I've been toying with llama-2-uncensendored via ollama[1] and it's hilarious and oddly liberating to just be able to throw anything at it, without getting the canned "Sorry, as an AI model bla bla bla" back.

[1] https://github.com/jmorganca/ollama

turnsout · on Aug 10, 2023

Does anyone know the feasibility of converting the ONNX model to CoreML for accelerated inference on Apple devices?

kiratp · on Aug 10, 2023

If you’re working with LLMs, just use this - https://github.com/ggerganov/llama.cpp

It has Metal support.

refulgentis · on Aug 10, 2023

That's sort of a non-sequitor, so does ONNX. Conversely, $X.cpp is great for local hobbyist stuff but not at all for deployment to iOS.

kiratp · on Aug 10, 2023

Nobody is deploying 3+GB models to iOS beyond some enthusiast “because you can” apps. Amazing tech but not feasible for any mainstream use yet.

Eg: https://apps.apple.com/app/id6444050820

hustwindmaple1 · on Aug 11, 2023

If you have played any large mobile games, then you would not be surprised to see apps downloading massive files during first open.

brucethemoose2 · on Aug 11, 2023

A small download + an in-app weights download (and a space requirement warning) is probably sane, right?

refulgentis · on Aug 12, 2023

I agree, we're too far down a chain of hypotheticals motivated by "but ONNX must be bad compared to $MODELX.cpp?"

Wouldn't make sense to deploy 4-bit quantization as a product either.

turnsout · on Aug 10, 2023

The size makes it tough for App Store deployment, but I could imagine using a local LLM on-device for an enterprise app.

mchiang · on Aug 10, 2023

They used to have this: https://github.com/onnx/onnx-coreml

refulgentis · on Aug 10, 2023

They still do. HN is way behind on ONNX and I'd go so far as to say it's the "Plastics."[1] of 2023.

[1] https://www.youtube.com/watch?v=PSxihhBzCjk

refulgentis · on Aug 10, 2023

"Not even wrong" question, ONNX is a runtime that can use/uses CoreML.

turnsout · on Aug 10, 2023

I didn't realize that! I wonder how performant a small Llama would be on iOS.

brucethemoose2 · on Aug 10, 2023

MLC's Apache TVM implementation can also compile to Metal.

Not sure if they made an autotuning profile for it yet.

glitchc · on Aug 10, 2023

How was this allowed? I was under the impression that companies the size of Microsoft needed to contact Meta to negotiate a license.

Excerpt from the license:

Additional Commercial Terms. If, on the Llama 2 version release date, the monthly active users of the products or services made available by or for Licensee, or Licensee's affiliates, is greater than 700 million monthly active users in the preceding calendar month, you must request a license from Meta, which Meta may grant to you in its sole discretion, and you are not authorized to exercise any of the rights under this Agreement unless or until Meta otherwise expressly grants you such rights.

thadk · on Aug 10, 2023

> Meta and Microsoft have been longtime partners on AI, starting with a collaboration to integrate ONNX Runtime with PyTorch to create a great developer experience for PyTorch on Azure, and Meta’s choice of Azure as a strategic cloud provider. (sic)

https://blogs.microsoft.com/blog/2023/07/18/microsoft-and-me...

amelius · on Aug 10, 2023

> To get access permissions to the Llama 2 model, please fill out the Llama 2 access request form. If allowable, you will receive GitHub access in the next 48 hours, but usually much sooner.

I guess they send the form to Meta?

Anyway, I hope this is not what Open Source will be like from now on.

stu2b50 · on Aug 10, 2023

So they negotiated a license? Meta partnered with Azure for the Llama 2 launch, there’s no reason to think that they’re antagonistic towards each other.

hint23 · on Aug 11, 2023

For best performance on x86 CPU and Nvidia GPU, ts_server is interesting ( https://bellard.org/ts_server ).

FL33TW00D · on Aug 11, 2023

I think unfortunately ONNX is doomed. The spec is incredibly bloated by now, and the fact that all graphs are entirely static just isn't feasible for modern ML.

fouronnes3 · on Aug 11, 2023

Is there an alternative then?

contravariant · on Aug 11, 2023

An alternative solution to what problem?

If you want to be able to tell someone else how to do some arbitrary calculation then we've solved that problem ages ago, that's what code is.

FL33TW00D · on Aug 11, 2023

Start over?

Havoc · on Aug 10, 2023

Sounds similar to llama's chat edition. Found that to be pretty solid in itself so hopefully this is even better.