Now I'm curious... Is your last suggestion correct? Wouldn't the time to cool down between pause intervals be proportionally longer due to the higher thermal mass and cancel out any savings gained by the long pause? Maybe the overall energy draw is even higher because the heat losses are higher when you spend a longer time with a high dT.
The water bottles don't warm up as quick as the air they replace that flows out of the fridge when you open it; so they have two effects first they take up space that new hot indoor air can't move into and second they then help chill that air slightly through their own thermal mass.
Whether you multiply by 10 or 2, the same "counter" argument from the article stands. Only now you don't have a trailing zero after infinite nines, you have a trailing 8.
I don't understand how you can even have a trailing zero after an infinite number of nines. Surely any place that someone would want to put the zero can be refuted by correctly stating that a nine goes there (it's an infinite number of them, after all) and there is literally no "last" place.
I’ve seen videos of actual mathematicians complaining to each other about how the general public thinks like GP. There is no last digit. Every time you reach the horizon there’s another horizon.
Technically you don't have an '8', you keep doing a carried sum forever, think about it. The last eight will be set to 9 forever and appended a new one to it. Thus, you are getting a periodical 1.9_ in practice.
There is no eight. This is something I’ve heard actual mathematicians complain about to other actual mathematicians: the non math public misunderstands infinite series as “imagine a number so big you can’t fathom it and add 1 more number to it. That’s not how things work.
Going as far as you can imagine and a little farther is an infinitesimal of the real infinite.
Very rough (!) napkin math: for a q8 model (almost lossless) you have parameters = VRAM requirement. For q4 with some performance loss it's roughly half. Then you add a little bit for the context window and overhead. So a 32B model q4 should run comfortably on 20-24 GB.
Again, very rough numbers, there's calculators online.
One of the things I'm still struggling with when using LLMs over NLP is classification against a large corpus of data. If I get a new text and I want to find the most similar text out of a million others, semantically speaking, how would I do this with an LLM? Apart from choosing certain pre-defined categories (such as "friendly", "political", ...) and then letting the LLM rate each text on each category, I can't see a simple solution yet except using embeddings (which I think could just be done using BERT and does not count as LLM usage?).
I've used embeddings to define clusters, then passed sampled documents from each cluster to an LLM to create labels for each grouping. I had pretty impressive results from this approach when creating a category/subcategory labels for a collection of texts I worked on recently.
That's interesting, it sounds a bit like those cluster graph visualisation techniques. Unfortunately, my texts seem to fall into clusters that really don't match the ones that I had hoped to get out of these methods. I guess it's just a matter of fine-tuning now.
Feed one through an LLM, one word at a time, and keep track of words that experience greatly inflated probabilities of occurrence, compared to baseline English. "For" is probably going to maintain a level of likelihood close to baseline. "Engine" is not.
Wouldn't a simple comparison of the word frequency in my text against a list of usual word frequencies do the trick here without an LLM? Sort of a BM25?
It might; it's not going to do the same thing. The LLM will tell you words that would likely appear in a similar text. Word frequency will tell you words that have actually appeared in your text. I'm postulating that the first kind of list is much more likely to show strong overlap between two similar documents than the second kind of list.
Vocabulary style matters a lot to what words are actually used, but much less to what words are likely to be used. If I'm following a style guide that says to use "automobile" instead of "car", appearance probabilities for "automobile" will be greatly inflated. And appearance probabilities for "car" will also be greatly inflated, just to a lesser extent than for "automobile". Whereas actual usage of "car" will be pegged at zero.
Determining how similar two texts are is something that an LLM should be good at. It should be better than a simple comparison of word frequency. Whether it's better enough to justify the extra compute is a different question.
The "issue" with saying an LLM can't do this is that CFD simulations are not actually that niche. Many university courses ask their students to write these types of algorithms for their course project. All this knowledge is present freely on the internet (as is evident by the Youtube videos that the author mentioned), and as such can be learned by an LLM. The article is of course still very impressive.
Great point. Niche to me, but not to thee. I was unaware. This is actually one of the frustrating things about the LLMs - they don’t tell you when what you asked for is outside their training data!
I'm a bit surprised by the amount of comments comparing the cost to (often cheap) cloud solutions. Nvidia's value proposition is completely different in my opinion. Say I have a startup in the EU that handles personal data or some company secrets and wants to use an LLM to analyse it (like using RAG). Having that data never leave your basement sure can be worth more than $3000 if performance is not a bottleneck.
Heck, I'm willing to pay $3000 for one of these to get a good model that runs my requests locally. It's probably just my stupid ape brain trying to do finance, but I'm infinitely more likely to run dumb experiments with LLMs on hardware I own than I am while paying per token (to the point where I currently spend way more time with small local llamas than with Claude), and even though I don't do anything sensitive I'm still leery of shipping all my data to one of these companies.
This isn't competing with cloud, it's competing with Mac Minis and beefy GPUs. And $3000 is a very attractive price point in that market.
Yep! I don't spend much time there because I got pretty comfortable with llama before that subreddit really got started, but it's definitely turned up some helpful answers about parameter tuning from time to time!
Even for established companies this is great. A tech company can have a few of these locally hosted and users can poll the company LLM with sensitive data.
The price seems relatively competitive even compared to other local alternatives like "build your own PC". I'd definitely buy one of this (or even two if it works really well) for developing/training/using models that currently run on cobbled together hardware I got left after upgrading my desktop.
> Having that data never leave your basement sure can be worth more than $3000 if performance is not a bottleneck
I get what you're saying, but there are also regulations (and your own business interest) that expects data redundancy/protection which keeping everything on-site doesnt seem to cover
Hey Jeremy, very exciting release! I'm currently building my first product with RoBERTa as one central component, and I'm very excited to see how ModernBERT compares. Quick question: When do you think the first multilingual versions will show up? Any plans of you training your own?