Hacker Newsnew | past | comments | ask | show | jobs | submit | fouc's commentslogin

I'm a bit surprised the article makes no mention of Google's TurboQuant[0] introduced 26 days prior.

Given that TurboQuant results in a 6x reduction in memory usage for KV caches and up to 8x boost in speed, this optimization is already showing up in llama.cpp, enabling significantly bigger contexts without having to run a smaller model to fit it all in memory.

Some people thought it might significantly improve the RAM situation, though I remain a bit skeptical - the demand is probably still larger than the reduction turboquant brings.

[0] https://news.ycombinator.com/item?id=47513475


TurboQuant is known across the industry to not be state of the art. There are superior schemes for KV quant at every bitrate. Eg, SpectralQuant: https://github.com/Dynamis-Labs/spectralquant among many, many papers.

> Given that TurboQuant results in a 6x reduction in memory usage for KV caches

All depends on baseline. The "6x" is by stylistic comparison to a BF16 KV cache; not a state of the art 8 or 4 bit KV cache scheme.


BTW, a number of corrections. The TurboQuant paper was submitted to Arxiv back in April 2025: https://arxiv.org/abs/2504.19874

Current "TurboQuant" implementations are about 3.8X-4.9X on compression (w/ the higher end taking some significant hits of GSM8K performance) and with about 80-100% baseline speed (no improvement, regression): https://github.com/vllm-project/vllm/pull/38479

For those not paying attention, it's probably worth sending this and ongoing discussion for vLLM https://github.com/vllm-project/vllm/issues/38171 and llama.cpp through your summarizer of choice - TurboQuant is fine, but not a magic bullet. Personally, I've been experimenting with DMS and I think it has a lot more promise and can be stacked with various quantization schemes.

The biggest savings in kvcache though is in improved model architecture. Gemma 4's SWA/global hybrid saves up to 10X kvcache, MLA/DSA (the latter that helps solve global attention compute) does as well, and using linear, SSM layers saves even more.

None of these reduce memory demand (Jevon's paradox, etc), though. Looking at my coding tools, I'm using about 10-15B cached tokens/mo currently (was 5-8B a couple months ago) and while I think I'm probably above average on the curve, I don't consider myself doing anything especially crazy and this year, between mainstream developers, and more and more agents, I don't think there's really any limit to the number of tokens that people will want to consume.


The work going into local models seems to be targeting lower RAM/VRAM which will definately help.

For example Gemma 4 32B, which you can run on an off-the-shelf laptop, is around the same or even higher intelligence level as the SOTA models from 2 years ago (e.g. gpt-4o). Probably by the time memory prices come down we will have something as smart as Opus 4.7 that can be run locally.

Bigger models of course have more embedded knowledge, but just knowing that they should make a tool call to do a web search can bypass a lot of that.


The net effect won’t be a memory use reduction to achieve the same thing. We’ll do more with the same amount of memory. Companies will increase the context windows of their offerings and people will use it.

That is the sad reality of the future of memory.


I am not convinced that more context will be useful, practical use of current models at 1mil context window shows they get less effective as the window grows. Given model progress is slowing as well, perhaps we end up reaching a balance of context size and competency sooner than expected.

Stuff in more code. Stuff in more system prompt. Stuff in raw utf8 characters instead of tokens to fix strawberries. Stuff in WAY more reasoning steps.

Given the current tech, I also doubt there will be practical uses and I hope we’ll see the opposite of what I wrote. But given the current industry, I fully trust them so somehow fill their hardware.

Market history shows us than when the cost of something goes down, we do more with the same amount, not the same thing with less. But I deeply hope to be wrong here and the memory market will relax.


You still need to hold the model in memory. If you have for example 16 GB ram, the gains aren't that much

That's not what consumes the most memory at scale. The KV caches are per-user.

You can still use as much memory, but fit more things into it, so I don’t think the current market hogs will let go easily.

that will only increase the demand for RAM as models will now be usable in scenarios that weren't feasible prior, and the ceiling for model and context size is not even visible at this point

I hate to mention Jevons paradox as it has become cliche by now, but this is a textbook such scenario


Or Firefox would still be using android's file system / upload process, which probably hands off the photos with geotags stripped already.

I'm pretty sure this is what happens in the iPhone at least, so I'd imagine it is the same in Android.


Lol, he's probably not technically allowed to set that as a requirement in most places I would guess. Funny side quest to offer though.

It would be interesting to talk with a victorian-era chatbot, including victorian-era ethics. would be interesting to see how much divergence from modern era ethics it would have.


I don't understand the init.d script hate ;)


in your "demo" image the menu bar is completely missing.. this seems like a very confusing choice. I can barely make out the menu bar icon against the background image.


"is a real architectural win, not just a convenience." AI use spotted


I agree, I think the problem is the seamless integration, where it renders only the application against the macOS environment. I'd much prefer something more like the cocoa-way example where there's a window that has its own background, and the applications run inside that. Not sure if Xquartz supports running a compositor or windowing manager.


XQuartz used to support rooted mode. I played with an early version back in the PowerPC era, and ran a regular desktop with WindowMaker and everything, using software from MacPorts. It was kind of a "parallel universe", as XQuartz would take over the whole screen in rooted mode and you had to switch between it and the Mac desktop, but it looked and functioned like a typical Linux or Unix desktop of the early 2000s.


I tried that a few times back in the day, but I found it so jarring & ugly against the macOS GUI. The problem was that it was rendering the application alone, for a seamless integration. I don't remember if there was even an option to run a compositor or window manager such that you had a proper window with it's own background and the linux apps show up inside that (like the cocoa-way example).


The problem is when the environment is already optimized for car use, when everything is massively spread out. Hard to get people to stop using cars when infra for walking is an afterthought.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: