More

anon373839 · 2026-05-05T16:13:56 1777997636

This was their play all along with their unethical data collection practices: let others use the APIs to discover the applications, then use the data against them to offer integrated solutions in every vertical of interest. Cursor, once Anthropic’s biggest customer, was one of the early ones they screwed.

They are also fighting for their lives because these insane valuations simply aren’t justified by being dumb pipes. Fortunately, open weights models are widely available and have crossed a threshold of usefulness that cements their place as good substitutes.

csoups14 · 2026-05-05T16:20:05 1777998005

Amazon Basics for Knowledge Work™

anon373839 · 2026-05-04T12:55:29 1777899329

When you read technical papers on various models, you’ll find that they often did most of the pretraining and even the supervised fine tuning using relatively short context data; then they “extended” the context window by training on a little bit of long context data. I think this is what is meant by not being trained uniformly.

However, now that RL environments and long-horizon agentic performance have taken such a prominent role in model development, I wonder if that practice still holds. I know that the most recent Gemma and Qwen models are incomparably more reliable at long contexts than their predecessors, even though, e.g. Qwen already had a 256k context. It just didn’t work like it does now.

anon373839 · 2026-05-04T05:50:02 1777873802

One can’t say that proposition is obvious to the population at large. Else, “we” (as in Earth in 2026) would have very political dynamics. So maybe Banksy felt inclined to do a public service announcement.

anon373839 · 2026-05-04T00:21:52 1777854112

The model outputs a probability distribution for the next token, given the sequence of all previous tokens in the context window. It’s just a list of floats in the same order as the list of tokens that the tokenizer uses.

After that, a piece of software that is NOT the LLM chooses the next token. This is called the sampler. There are different sampling parameters and strategies available, but if you want repeatable* outputs, just take the token with the highest probability number.

* Perfect determinism in this sense is difficult to achieve because GPU calculations naturally have a minor bit of nondeterminism. But you can get very close.

2ndorderthought · 2026-05-04T00:27:18 1777854438

I'm not so sold the LLM is an LLM without a sampler but it's not worth quibbling over. It's part of the statistical model anyways.

vrighter · 2026-05-04T11:13:20 1777893200

the llm is the trained part, the rest is the handwritten part. The sampler is handwritten, not learned.

2ndorderthought · 2026-05-04T12:01:55 1777896115

Believe it or not in statistics and machine learning the hard coded parts of a model that impact the results are considered part of the model. But I understand that now days we don't care about these things because ai goes brrr.

anon373839 · 2026-04-30T04:38:42 1777523922

Hm, I don't think this looks like Anthropic's design style. Anthropic is kind of doing a Chobanicore + Corporate Memphis design system that I personally find kind of creepy. But the website here just feels fresh and pleasant.

anon373839 · 2026-04-30T03:27:07 1777519627

Agreed; that's a beautiful site. The main design style apart from minimalism that I notice is glassmorphism. Well, that and a very well chosen Monet to set the tone.

anon373839 · 2026-04-26T01:13:28 1777166008

Well both aren’t “more important”, since that’s illogical. I think recent strides in high performance small LLMs have shown that the tasks LLMs are useful for may not require the level of representational capacity that trillion-parameter models offer.

However: the labs releasing these high-intelligence-density models are getting them by first training much larger models and then distilling down. So the most interesting question to me is, how can we accelerate learning in small networks to avoid the necessity of training huge teacher networks?

anon373839 · 2026-04-22T22:56:53 1776898613

This is just blind belief. The model discussed in this topic already outperforms “well made” frontier LLMs of 12-18 months ago. If what you wrote is true, that wouldn’t have been possible.

datadrivenangel · 2026-04-22T23:26:23 1776900383

It's amazing that we can run models better than state of the art ~36 months ago on local consumer devices!

anon373839 · 2026-04-22T22:51:48 1776898308

Absolutely. Plus as these companies become hungrier for revenue and to get out of the commodity market they are in, they are only going to get more aggressive in their (ab)use of customer data.

anon373839 · 2026-04-22T22:47:36 1776898056

I would recommend trying oMLX, which is much more performant and efficient than LM Studio. It has block-level KV context caching that makes long chats and agentic/tool calling scenarios MUCH faster.

felikz · 2026-05-04T21:40:57 1777930857

and it horribly kernel panics when it is running for too long due to Apple does not give a sh over mlx, see list of issues: https://github.com/Harperbot/metal-guard#landed-here-searchi...