The test doesn't prove you have AGI. It proves you don't have AGI. If your AI can't solve these problems that humans can solve, it can't be AGI.
Once the AIs solve this, there will be another ARC-AGI. And so on until we can't find any more problems that can be solved by humans and not AI. And that's when we'll know we have AGI.
AI X that can solve the tests contrasted with AI Y that cannot, with all else being equal, means X is closer to AGI than Y. There's no meaningful scale implicit to the tests, either.
Kinda crazy that Yudkowsky and all those rationalists and enthusiasts spent over a decade obsessing over this stuff, and we've had almost 80 years of elite academics pondering on it, and none of them could come up with a meaningful, operational theory of intelligence. The best we can do is "closer to AGI" as a measurement, and even then, it's not 100% certain, because a model might have some cheap tricks implicit to the architecture that don't actually map to a meaningful difference in capabilities.
Will there be a point in that series of ARC-AGI tests where AI can design the next test, or is designing the next text always going to be a problem that can be solved by humans and not AI?
I don't see why AI couldn't design tests. But they can only be validated by humans, as they are intended to be possible and ideally easy for humans to solve.
Yes, but I guess you see what I'm getting at. If designing the next ARC-AGI test is impossible for AI without a human in the loop, then AGI becomes unreachable by definition.
It doesn't prove anything of the sort. ARC-AGI has always been nothing special in that regard but this one really takes the cake. A 'human baseline' that isn't really a baseline and a scoring so convoluted a model could beat every game in reasonable time and still score well below 100. Really what are we doing here ?
That Francois had to do all this nonsense should tell you the state of where we are right now.
It's a "let's find a task humans are decent at, but modern AIs are still very bad at" kind of adversarial benchmark.
The exact coverage of this one is: spatial reasoning across multiple turns, agentic explore/exploit with rule inference and preplanning. Directly targeted against the current generation of LLMs.
There isn't a strict definition of AGI, there's no way to find evidence for what equates to it, and besides, things like this are meant only as likely necessary conditions.
Anyway, from the article:
> As long as there is a gap between AI and human learning, we do not have AGI.
This seems like a reasonable requirement. Something I think about a lot with vibe coding is that unlike humans, individual models do not get better within a codebase over time, they get worse.
Is that within a codebase off relatively fixed size that things get worse as time goes on, or are you saying as the codebase grows that the limits of a model's context means that because the model is no longer able to hold the entire codebase within its context that it performs worse than when the codebase was smaller?
I think there's a few factors, codebase size is one, and the tendency for vibe coding to be mostly additive certainly doesn't help with that.
But vibe coding also tends to produce somewhat poor architecture, lots of redundant and intermingled bits that should be refactored. I think the model is worse the worse code it has to work with, which I presume is only in part because it's fundamentally harder to work with bad code, but also in part because its context is filled with bad code.
The evolution of the test has been partly due to the evolution of AI capabilities. To take the most skeptical view, the types of puzzles AI has trouble solving are in the domain of capabilities where AGI might be required in order to solve them.
By updating the tests specifically in areas AI has trouble with, it creates a progressive feedback loop against which AI development can be moved forward. There's no known threshold or well defined capability or particular skill that anyone can point to and say "that! That's AGI!". The best we can do right now is a direction. Solving an ARC-AGI test moves the capabilities of that AI some increment closer to the AGI threshold. There's no good indication as to whether solving a particular test means it's 15% closer to AGI or .000015%.
It's more or less a best effort empiricist approach, since we lack a theory of intelligence that provides useful direction (as opposed to a formalization like AIXI which is way too broad to be useful in the context of developing AGI.)
I think the idea is that if they cannot perform any cognitive task that is trivial for humans then we can state they haven’t reached ‘AGI’.
It used to be easy to build these tests. I suspect it’s getting harder and harder.
But if we run out of ideas for tests that are easy for humans but impossible for models, it doesn’t mean none exist. Perhaps that’s when we turn to models to design candidate tests, and have humans be the subjects to try them out ad nauseam until no more are ever uncovered? That sounds like a lovely future…
The reality is machines can brute force endlessly to an extent humans cannot, and make it seem like they are intelligent.
Thats not intelligence though. Even if it may appear to be. Does it matter? Thats another question. But certaintly is not a representation of intelligence.
The evidence is that humans are able to win these games. AGI is usually defined as the ability to do any intellectual task about as well as a highly competent human could. The point of these ARC benchmarks is to find tasks that humans can do easily and AI cannot, thus driving a new reasoning competency as companies race each other to beat human performance on the benchmark.
> AGI is usually defined as the ability to do any intellectual task about as well as a highly competent human could
I think one major disconnect, is that for most people, AGI is when interacting with an AI is basically in every way like interacting with a human, including in failure modes. And likely, that this human would be the smartest most knowledgeable human you can imagine, like the top expert in all domains, with the utmost charisma and humor, etc.
This is why the "goal post" appears to be always moving, because the non-commoners who are involved with making AGI and what not never want to accept that definition, which to be fair seems too subjective, and instead like to approach AGI like something different, it can solve some problems human's can't, when it doesn't fail, it behaves like an expert human, etc.
Even if an AI could do any intellectual task about as well as a highly competent human could, I believe most people would not consider it AGI, if it lacks the inherent opinion, personality, character, inquiries, failure patterns, of a human.
And I think that goes so far as, a text only model can never meet this bar. If it cannot react in equal time to subtle facial queues, sounds, if answering you and the flow of conversation is slower than it would be with a human, etc. All these are also required for what I consider the commoner accepting AGI as having been achieved.
By that definition, does a human at the other end of a high-latency video call not have AGI because they can't react any faster that the connection's latency would allow them to have? From your POV what's the difference between that and an AI that's just slow?
> does a human at the other end of a high-latency video call not have AGI because they can't react any faster that the connection's latency would allow them to have
Correct. A person who'd mentally operate that slowly would be considered to have some cognitive disability. For example, would likely not be allowed to drive a car.
You could be fooled in thinking it is a human behind a slow connection, but layman would not consider it real AGI in my opinion, since you have to handicap the human, it seems like lowering the bar just to pretend you reached AGI.
You might recognize it's pretty close to AGI, if it has all the other qualities, but it needs to also operate at a similar response time, uptime, and so on.
My point is, everyone that's not trying to build AGI defines it as, same as an idealized smartest human would be in every way. I truly think this is how most people imagine AGI in their head, and until you have that, they'll say it's not AGI, and industry folks will claim the goalpost keeps moving, when in reality they kept setting their own post.
That's interesting. I thought the point was that it needed to be in-kernel for performance reasons; if it works in userspace why did linux not do that?
Ideally it does need to be in-kernel for performance reasons. But that's not possible on macOS, so it's better to have it in userspace than not at all.
I mean, I know Mac has had some great games (eg. I spent so much time on school Macs playing that Bolo tank game) ... but they have probably <1% of the number of games Windows has. I'd expect a simiilar percentage of devs to be interested in Mace (or whatever you call Mac Wine).
Where did you get that opinion? Germany is not doing great but OK in the group of Western countries, and its car industry is both very imporant and in trouble, so it's not an unreasonable opinion that things would be better without that trouble.
Germany has a great layer of "consultants" that fudge the books and make everything look profitable and rosy. It's the land of "Arbeitsgruppen" and "Berater" - folks that ensure things get buried and forgotten.
But there is no investment in the future, no investment in infrastructure and no investment in anything creative, in fact, that's were cuts are made, in the arts and culture.
Once a society can no longer afford the arts, you know there is something going wrong and Germany is going wrong. Perhaps "klagen auf hohem Niveau" (complaining from up on high) but the higher they are, the further they fall.
It's not. It's more like a cancer patient with an Überweisung for their first cancer screening but dragging their feet to go and do it. They know is it bad and will get worse but they're afraid of facing it.
imho there are multiple, starting with the pension and healthcare system which are not sustainable with the current demography trend, which pushed them into going all in with immigration, which fractured whatever was left of german identity (which was arguably already wiped out after ww2 and the cold war). Taxes are going up, retirement age is increasing, pensions are decreasing, public services are getting worse year after year, there is nothing young people can focus on, nothing they can expect to have better than their parents or grand parents, most will never own their place.
The self sabotage of the energy sector certainly didn't help. No long term vision + no clear way to improvement + no sense of appartenance = game over, and this is hitting most of the west at once, it's all about individualism and consumption, you can't build societies on these principles.
You wrote my thoughts. Add one more thing: Germany is federation with insanely complex administration. With many different (outdated) education systems, too many public healthcare insurers. It’s too much of regulation of everything decreasing real efficiency to zero.
Latest example (I am electrical engineer AND electrician): from this year on my buddy heating system specialist can’t help me with photovoltaic system installation on the roof. Last year he was qualified, this year not anymore. He can however install air conditioning unit on the roof this year too. But not the solar panels… Every year some shady lobby group writes some special law crippling last pieces of working system.
There should be some deregulation and centralization institution in Germany with a real short time efficiency increase plan. Otherwise it will stay there as a country of Oktoberfest and Cologne Carnival.
> it's all about individualism and consumption, you can't build societies on these principles
Lots of real problems listed, but such a non-sequitur conclusion. US is built on these principles, China seems to be more individualistic and consumerist than Germany too. If anything, a big problem in Germany is low ambition as the societal norm. A bit of consumerism could actually help with that, as to consume you need to earn, and to earn, you need some ambition.
Tax system and IG Metall salary tables will kill ambition very quickly. The highest salary groups do not guarantee comfy lifestyle for the corresponding areas anymore. Giving away half of salary as mandatory insurance and paying 19% value added tax from the rest is just insulting. Don’t forget the rents in 2026. It’s again new all time high. It does not pay off to work anymore.
Yeah, to me it seems that instead of fighting individualism, Germany needs to make sure that it pays off. Higher taxes for ownership, lower taxes for income from one's work for example.
And it's a complete clown show rewarding moral bankruptcy that ended up fabricating and promoting uneducated degenerates such as Trump, Hegseth, Miller, &co to the highest positions.
These are very different problems from what Germany has though. And it's a recent issue, while individualism is a core tenet of American culture since independence.
I have fond memories of porting Cube, Sauerbraten and AssaultCube to the Mac back in the day. Given what i've seen from Wouter back in the day i am not surprised he is still on it full steam…
great article but the 44 tonne limit is not "physics", it is regulation. if an electric truck would be allowed to weigh 5 tonnes more all these calculations would be different.
The computing cost to mine more bitcoin is hailed as the underlying value by proponents of that notion. It depends on bitcoin holders refusing to sell at a price lower than the cost of mining, which isn't a given. It's also a notion that doesn't account for potential innovations such as quantum computing, which would significantly reduce crypto mining costs.
Hindsight is 20/20. That bitcoin is a store of value has been talked about for a very long time when other blockchains overtook it in terms of functionality. People’s memories are short so I am sure it will be touted as such again in a couple years.
> [The] stretch of track that was renovated last May and inspected on January 7.
The track had been inspected very recently. Maybe the inspection standards are inadequate?
The linked article also shows figures that are quite meaningless without context.
> [The] vast majority [of Spain's high-speed rail budget] went to new infrastructure with only some 16% earmarked for maintenance, renewal and upgrades. That compares with between 34% to 39% spent by France, Germany and Italy,
They simply can't compare those numbers as-is. Of course Spain will be spending less in maintenance as a percentage of the total budget if it's still mainly building new tracks. It's not a useful figure.
> The track had been inspected very recently. Maybe the inspection standards are inadequate?
Spanish officials are very good at deflecting blame and playing politics. Nobody wants to be held accountable for a catastrophe. Also see the 2024 floods in Valencia; a partially preventable tragedy, followed by a whole lot of mud slinging, but zero accountability.
So while inspection standards might be inadequate, I would take anything a senior official says with a pound of salt.
But he is correct. If you have a large enough budget for new construction it can make any maintenance expenditure look tiny. The right figures to compare are normalized by length and age of track, not percentages of the total budget.
English is unusual in that we have both Germanic "weld" and Latinate "solder" and they've acquired different meanings. Spanish (and other Romance languages) use the term "solder" (soldado) for both.
As an aside: Chinese also uses the same term for both (焊接), and the standard English translation is "welding". This can lead to some confusion when Chinese manufacturers start talking about e.g. "surface-mount welding". :)
Interesting. In dutch we use 'solderen' vs 'lassen', in German they use 'schweizen' and 'loten'.
English has a third term like that as well called 'brazing', then there is silver solder (a high temperature version of soldering), in dutch we'd call that 'hardsolderen', whereas what the English call brazing we call oxy-acetyleen lassen (which is more of a process name by virtue of naming the ingredients).
Soldadura autogeno and Soldadura en el arco (sp?) are what I think the modifiers used in Spanish to indicate brazing and (arc) welding.
Ah yes, you are right! I was going by ear, rather than by the written version, in fact I can't recall seeing it written. German is a language that I will happily use but don't ask me to write a letter in it, you'll probably need exponential notation to represent the number of errors.
> Spain spent an average of about 1.5 billion euros ($1.76 billion) a year from 2018 to 2022 on its high-speed network, more than any other country. However, the vast majority went to new infrastructure with only some 16% earmarked for maintenance, renewal and upgrades. That compares with between 34% to 39% spent by France, Germany and Italy, whose networks are far less extensive, according to the Commission data.
Conflating the maintenance budget with the money invested in new infrastructure in this way is not very useful IMHO. How much inspection/maintenance money was spent per km of (high-speed and overall) railway track would be much more informative...
We've gone so over the top on weather fearcasting. Just look out the window if you want to know what the weather is. Save the "the world is ending" messages for truly life-threatening, property-damaging weather (and no, temperature alone doesn't qualify---it's easy to know it's cold or hot by just stepping outside).
Timely. I’m about to turn off severe weather alerts from my local city because they insist on spamming - multiple times per day - cold weather alerts.
And they start at pretty ridiculous temperatures in the double digits. The only way those would be dangerous to you is if you were homeless and lacked any form of winter clothing, at which point you either already know or are too far mentally gone for a text alert to help you.
hahaha we could also track if you typed too fast! ... actually, this is an actual idea, if you use AI to generate the code ... hmmm; that would then be a fun project vs a cloud cost saving one
reply