Pelican generated via OpenRouter: https://gist.github.com/simonw/cc4ca7815ae8256...

btown · 2026-02-11T18:01:59 1770832919

Thank you for continuing to maintain the only benchmarking system that matters!

Context for the unaware: https://simonwillison.net/tags/pelican-riding-a-bicycle/

gabiruh · 2026-02-11T19:38:40 1770838720

It's interesting how some features, such as green grass, a blue sky, clouds, and the sun, are ubiquitous among all of these models' responses.

btown · 2026-02-11T21:44:37 1770846277

If you were a pelican, wouldn't you want to go cycling on a sunny day?

Do electric pelicans dream of touching electric grass?

Magniquick · 2026-02-12T03:37:38 1770867458

Do electric pelicans dream of touching electric grass?

That would be shocking news to me.

davidwritesbugs · 2026-02-12T07:49:39 1770882579

Please leave the Internet :)

derefr · 2026-02-11T21:49:04 1770846544

It is odd, yeah.

I'm guessing both humans and LLMs would tend to get the "vibe" from the pelican task, that they're essentially being asked to create something like a child's crayon drawing. And that "vibe" then brings with it associations with all the types of things children might normally include in a drawing.

l_eo · 2026-02-11T20:55:29 1770843329

They will start to max this benchmark as well at some point.

ljm · 2026-02-11T22:25:50 1770848750

It's not a benchmark though, right? Because there's no control group or reference.

It's just an experiment on how different models interpret a vague prompt. "Generate an SVG of a pelican riding a bicycle" is loaded with ambiguity. It's practically designed to generate 'interesting' results because the prompt is not specific.

It also happens to be an example of the least practical way to engage with an LLM. It's no more capable of reading your mind than anyone or anything else.

I argue that, in the service of AI, there is a lot of flexibility being created around the scientific method.

tylervigen · 2026-02-11T22:32:56 1770849176

For 2026 SOTA models I think that is fair.

For the last generation of models, and for today's flash/mini models, I think there is still a not-unreasonable binary question ("is this a pelican on a bicycle?") that you can answer by just looking at the result: https://simonwillison.net/2024/Oct/25/pelicans-on-a-bicycle/

vidarh · 2026-02-12T07:58:53 1770883133

RLHF (reinforcement learning from human feedback) is to a large extent about resolving that ambiguity by simply polling people for their subjective judgement.

I've worked one an RLHF project for one of the larger model providers, and the instructions provided to the reviewers were very clear that if there was no objective correct answer, they were still required to choose the best answer, and while there were of course disagreements in the margins, groups of people do tend to converge on the big lines.

interstice · 2026-02-11T22:36:56 1770849416

So if it can generate exactly what you had in mind based presumably on the most subtle of cues like your personal quirks from a few sentences that could be _terrifying_, right?

9dev · 2026-02-12T11:19:29 1770895169

Simon has written a page specifically for you: https://simonwillison.net/2025/nov/13/training-for-pelicans-...

segmondy · 2026-02-12T01:01:04 1770858064

This is actually a good benchmark, I use to roll my eyes at it. Then I decided to apply the same idea and ask the models to generate SVG image of "something" not going to put it out there. There was a strong correlation between how good the models are and the image they generated. These were also no vision images, so I don't know if you are serious but this is a decent benchmark.

hasperdi · 2026-02-12T09:48:54 1770889734

That's a bike that's ergonomically designed for pelicans.

It is unreasonable to expect pelicans to ride human bikes, they have different anatomy.

MrsPeaches · 2026-02-12T10:23:51 1770891831

The next frontier:

Draw a pelican on a bicycle ergonomically designed for pelicans.

ben_w · 2026-02-12T11:14:06 1770894846

It may be a joke, but I think this is correct.

For reasons, I have tried to get Stable Diffusion to put parrots into spacesuits. Always ended up with the beak coming out where the visor glass should've been, either no wings at all or wings outside the suit, legs and torso just human-shaped.

ChatGPT got the helmet right, but their wings and tail (and sometimes claws) were exposed to vacuum, still very much closer to a human in either a normal or scifi space suit that happens to also be wearing a parrot head inside the space suit, and has tacked some costume wings on the outside.

Essentially, it's got the same category of wrong as fantasy art's approach to what women's armour should look like: aesthetics are great, but it would be instantly lethal if done for real.

simonw · 2026-02-12T19:37:37 1770925057

My more advanced prompt, for when models do a good job on the original, is this one:

> Generate an SVG of a California brown pelican riding a bicycle. The bicycle must have spokes and a correctly shaped bicycle frame. The pelican must have its characteristic large pouch, and there should be a clear indication of feathers. The pelican must be clearly pedaling the bicycle. The image should show the full breeding plumage of the California brown pelican.

mitjam · 2026-02-12T11:08:32 1770894512

Thereafter: Design a bike that an actual pelican can learn to ride in real life.

_joel · 2026-02-11T18:09:46 1770833386

Now this is the test that matters, cheers Simon.

RC_ITR · 2026-02-11T21:23:55 1770845035

The bird not having wings, but all of us calling it a 'solid bird' is one of the most telling examples of the AI expectations gap yet. We even see its own reasoning say it needs 'webbed feet' which are nowhere to be found in the image.

This pattern of considering 90% accuracy (like the level we've seemingly we've stalled out on for the MMLU and AIME) to be 'solved' is really concerning for me.

AGI has to be 100% right 100% of the time to be AGI and we aren't being tough enough on these systems in our evaluations. We're moving on to new and impressive tasks toward some imagined AGI goal without even trying to find out if we can make true Artificial Niche Intelligence.

zarzavat · 2026-02-12T10:15:07 1770891307

This test is so far beyond AGI. Try to spit out the SVG for a pelican riding a bicycle. You are only allowed to use a simple text editor. No deleting or moving the text cursor. You have 1 minute.

RC_ITR · 2026-02-12T23:49:12 1770940152

Sorry, is your definition of AGI "doing things worse than humans can do, but way faster?" because that's been true of computers for a long time.

pixl97 · 2026-02-13T14:14:22 1770992062

I mean for this particular benchmark, yes.

You'd have to put it in an agentic loop to perform corrections otherwise.

Rudybega · 2026-02-11T21:50:39 1770846639

MMLU performance caps out around 90% because there are tons of errors in the actual test set. There's a pretty solid post on it here: https://www.reddit.com/r/LocalLLaMA/comments/163x2wc/philip_...

As far as I can tell for AIME, pretty much every frontier model gets 100% https://llm-stats.com/benchmarks/aime-2025

RC_ITR · 2026-02-12T23:44:13 1770939853

Here's the score for new AIME's, where we know the answers aren't in training.

https://matharena.ai/?view=problem&comp=aime--aime_2026

As for MMLU, is your assertion that these AI labs are not correcting for errors in these exams and then self-reporting scores less than 100%?

As implied by the video, wouldn't it then take 1 intern a week max to fix those errors and allow any AI lab to become the first to consistently 100% the MMLU? I can guarantee Moonshot, DeepSeek, or Alibaba would be all over the opportunity to do just that if it were a real problem.

kingstnap · 2026-02-12T15:50:36 1770911436

The benchmarks are harder than you might imagine and contain more wrong answers and terrible questions than you would expect.

You don't need to take my word for it, try playing MMLU yourself.

https://d.erenrich.net/are-you-smarter-than-an-llm/index.htm...

Its not MMLU-Pro btw, which is considerably harder.

RC_ITR · 2026-02-12T23:50:21 1770940221

Sure and AGI will 100% it 100% of the time, even if it is hard.

hieudesu · 2026-02-14T13:35:48 1771076148

Your definition of AGI must be absurd

simonw · 2026-02-12T00:22:52 1770855772

It has a wing. Look at the code comments in the SVG!

solarized · 2026-02-11T20:22:51 1770841371

This Pelican benchmark has become irrelevant. SVG is already ubiquitous.

We need a new, authentic scenario.

viraptor · 2026-02-11T20:52:41 1770843161

Like identifying names of skateboard tricks from the description? https://skatebench.t3.gg/

alargemoose · 2026-02-11T21:23:55 1770845035

I don’t care how practical it may or may not be, this is my new favorite LLM benchmark

stevage · 2026-02-11T21:38:38 1770845918

I couldn't find an about page or similar?

viraptor · 2026-02-11T22:00:18 1770847218

Here's the public sample https://github.com/T3-Content/skatebench/blob/main/bench/tes...

I don't think there's a good description anywhere. https://youtube.com/@t3dotgg talks about it from time to time.

hmottestad · 2026-02-11T21:25:18 1770845118

o3-pro is better than 5.2 pro! And GPT 5 high is best. Really quite interesting.

echelon · 2026-02-11T21:56:20 1770846980

  1. Take the top ten searches on Google Trends 
     (on day of new model release)
  2. Concatenate
  3. SHA-1 hash them
  4. Use this as a seed to perform random noun-verb 
     lookup in an agreed upon large sized dictionary. 
  5. Construct a sentence using an agreed upon stable 
     algorithm that generates reasonably coherent prompts
     from an immensely deep probability space.

That's the prompt. Every existing model is given that prompt and compared side-by-side.

You can generate a few such sentences for more samples.

Alternatively, take the top ten F500 stock performers. Some easy signal that provides enough randomness but is easy to agree upon and doesn't provide enough time to game.

It's also something teams can pre-generate candidate problems for to attempt improvement across the board. But they won't have the exact questions on test day.

TZubiri · 2026-02-12T00:46:48 1770857208

The idea at the time is that it was obviously not part of the training set, now that it's a metric,it's worthless. Try an elephant smoking s cigar on the beach

blurbleblurble · 2026-02-12T04:41:50 1770871310

Have you tried with qwen-coder-next yet?

pwython · 2026-02-11T18:25:07 1770834307

How many pelican riding bicycle SVGs were there before this test existed? What if the training data is being polluted with all these wonky results...

bwilliams18 · 2026-02-11T20:45:19 1770842719

I'd argue that a models ability to ignore/manage/sift through the noise added to the training set from other LLMs increases in importance and value as time goes on.

nerdsniper · 2026-02-11T19:36:20 1770838580

You're correct. It's not as useful as it (ever?) was as a measure of performance...but it's fun and brings me joy.

brianjking · 2026-02-12T02:45:46 1770864346

Pretty damn great bird, tbh.