Hacker Newsnew | past | comments | ask | show | jobs | submit | ericol's commentslogin

I did some work yesterday with Opus and found it amazing.

Today we are almost on non-speaking terms. I'm asking it to do some simple stuff and he's making incredible stupid mistakes:

    This is the third time that I have to ask you to remove the issue that was there for more than 20 hours. What is going on here?
and at the same time the compacting is firing like crazy. (What adds ~4 minute delays every 1 - 15 minutes)

  | # | Time     | Gap before | Session span | API calls |
  |---|----------|-----------|--------------|-----------|
  | 1 | 15:51:13 | 8s        | <1m          | 1         |
  | 2 | 15:54:35 | 48s       | 37m          | 51        |
  | 3 | 16:33:33 | 2s        | 19m          | 42        |
  | 4 | 16:53:44 | 1s        | 9m           | 30        |
  | 5 | 17:04:37 | 1s        | 17m          | 30        |
  # — sequential compaction event number, ordered by time.

  Time — timestamp of the first API call in the resumed session, i.e. when the new context (carrying the compaction summary) was first sent to the
  model.

  Gap before — time between the last API call of the prior session and the first call of this one. Includes any compaction processing time plus user
   think time between the two sessions.

  Session span — how long this compaction-resumed session ran, from its first API call to its last before the next compaction (or end of session).

  API calls — total number of API requests made during this resumed session. Each tool use, each reply, each intermediate step = one request.

Bottomline, I will probably stay on Sonnet until they fix all these issues.

They won't. These are not "issues", it's them trying to push the models to burn less compute. It will only get worse.

> it's them trying to push the models to burn less compute

I'm curious, how does using more tokens save compute?


productivity (tokens per second per hardware unit) increases at the cost of output quality, but the price remains the same.

both Anthropic and OpenAI quantize their models a few weeks after release. they'd never admit it out loud, but it's more or less common knowledge now. no one has enough compute.


Pretty bold claim - you have a source for that?

There is no evidence TMK that the accuracy the models change due to release cycles or capacity issues. Only latency. Both Anthropic and OpenAI have stated they don't do any inference compute shenanigans due to load or post model release optimization.

Tons of conspiracy theories and accusations.

I've never seen any compelling studies(or raw data even) to back any of it up.


Do you have a source for that claim?

my source is that people have been noticing this since GPT4 days.

https://arxiv.org/pdf/2307.09009

but of course, this isn't a written statement by a corporate spokespersyn. I don't think that breweries make such statements when they water their beer either.


I think that the idea is each action uses more tokens, which means that users hit their limit sooner, and are consequently unable to burn more compute.

What?

I'm 99.9% sure Opus 4.7 is a smaller model than 4.6.

Too many signs between the sudden jump in TPS (biggest smoking gun for me), new tokenenizer, commentary about Project Mythos from Ant employees, etc.

It looks like their new Sonnet was good enough to be labeled Opus and their new Opus was good enough to be labeled Mythos.

They'll probably continue post-training and release a more polished version as Opus 5


It could be the adaptive reasoning

If you've not seen Common People Black Mirror episode I strongly recommend it.

The only misprediction it makes is that AI is creating the brain dead user base...

You have to hook your customers before you reel them in!

https://www.netflix.com/gb/title/70264888?s=a&trkid=13747225...


I am having a shit experience lately. Opus 4.7, max effort.

> You're right, that was a shit explanation. Let me go look at what V1 MTBL actually is before I try again.

> Got it — I read the V1 code this time instead of guessing. Turns out my first take was wrong in an important way. Let me redo this in English.

:facepalm:


> I read the V1 code this time instead of guessing

Does the LLM even keep a (self-accessible) record of previous internal actions to make this assertion believable, or is this yet another confabulation?


No they do not (to be clear, not internal state, just the transcript). It’s entirely role-play. LLM apologies are meaningless because the models are mostly stateless. Every new response is a “what would a helpful assistant with XYZ prior context continue to say?”

Yes, the LLM is able to see the entire prior chat history including tool use. This type of interaction occurs when the LLM fails to read the file, but acts as though it had.

This seems like the experience I've had with every model I've tried over the last several years. It seems like an inherent limitation of the technology, despite the hyperbolic claims of those financially invested in all of this paying off.

Opus 4.6 pre-nerf was incredible, almost magical. It changed my understanding of how good models could be. But that's the only model that ever made me feel that way.

That was better, but still not to the point that I just let it go on my repo.

Yes! I genuinely got a LOT of shit done with Opus 4.6 "pre nerf" with regular old out-of-the-box config, no crazy skills or hacks or memory tweaks or anything. The downfall is palpable. Textbook rugpull.

If it isn’t working for you why don’t you choose an older model? 4.6

Matches what I am experiencing. Makes incredible stupid mistakes.

The weird stuff is yesterday I asked it to test and report back on a 30+ commit branch for a PR and it did that flawlessly.


The docs suggest not using max effort in most cases to avoid overthinking :shrug:

They've jumped the shark. I truly can't comprehend why all of these changes were necessary. They had a literal money printing machine that actually got real shit done, really well. Now it's a gamble every time and I am pulling back hard from Anthropic ecosystem.

It seems clear that it was a money spending machine, not a money printing machine.

> he’s making .. mistakes

Claude and other LLMs do not have a gender; they are not a “he”. Your LLM is a pile of weights, prompts, and a harness; anthropomorphising like this is getting in the way.

You’re experiencing what happens when you sample repeatedly from a distribution. Given enough samples the probability of an eventual bad session is 100%.

Just clear the context, roll back, and go again. This is part of the job.


Why be so upset at someone using pronouns with a LLM?

You are being downvoted but I actually agree with your statement.

This looks dangerously close to cmux but with a narrower focus (Just Claude code)

BTW, the claude app kind supports this with the /remote-control command, and that was what made me move away from cmux (I still have to start the sessions there)


When my eldest daughter was in high school (~2010, Argentina) there was a provincial policy where if every single student had a result below a certain score in a test, the scores had to be re assessed against the maximum result.

The resulting situation here was that she was constantly bullied into underperforming. Both cases are actually similar in that each individual has a personal incentive to underperform - the difference is that in your friend's case the policy is granted at the company level so no single employee can defect and break it for the rest, while in my daughter's case one high scorer could invalidate the reassessment for everyone, which is exactly what made defection punishable and the bullying emerge naturally.


This is the natural result of "equity" which is the academic jargon term for "forced equality of outcome". High achievers are attacked. People who push us forward are demonized. The low achievers are never pushed to be better. And the average drops.


Can you link a source for it? That sounds too absurd to be true…


It’s not that absurd and happens all over the world in university systems. I had a Comp. Sci. Professor that taught assembly and graded on a curve. As you might imagine the one guy that was a wizard at assembly caught flak from the unwashed masses.

I had another professor that not only did a curve but dropped statistical outliers to prevent this problem, he literally explained his system on Day 1 of the course. This was 15+ years ago and by no means a new idea.


The future is not evenly distributed.

I tried to search for it, but even the 2 documents that superseded the one from around the time my daughter was at school at not available.

I mean, the site doesn't even have a valid secure certificate so...

In the site below (In Spanish) you can search for 10/2019 and a cursory translation of the document title will show that this is the proper document (For 2019 onwards, the replaced doc 04/2014 isn't available either)

https://koha.chubut.edu.ar/cgi-bin/koha/opac-search.pl?idx=k...



> human intuition driving the exploration

This, a thousand times this.

For me, what AI brings is augmented humans. Just as we don't calculate on paper anymore, what is the reason of doing things by hand when a machine in X times better.

Want to code by hand, as artisans of old? Suit yourself.

I, for one, love the smell of burning chrome.


If "AI" were doing anything more than repeating content from the web without attribution, I might agree with you.


It's not exactly that...


I regularly (Say, once a month) do a comparison of results across all Claude, Gemini and ChatGPT. Just for reasons, not that I want to see if there's any benefit in changing.

It's not "fair" in that I pay for Claude [1] and not for the others, so models availability is not complete except for Claude.

So I did like things at time in the form of how they were presented, I came to really like Sonnet's "voice" a lot over the others.

Take into account Opus doesn't have the same voice, and I don't like it as much.

[1] I pay for the lower tier of their Max offering.


Thanks for your perspective.


I've had more than a few instances of this over the past 2 years, and my reply is exactly the above.

"What you are doing is against Github's TOS"


> The long-term effect is less clear. If we generate more code, faster, does that reduce cost or just increase the surface area we need to maintain, test, secure, and reason about later?

My take is that the focus is mostly oriented towards code, but in my experience everything around code got cheaper too. In my particular case, I do coding, I do DevOps, I do second level support, I do data analysis. Every single task I have to do is now seriously augmented by AI.

In my last performance review, my manager was actually surprised when I told him that I am now more a manager of my own work than actually doing the work.

This also means my productivity is now probably around 2.5x what it was a couple of years ago.


> In my last performance review, my manager was actually surprised when I told him that I am now more a manager of my own work than actually doing the work.

I think this is very telling. Unless you have a good manager who is paying attention, a lot of them are clueless and just see the hype of 10x ing your developers and don't care about the nuance of (as they say) all the surrounding bits to writing code. And unfortunately, they just repeat this to the people above them, who also read the hype and just see $$ of reducing headcount. (sorry, venting a little)


He definitely was paying attention.

He had to pause for a second there, arrested by the realization, and was one of the reasons I got an "Exceeds expectations" in one of my KRAs.


It is interesting though that he evidently didn't notice this 2.5X productivity increase until you pointed it out to him.


Surely the manager will now raise his salary by a huge amount! Maybe even 2.5x


His own, a bonus for managing managers


Surely


This has been my experience, too. In dealing with hardware, I'm particularly pleased with how vision models are shaping up; it's able to identify what I've photographed, put it in a simple text list, and link me to appropriate datasheets. yday, it even figured out how I wanted to reverse engineer a remote display board for a just-released inverter and correctly identified which pin of which unfamiliar Chinese chip was spitting out the serial data I was interested in; all I actually asked for was chip IDs with a quick vague note on what I was doing. It doesn't help me solder faster, but it gets me to soldering faster.

A bit OT, but I would love to see some different methods of calculating economic productivity. After looking into how BLS calculates software productivity, I quit giving weight to the number altogether and it left me feeling a bit blue; they apply a deflator in part by considering the value of features (which they claim to be able to estimate by comparing feature sets and prices in a select basket of items of a category, applying coefficients based on differences); it'll likely never actually capture what's going on in AI unless Adobe decides to add a hundred new buttons "because it's so quick and easy to do." Their methodology requires ignoring FOSS (except for certain corporate own-account cases), too; if everyone switched from Microsoft365 to LibreOffice, US productivity as measured by BLS would crash.

BLS lays methodology out in a FAQ page on "Hedonic Quality Adjustment"[1], which covers hardware instead of software, but software becomes more reliant on these "what does the consumer pay" guesses at value (what is the value of S-Video input on your TV? significantly more than supporting picture-in-picture, at least in 2020).

[1] https://www.bls.gov/cpi/quality-adjustment/questions-and-ans...


> Having the LLM write down a skill representing the lessons from the struggle you just had to get something done is more typical (I hope) and quite different from what they're referring to

Just as of last week I had Claude build me a skill when I ask it to help me troubleshoot issues, and it came out quite good.

It did had some issues (Claude tends to o er specify over anecdotal data) but it's a strong step in the right direction.

Also, "skills" are too broad in my opinion. I have one (that Claude wrote) with my personal data that I have available when I analyze my workouts.

I think there's ample room for self-generated skills when you use a rather long exchange on a domain you plan to revisit, _specially_ when it comes to telling Claude what not to do.


I recently had to create a MySQL shim for upgrading a large PHP codebase that currently is running in version 5.6 (Don't ask)

The way I aimed at it (Yes, I know there are already existing shims, but I felt more comfortable vibe coding it than using something that might not cover all my use cases) was to:

1. Extract already existing test suit [1] from the original PHP extensions repo (All .phpt files)

2. Get Claude to iterate over the results of the tests while building the code

3. Extract my complete list of functions called and fill the gaps

3. Profit?

When I finally got to test the shim, the fact that it ran in the first run was rather emotional.

[1] My shim fails quite a lot of tests, but all of them are cosmetics (E.g., no warning for deprecation) rather than functional.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: