I had an equally painful experience with Open Code. I don't think the harness is the issue. It's the need for a large context window and slow inference.
Been using the model for a few hours now. I'm actually reall impressed with it. This is the first time i've found value in an image model for stuff I actually do. I've been using it to build powerpoint slides, and mockups. It's CRAZY good at that.
Yeah, it's funny. I would expect to see more enthusiasm versus just basic run-of-the-mill, "oh, there it is". Leave it to the HN crowd. This is incredible. I don't even like OpenAI.
LLMs make for great day 1 demos, but in a few weeks I promise you many people will be able to tell nearly all of the images generated by this are AI. It just takes time and exposure to figure out the new common flaws.
Frankly, I am not sure if they will ever actually be able to solve this problem or if it'll be a continuous game of whackamole, but regardless there's a large crowd of people out there where if they can tell something is AI generated they will not support the company behind it. Being able to tell anything is AI generate cheapens brands.
You're thinking is like everyone else's, and it's backwards. The world will learn to accept it as the standard way of doing things and people will appreciate one generation over another and look at manual image creation as a niche activity like blacksmithing vs assembly-line manufacturing and automation. With the latter, you appreciate the intent and the end result. Same thing here, people are just adjusting to it.
HN is engineer heavy so its a bunch of people who spend their days looking at code. If it's not a coding model they'll likely never use it.
To the average HN'er, images and design are superfluous aesthetic decoration for normies.
And for those on HN who do care about aesthetics, they're using Midjourney, which blows any GPT/Gemini model out of the water when it comes to taste even if it doesn't follow your prompt very well.
The examples given on this landing page are stock image-esque trash outside of the improvements in visual text generation.
My understanding is GPT 6 works via synaptic space reasoning... which I find terrifying. I hope if true, OpenAI does some safety testing on that, beyond what they normally do.
“My vibes don’t match a lot of the traditional A.I.-safety stuff,” Altman said. He insisted that he continued to prioritize these matters, but when pressed for specifics he was vague: “We still will run safety projects, or at least safety-adjacent projects.” When we asked to interview researchers at the company who were working on existential safety—the kinds of issues that could mean, as Altman once put it, “lights-out for all of us”—an OpenAI representative seemed confused. “What do you mean by ‘existential safety’?” he replied. “That’s not, like, a thing.”
The absolute gall of this guy to laugh off a question about x-risks. Meanwhile, also Sam Altman, in 2015: "Development of superhuman machine intelligence is probably the greatest threat to the continued existence of humanity. There are other threats that I think are more certain to happen (for example, an engineered virus with a long incubation period and a high mortality rate) but are unlikely to destroy every human in the universe in the way that SMI could. Also, most of these other big threats are already widely feared." [1]
> We study a novel language model architecture that is capable of scaling test-time computation by implicitly reasoning in latent space. Our model works by iterating a recurrent block, thereby unrolling to arbitrary depth at test-time. This stands in contrast to mainstream reasoning models that scale up compute by producing more tokens. Unlike approaches based on chain-of-thought, our approach does not require any specialized training data, can work with small context windows, and can capture types of reasoning that are not easily represented in words. We scale a proof-of-concept model to 3.5 billion parameters and 800 billion tokens. We show that the resulting model can improve its performance on reasoning benchmarks, sometimes dramatically, up to a computation load equivalent to 50 billion parameters.
I agree that they called many things remarkably well! That doesn't change the fact that AI 2027 is not a thing which happened, so it isn't valid to point out "this killed us in AI 2027." There are many reasons to want to preserve CoT monitorability. Instead of AI 2027, I'd point to https://arxiv.org/html/2507.11473.
I gave the same prompt (a small rust project that's not easy, but not overly sophisticated) to both Gemma-4 26b and Qwen 3.5 27b via OpenCode. Qwen 3.5 ran for a bit over an hour before I killed it, Gemma 4 ran for about 20 minutes before it gave up. Lots of failed tool calls.
I asked codex to write a summary about both code bases.
"Dev 1" Qwen 3.5
"Dev 2" Gemma 4
Dev 1 is the stronger engineer overall. They showed better architectural judgment, stronger completeness, and better maintainability instincts. The weakness is execution rigor: they built more, but didn’t verify enough, so important parts don’t actually hold up cleanly.
Dev 2 looks more like an early-stage prototyper. The strength is speed to a rough first pass, but the implementation is much less complete, less polished, and less dependable. The main weakness is lack of finish and technical rigor.
If I were choosing between them as developers, I’d take Dev 1 without much hesitation.
What causes these? Given how simple the LLM interface is (just completion), why don't teams make a simple, standardized template available with their model release so the inference engine can just read it and work properly? Can someone explain the difficulty with that?
The model does have the format specified but there is no _one_ standard. For this model it’s defined in the [
tokenizer_config.json [0]. As for llama.cpp they seem to be using a more type safe approach to reading the arguments.
Hm, but surely there will be converters for such simple formats? I'm confused as to how there can be calling bugs when the model already includes the template.
If you don't care about how it's architectured, why you care about size? Compare it to Q3.5 397B-A17B.
Just like smaller size models are speed / cost optimization, so is MoE.
G4 26B-A4B goes 150 t/s on 4090/5090, 80 t/s on M5 Max. Q3.5 35B-A3B is comparably fast. They are flash-lite/nano class models.
G4 31B despite small increase in total parameter count is over 5 times slower. Q3.5 27B is comparably slow. They are approximating flash/mini class models (I believe sizes of proprietary models in this class are closer to Q3.5 122B-A10B or Llama 4 Scout 109B-A17B).
The implication is that there is (should be) a major speed difference - naively you'd expect the MoE to be 10x faster and cheaper, which can be pretty relevant on real world tasks.
Neurons that fire together, wire together. Your brain optimizes for your environment over time. As we get older, our brains are running in a more optimized way than when we're younger. That's why older hunters are more effective than younger hunters. They're finely tuned for their environment. It's an evolutionary advantage. But it also means that they're not firing in "novel" ways as much as the "kids". "kids" are more creative I think because their brains are still adopting, exploring novelty, neuron connections aren't as deeply tied together yet.
This is also maybe one of the biggest pitfalls as our society get's "older" with more old people, and less "kids". We need kids to force us to do things differently.
Does it matter who the sycophant was or just that there was a sycophant?
My partner does that as well as LLMs at this point; "Sure honey, I remember you've talked a lot about Rust and about Clojure in the past, and you seem excited about this Clojure-To-Rust transpiler you're building, it sounds like a great idea!", is that bad too?
There is no comment on whether LLMs/agents have been used. I feel like projects should explicitly say if they were _or_ were not used. There is no license file, and no copyright header either. This feels like "fauxpen-source": imagine getting LEX+YACC to generate a parser, and presenting the generated C code as "open-source".
This is just another way to throw binaries over the wire, but much worse. This has the _worst_ qualities of the GPL _and_ pseudo-free-software-licenses (i.e. the EULAs used by mongo and others). It has all the deceptive qualities of the latter (e.g. we are open but not really -- similar to Sun Microsystems [love this company btw, in spite of its blunders], trying to convince people that NeWS is "free" but that the cost of media [the CD-ROM] is $900), with the viral qualities of the former (e.g. the fruit of the poison tree problem -- if you use this in your code, then not only can you not copyright the code, but you might actually be liable for infringement of copyright and/or patents).
I would appreciate it if the contributor, mrconter11, would treat HN as an internet space filled with intelligent thinking people, and not a bunch of shallow and mindless rubes. (Please (1) explicitly disclose both the use and absence of use of LLMs -- people are more likely to use your software this way, and preserves the integrity of the open source ecosystem, and (2) share you prompts and session).
That is (slightly) reassuring (but the rest of his portfolio does not inspire confidence). Nevertheless, we should be required to disclose whether the code has been (legally) tainted or not. This will help people make informed decisions, and will also help people replace the code if legal consequences appear on the horizon, or if they are ready to move from prototype to production.
reply