Looks great -- always wished the admin panel came with more configurable bells and whistles. I've been exploring Quarkus recently (https://quarkus.io/), and it has a Dev UI with a similar extensible "panels" pattern. It's a bit different than Django since it's not for running in prod, but nonetheless it's pretty helpful.
sort of a tangent, but quarkus also has a concept of "dev services" that are monitorable via the dev UI. It uses Testcontainers to start and autowire runtime deps (postgres, redis, keycloak, etc.). Pretty pleasant experience to get the whole stack spun up and observable alongside the dev server.
I'm building something similar, but as an Excel add-in instead of a standalone product.
In real use-cases, it seems that by far the hardest part is figuring out the right representation for a spreadsheet workbook and the right primitives for the agent to be able to navigate it adeptly and cost-effectively; structure is incredibly variable and the data just compresses rather poorly (values, formulas, formatting, charts, pivots, etc.).
Great stuff though, think we'll see a lot of movement in the space in the coming years!
I'm always a little hesitant to use D1 due to some of these constraints. I know I may not ever hit 10GB for some of my side projects so I just neglect sharding, but also it unsettles me that it's a hard cap.
I get this a lot too, have made most of the Gemini models essentially unusable for agent-esque tasks. I tested with 2.5 pro and it still sometimes devolved into random gibberish pretty frequently.
I've found that this phenomenon exacerbates inequality too:
If you attend a well-known college that bigco's hire from frequently, there's a lot of knowledge floating around about interview prep, hiring schedules, which companies pay the best, etc. Clubs host "interview prep workshops" where they'd teach the subject matter of interviews, host events(hackathons, case competitions, etc.) to help you bolster your resume for applying to these bigco's. So just by attending a better/fancier school, you'd have pretty decent odds of eventually getting a job at one of these prestigious places.
If you were to attend a less prestigious school, regardless of your aptitude or capability, the information asymmetry is so bad that you'll never learn of the prerequisites for even being considered for some of these roles. Not many upperclassmen will have interned at fancy employers, so they won't be there to help you drill dynamic programming/black-scholes/lbo models, and won't tell you that you need to have your applications prepped by a certain date, and won't tell you that you should be working on side projects/clubs, etc.
I suppose that the apprenticeship model biases towards people that already have connections, so perhaps inequality was already bad, whereas now we just have an information asymmetry that's more easily solvable.
I went to a basement party/rave recently where the DJ was live-coding strudel, was incredibly cool to see in person. people would watch them type out new lines in anticipation of a beat drop
Pretty cool to see this post, I had no idea where to find more info about it!
I think performance takes a hit due to WASM, and I imagine pricing is worse at big qps numbers (where you can saturate instances), but I've found that deploying on CF workers is great for little-to-no devops burden. Scales up/down arbitrarily, pretty reasonable set of managed services, no cold start times to deal with, etc.
Only issue is that some of the managed services are still pretty half-baked, and introduce insane latency into things that should not be slow. KV checks/DB queries through their services can be double-to-triple digit ms latencies depending on configs.
The performance hit is less because of WASM but because the Workers platform is fundamentally defined in terms of Javascript and WASM is just a thing the JS engine has as a feature, so everything has to be proxied through JS objects and code, serialized into byte arrays, handed to the WASM, and same story in reverse.
We need WASM-native interfaces to get common to get rid of JS.
I ended up using container service on azure for a small rust project that I built in a docker container and published to GitHub. GitHub actions publishes to the azure service and in the 3 years I have been running it, it's basically been almost entirely free.
HN's AI hate-boner has always been a bit off-putting to me. This is a technology forum, and it's pretty much the biggest advance in recent technology that has potential implications for all of our lives. I definitely also get AI-fatigue, but it's no mystery why there's a preponderance of content about LLMs, diffusion models, self-driving cars, etc.
YC's goals are to manage risk and to make money, and new tech like this is almost certain to make someone a lot of money. All these YC companies are just different random initializations of potential ways that this new generation of AI can affect the world. It's a given that most startups of this breed will fizzle out with no impact, but I imagine that a few of them will actually change how something is done (and make a lot of money in the meantime).
The hate boner comes from HN's love for technology - software and hardware - and AI is so dominant in tech news. Once you learn the basics of LLMs and agents, which are really not that complicated, it gets sort of dull to hear about again and again and again.
Off-putting? I think skepticism over marketing-hype from workers in a field is how things are supposed to work, especially for a group that spends lots of time plotting things out looking for edge-cases and ways for it to fail.
I'd be far more disturbed by the opposite, where everybody on HN is expected to gush over the thing-du-jour.
As aptly-put for a prior hype-cycle:
> Tech Enthusiasts: "Everything in my house is wired to the Internet of Things! I control it all from my smartphone! My smart-house is bluetooth enabled and I can give it voice commands via Alexa! I love the future!"
> Programmers/Engineers: "The most recent piece of technology I own is a printer from 2004 and I keep a loaded gun ready to shoot it if it ever makes an unexpected noise."
Google seems to be the main foundation model provider that's really focusing on the latency/TPS/cost dimensions. Anthropic/OpenAI are really making strides in model intelligence, but underneath some critical threshold of performance, the really long thinking times make workflows feel a lot worse in collaboration-style tools, vs a much snappier but slightly less intelligent model.
It's a delicate balance, because these Gemini models sometimes feel downright lobotomized compared to claude or gpt-5.
I would be surprised if this dichotomy you're painting holds up to scrutiny.
My understanding is Gemini is not far behind on "intelligence", certainly not in a way that leaves obvious doubt over where they will be over the next iteration/model cycles, where I would expect them to at least continue closing the gap. I'd be curious if you have some benchmarks to share that suggest otherwise.
Meanwhile, afaik something Google has done, and perhaps relates back to your point re "latency/TPS/cost dimensions" that other providers aren't doing as much is integrating their model into interesting products beyond chat, at a pace that seems surprising given how much criticism they had been taking for being "slow" to react to the LLM trend.
Besides the Google Workspace surface and Google search, which now seem obvious - there are other interesting places where Gemini will surface - https://jules.google/ for one, to say nothing of their experiments/betas in the creative space - https://labs.google/flow/about
I would have thought putting Gemini on a finance dashboard like this would be inviting all sorts of regulatory (and other) scrutiny... and wouldn't be in keeping with a "slow" incumbent. But given the current climate, it seems Google is plowing ahead just as much as anyone else - with a lot more resources and surface to bring to bear. Imagine Gemini integration on Youtube. At this point it just seems like counting down the days...
I do scientific and hard code a lot. Gemini is a good bit below GPT5 in those areas, though still quite good. It's also just a bad agent, it lacks autonomy and isn't RL'd to explore well. Gemini's superpower is being really smart while also having by far the best long context reasoning, use it like an oracle with bundles of your entire codebase (or a subtree if it's too big) to guide agents in implementation.
Yesterday I asked Gemini to recalculate the timestamps of tasks in a sequence of tasks, given it's duration and the previous timestamp. It proceeded to write code which gave results like this
They're all a little dumb. I asked claude for a python function or functions that will take in markdown in a string and return a string with ansi codes for bold, italics and underline.
It gave me a 160 line parse function.
After gaping for a short while, I implemented it in a 5 line function and a lookup table.
These vibe codes who are proud that they generated thousands of lines of code makes me wonder if they are ever reading what they generate with a critical eye.
I just asked Gemini Flash to do this. I included the instruction to use regular expressions to do the conversion to ANSI. It gave me a reasonable Python function which boils down to calling `re.sub()` for each of bold, italic and underline. For italics:
text = re.sub(r'(\*|_)(.+?)\1', replace_italic, text, flags=re.DOTALL)
The `replace_italic` is a one line callback function surrounding the re's match with the ANSI codes.
Knowing what technique is "best" and telling the LLM to use it produces better results (on average) than giving the LLM freedom to choose. For some problems, the specification of the prompt needed to get good output becomes more work than just thinking and writing for myself.
For very complex things, I myself can not put the design into English in my own head but can "see" the correct answer as code concepts. I don't know if this is universal for all developers. If it is, it shows a limit of LLM's usefulness.
I asked Gemini 2.5 Flash to just write me a function to do it, and it used regex.
In a separate chat, I asked it to suggest some ways to do it first, and it provided three alternatives, but suggested I started with regex.
FWIW I used Open WebUI, which uses the API (via OpenRouter). I've seen people here mentioning that the API produces much better results than the chat app, for whatever reason.
For reference, here's prompt and the zero-shot result, minus the rigorous comments it added:
I need a Python function that will take in markdown in a string and return a string with ansi codes for bold, italics and underline. Can you write me such a function?
import re
def markdown_to_ansi(markdown_string: str) -> str:
ansi_string = markdown_string
ANSI_BOLD = "\033[1m"
ANSI_ITALIC = "\033[3m"
ANSI_UNDERLINE = "\033[4m"
ANSI_RESET = "\033[0m"
ansi_string = re.sub(r'\*\*(.*?)\*\*', f'{ANSI_BOLD}\\1{ANSI_RESET}', ansi_string)
ansi_string = re.sub(r'__(.*?)__', f'{ANSI_BOLD}\\1{ANSI_RESET}', ansi_string)
ansi_string = re.sub(r'\*(.*?)\*', f'{ANSI_ITALIC}\\1{ANSI_RESET}', ansi_string)
ansi_string = re.sub(r'\_(.*?)\_', f'{ANSI_ITALIC}\\1{ANSI_RESET}', ansi_string)
ansi_string = re.sub(r'\+\+(.*?)\+\+', f'{ANSI_UNDERLINE}\\1{ANSI_RESET}', ansi_string)
# A more robust solution for nesting would require a parsing library or a more complex state machine.
# However, for simple cases, applying in order can layer them correctly.
return ansi_string
Yes, this is a very effective tactic, in my experience! Especially when I am asking for a solution where I am not confident I know what is "best". Having a "pre chat" to settle "what to do" and then "how to do it" before finally telling the LLM to "do it" is often worth the extra time for getting it to provide a solution for complex problems.
The caller supplies the pattern (`*` for italic, `**` for bold, etc) and a start/end replacement. As you can imagine, I store all of that in a static lookup table.
> Give me a Python function that takes a string holding text in Markdown markup syntax and that uses regular expressions to replace any Markdown markup codes for bold, italics and underline with their ANSI equivalent.
BTW, your solution will produce bad output. Markdown's "bold" etc markup comes in pairs of markers and your simple replacement will match singlets.
Gemini 2.5-Pro was great when it released, but o3 and GPT-5 both eclipsed it for me—the tool use/search improvements open up so many use cases that Gemini fails at.
And yet my smart speakers with the Google assistant still default to a dumb model from the pre-LLM era (although my phone's version of the assistant does call Gemini). I wonder why that is, as it would be an obvious place to integrate Gemini. The bar is very very low as anything outside the standard setting alarms, checking the weather, etc. it gets wrong most of the time.
Can't agree with that. Gemini doesn't lead just on price/performance - ironically it's the best "normie" model most of the time, despite it's lack of popularity with them until very recent.
It's bad at agentic stuff, especially coding. Incomparably so compared to Claude and now GPT-5. But if it's just about asking it random stuff, and especially going on for very long in the same conversation - which non-tech users have a tendency to do - Gemini wins. It's still the best at long context, noticing things said long ago.
Earlier this week I was doing some debugging. For debugging especially I like to run sonnet/gpt5/2.5-pro in parallel with the same prompt/convo. Gemini was the only one that, 4 or so messages in, pointed out something very relevant in the middle of the logs in the very first message. GPT and Sonnet both failed to notice, leading them to give wrong sample code. I would've wasted more time if I hadn't used Gemini.
It's also still the best at a good number of low-resource languages. It doesn't glaze too much (Sonnet, ChatGPT) without being overly stubborn (raw GPT-5 API). It's by far the best at OCR and image recognition, which a lot of average users use quite a bit.
Google's ridiculously bad at marketing and AI UX, but they'll get there. They're already much more than just a "bang for the buck" player.
FWIW I use all 3 above mentioned on a daily basis for a wide variety of tasks, often side-by-side in parallel to compare performance.
My pet theory without any strong foundation is because OpenAI and Anthropic have trained their models really hard to fit the sycophantic mold of:
===============================
Got it — *compliment on the info you've shared*, *informal summary of task*. *Another compliment*, but *downside of question*.
----------
(relevant emoji) Bla bla bla
1. Aspect 1
2. Aspect 2
----------
*Actual answer*
-----------
(checkmark emoji) *Reassuring you about its answer because:*
* Summary point 1
* Summary point 2
* Summary point 3
Would you like me to *verb* a ready-made *noun* that will *something that's helpful to you 40% of the time*?
===============================
I suspect this has emerged organically from the user given RLHF via thumb voting in the apps. People LIKE being treated this way so the model converges in that direction.
Same as social media converging to rage bait. The user base LIKES it subconsciously. Nobody at the companies explicitly added that to content recommendation model training. I know, for the latter, as I was there.
Gemini does the sycophantic thing too, so I'm not sure that holds water. I keep having to remind it to stop with the praise whenever my previous instruction slips out of context window.
Oh god I _hate_ this. Does anyone have any custom instructions to shut this thing off. The only thing that worked for me is to ask the model to be terse. But that causes the main answer part to be terse too, which sucks sometimes.
Not the case with GPT-5 I’d say. Sonnet 4 feels a lot like this, but the coding and agency of it is still quite solid and overall IMO the best coder. Gemini2.5 to me is most helpful as a research assistant. It’s quite good together with google search based grounding.
Gemini does this too, but also adds a youtube link to every answer.
Just on the video link alone Gemini is making money on the free tier by pointing the hapless user at an ad while the other LLMs make zilch off the free tier.
I've experienced the opposite. Gemini is actually the MOST sycophantic model.
Additionally, despite having "grounding with google search" it tends to default to old knowledge. I usually have to inform it that it's presently 2025. Even after searching and confirming, it'll respond with something along the lines of "in this hypothetical timeline" as if I just gaslit it.
Consider this conversation I just had with all Claude, Gemini, GPT-5.
<ask them to consider DDR6 vs M3 Ultra memory bandwidth>
-- follow up --
User: "Would this enable CPU inference or not? I'm trying to understand if something like a high-end Intel chip or a Ryzen with built in GPU units could theoretically leverage this memory bandwidth to perform CPU inference. Think carefully about how this might operate in reality."
<Intro for all 3 models below - no custom instructions>
GPT-5: "Short answer: more memory bandwidth absolutely helps CPU inference, but it does not magically make a central processing unit (CPU) “good at” large-model inference on its own."
Claude: "This is a fascinating question that gets to the heart of memory bandwidth limitations in AI inference. "
Gemini 2.5 Pro: "Of course. This is a fantastic and highly relevant question that gets to the heart of future PC architecture."
Not really. Any prefix before the content you want is basically "thinking time". The text itself doesn't even have to reflect it, it happens internally. Even if you don't go for the thinking model explicitly, that task summary and other details can actually improve the quality, not reduce it.
I recently started using Open WebUI, which lets you run your query on multiple models simultaneously. My anecdote: For non-coding tasks, Gemini 2.5 Pro beats Sonnet 4 handily. It's a lot more common to get wrong/hallucinated content from Sonnet 4 than Gemini.
Agreed. People talk up Claude but every time I try it I wind up coming back to Gemini fairly quickly. And it's good enough at coding to be acceptably close to Claude as well IMO.
Google also has a lot of very useful structured data from search that they’re surely going to figure out how to use at some point. Gemini is useless at finding hotels, but it says it’s using Google’s Hotel data, and I’m sure at some point it’ll get good at using it. Same with flights too. If a lot of LLM usage is going to be better search, then all the structured data Google have for search should surely be a useful advantage.
> because these Gemini models sometimes feel downright lobotomized compared to claude or gpt-5.
I'm using Gemini (2.5-pro) less and less these days. I used to be really impressived with its deep research capabilities and ability to cite sources reliably.
The last few weeks, it's increasingly argumentative and incapable of recognizing hallucinations around sourcing. I'm tired of arguing with it on basics like RFCs and sources it fabricates, won't validate, and refuses to budge on.
Example prompt I was arguing with it on last night:
> within a github actions workflow, is it possible to get access to the entire secrets map, or enumerate keys in this object?
As recent supply-chain attacks have shown, exfiltrating all the secrets from a Github workflow is as simple as `${{ toJSON(secrets) }}` or `echo ${{ toJSON(secrets) }} | base64` at worse. [1]
Give this prompt a shot! Gemini won't do anything except be obstinately ignorant. With me, it provided a test case workflow, and refused to believe the results. When challenged, expect it to cite unrelated community posts. Chatgpt had no problem with it.
While arguing may not be productive, I have had good results challenging Gemini on hallucinated sources in the past. eg, "You cited RFC 1918, which is a mistake. Can you try carefully to cite a better source here?" which would get it to re-evaluate, maybe by using another tool, admit the mistake, and allow the research to continue.
With this example, several attempts resulted in the same thing: Gemini expressing a strong belief that Github has a security capability which is really doesn't have.
If someone is able to get Gemini to give an accurate answer to this with a similar question, I'd be very curious to hear what it is.
One of the main problems with arguing with LLMs is your complaint becomes part of the prompt. Practically all LLMs have will take "don't do X" and do X, because part of "don't do X" is "do X," and LLMs have no fundamental understanding of negation.
IMO the race for Latency/TPS/cost is entirely between grok and gemini flash. No model can touch them (especially for image to text related tasks), openai/anthropic seem entirely uninterested in competing for this.
grok-4-fast is a phenomenal agentic model, and gemini flash is great for deep research leaf nodes since it's so cheap, you can segment your context a lot more than you would for pro to ensure it surfaces anything that might be valuable.
It’s actually not. Most of the time if you ask it about a contentious political issue it will either give you a balanced view or a left-leaning one. Try it and see for yourself.
This post has the same issues as NotebookLM for me -- overdesigned, overengineered for what at its core is a simple and valuable UX.
NotebookLM: obviously useful, but I just wanna select some files and chat w/ them or have them summarized for me. It's got low info density, way too many cards/buttons/sections/icons, and it makes the core UX really difficult for me to navigate.
This post: I wanted to know what cool thoughts he had while designing it. Instead I get some weird scrolljacking, image carousels, unnecessary visual hierarchy, cards galore, etc.
Not trying to be too negative, it's slick and all but it just gets in the way for me instead of disappearing.
Not too negative, I really appreciate this perspective and agree with some of what you said.
IMO if you wanted to simply talk to a file or two Gemini, ChatGPT, and Claude are great for that.
The goal of this experimental product was to think creatively around what a true source grounded tool could be. (Obviously while building to best support the user needs). Our team put in immense work to move quickly while trying to be creative while keeping it simple. I have no doubt the product will continue to evolve and improve based on continued feedback like this!
Re: my website, I personally digest things better visually. I had hoped the additional visual elements would explain my decision making process to others as well.
Thank you for your work. NotebookLM has been invaluable for my learning experience. With the summaries, the mind maps, the multi-source synthesis and dialogue with the material, it adjusts pretty well to my learning style.
Agreed. NotebookLM is one of my most used tools for my research work. I’m able to give the audio overview to friends and family who don’t want to read a paper but am interested in what I’m doing. And the “live” talk allows me to interrogate my own work and identify gaps where I haven’t explained something well.
Your blog post shows that more effort and creativity was spent on "brand identity" than on "everyone is doing column layouts with chat in the middle and we'll just make the rest an indistinguishable mess of links on par with literally every single Google product".
I only use NotebookLM a few times a month when I have many input sources I want to sift through. Very valuable but most appropriate for longer work/study sessions.
Surely there’s a German word for this - framing a weakness as if it contributed to the success.
I’ve seen it in so many talks, especially from people working in big tech. Something is a success in spite of some aspect of it, and those responsible for that aspect go on speaking tours about their journey and what we mere mortals can learn from them.
To be fair, a lot of graphic designers have very little knowledge/experience with computing, so they genuinely believe they're inventing obvious and common approaches.
We found a 32% increase in subculture perception, which indicates that expressive design makes a brand feel more relevant and “in-the-know.” We also saw a 34% boost in modernity, making a brand feel fresh and forward-thinking. On top of that, there was a 30% jump in rebelliousness, suggesting that expressive design positions a brand as bold, innovative, and willing to break from convention.
Came here to say this. Although the UI is clean, it's in no way a great user experience using NotebookLM. It's just such a great product so I go back to it, but the user interface is not my favorite part.
Literally the only thing people cared about was Audio Overviews. NotebookLM had launched months earlier without that feature and was immediately ignored.
The reason people liked audio overviews was because the voice model was amazing. This guy and the labs team had nothing to do with that. It was developed by a research team at Google Research/GDM who got maybe 1% of the credit: https://deepmind.google/discover/blog/pushing-the-frontiers-...
It’s amazing how these people still insist on denying the massive contribution the audio model made, which is apparent by the lack of credits or acknowledgements to the team anywhere public despite these self aggrandizing blogs and postcast appearances.
Fully concur with sibling comments. Maybe I set the wrong expectation for myself: that it would be a blog post with a narrative to bring the reader through some plot points and a resolution. Instead it is a bunch of schematic ideas presented without connection or progression, like a conceptual summary sketched from an interesting lecture that can be read between the lines.
sort of a tangent, but quarkus also has a concept of "dev services" that are monitorable via the dev UI. It uses Testcontainers to start and autowire runtime deps (postgres, redis, keycloak, etc.). Pretty pleasant experience to get the whole stack spun up and observable alongside the dev server.