Funny coincidence, I'm working on a benchmark showcasing AI capabilities in binary analysis.
Actually, AI has huge potential for superhuman capabilities in reverse engineering. This is an extremely tedious job with low productivity. Currently reserved, primarily when there is no other option (e.g., malware analysis). AI can make binary analysis go mainstream for proactive audits to secure against supply-chain attacks.
Great point! Not just binary analysis, plus even self-analysis! (See skill-snitch analyze and snitch on itself below!)
MOOLLM's Anthropic skill scanning and monitoring "skill-snitch" skill has superhuman capabilities in reviewing and reverse engineering and monitoring the behavior of untrusted Anthropic and MOOLLM skills, and is also great for debugging and optimizing skills.
It composes with the "cursor-mirror" skill, which gives you full reflective access to all of Cursor's internal chat state, behavior, tool calls, parameters, prompts, thinking, file reads and writes, etc.
That's but one example of how skills can compose, call each other, delegate from one to another, even recurse, iterate, and apply many (HUNDREDS) of skills in one llm completion call.
I call this "speed of light" as opposed to "carrier pigeon". In my experiments I ran 33 game turns with 10 characters playing Fluxx — dialogue, game mechanics, emotional reactions — in a single context window and completion call. Try that with MCP and you're making hundreds of round-trips, each suffering from token quantization, noise, and cost. Skills can compose and iterate at the speed of light without any detokenization/tokenization cost and distortion, while MCP forces serialization and waiting for carrier pigeons.
Skills also compose. MOOLLM's cursor-mirror skill introspects Cursor's internals via a sister Python script that reads cursor's chat history and sqlite databases — tool calls, context assembly, thinking blocks, chat history. Everything, for all time, even after Cursor's chat has summarized and forgotten: it's still all there and searchable!
MOOLLM's skill-snitch skill composes with cursor-mirror for security monitoring of untrusted skills, also performance testing and optimization of trusted ones. Like Little Snitch watches your network, skill-snitch watches skill behavior — comparing declared tools and documentation against observed runtime behavior.
You can even use skill-snitch like a virus scanner to review and monitor untrusted skills. I have more than 100 skills and had skill-snitch review each one including itself -- you can find them in the skill-snitch-report.md file of each skill in MOOLLM. Here is skill-snitch analyzing and reporting on itself, for example:
MCP is still valuable for connecting to external systems. But for reasoning, simulation, and skills calling skills? In-context beats tool-call round-trips by orders of magnitude.
More: Speed of Light -vs- Carrier Pigeon (an allegory for Skills -vs- MCP):
Haven't dived deep into it yet, but dabbled in similar areas last year (trying to get various bits to reliably "run" in-context).
My immediate thought was to want to apply it to the problem I've been having lately: could it be adapted to soothe the nightmare of bloated llm code environments where the model functionally forgets how to code/follow project guidelines & just wants to complete everything with insecure tutorial style pattern matching?
Great idea. Currently, people have to rely on client-side spans in OpenTelemetry. However, it would be awesome if we could get spans for slow SQL queries, along with explanations.
In this benchmark, micro-services are really small, ~300 lines, and sometimes just two of them. More realistic tasks (large codebases, more microservices) would have a lower success rate.
I'd expect it to actually do better in a large codebase. e.g. you'd already have an HTTP middleware stack, so it'd know that it can just add a layer to that for traces (and in fact there might already be off-the-shelf layers for whatever framework) vs. having to invent that on its own for the bare microservice.
“ Ultimately, I want to see full session transcripts, but we don't have enough tool support for that broadly.”
I have a side project, git-prompt-story to attach Claude Vode session in GitHub git notes. Though it is not that simple to do automatic (e.g. i need to redact credentials).
Not sure how I feel about transcripts. Ultimately I do my best to make any contributions I make high quality, and that means taking time to polish things. Exposing the tangled mess of my thought process leading up to that either means I have to "polish" that too (whatever that ends up looking like), or put myself in a vulnerable position of showing my tangled process to get to the end result.
I've thought about saving my prompts along with project development and even done it by hand a few times, but eventually I realized I don't really get much value from doing so. Are there good reasons to do it?
For me it's increasingly the work. I spend more time in Claude Code going back and forth with the agent than I do in my text editor hacking on the code by hand. Those transcripts ARE the work I've been doing. I want to save them in the same way that I archive my notes and issues and other ephemera around my projects.
Right, I get that writing prompts is "the work", but if you run them again you don't get the same code. So what's the point of keeping them? They are not 'source code' in the same sense as a programming language.
That's why I want the transcript that shows the prompts AND the responses. The prompts alone have little value. The overall conversation shows me exactly what I did, what the agent did and the end result.
It's not for you. It's so others can see how you arrived to the code that was generated. They can learn better prompting for themselves from it, and also how you think. They can see which cases got considered, or not. All sorts of good stuff that would be helpful for reviewing giant PRs.
Sounds depressing. First you deal with massive PRs and now also these agent prompts. Soon enough there won't be any coding at all, it seems. Just doomscrolling through massive prompt files and diffs in hopes of understanding what is going on.
If the AI generated most of the code based on these prompts, it's definitely valuable to review the prompts before even looking at the code. Especially in the case where contributions come from a wide range of devs at different experience levels.
At a minimum it will help you to be skeptical at specific parts of the diff so you can look at those more closely in your review. But it can inform test scenarios etc.
I wish Neal would do behind the scenes, how he built this art. I wonder whether LLM assistants like Claude Code make such an interactive show more feasible.
He previously did a game "Infinite Craft" which leveraged Llama models. However, I was only able to find an outdated blog from 2019.
I think you'd notice a pretty big difference in an LLM clone of this site. The art, music, and other small wouldn't be as consistent or hang together as nicely.
If I could download the LLM clone, and share it, I think I'd prefer it. This is just a website that could at any moment disappear, it isn't like a book.
Not sure if I get this: WASM lets you use any language in the browser, though it still works way better with languages without GC, such as Rust or a transpiling C engine. Java is unlikely to be the best choice.
In the era of LLM assistants like Claude Code, any engineer can write frontend code using popular stacks like React and TypeScript. This use case is when those tools shine.
Java running in the browser is unlikely as typescript has largely tamed the mess of Javascript. Java requires a JVM and shipping an entire JVM so its runs atop another VM is kinda redundant. Except if JVM itself gets compiled and cached as a WASM bundle and Java compilers start accept WASM-JVM as a target. That will just be distraction tbh, Java has its strength in large scale systems and it should just focus on those rather than get caught up in Frontend's messy world.
Takes a few seconds longer to load because it loads all of Java Spring, but it still performs just fine on my phone (though the lack of on screen keyboard activation makes it rather unfortunate for use in modern web apps).
> That will just be distraction tbh, Java has its strength in large scale systems and it should just focus on those rather than get caught up in Frontend's messy world.
Multiple people can work on different things in the Java ecosystem.
Compiling Rust to WASM doesn't really distract anyone from compiling Rust to x86 or ARM, either.
LLVM IR is quite fun to play with from many programming languages. The Java example is rather educational, but there are several practical example,s such as in Go Lang:
reply