Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

What LLM text generation has shown is that you don't actually have to understand English to generate pretty decent English. You just have to have enough examples.

This is where the massive corpus of source code available on the Internet can help generate a "LSM" (large software model) if you can expose the tokens as the lexer understands them in the training set.

If your LSM sees a trillion examples of correct usage of lifetime and scope and types and so on, then in the same way that an LLM trained on English grammar will emit text with correct grammar as if it understands English, your LSM will generate software with correct syntax as if it understands the software. Whatever the definition of "understands" is in the context of an LLM.



But:

- natural language is flexible, computer languages are less so.

- "pretty decent English" still includes hallucinations. I've seen companies whose product demo for generating marketing copy just makes up a plausible review. Hallucinating methods, variables, other packages/modules yields broken code.

- the human thought behind natural language is not feasible to directly provide to a model. An IR corresponding to the source of the program is feasible to provide. A trace of the program executing is feasible to provide. Grounding an LLM in the rich exterior world that humans talk about is hard; grounding an LSM in the rich internal representations accessible to an IDE or a debugger is achievable.


"pretty decent english" is a pretty fuzzy bar.

Indeed, Chat GPT 4 and Copilot can generate "pretty decent code" that will look fine to the average human coder even when it's incorrect (making up methods or getting params wrong or slighly missing requirements or similar).

The level of precision required for "pretty decent non-trivial code" is much higher than prose that looks like it was written by an educated human, so I share the idea that if it was augmented - even in really stupid ways like asking the IDE if it would even compile, in the case of Copilot, before suggesting it to the user - it would work much better at a much lower effort than increasing it's understanding implicitly by orders of magnitude.


> you don't actually have to understand English to generate pretty decent English. You just have to have enough examples.

I would have thought babies have been showing this beyond a doubt since time immemorial.


No, because we can't look into their skulls, to figure out whether they 'understand', whatever that means.


right. we're already abstracting from English words and characters into tokens, piping code through half a compiler so the LSM is given the AST to train on doesn't seem all that far fetched.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: