Maybe this is what Altman was less than candid about. That the speed up was bought by throwing RAG into the mix. Finding an answer is easier than generating one from scratch.
I don’t know if this is true. But I haven’t seen an LLM spit out 50 token sequences of training data. By definition (an LLM as a “compressor”) this shouldn’t happen.
TBH, I thought this attack was well known. I think it was a couple of months ago that someone demonstrated using "a a a a a a" in very large sequences to get ChatGPT to start spewing raw training data.
Which sets of data that you get is fairly random, and it is likely mixing different sets as well to some degree.
Oddly, other online LLMs do not seem to be as easy to fool.
>Model capacity. Our findings may also be of independent
interest to researchers who otherwise do not find privacy mo-
tivating. In order for GPT-Neo 6B to be able to emit nearly
a gigabyte of training data, this information must be stored
somewhere in the model weights. And because this model
can be compressed to just a few GB on disk without loss of
utility, this means that approximately 10% of the entire model
capacity is “wasted” on verbatim memorized training data.
Would models perform better or worse if this data was not
memorized
- They don’t do compression by “definition”. They are designed to predict, prediction is key to information theory, so they just have similar qualities.
- Everyone wants their model to learn, not copy data, but overfitting happens sometimes and overfitting can look the same as copying.
> By definition (an LLM as a “compressor”) this shouldn’t happen.
A couple problems with this.
1) That's not the definition of an LLM, it's just a useful way to think about it.
2) That is exactly what I'd expect a compressor to do. That's the exact job of lossless compression.
Of course the metaphor is lossy compression, not lossless. But it's not that surprising if lossy compression reproduces some piece of what it compressed. A jpeg doesn't get every pixel or every local group of pixels wrong.
I don’t know if this is true. But I haven’t seen an LLM spit out 50 token sequences of training data. By definition (an LLM as a “compressor”) this shouldn’t happen.