I've just read over this paper which is one of the best ones I've read coming out of the field.
However I'm inclined to dispute their conclusions.
They seem not to address the obvious (?) reply: that the board state is just a function of the game moves.
To say that the NN has built a representation of the board state from moves is trivial, if moves are just another phrasing of the board state.
That there's a correlation between trained weights and given board states is then expected, since to be trained on moves is to be trained on board states.
Consider the difference between a network seeming to obtaining a 3D model of a simple object from a handful of 2D photographs, vs. it doing so from millions at every angle (in every lighting condition, etc.).
In the latter case of course the weights of the trained network will correlated with the actual 3D model, because there's almost no information gap between the actual 3D and the millions of 2D images provided. (There is still an exploitable gap though, which could be used to show the network hadnt learnt the model).
The paper seems to address a strawman claim that NNs are "just remembering their inputs" as literally tokenised. That isn't the claim. It's that they remember their inputs as phrased in a transformed space.
Here: if board moves are just an alternative way of specifying the board state, "phrased in a transformed space" -- then the claim stands.
The claim that NNs are just "remembering their inputs" (with an extra step).
However I'm inclined to dispute their conclusions.
They seem not to address the obvious (?) reply: that the board state is just a function of the game moves.
To say that the NN has built a representation of the board state from moves is trivial, if moves are just another phrasing of the board state.
That there's a correlation between trained weights and given board states is then expected, since to be trained on moves is to be trained on board states.
Consider the difference between a network seeming to obtaining a 3D model of a simple object from a handful of 2D photographs, vs. it doing so from millions at every angle (in every lighting condition, etc.).
In the latter case of course the weights of the trained network will correlated with the actual 3D model, because there's almost no information gap between the actual 3D and the millions of 2D images provided. (There is still an exploitable gap though, which could be used to show the network hadnt learnt the model).
The paper seems to address a strawman claim that NNs are "just remembering their inputs" as literally tokenised. That isn't the claim. It's that they remember their inputs as phrased in a transformed space.
Here: if board moves are just an alternative way of specifying the board state, "phrased in a transformed space" -- then the claim stands.
The claim that NNs are just "remembering their inputs" (with an extra step).