It's fascinating the extent to which all the models rely on text - it's like they have severe (but not total) aphantasia:
"We hypothesize that this phenomenon emerges predominantly from a misassumption about how these systems are trained. Modern multimodal models are developed on web-scale corpora and are commonly built on top of pretrained large language models, which makes them extraordinarily strong at language modeling, retrieval of statistical regularities, and reconstruction of likely contexts from sparse cues.[48, 25, 24] During the multimodal training, the models are presented with the image, a textual question, and are expected to reconstruct the correct answer. Lacking access to an entire text corpora, a human would intuitively answer the question based on the image in that setup; but we should not infer that this would be the default approach for an AI model. Incentivized to generate the correct next tokens, models might learn to easily ignore the visual information and rely only on their vast prior knowledge, taking the shortest route to the correct answer.[36, 5, 48]"
The crazy thing is that based just on the text in the questions a model was able to "guess" answers:
"When fine-tuned on the public training set of this dataset with images removed (i.e., trained in mirage-mode), our 3-billion-parameter, text-only super-guesser outperformed all frontier multimodal models, including those exceeding hundreds of billions of parameters, on the held-out test benchmark (Figure 3c). It also surpassed human radiologists by more than 10% on average, relying entirely on hidden textual cues in the questions and the structural patterns of the benchmark. In addition, our super-guesser was able to create reasoning traces comparable to, and in some cases indistinguishable from, those of the ground-truth or those generated by frontier multi-modal AI models."
A protest demonstrates a level of unhappiness with a group or policy. People may not believe what they see on the news, facebook, or youtube, but hopefully we have not reached a point where they refuse to believe what they see with their own eyes.
The point is to demonstrate "we are not alone in this feeling", that's it...
And how can we leave out the OG of tech writers: Donald Knuth. He got a bit distracted by developing TeX but he got a well deserved Turing award for the series.
"We hypothesize that this phenomenon emerges predominantly from a misassumption about how these systems are trained. Modern multimodal models are developed on web-scale corpora and are commonly built on top of pretrained large language models, which makes them extraordinarily strong at language modeling, retrieval of statistical regularities, and reconstruction of likely contexts from sparse cues.[48, 25, 24] During the multimodal training, the models are presented with the image, a textual question, and are expected to reconstruct the correct answer. Lacking access to an entire text corpora, a human would intuitively answer the question based on the image in that setup; but we should not infer that this would be the default approach for an AI model. Incentivized to generate the correct next tokens, models might learn to easily ignore the visual information and rely only on their vast prior knowledge, taking the shortest route to the correct answer.[36, 5, 48]"
The crazy thing is that based just on the text in the questions a model was able to "guess" answers:
"When fine-tuned on the public training set of this dataset with images removed (i.e., trained in mirage-mode), our 3-billion-parameter, text-only super-guesser outperformed all frontier multimodal models, including those exceeding hundreds of billions of parameters, on the held-out test benchmark (Figure 3c). It also surpassed human radiologists by more than 10% on average, relying entirely on hidden textual cues in the questions and the structural patterns of the benchmark. In addition, our super-guesser was able to create reasoning traces comparable to, and in some cases indistinguishable from, those of the ground-truth or those generated by frontier multi-modal AI models."