LLM Evaluation: Scientific Text & Knowledge Testing

by Daniel Kim — Technology Editor 10/03/2026

written by Daniel Kim — Technology Editor 10/03/2026 0 comments

The AI gold rush continues, but a new study from Cornell and Google throws a bucket of cold water on the idea that Large Language Models (LLMs) are poised to replace expert analysis anytime soon. While LLMs demonstrate surprising aptitude for digesting scientific text, they fall critically short when it comes to truly *understanding* complex research – particularly when visual data is involved. This isn’t just an academic exercise; it highlights a fundamental bottleneck in AI’s ability to accelerate scientific discovery and raises questions about relying on these tools for critical decision-making.

Curated Data is King: LLMs perform significantly better when trained on a focused, expert-vetted dataset rather than scraping the open internet.
Visual Reasoning Remains a Major Hurdle: Current LLMs are “totally incapable” of critically engaging with data visualizations – a core skill for scientists.
AGI is Still Distant: The study reinforces that we are far from achieving Artificial General Intelligence, as LLMs struggle with synthesis, attribution, and nuanced understanding.

The research, published in the Proceedings of the National Academy of Sciences, focused on the notoriously complex field of high-temperature cuprate superconductors. Researchers tasked six LLMs – including ChatGPT, Claude, and Gemini – with answering questions designed to test their comprehension of decades of research. The core of the experiment was a meticulously curated database of 1,726 papers and a set of 67 probing questions developed by a panel of 12 human experts. The results were telling: systems leveraging curated information, specifically Google’s NotebookLM and a custom Retrieval-Augmented Generation (RAG) system capable of processing images, outperformed those relying on broader internet-sourced knowledge.

This finding isn’t surprising to those who’ve been closely following the LLM space. The “garbage in, garbage out” principle applies here with a vengeance. LLMs are exceptionally good at identifying patterns in data, but they lack the critical thinking skills to discern credible sources from misinformation, or to understand the subtle nuances of scientific methodology. The success of NotebookLM and the custom RAG system underscores the importance of providing LLMs with a reliable foundation of knowledge.

However, the most significant takeaway is the glaring weakness in visual reasoning. Scientists routinely rely on graphs, charts, and diagrams to interpret data and draw conclusions. The study found LLMs to be “totally incapable” of this crucial skill. This isn’t merely a technical limitation; it’s a fundamental flaw that prevents LLMs from replicating the cognitive processes of a human researcher. The custom RAG system, with its ability to retrieve and analyze images, offered a glimpse of a potential solution, but much work remains.

The Forward Look

The implications of this study extend far beyond the field of superconductivity. As AI becomes increasingly integrated into scientific workflows, the limitations identified here will become more pronounced. We can expect to see a surge in development focused on improving LLMs’ ability to process and interpret visual data. This will likely involve incorporating new architectures and training methods specifically designed for multimodal learning (combining text and images). Furthermore, the emphasis on curated datasets will intensify, leading to the creation of more specialized LLMs tailored to specific scientific domains.

Perhaps more importantly, this research shifts the conversation around AI in science. It’s not about replacing scientists with machines, but about augmenting their capabilities. As Eun-Ah Kim, the study’s corresponding author, points out, the real value lies in freeing researchers from the burden of rote memorization and allowing them to focus on creative problem-solving. The future isn’t about AI *doing* science, but about AI *enabling* scientists to do better science. This study, the first from the Cornell-led National Science Foundation AI-Materials Institute, signals a more realistic and nuanced approach to AI’s role in accelerating scientific progress.

Discover more from Archyworldys

Subscribe to get the latest posts sent to your email.

Daniel Kim — Technology Editor

Daniel Kim covers AI breakthroughs, cybersecurity threats, and consumer-tech launches. He runs Archyworldys’ Lab section, where hands-on reviews, structured data, and expert verdicts drive rich-snippet visibility and affiliate-revenue growth.

LLM Evaluation: Scientific Text & Knowledge Testing

The Forward Look

Share this:

Related

Discover more from Archyworldys

LSST: Earth Impact Alerts – Rubin Observatory’s New Watch

Auto Giant Faces 50K Layoffs Amid Financial Crisis

You may also like