AI Crumbles on Stroop Test, Revealing the Missing Human‑Like Attention Needed for AGI
AI fails a classic psychology test, exposing focus deficits that could delay progress toward human-level artificial intelligence.
AI needs to focus more like we do.
When researchers first introduced the transformer architecture in 2017, they highlighted a simple yet powerful idea: a model that can weigh the relevance of each word against every other word in a sentence. That insight gave rise to the self‑attention mechanism that now powers the most advanced conversational agents, from Claude and Gemini to OpenAI’s ChatGPT.
These large language models have quickly moved from novelty to utility, churning out recipes, code snippets, web layouts, and even full‑length articles. Yet a joint effort by scholars at City University of New York and partner institutions asks a deeper question: does the way these systems allocate attention resemble the way human brains do?
Understanding the gap matters because decades of neuroscience have informed AI design, while breakthroughs in artificial intelligence have, in turn, offered new tools for probing the brain. Bridging the two could unlock more reliable, goal‑directed machine reasoning.
Human‑like focus versus machine attention
Every day we juggle countless distractions—social feeds, work emails, household chores—yet our nervous system can zero in on the most pertinent stimulus. Cognitive scientists explain this ability through three interacting networks. The alerting system readies the mind for incoming information, the orienting circuit selects which sensory inputs deserve processing, and the executive control network arbitrates conflicts, steering behavior toward a chosen objective.
In practice, this triad lets a person instantly shift attention from a sizzling pan to a ringing phone, prioritizing the most urgent cue. AI, by contrast, does not possess an equivalent supervisory layer. Transformer‑based models split input into “tokens” and compute pairwise relevance scores, a process that, while mathematically elegant, lacks a dedicated mechanism for long‑term goal maintenance.
Since the original self‑attention paper, researchers have layered additional tricks onto the basic idea. Multi‑head attention runs several parallel relevance calculators, each specializing in syntax, semantics, or other patterns. Cross‑attention links separate streams of data, a technique vital for translation and summarization tasks. More recent work explores sparse attention—restricting the number of tokens considered at any moment—to curb the heavy computational load, and memory‑augmented methods that let models draw on past information to stay “on track.”
Stroop showdown for LLMs
To probe the limits of these mechanisms, the CUNY team adapted a classic psychological experiment first described by John Stroop in 1935. Participants view color words printed in either matching or mismatching ink and must report the ink hue while ignoring the word itself. The task reliably reveals how well an organism can suppress automatic reading in favor of controlled perception.
Researchers generated word sequences of varying length and difficulty, ranging from entirely congruent sets (e.g., the word “blue” in blue ink) to fully incongruent arrays (e.g., “blue” rendered in red) and mixed mixes. They then asked two state‑of‑the‑art models—OpenAI’s GPT‑4o and Anthropic’s Claude 3.5 Sonnet—to name the ink colors.
Initial trials were impressive: on five‑item lists, GPT‑4o exceeded a 90 percent success rate across all conditions. However, as the lists grew, performance deteriorated sharply. When confronted with 40‑item incongruent lists, accuracy dropped to around 15 percent, and in mixed‑condition tests both models approached random guessing.
“The steep decline in color‑naming accuracy as list length increases indicates that transformer‑based attention mechanisms struggle under scaling pressure,” the authors noted. Some models even reported awareness of the Stroop paradigm and could articulate its rules, yet that meta‑knowledge did not translate into better results.
The findings echo a broader pattern in machine cognition research: while LLMs can pass a variety of theory‑of‑mind, personality, and emotional‑intelligence benchmarks, they still lack a robust executive control component that would keep them anchored to a task despite competing demands.
According to the study’s authors, incorporating a module that mirrors the brain’s executive network—one that monitors progress, detects drift, and re‑engages focus when needed—could be essential for achieving true artificial general intelligence. Such a capability would be especially valuable in prolonged interactions, multi‑step problem solving, and high‑stakes domains like drug discovery.
“The ultimate aim of AI research is to build systems that match human versatility,” the researchers concluded. “Realizing that ambition may require first mastering the fundamental attention processes that underlie mature executive function.”
This article has been fact checked for accuracy, with information verified against reputable sources. Learn more about us and our editorial process.
Last reviewed on .
Article history
- Latest version
Cite this page:
- Posted by Zubair Ali