Another AI 'Stupidity' Test: Even Models Can't Count the 'R's in Strawberry

Models & Benchmarks · Published: Jul 25, 2024 · Priya Sharma · ~7 min read

Author Info

Enterprise AI & Governance Editor

JD (technology policy focus); CIPP/US; former in-house counsel at a cloud provider

Priya writes about regulation, enterprise procurement, and responsible deployment. She separates legal fact from commentary, flags jurisdictional limits, and works with external counsel on high-risk governance topics. Her articles emphasize what changed, who is accountable, and what practitioners should verify locally.

#AI Regulation #Enterprise Adoption #Risk & Compliance #Policy Analysis

Full author profile →

After failing to distinguish whether 9.11 is larger than 9.9, large language models (LLMs) have once again exhibited “collective dementia.”

The inability to correctly count the number of letter “r”s in the word “Strawberry” has sparked widespread discussion.

GPT-4o not only answered incorrectly but did so with high confidence.

Another AI 'Stupidity' Test: Even Models Can't Count the 'R's in Strawberry — figure 2

The freshly released Llama-3.1 405B, however, was able to identify the error during verification and correct it.

Another AI 'Stupidity' Test: Even Models Can't Count the 'R's in Strawberry — figure 3

Perhaps most surprisingly, Claude 3.5 Sonnet made matters worse with each correction attempt.

Another AI 'Stupidity' Test: Even Models Can't Count the 'R's in Strawberry — figure 4

It is worth noting that this is not a newly discovered issue; it has simply gained attention recently due to the continuous release of new models.

As each new model claims improvements in mathematical reasoning, users have revisited this test case, only to be disappointed by the results.

Among the numerous related discussions, a comment from Elon Musk regarding this phenomenon was unearthed:

Well, maybe AGI is further away than I thought.

Another AI 'Stupidity' Test: Even Models Can't Count the 'R's in Strawberry — figure 5

Encountering a Dumb AI: A Hard-Fought Lesson

Some users found that even when using Few-Shot Chain-of-Thought (CoT)—essentially instructing the model to “think step by step” along with a human example—ChatGPT still failed to learn the correct answer.

When asked to mark positions of “r” as 1 and others as 0, reducing the difficulty, the model still struggled to count the number of “1”s accurately.

Another AI 'Stupidity' Test: Even Models Can't Count the 'R's in Strawberry — figure 6

To teach LLMs to count the letter “r,” netizens worldwide employed creative prompting techniques.

For instance, some suggested asking ChatGPT to use methods that might be employed by “L,” the high-IQ character from the manga Death Note.

Another AI 'Stupidity' Test: Even Models Can't Count the 'R's in Strawberry — figure 7

The method ChatGPT devised was surprisingly straightforward: write out each letter individually, count them one by one, and record their positions. Ultimately, it arrived at the correct answer.

Another AI 'Stupidity' Test: Even Models Can't Count the 'R's in Strawberry — figure 8

Another Claude user crafted a prompt consisting of 3,682 tokens, based on the DeepMind Self-Discover paper, effectively reproducing the research overnight.

Another AI 'Stupidity' Test: Even Models Can't Count the 'R's in Strawberry — figure 9

The method involves two major stages: first, the AI discovers its own reasoning steps for a specific task; second, it executes those steps.

Another AI 'Stupidity' Test: Even Models Can't Count the 'R's in Strawberry — figure 10

The approach to discovering reasoning steps can be summarized as combining abstract thinking with specific problem analysis.

Another AI 'Stupidity' Test: Even Models Can't Count the 'R's in Strawberry — figure 11

Under this method, Claude provided a very complex answer.

Another AI 'Stupidity' Test: Even Models Can't Count the 'R's in Strawberry — figure 12

The author noted that expending such effort to solve the “counting r problem” is not truly practical; it was merely an accidental test while attempting to reproduce the paper’s method, with the hope of finding a universal prompt capable of answering all questions.

Unfortunately, this user has not yet released the complete prompt.

Another AI 'Stupidity' Test: Even Models Can't Count the 'R's in Strawberry — figure 13

Others took it a step further, wondering how to calculate the number of times “strawberry” appears in a document.

Their solution involved asking the AI to imagine a memory counter starting at zero, incrementing each time the word was encountered.

Another AI 'Stupidity' Test: Even Models Can't Count the 'R's in Strawberry — figure 14

Some commented that this approach resembles programming in English.

Another AI 'Stupidity' Test: Even Models Can't Count the 'R's in Strawberry — figure 15

Some AIs Get It Right on the First Try

So, are there any LLMs that can answer correctly without additional prompting?

Actually, some users reported earlier that ChatGPT has a small probability of answering directly and correctly, though it is not common.

Another AI 'Stupidity' Test: Even Models Can't Count the 'R's in Strawberry — figure 16

Google’s Gemini has approximately a two-thirds probability of answering correctly. By enabling the “draft” feature, it was observed that the model typically answers each question three times: twice correctly and once incorrectly.

Another AI 'Stupidity' Test: Even Models Can't Count the 'R's in Strawberry — figure 17

As for domestic models, under standardized questioning with only one attempt per model, those that correctly judged numerical magnitude previously performed stably again.

ByteDance’s Doubao provided the correct answer and even speculated whether the user was asking to learn word spelling.

Another AI 'Stupidity' Test: Even Models Can't Count the 'R's in Strawberry — figure 18

Zhipu AI’s ChatGLM automatically triggered code mode and directly output the correct answer: “3”.

Another AI 'Stupidity' Test: Even Models Can't Count the 'R's in Strawberry — figure 19

Tencent’s Yuanbao set up equations like a math problem to derive the correct answer (though arguably unnecessary).

Another AI 'Stupidity' Test: Even Models Can't Count the 'R's in Strawberry — figure 20

Baidu’s ERNIE Bot 4.0 (Paid Version) was even more detailed; after correctly understanding the intent, it counted out each “r” one by one.

Another AI 'Stupidity' Test: Even Models Can't Count the 'R's in Strawberry — figure 21

Interestingly, under the same method, the free version of ERNIE Bot 3.5 in the app still miscounted when trying to tally manually.

Another AI 'Stupidity' Test: Even Models Can't Count the 'R's in Strawberry — figure 22

iFlytek Spark also provided the correct answer by identifying the positions of the “r”s.

Another AI 'Stupidity' Test: Even Models Can't Count the 'R's in Strawberry — figure 23

The Culprit is Tokens

Although “counting r’s” and determining whether 9.11 or 9.9 is larger seem to be issues of letters versus numbers, for LLMs, both are tokenization problems.

Individual characters hold limited significance for models. Using the tokenizers from GPT or Llama series reveals that a 20-character string may be represented as only 10–13 tokens depending on the AI.

Notably, “strawberry” is consistently split into three parts: “st-”, “raw”, and “-berry”.

Another AI 'Stupidity' Test: Even Models Can't Count the 'R's in Strawberry — figure 24

By changing the approach to use special characters ⓢⓣⓡⓐⓦⓑⓔⓡⓡⓨ for questioning, each character corresponds to a separate token.

Another AI 'Stupidity' Test: Even Models Can't Count the 'R's in Strawberry — figure 25

Another AI 'Stupidity' Test: Even Models Can't Count the 'R's in Strawberry — figure 26

The simplest solution to this problem is, as demonstrated by Zhipu AI, to invoke code execution.

Another AI 'Stupidity' Test: Even Models Can't Count the 'R's in Strawberry — figure 27

As seen here, ChatGPT can easily solve it using Python’s string count function.

Another AI 'Stupidity' Test: Even Models Can't Count the 'R's in Strawberry — figure 28

Karan Singh, the entrepreneur who recently founded a school for AI, believes that the key is enabling AI to recognize its own capabilities and limitations so it can proactively call tools.

Another AI 'Stupidity' Test: Even Models Can't Count the 'R's in Strawberry — figure 29

Methods for teaching LLMs to assess their own knowledge gaps are also discussed in Meta’s Llama 3.1 paper.

Another AI 'Stupidity' Test: Even Models Can't Count the 'R's in Strawberry — figure 30

Finally, as netizens suggested, it is hoped that OpenAI and other major LLM companies will resolve this issue in their next updates.

Another AI 'Stupidity' Test: Even Models Can't Count the 'R's in Strawberry — figure 31

Another AI 'Stupidity' Test: Even Models Can't Count the 'R's in Strawberry

Author Info

Encountering a Dumb AI: A Hard-Fought Lesson

Some AIs Get It Right on the First Try

The Culprit is Tokens

Comments

Related News

Latest Headlines