The “Strawberry R Counting” Problem in LLMs: Causes and Solutions
March 21, 2025 - kyx.net
Introduction
One notable failure case of earlier large language models (LLMs) was the seemingly trivial question: “How many ‘r’ letters are in the word ‘strawberry’?” Despite the simplicity, many models consistently gave the wrong answer (often saying 2 instead of the correct 3). This “strawberry test” became a viral meme highlighting a fundamental limitation of LLMs(venturebeat.com, reddit.com). It exemplified a broader class of token-level and counting failures in which LLMs struggled with precise character-level reasoning. Let’s look at why early LLMs failed at this task and how recent models have overcome it. We explore root causes – from subword tokenization and attention mechanisms to representational limits and decoding strategies – and then examine how newer models (notably OpenAI’s o1 family, e.g. GPT-4 Turbo, code-named “Strawberry”) have architecturally and through training mitigated these issues. Let’s also check out benchmarks, documentation, and research that shed light on the improvements, including techniques like chain-of-thought reasoning, instruction fine-tuning, auxiliary objectives, and enhanced decoding methods.
The “Strawberry R-Counting” Problem in Early LLMs
Early GPT-series models and other LLMs often faltered when asked how many times the letter “r” appears in “strawberry.” Users observed models like GPT-3, initial ChatGPT versions, and even some early GPT-4 or Claude variants giving incorrect answers (commonly 2 instead of 3). This error persisted even when the question was rephrased or clarified, revealing a systematic blind spot rather than random guesswork (reddit.com, reddit.com). The model would sometimes insist on the wrong answer, betraying a fundamental misunderstanding of the task.
Figure: An example conversation showing an earlier ChatGPT model incorrectly claiming “strawberry” has two “r”s. When asked to highlight the “r”s, it actually marks three, yet still insists there are only two, then apologizes for the confusion. This highlights how the model’s internal representation fails at a seemingly trivial letter-counting task.
This problem was not isolated to the word “strawberry.” It became emblematic of LLMs’ difficulty with character-level operations in general. Users found similar failures with other words – for example, counting “m” in “mammal” or “p” in “hippopotamus” led to wrong answers (venturebeat.com). The “strawberry” riddle gained notoriety because it underscored the gap between human and LLM reasoning: even a child can count letters in a word, yet a 100-billion-parameter model could not. This discrepancy undermined trust and reminded us that LLMs, however fluent, do not truly understand language in the way humans do (techcrunch.com, techcrunch.com).
Why Early LLMs Failed at Counting Letters
Several technical factors contributed to this failure. At its core, the issue arises from how LLMs represent and process text, compounded by their training objectives. Below we analyze the root causes:
Subword Tokenization: Most LLMs (GPT-3, GPT-3.5, Claude, etc.) use subword tokenization (e.g. Byte Pair Encoding, BPE) to convert text into tokens. This means a word like “strawberry” is not seen as individual characters s,t,r,a,w,b,e,r,r,y but as one or a few tokens representing chunks of the word. For example, one GPT tokenizer breaks “strawberry” into tokens like “Str”, “aw”, “berry”(reddit.com). Another common split is “straw” + “berry” (or even the whole word as one token if it’s in the vocabulary) (reddit.com, reddit.com). Crucially, the model does not inherently break tokens into letters. As a result, it has no direct way to count individual characters within a token (techcrunch.com, venturebeat.com). It’s as if the letters are “bundled” inside a token embedding. One analogy describes it as “trying to count the threads in a rope without untwisting it first” (medium.com) – the model sees a woven unit (token) rather than the individual strands (letters).
Representational Limitations: Because of tokenization, the model’s representation of a word like “strawberry” is entangled in a vector that encodes the whole subword. The transformer architecture processes these token vectors, not the raw text. As AI researcher Matthew Guzdial explains, “When it sees the word ‘the’, it has this one encoding of what ‘the’ means, but it does not know about ‘T’, ‘H’, ‘E’.” (techcrunch.com) Likewise, the AI might know that the tokens “straw” and “berry” combine to form “strawberry,” but it doesn’t explicitly know that within “berry” there are two ‘r’ characters (techcrunch.com). The knowledge of spelling or letter composition is not a core part of the model’s learned representation – unless it was forced to learn it from data correlations. Essentially, LLMs don’t truly “spell out” words internally; they manipulate higher-level units. This representational gap means tasks requiring letter-level granularity (like counting letters) are outside the model’s normal scope.
Attention and Context Span: Transformers have a fixed attention window and typically treat each token as an atomic unit to attend to. For a short word like “strawberry,” attention span is not an issue in the usual sense (the word easily fits in context), but the model’s attention mechanism operates on tokens, not characters. There is no mechanism for attending to “the second letter of the token ‘berry’,” for example. In effect, the fine-grained structure is invisible. Furthermore, LLMs are trained to optimize next-token prediction over thousands of tokens of context; they are not explicitly trained to focus attention within a single token’s characters. This can lead to a kind of attention myopia for letter-level details. The model might overly rely on coarse features (e.g. the presence of the token “berry” which it vaguely “knows” contains an r) rather than truly examining each character.
Decoding and Probabilistic Reasoning: The training objective of early LLMs is to maximize the likelihood of text – essentially to produce the most plausible completion or answer based on patterns in training data. If a question like “How many r’s are in ‘strawberry’?” never or rarely appeared in the training set (or if the model never saw an explicit answer), it has to infer the answer from what it did learn. The model might recall that “strawberry” has a double-r in the middle (a common spelling fact), and thus conclude there are “two r’s” (two in a row) without realizing there is also another r at the beginning of the word. In other words, it confuses the count of distinct positions or groups of r with the total count. One community analysis noted that the model’s wrong answer “2” “would make sense if it was counting the number of tokens with ‘r’ in them.” (reddit.com) Indeed, if “strawberry” tokenizes into two parts that contain an 'r' (“raw” and “berry”, for instance), a naive heuristic might yield 2. This hints that the model is not performing an actual count but producing an answer that sounds plausible given its token-based perspective. Another subtle factor is that LLMs lack an internal scratchpad unless prompted – they don’t write down intermediate steps unless trained to do so. So counting requires a kind of discrete logic that isn’t native to their usual forward-pass. The decoder just outputs the most probable answer token-by-token, and if “two” appeared more often with “letters in strawberry” in training texts (perhaps from spelling rules discussions (reddit.com, reddit.com)), the model will gravitate to that answer.
Lack of Explicit Algorithmic Mechanisms: Ultimately, earlier LLMs had no built-in algorithm to count characters. The behavior is emergent from pattern recognition, not from executing a counting procedure. They are “not actually thinking like we do”, as TechCrunch noted – they manipulate symbols based on statistical correlations (techcrunch.com). Without explicit training on such problems, nothing in the transformer inherently performs counting (especially not within a token). This is a broader limitation that extends to other discrete reasoning tasks like arithmetic, logical counting, or tracking object positions – early LLMs often stumble unless they’ve seen extremely similar examples in training. The strawberry example simply highlighted this gap in a very concrete way.
Why This Was a Broader Token-Level Failure: The “strawberry” riddle is representative of a class of problems where the model must operate below the token level or perform precise counting. Other examples include: counting letters in any given word, determining if a word has double letters, spelling a word backwards, or even certain syllable or rhyme tasks. In all these cases, the model’s subword tokenization and training focus make it prone to error. As one Reddit user succinctly explained, “An LLM can’t count due to not interpreting words letter by letter like we do. (It’s) something with vectors and transformers.” (reddit.com). In essence, what is a simple deterministic task for a human (or for a trivial computer script) is out-of-distribution for an LLM that was never taught to treat words as sequences of individual characters. This discrepancy was jarring to users and researchers – it exposed how surface-level LLMs’ text understanding can be. It’s not that the model “forgot” how to spell; rather, spelling was never a first-class skill in its training.
Notably, some smaller or more specialized models that used character-level tokenization or had seen lots of literal text manipulation examples could sometimes get these right. And if users forced a model to break down the word (for instance, by instructing it to spell out “strawberry” one letter at a time and then count the letters), it would usually succeed (reddit.com). That trick essentially bypassed the problem by changing the task: spelling the word out made the model output each character as its own token, after which counting “r” tokens became trivial. However, without such prompting gymnastics, the default behavior of earlier LLMs was to fail the strawberry test more often than not (medium.com). This became a litmus test for character-level reasoning in language models.
Symbolic of Broader Limitations
The strawberry example’s popularity stems from more than just counting letters – it symbolized the gap between syntactic understanding and semantic fluency in AI. It raised questions: If an AI can write an essay but not count letters in a word, what else is it missing? The broader class of failures includes:
Character-Level Tasks: as discussed, counting occurrences of letters, identifying the nth letter of a word, spelling words backwards, etc. Many early LLMs would guess or make mistakes on these tasks because they require treating the input as a sequence of characters rather than higher-level tokens (techcrunch.com).
Precise Counting and Algorithmic Tasks: Even beyond characters, LLMs struggled with exact counts – e.g. counting the number of words in a sentence, or summing digits in a number. Unless these were small and explicitly seen examples, the models might output a likely number rather than compute the correct one. This is analogous to the letter-count issue: it’s a failure to perform an internal algorithm (counting) as opposed to retrieving a fact or pattern. As one analysis put it, these failures remind us “LLMs are not capable of ‘thinking’ like humans” and are fundamentally pattern recognizers, not arithmetic or logical engines (venturebeat.com, venturebeat.com).
Token Boundary Confusion: The strawberry case specifically highlighted how token boundaries can confuse a model. If information is split across tokens, the model might not properly integrate it. Conversely, if multiple pieces of information are fused in one token, the model can’t easily separate them. Similar issues occur in other domains (for example, with numbers: a year like 2024 might be one token, whereas 20 and 24 could be separate tokens in another context, leading to different behavior if asked to do math with them). The “strawberry” problem is essentially a token boundary issue – the letter r spans token boundaries in a tricky way (one ‘r’ might be at the end of one token and two ‘r’s in the middle of the next token). The model, lacking an internal letter-level representation, fumbles the count(reddit.com).
Over-reliance on Training Distribution: The initial inclination of models to say “2” could also reflect that, in training data, the concept of “how many r’s in strawberry” may have been discussed in terms of spelling rules (as some Redditors theorized (reddit.com, reddit.com). For instance, English learners often ask “Does strawberry have two r’s?” meaning “is r doubled?”. The answer to that question is “Yes, it has a double r” (which might be interpreted as two r’s total). Thus the phrasing could trigger a misleading pattern the model learned. This highlights how models can misinterpret intent when a question is slightly ambiguous – another broad limitation linked to their training on internet text.
In summary, the strawberry r-counting glitch was a canary in the coal mine for LLM limitations. It was a simple test exposing a complex issue: that large language models, for all their knowledge, lacked a basic form of symbolic reasoning that we take for granted. Next, we discuss how new techniques and model improvements have addressed this gap.
Advances in Recent Models (“Strawberry” Reasoning)
Recent state-of-the-art LLMs have made significant strides in overcoming these token-level reasoning failures. OpenAI’s latest models in the “o1” family (code-named “Strawberry”) are a prime example – in fact, the nickname “Strawberry” itself is an inside joke referencing the model’s ability to finally count the R’s in “strawberry” correctly (schlaff.com, schlaff.com). These models (which include GPT-4 Turbo and related variants introduced in late 2024) were explicitly engineered to handle complex reasoning and have demonstrated vastly improved performance on the strawberry test and similar challenges. Several key architectural and training changes enabled this progress:
Chain-of-Thought Reasoning and Process Supervision
A major change in the o1/Strawberry model is the emphasis on chain-of-thought (CoT) reasoning during training (schlaff.com, schlaff.com). Rather than training the model purely to predict the next token, OpenAI introduced methods to encourage step-by-step problem solving. The model “thinks before it answers” – internally breaking down tasks into intermediate steps, much like a human would. This was achieved through a technique known as process supervision, where the model is rewarded for each correct step in its reasoning process, not just the final answer (openai.com, openai.com). In practical terms, the model might learn to generate an internal scratchpad or chain of reasoning (which can be thought of as it silently spelling out “s-t-r-a-w-b-e-r-r-y” and counting the letters) before producing the final answer. Reinforcement learning was used to instill this behavior: the model was fine-tuned with feedback that explicitly favored correct reasoning chains over merely plausible answers (openai.com, openai.com). The result is a model that can tackle tasks requiring multi-step logic significantly better than its predecessors. As an example, the o1 model can solve complex math problems from the MATH dataset and achieve high scores on logical exams – feats that require structured reasoning (schlaff.com, schlaff.com). Correspondingly, it handles the “strawberry” r-counting flawlessly because it effectively performs the counting algorithm internally, rather than blurting out a guess. In fact, OpenAI staff reportedly joked that “most importantly, it can tell you how many R’s are in the word ‘Strawberry’” (schlaff.com) – a tongue-in-cheek reference to this formerly unsolvable query.
This chain-of-thought approach is a paradigm shift from earlier models. By training the model to break problems into sub-problems, it overcomes the tokenization issue in a functional way: the model learns to simulate what we would do (e.g. break the word into letters and count them) even though the underlying architecture still uses subword tokens. It’s important to note that the architecture (the transformer) didn’t radically change – but the way it is used and trained did. Essentially, the model learns a form of algorithmic reasoning as a skill. This was aided by increasing the model’s size and context length (the GPT-4 series already had a large capacity) and by carefully curated training data that included reasoning demonstrations.
Instruction Fine-Tuning and Supervised Training Signals
Another factor in the improvement is the instruction fine-tuning process that OpenAI and others apply to create helpful assistant models. During instruction tuning (and later reinforcement learning from human feedback, RLHF), the training team can include specific Q&A pairs or conversations that teach the model how to handle tricky questions. It’s highly likely that by late 2023, the “how many r’s in strawberry” question became a part of internal evaluation suites or fine-tuning datasets, given its notoriety. By explicitly training on the correct response (and the reasoning to get there), the new models would learn to override the faulty heuristic. Community evaluations indicate that by the time GPT-4 was rolled out in 2023, it was already more reliable on this question than GPT-3.5, and by the GPT-4 Turbo (o1) release, it answered correctly on the first try nearly every time. Instruction fine-tuning improves the model’s interpretation of the question as well – reducing the chance it will misinterpret the intent. The model is more likely to understand that the user genuinely wants a count of occurrences, not a spelling rule or some trick. This alignment with user intent, combined with exposure to the correct solution, fixes the surface error.
Moreover, modern models have been trained on multi-modal or multi-format data (including code, as we’ll discuss shortly). This gives them more tools to draw upon. A fine-tuned model might implicitly “know” the procedure: “to count letters, loop through the string and increment a counter” – knowledge essentially learned from code.
Exposure to Code and Algorithmic Data
Many recent LLMs, including GPT-4, have been trained or fine-tuned on programming code in addition to natural language. This turns out to be very beneficial for tasks like counting letters, which are trivial in programming. For instance, a model that has seen Python code (word = "strawberry"; print(word.count('r'))) or pseudocode for counting characters can repurpose that knowledge when answering in English( github.com). The o1/Strawberry model was reportedly trained to solve programming problems as part of its reinforcement learning regimen (lesswrong.com, lesswrong.com). As a result, it has a much more explicit grasp of underlying operations. In essence, it can internally perform computations that earlier purely-text models wouldn’t. This doesn’t mean it actually runs code, but it has seen enough examples of correct algorithms that it can simulate them. When asked to count letters, a model like GPT-4 (2024) might internally reason: “I know the word, I can iterate through each character and check if it’s ‘r’...” which is a huge leap from GPT-3’s strategy of “what’s a likely answer to this question?”. This also ties into chain-of-thought: the model might go step-by-step explicitly thanks to its coding brain. Empirically, GPT-4 and other code-capable models not only give the correct count but can also explain how to do it (or even write a snippet of code to do so) – indicating a true grasp of the procedure.
Tokenization and Architectural Tweaks
One straightforward way to fix letter-level tasks is to use a character-level tokenizer or include character unigrams in the tokenization. However, most large models stick with subword tokenization for efficiency reasons, and OpenAI’s GPT-4 still uses BPE/tokenizers similar to earlier versions (though vocabularies are updated). So there wasn’t a fundamental change to tokenization in GPT-4 or the o1 model that would directly solve this. Nonetheless, some tokenization adjustments can mitigate the issue. For example, ensuring common words like “strawberry” are a single token might actually reduce errors (since the model could memorize the letter count as a property of that token during training). However, if it’s a single token, the model still can’t count inside it by construction – it would only know the answer if it memorized it. There isn’t evidence that GPT-4’s tokenizer was specifically designed around this problem; rather, the solutions came from reasoning and training as described. Researchers have noted that a truly letter-aware model would require modeling text at the character level, but this would massively increase sequence lengths and isn’t practical for general use (venturebeat.com). Instead, the community explored prompting tricks: as mentioned, simply asking the model to put spaces between letters (forcing char-level tokens) was a known workaround for older models (reddit.com). Newer models do this implicitly via reasoning, without needing the user to spell it out.
Architecturally, one can imagine hybrid models that combine a transformer with a character-level module or an external tool (for example, a separate function that can be called to do counting). Some experimental systems have tool-use capabilities (like calling a calculator or executing code). OpenAI’s plugins and function-calling features allow a model to delegate certain tasks to external functions. It’s conceivable that a model could call an internal “count_letters” function if it had one. However, in the specific case of GPT-4 Turbo (o1), the improvement seems to come from the model’s own reasoning rather than an external tool – essentially an end-to-end learned solution rather than a hardcoded function.
Decoding Strategies and Self-Consistency
Beyond training and architecture, inference-time strategies can also help with such problems. One known technique is self-consistency decoding: instead of trusting a single forward pass, the model can generate multiple reasoning paths (with some randomness) and then take a majority vote on the answer. This has been shown to improve accuracy on reasoning tasks, as the correct answer (arrived at via a correct chain-of-thought) will often appear multiple times among the samples (understandingai.org). For letter counting, if at least some of the model’s attempts involve correctly spelling out and counting, self-consistency could surface the right answer. In practice, ChatGPT doesn’t explicitly do this unless a user or the system orchestrates it, but researchers and advanced users have applied it to tricky queries. Newer models are also better calibrated – they are more likely to say “Let me double-check: S-T-R-A-W-B-E-R-R-Y… that’s 10 letters, with R appearing 3 times.” This kind of explicit checking can be encouraged by prompting (and may even happen implicitly in the model’s hidden reasoning due to how it was trained).
It’s worth noting that longer context windows in models like GPT-4 (8K or 32K tokens) do not directly solve character counting, but they allow more complex sub-problems to be considered. For example, with a longer context, a model could potentially output a long step-by-step solution or consider multiple formulations of the question. In the case of “strawberry,” context length isn’t a limiting factor (the word is short), but for tasks like “count the letter occurrences in this long paragraph,” older models would completely fail whereas a model with a 32K context could at least in theory enumerate through the text. The combination of long context + chain-of-thought means the model can handle even extensive counting tasks (though runtime and prompt might be an issue if done naively).
Empirical Outcomes
The improvements in newer models are reflected in both anecdotal tests and formal evaluations. After the release of GPT-4 and especially the o1 (Strawberry) preview, community members reported that the model would correctly count letters in words like “strawberry”, “mammal”, “Mississippi”, etc., reliably – something GPT-3 and Claude 1/2 often got wrong (medium.com). OpenAI’s own documentation emphasizes that o1 was trained to perform complex reasoning, and this bears out in benchmarks: for instance, o1 could solve 95% of LSAT logical reasoning questions and 83% of AP Calculus questions, tasks far more challenging than counting letters (schlaff.com, schlaff.com). The fact that it also aces the “strawberry” test is a small but telling detail – it demonstrates that the model’s reasoning ability extends all the way down to the token/character level when needed. In essence, the model’s understanding of language became more hierarchical: it can operate at high levels (meaning, logic) and low levels (letters and spelling) as appropriate.
OpenAI insiders and commentators have explicitly linked the nickname “Strawberry” to this achievement. As one article humorously put it, “Finally, ChatGPT knows how many R’s are in the word Strawberry.” (schlaff.com) Another described the new model by saying “o1 thinks before it answers — it can solve problems it hasn’t seen before”, highlighting that it uses a reasoning process rather than regurgitation (schlaff.com). This is exactly what was needed to fix the r-counting glitch: actual reasoning. The model Strawberry was initially developed as a research breakthrough (able to solve novel math problems and generate synthetic data) and then distilled into a version suitable for ChatGPT (lesswrong.com, lesswrong.com). By the time GPT-4 Turbo was deployed to users, it carried over much of this capability, hence closing the chapter on the strawberry meme.
It’s important to also acknowledge other AI labs’ progress. Anthropic’s Claude models and Google’s models (PaLM 2, and the newer Gemini) also improved on such tasks by 2024. Claude 2 reportedly still made mistakes with “strawberry” in some instances, but Anthropic’s Claude 2.1/Claude-instant, etc., were being tuned to reduce such basic errors (medium.com). Google’s Gemini (as of late 2024) could get the answer right if prompted carefully, suggesting it also has some chain-of-thought or at least more granular language understanding (medium.com). These improvements across the board indicate that the field recognized the importance of addressing these shortcomings, and techniques like larger models, better fine-tuning, and reasoning prompts were adopted widely.
Conclusion
The “strawberry r-counting” problem was a striking demonstration of early LLMs’ limitations in token-level reasoning. Technically, it arose from subword tokenization and the inability of transformers to directly operate on individual character representations, combined with the models’ tendency to produce probabilistic answers without step-by-step verification. This led to confident but wrong answers for what humans consider a trivial query. The example became emblematic of the gap between statistical language understanding and symbolic reasoning, spurring both humor and serious discussions in the AI community about what “intelligence” these models lack.
Through a series of advances – including instruction fine-tuning, chain-of-thought training with process supervision, integration of algorithmic knowledge (e.g. via code training), and improved decoding strategies – newer models have largely closed this gap. OpenAI’s GPT-4 Turbo (o1 “Strawberry”) is a prime case: by training the model to “think in steps” and prioritize correct reasoning, it can now handle tasks like letter counting that eluded previous generations (schlaff.com, schlaff.com). In a broader sense, solving the strawberry problem was a symbolic victory indicating these models are becoming more capable of precise, low-level reasoning when needed, not just fluent high-level text generation.
That said, the underlying architecture (transformers with subword tokens) still has its constraints. The fundamental challenge of understanding text at all levels isn’t fully resolved – it’s mitigated by making the model reason more or by clever training, rather than by a native character-level comprehension. Some research suggests truly solving such problems in general would require different architectures or tokenization strategies(venturebeat.com), but for now, the combination of large scale and improved training has pushed the boundary significantly. The strawberry meme reminds us that “intelligence” in LLMs is heavily dependent on how they’re trained. With the right objectives and data (and a bit of humor from the research community), models can overcome surprisingly specific pitfalls. And so, if you ask today’s best models how many r’s are in “strawberry,” you’re likely to get a confident “3”, often accompanied by an explanation – a far cry from the befuddled answers of their predecessors.
Sources: The analysis above is supported by technical insights from AI researchers and journalists. Key references include discussions of tokenization and LLM limitations (techcrunch.com, techcrunch.com), community experiments on the “strawberry” problem (reddit.com, reddit.com), and documentation of OpenAI’s o1 “Strawberry” model improvements (schlaff.com, schlaff.com), as well as OpenAI’s research on process supervision for reasoning (openai.com, openai.com). These illustrate both the root causes of the issue and the methods by which it has been addressed in state-of-the-art models.