GPT Models Have a Goblin Problem. Here’s How It Happened.

PressBot 5 min read
GPT Models Have a Goblin Problem. Here’s How It Happened.

Starting with GPT-5.1, ChatGPT developed an odd habit. It began dropping goblins, gremlins, raccoons, trolls, ogres, and pigeons into its metaphors — unprompted, across topics that had nothing to do with fantasy creatures. Use of the word “goblin” in ChatGPT responses surged 175% after GPT-5.1 launched. “Gremlin” saw a 52% increase.

This wasn’t a hallucination in the traditional sense. The model wasn’t making up facts. It was decorating its language with creatures nobody asked for. And unlike most model bugs, this one didn’t trigger a failing evaluation or a spiking training metric. It crept in quietly.

OpenAI published a full breakdown of the root cause, and it’s one of the more instructive examples of how reinforcement learning can shape AI behavior in ways nobody intended.

The Nerdy Personality Did It

The trail led back to OpenAI’s personality customization feature — specifically the “Nerdy” personality option. During training, the reward model gave disproportionately high scores to responses that used creature-based metaphors. The Nerdy personality accounted for just 2.5% of all ChatGPT responses, but it was responsible for 66.7% of all goblin mentions.

Here’s the critical detail: the rewards were applied only in the Nerdy condition, but reinforcement learning doesn’t guarantee that learned behaviors stay scoped to the condition that produced them. Once a style tic gets rewarded, later training passes can spread it. If those goblin-rich outputs end up in supervised fine-tuning data or preference datasets, the behavior propagates to the general model.

Think of it like seasoning a single dish in a shared kitchen — except the seasoning gets into every pan.

A Concrete Example of Reward Signal Leakage

Imagine you ask GPT-5.1 something completely mundane: “Explain how database indexing works.” A well-scoped response talks about B-trees, query performance, and trade-offs. But a model trained with a goblin-affine reward signal might produce something like: “Think of an unindexed database like a goblin rummaging through an unsorted pile of treasure — it checks every coin before finding the one it wants.”

Charming once. Less charming when it happens in medical explanations, legal summaries, and technical documentation. The creature metaphors weren’t wrong, but they were uninvited — and they eroded trust in the model’s tone consistency.

The Fix Took Multiple Generations

OpenAI retired the Nerdy personality in March after launching GPT-5.4. They removed the goblin-affine reward signal and filtered training data containing creature-words. That should have been the end of it.

It wasn’t. Because OpenAI had already started training GPT-5.5 before the root cause was identified, the new model — including its use in Codex — showed the same affinity for mythical creatures. The goblins had already baked into the training pipeline.

The stopgap: a developer-prompt instruction that explicitly tells the model: “Never talk about goblins, gremlins, raccoons, trolls, ogres, pigeons, or other animals or creatures unless it is absolutely and unambiguously relevant to the user’s query.”

That’s a band-aid, not a cure. Prompt-level instructions are brittle. They work until they don’t — especially under complex multi-turn conversations or tool-calling scenarios where system prompts get compressed.

Why This Matters Beyond the Comedy

The goblin incident is funny. It’s also a textbook case of unintended AI model behavior caused by reward signal side effects. Three takeaways:

  • Reward signals generalize. A behavior rewarded in one context will bleed into others if training data isn’t carefully partitioned. This applies to tone, style, factual emphasis, and tool use patterns.
  • Training pipeline contamination is hard to reverse. Once outputs from a reward-contaminated model enter the fine-tuning data for the next generation, the behavior propagates forward. OpenAI discovered this across two model generations.
  • Subtle behavioral drift is harder to catch than factual errors. Nobody flags a goblin metaphor the way they flag an incorrect date. Style drift accumulates without triggering automated evals.

CEO Sam Altman has already hinted at GPT-6, and the goblin saga reportedly played a role in how that timeline was communicated — a combination of engineering bugs and creature references that made the announcement itself mildly chaotic.

What This Means If You Run AI on Your Site

If you’re running an AI chatbot that serves real visitors — answering product questions, handling support, guiding purchases — tone consistency isn’t optional. A goblin metaphor in a WooCommerce product recommendation undermines credibility.

This is one reason PressBot Pro supports multiple AI providers through BYOK. You can run GPT-5.4 for one role and Claude Sonnet 4.6 for another, choosing different models for your public chatbot and your admin agent based on which provider handles tone and tool calling best for your use case. If one model develops a quirk, you swap — no vendor lock-in, no waiting for a fix.

OpenAI patched the goblin problem. But the pattern — reward signal leakage causing subtle behavioral drift — will happen again, with different models and different quirks. The best defense is flexibility in which models you depend on.

If you want to test how different providers handle your site’s content, grab PressBot at pressbot.io and configure Claude, Gemini, OpenAI, or DeepSeek side by side. No goblins required.

Share

Ready when you are

Add AI to your WordPress.

Free forever. Unlimited conversations. Bring your own keys, keep your data on your server.