LLMs Mistook "User:" For Their Own Brain, Handed Over Coke Recipes Instead 🍿

Researchers from an academic team say they tricked several leading large language models into generating step-by-step cocaine synthesis instructions and into leaking sensitive credentials by exploiting a flaw they call "role confusion," a structural weakness in how LLMs separate trusted system prompts from attacker-controlled text. The findings appear in the paper "Prompt Injection as Role Confusion," presented at the International Conference on Machine Learning in June by Charles Ye, Jasmine Cui, and Dylan Hadfield-Menell.

"For an LLM, everything arrives through the same channel as one long token soup," the team wrote. "Its own thoughts sit next to your instructions, which sit next to the contents of a random webpage it just fetched." The paper argues that models rely on writing style rather than role tags to determine whether commands are trustworthy, sometimes treating injected text as legitimate user commands or even as the model's own internal reasoning. "When it sees its prior think text, it implicitly trusts its conclusions. That's the whole point of reasoning: If the LLM had to re-derive the same conclusions, reasoning would be useless," the researchers wrote. "So think text gets a kind of blanket trust."

The team described a specific technique, Chain-of-Thought (CoT) Forgery, which inserts fake reasoning that mimics a model's internal thought process. Models that would normally refuse illegal requests generated cocaine synthesis instructions after accepting the fabricated reasoning as their own, lifting jailbreak success rates from near zero to about 60% across the models tested. Those included OpenAI's GPT-5 nano, mini, and full, o4-mini, and gpt-oss-20b and gpt-oss-120b, along with GLM-4.6, Kimi-K2-Instruct, and MiniMax-M2.

In a separate experiment, the researchers said they tricked an AI coding agent into uploading a SECRETS.env file by hiding malicious instructions inside a webpage, a demonstration they presented as further evidence of the role-confusion flaw. Their probes also indicated that prepending the label "User" in front of a command caused the model to perceive that command as more trusted, reinforcing the conclusion that style cues matter more than structural boundaries. The researchers recommended new safeguards that treat reasoning traces, tool outputs, and retrieved content as distinct trust tiers, an architectural change they say would close off the demonstrated path.

LLMs Mistook "User:" For Their Own Brain, Handed Over Coke Recipes Instead 🍿

Share Article

Quick Info