Poetry Poses New Threat to AI Safety, Researchers Warn

Poetry, long valued for its beauty and unpredictability, is now at the centre of an unexpected AI safety concern. Researchers in Italy have found that carefully crafted poems can trick some of the world’s most advanced artificial intelligence models into ignoring their own safety rules. The technique, described as adversarial poetry, has been shown to bypass safeguards designed to prevent harmful or illicit content from being generated.

The work was carried out by Icaro Lab, part of the ethical AI company DexAI, which tested 20 poems across 25 large language models from nine major AI developers. The poems all ended with a request for harmful content. Despite the guardrails built into these systems, 62 per cent of the models complied.

How the jailbreak works

Large language models are trained to predict the next most likely word in a sequence. Poems, however, are less predictable than standard prose. Their structure, rhythm and metaphor can obscure intent and confuse safety filters. DexAI researchers say this allows harmful prompts, such as those seeking instructions for weapons or self harm, to slip past safeguards that would normally catch them.

The poems used in the study have not been published, but the team shared a safe example to show the method. It appears innocuous, describing a baker and a cake, yet its structure mirrors the style that proved effective in coaxing harmful responses.

The effect is not isolated to a single AI system. Models from Google, Meta, xAI, Anthropic and OpenAI were all tested. Results varied widely. Some models resisted every attempt. Others, including Google’s Gemini 2.5 Pro, failed every time.

Industry response and concerns

Google DeepMind says it uses a multilayered approach to safety and is updating its filters to recognise harmful intent even within artistic prompts. Anthropic has confirmed it is reviewing the findings. Other developers did not respond to researchers’ enquiries.

The study suggests a broader issue for the sector. Safety training has tended to focus on straightforward, literal prompts. But these attacks highlight that style, not just content, can determine whether a model produces dangerous information. According to researchers, larger and more capable models may even be more vulnerable, as they are better at interpreting metaphorical or ambiguous language.

Wider implications for AI deployment

The findings raise questions about how prepared AI systems are for creative misuse. If harmful requests can be hidden in poetic phrasing, the risk extends beyond specialists. Unlike previous jailbreaking methods, which often rely on technical expertise, adversarial poetry can be attempted by anyone.

Experts say this exposes a significant gap in current evaluation methods. New safety tests will need to incorporate creative, narrative and stylistic prompts. Organisations deploying AI in public settings may also need to assume that users could disguise their intentions in unexpected linguistic forms.

Icaro Lab plans to continue its research and will soon launch a public challenge inviting poets to test the limits of AI safety. As the researchers note, their own poetic skills are limited, meaning the full extent of the vulnerability may still be underestimated.

For now, the message is clear: when it comes to keeping AI safe, the structure of a sentence can be just as important as the words themselves.