
A recent “adversarial poetry” jailbreak claimed to be too dangerous to release. Using Starseer’s interpretability-based analysis, we reconstructed similar prompts, tested them across Llama, Qwen, and Phi models, and uncovered consistent, model-internal anomaly signatures that make detection possible—even without knowing the original attack prompts.

A recent “adversarial poetry” jailbreak claimed to be too dangerous to release. Using Starseer’s interpretability-based analysis, we reconstructed similar prompts, tested them across Llama, Qwen, and Phi models, and uncovered consistent, model-internal anomaly signatures that make detection possible—even without knowing the original attack prompts.
Ready to reduce your AI risk? Our platform helps you reduce exposure, harden models, & move at the speed of business AI.