ELLIS Institute / MPI-IS, Tübingen

Alexander Panfilov

AI safety, Adversarial ML, & LLM Red-Teaming

Yo! My name is Sasha and I am a second-year ELLIS / IMPRS-IS PhD student in Tuebingen advised by Jonas Geiping and Maksym Andriushchenko.

I work on AI Safety, particularly on red-teaming LLMs and stuff around them. Roughly four days a week I am an AI doomer.

I love LLM jailbreaks and red-teaming for misuse, but lately I’m spending more time on red-teaming for AI Control and Automated RnD.

I will start as intern at Meta Superinteligence Labs in 2026. I am open to collaboration and consider interesting roles in safety or security teams.

Alexander Panfilov

My current...

Research Interests

I am interested in introspection and its implications for AI Control, automated R&D, and white-box alignment methods.

Whereabouts

Mostly in Tuebingen, occasionaly in London, late 2026 in Bay Area.

Plans

Planning to attend ICLR 2026 in Brazil. Happy to catch up there!

News & updates

  • Mar 28, 2026
    Our work, Measuring Control Intervention Awareness Across Frontier LLMs, has been accepted for an oral presentation at the CAO Workshop at ICLR 2026!
  • Feb 09, 2026
    Presented my past work on jailbreaking at Imperial College London (Yves-Alexandre de Montjoye's group seminar).
  • Feb 06, 2026
    Presented Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLMs at Newspeak House as part of MATS Winter Research Talks! You can find the slides here.
  • Jan 26, 2026
    Happy to share that four out of four of my submissions got accepted into ICLR 2026! Shoot me an email if you want to catch up in Rio!
  • Dec 09, 2025
    I will join MATS 9.0 cohort as a part of GDM stream (Zimmermann/Lindner/Emmons/Jenner) focusing on red-teaming of white-box detectors!
  • Sep 01, 2025
    Kristina Nikolić, Evgenii Kortukov, and I won third place at the ARENA 6.0 Mechanistic Interpretability Hackathon by Apart Research in LISA (London)!
  • Jul 09, 2025
    Capability-Based Scaling Laws for LLM Red-Teaming accepted at ICML 2025 Workshop on Reliable and Responsible Foundation Models!
  • Jun 23, 2025
    Presented our work Capability-Based Scaling Laws for LLM Red-Teaming and ASIDE at the Google's Red Teaming seminar. You can find the slides here. Thanks for the invitation!
  • May 01, 2025
    Our work, An Interpretable N-gram Perplexity Threat Model for Large Language Model Jailbreaks, has been accepted at ICML 2025.
  • Apr 15, 2025
    Our work, ASIDE: Architectural Separation of Instructions and Data in Language Models, has been accepted for an oral presentation at the BuildingTrust Workshop at ICLR 2025.
  • Nov 05, 2024
    Presented our work, Provable Compositional Generalization for Object-Centric Learning at EPFL (Nicolas Flammarion's group seminar). You can find the slides here.
  • Oct 09, 2024
    Our work, A Realistic Threat Model for Large Language Model Jailbreaks, has been accepted for an oral presentation at the Red Teaming GenAI Workshop at NeurIPS 2024.
  • May 01, 2024
    Started my PhD at the ELLIS Institute Tübingen / Max Planck Institute for Intelligent Systems. You can find the slides for my IMPRS talk here.

Research

Some of my recent work :)

Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs cover
Preprint | 2026

Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

Alexander Panfilov*, Peter Romov*, Igor Shilov*, Yves-Alexandre Montjoye, Jonas Geiping, Maksym Andriushchenko

Paper Code

LLM agents like Claude Code can not only write code but also be used for autonomous AI research and engineering. We show that an autoresearch-style pipeline powered by Claude Code discovers novel white-box adversarial attack algorithms that significantly outperform all existing (30+) methods in jailbreaking and prompt injection evaluations. Starting from existing attack implementations, such as GCG, the agent iterates to produce new algorithms achieving up to 40\% attack success rate on CBRN queries against GPT-OSS-Safeguard-20B, compared to $\leq$10\% for existing algorithms. The discovered algorithms generalize: attacks optimized on surrogate models transfer directly to held-out models, achieving 100\% ASR against Meta-SecAlign-70B versus 56\% for the best baseline. Extending the findings of prior work, our results are an early demonstration that incremental safety and security research can be automated using LLM agents. White-box adversarial red-teaming is particularly well-suited for this: existing methods provide strong starting points, and the optimization objective yields dense, quantitative feedback.

@article{panfilov2026claudini,
    author = "Panfilov*, Alexander and Romov*, Peter and Shilov*, Igor and de Montjoye, Yves-Alexandre and Geiping, Jonas and Andriushchenko, Maksym",
    title = "Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs",
    journal = "arXiv preprint",
    eprint = "2603.24511",
    archivePrefix = "arXiv",
    booktitle = "Preprint",
    url = "https://arxiv.org/abs/2603.24511",
    year = "2026",
}
Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols cover
ICLR 2026

Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols

Mikhail Terekhov*, Alexander Panfilov*, Daniil Dzenhaliou*, Caglar Gulcehre, Maksym Andriushchenko, Ameya Prabhu, Jonas Geiping

Paper Website

AI control protocols serve as a defense mechanism to stop untrusted LLM agents from causing harm in autonomous settings. Prior work treats this as a security problem, stress testing with exploits that use the deployment context to subtly complete harmful side tasks, such as backdoor insertion. In practice, most AI control protocols are fundamentally based on LLM monitors, which can become a central point of failure. We study adaptive attacks by an untrusted model that knows the protocol and the monitor model, which is plausible if the untrusted model was trained with a later knowledge cutoff or can search for this information autonomously. We instantiate a simple adaptive attack vector by which the attacker embeds publicly known or zero-shot prompt injections in the model outputs. Using this tactic, frontier models consistently evade diverse monitors and complete malicious tasks on two main AI control benchmarks. The attack works universally against current protocols that rely on a monitor. Furthermore, the recent Defer-to-Resample protocol even backfires, as its resampling amplifies the prompt injection and effectively reframes it as a best-of-n attack. In general, adaptive attacks on monitor models represent a major blind spot in current control protocols and should become a standard component of evaluations for future AI control mechanisms.

@inproceedings{terekhov2025monitor,
    author = "Terekhov*, Mikhail and Panfilov*, Alexander and Dzenhaliou*, Daniil and Gulcehre, Caglar and Andriushchenko, Maksym and Prabhu, Ameya and Geiping, Jonas",
    title = "Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols",
    booktitle = "ICLR 2026",
    url = "https://arxiv.org/abs/2510.09462",
    year = "2025",
}
Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLMs cover
ICLR 2026

Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLMs

Alexander Panfilov*, Evgenii Kortukov*, Kristina Nikolic, Matthias Bethge, Sebastian Lapuschkin, Wojciech Samek, Ameya Prabhu, Maksym Andriushchenko, Jonas Geiping

Paper Code

Large language model (LLM) developers aim for their models to be honest, helpful, and harmless. However, when faced with malicious requests, models are trained to refuse, sacrificing helpfulness. We show that frontier LLMs can develop a preference for dishonesty as a new strategy, even when other options are available. Affected models respond to harmful requests with outputs that sound harmful but are crafted to be subtly incorrect or otherwise harmless in practice. This behavior emerges with hard-to-predict variations even within models from the same model family. We find no apparent cause for the propensity to deceive, but show that more capable models are better at executing this strategy. Strategic dishonesty already has a practical impact on safety evaluations, as we show that dishonest responses fool all output-based monitors used to detect jailbreaks that we test, rendering benchmark scores unreliable. Further, strategic dishonesty can act like a honeypot against malicious users, which noticeably obfuscates prior jailbreak attacks. While output monitors fail, we show that linear probes on internal activations can be used to reliably detect strategic dishonesty. We validate probes on datasets with verifiable outcomes and by using them as steering vectors. Overall, we consider strategic dishonesty as a concrete example of a broader concern that alignment of LLMs is hard to control, especially when helpfulness and harmlessness conflict.

@inproceedings{panfilov2025dishonesty,
    author = "Panfilov*, Alexander and Kortukov*, Evgenii and Nikolic, Kristina and Bethge, Matthias and Lapuschkin, Sebastian and Samek, Wojciech and Prabhu, Ameya and Andriushchenko, Maksym and Geiping, Jonas",
    title = "Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLMs",
    booktitle = "ICLR 2026",
    url = "https://arxiv.org/abs/2509.18058",
    year = "2025",
}
Capability-Based Scaling Trends for LLM-Based Red-Teaming cover
ICLR 2026

Capability-Based Scaling Trends for LLM-Based Red-Teaming

Alexander Panfilov, Paul Kassianik, Maksym Andriushchenko, Jonas Geiping

Paper Code

As large language models grow in capability and agency, identifying vulnerabilities through red-teaming becomes vital for safe deployment. However, traditional prompt-engineering approaches may prove ineffective once red-teaming turns into a \emph{weak-to-strong} problem, where target models surpass red-teamers in capabilities. To study this shift, we frame red-teaming through the lens of the \emph{capability gap} between attacker and target. We evaluate more than 500 attacker-target pairs using LLM-based jailbreak attacks that mimic human red-teamers across diverse families, sizes, and capability levels. Three strong trends emerge: (i) more capable models are better attackers, (ii) attack success drops sharply once the target’s capability exceeds the attacker's, and (iii) attack success rates correlate with high performance on social science splits of the MMLU-Pro benchmark. From these trends, we derive a \emph{jailbreaking scaling law} that predicts attack success for a fixed target based on attacker-target capability gap. These findings suggest that fixed-capability attackers (e.g., humans) may become ineffective against future models, increasingly capable open-source models amplify risks for existing systems, and model providers must accurately measure and control models' persuasive and manipulative abilities to limit their effectiveness as attackers.

@inproceedings{panfilov2025scalinglaws,
    author = "Panfilov, Alexander and Kassianik, Paul and Andriushchenko, Maksym and Geiping, Jonas",
    title = "Capability-Based Scaling Trends for LLM-Based Red-Teaming",
    booktitle = "ICLR 2026",
    url = "https://arxiv.org/abs/2505.20162",
    year = "2025",
}
An Interpretable N-gram Perplexity Threat Model for Large Language Model Jailbreaks cover
ICML 2025

An Interpretable N-gram Perplexity Threat Model for Large Language Model Jailbreaks

Valentyn Boreiko*, Alexander Panfilov*, Vaclav Voracek, Matthias Hein, Jonas Geiping

Paper Code

A plethora of jailbreaking attacks have been proposed to obtain harmful responses from safety-tuned LLMs. These methods largely succeed in coercing the target output in their original settings, but their attacks vary substantially in fluency and computational effort. In this work, we propose a unified threat model for the principled comparison of these methods. Our threat model checks if a given jailbreak is likely to occur in the distribution of text. For this, we build an N-gram language model on 1T tokens, which, unlike model-based perplexity, allows for an LLM-agnostic, nonparametric, and inherently interpretable evaluation. We adapt popular attacks to this threat model, and, for the first time, benchmark these attacks on equal footing with it. After an extensive comparison, we find attack success rates against safety-tuned modern models to be lower than previously presented and that attacks based on discrete optimization significantly outperform recent LLM-based attacks. Being inherently interpretable, our threat model allows for a comprehensive analysis and comparison of jailbreak attacks. We find that effective attacks exploit and abuse infrequent bigrams, either selecting the ones absent from real-world text or rare ones, e.g., specific to Reddit or code datasets.

@inproceedings{boreiko2024athreatmodel,
    author = "Boreiko*, Valentyn and Panfilov*, Alexander and Voracek, Vaclav and Hein, Matthias and Geiping, Jonas",
    title = "An Interpretable N-gram Perplexity Threat Model for Large Language Model Jailbreaks",
    booktitle = "Proceedings of the 42nd International Conference on Machine Learning",
    display = "ICML 2025",
    url = "https://proceedings.mlr.press/v267/boreiko25a.html",
    year = "2025",
}

Acknowledgements

I am grateful to the many friends and colleagues, from whom I learned so much, for their invaluable guidance and for shaping my research vision. I would like to especially acknowledge Svyatoslav Oreshin, Arip Asadualev, Roland Zimmermann, Thaddaeus Wiedemer, Jack Brady, Wieland Brendel, Valentyn Boreiko, Matthias Hein, Shashwat Goel, Illia Shumailov, Maksym Andriushchenko, and Jonas Geiping.