ELLIS Institute / MPI-IS, Tübingen

Alexander Panfilov

AI safety, Adversarial ML, & LLM Red-Teaming

Yo! My name is Sasha and I am a second-year ELLIS / IMPRS-IS PhD student in Tuebingen advised by Jonas Geiping and Maksym Andriushchenko.

I work on AI Safety, particularly on red-teaming LLMs and stuff around them. Roughly two days a week I am an AI doomer.

Previously I was obsessed with LLM jailbreaks, trying to understand realistic threat models, whether current safety evaluations are trustworthy, and whether we’re doomed. Lately I’m leaning into white-box alignment methods and AI control.

I am open to collaboration - feel free to drop an email! I’m also looking for a safety/security internship opportunity in 2026/2027.

Alexander Panfilov

My current...

Research Interests

I find introspection capability of LLMs fascinating and think it can be used as a way to reason about white-box interventions and useful for self-jailbreaking.

Whereabouts

Based in London from January through March for MATS. Happy to grab a coffee if you are around!

Plans

Planning to attend ICLR 2026 in Brazil. Happy to catch up there!

News & updates

  • Dec 09, 2025
    I will join MATS 9.0 cohort as a part of GDM stream (Zimermann/Lindner/Emmons/Jenner) focusing on red-teaming of whit-box detectors!
  • Sep 01, 2025
    Kristina Nikolić, Evgenii Kortukov, and I won third place at the ARENA 6.0 Mechanistic Interpretability Hackathon by Apart Research in LISA (London)!
  • Jul 09, 2025
    Capability-Based Scaling Laws for LLM Red-Teaming accepted at ICML 2025 Workshop on Reliable and Responsible Foundation Models!
  • Jun 23, 2025
    Presented our work Capability-Based Scaling Laws for LLM Red-Teaming and ASIDE at the Google's Red Teaming seminar. You can find the slides here. Thanks for the invitation!
  • May 01, 2025
    Our work, An Interpretable N-gram Perplexity Threat Model for Large Language Model Jailbreaks, has been accepted at ICML 2025.
  • Apr 15, 2025
    Our work, ASIDE: Architectural Separation of Instructions and Data in Language Models, has been accepted for an oral presentation at the BuildingTrust Workshop at ICLR 2025.
  • Nov 05, 2024
    Presented our work, Provable Compositional Generalization for Object-Centric Learning at EPFL (Nicolas Flammarion's group seminar). You can find the slides here.
  • Oct 09, 2024
    Our work, A Realistic Threat Model for Large Language Model Jailbreaks, has been accepted for an oral presentation at the Red Teaming GenAI Workshop at NeurIPS 2024.
  • May 01, 2024
    Started my PhD at the ELLIS Institute Tübingen / Max Planck Institute for Intelligent Systems. You can find the slides for my IMPRS talk here.

Research

Some of my recent work :)

Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols cover
preprint | 2025

Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols

Mikhail Terekhov*, Alexander Panfilov*, Daniil Dzenhaliou*, Caglar Gulcehre, Maksym Andriushchenko, Ameya Prabhu, Jonas Geiping

Paper Website

AI control protocols serve as a defense mechanism to stop untrusted LLM agents from causing harm in autonomous settings. Prior work treats this as a security problem, stress testing with exploits that use the deployment context to subtly complete harmful side tasks, such as backdoor insertion. In practice, most AI control protocols are fundamentally based on LLM monitors, which can become a central point of failure. We study adaptive attacks by an untrusted model that knows the protocol and the monitor model, which is plausible if the untrusted model was trained with a later knowledge cutoff or can search for this information autonomously. We instantiate a simple adaptive attack vector by which the attacker embeds publicly known or zero-shot prompt injections in the model outputs. Using this tactic, frontier models consistently evade diverse monitors and complete malicious tasks on two main AI control benchmarks. The attack works universally against current protocols that rely on a monitor. Furthermore, the recent Defer-to-Resample protocol even backfires, as its resampling amplifies the prompt injection and effectively reframes it as a best-of-n attack. In general, adaptive attacks on monitor models represent a major blind spot in current control protocols and should become a standard component of evaluations for future AI control mechanisms.

@inproceedings{terekhov2025monitor,
    author = "Terekhov*, Mikhail and Panfilov*, Alexander and Dzenhaliou*, Daniil and Gulcehre, Caglar and Andriushchenko, Maksym and Prabhu, Ameya and Geiping, Jonas",
    title = "Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols",
    booktitle = "preprint",
    url = "https://arxiv.org/abs/2510.09462",
    year = "2025",
}
Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLMs cover
preprint | 2025

Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLMs

Alexander Panfilov*, Evgenii Kortukov*, Kristina Nikolic, Matthias Bethge, Sebastian Lapuschkin, Wojciech Samek, Ameya Prabhu, Maksym Andriushchenko, Jonas Geiping

Paper

Large language model (LLM) developers aim for their models to be honest, helpful, and harmless. However, when faced with malicious requests, models are trained to refuse, sacrificing helpfulness. We show that frontier LLMs can develop a preference for dishonesty as a new strategy, even when other options are available. Affected models respond to harmful requests with outputs that sound harmful but are crafted to be subtly incorrect or otherwise harmless in practice. This behavior emerges with hard-to-predict variations even within models from the same model family. We find no apparent cause for the propensity to deceive, but show that more capable models are better at executing this strategy. Strategic dishonesty already has a practical impact on safety evaluations, as we show that dishonest responses fool all output-based monitors used to detect jailbreaks that we test, rendering benchmark scores unreliable. Further, strategic dishonesty can act like a honeypot against malicious users, which noticeably obfuscates prior jailbreak attacks. While output monitors fail, we show that linear probes on internal activations can be used to reliably detect strategic dishonesty. We validate probes on datasets with verifiable outcomes and by using them as steering vectors. Overall, we consider strategic dishonesty as a concrete example of a broader concern that alignment of LLMs is hard to control, especially when helpfulness and harmlessness conflict.

@inproceedings{panfilov2025dishonesty,
    author = "Panfilov*, Alexander and Kortukov*, Evgenii and Nikolic, Kristina and Bethge, Matthias and Lapuschkin, Sebastian and Samek, Wojciech and Prabhu, Ameya and Andriushchenko, Maksym and Geiping, Jonas",
    title = "Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLMs",
    booktitle = "preprint",
    url = "https://arxiv.org/abs/2509.18058",
    year = "2025",
}
ASIDE: Architectural Separation of Instructions and Data in Language Models cover
BuildingTrust Workshop at ICLR 2025oral

ASIDE: Architectural Separation of Instructions and Data in Language Models

Egor Zverev, Evgenii Kortukov, Alexander Panfilov, Soroush Tabesh, Alexandra Volkova, Sebastian Lapuschkin, Wojciech Samek, Christoph Lampert

Paper Code

Despite their remarkable performance, large language models lack elementary safety features, making them susceptible to numerous malicious attacks. In particular, previous work has identified the absence of an intrinsic separation between instructions and data as a root cause of the success of prompt injection attacks. In this work, we propose a new architectural element, ASIDE, that allows language models to clearly separate instructions and data at the level of embeddings. ASIDE applies an orthogonal rotation to the embeddings of data tokens, thus creating clearly distinct representations of instructions and data tokens without introducing any additional parameters. As we demonstrate experimentally across a range of models, instruction-tuning LLMs with ASIDE (1) leads to highly increased instruction-data separation without a loss in model utility and (2) makes the models more robust to prompt injection benchmarks, even without dedicated safety training. Additionally, we provide insights into the mechanism underlying our method through an analysis of the model representations. The source code and training scripts are openly accessible at \href{https://github.com/egozverev/aside}{https://github.com/egozverev/aside}.

@inproceedings{zverev2025aside,
    author = "Zverev, Egor and Kortukov, Evgenii and Panfilov, Alexander and Tabesh, Soroush and Volkova, Alexandra and Lapuschkin, Sebastian and Samek, Wojciech and Lampert, Christoph H",
    title = "ASIDE: Architectural Separation of Instructions and Data in Language Models",
    booktitle = "BuildingTrust Workshop at ICLR 2025",
    url = "https://arxiv.org/abs/2503.10566",
    year = "2025",
}
Capability-Based Scaling Laws for LLM Red-Teaming cover
R2-FM Workshop at ICML 2025

Capability-Based Scaling Laws for LLM Red-Teaming

Alexander Panfilov, Paul Kassianik, Maksym Andriushchenko, Jonas Geiping

Paper Code

As large language models grow in capability and agency, identifying vulnerabilities through red-teaming becomes vital for safe deployment. However, traditional prompt-engineering approaches may prove ineffective once red-teaming turns into a \emph{weak-to-strong} problem, where target models surpass red-teamers in capabilities. To study this shift, we frame red-teaming through the lens of the \emph{capability gap} between attacker and target. We evaluate more than 500 attacker-target pairs using LLM-based jailbreak attacks that mimic human red-teamers across diverse families, sizes, and capability levels. Three strong trends emerge: (i) more capable models are better attackers, (ii) attack success drops sharply once the target’s capability exceeds the attacker's, and (iii) attack success rates correlate with high performance on social science splits of the MMLU-Pro benchmark. From these trends, we derive a \emph{jailbreaking scaling law} that predicts attack success for a fixed target based on attacker-target capability gap. These findings suggest that fixed-capability attackers (e.g., humans) may become ineffective against future models, increasingly capable open-source models amplify risks for existing systems, and model providers must accurately measure and control models' persuasive and manipulative abilities to limit their effectiveness as attackers.

@inproceedings{panfilov2025scalinglaws,
    author = "Panfilov, Alexander and Kassianik, Paul and Andriushchenko, Maksym and Geiping, Jonas",
    title = "Capability-Based Scaling Laws for LLM Red-Teaming",
    booktitle = "R2-FM Workshop at ICML 2025",
    url = "https://arxiv.org/abs/2505.20162",
    year = "2025",
}
An Interpretable N-gram Perplexity Threat Model for Large Language Model Jailbreaks cover
ICML 2025

An Interpretable N-gram Perplexity Threat Model for Large Language Model Jailbreaks

Valentyn Boreiko*, Alexander Panfilov*, Vaclav Voracek, Matthias Hein, Jonas Geiping

Paper Code

A plethora of jailbreaking attacks have been proposed to obtain harmful responses from safety-tuned LLMs. These methods largely succeed in coercing the target output in their original settings, but their attacks vary substantially in fluency and computational effort. In this work, we propose a unified threat model for the principled comparison of these methods. Our threat model checks if a given jailbreak is likely to occur in the distribution of text. For this, we build an N-gram language model on 1T tokens, which, unlike model-based perplexity, allows for an LLM-agnostic, nonparametric, and inherently interpretable evaluation. We adapt popular attacks to this threat model, and, for the first time, benchmark these attacks on equal footing with it. After an extensive comparison, we find attack success rates against safety-tuned modern models to be lower than previously presented and that attacks based on discrete optimization significantly outperform recent LLM-based attacks. Being inherently interpretable, our threat model allows for a comprehensive analysis and comparison of jailbreak attacks. We find that effective attacks exploit and abuse infrequent bigrams, either selecting the ones absent from real-world text or rare ones, e.g., specific to Reddit or code datasets.

@inproceedings{boreiko2024athreatmodel,
    author = "Boreiko*, Valentyn and Panfilov*, Alexander and Voracek, Vaclav and Hein, Matthias and Geiping, Jonas",
    title = "An Interpretable N-gram Perplexity Threat Model for Large Language Model Jailbreaks",
    booktitle = "Proceedings of the 42nd International Conference on Machine Learning",
    display = "ICML 2025",
    url = "https://proceedings.mlr.press/v267/boreiko25a.html",
    year = "2025",
}
Provable Compositional Generalization for Object-Centric Learning cover
ICLR 2024oral

Provable Compositional Generalization for Object-Centric Learning

Thaddaeus Wiedemer*, Jack Brady*, Alexander Panfilov*, Attila Juhos*, Matthias Bethge, Wieland Brendel

Learning representations that generalize to novel compositions of known concepts is crucial for bridging the gap between human and machine perception. One prominent effort is learning object-centric representations, which are widely conjectured to enable compositional generalization. Yet, it remains unclear when this conjecture will be true, as a principled theoretical or empirical understanding of compositional generalization is lacking. In this work, we investigate when compositional generalization is guaranteed for object-centric representations through the lens of identifiability theory. We show that autoencoders that satisfy structural assumptions on the decoder and enforce encoder-decoder consistency will learn object-centric representations that provably generalize compositionally. We validate our theoretical result and highlight the practical relevance of our assumptions through experiments on synthetic image data.

@inproceedings{wiedemer2024provable,
    author = "Wiedemer*, Thaddaeus and Brady*, Jack and Panfilov*, Alexander and Juhos*, Attila and Bethge, Matthias and Brendel, Wieland",
    title = "Provable Compositional Generalization for Object-Centric Learning",
    booktitle = "ICLR 2024",
    url = "https://openreview.net/forum?id=7VPTUWkiDQ",
    year = "2024",
}

Acknowledgements

I am grateful to the many friends and colleagues, from whom I learned so much, for their invaluable guidance and for shaping my research vision. I would like to especially acknowledge Svyatoslav Oreshin, Arip Asadualev, Roland Zimmermann, Thaddaeus Wiedemer, Jack Brady, Wieland Brendel, Valentyn Boreiko, Matthias Hein, Shashwat Goel, Illia Shumailov, Maksym Andriushchenko, and Jonas Geiping.