Alexander Panfilov

Google Scholar
Alexander Panfilov (Александр Панфилов) - PhD Student in AI Safety.

Yo! My name is Sasha and I am a second-year ELLIS / IMPRS-IS PhD student, based in Tübingen. I find myself very lucky to be advised by Jonas Geiping and Maksym Andriushchenko.

Broadly, I am interested in adversarial robustness, AI safety, and ML security. In practical terms, I enjoy finding various ways to break machine learning systems. Roughly three days a week I am an AI doomer.

Lately, I have been focusing on jailbreaking attacks on LLMs, contemplating: (1) What are the viable threat models for attacks on safety tuning? (2) Are safety jailbreaks truly effective, or are we victims of flawed (LLM-based) evaluations? (3) Are we doomed?

You can find my CV here. I am always open to collaboration — feel free to reach out via email!

News

  • September 01, 2025: Kristina Nikolić, Evgenii Kortukov, and I won third place at the ARENA 6.0 Mechanistic Interpretability Hackathon by Apart Research in LISA (London)!
  • July 09, 2025: Capability-Based Scaling Laws for LLM Red-Teaming accepted at ICML 2025 Workshop on Reliable and Responsible Foundation Models!
  • June 23, 2025: Presented our work Capability-Based Scaling Laws for LLM Red-Teaming and ASIDE at the Google's Red Teaming seminar. You can find the slides here. Thanks for the invitation!
  • May 01, 2025: Our work, An Interpretable N-gram Perplexity Threat Model for Large Language Model Jailbreaks, has been accepted at ICML 2025.
  • April 15, 2025: Our work, ASIDE: Architectural Separation of Instructions and Data in Language Models, has been accepted for an oral presentation at the BuildingTrust Workshop at ICLR 2025.
  • November 05, 2024: Presented our work, Provable Compositional Generalization for Object-Centric Learning at EPFL (Nicolas Flammarion's group seminar). You can find the slides here.
  • October 09, 2024: Our work, A Realistic Threat Model for Large Language Model Jailbreaks, has been accepted for an oral presentation at the Red Teaming GenAI Workshop at NeurIPS 2024.
  • May 01, 2024: Started my PhD at the ELLIS Institute Tübingen / Max Planck Institute for Intelligent Systems. You can find the slides for my IMPRS talk here.

Selected Publications

Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols - terekhov2025monitor - Alexander Panfilov AI Safety ML Security Research
Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols
Mikhail Terekhov*, Alexander Panfilov*, Daniil Dzenhaliou*, Caglar Gulcehre, Maksym Andriushchenko, Ameya Prabhu, Jonas Geiping
preprint
Paper / Project Page
Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLM - panfilov2025dishonesty - Alexander Panfilov AI Safety ML Security Research
Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLM
Alexander Panfilov*, Evgenii Kortukov*, Kristina Nikolić, Matthias Bethge, Sebastian Lapuschkin, Wojciech Samek, Ameya Prabhu, Maksym Andriushchenko, Jonas Geiping
preprint
Paper
Capability-Based Scaling Laws for LLM Red-Teaming - panfilov2025scalinglaws - Alexander Panfilov AI Safety ML Security Research
Capability-Based Scaling Laws for LLM Red-Teaming
Alexander Panfilov, Paul Kassianik, Maksym Andriushchenko, Jonas Geiping
preprint
Paper / Code
An Interpretable N-gram Perplexity Threat Model for Large Language Model Jailbreaks - boreiko2024athreatmodel - Alexander Panfilov AI Safety ML Security Research
An Interpretable N-gram Perplexity Threat Model for Large Language Model Jailbreaks
Valentyn Boreiko*, Alexander Panfilov*, Václav Voráček, Matthias Hein, Jonas Geiping
ICML 2025
Paper / Code
ASIDE: Architectural Separation of Instructions and Data in Language Models - zverev2025aside - Alexander Panfilov AI Safety ML Security Research
oralASIDE: Architectural Separation of Instructions and Data in Language Models
Egor Zverev, Evgenii Kortukov, Alexander Panfilov, Soroush Tabesh, Alexandra Volkova, Sebastian Lapuschkin, Wojciech Samek, Christoph Lampert
BuildingTrust Workshop at ICLR 2025
Paper / Code
Provable Compositional Generalization for Object-Centric Learning - wiedemer2024provable - Alexander Panfilov AI Safety ML Security Research
oralProvable Compositional Generalization for Object-Centric Learning
Thaddäus Wiedemer*, Jack Brady*, Alexander Panfilov*, Attila Juhos*, Matthias Bethge, Wieland Brendel
ICLR 2024
Paper / Code / Project Page

Acknowledgements

I am grateful to the many colleagues I worked with in the past, from whom I learned so much, for their invaluable contributions to my career. I would like to especially acknowledge the mentorship and guidance of Svyatoslav Oreshin, Arip Asadualev, Roland Zimmerman, Thaddaus Wiedemer, Jack Brady, Wieland Brendel, Valentyn Boreiko and Matthias Hein.