Alexander Panfilov

Google Scholar
Alexander Panfilov (Александр Панфилов) - PhD Student in AI Safety and Machine Learning Security

Yo! My name is Sasha and I am a second-year ELLIS / IMPRS-IS PhD student, based in Tübingen. I find myself very lucky to be advised byJonas GeipingandMaksym Andriushchenko.

Broadly, I am interested in adversarial robustness, AI safety, and ML security. In practical terms, I enjoy finding various ways to break machine learning systems. Roughly three days a week I am an AI doomer.

Lately, I have been focusing on jailbreaking attacks on LLMs, contemplating: (1) What are the viable threat models for attacks on safety tuning? (2) Are safety jailbreaks truly effective, or are we victims of flawed (LLM-based) evaluations? (3) Are we doomed?

You can find my CV here. I am always open to collaboration — feel free to reach out via email!

News

  • May 01, 2025: Our work, "An Interpretable N-gram Perplexity Threat Model for Large Language Model Jailbreaks", has been accepted at ICML 2025.
  • April 15, 2025: Our work, "ASIDE: Architectural Separation of Instructions and Data in Language Models", has been accepted for an oral presentation at the BuildingTrust Workshop at ICLR 2025.
  • November 05, 2024: Presented our work, "Provable Compositional Generalization for Object-Centric Learning" at EPFL (Nicolas Flammarion's group seminar). You can find the slides here.
  • October 09, 2024: Our work, "A Realistic Threat Model for Large Language Model Jailbreaks", has been accepted for an oral presentation at the Red Teaming GenAI Workshop at NeurIPS 2024.
  • May 01, 2024: Started my PhD at the ELLIS Institute Tübingen / Max Planck Institute for Intelligent Systems. You can find the slides for my IMPRS talk here.

Selected Publications

Capability-Based Scaling Laws for LLM Red-Teaming - panfilov2025scalinglaws - Alexander Panfilov AI Safety ML Security Research
Capability-Based Scaling Laws for LLM Red-Teaming
Alexander Panfilov, Paul Kassianik, Maksym Andriushchenko, Jonas Geiping
preprint
Paper / Code
An Interpretable N-gram Perplexity Threat Model for Large Language Model Jailbreaks - boreiko2024athreatmodel - Alexander Panfilov AI Safety ML Security Research
An Interpretable N-gram Perplexity Threat Model for Large Language Model Jailbreaks
Valentyn Boreiko*, Alexander Panfilov*, Václav Voráček, Matthias Hein, Jonas Geiping
ICML 2025
Paper / Code
ASIDE: Architectural Separation of Instructions and Data in Language Models - zverev2025aside - Alexander Panfilov AI Safety ML Security Research
oralASIDE: Architectural Separation of Instructions and Data in Language Models
Egor Zverev, Evgenii Kortukov, Alexander Panfilov, Soroush Tabesh, Alexandra Volkova, Sebastian Lapuschkin, Wojciech Samek, Christoph Lampert
BuildingTrust Workshop at ICLR 2025
Paper / Code
Provable Compositional Generalization for Object-Centric Learning - wiedemer2024provable - Alexander Panfilov AI Safety ML Security Research
oralProvable Compositional Generalization for Object-Centric Learning
Thaddäus Wiedemer*, Jack Brady*, Alexander Panfilov*, Attila Juhos*, Matthias Bethge, Wieland Brendel
ICLR 2024
Paper / Code / Project Page
A Minimalist Approach for Domain Adaptation with Optimal Transport - asadulaev2023minimalist - Alexander Panfilov AI Safety ML Security Research
A Minimalist Approach for Domain Adaptation with Optimal Transport
Arip Asadulaev, Vitaly Shutov, Alexander Korotin, Alexander Panfilov, Vladislava Kontsevaya, Andrey Filchenkov
CoLLAs 2023
Paper

Acknowledgements

I am grateful to the many colleagues I worked with in the past, from whom I learned so much, for their invaluable contributions to my career. I would like to especially acknowledge the mentorship and guidance of Svyatoslav Oreshin, Arip Asadualev, Roland Zimmerman, Thaddaus Wiedemer, Jack Brady, Wieland Brendel, Valentyn Boreiko and Matthias Hein.