Alexander Panfilov

Yo! My name is Sasha and I am a second-year ELLIS / IMPRS-IS PhD student, based in Tübingen. I find myself very lucky to be advised byJonas GeipingandMaksym Andriushchenko.

Broadly, I am interested in adversarial robustness, AI safety, and ML security. In practical terms, I enjoy finding various ways to break machine learning systems. Roughly three days a week I am an AI doomer.

Lately, I have been focusing on jailbreaking attacks on LLMs, contemplating: (1) What are the viable threat models for attacks on safety tuning? (2) Are safety jailbreaks truly effective, or are we victims of flawed (LLM-based) evaluations? (3) Are we doomed?

You can find my CV here. I am always open to collaboration — feel free to reach out via email!

Alexander Panfilov Александр Панфилов Alexander Panfilov Александр Панфилов Sasha Panfilov AI Safety Machine Learning Security Adversarial Robustness LLM Jailbreaking PhD Student ELLIS IMPRS-IS Tübingen Research Max Planck Institute for Intelligent Systems MPI-IS Тольятти Togliatti Togliatty Toliatty Samara Самара ITMO ИТМО ITMO University Red Teaming AI Alignment ML Security Research искусственный интеллект машинное обучение безопасность ИИ adversarial attacks состязательные атаки jailbreak джейлбрейк LLM языковые модели исследователь PhD докторант Тюбинген Германия Germany neural networks нейронные сети deep learning глубокое обучение computer vision компьютерное зрение natural language processing NLP обработка естественного языка cybersecurity кибербезопасность AI ethics этика ИИ machine learning engineer инженер машинного обучения Panfilov researcher Панфилов исследователь student студент artificial intelligence безопасность машинного обучения ML safety AI researcher исследователь ИИ kotekjedi

News

July 09, 2025: Capability-Based Scaling Laws for LLM Red-Teaming accepted at ICML 2025 Workshop on Reliable and Responsible Foundation Models! I will be presenting two recent papers there - come chat with me if you are attending!
June 23, 2025: Presented our work Capability-Based Scaling Laws for LLM Red-Teaming and ASIDE at the Google's Red Teaming seminar. You can find the slides here. Thanks for the invitation!
May 01, 2025: Our work, An Interpretable N-gram Perplexity Threat Model for Large Language Model Jailbreaks, has been accepted at ICML 2025.
April 15, 2025: Our work, ASIDE: Architectural Separation of Instructions and Data in Language Models, has been accepted for an oral presentation at the BuildingTrust Workshop at ICLR 2025.
November 05, 2024: Presented our work, Provable Compositional Generalization for Object-Centric Learning at EPFL (Nicolas Flammarion's group seminar). You can find the slides here.
October 09, 2024: Our work, A Realistic Threat Model for Large Language Model Jailbreaks, has been accepted for an oral presentation at the Red Teaming GenAI Workshop at NeurIPS 2024.
May 01, 2024: Started my PhD at the ELLIS Institute Tübingen / Max Planck Institute for Intelligent Systems. You can find the slides for my IMPRS talk here.

Selected Publications

Capability-Based Scaling Laws for LLM Red-Teaming - panfilov2025scalinglaws - Alexander Panfilov AI Safety ML Security Research

Capability-Based Scaling Laws for LLM Red-Teaming
Alexander Panfilov, Paul Kassianik, Maksym Andriushchenko, Jonas Geiping
preprint
Paper / Code

An Interpretable N-gram Perplexity Threat Model for Large Language Model Jailbreaks - boreiko2024athreatmodel - Alexander Panfilov AI Safety ML Security Research

An Interpretable N-gram Perplexity Threat Model for Large Language Model Jailbreaks
Valentyn Boreiko*, Alexander Panfilov*, Václav Voráček, Matthias Hein, Jonas Geiping
ICML 2025
Paper / Code

ASIDE: Architectural Separation of Instructions and Data in Language Models - zverev2025aside - Alexander Panfilov AI Safety ML Security Research

oralASIDE: Architectural Separation of Instructions and Data in Language Models
Egor Zverev, Evgenii Kortukov, Alexander Panfilov, Soroush Tabesh, Alexandra Volkova, Sebastian Lapuschkin, Wojciech Samek, Christoph Lampert
BuildingTrust Workshop at ICLR 2025
Paper / Code

Provable Compositional Generalization for Object-Centric Learning - wiedemer2024provable - Alexander Panfilov AI Safety ML Security Research

oralProvable Compositional Generalization for Object-Centric Learning
Thaddäus Wiedemer*, Jack Brady*, Alexander Panfilov*, Attila Juhos*, Matthias Bethge, Wieland Brendel
ICLR 2024
Paper / Code / Project Page

A Minimalist Approach for Domain Adaptation with Optimal Transport - asadulaev2023minimalist - Alexander Panfilov AI Safety ML Security Research

A Minimalist Approach for Domain Adaptation with Optimal Transport
Arip Asadulaev, Vitaly Shutov, Alexander Korotin, Alexander Panfilov, Vladislava Kontsevaya, Andrey Filchenkov
CoLLAs 2023
Paper

Acknowledgements

I am grateful to the many colleagues I worked with in the past, from whom I learned so much, for their invaluable contributions to my career. I would like to especially acknowledge the mentorship and guidance of Svyatoslav Oreshin, Arip Asadualev, Roland Zimmerman, Thaddaus Wiedemer, Jack Brady, Wieland Brendel, Valentyn Boreiko and Matthias Hein.