AI Safety Researcher · Ex-Tech Leader · Machine Learning Engineer

KatherineSoto

Building red-teaming agents and using mechanistic interpretability to detect multi-turn jailbreak attacks on LLMs. Ex Tech Lead. Peruvian, based in Spain.

Scroll
01 — RESEARCH

Current thesis

MSc Thesis · La Salle — Universitat Ramon Llull
Interpretability in Multi-turn Jailbreak Attacks on LLMs
My research sits at the intersection of AI security and mechanistic interpretability. I'm working on both sides of the problem — building adversarial agents that attack, and using interpretability to understand and detect those attacks from inside the model.
⚔️
Red-teaming Agent
Building a multi-turn adversarial agent that generates jailbreak attacks against LLMs, fine-tuned with LoRA on the ScaleAI/MHJ dataset.
🔬
Attack Detection via Interp
Using mechanistic interpretability to identify internal signals that distinguish an ongoing multi-turn attack from normal conversation.
Red-teamingMulti-turn JailbreaksMech InterpAttack DetectionLLM SafetyAdversarial Robustness
03 — JOURNEY

Builder to researcher

Experience
2025 — PRESENT
MSc Researcher — AI Safety
La Salle — Universitat Ramon Llull
MSc in Data Science. Thesis on interpretability in multi-turn jailbreak attacks — building red-teaming agents and mech interp for attack detection.
PREVIOUS
Tech Lead
US-based Startup
Led technical team. Full-stack development, DevOps infrastructure, and production ML systems at scale.
AI Safety Programs
🔵
BlueDot AI Safety
Technical Program · 2026
🟢
ML4Good
AI Safety Bootcamp · 2025
🎓
MSc Data Science
Universitat Ramon Llull · 2025–2026