Katherine Soto — AI & Machine Learning Engineer

What I work with

The stack, end to end

Six years building production systems, the last two focused on artificial intelligence, Generative AI, and AI safety. A strong software engineering foundation (C/C++, École 42) underpins everything — I'm comfortable across the whole pipeline, from low-level systems to model design.

00 / foundation

Software Engineering

C · C++ · C#
Python · Node.js
Docker · Git
Distributed systems
APIs · clean code

01 / ingest

Data Engineering

AWS Glue · Lambda
Spark · Polars
ETL / ELT
Medallion arch
S3 · RDS · DynamoDB

02 / train

ML & Deep Learning

PyTorch · TensorFlow
Hugging Face
Computer Vision
NLP · Transformers
LoRA / QLoRA

03 / evaluate

GenAI & Safety

LLM finetuning
LLM-as-Judge
Eval harnesses
Red-teaming
Interpretability

04 / ship

MLOps & Cloud

SageMaker
Docker · CI/CD
Model versioning
IaC (CloudFormation)
Monitoring

Selected work

Things I've built

A mix of production systems, research, and end-to-end ML projects. Code on GitHub.

01research · thesis

Multi-turn LLM Red-Teaming & Safety Eval

End-to-end GenAI safety pipeline: data generation → LoRA/QLoRA finetuning → automated evaluation across a 20+ category harm taxonomy. Raised Attack Success Rate from 38.8% → 65.9% on held-out categories. Multi-agent harness with an LLM-as-Judge, plus interpretability research to detect multi-turn jailbreak attacks.

PyTorchHF PEFTLoRAInterpretability

View project →

02computer vision

Foundation Models for Low-Cost Medical Labeling

Benchmarked CLIP vs BiomedCLIP for pneumonia detection on chest X-rays. BiomedCLIP + LightGBM matched supervised baselines with only ~140 labeled images. Added an LLM-as-Judge quality gate to flag poorly-captured scans.

CLIPBiomedCLIPLightGBMGrad-CAM

View project →

03generative ai

Petly — Generative Pet Creation

Two-phase pipeline: facial-attribute extraction → trait voting → diffusion generation. Benchmarked CNN, ResNet-18 and ViT-B/16 (0.883 acc); generated pets with Stable Diffusion + IP-Adapter, accelerated via LCM-LoRA. Served end-to-end.

Stable DiffusionViTCLIPFastAPI

View project →

04data engineering

Moltbook — End-to-End ML Platform

AWS-native data platform with medallion architecture (Bronze/Silver/Gold): scraping → Glue ELT → schema-validated S3/RDS → H2O AutoML on SageMaker → served via API Gateway + Lambda.

SparkPolarsH2O AutoMLAWS Glue

View project →

05classical ml

Network Intrusion Anomaly Detection

Entropy-based unsupervised anomaly detection on network traffic (KDD Cup 1999), distinguishing normal patterns from attack types. Benchmarked against supervised baselines with full statistical evaluation.

REntropyClassification

View project →

06deep learning

Emotion Recognition — Vision Transformers

Image-based sentiment analysis using Vision Transformers combined with LightGBM and Optuna hyperparameter optimization for emotion understanding.

ViTLightGBMOptuna

View project →

07systems · low-level

Simple Shell — Unix Command Interpreter in C

A basic command-line interpreter written from scratch in C: parsing, process creation (fork/exec), PATH resolution, built-in commands, and environment handling — a ground-up implementation of how a shell works.

CUnixProcessesSyscalls

View project →

08graphics · C

FdF — 3D Wireframe Renderer in C

École 42 project: reads a height-map file and renders it as a 3D wireframe using isometric projection. Built in C with MiniLibX — line drawing (Bresenham), coordinate transforms, and memory-managed parsing from scratch.

CMiniLibXGraphicsLinear Algebra

View project →

09systems · C

ft_printf — Reimplementing printf in C

École 42 project: a from-scratch implementation of the C standard printf function, handling variadic arguments, format specifiers, flags, and edge cases — a deep dive into low-level string formatting and memory.

CVariadic FunctionsMemory

View project →

The path

From software to ML

A developer who grew into machine learning by shipping it

2026 — Present

MSc Data Science — AI Safety research thesis

La Salle · Universitat Ramon Llull

MSc in Data Science. Thesis on multi-turn LLM red-teaming and safety evaluation — finetuning attacker LLMs and building automated eval pipelines. Research on interpretability in LLMs to understand and detect multi-turn jailbreak attacks.

2022 — 2026

Tech Lead

Hera Solutions

Promoted to Tech Lead within 3 months. Led and mentored a team of 5, architected end-to-end ML pipelines and serverless infrastructure on AWS, built dev tooling, and owned the team's technical direction across the full stack.

2024 — 2025

Ai Engineer — Vision & GenAI

Hera Solutions

Built a computer-vision damage-detection model from scratch to production, plus a GenAI advisor agent — progressing it from a third-party API (v1) to a self-hosted Hugging Face pipeline (v2).

2020 — 2022

Software Engineer

Hera Solutions · Freelance

Built and shipped complex production systems end-to-end: full-stack features, data pipelines, and a graph-database fraud-analysis system for a major bank. Strong CS foundation from École 42 and Holberton (C/C++, low-level systems, algorithms).

Education & programs

Highest Distinction

MSc Data Science

La Salle · URL · 2025–26

AI Safety

BlueDot Technical Program

2026

AI Safety · Scholarship

ML4Good Bootcamp

Colombia · 2025

Full Scholarship

Data Analytics Program

UC Berkeley · 2020

1st in Cohort 15

Full-Stack Dev

Holberton School · 2022

Computer Science

École 42

Silicon Valley · 2019–21

1st Place

NASA Space Apps

Hub Peru · 2024

Speaker

IEEE Women's Day

Peru · 2025