i am jonathan haas.

i'm interested in making AI systems more reliable through better evaluation. currently building behavioral testing infrastructure for language models—adversarial testing, robustness under distribution shift, and converting measurements into guardrails and feedback signals.

security engineering taught me how fragile production systems can be. i worked as an engineer at Snap, Carta, and DoorDash, then built ThreatKey to help companies catch issues before they exploded. that experience drives my current work at EvalOps, my applied research lab: turning model evaluation into a first-class primitive so we can ship reliable, steerable AI.

recent writing

I Tested 5 Embedding Models on 10K Developer Questions

Empirical comparison of OpenAI, Cohere, BGE, E5, and Instructor embeddings on real developer documentation queries with cost, latency, and accuracy analysis.

The Complete Guide to Developer Experience

A comprehensive synthesis of 21 posts on DX: patterns, principles, and practices for building exceptional developer tools and experiences.

The 10-Minute AI POC That Becomes a 10-Month Nightmare

It started with a Jupyter notebook. 'Look, I built a chatbot in 10 minutes!' Nine months later, three engineers had quit and the company almost folded.

see more

projects i'm proud of

cognitive dissonance detection

multi-agent system for detecting and resolving cognitive dissonance in LLMs (263 stars)

dspy 0-to-1 guide

comprehensive guide for building self-improving LLM applications (154 stars)

dspy micro agent

minimal agent runtime with CLI, FastAPI server, and eval harness (48 stars)

dspy advanced prompting

production-grade LLM techniques with 200+ test cases (37 stars)

founder email optimizer

dspy-powered email optimization for startup founders (30 stars)

diffscope

composable code review engine for automated diff analysis (10 stars)