i am jonathan haas.
i'm interested in making AI systems more reliable through better evaluation. currently building behavioral testing infrastructure for language models—adversarial testing, robustness under distribution shift, and converting measurements into guardrails and feedback signals.
security engineering taught me how fragile production systems can be. i worked as an engineer at Snap, Carta, and DoorDash, then built ThreatKey to help companies catch issues before they exploded. that experience drives my current work at EvalOps, my applied research lab: turning model evaluation into a first-class primitive so we can ship reliable, steerable AI.
recent writing
Empirical comparison of OpenAI, Cohere, BGE, E5, and Instructor embeddings on real developer documentation queries with cost, latency, and accuracy analysis.
A comprehensive synthesis of 21 posts on DX: patterns, principles, and practices for building exceptional developer tools and experiences.
It started with a Jupyter notebook. 'Look, I built a chatbot in 10 minutes!' Nine months later, three engineers had quit and the company almost folded.
projects i'm proud of
multi-agent system for detecting and resolving cognitive dissonance in LLMs (263 stars)
comprehensive guide for building self-improving LLM applications (154 stars)
minimal agent runtime with CLI, FastAPI server, and eval harness (48 stars)
production-grade LLM techniques with 200+ test cases (37 stars)
dspy-powered email optimization for startup founders (30 stars)
composable code review engine for automated diff analysis (10 stars)