EvalOps
EvalOps is my work on measuring and improving AI systems. The point is simple: know whether a change made the system better, worse, or merely different.
The main site is evalops.dev.
Problem
- AI systems change behavior in ways ordinary tests miss.
- Prompt and model updates can make a product worse while the dashboard stays green.
- Review feedback usually disappears instead of becoming better tests.
- Teams need to know whether the system improved, regressed, or merely changed.
Beliefs
- Evals belong in the workflow, not in a quarterly benchmark doc.
- Human review is useful only when it leaves behind durable signal.
- Quality should be versioned, reviewed, and debugged like code.
- Good measurement makes the next decision smaller.
Work
- Evals for agents, code review, and product behavior.
- Regression checks across prompts, models, tools, and workflows.
- Reviews that turn judgment into reusable tests.
- Clear records of what changed and why it mattered.
Contact
I want to talk to teams working on AI code review, support agents, workflow automation, or products where quality is hard to measure.
Email jonathan@haasholdings.com or visit evalops.dev.