EvalOps

EvalOps is my work on measuring and improving AI systems. The point is simple: know whether a change made the system better, worse, or merely different.

The main site is evalops.dev.

Problem

AI systems change behavior in ways ordinary tests miss.
Prompt and model updates can make a product worse while the dashboard stays green.
Review feedback usually disappears instead of becoming better tests.
Teams need to know whether the system improved, regressed, or merely changed.

Beliefs

Evals belong in the workflow, not in a quarterly benchmark doc.
Human review is useful only when it leaves behind durable signal.
Quality should be versioned, reviewed, and debugged like code.
Good measurement makes the next decision smaller.

Work

Evals for agents, code review, and product behavior.
Regression checks across prompts, models, tools, and workflows.
Reviews that turn judgment into reusable tests.
Clear records of what changed and why it mattered.

Contact

I want to talk to teams working on AI code review, support agents, workflow automation, or products where quality is hard to measure.

Email jonathan@haasholdings.com or visit evalops.dev.