EvalOps

EvalOps is my work on measuring and improving AI systems. The point is simple: know whether a change made the system better, worse, or merely different.

The main site is evalops.dev.

Problem

  • AI systems change behavior in ways ordinary tests miss.
  • Prompt and model updates can make a product worse while the dashboard stays green.
  • Review feedback usually disappears instead of becoming better tests.
  • Teams need to know whether the system improved, regressed, or merely changed.

Beliefs

  • Evals belong in the workflow, not in a quarterly benchmark doc.
  • Human review is useful only when it leaves behind durable signal.
  • Quality should be versioned, reviewed, and debugged like code.
  • Good measurement makes the next decision smaller.

Work

  • Evals for agents, code review, and product behavior.
  • Regression checks across prompts, models, tools, and workflows.
  • Reviews that turn judgment into reusable tests.
  • Clear records of what changed and why it mattered.

Contact

I want to talk to teams working on AI code review, support agents, workflow automation, or products where quality is hard to measure.

Email jonathan@haasholdings.com or visit evalops.dev.