EvalOps

EvalOps is my work on AI evaluation, agent quality, and operational trust. The short version: if AI agents are going to do production work, teams need a way to prove they are getting better instead of just getting busier.

The main site is evalops.dev. This page is the thesis behind the work.

The Problem

  • Agents ship changes faster than teams can inspect them.
  • Prompt and model updates create regressions that look like product judgment failures.
  • Most eval suites are disconnected from production behavior, review feedback, and release decisions.
  • Teams need evidence that an agent is improving, not just a better demo.

What I Believe

  • Evals are an operating system for AI work, not a one-time test suite.
  • The best signal comes from the loop between automated checks, human review, production traces, and product outcomes.
  • Agent quality has to be versioned, reviewed, and debugged like software quality.
  • Useful evaluation systems make decisions easier: ship, block, rollback, retrain, reroute, or investigate.

What I Am Building Toward

  • Evaluation harnesses for agents, code review, product behavior, and model routing.
  • Regression detection across prompts, models, tools, and workflow changes.
  • Review loops that capture human judgment without turning every release into a research project.
  • Operational visibility for teams that need to trust delegated AI work.

Who Should Reach Out

I want to talk to teams evaluating AI agents, AI code review, support agents, workflow automation, or model-driven product surfaces where quality needs to be measured over time.

Email jonathan@haasholdings.com or visit evalops.dev.