EvalOps
EvalOps is my work on AI evaluation, agent quality, and operational trust. The short version: if AI agents are going to do production work, teams need a way to prove they are getting better instead of just getting busier.
The main site is evalops.dev. This page is the thesis behind the work.
The Problem
- Agents ship changes faster than teams can inspect them.
- Prompt and model updates create regressions that look like product judgment failures.
- Most eval suites are disconnected from production behavior, review feedback, and release decisions.
- Teams need evidence that an agent is improving, not just a better demo.
What I Believe
- Evals are an operating system for AI work, not a one-time test suite.
- The best signal comes from the loop between automated checks, human review, production traces, and product outcomes.
- Agent quality has to be versioned, reviewed, and debugged like software quality.
- Useful evaluation systems make decisions easier: ship, block, rollback, retrain, reroute, or investigate.
What I Am Building Toward
- Evaluation harnesses for agents, code review, product behavior, and model routing.
- Regression detection across prompts, models, tools, and workflow changes.
- Review loops that capture human judgment without turning every release into a research project.
- Operational visibility for teams that need to trust delegated AI work.
Who Should Reach Out
I want to talk to teams evaluating AI agents, AI code review, support agents, workflow automation, or model-driven product surfaces where quality needs to be measured over time.
Email jonathan@haasholdings.com or visit evalops.dev.