research.
research happens in public. i build evaluation systems in the open so teams can see the exact scaffolding behind every benchmark, monitor, and safety check before they ship it.
the repos below are active production tools from EvalOps, my evaluation research lab. they cover eval harnesses, observability, and workflow automation for accountable ai. every line exists because a real deployment needed it.
follow along on github or check the running commentary on twitter.
flagship systems
core infrastructure that anchors my evaluation stack. these projects get weekly updates and drive most client deployments.
Multi-agent LLM system for detecting and resolving cognitive dissonance in AI outputs.
Minimal agent runtime built with DSPy modules and a thin Python loop. CLI, FastAPI server, and eval harness.
more tools and experiments
supporting libraries that round out the stack—observability hooks, testing rigs, and research sandboxes.
DSPy-powered email optimization for startup founders: drop in your 3 best emails, get optimized outreach.
Brutally honest "high-orbit" startup advisor you can text or run from the CLI. Built with DSPy.
DSPy framework for detecting and preventing safety override cascades in LLM systems.
DSPy library for security-aware LLM development using Bandit static analysis.
Multi-armed mocks for LLM apps - Drop-in replacement for OpenAI/Anthropic APIs for deterministic testing.
Advanced LLM evaluation framework with multi-critic deliberation protocols and OWASP LLM Top 10 assessment.
Circuit Breaker for LLM output monitoring with budgets, verifiers, and Verdict/DSPy adapters.
Extractive RAG with line-anchored citations that fails closed when confidence is low. Deterministic, no API keys.
Library to convert AI evaluation results to OpenTelemetry GenAI semantic conventions for observability.
research notes
walkthroughs that document methodology, benchmarks, and decisions behind each release.
Empirical comparison of OpenAI, Cohere, BGE, E5, and Instructor embeddings with cost and latency analysis.
Deep dive into RAG architectures: chunking strategies, retrieval methods, and production patterns.
Systematic experiments on sampling parameters with empirical data on creativity vs coherence trade-offs.
partnering with teams
i work directly with orgs shipping real workloads. if you need bespoke evals, rollout support, or a second set of eyes on safety reviews, reach out.