research.

research happens in public. i build evaluation systems in the open so teams can see the exact scaffolding behind every benchmark, monitor, and safety check before they ship it.

the repos below are active production tools from EvalOps, my evaluation research lab. they cover eval harnesses, observability, and workflow automation for accountable ai. every line exists because a real deployment needed it.

follow along on github or check the running commentary on twitter.

flagship systems

core infrastructure that anchors my evaluation stack. these projects get weekly updates and drive most client deployments.

Multi-agent LLM system for detecting and resolving cognitive dissonance in AI outputs.

DSPy
Multi-Agent
Evaluation
DSPy Micro AgentPython59

Minimal agent runtime built with DSPy modules and a thin Python loop. CLI, FastAPI server, and eval harness.

DSPy
Agents
FastAPI

more tools and experiments

supporting libraries that round out the stack—observability hooks, testing rigs, and research sandboxes.

DSPy-powered email optimization for startup founders: drop in your 3 best emails, get optimized outreach.

DSPy
Optimization
Sales
Orbit AgentPython14

Brutally honest "high-orbit" startup advisor you can text or run from the CLI. Built with DSPy.

DSPy
Advisory
CLI

DSPy framework for detecting and preventing safety override cascades in LLM systems.

Safety
DSPy
Research
Bandit DSPyPython6

DSPy library for security-aware LLM development using Bandit static analysis.

Security
DSPy
Static Analysis
MocktopusPython3

Multi-armed mocks for LLM apps - Drop-in replacement for OpenAI/Anthropic APIs for deterministic testing.

Testing
Mocking
Evaluation
LLM TribunalPython3

Advanced LLM evaluation framework with multi-critic deliberation protocols and OWASP LLM Top 10 assessment.

Evaluation
Security
OWASP

Circuit Breaker for LLM output monitoring with budgets, verifiers, and Verdict/DSPy adapters.

Monitoring
Safety
DSPy
ProofCitePython

Extractive RAG with line-anchored citations that fails closed when confidence is low. Deterministic, no API keys.

RAG
Citations
Evaluation
Eval2OTelTypeScript2

Library to convert AI evaluation results to OpenTelemetry GenAI semantic conventions for observability.

OpenTelemetry
Observability
Evaluation

research notes

walkthroughs that document methodology, benchmarks, and decisions behind each release.

I Tested 5 Embedding Models on 10K Developer Questions

Empirical comparison of OpenAI, Cohere, BGE, E5, and Instructor embeddings with cost and latency analysis.

Research
Embeddings
Benchmarking
How RAG Actually Works: Architecture Patterns That Scale

Deep dive into RAG architectures: chunking strategies, retrieval methods, and production patterns.

Research
RAG
Architecture
Prompt Engineering Science: I Tested Temperature and Top-P on 1000 Queries

Systematic experiments on sampling parameters with empirical data on creativity vs coherence trade-offs.

Research
Prompts
Benchmarking
view all writing

partnering with teams

i work directly with orgs shipping real workloads. if you need bespoke evals, rollout support, or a second set of eyes on safety reviews, reach out.