research.

research happens in public. i build evaluation systems in the open so teams can see the exact scaffolding behind every benchmark, monitor, and safety check before they ship it.

the repos below are active production tools from EvalOps, my evaluation research lab. they cover eval harnesses, observability, and workflow automation for accountable ai. every line exists because a real deployment needed it.

follow along on github or check the running commentary on twitter.

flagship systems

core infrastructure that anchors my evaluation stack. these projects get weekly updates and drive most client deployments.

Cognitive Dissonance DetectionPython267★

Multi-agent LLM system for detecting and resolving cognitive dissonance in AI outputs.

DSPy

Multi-Agent

Evaluation

DSPy Micro AgentPython59★

Minimal agent runtime built with DSPy modules and a thin Python loop. CLI, FastAPI server, and eval harness.

DSPy

Agents

FastAPI

more tools and experiments

supporting libraries that round out the stack—observability hooks, testing rigs, and research sandboxes.

Founder Email OptimizerPython35★

DSPy-powered email optimization for startup founders: drop in your 3 best emails, get optimized outreach.

DSPy

Optimization

Sales

Orbit AgentPython14★

Brutally honest "high-orbit" startup advisor you can text or run from the CLI. Built with DSPy.

DSPy

Advisory

CLI

Override Cascade DetectionPython5★

DSPy framework for detecting and preventing safety override cascades in LLM systems.

Safety

DSPy

Research

Bandit DSPyPython6★

DSPy library for security-aware LLM development using Bandit static analysis.

Security

DSPy

Static Analysis

MocktopusPython3★

Multi-armed mocks for LLM apps - Drop-in replacement for OpenAI/Anthropic APIs for deterministic testing.

Testing

Mocking

Evaluation

LLM TribunalPython3★

Advanced LLM evaluation framework with multi-critic deliberation protocols and OWASP LLM Top 10 assessment.

Evaluation

Security

OWASP

Circuit Breaker LLMPython2★

Circuit Breaker for LLM output monitoring with budgets, verifiers, and Verdict/DSPy adapters.

Monitoring

Safety

DSPy

ProofCitePython

Extractive RAG with line-anchored citations that fails closed when confidence is low. Deterministic, no API keys.

RAG

Citations

Evaluation

Eval2OTelTypeScript2★

Library to convert AI evaluation results to OpenTelemetry GenAI semantic conventions for observability.

OpenTelemetry

Observability

Evaluation

research notes

walkthroughs that document methodology, benchmarks, and decisions behind each release.

I Tested 5 Embedding Models on 10K Developer Questions

Empirical comparison of OpenAI, Cohere, BGE, E5, and Instructor embeddings with cost and latency analysis.

Research

Embeddings

Benchmarking

How RAG Actually Works: Architecture Patterns That Scale

Deep dive into RAG architectures: chunking strategies, retrieval methods, and production patterns.

Research

RAG

Architecture

Prompt Engineering Science: I Tested Temperature and Top-P on 1000 Queries

Systematic experiments on sampling parameters with empirical data on creativity vs coherence trade-offs.

Research

Prompts

Benchmarking

view all writing

partnering with teams

i work directly with orgs shipping real workloads. if you need bespoke evals, rollout support, or a second set of eyes on safety reviews, reach out.

email book a working session