Loading...
Loading...
I publish evaluation systems and tooling in the open. The focus is on harnesses, observability, and workflow automation for model evaluation.
The repos below are open-source work from EvalOps, my research lab on agent reliability. Some are in active use; others are exploratory.
Follow along on GitHub or check the running commentary on Twitter.
Core infrastructure that anchors my evaluation stack.
Multi-agent LLM system for detecting and resolving cognitive dissonance in AI outputs.
DSPy · Multi-Agent · Evaluation
Minimal agent runtime built with DSPy modules and a thin Python loop. CLI, FastAPI server, and eval harness.
DSPy · Agents · FastAPI
Supporting libraries, utilities, and prototypes.
DSPy-powered email optimization for startup founders: drop in your 3 best emails, get optimized outreach.
DSPy · Optimization · Sales
Brutally honest "high-orbit" startup advisor you can text or run from the CLI. Built with DSPy.
DSPy · Advisory · CLI
DSPy framework for detecting and preventing safety override cascades in LLM systems.
Safety · DSPy · Research
DSPy library for security-aware LLM development using Bandit static analysis.
Security · DSPy · Static Analysis
Multi-armed mocks for LLM apps - Drop-in replacement for OpenAI/Anthropic APIs for deterministic testing.
Testing · Mocking · Evaluation
Advanced LLM evaluation framework with multi-critic deliberation protocols and OWASP LLM Top 10 assessment.
Evaluation · Security · OWASP
Circuit Breaker for LLM output monitoring with budgets, verifiers, and Verdict/DSPy adapters.
Monitoring · Safety · DSPy
Extractive RAG with line-anchored citations that fails closed when confidence is low. Deterministic, no API keys.
RAG · Citations · Evaluation
Library to convert AI evaluation results to OpenTelemetry GenAI semantic conventions for observability.
OpenTelemetry · Observability · Evaluation
Notes on methodology, benchmarks, and design decisions.
Empirical comparison of OpenAI, Cohere, BGE, E5, and Instructor embeddings with cost and latency analysis.
Research · Embeddings · Benchmarking
Deep dive into RAG architectures: chunking strategies, retrieval methods, and production patterns.
Research · RAG · Architecture
Systematic experiments on sampling parameters with empirical data on creativity vs coherence trade-offs.
Research · Prompts · Benchmarking
I occasionally consult on evaluations and safety reviews. Email is best for availability.