#benchmarking

3 posts filed under “benchmarking”

Oct 5, 2025

I Tested 5 Embedding Models on 10K Developer Questions

Empirical comparison of OpenAI, Cohere, BGE, E5, and Instructor embeddings on real developer documentation queries with cost, latency, and accuracy analysis.

Jul 12, 2025

#ai-evaluation #benchmarking #llm

Building Better AI Evals: A Practical Guide to LLM Evaluation

How to create custom evaluations, model-graded assessments, and domain-specific benchmarks that actually predict real-world performance

Jan 6, 2025

#ai #research #prompts

Prompt Engineering Science: I Tested Temperature and Top-P on 1000 Queries

Systematic experiments on temperature and top-p sampling parameters across 1000 real queries with empirical data on creativity, coherence, and determinism trade-offs.