#benchmarking
3 posts filed under “benchmarking”
I Tested 5 Embedding Models on 10K Developer Questions
Empirical comparison of OpenAI, Cohere, BGE, E5, and Instructor embeddings on real developer documentation queries with cost, latency, and accuracy analysis.
Building Better AI Evals: A Practical Guide to LLM Evaluation
How to create custom evaluations, model-graded assessments, and domain-specific benchmarks that actually predict real-world performance
Prompt Engineering Science: I Tested Temperature and Top-P on 1000 Queries
Systematic experiments on temperature and top-p sampling parameters across 1000 real queries with empirical data on creativity, coherence, and determinism trade-offs.