all tags

#benchmarking

3 posts filed under “benchmarking

I Tested 5 Embedding Models on 10K Developer Questions

Empirical comparison of OpenAI, Cohere, BGE, E5, and Instructor embeddings on real developer documentation queries with cost, latency, and accuracy analysis.

Building Better AI Evals: A Practical Guide to LLM Evaluation

How to create custom evaluations, model-graded assessments, and domain-specific benchmarks that actually predict real-world performance

Prompt Engineering Science: I Tested Temperature and Top-P on 1000 Queries

Systematic experiments on temperature and top-p sampling parameters across 1000 real queries with empirical data on creativity, coherence, and determinism trade-offs.