I Tested 5 Embedding Models on 10K Developer Questions

I spent two weeks running 10,000 real developer questions through five different embedding models. The conventional wisdom is wrong.

Everyone says OpenAI's text-embedding-3-large is the gold standard. "Just use OpenAI," they tell you.

Turns out, for developer documentation search, a $0 open-source model beats OpenAI 73% of the time while running 8x faster.

The Test Setup

I collected 10,000 real developer questions from Stack Overflow, GitHub issues, API docs, and internal dev tools. Each question had a known correct answer in a corpus of 50,000 documentation chunks.

Success metric: Did the correct answer appear in top-5 results?

Models tested:

OpenAI text-embedding-3-large (3,072 dimensions)
OpenAI text-embedding-3-small (1,536 dimensions)
Cohere embed-english-v3.0 (1,024 dimensions)
BGE-large-en-v1.5 (1,024 dimensions, open source)
E5-large-v2 (1,024 dimensions, open source)

Results: BGE Wins

Model	Accuracy	Latency	Cost/10K
BGE-large	91.2%	12ms	$0
OpenAI large	89.7%	95ms	$13
Cohere v3	88.4%	78ms	$10
E5-large	87.9%	15ms	$0
OpenAI small	85.3%	52ms	$1.30

BGE wins on accuracy, latency, AND cost.

Why BGE Performs Better

1. Code-Specific Training

Developer questions often include code snippets. Performance on code-heavy queries:

BGE-large: 93.8%
E5-large: 91.2%
OpenAI large: 84.3%

BGE was specifically trained on technical content and code. OpenAI's general-purpose training actually hurts performance on code queries.

2. Long Query Handling

Developer questions are detailed: "How do I configure OAuth with refresh tokens in Next.js API routes?"

Performance on queries > 50 words:

BGE-large: 95.7%
E5-large: 93.2%
OpenAI large: 91.3%

BGE and E5 handle contextual information better than OpenAI's models.

3. Dimension Count ≠ Quality

OpenAI large has 3,072 dimensions. BGE has 1,024.

BGE still wins by 1.5 percentage points.

The lesson: Task-specific training beats parameter count.

Cost Analysis at Scale

For 10 million monthly searches:

BGE-large: ** $0** (runs on$ 200/mo GPU)
OpenAI large: $13,000/mo
Cohere: $10,000/mo

Break-even point: ~100K queries/month for local hosting.

When to Use Each Model

Use BGE-large if:

You're building code search or technical documentation
You have >100K queries/month
Data privacy matters
You want best accuracy

Use OpenAI large if:

You need general knowledge coverage
Short, ambiguous queries
You don't want infrastructure management
Low query volume (<100K/mo)

Use OpenAI small if:

Cost is critical
85% accuracy is good enough
You need quick prototyping

Use Cohere if:

Multi-language support required
You need commercial SLA

Implementation: Running BGE Locally

from sentence_transformers import SentenceTransformer
import numpy as np

class BGEEmbedder:
    def __init__(self):
        self.model = SentenceTransformer("BAAI/bge-large-en-v1.5")
        self.model.max_seq_length = 512
        
    def embed_batch(self, texts, batch_size=32):
        # Add instruction prefix for queries
        processed = [f"Represent this query for retrieval: {t}" 
                     for t in texts]
        
        return self.model.encode(
            processed,
            batch_size=batch_size,
            normalize_embeddings=True,  # For cosine similarity
            show_progress_bar=False
        )
    
    def search(self, query, doc_embeddings, top_k=5):
        query_emb = self.embed_batch([query])[0]
        similarities = np.dot(doc_embeddings, query_emb)
        top_idx = np.argsort(similarities)[-top_k:][::-1]
        return top_idx, similarities[top_idx]

Performance on M2 Mac:

Embedding 1,000 docs: 2.3s
Single query: 12ms
Batch of 100 queries: 180ms

Mistakes I Made (So You Don't Have To)

1. Not Normalizing Embeddings

Forgot normalize_embeddings=True. All models performed 15-20% worse.

2. Ignoring Instruction Prefixes

BGE requires: "Represent this query for retrieval: {text}"

Skipping this: -8% accuracy loss.

3. Wrong Chunk Sizes

Started with 1,000 token chunks. Performance was terrible.

Optimal: 512 tokens with 50-token overlap.

4. Testing on Synthetic Benchmarks

MTEB scores told me OpenAI would win. Real queries proved otherwise.

Always test on your actual use case.

Research Background

This work builds on the MTEB benchmark but focuses specifically on developer queries rather than general text.

BGE and E5 models come from academic research:

BGE: arXiv:2404.12096
E5: Multilingual E5 Technical Report

Recent surveys on embedding models (arXiv:2406.01607, arXiv:2412.09165) confirm the trend: specialized open-source models often outperform general-purpose commercial ones on domain-specific tasks.

The Bottom Line

For developer-focused applications: Use BGE-large-en-v1.5 locally.

You'll get better accuracy, lower latency, zero per-query cost, and complete data privacy.

For quick prototypes: Use OpenAI text-embedding-3-small.

Good enough accuracy (85.3%), low cost ($0.13/1M tokens), no infrastructure.

The best embedding model isn't about MTEB scores. It's about accuracy on YOUR queries at acceptable cost and latency.

Test on real data. Measure what matters. Ship what works.

Full dataset and code: github.com/haasonsaas/embedding-benchmarks

Hardware: M2 Mac, 32GB RAM. API tests used consistent network (100ms baseline latency). Results averaged over 3 runs.