I spent last week building an AI that writes exactly like me. Not "kind of" like me—exactly like me. Down to the contractions, the contrarian takes, and my pathological inability to use hedge words.
The Jonathan Voice Engine analyzes 50+ blog posts and generates responses with 70-80% authenticity scores. It took me one weekend to build.
Here's the kicker: I built it with messy markdown files, basic regex patterns, and zero training data. While you're waiting for the perfect dataset, I'm shipping working code.
The Perfect Data Trap (And Why You're Stuck In It)
Many years ago at a tiny startup, we burned through $200K and six months because the CTO read too many ML papers. "We need clean data," he said. "We need balanced demographics."
We needed revenue. We got bankruptcy.
While we were jerking off to data quality metrics:
- Competitor A shipped with 100 crappy recordings
- Competitor B used their founders' podcast transcripts
- Competitor C literally used YouTube auto-captions
All three are still in business. We're not.
Here's the thing most people miss: Your users don't give a shit about your F1 scores. They care about whether your product works well enough to solve their problem. And "well enough" is way lower than you think.
What Actually Matters in Voice Profile Extraction
Everyone thinks voice profile extraction is about sophisticated NLP models and transformer architectures. That's academic thinking. Here's what actually matters:
1. Consistent Patterns Beat Perfect Accuracy
My voice profiler tracks simple patterns:
- Contraction frequency (I use them constantly)
- Sentence length (short, punchy, 2-4 sentences)
- Rhetorical questions (transition device)
- Active voice ratio (>90%)
- Signature phrases ("Here's the thing most people miss...")
That's it. No BERT. No transformers. Just pattern matching that works.
2. Domain-Specific Markers Trump Generic Features
Generic voice analysis looks for things like "formality level" and "sentiment." Useless.
My system looks for:
- Contrarian indicators ("conventional wisdom is wrong")
- Specific framework references (startup bargain, strategic quality)
- Industry context markers (security, startups, AI)
- Experience-based examples ("At one startup I advised...", "In my experience...")
These domain markers are 10x more valuable than generic linguistic features.
3. Fast Iteration Beats Slow Perfection
My development cycle:
- Monday: Basic regex extraction (2 hours)
- Tuesday: Statistical analysis layer (4 hours)
- Wednesday: Validation scoring system (3 hours)
- Thursday: Integration with Claude API (2 hours)
- Friday: Testing and refinement (all day)
Total: One week to working system.
The Architecture Nobody Tells You About
Here's the actual code structure that powers my voice engine:
// Core extraction pipeline
class VoiceProfileExtractor {
extract(posts: string[]): VoiceProfile {
return {
tone: this.extractToneMarkers(posts),
style: this.extractStylePatterns(posts),
perspectives: this.extractBeliefs(posts),
frameworks: this.extractFrameworks(posts),
phrases: this.extractSignaturePhrases(posts),
}
}
// The magic: simple pattern matching
extractToneMarkers(posts: string[]) {
return {
directness: this.measureDirectness(posts), // No hedge words
contrarian: this.measureContrarian(posts), // Challenge patterns
empathy: this.measureEmpathy(posts), // "I understand" patterns
pragmatism: this.measurePragmatism(posts), // "What works" focus
}
}
}
Notice what's missing? Machine learning. Deep learning. Any learning at all.
It's just measuring what's already there.
Building Your Own: The Non-Obvious Steps
Want to build your own voice profiler? Here's what actually works:
Step 1: Start With Your Worst Data
Don't clean your data. Don't normalize it. Use it raw. Why? Because production data will be messy too. If your system can't handle your worst data, it's useless.
Step 2: Extract Observable Patterns First
Before you think about AI:
- Count things (words, sentences, paragraphs)
- Find patterns (phrases, structures, transitions)
- Measure ratios (active/passive, short/long, direct/hedged)
You'll be shocked how far basic counting gets you.
Step 3: Build Validation Before Accuracy
Most people build a model then try to validate it. Backwards.
Build your validation system first:
- Define what "sounds right" means quantitatively
- Create scoring rubrics for each dimension
- Test manually on 10-20 examples
- THEN build the extraction system to hit those targets
Step 4: Ship at 60% Accuracy
My voice engine shipped at 60% accuracy. Now it's at 80%.
Those 20 percentage points came from:
- Real usage data
- User feedback
- Iterative improvements
- Parameter tuning based on results
You can't get from 60% to 80% in development. You can only get there in production.
The Uncomfortable Truth About Voice AI
Here's what nobody wants to admit: Most voice profile extraction is solving the wrong problem.
You don't need to perfectly replicate someone's voice. You need to:
- Capture their key perspectives
- Maintain consistent tone
- Apply their frameworks
- Sound authentic enough to be useful
My AI doesn't write exactly like me. It writes like me on a good day, when I'm focused and articulate. That's actually more valuable than perfect replication.
Real Implementation Lessons
After building this system, here's what I learned:
1. Authenticity Scoring > Similarity Scoring
Don't measure how similar the output is to training data. Measure whether it feels authentic. My scoring system penalizes:
- Academic language (-10%)
- Hedge words (-15%)
- Missing contractions (-20%)
- Generic advice (-25%)
These penalties matter more than matching exact phrases.
2. Context Injection > Model Training
Instead of training models, inject context at generation time:
- Recent examples of target voice
- Specific frameworks to reference
- Domain-specific knowledge
- Signature phrases to use
This approach is 100x faster than model training and surprisingly effective.
3. Human Validation > Automated Metrics
My best accuracy improvements came from:
- Reading output and marking what felt wrong
- Adjusting weights based on intuition
- Testing edge cases manually
- Getting feedback from blog readers
Fancy metrics didn't help. Human judgment did.
Ship Your Shitty V1 (Before Someone Else Does)
You know what's worse than shipping bad AI? Not shipping at all.
I've watched dozens of teams die waiting for perfect voice data. Meanwhile, some kid with a laptop and ChatGPT is eating their lunch. Because here's the truth: The market rewards speed, not perfection.
My voice engine shipped with:
- 60% accuracy
- Obvious failure modes
- Zero edge case handling
- Embarrassing bugs
Now it powers this entire blog's AI content. Because I fixed it in production, based on real usage, with actual feedback.
Stop optimizing for your ego. Start optimizing for learning speed.
Ship your shitty v1. Fix it live. Beat the perfectionists to market.
Technical Note: Want to validate your own content for authenticity? The Jonathan Voice Engine can analyze any text:
echo "Your text here" | bun scripts/jonathan-voice.ts validate
It'll score your content across multiple dimensions and tell you exactly what's missing. Because here's the truth: measuring authenticity is more valuable than generating it.