reading list.
The books, papers, and essays shaping my approach to evaluation, safety, and shipping resilient software.
Core texts
- "Testing Power" by Saša Jurić — staying honest about what your tests prove.
- "Site Reliability Engineering" by Google — operational rigor as applied research.
- "The Alignment Problem" by Brian Christian — understanding failure modes.
Research papers
- "Measuring Massive Multitask Language Understanding" — benchmark design lessons.
- "Red Teaming Language Models with Language Models" — the adversarial mindset.
- "Evaluating Large Language Models Trained on Code" — grounding metrics in user tasks.
Longform essays
- Charity Majors on observability — instrument first, reason second.
- Dan McKinley's "Choose Boring Technology" — keep fundamentals boring so evals shine.
- Ben Kuhn's "The Elegance of the Last Mile" — focus on finish quality.