Session 01 · 5/5 · jake + tanner

Two-voice interview

turning timestamped transcripts into a single mp3 · elevenlabs v4 expressive

A home for the things Jake and Tanner build together during their 1:1 sessions — half learning, half real engineering. Starting with a way to turn timestamped two-speaker transcripts into a single MP3 that sounds like the conversation actually happened.

Different voices for each speaker. Natural pacing. Audio tags like [laughs] and [whispers] rendered properly. Built collaboratively in Claude Code — the entire dev session, including all the wrong turns and tool switches, is preserved in the repo for you to read through.

Below: four sample outputs from the same transcript, generated with different mode + model combinations. Listen and compare.

Listen for yourself

four · samples · same · transcript
A

Dialogue + v3

Whole conversation generated as one performance. Real turn-taking, prosody match across speakers. Useful when reactive moments matter — laugh-on-laugh, interruptions, tonal shifts.

Mode
dialogue
Model
eleven_v3
Duration
2:32
D

Dialogue + v4 — experimental

Dialogue endpoint running on v4 instead of v3. Same cross-speaker awareness as A, with v4's tighter pacing. Notably faster — almost half the duration of A. Lands somewhere between "confident" and "rushed" depending on your taste; included for completeness.

Mode
dialogue
Model
eleven_v4
Duration
1:20
B

Segment + v3 — baseline

Original v3 baseline, per-line stitched. Comparison reference — the version we started with before the v4 upgrade.

Mode
segment
Model
eleven_v3
Duration
2:28

Read further

repo · receipts · references