Why burn captions in the browser
Most short-form social video lives or dies on the strength of its captions. Viewers scroll with the sound off; the words on screen are doing the work. The traditional path to animated, word-by-word captions is a desktop editor (Premiere, DaVinci, CapCut) or a paid SaaS that uploads your footage. For a 30-second TikTok or a 60-second Reel, neither is necessary - modern browsers can decode, composite, and re-encode video locally with a few hundred kilobytes of JavaScript.
This tool wraps that pipeline in a preset-driven UI: drop a video and a transcript, pick a look, render an MP4 you can post. The preview you see while scrubbing the player is exactly what gets burned into the file - same draw function, same fonts, same animation curves.
What's actually happening under the hood
When you drop a video, the browser hands the file to
mediabunny for codec
demuxing. The live preview pulls frames out via the WebCodecs
<video> element and overlays a <canvas> driven by a
requestAnimationFrame loop that reads video.currentTime,
finds the active word, and draws the chosen typography +
animation each frame.
When you click Render & download, mediabunny's Conversion
API replays the video frame-by-frame through a process callback.
Each frame goes onto a working canvas, the same draw-overlay!
function lays the captions on top, and the result is fed back to
the encoder. Audio passes through unchanged - no re-encoding -
so the original audio quality is preserved.
The fonts are loaded once at module init via the FontFace API, keyed off the same Google Fonts CDN the rest of the web uses. Pacifico, Anton, Bebas Neue and friends are bog-standard choices, unlikely to surprise the eye, deliberately.
Pairing with the Subtitle Generator
The companion Subtitle generator tool produces exactly the
JSON shape this tool expects - per-word start/end plus the
text. If you transcribe a video there, then open this tool and
drop the same video in, the per-word data loads from a local
IndexedDB cache automatically. No second upload of the JSON
needed.
If you already have a transcript from somewhere else (the OpenAI Whisper CLI, faster-whisper, WhisperX, AssemblyAI's word-level output), drop that JSON file on the page and it'll be normalised to the same shape on load. The accepted formats are listed in the prompt that appears once a video is in.
When to reach for something else
If you need diarisation across multiple speakers, scripted multilingual captions with manual timing tweaks, or you're captioning a 30-minute talk, this isn't the right tool - reach for a desktop editor or a server-side workflow. For short-form social where the captions are the point, browser-local rendering is fast enough, private by default, and free.