Upload a podcast audio file and get back a fully animated 2D video. The workflow transcribes the audio, figures out who's speaking, builds an edit timeline, assembles the shots, and adds lip sync. You don't touch the timeline.
What You'll Build
6 animation loops per character: wide, closeup, and laughing
Auto-transcription with speaker diarization and AI-corrected labels
AI-generated cut list assembled into a continuous timeline
Lip sync on the final video so the mouths match the audio
Running the Workflow
Open the workflow from the public gallery and click Run Flow, or open it in the node editor and hit Run in the top bar. Either way it's the same: upload your audio file and let it run.
Start with 30 seconds
This workflow doesn't trim your audio. It costs around $0.05 per second, so clip your podcast down to 30 seconds before your first run. Once everything looks right, go longer.
When it finishes you get a fully edited animated podcast: the right character on screen at the right time, with your original audio and lip sync applied.
How to Build It
Four stages: make the animation loops, transcribe and label the audio, generate the edit timeline, then add lip sync.
Step 1: Create the Animation Loops
Before anything else, you need a set of base animation loops for each character. These are generic clips that will repeat throughout the video wherever that character is on screen. They don't match the audio yet — the lips won't move correctly until the lip sync step at the end. The goal here is just to have a looping animation of each character looking alive: talking, reacting, laughing.
Generate a closeup and wide shot image for each character. From those two source images you create three animation loops per character:
Wide
The default talking shot. Used for most of the conversation.
Closeup
Tighter on the face. Save it for punchlines and key moments.
Laughing
Used when the transcript picks up laughter. Keep it distinct from the wide shot so cuts aren't jarring.
Wide and closeup source images for each character — the base frames for all 6 animation loops
Generate each loop as a 6-second clip with Seedance 1.5, using the same image as both the start and end frame. Seedance is great for 2D animation specifically because it's affordable and loops cleanly without a visible seam.
You end up with 6 clips total: host_wide, host_closeup, host_laugh, guest_wide, guest_closeup, and guest_laugh. These are the shots the timeline will cut between.
The full workflow: audio input, transcription, 6 animation loops, timeline assembly, and lip sync
Step 2: Transcribe and Label the Audio
A Transcribe Audio node runs Whisper with diarization, which gives you timestamped segments with speaker labels. The timestamps are accurate. The speaker labels often aren't. Diarization struggles with similar-sounding voices and overlapping speech, so Speaker 1 and Speaker 2 regularly get swapped.
To fix this, a Generate Text node passes the plain (non-timestamped) transcript to Gemini. Reading the conversation without timestamp noise, Gemini can figure out who's actually speaking from context. It returns a clean, correctly labeled speaker list.
Why two passes?
Whisper nails the timestamps but struggles with speaker ID. Gemini nails speaker ID but doesn't have timestamps. Run both and combine the outputs.
The transcription group: Whisper diarization feeds into Gemini for accurate speaker labeling
Step 3: Generate the Edit Timeline
A second Generate Text node takes the timestamped diarization and the corrected speaker labels and merges them into a JSON shot list. This is where the actual editing decisions get made. Here's the system prompt:
Analyze the provided podcast transcript and diarization timestamps to generate a continuous JSON timeline shot list. Correct speaker misidentification errors using the non-timestamped transcript. The timeline must be completely continuous with no gaps. If there are periods of silence, snap timestamps to the midpoint or extend the current shot.
Use all available shot IDs (host_wide, host_closeup, host_laugh, guest_wide, guest_closeup, guest_laugh) at least once. Show the relevant talking person. Alternate between wide and closeup shots, reserving closeups for dramatic moments or punchlines. When a character laughs, use their laugh shot, but never cut directly between a wide and laugh shot — they share the same starting frames and will create a jarring jump cut.
If both characters are laughing or talking together, prioritize showing the one who appears less often.
Each object in the output contains a shot_id, the dialogue, and start/end timestamps. Wire this JSON into the Timeline node along with all 6 animation clips. The Timeline node cuts between shots at the right timestamps, looping each clip as needed to fill its duration.
A Replace Audio node puts your original podcast audio back on the assembled video. At this point you have a complete animated edit. The lips just don't match yet.
Step 4: Add Lip Sync
Connect the assembled video to a Lip Sync node running PixVerse. It analyzes the audio and drives each character's mouth animation to match. That's the last step. Your animated podcast is done.
Workflow Summary
Character Art
Closeup + wide shot per character
Animation
6 loops via Seedance 1.5 (6s each)
Transcription
Whisper diarization + Gemini labels
Timeline
Gemini JSON edit → Timeline node
Lip Sync
PixVerse on final assembled video
The finished result: a fully animated and lip-synced podcast
Try It on Sequencer
Clone the workflow, drop in your characters, and run it with a short clip to start. Once it's working the way you want, scale up to the full episode.