AI video generators produce inconsistent voices. The same character can sound different in each clip, making multi-shot videos feel disjointed. This workflow solves that by replacing all dialogue with a single consistent voice while preserving the original background audio.
The Pipeline
The workflow uses six nodes: video input, audio extraction, vocal separation, speech analysis, voice cloning, and audio mixing. The original background audio remains untouched. Only the voice is replaced.
Video
Extract
Separate
LLM
Clone
Mix
Replace
Workflow Preview
Step 1: Extract Audio
Add a Video node and load your source file. Connect it to an Extract Audio node to pull out the audio track.
Step 2: Separate Vocals
Add a Separate Audio node and connect it to your extracted audio. The node uses AI to split the audio into two streams: vocals and background. The vocals go to voice cloning while the background is preserved for the final mix.
Step 3: Analyze Speech
Add a Text node and paste in this diarization prompt. Connect both the Text node and your original Video node to a Generate Text (LLM) node.
Diarization Prompt
You are a specialized Audio Diarization Engine for AI video processing.
Context: AI-generated videos often exhibit "voice drift," where a single character's voice changes pitch or tone between clips.
Your Goal: Construct a precise script JSON for voice normalization. You must identify unique characters and group their lines under consistent Speaker IDs (e.g., "Narrator", "Character_A"), ignoring inconsistent vocal artifacts.
Instructions:
1. Unify Characters: Group segments by Character Identity, not just acoustic similarity. If the context implies the same person is speaking despite a voice change, assign the same Speaker ID.
2. Segment Logic: Break dialogue into individual lines or logical phrases suitable for audio slicing.
3. Timestamp Inference: precise start/end times are critical. If end times are missing, infer them based on the next segment's start.
4. Content Cleaning: Fix minor transcription typos but preserve the original meaning.
5. Strict Output: Return ONLY the JSON object based on the provided schema. No markdown, no conversation.
Configure the LLM node with a structured output schema so it returns the correct format:
JSON Schema
{
"type": "object",
"properties": {
"speakers": {
"type": "array",
"description": "A registry of all distinct characters identified in the video.",
"items": {
"type": "object",
"properties": {
"id": {
"type": "string",
"description": "Unique stable identifier (e.g., 'spk_1', 'spk_2')."
},
"name": {
"type": "string",
"description": "Display name based on context (e.g., 'Interviewer', 'Darth Vader')."
},
"voice_id": {
"type": "string",
"description": "Optional: Pre-assigned Voice ID if detected, otherwise null.",
"nullable": true
}
},
"required": ["id", "name"]
}
},
"segments": {
"type": "array",
"description": "Chronological list of spoken lines.",
"items": {
"type": "object",
"properties": {
"speaker_id": {
"type": "string",
"description": "Must match one of the 'id' values defined in the speakers array."
},
"start_time": {
"type": "number"
},
"end_time": {
"type": "number"
},
"text": {
"type": "string"
}
},
"required": ["speaker_id", "start_time", "end_time", "text"]
}
}
},
"required": ["speakers", "segments"]
}
The LLM analyzes your video and outputs a JSON file containing speaker identities, dialogue text, and precise timestamps. This data tells the voice cloner exactly when to synthesize each line.
Step 4: Clone the Voice
Add a Clone Voice node and connect three inputs: the speaker JSON from your LLM, the isolated vocals from Separate Audio, and select your target voice model from the dropdown (Hume AI or ElevenLabs).
The node generates new speech using your chosen voice while matching the original timing exactly.
Step 5: Mix Audio
Add a Mix Audio node. Connect the new AI vocals from Clone Voice to one input and the original background from Separate Audio to the other. Adjust the volume sliders to balance the tracks.
Step 6: Replace Audio
Add a Replace Audio node. Connect your original Video node and the mixed audio from the previous step. The output is your final video with the new consistent voice.
Tips
Source audio quality matters. Videos with heavy background music or ambient noise will produce less clean vocal separation.
Be specific in your diarization prompt about the JSON format. The more structure you provide, the cleaner the output.
Save this workflow as a template once configured. You can reuse it for any clip by swapping the input video and selecting a different voice model.
Try It Now
Open the public workflow below and connect your video. Six nodes give you consistent voices across all your AI-generated clips.