How to make your mascot talk: Lipsync tools compared
How to make your brand 15x more memorable by combining voice with characters.
Hey brand-builder!
Here’s a stat that should change how you think about branding:
Sound has 9X better odds of creating strong brand recall than your logo.
Characters have 6X better odds.
Yet only 5% of ads use distinctive brand sounds. Only 15% use characters. Meanwhile, 90%+ of ads show the logo.
Brands are massively over-investing in the weakest distinctive asset while ignoring the strongest ones.
A talking mascot combines two of the most powerful memory structures:
Visual character recognition (6X odds)
Sonic branding through consistent voice (9X odds)
Someone scrolling with sound on but not watching? Voice wins. Someone listening to a podcast ad? Voice wins. Someone half-watching while multitasking? Familiar character voice wins.
Most brands have mascots but treat them like fancy logos: static, voiceless, underutilized. The fix is simple: give them a voice. Now with Gen-AI, you can enable these characters to speak without expensive voice actors, VFX animations etc…
Design Your Mascot for Lipsync (Or Regret It Later)
Most lip-sync AI models are trained on human faces. If your mascot doesn’t have certain human-like features, lip-sync won’t work.
Design features you need:
Clear mouth/jaw area - The AI needs to see where the mouth is
Visible lips or mouth opening - A line for a mouth won’t cut it
Human-like facial proportions - Eyes, nose, mouth in roughly human positions
Defined facial structure - The AI needs reference points (cheekbones, jaw, teeth, forehead)
If you’re creating a mascot today, design it with lip-sync capability from day one. Adding it later is painful (or impossible).
The Models That Actually Work
There are three types of lip-sync models for mascots:
Image-to-Video (with custom audio) - Start with a still image, generate a talking video
Video-to-Video Motion Transfer - transfer motion to your mascot from existing footage (or film yourself)
Lip Syncing/Dubbing - Add voice to existing mascot videos
TYPE 1: IMAGE-TO-VIDEO MODELS (with Custom Audio)
VEED FABRIC - The Reliable Workhorse
What it is: Image + custom audio = talking mascot
Best for: Static shots where your mascot is talking without any big movements or complex scenes
How it works:
Upload your mascot image (neutral pose, mouth clearly visible)
Upload your custom audio (this can be AI-generated or recorded)
Get a talking mascot in ~60 seconds
The good:
Most reliable for clean static shots
Natural mouth movement and gesturing based on audio sample
Fast generation
Simple workflow
The limitations:
No control over camera movement
Backgrounds usually stay static
Note on Hedra: Another image-to-video option exists (Hedra), but it’s significantly less reliable with non-human characters. Fabric wins for mascot work.
KLING AVATAR - For more complex movement
What it is: Image + custom audio + prompt guidance = controlled mascot movement
Best for: Any shot requiring specific movement, gestures or background.
How it works:
Upload your mascot image (mouth clearly visible)
Add your custom audio
Write prompts to control motion (”slight head tilt,” “dancing,” “walking toward camera,” “excited gestures”)
Get movement + lipsync together
The game-changer: This is the only image-to-video model that lets you control the motion and shot composition through prompts.
The good:
Full prompt control over movement (subtle or complex)
Can handle everything from gentle nods to full dancing
Dynamic camera work is possible
Natural integration of movement + lipsync
The limitation:
Requires character reference pack setup
More variables to control than Fabric
More complex workflow
TYPE 2: VIDEO-TO-VIDEO MOTION TRANSFER
MODELS: WAN ANIMATE / RUNWAY ACT 2 / KLING MOTION TRANSFER
What it is: Your filmed performance + mascot image = mascot doing your exact movements
Best for: Complex human gestures and detailed motion when you need specific movements
How it works:
Film yourself performing the exact motion/gesture you want (or use existing footage from movies), including the audio
Match the first frame of your mascot with the first frame of the reference video (so it can really map onto the character movement) and upload it as an image reference
Model transfers your motion to your mascot + adds lipsync
The good (when it works):
Most natural human-like movement possible
Perfect for specific complex gestures you can’t easily prompt
Captures nuance and timing from real performance
Great for detailed hand gestures, body language, and complex actions like dancing
The reality:
Very buggy across all providers
Hit or miss results
Can produce amazing outputs or complete failures
Inconsistent quality across generations
Requires multiple attempts to get usable results
Example: The “please just keep the tail” praying hands gesture - easier to film than prompt, but expect to generate 5-10 times before getting a keeper + some work to get the matching frames
Input video
Wan animate result (+ Kling O1 edit for background)
Models in this category:
Wan Animate
Runway Act 2
Kling Motion Transfer
Others emerging regularly (Luma Modify)
Marcel says: “Motion transfer is like playing the lottery. When you win, it’s gold. When you lose, it’s a three-legged nightmare. Budget extra time and patience.”
Overall, this category of models is very promising, as human driven acting still gives the most nuanced results, but currently it seems to do better at transferring full body movement (like dancing) as shown in this recent amazing example from Ink.
TYPE 3: LIP SYNCING / DUBBING MODELS
SYNC / HEYGEN - The Post-Production Fix
What it is: Existing video + new audio = re-synced lipsync layer
Best for: Adding voice to finished videos, translations, and audio replacements
How it works:
Take any existing mascot video (even without audio originally)
Upload new audio file
Tool layers lip-sync on top to match the new audio
The good:
Fast audio addition to existing content
Works with any existing video
Great for translations and voice changes
Good for fixing audio issues in post
You can get a more complex scene with top-tier video models like VEO 3.1, Kling 2.6, and then overlay the talking
The limitation:
Lipsync can feel “layered over” the original
Less seamless integration than image-to-video or motion transfer
You can see the seams if you look closely
Doesn’t add motion in the scene, only talking movement (you can specify if you want lip, face, or head movement)
Tools in this category:
Sync
Heygen
Similar dubbing services
Marcel says: “Gets the job done when you need it. Not as clean as building voice in from the start, but sometimes you just need to fix it in post. That’s life.”
Why Most Video Models Don’t Work for Mascots
Models like VEO 3.1 (4 coming soon), Kling 2.6, and now Seedance Pro 1.5 can generate impressive videos with audio.
For mascots? It doesn’t work (for now at least).
These tools generate videos with random AI voices. Different voices every generation. Generic. Uncontrollable. + The added bonus of the typical AI compressed voice that screams ‘AI’ to everyone.
For mascots, this breaks everything:
Your mascot needs its voice. The same voice. Every single time.
That voice might be:
AI-generated character voice (ElevenLabs, etc.)
Voice actor recording
Your own voice
Doesn’t matter, but it must be consistent
Brand recognition = voice recognition.
If someone hears your mascot and doesn’t immediately know it’s your brand, the voice is wrong.
The Three-Type Approach
This is why we use the three model types covered above. They all accept custom audio input:
Type 1 - Image-to-Video: Fabric, Kling Avatar (Image + your audio)
Type 2 - Motion Transfer: Wan Animate, Act 2, Kling Motion (Video + your audio)
Type 3 - Dubbing: Sync, Heygen (Video + your new audio)
These tools prioritize custom audio because that’s what builds brand memory.
Generic AI voices = no distinctive asset. No brand equity. No memory structure.
Marcel says: “See those fancy video models everyone’s hyping? They give me a different voice every time. That’s not a mascot. That’s digital schizophrenia. Hard pass.”
What 2026 Looks Like
This year, we went from “can it move its mouth?” to “can it gesture naturally.”
Next year? I expect 2 things:
Custom audio training for top-tier video models like VEO 4 (perhaps via Elevenlabs)
Real-time avatars with custom voice
The tech is already emerging:
Live mascots in video calls
Interactive customer service avatars
Real-time streaming characters
Mascots responding to live audience input
The brands that have their character voice systems built now will dominate.
The brands still treating mascots as “just logo variations” will scramble to catch up while everyone else is already live and interactive.
Voice isn’t optional anymore. It’s infrastructure.
For a broader view on mascots in 2026, check out this article
That’s it! Thanks for reading. And happy 2026!
Want to build your brand universe?
I help brands create character-driven engines powered by AI.
Book a 20-minute call | See examples
Subscribe to Marcel’s Lab - The only newsletter on building character-driven brands powered by AI. When content is infinite, characters are your moat.











