Add lip sync
Connect Convai audio output to your character's facial blendshapes to synchronize mouth movement with speech.
The Convai SDK for Unity includes a real-time lip sync system that drives SkinnedMeshRenderer blendshapes in sync with the character's voice audio. It supports three industry-standard blendshape formats and handles playback buffering, smoothing, and fade-out automatically.
How it works
When Convai sends voice audio, it also streams a sequence of blendshape frames in the character's transport format (ARKit, MetaHuman, or CC4 Extended). The SDK buffers and interpolates these frames, applies optional smoothing, and writes the result to your character's SkinnedMeshRenderer every frame.
Quick setup
Enter Play Mode
Leave Mapping empty — the SDK auto-selects a matching bundled map for the chosen profile. Enter Play Mode and speak to the character.
The character's mouth moves in sync with its voice output.
If the mouth does not move, confirm that your SkinnedMeshRenderer blendshape names match the expected naming convention for the chosen profile. ARKit uses camelCase names (e.g., jawOpen, mouthSmileLeft). MetaHuman uses the CTRL_expressions_ prefix. Use a custom map if your rig uses different names — see Profiles and mappings.
Bundled profiles
Choose the profile that matches the blendshape format your character was rigged with.
ARKit
arkit
61
Apple-rigged characters, some custom rigs
MetaHuman
metahuman
275+
Unreal MetaHuman exported to Unity
CC4 Extended
cc4extended
240+
Reallusion Character Creator 4 characters
If your character was rigged with non-standard blendshape names, create a custom map to route the SDK's output channels to your rig's actual names.
Profiles and mappingsPlayback settings
Core setup:
_lockedProfileId
arkit
Transport format the SDK streams (arkit, metahuman, cc4extended)
_mapping
(none)
Optional custom mapping asset (leave empty to use bundled auto-map)
_targetMeshes
(empty list)
SkinnedMeshRenderer components to write blendshapes to
Playback & behavior:
_smoothingFactor
0.5
0–0.9
Exponential smoothing per frame (higher = smoother but slower)
_fadeOutDuration
0.2
0.05–2.0
Seconds to fade all blendshapes to 0 after audio ends
_timeOffset
0.0
-0.5–0.5
Shift playback timing relative to audio (negative = earlier)
Streaming & latency:
_latencyMode
Balanced
—
Preset that controls buffer depth vs. responsiveness
_maxBufferedSeconds
3.0
1–10
Ring buffer capacity in seconds
_minResumeHeadroomSeconds
0.12
0.05–0.3
Buffer refill threshold after starvation
Latency mode options:
Balanced
Default. Recommended for most deployments
UltraLowLatency
Minimal delay; susceptible to starvation on unstable connections
NetworkSafe
High buffering; best for unreliable or high-latency networks
Custom
Unlocks manual control over buffer fields above
Usage examples
Example 1: ARKit character
Scenario: A corporate training simulation uses a character rigged with Apple ARKit blendshapes.
Setup:
Add
ConvaiLipSyncComponentto the NPC GameObject (same asConvaiCharacter).Set
_lockedProfileIdtoarkit.In the Target Meshes list, add the
SkinnedMeshRendererfrom the avatar's head mesh.Leave
_mappingempty — the bundled ARKit auto-map covers standard camelCase ARKit blendshape names (jawOpen,mouthSmileLeft, etc.).
Expected outcome: The avatar's mouth, lips, and jaw animate in sync with the character's voice during conversation. Blendshapes return to neutral smoothly after each response ends (_fadeOutDuration = 0.2s default).
Example 2: MetaHuman character
Scenario: A high-fidelity medical simulation uses an Unreal MetaHuman character exported to Unity.
Setup:
Add
ConvaiLipSyncComponentto the NPC GameObject.Set
_lockedProfileIdtometahuman.In the Target Meshes list, add all
SkinnedMeshRenderercomponents on the MetaHuman head and teeth meshes — MetaHuman separates these into multiple renderers.Leave
_mappingempty — the bundled MetaHuman map targetsCTRL_expressions_prefixed blendshapes.Increase
_smoothingFactorto0.7for more fluid animation on high-poly rigs.
Expected outcome: All facial regions animate together — lips, jaw, cheeks, and tongue shapes — producing highly realistic mouth movement. Smoothing reduces per-frame jitter visible on high-resolution meshes.
Next steps
After lip sync is configured, validate your complete setup.
Validate your setupLast updated
Was this helpful?