I tinkered with the recently released ComfyUI version of the Sonic Audio Perception model which is capable of generating lip-synced videos of arbitrary lengths based only on a single image and an audio file.
The results are kind of… mixed, but still pretty compelling, with moments of both creepiness and greatness. The arbitrary length is especially a major selling point. This is definitely one to keep an eye on.
The lady was created using my own Flux-based character workflow, and the track is an a cappella version of “She Wolf” by David Guetta & Sia, which is very challenging lip-sync material.
The workflow is actually remarkably simple, but also super prone to desynchronization between the audio and visuals.
I’ve done a number of tests to figure out what is responsible for this, but it’s been inconclusive so far.
The process introduces a noticeable bit of artifacting to the face, especially the eyes and the skin, but it’s also sometimes surprisingly good at catching little performance-nuances. Overall, very cool.
