What is humanimation? And why is Loose Lips so much better than other automatic Lip Sync products?

Humanimation is a patent-pending algorithm that makes realistic Lip Sync animation keyframe data directly from an audio file of dialog. There are other automatic Lip Sync products out there, but they make animation that is choppy and mechanical.

 

While there are differences between each of the previous generation of automatic lip sync products, they share one thing in common: they are all based on the idea of making a pre-defined mouth shape (or "viseme") for each audible sound  in the speech. In our view, this is approach is fundamentally misguided. To appreciate why, and to get a hint of how our technology works instead, let's take a deep dive into the problem set.

The Three Broad Approaches to Lip Sync Animation


Historically, three broad types of lip sync methods have emerged:

 

  • "Artist Intuition" - aka "doing it by hand "- the talent and artistic eye of visual artists.

  • "Facial Tracking"-  data-capturing the facial movements of a human actor.

  • "Phoneme Targeting" - triggering pre-defined mouth poses according to a phoneme transcription, then morphing between them.

 

All three methods have inherent problems. Artist intuition requires the time and expertise of talented animators. Potentially great, but time-consuming and therefore expensive. Facial tracking, also time-consuming, is a developing approach, but to do it right requires capturing the voice actor performance and facial tracking at the same time. Among other inconveniences,  facial tracking has great difficulty with the tongue, and audio editing necessitates also editing the facial tracking data.

Definition of “Phoneme” and “1-to-1 Phoneme Targeting”
 

Phoneme - A "phoneme" is an audio "building block" of human speech. All speech can be expressed as a combination of phonemes. In English, 51 phonemes have been identified. A standard phonetic alphabet called the "Arpabet" has been derived, in which a unique symbol is assigned to represent each phoneme.

1-to 1 Phoneme Targeting -Used in the prior generation of 2D and 3D animation platforms, "1-to-1 Phoneme Targeting", or simply “Phoneme Targeting”, refers to creating a library of reusable mouth poses, each associated with one or more phonemes, then acquiring a phoneme transcription from the speech within the audio file, then converting that phoneme data into KeyFrame Data, then placing the KeyFrame Data at the appropriate points along the timeline, which triggers a series of Mouth Poses, with the animation software "interpolating" or "morphing" data in between.

Previous automatic Lip Sync systems use 1-to-1 Phoneme Targeting.

  The “Mechanical Mouth” Problem


Targeting a mouth pose for every single phoneme detected in the audio file is what we call"1-to-1 Targeting". Experience has shown that 1-to-1 Targeting will result in Lip Sync that appears "choppy", "robotic", or "mechanical", what we call the dreaded"Mechanical Mouth".


The 1-to-1 Targeting approach looks unnatural and aesthetically unpleasant because human speakers typically do not form an individual mouth pose for every single phoneme that is produced. Rather than any 1-to-1 mapping, the relationship between jaw, lips and tongue movements versus the phonemes produced is complex, subtle, and highly context-specific. 

Prior inventors in the field have described the problems in phoneme target Lip Sync. For example, William H. Munns in patent US7827034B1 “Text-derived speech animation tool” (2008), (“Munns”) states that:

"A phoneme-based process is simpler in that there are less
phonemes than syllables in speech, but the result is unnatural
because real speech is syllabic and all dramatic quality and
character of human speech derives from modulation,
emphasis, and pace of the syllables, not the phonemes."
Munns (2008)

 

Munns categorically rejects phoneme-based lip sync because:


"phoneme system was never intended as a method of reconstructing lifelike lip animations for physical robotic and digitally embodied characters." (Id)

Our New and Original Solution

It is true that the concept of “phonemes” was derived simply as a description of the fundamental “building block” elements within the sound of spoken language, with no connection to the mouth movements used to produce those sounds.


However, our years of original research - delving into linguistics, physiology and aesthetics -  has produced discoveries that such relationships do exist. Our novel discoveries of the complex, subtle and context-specific relationships between independent jaw, lips and tongue movements versus phonemes form the basis of the technology we call...

 

humanimation. 

If you rep a major animation platform, game title, engine, or other developer in the animation field, and think that our humanimation technology would add value onboard your product, we are certainly interested. Please CONTACT us.

© 2020 by Humanimate  

a division of

ESPIRTU TECHNOLIGIES, LLC.