The Speech Graphics team shared the latest updates to their automatic speech-to-facial animation system, enhancing the Mandarin, Korean, and French language modules.
Intro
Lip sync is like any CGI: if it's good, it should not draw attention to itself. Instead, it should help to tell a story in the most immersive way. If we encounter bad lip sync, however, it can break immersion and take you out of the experience.
At Speech Graphics, we are on a mission to animate all sounds from audio, delivering lifelike, accurate facial animation with unparalleled precision. Our cutting-edge technology, based on a universal biological model of speech production, allows us to generate expressive and immersive animations in any language, even fictional ones!
However, to achieve the highest-quality lip sync, we develop language-specific modules tailored to each language's distinct phonetic requirements. We're excited to introduce updates to three SGX language modules: Mandarin 2.1, Korean 2.0, and French 1.3. These enhancements refine how SGX processes linguistic features unique to each language, ensuring even more realistic and seamless animations.
Language-specific vs Universal Models
Speech Graphics products make use of two types of models to achieve lip sync: universal and language-specific.
The universal model works for any spoken language, including fictional languages like Elvish! This model is trained on a diverse range of languages and reconstructs muscle movement from audio based on universal properties of the human vocal tract and the sounds it can produce. Simply input audio and the universal model will generate accurate lip sync.
Language-specific models can generate even higher-quality lip sync and facial animation by taking into account the speech sounds of one specific language or even a specific dialect.
Imagine you're trying to repeat after someone who is speaking a language you don't know. If you have a good ear, you might be able to replicate what you heard pretty well. This is how the universal model operates. But if you understand the language, you actually know the words and exactly how they're supposed to be pronounced, in which case you can probably repeat what they said perfectly! Our language modules provide this underlying knowledge of the language, helping our technology understand how these words are pronounced and resulting in lip sync tailored for this language.
Our team of expert linguists and native speakers plays a crucial role in developing our language modules. They bring a deep knowledge of phonetics, the study of how humans produce and perceive speech sounds, helping us analyze and model the way different languages are pronounced. Through research, iteration, and real-world validation, we ensure that our animations align with authentic speech patterns across languages.
Getting lip sync exactly right for each language is particularly important for localization projects. Diverse global audiences increasingly expect better quality entertainment in their own language, beyond traditional dubbing or subtitles. Our goal is to enable the creation of localized content that feels natural and authentic in any language.
Now, let's dive into the exciting updates for Mandarin, Korean, and French!
Mandarin 2.1: Enhanced Retroflex Sounds for Natural Lip Sync
One of the key challenges in Mandarin lip sync is accurately depicting retroflex consonants (翘舌音), which are spelled zh, ch, sh, and r in the Pinyin romanization. Retroflexes in Mandarin have several key characteristics: the tongue curls backward (hence the term retro "back" + flex "bend"); the lips flare out like a trumpet; and the teeth are very close together. These mouth shapes affect whole syllables zhi, chi, shi, and ri, prolonging the consonant features into the vowel as well. These sounds are strikingly common in Mandarin, appearing in many words such as 是 shì "to be", 十 shí "ten", and 日 rì "day", so getting them right has a big impact on the animation.
In line with these observations, Mandarin 2.1 provides much-improved pronunciation of retroflex consonants and syllables, as illustrated in the video above. With these updates, Mandarin speakers will notice a more natural and expressive representation of their language in animation.
Korean 2.0: Now Supporting Latin Characters for Seamless Mixing of Writing Systems
SGX language modules require a transcript to go along with the input audio. The transcripts provide maximal information about the sounds that occur in the audio. These transcripts are plain text in the native writing system of the given language. In Korean, that is the Hangul script. However, in modern practice, it is very common for Latin characters to appear intermixed with Hangul, especially when it comes to abbreviations, foreign words, and brand names. For example, the sentence "Speech Graphics를 사용하면 순식간에 얼굴 애니메이션을 만들 수 있어!" meaning "with Speech Graphics, you can create facial animations in no time!"
To better reflect modern language usage, we've introduced Latin support in Korean 2.1, enabling users to process audio and transcripts with a seamless mixing of Hangul and Latin characters in the transcripts. As the video above shows, Korean speakers will not simply speak the Latin text to sound like the foreign language it came from, but rather they naturalize it to Korean pronunciation. The language model accounts for this, and consequently, the pronunciation of these inserted words reflects reality.
This work matches existing capabilities in the SGX module for Japanese, another language that frequently mixes Latin characters into the native writing system.
French 1.3: Improved Pronunciation of "Nasal" Vowels
French is famous for its variety of nasal vowels. These are vowels in which the air moves through the nose as well as the mouth, creating a nasal resonance: sounds like an, on, un, and in, such as in un bon vin blanc, "a good white wine." Modern phonetic research has shown that, over time, French speakers have actually shifted how they pronounce these vowels. For dialects in France, un and in have become indistinguishable, and all the nasal vowels have shifted higher in the mouth, diverging from traditional descriptions of the language. The French 1.3 update reflects these shifts and so is more in line with how real-life speakers look when they produce the vowels an, on, un, and in. Due to their commonness, animating these vowels correctly through the articulation of the lips, jaw, and tongue is essential for natural-looking French lip sync.
Advancing the Future of Facial Animation
These updates mark another step forward in our mission to accurately animate all sounds from audio. By continuously refining our models, we push the boundaries of what's possible in facial animation, delivering hyper-realistic lip sync that enhances storytelling across borders.
Learn more about Speech Graphics' SGX and other products here and join our 80 Level Talent platform and our new Discord server, follow us on Instagram, Twitter, LinkedIn, Telegram, TikTok, and Threads, where we share breakdowns, the latest news, awesome artworks, and more.