Animating Speech & Mastering Facial Rigging: Guidelines for Game Developers

CTO and Co-Founder of Speech Graphics, Michael Berger, shares his insights on how to optimize facial rigs to get the very best results in terms of lip sync and emotional expression.

Introduction

Michael Berger Co-Founded Speech Graphics in 2010 and since then its multi-award-winning audio-driven facial animation has been used by studios across 9/10 AAA games publishers, including games spanning art styles as diverse as those included in Hogwarts Legacy, The Callisto Protocol, High On Life, and The Last Of Us Part 2.

What is rigging?

Rigging is a crucial aspect of game development and a key component when it comes to delivering compelling and immersive experiences. A rig provides a set of parametric controls to deform a 3D character and enable animations. These controls are analogous to the strings of a puppet. They are the degrees of freedom of the character and thus the building blocks of movement. If done well, rigging enables realistic and expressive animation; but if done poorly, it imposes a limit on the animation quality that can be achieved.

In short, the rig determines what deformations the facial model is capable of and therefore what facial expressions are possible. Getting your facial rig right can be key to conveying emotions and contributing to overall gaming realism. 

UI controls (Epic MetaHuman)

Bone rig

3D scan data used to create blendshapes

Rig Types: Blends, Bones & UI controls 

Character rigging typically falls into three categories: bones, blendshapes, and UI controls. A bone (aka joint) is a simple 3D transformation (translation, rotation and/or scaling) that affects a local region of the character skin. Blendshapes deform the skin via 3D interpolation (morphing) between various shapes. And UI (user-interface) controls are typically a compact set of artist-friendly handles that can be manipulated to deform the character in a digital content creation tool. Many modern character rigs combine all three methods.

Muscles

Speech Graphics audio-to-animation processing is agnostic about which type of rig is being used. The system drives muscles, and the animation is converted from muscle activations to rig parameters as a final step. This conversion is done by triggering various poses of the rig that are stored in the character control file. These poses approximate the effects of the individual muscles on the character's skin. The quality of the muscle poses is crucial to achieving good overall animation, so the main factor in judging the quality of a rig is simply whether it can achieve good muscle poses. Of particular concern are muscles of the lips, jaw, tongue and eyelids, which require extra precision.

Rig Quality: Isolation, detail, and extremity

There are three key characteristics of a rig that promote high-quality muscle poses: isolation, detail, and extremity.

Isolation
A good rig should be capable of isolated movements, meaning that if you can move one part of your face without moving another, your rig should be able to do that too. As a rough guideline, to the extent that a human (or other creature) can move any of the following parts of the face independently from each other, your rig should be able to as well:

  • lips vs jaw vs tongue
  • lower face vs upper face
  • lower lip vs upper lip
  • lower eyelid vs upper eyelid
  • left side of face vs right side of face

The isolation principle is important because it allows each muscle pose to precisely capture the effect of one muscle contraction, without mixing in the effects of other muscles. Creating poses that mix muscles together collapses the degrees of freedom of the character, and yokes anatomical parts in ways that do not respect their individual dynamic properties.

Detail

While isolating muscle movements is important, it is also important to capture the entire effect of a muscle, which means maximizing detail in the deformation. For this reason, the rig should be capable of fairly detailed deformation patterns and control over the entire facial surface. The worst thing to do is to have "dead" zones on the face – i.e., regions that cannot be moved by any rig control. More generally, density of controls allows us to achieve more complex and realistic deformations. Also keep in mind that muscle movements can have secondary non-local effects. For example, when the upper lip moves down, the cheeks and nose area may get pulled along. These subtle secondary movements add perceptual cues to the viewer's experience so that the facial animation is immersive and compelling.

As a general rule, if a particular deformation pattern can be made on a human face, it should be possible to approximate that deformation pattern on your rig. Even if the character is an animal, alien or monster, it should be capable of achieving detailed deformations appropriate to its (probably anthropomorphized) physiology.

Extremity

Finally, in order to achieve highly expressive facial animation, the rig must be capable of extreme muscle poses. Each muscle pose represents the maximum extent to which the character can exercize that muscle. Therefore, if the rig cannot achieve strong poses due to built-in limitations (such as low-intensity blendshapes), then the character will not be able to achieve its full range of movement. Not only will highly expressive animation be difficult to achieve, but more generally, the character's behavior will be relatively muted at all levels of intensity, because of the reduced proportions of these more weakly posed muscles.

Example of an extreme muscle pose (left) and a weak muscle pose (right).

Bones/Joints

When creating a bone rig for the face, around 100 bones is roughly the minimum number for the face to have an optimal range of motion. (Fewer bones can be used in resource-constrained rendering environments such as browsers.) It is important to have the right number of bones for the mouth area so that even high-deformation poses, such as lip pursing, can be achieved. As a rule of thumb, it takes a minimum of 12 to 16 bones in the lips to be able to achieve the necessary mouth poses. For the rest of the face, the bones should be laid out so as to provide the most flexibility where the amount of deformation is highest, such as the upper cheeks, between the brows, and around the eyes.

Maximum skinning influence should be 4-8 bones per vertex. It will require a degree of skinning adjustment and polish to get all the bones looking nice when animating. If you have more than 12 bones in the lips, then you will likely need a maximum skinning influence of 8.

Blendshapes and UI controls

Blendshapes and UI controls are great tools for creating lifelike facial animations. A blendshape or morph is a method of interpolating between pre-defined shapes. It involves creating a set of target shapes representing various facial poses; by blending between them, animators can smoothly transition and combine these shapes to achieve realistic and nuanced facial movements. These movements can be very detailed because each point on the face moves independently. 

UI controls are user-friendly handles that help animators guide groups of blendshapes, bones, and potentially other animation parameters to create cohesive movements, sometimes with correctives depending on the combination. 

Both blendshapes and UI controls tend to be based on an analysis of the face into its basic movement patterns. One popular description of the basic muscle movements of the face is the Facial Action Coding System (FACS) developed by psychologists Paul Ekman and Wallace Friesen. Many practitioners use some adaptation of this analysis for blendshapes and UI controls. 

The Speech Graphics muscle set itself provides an analysis of the face into basic movement patterns. However, there is no need to design a single blendshape or UI control to represent a Speech Graphics muscle, as combinations of the rig parameters can be used to achieve the desired poses.

For precise adjustments, especially around the lips and eyes, additional blendshapes may be needed beyond the basic FACS set. The Epic MetaHumans rigs provide a good example of a set of  poses and blendshapes necessary for a wide range of facial animations.

Conclusion

In conclusion, good facial rigging is the foundation of good facial animation. Whether bone-based, blendshape-driven, or UI-controlled, careful rigging is essential to creating expressive, lifelike characters that are immersive and believable in any kind of behavior. If you are interested in using Speech Graphics’ audio-driven animation or learning more about rigging, get in contact with our team. 

Michael Berger, CTO and Co-Founder of Speech Graphics

Join discussion

Comments 0

    You might also like

    We need your consent

    We use cookies on this website to make your browsing experience better. By using the site you agree to our use of cookies.Learn more