Speech Graphics' Dimitri Palaz has told us about working with non-verbal sounds, detailed the process of collecting data for auto modes, and discussed the challenges of detecting non-verbal vocalizations.
Introduction
80.lv: Please introduce yourself to those who don't know you. How did you join the Speech Graphics team?
Dimitri Palaz: My name is Dr. Dimitri Palaz, I'm the Head of Machine Learning at Speech Graphics, where my team and I work on developing and building deep neural network models to improve our facial animation technology.
I joined the company in 2017, more than six years ago, when it was a small start-up (I was the 7th employee). I had just finished my Ph.D. back then, where I was part of the early wave of speech researchers to study and adopt the deep learning approach, which is the underlying technology used today in GenerativeAI and ChatGPT.
I was looking for a position in the industry, and I was eager to work on solving real-world problems and have a meaningful impact. Michael and Gregor (the co-founders) were looking for someone to take over and further develop the ML aspect of Speech Graphics' facial animation technology, which used mostly deep learning models. It was pretty much a perfect match.
Non-Verbal Sounds
80.lv: How did you start working with non-verbal sounds? What motivated you to launch the project? How did you plan the process?
Dimitri Palaz: The goal of Speech Graphics is to provide accurate and believable audio-driven facial animation. To achieve that, our technology produces very accurate lip sync, but that's not all. We also animate the whole face, including head motion, blinks, and most importantly facial expressions. These expressions are selected using speech signal analysis, where they best match the speaker's tone and mood. Our long-term goal is to correctly animate any sound that can come out of the human vocal tract, so we focused on sounds that are not speech, where a very specific animation is often expected.
These sounds are referred to as non-verbal vocalizations, for example, grunts, breaths, or laughs. They are known to play an important role in human communication as they carry information about the speaker's physiological state, emotional state, or even their intentions. Our customers, video game studios, have a lot of lines of dialogue containing such vocalization and had manifested an interest to be able to trigger a specific custom animation for some of these sounds. Hence we developed a full non-verbal vocalization solution, including automatic detection from audio.
There are a lot of different categories of non-verbal vocalizations, each of them having different characteristics and applications, and presenting their own unique challenges. So we decided to focus only on a handful of categories and build a detection system for one category at a time. As commonly done in the field of speech processing, we used Deep Neural Networks (DNN), a class of algorithms in artificial intelligence (AI) that is currently state-of-the-art. In a few words, a DNN model can learn by itself how to perform a task by being shown a lot of data, which was manually annotated by one or more annotators. Building a detection system using DNN involves the following procedure:
- Review the literature to find the latest state-of-the-art approaches and select one or more that are suited for the task
- Review the available datasets that contain the annotation needed
- Train the model to establish a baseline performance
- Iterate from there, to improve the model to the target use cases.
We found very little relevant literature and no dataset available. So we had to start by gathering and annotating our own labeled non-verbal vocalization dataset.
Collecting the Data
80.lv: How do you collect data for auto modes? What is the labeling process?
Dimitri Palaz: There was no public dataset that we could use, so we had to build our own. Fortunately, as a speech company, we have gathered and cataloged a large collection of audio data over the years, containing as much variation as possible, so no recording was necessary. We then needed to select audio data from our collection that contained various non-verbal vocalizations and fit the target use cases and establish an annotation scheme (i.e. a label set) and manually annotate the audio data. This process is often the first step when building a DNN-based detection system.
The established approach for gathering and labeling data is sometimes referred to as Big Data. It consists of collecting as much data as possible, without looking much into its quality or adequacy, and labeling the data using a very simple annotation scheme that can be done by hundreds of annotators on crowdsourcing platforms. This approach was historically very successful, as such huge datasets enabled the deep learning approach to flourish and became state-of-the-art. This approach can be seen as model-centric as the research efforts are focused on finding the best models, assuming the data is perfect.
However, this approach is being questioned more and more recently, especially on issues around generalization and biases. To mitigate these issues, a new paradigm was recently proposed: the data-centric approach, where the research efforts are partially shifted from the model to the data. In this approach the dataset collection is done very carefully, making sure that the data contains enough variations to mitigate overfitting and biases and match the use cases. The labeling campaign is then done using a complex annotation scheme with expert annotators trained for the task. This approach will then produce way less volume than the Big Data approach but way higher in quality.
This is the approach we took. We first spent time carefully selecting audio data from our large collection of audio files. We end up with a couple of hours of highly relevant audio data. We then focused on the labeling scheme itself: deciding on which non-verbal vocalizations we wanted and organizing the annotation process. This was a highly challenging task, as there is no consensus on the definition of non-verbal vocalizations. We devoted a lot of research to identifying which vocalizations mattered most for animation, aiming at creating a set as large and encompassing as possible. We then spent a lot of time manually annotating the audio data and refining the set. Finally, we were happy with our dataset, and we moved to the detection systems.
The Challenges
80.lv: What were the challenges to detect non-verbal vocalizations?
Dimitri Palaz: The conventional approach when using Deep Neural Networks for detection is to first establish the label set, which is the list of categories that we want the model to detect. Usually, this list is exhaustive, meaning that every possible case is covered by the set. For example, when detecting silence in audio files, the label set is composed of two categories: “silence” and “non-silence”, as nothing else can possibly happen. When it’s not possible to obtain this exhaustive list, as is the case for us with non-verbal vocalization, the conventional approach is to keep the categories we are interested in and merge all the other categories in a “garbage” category, which represents all the categories we don’t want.
One of the biggest challenges we faced was that applying the conventional approach as described above didn’t work for the handful of categories of non-verbal vocalization we selected. The performance of the system was not high enough. We hypothesized that this “garbage” class technique was the problem: the model could not learn this garbage class well, because it is made of a lot of different non-verbal vocalization sounds with very different acoustic characteristics. So we created a novel algorithm based on a technique called multi-class learning which addressed this issue. This approach was very successful, allowing us to deploy this feature in our products. We also published a scientific paper about this approach.
We are constantly working on improving our data and models in order to make more modes available to our users. Our latest mode, which will be available soon in SGX, is breath, where we detect breath sounds, either in isolation or during speech, even differentiating between inhale and exhale sounds! We use it internally to improve animation, where it is especially helpful for very breathy lines, which can sometimes be difficult to handle. It will also be available as metadata for our customers to use in their own pipeline.
Dimitri Palaz, Head of Machine Learning at Speech Graphics
Interview conducted by Arti Burton
Keep reading
You may find these articles interesting