DeepSpeech: Developing Automatic Speech Systems for Games

Joshua Johnson talked about DeepSpeech, a system that turns speech into text, discussed the possible implementation of this system into voice-driven games and RPGs.

Introduction

My name is Joshua Johnson, CTO and Lead Developer for TREE Industries. Founded in August of 2017, TREE is a female-owned socially responsible technology company that is building products and solutions around Artificial Intelligence, Machine Learning, IoT, AR & VR, gaming, and blockchain. With a huge focus on privacy, accessibility, accountability, and open source. Long story short I have the amazing job of spending lots of time working on emerging tech and building proof of concepts. Our CEO and founder Jill does an amazing job turning our proof of concepts into MVPs(minimum viable products) and products. We have an amazing team of mentors, volunteers, supporters, from Saint Louis University, Lindenwood University, and Discord that have helped in TREE Industries' growth and success.    

DeepSpeech

DeepSpeech is a system that turns speech into text. For anyone unfamiliar, this is a form of “machine learning” where neural networks are trained to recognize speech. Most systems that developers can currently use to perform such actions require a connection to “the cloud” and are for better or worse controlled by a handful of large companies meaning your speech and voice goes through their servers. It also means that an internet connection is required to perform speech recognition.

DeepSpeech was created by Mozilla, the makers of Firefox, and as such is completely open-source, and it can run totally offline making it much more friendly to privacy. This made it an obvious choice for us to experiment with in Unreal Engine.

DeepSpeech and UE4

For the last two years, we have been exploring the concept of using game engines to power AI-driven virtual avatars and the variety of ways they could be utilized. Including using them in our own AI education assistant EZRA EA which is powered by the open-source smart assistant Mycroft AI. We had some initial success using Unity base avatars, but since we opened our gaming branch that develops mostly on Unreal Engine, it made sense to pivot our work on updating the avatars over to UE4.

In one of our “failed” experiments, we built a voice-driven trivia game called Top of the Hill Trivia, for one of the Epic Mega Game Jams. We were able to get Deepspeech working in our game by using the Unreal Engine Python plugin and a REST-based server, but could not get it working right in the packaged game in time for the jam deadline. At this point, we knew we were on the right track, but one of our main goals was to have Deepspeech running offline to allow for the most privacy possible.

Though we would eventually like to create a native Unreal Engine plugin for DeepSpeech using C++, the current experiment utilizes the great NodeJS for Unreal plugin developed by Getnamo. This saved us a ton of coding time since DeepSpeech already has libraries that can run on NodeJS.

Then some custom blueprints are used to detect mic activity using the UE4 built-in audio capture component, which then triggers the NodeJS DeepSpeech javascript to transcribe the recorded speech. The result is passed back as a simple text string.

One cool thing about this is that the NodeJS plugin already runs scripts apart from the main game thread without any extra code.

Working Without a Cloud

Let’s face it, “the cloud” is extremely convenient for performing AI / Machine Learning tasks and calculations and hence why it is used by most that provide such services. Having the cloud as the status quo for services like Text to Speech long term will only compound the obvious privacy and data security issues. Yes, there are some companies that may not steal your data, but if they store your voice data in the cloud and get hacked we are now in an age where your voice could be deep faked based on stolen data!

There will most likely always be a place for cloud-based services, but we are very excited about the idea of “edge computing”. This is where the AI / Machine Learning tasks are performed right on the device an app is running on.

Interacting with DragonFly by Voice

As a small company doing a great deal of R&D, we post a lot of videos of ongoing concept work to help us make connections and possibly earn some business.  A while back we posted a video that demonstrated using voice / natural language to change out Unreal Engine scenes dubbed the HoloSet. We were lucky enough to be contacted by Mr. Allan Luckow who has over 20 years of experience working in the film industry and is doing virtual production out of Copenhagen, Denmark. As part of our work to build a proof of concept AI-powered virtual production assistant for Allan he wisely connected us with the great people at Glassbox Technologies, the maker of DragonFly. For those who may not be aware, DragonFly is an Unreal Engine virtual camera plugin that allows control of a virtual camera using an iPhone or iPad with lots of enhanced features.

Since DragonFly is used quite a bit in virtual production it made sense to see if we could integrate it into the concept we were working on. 

One thing to be aware of is that in this experiment instead of using DeepSpeech, we are talking directly to a separate open-source AI smart assistant that communicates with Unreal through the web control plugin. With this voice / AI ecosystem, we can create natural language interactions for just about anything that can be controlled through the web control plugin. While this is great for use cases such as virtual production the requirement of a separate device running the AI does not make it a great fit for creating games hence why getting DeepSpeech working was important to us for other more game-specific applications.

Hopefully, one day in the near future indie studios will all be able to take advantage of a more polished version of our Unreal Voice Ecosystem (UVE) for all types of assets and plugins which can be used in the editor, or a packaged app if using UE 4.26 or above.

Possible Applications

This is where things get interesting. While TREE has created some good initial experiments, the surface has barely been scratched on harnessing voice technology in games/other real-time applications.

For example, one idea TREE Industries had thought about goes back to the idea mentioned earlier about a voice-driven “trivia” game where answers are spoken. This would allow a group of people to play together without controllers, could be great for a family game night. 
 
For a single-player action RPG example, the progress of voice recognition and natural language processing could change the way players interact with NPCs going forwards. In our setup, DeepSpeech responds with a simple text string, meaning it’s easy to hook that into existing dialogue/mission systems for Unreal.

Imagine much more free-flowing dialogues that are not bound to a,b,c,d responses. Taking the DeepSpeech result and running that through a natural language understanding model that maps to an in-game NPC would allow a player to interact more naturally, speaking and asking things to an NPC like “Where is the closest inn?”. UE4 has a built-in hearing sense AI that can be used as a boundary for which NPC is within the hearing range of the player’s voice. The mapped NLU model for the NPC can then determine if the question or phrase is something the NPC can answer, triggering the appropriate response. The possibilities are endless. 

Challenges and Limitations

Right now we are mostly testing with the CPU version of DeepSpeech which does limit how quickly an inference can be achieved. Deepspeech has a GPU version that is much faster but it requires CUDA to be installed. Using the GPU version would allow for near real-time inferencing on a decent card, but we are still investigating how that can be done in a packaged game.

One of the other limitations (for now) is that we are using the NodeJS plugin to run DeepSpeech inside Unreal. This works well, but eventually, we would like to build a native C++ plugin since the necessary bindings already exist for Deepspeech. For now, we are happy with the current setup because it allows us to do almost everything we could do with a native plugin.

We are excited to work on overcoming some of these limitations, voice interaction with technology is already being used in all kinds of ways from phones to smart speakers/devices and it will be interesting to see how it can be used successfully in games in the near future.

Joshua Johnson, CTO at TREE Industries

Interview conducted by Arti Sergeev

Join discussion

Comments 0

    You might also like

    We need your consent

    We use cookies on this website to make your browsing experience better. By using the site you agree to our use of cookies.Learn more