IBM places its bets on AI-powered audio interactivity as the future of XR.
As VR and AR continue to become more mainstream, the expectations of users are also on the rise. The last big breakthrough we saw in ease-of-use interactivity was the launch of hand tracking and gesture recognition on enterprise and consumer VAR devices.
IBM predicts that AI will unlock the next generation of interactivity for XR experiences, describing in the 2021 Unity Technology Trends Report that the maturity of AI will play a key role beyond hand tracking, and into the world of voice.
This will include query based voice interactions for a new level of digital agency, and even the ability to interact with and control digital environments through conversation.
Curious to learn more, I reached out to Joe Pavitt, Master Inventor and Emerging Technology Specialist at IBM Research Europe.
NATURAL LANGUAGE PROCESSING
Natural language processing is a type of machine learning that powers realistic conversation between humans and machines. IBM’s key technology in this area is Watson, their AI assistant.
Pavitt describes how Watson uses classifiers to recognize different components in speech. This makes it easier to interpret varying inputs or asks, and also easier for a developer to build the speech interfaces into an experience.
“When you program [speech] into a game, you may have 10 intents that you will need to handle, but the freedom of being able to use your own voice as the user makes it feel like you’ve got infinite things to ask. Even if you ask something completely obscure, you could still classify it in such a way that it’s integrated with the story and the flow of what you’re expecting,” says Pavitt.
He gives the example of Star Trek: Bridge Crew, a VR game that was made in collaboration with Ubisoft. You can play with ‘crewbots’ who, with the help of Watson-powered voice recognition – will listen to and carry out commands.
“You could be the captain of the Enterprise, and bark orders at anyone with your voice. You didn’t have to keep hitting menu buttons, you were just talking to the characters and telling them what to do,” says Pavitt.
In terms of voice recognition, he explains how natural language processing works in this context.
“You have ‘increase the engine power to 70%’. That’s classified as ‘engine power’. We know the intent of what they want to do. You could also say ’increase the engine thrust to 70%’. You could say ‘increase thrust’ without saying the word ‘engine’ and it would still be classified ‘engine power’. So functionally, that’s how it works,” says Pavitt.
When I ask Pavitt how AI-based audio interactivity will make VR games more compelling, he has a few points to share.
“A lot of what we’re doing in artificial intelligence, with conversational interfaces, allows flexibility of language. You don’t have to select from one of three options,” he says.
He talks about Mass Effect, an open-world game that was really groundbreaking when it first came out in 2007. A core mechanic was a player’s conversational choices affecting the overall arc of the story. Although they were menu-based conversations at the time, the game’s developer BioWare made incredible progress through to Mass Effect 3 that was released in 2012:
With audio-interactivity, Pavitt highlights how next generation games will become a lot more personalized thanks to the introduction of conversation-based language.
In addition to enriched audio interactivity with characters and narratives, Pavitt explains how Watson will enhance interactivity with the environment.
“It’s not just in terms of conversations with characters, but being able to query your world and query the game as well. Having your questions addressed is only going to help the user experience going forward,” says Pavitt.
He gives the example of the International Space Station (ISS) VR demo he built that allows you to fly around the ISS and interact with the environment via speech-based queries.
“You can do instructional things like say, ‘open the hatch door’, or ‘open the pod bay doors’. And you can say, ‘take me to this piece’ or ‘teleport me outside’, and it will just teleport you outside,” he says.
“Because Watson has the ability to read and understand language, you can give it documentation about the International Space Station, and then ask it questions. What is this thing, what does that do? How do astronauts exercise in space? It will just read off information it has,” says Pavitt.
“So it’s really good from an educational standpoint, as well as instruction based interactions,” he says.
GETTING ALL EMO (EMOTIONAL)
Another feature of Watson that I’m really interested in is its ability to perform sentiment analysis, a process called natural language understanding.
“So from a personality perspective, if you’re starting to train agents, they will be trying to understand not just what is being said, but how it’s being said,” says Pavitt.
On a very basic level, Watson will identify if something is being spoken about in a positive or negative light.
“That adds personality to characters. So how one character may perceive one particular state may be very different from how another character perceives it,” he says.
This approach to sentiment analysis sheds a lot of light on whether photorealism or caricatures are more effective as relatable beings. It comes down to the ability to emote. For example, anime characters are notorious for being incredibly emotional, but this means there can be no mistake about how they ‘feel’.
Less emotive beings, even photo-real ones, tend to fall flat as bot; sometimes even more so when the aim is realism. This is because we are now introduced to the uncanny valley—the relationship between the degree of an object’s resemblance to a human being and our emotional response to it as such.
A MATTER OF TRUST
“The technology is definitely present and valuable enough to offer the frictionless experience that we’re discussing,” says Pavitt.
To give a bit of context, voice-based assistants have been around for over 60 years, but it wasn’t until recently that they picked up popular use with the introduction of smart home devices such as Google Home, and Amazon Echo.
“It’s a very different way for people to interact with technology and data. And that’s definitely something that we’ve seen. We did a piece of work a few years ago with a football team in the UK. I remember presenting them the idea of conversation interfaces, and it was a very different shift for their mindset,” says Pavitt.
He describes how the footballers were surprised by how much they struggled with the conversation interface at first, continually turning to their colleagues to ask a question, rather than directly engaging with the interface.
“It was a trust exercise. That’s where we’re seeing a lot of very interesting things right now, more on the trust of the technology in terms of how you’re interacting with this digital agent, rather than the competency or the functionality of the technology,” he says.
“Those who are responsible for putting [audio- interactivity] into experiences for end-users, they’re still experiencing hesitancy around voice and the likes,” says Pavitt.
When I ask him what we’ll see in the next five to ten years, he points to hardware, especially microphone improvements that will enable better audio quality in the first place. He also reiterates that while the technology we discuss is already here, it’s going to take a while for it to become taken for granted as a part of our everyday lives.
IT’S NOT ALL TALK
There are already clear examples of how audio-based interactivity is advancing techniques for training and education, in enterprise applications, gaming, and narrative design, and also as assistants in our homes.
While trust is a huge factor, VR and AR is organically normalizing conversation in digital environments, whether this be in the real world or in fully computer generated environments.
In Kent Bye’s podcast #968, he talks with the Director of the Vket Global Team LilBagel, who discusses how some users will only mime in worlds like VRChat, or use hand gestures to communicate, before feeling comfortable enough to communicate with voice in-world.
As such, socialization in virtual worlds is rapidly increasing our comfort with audio-based conversations in immersive environments, but this is still a niche experience.
Gaming, however—whether 2D, AR, or AR—presents a compelling use case for the adoption of audio-interactivity as gamers are already habituated with virtual beings and assistants that play a key role as companions or assistants.
Some of the most successful narrative-based games let you play a central role in the story. These include The Elder Scrolls, Warcraft, Fallout, The Witcher, and Half-Life just to name a few. Being able to communicate verbally with characters who react accordingly will develop significant game-play evolutions in these genres.
Similarity in esports, communicating with your team via chat or voice is crucial. This type of game-play will also change as audio plays a larger role in game controls.
Finally, enterprise use cases where engineers need to make use of their hands, and cannot use gesture based controls, is another primary use case that Watson is a part of developing.
The open-sourced Watson Unity SDK can be found on GitHub.
I expect to see audio-interactive games and enterprise solutions become more popular as these types of interactions are normalized through VR and AR.
Fun Fact: Watson itself was only developed in 2004 as a part of IBMs DeepQA project to compete in Jeopardy, and it eventually appeared on Jeopardy! in 2011, defeating the champions and winning the prize of $1M.
Feature Image Credit: Ben Hider / Getty Images