On the iconic sci-fi TV Show Star Trek, the crew on the Starship Enterprise could interact with the computer onboard by just saying the word “computer” and then asking it a question using their voice. Imagine being able to control your technology using voice commands in English and Hindi. It might seem like something out of science fiction but thanks to all the advancements in tech, it is now a reality. This article will detail the rise of voice user interfaces, how they work and also the opportunities in this space.
Voice, after all, is the most natural way for humans to interact with each other. Humans have used voice and language to communicate with each other for over a hundred thousand years. It is no surprise that the Star Trek Computer was the inspiration for Alexa – a cloud-based service that lets you voice-control your world. Alexa is a voice-first interface that is the next major disruption in computing. We truly believe the future is Alexa Everywhere – where people will use their voice to communicate with technology not just in their homes, but also at work, at school and even on-the-go.
The reason for the rise of AI-powered voice interfaces is that it presents an ease-of-use and intuitiveness that the world has never seen before. Visual interfaces such as the desktop computer or the smartphone require some prior knowledge by users to be able to operate it. Interactions such as swipe right and pinch-to-zoom are commonplace on smartphones now but they didn’t exist prior to the age of touchscreens. We had to learn what these actions meant before utilizing them within an app. I’m sure a majority of us have taught a parent or elderly person how to use a smartphone app and it usually involves telling them a certain sequence of clicks. With voice interfaces, there is no new medium to be learned. Anyone can converse with Alexa and expect an appropriate response. This significantly lowers the barrier to access technology and will soon give millions of people access to tech they otherwise wouldn’t have access to. Imagine the number of people who can hail a cab, ask for cricket scores or listen to devotional songs, by speaking to Alexa in Hindi.
The main driver that enabled the rise of voice user interfaces was advancements in speech recognition technology. Speech recognition essentially is converting a user’s speech to text. This is complicated when you factor in different ways to pronounce words and variations in accents. Speech recognition algorithms are very advanced and they use linguistic and semantic analysis as training data to make an accurate match. When a user talks to Alexa, the Automatic Speech Recognition (ASR) algorithms accurately converts their speech to text before the text is processed.
Once the user’s speech is converted to text, the machine needs to understand the meaning or intent behind what the user has said. This is where Natural Language Understanding (NLU) comes in. To simplify, NLU converts unstructured human conversation to something structured that a computer can understand, the structure being ‘intents’ and ‘slots’. NLU forms the basis of conversational interfaces everywhere – including voice interfaces and chatbots. These NLU algorithms are required because with conversational interfaces, there are a number of different ways a user can say or ask for something. For instance, one can ask for the weather by saying variations of “What’s the weather?”, “Is it hot outside?”, “आज का तापमान क्या है?”. The job of the NLU algorithm is to ensure that variations of these utterances are matched to the correct intent. The intent in all the above examples is ‘get weather’. NLU algorithms parse what the user says to also pick out slots. A slot is essentially any variable that completes an intent. To continue with the above example, a user can ask for the weather in Bangalore, London or any other city. The name of the city is variable and is the slot in the user’s utterance. An utterance can have multiple slots too. Different NLU models exist and the strength of a conversational interface usually depends on how advanced their natural language processing algorithm is.
The intent and the corresponding slot values are packaged into a neat data structure that is passed along to a service that can handle these various intents. For instance, if the intent is to ‘get weather’, the service interfaces with a weather provider to obtain the weather, if the intent is to ‘get cricket scores’, the service looks up a provider of cricket scores to obtain the relevant info. This information retrieved is still in text form and needs to be converted into speech since it is a voice user interface. Having a machine plainly convert text to speech and read out words is not conversational. To ensure that the sentences being read out are conversational, these services use a technology called Speech Synthesis Markup Language (SSML). This markup language enables you to customize the pitch, tone and pronunciation and enable emphasis, whispers and even the usage of words or phrases indigenous to a certain region, to sound more conversational. Don’t be surprised when your smart speaker says “namaste” or “balle balle”.
What is truly exciting to me is that anyone can actually build a conversational experience for Alexa. Similar to how smartphones have apps and an app store, Alexa has ‘skills’ and a skill store. Building a skill is simple and you can publish your skill to the skill store so that other people can engage with it. The ASR and NLU algorithms described above are fairly complicated and might seem daunting to someone building a skill for the first time. Luckily the Alexa cloud service abstracts these elements and does all the heavy lifting so that the skill-building process is easier. Anyone building a conversational experience can thus solely focus on their code and voice design, to provide an engaging conversational experience for the customer. Building a skill gives you access to millions of Alexa endpoints across the world. Skills have different monetization opportunities which include Alexa Developer Rewards, In-Skill Purchases and Subscriptions. Mobile apps alone have alone spawned a multi-billion-dollar business and your voice app could be the next to make it big. Your Hindi skill could be the one to breakthrough to the next billion users.
Imagine if ten years ago, someone told you that keyboard and mouse won’t be the primary user interface and it will be replaced by touchscreens. The very thought would have been preposterous. We are now on the verge of yet another paradigm shift in technology and this time it is towards voice interfaces. There is an incredible opportunity in this space – from startups, service companies, marketers, agencies, brands and even academia. The possibilities are endless and the playing field is open. To paraphrase a line from Star Trek, it’s your chance to boldly go where no one has gone before.
The author is an Alexa Evangelist at Amazon India.