In July 2019, I finished writing a book, Voice Applications for Alexa and Google Assistant, that was all about building skills and actions (i.e., applications) for these smart assistants. I was eager for these platforms to take off, although there were already signs that it wouldn’t be so. By 2021 I was pretty sure that we could say that voice had failed as a platform, and I’ve been thinking for a long time why that is. Below, I share with you my thoughts, as well as recent news in the world of natural language and conversational tech.
Why Voice Failed as a Platform
The end of 2022 (and, let’s be honest, the beginning of 2023) was a bloodbath for many tech companies, but it was all the more so for the Amazon Alexa and Google Assistant teams. Amazon’s 10,000 person layoffs had an outsized effect on the Alexa team, and Google reduced investment in Assistant. Google Assistant had been withering for a while, but the news from Amazon was a surprise to many outside the company. Taking all of this news together, one conclusion is clear: voice has failed as a platform.
Let me be clear what I mean by that: voice hasn’t failed as an input, it has failed as a mode on which businesses can be built and where existing businesses feel they need to be.
In this post we’ll examine the difference between an input and a platform, the reasons why voice failed as a platform (and some why it didn’t), and what this means for the recent resurgence in chat, through tools like ChatGPT.
Input versus Platform
An input is how you interact with a device. Keyboards, mice, and touchscreens are all examples of inputs.
Voice can be an input, too. You can control your phone, smart speaker, or other devices through speaking.
A platform, meanwhile, is a system that provides “hooks” for developers and companies to build experiences on top. The web is a platform, as are iOS and Android, where touch and voice are the inputs. Likewise, Alexa and Google Assistant are also potential platforms.
(By the way, so are or were other players like Bixby, Cortana, and Siri. Their failures, as platforms or otherwise, are so clear that I won’t mention them much.)
Not the Reasons that Voice Failed as a Platform
Adoption
First, voice did not fail as a platform due to a lack of adoption. In 2021, U.S. adults with a smart speaker totaled 87.7 million, out of 255 million total. Germany saw 17.9 million out of 69 million, and the UK was at 19.7 million out of 52 million. Leaving the home, the same year there were 127.1 million smart assistant users in cars. 2022 numbers showed 35% of Americans “reached” by smart speakers. While overall growth might now be slowing, these are very healthy adoption metrics, and these studies are also showing that people didn’t buy and then drop, but were instead using their devices regularly.
Privacy Concerns
The adoption numbers show it, but it’s worth speaking to this directly, as the critics from this angle tend to over-extrapolate their POV to the population at large. Privacy concerns did not lead to a failure of voice devices.
The truth is this: most people make a calculation between privacy and received value. No, I don’t have data to back this up, but it seems pretty obvious. People are willing to carry microphones around all day via their smartphones because they feel that they are getting enough value in return for the risk. Same thing for security cameras connected to the cloud. Same thing, it would seem, for smart speakers.
Big Tech Investment
Finally, this wasn’t an area that just never took off due to underinvestment. Amazon’s hardware group lost $3 billion in a single quarter in 2022—and it wasn’t Kindle losing all of that money. In 2019, Amazon had 10,000 employees on Alexa. Google likewise had put significant money towards Assistant, even if never to the level of Amazon. Microsoft, Samsung, and Apple have also all invested heavily in voice assistants.
Lack of first party investment didn’t lead to voice’s failure as a platform. It was for reasons more inherent to voice.
The Reasons that Voice Failed as a Platform
Usability and Discoverability
This is the biggest reason. It is too difficult to use voice interfaces for more than just a handful of purposes. The reason for that is a lack of discoverability of capabilities.
When you visit a webpage, a well-designed one is going to show you the most common things to do:
On this page, there is no doubt about how to find information about pricing or how to download the app.
This isn’t true with voice, however. If you wanted to find out pricing information, do you ask “tell me about pricing” or “what is the cost?” What you ask, and if it takes you to the right place, is dependent on whether the voice app builder trained the NLU to handle your request. Anyone who has used a smart speaker for long enough knows the frustration of “sorry, I can’t do that.”
Of course, the interface of voice apps also obscures what you can do but you don’t know about. Take “Publish” on the screenshot above. What is it? You need to see it to know it’s an option. There are ways to inform users of functionality with voice interfaces (“by the way, did you know you can…”) but they are clunky and give users information when they aren’t ready for it and don’t want it.
Discovery, or being reminded, of voice apps is also bad. Sometime around 2018, whenever you would go to a voice-first conference, the topic du jour was around what the platforms were going to do to “improve discovery.”
Again, we can compare to successful platforms. Find a favorite website, and you can bookmark it (or it will be in your history and as an autocomplete in the address bar).
Mobile is even more powerful, because you now have an icon on your screen, reminding you. It’s one of the reasons that companies push you to their mobile apps when a mobile website would suffice.
The lack of discoverability is less of a problem when there are only a handful of things to do, and that is how most people use smart assistants. They control their lights, they play music, and they do little else. It’s easy enough to remember “play,” “pause,” and “turn off the lights,” and you’ve already discovered what you can do by the paper insert that came with your device. That’s why, even when voice is failing as a broad platform, it is useful for first-party experiences.
Speech to Text and NLU Errors
Usability problems are compounded when the underlying technology has problems itself. Speech to text (STT) and natural language understanding (NLU) mistakes made each interaction a game of chance. Is the assistant going to understand the request this time?
The problem is actually two-fold: STT and NLU. STT is handled completely at the platform level, but the NLU (or, at least, the config) is handled by individual developers, which is a giant risk. This approach required the developers to know enough about NLU to know how to configure it correctly while understanding enough about their users (before launching, even) to predict what they would request.
Further, there are often two parts of the request: what a user wants to do and with what. The intent and the entities. When there are few options and a lot of data about user request patterns, the experience will be better than when there are many. That, again, is why first-party experiences can be more enjoyable to use, and why they might still fail when asking for something more obscure, like the name of your favorite indie artist.
Lack of a “Killer App” & Business Adoption
These usability issues kept smart assistants from having a killer third-party app, an app that everyone had to have and that really justified the platform. Businesses in general weren’t inspired or inclined to build for them. Some dipped their toes in, but if you are going to roll out something that gives a subpar experience, or you are going to put those devs on your mobile app instead, it’s an easy choice. It also didn’t help that Amazon and Google were late to introduce monetization tools.
Then, for users, their favorite brands aren’t on the platform, and there isn’t a “Flappy Bird,” so they have less of an inclination to seek out third-party apps. It was reinforcing and so the assistants never took off as platforms for third-party development.
What’s Next?
The most obvious answer to “what’s next” is what is already happening: Google and Amazon pull back even more on trying to attract third-party developers, Apple doesn’t start, and Microsoft and Samsung pull back from smart assistants in general.
But something has happened since those layoffs at the end of 2022. ChatGPT has taken off like crazy. They have even opened it up to third parties like Amazon and Google did. ChatGPT isn’t voice driven, but it is natural language driven, which has many of the same challenges.
Chat still has the issue with discoverability of new experiences and new interaction options within an already known experience. I’ve linked to this post before, and it explains the problem well:
Good tools make it clear how they should be used… The only clue we receive [from a chat interface] is that we should type characters into the textbox. The interface looks the same as a Google search box, a login form, and a credit card field.
One giant advantage that ChatGPT (and Claude, etc.) have over Amazon Alexa and Google Assistant is on the NLU front. To say that these LLMs are better at understanding intent than the NLU that came before is, of course, being coy. Understanding what the user wants is no longer an issue in most all situations.
My prediction? Chat will fail as a platform just as much as voice did. There is no denying the power of these LLMs, but they will be used for first-party experiences, not as a general-purpose agent. The more things change, the more the inherent challenges remain…
Recent News
Bringing ChatGPT into the College Classroom
Simulating History with ChatGPT
This was a delightful read on how a college professor is embracing ChatGPT to engage with students and history.
LLM Articles, No Hype
Yes, there’s a line between overly dismissive of new technology and over hyping it. This set of articles is just that.