It’s been five years since the first Alexa smart speaker was released into the wild, and since then we’ve seen it joined by many entrants from other giants. From Google to Apple to Facebook, everyone is vying for a place on our kitchen counter, night table or bookshelf. With over half of all American households now owning at least one smart speaker, this product category has officially moved into mainstream adoption.
But, the odd thing about this “new” voice revolution, is that, well it really isn’t new. Voice interfaces have been used in cars (a seriously great use case) and for telephone banking systems for decades. What’s different now? What has caused this explosive growth in adoption? It’s simple: these devices are now “always listening”, ambiently waiting for you in the background. . Smart speakers now enable us to seamlessly and conveniently interact with our virtual assistants even when our hands are full, while we’re trying to multitask or when our phones are just out of reach. This has been the trigger that has led to their rapid proliferation. By being ambient, smart speakers give us the opportunity to stay in the moment and to live our lives with fewer visual disruptions (which is pretty cool, considering other digital products tend to achieve the opposite).
The wake word — “Hey Siri” or “Alexa” or “OK Google” has been the true trigger in this ambient paradigm shift. The technology behind modern voice interfaces is natural language understanding or NLU, and the ability to run these machine learning models on-device is what has enabled the innovation of the ‘wake word’. Smart speakers use special gateways built with an on-device NLU model, a highly focused model trained only to listen for those magical wake words (magical because, everyone likes the sound of their own name, right?). With the wake word, deeper conversational interactions can be unlocked — which happen through the cloud rather than on the device itself. Our voice recordings (post-wake word) are zapped between secure data centers, and then responses are returned to the assistants in our speakers. The on-device nature of the gateways is the technology that enables these speakers to be both secure, private (with some caveats), and efficient. With the gateway we can leverage the power of the cloud, we can know that our lives are not being recorded outside of these smartspeaker interactions. The only time that information reaches the cloud is in the moments after the wake words are spoken.
“Hey google” “Alexa” “Hey Siri”
Wake words have allowed for our smart speakers to remain in the background, and this has spurred the widespread adoption of conversational interfaces. This is truly a milestone in human-computer interactions. Different from tapping an app icon or sliding to unlock your smartphone, wake words allow us to interact in a way that is much more personal and intimate. Instead of following a mechanical or abstract gesture, we’re trying to get the attention of a friend as if they were just out of view:
“Alexa, can you turn off the kitchen light please.”
Unfortunately, things aren’t completely rosey in the wake world. Yes, it’s what allows for short interactions to feel quite natural — at first. Ask any smart speaker owner about their opinion on the necessity of “wake words” and they’ll tell you that they can get old pretty quick. It doesn’t feel natural or convenient to have to say “OK Google” every single time, especially if your smart speaker is right in front of you. It’s not how we would address a friend sitting across from us; , “Gwen, tell me about your mom” or “Gwen, pass the salt” . Seriously, it gets awkward, fast (just try it).
If we think about how we try to hold the attention of our peers normally, we don’t use their “wake words” (or, ahem, names) — we use our bodies and our eyes to imply our desire to communicate. For instance, if it’s you and a peer alone in a room, you may just begin speaking with the assumption that your counterpart will understand your intention because, well, it’s just the two of you. However, throw some more people into the room and you might capture their attention using a light touch, or a slight gesture. You may even just gaze at them for a moment and wait for their returned non-verbal acknowledgment. Then, voila. No need to use “Gwen” to get their attention.
Wouldn’t it be oh-so-convenient and well, natural if our human-computer interactions with smart speakers mirrored the types of cues we’re accustomed to in our social interactions? A kind of social human-computer interaction? Maybe we could stop exhausting ourselves, breathlessly calling out to Siri, and Google, and Alexa, for every weather update or random fact-check. While it might seem far-fetched to imagine a world where we can expect our devices to catch these subtle hints — sometimes we miss them ourselves from other people — we can already see the groundwork today being laid towards it from Google. From subtle gestures to facial recognition and more, Google is making our interactions with our smart speakers even more seamless than ever before:
Google introduced Continued Conversations at the I/O conference in 2018 with the goal of facilitating more natural conversations with the Google Assistant. With Continued Conversations enabled, Assistant-enabled devices will continue to listen for several seconds after an initial interaction, allowing the user to follow-up with additional queries or commands without the need to continue saying “OK Google” (you’ll only need to say it the first time). To accomplish this, the assistant is now able to understand if additional conversation is with them, or if it’s being directed at someone else.
For devices that are within reach, Google has several approaches to enable the assistant without relying on the wake word. My personal favourite is Pixel’s Active Edge technology. Give a pixel phone a slight squeeze while holding it — and the Assistant comes to life. When I first got my Pixel 2 it seemed like a minor feature, maybe even a bit gimmicky. However, after using it, I quickly understood its value. Even when I’m at home I’ll often opt to use my pixel’s Active Edge squeeze over interacting with a nearby smart speaker, simply because I don’t need to use wake words with it. I just squeeze and speak. The squeeze also feels surprisingly intimate and rewarding — a small haptic pulse, which indicates you’ve woken the Assistant.
While continued conversation and squeeze-to-talk are both great ways to avoid a wake word, the question is, how do we go from there to interpreting human body language and even understanding intention? The truth is that it’s not enough just to listen to us; smart speakers need to begin to see us. We are moving from the era of “Always Listening” to the new era of “Alway Looking”.
Just like on-device machine learning enabled wake-word detection, these on-device camera vision systems are now able to detect not only human gaze, but also recognize faces and gestures with secure, on-device processing. The on-device nature is what enables the new “Always Looking” gateway by avoiding the need to transmit data to the cloud. This is the beginning of computers that understand the implicit meanings behind our body language. Allowing us to take our interactions with computers a step further towards intuitive, social and ultimately more human exchanges while keeping our homes private and secure.
Which gestures are selected to be recognized, and how designers and engineers will contend with the diversity of cultural expressions within the world of gestures is truly a unique challenge. Will “Always Looking” software need to be internationalized or personalized to our unique personalities and abilities? How do we account for accessibility and mobility challenges in these systems? These are questions that will need to be addressed. In the Nest Home Max, Google has started with a fairly universal gesture: a raised hand that enables you to stop music from playing. You can see it demonstrated in the following advertisement for the new Google Nest Hub Max smart display.
Although the gesture is universal, considering diverse expressions is just one piece of the “Always Looking” puzzle; privacy is another huge issue that Google, and other companies will need to content with. Just by reading the comments below the video, we can see that the move from “Always Listening” to “Always Looking” will not be easy.
As the technology behind these new products becomes increasingly complex, it becomes harder to explain to consumers how they work. In an era with high profile data leaks and security breaches, consumer confidence in big tech firms has rightfully been shaken. Convincing us to bring in always-on cameras into our homes is no small task. Still, the steady march towards these sorts of technologies will continue — it’s up to us if we want to embrace them or not.