This project explores the user experience and product possibilities when adding computer vision to conversational interfaces through on-device machine learning. The concept came from an earlier exploration into the novel use cases that would be enabled by on-device ML or edge computing. I was working on this over 4 weeks with a pair of software engineers, and in that time we were able to go through several rounds of iterative prototyping on both Android, and then later in webVR.
As Google and Amazon have pushed towards reducing friction in smart speaker interactions the need for a wake word remains a well-known pain point for users. Voice assistants triggered by body language or gaze detection have the potential to relieve this. By being able to recognize gesture and body language voice interfaces become might become more intuitive and empathetic.
However, the public is still distrustful about the idea of ‘always listening’ speakers and the next evolution of these products will challenge them even more - ‘Always Looking’ speakers. How might we mitigate concerns about data governance and control through privacy focused designs and transparent interfaces?
On-device machine learning is being used in several consumer electronic products. A good example of this can be seen in Google's Smart Displays, which use computer vision to understand the context and intent of its user. For instance, if someone walks up to it so they can quickly glance at upcoming events on your calendar. It will recognize who they are, and only show the relevant events. Google’s Smart Displays also rely on computer vision to guide users through touch-free gesture commands so they can control media playback or dismiss an alarm as soon as it goes off.
These interactions are possible because of on-device machine learning and its ability to reduce latency and preserve of privacy. Typically the device itself is processing only the data that’s necessary to determine if there is invocation intent with the user and nothing more, data before invocation (the wake word) is never saved locally or sent to a server. Edge computing like this is very similar to how wake-word detection works, and consumer attitudes have adopted these new modalities of computing as they offer significantly more ambient and intuitive modes of interaction.
Desktop (0-1 meter)
In this first pair of experiments we explore non-verbal approaches to invoking the assistant with only a users eyes. Initially we tried just detecting an unusually long stare with no blinking.
Open Eye & Staring Detection The voice assistant does not trigger with normal intermittent blinking but will trigger when the user holds their eyes open for an extended period. Because blinking rates may differ we have to over extend this time, but it would be interesting to explore a more dynamic model that could adapt to variable average blinking rates, as an individuals personal biology and hydration levels will affect this rate.
Open Eye & Gaze Direction Detection
We realized that the first method has a high chance of creating false positives and wanted to increase the parameters in our invocation detection. Here we add eye angle detection so that we can invoke when the user is looking directly at the camera for a period of time.
Across Room (< 5 meters)
When a user is too far from the camera to detect their gaze a raised hand will invoke the assistant. I wanted to find a fallback non-verbal invocation that could trigger Google Assistant.
Raised hand gesture detection When a user is too far from the camera to detect their gaze a raised hand will invoke the assistant. A similar interaction is used on the Nest Hub Max to control media playback and dismiss alarms.
Virtual Reality with an embodied agent
By giving our virtual agent a scene we can add context to ground the conversation. Our agent is the shopkeeper of a virtual plant store. The chatbot was hosted in an AWS Sumerian scene and is invoked when three conditions are met.
When the user's gaze is met with the virtual agents, held for 3 seconds, and while the user is within a within a local radius to the agent.
Detecting customer interest
I wanted to explore if we could leverage additional contextual cues by enabling our agent to be invoked when the user holds their attention on a particular plant in the store. In this scenario, the agent uses this interest to match with an intent of customer interest, initiating an informative dialogue flow with the metadata of plant type being a variable and adding context to the dialogue.