At Connected, when we aren’t working with clients and helping them to define exciting new products, practitioners have the opportunity to take time to explore exciting emergent technologies in order to better understand the new interactions and experiences they might unlock. Recently, I worked with a small team in our Labs (R&D) program to explore new ways we could create more social and intuitive conversational interfaces. It was during this exploration we uncovered a new blended approach to building conversational agents that are more social and flexible, while still retaining functional utility and purpose. This blended approach is novel and exciting, in that up until now, agents do one or the other: imbue sociality or service functional needs. A step towards an agent that does both means that we’re that much closer to building conversational interfaces that are truly responsive to human needs. To understand why this is so important, we need to examine conversational interfaces of past and present. We’ve all experienced frustration when faced with a conversational interface that just isn’t understanding us (which is, to be honest, many of them). As users we’re meant to follow along with predetermined dialogue flows — and when we stray from these paths, we’re typically given a canned response when an ‘intent’ isn’t matched to our ‘utterance’.
“Can you repeat that? I don’t understand.”
While conversational designers can strive to limit these kinds of frustrating moments, and manage them more gracefully, they just aren’t able to eliminate them entirely. Humans are innately unpredictable, and furthermore, have a lot of engrained expectations around how conversations should work. This is complicated by how good or ‘real’ these digital agents are starting to sound. As speech synthesis becomes more human-like, so do our expectations around what these interfaces are capable of. This often leads to disappointment. Just listen to how realistic Google’s speech synthesis has gotten with this demonstration of Duplex:
While Duplex can book a reservation or schedule an appointment, the conversation will collapse as soon as you ask how its weekend went. A key limitation of modern conversational interfaces is that they are typically domain-specific. Alexa skills, Google Actions, or Facebook Messenger chatbots are all programmed along with predetermined conversational flows that encapsulate a specific capability or knowledge base; this is called a domain. Should the user stray from that narrow domain they’ll be met with a dreaded fallback response, exposing the edges of the interface and reminding the user that they are talking to an inanimate object.
“I’m sorry, I don’t understand, but I’m learning all the time”.
Unfortunately, when the illusion is broken and the curtain is pulled back, it becomes harder for agents to truly engage with people and hold their attention. These edges might remind us of telephone banking IVR’s (Interactive voice response). You might be familiar with IVR’s: the rage-inducing, button-smashing systems that force us through endless decision trees. Not the most well-designed interface, to say the least.
“Press 1 to access your account, Press 2 to report a lost card…”
Although modern conversational interfaces are more flexible than IVR’s, users often still experience the same sense of limited and constrained conversational flows. They simply aren’t as flexible as humans, and that makes them frustrating to engage with for anything but simple tasks. And, they definitely aren’t going to keep you company with thoughtful or reflective conversation.
An alternative to these closed-domain conversational agents is an emerging area of research known as open-domain chatbots. These are conversation agents that are able to carry on a conversation about anything a person wishes. For decades we’ve seen attempts to build open-domain chatbots with varying levels of success. Bots like ELIZA and Cleverbot were notable milestones — but still failed to be truly convincing. ELIZA was designed to emulate a therapist and was able to engage the user by detecting keywords in their statements and reflecting this information back at them with predetermined and open-ended prompts. These chatbots employed simple but clever conversational tricks, by reflecting what they were told, deflecting what they hadn’t prepared for and returning questions to the user to engage them.
Although novel, they still followed recursive paths that are predetermined and aren’t able to give the types of contextual and salient responses that we generally expect of our human conversational counterparts. They also continuously tried to steer the conversation and were unable to jump between domains in the way that a good conversationalist might. More recently, however, we’ve seen the application of modern NLU (natural language understanding) combined with new conversational models that are trained using deep learning techniques like recursive and generative neural networks. These are language models that are trained on sample data, independent of explicit rule-based programming. The GDP2 transformer model from open.ai was the first example of this to really showcase this capability to the world: text generators that are able to “hallucinate” and generate their responses based only on input text alone.
You can read more about GDP2 here
These modern generative language models are now showing a renewed promise towards creating more social and humanistic conversational interfaces. While promising, these new open-domain chatbots still haven’t shown the level of human-likeness that would allow them to be useful; they sometimes offer vague responses or unexpected tangential commentary, and the salience and contextual specificity are more often than not missing. But, over the last few months, both Google and Facebook have published research that shows what’s possible when models are trained with much more massive datasets of conversational text and when new model architectures are employed that help to hold conversational context that increases salience and specificity within the agent's responses.
When Google researchers wanted to evaluate their chatbot Meena they decided they needed a new metric. They developed the SSA, the Sensibleness and Specificity Average, as a way to determine the relevancy of each dialogue turn. To compare Meena they had third-party human evaluators test several of the leading open-domain chatbots, including a human conversation partner that they were led to believe was a chatbot. As you can see Meena is a significant step forward in closing the gap between humans and computers when it comes to conversational capabilities.
It didn’t take long before Meena was facing some competition. In late April, Facebook published their own open-domain chatbot research, Blenderbot, a generative open-domain chatbot that is engineered to be engaging and knowledgeable. Factfulness is something that Google’s researchers pointed to being an area that needs to be incorporated into Meena and this has been addressed by Facebook. Blenderbot is leveraging databases like Wikipedia to imbue specific knowledge across a wide range of topics, allowing it to delve deeper into specific domains.
While these developments are exciting, their practical application is still being explored. Models that learn independently from examples rather than being programmed explicitly with specific rules lack the precision to reliably execute functions that would make them practical and utilitarian. A social bot might be nice for casual chit-chat but not for much else. This is the main reason open-domain language models haven’t been adopted by mainstream digital “assistants” like Alexa or Google Assistant, whose primary purpose is to accomplish specific tasks rather than connect with users.
This brings me to the work we’ve been doing at Connected in our Labs program. While researching new types of implicit invocation inputs and social interaction patterns for conversational interfaces we began to explore the practical applications of open-domain language models. Facebook had just published the research on Blenderbot and we could access the codebase on Github.
At this point in our project, we had already developed a conversational agent — the “Shopkeeper” who was embodied in an avatar and existed in a virtual reality simulation of a flower shop. While we had already established a predetermined dialogue flow, we wanted to see if we could incorporate an open domain language model to allow for more flexible conversational interactions, while at the same time retaining the utility of our existing dialog flow. Our approach was to leverage the ‘fallback intent’ of our dialogue manager (Dialogflow), typically used for conversation repair, which is a dialogue intended to steer people back to the ‘happy path’ of the conversational flow when they go astray.
“I’m sorry, I don’t understand what you mean. Did you want an indoor plant or an outdoor plant?”
Instead of using the fallback intent to manage errors, we’re using it in an unorthodox manner to send the users ‘utterance’ (what they said) to our cloud-based Blenderbot instance and having the language model generate a unique response that is returned to our shop keeper agent. This method allows our shopkeeper agent to converse about subjects completely outside of their predetermined domain, while still understanding when the conversation steers within the realm of the flower shop and can seamlessly switch back to her predetermined dialogue flows.
Here is a diagram of the architecture of our AWS and GCP web services:
Now to show you our shopkeeper in action — the first half of the conversation she is improvising with blenderbot, however, when the conversation shifts to plant shopping the dialogue manager allows the predetermined flow to take over. There is a significant delay in our agent's responses, as we’re not running the model locally, the potential for hybrid closed/open domain chatbots is still clear. Open-domain language models are here, and getting really good, really fast. Conversation designers and product developers should anticipate a future where they are shaping the contours of their agents, not writing every single line of their story.