How to build a Holodeck and why large language models are key to unlocking the potential of the metaverse

How to build a Holodeck and why large language models are key to unlocking the potential of the metaverse

I have fond childhood memories of my dad and me watching Star Trek: Next Generation together. Gene Roddenberry’s vision of our collective techno-utopian future captivated my young imagination, and reflecting on it, it’s clear that this vision made a deep impression on me that’s contributed to my work, my values and ideals.

Over the last several decades, we’ve seen many of Star Treks’ fictional technological wonders seemingly jump right out of the television and into our homes. For example, the iPad bears a striking resemblance to the tablet computers used by the Enterprise crew, and 3D printers can materialize objects and even print a nice marbled steak in a way that’s reminiscent of Star Trek replicators. And just as I use “Hey Google” to ask my Google Assistants to check my calendar the ship’s crew uses “Computer” as a hands-free way to invoke their ship’s Assistant.


But the technology that really fascinated me as a boy and still does to this day was always the Holodeck. At first sight, it might be mistaken for a gymnasium or a racquetball court–a large open room with high ceilings and a cubic volume. It’s completely void of any furnishings, with its floors, walls, and ceiling tiled in a deep black and accented with bright orange lines that criss-cross in a gridded fashion.

Within this space, and seemingly out of thin air, anything can be materialized at will. Objects, people, and places are simulated with a high degree of realism using holographic projection. The panel outside the entrance offers the user various settings to set the parameters of the experience. Once inside, however, the Holodeck must be operated and controlled by voice command alone. The ship’s crew access the assistant by saying ‘Computer’ and, like rubbing a genie’s lamp, they are able to use natural language and request whatever their heart desires, and, just like magic, it appears before them.

Take this scene from season 1, episode 14. William Riker is speaking to the Assistant and instructing it on designing a Jazz bar simulation. Using only natural language input, Riker is able to request a new environment, storyline, characters, or experience. Anything that can be described with language can be designed and materialized instantaneously.

When I was a boy, this all seemed like it belonged to a distant future or science fiction plot and not something I’d see within my lifetime; it probably still seems unimaginable to most. My mind changed on this however when I was 8 and I took my first steps into a virtual world. It was at a theme park where my dad and I got the chance to battle it out in an early arcade-style Virtual Reality simulation called ‘Dactyl Nightmare’, which was released in 1991 to arcades and theme parks


The large helmet display weighed down on my head and shoulders and I probably looked ridiculous. However, I remember the excitement I felt as my eyes began to focus on the small 276 x 372 pixel displays. Even though the low poly graphics were crude, it felt like I was stepping through the screen and right onto the Holodeck. The fidelity paled to today’s standards, but they were cutting edge at the time. The headset incorporated head tracking and head-mounted displays, bringing an entirely new level of presence and immersion. I was utterly captivated by the experience, and this was the genesis of my lifelong interests in emergent technologies, virtual reality, and human-computer interfaces.

Almost 30 years later and my wife and I have just decided to take out a large table and convert the center of our home into our own personal Holodeck. In what used to be our dining room, we now have a carpeted play space for VR activities, a space that’s large enough to stretch out and navigate in virtual space comfortably. It’s in this space where we can be anywhere. The sense of presence can be so strong that sometimes, I forget that I have a headset on and that I’m standing in my living room. In the pursuit of a holodeck experience, we’re tantalizingly close; the processing power and immersive media technologies are there, the body and positional tracking are there. The price point is correct. But it seems as though there are a few missing puzzle pieces in our Holodeck puzzle, pieces we’re only now beginning to see the outlines of.

For one, you’ll definitely need a competent voice assistant. Virtual Reality users today are missing the ship's Assistant. Riker is able to use the Holodeck without any specific skills, he’s simply using natural language to describe an experience he wants to have. Throughout the series, we see a myriad of complex scenarios, simulations, and recreational activities play out aboard the Holodeck. This human-computer interaction is a large part of the Holodeck’s appeal and something sorely missing from today’s virtual reality systems. The Assistant’s natural language understanding and generative design capabilities are what unlock the crews’ exploration and creative expression within the Holodeck.

Today, even among the top VR headsets, we only see very nascent voice controls. This is a shame because in VR the typical touch and keyboard inputs are rather challenging to reproduce. Frankly, even the current top-in-class natural language understanding systems used by Google and Amazon and their respective virtual assistants don’t allow for the type of open-domain, creative, and conversational interactions demonstrated by the Holodeck Assistant. ‘Open-domain’ in the sense that the user can discuss or ask for anything, and the Assistant will never respond with a ‘fallback response’ of confusion and a graceful apology. And ‘Creative’ in the sense that the computer understands much more than just the intent of the user’s instruction; but it’s actually able to respond in turn with creative interpretations. These take shape as rich plot lines, compelling dialogues, and stunning three-dimensional environments and character designs.

So, where is this piece of the holodeck puzzle? Where is the technology to unlock this type of creative human-machine interplay? How will we talk to our Oculus Assistants like the Enterprise crew talks with the Holodeck Computer? Well, we might not have to wait too long. In recent years, incredible new developments in language generation have been made with transformer-based models like’s GPT3 and Google’s LaMDA. They’re entirely changing what’s possible and fueling the next generation of digital product development and human-computer interaction. We’re beginning to see that these single massive models possess a generalized intelligence that allows them to serve a wide range of applications– from general answer questions to writing essays, marketing copy, summarizing long texts, translating languages, and taking memos.

When it comes to chatbots and voice assistants, the introduction of Transformer-based language models has led to spellbinding exhibitions of flexible and unscripted, open-domain dialogue generation. They can even generate languages that aren’t necessarily linguistic or conversational but rather symbolic and abstract–such as math, chemistry, or computer languages. Their ability to generate computer code is driving a complete paradigm shift in how digital software and experiences are programmed, adding fuel to the ‘low code’ movement– the drive to democratize the power of software development, creating more equitable access to the immense opportunities available and potentially uncovering unforeseen use cases and experiences.


This shift has manifested in recent product announcements from Microsoft and with GPT3 powered solutions that generate code for various computer languages. We’ve seen headlines asking if AI is going to replace programmers. And while the focus currently has been on popular computer languages like Python and Javascript, we’re starting to see experimentation in leveraging GPT3 to develop WebXR frameworks such as three.js. Below is a simple experiment by Bram Adams that demonstrates the ability to use GPT-3 and natural language input to generate code that renders a 3D scene.

So as we can see, we now have an Assistant with the ability to take requests in natural language and then generate the necessary code to build the requested three-dimensional scene. Unfortunately, it certainly is a bit lacklustre; the current results seem limited to poorly chosen textures and simple primitive shapes. All of our world’s richness in detail and nuance is missing, and it looks as though there might be a second missing piece in our holodeck puzzle, the creative and artistic aspects of these interactions.

On the Enterprise, the Holodeck Assistant is fluent in the shared visual language of the ship's crew. In order for our own Assistants to interpret and design these new worlds we need a shared visual language with them — it’s that simple. The Holodeck Assistant understands that jungle means jungle and we can all say, hey, that’s a jungle.


In order to generate new vistas and help bring to life our complex imaginings, our Assistant will need to have a shared understanding of our world’s visual richness. For VR, this means once again, we need the help of AI. We’re beginning to see a glimpse of these capabilities in a new model shared by — They call it DALL·E, and it’s a transformer-based language model trained like GPT-3, but with a dataset of text-image pairs instead of text alone. Researchers at have discovered that just as a large transformer model pre-trained on language can generate coherent and salient text, the same GPT model type trained on text-image data sequences can generate convincing images from natural language input alone.

Image generation models in the past have been domain-specific and require complex architecture, highly specific training data, or additional inputs such as segmentation masks or labels. No previous model has shown the kind of generalizability or zero-shot natural language input demonstrated with DALL·E.


With the generalizability of GPT-3, and a visual sensibility and awareness of aesthetics, these multimodal assistants will possess a new contextual understanding of the world. DALL·E has proven that these language models can produce convincing and compelling images from natural language inputs alone across a wide variety of domains. Ask DALL· E for a chair that looks like an avocado, and avocado-inspired chairs are exactly what you’ll get; in fact, you can get dozens of variations. And just as DALL·E was pretrained on a dataset containing image-text pairs, it stands to reason that we’ll see similar efforts leveraging datasets that have 3D models and text pairings, along with other modality type pairings.

DALL·E combines the generalizability of GPT-3, with a visual sensibility and awareness of aesthetics. has shown us that with enough data and scale these language models can produce convincing images from natural language inputs in a wide variety of domains. Ask DALL· E for a chair that looks like an avocado and that’s exactly what you’ll get, in fact, you can get dozens of variations. DALL·E was trained on a dataset containing image-text pairs, and it stands to reason that we’ll see similar efforts leveraging datasets that contain 3d models and text pairings, along with other modality type pairings.

The immense potential is immediately apparent. In a recent interview with Sam Alton, the CEO of has confirmed that the upcoming GPT-4 will further develop multimodal models in addition to the language and code generation capabilities available today. We’re also seeing similar multimodal models from Google and Facebook, as well as open-source projects as well.

So, we’re at a point now where VR hardware is achieving an increasingly high degree of quality and value. The Oculus Quest 2 is cheaper than most smartphones, and it almost delivers on the promise of an all-in-one, complete VR experience. At the same time, there is a significant lack of diverse, quality content on the platform, and it’s holding back widespread adoption. A year ago many people hadn’t even heard the word Metaverse, and now it’s trending on every tech blog. The Metaverse, this new world where we’ll play and work. The quest is, who gets to build it?

This might seem like we’re at an impasse, but we’re actually approaching an inflection point. Necessity breeds innovation, and in our pursuit of the metaverse, we’ll bring together the puzzle pieces we need to build our own Holodeck Assistants.

What seemed so distant to me as a young boy now seems tantalizingly close, but we’ll never be able to realize the ultimate potential of VR without widely available and accessible creation tools. There’s a whole Metaverse that needs to be built, and this new world should be built by everyone


Natural language is the most universal and accessible instruction language. It’s something we’re given as children by our family. With natural language programming, the Holodeck will provide accessible and equitable access to the tools necessary to design and build new worlds. Our new assistants will possess the artistic sensibilities of a Venetian master with an incredible polymathic coding prowess, and they’ll collaborate with us to tailor new worlds to our exact specifications and help us all realize our metaverse dreams.