Apple's vision for spatial computing

A new era brings new paradigms

Apple revealed the Apple Vision Pro at the 2023 WWDC. This [don't call it a] headset has been long expected by Apple enthusiasts, developers, and analysts. Post-reveal, and pre-launch, a lot of time is being spent trying to figure out whether this product will be a success. Only the future will tell. A more interesting question we can ask today is, based on the material released so far, what can we infer now about Apple's vision for the spatial computing paradigm.

Apple has a track record of defining the product paradigm that goes on to become dominant in a category. Airpods and the the Apple Watch are recent examples, but the most famous one is of course the iPhone. Apple introduced a device based on core ideas that a) stayed consistent and b) became dominant:

Internet connected computer
Capacitive multi-touch
All glass front with no keyboard
Touch-centric UI with direct manipulation

You might quibble with some of my choices, but most people would broadly agree that Apple's vision for a smartphone set the paradigm for the rest of the market. Apple spent a lot of time thinking about these, they were there at the initial reveal, and they have never strayed from that vision. In this post I'm going to have a stab at identifying the ideas that are at the core of the Vision Pro and hence (I believe) will set the paradigm for spatial computing.

Why spatial computing?

Why isn't this Apple's paradigm for AR/VR/metaverse? Apple has studiously avoided the fact that the Vision Pro is a high end VR headset. They didn't show any typical VR games or metaverse experiences. In fact, they didn't show any examples of people moving through a virtual world - that's bread a butter for VR and its the promise that Zuckerburg has spent $35B betting on.

My bet is that Apple doesn't believe in delivering VR experiences ever. People will build them for their hardware but Apple aren't going to spend any time figuring out the right way to move through the metaverse.

Apple also spent years developing features in ARKit that would have allowed them to demonstrate use cases of virtual objects augmenting a field of view as you move around in the real world. They didn't show any examples of that either. Everything they did show was centered around a static user. My take is that Apple consider AR a secondary use case, and one that is not yet ready for prime time. I'm with them - I don't think the form factor is anywhere near the point where you'd see normal people walking down the street wearing one. But it is on their roadmap further down the road (no pun intended).

So what is spatial computing? Here's my definition¹ for the purposes of this post:

Bringing virtual objects into a physical space around the user for them to interact with.

And the subset of this vision that we can expect to be delivered by the Vision Pro is:

Virtual objects == mainly 2D planes
Physical space == home or office room
User == a person sitting or standing still
Interact == indirect hand gestures

But these constraints are not the core ideas of this paradigm. Core ideas aren't defined by what we can only only today, they are independent of that. So what are the ideas that I think will remain central to Apple's spatial computing efforts for the long term? Like the iPhone, I think there are four.

Your eyes can still be seen

Apple spent a significant part of their effort in ensuring you can see the user's eyes when they are wearing a Vision Pro.

Whether EyeSight (LOL) in it's current form creeps you out or not is irrelevant. Apple have stated that if you are going to use their device, other humans must be able to look you in the eye. Not cartoon eyes... your real eyes (or at least a 1:1 virtual puppet of your real eyes *cough*). Eyes convey so much of the emotion, nuance, and sub-channel of human communication. Steve Jobs famously described Apple products as being at the intersection of Technology and Liberal Arts. I think Tim Cook's Apple is at the intersection of Technology and Humanness. To be unable to see a person's eyes when you are communicating with them is to rob them of some part of their humanness. It isolates them from you. I don't think we will ever see a spatial computing device from Apple that does this².

You point with your eyes and gesture indirectly

A lot of the AR/VR experiences we have seen involve reaching out into the 3D space with your virtual hand and grabbing a virtual object to manipulate it directly. This makes sense when you a) cannot track where the eye is looking accurately and b) want the user to feel present in a virtual world they are moving through.

Apple instead have opted for a 2 phase interaction model. First you look at the virtual object to 'select' it. Secondly you perform a hand gesture (by your side) that manipulates the object. You don't ever reach out and try to touch the virtual object in 3D space.

The easiest way to understand this is via the example of resizing a floating screen. Step 1. you look at the grab handle that appears in the corner of the floating panel. Step 2 you pinch your fingers together and drag them in the direction that would make this panel bigger. Step 3. you see the panel getting bigger in real time as you move your fingers. This is basically what you do with a 2D mouse making a window larger on a screen except here its in 3D and you aren't holding a bit of plastic. It's an incredibly hard feat to pull off well and by all accounts Apple have achieved it. Again, they've spent a lot of time getting this specific part of the paradigm right so I think we can assume it's gonna stick around for the long term. If you see a virtual hand ... they blew it.

The real space looks real

Ironically many would have predicted that this part of the paradigm was going to be 'the virtual space looks real'. Instead, Apple spent most of their hardware engineering effort on making the real space around you look real. Not only in terms of resolution, but crucially also in terms of latency. A pair of ordinary spectacles already does this but to date, no consumer-level headset has been able to make the same claim. Once you set this bar, there is no going back. Apple will never release a device or form factor that doesn't achieve this. Of all the core ideas presented here, this more than anything is going to limit the form factors that Apple will pursue. With today's technology this means a goggles-based headset, no light leakage, an external battery pack, and a bunch of cutting edge silicon.

I'm just going to come straight out and say it that I think the Vision Pro looks goofy. For my money, you will have to get pretty close to ordinary spectacles until this feeling goes away and they become either a) cool and/or b) nondescript. I spent some time trying to find existing eyewear that could theoretically block out (most) light, that also looks cool. I came up with nothing. The closest we get is some of the eyewear worn in professional sports like basketball, wraparound ballistic shades or large wraparound sunglasses. Never say never, but those aren't cool now, and they probably aren't going to be cool any time soon.

This brings me to 2 predictions on the end state of spatial computing for Apple. Either we get a light projection based system that doesn't require any headset, or the headset abandons the need to block out light. If we go down the latter route, it strikes me we've already seen an approximation of this end-state in movies.

You'll notice that Peter Parker's glasses implement the core ideas presented so far. You can see his eyes. He points with his eyes and gestures indirectly, and the real space looks real.

Which brings me to the final, and most controversial of the core ideas I think the Vision Pro embodies.

You remain grounded in the real world

The problem with Peter Parker's glasses, indeed any glasses, is that the non-augmented world creeps in at the edges. The Vision Pro can (as far as I can tell³) dial the immersion knob all the way to 100% so that no part of the outside world is shown. But in the vast majority of experiences there are affordances that mean you won't get a sense of translocation. Instead, I think the effect Apple are reaching for with much of what they showed in the keynote was more like a supercharged version of the Stagecraft wraparound screen⁴. Yes it's deeply immersive, yes it is intellectually expansive and freeing, but no you aren't confused about where you are standing and there isn't a panic-inducing dislocation when you take the device off.

Remaining grounded in your space has 2 big downsides and 1 upside. The first downside is that you say goodbye to a lot of the promise of the metaverse. The ability to feel you are existing in a different space than the one you reside in, without constraints, without limitations. To feel a translocation to the point that your brain is 100% convinced that the virtual reality is the reality you are experiencing. Remaining grounded is essentially saying no to that experience.

The second downside is that you limit the ability to simulate movement in a space. This is one of the challenges to solve in VR gaming: How to convincingly move the character through the world without making the player feel sick or disorientated. Typically this is achieved with a controller and some amount of leaping about in your physical space. My take is that Apple's vision for spatial computing doesn't include these types of experience. They always want you to feel grounded in your space. They want the real world to seamlessly bleed through, even when the immersion knob is fully cranked. I'm not quite sure how this balance will work for some activities I do believe Apple intend to introduce e.g. rowing across a lake in Apple Sports. Nevertheless I think there's a way for Apple to make this experience incredible, while keeping you grounded in your space.

Which brings us to the upside. If you want the user to feel grounded in their space then a bit of that space bleeding in at the edges most or all of the time is actually a good thing. Knowing where your feet are placed. Knowing that you're sitting on a rowing machine. Accepting this allows you to introduce a glasses form factor when the technology catches up. You don't need to block out all the light or wrap the screens around the complete field of view so long as you still deliver a beautiful experience.

I know this is speculative, but I think this is the reason for the immersion dial. I think a future version of Apple's device for spatial computing won't be able to completely immerse you in a virtual world, and Apple will have been building an eco-system that doesn't lean into that capability. Yes I think it would be cool to be able to cut myself off from all the other passengers on the plane, but no I don't see a world where most of the passengers are rocking a goofy pair of ski goggles on their faces to make that happen. I can imagine a plane half full of people wearing Peter Parker's glasses.

Summary of core ideas

We may be at the dawn of spatial computing. At the very least, we've seen a spatial computing product that represents the fully-baked ideas that Apple have been working on for many years. Their vision is both well executed and contrarian. If the Vision Pro succeeds, I believe a lower priced Apple Vision, or a future Apple Glasses can expect to also embody these ideas:

Your eyes can still be seen
You point with your eyes and gesture indirectly
The real space looks real
You remain grounded in the real world

Only time will tell if Apple have got it right, again.

According to Wikipedia, Spatial computing was defined in 2003 by Simon Greenwold as "human interaction with a machine in which the machine retains and manipulates referents to real objects and spaces". This definition seems to put the computer's understanding of the physical space as the primary feature, rather than what the user interacts with. The reason I didn't use this definition is that I don't think that's the most important part of the story for Apple.↩
Yes, I know it goes into swirly cloud mode when you are fully immersed. That's a signifier to indicate that you should not communicate with the user.↩
I haven't had this confirmed. It would help my argument if Apple maxed it out at 95%!↩
Used for filming the Mandalorian among others↩