Interacting in space


Interacting in space

Natural interactions using hand tracking and gestures


Doug Cook




Readers of our newsletter and visitors to our website know that we love to play at the intersection of design and technology. We’re particularly drawn to products and experiences that explore what we call “emerging interactions.”

These experiences are unique applications of technology that typically require new and less proven ways of interacting with computers, often challenging the status quo. One of the more recent examples of this is Apple’s Vision Pro. But while the Vision Pro may be a technological marvel from a hardware perspective, its most redeeming feature is the interaction paradigm it supports through spatial computing.

Recently, our team set out to explore Apple’s guiding principles for this new paradigm. While we’ve been exploring natural interactions and spatial computing with a few of our partners, much of this work remains under wraps. So the release of Apple’s Spatial Design Guidelines is a great opportunity to talk more about this work.

Designing for spatial input

One of the most compelling aspects of Apple’s visionOS is its support for hand gestures. Not only are these extremely intuitive, they are also simple and discreet.

Tracked using cameras on Apple’s visor, these include affordances for tapping, dragging, and zooming with finger precision.

floating components from the library

Hand tracking and gestures

Using machine learning and a number of readily available models, these interactions can be achieved with a simple web camera and browser.

One of the more popular libraries we’ve used in the past is OpenCV, but Google recently released an even more impressive solution called MediaPipe which is a cross-platform framework for building custom machine learning pipelines with built-in support for streaming media.

Each has its advantages, but both support detailed hand and finger tracking, making them ideal for prototyping these types of interactions.

With MediaPipe, users can detect hand landmarks by applying a machine learning model to a continuous stream, allowing them to plot live coordinates, determine handedness (left/right), and even detect multiple hands.

Trained on over 30,000 images, including both real and synthetic models, Google’s default model tracks 21 hand and knuckle coordinates. By looking at the location of each of these at a given time, it’s possible to infer gestures in real time.

floating components from the library

Pinching and dragging

To test this model, we prototyped a simple pinch and drag gesture. In the following example, pinching grabs the object, while pinching and holding allows you to drag. This gesture is super natural and even feels a little magical in this simple context.

Zooming with two hands

Using these models, we can also support macro hand gestures by detecting both hands and the presence of palms. The following example shows how these can be used to zoom in and out on an object.

You can quickly see how with more explorations we might extend this interaction to support rotation as well.

Stay tuned for more of our experiments around gesture-based interactions and interfaces. Until then, let us know what you’ve been exploring.

Have an idea or interested in learning more? Reach out to us on Instagram or Twitter!

Special thanks to Ky Le, Natalie Vanderveen, and Morgan Gerber

Doug Cook

Doug Cook


Doug is the founder of thirteen23. When he’s not providing strategic creative leadership on our client engagements, he can be found practicing the time-honored art of getting out of the way.

Around the studio