Speaking in gestures


Speaking in gestures

Designing a gesture library using hand tracking


Doug Cook




In the right context, hand gestures can be an attractive alternative to on-screen interfaces, especially in the context of XR and wearables. With advances in computer vision and machine learning, it is now easier than ever to train your own custom models to recognize these gestures.

Continuing our explorations into natural interactions, we recently set out to build a small library of gestures based on simple hand signals and finger tracking to aid our prototyping on partner projects.


Easy as 1, 2, 3

Google’s MediaPipe libraries provide a great starting point for prototyping these interactions. Specifically, MediaPipe’s Gesture Recognizer provides a quick and accessible set of models for categorizing hand gestures, identifying handedness, and recognizing up to 20 different hand coordinates.

four hands held up with blue and white nodes outlining the digits

To do this, Google’s Gesture Recognizer uses two different models: a hand landmark model and a gesture classification model.

The landmark model detects the presence of hands and hand geometry to identify palm and finger coordinates, while the classification model uses a two step neural network to detect and recognize gestures.

That may sound like a lot, but out of the box, it just kind of works. In fact, it turns out that a number of gestures are not only easy to recognize, some are even recognized by default using Google's own models, making it great for design prototyping.

Identifying fingers and positions

Readers of our last post will remember Mediapipe’s landmark model. That model provides an array of 21 points, each of which is a 3D coordinate in a normalized coordinate system (i.e., screen-independent).

Each landmark is composed of x, y, and z coordinates. x and y correspond to the landmark’s position, with z representing how close the landmark is to the camera. Using these landmark coordinates, it’s possible to track and recognize a number of basic gestures.

four hands making symbols

Recognizing signals and directions

Google’s gesture recognizer and default models can detect a number of basic hand signals, including open palm, closed fist, pointing up, thumbs up, thumbs down, victory, and I love you.

Learning our numbers

In creating our gesture library, we thought a good next step would be to add hand-signaled numbers. Although not recognized by default, the number of raised fingers is easy to detect using MediaPipe’s default landmarks.

The model determines how many fingers are up by checking the tip of each finger. If the y coordinate of a particular fingertip falls below the coordinate of its central landmark, the model recognizes the finger as closed.

Adding the full alphabet

To extend our library, we thought it only natural to try adding support for American Sign Language (ASL)—a natural language that serves as the primary sign language of deaf communities in the United States and Canada. ASL has a set of 26 signs, known as the American Manual Alphabet, that can be used to spell words from the English language.

first eight letters for fingerspelling

Creating a model to recognize ASL is a bit more involved, but you can leverage other models or train your own custom model to use with MediaPipe.

Fortunately, ASL training data is readily available on the web, from custom image libraries to hand shape datasets to pre-built models. For our purposes, we decided to start with a subset of images from a larger training dataset of 87,000 images found on Kaggle.

Using this image set, it was possible to train a custom model to use with MediaPipe using Tensorflow.

Another option would have been to use MediaPipe to capture images or landmark coordinates for each sign to create our own dataset. Either of these methods would have worked, and we explored both, but since larger datasets tend to yield better results, and the datasets on Kaggle were more than adequate, we ultimately decided not to reinvent the wheel.

We have a few more things in the works, but in the meantime, be sure to check out our first writeup on gesture-based interactions.

Have an idea or interested in learning more? Feel free to reach out to us on Instagram or Twitter!

Special thanks to Natalie Vanderveen, Jaden Flores, and Morgan Gerber

Doug Cook

Doug Cook


Doug is the founder of thirteen23. When he’s not providing strategic creative leadership on our client engagements, he can be found practicing the time-honored art of getting out of the way.

Around the studio