Speaking in gestures
Designing a gesture library using hand tracking
In the right context, hand gestures can be an attractive alternative to on-screen interfaces, especially in the context of XR and wearables. With advances in computer vision and machine learning, it is now easier than ever to identify and train your own custom models to recognize these gestures.
Continuing our explorations into natural interactions, our team recently set out to build a small library of gestures based on simple hand signals and finger tracking to help with prototyping on partner projects.
Easy as 1, 2, 3
Google’s MediaPipe libraries provide a great starting point for prototyping these interactions. Specifically, MediaPipe’s Gesture Recognizer provides a quick and accessible set of models for categorizing hand gestures, identifying handedness, and recognizing up to 20 different hand coordinates.
To do this, Google’s Gesture Recognizer uses two different models: a hand landmark model and a gesture classification model.
The landmark model detects the presence of hands and hand geometry, allowing the easy identification of palm and finger coordinates, while the gesture recognition model uses a two step neural network pipeline pairing a gesture embedding model with a gesture classification model.
This may sound like a lot, but out of the box, it just kind of works. In fact, it turns out that a number of gestures are not only easy to detect, some are even recognized by default using Google’s own models making it great for design prototyping.
Identifying fingers and positions
Readers of our last article will remember Mediapipe’s landmark model. That model provides an array of 21 points, each of which is a 3D coordinate in a normalized coordinate system (i.e., screen-independent).
Each landmark is composed of x, y, and z coordinates. x and y correspond to the landmark’s position, with z representing how close the landmark is to the camera. Using these landmark coordinates, it’s possible to track and recognize a number of basic gestures.
Recognizing signals and directions
Google’s gesture recognizer and default models can detect a number of basic hand signals, including open palm, closed fist, pointing up, thumbs up, thumbs down, victory, and I love you.
Learning our numbers
When creating our gesture library, we thought a good next step would be to add hand-signaled numbers. Although not recognized by default, the number of raised fingers is easy to detect using MediaPipe’s default landmarks.
The model determines how many fingers are up by checking the tip of each finger. If the y coordinate of a particular fingertip falls below the coordinate of its central landmark, the model recognizes the finger as closed.
Adding the full alphabet
To extend our library, we thought it only natural to try adding support for American Sign Language (ASL)—a natural language that serves as the primary sign language of deaf communities in the United States and Canada. It has a set of 26 signs, known as the American Manual Alphabet, that can be used to spell words from the English language.
Creating a model to recognize ASL is a bit more involved, but you can leverage other models or train your own custom model to use with MediaPipe.
Fortunately, ASL training data is readily available on the web, from custom image libraries to hand shape datasets to pre-built models. For our purposes, we decided to start with a subset of images from a larger training dataset of 87,000 images found on Kaggle.
Using this image set, it was possible to train a custom model to use with MediaPipe using Tensorflow.
Another option would have been to use MediaPipe to capture original images or landmark coordinates for each sign to create our own dataset. Either of these methods would have worked, and we explored both, but since larger datasets tend to yield better results, and the datasets on Kaggle were more than adequate, we ultimately decided not to reinvent the wheel.
We have a few more things in the works, but in the meantime, be sure to check out our first writeup on gesture-based interactions.