Blog InfosAuthorTom ColvinPublished24. January 2025TopicsAI, Android App Development, Compose, Computer Vision, KotlinAuthorTom ColvinPublished24. January 2025TopicsAI, Android App Development, Compose, Computer Vision, KotlinFacebookTwitter
Image credit: Steve Johnson
Last week we looked at the basics of the CameraX library. That laid the foundations for something really exciting … AI vision! Now we can use your Android device to interpret and understand the physical world around us.
AI vision has incredible potential, like recognising what’s in a photo, separating/masking areas in images, or recognising body poses, smiles and other gestures. And this can all run on your phone — no need for internet access, or to pass your camera data to a third party.
Building a hand gesture recogniser
In this article we’re going to demo this by building a hand gesture recogniser. Like this:
CameraX delivers a live stream to an AI, which recognises my gestures
In the example, my hand gestures are being turned into emojis. We’ll use two libraries for this:
- MediaPipe is a great cross-platform library for on-device AI tasks. It has a vast amount of uses including driving AI models related to audio, video and text related tasks. We’re going to use it with the gesture_recognizer model to recognise the hand gestures.
CameraX is the Android Jetpack library which makes using camera features on Android much, much easier. (If you’ve ever tried to build a camera experience without it, you’ll know what I mean
).
We’ll start with the CameraX side of things and get a live stream of frames from the camera. Then we’ll use MediaPipe to asynchronously deliver those frames to the AI model for its analysis. And, of course, all processing will be done on the device.
A working app demonstrating all the code in this article is availalable here: https://github.com/tdcolvin/MediaPipeCameraXDemo
Reminder: CameraX’s use cases
CameraX is good for four specific tasks, which it calls use cases. In my last article we built a sample which used two of those use cases:
- The Preview use case allows us to display what the camera is pointing at (like a viewfinder)
- The Image Capture use case allows us to take photos
The other two use cases are Video Capture (for, um, capturing video) and Image Analysis (for receiving a live stream of video frames).
We’re going to use the Image Analysis use case here. The live stream of frames it provides are going to be delivered directly to MediaPipe, which will in turn use them to crank the handle on our AI model.
Step 1: Add CameraX library and set up the Preview use case
My last article showed you how to add the CameraX dependency to Gradle, how to add the PreviewView, how to link it to the Preview use case, and how to bind all that together using a CameraProvider.
If you’ve not used CameraX at all before then I’d recommend you start with that. There’s some concepts to get your head around as a prerequisite to this article.
Here, we’ll start with a version of the CameraPreview composable that we built previously. It’s the same as before, but for simplicity I’ve taken out support for camera switching, zoom, and image capture:
View this code snippet on GitHub.
Step 2: Create the ImageAnalysis use case
Next we need the ImageAnalysis use case to get that live video stream. Like other use cases in CameraX, it’s created using a builder pattern:
View this code snippet on GitHub.
That imageAnalyzer function is going to be called by CameraX whenever there’s an image ready for us to process. Later, we’ll use it to call the MediaPipe code. For now we’ll just add an empty implementation into our view model:
View this code snippet on GitHub.
We need to close() the image that we’re given so that CameraX knows we’re ready for the next one.
Step 3: Bind the ImageAnalysis use case
Now that we’ve created the ImageAnalysis use case, we need to get CameraX to actually use it. The CameraProvider.bindToLifecycle(…) function is the glue which binds together a physical camera with the use cases, against a particular lifecycle. In our demo app, that is called by the CameraPreview composable’s rebindCameraProvider function. And so we must pass our ImageAnalysis use case to that composable:
View this code snippet on GitHub.
Great! Now, if we run the app we’ll see a camera preview.
…And more importantly, we can see that there’s a stream of frames being blasted at our imageAnalyzer:
LogCat entries showing that our imageAnalyzer is called
Woohoo! That’s the CameraX bit done. Now let’s get MediaPipe to detect hand gestures.
Step 4: Add the MediaPipe dependencies
The MediaPipe library we’re going to use is tasks-vision, which is added to Gradle like so:
View this code snippet on GitHub.
Aside: MediaPipe dependency version problem
At time of writing, the current MediaPipe version is 0.10.20. For some reason, a few years ago, a version of the MediaPipe libraries were released to Maven with a version number formed from the date: 0.20230731. This was probably a mistake — but whatever the reason, it means that Android Studio thinks that one is the latest version:
Because 20230731 > 10, right? Nope. Don’t be tempted to accept this change, it will break things. And you’ll have to manually check the MediaPipe releases page for new versions, because until v1.0.0 comes out, Android Studio is always going to get it wrong.
Step 5: Add the AI model
MediaPipe is just the chauffeur. It will need to be given a car to drive — that is, an AI model to input to and receive output from.
MediaPipe will accept almost any AI model built for LiteRT (formerly known as Tensorflow Lite), though obviously some might be too big to fit in a phone’s memory. HuggingFace is a good resource for downloadable models as you can search by library support.
For hand gesture recognition, I’m using gesture_recognizer.task. I found this in the MediaPipe samples, where it was provided without proper credit to the original author (although perhaps it was created specifically for that sample). If you know who it belongs to, let me know so I can credit!
MediaPipe expects to find this file in an asset directory / source set, so we’ll put it there.
Video vs image models
Our gesture_recognizer.task operates on single, static images. It’s also, therefore, fine for video since a video is a just a stream of single images. We will run the model separately on each frame, and the model itself won’t use or even remember data from previous frames.
Some models are explicitly designed to work on videos. Often these have quite a large memory footprint.
Step 6: Creating the gesture recogniser
We now have the MediaPipe library installed and the AI model in place. To open and use the model, we need a MediaPipe GestureRecognizer instance. We’ll use this in the next steps to pass the image frames to.
Like CameraX, MediaPipe makes heavy use of the builder pattern. So we build a GestureRecognizerOptions object using this pattern, making use of a BaseOptions object which tells us where the model file is. Then from that GestureRecognizerOptions we’ll create the GestureRecognizer.
The GestureRecognizerOptions methods we’ll use are:
- setRunningMode(RunningMode.LIVE_STREAM) which means that we’ll pass it frames from a live video feed and it’ll send us continuous results asynchronously. (The alternative choice would be RunningMode.IMAGE where we’d feed it a single image and it would give a single answer synchronously).
- setResultListener(…) which specifies a function to be called asynchronously when results are available from the model.
Once the GestureRecognizerOptions instance is built, we can use it to create our GestureRecognizer instance. That’s done using GestureRecognizer.createFromOptions(…)
:
View this code snippet on GitHub.
Step 7: Delivering the frames to the gesture recogniser
OK, so now we have CameraX delivering its frames to an imageAnalyzer function (from step 2), and a gesture recogniser ready to analyse an image.
Let’s join those two ends together, so the gesture recogniser gets its frames!
The images we get from CameraX are going to be in the camera’s natural orientation — which is not necessarily the way round that the phone/tablet is being held. So we need to rotate them to match the orientation of the device.
Also, modern cameras produce pictures which are way too big for most AI tasks. Generally, AI models — particularly LiteRT ones — run on very small images. There’s no need to feed our gesture recogniser anything bigger than, say, 500px. Even that is probably too big. Largr images just add latency as the model has to work harder.
So, we also need to resize the image.
We’ll add that scale and resize task to our imageAnalyzer from step 2:
View this code snippet on GitHub.
Finally, we need to pass the processed image to our gesture recogniser:
View this code snippet on GitHub.
When results become available, they will be delivered to our handleGestureRecognizerResult method. So let’s fill that in now.
Step 8: Handling the gesture recogniser results
At this point, images are being delivered to the model, and the model is processing them and providing a response. That response tells us what gestures it’s recognised. So let’s parse those results, and display them on the screen as an emoji.
The response comes in the form of an instance of GestureRecognizerResult, which has a function gestures(). This gives a list of gestures it’s recognised, each of which has a list of possible options for what that gesture might be.
That’s a bit complex. Let’s give an example.
Say it saw this image:
Image credit: Mike Murray
There are two hand gestures there. If our model is doing well, the result will show that there were two gestures recognised.
Let’s say the first one is the gesture on the left. For that, it would hopefully be pretty sure it’s a thumbs up. But AI isn’t perfect, and that thumbs up might be recognised as something else. So those results include a list of gestures it could be, along with a score for each. The score is between 0 and 1, with 1 meaning it’s totally confident. An example here might be:
- Thumb_Up, score = 0.9
- Closed_Fist, score 0.2
- Pointing_Up, score 0.05
Here it’s pretty sure that that gesture is a thumb up, but it might instead be a closed fist. It’s unlikely to be an index finger pointing up.
And there would be a similar list of possibilities with confidence scores for the second gesture.
Each possible gesture is called a category in AI terms. What we’ll do is pick the first gesture it recognises, and for that gesture we’ll pick the category with the highest confidence score:
View this code snippet on GitHub.
Finally we’ll convert that category into an emoji which we’ll display in the UI:
View this code snippet on GitHub.
And that’s it! Finally our app will detect hand gestures.
<
Source: View source