Communicating Through Fingertips – Finger Gesture Recognition Using Depth Data
In Prof. Vishy’s ML class (cs 590 – top notch course, top notch professor), we don’t have a final and instead we are supposed to apply ML to a problem we find interesting. Microsoft gave all of us interns a Kinect this summer so I decided to put it to some use (I don’t have a TV so the XBox is just collecting dust).
My goal was to be able to record finger gestures and then detect them when a user makes these gestures. I had 2 goals in mind – no OpenCV (i.e. I will use just depth data) and no wearing special stuff to guide anything.
So, let us see what I did. Basically, I used the CandescentNUI Hand Tracker to get a collection of fingertip locations and points and then applied two techniques to try and recognize the gestures we make.
First, I tried using the Passive-Aggressive algorithm by Crammer et. al. This algorithm uses an online-learning approach to build a hyperplane (in 3 dimensions, this is a plane, in 2 dimensions – a line etc. Basically, this is what is defined when you try to define a “surface” like structure for a space. Take 2 non-parallel vectors in 3D space and you can construct the entirety of the 2D world. The hyperplane is just that – an entire space (a subspace with 1 dim less than the one we are operating in).
The hyperplane is supposed to act like a brick wall (if we’re in 3D – no point visualizing a higher dimension). When we see a new data point come in, we want to inspect on which side of the wall it lies and then we can “detect” or label this point. This is the binary classifier.
The dataset consists of raw point coordinates in the space of the human palm seen by the kinect. Now it turns out that the online passive-aggressive algorithm fails at constructing a decent hyperplane separating 2 classes (data points for 2 different gestures).
The obvious hack was to deploy a nearest neighbors classifier. The trick I used was that I ran a large cluster k-means on the data and built myself a dataset consisting entirely of cluster centers. So I was able to reduce the neighbors tenfold and still get fantastic performance. A simple technique worked fabulously in this situation and I couldn’t be more pleased.
Here is a video of the gesture-detector in action. The annotations should show you what to look @
The source is up on github. The code is very kludgy and I will fix it up after finals week. In case you’re in a hurry : http://github.com/shriphani/KinectSpellÂ
Now, it is time to try and avoid failing in the finals x(.
does this work with people who have 6 fingers?
I’m not sure. Basically this algorithm works by looking at the closest match to data it has already seen (some optimization is used but that is the general idea). I am not sure how the 6th finger will affect details.