Real-Time Continuous Pose Recovery of Human Hands Using Convolutional Networks
•
169:5
Fig. 5. Depth image overlaid with 14 feature locations and the heat map
for one fingertip feature.
Fig. 6. Convolutional network architecture.
the pixel intensity represents the probability of that feature occur-
ring in that spatial location. The Gaussian UV mean is centered at
one of 14 feature points of the user’s hand. These features repre-
sent key joint locations in the 3D model (e.g., knuckles) and were
chosen such that the inverse kinematics (IK) algorithm described in
Section 6 can recover a full 3D pose.
We found that the intermediate heat-map representation not only
reduces required learning capacity, but also improves generaliza-
tion performance since failure modes are often recoverable. Cases
contributing to high test-set error (where the input pose is vastly dif-
ferent from anything in the training set) are usually heat maps that
contain multiple hotspots. For instance, the heat map for a finger-
tip feature might incorrectly contain multiple lobes corresponding
to the other finger locations as the network failed to discriminate
among fingers. When this situation occurs, it is possible to recover
a reasonable feature location by simple heuristics to decide which
of these lobes corresponds to the desired feature (for instance, if
another heat map shows higher probability in those same lobe re-
gions then we can eliminate these as spurious outliers). Similarly,
the intensity of the heat-map lobe gives a direct indication of the
system’s confidence for that feature. This is an extremely useful
measure for practical applications.
Our multiresolution ConvNet architecture is shown in Figure 6.
The segmented depth image is initially preprocessed, whereby the
image is cropped and scaled by a factor proportional to the mean
depth value of the hand pixels, so that the hand is in the center and
has size that is depth invariant. The depth values of each pixel are
then normalized between 0 and 1 (with background pixels set to 1).
The cropped and normalized image is shown in Figure 5.
The preprocessed image is then filtered using local contrast nor-
malization [Jarrett et al. 2009], which acts as a high-pass filter
to emphasize geometric discontinuities. The image is then down-
sampled twice (each time by a factor of 2) and the same filter is
applied to each image. This produces a multiresolution band-pass
Fig. 7. Neural network input: multiresolution image pyramid.
Fig.
8. High-resolution
bank
feature
detector;
each
stage:
(N
features
× height × width).
image pyramid with three banks (shown in Figure 7), whose to-
tal spectral density approximates the spectral density of the input
depth image. Since experimentally we have found that hand pose
extraction requires knowledge of both local and global features, a
single-resolution ConvNet would need to examine a large image
window and thus would require a large learning capacity; as such,
a multiresolution architecture is very useful for this application.
The pyramid images are propagated through a two-stage Con-
vNet architecture. The highest-resolution feature bank is shown in
Figure 8. Each bank is comprised of two convolution modules, two
piecewise nonlinearity modules, and two max-pooling modules.
Each convolution module uses a stack of learned convolution ker-
nels with an additional learned output bias to create a set of output
feature maps (please refer to LeCun et al. [1998] for an in-depth
discussion). The convolution window sizes range from 4
×4 to 6×6
pixels. Each max-pooling [Nagi et al. 2011] module subsamples its
input image by taking the maximum in a set of nonoverlapping rect-
angular windows. We use max pooling since it effectively reduces
computational complexity at the cost of spatial precision. The max-
pooling windows range from 2
× 2 to 4 × 4 pixels. The nonlinearity
is a Rectify Linear Unit (ReLU), that has been shown to improve
training speed and discrimination performance in comparison to the
standard sigmoid units [Krizhevsky et al. 2012]. Each ReLU activa-
tion module computes the following per-pixel nonlinear function.
f
(x)
= max (0, x)
Lastly, the output of the ConvNet banks are fed into a two-stage
neural network shown in Figure 9. This network uses the high-level
convolution features to create the final 14 heat-map images; it does
so by learning a mapping from localized convolution feature activa-
tions to probability maps for each of the bone features. In practice,
these two large and fully connected linear networks account for
more than 80% of the total computational cost of the ConvNet.
However, reducing the size of the network has a very strong impact
on runtime performance. For this reason, it is important to find a
good trade-off between quality and speed. Another drawback of this
method is that the neural network must implicitly learn a likelihood
ACM Transactions on Graphics, Vol. 33, No. 5, Article 169, Publication date: August 2014.
169:6
•
J. Tompson et al.
Fig. 9. Two-stage neural network to create the 14 heat maps (with sizing
of each stage shown).
model for joint positions in order to infer anatomically correct out-
put joints. Since we do not explicitly model joint connectivity in the
network structure, the network requires a large amount of training
data to correctly perform this inference.
ConvNet training was performed using the open-source machine
learning package Torch7 [Collobert et al. 2011] that provides ac-
cess to an efficient GPU implementation of the back-propagation
algorithm for training neural networks. During supervised training
we use stochastic gradient descent with a standard L2-norm error
function, batch size of 64, and the following learnable parameter
update rule
w
i
= γ w
i
−1
− λ ηw
i
−
∂L
∂w
i
w
i
+1
= w
i
+ w
i
,
(3)
where w
i
is a bias or weight parameter for each of the network
modules for epoch i (with each epoch representing one pass over
the entire training set) and
∂L
∂w
i
is the partial derivative of the error
function L with respect to the learnable parameter w
i
averaged
over the current batch. We use a constant learning rate of λ
= 0.2
and a momentum term γ
= 0.9 to improve the learning rate when
close to the local minimum. Lastly, an L2-regularization factor of
η
= 0.0005 is used to help improve generalization.
During ConvNet training, the preprocessed database images were
randomly rotated, scaled, and translated to improve generalization
performance [Farabet et al. 2013]. Not only does this technique
effectively increase the size of the training set (which improves
test/validation-set error), it also helps improve performance for
other users whose hand size is not well represented in the origi-
nal training set. We perform this image manipulation in a back-
ground thread during batch training so the impact on training time
is minimal.
6.
POSE RECOVERY
We formulate the problem of pose estimation from the heat-map
output as an optimization problem, similar to inverse kinematics
(IK). We extract 2D and 3D feature positions from the 14 heatmaps
and then minimize an appropriate objective function to align 3D
model features to each heat-map position.
To infer the 3D position corresponding to a heat-map image, we
need to determine the most likely UV position of the feature in the
heat-map. Although the ConvNet architecture is trained to output
heat-map images of 2D Gaussians with low variance, in general,
they output multimodal gray-scale heat maps that usually do not
sum to 1. In practice, it is easy to deduce a correct UV position
by finding the maximal peak in the heat map (corresponding to the
location of greatest confidence). Rather than using the most likely
heat-map location as the final location, we fit a Gaussian model to
the maximal lobe to obtain subpixel accuracy.
First we clamp heat-map pixels below a fixed threshold to get
rid of spurious outliers. We then normalize the resulting image so it
sums to 1, then fit the best 2D Gaussian using Levenberg-Marquardt,
and use the mean of the resulting Gaussian as the UV position. Once
the UV position is found for each of the 14 heat maps, we perform a
lookup into the captured depth frame to obtain the depth component
at the UV location. In case this UV location lies on a depth shadow
where no depth is given in the original image, we store the computed
2D image for this point in the original image space. Otherwise, we
store its 3D point.
We then perform unconstrained nonlinear optimization on the
following objective function
f
(m)
=
n
i
=1
[
i
(m)]
+ (C) ,
(4)
i
(m)
=
(u, v, d)
t
i
− (u, v, d)
m
i
2
if d
t
i
= 0
(u, v)
t
i
− (u, v)
m
i
2
otherwise,
where (u, v, d)
t
i
is the target 3D heat-map position of feature i and
(u, v, d)
m
i
is the model feature position for the current pose estimate.
Eq. (4) is an L2-error norm in 3D or 2D, depending on whether or
not the given feature has a valid depth component associated with it.
We then use a simple linear accumulation of these featurewise error
terms, as well as the same linear penalty constraint (
(C)) used
in Section 4. We use PrPSO to minimize Eq. (4). Since function
evaluations for each swarm particle can be parallelized, PrPSO is
able to run in real time at interactive frame rates for this stage.
Furthermore, since a number of the 42 coefficients from Section 4
contribute only subtle behavior to the deformation of the LBS model
at real time, we found that removing coefficients describing finger
twist and coupling the last two knuckles of each finger into a single
angle coefficient significantly reduces the function evaluation time
of (4) without noticeable loss in pose accuracy. Therefore, we reduce
the complexity of the model to 23DOF during this final stage. Fewer
than 50 PrPSO iterations are required for adequate convergence.
This IK approach has one important limitation: the UVD target
position may not be a good representation of the true feature posi-
tion. For instance, when a feature is directly occluded by another
feature, the two features will incorrectly share the same depth value
(even though one is in front of the other). However, we found that
for a broad range of gestures this limitation was not noticeable. In
future work we hope to augment the ConvNet output with a learned
depth offset to overcome this limitation.
7.
RESULTS
For the results to follow, we test our system using the same experi-
mental setup that was used to capture the training data; the camera
is in front of the user (facing the user) and is at approximately eye-
level height. We have not extensively evaluated the performance of
our algorithm in other camera setups.
The RDF classifier described in Section 4 was trained using
6,500 images (with an additional 1,000 validation images held
aside for tuning of the RDF meta-parameters) of a user performing
typical one- and two-handed gestures (pinching, drawing, clapping,
grasping, etc.). Training was performed on a 24-core machine for
ACM Transactions on Graphics, Vol. 33, No. 5, Article 169, Publication date: August 2014.