169:4
•
J. Tompson et al.
Fig. 4. Algorithm pipeline for dataset creation.
randomization is used to prevent premature swarm collapse on
early local minima), we then use a robust variant of the Nelder-
Mead optimization algorithm [Tseng 1995] after PSO has com-
pleted. The Nelder-Mead optimization algorithm is a simplex-based
direct-search optimization algorithm for nonlinear functions. We
have found that, for our optimization problem, it provides fast con-
vergence when sufficiently close to local optima.
Since this dataset creation stage is performed offline, we do not
require it to be fast enough for interactive frame rates. Therefore
we used a high-quality, linear-blend-skinning (LBS) model [ ˇSari´c
2011] (shown in Figure 3) as an alternative to the simple ball-
and-cylinder model of Oikonomidis et al. After reducing the LBS
model’s face count to increase render throughput, the model con-
tains 1,722 vertices and 3,381 triangle faces, whereas the high-
density source model contained 67,606 faces. While LBS fails to
accurately model effects such as muscle deformation and skin fold-
ing, it represents many geometric details that ball-and-stick models
cannot.
To mitigate the effects of self-occlusion, we used three sensors
(at viewpoints separated by approximately 45
◦
surrounding the user
from the front) with attached vibration motors to reduce IR-pattern
interference [Butler et al. 2012] and whose relative positions and
orientations were calibrated using a variant of the Iterative Closest
Point (ICP) algorithm [Horn 1987]. While we use all three camera
views to fit the LBS model using the algorithm described earlier,
we only use depth images taken from the center camera to train the
ConvNet. The contributions from each camera were accumulated
into an overall fitness function F (C) that includes two a priori
terms (
(C) and P (C)) to maintain anatomically correct joint
angles as well as a data-dependant term
(I
s
, C
) from each camera’s
contribution. The fitness function is
F
(C)
=
3
s
=1
( (I
s
, C
))
+ (C) + P (C) ,
(2)
where I
s
is the s sensor’s depth image and C is a 42-dimensional
coefficient vector that represents the 6DOF position and orientation
of the hand as well as 36 internal joint angles (shown in Figure 3).
P
(C) is an interpenetration term (for a given pose) used to invalidate
anatomically incorrect hand poses and is calculated by accumulat-
ing the interpenetration distances of a series of bounding spheres
attached to the bones of the 3D model. We define interpenetration
distance as simply the sum of overlap between all pairs of inter-
penetrating bounding spheres.
(C) enforces a soft constraint that
coefficient values stay within a predetermined range (C
min
and C
max
)
(C)
=
n
k
=1
w
k
[max (C
k
− C
k
max
,
0)
+ max (C
k
min
− C
k
,
0)] ,
where w
k
is a per-coefficient weighting term to normalize penalty
contributions across different units (since we are including error
terms for angle and distance in the same objective function). C
min
and C
max
were determined experimentally by fitting an uncon-
strained model to a discrete set of poses that represent the full
range of motion for each joint. Lastly
(I
s
, C
) of Eq. (2) measures
the similarity between the depth image I
s
and the synthetic pose
rendered from the same viewpoint.
(I
s
, C
)
=
u,v
min (
|I
s
(u, v)
− R
s
(C, u, v)
| , d
max
)
Here, I
s
(u, v) is the depth at pixel (u, v) of sensor s, R
s
(C, u, v)
is the synthetic depth given the pose coefficient C, and d
max
is a
maximum depth constant. The result of this function is a clamped
L1-norm pixelwise comparison. It should be noted that we do not
include energy terms that measure the silhouette similarity, as pro-
posed by Oikonomidis et al., since we found that when multiple
range sensors are used these terms are not necessary.
5.
FEATURE DETECTION
While neural networks have been used for pose detection of a lim-
ited set of discrete hand gestures (for instance, in discriminating
between a closed fist and an open palm) [Nagi et al. 2011; Nowlan
and Platt 1995], to our knowledge this is the first work that has
attempted to use such networks to perform dense feature extraction
of human hands in order to infer continuous pose. To do this we
employ a multiresolution, deep ConvNet architecture inspired by
the work of Farabet et al. [2013] in order to perform feature ex-
traction of 14 salient hand points from a segmented hand image.
ConvNets are biologically inspired variants of multilayered percep-
trons, which exploit spatial correlation in natural images by extract-
ing features generated by localized convolution kernels. Since depth
images of hands tend to have many repeated local image features
(for instance, fingertips), ConvNets are well suited to perform fea-
ture extraction since multilayered feature banks can share common
features, thereby reducing the number of required free parameters.
We recast the full hand pose recognition problem as an interme-
diate collection of easier individual hand-feature recognition prob-
lems that can be more easily learned by ConvNets. In early experi-
ments we found inferring mappings between depth image space and
pose space directly (for instance, measuring depth image geometry
to extract a joint angle) yielded inferior results to learning with in-
termediate features. We hypothesize that one reason for this could
be that learning intermediate features allows ConvNets to concen-
trate the capacity of the network on learning local features and on
differentiating between them. Using this framework the ConvNet is
also better able to implicitly handle occlusions; by learning com-
pound, high-level image features, the ConvNet is able to infer the
approximate position of an occluded and otherwise unseen feature
(for instance, when making a fist, hidden fingertip locations can be
inferred by the knuckle locations).
We trained the ConvNet architecture to generate an output set of
heat-map feature images (Figure 5). Each feature heat map can be
viewed as a 2D Gaussian (truncated to have finite support), where
ACM Transactions on Graphics, Vol. 33, No. 5, Article 169, Publication date: August 2014.