Real-Time Continuous Pose Recovery of Human Hands Using Convolutional Networks

Yüklə 4,96 Mb.

Pdf görüntüsü

səhifə	4/6
tarix	08.08.2018
ölçüsü	4,96 Mb.
	#61490

1 2 3 4 5 6

169:4

•

J. Tompson et al.

Fig. 4. Algorithm pipeline for dataset creation.

randomization is used to prevent premature swarm collapse on

early local minima), we then use a robust variant of the Nelder-

Mead optimization algorithm [Tseng 1995] after PSO has com-

pleted. The Nelder-Mead optimization algorithm is a simplex-based

direct-search optimization algorithm for nonlinear functions. We

have found that, for our optimization problem, it provides fast con-

vergence when sufﬁciently close to local optima.

Since this dataset creation stage is performed ofﬂine, we do not

require it to be fast enough for interactive frame rates. Therefore

we used a high-quality, linear-blend-skinning (LBS) model [ ˇSari´c

2011] (shown in Figure 3) as an alternative to the simple ball-

and-cylinder model of Oikonomidis et al. After reducing the LBS

model’s face count to increase render throughput, the model con-

tains 1,722 vertices and 3,381 triangle faces, whereas the high-

density source model contained 67,606 faces. While LBS fails to

accurately model effects such as muscle deformation and skin fold-

ing, it represents many geometric details that ball-and-stick models

cannot.

To mitigate the effects of self-occlusion, we used three sensors

(at viewpoints separated by approximately 45

◦

surrounding the user

from the front) with attached vibration motors to reduce IR-pattern

interference [Butler et al. 2012] and whose relative positions and

orientations were calibrated using a variant of the Iterative Closest

Point (ICP) algorithm [Horn 1987]. While we use all three camera

views to ﬁt the LBS model using the algorithm described earlier,

we only use depth images taken from the center camera to train the

ConvNet. The contributions from each camera were accumulated

into an overall ﬁtness function F (C) that includes two a priori

terms (

angles as well as a data-dependant term

, C

) from each camera’s

contribution. The ﬁtness function is

(C)

( (I

, C

))

+ (C) + P (C) ,

(2)

where I

is the s sensor’s depth image and C is a 42-dimensional

coefﬁcient vector that represents the 6DOF position and orientation

of the hand as well as 36 internal joint angles (shown in Figure 3).

anatomically incorrect hand poses and is calculated by accumulat-

ing the interpenetration distances of a series of bounding spheres

attached to the bones of the 3D model. We deﬁne interpenetration

distance as simply the sum of overlap between all pairs of inter-

penetrating bounding spheres.

coefﬁcient values stay within a predetermined range (C

min

and C

max

)

(C)

[max (C

− C

max

+ max (C

min

− C

0)] ,

where w

is a per-coefﬁcient weighting term to normalize penalty

contributions across different units (since we are including error

terms for angle and distance in the same objective function). C

min

and C

max

were determined experimentally by ﬁtting an uncon-

strained model to a discrete set of poses that represent the full

range of motion for each joint. Lastly

s

, C

) of Eq. (2) measures

the similarity between the depth image I

and the synthetic pose

rendered from the same viewpoint.

, C

)

u,v

min (

(u, v)

− R

(C, u, v)

| , d

max

)

Here, I

(u, v) is the depth at pixel (u, v) of sensor s, R

(C, u, v)

is the synthetic depth given the pose coefﬁcient C, and d

max

is a

maximum depth constant. The result of this function is a clamped

L1-norm pixelwise comparison. It should be noted that we do not

include energy terms that measure the silhouette similarity, as pro-

posed by Oikonomidis et al., since we found that when multiple

range sensors are used these terms are not necessary.

FEATURE DETECTION

While neural networks have been used for pose detection of a lim-

ited set of discrete hand gestures (for instance, in discriminating

between a closed ﬁst and an open palm) [Nagi et al. 2011; Nowlan

and Platt 1995], to our knowledge this is the ﬁrst work that has

attempted to use such networks to perform dense feature extraction

of human hands in order to infer continuous pose. To do this we

employ a multiresolution, deep ConvNet architecture inspired by

the work of Farabet et al. [2013] in order to perform feature ex-

traction of 14 salient hand points from a segmented hand image.

ConvNets are biologically inspired variants of multilayered percep-

trons, which exploit spatial correlation in natural images by extract-

ing features generated by localized convolution kernels. Since depth

images of hands tend to have many repeated local image features

(for instance, ﬁngertips), ConvNets are well suited to perform fea-

ture extraction since multilayered feature banks can share common

features, thereby reducing the number of required free parameters.

We recast the full hand pose recognition problem as an interme-

diate collection of easier individual hand-feature recognition prob-

lems that can be more easily learned by ConvNets. In early experi-

ments we found inferring mappings between depth image space and

pose space directly (for instance, measuring depth image geometry

to extract a joint angle) yielded inferior results to learning with in-

termediate features. We hypothesize that one reason for this could

be that learning intermediate features allows ConvNets to concen-

trate the capacity of the network on learning local features and on

differentiating between them. Using this framework the ConvNet is

also better able to implicitly handle occlusions; by learning com-

pound, high-level image features, the ConvNet is able to infer the

approximate position of an occluded and otherwise unseen feature

(for instance, when making a ﬁst, hidden ﬁngertip locations can be

inferred by the knuckle locations).

We trained the ConvNet architecture to generate an output set of

heat-map feature images (Figure 5). Each feature heat map can be

viewed as a 2D Gaussian (truncated to have ﬁnite support), where

ACM Transactions on Graphics, Vol. 33, No. 5, Article 169, Publication date: August 2014.

Yüklə 4,96 Mb.

Dostları ilə paylaş:

1 2 3 4 5 6