Real-Time Continuous Pose Recovery of Human Hands Using Convolutional Networks

Yüklə 4,96 Mb.

Pdf görüntüsü

səhifə	5/6
tarix	08.08.2018
ölçüsü	4,96 Mb.
	#61490

1 2 3 4 5 6

•

169:5

Fig. 5. Depth image overlaid with 14 feature locations and the heat map

for one ﬁngertip feature.

Fig. 6. Convolutional network architecture.

the pixel intensity represents the probability of that feature occur-

ring in that spatial location. The Gaussian UV mean is centered at

one of 14 feature points of the user’s hand. These features repre-

sent key joint locations in the 3D model (e.g., knuckles) and were

chosen such that the inverse kinematics (IK) algorithm described in

Section 6 can recover a full 3D pose.

We found that the intermediate heat-map representation not only

reduces required learning capacity, but also improves generaliza-

tion performance since failure modes are often recoverable. Cases

contributing to high test-set error (where the input pose is vastly dif-

ferent from anything in the training set) are usually heat maps that

contain multiple hotspots. For instance, the heat map for a ﬁnger-

tip feature might incorrectly contain multiple lobes corresponding

to the other ﬁnger locations as the network failed to discriminate

among ﬁngers. When this situation occurs, it is possible to recover

a reasonable feature location by simple heuristics to decide which

of these lobes corresponds to the desired feature (for instance, if

another heat map shows higher probability in those same lobe re-

gions then we can eliminate these as spurious outliers). Similarly,

the intensity of the heat-map lobe gives a direct indication of the

system’s conﬁdence for that feature. This is an extremely useful

measure for practical applications.

Our multiresolution ConvNet architecture is shown in Figure 6.

The segmented depth image is initially preprocessed, whereby the

image is cropped and scaled by a factor proportional to the mean

depth value of the hand pixels, so that the hand is in the center and

has size that is depth invariant. The depth values of each pixel are

then normalized between 0 and 1 (with background pixels set to 1).

The cropped and normalized image is shown in Figure 5.

The preprocessed image is then ﬁltered using local contrast nor-

malization [Jarrett et al. 2009], which acts as a high-pass ﬁlter

to emphasize geometric discontinuities. The image is then down-

sampled twice (each time by a factor of 2) and the same ﬁlter is

applied to each image. This produces a multiresolution band-pass

Fig. 7. Neural network input: multiresolution image pyramid.

Fig.

8. High-resolution

bank

feature

detector;

each

stage:

features

× height × width).

image pyramid with three banks (shown in Figure 7), whose to-

tal spectral density approximates the spectral density of the input

depth image. Since experimentally we have found that hand pose

extraction requires knowledge of both local and global features, a

single-resolution ConvNet would need to examine a large image

window and thus would require a large learning capacity; as such,

a multiresolution architecture is very useful for this application.

The pyramid images are propagated through a two-stage Con-

vNet architecture. The highest-resolution feature bank is shown in

Figure 8. Each bank is comprised of two convolution modules, two

piecewise nonlinearity modules, and two max-pooling modules.

Each convolution module uses a stack of learned convolution ker-

nels with an additional learned output bias to create a set of output

feature maps (please refer to LeCun et al. [1998] for an in-depth

discussion). The convolution window sizes range from 4

×4 to 6×6

pixels. Each max-pooling [Nagi et al. 2011] module subsamples its

input image by taking the maximum in a set of nonoverlapping rect-

angular windows. We use max pooling since it effectively reduces

computational complexity at the cost of spatial precision. The max-

pooling windows range from 2

× 2 to 4 × 4 pixels. The nonlinearity

is a Rectify Linear Unit (ReLU), that has been shown to improve

training speed and discrimination performance in comparison to the

standard sigmoid units [Krizhevsky et al. 2012]. Each ReLU activa-

tion module computes the following per-pixel nonlinear function.

(x)

= max (0, x)

Lastly, the output of the ConvNet banks are fed into a two-stage

neural network shown in Figure 9. This network uses the high-level

convolution features to create the ﬁnal 14 heat-map images; it does

so by learning a mapping from localized convolution feature activa-

tions to probability maps for each of the bone features. In practice,

these two large and fully connected linear networks account for

more than 80% of the total computational cost of the ConvNet.

However, reducing the size of the network has a very strong impact

on runtime performance. For this reason, it is important to ﬁnd a

good trade-off between quality and speed. Another drawback of this

method is that the neural network must implicitly learn a likelihood

ACM Transactions on Graphics, Vol. 33, No. 5, Article 169, Publication date: August 2014.

169:6

•

J. Tompson et al.

Fig. 9. Two-stage neural network to create the 14 heat maps (with sizing

of each stage shown).

model for joint positions in order to infer anatomically correct out-

put joints. Since we do not explicitly model joint connectivity in the

network structure, the network requires a large amount of training

data to correctly perform this inference.

ConvNet training was performed using the open-source machine

learning package Torch7 [Collobert et al. 2011] that provides ac-

cess to an efﬁcient GPU implementation of the back-propagation

algorithm for training neural networks. During supervised training

we use stochastic gradient descent with a standard L2-norm error

function, batch size of 64, and the following learnable parameter

update rule

= γ w

−1

− λ ηw

−

∂L

∂w

= w

+ w

(3)

where w

is a bias or weight parameter for each of the network

modules for epoch i (with each epoch representing one pass over

the entire training set) and

∂L

∂w

i

is the partial derivative of the error

function L with respect to the learnable parameter w

averaged

over the current batch. We use a constant learning rate of λ

= 0.2

and a momentum term γ

= 0.9 to improve the learning rate when

close to the local minimum. Lastly, an L2-regularization factor of

= 0.0005 is used to help improve generalization.

During ConvNet training, the preprocessed database images were

randomly rotated, scaled, and translated to improve generalization

performance [Farabet et al. 2013]. Not only does this technique

effectively increase the size of the training set (which improves

test/validation-set error), it also helps improve performance for

other users whose hand size is not well represented in the origi-

nal training set. We perform this image manipulation in a back-

ground thread during batch training so the impact on training time

is minimal.

POSE RECOVERY

We formulate the problem of pose estimation from the heat-map

output as an optimization problem, similar to inverse kinematics

(IK). We extract 2D and 3D feature positions from the 14 heatmaps

and then minimize an appropriate objective function to align 3D

model features to each heat-map position.

To infer the 3D position corresponding to a heat-map image, we

need to determine the most likely UV position of the feature in the

heat-map. Although the ConvNet architecture is trained to output

heat-map images of 2D Gaussians with low variance, in general,

they output multimodal gray-scale heat maps that usually do not

sum to 1. In practice, it is easy to deduce a correct UV position

by ﬁnding the maximal peak in the heat map (corresponding to the

location of greatest conﬁdence). Rather than using the most likely

heat-map location as the ﬁnal location, we ﬁt a Gaussian model to

the maximal lobe to obtain subpixel accuracy.

First we clamp heat-map pixels below a ﬁxed threshold to get

rid of spurious outliers. We then normalize the resulting image so it

sums to 1, then ﬁt the best 2D Gaussian using Levenberg-Marquardt,

and use the mean of the resulting Gaussian as the UV position. Once

the UV position is found for each of the 14 heat maps, we perform a

lookup into the captured depth frame to obtain the depth component

at the UV location. In case this UV location lies on a depth shadow

where no depth is given in the original image, we store the computed

2D image for this point in the original image space. Otherwise, we

store its 3D point.

We then perform unconstrained nonlinear optimization on the

following objective function

(m)

[

(m)]

+ (C) ,

(4)

(m)

(u, v, d)

i

− (u, v, d)

if d

= 0

(u, v)

− (u, v)

otherwise,

where (u, v, d)

i

is the target 3D heat-map position of feature i and

(u, v, d)

is the model feature position for the current pose estimate.

Eq. (4) is an L2-error norm in 3D or 2D, depending on whether or

not the given feature has a valid depth component associated with it.

We then use a simple linear accumulation of these featurewise error

terms, as well as the same linear penalty constraint (

(C)) used

in Section 4. We use PrPSO to minimize Eq. (4). Since function

evaluations for each swarm particle can be parallelized, PrPSO is

able to run in real time at interactive frame rates for this stage.

Furthermore, since a number of the 42 coefﬁcients from Section 4

contribute only subtle behavior to the deformation of the LBS model

at real time, we found that removing coefﬁcients describing ﬁnger

twist and coupling the last two knuckles of each ﬁnger into a single

angle coefﬁcient signiﬁcantly reduces the function evaluation time

of (4) without noticeable loss in pose accuracy. Therefore, we reduce

the complexity of the model to 23DOF during this ﬁnal stage. Fewer

than 50 PrPSO iterations are required for adequate convergence.

This IK approach has one important limitation: the UVD target

position may not be a good representation of the true feature posi-

tion. For instance, when a feature is directly occluded by another

feature, the two features will incorrectly share the same depth value

(even though one is in front of the other). However, we found that

for a broad range of gestures this limitation was not noticeable. In

future work we hope to augment the ConvNet output with a learned

depth offset to overcome this limitation.

RESULTS

For the results to follow, we test our system using the same experi-

mental setup that was used to capture the training data; the camera

is in front of the user (facing the user) and is at approximately eye-

level height. We have not extensively evaluated the performance of

our algorithm in other camera setups.

The RDF classiﬁer described in Section 4 was trained using

6,500 images (with an additional 1,000 validation images held

aside for tuning of the RDF meta-parameters) of a user performing

typical one- and two-handed gestures (pinching, drawing, clapping,

grasping, etc.). Training was performed on a 24-core machine for

ACM Transactions on Graphics, Vol. 33, No. 5, Article 169, Publication date: August 2014.

Yüklə 4,96 Mb.

Dostları ilə paylaş:

1 2 3 4 5 6