S u rv e y pa p e r

Yüklə 471,61 Kb.

Pdf görüntüsü

səhifə	4/16
tarix	08.10.2017
ölçüsü	471,61 Kb.
	#3814

1 2 3 4 5 6 7 8 9 ... 16

3 Support vector machines

Top 10 algorithms in data mining

k-means

can be paired with another algorithm to describe non-convex clusters. One

ﬁrst clusters the data into a large number of groups using

k-means

. These groups are then

agglomerated into larger clusters using single link hierarchical clustering, which can detect

complex shapes. This approach also makes the solution less sensitive to initialization, and

since the hierarchical method provides results at multiple resolutions, one does not need to

pre-specify k either.

The cost of the optimal solution decreases with increasing k till it hits zero when the

number of clusters equals the number of distinct data-points. This makes it more difﬁcult

to (a) directly compare solutions with different numbers of clusters and (b) to ﬁnd the opti-

mum value of k. If the desired k is not known in advance, one will typically run

k-means

with different values of k, and then use a suitable criterion to select one of the results. For

example, SAS uses the cube-clustering-criterion, while X-means adds a complexity term

(which increases with k) to the original cost function (Eq.

) and then identiﬁes the k which

minimizes this adjusted cost. Alternatively, one can progressively increase the number of

clusters, in conjunction with a suitable stopping criterion. Bisecting

k-means

[

73

] achieves

this by ﬁrst putting all the data into a single cluster, and then recursively splitting the least

compact cluster into two using 2-means. The celebrated LBG algorithm [

] used for vector

quantization doubles the number of clusters till a suitable code-book size is obtained. Both

these approaches thus alleviate the need to know k beforehand.

The algorithm is also sensitive to the presence of outliers, since “mean” is not a robust

statistic. A preprocessing step to remove outliers can be helpful. Post-processing the results,

for example to eliminate small clusters, or to merge close clusters into a large cluster, is also

desirable. Ball and Hall’s ISODATA algorithm from 1967 effectively used both pre- and

post-processing on

k-means

2.3 Generalizations and connections

As mentioned earlier,

k-means

is closely related to ﬁtting a mixture of k isotropic Gaussians

to the data. Moreover, the generalization of the distance measure to all Bregman divergences

is related to ﬁtting the data with a mixture of k components from the exponential family of

distributions. Another broad generalization is to view the “means” as probabilistic models

instead of points in R

d

. Here, in the assignment step, each data point is assigned to the most

likely model to have generated it. In the “relocation” step, the model parameters are updated

to best ﬁt the assigned datasets. Such model-based

k-means

allow one to cater to more

complex data, e.g. sequences described by Hidden Markov models.

One can also “kernelize”

k-means

[

19

]. Though boundaries between clusters are still

linear in the implicit high-dimensional space, they can become non-linear when projected

back to the original space, thus allowing kernel

k-means

to deal with more complex clus-

ters. Dhillon et al. [

] have shown a close connection between kernel

k-means

and spectral

clustering. The K-medoid algorithm is similar to

k-means

except that the centroids have to

belong to the data set being clustered. Fuzzy c-means is also similar, except that it computes

fuzzy membership functions for each clusters rather than a hard one.

Despite its drawbacks,

k-means

remains the most widely used partitional clustering

algorithm in practice. The algorithm is simple, easily understandable and reasonably scal-

able, and can be easily modiﬁed to deal with streaming data. To deal with very large datasets,

substantial effort has also gone into further speeding up

k-means

, most notably by using

kd-trees or exploiting the triangular inequality to avoid comparing each data point with all

the centroids during the assignment step. Continual improvements and generalizations of the

123

X. Wu et al.

basic algorithm have ensured its continued relevance and gradually increased its effectiveness

as well.

3 Support vector machines

In today’s machine learning applications, support vector machines (SVM) [

] are consid-

ered a must try—it offers one of the most robust and accurate methods among all well-known

algorithms. It has a sound theoretical foundation, requires only a dozen examples for training,

and is insensitive to the number of dimensions. In addition, efﬁcient methods for training

SVM are also being developed at a fast pace.

In a two-class learning task, the aim of SVM is to ﬁnd the best classiﬁcation function

to distinguish between members of the two classes in the training data. The metric for the

concept of the “best” classiﬁcation function can be realized geometrically. For a linearly sep-

arable dataset, a linear classiﬁcation function corresponds to a separating hyperplane f

(x)

that passes through the middle of the two classes, separating the two. Once this function is

determined, new data instance x

n

can be classiﬁed by simply testing the sign of the function

); x

belongs to the positive class if f

(x

n

) > 0.

Because there are many such linear hyperplanes, what SVM additionally guarantee is that

the best such function is found by maximizing the margin between the two classes. Intui-

tively, the margin is deﬁned as the amount of space, or separation between the two classes

as deﬁned by the hyperplane. Geometrically, the margin corresponds to the shortest distance

between the closest data points to a point on the hyperplane. Having this geometric definition

allows us to explore how to maximize the margin, so that even though there are an inﬁnite

number of hyperplanes, only a few qualify as the solution to SVM.

The reason why SVM insists on ﬁnding the maximum margin hyperplanes is that it offers

the best generalization ability. It allows not only the best classiﬁcation performance (e.g.,

accuracy) on the training data, but also leaves much room for the correct classiﬁcation of the

future data. To ensure that the maximum margin hyperplanes are actually found, an SVM

classiﬁer attempts to maximize the following function with respect to w and b:

L

P

2

w

−

t

i

y

i

(w · x

+ b) +

t

i

(2)

where t is the number of training examples, and

α

i

, i = 1, . . . , t, are non-negative numbers

such that the derivatives of L

with respect to

α

i

are zero.

α

i

are the Lagrange multipliers

and L

P

is called the Lagrangian. In this equation, the vectors w and constant b deﬁne the

hyperplane.

There are several important questions and related extensions on the above basic formula-

tion of support vector machines. We list these questions and extensions below.

1. Can we understand the meaning of the SVM through a solid theoretical foundation?

2. Can we extend the SVM formulation to handle cases where we allow errors to exist,

when even the best hyperplane must admit some errors on the training data?

3. Can we extend the SVM formulation so that it works in situations where the training

data are not linearly separable?

4. Can we extend the SVM formulation so that the task is to predict numerical values or

to rank the instances in the likelihood of being a positive class member, rather than

classiﬁcation?

123

Yüklə 471,61 Kb.

Dostları ilə paylaş:

1 2 3 4 5 6 7 8 9 ... 16