Top 10 algorithms in data mining
9
k-means
can be paired with another algorithm to describe non-convex clusters. One
first clusters the data into a large number of groups using
k-means
. These groups are then
agglomerated into larger clusters using single link hierarchical clustering, which can detect
complex shapes. This approach also makes the solution less sensitive to initialization, and
since the hierarchical method provides results at multiple resolutions, one does not need to
pre-specify k either.
The cost of the optimal solution decreases with increasing k till it hits zero when the
number of clusters equals the number of distinct data-points. This makes it more difficult
to (a) directly compare solutions with different numbers of clusters and (b) to find the opti-
mum value of k. If the desired k is not known in advance, one will typically run
k-means
with different values of k, and then use a suitable criterion to select one of the results. For
example, SAS uses the cube-clustering-criterion, while X-means adds a complexity term
(which increases with k) to the original cost function (Eq.
1
) and then identifies the k which
minimizes this adjusted cost. Alternatively, one can progressively increase the number of
clusters, in conjunction with a suitable stopping criterion. Bisecting
k-means
[
73
] achieves
this by first putting all the data into a single cluster, and then recursively splitting the least
compact cluster into two using 2-means. The celebrated LBG algorithm [
34
] used for vector
quantization doubles the number of clusters till a suitable code-book size is obtained. Both
these approaches thus alleviate the need to know k beforehand.
The algorithm is also sensitive to the presence of outliers, since “mean” is not a robust
statistic. A preprocessing step to remove outliers can be helpful. Post-processing the results,
for example to eliminate small clusters, or to merge close clusters into a large cluster, is also
desirable. Ball and Hall’s ISODATA algorithm from 1967 effectively used both pre- and
post-processing on
k-means
.
2.3 Generalizations and connections
As mentioned earlier,
k-means
is closely related to fitting a mixture of
k isotropic Gaussians
to the data. Moreover, the generalization of the distance measure to all Bregman divergences
is related to fitting the data with a mixture of k components from the exponential family of
distributions. Another broad generalization is to view the “means” as probabilistic models
instead of points in R
d
. Here, in the assignment step, each data point is assigned to the most
likely model to have generated it. In the “relocation” step, the model parameters are updated
to best fit the assigned datasets. Such model-based
k-means
allow one to cater to more
complex data, e.g. sequences described by Hidden Markov models.
One can also “kernelize”
k-means
[
19
]. Though boundaries between clusters are still
linear in the implicit high-dimensional space, they can become non-linear when projected
back to the original space, thus allowing kernel
k-means
to deal with more complex clus-
ters. Dhillon et al. [
19
] have shown a close connection between kernel
k-means
and spectral
clustering. The K-medoid algorithm is similar to
k-means
except that the centroids have to
belong to the data set being clustered. Fuzzy c-means is also similar, except that it computes
fuzzy membership functions for each clusters rather than a hard one.
Despite its drawbacks,
k-means
remains the most widely used partitional clustering
algorithm in practice. The algorithm is simple, easily understandable and reasonably scal-
able, and can be easily modified to deal with streaming data. To deal with very large datasets,
substantial effort has also gone into further speeding up
k-means
, most notably by using
kd-trees or exploiting the triangular inequality to avoid comparing each data point with all
the centroids during the assignment step. Continual improvements and generalizations of the
123
10
X. Wu et al.
basic algorithm have ensured its continued relevance and gradually increased its effectiveness
as well.
3 Support vector machines
In today’s machine learning applications, support vector machines (SVM) [
83
] are consid-
ered a must try—it offers one of the most robust and accurate methods among all well-known
algorithms. It has a sound theoretical foundation, requires only a dozen examples for training,
and is insensitive to the number of dimensions. In addition, efficient methods for training
SVM are also being developed at a fast pace.
In a two-class learning task, the aim of SVM is to find the best classification function
to distinguish between members of the two classes in the training data. The metric for the
concept of the “best” classification function can be realized geometrically. For a linearly sep-
arable dataset, a linear classification function corresponds to a separating hyperplane f
(x)
that passes through the middle of the two classes, separating the two. Once this function is
determined, new data instance x
n
can be classified by simply testing the sign of the function
f
(x
n
); x
n
belongs to the positive class if f
(x
n
) > 0.
Because there are many such linear hyperplanes, what SVM additionally guarantee is that
the best such function is found by maximizing the margin between the two classes. Intui-
tively, the margin is defined as the amount of space, or separation between the two classes
as defined by the hyperplane. Geometrically, the margin corresponds to the shortest distance
between the closest data points to a point on the hyperplane. Having this geometric definition
allows us to explore how to maximize the margin, so that even though there are an infinite
number of hyperplanes, only a few qualify as the solution to SVM.
The reason why SVM insists on finding the maximum margin hyperplanes is that it offers
the best generalization ability. It allows not only the best classification performance (e.g.,
accuracy) on the training data, but also leaves much room for the correct classification of the
future data. To ensure that the maximum margin hyperplanes are actually found, an SVM
classifier attempts to maximize the following function with respect to w and b:
L
P
=
1
2
w
−
t
i
=1
α
i
y
i
(w · x
i
+ b) +
t
i
=1
α
i
(2)
where
t is the number of training examples, and
α
i
, i = 1, . . . , t, are non-negative numbers
such that the derivatives of L
P
with respect to
α
i
are zero.
α
i
are the Lagrange multipliers
and L
P
is called the Lagrangian. In this equation, the vectors w and constant b define the
hyperplane.
There are several important questions and related extensions on the above basic formula-
tion of support vector machines. We list these questions and extensions below.
1. Can we understand the meaning of the SVM through a solid theoretical foundation?
2. Can we extend the SVM formulation to handle cases where we allow errors to exist,
when even the best hyperplane must admit some errors on the training data?
3. Can we extend the SVM formulation so that it works in situations where the training
data are not linearly separable?
4. Can we extend the SVM formulation so that the task is to predict numerical values or
to rank the instances in the likelihood of being a positive class member, rather than
classification?
123