Top 10 algorithms in data mining
11
5. Can we scale up the algorithm for finding the maximum margin hyperplanes to thousands
and millions of instances?
Question 1
Can we understand the meaning of the SVM through a solid theoretical foun-
dation?
Several important theoretical results exist to answer this question.
A learning machine, such as the SVM, can be modeled as a function class based on some
parameters
α. Different function classes can have different capacity in learning, which is
represented by a parameter h known as the VC dimension [
83
]. The VC dimension measures
the maximum number of training examples where the function class can still be used to learn
perfectly, by obtaining zero error rates on the training data, for any assignment of class labels
on these points. It can be proven that the actual error on the future data is bounded by a sum
of two terms. The first term is the training error, and the second term if proportional to the
square root of the VC dimension h. Thus, if we can minimize h, we can minimize the future
error, as long as we also minimize the training error. In fact, the above maximum margin
function learned by SVM learning algorithms is one such function. Thus, theoretically, the
SVM algorithm is well founded.
Question 2
Can we extend the SVM formulation to handle cases where we allow errors to
exist, when even the best hyperplane must admit some errors on the training data?
To answer this question, imagine that there are a few points of the opposite classes that
cross the middle. These points represent the training error that existing even for the maximum
margin hyperplanes. The “soft margin” idea is aimed at extending the SVM algorithm [
83
]
so that the hyperplane allows a few of such noisy data to exist. In particular, introduce a slack
variable
ξ
i
to account for the amount of a violation of classification by the function f
(x
i
);
ξ
i
has a direct geometric explanation through the distance from a mistakenly classified data
instance to the hyperplane f
(x). Then, the total cost introduced by the slack variables can
be used to revise the original objective minimization function.
Question 3
Can we extend the SVM formulation so that it works in situations where the
training data are not linearly separable?
The answer to this question depends on an observation on the objective function where
the only appearances of x
i
is in the form of a dot product. Thus, if we extend the dot product
x
i
·x
j
through a functional mapping
(x
i
) of each x
i
to a different space
H
of larger and even
possibly infinite dimensions, then the equations still hold. In each equation, where we had
the dot product x
i
· x
j
, we now have the dot product of the transformed vectors
(x
i
) · (x
j
),
which is called a kernel function.
The kernel function can be used to define a variety of nonlinear relationship between its
inputs. For example, besides linear kernel functions, you can define quadratic or exponential
kernel functions. Much study in recent years have gone into the study of different kernels for
SVM classification [
70
] and for many other statistical tests. We can also extend the above
descriptions of the SVM classifiers from binary classifiers to problems that involve more than
two classes. This can be done by repeatedly using one of the classes as a positive class, and
the rest as the negative classes (thus, this method is known as the one-against-all method).
Question 4
Can we extend the SVM formulation so that the task is to learn to approximate
data using a linear function, or to rank the instances in the likelihood of being a positive class
member, rather a classification?
123
12
X. Wu et al.
SVM can be easily extended to perform numerical calculations. Here we discuss two such
extensions. The first is to extend SVM to perform regression analysis, where the goal is to
produce a linear function that can approximate that target function. Careful consideration
goes into the choice of the error models; in support vector regression, or SVR, the error is
defined to be zero when the difference between actual and predicted values are within a epsi-
lon amount. Otherwise, the epsilon insensitive error will grow linearly. The support vectors
can then be learned through the minimization of the Lagrangian. An advantage of support
vector regression is reported to be its insensitivity to outliers.
Another extension is to learn to rank elements rather than producing a classification for
individual elements [
39
]. Ranking can be reduced to comparing pairs of instances and pro-
ducing a
+1 estimate if the pair is in the correct ranking order, and −1 otherwise. Thus, a
way to reduce this task to SVM learning is to construct new instances for each pair of ranked
instance in the training data, and to learn a hyperplane on this new training data.
This method can be applied to many areas where ranking is important, such as in document
ranking in information retrieval areas.
Question 5
Can we scale up the algorithm for finding the maximum margin hyperplanes to
thousands and millions of instances?
One of the initial drawbacks of SVM is its computational inefficiency. However, this prob-
lem is being solved with great success. One approach is to break a large optimization problem
into a series of smaller problems, where each problem only involves a couple of carefully
chosen variables so that the optimization can be done efficiently. The process iterates until
all the decomposed optimization problems are solved successfully. A more recent approach
is to consider the problem of learning an SVM as that of finding an approximate minimum
enclosing ball of a set of instances.
These instances, when mapped to an N -dimensional space, represent a core set that can be
used to construct an approximation to the minimum enclosing ball. Solving the SVM learning
problem on these core sets can produce a good approximation solution in very fast speed.
For example, the core-vector machine [
81
] thus produced can learn an SVM for millions of
data in seconds.
4 The Apriori algorithm
4.1 Description of the algorithm
One of the most popular data mining approaches is to find frequent itemsets from a transaction
dataset and derive association rules. Finding frequent itemsets (itemsets with frequency larger
than or equal to a user specified minimum support) is not trivial because of its combinatorial
explosion. Once frequent itemsets are obtained, it is straightforward to generate association
rules with confidence larger than or equal to a user specified minimum confidence.
Apriori is a seminal algorithm for finding frequent itemsets using candidate generation
[
1
]. It is characterized as a level-wise complete search algorithm using anti-monotonicity of
itemsets, “if an itemset is not frequent, any of its superset is never frequent”. By convention,
Apriori assumes that items within a transaction or itemset are sorted in lexicographic order.
Let the set of frequent itemsets of size k be F
k
and their candidates be C
k
. Apriori first scans
the database and searches for frequent itemsets of size 1 by accumulating the count for each
item and collecting those that satisfy the minimum support requirement. It then iterates on
the following three steps and extracts all the frequent itemsets.
123