S u rv e y pa p e r

Yüklə 471,61 Kb.

Pdf görüntüsü

səhifə	5/16
tarix	08.10.2017
ölçüsü	471,61 Kb.
	#3814

1 2 3 4 5 6 7 8 9 ... 16

4 The Apriori algorithm

Top 10 algorithms in data mining

5. Can we scale up the algorithm for ﬁnding the maximum margin hyperplanes to thousands

and millions of instances?

Question 1

Can we understand the meaning of the SVM through a solid theoretical foun-

dation?

Several important theoretical results exist to answer this question.

A learning machine, such as the SVM, can be modeled as a function class based on some

parameters

α. Different function classes can have different capacity in learning, which is

represented by a parameter h known as the VC dimension [

]. The VC dimension measures

the maximum number of training examples where the function class can still be used to learn

perfectly, by obtaining zero error rates on the training data, for any assignment of class labels

on these points. It can be proven that the actual error on the future data is bounded by a sum

of two terms. The ﬁrst term is the training error, and the second term if proportional to the

square root of the VC dimension h. Thus, if we can minimize h, we can minimize the future

error, as long as we also minimize the training error. In fact, the above maximum margin

function learned by SVM learning algorithms is one such function. Thus, theoretically, the

SVM algorithm is well founded.

Question 2

Can we extend the SVM formulation to handle cases where we allow errors to

exist, when even the best hyperplane must admit some errors on the training data?

To answer this question, imagine that there are a few points of the opposite classes that

cross the middle. These points represent the training error that existing even for the maximum

margin hyperplanes. The “soft margin” idea is aimed at extending the SVM algorithm [

]

so that the hyperplane allows a few of such noisy data to exist. In particular, introduce a slack

variable

ξ

i

to account for the amount of a violation of classiﬁcation by the function f

);

has a direct geometric explanation through the distance from a mistakenly classiﬁed data

instance to the hyperplane f

(x). Then, the total cost introduced by the slack variables can

be used to revise the original objective minimization function.

Question 3

Can we extend the SVM formulation so that it works in situations where the

training data are not linearly separable?

The answer to this question depends on an observation on the objective function where

the only appearances of x

i

is in the form of a dot product. Thus, if we extend the dot product

x

i

·x

through a functional mapping

(x

i

) of each x

to a different space

of larger and even

possibly inﬁnite dimensions, then the equations still hold. In each equation, where we had

the dot product x

· x

, we now have the dot product of the transformed vectors

(x

i

) · (x

which is called a kernel function.

The kernel function can be used to deﬁne a variety of nonlinear relationship between its

inputs. For example, besides linear kernel functions, you can deﬁne quadratic or exponential

kernel functions. Much study in recent years have gone into the study of different kernels for

SVM classiﬁcation [

] and for many other statistical tests. We can also extend the above

descriptions of the SVM classiﬁers from binary classiﬁers to problems that involve more than

two classes. This can be done by repeatedly using one of the classes as a positive class, and

the rest as the negative classes (thus, this method is known as the one-against-all method).

Question 4

Can we extend the SVM formulation so that the task is to learn to approximate

data using a linear function, or to rank the instances in the likelihood of being a positive class

member, rather a classiﬁcation?

123

X. Wu et al.

SVM can be easily extended to perform numerical calculations. Here we discuss two such

extensions. The ﬁrst is to extend SVM to perform regression analysis, where the goal is to

produce a linear function that can approximate that target function. Careful consideration

goes into the choice of the error models; in support vector regression, or SVR, the error is

deﬁned to be zero when the difference between actual and predicted values are within a epsi-

lon amount. Otherwise, the epsilon insensitive error will grow linearly. The support vectors

can then be learned through the minimization of the Lagrangian. An advantage of support

vector regression is reported to be its insensitivity to outliers.

Another extension is to learn to rank elements rather than producing a classiﬁcation for

individual elements [

]. Ranking can be reduced to comparing pairs of instances and pro-

ducing a

+1 estimate if the pair is in the correct ranking order, and −1 otherwise. Thus, a

way to reduce this task to SVM learning is to construct new instances for each pair of ranked

instance in the training data, and to learn a hyperplane on this new training data.

This method can be applied to many areas where ranking is important, such as in document

ranking in information retrieval areas.

Question 5

Can we scale up the algorithm for ﬁnding the maximum margin hyperplanes to

thousands and millions of instances?

One of the initial drawbacks of SVM is its computational inefﬁciency. However, this prob-

lem is being solved with great success. One approach is to break a large optimization problem

into a series of smaller problems, where each problem only involves a couple of carefully

chosen variables so that the optimization can be done efﬁciently. The process iterates until

all the decomposed optimization problems are solved successfully. A more recent approach

is to consider the problem of learning an SVM as that of ﬁnding an approximate minimum

enclosing ball of a set of instances.

These instances, when mapped to an N -dimensional space, represent a core set that can be

used to construct an approximation to the minimum enclosing ball. Solving the SVM learning

problem on these core sets can produce a good approximation solution in very fast speed.

For example, the core-vector machine [

] thus produced can learn an SVM for millions of

data in seconds.

4 The Apriori algorithm

4.1 Description of the algorithm

One of the most popular data mining approaches is to ﬁnd frequent itemsets from a transaction

dataset and derive association rules. Finding frequent itemsets (itemsets with frequency larger

than or equal to a user speciﬁed minimum support) is not trivial because of its combinatorial

explosion. Once frequent itemsets are obtained, it is straightforward to generate association

rules with conﬁdence larger than or equal to a user speciﬁed minimum conﬁdence.

Apriori is a seminal algorithm for ﬁnding frequent itemsets using candidate generation

[

1

]. It is characterized as a level-wise complete search algorithm using anti-monotonicity of

itemsets, “if an itemset is not frequent, any of its superset is never frequent”. By convention,

Apriori assumes that items within a transaction or itemset are sorted in lexicographic order.

Let the set of frequent itemsets of size k be F

k

and their candidates be C

. Apriori ﬁrst scans

the database and searches for frequent itemsets of size 1 by accumulating the count for each

item and collecting those that satisfy the minimum support requirement. It then iterates on

the following three steps and extracts all the frequent itemsets.

123

Yüklə 471,61 Kb.

Dostları ilə paylaş:

1 2 3 4 5 6 7 8 9 ... 16