S u rv e y pa p e r

Yüklə 471,61 Kb.

Pdf görüntüsü

səhifə	9/16
tarix	08.10.2017
ölçüsü	471,61 Kb.
	#3814

1 ... 5 6 7 8 9 10 11 12 ... 16

Fig. 4

Top 10 algorithms in data mining

19

Fig. 4 The power iteration

method for PageRank

PageRank-Iterate(G)

P

← e/n

← 1

repeat

;

)

(

-

k

T

k

d

d

P

A

e

P

−

←

k

← k + 1;

until ||P

k

– P

k-1

||

1

<

return P

the largest eigenvalue and the PageRank vector P is the principal eigenvector. A well known

mathematical technique called power iteration [

] can be used to ﬁnd P.

However, the problem is that Eq. (

) does not quite sufﬁce because the Web graph does

not meet the conditions. In fact, Eq. (

) can also be derived based on the Markov chain.

Then some theoretical results from Markov chains can be applied. After augmenting the Web

graph to satisfy the conditions, the following PageRank equation is produced:

= (1 − d)e + dA

T

P

(15)

where e is a column vector of all 1’s. This gives us the PageRank formula for each page i :

P

(i) = (1 − d) + d

n

j

=1

A

j i

P

( j),

(16)

which is equivalent to the formula given in the original PageRank papers [

]:

P

(i) = (1 − d) + d

( j,i)∈E

P

( j)

O

j

(17)

The parameter d is called the damping factor which can be set to a value between 0 and

1. d = 0.85 is used in [

,

52

The computation of PageRank values of the Web pages can be done using the power

iteration method [

], which produces the principal eigenvector with the eigenvalue of 1.

The algorithm is simple, and is given in Fig.

. One can start with any initial assignments

of PageRank values. The iteration ends when the PageRank values do not change much or

converge. In Fig.

, the iteration ends after the 1-norm of the residual vector is less than a

pre-speciﬁed threshold e.

Since in Web search, we are only interested in the ranking of the pages, the actual

convergence may not be necessary. Thus, fewer iterations are needed. In [

], it is reported

that on a database of 322 million links the algorithm converges to an acceptable tolerance in

roughly 52 iterations.

6.3 Further references on PageRank

Since PageRank was presented in [

,

61

], researchers have proposed many enhancements

to the model, alternative models, improvements for its computation, adding the tempo-

ral dimension [

], etc. The books by Liu [

] and by Langville and Meyer [

] contain

in-depth analyses of PageRank and several other link-based algorithms.

123

X. Wu et al.

7 AdaBoost

7.1 Description of the algorithm

Ensemble learning [

] deals with methods which employ multiple learners to solve a prob-

lem. The generalization ability of an ensemble is usually signiﬁcantly better than that of a

single learner, so ensemble methods are very attractive. The AdaBoost algorithm [

] pro-

posed by Yoav Freund and Robert Schapire is one of the most important ensemble methods,

since it has solid theoretical foundation, very accurate prediction, great simplicity (Schapire

said it needs only “just 10 lines of code”), and wide and successful applications.

Let

denote the instance space and

the set of class labels. Assume

= {−1, +1}.

Given a weak or base learning algorithm and a training set

{(x

, y

), (x

, y

), . . . , (x

, y

)}

where x

∈

and y

i

∈

(i = 1, . . . , m), the AdaBoost algorithm works as follows. First,

it assigns equal weights to all the training examples

(x

i

, y

)(i ∈ {1, . . . , m}). Denote the

distribution of the weights at the t-th learning round as D

t

. From the training set and D

the algorithm generates a weak or base learner h

→

by calling the base learning

algorithm. Then, it uses the training examples to test h

t

, and the weights of the incorrectly

classiﬁed examples will be increased. Thus, an updated weight distribution D

t

is obtained.

From the training set and D

t

AdaBoost generates another weak learner by calling the

base learning algorithm again. Such a process is repeated for T rounds, and the ﬁnal model is

derived by weighted majority voting of the T weak learners, where the weights of the learners

are determined during the training process. In practice, the base learning algorithm may be a

learning algorithm which can use weighted training examples directly; otherwise the weights

can be exploited by sampling the training examples according to the weight distribution D

t

The pseudo-code of AdaBoost is shown in Fig.

In order to deal with multi-class problems, Freund and Schapire presented the Ada-

Boost.M1 algorithm [

] which requires that the weak learners are strong enough even

on hard distributions generated during the AdaBoost process. Another popular multi-class

version of AdaBoost is AdaBoost.MH [

] which works by decomposing multi-class task to

a series of binary tasks. AdaBoost algorithms for dealing with regression problems have also

been studied. Since many variants of AdaBoost have been developed during the past decade,

Boosting has become the most important “family” of ensemble methods.

7.2 Impact of the algorithm

As mentioned in Sect.

7.1

, AdaBoost is one of the most important ensemble methods, so it is

not strange that its high impact can be observed here and there. In this short article we only

brieﬂy introduce two issues, one theoretical and the other applied.

In 1988, Kearns and Valiant posed an interesting question, i.e., whether a weak learning

algorithm that performs just slightly better than random guess could be “boosted” into an

arbitrarily accurate strong learning algorithm. In other words, whether two complexity clas-

ses, weakly learnable and strongly learnable problems, are equal. Schapire [

] found that

the answer to the question is “yes”, and the proof he gave is a construction, which is the ﬁrst

Boosting algorithm. So, it is evident that AdaBoost was born with theoretical significance.

AdaBoost has given rise to abundant research on theoretical aspects of ensemble methods,

which can be easily found in machine learning and statistics literature. It is worth mentioning

that for their AdaBoost paper [

], Schapire and Freund won the Godel Prize, which is one

of the most prestigious awards in theoretical computer science, in the year of 2003.

123

Yüklə 471,61 Kb.

Dostları ilə paylaş:

1 ... 5 6 7 8 9 10 11 12 ... 16