Top 10 algorithms in data mining
19
Fig. 4 The power iteration
method for PageRank
PageRank-Iterate(G)
P
0
← e/n
k
← 1
repeat
;
)
1
(
1
-
k
T
k
d
d
P
A
e
P
+
−
←
k
← k + 1;
until ||P
k
– P
k-1
||
1
<
ε
return P
k
the largest eigenvalue and the PageRank vector P is the principal eigenvector. A well known
mathematical technique called power iteration [
30
] can be used to find P.
However, the problem is that Eq. (
14
) does not quite suffice because the Web graph does
not meet the conditions. In fact, Eq. (
14
) can also be derived based on the Markov chain.
Then some theoretical results from Markov chains can be applied. After augmenting the Web
graph to satisfy the conditions, the following PageRank equation is produced:
P
= (1 − d)e + dA
T
P
,
(15)
where
e is a column vector of all 1’s. This gives us the PageRank formula for each page
i :
P
(i) = (1 − d) + d
n
j
=1
A
j i
P
( j),
(16)
which is equivalent to the formula given in the original PageRank papers [
10
,
61
]:
P
(i) = (1 − d) + d
( j,i)∈E
P
( j)
O
j
.
(17)
The parameter
d is called the
damping factor which can be set to a value between 0 and
1. d = 0.85 is used in [
10
,
52
].
The computation of PageRank values of the Web pages can be done using the power
iteration method [
30
], which produces the principal eigenvector with the eigenvalue of 1.
The algorithm is simple, and is given in Fig.
1
. One can start with any initial assignments
of PageRank values. The iteration ends when the PageRank values do not change much or
converge. In Fig.
4
, the iteration ends after the 1-norm of the residual vector is less than a
pre-specified threshold e.
Since in Web search, we are only interested in the ranking of the pages, the actual
convergence may not be necessary. Thus, fewer iterations are needed. In [
10
], it is reported
that on a database of 322 million links the algorithm converges to an acceptable tolerance in
roughly 52 iterations.
6.3 Further references on PageRank
Since PageRank was presented in [
10
,
61
], researchers have proposed many enhancements
to the model, alternative models, improvements for its computation, adding the tempo-
ral dimension [
91
], etc. The books by Liu [
52
] and by Langville and Meyer [
49
] contain
in-depth analyses of PageRank and several other link-based algorithms.
123
20
X. Wu et al.
7 AdaBoost
7.1 Description of the algorithm
Ensemble learning [
20
] deals with methods which employ multiple learners to solve a prob-
lem. The generalization ability of an ensemble is usually significantly better than that of a
single learner, so ensemble methods are very attractive. The AdaBoost algorithm [
24
] pro-
posed by Yoav Freund and Robert Schapire is one of the most important ensemble methods,
since it has solid theoretical foundation, very accurate prediction, great simplicity (Schapire
said it needs only “just 10 lines of code”), and wide and successful applications.
Let
X
denote the instance space and
Y
the set of class labels. Assume
Y
= {−1, +1}.
Given a weak or base learning algorithm and a training set
{(x
1
, y
1
), (x
2
, y
2
), . . . , (x
m
, y
m
)}
where x
i
∈
X
and
y
i
∈
Y
(
i = 1, . . . ,
m), the AdaBoost algorithm works as follows. First,
it assigns equal weights to all the training examples
(x
i
, y
i
)(i ∈ {1, . . . , m}). Denote the
distribution of the weights at the t-th learning round as D
t
. From the training set and D
t
the algorithm generates a weak or base learner h
t
:
X
→
Y
by calling the base learning
algorithm. Then, it uses the training examples to test
h
t
, and the weights of the incorrectly
classified examples will be increased. Thus, an updated weight distribution D
t
+1
is obtained.
From the training set and
D
t
+1
AdaBoost generates another weak learner by calling the
base learning algorithm again. Such a process is repeated for
T rounds, and the final model is
derived by weighted majority voting of the T weak learners, where the weights of the learners
are determined during the training process. In practice, the base learning algorithm may be a
learning algorithm which can use weighted training examples directly; otherwise the weights
can be exploited by sampling the training examples according to the weight distribution D
t
.
The pseudo-code of AdaBoost is shown in Fig.
5
.
In order to deal with multi-class problems, Freund and Schapire presented the Ada-
Boost.M1 algorithm [
24
] which requires that the weak learners are strong enough even
on hard distributions generated during the AdaBoost process. Another popular multi-class
version of AdaBoost is AdaBoost.MH [
69
] which works by decomposing multi-class task to
a series of binary tasks. AdaBoost algorithms for dealing with regression problems have also
been studied. Since many variants of AdaBoost have been developed during the past decade,
Boosting has become the most important “family” of ensemble methods.
7.2 Impact of the algorithm
As mentioned in Sect.
7.1
, AdaBoost is one of the most important ensemble methods, so it is
not strange that its high impact can be observed here and there. In this short article we only
briefly introduce two issues, one theoretical and the other applied.
In 1988, Kearns and Valiant posed an interesting question, i.e., whether a weak learning
algorithm that performs just slightly better than random guess could be “boosted” into an
arbitrarily accurate strong learning algorithm. In other words, whether two complexity clas-
ses, weakly learnable and strongly learnable problems, are equal. Schapire [
67
] found that
the answer to the question is “yes”, and the proof he gave is a construction, which is the first
Boosting algorithm. So, it is evident that AdaBoost was born with theoretical significance.
AdaBoost has given rise to abundant research on theoretical aspects of ensemble methods,
which can be easily found in machine learning and statistics literature. It is worth mentioning
that for their AdaBoost paper [
24
], Schapire and Freund won the Godel Prize, which is one
of the most prestigious awards in theoretical computer science, in the year of 2003.
123