Top 10 algorithms in data mining
17
current fit
(
k)
for
(
i
= 1, . . . , g; j = 1, . . . , n). It can be expressed as
τ
(k)
i j
=
π
(
k)
i
φ(y
j
; µ
(
k)
i
,
(k)
i
)
f
(y
j
;
(k)
)
.
(8)
On the M-step, the updated estimates of the mixing proportion
π
j
, the mean vector
µ
i
, and
the covariance matrix
i
for the i th component are given by
π
(k+1)
i
=
n
j
=1
τ
(
k)
i j
n
,
(9)
µ
(k+1)
i
=
n
j
=1
τ
(
k)
i j
y
j
n
j
=1
τ
(
k)
i j
(10)
and
(k+1)
i
=
n
j
=1
τ
(
k)
i j
(y
j
− µ
(
k+1)
i
)(y
j
− µ
(
k+1)
i
)
T
n
j
=1
τ
(
k)
i j
.
(11)
It can be seen that the M-step exists in closed form.
These E- and M-steps are alternated until the changes in the estimated parameters or the
log likelihood are less than some specified threshold.
5.3 Number of clusters
We can make a choice as to an appropriate value of g by consideration of the likelihood
function. In the absence of any prior information as to the number of clusters present in the
data, we monitor the increase in the log likelihood function as the value of g increases.
At any stage, the choice of g
= g
0
versus g
=
g
1
, for instance g
1
= g
0
+ 1, can be made
by either performing the likelihood ratio test or by using some information-based criterion,
such as BIC (Bayesian information criterion). Unfortunately, regularity conditions do not
hold for the likelihood ratio test statistic
λ to have its usual null distribution of chi-squared
with degrees of freedom equal to the difference d in the number of parameters for g
= g
1
and g
=
g
0
components in the mixture models. One way to proceed is to use a resampling
approach as in [
55
]. Alternatively, one can apply BIC, which leads to the selection of g
=
g
1
over g
=
g
0
if
−2 log λ is greater than
d log(
n).
6 PageRank
6.1 Overview
PageRank [
10
] was presented and published by Sergey Brin and Larry Page at the Seventh
International World Wide Web Conference (WWW7) in April 1998. It is a search ranking
algorithm using hyperlinks on the Web. Based on the algorithm, they built the search engine
Google, which has been a huge success. Now, every search engine has its own hyperlink
based ranking method.
PageRank produces a static ranking of Web pages in the sense that a PageRank value is
computed for each page off-line and it does not depend on search queries. The algorithm
relies on the democratic nature of the Web by using its vast link structure as an indicator
of an individual page’s quality. In essence, PageRank interprets a hyperlink from page x to
page y as a vote, by page x, for page y. However, PageRank looks at more than just the sheer
123
18
X. Wu et al.
number of votes, or links that a page receives. It also analyzes the page that casts the vote.
Votes casted by pages that are themselves “important” weigh more heavily and help to make
other pages more “important”. This is exactly the idea of rank prestige in social networks
[
86
].
6.2 The algorithm
We now introduce the PageRank formula. Let us first state some main concepts in the Web
context.
In-links of page
i : These are the hyperlinks that point to page
i from other pages. Usually,
hyperlinks from the same site are not considered.
Out-links of page i : These are the hyperlinks that point out to other pages from page i .
Usually, links to pages of the same site are not considered.
The following ideas based on rank prestige [
86
] are used to derive the PageRank algorithm:
1. A hyperlink from a page pointing to another page is an implicit conveyance of authority
to the target page. Thus, the more in-links that a page i receives, the more prestige the
page i has.
2. Pages that point to page i also have their own prestige scores. A page with a higher
prestige score pointing to i is more important than a page with a lower prestige score
pointing to i . In other words, a page is important if it is pointed to by other important
pages.
According to rank prestige in social networks, the importance of page i (i ’s PageRank
score) is determined by summing up the PageRank scores of all pages that point to
i . Since a
page may point to many other pages, its prestige score should be shared among all the pages
that it points to.
To formulate the above ideas, we treat the Web as a directed graph G
= (V, E), where
V is the set of vertices or nodes, i.e., the set of all pages, and E is the set of directed edges
in the graph, i.e., hyperlinks. Let the total number of pages on the Web be n (i.e., n
= |V |).
The PageRank score of the page i (denoted by P
(i)) is defined by
P
(i) =
( j,i)∈E
P
( j)
O
j
,
(12)
where
O
j
is the number of out-links of page j . Mathematically, we have a system of n linear
equations (
12
) with n unknowns. We can use a matrix to represent all the equations. Let
P be a
n-dimensional column vector of PageRank values, i.e.,
P
= (P(1), P(2), . . . , P(n))T.
Let A be the adjacency matrix of our graph with
A
i j
=
1
O
i
if
(i, j) ∈ E
0
otherwise
(13)
We can write the system of n equations with
P
= A
T
P
.
(14)
This is the characteristic equation of the
eigensystem, where the solution to
P is an
eigenvector with the corresponding
eigenvalue of 1. Since this is a circular definition, an
iterative algorithm is used to solve it. It turns out that if some conditions are satisfied, 1 is
123