S u rv e y pa p e r

Yüklə 471,61 Kb.

Pdf görüntüsü

səhifə	8/16
tarix	08.10.2017
ölçüsü	471,61 Kb.
	#3814

1 ... 4 5 6 7 8 9 10 11 ... 16

Top 10 algorithms in data mining

current ﬁt

(k)

for

= 1, . . . , g; j = 1, . . . , n). It can be expressed as

(k)

i j

(k)

i

φ(y

; µ

(k)

i

(k)

)

f

(y

j

;

(k)

)

(8)

On the M-step, the updated estimates of the mixing proportion

π

j

, the mean vector

µ

i

, and

the covariance matrix

for the i th component are given by

(k+1)

=

n

(k)

i j

n

(9)

(k+1)

=

n

(k)

i j

y

j

n

j

(k)

i j

(10)

and

(k+1)

=

n

(k)

i j

− µ

(k+1)

i

)(y

− µ

(k+1)

i

)

T

n

j

(k)

i j

(11)

It can be seen that the M-step exists in closed form.

These E- and M-steps are alternated until the changes in the estimated parameters or the

log likelihood are less than some speciﬁed threshold.

5.3 Number of clusters

We can make a choice as to an appropriate value of g by consideration of the likelihood

function. In the absence of any prior information as to the number of clusters present in the

data, we monitor the increase in the log likelihood function as the value of g increases.

At any stage, the choice of g

= g

versus g

= g

, for instance g

= g

+ 1, can be made

by either performing the likelihood ratio test or by using some information-based criterion,

such as BIC (Bayesian information criterion). Unfortunately, regularity conditions do not

hold for the likelihood ratio test statistic

λ to have its usual null distribution of chi-squared

with degrees of freedom equal to the difference d in the number of parameters for g

= g

and g

= g

components in the mixture models. One way to proceed is to use a resampling

approach as in [

]. Alternatively, one can apply BIC, which leads to the selection of g

= g

over g

= g

−2 log λ is greater than d log(n).

6 PageRank

6.1 Overview

PageRank [

] was presented and published by Sergey Brin and Larry Page at the Seventh

International World Wide Web Conference (WWW7) in April 1998. It is a search ranking

algorithm using hyperlinks on the Web. Based on the algorithm, they built the search engine

Google, which has been a huge success. Now, every search engine has its own hyperlink

based ranking method.

PageRank produces a static ranking of Web pages in the sense that a PageRank value is

computed for each page off-line and it does not depend on search queries. The algorithm

relies on the democratic nature of the Web by using its vast link structure as an indicator

of an individual page’s quality. In essence, PageRank interprets a hyperlink from page x to

page y as a vote, by page x, for page y. However, PageRank looks at more than just the sheer

123

X. Wu et al.

number of votes, or links that a page receives. It also analyzes the page that casts the vote.

Votes casted by pages that are themselves “important” weigh more heavily and help to make

other pages more “important”. This is exactly the idea of rank prestige in social networks

[

6.2 The algorithm

We now introduce the PageRank formula. Let us ﬁrst state some main concepts in the Web

context.

In-links of page i : These are the hyperlinks that point to page i from other pages. Usually,

hyperlinks from the same site are not considered.

Out-links of page i : These are the hyperlinks that point out to other pages from page i .

Usually, links to pages of the same site are not considered.

The following ideas based on rank prestige [

] are used to derive the PageRank algorithm:

1. A hyperlink from a page pointing to another page is an implicit conveyance of authority

to the target page. Thus, the more in-links that a page i receives, the more prestige the

page i has.

2. Pages that point to page i also have their own prestige scores. A page with a higher

prestige score pointing to i is more important than a page with a lower prestige score

pointing to i . In other words, a page is important if it is pointed to by other important

pages.

According to rank prestige in social networks, the importance of page i (i ’s PageRank

score) is determined by summing up the PageRank scores of all pages that point to i . Since a

page may point to many other pages, its prestige score should be shared among all the pages

that it points to.

To formulate the above ideas, we treat the Web as a directed graph G

= (V, E), where

V is the set of vertices or nodes, i.e., the set of all pages, and E is the set of directed edges

in the graph, i.e., hyperlinks. Let the total number of pages on the Web be n (i.e., n

= |V |).

The PageRank score of the page i (denoted by P

(i)) is deﬁned by

P

(i) =

( j,i)∈E

P

( j)

O

j

(12)

where O

j

is the number of out-links of page j . Mathematically, we have a system of n linear

equations (

) with n unknowns. We can use a matrix to represent all the equations. Let

P be a n-dimensional column vector of PageRank values, i.e.,

P

= (P(1), P(2), . . . , P(n))T.

Let A be the adjacency matrix of our graph with

A

i j

O

i

(i, j) ∈ E

otherwise

(13)

We can write the system of n equations with

= A

T

P

(14)

This is the characteristic equation of the eigensystem, where the solution to P is an

eigenvector with the corresponding eigenvalue of 1. Since this is a circular deﬁnition, an

iterative algorithm is used to solve it. It turns out that if some conditions are satisﬁed, 1 is

123

Yüklə 471,61 Kb.

Dostları ilə paylaş:

1 ... 4 5 6 7 8 9 10 11 ... 16