The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page

Yüklə 1,64 Mb.

tarix	08.08.2018
ölçüsü	1,64 Mb.
	#61871

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Sergey Brin, Lawrence Page
Presented By: Paolo Lim
April 10, 2007

AKA: The Original Google Paper

Presentation Outline

Design goals of Google search engine
Link Analysis and other features
System architecture and major structures
Crawling, indexing, and searching the web
Performance and results
Conclusions
Final exam questions

Linear Algebra Background

PageRank involves knowledge of:

Matrix addition/multiplication
Eigenvectors and Eigenvalues
Power iteration
Dot product

Not discussed in detail in presentation
For reference:

http://cs.wellesley.edu/~cs249B/math/Linear%20Algebra/CS298LinAlgpart1.pdf
http://www.cse.buffalo.edu/~hungngo/classes/2005/Expanders/notes/LA-intro.pdf

Google Design Goals

Scaling with the web’s growth
Improved search quality

Number of documents increasing rapidly, but user’s ability to look at documents lags
Lots of “junk” results, little relevance

Academic search engine research

Development and understanding in academic realm
System that reasonable number of people can actually use
Support novel research activities of large-scale web data by other researchers and students

Link Analysis Basics

PageRank Algorithm

A Top 10 IEEE ICDM data mining algorithm
Large basis for ranking system (discussed later)
Tries to incorporate ideas from academic community (publishing and citations)

Anchor Text Analysis

http://www.com> ANCHOR TEXT

Intuition: Why Links, Anyway?

Links represent citations
Quantity of links to a website makes the website more popular
Quality of links to a website also helps in computing rank
Link structure largely unused before Larry Page proposed it to thesis advisor

Naïve PageRank

Each link’s vote is proportional to the importance of its’ source page
If page P with important I has N outlinks, then each link gets I / N votes
Simple recursive formulation:

PR(A) = PR(p1)/C(p1) + … + PR(pn)/C(pn)
PR(X)  PageRank of page X
C(X)  number of links going out of page X

Naïve PageRank Model (from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)

The web in 1839

Solving the flow equations

3 equations, 3 unknowns, no constants

No unique solution
All solutions equivalent modulo scale factor

Additional constraint forces uniqueness

y+a+m = 1
y = 2/5, a = 2/5, m = 1/5

Gaussian elimination method works for small examples, but we need a better method for large graphs

Matrix formulation

Matrix M has one row and one column for each web page
Suppose page j has n outlinks

If j ! i, then Mij=1/n
Else Mij=0

M is a column stochastic matrix

Columns sum to 1

Suppose r is a vector with one entry per web page

ri is the importance score of page i
Call it the rank vector

Example (from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)

Eigenvector formulation

The flow equations can be written
r = Mr
So the rank vector is an eigenvector of the stochastic web matrix

In fact, its first or principal eigenvector, with corresponding eigenvalue 1

Example (from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)

Power Iteration

Simple iterative scheme (aka relaxation)
Suppose there are N web pages
Initialize: r0 = [1,….,1]T
Iterate: rk+1 = Mrk
Stop when |rk+1 - rk|1 < 

|x|1 = 1·i·N|xi| is the L1 norm
Can use any other vector norm e.g., Euclidean

Power Iteration Example (from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)

Random Surfer

Imagine a random web surfer

At any time t, surfer is on some page P
At time t+1, the surfer follows an outlink from P uniformly at random
Ends up on some page Q linked from P
Process repeats indefinitely

Let p(t) be a vector whose ith component is the probability that the surfer is at page i at time t

p(t) is a probability distribution on pages

The stationary distribution

Where is the surfer at time t+1?

Follows a link uniformly at random
p(t+1) = Mp(t)

Suppose the random walk reaches a state such that p(t+1) = Mp(t) = p(t)

Then p(t) is called a stationary distribution for the random walk

Our rank vector r satisfies r = Mr

So it is a stationary distribution for the random surfer

Spider traps

A group of pages is a spider trap if there are no links from within the group to outside the group

Random surfer gets trapped

Spider traps violate the conditions needed for the random walk theorem

Microsoft becomes a spider trap (from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)

Random teleports

The Google solution for spider traps
At each time step, the random surfer has two options:

With probability , follow a link at random
With probability 1-, jump to some page uniformly at random
Common values for  are in the range 0.8 to 0.9

Surfer will teleport out of spider trap within a few time steps

Matrix formulation

Suppose there are N pages

Consider a page j, with set of outlinks O(j)
We have Mij = 1/|O(j)| when j!i and Mij = 0 otherwise
The random teleport is equivalent to

adding a teleport link from j to every other page with probability (1-)/N
reducing the probability of following each outlink from 1/|O(j)| to /|O(j)|
Equivalent: tax each page a fraction (1-) of its score and redistribute evenly

Page Rank

Construct the NxN matrix A as follows

Aij = Mij + (1-)/N

Verify that A is a stochastic matrix
The page rank vector r is the principal eigenvector of this matrix

satisfying r = Ar

Equivalently, r is the stationary distribution of the random walk with teleports

Previous example with =0.8 (from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)

Dead ends

Pages with no outlinks are “dead ends” for the random surfer

Nowhere to go on next step

Microsoft becomes a dead end (from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)

Dealing with dead-ends

Teleport

Follow random teleport links with probability 1.0 from dead-ends
Adjust matrix accordingly

Prune and propagate

Preprocess the graph to eliminate dead-ends
Might require multiple passes
Compute page rank on reduced graph
Approximate values for dead ends by propagating values from reduced graph

Anchor Text

Can be more accurate description of target site than target site’s text itself
Can point at non-HTTP or non-text

Images
Videos
Databases

Possible for non-crawled pages to be returned in the process

Other Features

List of occurrences of a particular word in a particular document (Hit List)
Location information and proximity
Keeps track of visual presentation details:

Font size of words
Capitalization
Bold/Italic/Underlined/etc.

Full raw HTML of all pages is available in repository

Google Architecture (from http://www.ics.uci.edu/~scott/google.htm)

Google Query Evaluation

Single Word Query Ranking

Hitlist is retrieved for single word
Each hit can be one of several types: title, anchor, URL, large font, small font, etc.
Each hit type is assigned its own weight
Type-weights make up vector of weights
Number of hits of each type is counted to form count-weight vector
Dot product of type-weight and count-weight vectors is used to compute IR score
IR score is combined with PageRank to compute final rank

Multi-word Query Ranking

Similar to single-word ranking except now must analyze proximity of words in a document
Hits occurring closer together are weighted higher than those farther apart
Each proximity relation is classified into 1 of 10 bins ranging from a “phrase match” to “not even close”
Each type and proximity pair has a type-prox weight
Counts converted into count-weights
Take dot product of count-weights and type-prox weights to computer for IR score

Scalability

Cluster architecture combined with Moore’s Law make for high scalability. At time of writing:

~ 24 million documents indexed in one week
~518 million hyperlinks indexed
Four crawlers collected 100 documents/sec

Key Optimization Techniques

Each crawler maintains its own DNS lookup cache
Use flex to generate lexical analyzer with own stack for parsing documents
Parallelization of indexing phase
In-memory lexicon
Compression of repository
Compact encoding of hit lists for space saving
Indexer is optimized so it is just faster than the crawler so that crawling is the bottleneck
Document index is updated in bulk
Critical data structures placed on local disk
Overall architecture designed avoid to disk seeks wherever possible

Storage Requirements (from http://www.ics.uci.edu/~scott/google.htm)

At the time of publication, Google had the following statistical breakdown for storage requirements:

Conclusions

Search is far from perfect

Topic/Domain-specific PageRank
Machine translation in search
Non-hypertext search

Business potential

Brin and Page worth around $15 billion each… at 32 years old!
If you have a better idea than how Google does search, please remember me when you’re hiring software engineers! 

Possible Exam Questions

Given a web/link graph, formulate a Naïve PageRank link matrix and do a few steps of power iteration.

Slides 14 – 16

What are spider traps and dead ends, and how does Google deal with these?

Spider Trap: Slides 19 – 21
Dead End: Slides 25 – 27

Explain difference between single and multiple word search query evaluation.

Slides 35 – 36

References

Brin, Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine.
Brin, Page, Motwani, Winograd. The PageRank Citation Ranking: Bringing Order to the Web.
http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf
www.cs.duke.edu/~junyang/courses/cps296.1-2002-spring/lectures/02-web-search.pdf
http://www.ics.uci.edu/~scott/google.htm

Thank you!

Yüklə 1,64 Mb.

Dostları ilə paylaş:

The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Sergey Brin, Lawrence Page

Presented By: Paolo Lim

April 10, 2007

AKA: The Original Google Paper

Presentation Outline

Design goals of Google search engine

Link Analysis and other features

System architecture and major structures

Crawling, indexing, and searching the web

Performance and results

Conclusions

Final exam questions

Linear Algebra Background

PageRank involves knowledge of:

Not discussed in detail in presentation

For reference:

Google Design Goals

Scaling with the web’s growth

Improved search quality

Academic search engine research

Link Analysis Basics

PageRank Algorithm

Anchor Text Analysis

Intuition: Why Links, Anyway?

Links represent citations

Quantity of links to a website makes the website more popular

Quality of links to a website also helps in computing rank

Link structure largely unused before Larry Page proposed it to thesis advisor

Naïve PageRank

Each link’s vote is proportional to the importance of its’ source page

If page P with important I has N outlinks, then each link gets I / N votes

Simple recursive formulation:

Naïve PageRank Model (from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)

The web in 1839

Solving the flow equations

3 equations, 3 unknowns, no constants

Additional constraint forces uniqueness

Gaussian elimination method works for small examples, but we need a better method for large graphs

Matrix formulation

Matrix M has one row and one column for each web page

Suppose page j has n outlinks

M is a column stochastic matrix

Suppose r is a vector with one entry per web page

Example (from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)

Eigenvector formulation

The flow equations can be written

r = Mr

So the rank vector is an eigenvector of the stochastic web matrix

Example (from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)

Power Iteration

Simple iterative scheme (aka relaxation)

Suppose there are N web pages

Initialize: r0 = [1,….,1]T

Iterate: rk+1 = Mrk

Stop when |rk+1 - rk|1 < 

Power Iteration Example (from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)

Random Surfer

Imagine a random web surfer

Let p(t) be a vector whose ith component is the probability that the surfer is at page i at time t

The stationary distribution

Where is the surfer at time t+1?

Suppose the random walk reaches a state such that p(t+1) = Mp(t) = p(t)

Our rank vector r satisfies r = Mr

Spider traps

A group of pages is a spider trap if there are no links from within the group to outside the group

Spider traps violate the conditions needed for the random walk theorem

Microsoft becomes a spider trap (from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)

Random teleports

The Google solution for spider traps

At each time step, the random surfer has two options:

Surfer will teleport out of spider trap within a few time steps

Matrix formulation

Suppose there are N pages

Page Rank

Construct the NxN matrix A as follows

Verify that A is a stochastic matrix

The page rank vector r is the principal eigenvector of this matrix

Equivalently, r is the stationary distribution of the random walk with teleports