The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page



Yüklə 1,64 Mb.
tarix08.08.2018
ölçüsü1,64 Mb.
#61871


The Anatomy of a Large-Scale Hypertextual Web Search Engine

  • Sergey Brin, Lawrence Page

  • Presented By: Paolo Lim

  • April 10, 2007


AKA: The Original Google Paper



Presentation Outline

  • Design goals of Google search engine

  • Link Analysis and other features

  • System architecture and major structures

  • Crawling, indexing, and searching the web

  • Performance and results

  • Conclusions

  • Final exam questions



Linear Algebra Background

  • PageRank involves knowledge of:

    • Matrix addition/multiplication
    • Eigenvectors and Eigenvalues
    • Power iteration
    • Dot product
  • Not discussed in detail in presentation

  • For reference:

    • http://cs.wellesley.edu/~cs249B/math/Linear%20Algebra/CS298LinAlgpart1.pdf
    • http://www.cse.buffalo.edu/~hungngo/classes/2005/Expanders/notes/LA-intro.pdf


Google Design Goals

  • Scaling with the web’s growth

  • Improved search quality

    • Number of documents increasing rapidly, but user’s ability to look at documents lags
    • Lots of “junk” results, little relevance
  • Academic search engine research

    • Development and understanding in academic realm
    • System that reasonable number of people can actually use
    • Support novel research activities of large-scale web data by other researchers and students


Link Analysis Basics

  • PageRank Algorithm

    • A Top 10 IEEE ICDM data mining algorithm
    • Large basis for ranking system (discussed later)
    • Tries to incorporate ideas from academic community (publishing and citations)
  • Anchor Text Analysis

    • http://www.com> ANCHOR TEXT


Intuition: Why Links, Anyway?

  • Links represent citations

  • Quantity of links to a website makes the website more popular

  • Quality of links to a website also helps in computing rank

  • Link structure largely unused before Larry Page proposed it to thesis advisor



Naïve PageRank

  • Each link’s vote is proportional to the importance of its’ source page

  • If page P with important I has N outlinks, then each link gets I / N votes

  • Simple recursive formulation:

    • PR(A) = PR(p1)/C(p1) + … + PR(pn)/C(pn)
    • PR(X)  PageRank of page X
    • C(X)  number of links going out of page X


Naïve PageRank Model (from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)

  • The web in 1839



Solving the flow equations

  • 3 equations, 3 unknowns, no constants

    • No unique solution
    • All solutions equivalent modulo scale factor
  • Additional constraint forces uniqueness

    • y+a+m = 1
    • y = 2/5, a = 2/5, m = 1/5
  • Gaussian elimination method works for small examples, but we need a better method for large graphs



Matrix formulation

  • Matrix M has one row and one column for each web page

  • Suppose page j has n outlinks

    • If j ! i, then Mij=1/n
    • Else Mij=0
  • M is a column stochastic matrix

    • Columns sum to 1
  • Suppose r is a vector with one entry per web page

    • ri is the importance score of page i
    • Call it the rank vector


Example (from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)



Eigenvector formulation

  • The flow equations can be written

  • r = Mr

  • So the rank vector is an eigenvector of the stochastic web matrix

    • In fact, its first or principal eigenvector, with corresponding eigenvalue 1


Example (from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)



Power Iteration

  • Simple iterative scheme (aka relaxation)

  • Suppose there are N web pages

  • Initialize: r0 = [1,….,1]T

  • Iterate: rk+1 = Mrk

  • Stop when |rk+1 - rk|1 < 

    • |x|1 = 1·i·N|xi| is the L1 norm
    • Can use any other vector norm e.g., Euclidean


Power Iteration Example (from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)



Random Surfer

  • Imagine a random web surfer

    • At any time t, surfer is on some page P
    • At time t+1, the surfer follows an outlink from P uniformly at random
    • Ends up on some page Q linked from P
    • Process repeats indefinitely
  • Let p(t) be a vector whose ith component is the probability that the surfer is at page i at time t

    • p(t) is a probability distribution on pages


The stationary distribution

  • Where is the surfer at time t+1?

    • Follows a link uniformly at random
    • p(t+1) = Mp(t)
  • Suppose the random walk reaches a state such that p(t+1) = Mp(t) = p(t)

    • Then p(t) is called a stationary distribution for the random walk
  • Our rank vector r satisfies r = Mr

    • So it is a stationary distribution for the random surfer


Spider traps

  • A group of pages is a spider trap if there are no links from within the group to outside the group

    • Random surfer gets trapped
  • Spider traps violate the conditions needed for the random walk theorem



Microsoft becomes a spider trap (from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)



Random teleports

  • The Google solution for spider traps

  • At each time step, the random surfer has two options:

    • With probability , follow a link at random
    • With probability 1-, jump to some page uniformly at random
    • Common values for  are in the range 0.8 to 0.9
  • Surfer will teleport out of spider trap within a few time steps



Matrix formulation

  • Suppose there are N pages

    • Consider a page j, with set of outlinks O(j)
    • We have Mij = 1/|O(j)| when j!i and Mij = 0 otherwise
    • The random teleport is equivalent to
      • adding a teleport link from j to every other page with probability (1-)/N
      • reducing the probability of following each outlink from 1/|O(j)| to /|O(j)|
      • Equivalent: tax each page a fraction (1-) of its score and redistribute evenly


Page Rank

  • Construct the NxN matrix A as follows

    • Aij = Mij + (1-)/N
  • Verify that A is a stochastic matrix

  • The page rank vector r is the principal eigenvector of this matrix

    • satisfying r = Ar
  • Equivalently, r is the stationary distribution of the random walk with teleports



Previous example with =0.8 (from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)



Dead ends

  • Pages with no outlinks are “dead ends” for the random surfer

    • Nowhere to go on next step


Microsoft becomes a dead end (from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)



Dealing with dead-ends

  • Teleport

    • Follow random teleport links with probability 1.0 from dead-ends
    • Adjust matrix accordingly
  • Prune and propagate

    • Preprocess the graph to eliminate dead-ends
    • Might require multiple passes
    • Compute page rank on reduced graph
    • Approximate values for dead ends by propagating values from reduced graph


Anchor Text

  • Can be more accurate description of target site than target site’s text itself

  • Can point at non-HTTP or non-text

    • Images
    • Videos
    • Databases
  • Possible for non-crawled pages to be returned in the process



Other Features

  • List of occurrences of a particular word in a particular document (Hit List)

  • Location information and proximity

  • Keeps track of visual presentation details:

    • Font size of words
    • Capitalization
    • Bold/Italic/Underlined/etc.
  • Full raw HTML of all pages is available in repository



Google Architecture (from http://www.ics.uci.edu/~scott/google.htm)



Google Architecture (from http://www.ics.uci.edu/~scott/google.htm)



Google Architecture (from http://www.ics.uci.edu/~scott/google.htm)



Google Architecture (from http://www.ics.uci.edu/~scott/google.htm)



Google Query Evaluation



Single Word Query Ranking

  • Hitlist is retrieved for single word

  • Each hit can be one of several types: title, anchor, URL, large font, small font, etc.

  • Each hit type is assigned its own weight

  • Type-weights make up vector of weights

  • Number of hits of each type is counted to form count-weight vector

  • Dot product of type-weight and count-weight vectors is used to compute IR score

  • IR score is combined with PageRank to compute final rank



Multi-word Query Ranking

  • Similar to single-word ranking except now must analyze proximity of words in a document

  • Hits occurring closer together are weighted higher than those farther apart

  • Each proximity relation is classified into 1 of 10 bins ranging from a “phrase match” to “not even close”

  • Each type and proximity pair has a type-prox weight

  • Counts converted into count-weights

  • Take dot product of count-weights and type-prox weights to computer for IR score



Scalability

  • Cluster architecture combined with Moore’s Law make for high scalability. At time of writing:

    • ~ 24 million documents indexed in one week
    • ~518 million hyperlinks indexed
    • Four crawlers collected 100 documents/sec


Key Optimization Techniques

  • Each crawler maintains its own DNS lookup cache

  • Use flex to generate lexical analyzer with own stack for parsing documents

  • Parallelization of indexing phase

  • In-memory lexicon

  • Compression of repository

  • Compact encoding of hit lists for space saving

  • Indexer is optimized so it is just faster than the crawler so that crawling is the bottleneck

  • Document index is updated in bulk

  • Critical data structures placed on local disk

  • Overall architecture designed avoid to disk seeks wherever possible



Storage Requirements (from http://www.ics.uci.edu/~scott/google.htm)

  • At the time of publication, Google had the following statistical breakdown for storage requirements:



Conclusions

  • Search is far from perfect

    • Topic/Domain-specific PageRank
    • Machine translation in search
    • Non-hypertext search
  • Business potential

    • Brin and Page worth around $15 billion each… at 32 years old!
    • If you have a better idea than how Google does search, please remember me when you’re hiring software engineers! 


Possible Exam Questions

  • Given a web/link graph, formulate a Naïve PageRank link matrix and do a few steps of power iteration.

    • Slides 14 – 16
  • What are spider traps and dead ends, and how does Google deal with these?

    • Spider Trap: Slides 19 – 21
    • Dead End: Slides 25 – 27
  • Explain difference between single and multiple word search query evaluation.

    • Slides 35 – 36


References

  • Brin, Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine.

  • Brin, Page, Motwani, Winograd. The PageRank Citation Ranking: Bringing Order to the Web.

  • http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf

  • www.cs.duke.edu/~junyang/courses/cps296.1-2002-spring/lectures/02-web-search.pdf

  • http://www.ics.uci.edu/~scott/google.htm



Thank you!



Yüklə 1,64 Mb.

Dostları ilə paylaş:




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©www.genderi.org 2024
rəhbərliyinə müraciət

    Ana səhifə