|
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page
|
tarix | 08.08.2018 | ölçüsü | 1,64 Mb. | | #61871 |
|
Sergey Brin, Lawrence Page Presented By: Paolo Lim April 10, 2007
AKA: The Original Google Paper
Presentation Outline Design goals of Google search engine Link Analysis and other features System architecture and major structures Crawling, indexing, and searching the web Performance and results Conclusions Final exam questions
Linear Algebra Background PageRank involves knowledge of: - Matrix addition/multiplication
- Eigenvectors and Eigenvalues
- Power iteration
- Dot product
Not discussed in detail in presentation For reference: - http://cs.wellesley.edu/~cs249B/math/Linear%20Algebra/CS298LinAlgpart1.pdf
- http://www.cse.buffalo.edu/~hungngo/classes/2005/Expanders/notes/LA-intro.pdf
Scaling with the web’s growth Improved search quality - Number of documents increasing rapidly, but user’s ability to look at documents lags
- Lots of “junk” results, little relevance
Academic search engine research - Development and understanding in academic realm
- System that reasonable number of people can actually use
- Support novel research activities of large-scale web data by other researchers and students
Link Analysis Basics PageRank Algorithm - A Top 10 IEEE ICDM data mining algorithm
- Large basis for ranking system (discussed later)
- Tries to incorporate ideas from academic community (publishing and citations)
Anchor Text Analysis - http://www.com> ANCHOR TEXT
Intuition: Why Links, Anyway? Links represent citations Quality of links to a website also helps in computing rank Link structure largely unused before Larry Page proposed it to thesis advisor
Naïve PageRank Each link’s vote is proportional to the importance of its’ source page If page P with important I has N outlinks, then each link gets I / N votes Simple recursive formulation: - PR(A) = PR(p1)/C(p1) + … + PR(pn)/C(pn)
- PR(X) PageRank of page X
- C(X) number of links going out of page X
Naïve PageRank Model (from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)
Solving the flow equations 3 equations, 3 unknowns, no constants - No unique solution
- All solutions equivalent modulo scale factor
Additional constraint forces uniqueness - y+a+m = 1
- y = 2/5, a = 2/5, m = 1/5
Gaussian elimination method works for small examples, but we need a better method for large graphs
Matrix formulation Matrix M has one row and one column for each web page Suppose page j has n outlinks - If j ! i, then Mij=1/n
- Else Mij=0
M is a column stochastic matrix Suppose r is a vector with one entry per web page - ri is the importance score of page i
- Call it the rank vector
Example (from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)
Eigenvector formulation The flow equations can be written r = Mr So the rank vector is an eigenvector of the stochastic web matrix - In fact, its first or principal eigenvector, with corresponding eigenvalue 1
Example (from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)
Power Iteration Simple iterative scheme (aka relaxation) Suppose there are N web pages Initialize: r0 = [1,….,1]T Iterate: rk+1 = Mrk Stop when |rk+1 - rk|1 < - |x|1 = 1·i·N|xi| is the L1 norm
- Can use any other vector norm e.g., Euclidean
Power Iteration Example (from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)
Random Surfer Imagine a random web surfer - At any time t, surfer is on some page P
- At time t+1, the surfer follows an outlink from P uniformly at random
- Ends up on some page Q linked from P
- Process repeats indefinitely
Let p(t) be a vector whose ith component is the probability that the surfer is at page i at time t - p(t) is a probability distribution on pages
The stationary distribution Where is the surfer at time t+1? - Follows a link uniformly at random
- p(t+1) = Mp(t)
Suppose the random walk reaches a state such that p(t+1) = Mp(t) = p(t) - Then p(t) is called a stationary distribution for the random walk
Our rank vector r satisfies r = Mr - So it is a stationary distribution for the random surfer
Spider traps A group of pages is a spider trap if there are no links from within the group to outside the group - Random surfer gets trapped
Microsoft becomes a spider trap (from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)
Random teleports The Google solution for spider traps At each time step, the random surfer has two options: - With probability , follow a link at random
- With probability 1-, jump to some page uniformly at random
- Common values for are in the range 0.8 to 0.9
Surfer will teleport out of spider trap within a few time steps
Matrix formulation Suppose there are N pages - Consider a page j, with set of outlinks O(j)
- We have Mij = 1/|O(j)| when j!i and Mij = 0 otherwise
- The random teleport is equivalent to
- adding a teleport link from j to every other page with probability (1-)/N
- reducing the probability of following each outlink from 1/|O(j)| to /|O(j)|
- Equivalent: tax each page a fraction (1-) of its score and redistribute evenly
Page Rank Construct the NxN matrix A as follows The page rank vector r is the principal eigenvector of this matrix Equivalently, r is the stationary distribution of the random walk with teleports
Previous example with =0.8 (from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)
Dead ends Pages with no outlinks are “dead ends” for the random surfer - Nowhere to go on next step
Microsoft becomes a dead end (from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)
Dealing with dead-ends Teleport - Follow random teleport links with probability 1.0 from dead-ends
- Adjust matrix accordingly
Prune and propagate - Preprocess the graph to eliminate dead-ends
- Might require multiple passes
- Compute page rank on reduced graph
- Approximate values for dead ends by propagating values from reduced graph
Anchor Text Can be more accurate description of target site than target site’s text itself Possible for non-crawled pages to be returned in the process
Other Features List of occurrences of a particular word in a particular document (Hit List) Location information and proximity Keeps track of visual presentation details: - Font size of words
- Capitalization
- Bold/Italic/Underlined/etc.
Full raw HTML of all pages is available in repository
Google Architecture (from http://www.ics.uci.edu/~scott/google.htm)
Google Architecture (from http://www.ics.uci.edu/~scott/google.htm)
Google Architecture (from http://www.ics.uci.edu/~scott/google.htm)
Google Architecture (from http://www.ics.uci.edu/~scott/google.htm)
Google Query Evaluation
Single Word Query Ranking Hitlist is retrieved for single word Each hit can be one of several types: title, anchor, URL, large font, small font, etc. Each hit type is assigned its own weight Type-weights make up vector of weights Number of hits of each type is counted to form count-weight vector Dot product of type-weight and count-weight vectors is used to compute IR score
Multi-word Query Ranking Similar to single-word ranking except now must analyze proximity of words in a document Hits occurring closer together are weighted higher than those farther apart Each proximity relation is classified into 1 of 10 bins ranging from a “phrase match” to “not even close” Each type and proximity pair has a type-prox weight Counts converted into count-weights Take dot product of count-weights and type-prox weights to computer for IR score
Scalability Cluster architecture combined with Moore’s Law make for high scalability. At time of writing: - ~ 24 million documents indexed in one week
- ~518 million hyperlinks indexed
- Four crawlers collected 100 documents/sec
Key Optimization Techniques Each crawler maintains its own DNS lookup cache Use flex to generate lexical analyzer with own stack for parsing documents Parallelization of indexing phase In-memory lexicon Compression of repository Compact encoding of hit lists for space saving Indexer is optimized so it is just faster than the crawler so that crawling is the bottleneck Document index is updated in bulk Critical data structures placed on local disk Overall architecture designed avoid to disk seeks wherever possible
Storage Requirements (from http://www.ics.uci.edu/~scott/google.htm) At the time of publication, Google had the following statistical breakdown for storage requirements:
Conclusions Search is far from perfect - Topic/Domain-specific PageRank
- Machine translation in search
- Non-hypertext search
Business potential - Brin and Page worth around $15 billion each… at 32 years old!
- If you have a better idea than how Google does search, please remember me when you’re hiring software engineers!
Possible Exam Questions Given a web/link graph, formulate a Naïve PageRank link matrix and do a few steps of power iteration. What are spider traps and dead ends, and how does Google deal with these? - Spider Trap: Slides 19 – 21
- Dead End: Slides 25 – 27
Explain difference between single and multiple word search query evaluation.
References Brin, Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine. Brin, Page, Motwani, Winograd. The PageRank Citation Ranking: Bringing Order to the Web. http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf www.cs.duke.edu/~junyang/courses/cps296.1-2002-spring/lectures/02-web-search.pdf http://www.ics.uci.edu/~scott/google.htm
Thank you!
Dostları ilə paylaş: |
|
|