%!ps-adobe 0


Application: A Recommender System



Yüklə 134,5 Kb.
səhifə3/3
tarix26.09.2018
ölçüsü134,5 Kb.
#70466
1   2   3

5.3Application: A Recommender System


One useful class of information retrieval applications is recommender systems [33], where a program recommends new Web pages (or some other resource) judged likely to be of interest to a user, based on the user's initial set of liked pages P. A standard technique for recommender systems is extracting keywords that appear on the initial pages and returning pages that contain these keywords. Note that this technique is based purely on the text of a page, independent of any inter- or intra-document structure.

Another technique for making recommendations is collaborative filtering [13], where pages are recommended that were liked by other people who liked P. This is based on the assumption that items thought valuable/similar by one user are likely to by another user. As collaborative filtering is currently practiced, users explicitly rate pages to indicate their recommendations. We can think of the act of creating hyperlinks to a page as being an implicit recommendation. In other words, if a person links to pages Q and R, we can guess that people who like Q may like R, especially if the links to Q and R appear near each other on the referencing page (such as within the same list). This makes use of intra-document structural information. One can use the following algorithm to find pages similar to P1 and P2:



  1. Generate a list of pages Parent that reference P1 and P2.

  2. Generate a list of pages Result that are pointed to by the pages in Parent (i.e., are siblings of P1 and P2).

  3. List the pages in Results most frequently referenced by elements of Parent.

Figure 13 shows a sample run of this algorithm. Figure 14 shows the implementation.

Some improvements to the application that have been implemented and described elsewhere [15] are:



  1. Only return target pages that include a keyword specified by the user.

  2. Return the names of hosts frequently containing referenced pages.

  3. Only return target pages that point to one or both of P1 and P2 .

  4. Only follow links that appear in the same list and under the same header as the links to P1 and P2 .

Preliminary evaluation [15] suggests that the last optimization yields results superior in some ways to the best text­based Web recommender system. Note that these heuristics simultaneously take advantage of inter-document, intra-document, and intra-URL structure.

Elsewhere, we discuss a home page finder, a moved page finder, and a technique for finding pages on given topics [14][15].



v.vcvalue

score

www.nasa.gov/

13

www.nsf.gov/

12

www.fcc.gov/

5

www.nih.gov/

5

daac.gsfc.nasa.gov/

5

www.whitehouse.gov/

4

www.cdc.gov/

4

www.doc.gov/

4

www.doe.gov/

4

www.ed.gov/

4

Figure 13: First ten results of running SimPagesBasic (Figure 14) with the url_ids corresponding to “www.nsf.gov” (National Science Foundation) and “www.nasa.gov” (National Aeronautics and Space Administration).

6.Conclusions

6.1Discussion


Just-In-Time Databases allow the user read access to a distributed semi-structured data set as though it were in a single relational database. This enables the use of SQL in querying the World-Wide Web, whose abundant structure has been underutilized because of the difficulty in cleanly accessing it. The representation supports distinguishing among data returned by different search engines (or other sources), information that was true in the past, and information that was recently verified. This provides a basis for powerful Web applications.

6.2Related Work


An extractor developed within the TSIMMIS project uses user­specified wrappers to convert web pages into database objects, which can then be queried [3]. Specifically, hypertext pages are treated as text, from which site­specific information (such as a table of weather information) is extracted in the form of a database object. The Information Manifold provides a uniform query interface to structured information sources, such as databases, on the Web [5]. Both of these systems differ from our system, where each page is converted into a set of database relations according to the same schema.

SimPagesBasic(page1id, page2id, threshold)

// Create temporary data structures

create table parent(url_id int)

create table results(url_id int, score int)

// Insert into parent the pages that reportedly link
// to both pages that we care about

insert into parent (url_id)

select distinct r1.source_url_id

from rlink r1, rlink r2

where r1.source_url_id = r2.source_url_id and r1.dest_url_id = page1id
and r2.dest_url_id = page2id
// Store the pages pointed to by the parent pages,

// along with a count of the number of links

insert into results (url_id, score)

select l.dest_url_id, COUNT(*)

from link l, parent p

where l.source_url_id = p.url_id

group by l.dest_url_id

// Show the URLs of pages most often pointed to

// and the number of links to them

select v.textvalue, count(*)

from link l, parent p, valstring v, urls u

where l.source_url_id=p.url_id and l.dest_url_id = u.url_id and u.value_id=v.value_id

group by v.textvalue

having count(*)  threshold

order by count(*) desc

Figure 14: Code for SimPagesBasic

This work is influenced by WebSQL, a language that allows queries about hyperlink paths among Web pages, with limited access to the text and internal structure of pages and URLs [8][7]. In the default configuration, hyperlinks are divided into three categories, internal links (within a page), local links (within a site), and global links. It is also possible to define new link types based on anchor text; for example, links with anchor text “next”. All of these facilities can be implemented in our system, although WebSQL’s syntax is more concise. While it is possible to access a region of a document based on text delimiters in WebSQL, one cannot do so on the basis of structure. Some queries we can express but not expressible in WebSQL are:


  1. How many lists appear on a page?

  2. What is the second item of each list?

  3. Do any headings on a page consist of the same text as the title?

W3QL is another language for accessing the web as a database, treating web pages as the fundamental units [4]. Information one can obtain about web pages includes:

  1. The hyperlink structure connecting web pages

  2. The title, contents, and links on a page

  3. Whether they are indices (“forms”) and how to access them

For example, it is possible to request that a specific value be entered into a form and to follow all links that are returned, giving the user the titles of the pages. It is not possible for the user to specify forms in our system (or in WebSQL), access to a few search engines being hardcoded. Access to the internal structure of a page is more restricted than with our system. In W3QL, one cannot specify all hyperlinks originating within a list, for example.

An additional way in which Just-In-Time Databases differ from all of the other systems is in providing a data model guaranteeing that data is saved from one query to the next and (consequently) containing information about the time at which data was retrieved or interpreted. Because the data is written to a SQL database, it can be accessed by other applications. Another way our system is unique is in providing equal access to all tags and attributes, unlike WebSQL and W3QL, which can only refer to certain attributes of links and provide no access to attributes of other tags.


6.3Future Work


The current system is a prototype. Just-In-Time Databases would be much more efficient and robust if it were integrated with a SQL database server. Currently, no guarantees are made as to atomicity, consistency, isolation, and durability. Compilation of virtual queries would be more efficient than the current interpretation, as would be exposing virtual queries to optimization. We would also like to be able to include virtual table references within SQL procedures. Opportunities exist for more efficient memory usage. For example, text within nested tags is repeated in the database, instead of being referred to indirectly. Active database technology could be used to update pages in the database as they expire. We would also like to provide our system with direct access to a search engine's database to minimize data transfer delays.

7.ACKNOWLEDGMENTS

8.REFERENCES


  1. Gustavo O. Arocena, Alberto O. Mendelzon, and George A. Mihaila. Applications of a Web query language. In Proceedings of the Sixth International World Wide Web Conference, Santa Cruz, CA, April 1997.

  2. Peter Buneman, Susan Davidson, Gerd Hillebrand, and Dan Suciu. A Query Language and Optimization Techniques for Unstructured Data in Proceedings of the 1996 ACM SIGMOD international conference on management of data.

  3. J. Hammer, H. Garcia­Molina, J. Cho, R. Aranha, and A. Crespo. Extracting semistructured information from the Web. In Proceedings of the Workshop on Management of Semistructured Data, Tucson, Arizona, May 1997.

  4. David Knopnicki and Oded Shmueli. Information gathering in the World­Wide Web: The W3QL query language and the W3QS system. Available from “www.cs.technion.ac.il/~konop/w3qs.html”, 1997.

  5. Alon Y. Levy, Anand Rajaraman , Joann J. Ordille , Query answering algorithms for information agents. Proceedings of the AAAI Thirteenth National Conference on Artificial Intelligence, 1996.

  6. Jason McHugh, Serge Abiteboul, Roy Goldman, Dallan Quass, and Jennifer Widom. Lore: A Database Management System for Semistructured Data. SIGMOD Record, 26(3):54-66, September 1997.

  7. Alberto Mendelzon, George Mihaila, and Tova Milo. Querying the world wide web. In Proceedings of the Fourth International Conference on Parallel and Distributed Information Systems, Miami, FL, 1996.

  8. George A. Mihaila. WebSQL — an SQL­like query language for the world wide web. Master's thesis, University of Toronto, 1996.

  9. Dave Raggett. Html 3.2 reference specification. World­Wide Web Consortium technical report, January 1997.

  10. Paul Resnick and Hal R. Varian. Recommender systems (introduction to special section). Communications of the ACM, 40(3):56–58, March 1997.

  11. Ronald Rivest. The MD5 message­digest algorithm. Network Working Group Request for Comments: 1321, April 1992.

  12. Jacques Savoy. Citation schemes in hypertext information retrieval. In Maristella Agosti and Alan F. Smeaton, editors, Information Retrieval and Hypertext, pages 99–120. Kluwer Academic Press, 1996.

  13. Upendra Shardanand and Pattie Maes. Social information filtering: Algorithms for automating “word of mouth”. In Computer­Human Interaction (CHI), 1995.

  14. Ellen Spertus. ParaSite: Mining structural information on the web. In The Sixth International World Wide Web Conference, April 1997.

  15. Ellen Spertus. ParaSite: Mining the Structural Information on the World-Wide Web. PhD Thesis, Department of EECS, MIT, Cambridge, MA, February 1998.


Yüklə 134,5 Kb.

Dostları ilə paylaş:
1   2   3




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©www.genderi.org 2024
rəhbərliyinə müraciət

    Ana səhifə