%!ps-adobe 0


Virtual and Physical Tables



Yüklə 134,5 Kb.
səhifə2/3
tarix26.09.2018
ölçüsü134,5 Kb.
#70466
1   2   3

3.3Virtual and Physical Tables


We use the term virtual table to describe a table whose contents are computed in a lazy fashion. Every virtual table must have at least one defining column. The term physical table describes ordinary tables, whose contents are created with the “insert” instruction.

3.4Selectp and Selectv


While the user is not permitted to know what information is in the relational database underlying the virtual tables, it is useful for the system to be able to do this, which it does via the selectp statement, which performs the corresponding select query on the underlying physical database. In contrast, the selectv statement is used to query the virtual database. Each selectv statement is translated into one or more fetch statements, followed by a selectp statement. For example, consider the following statement:

selectv s.square


from squares s, primes p
where s.base = p.number
and p.number < 100

It is converted into the following sequence:

fetch squares(base=p.number)
from primes p
where p.number < 100

selectp s.square


from squares s, primes p
where s.base = p.number
and p.number < 100

The selectp statements are interpreted by the underlying RDBMS. The next section describes the algorithms for converting selectv statements into fetch and selectp statements.


4.ALGORITHMS


This section describes how selectv statements are processed. When all referenced tables are physical, the semantics are the same as in SQL, and the actual table is queried. When virtual tables are involved, they must be converted into fetch and selectp statements, which is done by the following routines:

  1. transform: top­level routine, which calls the below procedures and outputs required fetch statements (Figure 3).

  2. merge: recognize a conjunction idiom, possibly outputting a fetch statement (Figure 4).

  3. findBound: try to find a bound on a defining column of a virtual table appearing in the select statement; may call itself recursively (Figure 6).

  4. refDependences: help construct fetch statements, adding references to tables upon which the current table depends; may call itself recursively (Figure 5).

5.Example: The World-Wide Web


This section describes a Just-In-Time Database representing the Web. First, we describe the schema; then, we provide an example of its use.

5.1Background


All of its non-leaf nodes are written in hypertext markup language (HTML), which includes tags and attributes to specify intra-document structure, such as a page’s logical structure or how portions of it should be visually rendered. (Leaf nodes may be in HTML or other formats, such as Postscript.) HTML can also express inter-document structure through labeled hyperlinks relating one page to another. Each page is addressed through one or more uniform resource locators (URLs), which are themselves structured. This abundance of structure has been underutilized because of the absence of a unifying interface to the different types of structural information. Just-In-Time Databases address this problem by providing a relational database interface to the Web, allowing the user to query the Web in SQL, taking full advantage of its many types of structural information.

5.2Schema


Relations are divided into four categories: (1) basic relations, (2) tag and attribute relations, (3) relations built on tags, and (4) relations for second­hand information.

5.2.1Basic Relations


In order to provide useful information about the Web, it is necessary that strings, URLs, and Web pages can be represented. The valstring relation, shown in Figure 7 is used to associate an integer, value_id, with a string, textvalue; the integer is used in other tables to refer to the string. The column textvalue is a defining column. If a new textvalue is given, a new line is added to the table, with a new integer and the given textvalue.

The following relations all depend on valstring:



  1. urls, which describes the set of URLs representing the same page

  2. page, which represents information corresponding to a single page

  3. parse, which represents the components of a URL

The urls relation, also shown in Figure 7, is used to map one or more value_ids representing URL strings to a unique url_id. If there were a one­to­one correspondence between URL strings and url ids, it would not be needed, but multiple URL strings can reference the same page. For example, both “www.ai.mit.edu” and “www.ai.mit.edu/ index.html” reference the same document, so they are

Procedure transform(selectStatement)

  1. Initialization

  1. Let unboundedTables be the set of table aliases in selectStatement.

  2. Let whereDef be the WHERE clause in selectStatement.

  3. Let selectionTable by an empty hash table indexed on table aliases.

  4. Let orderedKeys be an empty vector.

  5. Let remainingTables be an empty vector.

  1. Call merge(selectStatement, unboundedTables)

  2. Set remainingTables to unboundedTables

  3. For each table alias currentTableAlias in unboundedTables, call
    findBound(currentTable, unboundedTables, fetchClauses, whereDef)

  4. Branch point on the value of |unboundedTables|:

  1. |remainingTables|, fail

  2. > 0, go to step 3

  3. else, execute completion code:

  1. For each alias currentAlias in orderedKeys corresponding to a virtual table

  1. Let fetchString = “FETCH ”

  2. Let tablesString = “”

  3. Let clausesString = “ WHERE (1=1) ”

  4. Let dependences be the empty set

  5. For each tuple [currentAlias, LHSstring, op, RHSstring, tablesReferenced] in selectionTable where LHSstring  the empty string

  1. Append to fetchString: LHSstring + op + RHSstring

  2. Add the elements of tablesReferenced to dependences

  3. Let processed be the set containing currentAlias

  4. Call refDependences(currentAlias, selectionTable, dependences, tablesString, clausesString, processed)

  5. Output clause: fetchString + “) FROM ” + tablesString + clausesString

  1. Return success

Figure 3: Pseudocode for transform. Given a selectStatement, transform returns a set of clauses or fails. Code to handle simple syntactic items, such as commas, has been omitted.

Procedure merge(selectStatement, unboundedTables)

Look for “where” clauses of the form:

t1.col1 = t2.col1 and t1.col2 = exp1 and t2.col2 = exp2

where t1 and t2 are different instantiations of the same table T. If found, remove t1 and t2 from unboundedTables and execute:

fetch T(col2 = (exp1 & exp2))

Figure 4: Pseudocode for merge


Procedure refDependences(topAlias, selectionTable, tablesReferenced, tablesString, clausesString, processed)

  1. Let newDependences be the empty set

  1. For each alias predAlias in tablesReferenced and selectionTable but not in processed

  1. Put predAlias in processed

(b) Append to tablesString: TableType(currentTableAlias) + “ ” + currentTableAlias

i. Append to tablesString: TableType(predAlias) + “ ” + predAlias

ii. For each tuple [currentTableAlias, LHSstring, operator, RHSstring, currentTablesReferenced] corresponding to predAlias in selectionTable where currentTablesReferenced does not contain topAlias

A. Append to clausesString: “ and ” + LHSstring + “ ”+ operator + “ ” + RHSstring

B. Add the elements of currentTablesReferenced to the set newDependences


  1. If |newDependences| > 0, call refDependences(topAlias, selectionTable, tablesReferenced, tablesString, clausesString, processed)

Figure 5: Pseudocode for refDependences. The procedure adds table definitions and clauses of referenced tables to the current fetch statement.

Procedure findBound(currentTableAlias, unboundedTables, fetchClauses, whereDef)



Branch point on the type of whereDef

  1. disjunction (clause1 or clause2)

  1. Let returnValue1 be the result of
    findBound(currentTableAlias, unboundedTables, fetchClauses, clause1)

  2. If returnValue = false, return false

  3. Let returnValue2 be the result of
    findBound(currentTableAlias, unboundedTables, fetchClauses, clause2)

  4. Merge any new clauses of fetchClauses with the same left­hand side to create fetch disjunctions

  5. Return returnValue2

  1. conjunction (clause1 and clause2)

  1. Let returnValue1 be findBound(currentTableAlias, unboundedTables, fetchClauses, clause1)

  2. Let returnValue2 be findBound(currentTableAlias, unboundedTables, fetchClauses, clause2)

  3. Merge any new clauses of fetchClauses with the same left­hand side, to create fetch conjunctions

  4. Return (returnValue1 V returnValue2)

  1. negation (e.g, “tab.col  7”), return false

  2. relational expression (LHS operator RHS)

  1. Let currentTableType be the type of currentTableAlias

  2. If (currentTableType is a virtual table and (operator is not “=” or “LIKE”))
    or (LHS is not of the form currentTableAlias.col)
    or (col is not a defining column for currentTableType)),
    return false

  3. Let referencedTables be the tables referenced in RHS

  4. If |referencedTablesunboundedTables|  0,
    return false

  5. Let newClause be the tuple
    [currentTableAlias, ToString(LHS), operator, ToString(RHS), tablesReferenced]

  6. Add newClauses to fetchClauses

  7. Remove currentTableAlias from unboundedTables

  8. Return true

Figure 6: Pseudocode for findBound. Returns true if an item was removed from unboundedTables, false otherwise.

valstring







urls







page




colname

type




colname

type




colname

type

value_id

int




url_id

int




url_id*

int

textvalue*

text




value_id*

int




contents*

text










variant

int




bytes

int



















md5

char(32)



















when

datetime

Figure 7: The valstring, urls, and page relations. Asterisks indicate defining columns.

parse

schema







parse

relation




colname

coltype




url_value_id

component

value

depth

url_value_id*

int




950

“host”

“www.ai.mit.edu”




component

char(5)




950

“port”

80




value

varchar(255)




950

“path”

“index.html”

1

depth

int




950

“path”

“people”

2










950

“ref”

“t”




Figure 8: The parse relation and an example of its contents for the url_id corresponding to “www.ai.mit.edu/people/index.html#t”

assigned the same url_id, with different variant numbers. Strictly speaking, there is no way to tell if two URLs reference the same file or merely two copies of a file. We use the heuristic that two URLs reference the same page and should have the same url_id if they retrieve identical contents and the URLs have the same host machine. Because the information known about URLs can change, it is possible for the url_id associated with a string to change, in which case all url_ids in system and user tables are changed correspondingly. The value_id column is a defining column. If a new value_id is given, a new line is created with a unique url_id and a variant value of zero.

The page relation (Figure 7) contains information obtained the last time a Web page was retrieved, including the text of a page, its size, a time stamp, and the results of a Message Digest 5 (MD5) hash function [11] on the page's text, providing a quick way to tell if two pages contain the same text. The valid field encodes the results of the last attempted retrieval of the page: “y” if it was successfully retrieved or “n” if it could not be retrieved. The columns url_id and contents are defining columns. If a url_id is given, the system can fetch the associated page from the

Web and fill in the other fields. If a wildcard expression is given for contents, such as “%VLDB%”, the system can ask a search service, such as AltaVista (www.alta-vista.digital.com), to return all pages (that it knows of) containing the substring. These are then added to the table.

The parse relation, defined and illustrated in Figure 8, encodes a parsed version of a URL string. This is useful for determining if two URLs are on the same host machine or if one is above the other in the directory hierarchy. The component field contains “host”, “port”, “path”, or “ref”, and the value field shows the associated value. For the “path” component, the depth within the file hierarchy is also indicated, starting at the file name. Note that one url_id can have multiple parses, if multiple URLs correspond to that page. The following transcript demonstrates the use of these basic relations.



What value_id corresponds to the string www.ai.mit.edu/index.html?
select value_id from valstring where textvalue=“www.ai.mit.edu/index.html

value_id

950

What url_id corresponds to the URL represented as “www.ai.mit.edu/index.html”?
s
elect url_id from urls where value_id = 950

url_id

817


Show all the known
url_ids for this page.
select * from urls where url_id = 817

url_id

value_id

variant

817

950

1

817

1140

2

What strings correspond to these url_ids?
select textvalue from valstring v, urls u
where u.url_id = 817 and v.value_id = u.value_id

textvalue

“www.ai.mit.edu”

“www.ai.mit.edu/index.html”

How many bytes long is the page?
select bytes from page where url_id = 817

bytes

8116

What is the host component of the URL corresponding to value_id 950?
select value from parse where url_value_id = 950 and component = “host”

value

“www.ai.mit.edu”



tag







att




colname

type




colname

type

url_id*

int




tag_id*

int

tag_id

int




name

char(15)

name

char(15)




value_id

int

startOffset

int










endOffset

int










Figure 9: The tag and att (attribute) relations

5.2.2Tag and Attribute Relations


Hypertext consists of text annotated with tags and attributes. Hence, it is necessary that our system can represent tag and attribute information, which it does through the tag and att relations, shown in Figure 9.

The tag relation contains one line for each tag or set of paired tags encountered. (Single tags include “


” (horizontal rule); paired tags include “” and “”.) Information stored includes the url_id, tag name, the start and end character offsets of the tag within the document, and a unique tag_id, which is used to reference attributes. The url_id column is a defining column. Given a url_id, the system can retrieve a page and parse it, filling in the tag table (and the att table as a side effect).

The att relation contains one line for each attribute name and value in a document. The name is a short string and the value an index into valstring. We consider the anchor text enclosed by a pair of tags to be the value of the attribute “anchor”. The following transcript illustrates the use of the tag and att relations:

What tags occur within the first 100 characters on “www.mit.edu”?
select t.name, t.tag_id from tag t, urls u, valstring v where .textvalue = “www.mit.edu” and u.value_id = v.value_id and t.url_id = u.url_id and startOffset < 100


name

tag_id

!DOCTYPE

1080

HTML

1081

HEAD

1082

TITLE

1083

What are the attributes of the “TITLE” tag?
select * from att where tag_id = 1083

tag_id

name

value_id

1083

anchor

1105

What is the text associated with the “TITLE” tag?
select textvalue from valstring where value_id = 1105

textvalue

“SIPB WWW Server Home Page”


5.2.3Relations Built on Tags


Information about the structure of headers and lists on a page and hyperlinks across pages can be inferred from the information in the tag and att tables, but for greater convenience we provide the user with specialized tables header, list, and link, shown in Figures Figure 10Figure 11.

Through the header table, we keep track of the levels of headings at each point of change. HTML supports six levels of headings (H1–H6). The struct column consists of an array of six bytes, each of which corresponds to one level of heading. At the beginning of a page, all six bytes are initialized to zero. Whenever a n> tag is encountered, byte (n-1) is incremented and all bytes with higher offsets are clear. The ord element is an ordinal number associated with each change of header structure. The defining column is url_id, since, given a url_id, a page can be fetched and parsed. As a side effect of computing the header relation, the tag and att relations are also computed.



The list relation keeps track of the list structure of a document. There are four types of lists supported by HTML: ordered lists, unordered lists, directories, and menus. These are not differentiated among in the list relation (but are in the tag relation). Lists can be arbitrarily nested in HTML; the list relation keeps track of up to six levels of nesting of lists. As with header, the only defining column is url_id. The details [15] are similar to, but not identical to, those for the header relationship.




header




list




colname

type




colname

type

url_id*

int




url_id

int

struct

binary(6)




struct

binary(6)

ord

int




ord

int

offset

int




offset

int

Figure 10: The header and list relations

link




colname

type

source_url_id*

int

anchor_value_id*

int

dest_url_id*

int

hstruct

binary(6)

lstruct

binary(6)

Figure 11: The link relation

The link relation contains one line for each hyperlink encountered, storing the source_url_id, anchor text value_id, and destination url_id. For example, there is a hyperlink from the MIT AI Lab home page with anchor text “LCS” to the MIT Laboratory for Computer Science. The corresponding url_ids for the source (www.ai.mit.edu) and destination (www.lcs.mit.edu) URLs and the value_id for the anchor text (“LCS”) appear as one line in the relation, as do the list and header structures at which the hyperlink occurs. The source_url_id column is a defining column because, given a URL, the system can retrieve the corresponding page, parse it, and determine the other fields. Less obviously, anchor_value_id and dest_url_id are also defining columns. This is done by having our system access AltaVista, which allows queries for pages that contain a given string as either anchor text or as the target of a hyperlink, although it does not (and could not) guarantee that it will return complete set. The system then verifies that the hyperlink still exists before adding it to the link relation.


5.2.4Relations for second­hand information


In addition to direct information about the text and hyperlinks of Web pages, indirect information can be obtained from Web tools. It is important however to distinguish between verified and second­hand information, hence the rcontains table, for reported contents of pages, and the rlink table, for reported links. These are illustrated in Figure 12.

The rcontains relation indicates the claim by the search engine shown in the helper column that a certain page contains a certain piece of text. The value of the num column (by default 10) determines the maximum number of such references added.



The rlink relation contains search engine claims about hyperlinks among pages. Because of the functionality provided by the public search engines, the source_url_id field will always have a non­null value, but exactly one of anchor and dest_url_id will be null. In other words, a line can indicate the reported source and destination of a hyperlink, or it can indicate the reported source and anchor text of a hyperlink. The missing information, assuming the reported link actually exists, can be extracted from the link relation. As with rcontains, the helper and num fields are used to specify the search engine and maximum number of lines added.

5.2.5Bookkeeping


For each table, the system adds a column, compute_id, to keep track of what fetch statement caused each line to be inserted and when the fetch was performed [15]. This provides a hook for the user to ask questions about how fresh data is and how the Web has changed.

rcontains







rlink




colname

coltype




colname

coltype

url_id

int




source_url_id

int

value

varchar(255)




anchor

varchar(255)

helper

char(16)




dest_url_id

int

num

int




helper

char(16)










num

char(16)

Figure 12: The rcontains and rlink relations

Yüklə 134,5 Kb.

Dostları ilə paylaş:
1   2   3




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©www.genderi.org 2024
rəhbərliyinə müraciət

    Ana səhifə