%!ps-adobe 0

Virtual and Physical Tables

Yüklə 134,5 Kb.

səhifə	2/3
tarix	26.09.2018
ölçüsü	134,5 Kb.
	#70466

1 2 3

3.4Select p and Select v
4.ALGORITHMS
5.Example: The World-Wide Web
5.2.1Basic Relations
5.2.2Tag and Attribute Relations
5.2.3Relations Built on Tags
5.2.4Relations for secondhand information
5.2.5Bookkeeping

3.3Virtual and Physical Tables

We use the term virtual table to describe a table whose contents are computed in a lazy fashion. Every virtual table must have at least one defining column. The term physical table describes ordinary tables, whose contents are created with the “insert” instruction.

3.4Select_pand Select_v

While the user is not permitted to know what information is in the relational database underlying the virtual tables, it is useful for the system to be able to do this, which it does via the select_p statement, which performs the corresponding select query on the underlying physical database. In contrast, the select_v statement is used to query the virtual database. Each select_v statement is translated into one or more fetch statements, followed by a select_p statement. For example, consider the following statement:

select_vs.square

from squares s, primes p
where s.base = p.number
and p.number < 100

It is converted into the following sequence:

fetch squares(base=p.number)
from primes p
where p.number < 100

select_ps.square

from squares s, primes p
where s.base = p.number
and p.number < 100

The select_p statements are interpreted by the underlying RDBMS. The next section describes the algorithms for converting select_v statements into fetch and select_p statements.

4.ALGORITHMS

This section describes how select_v statements are processed. When all referenced tables are physical, the semantics are the same as in SQL, and the actual table is queried. When virtual tables are involved, they must be converted into fetch and select_p statements, which is done by the following routines:

transform: toplevel routine, which calls the below procedures and outputs required fetch statements (Figure 3).
merge: recognize a conjunction idiom, possibly outputting a fetch statement (Figure 4).
findBound: try to find a bound on a defining column of a virtual table appearing in the select statement; may call itself recursively (Figure 6).
refDependences: help construct fetch statements, adding references to tables upon which the current table depends; may call itself recursively (Figure 5).

5.Example: The World-Wide Web

This section describes a Just-In-Time Database representing the Web. First, we describe the schema; then, we provide an example of its use.

5.1Background

All of its non-leaf nodes are written in hypertext markup language (HTML), which includes tags and attributes to specify intra-document structure, such as a page’s logical structure or how portions of it should be visually rendered. (Leaf nodes may be in HTML or other formats, such as Postscript.) HTML can also express inter-document structure through labeled hyperlinks relating one page to another. Each page is addressed through one or more uniform resource locators (URLs), which are themselves structured. This abundance of structure has been underutilized because of the absence of a unifying interface to the different types of structural information. Just-In-Time Databases address this problem by providing a relational database interface to the Web, allowing the user to query the Web in SQL, taking full advantage of its many types of structural information.

5.2Schema

Relations are divided into four categories: (1) basic relations, (2) tag and attribute relations, (3) relations built on tags, and (4) relations for secondhand information.

5.2.1Basic Relations

In order to provide useful information about the Web, it is necessary that strings, URLs, and Web pages can be represented. The valstring relation, shown in Figure 7 is used to associate an integer, value_id, with a string, textvalue; the integer is used in other tables to refer to the string. The column textvalue is a defining column. If a new textvalue is given, a new line is added to the table, with a new integer and the given textvalue.

The following relations all depend on valstring:

urls, which describes the set of URLs representing the same page
page, which represents information corresponding to a single page
parse, which represents the components of a URL

The urls relation, also shown in Figure 7, is used to map one or more value_ids representing URL strings to a unique url_id. If there were a onetoone correspondence between URL strings and url ids, it would not be needed, but multiple URL strings can reference the same page. For example, both “www.ai.mit.edu” and “www.ai.mit.edu/ index.html” reference the same document, so they are

Procedure transform(selectStatement)

Initialization

Let unboundedTables be the set of table aliases in selectStatement.
Let whereDef be the WHERE clause in selectStatement.
Let selectionTable by an empty hash table indexed on table aliases.
Let orderedKeys be an empty vector.
Let remainingTables be an empty vector.

Call merge(selectStatement, unboundedTables)
Set remainingTables to unboundedTables
For each table alias currentTableAlias in unboundedTables, call
findBound(currentTable, unboundedTables, fetchClauses, whereDef)
Branch point on the value of |unboundedTables|:

|remainingTables|, fail
> 0, go to step 3
else, execute completion code:

For each alias currentAlias in orderedKeys corresponding to a virtual table

Let fetchString = “FETCH ”
Let tablesString = “”
Let clausesString = “ WHERE (1=1) ”
Let dependences be the empty set
For each tuple [currentAlias, LHSstring, op, RHSstring, tablesReferenced] in selectionTable where LHSstring  the empty string

Append to fetchString: LHSstring + op + RHSstring
Add the elements of tablesReferenced to dependences
Let processed be the set containing currentAlias
Call refDependences(currentAlias, selectionTable, dependences, tablesString, clausesString, processed)
Output clause: fetchString + “) FROM ” + tablesString + clausesString

Return success

Figure 3: Pseudocode for transform. Given a selectStatement, transform returns a set of clauses or fails. Code to handle simple syntactic items, such as commas, has been omitted.

Procedure merge(selectStatement, unboundedTables)

Look for “where” clauses of the form:

t1.col1 = t2.col1 and t1.col2 = exp1 and t2.col2 = exp2

where t1 and t2 are different instantiations of the same table T. If found, remove t1 and t2 from unboundedTables and execute:

fetch T(col2 = (exp1 & exp2))

Figure 4: Pseudocode for merge

Procedure refDependences(topAlias, selectionTable, tablesReferenced, tablesString, clausesString, processed)

Let newDependences be the empty set

For each alias predAlias in tablesReferenced and selectionTable but not in processed

Put predAlias in processed

(b) Append to tablesString: TableType(currentTableAlias) + “ ” + currentTableAlias

i. Append to tablesString: TableType(predAlias) + “ ” + predAlias

ii. For each tuple [currentTableAlias, LHSstring, operator, RHSstring, currentTablesReferenced] corresponding to predAlias in selectionTable where currentTablesReferenced does not contain topAlias

A. Append to clausesString: “ and ” + LHSstring + “ ”+ operator + “ ” + RHSstring

B. Add the elements of currentTablesReferenced to the set newDependences

If |newDependences| > 0, call refDependences(topAlias, selectionTable, tablesReferenced, tablesString, clausesString, processed)

Figure 5: Pseudocode for refDependences. The procedure adds table definitions and clauses of referenced tables to the current fetch statement.

Procedure findBound(currentTableAlias, unboundedTables, fetchClauses, whereDef)

Branch point on the type of whereDef

disjunction (clause1 or clause2)

Let returnValue1 be the result of
findBound(currentTableAlias, unboundedTables, fetchClauses, clause1)
If returnValue = false, return false
Let returnValue2 be the result of
findBound(currentTableAlias, unboundedTables, fetchClauses, clause2)
Merge any new clauses of fetchClauses with the same lefthand side to create fetch disjunctions
Return returnValue2

conjunction (clause1 and clause2)

Let returnValue1 be findBound(currentTableAlias, unboundedTables, fetchClauses, clause1)
Let returnValue2 be findBound(currentTableAlias, unboundedTables, fetchClauses, clause2)
Merge any new clauses of fetchClauses with the same lefthand side, to create fetch conjunctions
Return (returnValue1 V returnValue2)

negation (e.g, “tab.col  7”), return false
relational expression (LHS operator RHS)

Let currentTableType be the type of currentTableAlias
If (currentTableType is a virtual table and (operator is not “=” or “LIKE”))
or (LHS is not of the form currentTableAlias.col)
or (col is not a defining column for currentTableType)),
return false
Let referencedTables be the tables referenced in RHS
If |referencedTables  unboundedTables|  0,
return false
Let newClause be the tuple
[currentTableAlias, ToString(LHS), operator, ToString(RHS), tablesReferenced]
Add newClauses to fetchClauses
Remove currentTableAlias from unboundedTables
Return true

Figure 6: Pseudocode for findBound. Returns true if an item was removed from unboundedTables, false otherwise.

valstring

urls

page

colname

type

colname

type

colname

type

value_id

int

url_id

int

url_id*

int

textvalue*

text

value_id*

int

contents*

text

variant

int

bytes

int

md5

char(32)

when

datetime

Figure 7: The valstring, urls, and page relations. Asterisks indicate defining columns.

parse	schema		parse	relation
colname	coltype	url_value_id	component	value	depth
url_value_id*	int	950	“host”	“www.ai.mit.edu”
component	char(5)	950	“port”	80
value	varchar(255)	950	“path”	“index.html”	1
depth	int	950	“path”	“people”	2
		950	“ref”	“t”

Figure 8: The parse relation and an example of its contents for the url_id corresponding to “www.ai.mit.edu/people/index.html#t”

assigned the same url_id, with different variant numbers. Strictly speaking, there is no way to tell if two URLs reference the same file or merely two copies of a file. We use the heuristic that two URLs reference the same page and should have the same url_id if they retrieve identical contents and the URLs have the same host machine. Because the information known about URLs can change, it is possible for the url_id associated with a string to change, in which case all url_ids in system and user tables are changed correspondingly. The value_id column is a defining column. If a new value_id is given, a new line is created with a unique url_id and a variant value of zero.

The page relation (Figure 7) contains information obtained the last time a Web page was retrieved, including the text of a page, its size, a time stamp, and the results of a Message Digest 5 (MD5) hash function [11] on the page's text, providing a quick way to tell if two pages contain the same text. The valid field encodes the results of the last attempted retrieval of the page: “y” if it was successfully retrieved or “n” if it could not be retrieved. The columns url_id and contents are defining columns. If a url_id is given, the system can fetch the associated page from the

Web and fill in the other fields. If a wildcard expression is given for contents, such as “%VLDB%”, the system can ask a search service, such as AltaVista (www.alta-vista.digital.com), to return all pages (that it knows of) containing the substring. These are then added to the table.

The parse relation, defined and illustrated in Figure 8, encodes a parsed version of a URL string. This is useful for determining if two URLs are on the same host machine or if one is above the other in the directory hierarchy. The component field contains “host”, “port”, “path”, or “ref”, and the value field shows the associated value. For the “path” component, the depth within the file hierarchy is also indicated, starting at the file name. Note that one url_id can have multiple parses, if multiple URLs correspond to that page. The following transcript demonstrates the use of these basic relations.

What value_id corresponds to the string www.ai.mit.edu/index.html?
select value_id from valstring where textvalue=“www.ai.mit.edu/index.html”

value_id

950

What url_id corresponds to the URL represented as “www.ai.mit.edu/index.html”?
select url_id from urls where value_id = 950

url_id

817

Show all the known url_ids for this page.
select * from urls where url_id = 817

url_id	value_id	variant
817	950	1
817	1140	2

What strings correspond to these url_ids?
select textvalue from valstring v, urls u
where u.url_id = 817 and v.value_id = u.value_id

textvalue

“www.ai.mit.edu”

“www.ai.mit.edu/index.html”

How many bytes long is the page?
select bytes from page where url_id = 817

bytes

8116

What is the host component of the URL corresponding to value_id 950?
select value from parse where url_value_id = 950 and component = “host”

value

“www.ai.mit.edu”

tag		att
colname	type	colname	type
url_id*	int	tag_id*	int
tag_id	int	name	char(15)
name	char(15)	value_id	int
startOffset	int
endOffset	int

Figure 9: The tag and att (attribute) relations

5.2.2Tag and Attribute Relations

Hypertext consists of text annotated with tags and attributes. Hence, it is necessary that our system can represent tag and attribute information, which it does through the tag and att relations, shown in Figure 9.

The tag relation contains one line for each tag or set of paired tags encountered. (Single tags include “

” (horizontal rule); paired tags include “” and “”.) Information stored includes the url_id, tag name, the start and end character offsets of the tag within the document, and a unique tag_id, which is used to reference attributes. The url_id column is a defining column. Given a url_id, the system can retrieve a page and parse it, filling in the tag table (and the att table as a side effect).

The att relation contains one line for each attribute name and value in a document. The name is a short string and the value an index into valstring. We consider the anchor text enclosed by a pair of tags to be the value of the attribute “anchor”. The following transcript illustrates the use of the tag and att relations:

What tags occur within the first 100 characters on “www.mit.edu”?
select t.name, t.tag_id from tag t, urls u, valstring v where .textvalue = “www.mit.edu” and u.value_id = v.value_id and t.url_id = u.url_id and startOffset < 100

name	tag_id
!DOCTYPE	1080
HTML	1081
HEAD	1082
TITLE	1083

What are the attributes of the “TITLE” tag?
select * from att where tag_id = 1083

tag_id	name	value_id
1083	anchor	1105

What is the text associated with the “TITLE” tag?
select textvalue from valstring where value_id = 1105

textvalue

“SIPB WWW Server Home Page”

5.2.3Relations Built on Tags

Information about the structure of headers and lists on a page and hyperlinks across pages can be inferred from the information in the tag and att tables, but for greater convenience we provide the user with specialized tables header, list, and link, shown in Figures Figure 10Figure 11.

Through the header table, we keep track of the levels of headings at each point of change. HTML supports six levels of headings (H1–H6). The struct column consists of an array of six bytes, each of which corresponds to one level of heading. At the beginning of a page, all six bytes are initialized to zero. Whenever a n> tag is encountered, byte (n-1) is incremented and all bytes with higher offsets are clear. The ord element is an ordinal number associated with each change of header structure. The defining column is url_id, since, given a url_id, a page can be fetched and parsed. As a side effect of computing the header relation, the tag and att relations are also computed.

The list relation keeps track of the list structure of a document. There are four types of lists supported by HTML: ordered lists, unordered lists, directories, and menus. These are not differentiated among in the list relation (but are in the tag relation). Lists can be arbitrarily nested in HTML; the list relation keeps track of up to six levels of nesting of lists. As with header, the only defining column is url_id. The details [15] are similar to, but not identical to, those for the header relationship.

	header		list
colname		type	colname	type
url_id*		int	url_id	int
struct		binary(6)	struct	binary(6)
ord		int	ord	int
offset		int	offset	int

Figure 10: The header and list relations

link
colname	type
source_url_id*	int
anchor_value_id*	int
dest_url_id*	int
hstruct	binary(6)
lstruct	binary(6)

Figure 11: The link relation

The link relation contains one line for each hyperlink encountered, storing the source_url_id, anchor text value_id, and destination url_id. For example, there is a hyperlink from the MIT AI Lab home page with anchor text “LCS” to the MIT Laboratory for Computer Science. The corresponding url_ids for the source (www.ai.mit.edu) and destination (www.lcs.mit.edu) URLs and the value_id for the anchor text (“LCS”) appear as one line in the relation, as do the list and header structures at which the hyperlink occurs. The source_url_id column is a defining column because, given a URL, the system can retrieve the corresponding page, parse it, and determine the other fields. Less obviously, anchor_value_id and dest_url_id are also defining columns. This is done by having our system access AltaVista, which allows queries for pages that contain a given string as either anchor text or as the target of a hyperlink, although it does not (and could not) guarantee that it will return complete set. The system then verifies that the hyperlink still exists before adding it to the link relation.

5.2.4Relations for secondhand information

In addition to direct information about the text and hyperlinks of Web pages, indirect information can be obtained from Web tools. It is important however to distinguish between verified and secondhand information, hence the rcontains table, for reported contents of pages, and the rlink table, for reported links. These are illustrated in Figure 12.

The rcontains relation indicates the claim by the search engine shown in the helper column that a certain page contains a certain piece of text. The value of the num column (by default 10) determines the maximum number of such references added.

The rlink relation contains search engine claims about hyperlinks among pages. Because of the functionality provided by the public search engines, the source_url_id field will always have a nonnull value, but exactly one of anchor and dest_url_id will be null. In other words, a line can indicate the reported source and destination of a hyperlink, or it can indicate the reported source and anchor text of a hyperlink. The missing information, assuming the reported link actually exists, can be extracted from the link relation. As with rcontains, the helper and num fields are used to specify the search engine and maximum number of lines added.

5.2.5Bookkeeping

For each table, the system adds a column, compute_id, to keep track of what fetch statement caused each line to be inserted and when the fetch was performed [15]. This provides a hook for the user to ask questions about how fresh data is and how the Web has changed.

rcontains		rlink
colname	coltype	colname	coltype
url_id	int	source_url_id	int
value	varchar(255)	anchor	varchar(255)
helper	char(16)	dest_url_id	int
num	int	helper	char(16)
		num	char(16)

Figure 12: The rcontains and rlink relations

Yüklə 134,5 Kb.

Dostları ilə paylaş:

1 2 3

%!ps-adobe 0

Virtual and Physical Tables

3.3Virtual and Physical Tables

3.4Selectp and Selectv

4.ALGORITHMS

5.Example: The World-Wide Web

5.1Background

5.2Schema

5.2.1Basic Relations

5.2.2Tag and Attribute Relations

5.2.3Relations Built on Tags

5.2.4Relations for second­hand information

5.2.5Bookkeeping

3.4Select_pand Select_v

5.2.4Relations for secondhand information