Copyright ©2000 W3C® (MIT, INRIA, Keio), All Rights Reserved. W3C liability, trademark, document use and software licensing rules apply.
This document introduces the XML Query Algebra as a formal basis for an XML query language.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. The latest status of this document series is maintained at the W3C.
This is a First Public Working Draft for review by W3C Members and other interested parties. It is a draft document and may be updated, replaced or made obsolete by other documents at any time. It is inappropriate to use W3C Working Drafts as reference material or to cite them as other than "work in progress". This is work in progress and does not imply endorsement by the W3C membership.
This document has been produced as part of the W3C XML Activity, following the procedures set out for the W3C Process. The document has been written by the XML Query Working Group.
The purpose of this document is to present the current state of the XML Query Algebra and to elicit feedback on its current state. The XML Query Working Group feels it has made good progress on this document but this is still a work in progress and is liable to change in future versions. A number of important issues are still open - see also [B.3.1 Open Issues]. Comments on this document should be sent to the W3C mailing list www-xml-query-comments@w3.org (archived at http://lists.w3.org/Archives/Public/www-xml-query-comments/).
A list of current W3C Recommendations and other technical documents can be found at http://www.w3.org/TR/.
This document proposes an algebra for XML Query.
This work builds on long standing traditions in the database community. In particular, we have been inspired by systems such as SQL, OQL, and nested relational algebra (NRA). We have also been inspired by systems such as Quilt, UnQL, XDuce, XML-QL, XPath, XQL, and YaTL. We give citations for all these systems below.
In the database world, it is common to translate a query language into an algebra; this happens in SQL, OQL, and NRA, among others. The purpose of the algebra is twofold. First, the algebra is used to give a semantics for the query language, so the operations of the algebra should be well-defined. Second, the algebra is used to support query optimization, so the algebra should possess a rich set of laws. Our algebra is powerful enough to capture the semantics of many XML query languages, and the laws we give include analogues of most of the laws of relational algebra.
It is also common for a query language to exploit schemas or types; this happens in SQL, OQL, and NRA, among others. The purpose of types is twofold. Types can be used to detect certain kinds of errors at compile time and to support query optimization. DTDs and XML Schema can be thought of as providing something like types for XML. The XML Query algebra uses a simple type system that captures the essence of [XSchema1]. The type system is close to that used in XDuce [HP2000]. On this basis, the XML Query algebra is statically typed. This allows to determine and check the output type of a query on documents conforming to an input type at compile time rather than at runtime. Compare this to the situation with an untyped or dynamically typed query language, where each individual output has to be validated against a schema at runtime, and there is no guarantuee that this check will always succeed.
The document is organized as follows. A tutorial introduction is presented in [2 The Algebra by Example]. After reading the tutorial, the reader will have seen the entire algebra and should be able to write example queries. The components of the algebra are described in detail in [3 Algebra Components]. We present some equivalence and optimization laws of the algebra in [4.3 Laws]. Finally, we give the static typing rules for the algebra in [5 Type Rules]. Although it contains the most challenging material, we have tried to make the content as accessible as possible. Nonetheless, this section is not necessary for understanding or using the algebra, therefore the reader may skip this section. In [B.2 Issues list], we discuss open issues and problems.
In Appendix [A The XML Query Data Model], we give a formal mapping relating the algebra to the XML Query Data Model.
Cited literature includes: monads [Mog89], [Mog91], [Wad92], [Wad93], [Wad95], SQL [Date97], OQL [BK93], [BKD90], [CM93], NRA [BNTW95], [Col90], [LW97], [LMW96], Quilt [Quilt], UnQL [BFS00], XDuce [HP2000], XML-QL [XMLQL99], XPath [XPath], XQL [XQL99], and YaTL [YAT99].
This section introduces the main features of the algebra, using familiar examples based on accessing a database of books.
Consider the following sample data:
<bib> <book year="1999" isbn="1-55860-622-X"> <title>Data on the Web</title> <author>Abiteboul</author> <author>Buneman</author> <author>Suciu</author> </book> <book year="2001" isbn="1-XXXXX-YYY-Z"> <title>XML Query</title> <author>Fernandez</author> <author>Suciu</author> </book> </bib>
Here is a fragment of a XML Schema for such data:
<xsd:group name="Bib"> <xsd:element name="bib"> <xsd:complexType> <xsd:group ref="Book" minOccurs="0" maxOccurs="unbounded"/> </xsd:complexType> </xsd:element> </xsd:group> <xsd:group name="Book"> <xsd:element name="book"> <xsd:complexType> <xsd:attribute name="year" type="xsd:integer"/> <xsd:attribute name="isbn" type="xsd:string"/> <xsd:element name="title" type="xsd:string"/> <xsd:element name="author" type="xsd:string" maxOccurs="unbounded"/> </xsd:complexType> </xsd:element> </xsd:group>
This data and schema is represented in our algebra as follows:
type Bib = bib [ Book{0, *} ] type Book = book [ @year [ Integer ] & @isbn [ String ], title [ String ], author [ String ]{1, *} ] let bib0 : Bib = bib [ book [ @year [ 1999 ], @isbn [ "1-55860-622-X" ], title [ "Data on the Web" ], author [ "Abiteboul" ], author [ "Buneman" ], author [ "Suciu" ] ], book [ @year [ 2001 ], @isbn [ "1-XXXXX-YYY-Z" ], title [ "XML Query" ], author [ "Fernandez" ], author [ "Suciu" ] ] ]
The expression above defines two types, Bib
and Book
, and defines one global variable,
bib0
.
The Bib
type corresponds to a single
bib
element, which contains a
forest of zero or more Book
elements. We use the term forest to refer to an sequence of (zero or more) attributes and elements.
Every attribute or element can be viewed as a forest of length one.
The Book
type corresponds to a single
book
element, which contains one
year
attribute
and one isbn
attribute, followed by one
title
element,
followed by one or more
author
elements.
The &
operator, called an all group,
indicates that the year
and
isbn
attributes may occur in any order.
A isbn
attribute and
a title
or
author
element contains a string value, and a
year
attribute contains an integer.
The variable bib0
is bound to a literal XML
value, which is the data model representation of the earlier XML document. The
bib
element contains two book
elements.
The algebra is a strongly typed language, therefore the value of
bib0
must be an instance of its declared type, or the
expression is ill-typed. Here the value of bib0
is an
instance of the Bib
type, because it contains one
bib
element, which contains two
book
elements, each of which contain
an integer-valued year
attribute, a string-valued
isbn
attribute, a string-valued
title
element,
and one or more string-valued author
elements.
For convenience, we define a second global variable
book0
also bound to a literal value, which is
equal to the first book in bib0
.
let book0 : Book = book [ @year [ 1999 ], @isbn [ "1-55860-622-X" ], title [ "Data on the Web" ], author [ "Abiteboul" ], author [ "Buneman" ], author [ "Suciu" ] ]
The simplest operation is projection. The algebra uses a notation similar in
appearance and meaning to path navigation in XPath. The following expression
returns all author
elements contained in
book
elements contained in
bib0
:
bib0/book/author ==> author [ "Abiteboul" ], author [ "Buneman" ], author [ "Suciu" ], author [ "Fernandez" ], author [ "Suciu" ] : author [ String ] {0, *}
Note that in the result, the document order of
author
elements is preserved and that duplicate
elements are also preserved.
The above example and the ones that follow have three parts. First is an
expression in the algebra. Second, following the
==>
is the value of this expression. Third,
following the :
is the type of the expression, which
is (of course) also a legal type for the value.
It may be unclear why the type of bib0/book/author
contains zero or more authors, even though the type of
a book
element contains one
or more authors. Let's look at the derivation of the result type by looking at
the type of each sub-expression:
bib0 : Bib bib0/book : Book{0, *} bib0/book/author : author [ String ]{0, *}
Recall that Bib
, the type of bib0
, may
contain zero or more Book
elements, therefore the expression bib0/book
might
contain zero book
elements, in which case,
bib0/book/author
would contain no authors.
This illustrates an important feature of the type system: the type of an
expression depends only on the type of its sub-expressions. It also illustrates
the difference between an expression's run-time value and its compile-time
type. Since the type of bib0
is
Bib
, the best type for
bib0/book/author
is one listing zero or more authors,
even though for the given value of bib0
, the expression will
always contain exactly five authors.
One may access atomic data (strings, integers, or booleans) using
the keyword data()
. For instance, if we wish to select all
author names in a book, rather than all author elements, we could
write the following.
book0/author/data() ==> "Abiteboul", "Buneman", "Suciu" : String {1, *}
Similarly, it is possible to project the atomic values of attributes. The following returns the year the book was published.
book0/@year/data() ==> 1999 : Integer
This notation is similar to the use of text()
in XPath.
We chose the keyword data()
because, as the second example
shows, not all data items are strings.
Another common operation is to iterate over elements in a document so that their content can be transformed into new content. Here is an example of how to process each book to list the author before the title, and remove the year and isbn.
for b in bib0/book do book [ b/author, b/title ] ==> book [ author [ "Abiteboul" ], author [ "Buneman" ], author [ "Suciu" ], title [ "Data on the Web" ] ], book [ author [ "Fernandez" ], author [ "Suciu" ], title [ "XML Query" ] ] : book [ author[ String ]{1, *}, title[ String ] ]{0, *}
The for
expression iterates over all
book
elements in bib0
, and binds the
variable b
to each such element. For each element
bound to b
, the inner expression constructs a
new book
element containing the book's authors followed
by its title. The transformed elements appear in the same order as they occur
in bib0
.
In the result type, a book
element is guaranteed
to contain one or more authors followed by one title. Let's look at the
derivation of the result type to see why:
bib0/book : Book{0, *} b : Book b/author : author [ String ]{1, *} b/title : title [ String ]
The type system can determine that b
is
always Book
, therefore the type
of b/author
is author[ String ]{1, *}
,
and the type of b/title
is title[ String
]
.
In general, the value of a for
expression is a forest.
If the body of the for
expression itself yields a forest, then all of
the forests are concatenated together. For instance, the
expression:
for b in bib0/book do b/author
is exactly equivalent to the expression bib0/book/author
.
Here we have explained the typing of for
expressions by example.
In fact, the typing rules are rather subtle
and will be explained in detail in [5 Type Rules].
To select values that satisfy some predicate, we use
the where
expression. For example, the following
expression selects all book
elements in
bib0
that were published before 2000.
for b in bib0/book do where b/@year/data() <= 2000 do b ==> book [ @year [ 1999 ], @isbn [ "1-55860-622-X" ], title [ "Data on the Web" ], author [ "Abiteboul" ], author [ "Buneman" ], author [ "Suciu" ] ] : Book{0, *}
In general, an expression of the form:
where
e1 do e2
is converted to the form
if
e1
then
e2 else
()
where e1 and e2 are
expressions. Here ()
is an expression that stands for
the empty sequence, a forest that contains no attributes or elements. We also write
()
for the type of the empty sequence.
According to this rule, the expression above translates to
for b in bib0/book do if b/@year/data() <= 2000 then b else ()
and this has the same value and the same type as the preceding expression.
The following expression selects all book
elements
in bib0
that have some author
named "Buneman".
for b in bib0/book do for a in distinct(b/author/data()) do where a = "Buneman" do b ==> book [ @year [ 1999 ], @isbn [ "1-55860-622-X" ], title [ "Data on the Web" ], author [ "Abiteboul" ], author [ "Buneman" ], author [ "Suciu" ] ] : Book{0, *}
In contrast, we can use the empty
operator to find
all books that have no author whose name is
Buneman:
for b in bib0/book do where empty(for a in b/author do where a/data() = "Buneman" do a) do b ==> book [ @year [ 2001 ], @isbn [ "1-XXXXX-YYY-Z" ], title [ "XML Query" ], author [ "Fernandez" ], author [ "Suciu" ] ] : Book{0, *}
The empty
expression checks that its argument is
the empty sequence ()
.
We can also use the empty
operator to find all
books where all the authors are Buneman, by checking that there are no authors
that are not Buneman:
for b in bib0/book do where empty(for a in b/author do where a/data() <> "Buneman" do a) do b ==> () : Book{0, *}
There are no such books, so the result is the empty sequence. Appropriate
use of empty
(possibly combined with
not
) can express universally or existentially quantified
expressions.
Here is a good place to introduce the let
expression, which binds a local variable to a value. Introducing local
variables may improve readability. For example, the following expression is
exactly equivalent to the previous one.
for b in bib0/book do let nonbunemans = (for a in b/author do where a/data() <> "Buneman" do a) do where empty(nonbunemans) do b
Local variables can also be used to avoid repetition when the same subexpression appears more than once in a query.
Another common operation is to join values from one or more documents. To illustrate joins, we give a second data source that defines book reviews:
type Reviews = reviews [ book [ title [ String ], review [ String ] ]{0, *} ] let review0 : Reviews = reviews [ book [ title [ "XML Query" ], review [ "A darn fine book." ] ], book [ title [ "Data on the Web" ], review [ "This is great!" ] ] ]
The Reviews
type contains one
reviews
element, which contains zero or more
book
elements;
each book
contains a title and a review.
We can use nested for
expressions to join the two
sources review0
and bib0
on
title values. The result combines the title, authors, and reviews for each
book.
for b in bib0/book do for r in review0/book do where b/title/data() = r/title/data() do book [ b/title, b/author, r/review ] ==> book [ title [ "Data on the Web" ], author [ "Abiteboul" ], author [ "Buneman" ], author [ "Suciu" ] review [ "A darn fine book." ] ], book [ title [ "XML Query" ], author [ "Fernandez" ], author [ "Suciu" ] review [ "This is great!" ] ] : book [ title [ String ], author [ String ] {1, *}, review [ String ] ] {0, *}
Note that the outer-most for
expression determines
the order of the result. Readers familiar with optimization of relational join
queries know that relational joins commute, i.e., they can be evaluated in any
order. This is not true for the XML Query Algebra: changing the order of the first
two for
expressions would produce different output.
In [B.2 Issues list], we discuss extending the algebra to
support unordered forests, which would permit commutable joins.
Often it is useful to regroup elements in an XML document. For example,
each book
element in bib0
groups one title with multiple authors. This expression groups each author
with the titles of his/her publications.
for a in distinct(bib0/book/author/data()) do biblio [ author[a], for b in bib0/book do for a2 in b/author/data() do where a = a2 do b/title ] ==> biblio [ author [ "Abiteboul" ], title [ "Data on the Web" ] ], biblio [ author [ "Buneman" ], title [ "Data on the Web" ] ], biblio [ author [ "Suciu" ], title [ "Data on the Web" ], title [ "XML Query" ] ], biblio [ author [ "Fernandez" ], title [ "XML Query" ] ] : biblio [ author [ String ], title [ String ] {0, *} ] {0, *}
Readers may recognize this expression as a self-join of books on authors.
The expression distinct(bib0/book/author)
produces a
forest of author elements with no duplicates. The outer
for
expression binds a
to each
author element, and the inner for
expression selects the
title of each book that has some author equal to a
.
Here distinct
is an example of a built-in function.
It takes a forest of elements and removes duplicates by content; the order of the
resulting forest is not defined.
The type of the result expression may seem surprising:
each biblio
element may contain
zero or more title
elements,
even though in bib0
every author
co-occurs with a title
. Recognizing such a constraint is
outside the scope of the type system, so the resulting type is not as precise
as we would like.
Often it is useful to query the order of elements in a forest. There are two kinds of order among elements. Document or global order refers to the total order among all elements in a document and corresponds to their sequential appearance. Local order refers to the order among sibling elements in a forest. Currently, the algebra supports querying of local order, but not global order. We discuss global order in [B.2 Issues list].
The index
function pairs an integer index with each
element in a forest:
index(book0/author) ==> pair [ fst [ 1 ], snd [ author [ "Abiteboul" ] ] ], pair [ fst [ 2 ], snd [ author [ "Buneman" ] ] ], pair [ fst [ 3 ], snd [ author [ "Suciu" ] ] ] : pair [ fst [ Integer ], snd [ author [ String ] ] ] {1, *}
Note that the result type guarantees that at least one pair exists in the result,
because (book0/author)
always contains one or more
authors.
Once we have paired authors with an integer index, we can select the first two authors:
for p in index(book0/author) do where (p/fst/data() <= 2) do p/snd/author ==> author [ "Abiteboul" ], author [ "Buneman" ] : author [ String ] {0, *}
The for
expression iterates over all
pair
elements produced by the index expression. It
selects elements whose index value in the fst
element
is between one and two inclusive, and it returns the original content in the
snd
element.
Once again, the result type may be surprising, because
the Book
type guarantees that each book has at least one
author. However, the type system cannot determine that the
conditional where
clause will always succeed, so the
inner clause may produce zero results. (A sophisticated analysis might improve
type precision, but is likely to require much work for little benefit.)
To sort a forest, the algebra provides a sort
expression, whose form is:
sort
Var in
Exp1
by
Exp2.
The variable Var ranges over the items in the forest Exp1
and sorts those items using the key value
defined by Exp2.
For example, this expression sorts book
elements in
review0
by their titles.
Note that the variable b
appear in the key expression
b/title
.
sort b in review0/book by b/title ==> book [ title [ "Data on the Web" ], review [ "This is great!" ] ], book [ title [ "XML Query" ], review [ "This is pretty good too!" ] ] : book [ title [ String ], review [ String ] ] ] {0, *}
The sort
operator is a restricted
form of higher-order
function, i.e., it takes a function as an argument.
In this case, sort
takes a single function, which should
extract the key value from each element.
We have already seen several several built-in
functions: index
, distinct
, and
sort
. In addition to these
functions, the algebra has five built-in aggregation
functions: avg
, count
,
max
, min
, and sum
.
This expression selects books that have more than two authors:
for b in bib0/book do where count(b/author) > 2 do b ==> book [ @year [ 1999 ], @isbn [ "1-55860-622-X" ], title [ "Data on the Web" ], author [ "Abiteboul" ], author [ "Buneman" ], author [ "Suciu" ] ] : Book{0, *}
All the aggregation functions take a forest with repetition type and return
an integer value; count
returns the number of elements
in the forest.
So far, all our examples of attributes and elements use unqualified local names, i.e., names that do not include an explicit namespace URI. It is also possible to specify and match on the expanded name of an attribute or element. The expanded name {Namespace}LocalName consists of a namespace URI Namespace and a local name LocalName.
Consider an inventory of books that contains data from
both http://www.BooksRUs.com
and
http://www.cheapBooks.com
.
In this example, the first element contains values whose names are
defined in the BooksRUs.com
namespace, and the second element
contains values whose names are defined in the
cheapBooks.com
namespace:
let inventory = inv [ {http://www.BooksRUs.com/books.xsd}book [ @{http://www.BooksRUs.com/books.xsd}year [ 1999 ], @{http://www.BooksRUs.com/books.xsd}isbn [ "1-55860-622-X" ], {http://www.BooksRUs.com/books.xsd}title [ "Data on the Web" ], {http://www.BooksRUs.com/books.xsd}author [ "Abiteboul" ], {http://www.BooksRUs.com/books.xsd}author [ "Buneman" ], {http://www.BooksRUs.com/books.xsd}author [ "Suciu" ] ], {http://www.cheapBooks.com/ourschema.xsd}book [ @{http://www.cheapBooks.com/ourschema.xsd}year [ 2001 ], {http://www.cheapBooks.com/ourschema.xsd}title [ "XML Query" ], {http://www.cheapBooks.com/ourschema.xsd}author [ "Fernandez" ], {http://www.cheapBooks.com/ourschema.xsd}author [ "Suciu" ] {http://www.cheapBooks.com/ourschema.xsd}isbn [ "1-XXXXX-YYY-Z" ], ] ] ] : Inventory type Inventory = inv [ InvBook{0, *}]
In this example,
elements imported from existing schemas each refer
to a single namespace, thus the definition of InvBook
is:
type BooksRUBook = {http://www.BooksRUs.com/books.xsd}book [ @{http://www.BooksRUs.com/books.xsd}year [ Integer ] & @{http://www.BooksRUs.com/books.xsd}isbn [ String ], {http://www.BooksRUs.com/books.xsd}title [ String ], {http://www.BooksRUs.com/books.xsd}author [ String ]{1, *} ] type CheapBooksBook = {http://www.cheapBooks.com/ourschema.xsd}book [ @{http://www.cheapBooks.com/ourschema.xsd}year [ Integer ], {http://www.cheapBooks.com/ourschema.xsd}title [ String ], {http://www.cheapBooks.com/ourschema.xsd}author [ String ]{1, *}, {http://www.cheapBooks.com/ourschema.xsd}isbn [ String ], ] type InvBook = BooksRUBook | CheapBooksBook
Here vertical bar (|
) is used to indicate a
choice between types: each InvBook
is either a BooksRUBook
or
a CheapBooksBook
.
We have already seen how to project on the constant name of an attribute or
element.
It is also useful to project on wildcards, which
are used to match names with any namespace and/or any local name.
For example, this expression matches elements
with any local name and with
namespace URI
http://www.BooksRUs.com/books.xsd
:
inventory/{http://www.BooksRUs.com/books.xsd}* ==> {http://www.BooksRUs.com/books.xsd}book [ {http://www.BooksRUs.com/books.xsd}title [ "Data on the Web" ], {http://www.BooksRUs.com/books.xsd}author [ "Abiteboul" ], {http://www.BooksRUs.com/books.xsd}author [ "Buneman" ], {http://www.BooksRUs.com/books.xsd}author [ "Suciu" ] ] : {http://www.BooksRUs.com/books.xsd}book [ {http://www.BooksRUs.com/books.xsd}title [ String ], {http://www.BooksRUs.com/books.xsd}author [ String ] {0, *}, ]
Similarly, this expression first projects elements
in any namespace whose local name is book
and then projects on their year
attributes:
inventory/{*}book/@{*}year ==> @{http://www.BooksRUs.com/books.xsd}year [ 1999 ], @{http://www.cheapBooks.com/ourschema.xsd}year [ 2001 ], : (@{http://www.BooksRUs.com/books.xsd}year [ Integer ] | @{http://www.cheapBooks.com/ourschema.xsd}year [ Integer ]){0,*}
In general, the expression Exp/a is shorthand for
Exp/{*}a, that is,
the namespace wildcard can be omitted.
Similarly, *
is shorthand for {*}*
.
Functions can make queries more modular and concise. Recall that we used the following query to find all books that do not have "Buneman" as an author.
for b in bib0/book do where empty(for a in b/author do where a/data() = "Buneman" do a) do b ==> book [ @year [ 2001 ], @isbn [ "1-XXXXX-YYY-Z" ], title [ "XML Query" ], author [ "Fernandez" ], author [ "Suciu" ] ] : Book{0, *}
A different way to formulate this query is to first define a function that
takes a string s
and a book b
as arguments, and returns true if book b
does not have
an author with name s
.
fun notauthor (s : String; b : Book) : Boolean = empty(for a in b/author do where a/data() = s do a)
The query can then be re-expressed as follows.
for b in bib0/book do where notauthor("Buneman"; b) do b ==> book [ @year [ 2001 ], @isbn [ "1-XXXXX-YYY-Z" ], title [ "XML Query" ], author [ "Fernandez" ], author [ "Suciu" ] ] : Book{0, *}
We use semicolon rather than comma to separate function arguments, since comma is used to concatenate forests.
Note that a function declaration includes the types of all its arguments and the type of its result. This is necessary for the type system to guarantee that applications of functions are type correct.
In general, any number of functions may be declared at the top-level. The order of function declarations does not matter, and each function may refer to any other function. Among other things, this allows functions to be recursive (or mutually recursive), which supports structural recursion, the subject of the next section.
Functions make the algebra extensible. We have seen examples of built-in
functions (sort
and
distinct
) and examples of user-defined functions (notauthor
).
In addition to built-in and user-defined
functions, the algebra could support externally defined functions, i.e.,
functions that are not defined in the algebra itself, but in some external
language. This would make special-purpose implementations of, for example,
full-text search functions available in the algebra. We discuss support for
externally defined functions in [B.2 Issues list].
XML documents can be recursive in structure, for example, it is possible to
define a part
element that directly or indirectly
contains other part
elements. In the algebra, we use
recursive types to define documents with a recursive structure, and we use
recursive functions to process such documents. (We can also use mutually
recursive functions for more complex recursive structures.)
For instance, here is a recursive type defining a part hierarchy.
type Part = Basic | Composite type Basic = basic [ cost [ Integer ] ] type Composite = composite [ assembly_cost [ Integer ], subparts [ Part{1, *} ] ]
And here is some sample data.
let part0 : Part = composite [ assembly_cost [ 12 ], subparts [ composite [ assembly_cost [ 22 ], subparts [ basic [ cost [ 33 ] ] ] ], basic [ cost [ 7 ] ] ] ]
Here vertical bar (|
) is used to indicate a
choice between types: each part is either basic (no subparts), and has a cost,
or is composite, and includes an assembly cost and subparts.
We might want to translate to a second form, where every part has a total cost and a list of subparts (for a basic part, the list of subparts is empty).
type Part2 = part [ total_cost [ Integer ], subparts [ Part2{0, *} ] ]
Here is a recursive function that performs the desired transformation. It
uses a new construct, the match
expression.
fun convert(p : Part) : Part2 = match p case b : Basic do part [ total_cost [ b/cost/data() ], subparts [] ] case c : Composite do let s = (for y in children(c/subparts) do convert(y)) do part [ total_cost [ q/assembly_cost/data() + sum(s/total_cost/data()) ], subparts[ s ] ] else error
Each branch of the match expression is labeled with a type,
Basic
or Composite
, and with a corresponding variable, b
or c
.
The evaluator checks the type of the value of p
at run-time,
and evaluates the corresponding branch.
If the first branch is taken then b
is bound to the
value of p
, and the branch returns a new part with total cost the
same as the cost of b
, and with no subparts. If the second
branch is taken, then c
is bound to the value of p
. The
function is recursively applied to each of the subparts of c
,
giving a list of new subparts s
. The branch returns a new part
with total cost computed by adding the assembly cost of c
to the
sum of the total cost of each subpart in s
, and with subparts
s
.
One might wonder why b
and c
are required,
since they have the same value as p
. The reason why is
that p
, b
, and c
have different
types.
p : Part b : Basic c : Composite
The types of b
and c
are more precise than
the type of p
, because which branch is taken depends upon
the type of value in p
.
Applying the query to the given data gives the following result.
convert(part0) ==> part [ total_cost [ 74 ], subparts [ part [ total_cost [ 55 ], subparts [ part [ total_cost [ 33 ], subparts [] ] ] ], part [ total_cost [ 7 ], subparts [] ] ] ]
Of course, a match
expression may be used in any
query, not just in a recursive one.
Recursive types allow us to define a type that matches any
well-formed XML document. This type is called AnyTree
:
type AnyTree = AnyScalar | AnyElement | AnyAttribute type AnySimpleType = AnyScalar | AnyScalar{0,*} type AnyAttribute = @*[ AnySimpleType ] type AnyElement = *[ AnyComplexType ] type AnyComplexType = AnyTree{0, *} type AnyType = AnySimpleType | AnyComplexType
AnyTree
stands for any scalar, element, or attribute
value.
Here AnyScalar
is a built-in scalar type. It stands
for the most general scalar type, and all other scalar types
(like Integer
or String
) are
subtypes of it.
AnySimpleType
stands for the most
general simple type, which is a scalar or a list of scalars.
AnyAttribute
stands for the most general attribute type.
The asterisk (*
) is used to indicate a
wild-card type, i.e., a type whose name is not known statically.
The type AnyAttribute
may have any name, and its content must have type
AnySimpleType
, i.e., it may contain atomic values,
but no elements.
The AnyElement
stands for the most general element type,
which may have any name, and its content must be a
complex type, which is a repetition of zero or more
AnyTree
s.
Finally, an AnyTree
is either
an AnySimpleType
or an AnyComplexType
.
In other words, any element, attribute, or atomic value
has type AnyType
.
The use of * is a significant extension to XML Schema, because
XML Schema has no type corresponding to *[t],
where t is some type other than
AnyComplexType
.
It is not clear that this extension is necessary, since the more
restrictive expressiveness of XML Schema wildcards may be adequate.
In particular, our earlier data also has type AnyTree
.
book0 : AnyTree ==> book [ @year [ 1999 ], @isbn [ "1-55860-622-X" ], title [ "Data on the Web" ], author [ "Abiteboul" ], author [ "Buneman" ], author [ "Suciu" ] ] : AnyTree
A specific type can be indicated for any expression in the query language, by writing a colon and the type after the expression.
As an example of a function that can be applied to all well-formed documents, we define a recursive function that converts any XML data into HTML. We first give a simplified definition of HTML.
type HTML_body = ( AnyScalar | b [ HTML_body ] | ul [ (li [ HTML_body ]){0, *} ] ) {0, *}
An HTML body consists of a sequence of zero or more items, each of which is
either a scalar, or a b
element, where the content is
an HTML body, or an ul
element, where the children
are li
elements, each of which has as content an HTML
body.
Now, here is the function that performs the conversion.
fun html_of_xml( x : AnyTree ) : Html_Body = match x case s : AnyScalar do s case v1 : AnyAttribute do b [ name(v1) ], ul [ for y in children(v1) do li [ html_of_xml(y) ] ], case v2 : AnyElement do b [ name(v2) ], ul [ for y in attributes(v2) do li [ html_of_xml(y) ], ul [ for y in children(v2) do li [ html_of_xml(y) ] ] else error
The first branch of the match match
expression checks whether the value
of x
is a subtype of AnyScalar
, and
if so then s
is bound to that value, so if this branch
is taken then s
is the same as x
, but
with a more precise type (it must be a scalar, not an element). This branch
returns the scalar.
The second branch checks whether the value
of x
is a subtype of AnyAttribute
. As before,
v1
is the same as
x
but with a more precise type (it must be an attribute,
not a scalar). This branch returns a b
element
containing the name of the attribute, and a ul
element
containing one li
element for each value of the
attribute. The function is recursively applied to get the content of the
li
element.
The last branch is analogous to the second, but it matches an element
instead of an attribute, and it applies html_of_xml
to each of the element's attributes and children.
Applying the query to the book element above gives the following result.
html_of_xml(book0) ==> b [ "book" ], ul [ li [ b [ "year" ], ul [ li [ 1999 ] ] ], li [ b [ "isbn" ], ul [ li [ "1-55860-622-X" ] ] ], li [ b [ "title" ], ul [ li [ "Data on the Web" ] ] ], li [ b [ "author" ], ul [ li [ "Abiteboul" ] ] ], li [ b [ "author" ], ul [ li [ "Buneman" ] ] ], li [ b [ "author" ], ul [ li [ "Suciu" ] ] ] ] : HTML_Body
A query consists of a sequence of top-level expressions, or query items, where each query item is either a type declaration, a function declaration, a global variable declaration, or a query expression. The order of query items is immaterial; all type, function, and global variable declarations may be mutually recursive.
Each query expression is evaluated in the environment specified by all of the declarations. (Typically, all of the declarations will precede all of the query expressions, but this is not required.) We have already seen examples of type, function, and global variable declarations. An example of a query expression is:
query html_of_xml(book0)
To transform any expression into a top-level query, we simply precede the
expression by the query
keyword.
In this section, we summarize the algebra and present the grammars for types and expressions. We start by defining the non-terminal and terminal symbols that appear in the type and expression grammars.
A Namespace is a URI, and a LocalName is an NCName, as in the Namespace recommendation [XML Names]. We let Namespace range over namespaces and the null value, which represents the absence of a namespace, and LocalName range over NCNames. The symbol Name ranges over expanded names. An expanded name is either a local name for which the namespace is absent or namespace, local name pairs.
|
Ed. Note: MF (Oct-13-2000): We need to add the data model accessors that extract the namespace and the local name from an expanded name.
A wildcard denotes a set of names. A wildcard item is
of the form {*}*
denoting any name in any namespace,
{Namespace}*
denoting any name in namespace Namespace,
{*
}LocalName
denoting the local name LocalName in any namespace,
or
{Namespace}LocalName denoting just the
name with namespace Namespace and local name LocalName.
A wildcard consists
of wildcard items, union of wildcards, or difference of wildcards.
We let WItem range over wildcard items, and
Wild range over wildcard expressions.
|
For example,
the wildcard {*}*!{http://www.foo.org/baz}*!{http://www.xsd.com/xsi}String
denotes any
name in any namespace except for names in namespace
http://www.foo.org/baz
and for local name
String
in namespace http://www.xsd.com/xsi
.
Wildcards denote sets of expanded names, so we can define a containment relation, <:wild that relate sets of wildcards. The inequalities that hold for this relation include:
{Namespace}LocalName | <:wild | {*}LocalName |
{Namespace}LocalName | <:wild | {Namespace}* |
{Namespace}* | <:wild | {*}* |
{*}LocalName | <:wild | {*}* |
{Namespace}LocalName | <:wild | {*}* |
The last inequality holds by transitivity. Note, however, that {*}LocalName <:wild {Namespace}* does not necessarily hold.
The Algebra takes as given the atomic datatypes from XML Schema Part 2 [XSchema2]. We let p range over all atomic datatypes.
|
Typically, an atomic datatype is either a primitive datatype, or is derived from another atomic datatype by specifying a set of facets. A type hierarchy is induced between scalar types by containment of facets. Note that lists of atomic datatypes are specified using repetition and unions are specified using alternation, as defined in [3.4 Types : abstract syntax]
The built-in atomic data type
AnyScalar
stands for the most
general scalar type, and all other scalar types (like Integer
or
String
) are subtypes of it.
[Figure 1] contains the abstract syntax for the Algebra's type system. A type is closely related to a model group as defined in [MSL]. MSL uses standard regular expression notation for model groups and we do the same for algebra types. This type system captures the essence of [XSchema1] and is similar to the type system used in XDuce [HP2000].
type variable y type t ::= y type variable | () empty sequence | Ø empty choice | Wild wildcard type | u unit type | t1 , t2 sequence, t1 followed by t2 | t1 | t2 choice, t1 or t2 | t1 & t2 all, t1 and t2 in either order | t {m, n} repetition of t between m and n times bound m, n ::= natural number or *
unit type u ::= p atomic datatype | @Wild[t] attribute with name in Wild and content in t | Wild[t] element with name in Wild and content in t prime type q ::= u | q | q
Figure 1: Abstract Syntax for Types
Ed. Note: MF (Nov-16-2000): Explain why prime type is here.
Ed. Note: MF (Nov-16-2000): Explain why Wild occurs in types.
The empty sequence matches only the empty document; it is an identity for sequence and all. The empty choice matches no document; it is an identity for choice.
() , t | = t = | t , () |
() & t | = t = | t & () |
Ø | t | = t = | t | Ø |
The bounds on a repetition type will be either a natural number (that is,
either a positive integer or zero) or the special value *
,
meaning unbounded. We extend arithmetic to include *
in the obvious way:
|
For technical reasons, we allow the lower bound of a repetition to
be *
. A repetition t
{m, n} is equivalent to the empty
choice Ø if m > n or if
m is *
.
The algebra's external type system, that is, the type definitions associated
with input and output documents, is XML Schema. The internal types are in some
ways more expressive than XML Schema, for example, XML Schema has no type
corresponding to
{*}*[t] where t is some
type other than AnyComplexType
. In general, mapping XML Schema
types into internal types will not lose information, however, mapping internal
types into XML Schema may lose information.
It is useful to define a concrete syntax for the Algebra's types using syntactic categories that specify various subsets of types. Syntactic classes can indicate, for example, that the content of an attribute should be a simple type and that the content of an element should consist of attributes followed by elements. These restrictions guarantee that errors in type construction can be caught during parsing of a query.
In the concrete syntax for types, we capitalize non-terminal symbols. We first distinguish between type variables used to name attribute groups, element groups, and the content types of elements.
TypeVar | ::= | AttrGroupVar |
| | ElemGroupVar | |
| | ContentTypeVar |
A simple type is either an atomic type, a list of atomic types, or a choice of atomic types, which allows a choice of atomic lists.
SimpleType | ::= | p | atomic |
| | p{m,n} | list of atomic | |
| | SimpleType | SimpleType | choice of atomic |
An attribute group defines the syntactic form of attributes: their content may only be a simple type and they are combined only in all groups, not sequences.
AttrGroup | ::= | @Wild[SimpleType] | |
| | @Wild[SimpleType]{0,1} | optional attribute | |
| | AttrGroup & AttrGroup | ||
| | AttrGroupVar | ||
| | () |
An element group contains elements with constant or wildcard names and they are combined in sequences, choices, all groups, and repetitions.
ElemGroup | ::= | Wild[ContentType] | element |
| | ElemGroup , ElemGroup | ||
| | ElemGroup | ElemGroup | ||
| | ElemGroup & ElemGroup | ||
| | ElemGroup{m,n} | ||
| | ElemGroupVar | ||
| | () | ||
| | Ø |
The content type of an element is either an attribute group, an element group, a sequence of an attribute group followed by an element group, or a content-type variable.
ContentType | ::= | AttrGroup | |
| | ElemGroup | ||
| | AttrGroup , ElemGroup | ||
| | ContentTypeVar | content-type variable |
Finally, a type in an algebraic expression may be an attribute group, an element group, or a content type:
Type | ::= | AttrGroup | ElemGroup | ContentType |
Ed. Note: MF (Nov-16-2000): This needs work.
[Figure 2] contains the concrete syntax for expressions. A few of these expressions can be rewritten as other expressions in a smaller core algebra; such reducible expressions are labeled with "*". We define the algebra's typing rules on the smaller core algebra. In [4.3 Laws], we give the laws that relate a reducible expression with its equivalent expression in the core algebra. Typing rules for the core algebra are defined in [5 Type Rules].
We have seen examples of most of the expressions, so we will only point out details here. We define a subset of expressions that correspond to data values. An expression is a data value if it consists only of scalar constant, element, sequence, and empty sequence expressions.
name Name function FuncName variable Var constant Const atomic-datatype constant binary operator BinaryOp ::= +
|-
|and
|or
| =
|!=
|<
|<=
|>=
|>
unary operator UnaryOp ::= +
|-
|not
name expression NameExp ::= Name | {Exp}Exp computed name expression Exp ::= Const atomic constant | NameExp name expression | Var variable | @Name[Exp] attribute | Name[Exp] element | @NameExp[Exp] computed attribute | NameExp[Exp] computed element | Exp, Exp sequence | () empty sequence | if
Expthen
Expelse
Expconditional | let
Var=
Expdo
Explocal binding | FuncName (Exp ;...; Exp) function application | Exp : Type explicit type | error
error | Exp BinaryOp Exp binary operator | UnaryOp Exp unary operator | attributes
(Exp)attributes | children
(Exp)children | name
(Exp)element name | for
Varin
Expdo
Expiteration | match
Expmatch case
Var : Typedo
Exp··· case
Var : Typedo
Expelse
Exp| Exp / @Wild attribute projection * | Exp / Wild element projection * | Exp / data()
atomic projection * | where
Expdo
Expwhere conditional * | empty
(Exp)emptiness predicate * | cast
Exp : Typerun-time type cast * | let
Var : Type=
Expdo
Explocal variable with type * query item Query ::= type
TypeVar = Typetype declaration | fun
FuncName (Var : Type ; ... ; Var : Type): Type =Exp function declaration | let
Var : Type = Expglobal variable declaration | query
Expquery expression
Figure 2: Algebra Expression Syntax
Ed. Note: MF (Oct-13-2000): We need to add the data model accessors that extract the namespace and the local name from an expanded name.
We have not defined the semantics of the binary operators in the algebra. It might be useful to define more than one type of equality over scalar and element values. We leave that to future work.
[Figure 3] summarizes the built-in functions of the algebra.
avg
,count
,min
,max
,sum
Aggregation functions index
Pairs each element of a forest with integer index. sort
Sorts elements on key value bound by variable argument. distinct
Removes duplicates from a forest. namespace
Extracts URI namespace from a name value. localname
Extracts NCName local name from a name value.
Figure 2: Built-In Functions
Ed. Note: MF (Nov-16-2000): We need a section here on operational semantics of expressions. The intent of the "Algebra by Example" section is to do this by example, but a formal operational semantics for each operator is necessary. See [Issue-0074: Operational semantics for expressions].
The examples use the /
operator liberally, but in fact
we use /
as a convenient abbreviation for expressions built from
lower-level operators: for
expressions, the children
function,
and match
expressions.
For example, the expression:
book0/author
is equivalent to the expression:
for v in children(book0) do match v case a : author[AnyComplexType] do a else ()
Here the children
function returns a forest consisting of the
children of the element book0
, namely, a title element
and three author elements (the order is preserved). The
for
expression binds the variable v
successively to each
of these elements. Then the match
expression selects a branch
based on the type of v
. If it is an author
element then
the first branch is evaluated, otherwise the second branch. If the
first branch is evaluated, it returns a
.
The variable a
contains the same
value as v
, but the
type of a
is author[String]
, which is
the intersection of the type of v
and the type author[AnyComplexType]
. If the
second branch is evaluated, then the branch returns ()
, the empty sequence.
To compose several expressions using /
, we again use for
expressions.
For example, the expression:
bib0/book/author
is equivalent to the expression:
for b in children(bib0) do match b case b : book[AnyComplexType] do for d in children(b) do match d case a : author[AnyComplexType] do a else () else ()
The for
expression iterates over all book
elements in
bib0
and binds the variable b
to each such element. For
each element bound to b
, the inner expression returns all the
author
elements in b
, and the resulting forests are
concatenated together in order.
In general, an expression of the form e / a is converted to the form
for v1 in e do for v2 in children(v1) do match v2 case v3 : a[AnyComplexType] do v3 else ()
where e is an expression, a is an element name, and v1, v2, and v3 are fresh variables (ones that do not appear in the expression being converted).
According to this rule, the expression bib0/book
translates to
for v1 in bib0 do for v2 in children(v1) do match v2 case v3 : book[AnyComplexType] do v3 else ()
In [4.3 Laws], we discuss laws which allow us to simplify this to the previous expression:
for v2 in children(bib0) do match v2 case v3 : book[AnyType] do v3 else ()
Similarly, the expression bib0/book/author
translates to
for v4 in (for v2 in children(bib0) do match v2 case v3 : book[AnyType] do v3 else ()) do for v5 in children(v4) do match v5 case v6 : author[AnyType] do v6 else ()
Again, the laws will allow us to simplify this to the previous expression
for v2 in children(bib0) do match v2 case v3 : book[AnyType] do for v5 in children(v3) do match v5 case v6 : author[AnyType] do v6 else () else ()
These examples illustrate an important feature of the algebra: high-level operators may be defined in terms of low-level operators, and the low-level operators may be subject to algebraic laws that can be used to further simplify the expression.
[Figure 4] contains the laws that relate the reducible expressions (i.e., those labeled with "*" in [Figure 2]) to equivalent expressions. In these definitions, Exp1{Exp2/ Var} denotes the expression Exp1 in which all occurrences of the variable Var are replaced by the expression Exp2.
In Rule 1, the projection expression
e / Wild is rewritten as a
for
expression, which binds v to
each element in the forest e, and returns the children
elements of v whose name is in Wild.
Similarly, Rule 2 rewrites the attribute projection expression
e / @Wild as a match embedded
in a for
expression.
Rule 3
rewrites the e / data()
expression
by iterating over items in e.
The nested
match
expression returns
e's children if e
is AnySimpleType
. Otherwise, it returns the empty sequence.
Rule 4 simply rewrites a
where
expression by an if
- then
- else
expression.
Rule 5 rewrites empty
(e) as a
match
expression that returns true if the type of
e is the empty sequence.
Rule 6 rewrites cast
e : t as a
match
expression that returns the value of e
with type t, if it has that type at runtime, otherwise
it raises an error.
Rule 7 rewrites the let
expression with a type as
a let
expression without a type by moving the type
constraint into the expression.
e / Wild Þ for
v1in
edo
for
v2in
children
(v1)do
match
v2case
v3 : Wild[AnyType]
do
v3else ()
(1) e / @Wild Þ for
v1in
edo
for
v2in
children
(v1)do
match
v2case
v3 : @Wild[AnyType]
do
v3else ()
(2) e / data()
Þ for
vin
edo
match children
(v)case
v :AnySimpleType do
velse ()
(3) where
e1do
e2Þ if
e1then
e2else
()(4) empty
(e)Þ match
ecase
v :() do true else false
(5) cast
e : tÞ match
ecase
v : tdo
velse error
(6) let
v : Type=
e1do
e2Þ let
v=
(e1: Type)do
e2(7)
Figure 4: Definitions
In this section, we describe some laws that hold for the algebra.
These laws are important for defining rules that simplify
algebraic expressions, such as eliminating
unnecessary for
or match
expressions.
The iteration construct of the algebra is closely related to an important mathematical object called a monad. A monad, among other things, generalizes set, bag, and list types. In functional languages, the comprehension construct is used to express iteration over set, bag, and list types. A comprehension corresponds directly to a monad [Wad92], [Wad93], [Wad95].
The correspondence between the algebra's iteration construct
and a monad is close, but not exact.
Each monad is
based on a unary type constructor, such as Set{t} or List{t},
representing a homogenous set or list where all elements are of type
t. In the algebra, we have more complex and heterogenous types,
such as a forest consisting of a title, a year, and a sequence of one
or more authors. Also, one important component of a monad is the unit
operator, which converts an element to a set or list. If x has type
t, then set{x} is a unit set of type Set{t} or [x] is a unit
list of type List{t}. In the algebra, we simply write, say,
author["Buneman"]
, which stands for both a tree and for the unit
forest containing that tree.
Monads satisfy three laws, and three corresponding laws are satisfied by the the Algebra's for expression.
First, iteration over a unit forest can be replaced by substition. This is called the left unit law.
for v in e1 do e2 = e2{v := e1}
provided that e1 is a unit type (e.g., is an element or a scalar constant). We write e2{v := e1} to denote the result of taking expression e2 and replacing occurrences of the variable v by the expression e1. For example,
for v in author["Buneman"] do auth[v/data()] = auth["Buneman"]
The iteration over a forest of one item can always be eliminated using variable substitution.
Second, an iteration that returns the iteration variable is equivalent to the identity. This is called the right unit law.
for v in e do v = e
For example
for v in book0 do v = book0
An important feature of the type system described here is that the left side of the above equation always has the same type as the right side.
Third, there are two ways of writing an iteration over an iteration, both of which are equivalent. This is called the associative law.
for v2 in (for v1 in e1 do e2) do e3 = for v1 in e1 do (for v2 in e2 do e3)
For example, a projection over a forest includes an implicit
iteration, so e / a =
for
v in
e
do
v / a. Say we
define a forest of bibliographies, bib1
=
bib0, bib0
. Then bib1/book/author
is equivalent
to the first expression below, which in turn is equivalent to the
second.
for b in (for a in bib1 do a/book) do b/author = for a in bib1 do (for b in a/book do b/author)
With nested relational algebra, the monad laws play a key role in optimizing queries. Similarly, the monad laws can also be exploited for optimization in this context.
For example, if b
is a book, the following finds all authors
of the book that are not Buneman:
for a in b do where a/data() != Buneman do a
If l
is a list of authors, the following renames all author
elements to auth
elements:
for a' in l do auth[ a'/data() ]
Combining these, we select all authors that are not Buneman, and rename the elements:
for a' in (for a in b do where a/data() != Buneman do a) do auth[ a'/data() ]
Applying the associative law for a monad, we get:
for a in b do for a' in (where a/data() != Buneman do a) do auth[ a'/data() ]
Expanding the where
clause to a conditional, we get:
for a in b do for a' in (if a/data() != Buneman then a else ()) do auth[ a'/data() ]
Applying a standard law for distributing loops over conditionals gives:
for a in b do if a/data() != Buneman then for a' in a do auth[ a'/data() ] else ()
Applying the left unit law for a monad, we get:
for a in b do if a/data() != Buneman then auth[ a/data() ] else ()
And replacing the conditional by a where
clause, we get:
for a in b do where a/data() != Buneman do auth[ a/data() ]
Thus, simple manipulations, including the monad laws, fuse the two loops.
[4.1 Relating projection to iteration] ended with two examples of simplification. Returning to these, we can now see that the simplifications are achieved by application of the left unit and associative monad laws.
[Figure 5] contains a dozen algebraic simplification
laws. In a relational query engine, algebraic simplifications are often applied
by a query optimizer before a physical execution plan is generated; algebraic
simplification can often reduce the size of the intermediate results computed
by a query evaluator. The purpose of our laws is similar -- they eliminate
unnecessary for
or match
expressions, or they enable other optimizations by reordering or distributing
computations. The set of laws given is suggestive, rather than complete.
E ::= if
[]then
e1else
e2| let
v=
[]in
e| for
vin
[]do
e| match
[]case
v : tdo
e...
case
v : tdo
e| else
efor
vin
()do
eÞ () (8) for
vin
(e1, e2)do
e3Þ ( for
vin
e1do
e3), (for
vin
e2do
e3)(9) for
vin
e1do
e2Þ e2{e1/ v}, if e1 : u (10) for
vin
edo
vÞ e (11) E[ if
e1then
e2else
e3]Þ if
e1then
E[e2]else
E[e3](12) E[ let
v=
e1do
e2]Þ let
v=
e1do
e[e2](13) E[ for
vin
e1do
e2]Þ for
vin
e1do
E[e2](14)
E[ match
e1case
v : tdo
e2case
v : tdo
en-1... else
en ]Þ
match
e1case
v : tdo
E[ e2 ]... case
v : tdo
E[ en-1 ]else
E[en ](15)
Figure 5: Optimization Laws
Rules 8, 9, and 10 simplify iterations. Rule 8 rewrites an iteration over the empty sequence as the empty sequence. Rule 9 distributes iteration through sequence: iterating over the sequence (e1, e2) is equivalent to the sequence of two iterations, one over e1 and one over e2. Rule 10 eliminates an iteration over a single element or scalar. If e1 is a unit type, then e1 can be substituted for occurrences of v in e2.
Ed. Note: MF (Oct-18-2000) The rules for eliminating trivial
match
expressions need to be written. They are more complex than those for the oldcase
expressions.
Rule 11 eliminates an iteration when the result expression is simply the iteration variable v.
Rules 12--15 are used to commute expressions. Each rule actually
abbreviates a number of other rules, since the context
variable
e stands for a number of different
expressions. The notation e
[e] stands for one of the six expressions given with
expression e replacing the hole [] that appears in
each of the alternatives. For instance, one of the expansions of Rule 14
is the following, when e is taken to
be for
v
in
[] do
e :
|
We explain our type system in the form commonly used in the programming languages community. The rules for typing expressions are described with an inference rule notation. Originally developed by logicians, this notation is now widely used for describing type systems and semantics of programming languages. For a textbook introduction to type systems, see, for example, [Mitchell].
In inference notation, when all judgements above the line hold, then the judgement below the line holds as well. Here is an example of a rule used later on:
|- Data1: t1 |- Data2: t2 |
|
|- (Data1, Data2): (t1, t2) |
Data is the subset of expressions that consists only of scalar content, attribute, element, sequence, and empty sequence expressions.
Data | ::= | Const | atomic constant | |
| | @Name[Data] | attribute | ||
| | Name[Data] | element | ||
| | Data, Data | sequence | ||
| | () | empty sequence |
The judgement "|- Data : t"
is read as "in the empty environment, the value Data has type
t."
The symbol |- is called a turnstyle, and is usually preceded by an
environment symbol, which represents
a mapping from variables to values.
In this example, there is no environment symbol,
which means the judgement holds in the empty environment.
In [5.5 Expressions], we will see examples of rules that
have non-empty environments.
The rule states that if both
Data1 : t1
and
Data2 : t2
hold, then
(Data1,
Data2):
(t1,
t2)
holds as well.
For instance,
take
Data1 = a[1]
,
Data2 = b["two"]
,
t1 = a[Integer]
, and
t2 = b[String]
.
Then since both
a[1]
: a[Integer]
and b["two"]
: b[String]
hold, we may conclude that
(a[1]
, b["two"]
) : (a[Integer]
,
b[String]
)
holds.
The following type rules relate Data values to types. The next rule states that any constant whose lexical form defines a value of type p, then the constant has type p:
|
|- Constp: p |
The next rule states that any constant also has type AnyScalar
:
|
|-
Const :
AnyScalar
|
The next two rules are for attribute and element construction. The first rule states that if Data has type t, then the attribute expression @Name[Data] has type @Name[t]. The subsequent rule is analogous for element expressions.
|- Data : t |
|
|- @Name[Data]: @Name[t] |
|- Data : t |
|
|- Name[Data]: Name[t] |
The next rule is for typing sequences and was already described at the beginning of [5 Type Rules].
|- Data1: t1 |- Data2: t2 |
|
|- (Data1, Data2): (t1, t2) |
The next rule states that the empty sequence value always has the empty sequence type:
|
|- (): () |
The rules above associate the most specific type possible with a Data value. The remaining rules in this section associate more general types with a Data value, which are necessary when the type system must determine whether a Data value with most specific type t is permissible when a value of type t' is expected. This occurs during type checking of a query.
The next two rules are also for attribute and element construction, but these rules specify more general types. The first rule states that if Data has type t, and Name is in the set of names defined by the wildcard expression Wild, then the given attribute expression also has type @Wild[t]. The subsequent rule is analogous for element expressions.
|
||
|
||
|- @Name[Data] : @Wild[t] |
|- Data : t |
|
|- Name[Data] : Wild[t] |
The next rule states that if the value Data has type t1, then it also has type t1| t2. The subsequent rule is analogous and states that the choice operator is associative.
|- Data : t1 |
|
|- Data : (t1| t2) |
|- Data : t2 |
|
|- Data : (t1| t2) |
The next rule states that the sequence of one value with type t followed by a value with repetition type t{m,n} has repetition type t{m+1,n +1}.
|- Data1: t |- Data2: t {m, n} |
|
|- (Data1, Data2): t {m +1, n +1} |
The next rule states that the sequence of one value with type t followed by a value with repetition type t{0,n} has repetition type t{0,n +1}.
|- Data1: t |- Data2: t {0, n} |
|
|- (Data1, Data2): t {0, n +1} |
The next rule states that the empty sequence is a repetition of any type with lower bound 0:
|
|- (): t {0, n} |
The symbol <: denotes the subtype relation. We write t1<: t2 if for every data Data such that |- Data : t1, it is also the case that |- Data : t2, i.e., is t1 is a subtype of t2. The subtyping relation is used in many of the type rules that follow. It is easy to see that the subtype relation <: is a partial order, i.e., it is reflexive, t <: t, and it is transitive, if t1<: t2and t2<: t3then t1<: t3. Here are some of the inequalities that hold:
|
Further, if Name <:wild Wild and t <: t', then
Name[t] | <: | Wild[t'] |
If t <: t' and m ³ m' and n £ n' then
t {m, n} | <: | t {m', n'} |
If t1<: t1' and t2<: t2' then
t1, t2 | <: | t1', t2' |
We write t1= t2 if t1<:t2 and t2<: t1. Here are some of the equalities that hold:
|
We also have that t1<: t2 if and only if t1| t2= t2.
We define the intersection t1 /\ t2 of two types t1 and t2 to be the largest type t that is smaller than both t1 and t2. That is, t = t1 /\ t2 if t <: t1 and t <: t2 and if for any t' such that t' <: t1 and t' <: t2, we have t' <: t.
Before giving the remaining type rules, we introduce
factoring:
Many of the operations in the algebra have an operand that should be of a
repetition type. This is true for the built-in operators index
,
sort
and distinct
.
For example, consider the following.
index (children (book0)) ==> pair [ fst [ 1 ], snd [ title [ "Data on the Web" ] ], pair [ fst [ 2 ], snd [ author [ "Abiteboul" ] ] ], pair [ fst [ 3 ], snd [ author [ "Buneman" ] ] ], pair [ fst [ 4 ], snd [ author [ "Suciu" ] ] ], pair [ fst [ 5 ], snd [ year [ 1999 ] ]
How should we describe the type of the result? By nesting the children of
book0
under snd
the original sequence of
title, author+, year
gets lost. snd
can contain
either a title
, an author
, or a year
.
More formally, we need to find
q, m, and n
such
that
children(book0) : q{m, n}
and then the type is given by
index(children(book0)) : pair [ fst [ String ], snd [q] ]{m, n}
In the case of books, the values of q are:
title [ String ] | author [ String ] | year [ Integer]
the value of m is 3 (because there will be
one
title
,
at
least one author
, and one year
element) and the value of
n
is *
(because
there may be any number of author elements).
We call a type like q a prime type. In general, it may contain scalar, attribute, element, choice, and empty choice types, but it will not contain repetition, sequence, or empty sequence types (except, perhaps, within an element or attribute type). The definition of prime types appears in [Figure 1].
factor (p) = p {1, 1} factor (Wild[t]) = Wild[t]{1, 1} factor (@Wild[t]) = @Wild[t]{1, 1} factor (t1 , t2) = (q1| q2){m1+ m2, n1+ n2} where qi{mi, ni} = factor (ti) factor (t1 & t2) = (q1| q2){m1+ m2, n1+ n2} where qi{mi, ni} = factor (ti) factor (t1| t2) = (q1| q2){m1min m2, n1max n2} where qi{mi, ni} = factor (ti) factor (t {m, n}) = q {m' · m, n' · n} where q {m', n'} = factor (t) factor (()) = Ø{0, 0} factor (Ø) = Ø{*, 0}
Figure 6: Definition of factoring
factor, as shown in [Figure 6], converts any type t to a type of the form q {m, n}, where t <: q {m, n}, so that any value that has type t also has type q {m, n}. For example,
|
We can see here that the factored type is less specific than the unfactored type. For mnemonic convenience we write q {m, n} = factor (t), but one should actually think of the function as returning a triple consisting of a prime type q and two bounds m and n.
Just as factoring a number yields a product of prime
numbers, factoring a type yields a repetition of prime types. Further, the
result yielded by factoring is in some sense optimal. If
q {m, n}
=
factor (t) then
t
<:
q {m, n}
and
furthermore for any
q', m', and
n' such that t <:
q'{m',
n'} we have
that
q <: q' and
m
>=
m'
and
n <= n'. Also, if
t =
t',
then factor (t) =
factor (t'). In particular,
the
choice of the lower bound *
for
factor (Ø)
guarantees that
factor (t) =
factor (t |
Ø), since m min
*
= m.
Note that factor is only used by the type inference rules, and thus is not part of the algebra expressions.
The type rules make use of an environment that specifies the types of variables and functions. The type environment is denoted by G, and is composed of a comma-separated list of variable types, Var : t or function types, FuncName :(t1 ; ... ; tn) : t. We retrieve type information from the environment by writing (Var : t) Î G to look up a variable, or by writing (FuncName : (t1;...; tn) : t) Î G to look up a function.
The type checking starts with an environment that contains all the types declared for functions and global variables. For instance, before typing the first query of [2.2 Projection], the environment contains:
G = bib0 : Bib, book0 : Book
While doing the
type-checking, new
variables will be added in the environment. For instance, when typing the query
of [2.4 Iteration],
variable b
will be typed with Book
,
and
added in the
environment. This will result in a new environment:
G' = G, b : Book
We write
G |- Exp : t if
in
environment G
the
expression
Exp has type
t.
Below are
all the rules except those for for
and match
expressions, which are discussed in
later subsections.
The next rule states that in any environment G,
a constant whose lexical form
defines a value of type p has type
p. So, for example, G |- 1 : Integer
and G |- "two" : String
.
|
G |-
Constp:
p
|
The next five rules are for typing expressions that define names.
|
|-
LocalName: {Null }LocalName
|
|
|- {Namespace}LocalName: {Namespace}LocalName |
G |-
Exp :
NCName
|
|
G |- {Namespace}Exp: {Namespace}* |
G |-
Exp : URI
|
|
G |- {Exp}LocalName: {*}LocalName |
G |-
Exp1:
URI
G |-
Exp2:
NCName
|
|
G |- {Exp1}Exp2: {*}* |
The next rule is a trivial definition of the variable Var having type t in environment G:
(Var : t) Î G |
|
G |- Var : t |
The next two rules are for attribute and element construction with a constant tag name. The first rule states that if Exp has type t, then the attribute expression @Name[Data] has type @Name[t]. The subsequent rule is analogous for element expressions.
G |- Exp : t |
|
G |- @Name[Exp]: @Name[t] |
G |- Exp : t |
|
G |- Name[Exp]: Name[t] |
The next two rules are for attribute and element construction in which the tag name is computed. The first rule states that if Exp1 is in the set of names defined by Wild, and Exp2 has type t, then the given computed attribute expression has type @Wild[t]. The subsequent rule is analogous for element expressions.
G |- Exp1: Wild G |- Exp2: t |
|
G |- *@Exp1[ Exp2]: @Wild[t] |
G |- Exp1: Wild G |- Exp2: t |
|
G |- *Exp1[ Exp2]: Wild[t] |
The following two rules are analogous to the sequence and empty sequence rules in [5.1 Relating data to types].
G |- Exp1: t1 G |- Exp2: t2 |
|
G |- Exp1, Exp2: t1, t2 |
|
G |- () : () |
Note that in the type rule for a conditional expression, the result type is the choice (t2|t3).
G |-
Exp1:
Boolean
G |-
Exp2:
t2
G |-
Exp3:
t3
|
|
G |-
if
Exp1
then
Exp2
else
Exp3:
(t2 |
t3)
|
A let
expression extends the current environment
G
with the variable Var with type
t.
Note that Exp2,
the body of the let
expression,
is typed in the extended environment, and the type of the entire
let
expression is t2.
G |- Exp1: t1 G, Var : t1 |- Exp2: t2 |
|
G |-
let
Var =
Exp1
do
Exp2:
t2
|
The next rule is for function application. In a function application, the type of each actual argument to the function must be a subtype of the corresponding formal argument to the function, i.e., it is not necessary for the actual and formal types to be equal.
|
|||||
|
|||||
G |- FuncName (Exp1;...; Expn) : t |
The next rule states that is always permissible to explicitly type an expression with a type t' that is a supertype of the expression's type t. In programming-language terminology, this operation is sometimes called an ``upcast''.
G |- Exp : t' t' <: t |
|
G |- (Exp : t) : t |
The error
expression always has the empty choice type.
|
G |-
error
: Ø
|
The attributes
, children
, and
name
operators only apply to values with element type.
The first two rules require that the content type t
is equal to an attribute group followed by an element group.
This constraint permits extraction of the appropriate result type.
G |- Exp : Wild[t] t = AttrGroup, ElemGroup |
|
G |-
attributes (Exp) : AttrGroup
|
G |- Exp : Wild[t] t = AttrGroup, ElemGroup |
|
G |-
children (Exp) : ElemGroup
|
Ed. Note: MF, Oct 23/2000: I'm not sure if using the abstract syntax for types here is fair game. An alternative is defining two type functions
attrgroup
andelemgroup
, which extract the constituent types, but that seemed overkill given that we already specify the constraint on content type in the abstract syntax.
The next rule extracts the type of the element's name from the element's type Wild[t].
G |- Exp : Wild[t] |
|
G |-
name (Exp) : Wild
|
The complete set of operators are not yet defined, so it is not possible to give the typing rules for operators. In general, however, arithmetic operators will have a type rule such as the following, in which t1 and t2 are numeric types and appropriate conversion exist between the two:
G |- Exp1 : t1 G |- Exp2 : t2 t1 <: Integer | Float | Decimal, etc. t2 <: Integer | Float | Decimal, etc. |
|
G |- Exp1 Oparith Exp2 : t |
Equality and inequality operators are typed similarly.
G |- Exp1 : t1 G |- Exp2 : t2 |
|
G |-
Exp1
Opeq
Exp2
: Boolean
|
The type rule for index
requires that
its argument be a factored type. The second expression
above the judgement line converts t into a factored type.
G |- Exp : t q {m, n} = factor (t) |
|
G |-
index
Exp : pair
[ fst
[ Integer ],
snd
[q]]{m,
n}
|
Similarly, the type rule for sort
also requires a
factored type. Note that the key expression
Exp2 is typed in the extended
environment G |-
Var : q.
|
||
|
||
G |-
sort Var in
Exp1 by Exp2: q{m, n}
|
Ed. Note: MF, Oct 23/2000: This definition assumes that the equality operator on t is defined. An alternative is requiring Exp2 to have
AnyScalar
, but that seems too restrictive.
The types of aggregated expressions must be factored, and their prime type must be a numeric type.
G |-
Exp
:
t
q{m,
n} =
factor (t)
q <: Integer | Float | Decimal, etc.
|
|||
|
|||
|
G |-
Exp
: t
|
|
G |-
count
Exp :
Integer
|
The distinct
operator also requires an expression with a
factored type, and because distinct
removes duplicates,
the minimum bound of the result type is the minimum of 1 and
the minimum bound of the factored type.
G |- Exp : t q {m, n} = factor (t) |
|
G |-
distinct
Exp :
q {m min 1,
n}
|
The last two rules are for name expressions: they extract the constituent parts of a name.
G |- Exp : Wild |
|
G |-
namespace (Exp)
: URI
|
G |- Exp : Wild |
|
G |-
localname (Exp) : NCName
|
The typing of for
expressions is rather subtle. We give an intuitive
explanation first and then the detailed typing rules below.
A unit type is defined in [Figure 1];
it is either an atomic type,
or an attribute or element type with a constant or wildcard
name. A for
expression
for
Var in
Exp1 do
Exp2
is typed as follows. First, one finds the type of expression Exp1. Next, for each unit type in this type one assumes the variable Var has the unit type and one types the body Exp2. Note that this means we may type the body of Exp2 several times, once for each unit type in the type of Exp1. Finally, the types of the body Exp2 are combined, according to how the types were combined in Exp1. That is, if the type of Exp1 is formed with sequencing, then sequencing is used to combine the types of Exp2, and similarly for choice or repetition.
For example, consider the following expression, which selects all author
elements from a book.
for c in children(book0) do match c case a : author[AnyType] do a else ()
The type of children(book0)
is
title[String], year[Integer], author[String]{1,*}
This is composed of three unit types, and so the body is typed three times.
|
The three result types are then combined in the same way the original unit types were, using sequencing and iteration.
(), (), author[String]{1,*}
as the type of the iteration, and simplifying yields
author[String]{1,*}
as the final type.
As a second example, consider the following expression, which selects all title
and author
elements from a book, and renames them.
for c in children(book0) do match c case t : title[String] do titl [ t/data() ] case y : year[Integer] do () case a : author[String] do auth [ a/data() ] else error()
Again, the type of children(book0)
is
title[String], year[Integer], author[String]{1,*}
This is composed of three unit types, and so the body is typed three times.
|
The three result types are then combined in the same way the original unit types were, using sequencing and iteration. This yields
titl[String], (), auth[String]{1,*}
as the type of the iteration, and simplifying yields
titl[String], auth[String]{1,*}
as the final type. Note that the title occurs just once and the author occurs one or more times, as one would expect.
As a third example, consider the following expression, which selects all basic parts from a sequence of parts.
for p in children(part0/subparts) do match p case b : Basic do b case c : Composite do () else error()
The type of children(part0/subparts)
is
(Basic | Composite){1,*}
This is composed of two unit types, and so the body is typed two times.
|
The two result types are then combined in the same way the original unit types were, using sequencing and iteration. This yields
(Basic | ()){1,*}
as the type of the iteration, and simplifying yields
Basic{0,*}
as the final type. Note that although the original type involves repetition one or more times, the final result is a repetition zero or more times. This is what one would expect, since if all the parts are composite the final result will be an empty sequence.
In this way, we see that for
expressions can be combined with match
expressions
to select and rename elements from a sequence, and that the result is given a
sensible type.
In order for this approach to typing to be sensible, it is necessary that the unit types can be uniquely identified. However, the type system given here satisfies the following law.
a [t1 | t2] = a [t1] | a [t2]
This has one unit type on the left, but two distinct unit types on the right, and so might cause trouble. Fortunately, our type system inherits an additional restriction from XML Schema: we insist that the regular expressions can be recognized by a top-down deterministic automaton. In that case, the regular expression must have the form on the left, the form on the right is outlawed because it requires a non-deterministic recognizer. With this additional restriction, there is no problem.
The method of translating projection to iteration described in the previous section combined with the typing rules given here yield optimal types for projections, in the following sense. Say that variable x has type t, and the projection x / a has type t'. The type assignment is sound if for every value of type t, the value of x / a has type t'. The type assignment is complete if for every value y of type t' there is a value x of type t such that x / a = y. In symbols, we can see that these conditions are complementary.
|
Any sensible type system must be sound, but it is rare for a type system to be complete. But, remarkably, the type assignment given by the above approach is both sound and complete.
The type rule for for expressions uses the following auxiliary judgement. We
write G |- for
Var : t do
Exp : t', if in environment G when the bound variable
Var of an iteration has type t
then the body Exp of the iteration has type
t'.
G, Var : t |- Exp : t' |
|
G |-
for Var : t do Exp : t
|
|
G |-
for Var : () do Exp : ()
|
G |-
for Var :
t1 Exp : t'1
G |-
for
Var : t2 Exp : t'2
|
|
G |-
for Var : t1, t2 do
Exp : t'1, t'2
|
|
G |-
for Var : Ø do Exp : Ø
|
G |-
for
Var : t1
Exp : t'1
G |-
for
Var : t2
Exp : t'2
|
|
G |-
for Var : t1 | t2 do
Exp : t'1 | t'2
|
G |-
for Var : t Exp : t'
|
|
G |-
for Var : t* do
Exp : t'*
|
Given the above rules, the type rule for for
expressions is immediate.
G |-
Exp1 : t1
G |-
for Var : t1
do Exp2 :
t2
|
|
G |-
for Var in Exp1
do Exp2 :
t2
|
The typing of match
expressions is closely related to
the typing of for
expressions.
Due to the typing rules of
for
expressions, it is possible that the body of an iteration is checked
many times. Thus, when a match
expression is checked, it is possible that quite
a lot is known about the type of the expression being matched, and one can
determine that only some of the clauses of the match
apply. The definition of
match
uses the auxiliary judgments to check whether a given
clause is applicable.
We
write G |- case
Var : t do
Exp : t', if in environment G, the bound variable
Var of the case
has type t,
then the body Exp of case
has type
t'. Note the type of the body is irrelevant if t = Ø.
¬(t = Ø) G, Var : t |- Exp : t' |
|
G |-
case Var : t do Exp : t'
|
|
G |-
case Var : Ø do Exp : Ø
|
We write G |- t <: t' else
Exp: t''
if in environment G when t <: t'
does not hold, then
the body Exp of the match
expression's
else
clause has type t''. Note that
the type of the body is irrelevant if t < t'.
t <: t' |
|
G |-
t <: t'
else Exp : Ø
|
¬ t <: t' G |- Exp : t'' |
|
G |-
t <: t'
else Exp : t''
|
Given the above, it is straightforward to construct the typing rule for a
match
expression. Recall that we write t /\ t'
for the intersection of two types.
|
||||||||
|
||||||||
|
We write G |- Query if in environment G, the query item Query is well-typed. A type declaration is always well typed:
|
G |-
type
x =
t
|
A function declaration is well-typed if in the environment extended with the type assignments for its formal variables, its body is well-typed.
G, Var1 : t1, ..., Varn : tn |- Exp : t' t' <: t |
|
G |-
fun
FuncName (Var1 :
t1; ...;
Varn :
tn)
: t =
Exp
|
A top-level let
expression is well-typed if the type of
its body, t', is a subtype of the bound variable's type.
G |- Exp : t' t' <: t |
|
G |-
let
Var : t =
Exp
|
Finally, a top-level query expression is well-typed if its body is well-typed.
G |- Exp : t |
|
G |-
query
Exp
|
We extract the relevant component of a type environment from a query item Query with the function environment (Query).
|
We write |- Query1 ... Queryn if the sequence of query items Query1 ... Queryn is well typed.
|
|||
|
|||
Query1... Queryn |
The XML Query Algebra uses the XML Query Data Model [XQDM00]. Because the XML examples in this document do not use many of the features of the data model, such as attributes, declared namespaces, comments, the algebra uses an abbreviated notation. Here, we give the mapping from the abbreviated algebraic expressions to complete expressions in the XML Query Data Model:
s = ref (stringValue (s, ref Def_string, ref [])); i = ref (integerValue(i, ref Def_integer)); a[ kids ] = ref (elemNode (qnameValue (null, a, ref Def_string), {}, {}, kids, (ref Def_t)) );
For example, the string s
constructs a string value
and
the integer i
constructs an integer value.
The term a[ kids ]
constructs an ElemNode
with the tag a
, an empty set
of namespaces, an empty set of attributes, the forest of children nodes
kids
, and the XML Schema type t
,
where
t
is the
type of the expression a[ kids ]
.
Again, to simplify presentation, types do not include attributes, but they can be added easily. Subtyping or containment relationships exist between types. We discuss subtyping in [5 Type Rules].
Ed. Note: MF (Aug-11-2000): This section needs to be completed.
The issues in [B.2 Issues list] serve as a design history for this document. The ordering of issues is irrelevant. Each issue has a unique id of the form Issue-<dddd> (where d is a digit). This can be used for referring to the issue by <url-of-this-document>#Issue-<dddd>. Furthermore, each issue has a mnemonic header, a date, an optional description, and an optional resolution. For convenience, resolved issues are displayed in green. Some of the issues contain references to W3C internal archives. These are marked with "W3C-members only". Some of the descriptions of the resolved issues are obsolete w.r.t. to the current version of the document.
Ed. Note: PF (Aug-05-2000): For the sake of archival, there are some duplicate issues raised in multiple instances. Duplicate issues are marked as "resolved" with reference to the representative issue.
Unless stated explicitly otherwise, [Issue-0001: Attributes] through [Issue-0039: Dereferencing semantics] have been raised in http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Jul/0142.html (W3C-members only).
Issue-0001: Attributes
Date: Jul-26-2000
Description: One example of the need for support of [Issue-0049: Unordered Collections], but also: Attributes need to be constrained to contain white space separated lists of simple types only.
Resolution: Attributes are represented by @attribute-name[content]. See [3.4 Types : abstract syntax], [3.5 Types : concrete syntax] for the constraint on white space seperated lists of simple types, and [3.6 Expressions] for selecting and constructing attributes.
Issue-0002: Namespaces
Date: Jul-26-2000
Resolution: Namespaces are represented by {uri-of-namespace}localname. See [3.1 Expanded names] and [3.2 Wildcards].
Issue-0003: Global Order
Date: Jul-26-2000
Description: The data model and algebra do not define a global order on documents. Querying global order is often required in document-oriented queries.
See the thread starting at http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Sep/0179.html (W3C-members only).
Issue-0004: References vs containment
Date: Jul-26-2000
Description: The query-algebra datamodel currently does not explicitly model children-elements by references (other than the XML-Query Datamodel. This facilitates presentation, but may be an oversimplification with regard to [Issue-0005: Element identity].
Resolution: This issue is resolved by subsumption as follows: (1) As [A The XML Query Data Model] points out, all child-elements are (implicit) references to nodes. (2) Thus, having resolved [Issue-0005: Element identity] this issue is resolved too.
Issue-0005: Element identity
Date: Jul-26-2000
Description: Do expressions preserve element identity or don't they? And does "=" and distinct use comparison by reference or comparison by value?
Resolution: The first part of the question has been resolved by resolution of [Issue-0010: Construct values by copy]. The second part raises a more specific issue [Issue-0066: Shallow or Deep Equality?].
Issue-0006: Source and join syntax instead of "for"
Date: Jul-26-2000
Description: Another term for "source and join syntax" is "comprehension". See [4.3 Laws] for a discussion of the relationship between iteration by for and comprehension syntax.
Resolution: This issue is resolved by subsumption under [Issue-0021: Syntax]. List comprehension is a syntactic alternative to "for v in e1 do e2", which has been favored by the WG in the resolution of [Issue-0021: Syntax].
Issue-0007: References: IDREFS, Keyrefs, Joins
Date: Jul-26-2000
Description: Currently,
the algebra does not support reference values, such as IDREF, or Keyref
(not to be mixed up with "node-references" - see [Issue-0005:
Element identity],
which are defined in the XML
Query Data Model. The algebra's type system should be extended to support
reference types and the data model operators ref
, and
deref
should be supported (similar to id() in XPath).
Issue-0008: Fixed point operator or recursive functions
Date: Jul-26-2000
Description: It may be useful to add a fixed-point operator, which can be used in lieu of recursive functions to compute, for example, the transitive closure of a collection.
Currently, the algebra does not guarantee termination of recursive expressions. In order to ensure termination, we might require that a recursive function take one argument that is a singleton element, and any recursive invocation should be on a descendant of that element; since any element has a finite number of descendants, this avoids infinite regress. (Ideally, we should have a simple syntactic rule that enforces this restriction, but we have not yet devised such a rule.)
Impacts optimization; hard to do static type inference; current algebra is first-order
See for the subproblem of typing "//" or desc() http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Sep/0187.html (W3C-members only).
Issue-0009: Externally defined functions
Date: Jul-26-2000
Description: There is no explicit support for externally defined functions.
The set of built-in functions may be extended to support other important operators.
See also http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Sep/0003.html (W3C-members only).
Issue-0010: Construct values by copy
Date: Jul-26-2000
Description: Need to be able to construct new types from bits of old types by reference and by copy. Related to [Issue-0005: Element identity].
Resolution: The WG wishes to support both: construction of values by copy, as well as references to original nodes ( http://www.w3.org/XML/Group/2000/09/ql/unedited-minutes-day1.txt (W3C-members only)) See also http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Sep/0155.html (W3C-members only). This needs some further investigation to sort out all technical difficulties (see [Issue-0062: Open questions for constructing elements by reference]) so the change has not yet been reflected in the algebra document.
Issue-0011: XPath tumbler syntax instead of index?
Date: Jul-26-2000
Description: XPath provides as a shorthand syntax [integer] to select child-elements by their position on the sibling axes, whereas the xml-query algebra uses a combination of a built-in function index() and iteration. See http://lists.w3.org/Archives/Member/w3c-archive/2000Sep/0168.html (W3C-members only) for a suggestion to to support indexed iteration in the form "for v sub i in e1 do e2", and to express index() as a function (or macro).
Issue-0012: GroupBy - needs second order functions?
Date: Jul-26-2000
Description: The type system is currently first order: it does not support function types nor higher-order functions. Higher-order functions are useful for specifying, for example, sorting and grouping operators, which take other functions as arguments.
Resolution: The WG has decided to express groupBy
by a combination of for
and distinct
(see also
http://www.w3.org/XML/Group/2000/09/ql/unedited-minutes-day1.txt (W3C-members only) and [Issue-0042:
GroupBy]):. Thus w.r.t. to GroupBy this
Issue is resolved. Because GroupBy is not the only use case for higher order functions,
a new issue [Issue-0063:
Do we need (user defined) higher order functions?] is raised.
Issue-0013: Collations
Date: Jul-26-2000
Description: Collations identify the ordering to be applied for sorting strings. Currently, it is considered to have an (optional parameter) collation "name" as follows: "SORT variable IN exp BY +(expression {ASCENDING|DESCENDING} {COLLATION name}) (see http://www.w3.org/XML/Group/2000/09/ql/unedited-minutes-day1.txt (W3C-members only)). An alternative would be to model a collation as a simple type derived from string, and use type-level casting, i.e. expression :collationtype (which is already supported in the XML Query Algebra), for specifying the collation. That would make: "SORT variable IN exp BY +(expression:collationname {ASCENDING|DESCENDING}). But that requires some support from XML-Schema.
More generally, collations are important for any operator in the XML Query Algebra that involves string comparison, among them: sort, distinct, "=" and "<".
Issue-0014: Polymorphic types
Date: Jul-26-2000
Description: The type system is currently monomorphic: it does not permit the definition of a function over generalized types. Polymorphic functions are useful for factoring equivalent functions, each of which operate on a fixed type.
The current type system has already a built-in polymorphic type (lists) and is likely to have more (unordered collections). The question is, whether to allow for user-defined polymorphic types and user defined polymorphic functions.
See also thread around http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Sep/0111.html (W3C-members only).
Issue-0015: 3-valued logic to support NULLs
Date: Jul-26-2000
Issue-0016: Mixed content
Date: Jul-26-2000
Description: The XML-Query Algebra allows to generate elements with an arbitrary mixture of data (of simple type) and elements. XML-Schema only allows for a combination of strings interspersed with elements (aka mixed content). We need to figure out whether and how to constrain the XML-Query Algebra accordingly (e.g. by typing rules?)
Issue-0017: Unordered content
Date: Jul-26-2000
Description: All-groups in XML-Schema, not to be mixed up with [Issue-0049: Unordered Collections]
Resolution: The type system has been extended with the support of all-groups - see [3.4 Types : abstract syntax].
Issue-0018: Align algebra types with schema
Date: Jul-26-2000
Description: The algebra's internal type system is the type system of XDuce. A potentially significant problem is that the algebra's types may lose information when converted into XML Schema types, for example, when a result is serialized into an XML document and XML Schema.
This issue comprises also issues [Issue-0016: Mixed content], [Issue-0017: Unordered content], [Issue-0053: Global vs. local elements], [Issue-0054: Global vs. local complex types], [Issue-0019: Support derived types], substitution groups.
Issue-0019: Support derived types
Date: Jul-26-2000
Description: The current type system does not support user defined type hierarchies (by extension or by restriction).
Issue-0020: Structural vs. name equivalence
Date: Jul-26-2000
Description: The subtyping rules in [5.1 Relating data to types] only define structural subtyping. We need to extend this with support for subtyping via user defined type hierarchies - this is related to [Issue-0019: Support derived types].
Issue-0021: Syntax
Date: Jul-26-2000
Description: (e.g. for.<-.in vs for.in.do)
Resolution: The WG has voted for several syntax changes (see also http://www.w3.org/XML/Group/2000/09/ql/unedited-minutes-day1.txt (W3C-members only), [3.6 Expressions]: "for v in e do e", "let v = e do", "sort v in e by e ...", "distinct", "match case v:t e ... else e".
Issue-0022: Indentation, Whitespaces
Date: Jul-26-2000
Description: Is indentation significant?
Resolution: The WG has consensus that indentation is not significant (see http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Aug/0043.html (W3C-members only)), i.e., all documents are white space normalized.
Issue-0023: Catch exceptions and process in algebra?
Date: Jul-26-2000
Description: Does the algebra give explicit support for catching exceptions and processing them?
Resolution: Subsumed by new issue [Issue-0064: Error code handling in Query Algebra].
Issue-0024: Value for empty forests
Date: Jul-26-2000
Description: What does "value" do with empty forests?
Resolution: The definition of value(e) has changed to:
value(e) = match children(e) case v: AnyScalar do v else()
Furthermore, the typing rules for "for v in e1 do e2" have been changed such that the variable v is typed-checked seperately for each unit-type occuring in expression e1.
Consequently the example in http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Jul/0138.html (W3C-members only) would be typed as follows:
query for b in b0/book do value(b/year): Integer{0,*}
rather than leading to an error.
Issue-0025: Treatment of empty results at type level
Date: Jul-26-2000
Description: According to http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Jul/0138.html (W3C-members only) this is closely related to [Issue-0024: Value for empty forests].
Resolution: Resolved by resolution of [Issue-0025: Treatment of empty results at type level].
Issue-0026: Project - one tag only
Date: Jul-26-2000
Description: Project is only parameterized by one tag. How can we translate a0/(b | c)?
Resolution: With the new syntax (and type system) a0/(b | c) can be translated to "for v in a0 do match case v1:b[AnyType] do v1 case v2:c[AnyType] do c else ()" - see also [4.1 Relating projection to iteration].
Issue-0027: Case syntax
Date: Jul-26-2000
Description: N-ary case can be realized by nested binary cases. For design alternatives of case see: http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Aug/0017.html (W3C-members only)
Resolution: New (n-ary) case syntax is introduced in [3.6 Expressions].
Issue-0028: Fusion
Date: Jul-26-2000
Description: Does the algebra support fusion as introduced by query languages such as LOREL? This is related to [Issue-0005: Element identity], because fusion only makes sense with support of element identity.
Issue-0029: Views
Date: Jul-26-2000
Description: One of the problems in views: Can we undeclare/hide things in environment? For example, if we support element-identity, can we explicitly discard a parent, and/or children from an element in the result-set? Related to [Issue-0005: Element identity]. See also description in http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Sep/0047.html (W3C-members only).
Issue-0030: Automatic type coercion
Date: Jul-26-2000
Description: What do we do if a value does not have a type or a different type from what is required? See also thread around http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Oct/0071.html (W3C-members only). This link also contains a recommendation, which has been agreed as the general direction to go in http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Oct/0196.html (W3C-members only):
Suggested Resolution: We believe that the XML Query Language should specify default type coercions for mixed mode arithmetic should be performed according to a fixed precedence hierarchy of types, specifically integer to fixed decimal, fixed decimal to float, float to double. This policy has the advantage of simplicity, tradition, and static type inference. Programmers could explicitly specify alternative type coercions when desirable.
Issue-0031: Recursive functions
Date: Jul-26-2000
Resolution: subsumed by [Issue-0008: Fixed point operator or recursive functions]
Issue-0032: Full regular path expressions
Date: Jul-26-2000
Description: Full regular path expressions allow to constrain recursive navigation along paths by means of regular expressions, e.g. a/b*/c denotes all paths starting with an a, proceeding with arbitrarily many b's and ending in a c. Currently the XML-Query Algebra can express this by means of (structurally) recursive functions. An alternative may be the introduction of a fixpoint operator [Issue-0008: Fixed point operator or recursive functions].
Issue-0033: Metadata Queries
Date: Jul-26-2000
Description: Metadata queries are queries that require runtime access to type information. See also discussion starting at http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Oct/0087.html (W3C-members only).
Issue-0034: Fusion
Date: Jul-26-2000
Resolution: Identical with [Issue-0028: Fusion]
Issue-0035: Exception handling
Date: Jul-26-2000
Resolution: Subsumed by [Issue-0023: Catch exceptions and process in algebra?] and [Issue-0064: Error code handling in Query Algebra].
Issue-0036: Global-order based operators
Date: Jul-26-2000
Resolution: Subsumed by [Issue-0003: Global Order]
Issue-0037: Copy vs identity semantics
Date: Jul-26-2000
Resolution: subsumed by [Issue-0005: Element identity]
Issue-0038: Copy by reachability
Date: Jul-26-2000
Description: Is it possible to copy children as well as IDREFs, Links, etc.? Related to [Issue-0005: Element identity] and [Issue-0008: Fixed point operator or recursive functions]
Issue-0039: Dereferencing semantics
Date: Jul-26-2000
Resolution: subsumed by [Issue-0005: Element identity]
[Issue-0040: Case Syntax] through [Issue-0047: Attributes] are raised in http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Aug/0010.html (W3C-members only)
Issue-0040: Case Syntax
Date: Aug-01-2000
Description: We suggest that the syntax for "case" be made more regular. At present, it takes only two branches, the first labelled with a tag-name and the second labelled with a variable. A more traditional syntax for "case" would have multiple branches and label them in a uniform way. If the algebra is intended only for semantic specification, "case" may not even be necessary.
Resolution: subsumed by [Issue-0027: Case syntax]
Issue-0041: Sorting
Date: Aug-01-2000
Description: We are not happy about the three-step sorting process in the algebra. We would prefer a one-step sorting operator such as the one illustrated below, which handles multiple sort keys and mixed sorting directions: SORT emp <- employees BY emp/deptno ASCENDING emp/salary DESCENDING
Resolution: The WG has decided to go for the above syntax, with an (optional) indication of COLLATION. (see http://www.w3.org/XML/Group/2000/09/ql/unedited-minutes-day1.txt (W3C-members only), [2.10 Sorting]).
Issue-0042: GroupBy
Date: Aug-01-2000
Description: We do not think the algebra needs an explicit grouping operator. Quilt and other high-level languages perform grouping by nested iteration. The algebra can do the same.
related to [Issue-0012: GroupBy - needs second order functions?]
Resolution: The WG has decided (see http://www.w3.org/XML/Group/2000/09/ql/unedited-minutes-day1.txt (W3C-members only)) to skip groupBy for the time being (see also revised [2.8 Restructuring and grouping] and raise [Issue-0069: Organization of Document] for a possible future revision of this resolution.
Issue-0043: Recursive Descent for XPath
Date: Aug-01-2000
Description: The very important XPath operator "//" is supported in the algebra only by writing a recursive function. This is adequate for a semantic specification, but if the algebra is intended as an optimizable target language it will need better support for "//" (possibly in the form of a fix-point operator.)
Resolution: Resolved by subsumption under [Issue-0043: Recursive Descent for XPath] (see http://www.w3.org/XML/Group/2000/09/ql/unedited-minutes-day1.txt (W3C-members only)).
Issue-0044: Keys and IDREF
Date: Aug-01-2000
Description: We think the algebra needs some facility for dereferencing keys and IDREFs (exploiting information in the schema.)
Resolution: Subsumed by [Issue-0007: References: IDREFS, Keyrefs, Joins]
Issue-0045: Global Order
Date: Aug-01-2000
Description: We are concerned about absence of support for operators based on global document ordering such as BEFORE and AFTER.
Resolution: subsumed by [Issue-0003: Global Order]
Issue-0046: FOR Syntax
Date: Aug-01-2000
Description: We agree with comments made in the face-to-face meeting about the aesthetics of the algebra's syntax for iteration. For example, the following syntax is relatively easy to understand: FOR x IN some_expr EVAL f(x) whereas we find the current algebra equivalent to be confusing and misleading: FOR x <- some_expr IN f(x) This syntax appears to assign the result of some_expr to variable x, and uses the word IN in a non-intuitive way.
Resolution: subsumed by [Issue-0021: Syntax]
Issue-0047: Attributes
Date: Aug-01-2000
Description: See [Issue-0001: Attributes].
Resolution: subsumed by [Issue-0001: Attributes]
[Issue-0048: Explicit Type Declarations] through [Issue-0050: Recursive Descent for XPath] are raised in http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Jul/0148.html (W3C-members only)
Issue-0048: Explicit Type Declarations
Date: Jul-27-2000
Description: Type Declaration for the results of a query: The issue is whether to auto construct the result type from a query or to pre-declare the type of the result from a query and check for correct type on the return value. Suggestion: Support for pre-declared result data type and as well as to coerce the output to a new type is desirable. Runtime or compile time type checking is to be resolved? Once you attach a name to a type, it is preserved during the query processing.
Resolution: W.r.t. compile time type casts this is already possible with e:t (see [3.6 Expressions]). For run-time casts an issue has been raised in [Issue-0062: Open questions for constructing elements by reference].
Issue-0049: Unordered Collections
Date: Jul-27-2000
Description: Currently, all forests in the data model are
ordered. It may
be useful to have unordered forests. The distinct
operator,
for
example,
produces an inherently unordered forest. Unordered forests can benefit from
many
optimizations for the relational algebra, such as commutable joins.
Handling of collection of attributes is easy but the collection of elements is complex due to complex type support for the elements. It makes sense to allow casting from unordered to ordered collection and vice versa. It is not clear whether the new ordered or unordered collection is a new type or not. It affects function resolution, optimization.
See also thread around http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Oct/0135.html (W3C-members only).
Our request to Schema to represent insignificance of ordering at schema level has not been fulfilled - see http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Sep/0136.html (W3C-members only). Thus we need to be aware that this information may get lost, when mapping to schema.
Issue-0050: Recursive Descent for XPath
Date: Jul-27-2000
Description: Suggestion: The group likes to add a support for fixed-point operator in the query language that will allow us to express the semantics of the // operator in an xpath expression. A path expression of the form a//b may be represented by a fixed-point operator fp(a, "/.")/b.
Resolution: subsumed by [Issue-0043: Recursive Descent for XPath]
Issue-0051: Project redundant?
Date: Aug-05-2000
Description: It appears that project a e could be reduced to sth. like
for v <- e in case v of a[v1] => a[v1] | v2 => ()
... or would that generate a less precise type?
Resolution: With the new type system and handling of the for operator, project is indeed redundant. See [4.1 Relating projection to iteration].
Issue-0052: Axes of XPath
Date: Aug-05-2000
Description: The current algebra makes navigation to parents difficult to impossible. With support of Element Identity [Issue-0005: Element identity] and recursive functions [Issue-0008: Fixed point operator or recursive functions] one can express parent() by a recursive function via the document root. More direct support needs to be investigated w.r.t its effect on the type system.
The WG wishes to support a built-in operator parent() (see http://www.w3.org/XML/Group/2000/09/ql/unedited-minutes-day1.txt (W3C-members only)). For the current state of affairs see thread around http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Sep/0074.html (W3C-members only). For some use-cases see http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Oct/0011.html (W3C-members only).
Issue-0053: Global vs. local elements
Date: Aug-05-2000
Description: The current type system cannot represent global element-declarations of XML-Schema. All element declarations are local.
Issue-0054: Global vs. local complex types
Date: Aug-05-2000
Description: The current type system does not distinguish between global and local types as XML-Schema does. All types appear to be fully nested (i.e. local types)
Issue-0055: Types with non-wellformed instances
Date: Aug-05-2000
Description: The type system and algebra allows for sequences of simple types, which can usually be not represented as a well-formed document. How shall we constrain this? Related to [Issue-0016: Mixed content].
Issue-0056: Operators on Simple Types
Date: Jul-15-2000
Description: We intentionally did not define equality or relational operators on element and atomic type. These operators should be defined by consensus.
See also first designs for support of arithmetic operators http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Sep/0138.html (W3C-members only) and for support of operators for date/time http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Sep/0113.html (W3C-members only).
Issue-0057: More precise type system; choice in path
Date: Aug-07-2000
Description: (This subsumes [Issue-0051: Project redundant?]). If the type system were more precise, then (project a e) could be replaced by:
for v <- e in case v of a[v1] => a[v1] | v2 => ()
One could also represent (e/(a|b)) directly in a similar style.
for v <- e in case v of a[v1] => a[v1] | v2 => case v2 of b[v3] => b[v3] | v4 => ()
Currently, there is no way to represent (e/(a|b)) without loss of precision, so if we do not change the type system, we may need to have some way to represent (e/(a|b)) and similar terms without losing precision. (The LA team has a design for this more precise type system, but it is too large to fit in the margin of this web page!)
Resolution: See resolution of [Issue-0051: Project redundant?]
Issue-0058: Downward Navigation only?
Date: Aug-07-2000
Description: Related to [Issue-0052: Axes of XPath]. The current type system (and the more precise system alluded to in [Issue-0057: More precise type system; choice in path]) seems well suited for handling XPath children and descendant axes, but not parent, ancestor, sibling, preceding, or following axes. Is this limitation one we can live with?
Resolution: Subsumed by [Issue-0052: Axes of XPath]
Issue-0059: Testing Subtyping
Date: Aug-07-2000
Description: One operation required in the algebra is to test whether XML type t1 is a subtype of XML type t2, indicated by writing t1 <: t2. There is a well-known algorithm for this, based on tree automata, which is a straightforward variant of the well-known algorithm for testing whether the language generated by one regular-expression is a subset of the language generated by another. (The algorithm involves generating deterministic automata for both regular expressions or types.)
However, the naive implementation of the algorithm for comparing XML types can be slow in practice, whereas the naive algorithm for regular expressions is tolerably fast. The only acceptably fast implementation of a comparison for XML types that the LA team knows of has been implemented by Haruo Hasoya, Jerome Voullion, and Benjamin Pierce at the University of Pennsylvania, for their implementation of Xduce. (Our implementation of the XML query algebra re-uses their code, with permission.)
So, should we adopt a simpler definition of subtyping which is easier to test? One possibility is to adopt the sibling restriction from Schema, which requires that any two elements which appear a siblings in the same content model must themselves have contents of the same type. Jerome Simeon and Philip Wadler discovered that adopting the sibling restriction reduces the problem of checking subtyping of XML types to that of checking regular languages for inclusion, so it may be worth adopting the restriction for that reason.
Issue-0060: Internationalization aspects for strings
Date: Jun-26-2000
Description: These issues are taken from the comments on the Requirements Document by I18N (http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Jun/0137.html (W3C-members only)).
Further information can be found at http://www.w3.org/TR/WD-charreq.
It is a goal of i18n that queries involving string matching ("select x where x='some_constant'") treat canonically equivalent strings (in the Unicode sense) as matching. If the query and the target are both XML, early normalization (as per the Character Model) is assumed and binary comparison ensures that the equivalence requirement is satisfied. However, if the target is originally a legacy database which logically has a layer that exports the data as XML, that XML must be exported in normalized form. The XML Query spec must impose the normalization requirement upon such layers.
Similarly, the query may come from a user-interface layer that creates the XML query. The XML Query spec must impose the normalization requirement upon such layers.
Provided that the query and the target are in normalized form C, the output of the query must itself be in normalized form C.
Queries involving string matching should support various kinds of loose matching (such as case-insensitivity, katakana-hiragana equivalence, accent-accentless equivalence, etc.)
If such features as case-insensitivity are present in queries involving string matching, these features must be properly internationalized (e.g. case folding works for accented letters) and language-dependence must be taken into account (e.g. Turkish dotless-i).
Queries involving character counting and indexing must take into account the Character Model. Specifically, they should follow Layer 3 (locale-independent graphemes). Additional details can be found in The Unicode Standard 3.0 and UTR#18. Queries involving word counting and indexing should similarly follow the recommendations in these references.
Issue-0061: Model for References
Date: Aug-16-2000
Description: Raised in: http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Aug/0063.html (W3C-members only). Related to a number of issues around [Issue-0005: Element identity].
Use Cases
Table of Contents
REF *could* do this well if it were restructured - it does not maintain unforeseen relationships or use them...
Bibliographies
Recursive parts
RDF assertions
Inversion of simple parent/child references (related to [Issue-0058: Downward Navigation only?]).
What can we leave out?
can we leave out transitive closure?
can we limit recursion?
can we leave out fixed point recursion?
related to [Issue-0008: Fixed point operator or recursive functions]
Do we need to be able to...
a. Find the person with the maximum number of descendants?
b. Airplane routes: how can I get from RDU to Raleigh? (fixed point: guaranteeing termination in reasonable time...)
c. Given children and their mothers, can I get mothers and their children? (without respect to the form of the original reference...)
related to [Issue-0008: Fixed point operator or recursive functions].
Should we abstract out the difference between different kinds of references? If so, should we be able to cast to a particular kind of reference in the output?
a. abstracting out the differences is cheaper, which is kewl...
b. the kind of reference gives me useful information about: locality (same document, same repository, big bad internet...) static vs. dynamic (xpointer *may* be resolved dynamically, or *may* be resolved at run time, ID/IDREF is static).
related to [Issue-0007: References: IDREFS, Keyrefs, Joins].
do we need to be able to generate ids, e.g. using skolem functions?
for a document in RAM, or in a persistent tree, identity may be present, implicit, system dependent, and cheap - it's nice to have an abstraction that requires no more than the implicit identity
persistable ID is more expensive, may want to be able to serialize with ID/IDREF to instantiate references in the data model
can use XPath instead of generating ID/IDREF, but these references are fragile, and one reason for queries is to create data that may be processed further
persistable ID unique within a repository context
persistable ID that is globally unique
related to [Issue-0005: Element identity].
copy vs. reference semantics
"MUST not preclude updates..."
in a pure query environment, sans update, we do not need to distinguish these
if we have update, we may need to distinguish, perhaps in a manner similar to "updatable cursors" in SQL
programs may do queries to get DOM nodes that can that be modified. It is essential to be able to distinguish copies of nodes from the nodes themselves.
copy semantics - what does it mean?
copy the descendant hierarchy?
copy the reachability tree? (to avoid dangling references)
related to [Issue-0038: Copy by reachability].
The following issues have been raised since Sep-25-2000.
Issue-0062: Open questions for constructing elements by reference
Date: Sep-25-2000
Description: (1) What is the value of parent() when constructing new elements with children refering to original nodes? See also discussion at http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Sep/0155.html (W3C-members only).
(2) Is an approach to either make copies for all children or provide references to all children, or should we allow for a more flexible combination of copies and references?
Issue-0063: Do we need (user defined) higher order functions?
Date: Oct-16-2000
Description: The current XML-Query-Algebra does not allow functions to be parameters of another function - so called higher order functions. However, most of the algebra operators are (built-in) higher functions, taking expressions as an argument ("sort", "for", "case" to name a few). Even a fixpoint operator, "fun f(x)=e, fix f(x) in e" (see also [Issue-0008: Fixed point operator or recursive functions]), would be a built-in higher order function.
Resolution: As agreed in http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Oct/0196.html (W3C-members only) the XML Query Algebra will not support user defined higher order functions. It does support a number of built-in higher order functions.
Issue-0064: Error code handling in Query Algebra
Date: Oct-04-2000
Description: How do we return an error code from a function defined in current Query algebra. Do we need to create an array (or a structure) to merge the return value and error code to do this. If that is true, it may be inefficient to implement. In order for cleaner and efficient implementation, it may be necessary to allow a function declaration to take a parameter of type "output" and allow it to return an error code as part of the function definition. See also thread starting with http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Oct/0051.html (W3C-members only).
Resolution: One does not need to create a structure to combine return values with error codes, provided each operator or function /either/ returns a value /or/ raises an error. The XML-Query Algebra supports means to raise errors, but does not define standard means to catch errors. Raising errors is accomplished by the expression "error" of type Ø (empty choice). Because Ø | t = t, such runtype errors do not influence static typing. The surface syntax and/or detailed specification of operators on simple types (see [Issue-0056: Operators on Simple Types]) may choose to differentiate errors into several error-codes.
Issue-0065: Built-In GroupBy?
Date: Oct-16-2000
Description: As discussed in http://www.w3.org/XML/Group/2000/09/ql/unedited-minutes-day1.txt (W3C-members only), we may revisit the resolution of [Issue-0042: GroupBy] and reintroduce GroupBy along the lines of sort: "group v in e1 by [e2 {collation}]". One reason for this may be that this allows to use collation for deciding about the equality of strings.
Resolution: In http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Oct/0196.html (W3C-members only) the WG has decided to close this issue, and for the time being not consider GroupBy as a built-in operator. Furthermore, [Issue-0013: Collations] is ammended to deal with collations for all operators involving a comparison of strings.
Issue-0066: Shallow or Deep Equality?
Date: Oct-16-2000
Description: What is the meaning of "=" and "distinct"? Equality of references to nodes or deep equality of data?
Issue-0067: Runtime Casts
Date: Sep-21-2000
Description: In some contexts it may be desirable to cast values at runtime. Such runtime casts lead to an error if a value cannot be cast to a given type. See also http://www.w3.org/XML/Group/2000/09/ql/unedited-minutes-day1.txt (W3C-members only), where the Algebra team has been put in charge of introducing run-time casts into the Algebra.
Resolution: cast
e :
t
has been introduced as a reducible operator expressed in terms of match
(see [4.2 Reducible Expressions]).
Issue-0068: Document Collections
Date: Oct-16-2000
Description: Per our requirements document we are chartered to support document collections. The current XML-Query Algebra deals with single documents only. There are a number of subissues:
(a) Do we need a more elaborate notion of node-references? E.g. pair of (URI of root-node, local node-ref)
(b) Does the namespace mechanism suffice to type collections of nodes from different documents? Probably yes.
(c) Provided (a) and (b) can be settled, will the approach taken for [Issue-0049: Unordered Collections] do the rest?
Issue-0069: Organization of Document
Date: Oct-16-2000
Description: The current document belongs more to the genre (scientific) paper than to the genre specification. One may consider the following modifications: (a) reorganize intro to give a short overview and then state the purpose (strongly typed, neutral syntax with formal semantics as a basis for possibly multiple syntaxes, etc.) (compared to version Aug-23, this version has already gone a good deal in this direction). (b) Equip various definitions and type rules with id's. (c) Elaborate appendices on mapping XML-Query-Algebra Model vs. XML-Query-Datamodel, XML-Query-Type System vs. XML-Schema-Type System. (d) Maybe add an appendix on use-case-solutions. The problem is of course: Part of this is a lot of work, and we may not achieve all for the first release.
Resolution: At http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Oct/0196.html (W3C-members only) the WG decided to dispose of this issue. The current overall organization of the document is quite adequate, but of course editorial decisions will have to made all the time.
Issue-0070: Stable vs. Unstable Sort/Distinct
Date: Oct-02-2000
Description: Should sort (and distinct) be stable on ordered collections, i.e. lists, and unstable on unordered collections (see [Issue-0049: Unordered Collections])? For more details see thread around http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Oct/0007.html (W3C-members only).
Issue-0071: Alignment with the XML Query Datamodel
Date: Sep-26-2000
Description: Currently, the XML Query Algebra Datamodel does not model PI's and comments. For more details see thread starting with http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Sep/0167.html (W3C-members only).
Issue-0072: Facet value access in Query Algebra
Date: Oct-04-2000
Description: Each of the date-time data types have facet values as defined by the schema data types draft spec. This problem is general enough to be applied to other atomic data types.
The question is : Should we provide access to these facet values on an instance of a particular data types? If so, what type of access? My take is the facets are to be treated like read-only attributes of a data instance and one should have a read access to them. See also thread starting at http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Oct/0044.html (W3C-members only).
Issue-0073: Facets for simple types and their role for typechecking
Date: Oct-16-2000
Description: XML-Schema introduces a number of constraining facets http://www.w3.org/TR/xmlschema-2/#dc-facets for simple types (among them: length, pattern, enumeration, ...). We need to figure out whether and how to use these constraining facets for type-checking. See also thread starting at http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Oct/0146.html (W3C-members only).
Issue-0074: Operational semantics for expressions
Date: Nov-16-2000
Description: It is necessary to add an operational semantics that formally defines each operator in the Algebra.
Issue-0075: Overloading user defined functions
Date: Nov-17-2000
Description: User defined functions can not be overloaded in the XML Query Algebra, i.e., a function is exclusively identified by its name, and not by its signature. Should this restriction be relaxed and if so - to which extent?
[Issue-0015:
3-valued logic to support NULLs]
[Issue-0018:
Align algebra types with schema]
[Issue-0071:
Alignment with the XML Query Datamodel]
[Issue-0030:
Automatic type coercion]
[Issue-0052:
Axes of
XPath]
[Issue-0013:
Collations]
[Issue-0038:
Copy by reachability]
[Issue-0068:
Document Collections]
[Issue-0009:
Externally defined functions]
[Issue-0073:
Facets for simple types and their role for typechecking]
[Issue-0072:
Facet value access in Query Algebra]
[Issue-0008:
Fixed point operator or recursive
functions]
[Issue-0032:
Full regular
path expressions]
[Issue-0028:
Fusion]
[Issue-0003:
Global Order]
[Issue-0054:
Global vs. local complex
types]
[Issue-0053:
Global vs. local
elements]
[Issue-0060:
Internationalization aspects for strings]
[Issue-0033:
Metadata Queries]
[Issue-0016:
Mixed content]
[Issue-0061:
Model for References]
[Issue-0062:
Open questions for constructing elements by reference]
[Issue-0074:
Operational semantics for expressions]
[Issue-0056:
Operators
on Simple Types]
[Issue-0075:
Overloading user defined functions]
[Issue-0014:
Polymorphic types]
[Issue-0007:
References: IDREFS, Keyrefs, Joins]
[Issue-0066:
Shallow or Deep Equality?]
[Issue-0070:
Stable vs. Unstable Sort/Distinct]
[Issue-0020:
Structural vs. name equivalence]
[Issue-0019:
Support
derived types]
[Issue-0059:
Testing Subtyping]
[Issue-0055:
Types with
non-wellformed instances]
[Issue-0049:
Unordered
Collections]
[Issue-0029:
Views]
[Issue-0011:
XPath tumbler syntax instead of index?]
[Issue-0001:
Attributes]
[Issue-0047:
Attributes]
[Issue-0065:
Built-In GroupBy?]
[Issue-0027:
Case syntax]
[Issue-0040:
Case
Syntax]
[Issue-0023:
Catch exceptions and process in algebra?]
[Issue-0010:
Construct values by copy]
[Issue-0037:
Copy vs identity semantics]
[Issue-0039:
Dereferencing semantics]
[Issue-0063:
Do we need (user defined) higher order functions?]
[Issue-0058:
Downward Navigation only?]
[Issue-0005:
Element identity]
[Issue-0064:
Error code handling in Query Algebra]
[Issue-0035:
Exception handling]
[Issue-0048:
Explicit Type Declarations]
[Issue-0046:
FOR
Syntax]
[Issue-0034:
Fusion]
[Issue-0045:
Global
Order]
[Issue-0036:
Global-order based operators]
[Issue-0012:
GroupBy - needs second order functions?]
[Issue-0042:
GroupBy]
[Issue-0022:
Indentation, Whitespaces]
[Issue-0044:
Keys and IDREF]
[Issue-0057:
More precise type system; choice in path]
[Issue-0002:
Namespaces]
[Issue-0069:
Organization of Document]
[Issue-0026:
Project - one tag only]
[Issue-0051:
Project redundant?]
[Issue-0043:
Recursive Descent for
XPath]
[Issue-0050:
Recursive Descent for
XPath]
[Issue-0031:
Recursive functions]
[Issue-0004:
References vs containment]
[Issue-0067:
Runtime Casts]
[Issue-0041:
Sorting]
[Issue-0006:
Source and join syntax instead of "for"]
[Issue-0021:
Syntax]
[Issue-0025:
Treatment of empty results at type level]
[Issue-0017:
Unordered content]
[Issue-0024:
Value for empty forests]