Copyright ©2000 W3C® (MIT, INRIA, Keio), All Rights Reserved. W3C liability, trademark, document use and software licensing rules apply.
An XML syntax for describing collections of characters is proposed. This will allow to reference character collections with URIs and thus to reference them from other protocols and formats. The main usage areas for character collections are schemas, forms, and stylesheets. Several constructs, in particular kernels, hulls, and alternatives, are provided to allow incomplete specifications and to increase network efficiency.
This document is an initial proposal made available as a NOTE by the World Wide Web Consortium (W3C) for further discussion. This indicates no endorsement of its content by the W3C at this moment. Depending on the response from interested/affected W3C Working Groups and from the W3C Members at large, this NOTE will be assigned to a Working Group (potentially the Internationalization Working Group) for further work.
This document is an editorial revision of an earlier version. A list of changes can be found in Appendix D.
Please send comments on this document to www-international@w3.org for public discussion, to the mailing list of the W3C Internationalization Interest Group (members only) for W3C-internal discussion, and directly to the author at duerst@w3.org for editorial issues. Experience reports from experimental implementations are welcome, but the W3C will not allow early implementations to constrain its ability to make changes.
A list of current W3C technical reports and publications, including Working Drafts and Notes, can be found at http://www.w3.org/TR/.
1. Introduction
2. Specification of the Notation
2.1 Overview
2.2 Kernels and Hulls
2.3 Alternatives
3. Examples and Applications
3.1 Well-known collections
3.2 Open collections
3.3 Efficient network access and lazy
evaluation
3.4 Built-in collections
3.5 Character properties as collections
3.6 CSS2: The 'unicode-range' property
3.7 Styling
3.8 Form Input
3.9 Schemas and Regular Expressions
Acknowledgements
References
Appendix A: Definitions of Operators
Appendix B: DTD
Appendix C: Definition of <range> contents
Appendix D: List of Changes
A precise and concise specification of operations on characters and strings is often desirable. Many of these operations depend on character types, character classes, or character properties. To simplify specifications and operations, a common notation helps. This NOTE is an attempt to develop such a notation, based on the concept of character collections.
A character collection is a set of characters. Once a notation for character collections is defined and agreed upon, collections can be easily defined and referenced via web addresses (URIs) [URI], can be enumerated, and can be constructed from other collections using set operators.
By making character collections available via an URI, they become first-class web objects. The URI of a collection serves both as an identifier (or name) of the collection as well as a way to obtain the description of the collection if necessary. There is no need to define names for collections separately from the URI.
For various reasons, it is often difficult to exactly specify a set of characters. To take care of such cases, hulls and kernels are introduced. A kernel contains characters that are guaranteed to be in the set; the set may contain other characters. A hull gives an outer boundary; characters which are not in the hull are guaranteed not to be in the set.
The syntax given in this NOTE limits characters to be codepoints in the Universal Character Set (UCS, [ISO10646], [Unicode]). The term 'character' correspondingly can be read as a synonym for 'UCS codepoint' throughout this NOTE. There may be reasons for extending the syntax to include e.g. combinations of UCS codepoints for specific applications. This is left for future discussions.
On the Web, network accessibility and efficiency are primary concerns. Alternative descriptions, lazy evaluation, built-in or well-known collections, as well as kernels and hulls can be used to address these concerns. Application areas for character collections include styling, forms, and schemas. These topics are discussed in Section 3.
The term character collection has been chosen here because character set is already used (and misused) in various contexts. See [Harm] for some background.
The notation to specify collections uses XML [XML]. Each XML element in the syntax specifies a collection. Most elements are operators which combine a number of collections into a new collection. For the exact semantics, see the following sections and Appendix A. The DTD for a collection specification is given in Appendix B. The notation has the following elements:
For details on how these elements can be combined, please see the DTD in Appendix B.
Because the repertoire of characters in the Universal Character Set is growing, characters may be continuously added to a set. In this case, it is impossible to specify the set exactly, but it is possible to specify that some characters are included, and others are excluded. Also, in some cases, collections may only be used as hints; in these cases, it is not necessary to specify a collection exactly. As an example, the CSS2 specification allows to indicate ranges of characters for which a given font may have appropriate glyphs; this is used to avoid downloading fonts that are completely unrelated to the characters that one tries to render.
For such cases, the concepts of hull and kernel are introduced. A kernel contains characters that are guaranteed to be in the collection; the collection may contain other characters. A hull gives an outer boundary so that characters which are not in the hull are guaranteed not to be in the collection; some characters in the hull may not actually be in the collection. This is shown in the following figure.
In the markup for collections, <kernel>A</kernel>
should be read as '(X) has a kernel A', and not as 'A has
a kernel'. Similarly, <hull>B</hull>
should be read
as '(X) has a hull B', and not as 'B has a hull'.
A collection may not be exactly known (that's when kernels and hulls have their use), but as a concept it is assumed to be well defined. On the other hand, for each collection, there are a huge number of collections that can serve as a hull or as a kernel. There is usually a trade-off between efficiency and precision.
<union>, <intersection>, and <difference> work according to set algebra (see e.g. [Gerstling]). Given a collection description only containing <union>, <intersection>, <difference>, <enum>, and <range>, for each character c, the question "Is c in the collection?" can be answered with "yes" or "no". Kernels and hulls introduce a third answer, namely "don't know". Under set operations, answers behaves in the usual way (see Appendix A).
Unions or intersections don't allow to combine a kernel with a hull in the obvious way (i.e. "both apply"). For this, the <alt> operator is defined. If one of its operands "knows" something about a character, and the other doesn't, the result is taken from the operand that "knows". Collection X in the figure in Section 2.2 can be expressed using <alt>, <kernel>, and <hull>, as follows:
<collection> <alt> <kernel>A</kernel> <hull>B</hull> </alt> </collection>
The <alt> operator also allows to deal with network access problems. As an example, assume that the collection we want to specify can be exactly and compactly specified in two ways, relying on two different network resources that are indicated with the <ref> element. Any of these network resources may be inaccessible, but the chance that both of them are inaccessible is lower than the chance that only one of them is inaccessible. In fact, implementations may have one or the other built in (see built-in collections, Section 3.4), in which case at least one of them is guaranteed to be accessible.
On usage of a collection description, the answer from a <ref> element that is not accessible is the same as e.g. for characters not in a kernel, i.e. "don't know". However, tools used to build collections should provide a way for somebody defining a collection to know whether all the referenced resources were accessible. This is very useful because the <alt> operator introduces a fourth potential answer, namely "error": if one of the operands of <alt> defines that a certain character is included in the overall collection, and another operand defines that this character is excluded, then there is a contradiction and therefore an error.
This section gives some usage examples, starting with examples that are more geared towards general mechanisms and continuing with examples in particular application fields.
Various organizations may define character collections for general reference. As a very prominent and important example, Technical Corrigendum No. 2 to ISO/IEC 10646-1:1993 [ISO10646] provides a facility to define collections that can be identified by a collection number and a collection name. A significant number of collections is already defined in Appendix A of ISO/IEC 10646-1. These collections mainly refer to scripts (e.g. collection 11, ARMENIAN), subsets of scripts (e.g. collection 12, BASIC HEBREW), and ranges of related characters (e.g. collection 34, CURRENCY SYMBOLS; please note that this collection only includes currency symbols in the range U+20A0-20CF; other currency symbols, e.g. '$', are not included).
ISO/IEC 10646-1 identifies some of the collections as fixed. A fixed collection is exactly defined and does not change in the future. As an example, collection 1, BASIC LATIN, is a fixed collection, and can be expressed as:
<collection><range>U+20-7E</range></collection>
Another kind of well-known collections are the repertoires of coded character sets and character encodings (see e.g. the International Register of Coded Character Sets to be used with Escape Sequences [ISOIR] and the IANA 'charset' registry [IANA]).
Definitions of well-known collection should be made easily available on the Internet in a computer-processable form. This might be done preferably by the organization defining the collection or alternatively by another organization. The URIs chosen should be extremely stable, and should follow an uniform naming convention for a given series of collections.
ISO/IEC 10646-1 [ISO10646] also has the concept of open collections. Open collections can include unassigned codepoints, and if characters are assigned to such a codepoint, they automatically become members of the collection. The notation of open collections is based on using a hull. As an example, collection 24 (MALAYALAM) can be defined as:
<collection><hull><range>U+D00-D7F, U+200C, U+200D</range></hull></collection>
This definition is somewhat unprecise because there are many characters of which we actually know that they are in the collection, while the above definition just tells us that they may be in the definition. This can be fixed in at least two ways. One way is to add a kernel that lists the currently defined characters:
<collection> <alt> <hull><range>U+D00-D7F, U+200C, U+200D</range></hull> <kernel> <range> U+D02-D03, U+D05-0C, U+D0E-D10, U+D12-D28, U+D2A-D39, U+D3E-D43, U+D46-D48, U+D4A-D4D, U+D47, U+D60-D61, U+D66-D6, U+200C, U+200D </range> </kernel> </alt> </collection>
The other alternative is to assume the existence of a continuously updated collection containing all the currently defined characters:
<collection> <intersection> <range>U+D00-D7F, U+200C, U+200D</range> <ref href=" ... "/> <!-- pointer to collection of characters currently defined in ISO/IEC 10646 --> </intersection> </collection>
For network efficiency, the two solutions above can be combined with an <alt> operator. This is discussed in the next section.
There are clearly large speed differences between accessing data locally and over the network, and in some cases, data only available over the network may not be available at all. The evaluation of collections should rely on standard network performance improvement mechanisms wherever possible, and should do this in a transparent way. In particular, caching is very useful to avoid repeated download of collection descriptions. It is expected that collection descriptions change extremely slowly, if at all. Servers serving collection data should be configured properly in order to let proxies and clients take full advantage of caching wherever possible.
Appropriate design of collection descriptions and implementations can also increase network efficiency. An implementation only needs to download a collection description referenced in a <ref> if it cannot decide otherwise whether a character in question is in the collection or not. This is called lazy evaluation. A collection definition can be written to help such implementations, by providing simple cutoffs. For example, assume that a collection definition contains an enumeration for all Han characters defined in JIS X 0208-1997 (the standard Japanese set of Kanji characters available on computers in Japan). This is about a third of the CJK range U+4E00-9FA5, sprinkled all over very irregularly. Rather than just referencing this collection with a <ref>, the following syntax should be used:
<collection> <alt> <hull><range>U+4E00-9FA5</range></hull> <ref href=" ... "/> <!-- pointer to JIS X 0208-19997 collection --> </alt> </collection>
It should be noted that instead of an <alt> containing a <hull>, an <intersection> could also be used here. An <intersection> should be used for true intersections (i.e. when as a result of the intersection, characters drop out from both collections). A hull should be used when the collection is indeed not completely known. Another relation between intersections and hulls is that it would be possible to replace the hull element by an intersection with an element representing an unknown collection, e.g.
<hull><range>U+4E00-9FA5</range></hull>
could be replaced by
<intersection><range>U+4E00-9FA5</range><unknown/></intersection>
The same thing can be done with unions and kernels, i.e. a kernel can be replaced by an union of the kernel's contents and an unknown collection.
In order to allow efficient network access and caching, it is also important to realize in which cases a collection definition should be referenced, and in which cases it should be provided inline. Minor changes to existing collections and efficiency cuts should be provided inline to reduce network traffic. Well-known collections should be provided by reference to avoid duplication of data in caches and in memory.
Caching mechanisms cannot only help for efficient network access, they may also be used as one means to determine whether a collection is supposed to stay constant or whether it may change over time. This information is very important when designing collections referencing other collections. A designer of a collection wants to make sure that the referenced collection not only corresponds to his/her intentions at the time the new collection is designed, but also later on. In some cases, a constant collection (or a snapshot of an evolving collection) is desired, in other cases, an evolving collection is desired. This NOTE does not address the issues of metadata and trust regarding collections, because these issues are largely common to other Web-oriented notations, because they are largely orthogonal to the notation chosen, and because a stable registry for collections is already provided by ISO/IEC 10646 [ISO10646].
Operating systems, libraries, and applications contain more and more data about characters, and this data can be used to reduce network traffic to access collection information as follows.
If a <ref> element is met when querying a collection definition, the URI contained in the href attribute is compared with an internal list of collections known to the system. Such a collection is called a built-in collection. It may be built into the operating system or the collection implementation, or may be available in other parts of the configuration, e.g. through a kind of registry mechanism.
If the URI matches with a collection known to the system, the system checks the character in question using the necessary operations. Different collections may be stored and accessed in different ways on the same system; for example, one collection may be defined as "the characters for which glyphs are available in a certain system font", whereas another collection may be defined as "all the characters that have a certain property according to the built-in property table".
Different systems may have different collections available built-in. Because checking a built-in definition is much more efficient than checking a collection that has to be referenced over the network, an efficient way to define a collection is to define it based on several built-in collections, each of them available on a different system.
As an example, assume that we want to define a collection that can be seen as collection abcde from os1 plus some small additions, or as collection xyz from os2 minus a small set of characters. This can be written as:
<collection> <alt> <union> <ref href="http://www.os1.com/collections/abcde.charcol"/> <range> ... </range> </union> <difference> <ref href="http://www.os2.com/collections/xyz.charcol"/> <range> ... </range> </difference> </alt> </collection>
On os1, the collection abcde will be built-in, and no network access will be needed, and similar for os2.
Characters have a series of properties, such as whether they are letters, digits, or symbols, or, for letters, whether they are upper-case, or lower-case, or caseless. The Unicode Standard [Unicode] defines a large number of such properties.
Properties and collections in many cases can be seen as different views of the same data, because they are related in the following way:
Because many character properties are well known, the corresponding collections can be treated as well-known collections (see Section 3.1). Because many systems store character properties very efficiently internally, such collections can easily be treated as built-in collections (see Section 3.4). Character collections can then be used to easily define local changes to properties by adding or removing a few characters.
CSS2 [CSS2, http://www.w3.org/TR/REC-CSS2/fonts.html#dataqual] provides a 'unicode-range' property to reduce network traffic for font downloading. This property is defined to be inclusive, i.e. the range may contain characters for which the font in question does not have any glyphs available. This can be explicitly expressed by a <hull> in our syntax. When character collections are accessible via URIs, it would be possible to extend the syntax of the 'unicode-range' property to allow URIs as values.
Also, the 'unicode-range' property is currently not explicitly defined to be exclusive, i.e. does not seem to be disallowed to use a glyph from the font in question even if the corresponding character is not contained in the value of the 'unicode-range' property. For finer control over font composition and glyph selection, it may be desirable to explicitly define that 'unicode-range' is exclusive, or to define another property for that purpose. Because the current range notation in CSS2 can become rather verbose, character collections as proposed in this document may help.
Text presentation and styling could often be specified conveniently by reference to a character collection. Examples include:
Character collections or references to character collections could appear both on the selector/template side of styling rules or as formatting property values. Using them on the selector/template side would define a general mechanism for all or a selected class of properties, i.e. it might be possible to say something like: Color all these characters red. Using character collections as formatting property values would be less general, limited to those cases where character collections are most useful.
In the context of HTML [HTML] form input, it is desirable to have an indication of the characters expected or allowed in a particular text field. This is useful for two reasons:
<enum>0123456789</enum>.
The two uses above should be clearly distinguished. Please note that while the first use fits the approach of testing individual characters very nicely, the second use does not directly fit this approach. The collection of characters covered by each input method/keyboard configuration available locally would have to be matched against the collection of characters expected in an input field. This may take considerably more time than checking only individual characters. However, there are various ways to reduce this effort. One way is to use very few characters 'typical' of the collections describing the available input method/keyboard configuration for sampling against the collection expected in an input field. Another is to limit the choice of collections allowed for describing the range of expected characters to some collections typically covered by input methods.
Limiting the choice of collection by explicitly listing the allowed URIs, while using the collection mechanism to clearly and unambiguously define the collection, may be a reasonable solution in other cases. It can limit implementation efforts while preserving forward compatibility.
XML Schema [Schema] is currently working on ways to define restrictions to XML attribute and element contents. Both Cobol 'pictures' and regular expressions rely on character classes, but these classes are not general and flexible enough for international purposes. Character collections may provide more flexibility; a way to bind a character collection to a letter of a picture or an escape sequence of a regular expression would have to be provided.
V.S. Umamaheswaran (Uma) and Bruce Paterson for information on ISO/IEC 10646 collections. W3C WG/IG members and W3C team members, in particular Bert Bos, Dan Connolly, John Cowan, Masayasu Ichikawa, Rick Jelliffe, Mike Ksar, Chris Lilley, Masahiro Sekiguchi, Tex Texin, Andrea Vine, and François Yergeau for discussions and suggestions for improvement. All these contributions are very gratefully acknowledged, but any shortcommings in this NOTE should only be blamed to the author.
This appendix exactly defines what each of the collection operators means. For operators that can take more than two operands, only the definition for two operands is given; this is sufficient because all these operators are transitive. All the operations that allow more than two operands are also commutative. For each operator, a table and a textual description is given; these descriptions are identical. In the tables, the symbols "+", "-", and "?" stand for included, not included, and unknown.
A+ | A? | A- |
---|---|---|
? | ? | - |
For any character x and collection A:
A+ | A? | A- |
---|---|---|
+ | ? | ? |
For any character x and collection A:
A+ | A? | A- | |
---|---|---|---|
B+ | + | + | + |
B? | + | ? | ? |
B- | + | ? | - |
For any character x, collection A, and collection B:
A+ | A? | A- | |
---|---|---|---|
B+ | + | ? | - |
B? | ? | ? | - |
B- | - | - | - |
For any character x, collection A, and collection B:
A+ | A? | A- | |
---|---|---|---|
B+ | - | - | - |
B? | ? | ? | - |
B- | + | ? | - |
For any character x, collection A, and collection B:
A+ | A? | A- | |
---|---|---|---|
B+ | + | + | error |
B? | + | ? | - |
B- | error | - | - |
For any character x, collection A, and collection B:
This is the DTD for the syntax defined in this NOTE. A namespace URI is currently not defined.
<!ENTITY % coll "(hull|kernel|union|intersection|difference|alt|ref|range|enum)"> <!ELEMENT collection (%coll;)> <!ELEMENT hull (%coll;)> <!ELEMENT kernel (%coll;)> <!ELEMENT union (%coll;, (%coll;)+)> <!ELEMENT intersection (%coll;, (%coll;)+)> <!ELEMENT difference (%coll;, %coll;)> <!ELEMENT alt (%coll;, (%coll;)+)> <!ELEMENT ref EMPTY> <!ATTLIST ref xmlns:xlink CDATA #FIXED "http://www.w3.org/XML/XLink/0.9" xlink:type (simple|extended|locator|arc) #FIXED "simple" xlink:role CDATA #FIXED "character collection" xlink:title CDATA #IMPLIED xlink:show (new|parsed|replace) #FIXED 'parsed' xlink:actuate (user|auto) #FIXED 'auto' xlink:href CDATA #REQUIRED> <!ELEMENT range (#PCDATA)> <!ELEMENT enum (#PCDATA)>
This is supposed to reflect the syntax in the CSS2 specification (http://www.w3.org/TR/REC-CSS2/fonts.html#dataqual) as exactly as possible. This is one of the low-level building block for collections, not captured by XML structure. To define the syntax, the notation used in the XML 1.0 Recommendation [XML1.0] is used, and some common definitions are taken from there. (Note: The CSS2 syntax allows \f in whitespace, but XML does not allow this, and therefore it is not allowed in <range>).
The content of <range>, a RangeEnumeration, is a comma-separated list of Ranges.
RangeEnumeration ::= S Urange (S ',' S Urange ) * S
Note: Is it possible in CSS to have a comma at the end?
An Urange is either given as an Ublock (allowing question marks) or with a starting and ending Uvalue. The Urange includes all the characters between the starting and the ending Uvalues, including the starting and ending Uvalues themselves.
Urange ::= ( 'U' '+' Ublock ) | ( 'U' '+' Uvalue '-' Uvalue )
An Ublock is a (potentially empty) sequence of hex characters followed by a (potentially empty) sequence of question marks, for a total of at least one and at most six symbols. The definition is a bit lengthy because the XML notation does not provide a repetition indicator.
Ublock ::= '?' '?'? '?'? '?'? '?'? '?'? | [0-9a-fA-F] ('?'? '?'? '?'? '?'? '?'? | [0-9a-fA-F] ('?'? '?'? '?'? '?'? | [0-9a-fA-F ('?'? '?'? '?'? | [0-9a-fA-F] ('?'? '?'? | [0-9a-fA-F] ('?'? | [0-9a-fA-F] ) ) ) ) )
An Ublock is equivalent to an explicit range expressed with a hyphen by using the Ublock in both Uvalues and replacing the question marks in the first Uvalue by '0's, and the question marks in the second Uvalue by 'F's. As an example, U+123?? is the same as U+12300-U+123FF. An Uvalue is a sequence of at least one and at most six hexadecimal digits:
Uvalue ::= [0-9a-fA-F] [0-9a-fA-F]? [0-9a-fA-F]? [0-9a-fA-F]? [0-9a-fA-F]? [0-9a-fA-F]?
Please note that while this notation is originally taken from Unicode [Unicode], Unicode uses a different notation for cases with more than four hexadecimal digits, replacing the '+' with a '-' and requiring eight digits (e.g. U-00012300). The definition of 'unicode-range' mentions a max value of U+7FFFFFFF, but the syntax only allows values up to U+FFFFFF.
Some more examples, from the CSS2 specification (adjusted for the fact that character collection specifications in general are more detailed, and therefore can be more verbose, than CSS2 unicode-range values, and updated for proposed registrations):
Changes since http://www.w3.org/TR/1999/NOTE-charcol-19991105: