3 A brief SGML tutorial

Contents

  1. HTML syntax
    1. Entities
    2. Elements
    3. Attributes
    4. HTML comments
    5. SGML features with limited support
  2. How to read the HTML DTD
    1. Block level and Inline elements
    2. DTD Comments
    3. Parameter entity definitions
    4. Element definitions
    5. Attribute definitions

This section of the document presents introductory information about SGML and its relationship to HTML.

HTML 4.0 is an SGML application conforming to International Standard ISO 8879 -- Standard Generalized Markup Language SGML (defined in [ISO8879]). SGML provides a means for defining markup languages. The basic idea is to annotate the text of a document with markup tags that provide additional information about the document's structure and interpretation. A complete discussion of SGML parsing, e.g. the mapping of a sequence of characters to a sequence of tags and data, is left to the SGML standard. This section is only a summary.

An SGML application consists of several parts:

  1. The SGML declaration. The SGML declaration specifies which characters and delimiters may appear in the application.
  2. The document type definition (DTD). The DTD defines the syntax of markup constructs. The DTD may include additional definitions such as numeric and named character entities.
  3. A specification that describes the semantics to be ascribed to the markup. This specification also imposes syntax restrictions that cannot be expressed within the DTD.
  4. Document instances containing data (contents) and markup. Each instance contains a reference to the DTD to be used to interpret it.

The SGML declaration for HTML 4.0 and the DTD for HTML 4.0 are included in this reference manual, along with the entity sets referenced by the DTD.

3.1 HTML syntax

In this section, we discuss the syntax of HTML elements, attributes, and comments.

3.1.1 Entities

Character entity references are numeric or symbolic names for characters that may be included in an HTML document. They are useful when your authoring tools make it difficult or impossible to enter a character you may not enter often. You will see character entities throughout this document; they begin with a "&" sign and end with a semi-colon (;). Some examples include:

We discuss HTML character entities in detail later in the section on the HTML document character set.

3.1.2 Elements

An SGML application defines elements that represent structures or desired behavior. An element typically consists of three parts: a start tag, content, and an end tag.

An element's start tag is written <element-name>, where element-name is the name of the element. An element's end tag is written with a slash before the element name: </element-name>. For example,

<pre>The content of the PRE element is preformatted text.</pre>

HTML allows some elements to be marked up without end tags, for example P and LI. A few elements also allow the start tags to be omitted for example HEAD and BODY. The HTML DTD indicates for each element whether the start tag and end tag are required.

Some HTML elements have no content. For example, the line break element BR has no content; its only role is to terminate a line of text. Such "empty" elements never have end tags. The definition of each element in the reference manual indicates whether it is empty (has no content) or, if it can have content, what is considered legal content.

Element names are always case-insensitive.

Elements are not tags. Some people refer incorrectly to elements as tags (e.g., "the P tag"). Remember that the element is one thing, and the tag (be it start or end tag) is another. For instance, the HEAD element is always present, even though both start and end HEAD tags may be missing in the markup.

3.1.3 Attributes

Elements may have associated properties, called attributes, to which authors assign values. Attribute/value pairs appear before the final ">" of an element's start tag. Any number of (legal) attribute value pairs, separated by spaces, may appear in an element's start tag. They may appear in any order.

In this example, the align attribute is set for the H1 element:

<H1 align="center">
This is a centered heading thanks to the align attribute
</H1> 

By default, SGML requires that all attribute values be delimited using either double quotation marks (ASCII decimal 34) or single quotation marks (ASCII decimal 39). Single quote marks can be included within the attribute value when the value is delimited by double quote marks, and vice versa. You may also use numeric character entities to represent double quotes (&#34;) and single quotes (&#39;). For double quotes you can also use the named character entity &quot;.

In certain cases, it is possible in HTML to specify the value of an attribute without any quotation marks. The attribute value may only contain letters (a-z and A-Z), digits (0-9), hyphens (ASCII decimal 45), and periods (ASCII decimal 46). We suggest using quotation marks even when it is possible to eliminate them.

Attribute names are always case-insensitive.

Attribute values are generally case-insensitive. The definition of each attribute in the reference manual indicates whether its value is case-insensitive.

3.1.4 HTML comments

HTML comments have the following syntax:

    
<!-- this is a comment -->
<!-- and so is this one,
    which occupies more than one line -->

White space is not permitted between the markup declaration open delimiter("<!") and the comment open delimiter ("--"), but is permitted between the comment close delimiter ("--") and the markup declaration close delimiter (">"). A common error is to include a string of hyphens ("---") within a comment. Authors should avoid putting two or more adjacent hyphens inside comments.

Comments must not be rendered by user agents as part of a document.

3.1.5 SGML features with limited support

SGML applications conforming to [ISO8879] are expected to recognize a number of features that aren't widely supported by HTML user agents.

Marked Sections 

These play a role similar to the #ifdef construct in the C preprocessor.

<![INCLUDE[
 <!-- this will be included -->
]]>

<![IGNORE[
 <!-- this will be ignored -->
]]>

SGML also defines the use of marked sections for CDATA content, within which "<" is not treated as the start of a tag, e.g.,

<![CDATA[
 <an> example of <sgml> markup that is
 not <painful> to write with < and such.
]]>

We recommend that authors avoid using marked sections as most HTML user agents don't recognize them. The telltale sign is the appearance of "]]>", which is seen when the user agent mistakenly uses the first ">" character as the end of the tag starting with "<![".

Processing Instructions  

Processing instructions are a mechanism to capture platform-specific idioms. A processing instruction begins with <? and ends with >

<?instruction >

For example:

<?>
<?style tt = font courier>
<?page break>
<?experiment> ... <?/experiment>

We recommend that authors avoid using processing instructions in HTML documents, as many user agents render them as part of the document's text.

Document Type Declaration Subset 

SGML provides a way to extend the DTD using the <!DOCTYPE> element, for instance to add a new set of entity definitions, or to set entities controlling the use of marked sections.

For example:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0//EN" [
<!ENTITY % AcmeCorpSymbols SYSTEM
   "-//Acme Corp//ENTITIES Corporate Symbols//EN">
%AcmeCorpSymbols;
]>

We recommend that authors avoid using the DTD subset as most HTML user agents do not support this feature.

User agents must not render SGML processing instructions (e.g., <?full volume>).

Shorthand Markup Prohibited 

Some constructs save typing, but add no expressive capability to the language. And while they technically introduce no ambiguity, they reduce the robustness of documents, especially when the language is enhanced to include new elements. Thus, while the SHORTTAG constructs of SGML related to attributes are widely used and implemented, those related to elements are not.

These are relatively straightforward to support, but they are not widely deployed. While documents that use them are conforming SGML documents, they will not work with the deployed HTML tools.

3.2 How to read the HTML DTD

This specification presents pertinent fragments of the DTD each time an element or attribute is defined. The DTD fragment gives concise information about an element and its attributes. We have chosen to include the DTD fragments in the specification rather than seek a more approachable, but longer and less precise means of describing an element. While almost all of the definitions include enough English text to make them comprehensible, for those who require definitive information, we complete this specification with a brief tutorial on reading the HTML DTD.

3.2.1 Block level and Inline elements

Certain HTML elements are said to be "block level" while others are "inline" (also known as "text level"). The distinction is founded on several notions:

Content model
Generally, block level elements may contain inline elements and other block level elements. Generally, inline elements may contain only data and other inline elements. Inherent in this structural distinction is the idea that block elements create "larger" structures than inline elements.
Formatting
By default, block level are formatted differently than inline elements. Generally, Block level elements begin on new lines, inline elements do not. Block level elements end an unterminated paragraph element. This enables you to omit end-tags for paragraphs in many cases.
Directionality
For technical reasons involving the [UNICODE] bidirectional text algorithm, block level and inline elements differ in how they inherit directionality information. For details, see the section on inheritance of text direction.

Style sheets provide the means to specify the rendering of arbitrary elements, including whether an element is rendered as block or inline. In some cases, such as an inline style for list elements, this may be appropriate, but generally speaking, authors are discouraged from overriding the conventional interpretation of HTML elements in this way.

The alteration of the traditional presentation idioms for block level and inline elements also has an impact on the bidirectional text algorithm. See the section on the effect of style sheets on bidirectionality for more information.

3.2.2 DTD Comments

In DTDs, comments may spread over one or more lines. In the DTD, comments are delimited by a pair of "--" marks, e.g.

<!ELEMENT PARAM - O EMPTY       -- named property value -->
Here, the comment "named property value" explains the use of the PARAM element. DTD comments for HTML do have not normative value.

3.2.3 Parameter entity definitions

The HTML DTD begins with a series of entity definitions. A parameter entity definition (not to be confused with an SGML entity) defines a kind of macro that may be expanded elsewhere in the DTD. When the macro is referred to by name in the DTD, it is expanded into a string.

A parameter entity definition begins with the keyword <!ENTITY % followed by the entity name, the quoted string the entity expands to, and finally a closing >. The following example defines the string that the %font entity will expand to.

<!ENTITY % font "TT | I | B | U | S | BIG | SMALL">

The string the parameter entity expands to may contain other parameter entity names. These names are expanded recursively. In the following example, the %inline parameter entity is defined to include the %font, %phrase, %special and %formctrl parameter entities.

<!ENTITY % inline "#PCDATA | %font | %phrase | %special | %formctrl">

You will encounter two DTD entities frequently in the HTML DTD: %inline and %blocklevel. They are used when the content model includes inline and block level elements respectively.

3.2.4 Element definitions

The bulk of the HTML DTD consists of the definitions of elements and their attributes. The <!ELEMENT keyword begins an element definition and the > character ends it. Between these are specified:

  1. The element's name.
  2. Whether the element's end tag is optional. Two hyphens that appear after the element name mean that the start and end tags are mandatory. One hyphen followed by the letter "O" (not zero) indicates that the end tag can be omitted. A pair of letter "O"s indicate that both the start and end tags can be omitted.
  3. The element's content, if any. The allowed content for an element is called its content model. Elements with no content are called empty elements. Empty elements are defined with the keyword "EMPTY".

In this example:

    <!ELEMENT UL - - (LI)+>

This example illustrates the definition of an empty element:

    <!ELEMENT IMG - O EMPTY>

Content model definitions 

The content model describes what may be contained by an element. Content definitions may include:

The content model use the following syntax to define what markup is allowed for the content of the element:

( ... )
Specifies a group.
A | B
Either A or B must occur but not both.
A , B
Both A and B must occur in that order.
A & B
Both A and B must occur, but may do so in any order.
A?
A can occur zero or one times
A*
A can occur zero or more times
A+
A can occur one or more times

Here are some examples from the HTML DTD:

<!ELEMENT SELECT - - (OPTION+)>

The SELECT element must contain one or more OPTION elements.

<!ELEMENT DL - - (DT|DD)+>

The DL element must contain one or more DT or DD elements in any order.

<!ELEMENT OPTION - O (#PCDATA)*>

The OPTION element may only contain text and entities, such as &amp;

A few HTML elements use an additional SGML feature to exclude certain elements from content model. Excluded elements are preceded by a hyphen. Explicit exclusions override inclusions.

In this example, the -(A) signifies that the element A cannot be included in another A element (i.e., anchors may not be nested).

   <!ELEMENT A - - (%inline)* -(A)>

Note that the A element is part of the DTD parameter entity %inline, but is excluded explicitly because of -(A).

Similarly, the following element definition for FORM prohibits nested forms:

   <!ELEMENT FORM - - %block -(FORM)>

3.2.5 Attribute definitions

The <!ATTLIST keyword begins the definition of attributes that an element may take. It is followed by the name of the element in question, a list of attribute definitions, and a closing >. An attribute definition is a triplet that defines:

In this example, the name attribute is defined for the MAP element. The attribute is optional for this element.

<!ATTLIST MAP
  name        CDATA     #IMPLIED
  >

The type of values permitted for the attribute is given as CDATA, an SGML data type. CDATA is text that may include character entities.

For more information about "CDATA", "NAME", "ID", and other data types, please consult the section on HTML data types.

The following examples illustrate possible attribute definitions:

rowspan     NUMBER     1         -- number of rows spanned by cell --
http-equiv  NAME       #IMPLIED  -- HTTP response header name  --
id          ID         #IMPLIED  -- document-wide unique id -- 
valign      (top|middle|bottom|baseline) #IMPLIED

The rowspan attribute requires values of type NUMBER. The default value is given explicitly as "1". The optional http-equiv attribute requires values of type NAME. The optional id attribute requires values of type ID. The optional valign attribute is constrained to take values from the set {top, middle, bottom, baseline}.

DTD entities in attribute definitions 

Attribute definitions may also include DTD entities.

In this example, we see that the attribute definition list for the LINK element begins with the %attrs parameter entity.

<!ELEMENT LINK - O EMPTY -- a media-independent link -->
<!ATTLIST LINK
  %attrs;                          -- %coreattrs, %i18n, %events --
  charset     CDATA      #IMPLIED  -- char encoding of linked resource --
  href        %URL;      #IMPLIED  -- URL for linked resource --
  rel         CDATA      #IMPLIED  -- forward link types --
  rev         CDATA      #IMPLIED  -- reverse link types --
  type    %ContentType;  #IMPLIED  -- advisory Internet content type --
  media       CDATA      #IMPLIED  -- for rendering on these media --
  target      CDATA      #IMPLIED  -- where to render linked resource --
  >

Start tag: required, End tag: forbidden

The %attrs parameter entity is defined as follows:

<!ENTITY % attrs "%coreattrs; %i18n; %events;">

The %coreattrs parameter entity in the %attrs definition expands as follows:

<!ENTITY % coreattrs
 "id          ID         #IMPLIED  -- document-wide unique id --
  class       CDATA      #IMPLIED  -- space separated list of classes --
  style       CDATA      #IMPLIED  -- associated style info --
  title       CDATA      #IMPLIED  -- advisory title/amplification --"
  >

The %attrs parameter entity has been defined for convenience since these attributes are defined for most HTML elements.

Similarly, the DTD defines the %URL parameter entity as expanding into the string CDATA.

<!ENTITY % URL "CDATA"
    -- a Uniform Resource Locator,
       see [RFC1808] and [RFC1738]
    -->

As this example illustrates, the parameter entity %URL provides readers of the DTD with more information as to the type of data expected for an attribute. Similar entities have been defined for %Color, %ContentType, %Length, %Pixels, etc.

Boolean attributes 

Some attributes play the role of boolean variables (e.g., selected). Their appearance in the start tag of an element implies that the value of the attribute is "true". Their absence implies a value of "false".

Boolean attributes may legally take a single value: the name of the attribute itself (e.g., selected="selected").

This example defines the selected attribute to be a boolean attribute.

selected     (selected)  #IMPLIED  -- reduced inter-item spacing --

The attribute is set to "true" by appearing in the element's start tag:

<OPTION selected="selected">
...contents...
<OPTION>
Minimized boolean attributes In HTML, boolean attributes may be appear in "minimized form" -- the attribute's value appears alone in the element's start tag. Thus:
<OPTION selected>

instead of

<OPTION selected="selected">

Authors should be aware than many user agents only recognize the minimized form and not the full form.