Speech Synthesis Markup Language Specification

W3C Working Draft 5 April 2002

This version:: http://www.w3.org/TR/2002/WD-speech-synthesis-20020405/
Latest version:: http://www.w3.org/TR/speech-synthesis
Previous version:: http://www.w3.org/TR/2001/WD-speech-synthesis-20010103/
Editors:: Daniel C. Burnett, Nuance; Mark R. Walker, Intel; Andrew Hunt, SpeechWorks International

Abstract

The Voice Browser Working Group has sought to develop standards to enable access to the web using spoken interaction. The Speech Synthesis Markup Language Specification is part of this set of new markup specifications for voice browsers, and is designed to provide a rich, XML-based markup language for assisting the generation of synthetic speech in web and other applications. The essential role of the markup language is to provide authors of synthesizable content a standard way to control aspects of speech such as pronunciation, volume, pitch, rate and etc. across different synthesis-capable platforms.

Status of this Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. The latest status of this document series is maintained at the W3C.

This is a working draft of the "Speech Synthesis Markup Language Specification". You are encouraged to subscribe to the public discussion list <www-voice@w3.org> and to mail in your comments as soon as possible. To subscribe, send an email to <mailto:www-voice-request@w3.org> with the word subscribe in the subject line (include the word unsubscribe if you want to unsubscribe). A public archive is available online.

This specification describes markup for generating synthetic speech via a speech synthesizer, and forms part of the proposals for the W3C Speech Interface Framework. This document has been produced as part of the W3C Voice Browser Activity, following the procedures set out for the W3C Process. The authors of this document are members of the Voice Browser Working Group (W3C Members only).

The previous draft of this specification was published as a Last Call Working Draft in January of 2001. Over the past year the Voice Browser Working Group has not focused its attention on this specification, but now it is ready to make more active and timely progress on the specification. The Working Group has meanwhile made progress on other specifications, such as the Speech Recognition Grammar Format and the VoiceXML 2.0 specification. These are related to the SSML specification, and in some areas depend on this specification.

In order to coordinate the advancements of these specification along the W3C track to Recommendation, the Working Group felt that it was necessary to update the SSML specification with changes necessary to support the VoiceXML specification. Due to changes in the state of the art of the technology of speech synthesis during this timeframe, the Working Group felt it would be appropriate to make a new release of the specification, with a small number of changes, as a Working Draft. The expectation and goal are that it will be possible to release another draft after that one as a Last Call Working Draft because the Working Group will have focused sufficient attention on the specification for it to be technically sound in today's world.

Following the publication of the previous draft of this specification, the group received a number of public comments. Those comments have not been addressed in this current Working Draft but will be addressed in the timeframe of the Last Call Working Draft. Commenters who have sent their comments to the public mailing list need not resubmit their comments in order for them to be addressed at that time.

To help the Voice Browser working group build an implementation report, (as part of advancing the document on the W3C Recommendation Track), you are encouraged to implement this specification and to indicate to W3C which features have been implemented, and any problems that arose.

Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite W3C Working Drafts as other than "work in progress". A list of current public W3C Working Drafts can be found at http://www.w3.org/TR/.

1. Introduction
2. Elements and Attributes
- 2.1 Document Structure, Text Processing and Pronunciation
  - 2.1.1 "speak" Root Element
  - 2.1.2 "xml:lang" Language Attribute
  - 2.1.3 "paragraph" and "sentence"
  - 2.1.4 "say-as" Element
  - 2.1.5 "phoneme" Element
  - 2.1.6 "sub" Element
- 2.2 Prosody and Style
  - 2.2.1 "voice" Element
  - 2.2.2 "emphasis" Element
  - 2.2.3 "break" Element
  - 2.2.4 "prosody" Element
- 2.3 Other Elements
  - 2.3.1 "audio" Element
  - 2.3.2 "mark" Element
3.SSML Documents
- 3.1 Document Form
- 3.2 Integration With Other Markup Languages
  - 3.2.1 SMIL
  - 3.2.2 ACSS
- 3.3 Fetching SSML Documents
4. Conformance
5. References
6. Acknowledgements
Appendix A. Example SSML (Non-Normative)
Appendix B. DTD for the Speech Synthesis Markup Language (Normative)
Appendix C. Schema for the Speech Synthesis Markup Language (Normative)
Appendix D. Audio File Formats (Normative)
Appendix E. MIME Types and File Suffix (Non-Normative)
Appendix F. Features under Consideration for Future Versions (Non-Normative)
Appendix G. Internationalization (Normative)

1. Introduction

The W3C Standard is known as the Speech Synthesis Markup Language specification and is based upon the JSGF and/or JSML specifications, which are owned by Sun Microsystems, Inc., California, U.S.A. The JSML specification can be found at [JSML].

The Speech Synthesis Markup Language specification is part of a larger set of markup specifications for voice browsers developed through the open processes of the W3C. It is designed to provide a rich, XML-based markup language for assisting the generation of synthetic speech in web and other applications. The essential role of the markup language is to give authors of synthesizable content a standard way to control aspects of speech output such as pronunciation, volume, pitch, rate and etc. across different synthesis-capable platforms.

1.1 Vocabulary and Design Concepts

There is some variance in the use of technical vocabulary in the speech synthesis community. The following definitions establish a common understanding for this document.

Voice Browser	A device which interprets a (voice) markup language and is capable of generating voice output and/or interpreting voice input, and possibly other input/output modalities.
Speech Synthesis	The process of automatic generation of speech output from data input which may include plain text, formatted text or binary objects.
Text-To-Speech	The process of automatic generation of speech output from text or annotated text input.

The design and standardization process has followed from the Speech Synthesis Markup Requirements for Voice Markup Languages published December 23, 1999 by the W3C Voice Browser Working Group.

The following items were the key design criteria.

Consistency: provide predictable control of voice output across platforms and across speech synthesis implementations.
Interoperability: support use along with other W3C specifications including (but not limited to) the Dialog Markup Language, Audio Cascading Style Sheets and SMIL.
Generality: support speech output for a wide range of applications with varied speech content.
Internationalization: Enable speech output in a large number of languages within or across documents.
Generation and Readability: Support automatic generation and hand authoring of documents. The documents should be human-readable.
Implementable: The specification should be implementable with existing, generally available technology and the number of optional features should be minimal.

1.2 Speech Synthesis Processes

A Text-To-Speech (TTS) system that supports the Speech Synthesis Markup Language will be responsible for rendering a document as spoken output and for using the information contained in the markup to render the document as intended by the author.

Document creation: A text document provided as input to the TTS system may be produced automatically, by human authoring, or through a combination of these forms. The Speech Synthesis markup language defines the form of the document.

Document processing: The following are the six major processing steps undertaken by a TTS system to convert marked-up text input into automatically generated voice output. The markup language is designed to be sufficiently rich so as to allow control over each of the steps described below so that the document author (human or machine) can control the final voice output.

XML Parse: An XML parser is used to extract the document tree and content from the incoming text document. The structure, tags and attributes obtained in this step influence each of the following steps.
Structure analysis: The structure of a document influences the way in which a document should be read. For example, there are common speaking patterns associated with paragraphs and sentences.

- Markup support: The "paragraph" and "sentence" elements defined in the TTS markup language explicitly indicate document structures that affect the speech output.

- Non-markup behavior: In documents and parts of documents where these elements are not used, the TTS system is responsible for inferring the structure by automated analysis of the text, often using punctuation and other language-specific data.
Text normalization: All written languages have special constructs that require a conversion of the written form (orthographic form) into the spoken form. Text normalization is an automated process of the TTS system that performs this conversion. For example, for English, when "$200" appears in a document it may be spoken as "two hundred dollars". Similarly, "1/2" may be spoken as "half", "January second", "February first", "one of two" and so on.

- Markup support: The "say-as" element can be used in the input document to explicitly indicate the presence and type of these constructs and to resolve ambiguities. The set of constructs that can be marked includes dates, times, numbers, acronyms, current amounts and more. The set covers many of the common constructs that require special treatment across a wide number of languages but is not and cannot be a complete set.

- Non-markup behavior: For text content that is not marked with the "say-as" element the TTS system is expected to make a reasonable effort to automatically locate and convert these constructs to a speakable form. Because of inherent ambiguities (such as the "1/2" example above) and because of the wide range of possible constructs in any language, this process may introduce errors in the speech output and may cause different systems to render the same document differently.
Text-to-phoneme conversion: Once the system has determined the set of words to be spoken it must convert those words to a string of phonemes. A phoneme is the basic unit of sound in a language. Each language (and sometimes each national or dialect variant of a language) has a specific phoneme set: e.g., most US English dialects have around 45 phonemes. In many languages this conversion is ambiguous since the same written word may have many spoken forms. For example, in English, "read" may be spoken as "reed" (I will read the book) or "red" (I have read the book). Another issue is the handling of words with non-standard spellings or pronunciations. For example, an English TTS system will often have trouble determining how to speak some non-English-origin names; e.g. "Tlalpachicatl" which has a Mexican/Aztec origin.

- Markup support: The "phoneme" element allows a phonemic sequence to be provided for any word or word sequence. This provides the content creator with explicit control over pronunciations. The "say-as" element may also be used to indicate that text is a proper name that may allow a TTS system to apply special rules to determine a pronunciation.

- Non-markup behavior: In the absence of a "phoneme" element the TTS system must apply automated capabilities to determine pronunciations. This is typically achieved by looking up words in a pronunciation dictionary and applying rules to determine other pronunciations. Most TTS systems are expert at performing text-to-phoneme conversions so most words of most documents can be handled automatically.
Prosody analysis: Prosody is the set of features of speech output that includes the pitch (also called intonation or melody), the timing (or rhythm), the pausing, the speaking rate, the emphasis on words and many other features. Producing human-like prosody is important for making speech sound natural and for correctly conveying the meaning of spoken language.
- Markup support: The "emphasis" element, "break" element and "prosody" element may all be used by document creators to guide the TTS system in generating appropriate prosodic features in the speech output.

- Non-markup behavior: In the absence of these elements, TTS systems are expert (but not perfect) in automatically generating suitable prosody. This is achieved through analysis of the document structure, sentence syntax, and other information that can be inferred from the text input.
Waveform production: The phonemes and prosodic information are used by the TTS system in the production of the audio waveform. There are many approaches to this processing step so there may be considerable platform-specific variation.

- Markup support: The TTS markup does not provide explicit controls over the generation of waveforms. The "voice" element allows the document creator to request a particular voice or specific voice qualities (e.g. a young male voice). The "audio" element allows for insertion of recorded audio data into the output stream.

1.3 Document Generation, Applications and Contexts

There are many classes of document creator that will produce marked-up documents to be spoken by a TTS system. Not all document creators (including human and machine) have access to information that can be used in all of the elements or in each of the processing steps described in the previous section. The following are some of the common cases.

The document creator has no access to information to mark up the text. All processing steps in the TTS system must be performed fully automatically on raw text. The document requires only the containing "speak" element to indicate the content is to be spoken.
When marked text is generated programmatically the creator may have specific knowledge of the structure and/or special text constructs in some or all of the document. For example, an email reader can mark the location of the time and date of receipt of email. Such applications may use elements that affect structure, text normalization, prosody and possibly text-to-phoneme conversion.
Some document creators make considerable effort to mark as many details of the document to ensure consistent speech quality across platforms and to more precisely specify output qualities. In these cases, the markup may use any or all of the available elements to tightly control the speech output. For example, prompts generated in telephony and voice browser applications may be fine-tuned to maximize the effectiveness of the overall system.
The most advanced document creators may skip the higher-level markup (structure, text normalization, text-to-phoneme conversion, and prosody analysis) and produce low-level TTS markup for segments of documents or for entire documents. This typically requires tools to generate sequences of phonemes, plus pitch and timing information. For instance, tools that do "copy synthesis" or "prosody transplant" try to emulate human speech by copying properties from recordings.

The following are important instances of architectures or designs from which marked-up TTS documents will be generated. The language design is intended to facilitate each of these approaches.

Dialog language: It is a requirement that it should be possible to include documents marked with the speech synthesis markup language into the dialog description document to be produced by the Voice Browser Working Group.
Interoperability with Aural CSS : Any HTML processor that is Aural CSS-enabled can produce Speech Synthesis Markup Language. ACSS is covered in Section 19 of the Cascading Style Sheets, level 2 (CSS2) Specification (12-May-1998). This usage of speech synthesis facilitates improved accessibility to existing HTML and XHTML content.
Application-specific style-sheet processing: As mentioned above, there are classes of application that have knowledge of text content to be spoken and this can be incorporated into the speech synthesis markup to enhance rendering of the document. In many cases, it is expected that the application will use style-sheets to perform transformations of existing XML documents to speech synthesis markup. This is equivalent to the use of ACSS with HTML and once again the speech synthesis markup language is the "final form" representation to be passed to the speech synthesis engine. In this context, SSML may be viewed as a superset of ACSS capabilities, excepting spatial audio.

1.4 Platform-Dependent Output Behavior of Speech Synthesis Content

The Speech Synthesis Markup Language Specification provides a standard way to specify gross properties of synthetic speech production such as pronunciation, volume, pitch, rate and etc. Exact specification of synthetic speech output behavior across disparate platforms, however, is beyond the scope of this document.

1.5 Terminology

Requirements terms

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119]. However, for readability, these words do not appear in all uppercase letters in this specification.

2. Elements and Attributes

The following elements are defined in this draft specification.

2.1 Document Structure, Text Processing and Pronunciation
- 2.1.1 "speak" Root Element
- 2.1.2 "xml:lang" Language Attribute
- 2.1.3 "paragraph" and "sentence"
- 2.1.4 "say-as" Element
- 2.1.5 "phoneme" Element
- 2.1.6 "sub" Element
2.2 Prosody and Style
- 2.2.1 "voice" Element
- 2.2.2 "emphasis" Element
- 2.2.3 "break" Element
- 2.2.4 "prosody" Element
2.3 Other Elements
- 2.3.1 "audio" Element
- 2.3.2 "mark" Element

2.1 Document Structure, Text Processing and Pronunciation

2.1.1 "speak" Root Element

The Speech Synthesis Markup Language is an XML application. The root element is speak. xml:lang is a defined attribute specifying the language of the root document. The version attribute is a required attribute that indicates the version of the specification to be used for the document. The version number for this specification is 1.0.

<?xml version="1.0"?>
<speak version="1.0" xml:lang="en-US">
  ... the body ...
</speak>

2.1.2 "xml:lang" Attribute: Language

Following the XML convention, languages are indicated by an xml:lang attribute on the enclosing element with the value following [RFC3066] to define language codes. A language is specified by an RFC 3066 identifier following the convention of XML 1.0. [Note: XML 1.0 adopted RFC3066 through Errata as of 2001-02-22].

Language information is inherited down the document hierarchy, i.e. it has to be given only once if the whole document is in one language, and language information nests, i.e. inner attributes overwrite outer attributes.

xml:lang is a defined attribute for speak , paragraph , sentence , p , and s elements.

<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
         xml:lang="en-US">
  <paragraph>I don't speak Japanese.</paragraph>
  <paragraph xml:lang="ja">Nihongo-ga wakarimasen.</paragraph>
</speak>

General Notes (Normative)

The speech output platform largely determines behavior in the case that a document requires speech output in a language not supported by the speech output platform. In any case, if a value for xml:lang specifying an unsupported language is encountered, a conforming SSML processor should attempt to continue processing and should also notify the hosting environment in that case.
There may be variation across conformant platforms in the implementation of xml:lang for different markup elements (e.g. paragraph and sentence elements). A document author should beware that intra-sentential language changes may not be supported on all platforms.
A language change often necessitates a change in the voice. Where the platform does not have the same voice in both the enclosing and enclosed languages it should select a new voice with the inherited voice attributes. Any change in voice will reset the prosodic attributes to the default values for the new voice of the enclosed text. Where the "xml:lang" value is the same as the inherited value there is no need for any changes in the voice or prosody.
All elements should process their contents specific to the enclosing language. For instance, the phoneme, emphasis and break element should each be rendered in a manner that is appropriate to the current language.

2.1.3 "paragraph" and "sentence": Text Structure Elements

A paragraph element represents the paragraph structure in text. A sentence element represents the sentence structure in text. A paragraph contains zero or more sentences.

xml:lang is a defined attribute on both paragraph and sentence elements.

<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
         xml:lang="en-US">
<paragraph>
  <sentence>This is the first sentence of the paragraph.</sentence>
  <sentence>Here's another sentence.</sentence>
</paragraph>
</speak>

General Notes (Normative)

For brevity, the markup also supports <p> and <s> as exact equivalents of <paragraph> and <sentence>. (Note: XML requires that the opening and closing elements be identical so <p> text </paragraph> is not legal.). Also note that <s> means "strike-out" in HTML 4.0 and earlier, and in XHTML-1.0-Transitional, but not in XHTML-1.0-Strict.
The use of paragraph and sentence elements is optional. Where text occurs without an enclosing paragraph or sentence elements the speech output system should attempt to determine the structure using language-specific knowledge of the format of plain text.

2.1.4 "say-as" Element

The say-as element indicates the type of text construct contained within the element. This information is used to help specify the pronunciation of the contained text. Defining a comprehensive set of text format types is difficult because of the variety of languages that must be considered and because of the innate flexibility of written languages. The say-as element has been specified with a reasonable set of format types. Text substitution may be utilized for unsupported constructs.

The type attribute is a required attribute that indicates the contained text construct. The format is a text type optionally followed by a colon and a format.

The base set of type values, divided according to broad functionality, is as follows:

Pronunciation Types

acronym: The contained text is an acronym.

<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xml:lang="en-US"
xmlns="http://www.w3.org/2001/10/synthesis">
  <say-as type="acronym"> DEC </say-as>
</speak>

Output: "DEC."

spell-out: The characters in the contained text string are pronounced as individual characters.

<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xml:lang="en-US"
xmlns="http://www.w3.org/2001/10/synthesis">
  <say-as type="spell-out"> USA </say-as>
</speak>

Output: "U, S, A".

Numerical Types

number: contained text contains integers, fractions, floating points, Roman numerals or some other textual format that can be interpreted and spoken as a number in the current language. Format values for numbers are: ordinal, where the contained text should be interpreted as an ordinal. The content may be a digit sequence or some other textual format that can be interpreted and spoken as an ordinal in the current language; cardinal, where the contained text should be interpreted as a cardinal. The content may be a digit sequence or some other textual format that can be interpreted and spoken as a cardinal in the current language; and digits, where the contained text is to be read as a digit sequence, rather than as a number.

<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xml:lang="en-US"
xmlns="http://www.w3.org/2001/10/synthesis">
  Rocky <say-as type="number"> XIII </say-as>
</speak>

Output: "Rocky thirteen."

<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xml:lang="en-US"
xmlns="http://www.w3.org/2001/10/synthesis">
  Pope John the <say-as type="number:ordinal"> VI </say-as>
  <!-- Pope John the sixth -->
  Deliver to <say-as type="number:digits"> 123 </say-as>
  Brookwood.
</speak>

Output: "Deliver to one two three Brookwood."

Time, Date and Measure Types

date: contained text is a date. Format values for date input content are:

"dmy", "mdy", "ymd" (day, month , year), (month, day, year), (year, month, day)

"ym", "my", "md" (year, month), (month, year), (month, day)

"y", "m", "d" (year), (month), (day).
time: contained text is a time of day. Format values for time input content are:

"hms", "hm", "h" (hours, minutes, seconds), (hours, minutes), (hours).
duration: contained text is a temporal duration. Format values for duration input content are:

"hms", "hm", "ms", "h", "m", "s" (hours, minutes, seconds), (hours, minutes), (minutes, seconds), (hours), (minutes), (seconds).
currency: contained text is a currency amount.
measure: contained text is a measurement.

telephone: contained text is a telephone number.

<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xml:lang="en-US"
xmlns="http://www.w3.org/2001/10/synthesis">
  <say-as type="date:ymd"> 2000/1/20 </say-as>
  <!-- January 20th two thousand -->
  Proposals are due in <say-as type="date:my"> 5/2001 </say-as>
  <!-- Proposals are due in May two thousand and one -->
  The total is <say-as type="currency">$20.45</say-as>
  <!-- The total is twenty dollars and forty-five cents -->
</speak>

Address, Name, Net Types

name: contained text is a proper name of a person, company etc.
net: contained text is an internet identifier. Format values for internet identifier input content are: "email", "uri".
address: contained text is a postal address.

<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xml:lang="en-US"
xmlns="http://www.w3.org/2001/10/synthesis">
  <say-as type="net:email"> road.runner@acme.com </say-as>
</speak>

General Notes (Normative)

When specified, format values of say-as attributes are to be interpreted by the conforming SSML processor as hints provided by the mark-up document author to aide text normalization and pronunciation.
In all cases, the text enclosed by any say-as element is intended to be a standard, orthographic form of the language currently in context. An SSML processor should be able to support the common, orthographic forms of the specified language. In the case of dates for example, <say-as type="date"> 2000/1/20 </say-as> may be read as "January twentieth two thousand" or as "the twentieth of January two thousand" and so on.
When character(s) designating currency units are included in the enclosed text, the SSML processor should include the units in the rendered output.

Usage Notes (Informative)

When multi-field quantities are specified in the format value attribute ("dmy", "my", etc.), the processor may assume that the fields are separated by a single, non-alphanumeric character. The resulting orthographic form may be language-specific, e.g. using a slash to delimit year, month and day in English.

2.1.5 "phoneme" Element

The phoneme element provides a phonetic pronunciation for the contained text. The "phoneme" element may be empty. However, it is recommended that the element contain human-readable text that can be used for non-spoken rendering of the document. For example, the content may be displayed visually for users with hearing impairments.

The ph attribute is a required attribute that specifies the phoneme string.

The alphabet attribute is an optional attribute that specifies the phonetic alphabet. The default value of alphabet for a conforming SSML processor is "ipa", corresponding to characters composing the International Phonetic Alphabet. In addition to an exhaustive set of vowel and consonant symbols, IPA supports a syllable delimiter, numerous diacritics, stress symbols, lexical tone symbols, intonational markers and more.

<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xml:lang="en-US"
xmlns="http://www.w3.org/2001/10/synthesis">
  <phoneme alphabet="ipa" ph="t&#x252;m&#x251;to&#x28A;">
    tomato
  </phoneme>
  <!-- This is an example of IPA using character entities -->
</speak>

If a value for alphabet specifying an unknown phonetic alphabet is encountered, a conforming SSML processor should continue processing and should notify the hosting environment in that case.

Usage Notes (Informative)

Characters composing many of the International Phonetic Alphabet (IPA) phonemes are known to display improperly on most platforms. Additional IPA limitations include the fact that IPA is difficult to understand even when using ASCII equivalents, IPA is missing symbols required for many of the world's languages, and IPA editors and fonts containing IPA characters are not widely available. The Voice Browser Working Group will address the issue of specifying a more robust phoneme alphabet at a later date.

Entity definitions may be used for repeated pronunciations. For example:

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE speak PUBLIC "-//W3C//DTD SYNTHESIS 1.0//EN"
            "http://www.w3.org/TR/speech-synthesis/synthesis.dtd"
[
  <!ENTITY uk_tomato "t&#x252;m&#x251;to&#x28A;">
]>
<speak version="1.0" xml:lang="en-US"
xmlns="http://www.w3.org/2001/10/synthesis">
  ... you say <phoneme ph="&uk_tomato;"> tomato </phoneme>
  I say... 
</speak>

2.1.6 "sub" Element

The sub element is employed to indicate that the specified text replaces the contained text for pronunciation. This allows a document to contain both a spoken and written form. The required alias attribute specifies the string to be substituted for the enclosed string.

<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xml:lang="en-US"
xmlns="http://www.w3.org/2001/10/synthesis">
  <sub alias="World Wide Web Consortium"> W3C </sub>
  <!-- World Wide Web Consortium -->
</speak>

2.2 Prosody and Style

2.2.1 "voice" Element

The "voice" element is a production element that requests a change in speaking voice. Attributes are:

xml:lang: optional language specification attribute.
gender: optional attribute indicating the preferred gender of the voice to speak the contained text. Enumerated values are: "male", "female", "neutral".
age: optional attribute indicating the preferred age of the voice to speak the contained text. Acceptable values are of type (integer)
variant: optional attribute indicating a preferred variant of the other voice characteristics to speak the contained text. (e.g. the second or next male child voice). Valid values of variant are integers.
name: optional attribute indicating a platform-specific voice name to speak the contained text. The value may be a space-separated list of names ordered from top preference down.

<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xml:lang="en-US"
xmlns="http://www.w3.org/2001/10/synthesis">   
  <voice gender="female">Mary had a little lamb,</voice>
  <!-- now request a different female child's voice -->
  <voice gender="female" variant="2">
    It's fleece was white as snow.
  </voice>
  <!-- platform-specific voice selection -->
  <voice name="Mike">I want to be like Mike.</voice>
</speak>

General Notes (Normative)

When there is not a voice available that exactly matches the attributes specified in the document, or multiple voices that match the criteria, the voice selection algorithm may be platform-specific. In both cases, a conforming SSML processor should continue processing and should notify the hosting environment.
Voice attributes are inherited down the tree including to within elements that change the language.

<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xml:lang="en-US"
xmlns="http://www.w3.org/2001/10/synthesis">
  <voice gender="female"> 
    Any female voice here.
    <voice age="6"> 
      A female child voice here.
      <paragraph xml:lang="ja"> 
        <!-- A female child voice in Japanese. -->
      </paragraph>
    </voice>
  </voice>
</speak>

A change in voice resets the prosodic parameters since different voices have different natural pitch and speaking rates. Volume is the only exception.
The xml:lang attribute may be used specially to request usage of a voice with a specific dialect or other variant of the enclosing language.

<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xml:lang="en-US"
xmlns="http://www.w3.org/2001/10/synthesis">
  <voice xml:lang="en-cockney">
    Try a Cockney voice (London area).
  </voice>
  <voice xml:lang="en-brooklyn">
    Try one with a New York accent.
  </voice>
</speak>

2.2.2 "emphasis" Element

The "emphasis" element requests that the contained text be spoken with emphasis (also referred to as prominence or stress). The synthesizer determines how to render emphasis since the nature of emphasis differs between languages, dialects or even voices. The attributes are:

level: the level attribute indicates the strength of emphasis to be applied. Defined values are "strong", "moderate", "none" and "reduced". The default level is "moderate". The meaning of "strong" and "moderate" emphasis is interpreted according to the language being spoken (languages indicate emphasis using a possible combination of pitch change, timing changes, loudness and other acoustic differences). The "reduced" level is effectively the opposite of emphasizing a word. For example, when the phrase "going to" is reduced it may be spoken as "gonna". The "none" level is used to prevent the speech synthesizer from emphasizing words that it might typically emphasize.

<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xml:lang="en-US"
xmlns="http://www.w3.org/2001/10/synthesis">
  That is a <emphasis> big </emphasis> car!
  That is a <emphasis level="strong"> huge </emphasis>
  bank account!
</speak>

2.2.3 "break" Element

The break element is an empty element that controls the pausing or other prosodic boundaries between words. The use of the break element between any pair of words is optional. If the element is not defined, the speech synthesizer is expected to automatically determine a break based on the linguistic context. In practice, the break element is most often used to override the typical automatic behavior of a speech synthesizer. The attributes are:

size: the size attribute is an optional attribute having one of the following relative values: "none", "small", "medium" (default value), or "large". The value "none" indicates that a normal break boundary should be used. The other three values indicate increasingly large break boundaries between words. The larger boundaries are typically accompanied by pauses.
time: the time attribute is an optional attribute indicating the duration of a pause in seconds or milliseconds. It follows the "Times" attribute format from the Cascading Style Sheets, level 2 (CSS2) Specification . e.g. "250ms", "3s".

    <?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xml:lang="en-US"
xmlns="http://www.w3.org/2001/10/synthesis">
  Take a deep breath <break/> then continue. 
  Press 1 or wait for the tone. <break time="3s"/>
  I didn't hear you!
</speak>

Usage Notes (Informative)

Using the size attribute is generally preferable to the time attribute within normal speech. This is because the speech synthesizer will modify the properties of the break according to the speaking rate, voice and possibly other factors. As an example, a fixed 250ms pause (placed with the time attribute) sounds much longer in fast speech than in slow speech.

2.2.4 "prosody" Element

The prosody element permits control of the pitch, speaking rate and volume of the speech output. The attributes are:

pitch: the baseline pitch for the contained text in Hertz, a relative change or values "high", "medium", "low", "default".
contour: sets the actual pitch contour for the contained text. The format is outlined below.
range: the pitch range (variability) for the contained text in Hertz, a relative change or values "high", "medium", "low", "default".
rate: the speaking rate in words-per-minute for the contained text, a relative change or values "fast", "medium", "slow", "default".
duration: a value in seconds or milliseconds for the desired time to take to read the element contents. Follows the Times attribute format from the Cascading Style Sheets, level 2 (CSS2) Specification . e.g. "250ms", "3s".
volume: the volume for the contained text in the range 0.0 to 100.0 (higher values are louder and specifying a value of zero is equivalent to specifying "silent"), a relative change or values "silent", "soft", "medium", "loud" or "default".

Relative values

Relative changes for any of the attributes above are specified as floating-point values: "+10", "-5.5", "+15.2%", "-8.0%". For the pitch and range attributes, relative changes in semitones are permitted: "+0.5st", "+5st", "-2st".

<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xml:lang="en-US"
xmlns="http://www.w3.org/2001/10/synthesis">
  The price of XYZ is <prosody rate="-10%">
  <say-as type="currency">$45</say-as></prosody>
</speak>

Pitch contour

The pitch contour is defined as a set of targets at specified intervals in the speech output. The algorithm for interpolating between the targets is platform-specific. In each pair of the form (interval,target), the first value is a percentage of the period of the contained text and the second value is the value of the pitch attribute (absolute, relative, relative semitone, or descriptive values are all permitted). Interval values outside 0% to 100% are ignored. If a value is not defined for 0% or 100% then the nearest pitch target is copied.

<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xml:lang="en-US"
xmlns="http://www.w3.org/2001/10/synthesis">
  <prosody contour="(0%,+20)(10%,+30%)(40%,+10)">
    good morning
  </prosody>
</speak>

General Notes (Normative)

The duration attribute takes precedence over the rate attribute. The contour attribute takes precedence over the pitch and range attributes.
All prosodic attribute values are indicative. If a conforming speech synthesizer is unable to accurately render a document as specified, (e.g. trying to set the pitch to 1Mhz, or the speaking rate to 1,000,000 words per minute.) it will make a best effort to continue processing by imposing a limit or a substitute for the specified, unsupported value.
In some cases, SSML processors may elect to ignore a given prosodic markup if the processor determines, for example, that the indicated value is redundant, improper or in error. In particular, concatenative-type synthetic speech systems that employ large acoustic units may reject prosody-modifying markup elements if they are redundant with the prosody of a given acoustic unit(s) or would otherwise result in degraded speech quality.
The default value of all prosodic attributes is no change. For example, omitting the rate attribute means that the rate is the same within the element as outside.

Usage Notes (Informative)

The descriptive values ("high", "medium" etc.) may be specific to the platform, to user preferences or to the current language and voice. As such, it is generally preferable to use the descriptive values or the relative changes over absolute values.

2.3 Other Elements

2.3.1 "audio" Element

The audio element supports the insertion of recorded audio files and the insertion of other audio formats in conjunction with synthesized speech output. The audio element may be empty. If the audio element is not empty then the contents should be the marked-up text to be spoken if the audio document is not available. The alternate content may include text, speech markup, or another audio element. The alternate contents may also be used when rendering the document to non-audible output and for accessibility. The optional attribute is src, which is the URI of a document with an appropriate mime-type.

<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
         xml:lang="en-US">
                 
<!-- Empty element -->
Please say your name after the tone.  <audio src="beep.wav"/>

<!-- Container element with alternative text -->
<audio src="prompt.au">What city do you want to fly from?</audio>
<audio src="welcome.wav">  
     <emphasis>Welcome</emphasis>  to the Voice Portal. 
</audio>

</speak>

An audio element is sucessfully rendered if:

The referenced audio source is played, or
If the processor is unable to execute #1, then the alternative content is successfully rendered.

Deciding which conditions result in the alternative content being rendered is platform dependent. If the audio element is not successfully rendered, a conforming SSML processor should continue processing and should notify the hosting environment in that case. An SSML processor may determine after beginning playback of an audio source that it cannot be played in its entirety. For example, encoding problems, network disruptions, etc. may occur. The processor may designate this either as successful or unsuccessful rendering, but it must document this behavior.

Usage Notes (Informative)

The audio element is not intended to be a complete mechanism for synchronizing synthetic speech output with other audio output or other output media (video etc.). Instead the audio element is intended to support the common case of embedding audio files in voice output. See the SMIL integration example in Appendix A.

2.3.2 "mark" Element

A mark element is an element that places a marker into the text/tag sequence. The mark element that contains text is used to reference a special sequence of tags and text, either for internal reference within the SSML document, or externally by another document. The empty mark element can also be used to reference a specific location in the text/tag sequence, and can additionally be used to insert a marker into an output stream for asynchronous notification. When audio output of the TTS document reaches the mark, the speech synthesizer issues an event that includes the required name attribute of the element. The platform defines the destination of the event. The mark element does not affect the speech output process.

<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
         xml:lang="en-US">
                 
We would like 
<mark name="congrats">to extend our warmest congratulations</mark> 
to the members of the Voice Browser Working Group! 
Go from <mark name="here"/> here, to <mark name="there"/> there!

</speak>

Usage Notes (Informative)

When supported by the implementation, requests can be made to pause and resume at document locations specified by the mark values.

3. SSML Documents

3.1 Document Form

A legal Speech Synthesis Markup Language document must have a legal XML Prolog [XML §2.8].

The XML prolog in a synthesis document comprises the XML declaration and an optional DOCTYPE declaration referencing the synthesis DTD. It is followed by the root speak element. The XML prolog may also contain XML comments, processor instructions and other content permitted by XML in a prolog.

The version number of the XML declaration indicates which version of XML is being used. The version number of the speak element indicates which version of the SSML specification is being used -- "1.0" for this specification. The speak version is a required attribute.

The speak element must designate the SSML namespace using the xmlns attribute [XMLNS]. The namespace for SSML is defined to be http://www.w3.org/2001/10/synthesis.

If present, the DOCTYPE should reference the standard DOCTYPE and identifier.

The following are two examples of SSML headers:

<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
         xml:lang="en">

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE speak PUBLIC "-//W3C//DTD SYNTHESIS 1.0//EN"
        "http://www.w3.org/TR/speech-synthesis/synthesis.dtd">
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
         xml:lang="en">

3.2 Integration With Other Markup Languages

3.2.1 SMIL

The Synchronized Multimedia Integration Language (SMIL, pronounced "smile") enables simple authoring of interactive audiovisual presentations. SMIL is typically used for "rich media"/multimedia presentations which integrate streaming audio and video with images, text or any other media type. SMIL is an easy-to-learn HTML-like language, and many SMIL presentations are written using a simple text-editor. See the SMIL/SSML integration examples in Appendix A..

3.2.2 ACSS

Aural style sheets are employed to augment standard visual forms of documents (like HTML) with additional elements that assist in the synthesis of the text into audio. In comparison to SSML, ACSS-generated documents are capable of more complex specifications of the audio sequence, including the designation of 3D location of the audio source. Many of the other ACSS elements overlap SSML functionality, especially in the specification of voice type/quality. SSML may be viewed as a superset of ACSS capabilities, excepting spatial audio.

3.3 SSML Document Fetching

The fetching and caching behavior of SSML documents is defined by the environment in which the SSML processor operates. In a VoiceXML interpreter context for example, the caching policy is determined by the VoiceXML interpreter.

4. Conformance

This section is Normative.

4.1 Conforming Speech Synthesis Markup Language Fragments

A synthesis document fragment is a Conforming Speech Synthesis Markup Language Fragment if:

it is a well-formed XML document [XML] conforming to namespaces in XML [XMLNS]
and it conforms to the criteria for Conforming Stand-Alone Speech Synthesis Markup Language Documents after:
- all non-synthesis namespace elements and attributes and all xmlns attributes which refer to non-synthesis namespace elements are removed from the document,
- and, an appropriate XML declaration (i.e., <?xml...?>) is included at the top of the document,
- and, if the speak element does not already designate the synthesis namespace using the "xmlns" attribute, then xmlns="http://www.w3.org/2001/10/synthesis" is added to the element.

4.2 Conforming Stand-Alone Speech Synthesis Markup Language Documents

A document is a Conforming Stand-Alone Speech Synthesis Markup Language Document if:

it is a well-formed XML document [XML] conforming to namespaces in XML [XMLNS]
and it is valid XML document which adheres to the specification described in this document (Speech Synthesis Markup Language Specification) including the constraints expressed in the Schema (see Appendix C) and having an XML Prolog and <speak> root element as specified in Section 3.1.

The Speech Synthesis specification and these conformance criteria provide no designated size limits on any aspect of synthesis documents. There are no maximum values on the number of elements, the amount of character data, or the number of characters in attribute values.

4.3 Using SSML with other Namespaces

The SSML namespace may be used with other XML namespaces as per the Namespaces in XML Recommendation. Future work by W3C will address ways to specify conformance for documents involving multiple namespaces.

4.4 Conforming Speech Synthesis Markup Language Processors

A Speech Synthesis Markup Language processor is a program that can parse and process Speech Synthesis Markup Language documents.
In a Conforming Speech Synthesis Markup Language Processor, the XML parser must be able to parse and process all XML constructs defined within XML 1.0 and XML Namespaces.
A Conforming Speech Synthesis Markup Language Processor must correctly understand and apply the semantics defined for each markup element as described by this document.
A Conforming Speech Synthesis Markup Language Processor is required to parse all language declarations successfully.
A Conforming Speech Synthesis Markup Language Processor should inform its hosting environment if it encounters a language that it can not support.
There is no conformance requirement with respect to performance characteristics of the Speech Synthesis Markup Language Processor.

5. References

5.1 Normative References

[XML]: World Wide Web Consortium. Extensible Markup Language (XML) 1.0 (Second Edition). W3C Recommendation, 6 October 2000. See http://www.w3.org/TR/2000/REC-xml-20001006
[XMLNS]: World Wide Web Consortium. Namespaces in XML. W3C Recommendation. See http://www.w3.org/TR/REC-xml-names
[JSML]: Sun Microsystems. JSpeech Markup Language. Sun Microsystems submission to W3C, 5 June 2000. See http://www.w3.org/TR/jsml/
[RFC2119]: S. Bradner, Key words for use in RFCs to Indicate Requirement Levels , Harvard University, March 1997. See http://www.normos.org/ietf/rfc/rfc2119.txt
[RFC3066]: H. Alvestrand, Tags for the Identification of Languages. See http://www.ietf.org/rfc/rfc3066.txt
[CSS2]: World Wide Web Consortium, Cascading Style Sheets, level 2 CSS2 Specification. W3C Recommendation. See http://www.w3.org/TR/REC-CSS2/aural.html
[VXML]: World Wide Web Consortium. Voice Extensible Markup Language (VoiceXML) Version 2.0. W3C Working Draft. See http://www.w3.org/TR/2001/WD-voicexml20-20011023/

5.2 Informative References

[SABLE]: R. Sproat, SABLE. A Standard for TTS Markup. See http://www.research.att.com/~rws/Sable.v1_0.htm

6. Acknowledgements

This document was written with the participation of the members of the W3C Voice Browser Working Group (listed in alphabetical order):

Brian Eberman, SpeechWorks International
Andrew Hunt, SpeechWorks International
Jim Larson, Intel
Bruce Lucas, IBM
Scott McGlashan, PipeBeach
T.V. Raman, IBM
Dave Raggett, W3C/Openwave
Richard Sproat, ATT
Luc Van Tichelen, ScanSoft
Kuansan Wang, Microsoft
Mark Walker, Intel

Appendix A: Example SSML

This appendix is Non-Normative.

The following is an example of reading headers of email messages. The paragraph and sentence elements are used to mark the text structure. The say-as element is used to indicate text constructs such as the time and proper name. The break element is placed before the time and has the effect of marking the time as important information for the listener to pay attention to. The prosody element is used to slow the speaking rate of the email subject so that the user has extra time to listen and write down the details.

<?xml version="1.0"?>
<speak xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en">
  <paragraph>
    <sentence>You have 4 new messages.</sentence>
    <sentence>The first is from 
      <say-as type="name">
        Stephanie Williams
      </say-as>
      and arrived at <break/>
      <say-as type="time">3:45pm</say-as>.
    </sentence>
    <sentence>
      The subject is <prosody rate="-20%">ski trip</prosody>
    </sentence>
  </paragraph>
</speak>

The following example combines audio files and different spoken voices to provide information on a collection of music.

<?xml version="1.0"?>
<speak xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en">

  <paragraph>
    <voice gender="male">
      <sentence>Today we preview the latest romantic music
      from the W3C.</sentence>

      <sentence>Hear what the Software Reviews said about
      Tim Lee's newest hit.</sentence>
    </voice>
  </paragraph>

  <paragraph>
    <voice gender="female">
      He sings about issues that touch us all.
    </voice>
  </paragraph>

  <paragraph>
    <voice gender="male">
      Here's a sample.  <audio src="http://www.w3c.org/music.wav"/>
      Would you like to buy it?
    </voice>
  </paragraph>

</speak>

SMIL Integration Example

The SMIL language is an XML-based multimedia control language. It is especially well suited for describing dynamic media applications that include synthetic speech output.

File 'greetings.ssml' contains the following:

<?xml version="1.0"?>

<speak xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en">

  <sentence>
  <mark name="greetings">
    <emphasis>Greetings</emphasis> from the
    <sub alias="World Wide Web Consortium">W3C</sub>!
  </mark>
  </sentence>
</speak>

SMIL Example 1: W3C logo image appears, and then one second later, the speech sequence is rendered. File 'greetings.smil' contains the following:

<smil xmlns="http://www.w3.org/2001/SMIL20/Language">
  <head>
    <top-layout width="640" height="320">
      <region id="whole" width="640" height="320"/>
    </top-layout>
  </head>
  <body>
    <par>
      <img src="http://w3clogo.gif" region="whole" begin="0s"/>
      <ref src="greetings.ssml#greetings" begin="1s"/>
    </par>
  </body>
</smil>

SMIL Example 2: W3C logo image appears, then clicking on the image causes it to disappear and the speach sequence to be rendered. File 'greetings.smil' contains the following:

<smil xmlns="http://www.w3.org/2001/SMIL20/Language">
  <head>
    <top-layout width="640" height="320">
      <region id="whole" width="640" height="320"/>
    </top-layout>
  </head>
  <body>
    <seq>
      <img id="logo" src="http://w3clogo.gif" region="whole" 
      begin="0s" end="logo.activateEvent"/>
      <ref src="greetings.ssml#greetings"/>
    </seq>
  </body>
</smil>

Appendix B: DTD for the Speech Synthesis Markup Language

This appendix is Informative.

The synthesis DTD is located at http://www.w3.org/TR/speech-synthesis/synthesis.dtd.

<?xml version="1.0" encoding="ISO-8859-1"?>
<!-- 

SSML DTD 20020313

Copyright 1998-2002 W3C (MIT, INRIA, Keio), All Rights Reserved. 

Permission to use, copy, modify and distribute the SSML DTD and 
its accompanying documentation for any purpose and without fee is 
hereby granted in perpetuity, provided that the above copyright 
notice and this paragraph appear in all copies.  

The copyright holders make no representation about the suitability 
of the DTD for any purpose. It is provided "as is" without expressed 
or implied warranty.

-->
<!ENTITY % duration "CDATA">
<!ENTITY % integer "CDATA">
<!ENTITY % uri "CDATA">
<!ENTITY % audio "#PCDATA | audio ">
<!ENTITY % structure "paragraph | p | sentence | s">
<!ENTITY % sentence-elements "break | emphasis | mark | phoneme | 
        prosody | say-as | voice | sub">
<!ENTITY % allowed-within-sentence " %audio; | %sentence-elements; ">
<!ENTITY % say-as-types "(acronym|spell-out|currency|measure|
        name|telephone|address|
        number|number:ordinal|number:digits|number:cardinal|
            date|date:dmy|date:mdy|date:ymd|
        date:ym|date:my|date:md|
        date:y|date:m|date:d|
            time|time:hms|time:hm|time:h|
            duration|duration:hms|duration:hm|duration:ms|
            duration:h|duration:m|duration:s|
            net|net:email|net:uri)">
<!ELEMENT speak (%allowed-within-sentence; | %structure;)*>
<!ATTLIST speak
    version NMTOKEN #REQUIRED
    xml:lang NMTOKEN #IMPLIED
    xmlns CDATA #REQUIRED
    xmlns:xsi CDATA #IMPLIED
    xsi:schemaLocation CDATA #IMPLIED
>
<!ELEMENT paragraph (%allowed-within-sentence; | sentence | s)*>
<!ATTLIST paragraph
    xml:lang NMTOKEN #IMPLIED
>
<!ELEMENT sentence (%allowed-within-sentence;)*>
<!ATTLIST sentence
    xml:lang NMTOKEN #IMPLIED
>
<!ELEMENT p (%allowed-within-sentence; | sentence | s)*>
<!ATTLIST p
    xml:lang NMTOKEN #IMPLIED
>
<!ELEMENT s (%allowed-within-sentence;)*>
<!ATTLIST s
    xml:lang NMTOKEN #IMPLIED
>
<!ELEMENT voice (%allowed-within-sentence; | %structure;)*>
<!ATTLIST voice
    xml:lang NMTOKEN #IMPLIED
    gender (male | female | neutral) #IMPLIED
    age %integer; #IMPLIED
    variant %integer; #IMPLIED
    name CDATA #IMPLIED
>
<!ELEMENT prosody (%allowed-within-sentence; | %structure;)*>
<!ATTLIST prosody
    pitch CDATA #IMPLIED
    contour CDATA #IMPLIED
    range CDATA #IMPLIED
    rate CDATA #IMPLIED
    duration %duration; #IMPLIED
    volume CDATA #IMPLIED
>
<!ELEMENT audio (%allowed-within-sentence; | %structure;)*>
<!ATTLIST audio
    src %uri; #IMPLIED
>
<!ELEMENT emphasis (%allowed-within-sentence;)*>
<!ATTLIST emphasis
    level (strong | moderate | none | reduced) "moderate"
>
<!ELEMENT say-as (#PCDATA)>
<!ATTLIST say-as
    type %say-as-types; #REQUIRED
>
<!ELEMENT sub (#PCDATA)>
<!ATTLIST sub
    alias CDATA #REQUIRED
>
<!ELEMENT phoneme (#PCDATA)>
<!ATTLIST phoneme
    ph CDATA #REQUIRED
    alphabet CDATA "ipa"
>
<!ELEMENT break EMPTY>
<!ATTLIST break
    size (large | medium | small | none) "medium"
    time %duration; #IMPLIED
>
<!ELEMENT mark (%allowed-within-sentence; | %structure;)*>
<!ATTLIST mark
    name ID #REQUIRED
>

Appendix C: Schema for the Speech Synthesis Markup Language

This appendix is Normative.

The synthesis schema is located at http://www.w3.org/TR/speech-synthesis/synthesis.xsd.

Note: the synthesis schema includes a no-namespace core schema, located at http://www.w3.org/TR/speech-synthesis/synthesis-core.xsd, which may be used as a basis for specifying Speech Synthesis Markup Language Fragments embedded in non-synthesis namespace schemas.

<?xml version="1.0" encoding="ISO-8859-1"?>
<xsd:schema targetNamespace="http://www.w3.org/2001/10/synthesis" 
xmlns:xsd="http://www.w3.org/2001/XMLSchema" 
xmlns="http://www.w3.org/2001/10/synthesis"
elementFormDefault="qualified">
    <xsd:annotation>
    <xsd:documentation>SSML 1.0 Schema (20020311)</xsd:documentation>
    </xsd:annotation>
    <xsd:annotation>
    <xsd:documentation>Copyright 1998-2002 W3C (MIT, INRIA, Keio),
  All Rights Reserved. Permission to use, copy, modify and
  distribute the SSML schema and its accompanying documentation
  for any purpose and without fee is hereby granted in
  perpetuity, provided that the above copyright notice and this
  paragraph appear in all copies.  The copyright holders make no
  representation about the suitability of the schema for any purpose. 
  It is provided "as is" without expressed or implied warranty.
  </xsd:documentation>
    </xsd:annotation>
    <xsd:include schemaLocation="synthesis-core.xsd"/>
</xsd:schema>

<?xml version="1.0" encoding="ISO-8859-1"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
elementFormDefault="qualified">
    <xsd:annotation>
        <xsd:documentation>SSML 1.0 Core Schema 
        (20020222)</xsd:documentation>
    </xsd:annotation>
    <xsd:annotation>
        <xsd:documentation>Copyright 1998-2002 W3C (MIT,
        INRIA, Keio), All Rights Reserved. Permission to use,
        copy, modify and distribute the SSML core schema and its
        accompanying documentation for any purpose and without
        fee is hereby granted in perpetuity, provided that the
        above copyright notice and this paragraph appear in all
        copies.  The copyright holders make no representation
        about the suitability of the schema for any purpose. 
        It is provided "as is" without expressed or implied
        warranty.</xsd:documentation>
    </xsd:annotation>
    <xsd:annotation>
        <xsd:documentation>Importing dependent
        namespaces</xsd:documentation>
    </xsd:annotation>
    <xsd:import namespace="http://www.w3.org/XML/1998/namespace"
    schemaLocation="http://www.w3.org/2001/xml.xsd"/>
    <xsd:annotation>
        <xsd:documentation>General Datatypes</xsd:documentation>
    </xsd:annotation>
    <xsd:simpleType name="duration">
        <xsd:annotation>
            <xsd:documentation>Duration follows "Times" in
            CCS specification; e.g. "25ms", "3s"</xsd:documentation>
        </xsd:annotation>
        <xsd:restriction base="xsd:string">
            <xsd:pattern value="[0-9]+m?s"/>
        </xsd:restriction>
    </xsd:simpleType>
    <xsd:simpleType name="relative.change">
        <xsd:annotation>
            <xsd:documentation>Relative change: e.g. +10, -5.5,
            +15%, -9.0%</xsd:documentation>
        </xsd:annotation>
        <xsd:restriction base="xsd:string">
            <xsd:pattern value="[+-][0-9]+(.[0-9]+)?[%]?"/>
        </xsd:restriction>
    </xsd:simpleType>
    <xsd:simpleType name="relative.change.st">
        <xsd:annotation>
            <xsd:documentation>Relative change in semi-tones:
            e.g. +10st, -5st</xsd:documentation>
        </xsd:annotation>
        <xsd:restriction base="xsd:string">
            <xsd:pattern value="[+-]?[0-9]+st"/>
        </xsd:restriction>
    </xsd:simpleType>
    <xsd:simpleType name="height.scale">
        <xsd:annotation>
            <xsd:documentation>values for height
                   </xsd:documentation>
        </xsd:annotation>
        <xsd:restriction base="xsd:string">
            <xsd:enumeration value="high"/>
            <xsd:enumeration value="medium"/>
            <xsd:enumeration value="low"/>
            <xsd:enumeration value="default"/>
        </xsd:restriction>
    </xsd:simpleType>
    <xsd:simpleType name="number.range">
        <xsd:annotation>
            <xsd:documentation>number range: e.g. 0-123, 23343-223333.
    No constraint that the second number is greater than the first.
    </xsd:documentation>
        </xsd:annotation>
        <xsd:restriction base="xsd:string">
            <xsd:pattern value="[0-9]+-.[0-9]+"/>
        </xsd:restriction>
    </xsd:simpleType>
    <xsd:simpleType name="speed.scale">
        <xsd:annotation>
            <xsd:documentation>values for speed
                     </xsd:documentation>
        </xsd:annotation>
        <xsd:restriction base="xsd:string">
            <xsd:enumeration value="fast"/>
            <xsd:enumeration value="medium"/>
            <xsd:enumeration value="slow"/>
            <xsd:enumeration value="default"/>
        </xsd:restriction>
    </xsd:simpleType>
    <xsd:simpleType name="volume.scale">
        <xsd:annotation>
            <xsd:documentation>values for speed
                 </xsd:documentation>
        </xsd:annotation>
        <xsd:restriction base="xsd:string">
            <xsd:enumeration value="silent"/>
            <xsd:enumeration value="soft"/>
            <xsd:enumeration value="medium"/>
            <xsd:enumeration value="loud"/>
            <xsd:enumeration value="default"/>
        </xsd:restriction>
    </xsd:simpleType>
    <xsd:simpleType name="float.range1">
        <xsd:annotation>
            <xsd:documentation>0.0 - 100.0
            </xsd:documentation>
        </xsd:annotation>
        <xsd:restriction base="xsd:float">
            <xsd:minInclusive value="0.0"/>
            <xsd:maxInclusive value="100.0"/>
        </xsd:restriction>
    </xsd:simpleType>
    <xsd:simpleType name="Say-as.datatype">
        <xsd:annotation>
            <xsd:documentation>say-as datatypes
            </xsd:documentation>
        </xsd:annotation>
        <xsd:restriction base="xsd:string">
            <xsd:enumeration value="acronym"/>
            <xsd:enumeration value="spell-out"/>
            <xsd:enumeration value="number"/>
            <xsd:enumeration value="number:ordinal"/>
            <xsd:enumeration value="number:digits"/>
            <xsd:enumeration value="number:cardinal"/>
            <xsd:enumeration value="date"/>
            <xsd:enumeration value="date:dmy"/>
            <xsd:enumeration value="date:mdy"/>
            <xsd:enumeration value="date:ymd"/>
            <xsd:enumeration value="date:ym"/>
            <xsd:enumeration value="date:my"/>
            <xsd:enumeration value="date:md"/>
            <xsd:enumeration value="date:y"/>
            <xsd:enumeration value="date:m"/>
            <xsd:enumeration value="date:d"/>
            <xsd:enumeration value="time"/>
            <xsd:enumeration value="time:hms"/>
            <xsd:enumeration value="time:hm"/>
            <xsd:enumeration value="time:h"/>
            <xsd:enumeration value="duration"/>
            <xsd:enumeration value="duration:hms"/>
            <xsd:enumeration value="duration:hm"/>
            <xsd:enumeration value="duration:ms"/>
            <xsd:enumeration value="duration:h"/>
            <xsd:enumeration value="duration:m"/>
            <xsd:enumeration value="duration:s"/>
            <xsd:enumeration value="currency"/>
            <xsd:enumeration value="measure"/>
            <xsd:enumeration value="name"/>
            <xsd:enumeration value="net"/>
            <xsd:enumeration value="net:email"/>
            <xsd:enumeration value="net:uri"/>
            <xsd:enumeration value="address"/>
            <xsd:enumeration value="telephone"/>
        </xsd:restriction>
    </xsd:simpleType>
    <xsd:annotation>
        <xsd:documentation>General attributes</xsd:documentation>
    </xsd:annotation>
    <xsd:annotation>
        <xsd:documentation>Elements</xsd:documentation>
    </xsd:annotation>
    <xsd:element name="aws" abstract="true">
        <xsd:annotation>
            <xsd:documentation>The 'allowed-within-sentence'
            group uses this abstract element. Elements with aws as
            their substitution class are then alternatives for
            'allowed-within-sentence'.</xsd:documentation>
        </xsd:annotation>
    </xsd:element>
    <xsd:group name="allowed-within-sentence">
        <xsd:choice>
            <xsd:element ref="aws"/>
        </xsd:choice>
    </xsd:group>
    <xsd:element name="struct" abstract="true"/>
    <xsd:group name="structure">
        <xsd:choice>
            <xsd:element ref="struct"/>
        </xsd:choice>
    </xsd:group>
    <xsd:element name="speak" type="speak"/>
    <xsd:complexType name="speak" mixed="true">
        <xsd:choice minOccurs="0" maxOccurs="unbounded">
            <xsd:group ref="allowed-within-sentence"/>
            <xsd:group ref="structure"/>
        </xsd:choice>
        <xsd:attribute name="version" use="required">
            <xsd:simpleType>
                <xsd:restriction base="xsd:NMTOKEN"/>
            </xsd:simpleType>
        </xsd:attribute>
        <xsd:attribute ref="xml:lang"/>
    </xsd:complexType>
    <xsd:element name="paragraph" type="paragraph"
    substitutionGroup="struct"/>
    <xsd:element name="p" type="paragraph" substitutionGroup="struct"/>
    <xsd:complexType name="paragraph" mixed="true">
        <xsd:choice minOccurs="0" maxOccurs="unbounded">
            <xsd:group ref="allowed-within-sentence"/>
            <xsd:element ref="sentence"/>
            <xsd:element ref="s"/>
        </xsd:choice>
        <xsd:attribute ref="xml:lang"/>
    </xsd:complexType>
    <xsd:element name="sentence" type="sentence"
    substitutionGroup="struct"/>
    <xsd:element name="s" type="sentence" substitutionGroup="struct"/>
    <xsd:complexType name="sentence" mixed="true">
        <xsd:sequence minOccurs="0" maxOccurs="unbounded">
            <xsd:group ref="allowed-within-sentence"/>
        </xsd:sequence>
        <xsd:attribute ref="xml:lang"/>
    </xsd:complexType>
    <xsd:element name="voice" type="voice" substitutionGroup="aws"/>
    <xsd:complexType name="voice" mixed="true">
        <xsd:choice minOccurs="0" maxOccurs="unbounded">
            <xsd:group ref="allowed-within-sentence"/>
            <xsd:group ref="structure"/>
        </xsd:choice>
        <xsd:attribute name="gender">
            <xsd:simpleType>
                <xsd:restriction base="xsd:string">
                    <xsd:enumeration value="male"/>
                    <xsd:enumeration value="female"/>
                    <xsd:enumeration value="neutral"/>
                </xsd:restriction>
            </xsd:simpleType>
        </xsd:attribute>
        <xsd:attribute name="age" type="xsd:positiveInteger"/>
        <xsd:attribute name="variant" type="xsd:integer"/>
        <xsd:attribute name="name" type="xsd:string"/>
        <xsd:attribute ref="xml:lang"/>
    </xsd:complexType>
    <xsd:element name="prosody" type="prosody" substitutionGroup="aws"/>
    <xsd:complexType name="prosody" mixed="true">
        <xsd:choice minOccurs="0" maxOccurs="unbounded">
            <xsd:group ref="allowed-within-sentence"/>
            <xsd:group ref="structure"/>
        </xsd:choice>
        <xsd:attribute name="pitch">
            <xsd:simpleType>
                <xsd:union memberTypes="xsd:positiveInteger
                relative.change relative.change.st height.scale"/>
            </xsd:simpleType>
        </xsd:attribute>
        <xsd:attribute name="contour" type="xsd:string"/>
        <xsd:attribute name="range">
            <xsd:simpleType>
                <xsd:union memberTypes="number.range
                relative.change relative.change.st height.scale"/>
            </xsd:simpleType>
        </xsd:attribute>
        <xsd:attribute name="rate">
            <xsd:simpleType>
                <xsd:union memberTypes="xsd:positiveInteger
                relative.change speed.scale"/>
            </xsd:simpleType>
        </xsd:attribute>
        <xsd:attribute name="duration" type="duration"/>
        <xsd:attribute name="volume">
            <xsd:simpleType>
                <xsd:union memberTypes="float.range1
                relative.change volume.scale"/>
            </xsd:simpleType>
        </xsd:attribute>
    </xsd:complexType>
    <xsd:element name="audio" type="audio" substitutionGroup="aws"/>
    <xsd:complexType name="audio" mixed="true">
        <xsd:choice minOccurs="0" maxOccurs="unbounded">
            <xsd:group ref="allowed-within-sentence"/>
            <xsd:group ref="structure"/>
        </xsd:choice>
        <xsd:attribute name="src" type="xsd:anyURI"/>
    </xsd:complexType>
    <xsd:element name="emphasis" type="emphasis" substitutionGroup="aws"/>
    <xsd:complexType name="emphasis" mixed="true">
        <xsd:sequence minOccurs="0" maxOccurs="unbounded">
            <xsd:group ref="allowed-within-sentence"/>
        </xsd:sequence>
        <xsd:attribute name="level" default="moderate">
            <xsd:simpleType>
                <xsd:restriction base="xsd:string">
                    <xsd:enumeration value="strong"/>
                    <xsd:enumeration value="moderate"/>
                    <xsd:enumeration value="none"/>
                    <xsd:enumeration value="reduced"/>
                </xsd:restriction>
            </xsd:simpleType>
        </xsd:attribute>
    </xsd:complexType>
    <xsd:element name="sub" type="sub" substitutionGroup="aws"/>
    <xsd:complexType name="sub">
        <xsd:simpleContent>
            <xsd:extension base="xsd:string">
                <xsd:attribute name="alias" type="xsd:string"
                use="required"/>
            </xsd:extension>
        </xsd:simpleContent>
    </xsd:complexType>
    <xsd:element name="say-as" type="say-as" substitutionGroup="aws"/>
    <xsd:complexType name="say-as" mixed="true">
        <xsd:attribute name="type" type="Say-as.datatype" use="required"/>
    </xsd:complexType>
    <xsd:element name="phoneme" type="phoneme" substitutionGroup="aws"/>
    <xsd:complexType name="phoneme" mixed="true">
        <xsd:attribute name="ph" type="xsd:string" use="required"/>
        <xsd:attribute name="alphabet" type="xsd:string" default="ipa"/>
    </xsd:complexType>
    <xsd:element name="break" type="break" substitutionGroup="aws"/>
    <xsd:complexType name="break">
        <xsd:attribute name="size" default="medium">
            <xsd:simpleType>
                <xsd:restriction base="xsd:string">
                    <xsd:enumeration value="large"/>
                    <xsd:enumeration value="medium"/>
                    <xsd:enumeration value="small"/>
                    <xsd:enumeration value="none"/>
                </xsd:restriction>
            </xsd:simpleType>
        </xsd:attribute>
        <xsd:attribute name="time" type="duration"/>
    </xsd:complexType>
    <xsd:element name="mark" type="mark" substitutionGroup="aws"/>
    <xsd:complexType name="mark" mixed="true">
        <xsd:choice minOccurs="0" maxOccurs="unbounded">
            <xsd:group ref="allowed-within-sentence"/>
            <xsd:group ref="structure"/>
        </xsd:choice>
        <xsd:attribute name="name" type="xsd:ID" use="required"/>
    </xsd:complexType>
</xsd:schema>

Appendix D: Audio File Formats

This appendix is Normative.

SSML requires that a platform support the playing of the audio formats specified below.

Audio Format	Media Type
Raw (headerless) 8kHz 8-bit mono mu-law [PCM] single channel. (G.711)	audio/basic (from http://www.ietf.org/rfc/rfc1521.txt)
Raw (headerless) 8kHz 8 bit mono A-law [PCM] single channel. (G.711)	audio/x-alaw-basic
WAV (RIFF header) 8kHz 8-bit mono mu-law [PCM] single channel.	audio/wav
WAV (RIFF header) 8kHz 8-bit mono A-law [PCM] single channel.	audio/wav

The 'audio/basic' mime type is commonly used with the 'au' header format as well as the headerless 8-bit 8Khz mu-law format. If this mime type is specified for recording, the mu-law format must be used. For playback with the 'audio/basic' mime type, platforms must support the mu-law format and may support the 'au' format.

Appendix E: MIME Types and File Suffix

This appendix is Non-Normative.

The W3C Voice Browser Working Group has applied to IETF to register a MIME type for the Speech Synthesis Markup Language. The current proposal is to use "application/ssml+xml".

The W3C Voice Browser Working Group has adopted the convention of using the ".ssml" filename suffix for Speech Synthesis Markup Language documents where "speak" is the root element.

Appendix F: Features Under Consideration for Future Versions

This appendix is Non-Normative.

The following features are under consideration for versions of the Speech Synthesis Markup Language Specification after version 1.0:

Support for other phoneme alphabets.
"Lowlevel" elements (fine-grained acoustic-prosodic control)
Intonational control elements
"Value" element for insertion of expressions
Support for pronunciation lexicon (when it becomes available)

Appendix G: Internationalization

This appendix is Normative.

SSML is an application of XML 1.0 and thus supports Unicode which defines a standard universal character set.

Additionally, SSML provides a mechanism for precise control of the input and output languages via the use of "xml:lang" attribute. This facility provides:

The ability to specify the input and output language overriding the SSML Processor default language
The ability to produce multi-language output
The ability to accept input in a language which is different from the language employed in the spoken output.