The HTML 4.0 document character set, in the SGML sense, is the Universal Character Set (UCS) of [ISO10646]. This is code-by-code identical with the [UNICODE] standard.
HTML documents can be transmitted in a variety of encodings as described in the section "HTML Document Character Set" near the beginning of this specification. Characters outside the range of the encoding need to be represented as entity references. This is unnecessary with a more direct encoding of Unicode such as UTF-8 or UTF-16. After compression the resultant file sizes are close to that for character encodings such as ISO-8859-1 and EUC-JP.
When HTML text is transmitted directly in UTF-16 (charset="UTF-16"), text data should be transmitted in big-endian byte order (high order byte first) in accordance with ISO 10646 Section 6.3 and Unicode 2.0, clause C3, page 3-1 (see [UNICODE]).
Furthermore, to maximize chances of proper interpretation, it is recommended that documents transmitted as UTF-16 always begin with a ZERO-WIDTH NON-BREAKING SPACE character (hexadecimal FEFF) which, when byte-reversed becomes number FFFE, a character guaranteed to be never assigned. Thus, a user-agent receiving an FFFE as the first octets of a text would know that bytes have to be reversed for the remainder of the text.
The UTF-1 transformation format of [ISO10646] (registered by IANA as ISO-10646-UTF-1), should not be used.
Note that ISO Registration Number 177 strictly speaking refers to the original state of ISO 10646 in 1993, while in this specification, we always refer to the most up-to-date form of ISO 10646. Changes since 1993 have been the addition of characters and a one-time operation reallocating a large number of codepoints for Korean Hangul (Amendment 5).
<!SGML "ISO 8879:1986" -- SGML Declaration for HyperText Markup Language version 4.0 With support for Unicode UCS-4 and increased limits for tag and literal lengths etc. -- CHARSET BASESET "ISO Registration Number 177//CHARSET ISO/IEC 10646-1:1993 UCS-4 with implementation level 3//ESC 2/5 2/15 4/6" DESCSET 0 9 UNUSED 9 2 9 11 2 UNUSED 13 1 13 14 18 UNUSED 32 95 32 127 1 UNUSED 128 32 UNUSED 160 2147483486 160 -- In ISO 10646, the positions with hexadecimal values 0000D800 - 0000DFFF, used in the UTF-16 encoding of UCS-4, are reserved, as well as the last two code values in each plane of UCS-4, i.e. all values of the hexadecimal form xxxxFFFE or xxxxFFFF. These code values or the corresponding numeric character references must not be included when generating a new HTML document, and they should be ignored if encountered when processing a HTML document. -- CAPACITY SGMLREF TOTALCAP 150000 GRPCAP 150000 ENTCAP 150000 SCOPE DOCUMENT SYNTAX SHUNCHAR CONTROLS 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 127 BASESET "ISO 646IRV:1991//CHARSET International Reference Version (IRV)//ESC 2/8 4/2" DESCSET 0 128 0 FUNCTION RE 13 RS 10 SPACE 32 TAB SEPCHAR 9 NAMING LCNMSTRT "" UCNMSTRT "" LCNMCHAR ".-" -- ?include "~/_" for URLs? -- UCNMCHAR ".-" NAMECASE GENERAL YES ENTITY NO DELIM GENERAL SGMLREF SHORTREF SGMLREF NAMES SGMLREF QUANTITY SGMLREF ATTSPLEN 65536 -- These are the largest values -- LITLEN 65536 -- permitted in the declaration -- NAMELEN 65536 -- Avoid fixed limits in actual -- PILEN 65536 -- implementations of HTML UA's -- TAGLVL 100 TAGLEN 65536 GRPGTCNT 150 GRPCNT 64 FEATURES MINIMIZE DATATAG NO OMITTAG YES RANK NO SHORTTAG YES LINK SIMPLE NO IMPLICIT NO EXPLICIT NO OTHER CONCUR NO SUBDOC NO FORMAL YES >