On the use of some MS Windows characters in HTML

The so-called MS Windows character set contains, in addition to ISO Latin 1 (ISO 8859-1) characters, some special characters like em dash, trademark symbol, and asymmetric quote characters. A Web author who works in a Windows environment may not realize that by using such characters he creates problems to many users who don't use Windows, and possibly to some Windows users, too. Typically, if an author naively types a trademark symbol, a browser running on Unix or some other non-Windows system will probably display a blank instead of the trademark symbol, or something worse. This document explains this problem in some detail, outlines the long-term solutions, and gives some suggestions for short-term solutions.

You should usually try to avoid using any of the following characters in HTML documents:
baseline single and double quote, florin, ellipsis, dagger, double dagger, circumflex accent, permile,
S and s Hacek, left and right single guillemet, OE and oe ligature, left and right single and double quote,
bullet, endash, emdash, tilde accent, trademark ligature, Y Dieresis

The same applies to euro sign, as well as to Z and z with caron, with the additional note that since they are additions to the original MS Windows character set, they cause even more problems than the others.

Please don't get this wrong

There is nothing wrong with the characters discussed here. They have their legitimate uses, and they are, as characters, part of many other character repertoires too, such as Unicode and ISO 10646. The problem is that currently or in the near future they cannot be presented in HTML reliably enough.

There is a very large number of useful characters in Unicode, but the great majority of them cannot be used in HTML documents with reasonably good success yet. The situation is improving, as the so-called internationalization of the Web proceeds and gets implemented. But currently we must live with rather limited character repertoires, unless we are willing to restrict accessibility radically. Compared with that, the need to use a hyphen instead of en dash, for example, is a rather small detail.

This document is not intended to make any judgment on the MS Windows systems themselves. And the characters discussed here can be used within and between Windows systems using the MS Windows character encoding.

The point is that on the World Wide Web, one should not expect that vendor-specific, system-dependent encodings work widely enough. And unfortunately, the standardized methods of presenting the characters under discussion do not work widely yet.

The main reason why the characters discussed here cause problems is that various attempts to present them create an illusion of working. When you create an HTML document and either consciously or unconsciously use, for example, the trademark symbol, you will probably see it right on your browser, and so will many others. But a large number of other people will see just a blank, or even have their display messed up by some control function.

This is what you may get:

This is what many others get:

Although the trademark symbol, for example, probably looks somewhat better than the result of using a replacement (like HTML markup (TM), which looks like the following on your current browser: ^(TM)), the gain is rather small as compared with the damage caused when the vendor-specific method of presenting the symbol does not work at all, i.e. information is lost. Of course, in some cases this might not matter so much while in others it can be quite serious (see the examples). But note that the effect varies; it's need not be simply a space, though this is a common situation. (Bob Baumel's document on special characters contains some examples of different behavior.)

Naturally, the warnings equally apply to any cross-platform transfer of data. However, when data is transferred to a known system - instead of being made accessible from any platform - one can often use a suitable character code conversion program. For example, when transferring text data from Windows to Macintosh, one can handle some of the characters discussed here, if one correctly converts from the Windows encoding to the Mac encoding.

The characters

The following table lists the characters we are discussing, i.e. the Windows characters which are not ISO Latin 1 characters. The Windows and ISO 10646 names as well as code numbers are given, Windows code in decimal and ISO 10646 code in hexadecimal. The column "# ref." contains the numeric character references (containing the Unicode code number in decimal) that can be used in HTML, but see warnings below.

"Special" Windows characters and their ISO 10646 equivalents
Windows name	ISO 10646 name of character	Win	Unicode	# ref.
baseline single quote	single low-9 quotation mark	130	`U+201A`	`‚`
florin	Latin small letter f with hook	131	`U+0192`	`ƒ`
baseline double quote	double low-9 quotation mark	132	`U+201E`	`„`
ellipsis	horizontal ellipsis	133	`U+2026`	`…`
dagger	dagger	134	`U+2020`	`†`
double dagger	double dagger	135	`U+2021`	`‡`
circumflex accent	modifier letter circumflex accent	136	`U+02C6`	`ˆ`
permile	per mille sign	137	`U+2030`	`‰`
S Hacek	Latin capital letter S with caron	138	`U+0160`	`Š`
left single guillemet	single left-pointing angle quot. m.	139	`U+2039`	`‹`
OE ligature	Latin capital ligature OE	140	`U+0152`	`Œ`
left single quote	left single quotation mark	145	`U+2018`	`‘`
right single quote	right single quotation mark	146	`U+2019`	`’`
left double quote	left double quotation mark	147	`U+201C`	`“`
right double quote	right double quotation mark	148	`U+201D`	`”`
bullet	bullet	149	`U+2022`	`•`
endash	en dash	150	`U+2013`	`–`
emdash	em dash	151	`U+2014`	`—`
tilde accent	small tilde	152	`U+02DC`	`˜`
trademark ligature	trade mark sign	153	`U+2122`	`™`
s Hacek	Latin small letter S with caron	154	`U+0161`	`š`
right single guillemet	single right-pointing angle quot. m.	155	`U+203A`	`›`
oe ligature	Latin small ligature oe	156	`U+0153`	`œ`
Y Dieresis	Latin capital letter Y with diaeresis	159	`U+0178`	`Ÿ`

Notes:

"quot. m." is an abbreviation for "quotation mark", used here for convenience.
The official spelling of ISO 10646 and Unicode names for characters is upper-case only. Here, as usual, mixed case is used for readability.
What we call "the Windows character set" here is (since August 2000!) officially registered at IANA as windows-1252. Unofficial synonyms include cp-1252 and WinLatin1.
The Windows names for the characters listed in the table have been taken from the document Windows 3.1 Character Set written by Scott W. Adkins.
The correspondence between the Windows and Unicode code positions as given in the above-mentioned document and listed here is the same as in the cp1252 to Unicode table (in the Online Data by the Unicode consortium).

Where do these characters come from?

There are of course some reasons why the characters were are discussing were included into the "Windows character set" (as well to some other character repertoires). People who need a character tend to use it if they can. And many people are accustomed to using programs like MS Word where a large character repertoire is available. They usually just use any way of inserting special characters they need. (On MS Windows systems, a rather universal way of inserting the characters under discussion is the so-called Alt-nnnn method.) Normally they are satisfied when they see the characters presented on paper. So far so good.

The problem is that the internal encoding of the characters can be interpreted in different ways if the data is transferred to or processed in different programs and systems. For instance, if you use on Windows Alt-0151 to insert an em dash into a file and that file is transferred, without conversion, to a Unix system, anything may happen. Unix systems typically use some ISO 8859 encoding nowadays, and that means that the octet (byte) with value 151 in decimal is in the range reserved for control characters. Problems may occur even if you don't transfer the file to a different computer. If you use e.g. the type command on the file at the DOS level, you will see something like ú (letter u with acute accent) instead of em dash!

On the Web, people use different browsers on different systems. Therefore, anything you put onto the Web is thereby "virtually" transferred to a huge variety of systems. Consequently, an HTML document for the Web should not contain anything that works on some operating systems only, no matter how common they are.

The problematic characters are often produced by different programs, such as HTML editors or converters. Naturally, they shouldn't behave that way, but many of them actually do. It's often a good idea to check that output from such tools does not contain any octets (bytes) in the range 128 - 159 decimal (200 - 237 octal). (A very simple C program could do that, for example.)

Attempts to present the characters

The following table summarizes the most common attempts to present in HTML the characters we discuss here. For concreteness, the table shows examples of presenting a particular character, the em dash.

method	example	problems
"raw data"	(octet with value 151 in decimal)	roughly speaking, works on Windows browsers only
numeric character ref. using Windows code	``	roughly speaking, works on Windows browsers only
(symbolic) entity reference	`—`	increasing support, but not very wide yet
correct numeric character reference	`—`	better support than for the symbolic reference, but still limited
an alternative correct numeric character reference	`—`	not supported yet in any popular browser
an image	`<IMG SRC="mdash.gif" ALT="--">`	does not match the size of normal characters (except by accident); cf. to notes on using an image in The euro sign in HTML

As regards to the em dash in particular, Andreas Prilop has mentioned an interesting possibility:
<TT>-</TT>>
(He also mentions -; although that might give an even wider glyph, it relies on the user's system having a font with a particular name, whereas the TT element is universally supported.) This particular method essentially consists of using a hyphen (-) as surrogate for em dash but with a presentation suggestion to display it using a font where the glyph for hyphen is expected to be wider than a normal hyphen. Although it often creates a good presentation, it has been said that in the hyphen character of some monospace fonts looks bad especially in the midst of normal text.

Yet another approach is to use two consecutive hyphens, with a style sheet suggestion to reduce the spacing between them, hoping that they will look like a dash. This would apply to situations where "--" is an acceptable surrogate for a dash. For some odd reason, Internet Explorer seems to be immune to the style rule in this particular case, unless you use the nobr markup. Here is what your browser presents when the construct -- is used together with the style sheet .dash { letter-spacing: -0.1em; } is used: --.

Presenting a character as "raw data" simply means that the character is presented as an octet (byte) or a sequence of octets according to the encoding used for the document. This is how most characters are actually presented in HTML documents. There is nothing mystical about it. (If you type characters from a keyboard using an editor, what normally happens is that you actually enter characters as "raw data" in some encoding; in some cases, you use some special methods for entering characters when they cannot be directly typed.) The problem with the "raw data" method is that it works only for those browsers (and other user agents) which can handle data in the specific encoding used. There is a very a large number of registered character encodings (and many unregistered encodings, too). One can hardly expect Web browsers generally handle whatever encoding an author has decided to use. In fact, the ISO 8859-1 encoding is the only encoding which can reasonably be expected to be known to any browser. Although the Windows encoding is very widely used, it is usually not understood by browsers running under other than Windows systems; there are good reasons why it should, but factually it isn't. On the other hand, browsers running in Windows environment usually treat documents according to Windows encoding, if the server does not specify the encoding or if the encoding is specified to be ISO 8859-1.

In principle, if the "raw data" method is used, the server should send an HTTP header which specifies that the encoding used. When octets are to be interpreted according to the Windows encoding (e.g. octet 151 means em dash), the server should send
Content-Type:text/html;charset=windows-1252
However, for reasons explained above, such headers usually don't make browsers process the data any better than they would be default.

The problem with notations like  is that their meaning is undefined, i.e. anything may happen. In practice, perhaps the user sees an em dash, perhaps a space, perhaps nothing - or perhaps the screen gets messed up. After all, code positions 128 - 159 have been reserved for eventual use as control codes ("control characters"), and they might actually be used that way in some environments e.g. according to the DEC Multinational Character Set.

In the long run, the problem will solved when browsers widely enough support the two methods which are defined in the HTML 4.0 Specification:

Use a numeric character reference of the form &#n; where n is the code number, in decimal, of the character in Unicode and ISO 10646. (To use this method, you often need to convert numbers from hexadecimal notation to decimal, since the code numbers are given in hexadecimal in most references. HTML 4.0 also allows a notation using hexadecimal numbers, but it is practically not supported yet.)
Use a "symbolic" character entity reference of the form &name; which is defined for some characters, including those we are discussing here. There is a handy reference HTML 4.0 Entities by WDG; see its section on "Special Entities" for most of the characters discussed here.

In an intranet where you can make sure that all browsers satisfy applicable requirements, you could use either the above-mentioned methods even today, and expect them to work in the future on the Web too. Naturally you should carefully test that your selected method actually works on browsers and browser setting (especially font settings) it needs to work. - It is true that intranet operability could often be achieved using the Windows-specific hacks, too. But it shouldn't be more difficult to insure that the method used works according to specifications and across platforms; and then you wouldn't need to worry in the future when you need to put the same documents to an extranet or the Internet, too.

There might be some other very special cases. If you need to include e.g. Greek and Cyrillic letters onto one page, then any methods for using such a large character repertoire (one of which is described in my document Using national and special characters in HTML ) at present considerably limit accessibility at present and the near future. If you have good reasons to do so, then you might as well include "smart quotes", em and en dashes, and other characters discussed here, naturally using the method you have selected to solve the fundamental problem. (When using "the most universal way" described in that document of mine, you would use — for em dash etc., using the Unicode positions mentioned in the list of characters above.)

For more detailed explanations of some of the problems, see ISO-8859 briefing and resources by Alan J. Flavell.

Various hacks have also been suggested, such as using a few no-break spaces within a STRIKE element to "construct" an em dash! I have prepared a small test file containing examples of and annotations on such attempts as well the above-mentioned methods.

If you decide to use characters like em dashes, en dashes, and "smart" quotes, make sure you use them properly, according to the rules of the natural language you write. It's easy to go wrong here, since there have been breaks in typographic traditions, when those characters have been (and still largely are) avoided when producing texts on computers. For dashes in particular, see some usage notes in Dashes and hyphens.

Suggested substitutes

Whenever you need a character and can't use it, you need to consider substitutes. For the characters discussed here, relatively good substitutes can be found:

Suggested substitutes for "special" Windows characters
Windows name	substitute	comments
baseline single quote	'	apostrophe used as single quote
florin	<i>f</i> or NLG or gulden(s)	letter f in italics or the currency code or name
baseline double quote	"	quotation mark (double quote)
ellipsis	...	three dots, possibly styled
dagger	¹	superscript 1: ¹ (assuming use as footnote reference)
double dagger	²	superscript 2: ² (assuming use as footnote reference)
circumflex accent	^	circumflex
permile	o/oo	usual, but somewhat illogical
S Hacek	Sh or SH	language-dependent
left single guillemet	< or '	"<" used as "left angle bracket", or an apostrophe used as single quote
OE ligature	Oe or OE	optionally styled; natural due to what "ligature" means
left single quote	'	apostrophe used as single quote
right single quote	'	apostrophe used as single quote
left double quote	"	quotation mark (double quote)
right double quote	"	quotation mark (double quote)
bullet	* or - or list markup	consider using <ul> and <li> markup instead
endash	-	hyphen
emdash	--	two hyphens
tilde accent	~ or <sup>~</sup>	tilde ~, possibly in superscript style: ^~
trademark ligature	<sup>(TM)</sup>	(TM) in superscript style: ^(TM)
s Hacek	sh	language-dependent
right single guillemet	>	">" used as "right angle bracket", or an apostrophe used as single quote
oe ligature	oe	natural due to what "ligature" means
Y Dieresis	IJ or Y	depending on intended meaning

Notes:

To simulate ellipsis, you might wish to use CSS in order to suggest a presentation of "..." that resembles typographers' ellipsis, i.e. where the dots have somewhat more spacing between them than what you normally get. E.g., you could have
<style type="text/css"></style> and
...
If dagger would be used as the symbol of year or date of death, due to its cross-like appearance, the suitable substitute is of course a word like "died" or abbreviation like "d." in the language used, or perhaps the symbol + (recommended in soc.genealogy.german FAQ).
If o/oo is used as a substitute for permile (more correctly "per mille", which means 'per one thousand') then perhaps one should similary use o/o instead of % for stylistic uniformity. Please note that a speech generator, unless programmed to handle o/oo in a special way, would probably read it pronouncing "o" as a letter and "oo" as a word (or two letters)! A more logical alternative to o/oo might be to use zero instead of o and try to make the presentation better-looking by using SUP and SUB:
0/00
(This looks like the following in your current browsing situation: ⁰/₀₀). The main problem with this is that on many implementations, the use of SUB and SUP causes uneven vertical spacing between lines; using SMALL too seems to help a bit. There's also the possibility of using just 0/00 with FONT
SIZE="1" markup for the digits, giving the following appearance on your browser: 0/00. In any case, it is best to put a no-break space between a number and a per mille symbol or its substitute; otherwise there's the risk that something intended to denote one thousandth looks like 10/00.
For s Hacek, the officially recommended substitute in Finnish is sh (which of course is to be written as Sh if it would be S Hacek, or as SH in uppercase-only text). In Czech texts presented in ISO Latin 1 on the Web, it seems to be usual to simply omit the diacritic, thus presenting s Hacek as s. In Estonian, it seems to be normal to use s^ for s Hacek when the character itself cannot be used. Notice that Estonian standard EVS 8:1993 defines a "basic table" (of characters) saying first that it "corresponds to ISO 8859-1", then that certain characters "are added", but in fact the table has s and z with Hacek (caron) replacing some ISO Latin 1 characters (namely the Icelandic letters Ð, ð, Þ, þ)!
For the bullet character, asterisk (*) or hyphen (-) would be a suitable surrogate in typical cases; the letter o is often used, but this implies the risk of being read as "oh" (especially by automatic speech generators). But consider whether you need the bullet character at all. Usually the "need" for it arises from illogical markup which tries to construct a bulleted list, instead of just using adequate markup: the UL element
As regards to substitutes for dashes, there are different practices in different languages, and perhaps even different recommendations. For the English language, according to Grammar, Punctuation, and Capitalization; A Handbook for Technical Writers and Editors by Mary K. McCaskill, the recommendation is:

In typewritten material, the em dash is represented by two hyphens with no space around them, and an en dash is represented by a hyphen.

and it seems natural to apply this to material where em and en dash cannot be used. (For example, for the Finnish language, there is an official recommendation which deviates from English practice: a single hyphen is used as a replacement, in many cases surrounded by spaces.) You might consider using font-level markup to suggest that a hyphen used as a surrogate for dash be displayed in a particular font (to make it look more like a dash).
Y dieresis is needed very rarely if ever. The lower case y dieresis (ÿ) is in ISO Latin 1, probably because it it is used in French in some names like L'Haÿ; the dieresis (trema) there indicates, as usual in French, that each vowel keeps its own pronunciation. In such usage, ÿ cannot occur at the beginning of a word, so an uppercase Y dieresis would appear in French names only if they are written in all caps and preserving the diacritics. In HTML authoring, caps-only is hardly needed, since one can (and should) should use logical markup (like H1 and STRONG) instead. On the other hand, it has been reported that y dieresis is used as a ligature for ij in Dutch. This probably means that ÿ is used as a surrogate for a real ij ligature (which exists in Unicode). Thus, anyone intending to write a Dutch word using a ligature for IJ should really type just IJ. (In situations where support to Unicode could be relied on, the real IJ ligature, U+0132, could be used.)

The article Window[s] Characters and HTML, based on an early version of this document, was published in Boardwatch in June 2000. The tone of the current document is somewhat different, since support to the use of these characters has become wider.

If you found this document useful, you might wish to check other documents on character problems in Web authoring by the same author.

Note to Finnish readers: Tämä dokumentti on laajennettu versio suomenkielisestä dokumentistani Mikrojen merkistöjen aiheuttamista ongelmista Webissä.