www.cloford.com Home  |  Photos  |  Resources  |  About  |  Sitemap  
 

Character Sets

A character set is a list of characters that may appear in a document, and a character encoding is a way of storing these characters on a computer as bits.

 

Whenever you are developing HTML documents you must specify the encoding you wish to use, e.g.

 

<META HTTP-EQUIV="Content-type" CONTENT="text/html; charset=iso-8859-1">

 

Common encodings include the ISO-8859-x series, SHIFT-JIS and EUC-JP for Japanese, and the various Unicode encodings (UTF-4, UTF-8, UTF-16 etc.).  

But the most common encoding used in HTML documents is ISO Latin-1 (otherwise known as ISO-8859-1) as this is the encoding used in HTML 4.0 and XHTML 1.0 specifications developed by the World Wide Web Consortium (W3C).

Windows also uses a range of proprietary encodings (e.g. Windows-1252 Western European) which are similar to some of the popular encodings (most notably ISO Latin-1).  However, there are a few significant incompatibilities. For example, although Windows-1252 displays characters in code positions 128-159, in ISO Latin-1 these code positions are not used.

 

Character references

Character references allow web authors to refer to characters using either:

  • a symbolic name (character entity references) or
  • their number, as specified in the document character set (numeric character references).

 

Character entity references

Character entity references allow you to use a simple, memorable name instead of a number to refer to a character.  

The benefits are: 

  1. People find names easier to remember than numbers. (e.g. "quot" is more memorable than 34, which is the number used to represent a quotation mark in a numeric character reference). 
  2. Browsers handle character entity references more reliably than numeric character references, as character entity references can refer to a character without making assumptions about the character set or encoding. 

The disadvantages are: 

  1. Some of the names are difficult to remember, and it can even be difficult to decipher what they represent from their description (e.g.  "raquo" stands for "single right-pointing angle quotation mark").
  2. Some browsers (e.g. early versions of Netscape) will not understand all the character entity references specified in HTML 4.0.

 

The first character entity references were introduced with HTML 3.2 for ISO Latin-1 characters.  In HTML 4, the list was extended to include symbols, mathematical symbols and Greek letters plus markup-significant and internationalisation characters.

 

The syntax for a character entity reference is an ampersand (&) followed by the name of the entity, followed by a semi-colon (;) :

 

<P>Special offer on &laquo;War &amp; Peace&raquo;. Price only &pound;0.99.</P>

 

Special offer on «War & Peace». Price only £0.99.

 

Numeric character references

Numeric character references use a number to refer to a character in the document character set. The number can be either a decimal or hexadecimal number.

The benefits are: 

  1. Numeric character references will be displayed by all browsers that conform with HTML 2 specifications, unlike character entity references which were only introduced with HTML 3.2. 

The disadvantages are: 

  1. The numbers used for numeric character references are more difficult to remember than the simple symbolic names used in character entity references. 
  2. Numeric character references can cause problems with browsers that are not properly internationalised.

 

The syntax for numeric character references is an ampersand and a hash mark (&#), followed by a number in decimal, or the letter "x" and a number in hexadecimal, followed by a semi-colon (;) :

 

<P>&#8240; or &#x2030; displays the "per mille sign".</P>

 

‰ or ‰ displays the "per mille sign".

 

 

 
© 2000 cloford.com. All rights reserved. top of page