| 
    
    
      
     
    
	   | 
    ASCII
    One of the most common character encodings is ASCII (American Standard Code for
    Information Interchange).  
    ASCII uses a total of 7 bits to store each character code - giving a total of 128
    possible values. The first 32 characters of ASCII are reserved for control characters such
    as Line Feed, Carriage Return, and Form Feed (new page). Character #127 is interpreted as
    "delete" and was used in the past to delete information stored on tape and punch
    card media. 
      
    Although ASCII only contains 128 characters (only 95 not counting the control
    characters), it is sufficient for the English language.  
    ISO 646 - Making ASCII International
    
      
        | ASCII, although sufficiently expressive for English, lacks many of the
        characters found in other world languages such as German and Dutch. Many languages contain
        accented versions of normal Latin characters and additional characters not found in
        English. ASCII simply did not have the space to store these codes. To help resolve this
        issue, the International Standards Organization (ISO), in 1972, created the 646
        specification. This specification defined a number of  variants of the ASCII
        character set to use in different world states.  The number of bits in the ISO-646
        encoding is still 7-bit. To handle the additional characters, many of the symbols in ASCII
        were replaced. For instance, in the Dutch version, 646-DK, the backslash character was
        replaced with Ö.  646-US is
        identical to US-ASCII. 
        For the most part, each version is compatible with one another. However, problems can
        arise if information was passed between different sets. For instance, if a 646-US text
        file containing the "["
        character is moved to a 646-DE (German) system, the character will change to "Ä".  | 
        
          
            | Some ISO-646
            Sets | 
           
          
            | 646-CA | 
            Canada | 
           
          
            | 646-DE | 
            Germany | 
           
          
            | 646-DK | 
            Denmark | 
           
          
            | 646-GB | 
            Great Britain | 
           
          
            | 646-JP | 
            Japan | 
           
          
            | 646-KR | 
            Korean | 
           
          
            | 646-NO | 
            Norway | 
           
          
            | 646-SE | 
            Sweden | 
           
          
            | 646-US | 
            United States | 
           
          
            | 646-YU | 
            Yugoslavia | 
           
         
         | 
       
     
    Extended ASCII
    
      
        | 
     In many cases, different companies extended ASCII (or ISO-646) to the full 8 bits
    available in the byte - creating a total of 256 possible values. However, the values from
    128 to 256 varied greatly from company to company. For instance, in the early 80's each
    computer platform used the 128-256 range for its own particular needs - depending on the
    intended market. 
    The Mattel Aquarius (known as the worst computer of all time) was
    designed primarily for games and home use (of both it was ill-suited). The extended
    characters contained cartoons, explosions, and box-drawing graphics. This was the only
    graphics capabilities of this system. 
    	The IBM-PC was designed to be easily used by different world states for
        both business and science. As a result, the extended characters contained additional Latin
        characters, mathematical symbols, and symbols for drawing graphical boxes. When
        necessary, characters in the 128-255 character range were modified for different
        languages. Like ISO 646, different versions of the IBM-PC/DOS encoding were created for
        different world states and different languages. 
        These different versions of Extended ASCII were created by the Microsoft Corporation
        and are generally known as "Code Pages". The chart on the right contains a
        number of the different character encodings that were used throughout the world. The
        characters between 0-127 followed the ISO 646 encodings - meaning that the characters
        would not always match. 
        In most cases, the various Code Pages became the de-facto standard.  | 
        
          
            | DOS Code Pages | 
           
          
            | CP437 | 
            English | 
           
          
            | CP737 | 
            Greek | 
           
          
            | CP775 | 
            Baltic | 
           
          
            | CP850 | 
            Latin | 
           
          
            | CP852 | 
            Latin (Revised) | 
           
          
            | CP855 | 
            Cyrillic | 
           
          
            | CP857 | 
            Turkish | 
           
          
            | CP860 | 
            Portuguese | 
           
          
            | CP861 | 
            Icelandic | 
           
          
            | CP862 | 
            Hebrew | 
           
          
            | CP863 | 
            Canada | 
           
          
            | CP864 | 
            Arabic | 
           
          
            | CP865 | 
            Nordic | 
           
          
            | CP866 | 
            Cyrillic (Revised) | 
           
          
            | CP869 | 
            Greek (Revised) | 
           
         
         | 
       
     
    The following two diagrams contain the Extended ASCII character codes for the IBM-PC
    and the Mattel Aquarius. Note the difference between the different characters in the
    "extended" range. Both companies assigned additional characters to the 0 - 31
    range. When printed, they acted as normal control characters but could be POKE'd to the
    screen to display the graphical characters. 
    
      
          
        CP437 - IBM-PC / DOS Extended ASCII | 
          
        Mattel Aquarius Extended ASCII | 
       
     
    ISO 8859 - The 8-Bit Solution
    
      
        | The existence of different, slightly incompatible, versions of ISO-646
        and different versions of extended ASCII made is difficult to transport text between
        systems. The first attempt to resolve this issue was in 1987 by International Standards
        Organization (ISO). The ISO 8859 specifications were not designed to create a single
        uniform character set, but to avoid the incompatibility of ISO-646. To accomplish this,
        rather than just use the first 7 bits of each byte, the 8859 character set was expanded to
        8 bits - giving a total of 256 total codes.   
        ISO could have accepted IBM-PC/DOS Extended ASCII as an international standard, but,
        instead, decided to create a new encoding. 
        The codes between 0 and 127 were set to the same values in US-ASCII. This allowed easy
        portability of text  - given that the lower 7-bits would be identical regardless of
        platform. The codes from 128 to 256, however, were specialized for different languages.
        While the first 128 codes would overlap between languages, the remaining 128 codes would
        not. 
        ISO created a total of 16 different sets between 1987 and present time. The chart on
        the right contains each of the ISO 8859 sets along with its primary and secondary names.  
        The 8859-12 set was rejected by the organization and numbering continued at 13. ISO
        8859-16 was a revision of Latin-1 (a.k.a. "Western"). Various characters where
        replaced with those in higher demand such as the Euro.  | 
        
          
            | ISO 8859 | 
           
          
            | 8859-1 | 
            Latin-1, Western | 
           
          
            | 8859-2 | 
            Latin-2, C. Europe | 
           
          
            | 8859-3 | 
            Latin-3, S. Europe | 
           
          
            | 8859-4 | 
            Latin-4, N. Europe | 
           
          
            | 8859-5 | 
            Cyrillic | 
           
          
            | 8859-6 | 
            Arabic | 
           
          
            | 8859-7 | 
            Greek | 
           
          
            | 8859-8 | 
            Hebrew | 
           
          
            | 8859-9 | 
            Latin-5, Turkish | 
           
          
            | 8859-10 | 
            Latin-6, Nordic | 
           
          
            | 8859-11 | 
            Thai | 
           
          
            | 8859-12 | 
            Does not exist | 
           
          
            | 8859-13 | 
            Latin-7, Baltic | 
           
          
            | 8859-14 | 
            Latin-8, Celtic | 
           
          
            | 8859-15 | 
            Latin-9, Rev. Latin-1 | 
           
          
            | 8859-16 | 
            Latin-10, S.E. Europe | 
           
         
         | 
       
     
    Windows-1252
    Microsoft modified the ISO 8859-1 character set to use in its Windows Operating System.
    The characters between 128 and 159, which beforehand contained control characters, were
    modified to contain commonly needed characters. These characters included: ,
    the Euro symbol , and the trademark symbol . This
    set is commonly, and vaguely, referred to as "ANSI". 
       
     
    Essentially, Windows-1252 is a superset of ISO 8859-1. Windows versions 3.1, 95, 98 and
    ME use this character set. The NT and XP series are strictly Unicode. 
    Unicode - The Universal Code
    It became apparent after the ISO 8859 standard was created, that it was ill-suited for
    transmitting information between different languages. To resolve this problem, as well as
    the problems with earlier encodings, work began on the "universal" coding
    system. The system is called Unicode. 
    The Unicode Consortium, which is based in Mountain View, California (near San
    Francisco), published "The Unicode Standard" in 1991. The primary premise of the
    Unicode system is that each character should have a single and unique code. This value,
    called a "code point", would used universally - regardless of where in the world
    the system is used.  
    The original Unicode standard set the coding system to 16-bits - giving a total of
    65536 possible code points. This was more than sufficient space to include all the
    characters for every language on the planet - and leave plenty of room for future
    expansion. The problems that plagued ASCII, IBM-PC/DOS Extended ASCII, ISO 646 and ISO
    8859 would not  affect Unicode. 
    The characters that shared the 128-255 range in ISO 8859 were given unique code points
    by the Unicode system. The first 256 codes are identical to ISO 8859-1 and conversion
    between the two is simple. The characters themselves were organized into different ranges
    within the 65636 code range. For instance, Greek characters are stored between 880 and
    1023 (0x370 and 0x3FF); Hebrew characters are stored between 1424 and 1535 (0x590 and
    0x5FF). 
    Unicode was developed at the same time as many of the latter ISO 8859 standards. It
    has, subsequently, replaced it on most modern operating systems. The Unicode Consortium
    works with the International Standards Organization (ISO) on the Unicode standard.
    However, the ISO/IEC 10646 standard is considered a subset of the Unicode standard. While
    ISO/IEC 10646 contains the same code points as Unicode, it does not contain additional
    information such as how the character is displayed and other metrics. In other words, ISO
    simply validated the Unicode Consortium standard for international use. 
    Beyond 16-bit
    
      
        | In 2001, the Unicode Consortium released version 3.1 of the Unicode
        encoding specification. At this point, the 16-bit code range was expanded to 21-bits which
        made it possible to store over 1 million different code points. The system was
        subdivided into different logical "planes" that contain different broad classes
        of characters. The initial 65536 characters of Unicode were organized into the Basic
        Multilingual Plane (BMP). This set includes all characters that are part of modern written
        languages and common symbols such as icons.  | 
        
          
            | Unicode
            Character Planes | 
           
          
            | Plane 0 | 
            Basic Multilingual Plane | 
           
          
            | Plane 1 | 
            Supplementary Multilingual Plane | 
           
          
            | Plane 2 | 
            Supplementary Ideographic Plane | 
           
          
            | Plane 14 | 
            Nonrecommended | 
           
          
            | Plane 15 | 
            Open to private use | 
           
          
            | Plane 16 | 
            Open to private use | 
           
         
         | 
       
     
    Plane 1, the Supplementary Multilingual Plane (SMP), is used to store characters that
    are part of historical languages such as Linear B. Musical and rare mathematical
    characters are also stored here. 
    Plane 2, the Supplementary Ideographic Plane (SIP), is used to store over
    40,000 rare historical Chinese characters. 
    Plane 14 is used to store a number of nonrecommended and experimental tag symbols. The
    nature of this plane is nebulous and will, no doubt, change over time. 
    Planes 15 and 16 are open for private use. 
    Unicode Character Encoding
    
      
        Since Unicode is a multiple-byte encoding standard,  byte ordering  is of  vital importance. The
        Unicode Consortium defined a number of  Unicode Transformation Formats (UTF) to
        encode characters. These include UTF-7, UTF-8, UTF-16 and UTF-32. The International
        Standards Organization (ISO), in the ISO/IEC 10646 specification, also defined two
        different Universal Character Sets (UCS) to store Unicode code points. 
        Essentially, both UCS encodings are subsets of the UTF encoding.  | 
        
          
            | Unicode
            Encoding Methods | 
           
          
            | UCS-2 | 
            16 bits only | 
           
          
            | UCS-4 | 
            32 bits only | 
           
          
            | UTF-7 | 
            7 bits with override | 
           
          
            | UTF-8 | 
            8 bits with override | 
           
          
            | UTF-16 | 
            16 bits with override | 
           
         
         | 
       
     
    UCS-2 and UCS-4
    UCS-2, like the original version of Unicode, is primarily 16-bit. As expected, UCS-2 is
    only able to store the Basic Multilingual Plane (the first 65536 Unicode Characters).  
    UCS-4 encoding uses a total of 32-bits to store each character code. The full Unicode
    encoding can currently be represented with only 21 bits, which makes UCS-4 a particularly
    inefficient format. However, since computers generally store integer values in powers of
    2, 32-bit integers are common on practically all platforms while 24-bit variants are
    exceedingly rare. 
    UTF-16
    UTF-16 is almost identical to the UCS-2 format with some, very important, exceptions.
    This format usually stores each character code using 2-bytes like UCS-2, but also provides
    override sequences for encoding characters that are not part of the Basic Multilingual
    Plane. This allows the system to represent the normal Unicode characters using 16-bit, but
    also can provide the representation of Plane 1 and Plane 2 characters. 
    
      
        | UTF-16 also supports different byte ordering sequences.
        To accomplish this, every transmitted UTF string is preceeded by a Byte Order Mark (BOM)
        which tells the decoder the byte ordering of the following Unicode code points. The BOM is
        2-bytes - with one byte containing FF and the other containing FE. 0xFFFE alerts the
        decoder that the information is stored in Little Endian; 0xFEFF is for Big Endian. | 
         | 
       
     
    Since practically all real-world Unicode characters are part of the Basic Multilingual
    Plane, UCS-2 is usually sufficient. However, UTF-16 has the benefit of providing a method
    for supporting the full Unicode encoding. As a result, UTF-16 is predominately used in
    most systems that support Unicode. Both Windows NT / XP and Linux use UTF-16 internally. 
      
    Note that the most significant byte (the second listed in Little Endian) will contain a
    0 in for all ASCII characters. 
    UTF-8 and UTF-7
    
      
        | The UTF-8 format supports a number of override sequences such that each
        code in the Unicode encoding can be represented using 8 bits.  UTF-8 was designed
        specifically so that a string can be represented without any issues caused by byte
        ordering. The encoding also will not conflict with ASCII control characters - meaning that
        the string can be stored in legacy programs that are strictly based on ASCII and use the
        null-character to terminate strings. UTF-8 is popular for transmitting Unicode
        information over the Internet and, more notabily, e-mail. Unfortunately, the number of
        e-mail clients that support Unicode varies and most information sent via e-mail is done
        using ISO 8859 or Windows-1252. 
        UTF-7 is a 7-bit variant of UTF that uses a combination of Base64 (used in MIME) and
        override characters. However, since HTML-style encoding can also represent any Unicode
        code point, UTF-7 is rarely, if never, used.  | 
        
          
            | UTF-8
            Encoding | 
           
          
            | 0000 ... 007F | 
            0xxxxxxx | 
           
          
            | 0080 ... 07FF | 
            110xxxxx 10xxxxxx | 
           
          
            | 0800 ... FFFF | 
            1110xxxx 10xxxxxx  
            10xxxxxx | 
           
          
            | 10000 ... 10FFFF | 
            11110xxx 10xxxxxx  
            10xxxxxx 10xxxxxx | 
           
         
         | 
       
     
    References
    For more information, please see following links: 
      
    Special thanks to Mike Brown & Robert van Loenhout for their help.  |