Title: From ASCII to Unicode
Author: Alexander Arkhipov <aa@manpager.org>
Created: 2024-03-07
Modified: 2024-03-08

Every programmer should know how character encoding works.
Unfortunately, this topic is rarely brought up, and so many developers
remain ignorant about an issue that's not actually that hard. So let's
discuss character encodings!

Character encoding is a problem much older than computers are. Telegraph
used to have some really wacky 5-bit encodings, that included one or
more special characters for changing character set on the fly. If you
are curious, google is your friend.

But in this article I will start with the oldest encodings relevant to
modern computer systems -- ASCII. After I'll write a little about some
of the pre-Unicode encodings, and, finally, Unicode and UTF-8.


ASCII

ASCII stands for American Standard Code for Information Interchange.
It's an 7-bit (though bytes on modern hardware are 8 bits, more on that
later) character encoding invented a very long time ago for paper
teletypes (or teletypes, as they were known then).  If you don't know
what a paper teletype is, imagine a typewriter (if you don't know what a
typewriter is, I'm afraid I can't help you). Now imagine that it's
plugged into a computer. And that it's used for all input/output. You
just imagined a paper teletype!

So, if you ever wondered why ASCII has so many weird control characters
(like what even are DC[1-4] (0x11-0x14)?), this is part of the reason.
Particularly, to this day we have to deal with two end-of-line
characters: carriage return (0x0d) and line feed (0x0a). Well, that's
because on typewriters characters were printed using a piece called
"carriage". Instead of just pressing "Enter" like on modern computers,
you would have to first return the carriage to the first column
(carriage return), and then move it down one line (line feed). The idea
moved on to teletypes.

Let's now look at the technical properties of ASCII. To do that,
consider the following table of each ASCII character with corresponding
binary code:

  0000 0000  nul      0010 0000  sp   0100 0000  @    0110 0000  `
  0000 0001  soh      0010 0001  !    0100 0001  A    0110 0001  a
  0000 0010  stx      0010 0010  "    0100 0010  B    0110 0010  b
  0000 0011  etx      0010 0011  #    0100 0011  C    0110 0011  c
  0000 0100  eot      0010 0100  $    0100 0100  D    0110 0100  d
  0000 0101  enq      0010 0101  %    0100 0101  E    0110 0101  e
  0000 0110  ack      0010 0110  &    0100 0110  F    0110 0110  f
  0000 0111  bel      0010 0111  '    0100 0111  G    0110 0111  g
  0000 1000  bs       0010 1000  (    0100 1000  H    0110 1000  h
  0000 1001  ht       0010 1001  )    0100 1001  I    0110 1001  i
  0000 1010  lf       0010 1010  *    0100 1010  J    0110 1010  j
  0000 1011  vt       0010 1011  +    0100 1011  K    0110 1011  k
  0000 1100  ff       0010 1100  ,    0100 1100  L    0110 1100  l
  0000 1101  cr       0010 1101  -    0100 1101  M    0110 1101  m
  0000 1110  so       0010 1110  .    0100 1110  N    0110 1110  n
  0000 1111  si       0010 1111  /    0100 1111  O    0110 1111  o
  0001 0000  dle      0011 0000  0    0101 0000  P    0111 0000  p
  0001 0001  dc1      0011 0001  1    0101 0001  Q    0111 0001  q
  0001 0010  dc2      0011 0010  2    0101 0010  R    0111 0010  r
  0001 0011  dc3      0011 0011  3    0101 0011  S    0111 0011  s
  0001 0100  dc4      0011 0100  4    0101 0100  T    0111 0100  t
  0001 0101  nak      0011 0101  5    0101 0101  U    0111 0101  u
  0001 0110  syn      0011 0110  6    0101 0110  V    0111 0110  v
  0001 0111  etb      0011 0111  7    0101 0111  W    0111 0111  w
  0001 1000  can      0011 1000  8    0101 1000  X    0111 1000  x
  0001 1001  em       0011 1001  9    0101 1001  Y    0111 1001  y
  0001 1010  sub      0011 1010  :    0101 1010  Z    0111 1010  z
  0001 1011  esc      0011 1011  ;    0101 1011  [    0111 1011  {
  0001 1100  fs       0011 1100  <    0101 1100  \    0111 1100  |
  0001 1101  gs       0011 1101  =    0101 1101  ]    0111 1101  }
  0001 1110  rs       0011 1110  >    0101 1110  ^    0111 1110  ~
  0001 1111  us       0011 1111  ?    0101 1111  _    0111 1111  del

In no particular order, here are some observations:

- The most significant bit is always 0.
- del (127/0x7f/0b01111111) and everything below sp (32/0x20/0b00100000)
  are control characters. All others are normal printable characters.
- It's obvious why '0' != 0. In fact, nul == 0. If you want to compare
  numeric characters with integers, subtract '0' (e.g. '9'-'0' == 9).
- The regular expression [a-Z] doesn't make any sense at all, and [A-z]
  is synonymous with [A-Z[\\\]^_`a-z]. The correct one is usually
  [A-Za-z] or [a-zA-Z].
- Lowercase characters are formed by setting the third most significant
  bit of the corresponding uppercase character to 1 like so:

    0100 0001  A
    0110 0001  a
    ---^

- Each character in the third column corresponds to a control character
  in the first column, but with the second most significant bit set like
  so:

    0001 1011  esc
    0101 1011  [
    -^

  This is the origin of the Unix notation ^X, where X is a character
  from the third column (e.g. esc can be represented as ^[). It's also
  the reason why you can get these control characters by pressing
  corresponding key while holding control (e.g. pressing Ctrl+[ to get
  esc).


OTHER SINGLE-BYTE CHARACTER ENCODINGS

As discussed above, modern computers use 8-bit bytes, but ASCII only
really needs 7, so the most significant bit is always 0, and there
remain 128 "empty" values. Well, as it turned out, American English is
not the only language in the world, and some people need a way to
represent characters such as ß, ç and Ω. So they created their own
ASCII-compatible character encodings with 128 extra characters.

Here are a few such encodings:

- ISO 8859-1 (Latin1)
- KOI8-R
- Windows-1252 (CP-1252)
- Windows-1251 (CP-1251)

Although, these encodings did solve some problems, they also introduced
a lot of new ones, such as:

- People were still limited to only a few languages per file. You
  couldn't write an English text with a few German *and* Russian words.
- For identical character sets there existed several different
  encodings. That reduced interoperability between operating systems.
- If you were to get a file from "somewhere", you would not necessarily
  be able to read it. At the very least you would have to guess the
  file's encoding.

Fortunately, nowadays, you are very unlikely to get some random file in
anything other than UTF-8. Unfortunately, if you have to work with
postscript or matrix printers, you may still have to deal with the
legacy.


DOUBLE-BYTE CHARACTER ENCODINGS

Turns out, there are some scripts where 128 characters is just not
enough. For this reason, before the unicode, the CJK (Chinese, Japanese,
Korean) languages, used to represent characters with two bytes.

Unfortunately, I am not an expert on CJK languages, or their
representation. If you want to know more, I suggest doing your own
research.


UNICODE AND UTF-8

Thankfully, Unicode was invented so that every script that one might
reasonably want on a computer could be represented using the same
encoding.

Here's the Unicode code point for the Hebrew letter Aleph (א): U+05D0.
U+ means "Unicode", and 05D0 are two bytes 05 D0 in hexadecimal. ASCII
characters are the same, except that the most significant byte is 00.
U+0041 is A. Unicode code points can be longer than 2 bytes. The highest
defined code point is U+10FFFF.

Code points are distinct from graphemes. For instance, ñ could be
represented as U+00F1, or as U+006E U+0303 (two code points). That's
because U+006E is plain ASCII n, and U+0303 is a combining character
called COMBINING TILDE. Combining characters are not supposed to
represent an independent grapheme, but to modify the preceding
grapheme.

The Unicode itself is not an encoding, but a standard that can be
represented with several encodings. If you use Unix, the only one you
should really care about is UTF-8. UTF-8 is a variable-length encoding,
that represents code points with one, two, three, or four bytes. The
encoding is as follows:

  U+0000 - U+007F:
    0XXXXXXX (one byte, ASCII compatible)
  U+0080 - U+07FF:
    110XXXXX 10XXXXXX (two bytes)
  U+0800 - U+D7FF and U+E000 - U+FFFF:
    1110XXXX 10XXXXXX 10XXXXXX (three bytes)
  U+10000 - U+10FFFF:
    11110XXX 10XXXXXX 10XXXXXX 10XXXXXX (four bytes)

How to operate on that will depend on the programming language and
environment. For C you may wish to see my [locales in C] article.
Alternatively you may want to use the [ICU] library.


CONCLUSION

I usually avoid conclusions, and instead skip to further reading, but
the information in this article is a bit less straightforward than I
usually like. So, the key points are:

- The entire world agrees on Unicode and UTF-8.
- It covers most writing systems on Earth.
- UTF-8 is ASCII-compatible. This means that if the most interesting
  thing your program does with characters is checking for '\0' and '\n',
  you probably don't need to modify it.
- For anything as advanced as "how many (human-readable) character does
  this string has", you'll have to use special functions, possibly from
  a third-party library.


SEE ALSO

- Man pages ascii(7), utf8(7), iconv(1), locale(1), setlocale(3)
- The Absolute Minimum Every Software Developer Absolutely, Positively
  Must Know About Unicode and Character Sets (No Excuses!) by
  Joel Spolsky:
  https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/


REFERENCES

[locales in C] https://manpager.org/usr/aa/art/014.locales_in_c.txt
               gopher://manpager.org/0/usr/aa/art/014.locales_in_c.txt
	       gemini://manpager.org/usr/aa/art/014.locales_in_c.txt
[ICU] https://icu.unicode.org/