Unicode

Harold Nelson

10/18/2020

Goals

This presentation will give you enough comfort with Unicode and UTF-8 to be functional in almost all circumstances.

The simple way to think about this topic is to see it as an extension of the character numbering system in the ASCII table. One major difference is that we speak of the character numbers as “code points”. The highest legitimate character number / code point for ASCII characters is 127.

We saw that the functions chr() and ord() enabled us to relate a character written on our keyboard to its code point value.

Exercise

Use the ord() function to obtain the code point value for Upper case A.

Answer

ord("A")
## 65

Use the chr() function to see the character represented by code point 65.

Answer

chr(65)
## 'A'

Decimal to Hex

Use the hex() function to obtain a hexadecimal representation of the decimal integer 65.

Answer

hex(65)
## '0x41'

Note that 65 and 0x41 are different ways to represent the same underlying value. Can you use the hexadecimal represention with the chr function?

chr(0x41)
## 'A'

Quotes

Can you leave the quote marks around 0x41?

Answer

# chr('0x41')
# No. chr requires an integer.

Beyond ASCII

The big thing to remember is that the usage of chr() and ord() remains the same for the entire range of unicode characters. Visit https://home.unicode.org/ and look at the third character in the fifth row. It looks like a wishbone. What is written under it?

Answer

U+4EBA

Using chr().

You can use the hexadecimal number you see following U+ to display the unicode character or insert it into a string.

chr(0x4EBA)
## '人'
s = "look at this " + chr(0x4EBA) + "!"
s
## 'look at this 人!'

Using ord()

Now that you have the glyph of the codepoint in a machine readable form on your screen, you can get its codepoint value using ord().

ord('人')
## 20154

Note that ord() returns a decimal representation of the value. If you want the hex value you started with, use hex().

hex(20154)
## '0x4eba'

Back to chr()

You can also use the decimal representation with chr() to get the glyph.

chr(20154)
## '人'

What is it?

After you import unicodedata, you can see what the character is.

import unicodedata
unicodedata.name('人')
## 'CJK UNIFIED IDEOGRAPH-4EBA'

That;s not terribly informative, but at least you know it’s found in the CJK (Chinese, Japanese, Korean) area. From my limited knowledge, I think it represents a walking person.

The Whole Thing

Visit and http://www.unicode.org/charts/ and poke into it to use what you’ve learned.

UTF-8

There are many different ways you could represnt unicode characters digitally. One possibility is that every character could be represented in 32 bits.

If most of your data is ASCII, this would expand the size of your files and your memory requirements by a factor of 4.

UTF-8 is based on the idea that 8 bits, one byte, would be the starting point. If the character is ASCII, that’s the final result. However if the character is not ASCII, it would have a numerical value > 127. That triggers the software to look at the following bytes to determine what the character is.

The software must have this logic built in to use UTF-8 data.