Harold Nelson
10/18/2020
This presentation will give you enough comfort with Unicode and UTF-8 to be functional in almost all circumstances.
The simple way to think about this topic is to see it as an extension of the character numbering system in the ASCII table. One major difference is that we speak of the character numbers as “code points”. The highest legitimate character number / code point for ASCII characters is 127.
We saw that the functions chr() and ord() enabled us to relate a character written on our keyboard to its code point value.
Use the ord() function to obtain the code point value for Upper case A.
Use the hex() function to obtain a hexadecimal representation of the decimal integer 65.
## '0x41'
Note that 65 and 0x41 are different ways to represent the same underlying value. Can you use the hexadecimal represention with the chr function?
Can you leave the quote marks around 0x41?
The big thing to remember is that the usage of chr() and ord() remains the same for the entire range of unicode characters. Visit https://home.unicode.org/ and look at the third character in the fifth row. It looks like a wishbone. What is written under it?
U+4EBA
You can use the hexadecimal number you see following U+ to display the unicode character or insert it into a string.
## '人'
## 'look at this 人!'
Now that you have the glyph of the codepoint in a machine readable form on your screen, you can get its codepoint value using ord().
## 20154
Note that ord() returns a decimal representation of the value. If you want the hex value you started with, use hex().
## '0x4eba'
You can also use the decimal representation with chr() to get the glyph.
## '人'
After you import unicodedata, you can see what the character is.
## 'CJK UNIFIED IDEOGRAPH-4EBA'
That;s not terribly informative, but at least you know it’s found in the CJK (Chinese, Japanese, Korean) area. From my limited knowledge, I think it represents a walking person.
Visit and http://www.unicode.org/charts/ and poke into it to use what you’ve learned.
There are many different ways you could represnt unicode characters digitally. One possibility is that every character could be represented in 32 bits.
If most of your data is ASCII, this would expand the size of your files and your memory requirements by a factor of 4.
UTF-8 is based on the idea that 8 bits, one byte, would be the starting point. If the character is ASCII, that’s the final result. However if the character is not ASCII, it would have a numerical value > 127. That triggers the software to look at the following bytes to determine what the character is.
The software must have this logic built in to use UTF-8 data.