suppressPackageStartupMessages(library("tidyverse"))
package 㤼㸱tidyverse㤼㸲 was built under R version 3.6.3
1. What are the most important arguments to locale()
?
The locale object has arguments to set the following:
- date and time formats:
date_names
, date_format
, and time_format
- time zone:
tz
- numbers: `
decimal_mark
, grouping_mark
- encoding:
encoding
2. What happens if you try and set decimal_mark
and grouping_mark
to the same character? What happens to the default value of grouping_mark
when you set decimal_mark
to ","
? What happens to the default value of decimal_mark
when you set the grouping_mark to "."
?
If the decimal and grouping marks are set to the same character, locale throws an error:
#locale(decimal_mark = ".", grouping_mark = ".")
#> Error: `decimal_mark` and `grouping_mark` must be different
If the decimal_mark is set to the comma “,”, then the grouping mark is set to the period “.”:
locale(decimal_mark = ",")
<locale>
Numbers: 123.456,78
Formats: %AD / %AT
Timezone: UTC
Encoding: UTF-8
<date_names>
Days: Sunday (Sun), Monday (Mon), Tuesday (Tue), Wednesday (Wed), Thursday (Thu), Friday
(Fri), Saturday (Sat)
Months: January (Jan), February (Feb), March (Mar), April (Apr), May (May), June (Jun),
July (Jul), August (Aug), September (Sep), October (Oct), November (Nov),
December (Dec)
AM/PM: AM/PM
If the grouping mark is set to a period, then the decimal mark is set to a comma
locale(grouping_mark = ",")
<locale>
Numbers: 123,456.78
Formats: %AD / %AT
Timezone: UTC
Encoding: UTF-8
<date_names>
Days: Sunday (Sun), Monday (Mon), Tuesday (Tue), Wednesday (Wed), Thursday (Thu), Friday
(Fri), Saturday (Sat)
Months: January (Jan), February (Feb), March (Mar), April (Apr), May (May), June (Jun),
July (Jul), August (Aug), September (Sep), October (Oct), November (Nov),
December (Dec)
AM/PM: AM/PM
4. If you live outside the US, create a new locale object that encapsulates the settings for the types of file you read most commonly.
Read the help page for locale()
using ?locale
to learn about the different variables that can be set.
As an example, consider Australia. Most of the defaults values are valid, except that the date format is “(d)d/mm/yyyy”, meaning that January 2, 2006 is written as 02/01/2006
.
However, default locale will parse that date as February 1, 2006.
parse_date("02/01/2006")
1 parsing failure.
row col expected actual
1 -- date like 02/01/2006
[1] NA
To correctly parse Australian dates, define a new locale object.
au_locale <- locale(date_format = "%d/%m/%Y")
Using parse_date() with the au_locale as its locale will correctly parse our example date.
parse_date("02/01/2006", locale = au_locale)
[1] "2006-01-02"
5. What’s the difference between read_csv() and read_csv2()?
The delimiter. The function read_csv()
uses a comma, while read_csv2()
uses a semi-colon (;
). Using a semi-colon is useful when commas are used as the decimal point (as in Europe).
6. What are the most common encodings used in Europe? What are the most common encodings used in Asia? Do some googling to find out.
UTF-8 is standard now, and ASCII has been around forever.
For the European languages, there are separate encodings for Romance languages and Eastern European languages using Latin script, Cyrillic, Greek, Hebrew, Turkish: usually with separate ISO and Windows encoding standards. There is also Mac OS Roman.
For Asian languages Arabic and Vietnamese have ISO and Windows standards. The other major Asian scripts have their own:
- Japanese: JIS X 0208, Shift JIS, ISO-2022-JP
- Chinese: GB 2312, GBK, GB 18030
- Korean: KS X 1001, EUC-KR, ISO-2022-KR
The list in the documentation for stringi::stri_enc_detect()
is a good list of encodings since it supports the most common encodings.
- Western European Latin script languages: ISO-8859-1, Windows-1250 (also CP-1250 for code-point)
- Eastern European Latin script languages: ISO-8859-2, Windows-1252
- Greek: ISO-8859-7
- Turkish: ISO-8859-9, Windows-1254
- Hebrew: ISO-8859-8, IBM424, Windows 1255
- Russian: Windows 1251
- Japanese: Shift JIS, ISO-2022-JP, EUC-JP
- Korean: ISO-2022-KR, EUC-KR
- Chinese: GB18030, ISO-2022-CN (Simplified), Big5 (Traditional)
- Arabic: ISO-8859-6, IBM420, Windows 1256
For more information on character encodings see the following sources.
Programs that identify the encoding of text include:
readr::guess_encoding()
stringi::str_enc_detect()
- iconv
- chardet (Python)
LS0tDQp0aXRsZTogIlBhcnNpbmcgYSB2ZWN0b3IiDQpvdXRwdXQ6IA0KICBodG1sX25vdGVib29rOg0KICAgIHRvYzogdHJ1ZQ0KICAgIHRvY19mbG9hdDogdHJ1ZQ0KLS0tDQoNCmBgYHtyfQ0Kc3VwcHJlc3NQYWNrYWdlU3RhcnR1cE1lc3NhZ2VzKGxpYnJhcnkoInRpZHl2ZXJzZSIpKQ0KYGBgDQoNCiMjIyAxLiBXaGF0IGFyZSB0aGUgbW9zdCBpbXBvcnRhbnQgYXJndW1lbnRzIHRvIGBsb2NhbGUoKWA/DQoNClRoZSBsb2NhbGUgb2JqZWN0IGhhcyBhcmd1bWVudHMgdG8gc2V0IHRoZSBmb2xsb3dpbmc6DQoNCiAtIGRhdGUgYW5kIHRpbWUgZm9ybWF0czogYGRhdGVfbmFtZXNgLCBgZGF0ZV9mb3JtYXRgLCBhbmQgYHRpbWVfZm9ybWF0YA0KIC0gdGltZSB6b25lOiBgdHpgDQogLSBudW1iZXJzOiBgYGRlY2ltYWxfbWFya2AsIGBncm91cGluZ19tYXJrYA0KIC0gZW5jb2Rpbmc6IGBlbmNvZGluZ2ANCg0KIyMjIDIuIFdoYXQgaGFwcGVucyBpZiB5b3UgdHJ5IGFuZCBzZXQgYGRlY2ltYWxfbWFya2AgYW5kIGBncm91cGluZ19tYXJrYCB0byB0aGUgc2FtZSBjaGFyYWN0ZXI/IFdoYXQgaGFwcGVucyB0byB0aGUgZGVmYXVsdCB2YWx1ZSBvZiBgZ3JvdXBpbmdfbWFya2Agd2hlbiB5b3Ugc2V0IGBkZWNpbWFsX21hcmtgIHRvIGAiLCJgPyBXaGF0IGhhcHBlbnMgdG8gdGhlIGRlZmF1bHQgdmFsdWUgb2YgYGRlY2ltYWxfbWFya2Agd2hlbiB5b3Ugc2V0IHRoZSBncm91cGluZ19tYXJrIHRvIGAiLiJgPw0KDQpJZiB0aGUgZGVjaW1hbCBhbmQgZ3JvdXBpbmcgbWFya3MgYXJlIHNldCB0byB0aGUgc2FtZSBjaGFyYWN0ZXIsIGxvY2FsZSB0aHJvd3MgYW4gZXJyb3I6DQoNCmBgYHtyfQ0KI2xvY2FsZShkZWNpbWFsX21hcmsgPSAiLiIsIGdyb3VwaW5nX21hcmsgPSAiLiIpDQojPiBFcnJvcjogYGRlY2ltYWxfbWFya2AgYW5kIGBncm91cGluZ19tYXJrYCBtdXN0IGJlIGRpZmZlcmVudA0KYGBgDQoNCklmIHRoZSBkZWNpbWFsX21hcmsgaXMgc2V0IHRvIHRoZSBjb21tYSAiLCIsIHRoZW4gdGhlIGdyb3VwaW5nIG1hcmsgaXMgc2V0IHRvIHRoZSBwZXJpb2QgIi4iOg0KDQpgYGB7cn0NCmxvY2FsZShkZWNpbWFsX21hcmsgPSAiLCIpDQpgYGANCg0KSWYgdGhlIGdyb3VwaW5nIG1hcmsgaXMgc2V0IHRvIGEgcGVyaW9kLCB0aGVuIHRoZSBkZWNpbWFsIG1hcmsgaXMgc2V0IHRvIGEgY29tbWENCg0KYGBge3J9DQpsb2NhbGUoZ3JvdXBpbmdfbWFyayA9ICIsIikNCmBgYA0KDQojIyMgMy4gSSBkaWRu4oCZdCBkaXNjdXNzIHRoZSBgZGF0ZV9mb3JtYXRgIGFuZCBgdGltZV9mb3JtYXRgIG9wdGlvbnMgdG8gYGxvY2FsZSgpYC4gV2hhdCBkbyB0aGV5IGRvPyBDb25zdHJ1Y3QgYW4gZXhhbXBsZSB0aGF0IHNob3dzIHdoZW4gdGhleSBtaWdodCBiZSB1c2VmdWwuDQoNClRoZXkgcHJvdmlkZSBkZWZhdWx0IGRhdGUgYW5kIHRpbWUgZm9ybWF0cy4gVGhlIFtyZWFkciB2aWduZXR0ZV0oaHR0cHM6Ly9jcmFuLnItcHJvamVjdC5vcmcvd2ViL3BhY2thZ2VzL3JlYWRyL3ZpZ25ldHRlcy9sb2NhbGVzLmh0bWwpIGRpc2N1c3NlcyB1c2luZyB0aGVzZSB0byBwYXJzZSBkYXRlczogc2luY2UgZGF0ZXMgY2FuIGluY2x1ZGUgbGFuZ3VhZ2VzIHNwZWNpZmljIHdlZWtkYXkgYW5kIG1vbnRoIG5hbWVzLCBhbmQgZGlmZmVyZW50IGNvbnZlbnRpb25zIGZvciBzcGVjaWZ5aW5nIEFNL1BNDQoNCmBgYHtyfQ0KbG9jYWxlKCkNCmBgYA0KDQpFeGFtcGxlcyBmcm9tIHRoZSByZWFkciB2aWduZXR0ZSBvZiBwYXJzaW5nIEZyZW5jaCBkYXRlcw0KDQpgYGB7cn0NCnBhcnNlX2RhdGUoIjEgamFudmllciAyMDE1IiwgIiVkICVCICVZIiwgbG9jYWxlID0gbG9jYWxlKCJmciIpKQ0KcGFyc2VfZGF0ZSgiMTQgb2N0LiAxOTc5IiwgIiVkICViICVZIiwgbG9jYWxlID0gbG9jYWxlKCJmciIpKQ0KYGBgDQoNCkFwcGFyZW50bHkgdGhlIHRpbWUgZm9ybWF0IGlzIG5vdCB1c2VkIGZvciBhbnl0aGluZywgYnV0IHRoZSBkYXRlIGZvcm1hdCBpcyB1c2VkIGZvciBndWVzc2luZyBjb2x1bW4gdHlwZXMuDQoNCiMjIyA0LiBJZiB5b3UgbGl2ZSBvdXRzaWRlIHRoZSBVUywgY3JlYXRlIGEgbmV3IGxvY2FsZSBvYmplY3QgdGhhdCBlbmNhcHN1bGF0ZXMgdGhlIHNldHRpbmdzIGZvciB0aGUgdHlwZXMgb2YgZmlsZSB5b3UgcmVhZCBtb3N0IGNvbW1vbmx5Lg0KDQpSZWFkIHRoZSBoZWxwIHBhZ2UgZm9yIGBsb2NhbGUoKWAgdXNpbmcgYD9sb2NhbGVgIHRvIGxlYXJuIGFib3V0IHRoZSBkaWZmZXJlbnQgdmFyaWFibGVzIHRoYXQgY2FuIGJlIHNldC4NCg0KQXMgYW4gZXhhbXBsZSwgY29uc2lkZXIgQXVzdHJhbGlhLiBNb3N0IG9mIHRoZSBkZWZhdWx0cyB2YWx1ZXMgYXJlIHZhbGlkLCBleGNlcHQgdGhhdCB0aGUgZGF0ZSBmb3JtYXQgaXMg4oCcKGQpZC9tbS95eXl54oCdLCBtZWFuaW5nIHRoYXQgSmFudWFyeSAyLCAyMDA2IGlzIHdyaXR0ZW4gYXMgYDAyLzAxLzIwMDZgLg0KDQpIb3dldmVyLCBkZWZhdWx0IGxvY2FsZSB3aWxsIHBhcnNlIHRoYXQgZGF0ZSBhcyBGZWJydWFyeSAxLCAyMDA2Lg0KDQpgYGB7cn0NCnBhcnNlX2RhdGUoIjAyLzAxLzIwMDYiKQ0KYGBgDQoNClRvIGNvcnJlY3RseSBwYXJzZSBBdXN0cmFsaWFuIGRhdGVzLCBkZWZpbmUgYSBuZXcgbG9jYWxlIG9iamVjdC4NCg0KYGBge3J9DQphdV9sb2NhbGUgPC0gbG9jYWxlKGRhdGVfZm9ybWF0ID0gIiVkLyVtLyVZIikNCmBgYA0KDQpVc2luZyBwYXJzZV9kYXRlKCkgd2l0aCB0aGUgYXVfbG9jYWxlIGFzIGl0cyBsb2NhbGUgd2lsbCBjb3JyZWN0bHkgcGFyc2Ugb3VyIGV4YW1wbGUgZGF0ZS4NCg0KYGBge3J9DQpwYXJzZV9kYXRlKCIwMi8wMS8yMDA2IiwgbG9jYWxlID0gYXVfbG9jYWxlKQ0KYGBgDQoNCiMjIyA1LiBXaGF04oCZcyB0aGUgZGlmZmVyZW5jZSBiZXR3ZWVuIHJlYWRfY3N2KCkgYW5kIHJlYWRfY3N2MigpPw0KDQpUaGUgZGVsaW1pdGVyLiBUaGUgZnVuY3Rpb24gYHJlYWRfY3N2KClgIHVzZXMgYSBjb21tYSwgd2hpbGUgYHJlYWRfY3N2MigpYCB1c2VzIGEgc2VtaS1jb2xvbiAoYDtgKS4gVXNpbmcgYSBzZW1pLWNvbG9uIGlzIHVzZWZ1bCB3aGVuIGNvbW1hcyBhcmUgdXNlZCBhcyB0aGUgZGVjaW1hbCBwb2ludCAoYXMgaW4gRXVyb3BlKS4NCg0KIyMjIDYuIFdoYXQgYXJlIHRoZSBtb3N0IGNvbW1vbiBlbmNvZGluZ3MgdXNlZCBpbiBFdXJvcGU/IFdoYXQgYXJlIHRoZSBtb3N0IGNvbW1vbiBlbmNvZGluZ3MgdXNlZCBpbiBBc2lhPyBEbyBzb21lIGdvb2dsaW5nIHRvIGZpbmQgb3V0Lg0KDQpVVEYtOCBpcyBzdGFuZGFyZCBub3csIGFuZCBBU0NJSSBoYXMgYmVlbiBhcm91bmQgZm9yZXZlci4NCg0KRm9yIHRoZSBFdXJvcGVhbiBsYW5ndWFnZXMsIHRoZXJlIGFyZSBzZXBhcmF0ZSBlbmNvZGluZ3MgZm9yIFJvbWFuY2UgbGFuZ3VhZ2VzIGFuZCBFYXN0ZXJuIEV1cm9wZWFuIGxhbmd1YWdlcyB1c2luZyBMYXRpbiBzY3JpcHQsIEN5cmlsbGljLCBHcmVlaywgSGVicmV3LCBUdXJraXNoOiB1c3VhbGx5IHdpdGggc2VwYXJhdGUgSVNPIGFuZCBXaW5kb3dzIGVuY29kaW5nIHN0YW5kYXJkcy4gVGhlcmUgaXMgYWxzbyBNYWMgT1MgUm9tYW4uDQoNCkZvciBBc2lhbiBsYW5ndWFnZXMgQXJhYmljIGFuZCBWaWV0bmFtZXNlIGhhdmUgSVNPIGFuZCBXaW5kb3dzIHN0YW5kYXJkcy4gVGhlIG90aGVyIG1ham9yIEFzaWFuIHNjcmlwdHMgaGF2ZSB0aGVpciBvd246DQoNCiAtIEphcGFuZXNlOiBKSVMgWCAwMjA4LCBTaGlmdCBKSVMsIElTTy0yMDIyLUpQDQogLSBDaGluZXNlOiBHQiAyMzEyLCBHQkssIEdCIDE4MDMwDQogLSBLb3JlYW46IEtTIFggMTAwMSwgRVVDLUtSLCBJU08tMjAyMi1LUg0KDQpUaGUgbGlzdCBpbiB0aGUgZG9jdW1lbnRhdGlvbiBmb3IgYHN0cmluZ2k6OnN0cmlfZW5jX2RldGVjdCgpYCBpcyBhIGdvb2QgbGlzdCBvZiBlbmNvZGluZ3Mgc2luY2UgaXQgc3VwcG9ydHMgdGhlIG1vc3QgY29tbW9uIGVuY29kaW5ncy4NCg0KIC0gV2VzdGVybiBFdXJvcGVhbiBMYXRpbiBzY3JpcHQgbGFuZ3VhZ2VzOiBJU08tODg1OS0xLCBXaW5kb3dzLTEyNTAgKGFsc28gQ1AtMTI1MCBmb3IgY29kZS1wb2ludCkNCiAtIEVhc3Rlcm4gRXVyb3BlYW4gTGF0aW4gc2NyaXB0IGxhbmd1YWdlczogSVNPLTg4NTktMiwgV2luZG93cy0xMjUyDQogLSBHcmVlazogSVNPLTg4NTktNw0KIC0gVHVya2lzaDogSVNPLTg4NTktOSwgV2luZG93cy0xMjU0DQogLSBIZWJyZXc6IElTTy04ODU5LTgsIElCTTQyNCwgV2luZG93cyAxMjU1DQogLSBSdXNzaWFuOiBXaW5kb3dzIDEyNTENCiAtIEphcGFuZXNlOiBTaGlmdCBKSVMsIElTTy0yMDIyLUpQLCBFVUMtSlANCiAtIEtvcmVhbjogSVNPLTIwMjItS1IsIEVVQy1LUg0KIC0gQ2hpbmVzZTogR0IxODAzMCwgSVNPLTIwMjItQ04gKFNpbXBsaWZpZWQpLCBCaWc1IChUcmFkaXRpb25hbCkNCiAtIEFyYWJpYzogSVNPLTg4NTktNiwgSUJNNDIwLCBXaW5kb3dzIDEyNTYNCg0KRm9yIG1vcmUgaW5mb3JtYXRpb24gb24gY2hhcmFjdGVyIGVuY29kaW5ncyBzZWUgdGhlIGZvbGxvd2luZyBzb3VyY2VzLg0KDQogLSBUaGUgV2lraXBlZGlhIHBhZ2UgW0NoYXJhY3RlciBlbmNvZGluZ10oaHR0cHM6Ly9lbi53aWtpcGVkaWEub3JnL3dpa2kvQ2hhcmFjdGVyX2VuY29kaW5nKSwgaGFzIGEgZ29vZCBsaXN0IG9mIGVuY29kaW5ncy4NCiAtIFVuaWNvZGUgW0NMRFJdKGh0dHA6Ly9jbGRyLnVuaWNvZGUub3JnLykgcHJvamVjdA0KIC0gW1doYXQgaXMgdGhlIG1vc3QgY29tbW9uIGVuY29kaW5nIG9mIGVhY2ggbGFuZ3VhZ2VdKGh0dHBzOi8vc3RhY2tvdmVyZmxvdy5jb20vcXVlc3Rpb25zLzg1MDkzMzkvd2hhdC1pcy10aGUtbW9zdC1jb21tb24tZW5jb2Rpbmctb2YtZWFjaC1sYW5ndWFnZSkgKFN0YWNrIE92ZXJmbG93KQ0KIC0g4oCcV2hhdCBFdmVyeSBQcm9ncmFtbWVyIEFic29sdXRlbHksIFBvc2l0aXZlbHkgTmVlZHMgVG8gS25vdyBBYm91dCBFbmNvZGluZ3MgQW5kIENoYXJhY3RlciBTZXRzIFRvIFdvcmsgV2l0aCBUZXh04oCdLCA8aHR0cDovL2t1bnN0c3R1YmUubmV0L2VuY29kaW5nLz4uDQoNClByb2dyYW1zIHRoYXQgaWRlbnRpZnkgdGhlIGVuY29kaW5nIG9mIHRleHQgaW5jbHVkZToNCg0KIC0gYHJlYWRyOjpndWVzc19lbmNvZGluZygpYA0KIC0gYHN0cmluZ2k6OnN0cl9lbmNfZGV0ZWN0KClgDQogLSBbaWNvbnZdKGh0dHBzOi8vZW4ud2lraXBlZGlhLm9yZy93aWtpL0ljb252KQ0KIC0gW2NoYXJkZXRdKGh0dHBzOi8vZ2l0aHViLmNvbS9jaGFyZGV0L2NoYXJkZXQpIChQeXRob24pDQoNCiMjIyA3LiBHZW5lcmF0ZSB0aGUgY29ycmVjdCBmb3JtYXQgc3RyaW5nIHRvIHBhcnNlIGVhY2ggb2YgdGhlIGZvbGxvd2luZyBkYXRlcyBhbmQgdGltZXM6DQoNCmBgYHtyfQ0KZDEgPC0gIkphbnVhcnkgMSwgMjAxMCINCmQyIDwtICIyMDE1LU1hci0wNyINCmQzIDwtICIwNi1KdW4tMjAxNyINCmQ0IDwtIGMoIkF1Z3VzdCAxOSAoMjAxNSkiLCAiSnVseSAxICgyMDE1KSIpDQpkNSA8LSAiMTIvMzAvMTQiICMgRGVjIDMwLCAyMDE0DQp0MSA8LSAiMTcwNSINCnQyIDwtICIxMToxNToxMC4xMiBQTSINCmBgYA0KDQpUaGUgY29ycmVjdCBmb3JtYXRzIGFyZToNCg0KYGBge3J9DQpwYXJzZV9kYXRlKGQxLCAiJUIgJWQsICVZIikNCnBhcnNlX2RhdGUoZDIsICIlWS0lYi0lZCIpDQpwYXJzZV9kYXRlKGQzLCAiJWQtJWItJVkiKQ0KcGFyc2VfZGF0ZShkNCwgIiVCICVkICglWSkiKQ0KcGFyc2VfZGF0ZShkNSwgIiVtLyVkLyV5IikNCnBhcnNlX3RpbWUodDEsICIlSCVNIikNCmBgYA0KDQpUaGUgdGltZSBgdDJgIHVzZXMgcmVhbCBzZWNvbmRzLA0KDQpgYGB7cn0NCnBhcnNlX3RpbWUodDIsICIlSDolTTolT1MgJXAiKQ0KYGBg