suppressPackageStartupMessages(library("tidyverse"))
package 㤼㸱tidyverse㤼㸲 was built under R version 3.6.3
1. Split up a string like “apples, pears, and bananas” into individual components.
x <- c("apples, pears, and bananas")
str_split(x, ", +(and +)?")[[1]]
[1] "apples" "pears" "bananas"
Exercise 2. Why is it better to split up by boundary("word") than " "?
Splitting by boundary("word") is a more sophisticated method to split a string into words. It recognizes non-space punctuation that splits words, and also removes punctuation while retaining internal non-letter characters that are parts of the word, e.g., “can’t” See the ICU website for a description of the set of rules that are used to determine word boundaries.
Consider this sentence from the official Unicode Report on word boundaries,
sentence <- "The quick (“brown”) fox can’t jump 32.3 feet, right?"
Splitting the string on spaces considers will group the punctuation with the words,
str_split(sentence, " ")
[[1]]
[1] "The" "quick" "(“brown”)" "fox" "can’t" "jump" "32.3"
[8] "feet," "right?"
However, splitting the string using boundary("word") correctly removes punctuation, while not separating “32.2” and “can’t”,
str_split(sentence, boundary("word"))
[[1]]
[1] "The" "quick" "brown" "fox" "can’t" "jump" "32.3" "feet" "right"
3. What does splitting with an empty string ("") do? Experiment, and then read the documentation.
str_split("ab. cd|agt", "")[[1]]
[1] "a" "b" "." " " "c" "d" "|" "a" "g" "t"
It splits the string into individual characters.
LS0tDQp0aXRsZTogIlNwbGl0dGluZyINCm91dHB1dDogDQogIGh0bWxfbm90ZWJvb2s6DQogICAgdG9jOiB0cnVlDQogICAgdG9jX2Zsb2F0OiB0cnVlDQotLS0NCg0KYGBge3J9DQpzdXBwcmVzc1BhY2thZ2VTdGFydHVwTWVzc2FnZXMobGlicmFyeSgidGlkeXZlcnNlIikpDQpgYGANCg0KIyMjIDEuIFNwbGl0IHVwIGEgc3RyaW5nIGxpa2UgImFwcGxlcywgcGVhcnMsIGFuZCBiYW5hbmFzIiBpbnRvIGluZGl2aWR1YWwgY29tcG9uZW50cy4NCg0KYGBge3J9DQp4IDwtIGMoImFwcGxlcywgcGVhcnMsIGFuZCBiYW5hbmFzIikNCnN0cl9zcGxpdCh4LCAiLCArKGFuZCArKT8iKVtbMV1dDQpgYGANCg0KIyMjIEV4ZXJjaXNlIDIuIFdoeSBpcyBpdCBiZXR0ZXIgdG8gc3BsaXQgdXAgYnkgYGJvdW5kYXJ5KCJ3b3JkIilgIHRoYW4gIiAiPw0KDQpTcGxpdHRpbmcgYnkgYGJvdW5kYXJ5KCJ3b3JkIilgIGlzIGEgbW9yZSBzb3BoaXN0aWNhdGVkIG1ldGhvZCB0byBzcGxpdCBhIHN0cmluZyBpbnRvIHdvcmRzLiBJdCByZWNvZ25pemVzIG5vbi1zcGFjZSBwdW5jdHVhdGlvbiB0aGF0IHNwbGl0cyB3b3JkcywgYW5kIGFsc28gcmVtb3ZlcyBwdW5jdHVhdGlvbiB3aGlsZSByZXRhaW5pbmcgaW50ZXJuYWwgbm9uLWxldHRlciBjaGFyYWN0ZXJzIHRoYXQgYXJlIHBhcnRzIG9mIHRoZSB3b3JkLCBlLmcuLCDigJxjYW7igJl04oCdIFNlZSB0aGUgW0lDVSB3ZWJzaXRlXShodHRwOi8vdXNlcmd1aWRlLmljdS1wcm9qZWN0Lm9yZy9ib3VuZGFyeWFuYWx5c2lzKSBmb3IgYSBkZXNjcmlwdGlvbiBvZiB0aGUgc2V0IG9mIHJ1bGVzIHRoYXQgYXJlIHVzZWQgdG8gZGV0ZXJtaW5lIHdvcmQgYm91bmRhcmllcy4NCg0KQ29uc2lkZXIgdGhpcyBzZW50ZW5jZSBmcm9tIHRoZSBvZmZpY2lhbCBbVW5pY29kZSBSZXBvcnQgb24gd29yZCBib3VuZGFyaWVzXShodHRwOi8vd3d3LnVuaWNvZGUub3JnL3JlcG9ydHMvdHIyOS8jV29yZF9Cb3VuZGFyaWVzKSwNCg0KYGBge3J9DQpzZW50ZW5jZSA8LSAiVGhlIHF1aWNrICjigJxicm93buKAnSkgZm94IGNhbuKAmXQganVtcCAzMi4zIGZlZXQsIHJpZ2h0PyINCmBgYA0KDQpTcGxpdHRpbmcgdGhlIHN0cmluZyBvbiBzcGFjZXMgY29uc2lkZXJzIHdpbGwgZ3JvdXAgdGhlIHB1bmN0dWF0aW9uIHdpdGggdGhlIHdvcmRzLA0KDQpgYGB7cn0NCnN0cl9zcGxpdChzZW50ZW5jZSwgIiAiKQ0KYGBgDQoNCkhvd2V2ZXIsIHNwbGl0dGluZyB0aGUgc3RyaW5nIHVzaW5nIGBib3VuZGFyeSgid29yZCIpYCBjb3JyZWN0bHkgcmVtb3ZlcyBwdW5jdHVhdGlvbiwgd2hpbGUgbm90IHNlcGFyYXRpbmcg4oCcMzIuMuKAnSBhbmQg4oCcY2Fu4oCZdOKAnSwNCg0KYGBge3J9DQpzdHJfc3BsaXQoc2VudGVuY2UsIGJvdW5kYXJ5KCJ3b3JkIikpDQpgYGANCg0KIyMjIDMuIFdoYXQgZG9lcyBzcGxpdHRpbmcgd2l0aCBhbiBlbXB0eSBzdHJpbmcgKCIiKSBkbz8gRXhwZXJpbWVudCwgYW5kIHRoZW4gcmVhZCB0aGUgZG9jdW1lbnRhdGlvbi4NCg0KYGBge3J9DQpzdHJfc3BsaXQoImFiLiBjZHxhZ3QiLCAiIilbWzFdXQ0KYGBgDQoNCkl0IHNwbGl0cyB0aGUgc3RyaW5nIGludG8gaW5kaXZpZHVhbCBjaGFyYWN0ZXJzLg==