suppressPackageStartupMessages(library("tidyverse"))
package 㤼㸱tidyverse㤼㸲 was built under R version 3.6.3

1. Split up a string like “apples, pears, and bananas” into individual components.

x <- c("apples, pears, and bananas")
str_split(x, ", +(and +)?")[[1]]
[1] "apples"  "pears"   "bananas"

Exercise 2. Why is it better to split up by boundary("word") than " "?

Splitting by boundary("word") is a more sophisticated method to split a string into words. It recognizes non-space punctuation that splits words, and also removes punctuation while retaining internal non-letter characters that are parts of the word, e.g., “can’t” See the ICU website for a description of the set of rules that are used to determine word boundaries.

Consider this sentence from the official Unicode Report on word boundaries,

sentence <- "The quick (“brown”) fox can’t jump 32.3 feet, right?"

Splitting the string on spaces considers will group the punctuation with the words,

str_split(sentence, " ")
[[1]]
[1] "The"       "quick"     "(“brown”)" "fox"       "can’t"     "jump"      "32.3"     
[8] "feet,"     "right?"   

However, splitting the string using boundary("word") correctly removes punctuation, while not separating “32.2” and “can’t”,

str_split(sentence, boundary("word"))
[[1]]
[1] "The"   "quick" "brown" "fox"   "can’t" "jump"  "32.3"  "feet"  "right"

3. What does splitting with an empty string ("") do? Experiment, and then read the documentation.

str_split("ab. cd|agt", "")[[1]]
 [1] "a" "b" "." " " "c" "d" "|" "a" "g" "t"

It splits the string into individual characters.

LS0tDQp0aXRsZTogIlNwbGl0dGluZyINCm91dHB1dDogDQogIGh0bWxfbm90ZWJvb2s6DQogICAgdG9jOiB0cnVlDQogICAgdG9jX2Zsb2F0OiB0cnVlDQotLS0NCg0KYGBge3J9DQpzdXBwcmVzc1BhY2thZ2VTdGFydHVwTWVzc2FnZXMobGlicmFyeSgidGlkeXZlcnNlIikpDQpgYGANCg0KIyMjIDEuIFNwbGl0IHVwIGEgc3RyaW5nIGxpa2UgImFwcGxlcywgcGVhcnMsIGFuZCBiYW5hbmFzIiBpbnRvIGluZGl2aWR1YWwgY29tcG9uZW50cy4NCg0KYGBge3J9DQp4IDwtIGMoImFwcGxlcywgcGVhcnMsIGFuZCBiYW5hbmFzIikNCnN0cl9zcGxpdCh4LCAiLCArKGFuZCArKT8iKVtbMV1dDQpgYGANCg0KIyMjIEV4ZXJjaXNlIDIuIFdoeSBpcyBpdCBiZXR0ZXIgdG8gc3BsaXQgdXAgYnkgYGJvdW5kYXJ5KCJ3b3JkIilgIHRoYW4gIiAiPw0KDQpTcGxpdHRpbmcgYnkgYGJvdW5kYXJ5KCJ3b3JkIilgIGlzIGEgbW9yZSBzb3BoaXN0aWNhdGVkIG1ldGhvZCB0byBzcGxpdCBhIHN0cmluZyBpbnRvIHdvcmRzLiBJdCByZWNvZ25pemVzIG5vbi1zcGFjZSBwdW5jdHVhdGlvbiB0aGF0IHNwbGl0cyB3b3JkcywgYW5kIGFsc28gcmVtb3ZlcyBwdW5jdHVhdGlvbiB3aGlsZSByZXRhaW5pbmcgaW50ZXJuYWwgbm9uLWxldHRlciBjaGFyYWN0ZXJzIHRoYXQgYXJlIHBhcnRzIG9mIHRoZSB3b3JkLCBlLmcuLCDigJxjYW7igJl04oCdIFNlZSB0aGUgW0lDVSB3ZWJzaXRlXShodHRwOi8vdXNlcmd1aWRlLmljdS1wcm9qZWN0Lm9yZy9ib3VuZGFyeWFuYWx5c2lzKSBmb3IgYSBkZXNjcmlwdGlvbiBvZiB0aGUgc2V0IG9mIHJ1bGVzIHRoYXQgYXJlIHVzZWQgdG8gZGV0ZXJtaW5lIHdvcmQgYm91bmRhcmllcy4NCg0KQ29uc2lkZXIgdGhpcyBzZW50ZW5jZSBmcm9tIHRoZSBvZmZpY2lhbCBbVW5pY29kZSBSZXBvcnQgb24gd29yZCBib3VuZGFyaWVzXShodHRwOi8vd3d3LnVuaWNvZGUub3JnL3JlcG9ydHMvdHIyOS8jV29yZF9Cb3VuZGFyaWVzKSwNCg0KYGBge3J9DQpzZW50ZW5jZSA8LSAiVGhlIHF1aWNrICjigJxicm93buKAnSkgZm94IGNhbuKAmXQganVtcCAzMi4zIGZlZXQsIHJpZ2h0PyINCmBgYA0KDQpTcGxpdHRpbmcgdGhlIHN0cmluZyBvbiBzcGFjZXMgY29uc2lkZXJzIHdpbGwgZ3JvdXAgdGhlIHB1bmN0dWF0aW9uIHdpdGggdGhlIHdvcmRzLA0KDQpgYGB7cn0NCnN0cl9zcGxpdChzZW50ZW5jZSwgIiAiKQ0KYGBgDQoNCkhvd2V2ZXIsIHNwbGl0dGluZyB0aGUgc3RyaW5nIHVzaW5nIGBib3VuZGFyeSgid29yZCIpYCBjb3JyZWN0bHkgcmVtb3ZlcyBwdW5jdHVhdGlvbiwgd2hpbGUgbm90IHNlcGFyYXRpbmcg4oCcMzIuMuKAnSBhbmQg4oCcY2Fu4oCZdOKAnSwNCg0KYGBge3J9DQpzdHJfc3BsaXQoc2VudGVuY2UsIGJvdW5kYXJ5KCJ3b3JkIikpDQpgYGANCg0KIyMjIDMuIFdoYXQgZG9lcyBzcGxpdHRpbmcgd2l0aCBhbiBlbXB0eSBzdHJpbmcgKCIiKSBkbz8gRXhwZXJpbWVudCwgYW5kIHRoZW4gcmVhZCB0aGUgZG9jdW1lbnRhdGlvbi4NCg0KYGBge3J9DQpzdHJfc3BsaXQoImFiLiBjZHxhZ3QiLCAiIilbWzFdXQ0KYGBgDQoNCkl0IHNwbGl0cyB0aGUgc3RyaW5nIGludG8gaW5kaXZpZHVhbCBjaGFyYWN0ZXJzLg==