library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.4     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   2.0.1     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(stringi)

Task 1

Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”

college_majors <- read.csv("majors-list.csv")

data_or_stats_majors <- college_majors %>% filter(grepl('DATA|STATISTICS',Major))
data_or_stats_majors

##   FOD1P                                         Major          Major_Category
## 1  6212 MANAGEMENT INFORMATION SYSTEMS AND STATISTICS                Business
## 2  2101      COMPUTER PROGRAMMING AND DATA PROCESSING Computers & Mathematics
## 3  3702               STATISTICS AND DECISION SCIENCE Computers & Mathematics

Task 2

Write code that transforms the data below:

[1] “bell pepper” “bilberry” “blackberry” “blood orange”

[5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”

[9] “elderberry” “lime” “lychee” “mulberry”

[13] “olive” “salal berry”

Into a format like this:

c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)

Step 1

Load raw data

berries_raw <- '[1] "bell pepper"  "bilberry"     "blackberry"   "blood orange"

[5] "blueberry"    "cantaloupe"   "chili pepper" "cloudberry"  

[9] "elderberry"   "lime"         "lychee"       "mulberry"    

[13] "olive"        "salal berry"'

Step 2

Use the stri_extract_all_regex function of the stringi library to extract all of the quoted values. The regex matches a double quote as the first character, 1 or more characters that are not quotes, then the last character being another double quote. This function returns a list instead of a vector so I used the unlist function to convert into a vector.

berries <- unlist(stri_extract_all_regex(berries_raw, '"[^"]*"'))

Step 3

Use the gsub command to remove the extra double quotes

berries <- gsub('"', "", berries)

Step 4

Compare the result with the expected result

expected <- c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")

expected == berries

##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

The two exercises below are taken from R for Data Science, 14.3.5.1 in the on-line version:

Task 3

Describe, in words, what these expressions will match:

(.)\1\1

(.) matches any single character except newline
\1 matches a literal escaped 1 which is a non-printing character
\1 matches a literal escaped 1 which is a non-printing character

Example matches would be “a\1\1” or “7\1\1”

If you really want to take this literally it wouldn’t match anything because there are no double quotes around the expression which means you would get a syntax error.

“(.)(.)\2\1”

(.) matches any character except newline
(.) matches any character except newline
\2 matches the second character in the string if it appears in the third position
\1 matches the first character in the string if it appears in the fourth position

Example matches would be “aaaa” or “abba” but not “aabb”

(..)\1

(..) capture group matching any two characters except newline
\1 matches the non printing escaped \1 character

Example matches would be “ab\1”, “11\1”, or “a\1\1” but not “abab”

Again this was not surrounded in double quotes so would technically fail

“(.).\1.\1”

(.) capture group matching any character except newline
. matches any char except newline
\1 matches the first character
. matches any char except newline
\1 matches the first character again

This matches strings where the first character is repeated at the start, middle, and end of a 5 character string with any other characters in positions 2 and 4

Examples “a1a2a”, “aaaaa”, not “axaxx”

"(.)(.)(.).*\3\2\1"

(.) capture group matching any character except newline
(.) capture group matching any character except newline
(.) capture group matching any character except newline
. matching any character except newline
- match zero or more of the previous character
\3 match character in third position
\2 match character in second position
\1 match character in first position

Matches strings with a minimum length of six where the last three characters are the first three characters in reverse order with any characters (except newline) in between

“abc23456cba”, “aaaaaaaaa”

Task 4

Construct regular expressions to match words that:

Start and end with the same character.

"(.).*\1"

str_view(c("church","banana","apple"), "(.).*\\1")

Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)

".(.)(.).\1\2.*"

str_view(c("church","banana","apple"), ".*(.)(.).*\\1\\2.*")

Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

“.(.).\1.\1.”

str_view(c("church","banana","apple","eleven"), ".*(.).*\\1.*\\1.*")

Homework 3

Nick Oliver