Problem 1

Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset, provide code that identifies the majors that contain either “DATA” or “STATISTICS”

majors <- read_csv(url("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv"), show_col_types = FALSE)

data_or_stat <- majors |>
  filter(str_detect(Major, "DATA") | str_detect(Major, "STATISTICS"))

knitr::kable(data_or_stat)
FOD1P Major Major_Category
6212 MANAGEMENT INFORMATION SYSTEMS AND STATISTICS Business
2101 COMPUTER PROGRAMMING AND DATA PROCESSING Computers & Mathematics
3702 STATISTICS AND DECISION SCIENCE Computers & Mathematics

Problem 2

Write code that transforms the data below:

[1] "bell pepper"  "bilberry"     "blackberry"   "blood orange"  
[5] "blueberry"    "cantaloupe"   "chili pepper" "cloudberry"  
[9] "elderberry"   "lime"         "lychee"       "mulberry"  
[13] "olive"        "salal berry"  

Into a format like this:

c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")

Loading the Data

str1 <- '[1] "bell pepper"  "bilberry"     "blackberry"   "blood orange"' 
str2 <- '[5] "blueberry"    "cantaloupe"   "chili pepper" "cloudberry"'
str3 <- '[9] "elderberry"   "lime"         "lychee"       "mulberry"'
str4 <- '[13] "olive"        "salal berry"'

full_string <- paste(str1, str2, str3, str4)

full_string
## [1] "[1] \"bell pepper\"  \"bilberry\"     \"blackberry\"   \"blood orange\" [5] \"blueberry\"    \"cantaloupe\"   \"chili pepper\" \"cloudberry\" [9] \"elderberry\"   \"lime\"         \"lychee\"       \"mulberry\" [13] \"olive\"        \"salal berry\""

Creating the Function

Initially I created the following function which replaces anything between the words with a comma, removes whatever is before the first entry or after the last entry, and then splits and formats the string:

formatString <- function(string) {
  temp <- gsub('(\\[\\d+\\])?(\\s)?\\"', "", gsub('\"(\\s+)?(\\[\\d+\\])?(\\s+)?"', ",", string))
  
  temp <- unlist(str_split(temp, ","))
  
  dput(temp)
}

formatString(full_string)
## c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", 
## "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", 
## "lychee", "mulberry", "olive", "salal berry")

I also created an alternative function which instead extracts all the desired words from the string and then formats the string properly:

formatString <- function (string) {
  temp <- unlist(str_extract_all(string, '[a-zA-Z]+\\s?[a-zA_Z]+'))
  
  dput(temp)
}

formatString(full_string)
## c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", 
## "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", 
## "lychee", "mulberry", "olive", "salal berry")

Both functions output the same desired format.

Problem 3

Describe, in words, what these expressions will match:

  • (.)\1\1

    This will match any three consecutive characters that are all the same (ex: 'ooo'). The (.) captures a character in a group and each \1 calls that group (i.e. matches to that same character).
  • "(.)(.)\\2\\1"

    This will match any four consecutive characters in the string where the last two characters are the same as the first two characters in reverse (ex: 'eppe' in pepper). Each (.) groups one characters and then \2 and \1 call those groups in the reverse.
  • (..)\1

    This will match any four consecutive characters where the first two are the same as the last two (ex: 'anan' in banana). (..) groups two consecutive characters and \1 matches the same two characters.
  • "(.).\\1.\\1"

    This will match any five consecutive characters where character 1, 3, and 5 are the same (ex: 'abana' in cabana). (.) groups a character, . can be any character, \1 matches the first character, . can again be any character, and \1 again matches the first character. 
  • "(.)(.)(.).*\\3\\2\\1"

    This will match any string that contains at least six characters where the last three characters are the same as the first three characters in reverse order (ex: 'abcanythingcba'). Each (.) captures a different character, .* matches any number of characters, and \3, \2, \1 match the first three characters in reverse order. 
test_string_3 <- c("coool", "banana", "cabana", "pepper", "boooo", "abcanythingcba", "deffed")

str_view(test_string_3, '(.)\\1\\1')
str_view(test_string_3, '(.)(.)\\2\\1')
str_view(test_string_3, '(..)\\1')
str_view(test_string_3, '(.).\\1.\\1')
str_view(test_string_3, '(.)(.)(.).*\\3\\2\\1')

Problem 4

Construct regular expressions to match words that:

  • Start and end with the same character.

    "^(.).*\\1$"

  • Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice).

    "(..).*\\1"

  • Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s).

    "(.).*\\1.*\\1"

test_string_4 <- c("stats", "church", "blurb", "eleven", "strawberry", "pepper")

str_view(test_string_4, '^(.).*\\1$')
str_view(test_string_4, '(..).*\\1')
str_view(test_string_4, '(.).*\\1.*\\1')