Overview

A good ol’ string manipulation hootenanny using R. Also, ŗ̵̄͒̌̽͆̓͊͛e̶͇̬̞͙̩͚̖͆̂̆̒̕̚̕̚g̷͓̭̘̪̦͍̓͗̎̋̽̈́̕͜e̵̫̼̞̮͙͆̍̇̾͝x̴͇͙͓̟̠̍ shenanigans.

College Majors Dataset

Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”

Alright, well, let’s start with getting the data.

majors = read_csv(url("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv"))
## Rows: 174 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): FOD1P, Major, Major_Category
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(majors)
## # A tibble: 6 × 3
##   FOD1P Major                                 Major_Category                 
##   <chr> <chr>                                 <chr>                          
## 1 1100  GENERAL AGRICULTURE                   Agriculture & Natural Resources
## 2 1101  AGRICULTURE PRODUCTION AND MANAGEMENT Agriculture & Natural Resources
## 3 1102  AGRICULTURAL ECONOMICS                Agriculture & Natural Resources
## 4 1103  ANIMAL SCIENCES                       Agriculture & Natural Resources
## 5 1104  FOOD SCIENCE                          Agriculture & Natural Resources
## 6 1105  PLANT SCIENCE AND AGRONOMY            Agriculture & Natural Resources

Now we get the majors that contain the substring DATA or STATISTICS. There are lots of ways to do this both natively and with other libraries such as %like% from data.table.

majors$Major[grep("DATA|STATISTICS", majors$Major,)]
## [1] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
## [2] "COMPUTER PROGRAMMING AND DATA PROCESSING"     
## [3] "STATISTICS AND DECISION SCIENCE"

Fruit Data Manipulation

Write code that transforms the data below: [1] “bell pepper” “bilberry” “blackberry” “blood orange” [5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”
[9] “elderberry” “lime” “lychee” “mulberry”
[13] “olive” “salal berry” Into a format like this: c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)

My interpretation for this question is transforming a string matching the former format to a string matching the latter. We’ll need to strip out the item counter at the start of each line, put everything on the same line, trim the spaces, add the commas, wrap it all in parentheses, and finally prefix it with the c.

origFruit <- r"--([1] "bell pepper"  "bilberry"     "blackberry"   "blood orange"
[5] "blueberry"    "cantaloupe"   "chili pepper" "cloudberry"  
[9] "elderberry"   "lime"         "lychee"       "mulberry"    
[13] "olive"        "salal berry")--"
str_view(origFruit)
## [1] │ [1] "bell pepper"  "bilberry"     "blackberry"   "blood orange"
##     │ [5] "blueberry"    "cantaloupe"   "chili pepper" "cloudberry"  
##     │ [9] "elderberry"   "lime"         "lychee"       "mulberry"    
##     │ [13] "olive"        "salal berry"
newFruit <- str_remove_all(origFruit, "\\[.*\\]")
newFruit <- str_remove_all(newFruit, "\\\n")
newFruit <- str_replace_all(newFruit, "\\s+", " ")
newFruit <- str_replace_all(newFruit, "\" \"", "\", \"")
newFruit <- trimws(newFruit)
newFruit <- paste("c(", newFruit, sep="")
newFruit <- paste(newFruit,")", sep="")
str_view(newFruit)
## [1] │ c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")

RegEx Questions

Describe, in words, what these expressions will match:

  • (.)\1\1
    Character repeated three times consecutively anywhere in a string. e.g. aaa

  • “(.)(.)\\2\\1”
    Two characters followed by \2\1 wrapped in quotes. e.g. “21\2\1”, “c”\2\1”, etc

  • (..)\1
    Two characters followed by the same two characters. e.g. abab

  • “(.).\\1.\\1”
    Two characters followed by a backslash followed by \1 followed by any character followed by \1 all wrapped in quotes. e.g. “ab\1a\1”

  • “(.)(.)(.).*\\3\\2\\1”
    Three characters followed by any number of characters followed by \3 followed by \2 followed by \1 all wrapped in quotes. e.g. “abcDEFQRESXYZ\3\2\1”

Construct regular expressions to match words that:

  • Start and end with the same character.
    ^(.).*\1$
    Matching first and last characters with anything else in between

  • Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)
    (..).*\1
    Very similar to the first regex, but instead we have it be two characters instead of one and dont force match to be first and last character, but instead anywhere in the string

  • Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)
    (.).*\1.*\1
    Similar to previous, but one character instead of two, and repeated minimum twice instead of just once