library(tidyverse)

Introduction

This week we’re focusing on character manipulation and processing with R.

Question 1

Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”.

The csv containing all the majors can be found here: https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv

file = r"(https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv)"

majors <- read_csv(file, col_names = TRUE, show_col_types = FALSE)

majors |> 
  filter(str_detect(str_to_upper(Major), "DATA|STATISTICS"))
## # A tibble: 3 × 3
##   FOD1P Major                                         Major_Category         
##   <chr> <chr>                                         <chr>                  
## 1 6212  MANAGEMENT INFORMATION SYSTEMS AND STATISTICS Business               
## 2 2101  COMPUTER PROGRAMMING AND DATA PROCESSING      Computers & Mathematics
## 3 3702  STATISTICS AND DECISION SCIENCE               Computers & Mathematics

Question 2

Write code that transforms the data below:

[1] “bell pepper” “bilberry” “blackberry” “blood orange” [5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”
[9] “elderberry” “lime” “lychee” “mulberry”
[13] “olive” “salal berry”

Into a format like this:

c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)

We’re going to want to transform a string of the first instance of data into a properly formatted array.

raw_data <- r"([1] "bell pepper"  "bilberry"     "blackberry"   "blood orange"
[5] "blueberry"    "cantaloupe"   "chili pepper" "cloudberry"  
[9] "elderberry"   "lime"         "lychee"       "mulberry"    
[13] "olive"        "salal berry")"

# Removes the characters that do not help us with further conversion into an array i.e. "[1] ".
processed_data_1 <- str_remove_all(raw_data,r"(\[.+\]\s)")

# Removing the first and last characters as with the current format they will have quotes which we will not split on.
processed_data_2 <- str_sub(processed_data_1,2,-2)

# Splitting the string on the pattern of quotations with white space between them which perfectly matches how the string is structured.
processed_list <- str_split_1(processed_data_2,r"("\s+")")

# Returning the processed list for our viewing pleasure.
processed_list
##  [1] "bell pepper"  "bilberry"     "blackberry"   "blood orange" "blueberry"   
##  [6] "cantaloupe"   "chili pepper" "cloudberry"   "elderberry"   "lime"        
## [11] "lychee"       "mulberry"     "olive"        "salal berry"
# Testing that the code transformation is identical in r's eyes.
test_list <- c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")
identical(processed_list,test_list)
## [1] TRUE

Question 3

Describe, in words, what these expressions will match:

- (.)\1\1
This will match any character that has been repeated three times.
- "(.)(.)\\2\\1"
This will match any of the first character in a sequence and any of the second character in a sequence where the same second character comes next and the first character comes again in the fourth place.
- (..)\1
This will match a series of any two characters that has been repeated twice.
- "(.).\\1.\\1"
This will match any of the first character in a sequence, any character, the same first character, any character, and the same first character in the fifth place.
- "(.)(.)(.).*\\3\\2\\1"
This will match any of three characters in a row, with any or none characters in between the three characters with their order reversed.

Question 4

Construct regular expressions to match words that:

- Start and end with the same character.
^(.).*\1$
- Contain a repeated pair of letters (e.g. "church" contains "ch" repeated twice.)
^.*(..).*\1.*$
- Contain one letter repeated in at least three places (e.g. "eleven" contains three "e"s.)
^.*(.).*(\1.*){2,}$

Conclusions

We’ve mainly focused on regex with this assignment. What I do to enhance my usage of regex is utilize https://regex101.com/ for building regex queries as I can take in a portion of sample data to ensure that the desired data matching is taking place.There is support for multiple different flavors of regex, but not POSIX ERE which R uses by default. PCRE2 from the site will work if you set the regex flavor by calling (perl = true). A breakdown of the differences between regex engines can be found here: https://gist.github.com/CMCDragonkai/6c933f4a7d713ef712145c5eb94a1816.