1

Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”

Response1: str_detect

major_df <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv")

spec(major_df)
## cols(
##   FOD1P = col_character(),
##   Major = col_character(),
##   Major_Category = col_character()
## )
major_df |>
  filter( str_detect(Major,"DATA") | str_detect(Major,"STATISTICS") | str_detect(Major_Category,"DATA") | str_detect(Major_Category,"STATISTICS") )
## # A tibble: 3 × 3
##   FOD1P Major                                         Major_Category         
##   <chr> <chr>                                         <chr>                  
## 1 6212  MANAGEMENT INFORMATION SYSTEMS AND STATISTICS Business               
## 2 2101  COMPUTER PROGRAMMING AND DATA PROCESSING      Computers & Mathematics
## 3 3702  STATISTICS AND DECISION SCIENCE               Computers & Mathematics

2

Write code that transforms the data below:

1 “bell pepper” “bilberry” “blackberry” “blood orange”

[5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”

[9] “elderberry” “lime” “lychee” “mulberry”

[13] “olive” “salal berry”

Response1: Using Regex , str_squish and str_replace_all

task2 <-'[1] "bell pepper"  "bilberry"     "blackberry"   "blood orange"

[5] "blueberry"    "cantaloupe"   "chili pepper" "cloudberry"  

[9] "elderberry"   "lime"         "lychee"       "mulberry"    

[13] "olive"        "salal berry"'

#Remove digits, [], additional line, and tab 
task2_result <- str_replace_all(task2, "\\d+|\\[|\\]|\\n|\\t", "")

#Remove extra whitespace from strings
task2_result <- str_squish(task2_result)

#Add a - between different elements in the vector
task2_result <- str_replace_all(task2_result, '" "', "-")

task2_result <- task2_result %>%
  str_replace_all('"', '') %>%    # Remove all quotes
  str_split("-", simplify = TRUE) %>%  # Split by dash
  as.vector()   # Convert to vector

#print the vector
print(task2_result)
##  [1] "bell pepper"  "bilberry"     "blackberry"   "blood orange" "blueberry"   
##  [6] "cantaloupe"   "chili pepper" "cloudberry"   "elderberry"   "lime"        
## [11] "lychee"       "mulberry"     "olive"        "salal berry"

3

Describe, in words, what these expressions will match:

3.1

(.)\1\1 , this will find the sequence of 3 characters. For example in the text, it replaced III for IV.

## [1] "This is a version IV"

3.2

“(.)(.)\2\1” , this will find a sequence of two characters followed by its inverse, within double quotes. For example in the text, it replaced IXXI for “The Roman numeral IXXI”.

## [1] "The Roman numeral IXXI is not valid"

3.3

(..)\1 , this will find the sequence of 2 characters followed by same two characters. For example in the text, it replaced XIXI for “The Roman numeral IXXI”.

## [1] "The Roman numeral XIXI is not valid"

3.4

“(.).\1.\1” , this will find the sequence of 2 characters structure followed by the first character, within double quotes. For example in the text, it replaced XIXIX for “The Roman numeral IXXI”.

## [1] "The Roman numeral XIXIX is not valid"

3.5

“(.)(.)(.).*\3\2\1” , this will find the sequence of 3 characters structure followed by any content and then the same sequence reversed, within double quotes. For example in the text, it replaced “XIVXIV is not valid same as VIXVIX” for “The Roman numeral IXXI”.

## [1] "String matches Regex"

4

Construct regular expressions to match words that:

4.1

Start and end with the same character : ^(.)[^\\1]*\1$

## [1] "String matches Regex"

4.2

Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.): (.{2}).*\1

## [1] "String matches Regex .\""

4.3

Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.):(.).\1.\1

## [1] "String matches Regex "