Exercise 1

Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”.

# The URL of the dataset
url <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv"

# Read the dataset into an R data frame
college_majors <- read_csv(url)

# Identify the majors that contain either "DATA" or "STATISTICS"
filtered_majors <- college_majors[grep("(DATA|STATISTICS)", college_majors$Major, ignore.case = TRUE), ]

# View the filtered majors
print(filtered_majors)
## # A tibble: 3 × 3
##   FOD1P Major                                         Major_Category         
##   <chr> <chr>                                         <chr>                  
## 1 6212  MANAGEMENT INFORMATION SYSTEMS AND STATISTICS Business               
## 2 2101  COMPUTER PROGRAMMING AND DATA PROCESSING      Computers & Mathematics
## 3 3702  STATISTICS AND DECISION SCIENCE               Computers & Mathematics

Exercise 2

Write code that transforms the data below:

[1] “bell pepper” “bilberry” “blackberry” “blood orange”
[5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”
[9] “elderberry” “lime” “lychee” “mulberry”
[13] “olive” “salal berry”

Into a format like this:

c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)

fruits <- c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")


# Converting the entire data into a single string
data_string <- paste(fruits, collapse = " ")

# Splitting the string into individual words
words <- unlist(strsplit(data_string, " "))

# The 'words' variable contains individual words extracted from the data string
print(words)
##  [1] "bell"       "pepper"     "bilberry"   "blackberry" "blood"     
##  [6] "orange"     "blueberry"  "cantaloupe" "chili"      "pepper"    
## [11] "cloudberry" "elderberry" "lime"       "lychee"     "mulberry"  
## [16] "olive"      "salal"      "berry"

Exercise 3

Describe, in words, what these expressions will match:

(.)\1\1 Matches any character repeated 3 times in a row (e.g., “aaa”).

grep("(.)\\1\\1", c("aaabbb", "abcde", "ffff"), value = TRUE)
## [1] "aaabbb" "ffff"

"(.)(.)\\2\\1" Matches two characters inside quotes that then appear again but in reverse order (e.g., “abba”).

grep("(.)(.)\\2\\1", c("abba", "xyz", "mnop"), value = TRUE)
## [1] "abba"

(..)\1 Matches any two characters that appear twice in a row (e.g., “abab”).

grep("(..)\\1", c("aabb", "ccddccdd", "eeffgh"), value = TRUE)
## character(0)

"(.).\\1.\\1" Inside quotes, matches a pattern where a character is followed by any character, and this sequence repeats twice more (e.g., “aXaYa”).

grep("(.).\\1.\\1", c("axaya", "aaxx", "ayay", "abc"), value = TRUE)
## [1] "axaya"

"(.)(.)(.).*\\3\\2\\1" Inside quotes, matches three characters followed by any sequence of characters, ending with the initial three characters but in reverse order (e.g., “abc…cba”).

grep("(.)(.)(.).*\\3\\2\\1", c("abc...cba", "xyz", "abz...zba"), value = TRUE)
## [1] "abc...cba" "abz...zba"

Exercise 4

Construct regular expressions to match words that:
- Start and end with the same character: \b\w*(\w).*\1.*\1

grep("\\b\\w*(\\w).*\\1.*\\1", c("apple", "banana", "carrot", "abba"), value = TRUE)
## [1] "banana"
  • Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.): \b\w*(\w\w).*\1
grep("\\b\\w*(\\w\\w).*\\1", c("church", "apple", "letter"), value = TRUE)
## [1] "church"
  • Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.): \b\w*(\w).*\1.*\1
grep("\\b\\w*(\\w).*\\1.*\\1", c("eleven", "apple", "banana"), value = TRUE)
## [1] "eleven" "banana"