Exercise 1

Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”

#There should be three majors that belong to this subset.

majors<- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv")
#read the csv file.
str(majors)
## 'data.frame':    174 obs. of  3 variables:
##  $ FOD1P         : chr  "1100" "1101" "1102" "1103" ...
##  $ Major         : chr  "GENERAL AGRICULTURE" "AGRICULTURE PRODUCTION AND MANAGEMENT" "AGRICULTURAL ECONOMICS" "ANIMAL SCIENCES" ...
##  $ Major_Category: chr  "Agriculture & Natural Resources" "Agriculture & Natural Resources" "Agriculture & Natural Resources" "Agriculture & Natural Resources" ...
desired_majors <- grep("DATA|STATISTICS", majors$Major, value=TRUE)
desired_majors
## [1] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
## [2] "COMPUTER PROGRAMMING AND DATA PROCESSING"     
## [3] "STATISTICS AND DECISION SCIENCE"

Exercise 2

Write code that transforms the data below:

[1] “bell pepper” “bilberry” “blackberry” “blood orange”

[5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”

[9] “elderberry” “lime” “lychee” “mulberry”

[13] “olive” “salal berry”

Into a format like this:

c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)

ugly_sting <- '[1] "bell pepper"  "bilberry"     "blackberry"   "blood orange"

[5] "blueberry"    "cantaloupe"   "chili pepper" "cloudberry"  

[9] "elderberry"   "lime"         "lychee"       "mulberry"    

[13] "olive"        "salal berry"'
better_string <- str_remove_all(ugly_sting, "[^[:alnum:]\\W]")
writeLines(better_string)
## [1] "bell pepper"  "bilberry"     "blackberry"   "blood orange"
## 
## [5] "blueberry"    "cantaloupe"   "chili pepper" "cloudberry"  
## 
## [9] "elderberry"   "lime"         "lychee"       "mulberry"    
## 
## [13] "olive"        "salal berry"
#Only thing left to do is to add a comma after every second quotation.

Exercise 3

Describe, in words, what these expressions will match:

(.)\1\1

library(htmlwidgets)
## Warning: package 'htmlwidgets' was built under R version 4.1.1
test <- c("a", "abab","aabb", "abba", "abracadabra", "a\1\1", "aaaaaaa", "aabbaa")
str_view_all(test, "(.)\1\1")

This expression searches for any character that is followed by the characters “\1\1”.

“(.)(.)\2\1”

str_view_all(test, "(.)(.)\\2\\1")

This expression searches for strings that contain the following format: char(a)char(b)char(b)char(a)

(..)\1

str_view_all(test, "(..)\1")

This expression looks for any two characters that are followed by the literal characters \1.

“(.).\1.\1”

str_view_all(test, "(.).\\1.\\1")

This expression searches for any string of the form a[]a[]a, where a can be any character, and inside the brackets must be seperate characters.

"(.)(.)(.).*\3\2\1"

str_view_all(test, "(.)(.)(.).*\\3\\2\\1")

This expression I believe takes any three characters abc and searches for strings that have the form abc[]cba. Where the bracket can be any character.

#Exercise 4

construct regular expressions to match words that:

Start and end with the same character.

str_view_all(test, "^(.).*\\1$")

Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)

test_new <- c("church", "eleven")
str_view_all(test_new, "(..).*\\1")

Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

str_view_all(test_new, "(.).*\\1.*\\1")