607 Assignment 3

Exercise 1.

Question: Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”

Answer: First, I’ll get the data from github, and create a data frame from the CSV of major names:

x <- getURL("https://raw.github.com/fivethirtyeight/data/master/college-majors/majors-list.csv",.opts=curlOptions(followlocation = TRUE)) 

majors <- read.csv(text = x, header=FALSE)

Then I’ll create a vector of major names and identify those that match “DATA” or “STATISTICS”:

major_names <- majors$V2

majors %>%
  filter(str_detect(major_names, "STATISTICS|DATA"))

Exercise 2.

Write code that transforms the data below:

[1] “bell pepper” “bilberry” “blackberry” “blood orange”

[5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”

[9] “elderberry” “lime” “lychee” “mulberry”

[13] “olive” “salal berry”

Into a format like this:

c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)

fruit_vector <- '[1] "bell pepper"  "bilberry"     "blackberry"   "blood orange"

[5] "blueberry"    "cantaloupe"   "chili pepper" "cloudberry"  

[9] "elderberry"   "lime"         "lychee"       "mulberry"    

[13] "olive"        "salal berry"'

fruit_vector <- str_extract_all(fruit_vector, pattern = "([a-z]+.[a-z]+)") 
fruit_vector <- unlist(fruit_vector) #found on stack overflow, to turn from list back into a vector

# I wasn't sure if the goal was to have a string output, so took a crack at it here: 
final_string <- paste('c("',paste(fruit_vector,collapse = '","'),'")',collapse = '')
final_string

## [1] "c(\" bell pepper\",\"bilberry\",\"blackberry\",\"blood orange\",\"blueberry\",\"cantaloupe\",\"chili pepper\",\"cloudberry\",\"elderberry\",\"lime\",\"lychee\",\"mulberry\",\"olive\",\"salal berry \")"

Exercise 3:

Leveraged these Regex resources: - https://rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf - https://www.regular-expressions.info/backref.html

Question: Describe, in words, what these expressions will match: (.)\1\1 “(.)(.)\2\1” (..)\1 “(.).\1.\1” "(.)(.)(.).*\3\2\1"

Answer:

(.)\1\1 : This isn’t in quotes, so it’s not ready for subsetting in R. If this were in quotes and with a double \, it would match the first 3 matching characters in a string. Put into quotes as is, it’s looking for two backslash and 1s.
“(.)(.)\2\1” : This will match the first 2 characters that aren’t a newline (.) and then the second captured character (\2) and then the first (\1), e.g. in afternoon, ‘noon’
(..)\1 : This isn’t in quotes, so it’s not ready for subsetting in R. If this were in quotes and with a double \, it would match the first instance when two captured characters (..) immediately repeat each other. e.g. in banana - ‘anan’. Put into quotes as is, it looks for a backslash and 1.
“(.).\1.\1” : This will match two characters, the first captured character again, another character, and the first character again, .e.g abaca.
"(.)(.)(.).*\3\2\1" : This will match 3 characters, followed by 0 or more characters, followed by the first 3 characters in reverse, e.g. in paragraph, ‘paragrap’

Exercise 4

Question:Construct regular expressions to match words that: - Start and end with the same character. - Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.) - Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

Answer: * Start and end with the same character: "^(.).*\1$"

str_subset(words, pattern = "^(.).*\\1$")

##  [1] "america"    "area"       "dad"        "dead"       "depend"    
##  [6] "educate"    "else"       "encourage"  "engine"     "europe"    
## [11] "evidence"   "example"    "excuse"     "exercise"   "expense"   
## [16] "experience" "eye"        "health"     "high"       "knock"     
## [21] "level"      "local"      "nation"     "non"        "rather"    
## [26] "refer"      "remember"   "serious"    "stairs"     "test"      
## [31] "tonight"    "transport"  "treat"      "trust"      "window"    
## [36] "yesterday"

Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.): "([a-z][a-z]).*\1"

str_subset(words, pattern = "([a-z][a-z]).*\\1")

##  [1] "appropriate" "church"      "condition"   "decide"      "environment"
##  [6] "london"      "paragraph"   "particular"  "photograph"  "prepare"    
## [11] "pressure"    "remember"    "represent"   "require"     "sense"      
## [16] "therefore"   "understand"  "whether"

Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.): "“([a-z]).\1.\1”"

str_subset(words, pattern = "([a-z]).*\\1.*\\1")

##  [1] "appropriate" "available"   "believe"     "between"     "business"   
##  [6] "degree"      "difference"  "discuss"     "eleven"      "environment"
## [11] "evidence"    "exercise"    "expense"     "experience"  "individual" 
## [16] "paragraph"   "receive"     "remember"    "represent"   "telephone"  
## [21] "therefore"   "tomorrow"

607 Assignment 3

Claire Meyer

2/17/2021

Exercise 1.

Exercise 2.

Exercise 3:

Exercise 4