1. Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”

Load data

majors <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv")

We will use grep function to check if column Major contains word “DATA” or “STATISTICS”.

grep_output <- grep('DATA|STATISTICS',majors$Major, value = TRUE)
grep_output
## [1] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
## [2] "COMPUTER PROGRAMMING AND DATA PROCESSING"     
## [3] "STATISTICS AND DECISION SCIENCE"

Using stringr

stringr_output <- majors[str_detect(majors$Major, "STATISTICS|DATA"),]
stringr_output
##    FOD1P                                         Major          Major_Category
## 44  6212 MANAGEMENT INFORMATION SYSTEMS AND STATISTICS                Business
## 52  2101      COMPUTER PROGRAMMING AND DATA PROCESSING Computers & Mathematics
## 59  3702               STATISTICS AND DECISION SCIENCE Computers & Mathematics

2. Write code that transforms the data below:

[1] “bell pepper” “bilberry” “blackberry” “blood orange” [5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”
[9] “elderberry” “lime” “lychee” “mulberry”
[13] “olive” “salal berry” Into a format like this: c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”) I didn’t fully get this assignment, should I just create input and make output look like c(“bell pepper”, etc..)?

input_str <- '[1] "bell pepper"  "bilberry"     "blackberry"   "blood orange"
[5] "blueberry"    "cantaloupe"   "chili pepper" "cloudberry"  
[9] "elderberry"   "lime"         "lychee"       "mulberry"    
[13] "olive"        "salal berry"'
input_str
## [1] "[1] \"bell pepper\"  \"bilberry\"     \"blackberry\"   \"blood orange\"\n[5] \"blueberry\"    \"cantaloupe\"   \"chili pepper\" \"cloudberry\"  \n[9] \"elderberry\"   \"lime\"         \"lychee\"       \"mulberry\"    \n[13] \"olive\"        \"salal berry\""

Finding all quoted values:

extract_berry <- unlist(str_extract_all(input_str, '"[^"]*"'))
extract_berry 
##  [1] "\"bell pepper\""  "\"bilberry\""     "\"blackberry\""   "\"blood orange\""
##  [5] "\"blueberry\""    "\"cantaloupe\""   "\"chili pepper\"" "\"cloudberry\""  
##  [9] "\"elderberry\""   "\"lime\""         "\"lychee\""       "\"mulberry\""    
## [13] "\"olive\""        "\"salal berry\""

Removing ":

extract_berry_update <- str_remove_all(extract_berry, "\"")
extract_berry_update
##  [1] "bell pepper"  "bilberry"     "blackberry"   "blood orange" "blueberry"   
##  [6] "cantaloupe"   "chili pepper" "cloudberry"   "elderberry"   "lime"        
## [11] "lychee"       "mulberry"     "olive"        "salal berry"

Writing the result in the required formed with c, parentheses

final <- paste('c(', paste('"',extract_berry_update,'"',sep = "", collapse = ','), sep = "",')')
str_view(final,'""')

3.Describe, in words, what these expressions will match:

3.1 (.)\1\1

(.) 1st capturing group, would match any character except for for line terminators
\1\1 would refer to the same text as previously matched by the first capturing group 2 times
We will need o create a string to use regular expression in R by using \1 instead of \1.
For example, 777 in 84572777. In other words, will show 3 characters repeated.

stng_1 <- '84572777'
str_view(stng_1, "(.)\\1\\1", match = TRUE)

3.2 “(.)(.)\2\1”

The regular expression is shown as a string with “” and \
(.)(.) 1st and 2nd capturing group, would match any character except for for line terminators
\2 would refer to the same text as previously matched by the second capturing group
\1 would refer to the same text as previously matched by the first capturing group
For example, 4444 in 4564444rt. In other words, will show 4 characters repeated.

stng_2 <- '4564444rt'
str_view(stng_2, "(.)(.)\\2\\1", match = TRUE)

3.3 (..)\1

(..) would match any 2 characters except for for line terminators
\1 would refer to the same text as previously matched by the first capturing group
For example, 1414 in ssff1414slfs. In other words, will show 2 characters repeated twice.

stng_3 <- 'ssff1414slfs'
str_view(stng_3, "(..)\\1", match = TRUE)

3.4 “(.).\1.\1”

(.) 1st capturing group, would match any character except for for line terminators
. would match any character except for for line terminators/ \1 would refer to the same text as previously matched by the first capturing group
For example, e7e8e in e7e8e9e7. In other words, will return string if 1st character repeated on the 3rd and 5th positions as well as 2nd and 4th symbols.

stng_4 <- 'e7e8e9e7'
str_view(stng_4, "(.).\\1.\\1", match = TRUE)

3.5 “(.)(.)(.).*\3\2\1”

(.)(.) 1st,2nd,3rd capturing group, would match any character except for for line terminators
. would match any character except for for line terminators
* controls how many times a pattern matches between zero and unlimited times, as many times as possible
\3 would refer to the same text as previously matched by the third capturing group
\2 would refer to the same text as previously matched by the second capturing group
\1 would refer to the same text as previously matched by the first capturing group
For example, 1234y4321 in 1234y4321rff. In other words, will return string with the min length of 6 if first three characters are the same as last three in reverse order, there can be any symbols between these two groups.

stng_5 <- '1234y4321rff'
str_view(stng_5, "(.)(.)(.).*\\3\\2\\1", match = TRUE)

4. Construct regular expressions to match words that:

4.1 Start and end with the same character

To only match a complete string, we should start with ^ and end with $.

stng <- c('slow','momnm','slack','hopeh')
str_view(stng, "^(.).*\\1$")

4.2 Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)

st <- c('slowwew','momnm','scklack','ohopehop','banana')
str_view(st, ".*(.)(.).*\\1\\2.*")

4.3 Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

s <- c('slowwew','momnm','scklack','ohopehop','banana')
str_view(s, ".*(.).*\\1.*\\1.*")