library(tidyverse)
Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”
url = 'https://raw.githubusercontent.com/dab31415/DATA607/main/Homework/Assignment_3/majors-list.csv'
majors <- read_csv(url,show_col_types = FALSE)
majors %>%
filter(grepl('DATA|STATISTICS',Major))
## # A tibble: 3 x 3
## FOD1P Major Major_Category
## <chr> <chr> <chr>
## 1 6212 MANAGEMENT INFORMATION SYSTEMS AND STATISTICS Business
## 2 2101 COMPUTER PROGRAMMING AND DATA PROCESSING Computers & Mathematics
## 3 3702 STATISTICS AND DECISION SCIENCE Computers & Mathematics
Write code that transforms the data below:
[1] “bell pepper” “bilberry” “blackberry” “blood orange”
[5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”
[9] “elderberry” “lime” “lychee” “mulberry”
[13] “olive” “salal berry”
Into a format like this:
c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)
raw_input <- '[1] "bell pepper" "bilberry" "blackberry" "blood orange"
[5] "blueberry" "cantaloupe" "chili pepper" "cloudberry"
[9] "elderberry" "lime" "lychee" "mulberry"
[13] "olive" "salal berry"'
expected_result <- c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")
Remove new line characters and positional indicators and trim whitespace.
new_input <- raw_input %>%
str_replace_all('\\n','') %>%
str_replace_all('(\\[\\d+\\])','') %>%
str_trim()
new_input
## [1] "\"bell pepper\" \"bilberry\" \"blackberry\" \"blood orange\" \"blueberry\" \"cantaloupe\" \"chili pepper\" \"cloudberry\" \"elderberry\" \"lime\" \"lychee\" \"mulberry\" \"olive\" \"salal berry\""
Introduce delimiter by searching for whitespace between double quotes, then remove the leading and trailing double quote.
new_input <- new_input %>%
str_replace_all('\\"[ ]*\\"',',') %>%
str_replace('^"','') %>%
str_replace('"$','')
new_input
## [1] "bell pepper,bilberry,blackberry,blood orange,blueberry,cantaloupe,chili pepper,cloudberry,elderberry,lime,lychee,mulberry,olive,salal berry"
Split the input string by the delimiter and convert to a vector.
new_input <- new_input %>%
str_split(pattern = ',') %>%
unlist()
new_input
## [1] "bell pepper" "bilberry" "blackberry" "blood orange" "blueberry"
## [6] "cantaloupe" "chili pepper" "cloudberry" "elderberry" "lime"
## [11] "lychee" "mulberry" "olive" "salal berry"
Compare the converted input to our expected result.
identical(expected_result,new_input)
## [1] TRUE
Describe, in words, what these expressions will match:
(.)\1\1
Will match any character repeated 3 consecutive times, for example ‘aaa’.
(.)(.)\2\1
Will match any two characters that are repeated reversed, for example ‘abba’.
(..)\1
Will match any two characters that are repeated, for example ‘mama’
(.).\1.\1
Will match any character repeated two times, but separated by a character in between each occurrence, for example ‘anana’ in ‘banana’,
(.)(.)(.).*\3\2\1
Will match any three characters that are later reversed with any number of characters between them. For example, ‘abccba’, ‘abcdddcba’
Construct regular expressions to match words that:
Start and end with the same character.
^(.).*\1$
Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)
(..).*\1
Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)
I had to escape ’' the astrics in the following regex to prevent R Markdown from converting to italics.
(.).*\1.*\1