1. Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset, provide code that identifies the majors that contain either “DATA” or “STATISTICS”
I downloaded majors-list.csv from https://github.com/fivethirtyeight/data/tree/master/college-majors and saved it to my GitHub repository. This file contains college majors in the American Community Survey 2010-2012 Public Use Microdata Series provided by the US Census Bureau.
I read the raw data from GitHub into a data frame (tibble).
majors <- read_csv('https://media.githubusercontent.com/media/alexandersimon1/Data607/main/Assignment3/majors-list.csv')
## Rows: 174 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): FOD1P, Major, Major_Category
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Then I searched for the majors that contained either “DATA” or “STATISTICS”. There were 3 majors that matched these criteria. (Note: The outermost parentheses print the result of the expression inside.)
(interesting_majors <- majors %>%
filter(str_detect(Major, "DATA") | str_detect(Major, "STATISTICS")))
## # A tibble: 3 × 3
## FOD1P Major Major_Category
## <chr> <chr> <chr>
## 1 6212 MANAGEMENT INFORMATION SYSTEMS AND STATISTICS Business
## 2 2101 COMPUTER PROGRAMMING AND DATA PROCESSING Computers & Mathematics
## 3 3702 STATISTICS AND DECISION SCIENCE Computers & Mathematics
2. Write code that transforms the data below:
[1] “bell pepper” “bilberry” “blackberry” “blood orange”
[5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”
[9] “elderberry” “lime” “lychee” “mulberry”
[13] “olive” “salal berry”
Into a format like this:
c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)
The input string is essentially a vector, so I created the vector:
foods <- c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")
I performed the transformation from the inside out. First, I added quotes around each element (food). The quote character needs to be escaped, which introduces a backslash that will be omitted in a subsequent step.
(q <- paste0("\"", foods, "\""))
## [1] "\"bell pepper\"" "\"bilberry\"" "\"blackberry\"" "\"blood orange\""
## [5] "\"blueberry\"" "\"cantaloupe\"" "\"chili pepper\"" "\"cloudberry\""
## [9] "\"elderberry\"" "\"lime\"" "\"lychee\"" "\"mulberry\""
## [13] "\"olive\"" "\"salal berry\""
Next, I collapsed the quoted words into a string, separating each word with a comma.
(s <- paste0(q, collapse = ", "))
## [1] "\"bell pepper\", \"bilberry\", \"blackberry\", \"blood orange\", \"blueberry\", \"cantaloupe\", \"chili pepper\", \"cloudberry\", \"elderberry\", \"lime\", \"lychee\", \"mulberry\", \"olive\", \"salal berry\""
Then, I added the leading and trailing characters.
(w <- paste0("c(", s, ")"))
## [1] "c(\"bell pepper\", \"bilberry\", \"blackberry\", \"blood orange\", \"blueberry\", \"cantaloupe\", \"chili pepper\", \"cloudberry\", \"elderberry\", \"lime\", \"lychee\", \"mulberry\", \"olive\", \"salal berry\")"
Finally, I parsed the backslashes as escape characters. The output is now in the desired format.
cat(w)
## c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")
3. Describe, in words, what these expressions will match:
(.)\1\1 - matches all words with 3 or more adjacent characters that are identical (eg, aaa). \1 is a back reference to the most recent match of group 1, which is (.) [any character].
“(.)(.)\\2\\1” - matches a leading quotation mark, followed by 2 characters (any), then the literal string \2\1, and a trailing quotation mark (eg, “ab\2\1”). The regular expression \\2\\1 matches \2\1 because the first backslash escapes the second backslash.
(..)\1 - matches all words with 2 adjacent repeats of a pair of characters (eg, abab). \1 is a back reference to group 1, (..), which matches any 2 adjacent characters.
“(.).\\1.\\1” - matches a leading quotation mark, followed by 2 characters (any), then the literal string \1, followed by any character, then the literal string \1, and a trailing quotation mark. As explained above, the regular expression \\1 matches \1 because the first backslash escapes the second backslash.
“(.)(.)(.).*\\3\\2\\1” - matches a leading quotation mark, followed by 3 or more characters (any), then the literal string \3\2\1, and a trailing quotation mark. The regular expression .* matches any character 0 or more times. As explained above, the first backslash in \\ escapes the second backslash.
4. Construct regular expressions to match words that:
a) Start and end with the same character.
^(.).*\1$
Explanation:
^ and $ match the beginning and end of a word
(.) matches any character at the first position of a string. The parentheses enclose a group (group 1) for back reference
.* matches any character 0 or more times
\1 refers to group 1 (first character)
Together, this regular expression will match words that have identical characters at the beginning and end, with any number of characters in between.
b) Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice).
(.)(.).*\1\2
Explanation:
(.)(.) matches 2 adjacent characters (any). Each parenthesis encloses a group for back reference
\1\2 matches the characters captured by (.)(.)
.* matches any number of characters in between the 2 sets of characters above
Together, this regular expression will match words that repeat a pair of adjacent characters, with any number of characters in between.
c) Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s).
(.).*\1.*\1
Explanation:
(.) matches any character. The parentheses enclose a group (group 1) for back reference.
This is followed by 2 repetitions of:
.* that matches any character 0 or more times
\1 that refers to group 1 (matched by the first character)
Together, this regular expression will match a character repeated 3 or more times with any number of characters in between each occurrence of the character.