1. Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset, provide code that identifies the majors that contain either “DATA” or “STATISTICS”

I downloaded majors-list.csv from https://github.com/fivethirtyeight/data/tree/master/college-majors and saved it to my GitHub repository. This file contains college majors in the American Community Survey 2010-2012 Public Use Microdata Series provided by the US Census Bureau.

I read the raw data from GitHub into a data frame (tibble).

majors <- read_csv('https://media.githubusercontent.com/media/alexandersimon1/Data607/main/Assignment3/majors-list.csv')
## Rows: 174 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): FOD1P, Major, Major_Category
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Then I searched for the majors that contained either “DATA” or “STATISTICS”. There were 3 majors that matched these criteria. (Note: The outermost parentheses print the result of the expression inside.)

(interesting_majors <- majors %>%
  filter(str_detect(Major, "DATA") | str_detect(Major, "STATISTICS")))
## # A tibble: 3 × 3
##   FOD1P Major                                         Major_Category         
##   <chr> <chr>                                         <chr>                  
## 1 6212  MANAGEMENT INFORMATION SYSTEMS AND STATISTICS Business               
## 2 2101  COMPUTER PROGRAMMING AND DATA PROCESSING      Computers & Mathematics
## 3 3702  STATISTICS AND DECISION SCIENCE               Computers & Mathematics

2. Write code that transforms the data below:

[1] “bell pepper” “bilberry” “blackberry” “blood orange”

[5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”

[9] “elderberry” “lime” “lychee” “mulberry”

[13] “olive” “salal berry”

Into a format like this:

c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)

The input string is essentially a vector, so I created the vector:

foods <- c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")

I performed the transformation from the inside out. First, I added quotes around each element (food). The quote character needs to be escaped, which introduces a backslash that will be omitted in a subsequent step.

(q <- paste0("\"", foods, "\""))
##  [1] "\"bell pepper\""  "\"bilberry\""     "\"blackberry\""   "\"blood orange\""
##  [5] "\"blueberry\""    "\"cantaloupe\""   "\"chili pepper\"" "\"cloudberry\""  
##  [9] "\"elderberry\""   "\"lime\""         "\"lychee\""       "\"mulberry\""    
## [13] "\"olive\""        "\"salal berry\""

Next, I collapsed the quoted words into a string, separating each word with a comma.

(s <- paste0(q, collapse = ", "))
## [1] "\"bell pepper\", \"bilberry\", \"blackberry\", \"blood orange\", \"blueberry\", \"cantaloupe\", \"chili pepper\", \"cloudberry\", \"elderberry\", \"lime\", \"lychee\", \"mulberry\", \"olive\", \"salal berry\""

Then, I added the leading and trailing characters.

(w <- paste0("c(", s, ")"))
## [1] "c(\"bell pepper\", \"bilberry\", \"blackberry\", \"blood orange\", \"blueberry\", \"cantaloupe\", \"chili pepper\", \"cloudberry\", \"elderberry\", \"lime\", \"lychee\", \"mulberry\", \"olive\", \"salal berry\")"

Finally, I parsed the backslashes as escape characters. The output is now in the desired format.

cat(w)
## c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")

3. Describe, in words, what these expressions will match:

4. Construct regular expressions to match words that:

a) Start and end with the same character.

^(.).*\1$

Explanation:

b) Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice).

(.)(.).*\1\2

Explanation:

c) Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s).

(.).*\1.*\1

Explanation: