DATA 607 - Assignment 3

Problem 1

Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset, provide code that identifies the majors that contain either “DATA” or “STATISTICS”

majors <- read_csv(url("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv"), show_col_types = FALSE)

data_or_stat <- majors |>
  filter(str_detect(Major, "DATA") | str_detect(Major, "STATISTICS"))

knitr::kable(data_or_stat)

FOD1P	Major	Major_Category
6212	MANAGEMENT INFORMATION SYSTEMS AND STATISTICS	Business
2101	COMPUTER PROGRAMMING AND DATA PROCESSING	Computers & Mathematics
3702	STATISTICS AND DECISION SCIENCE	Computers & Mathematics

Problem 2

Write code that transforms the data below:

[1] "bell pepper"  "bilberry"     "blackberry"   "blood orange"  
[5] "blueberry"    "cantaloupe"   "chili pepper" "cloudberry"  
[9] "elderberry"   "lime"         "lychee"       "mulberry"  
[13] "olive"        "salal berry"

Into a format like this:

c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")

Loading the Data

str1 <- '[1] "bell pepper"  "bilberry"     "blackberry"   "blood orange"' 
str2 <- '[5] "blueberry"    "cantaloupe"   "chili pepper" "cloudberry"'
str3 <- '[9] "elderberry"   "lime"         "lychee"       "mulberry"'
str4 <- '[13] "olive"        "salal berry"'

full_string <- paste(str1, str2, str3, str4)

full_string

## [1] "[1] \"bell pepper\"  \"bilberry\"     \"blackberry\"   \"blood orange\" [5] \"blueberry\"    \"cantaloupe\"   \"chili pepper\" \"cloudberry\" [9] \"elderberry\"   \"lime\"         \"lychee\"       \"mulberry\" [13] \"olive\"        \"salal berry\""

Creating the Function

Initially I created the following function which replaces anything between the words with a comma, removes whatever is before the first entry or after the last entry, and then splits and formats the string:

formatString <- function(string) {
  temp <- gsub('(\\[\\d+\\])?(\\s)?\\"', "", gsub('\"(\\s+)?(\\[\\d+\\])?(\\s+)?"', ",", string))
  
  temp <- unlist(str_split(temp, ","))
  
  dput(temp)
}

formatString(full_string)

## c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", 
## "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", 
## "lychee", "mulberry", "olive", "salal berry")

I also created an alternative function which instead extracts all the desired words from the string and then formats the string properly:

formatString <- function (string) {
  temp <- unlist(str_extract_all(string, '[a-zA-Z]+\\s?[a-zA_Z]+'))
  
  dput(temp)
}

formatString(full_string)

## c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", 
## "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", 
## "lychee", "mulberry", "olive", "salal berry")

Both functions output the same desired format.

Problem 3

Describe, in words, what these expressions will match:

(.)\1\1

This will match any three consecutive characters that are all the same (ex: 'ooo'). The (.) captures a character in a group and each \1 calls that group (i.e. matches to that same character).

"(.)(.)\\2\\1"

This will match any four consecutive characters in the string where the last two characters are the same as the first two characters in reverse (ex: 'eppe' in pepper). Each (.) groups one characters and then \2 and \1 call those groups in the reverse.

(..)\1

This will match any four consecutive characters where the first two are the same as the last two (ex: 'anan' in banana). (..) groups two consecutive characters and \1 matches the same two characters.

"(.).\\1.\\1"

This will match any five consecutive characters where character 1, 3, and 5 are the same (ex: 'abana' in cabana). (.) groups a character, . can be any character, \1 matches the first character, . can again be any character, and \1 again matches the first character.

"(.)(.)(.).*\\3\\2\\1"

This will match any string that contains at least six characters where the last three characters are the same as the first three characters in reverse order (ex: 'abcanythingcba'). Each (.) captures a different character, .* matches any number of characters, and \3, \2, \1 match the first three characters in reverse order.

test_string_3 <- c("coool", "banana", "cabana", "pepper", "boooo", "abcanythingcba", "deffed")

str_view(test_string_3, '(.)\\1\\1')

str_view(test_string_3, '(.)(.)\\2\\1')

str_view(test_string_3, '(..)\\1')

str_view(test_string_3, '(.).\\1.\\1')

str_view(test_string_3, '(.)(.)(.).*\\3\\2\\1')

Problem 4

Construct regular expressions to match words that:

Start and end with the same character.

"^(.).*\\1$"
Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice).

"(..).*\\1"
Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s).

"(.).*\\1.*\\1"

test_string_4 <- c("stats", "church", "blurb", "eleven", "strawberry", "pepper")

str_view(test_string_4, '^(.).*\\1$')

str_view(test_string_4, '(..).*\\1')

str_view(test_string_4, '(.).*\\1.*\\1')