Assignment 3

Answer the following questions

Question #1

Provide code that identifies the majors that contain either “DATA” or “STATISTICS”.

x <- getURL("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv")
majors <- read.csv(text = x)

#At first, I had it as a vector and then decided to have it as a dataframe to display it as a table
#major <- majors$Major
#vdata_statistics_majors <- grep(pattern = "data|statistics", major, ignore.case = TRUE)
dfdata_statistics_majors <- majors %>% filter(grepl(pattern = "data|statistics", Major, ignore.case = TRUE))

dfdata_statistics_majors |>
    gt() %>%
    tab_options(table.font.size = px(8))

FOD1P	Major	Major_Category
6212	MANAGEMENT INFORMATION SYSTEMS AND STATISTICS	Business
2101	COMPUTER PROGRAMMING AND DATA PROCESSING	Computers & Mathematics
3702	STATISTICS AND DECISION SCIENCE	Computers & Mathematics

Question #2

Write code that transforms the data below: [1] “bell pepper” “bilberry” “blackberry” “blood orange” [5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”
[9] “elderberry” “lime” “lychee” “mulberry”
[13] “olive” “salal berry” Into a format like this: c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)

#create a vector with the data
fruits <- c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee","mulberry", "olive", "salal berry")

#Convert the vector to a string
string_fruits <-
  paste("c(", #adding the beginning of the vector
        paste0('"', fruits, '"', collapse = ", "), #adding the elements of the vector with quotations and a comma in between
        ")", #closing the vector
        sep = "") #removing the space between the vectors

#Print the string using cat() to avoid getting a line number and extra quotations
cat(string_fruits, "\n")

## c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")

Question #3

Describe, in words, what these expressions will match:

(I am assuming the fact that some of these have quotations and double backslashes and some don’t is not relevant and all should have double backslashes since r would need two to recognize it as a literal backslash)

(.)\1\1

This regex matches any three characters in a row, like aaa. The \1 means take the first character and make sure it’s the second one as well. Same for the next \1

“(.)(.)\2\1”

This regex matches any two characters followed by their reverse, like abba. The \2 means find if the third character matches the second. The following \1 means find the fourth character matching the first.

(..)\1

This regex matches two characters that are repeated, like haha. The (..) means group any two characters as one and the \1 means see if that group is repeated.

“(.).\1.\1”

This regex will look for the same character being repeated in alternating spots, like a?a?a. The (.) means take tht character and remembered it. The next . means not to care what that character is, followed by \1 means look for that remembered character and see if it matches. This is followed by another . and another \1, both meaning the same thing as the first time.

“(.)(.)(.).*\3\2\1”

This regex is kind of similar to #2 with some differences. First of all it is three characters instaed of two followed by anything in between and then followed by the original three in reverse order, like abc?cba. Importantly, the middle .* can represent any amount of characters including zero. So it can abccba and it can be abc???…cba or anything in between.

Question #5

Construct regular expressions to match words that: 1. Start and end with the same character.

#\\b for boundaries before and after the words, \\w to limit it to word characters, w* to allow for any amount of other letters, \\1 to check for the grouping
regex <- "\\b(\\w)\\w*\\1\\b"
matches <- str_extract_all("hello everyone aha I am shaya", regex)
matches

## [[1]]
## [1] "everyone" "aha"

Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)

#\\b for boundaries, \\w* at the beginning middle and end to account for other letters in all those places, (\\w{2,}) is our grouping of any two word characters
regex <- "\\b\\w*(\\w{2,})\\w*\\1\\w*\\b"
matches <- str_extract_all("church test pishposh test test", regex) 
matches

## [[1]]
## [1] "church"   "pishposh"

Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

#\\b for boundaries, (\\w) is grouping any character, the two times \\1 make sure the grouped character (anything really) appears at least twice more, and the 4 times w* are to allow for any amount of other letters anywhere else
regex <- "\\b\\w*(\\w)\\w*\\1\\w*\\1\\w*\\b"
matches <- str_extract_all("eleven seventeen eighteen test", regex)
matches

## [[1]]
## [1] "eleven"    "seventeen" "eighteen"

Assignment 3

Shaya Engelman

2023-09-20

Assignment 3

Question #1

Question #2

Question #3

Question #5