Data 607 - Week 3 Assignment

Load the Required Packages

library(tidyverse)
library(DT)
library(DescTools)

Part 1:

Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/, provide code that identifies the majors that contain either “DATA” or “STATISTICS” in their name.

my_url1 <- "https://raw.githubusercontent.com/geedoubledee/data607_week3/main/majors-list.csv"
majors_df <- read.csv(my_url1, header = TRUE)
majors_filtered_df <- majors_df |> filter(Major %like any% c("%DATA%", "DATA%", "%DATA", "%STATISTICS%", "STATISTICS%", "%STATISTICS"))
datatable(majors_filtered_df, rownames = FALSE, options = list(dom = "t", order = list(list(2, "asc"))))

Part 2:

my_url2 <- "https://raw.githubusercontent.com/geedoubledee/data607_week3/main/data.txt"
data_str <- paste(readLines(my_url2))

Write code that transforms the data below:

[1] “bell pepper” “bilberry” “blackberry” “blood orange”

[5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”

[9] “elderberry” “lime” “lychee” “mulberry”

[13] “olive” “salal berry”

into a format like this:

c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”).

#eliminate bracketed nums and space after
data_str <- str_remove_all(data_str, r"{\[.+\] }")
#replace any num of spaces separating quoted words with ;
data_str <- str_replace_all(data_str, r"{\" +\"}", r"{\";\"}")
#eliminate quotation marks
data_str <- str_remove_all(data_str, r"{\"}")
#trim trailing whitespace
data_str <- trimws(data_str, which = "right")
#create a df from str so we can use separate_longer_delim func with ; delim
data_df <- as.data.frame(data_str)
data_df <- data_df |> separate_longer_delim(cols = 1, delim = ";")
#each item is now a separate row in a single column of the df
#so we can create our character vector from that
data_vec <- data_df$data_str
print(data_vec)

##  [1] "bell pepper"  "bilberry"     "blackberry"   "blood orange" "blueberry"   
##  [6] "cantaloupe"   "chili pepper" "cloudberry"   "elderberry"   "lime"        
## [11] "lychee"       "mulberry"     "olive"        "salal berry"

Part 3:

Describe, in words, what these expressions will match:

spell_check <- c("buff", "spoooky", "cheeers", "cattle", "gogggles", "boooks", "geese")
names <- c("Hannah", "Jesse", "Carl", "Steve", "Sarah", "Estelle", "Bobby", "Eliana", "Jana", "Amanda")
fruits <- c("banana", "coconut", "pear", "apple", "papaya", "tomato")
my_url3 <- "https://raw.githubusercontent.com/geedoubledee/data607_week3/main/zipcodes_partial.csv"
zip_codes <- read.csv(my_url3, header = FALSE, colClasses = "character")
zip_codes <- zip_codes$V1

(.)\1\1

This expression finds patterns where any character appears three times in succession. The . denotes that this can be any character, and the parentheses place that character in capture group 1, which is then back referenced twice so that only strings where the same character appears three times in succession are returned. Below is an example using a vector that includes misspelled words:

str_view(spell_check, "(.)\\1\\1")

## [2] │ sp<ooo>ky
## [3] │ ch<eee>rs
## [5] │ go<ggg>les
## [6] │ b<ooo>ks

“(.)(.)\2\1”

This expression finds patterns where any two characters are immediately followed by the same two characters in reverse order. That’s why capture group 2 is back referenced before capture group 1. Below is an example using a vector of names:

str_view(names, "(.)(.)\\2\\1")

## [1] │ H<anna>h
## [2] │ J<esse>
## [6] │ Est<elle>

(..)\1

This expression finds patterns where any two characters are immediately followed by the same two characters in the same order. Below is an example using a vector of fruits:

str_view(fruits, "(..)\\1")

## [1] │ b<anan>a
## [2] │ <coco>nut
## [5] │ <papa>ya

“(.).\1.\1”

This expression finds patterns where a character is repeated three times, with any other character in between instances 1 and 2, as well as any other character between instances 2 and 3. Below is an example using a vector of some zip codes:

str_view(zip_codes, "(.).\\1.\\1")

## [23] │ <11101>
## [50] │ <21222>
## [66] │ <30303>

“(.)(.)(.).*\3\2\1”

This expression finds patterns that contain any three characters, then also contain those same three characters in reverse order later. There can be any number of any characters between the first set of three characters and the later reverse set of those same characters, as denoted by the . and the *. Below is an example using the names vector again, with all elements in lowercase this time:

names_lower <- str_to_lower(names)
str_view(names_lower, "(.)(.)(.).*\\3\\2\\1")

## [1] │ <hannah>

Part 4:

Construct regular expressions to match words that:

Start and end with the same character.

“^(.).*\1$”

str_view(names_lower, "^(.).*\\1$")

##  [1] │ <hannah>
##  [6] │ <estelle>
## [10] │ <amanda>

Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)

“(..).*\1”

str_view(fruits, "(..).*\\1")

## [1] │ b<anan>a
## [2] │ <coco>nut
## [5] │ <papa>ya
## [6] │ <tomato>

Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

“(.).\1.\1”

str_view(names_lower, "(.).*\\1.*\\1")

##  [6] │ <estelle>
##  [7] │ <bobb>y
## [10] │ <amanda>