Please deliver links to an R Markdown file (in GitHub and rpubs.com) with solutions to the problems below. You may work in a small group, but please submit separately with names of all group participants in your submission.
Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset (The Economic Guide To Picking A College Major), provide code that identifies the majors that contain either “DATA” or “STATISTICS”.
# define the URL for the raw csv file containing major list
majors_url <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv"
# read csv file and store the records into a data frame
majors <- read_csv(majors_url)
# convert the data frame to a tibble
mt <- as_tibble(majors)
# from the list of majors filter only those which contain the words DATA OR STATISTICS
tbl2 <- mt %>% filter(grepl('DATA|STATISTICS', Major, ignore.case = TRUE))
# count the total of major found in the majors list
total_majors <- nrow(majors) # Actual total of majors in data set
#An actual total of `total_majors` majors were found in the current site's data set.
# display the list of filtered majors
knitr::kable(tbl2, caption = 'Table: List of Majors that contain the words "DATA" or "STATISTICS"')
| FOD1P | Major | Major_Category |
|---|---|---|
| 6212 | MANAGEMENT INFORMATION SYSTEMS AND STATISTICS | Business |
| 2101 | COMPUTER PROGRAMMING AND DATA PROCESSING | Computers & Mathematics |
| 3702 | STATISTICS AND DECISION SCIENCE | Computers & Mathematics |
Write code that transforms the data below:
[1] “bell pepper” “bilberry” “blackberry” “blood orange”
[5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”
[9] “elderberry” “lime” “lychee” “mulberry”
[13] “olive” “salal berry”
Into a format like this:
c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)
# stored the character data as is, including spaces, carriage returns, double quotes
raw_data <- '[1] "bell pepper" "bilberry" "blackberry" "blood orange"
[5] "blueberry" "cantaloupe" "chili pepper" "cloudberry"
[9] "elderberry" "lime" "lychee" "mulberry"
[13] "olive" "salal berry"'
# define a regex to match words surrounded by double quotes "\b\w+\s?\w*\b"
regex_double_quote_surrounded_words <- '\\b\\w+\\s?\\w*\\b"'
# extract the quoted words from the string and store them into a list
extracted_quoted_words <- str_extract_all(raw_data, regex_double_quote_surrounded_words)
# concatenate the quoted words and store them in a character vector as per the ask
new_character_format <- paste(extracted_quoted_words, ",")
extracted_quoted_words
## [[1]]
## [1] "bell pepper\"" "bilberry\"" "blackberry\"" "blood orange\""
## [5] "blueberry\"" "cantaloupe\"" "chili pepper\"" "cloudberry\""
## [9] "elderberry\"" "lime\"" "lychee\"" "mulberry\""
## [13] "olive\"" "salal berry\""
new_character_format
## [1] "c(\"bell pepper\\\"\", \"bilberry\\\"\", \"blackberry\\\"\", \"blood orange\\\"\", \"blueberry\\\"\", \"cantaloupe\\\"\", \"chili pepper\\\"\", \"cloudberry\\\"\", \"elderberry\\\"\", \"lime\\\"\", \"lychee\\\"\", \"mulberry\\\"\", \"olive\\\"\", \"salal berry\\\"\") ,"
# Build a tibble with a list of sample words to be tested against the regular expressions
dt_test <- tibble( words = c("Ohio","Pennelope","Kabbak","tomato","eleven","church","Pop","papa","Robocop","look","Pajama","noN","XXXrated"))
| words |
|---|
| Ohio |
| Pennelope |
| Kabbak |
| tomato |
| eleven |
| church |
| Pop |
| papa |
| Robocop |
| look |
| Pajama |
| noN |
| XXXrated |
The two exercises below are taken from R for Data Science, 14.3.5.1 in the on-line version:
Describe, in words, what these expressions will match:
(.)\1\1
It matches a character that repeats 3 times consecutively.
dt_matches <- dt_test %>% dplyr::filter(str_detect(words, regex('(.)\\1\\1', ignore_case = T)))
knitr::kable(dt_matches)
| words |
|---|
| XXXrated |
“(.)(.)\2\1”
It matches two characters followed by the same two characters but in reversed order.
dt_matches <- dt_test %>% dplyr::filter(str_detect(words, regex("(.)(.)\\2\\1", ignore_case = T)))
knitr::kable(dt_matches)
| words |
|---|
| Pennelope |
| Kabbak |
(..)\1
It matches two characters followed by the same two characters.
dt_matches <- dt_test %>% dplyr::filter(str_detect(words, regex("(..)\\1", ignore_case = T)))
knitr::kable(dt_matches)
| words |
|---|
| papa |
“(.).\1.\1”
It matches a character repeated 3 times in non-consecutive positions.
dt_matches <- dt_test %>% dplyr::filter(str_detect(words, regex("(.).\\1.\\1", ignore_case = T)))
knitr::kable(dt_matches)
| words |
|---|
| eleven |
| Robocop |
| Pajama |
"(.)(.)(.).*\3\2\1"
It matches 3 characters followed by the same 3 characters but in reversed order.
dt_matches <- dt_test %>% dplyr::filter(str_detect(words, regex("(.)(.)(.).*\\3\\2\\1", ignore_case = T)))
knitr::kable(dt_matches)
| words |
|---|
| Kabbak |
# define a regular expression that will match words starting and ending with the same character: \b(\w)\w*\1\b
regex_same_1st_n_last_char <- "\\b(\\w)\\w*\\1\\b"
# filter the words that match the regular expression
dt_flc_filtered <- dt_test %>% dplyr::filter(str_detect(words, regex(regex_same_1st_n_last_char, ignore_case = T)))
knitr::kable(dt_flc_filtered, caption = 'Table: List of words starting and ending with the same character')
| words |
|---|
| Ohio |
| Kabbak |
| Pop |
| noN |
# define a regular expression that will match words that contain a repeated pair of letters: \b(..)\w*\1\b
regex_2_rep_letters <- "\\b(..)\\w*\\1\\b"
# filter the words that match the regular expression
dt_flc_filtered <- dt_test %>% dplyr::filter(str_detect(words, regex(regex_2_rep_letters, ignore_case = T)))
knitr::kable(dt_flc_filtered, caption = 'Table: List of words that contain a repeated pair of letters')
| words |
|---|
| Pennelope |
| tomato |
| church |
| papa |
# define a regular expression that will match words that contain one letter repeated in at least three places: ([a-z])\w*\1\w*\1
regex_3_rep_letters <- "([a-z])\\w*\\1\\w*\\1"
# filter the words that match the regular expression
dt_flc_filtered <- dt_test %>% dplyr::filter(str_detect(words, regex(regex_3_rep_letters, ignore_case = T)))
knitr::kable(dt_flc_filtered, caption = 'Table: List of words that contain one letter repeated in at least three places')
| words |
|---|
| Pennelope |
| eleven |
| Robocop |
| Pajama |
| XXXrated |