In this assignment, we will go over string manipulation. This includes simple functions as well as regex operations.
Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”
library(magrittr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stringr)
majors_url <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv"
majors_df <- read.csv(majors_url)
head(majors_df)
Now that we’ve read in the dataset, we’ll need to find all entries which contain either “DATA” or “STATISTICS”
One way to do this is by using the str_detect() function:
filtered_majors <- majors_df %>%
filter(str_detect(Major, "DATA") | str_detect(Major, "STATISTICS"))
filtered_majors
This seems to work, although my preferred method would be to have a list of entries we would like to check for. By using a list, we can easily expand our search to include any new keywords.
In order to use my preferred method, we’ll follow the below steps:
keywords_to_check <- list("DATA", "STATISTICS")
keyword_collection_df <- majors_df %>%
filter(1 == 2)
for (keyword in keywords_to_check){
keyword_collection_df <- union(
keyword_collection_df,
majors_df %>%
filter(str_detect(Major, keyword))
)
}
keyword_collection_df
Let’s also try this using regex:
regex_pattern <- "DATA|STATISTICS"
majors_df %>%
filter(str_detect(Major, pattern = regex_pattern))
Comparing the results of these three methods, it seeems that the order has changed but the number of results has not.
Write code that transforms the data below: [1] “bell pepper”
“bilberry” “blackberry” “blood orange” [5] “blueberry” “cantaloupe”
“chili pepper” “cloudberry”
[9] “elderberry” “lime” “lychee” “mulberry”
[13] “olive” “salal berry” Into a format like this: c(“bell pepper”,
“bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”,
“chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”,
“mulberry”, “olive”, “salal berry”)
As I interpret this, it seems that this task is requiring one of the two below:
The following 2 chunks of code will do each of these two.
fruit_vector <- c("bell pepper", "bilberry", "blackberry", "blood orange",
"blueberry", "cantaloupe", "chili pepper", "cloudberry",
"elderberry", "lime", "lychee", "mulberry", "olive",
"salal berry")
fruit_vector
## [1] "bell pepper" "bilberry" "blackberry" "blood orange" "blueberry"
## [6] "cantaloupe" "chili pepper" "cloudberry" "elderberry" "lime"
## [11] "lychee" "mulberry" "olive" "salal berry"
Now we will convert the vector into that exact string. Specifically: c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)
string_prefix <- 'c("'
string_suffix <- '")'
string_sep <- '", "'
string_output <- str_c(string_prefix,
str_flatten(fruit_vector, string_sep),
string_suffix)
cat(string_output)
## c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")
Describe, in words, what these expressions will match:
Construct regular expressions to match words that:
For this we will use \b, \1, and [A-Za-z].
t4_1_pattern <- "\\b([A-Za-z])[A-Za-z]*\\1\\b"
t4_1_test <- c("meriam", "health", "meriam is in perfect health")
str_view(t4_1_test, pattern = t4_1_pattern)
## [1] │ <meriam>
## [2] │ <health>
## [3] │ <meriam> is in perfect <health>
t4_2_pattern <- "\\b[A-Za-z]*([A-Za-z][A-Za-z])[A-Za-z]*\\1[A-Za-z]*\\b"
t4_2_test <- c("church", "mississippi",
"there are many churches in mississippi")
str_view(t4_2_test, t4_2_pattern)
## [1] │ <church>
## [2] │ <mississippi>
## [3] │ there are many <churches> in <mississippi>
t4_3_pattern <- "\\b[A-Za-z]*([A-Za-z])[A-Za-z]*\\1[A-Za-z]*\\1[A-Za-z]*\\b"
t4_3_test <- c("narrator", "antenna", "believe", "parallel", "purple",
"the narrator believes that antenna must be parallel")
str_view(t4_3_test, t4_3_pattern)
## [1] │ <narrator>
## [2] │ <antenna>
## [3] │ <believe>
## [4] │ <parallel>
## [6] │ the <narrator> <believes> that <antenna> must be <parallel>
Although I’ve had exposure to regex in the past, I have never felt very confident in my ability to write or understand it. After running through these operations I feel much more comfortable writing and feel capable to read regex.