In this assignment, we will go over string manipulation. This includes simple functions as well as regex operations.
Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”
library(magrittr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stringr)
majors_url <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv"
majors_df <- read.csv(majors_url)
head(majors_df)
Now that we’ve read in the dataset, we’ll need to find all entries which contain either “DATA” or “STATISTICS”
One way to do this is by using the str_detect() function:
filtered_majors <- majors_df %>%
filter(str_detect(Major, "DATA") | str_detect(Major, "STATISTICS"))
filtered_majors
This seems to work, although my preferred method would be to have a list of entries we would like to check for. By using a list, we can easily expand our search to include any new keywords.
In order to use my preferred method, we’ll follow the below steps:
keywords_to_check <- list("DATA", "STATISTICS")
keyword_collection_df <- majors_df %>%
filter(1 == 2)
for (keyword in keywords_to_check){
keyword_collection_df <- union(
keyword_collection_df,
majors_df %>%
filter(str_detect(Major, keyword))
)
}
keyword_collection_df
Let’s also try this using regex:
regex_pattern <- "DATA|STATISTICS"
majors_df %>%
filter(str_detect(Major, pattern = regex_pattern))
Comparing the results of these three methods, it seeems that the order has changed but the number of results has not.
Write code that transforms the data below: [1] “bell pepper”
“bilberry” “blackberry” “blood orange” [5] “blueberry” “cantaloupe”
“chili pepper” “cloudberry”
[9] “elderberry” “lime” “lychee” “mulberry”
[13] “olive” “salal berry” Into a format like this: c(“bell pepper”,
“bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”,
“chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”,
“mulberry”, “olive”, “salal berry”)
As I interpret this, it seems that this task is requiring one of the two below:
The following 2 chunks of code will do each of these two.
fruit_vector <- c("bell pepper", "bilberry", "blackberry", "blood orange",
"blueberry", "cantaloupe", "chili pepper", "cloudberry",
"elderberry", "lime", "lychee", "mulberry", "olive",
"salal berry")
fruit_vector
## [1] "bell pepper" "bilberry" "blackberry" "blood orange" "blueberry"
## [6] "cantaloupe" "chili pepper" "cloudberry" "elderberry" "lime"
## [11] "lychee" "mulberry" "olive" "salal berry"
Now we will convert the vector into that exact string. Specifically: c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)
string_prefix <- 'c("'
string_suffix <- '")'
string_sep <- '", "'
string_output <- str_c(string_prefix,
str_flatten(fruit_vector, string_sep),
string_suffix)
cat(string_output)
## c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")
To do so this way, we’ll perform 2 steps:
input_string <- '[1] "bell pepper" "bilberry" "blackberry" "blood orange"
[5] "blueberry" "cantaloupe" "chili pepper" "cloudberry"
[9] "elderberry" "lime" "lychee" "mulberry"
[13] "olive" "salal berry"'
string_prefix <- 'c("'
string_suffix <- '")'
string_sep <- ", "
string_pattern <- '"(.*?)"'
list_of_elements <- str_extract_all(input_string, pattern = string_pattern)
vector_of_elements <- unlist(list_of_elements)
combined_string <- str_flatten(vector_of_elements, string_sep)
output_string <- str_c(string_prefix, combined_string, string_suffix)
cat(output_string)
## c(""bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry"")
Using that 3rd method, we can see that the string is output in the format specified.
Describe, in words, what these expressions will match:
Construct regular expressions to match words that:
For this we will use \b, \1, and [A-Za-z].
t4_1_pattern <- "\\b([A-Za-z])[A-Za-z]*\\1\\b"
t4_1_test <- c("meriam", "health", "meriam is in perfect health")
str_view(t4_1_test, pattern = t4_1_pattern)
## [1] │ <meriam>
## [2] │ <health>
## [3] │ <meriam> is in perfect <health>
t4_2_pattern <- "\\b[A-Za-z]*([A-Za-z][A-Za-z])[A-Za-z]*\\1[A-Za-z]*\\b"
t4_2_test <- c("church", "mississippi",
"there are many churches in mississippi")
str_view(t4_2_test, t4_2_pattern)
## [1] │ <church>
## [2] │ <mississippi>
## [3] │ there are many <churches> in <mississippi>
t4_3_pattern <- "\\b[A-Za-z]*([A-Za-z])[A-Za-z]*\\1[A-Za-z]*\\1[A-Za-z]*\\b"
t4_3_test <- c("narrator", "antenna", "believe", "parallel", "purple",
"the narrator believes that antenna must be parallel")
str_view(t4_3_test, t4_3_pattern)
## [1] │ <narrator>
## [2] │ <antenna>
## [3] │ <believe>
## [4] │ <parallel>
## [6] │ the <narrator> <believes> that <antenna> must be <parallel>
Although I’ve had exposure to regex in the past, I have never felt very confident in my ability to write or understand it. After running through these operations I feel much more comfortable writing and feel capable to read regex.