Let’s first delete all variables and load our tidyverse library.
Exercise 1
Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”
ANSWER We will load directly from github using the alternatve site which works for read_csv. I will use regex and the str_detect function in Stringr to match any major whose description either has the word “DATA” or the word “STATISTICS.
Since all descriptions were already CAPITALIZED, I didn’t have to transform any of the columns or adapt the regex to account for it.
# Insert code for Exercise 1 here
college_majors <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv")
## Rows: 174 Columns: 3
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (3): FOD1P, Major, Major_Category
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
# I applied a simple FILTER to get requested output
data_majors <- college_majors %>%
filter(str_detect(Major, "DATA|STATISTICS"))
data_majors
## # A tibble: 3 x 3
## FOD1P Major Major_Category
## <chr> <chr> <chr>
## 1 6212 MANAGEMENT INFORMATION SYSTEMS AND STATISTICS Business
## 2 2101 COMPUTER PROGRAMMING AND DATA PROCESSING Computers & Mathematics
## 3 3702 STATISTICS AND DECISION SCIENCE Computers & Mathematics
Exercise 2
Write code that transforms the data below:
[1] “bell pepper” “bilberry” “blackberry” “blood orange” [5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”
[9] “elderberry” “lime” “lychee” “mulberry”
[13] “olive” “salal berry”
Into a format like this:
c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)
ANSWER This one was more complicated than I thought. The complication is that I was not exactly certain on the format of the output.
The complication was how to threat the quotation chartacters ” since the output has in itself quotations.
So my interpretation was to provide back a string that has quote characters within the string itself, or something like Answer_String <- “c(”word1”,“word2”, “etc”)“.
But this means that the answer string must include escapes and should be something like Answer_String <- “c("word1","word2","etc")”.
This way if I write this code below the output will look like what I think the output should look like:
Answer_String <- "c(\"word1\",\"word2\",\"etc\")"
writeLines(Answer_String)
## c("word1","word2","etc")
So I tried this several ways using variations of REGEX. Let’s first load the text from a file I created. The approach I took in this problems is to read the whole text in a single variable.
my_file <- read_file("words.txt")
my_file
## [1] "[1] \"bell pepper\" \"bilberry\" \"blackberry\" \"blood orange\"\r\n[5] \"blueberry\" \"cantaloupe\" \"chili pepper\" \"cloudberry\" \r\n[9] \"elderberry\" \"lime\" \"lychee\" \"mulberry\" \r\n[13] \"olive\" \"salal berry\""
Then for approach #1 I decided to use REGEX to match the whole word INCLUDING THE QUOTATION characters
# Insert code for Exercise 1 here
pattern <- '["“][A-Za-z ]{2,}["”]'
str_view_all(my_file,pattern)
data1 <- str_match_all(my_file,pattern)
The match was as designed, and the result stored in a LIST.
Next step was to merge the words in quotes.
data1 <- paste(data1,collapse=",")
writeLines((data1))
## c("\"bell pepper\"", "\"bilberry\"", "\"blackberry\"", "\"blood orange\"", "\"blueberry\"", "\"cantaloupe\"", "\"chili pepper\"", "\"cloudberry\"", "\"elderberry\"", "\"lime\"", "\"lychee\"", "\"mulberry\"", "\"olive\"", "\"salal berry\"")
The result looks VERY CLOSE to desired answer, but not exactly as I desired since it added quotation characters around each word. So instead of ending with “word1”, I ended with ““word1”“. Not what I wanted.
So I went the route to match the words inside quotation marks, but without the quotation characters themselves, just the words inside. I used again REGEX to isolate words.
pattern <- '(?<=\")[A-Za-z ]{3,}(?=\")'
str_view_all(my_file,pattern)
data2 <- str_match_all(my_file,pattern)
I almost got what I wanted. Unfortunately it also added some blank spaces between words in quotes. After some experimenting coudlnt’ find a REGEX that would I isolate just rhe words within qoutes.
I decided just to go ahead remove all empty strings added to the list. Not pretty but would get me there
data3 <- data2[[1]][str_count(data2[[1]],pattern=" ") < 2]
data3
## [1] "bell pepper" "bilberry" "blackberry" "blood orange" "blueberry"
## [6] "cantaloupe" "chili pepper" "cloudberry" "elderberry" "lime"
## [11] "lychee" "mulberry" "olive" "salal berry"
So that gives me what I want, a list of the just words in the file, without quotes around them.
My next step was to merge the words in the list with “,” as separator.
data4<- paste(data3,collapse = '\",\"')
writeLines(data4)
## bell pepper","bilberry","blackberry","blood orange","blueberry","cantaloupe","chili pepper","cloudberry","elderberry","lime","lychee","mulberry","olive","salal berry
So this ALMOST completely did the trick. I just need to add “c(” at the beginning and “) at the end and PRESTO.
final_string <- paste0('c(\"',data4,'\")')
writeLines(final_string)
## c("bell pepper","bilberry","blackberry","blood orange","blueberry","cantaloupe","chili pepper","cloudberry","elderberry","lime","lychee","mulberry","olive","salal berry")
Exacty the output requested in the exercise DONE!
Exercise 3
Describe, in words, what these expressions will match:
(.)\1\1
“(.)(.)\2\1”
(..)\1
“(.).\1.\1”
“(.)(.)(.).*\3\2\1”
ANSWER
Let’s start by defining a test vector we can use to test our REGEX.
test <- c("bell pepper", "aaairlines", "bilberry", "blackberry", "blood orange",
"blueberry", "cantaloupe", "chili pepper", "church","cloudberry",
"elderberry", "lime", "rhbbbf", "lychee", "mulberry", "olive",
"salal berry", "1234ab345 ba343", "1234ab345ba343","alibaba", "bereb",
"bbgun", "asdbbb", "balacab", "balacaba", "1234568965443", "erberberbee")
Now lets look and test what each REGEX do
#Searches the whole string and would match 3 continuous of same character
str_view_all(test,"(.)\\1\\1")
#Searches the whole string and looks for 4 characters, where
#the last pair of two is in reverse order than the first pair of two
str_view_all(test,"(.)(.)\\2\\1")
#Searches and look for 4 consecutive characters where the first
#pair of characters is repeated once
str_view_all(test,"(..)\\1")
#Looks for a pattern of same character repeated 3 times with a character
#in between
str_view_all(test,"(.).\\1.\\1")
