Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”
There are three majors that contain either “DATA” or “STATISTICS”
# Load packages --------------------------------------
library(tidyverse)
# Load data --------------------------------------
<- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv", stringsAsFactors = FALSE) majordom
<- str_subset(majordom$Major, pattern = "DATA|STATISTICS")
majordom_data majordom_data
## [1] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
## [2] "COMPUTER PROGRAMMING AND DATA PROCESSING"
## [3] "STATISTICS AND DECISION SCIENCE"
Write code that transforms the data below:
[1] “bell pepper” “bilberry” “blackberry” “blood orange”
[5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”
[9] “elderberry” “lime” “lychee” “mulberry”
[13] “olive” “salal berry”
Into a format like this:
c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)
We can generate a string that looks like that with the following steps:
# Generate input
<- '[1] "bell pepper" "bilberry" "blackberry" "blood orange"'
line1 <- '[5] "blueberry" "cantaloupe" "chili pepper" "cloudberry" '
line2 <- '[9] "elderberry" "lime" "lychee" "mulberry" '
line3 <- '[13] "olive" "salal berry"'
line4 <- c(line1, line2, line3, line4)
input writeLines(input)
## [1] "bell pepper" "bilberry" "blackberry" "blood orange"
## [5] "blueberry" "cantaloupe" "chili pepper" "cloudberry"
## [9] "elderberry" "lime" "lychee" "mulberry"
## [13] "olive" "salal berry"
# Combine the lines into one vector element
<- str_c(input, collapse = " ")
input writeLines(input)
## [1] "bell pepper" "bilberry" "blackberry" "blood orange" [5] "blueberry" "cantaloupe" "chili pepper" "cloudberry" [9] "elderberry" "lime" "lychee" "mulberry" [13] "olive" "salal berry"
# Remove the numbers in square brackets
<- str_remove_all(input, "\\[(.|..)\\]")
interim writeLines(interim)
## "bell pepper" "bilberry" "blackberry" "blood orange" "blueberry" "cantaloupe" "chili pepper" "cloudberry" "elderberry" "lime" "lychee" "mulberry" "olive" "salal berry"
# Trim the white space
<- str_squish(interim)
interim writeLines(interim)
## "bell pepper" "bilberry" "blackberry" "blood orange" "blueberry" "cantaloupe" "chili pepper" "cloudberry" "elderberry" "lime" "lychee" "mulberry" "olive" "salal berry"
# Replace " " with ", "
<- str_replace_all(interim, "\" \"", "\", \"")
interim writeLines(interim)
## "bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry"
# Wrap the vectorized element with `c()`
<- str_c("c(", interim, ")")
output writeLines(output)
## c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")
However, this only looks like the output above but isn’t a vector with these elements. If it needs to be a vector we can extract just the fruit names from the original input collapsed into one vector element. WriteLines() doesn’t return an output that can be assigned to a variable, otherwise we would have shown that, too.
# Extract names into a list
<- str_extract_all(input, "\"[a-z ]+\"")
output2 # [a-z ]+ is not picking up " " because the beginning " is picked up already
output2
## [[1]]
## [1] "\"bell pepper\"" "\"bilberry\"" "\"blackberry\"" "\"blood orange\""
## [5] "\"blueberry\"" "\"cantaloupe\"" "\"chili pepper\"" "\"cloudberry\""
## [9] "\"elderberry\"" "\"lime\"" "\"lychee\"" "\"mulberry\""
## [13] "\"olive\"" "\"salal berry\""
# Turn the list into a vector
<- unlist(output2)
output2 output2
## [1] "\"bell pepper\"" "\"bilberry\"" "\"blackberry\"" "\"blood orange\""
## [5] "\"blueberry\"" "\"cantaloupe\"" "\"chili pepper\"" "\"cloudberry\""
## [9] "\"elderberry\"" "\"lime\"" "\"lychee\"" "\"mulberry\""
## [13] "\"olive\"" "\"salal berry\""
# Remove explicit double quotes
<- str_remove_all(output2, "\"")
output2 output2
## [1] "bell pepper" "bilberry" "blackberry" "blood orange" "blueberry"
## [6] "cantaloupe" "chili pepper" "cloudberry" "elderberry" "lime"
## [11] "lychee" "mulberry" "olive" "salal berry"
This matches the output of: c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)
# Show output of "c("bell pepper", ..."salal berry")
c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")
## [1] "bell pepper" "bilberry" "blackberry" "blood orange" "blueberry"
## [6] "cantaloupe" "chili pepper" "cloudberry" "elderberry" "lime"
## [11] "lychee" "mulberry" "olive" "salal berry"
On #3, please don’t assume that it’s a typo when there is only a single forward slash. That is, what happens when the second forward slash that R requires is forgotten?
Just above 14.3.1.1 in R for Data Science, Wickham & Grolemund write, “In this book, I’ll write regular expression as \.
and strings that represent the regular expression as "\\."
.” So I assume in the two examples without quotes he’s demonstrating the difference between regex and the strings that represent regex.
Nevertheless if you do escape a normal character it doesn’t just disappear. Otherwise the second and third code chunks below would have equivalent output. Instead we see that "\1"
becomes an invisible character. You can see in the examples below that while "A\1\1"
is matched by both "(.)"
and "(.)\1\1"
, "AAA", "BBB", "B11"
don’t match with "(.)\1\1"
because they don’t have those two escaped normal characters.
# Writing the string "(.)\1\1" returns "(.)" but they are not equivalent in the examples below
writeLines("(.)\1\1")
## (.)
# Running string_view using "(.)\1\1" matches the first instance of any character followed by
# two escaped 1s.
<- c("AAA", "BBB", "B11", "A\1\1")
sample_string str_view(sample_string, "(.)\1\1")
# Running string_view using "(.)" matches the first instance of any character
<- c("AAA", "BBB", "B11", "A\1\1")
sample_string str_view(sample_string, "(.)")
Describe, in words, what these expressions will match:
(.)\1\1
This will match any character (except \n
) repeated three times consecutively.
<- c("AAA", "BBB", "BCCC", "ABBCB")
sample_string str_view(sample_string, "(.)\\1\\1")
"(.)(.)\\2\\1"
This will match any two characters, repeated in reverse.
<- c("ANNA", "BABA", "UOMMO", "ABBACADABBADOO")
sample_string str_view_all(sample_string, "(.)(.)\\2\\1")
(..)\1
This will match any two characters, repeated.
<- c("MAMA", "PAPA", "GAGAGA", "HAHAHAHA")
sample_string str_view_all(sample_string, "(..)\\1")
"(.).\\1.\\1"
This will match five characters that have the same first, third and fifth character.
<- c("DADDY", "DADADY", "NANNA", "NANXN")
sample_string str_view(sample_string, "(.).\\1.\\1")
"(.)(.)(.).*\\3\\2\\1"
This will match any three characters, repeated in reverse, with no or any number of characters in between.
<- c("ABCCB", "ABCCBA", "ABCXCBA", "ABCXYCBA", "ABCX CBAY")
sample_string str_view(sample_string, "(.)(.)(.).*\\3\\2\\1")
Construct regular expressions to match words that:
Start and end with the same character.
"^(.).*\\1$"
See below, only the vector elements that start and end with the same character are viewed.
<- c("ABCCB", "ABCXCBA", "ZBCXYCBZ", "ZBCXYZCBA")
sample_string str_view(sample_string, "^(.).*\\1$")
Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)
"(..).*\\1"
See below for the regex in practice. I tried adding {2} or ? but it didn’t end the churchitch
match after the second ch
.
<- c("ABCCAB", "church", "ladeeadi", "ZBCXYXYZA", "churchitch")
sample_string str_view(sample_string, "(..).*\\1")
Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)
"(.).*\\1.*\\1"
See below
<- c("ABCCAB", "eleven", "DADDY", "ladeeadi", "ZBCXYAYAAY")
sample_string str_view(sample_string, "(.).*\\1.*\\1")
The R Markdown file for this document is saved here, github.com/pkofy/DATA607, with the name “DATA607WK3Assignment.rmd”.