library(tidyverse)
This week we reviewed regular expressions, which are useful for string manipulation and identifying patterns within strings. Since understanding strings is foundational to understanding regular expressions, we also touched upon useful string functions in this assignment.
Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”
majors <- read.csv("C:\\Users\\Kim\\Documents\\Data607\\all-ages.csv", header = TRUE, sep = ",")
DATAMAJORS <- majors %>% filter(str_detect(Major,"DATA") | str_detect(Major,"STATISTICS"))
DATAMAJORS
## Major_code Major
## 1 2101 COMPUTER PROGRAMMING AND DATA PROCESSING
## 2 3702 STATISTICS AND DECISION SCIENCE
## 3 6212 MANAGEMENT INFORMATION SYSTEMS AND STATISTICS
## Major_category Total Employed Employed_full_time_year_round
## 1 Computers & Mathematics 29317 22828 18747
## 2 Computers & Mathematics 24806 18808 14468
## 3 Business 156673 134478 118249
## Unemployed Unemployment_rate Median P25th P75th
## 1 2265 0.09026422 60000 40000 85000
## 2 1138 0.05705405 70000 43000 102000
## 3 6186 0.04397714 72000 50000 100000
My interpretation of this question is that the list is printing like
[1] “bell pepper” “bilberry” “blackberry” “blood orange”
[5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”
[9] “elderberry” “lime” “lychee” “mulberry”
[13] “olive” “salal berry”
as shown by
print(c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry"))
## [1] "bell pepper" "bilberry" "blackberry" "blood orange" "blueberry"
## [6] "cantaloupe" "chili pepper" "cloudberry" "elderberry" "lime"
## [11] "lychee" "mulberry" "olive" "salal berry"
However, it is instead preferable to print
c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)
like the following
writeLines('c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")')
## c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")
Describe, in words, what these expressions will match:
(.)\1\1
The (.)\1\1 regular expression will look for any character (as determined by the “(.)” portion of the expression) followed by two literal \001 ASCII characters. In RStudio, this renders like the following:
writeLines("\1")
##
While other characters may render in the same way, it is important to note that the rendering does not change the way R will search for the character. Not all improperly rendered characters will be viewed the same way by R. For example, \002 will be rendered in the same manner, but it will not match with \001.
x <- c("a\1\1","b\1\1","\1\1c","d\2\2")
writeLines(x)
## a
## b
## c
## d
str_view(x,"(.)\1\1")
## [1] │ <a>
## [2] │ <b>
“(.)(.)\2\1”
The “(.)(.)\2\1” regular expression includes quotes in this case. Since \2 and \1 are properly escaped, this will match with any two characters, followed by one of the second characters, followed by one of the first character, bounded in quotes.
x <- c('"aaaa"','"abba"',"aaaa","abba")
writeLines(x)
## "aaaa"
## "abba"
## aaaa
## abba
str_view(x,'"(.)(.)\\2\\1"')
## [1] │ <"aaaa">
## [2] │ <"abba">
(..)\1
The (..)\1 regular expression will match with any two characters followed by the ASCII \001 character.
x <- c("a\1\1","za\1","\1\1\1","\2\2\2")
writeLines(x)
## a
## za
##
##
str_view(x,"(..)\1")
## [1] │ <a>
## [2] │ <za>
## [3] │ <>
“(.).\1.\1”
The “(.).\1.\1” regular expression will match with any two characters, followed by a repeat of the first character, followed by any character, followed by a repeat of the first character again. Again, it all has to be bounded by quotes since quotes were included in the regular expression.
x <- c('"abaxa"','"\1\2\1x\1"',"abaxa")
writeLines(x)
## "abaxa"
## "x"
## abaxa
str_view(x,'"(.).\\1.\\1"')
## [1] │ <"abaxa">
## [2] │ <"x">
“(.)(.)(.).*\3\2\1”
The “(.)(.)(.).*\3\2\1” regular expression will match any four characters, then the 3rd character, 2nd character, and 1st character, all bounded in quotes.
x <- c('"abcxcba"','"abcccba"','abcdcba')
writeLines(x)
## "abcxcba"
## "abcccba"
## abcdcba
str_view(x,'"(.)(.)(.).\\3\\2\\1"')
## [1] │ <"abcxcba">
## [2] │ <"abcccba">
Construct regular expressions that match words that:
Start and end with the same character.
x <- c("example","ee",'"anything in quotes should work"','"unless I put a different char at the end"!')
writeLines(x)
## example
## ee
## "anything in quotes should work"
## "unless I put a different char at the end"!
str_view(x,"^(.).*\\1$")
## [1] │ <example>
## [2] │ <ee>
## [3] │ <"anything in quotes should work">
Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)
x <- c("church","abxyxyab","no","abobo")
writeLines(x)
## church
## abxyxyab
## no
## abobo
str_view(x,"(..).*\\1")
## [1] │ <church>
## [2] │ <abxyxyab>
## [4] │ a<bobo>
Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)
x <- c("church","abxyxyaba","no","eleven","elevene","elevn")
writeLines(x)
## church
## abxyxyaba
## no
## eleven
## elevene
## elevn
str_view(x,"(.).*\\1.*\\1")
## [2] │ <abxyxyaba>
## [4] │ <eleve>n
## [5] │ <elevene>
Learning about these functions has opened a door for better “querying” in R, where we can use functions like str_detect to act analogous to “like” clauses in SQL. The regular expressions act similarly to this as well when matching patterns in strings, except provide even more utility as the actual characters in the regular expression can be further “wildcarded” to identify patterns.