Data 607 Assignment 3

Load packages

library(tidyverse)

Overview / Introduction

This week we reviewed regular expressions, which are useful for string manipulation and identifying patterns within strings. Since understanding strings is foundational to understanding regular expressions, we also touched upon useful string functions in this assignment.

Question 1

Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”

majors <- read.csv("C:\\Users\\Kim\\Documents\\Data607\\all-ages.csv", header = TRUE, sep = ",")
DATAMAJORS <- majors %>% filter(str_detect(Major,"DATA") | str_detect(Major,"STATISTICS"))
DATAMAJORS

##   Major_code                                         Major
## 1       2101      COMPUTER PROGRAMMING AND DATA PROCESSING
## 2       3702               STATISTICS AND DECISION SCIENCE
## 3       6212 MANAGEMENT INFORMATION SYSTEMS AND STATISTICS
##            Major_category  Total Employed Employed_full_time_year_round
## 1 Computers & Mathematics  29317    22828                         18747
## 2 Computers & Mathematics  24806    18808                         14468
## 3                Business 156673   134478                        118249
##   Unemployed Unemployment_rate Median P25th  P75th
## 1       2265        0.09026422  60000 40000  85000
## 2       1138        0.05705405  70000 43000 102000
## 3       6186        0.04397714  72000 50000 100000

Question 2

My interpretation of this question is that the list is printing like

[1] “bell pepper” “bilberry” “blackberry” “blood orange”

[5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”

[9] “elderberry” “lime” “lychee” “mulberry”

[13] “olive” “salal berry”

as shown by

print(c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry"))

##  [1] "bell pepper"  "bilberry"     "blackberry"   "blood orange" "blueberry"   
##  [6] "cantaloupe"   "chili pepper" "cloudberry"   "elderberry"   "lime"        
## [11] "lychee"       "mulberry"     "olive"        "salal berry"

However, it is instead preferable to print

c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)

like the following

writeLines('c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")')

## c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")

Question 3

Describe, in words, what these expressions will match:

(.)\1\1

The (.)\1\1 regular expression will look for any character (as determined by the “(.)” portion of the expression) followed by two literal \001 ASCII characters. In RStudio, this renders like the following:

writeLines("\1")

##

While other characters may render in the same way, it is important to note that the rendering does not change the way R will search for the character. Not all improperly rendered characters will be viewed the same way by R. For example, \002 will be rendered in the same manner, but it will not match with \001.

x <- c("a\1\1","b\1\1","\1\1c","d\2\2")
writeLines(x)

## a
## b
## c
## d

str_view(x,"(.)\1\1")

## [1] │ <a>
## [2] │ <b>

“(.)(.)\2\1”

The “(.)(.)\2\1” regular expression includes quotes in this case. Since \2 and \1 are properly escaped, this will match with any two characters, followed by one of the second characters, followed by one of the first character, bounded in quotes.

x <- c('"aaaa"','"abba"',"aaaa","abba")
writeLines(x)

## "aaaa"
## "abba"
## aaaa
## abba

str_view(x,'"(.)(.)\\2\\1"')

## [1] │ <"aaaa">
## [2] │ <"abba">

(..)\1

The (..)\1 regular expression will match with any two characters followed by the ASCII \001 character.

x <- c("a\1\1","za\1","\1\1\1","\2\2\2")
writeLines(x)

## a
## za
## 
##

str_view(x,"(..)\1")

## [1] │ <a>
## [2] │ <za>
## [3] │ <>

“(.).\1.\1”

The “(.).\1.\1” regular expression will match with any two characters, followed by a repeat of the first character, followed by any character, followed by a repeat of the first character again. Again, it all has to be bounded by quotes since quotes were included in the regular expression.

x <- c('"abaxa"','"\1\2\1x\1"',"abaxa")
writeLines(x)

## "abaxa"
## "x"
## abaxa

str_view(x,'"(.).\\1.\\1"')

## [1] │ <"abaxa">
## [2] │ <"x">

“(.)(.)(.).*\3\2\1”

The “(.)(.)(.).*\3\2\1” regular expression will match any four characters, then the 3rd character, 2nd character, and 1st character, all bounded in quotes.

x <- c('"abcxcba"','"abcccba"','abcdcba')
writeLines(x)

## "abcxcba"
## "abcccba"
## abcdcba

str_view(x,'"(.)(.)(.).\\3\\2\\1"')

## [1] │ <"abcxcba">
## [2] │ <"abcccba">

Question 4

Construct regular expressions that match words that:

Start and end with the same character.

x <- c("example","ee",'"anything in quotes should work"','"unless I put a different char at the end"!')
writeLines(x)

## example
## ee
## "anything in quotes should work"
## "unless I put a different char at the end"!

str_view(x,"^(.).*\\1$")

## [1] │ <example>
## [2] │ <ee>
## [3] │ <"anything in quotes should work">

Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)

x <- c("church","abxyxyab","no","abobo")
writeLines(x)

## church
## abxyxyab
## no
## abobo

str_view(x,"(..).*\\1")

## [1] │ <church>
## [2] │ <abxyxyab>
## [4] │ a<bobo>

Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

x <- c("church","abxyxyaba","no","eleven","elevene","elevn")
writeLines(x)

## church
## abxyxyaba
## no
## eleven
## elevene
## elevn

str_view(x,"(.).*\\1.*\\1")

## [2] │ <abxyxyaba>
## [4] │ <eleve>n
## [5] │ <elevene>

Conclusion

Learning about these functions has opened a door for better “querying” in R, where we can use functions like str_detect to act analogous to “like” clauses in SQL. The regular expressions act similarly to this as well when matching patterns in strings, except provide even more utility as the actual characters in the regular expression can be further “wildcarded” to identify patterns.