1. Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset, provide code that identifies the majors that contain either “DATA” or “STATISTICS”.
Import the dataset into R:
college_majors_csv = "https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv"
college_majors <- read_csv(url(college_majors_csv))
##
## -- Column specification --------------------------------------------------------
## cols(
## FOD1P = col_character(),
## Major = col_character(),
## Major_Category = col_character()
## )
Filter the data based on a regular expression in the Major column. The grepl function returns a vector of booleans based on whether the entry in Major matches the regular expression. Then filter returns rows from college_majors for which grepl returned True.
ds_majors <- college_majors %>%
filter(grepl("DATA|STATISTICS", Major))
ds_majors$Major
## [1] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
## [2] "COMPUTER PROGRAMMING AND DATA PROCESSING"
## [3] "STATISTICS AND DECISION SCIENCE"
2. Write code that transforms \(\texttt{fruits}\), a vector of 14 character vectors, into a single string beginning \(\texttt{c("}\).
fruits <- c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")
stem <- "c(" #begin the output string
for (fruit in fruits) { #add fruits and associated punctuation
stem <- str_c(stem, '"', fruit, '", ')
}
stem <- str_c(substr(stem, 0, nchar(stem) - 2), ")") #replace final comma with closing parenthesis
print.noquote(stem) #display resulting string with escape characters hidden
## [1] c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")
3. Describe, in words, what these expressions will match:
\(\texttt{(.)\1\1}\) This expression matches any string containing a single character occurring three times in a row (for example, eee). Since there are no English words that contain this pattern, the subset of \(\texttt{words}\) matching this pattern is empty.
str_subset(words, "(.)\\1\\1")
## character(0)
\(\texttt{"(.)(.)\\\2\\\1"}\) This expression matches any string containing a symmetric sequence of 4 characters (for example, abba).
str_subset(words, "(.)(.)\\2\\1")
## [1] "afternoon" "apparent" "arrange" "bottom" "brilliant"
## [6] "common" "difficult" "effect" "follow" "indeed"
## [11] "letter" "million" "opportunity" "oppose" "tomorrow"
\(\texttt{(..)\1}\) This expression matches any string containing a repeated pair of letters (for example, abab).
str_subset(words,"(..)\\1")
## [1] "remember"
\(\texttt{"(.).\\\1.\\\1"}\) This expression matches any string containing a sequence of 5 characters such that the first, third, and fifth character are the same (for example, abaca).
str_subset(words, "(.).\\1.\\1")
## [1] "eleven"
\(\texttt{"(.)(.)(.).*\\\3\\\2\\\1"}\) This expression matches any string containing a sequence of any three characters, followed later by that same sequence reversed (for example, abc…cba).
str_subset(words, "(.)(.)(.).*\\3\\2\\1")
## [1] "paragraph"
4. Construct regular expressions to match words that:
Start and end with the same character.
\(\texttt{^(.)}\) Begin the string with any character.
\(\texttt{.*}\) Follow with any characters.
\(\texttt{\1\$}\) End the string with a backreference to the character represented by the period enclosed in parentheses.
str_subset(words,"^(.).*\\1$")
## [1] "america" "area" "dad" "dead" "depend"
## [6] "educate" "else" "encourage" "engine" "europe"
## [11] "evidence" "example" "excuse" "exercise" "expense"
## [16] "experience" "eye" "health" "high" "knock"
## [21] "level" "local" "nation" "non" "rather"
## [26] "refer" "remember" "serious" "stairs" "test"
## [31] "tonight" "transport" "treat" "trust" "window"
## [36] "yesterday"
Contain a repeated pair of letters (e.g., “church” contains “ch” repeated twice.)
\(\texttt{(..)}\) Identify a pair of characters.
\(\texttt{.*}\) Follow with any characters.
\(\texttt{\1}\) Follow with a backreference to the pair of characters identified earlier.
str_subset(words, "(..).*\\1")
## [1] "appropriate" "church" "condition" "decide" "environment"
## [6] "london" "paragraph" "particular" "photograph" "prepare"
## [11] "pressure" "remember" "represent" "require" "sense"
## [16] "therefore" "understand" "whether"
Contain one letter repeated in at least three places (e.g., “eleven” contains three “e”s).
\(\texttt{(.)}\) Identify any character.
\(\texttt{.*}\) Follow with any characters.
\(\texttt{\1}\) Follow with a backreference to the character matched to (.).
\(\texttt{.*}\) Follow with any characters.
\(\texttt{\1}\) Follow with a backreference to the character matched to (.).
str_subset(words,"(.).*\\1.*\\1")
## [1] "appropriate" "available" "believe" "between" "business"
## [6] "degree" "difference" "discuss" "eleven" "environment"
## [11] "evidence" "exercise" "expense" "experience" "individual"
## [16] "paragraph" "receive" "remember" "represent" "telephone"
## [21] "therefore" "tomorrow"
Questions
A programming language like Python is Turing complete because it can be used to express any computable function. Is there a sense in which regular expressions are “complete”? That is, can a regular expression be written to express any possible pattern in a string? Computers are built from primitive logic gates, like and, or, not. Is there some primitive set of operations for identifying patterns in strings?