Assignment on RPubs
Rmd on Github
Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”
library(stringr)
majorCSV <- read.csv("https://raw.githubusercontent.com/logicalschema/DATA607/master/week3/majors-list.csv")
#The following code uses a regular expression DATA or STATISTICS and searches through the Major field of the data.
grep('DATA|STATISTICS', majorCSV$Major, value = TRUE)
## [1] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
## [2] "COMPUTER PROGRAMMING AND DATA PROCESSING"
## [3] "STATISTICS AND DECISION SCIENCE"
Write code that transforms the data below:
[1] "bell pepper" "bilberry" "blackberry" "blood orange"
[5] "blueberry" "cantaloupe" "chili pepper" "cloudberry"
[9] "elderberry" "lime" "lychee" "mulberry"
[13] "olive" "salal berry"
Into a format like this:
c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")
w <- c("bell pepper","bilberry","blackberry","blood orange")
x <- c("blueberry","cantaloupe","chili pepper","cloudberry")
y <- c("elderberry","lime","lychee","mulberry")
z <- c("olive","salal berry")
combined <- c(w, x, y, z)
print(combined)
## [1] "bell pepper" "bilberry" "blackberry" "blood orange" "blueberry"
## [6] "cantaloupe" "chili pepper" "cloudberry" "elderberry" "lime"
## [11] "lychee" "mulberry" "olive" "salal berry"Describe, in words, what these expressions will match:
(.)\1\1
This expression matches on characters, except line breaks, that is followed by a “\1\1”. Examples would be “b\1\1”, “c\1\1”, or “5\1\1”.
x <- c("b\1\1", "c\1\1", "hello\1\1", "yellow")
str_match(x, '(.)\1\1')
## [,1] [,2]
## [1,] "b\001\001" "b"
## [2,] "c\001\001" "c"
## [3,] "o\001\001" "o"
## [4,] NA NA
“(.)(.)\2\1”
This expression matches strings that contain pairs of characters, excluding line breaks, that are followed by a reverse of their order. Examples would be “abba”, “0101”, or “daad”.
x <- c("abba", "0110", "ACTGGTCA", "yellow")
str_match(x, "(.)(.)\\2\\1")
## [,1] [,2] [,3]
## [1,] "abba" "a" "b"
## [2,] "0110" "0" "1"
## [3,] "TGGT" "T" "G"
## [4,] NA NA NA
(..)\1
This expression matches strings that have a couple of characters, excluding line breaks, that are followed by a “\1”. Examples would be “ab\1”, “54\1”, or “11\1”.
x <- c("ab\1", "red", "A\1", "AABBCC\1")
str_match(x, '(..)\1')
## [,1] [,2]
## [1,] "ab\001" "ab"
## [2,] NA NA
## [3,] NA NA
## [4,] "CC\001" "CC"
“(.).\1.\1” This expression matches strings that contain a character that repeats in the 2 and 4 places over from its first occurrence. Examples would be “a0a1a”, “c1d1e1”, and “-1-2-3”.
x <- c("a0a1a", "blue", "c1d1e1", "-1-2-3")
str_match(x, "(.).\\1.\\1")
## [,1] [,2]
## [1,] "a0a1a" "a"
## [2,] NA NA
## [3,] "1d1e1" "1"
## [4,] "-1-2-" "-"
**(.)(.)(.).*\3\2\1** This expression matches any sequence of strings that are encapsulated by 3 characters, excluding line breaks, where the end string is a reverse order of those 3 characters. Examples would be “abcjfkdjkfjicba”, “0110Middleofthestring110”.
x <- c("abcjfkdjkfjicba", "redyellow001middle100kdlskdls", "beginbegin1middlegebend", "98&^A")
str_match(x, "(.)(.)(.).*\\3\\2\\1") #Note the \ needs to be escaped in the expression.
## [,1] [,2] [,3] [,4]
## [1,] "abcjfkdjkfjicba" "a" "b" "c"
## [2,] "001middle100" "0" "0" "1"
## [3,] "beginbegin1middlegeb" "b" "e" "g"
## [4,] NA NA NA NAConstruct regular expressions to match words that
Start and end with the same character:
"^(.)(.*)\1$"
x <- c("amiddleofthestringa", "0red,yellow,green0", "red", "9*&^(^Hjshjshf9")
str_match(x, "^(.)(.*)\\1$") #Note the \ needs to be escaped in the expression.
## [,1] [,2] [,3]
## [1,] "amiddleofthestringa" "a" "middleofthestring"
## [2,] "0red,yellow,green0" "0" "red,yellow,green"
## [3,] NA NA NA
## [4,] "9*&^(^Hjshjshf9" "9" "*&^(^Hjshjshf"
Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.):
"(..)(.*)\1"
x <- c("church", "blue", "red", "abracadabra")
str_match(x, "(..)(.*)\\1") #Note the \ needs to be escaped in the expression.
## [,1] [,2] [,3]
## [1,] "church" "ch" "ur"
## [2,] NA NA NA
## [3,] NA NA NA
## [4,] "abracadab" "ab" "racad"
Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)
“(.)(.)\1(.)\1”
x <- c("eleven", "blue", "010001001010", "yellow submarine light")
str_match(x, "(.)(.*)\\1(.*)\\1") #Note the \ needs to be escaped in the expression.
## [,1] [,2] [,3] [,4]
## [1,] "eleve" "e" "l" "v"
## [2,] NA NA NA NA
## [3,] "010001001010" "0" "10001001" "1"
## [4,] "llow submarine l" "l" "" "ow submarine "