Week 3 Assignment

Assignment on RPubs
Rmd on Github

Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”

 library(stringr)

 majorCSV <- read.csv("https://raw.githubusercontent.com/logicalschema/DATA607/master/week3/majors-list.csv")


 #The following code uses a regular expression DATA or STATISTICS and searches through the Major field of the data.
grep('DATA|STATISTICS', majorCSV$Major,  value = TRUE)

## [1] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
## [2] "COMPUTER PROGRAMMING AND DATA PROCESSING"     
## [3] "STATISTICS AND DECISION SCIENCE"

Write code that transforms the data below:

[1] "bell pepper"  "bilberry"     "blackberry"   "blood orange"

[5] "blueberry"    "cantaloupe"   "chili pepper" "cloudberry"  

[9] "elderberry"   "lime"         "lychee"       "mulberry"    

[13] "olive"        "salal berry"

Into a format like this:

c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")

w <- c("bell pepper","bilberry","blackberry","blood orange")
x <- c("blueberry","cantaloupe","chili pepper","cloudberry") 
y <- c("elderberry","lime","lychee","mulberry") 
z <- c("olive","salal berry")


combined <- c(w, x, y, z)

print(combined)

##  [1] "bell pepper"  "bilberry"     "blackberry"   "blood orange" "blueberry"   
##  [6] "cantaloupe"   "chili pepper" "cloudberry"   "elderberry"   "lime"        
## [11] "lychee"       "mulberry"     "olive"        "salal berry"

The two exercises below are taken from R for Data Science, 14.3.5.1 in the on-line version:

Describe, in words, what these expressions will match:

(.)\1\1
This expression matches on characters, except line breaks, that is followed by a “\1\1”. Examples would be “b\1\1”, “c\1\1”, or “5\1\1”.

x <- c("b\1\1", "c\1\1", "hello\1\1", "yellow")
str_match(x, '(.)\1\1')

##      [,1]        [,2]
## [1,] "b\001\001" "b" 
## [2,] "c\001\001" "c" 
## [3,] "o\001\001" "o" 
## [4,] NA          NA

“(.)(.)\2\1”
This expression matches strings that contain pairs of characters, excluding line breaks, that are followed by a reverse of their order. Examples would be “abba”, “0101”, or “daad”.

x <- c("abba", "0110", "ACTGGTCA", "yellow")
str_match(x, "(.)(.)\\2\\1")

##      [,1]   [,2] [,3]
## [1,] "abba" "a"  "b" 
## [2,] "0110" "0"  "1" 
## [3,] "TGGT" "T"  "G" 
## [4,] NA     NA   NA

(..)\1
This expression matches strings that have a couple of characters, excluding line breaks, that are followed by a “\1”. Examples would be “ab\1”, “54\1”, or “11\1”.

x <- c("ab\1", "red", "A\1", "AABBCC\1")
str_match(x, '(..)\1')

##      [,1]     [,2]
## [1,] "ab\001" "ab"
## [2,] NA       NA  
## [3,] NA       NA  
## [4,] "CC\001" "CC"

“(.).\1.\1” This expression matches strings that contain a character that repeats in the 2 and 4 places over from its first occurrence. Examples would be “a0a1a”, “c1d1e1”, and “-1-2-3”.

x <- c("a0a1a", "blue", "c1d1e1", "-1-2-3")
str_match(x, "(.).\\1.\\1")

##      [,1]    [,2]
## [1,] "a0a1a" "a" 
## [2,] NA      NA  
## [3,] "1d1e1" "1" 
## [4,] "-1-2-" "-"

**(.)(.)(.).*\3\2\1** This expression matches any sequence of strings that are encapsulated by 3 characters, excluding line breaks, where the end string is a reverse order of those 3 characters. Examples would be “abcjfkdjkfjicba”, “0110Middleofthestring110”.

x <- c("abcjfkdjkfjicba", "redyellow001middle100kdlskdls", "beginbegin1middlegebend", "98&^A")
str_match(x, "(.)(.)(.).*\\3\\2\\1")  #Note the \ needs to be escaped in the expression.

##      [,1]                   [,2] [,3] [,4]
## [1,] "abcjfkdjkfjicba"      "a"  "b"  "c" 
## [2,] "001middle100"         "0"  "0"  "1" 
## [3,] "beginbegin1middlegeb" "b"  "e"  "g" 
## [4,] NA                     NA   NA   NA

Construct regular expressions to match words that

Start and end with the same character:
"^(.)(.*)\1$"

x <- c("amiddleofthestringa", "0red,yellow,green0", "red", "9*&^(^Hjshjshf9")
str_match(x, "^(.)(.*)\\1$")  #Note the \ needs to be escaped in the expression.

##      [,1]                  [,2] [,3]               
## [1,] "amiddleofthestringa" "a"  "middleofthestring"
## [2,] "0red,yellow,green0"  "0"  "red,yellow,green" 
## [3,] NA                    NA   NA                 
## [4,] "9*&^(^Hjshjshf9"     "9"  "*&^(^Hjshjshf"

Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.):
"(..)(.*)\1"

x <- c("church", "blue", "red", "abracadabra")
str_match(x, "(..)(.*)\\1")  #Note the \ needs to be escaped in the expression.

##      [,1]        [,2] [,3]   
## [1,] "church"    "ch" "ur"   
## [2,] NA          NA   NA     
## [3,] NA          NA   NA     
## [4,] "abracadab" "ab" "racad"

Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)
“(.)(.)\1(.)\1”

x <- c("eleven", "blue", "010001001010", "yellow submarine light")
str_match(x, "(.)(.*)\\1(.*)\\1")  #Note the \ needs to be escaped in the expression.

##      [,1]               [,2] [,3]       [,4]           
## [1,] "eleve"            "e"  "l"        "v"            
## [2,] NA                 NA   NA         NA             
## [3,] "010001001010"     "0"  "10001001" "1"            
## [4,] "llow submarine l" "l"  ""         "ow submarine "

Week 3 Assignment

Sung Lee

2/12/2020

The two exercises below are taken from R for Data Science, 14.3.5.1 in the on-line version: