Assignment Requirements

  1. Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”

  2. Write code that transforms the data below:

    [1] “bell pepper” “bilberry” “blackberry” “blood orange”

    [5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”

    [9] “elderberry” “lime” “lychee” “mulberry”

    [13] “olive” “salal berry”

Into a format like this:

c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")

The two exercises below are taken from R for Data Science, 14.3.5.1 in the on-line version:

  1. Describe, in words, what these expressions will match:

  1. Construct regular expressions to match words that:

    Start and end with the same character.

    Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)

    Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

Question 1

The Economic Guide To Picking A College Major article data is stored on the fivethirtyeight github. I will be importing my data from the majors=list.csv and the grad-students.csv respectively.

majors_list<-
read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv",
           header= TRUE, sep=",")

majors_list<-majors_list[order(majors_list$Major),]

grad_students<-
read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/grad-students.csv", 
           header= TRUE, sep=",")

grad_students<-grad_students[order(grad_students$Major),]
## [1] "Columns for major_lists are:"
## [1] "FOD1P"          "Major"          "Major_Category"
## [1] "Columns for grad_students are:"
##  [1] "Major_code"                   "Major"                       
##  [3] "Major_category"               "Grad_total"                  
##  [5] "Grad_sample_size"             "Grad_employed"               
##  [7] "Grad_full_time_year_round"    "Grad_unemployed"             
##  [9] "Grad_unemployment_rate"       "Grad_median"                 
## [11] "Grad_P25"                     "Grad_P75"                    
## [13] "Nongrad_total"                "Nongrad_employed"            
## [15] "Nongrad_full_time_year_round" "Nongrad_unemployed"          
## [17] "Nongrad_unemployment_rate"    "Nongrad_median"              
## [19] "Nongrad_P25"                  "Nongrad_P75"                 
## [21] "Grad_share"                   "Grad_premium"

ANSWER

The majors that contain either “DATA” or “STATISTICS” in them, are as follows, according to both datasets:

dplyr::filter(majors_list, grepl('data|Data|DATA|statistic|Statistic|STATISTIC', Major))
##   FOD1P                                         Major          Major_Category
## 1  2101      COMPUTER PROGRAMMING AND DATA PROCESSING Computers & Mathematics
## 2  6212 MANAGEMENT INFORMATION SYSTEMS AND STATISTICS                Business
## 3  3702               STATISTICS AND DECISION SCIENCE Computers & Mathematics
dplyr::filter(majors_list, grepl('data|Data|DATA|statistic|Statistic|STATISTIC', Major))
##   FOD1P                                         Major          Major_Category
## 1  2101      COMPUTER PROGRAMMING AND DATA PROCESSING Computers & Mathematics
## 2  6212 MANAGEMENT INFORMATION SYSTEMS AND STATISTICS                Business
## 3  3702               STATISTICS AND DECISION SCIENCE Computers & Mathematics

Question 2

Write code that transforms the data below:

NOTE: the data below is in a matrix with the number values in brackets representing the row name.

##      [,1]          [,2]          [,3]           [,4]          
## [1]  "bell pepper" "bilberry"    "blackberry"   "blood orange"
## [5]  "blueberry"   "cantaloupe"  "chili pepper" "cloudberry"  
## [9]  "elderberry"  "lime"        "lychee"       "mulberry"    
## [13] "olive"       "salal berry" ""             ""

Into a format like this: c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)

ANSWER

quotes = '"'
quotes
## [1] "\""
# Make vector
x_vector<-as.vector(t(x_matrix[c("[1]","[5]","[9]","[13]"),]))
# Remove "" based on index search
x_vector<- x_vector[-(match("",x_vector))]
x_vector<- x_vector[-(match("",x_vector))]
# add quotes on each element
x_vector <-paste(quotes,x_vector,quotes, sep = "")
# add commas at the end of all elements but the last one
x_vector <- paste(c("","","","","","","","","","","","","",""), x_vector, sep = "")
# collapse to a single string
x_vector <- paste(x_vector, collapse = ", ")
# add 'C()' wrapper
x_vector <- paste("c(",x_vector, ")", sep = "")
cat(x_vector)
## c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")

Question 3

(.)\1\1                   |`( )` group
                          |`(.)` match-any-character operator matches any character except line
                          |      terminator
                          |`\1`  back-reference-operator references capturing group `\1`
                          |`\1`  the second is just to replace duplicate words of group `\1`
                          |NOTE: Regex requires `\` escapes with `R` functions since parameter is inserted
                          |       as a string. This combiniation finds triples.
                          |
"(.)(.)\\2\\1"            |`( )` group
                          |`(.)` match-any-character operator group 1
                          |`(.)` match-any-character operator group 2
                          |`\\2` back-reference-operator with `\` to escape references capturing group `\2`
                          |`\\1` back-reference-operator with `\` to escape references capturing group `\1`
                          |NOTE: Regex matches 2 groups of doubles such as `bbbb` or `abba`
                          |
(..)\1                    |`(..)` same as match-any-character group 1
                          |`\1`  back-references group `\1`
                          |NOTE: Requires `\` escape. Regex finds 1 group of quadruples
                          |
"(.).\\1.\\1"             |. Works as wild cards outside of `( )`. Regex searches for a repeated character
                          |3 times and seperated by any character
                          |
"(.)(.)(.).*\\3\\2\\1"    |`(.)(.)(.)` 3 grouping
                          |. wild card for string beginning with group
                          |* 0 to ∞ combinations with these specific regex
                          |`\\3\\2\\1` back-reference 3 seperate groups
                          |This expression finds 3 seperate groups of matching characters

ANSWER

# Regex example (.)\1\1
str_view(example,"(.)\\1\\1",match = T)
# Regex example "(.)(.)\\2\\1"
str_view(example,"(.)(.)\\2\\1",match = T)
# Regex example (..)\1
str_view(example,"(..)\\1",match = T)
# Regex example  "(.).\\1.\\1" 
str_view(example,"(.).\\1.\\1",match = T)
# Regex example  "(.)(.)(.).*\\3\\2\\1"
str_view(example,"(.)(.)(.).*\\3\\2\\1",match = T)

Question 4

ANSWER

# Start and end with the same character.
str_view(example,"^(.).*\\1$",match = T)
# Contain a repeated pair of letters (e.g. "church" contains "ch" repeated twice.)
str_view(example, "(..).*\\1",match = T)
#Contain one letter repeated in at least three places (e.g. "eleven" contains three "e"s.)
str_view(example,"(.).*\\1.*\\1",match = T)