1 Problem

Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”

library(stringr)
library(tidyverse)
library(kableExtra)
library(knitr)
major_df <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv")
kable(head(major_df))
FOD1P Major Major_Category
1100 GENERAL AGRICULTURE Agriculture & Natural Resources
1101 AGRICULTURE PRODUCTION AND MANAGEMENT Agriculture & Natural Resources
1102 AGRICULTURAL ECONOMICS Agriculture & Natural Resources
1103 ANIMAL SCIENCES Agriculture & Natural Resources
1104 FOOD SCIENCE Agriculture & Natural Resources
1105 PLANT SCIENCE AND AGRONOMY Agriculture & Natural Resources
summary(major_df)
##      FOD1P                                       Major    
##  1100   :  1   ACCOUNTING                           :  1  
##  1101   :  1   ACTUARIAL SCIENCE                    :  1  
##  1102   :  1   ADVERTISING AND PUBLIC RELATIONS     :  1  
##  1103   :  1   AEROSPACE ENGINEERING                :  1  
##  1104   :  1   AGRICULTURAL ECONOMICS               :  1  
##  1105   :  1   AGRICULTURE PRODUCTION AND MANAGEMENT:  1  
##  (Other):168   (Other)                              :168  
##                    Major_Category
##  Engineering              :29    
##  Education                :16    
##  Humanities & Liberal Arts:15    
##  Biology & Life Science   :14    
##  Business                 :13    
##  (Other)                  :86    
##  NA's                     : 1
reg_data_stats = str_detect(levels(major_df$Major), regex("DATA|STATISTICS", ignore_case=TRUE))
reg_data_stats
##   [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [34] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
##  [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [100] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [111] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [122] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [144] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [155] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [166]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
levels(major_df$Major)[reg_data_stats]
## [1] "COMPUTER PROGRAMMING AND DATA PROCESSING"     
## [2] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
## [3] "STATISTICS AND DECISION SCIENCE"

2 Problem

Write code that transforms the data below:

[1] “bell pepper” “bilberry” “blackberry” “blood orange”

[5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”

[9] “elderberry” “lime” “lychee” “mulberry”

[13] “olive” “salal berry”

input_text <- '[1] "bell pepper"  "bilberry"     "blackberry"   "blood orange"
[5] "blueberry"    "cantaloupe"   "chili pepper" "cloudberry"  
[9] "elderberry"   "lime"         "lychee"       "mulberry"    
[13] "olive"        "salal berry"'


char_Vector <- c(unlist(str_extract_all(input_text, "\\b[A-Za-z]+\\b")))

vec_str <- str_c('"', char_Vector, '"', collapse = ", " )

final_text <- str_c('c(', vec_str, ')', collapse = " " )

#Final Output text
writeLines(final_text)
## c("bell", "pepper", "bilberry", "blackberry", "blood", "orange", "blueberry", "cantaloupe", "chili", "pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal", "berry")

3 Problem

Describe, in words, what these expressions will match:

3.1 (.)\1\1

word_ex_char <- c("(.)\1\1", "(.)(.)\\2\\1", "(..)\1", "(.).\\1.\\1", "(.)(.)(.).*\\3\\2\\1")
word_exprs <- list("aaa", "abc", "abba", "afada", "ab\1", "a\1\1", "abccba")

str_view(word_exprs, '(.)\1\1')

This expression matches characters which are followed by a “\1\1”

3.2 (.)(.)\2\1

str_view(word_exprs, '(.)(.)\\2\\1')

It matches strings which contain pairs of characters that are followed by a reverse of their order.

3.3 (..)\1

str_view(word_exprs, '(..)\1')

It matches strings a couple of characters that are followed by “\1”.

3.4 (.).\1.\1

str_view(word_exprs, '(.).\\1.\\1')

This expression matches strings a character that repeats in the 2 and 4 places over from its first occurrence.

3.5 (.)(.)(.).*\3\2\1

str_view(word_exprs, '(.)(.)(.).*\\3\\2\\1')

This expression matches any sequence of strings of 3 characters, excluding line breaks, where the ending character is a reverse order of those 3 characters

4 Problem

Construct regular expressions to match words that

4.1 Start and end with the same character

word_1 <- list("blurb", "9Thousand9", "Light", "101DATA101", "MAGMA", "BANANA")
str_view(word_1, "^(.)(.*)\\1$") 

4.2 Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice

str_view(word_1, '([A-Za-z][A-Za-z]).*\\1')

4.3 Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

str_view(word_1, '([A-Za-z]).*\\1.*\\1')