Week 3 Assignment

Please deliver links to an R Markdown file (in GitHub and rpubs.com) with solutions to the problems below. You may work in a small group, but please submit separately with names of all group participants in your submission.

#1. Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

I am going to provide code that identifies the majors that contain either “DATA” or “STATISTICS”

# read the data in GithHub
major <-  read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv",
                  stringsAsFactors = F, header = T)
data_m <- major %>% filter(str_detect(Major, "DATA|STATISTICS"))

gt_data <- gt(data_m)
# Create two additional footnotes, using the
# `columns` and `where` arguments of `data_cells()`
gt_data |>
  tab_header(
    title = "The Data Science and Technology Majors",
    subtitle = "The Only DATA and STATITICS Majors"
  )

FOD1P	Major	Major_Category
The Data Science and Technology Majors
The Only DATA and STATITICS Majors
6212	MANAGEMENT INFORMATION SYSTEMS AND STATISTICS	Business
2101	COMPUTER PROGRAMMING AND DATA PROCESSING	Computers & Mathematics
3702	STATISTICS AND DECISION SCIENCE	Computers & Mathematics

# Show the gt Table
gt_data

FOD1P	Major	Major_Category
6212	MANAGEMENT INFORMATION SYSTEMS AND STATISTICS	Business
2101	COMPUTER PROGRAMMING AND DATA PROCESSING	Computers & Mathematics
3702	STATISTICS AND DECISION SCIENCE	Computers & Mathematics

2 Write code that transforms the data below:

[1] “bell pepper” “bilberry” “blackberry” “blood orange” [5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”
[9] “elderberry” “lime” “lychee” “mulberry”
[13] “olive” “salal berry”

nto a format like this: c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)

str <- c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")

str

##  [1] "bell pepper"  "bilberry"     "blackberry"   "blood orange" "blueberry"   
##  [6] "cantaloupe"   "chili pepper" "cloudberry"   "elderberry"   "lime"        
## [11] "lychee"       "mulberry"     "olive"        "salal berry"

3 Describe, in words, what these expressions will match:

(.)\1\1 “(.)(.)\2\1” (..)\1 “(.).\1.\1” “(.)(.)(.).*\3\2\1”

abc <- c("abc\1", "a\1", "abc","z\001\001","Z\1\1","b\1\1","aaaa","aabbbbbcccc","dd")
v <- "(.)\1\1"
v

## [1] "(.)\001\001"

str_detect(abc,v)

## [1] FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE

str_match(abc,v)

##       [,1]        [,2]
##  [1,] NA          NA  
##  [2,] NA          NA  
##  [3,] NA          NA  
##  [4,] "z\001\001" "z" 
##  [5,] "Z\001\001" "Z" 
##  [6,] "b\001\001" "b" 
##  [7,] NA          NA  
##  [8,] NA          NA  
##  [9,] NA          NA

The backreference \1 (backslash one) references the first capturing group. \1 matches the exact same text that was matched by the first capturing group. The / before it is a literal character.

“(.)\1\1” will only match group strings group characters follow by \1

According to the R for Data Science book, The first way to use a capturing group is to refer back to it within a match with back reference: \1 refers to the match contained in the first parenthesis, \2 in the second parenthesis, and so on

d <- "(.)(.)\\2\\1"
str_detect(abc,d)

## [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE

str_match(abc,d)

##       [,1]   [,2] [,3]
##  [1,] NA     NA   NA  
##  [2,] NA     NA   NA  
##  [3,] NA     NA   NA  
##  [4,] NA     NA   NA  
##  [5,] NA     NA   NA  
##  [6,] NA     NA   NA  
##  [7,] "aaaa" "a"  "a" 
##  [8,] "bbbb" "b"  "b" 
##  [9,] NA     NA   NA

“(.)(.)\2\1” would match any match contain in the second parenthesis which mean it will match any four of the same letters. Examples: “aaaa”,“aabbbbbcccc”— It will only match a and b

j <- c("(..)\1")
str_detect(abc,j)

## [1]  TRUE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE

str_match(abc,j)

##       [,1]        [,2]   
##  [1,] "bc\001"    "bc"   
##  [2,] NA          NA     
##  [3,] NA          NA     
##  [4,] "z\001\001" "z\001"
##  [5,] "Z\001\001" "Z\001"
##  [6,] "b\001\001" "b\001"
##  [7,] NA          NA     
##  [8,] NA          NA     
##  [9,] NA          NA

“(..)\1” will match any of the last two characters in a group string folowed by “\1”. For Example: “abc\1” – bc will be selected “acgdeftstwrhyg9.\1” — g9 will be selected.

c_l <- c("cdcacdabbb11","dgdgdfg","abacgwabda","trtrtrtr")
p <- c("(.).\\1.\\1")
str_detect(c_l,p)

## [1]  TRUE  TRUE FALSE  TRUE

str_match(c_l,p)

##      [,1]    [,2]
## [1,] "cdcac" "c" 
## [2,] "dgdgd" "d" 
## [3,] NA      NA  
## [4,] "trtrt" "t"

“(.).\1.\1” will match string characters only where their first letter is identical after every other string characters. For example: “cdcacdabbb11” will match “c” “trtrtrtr” will match “t”

ch <- c("bcdbcbdbcbd","cdcacdabbb11","dfdfhjdfh","spsdlkjspsd","000550050005")
l <- c("(.)(.)(.).*\\3\\2\\1")
str_detect(ch,l)

## [1]  TRUE FALSE FALSE  TRUE  TRUE

str_match(ch,l)

##      [,1]          [,2] [,3] [,4]
## [1,] "dbcbdbcbd"   "d"  "b"  "c" 
## [2,] NA            NA   NA   NA  
## [3,] NA            NA   NA   NA  
## [4,] "spsdlkjsps"  "s"  "p"  "s" 
## [5,] "00055005000" "0"  "0"  "0"

“(.)(.)(.).*\3\2\1” will only match characters that repeat three times in a string group. For Examples : “dbcbdbcbd” will macth dbc “spsdlkjsps” will match “s” “p” “s”

## 4 Construct regular expressions to match words that: Start and end with the same character. Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.) Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

words <-  c("alababa","cardiac", "chaotic","clementine","blueberry","guava","jujube" )
str_view(words,  "^(.).*\\1$",match = T)

## [1] │ <alababa>
## [2] │ <cardiac>
## [3] │ <chaotic>

Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)

v <- c("(.).*\\1.*\\1")
str_detect(words,v)

## [1]  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE

str_view(words, v, match = T)

## [1] │ <alababa>
## [4] │ cl<ementine>

Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

k <- ("(.).*\\1.*\\1")
str_match(fruit,k)

##       [,1]          [,2]
##  [1,] NA            NA  
##  [2,] NA            NA  
##  [3,] NA            NA  
##  [4,] "anana"       "a" 
##  [5,] "ell peppe"   "e" 
##  [6,] NA            NA  
##  [7,] NA            NA  
##  [8,] NA            NA  
##  [9,] "ood o"       "o" 
## [10,] NA            NA  
## [11,] NA            NA  
## [12,] NA            NA  
## [13,] NA            NA  
## [14,] NA            NA  
## [15,] NA            NA  
## [16,] NA            NA  
## [17,] "pepp"        "p" 
## [18,] "ementine"    "e" 
## [19,] NA            NA  
## [20,] NA            NA  
## [21,] "ranberr"     "r" 
## [22,] NA            NA  
## [23,] NA            NA  
## [24,] NA            NA  
## [25,] NA            NA  
## [26,] NA            NA  
## [27,] NA            NA  
## [28,] NA            NA  
## [29,] "elderbe"     "e" 
## [30,] NA            NA  
## [31,] NA            NA  
## [32,] NA            NA  
## [33,] NA            NA  
## [34,] NA            NA  
## [35,] NA            NA  
## [36,] NA            NA  
## [37,] NA            NA  
## [38,] NA            NA  
## [39,] NA            NA  
## [40,] NA            NA  
## [41,] NA            NA  
## [42,] "iwi frui"    "i" 
## [43,] NA            NA  
## [44,] NA            NA  
## [45,] NA            NA  
## [46,] NA            NA  
## [47,] NA            NA  
## [48,] NA            NA  
## [49,] NA            NA  
## [50,] NA            NA  
## [51,] NA            NA  
## [52,] NA            NA  
## [53,] NA            NA  
## [54,] NA            NA  
## [55,] NA            NA  
## [56,] "apaya"       "a" 
## [57,] NA            NA  
## [58,] NA            NA  
## [59,] NA            NA  
## [60,] NA            NA  
## [61,] NA            NA  
## [62,] "pineapp"     "p" 
## [63,] NA            NA  
## [64,] NA            NA  
## [65,] NA            NA  
## [66,] "e mangostee" "e" 
## [67,] NA            NA  
## [68,] NA            NA  
## [69,] NA            NA  
## [70,] "raspberr"    "r" 
## [71,] "redcurr"     "r" 
## [72,] NA            NA  
## [73,] NA            NA  
## [74,] NA            NA  
## [75,] NA            NA  
## [76,] "rawberr"     "r" 
## [77,] NA            NA  
## [78,] NA            NA  
## [79,] NA            NA  
## [80,] NA            NA

Week 3 Assignment

Warner Alexis

2 Write code that transforms the data below:

3 Describe, in words, what these expressions will match: