DATA607 - Assignment 3

Question 1

Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”

#Load readr library and  use to store csv into dataframe
library(readr, quietly = TRUE)
url <- 'https://github.com/fivethirtyeight/data/raw/master/college-majors/majors-list.csv'
dfMajors <- read.csv(file = url)
head(dfMajors)

##   FOD1P                                 Major                  Major_Category
## 1  1100                   GENERAL AGRICULTURE Agriculture & Natural Resources
## 2  1101 AGRICULTURE PRODUCTION AND MANAGEMENT Agriculture & Natural Resources
## 3  1102                AGRICULTURAL ECONOMICS Agriculture & Natural Resources
## 4  1103                       ANIMAL SCIENCES Agriculture & Natural Resources
## 5  1104                          FOOD SCIENCE Agriculture & Natural Resources
## 6  1105            PLANT SCIENCE AND AGRONOMY Agriculture & Natural Resources

Filter Major column for “DATA” or “STATISTICS”

#Load tidyverse to use the pipe function and stringr to use the str_detect function
library(tidyverse, quietly = TRUE)

## -- Attaching packages -------------------------------------------------------------------------------------------------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.2     v dplyr   1.0.2
## v tibble  3.0.3     v stringr 1.4.0
## v tidyr   1.1.2     v forcats 0.5.0
## v purrr   0.3.4

## -- Conflicts ----------------------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(stringr, quietly = TRUE)

dfMajors %>%
  filter(str_detect(Major, 'DATA|STATISTICS'))

##   FOD1P                                         Major          Major_Category
## 1  6212 MANAGEMENT INFORMATION SYSTEMS AND STATISTICS                Business
## 2  2101      COMPUTER PROGRAMMING AND DATA PROCESSING Computers & Mathematics
## 3  3702               STATISTICS AND DECISION SCIENCE Computers & Mathematics

Question 2

Write code that transforms the data below:

[1] “bell pepper” “bilberry” “blackberry” “blood orange”

[5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”

[9] “elderberry” “lime” “lychee” “mulberry”

[13] “olive” “salal berry”

Into a format like this:

c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)

#assuming the above data is store as a string
# y <- str_replace_all(x, ' {2,}', ',')
# y <- as.vector(x)

Question 3

Describe, in words, what these expressions will match:

(.)\1\1

“(.)(.)\2\1”

(..)\1

“(.).\1.\1”

"(.)(.)(.).*\3\2\1"

# a. The regular expression will match characters by grouping the first followed
# by the same character twice more
a <- str_extract(sentences, '(.)\\1\\1') %>% na.omit()
a[1]

## [1] "???"

# b. The regular expression will match characters by grouping two characters followed 
# by the same two characters in reverse 
b <- str_extract(sentences, '(.)(.)\\2\\1') %>% na.omit()
b[1]

## [1] "ollo"

# c. The regular expression will match characters by two characters togerther followed 
# by the same two characters in the same order
c <- str_extract(sentences, '(..)\\1') %>% na.omit()
c[1]

## [1] " s s"

# d. The regular expression will match characters by grouping the first character,
# followed by any character, followed by the first character, followed by any character
# followed by the first character
d <- str_extract(sentences, '(.).\\1.\\1') %>% na.omit()
d[1]

## [1] "e eve"

# e. The regular expression will group three characters followed by a character 0 or
# more times followed by the same first three characters in reverse order
e <- str_extract(sentences, '(.)(.)(.).*\\3\\2\\1') %>% na.omit()
e[1]

## [1] " a chicken leg is a "

Question 4

Construct regular expressions to match words that:

Start and end with the same character.

Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.) Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

words <- c('church', 'statstics', 'eleven')
# '^(.).*\\1$'
aa <- str_match_all(words, '^(.).*\\1$')
aa

## [[1]]
##      [,1] [,2]
## 
## [[2]]
##      [,1]        [,2]
## [1,] "statstics" "s" 
## 
## [[3]]
##      [,1] [,2]

# '(.)(.).*\\1\\2'
bb <- str_match_all(words, '(.)(.).*\\1\\2')
bb

## [[1]]
##      [,1]     [,2] [,3]
## [1,] "church" "c"  "h" 
## 
## [[2]]
##      [,1]     [,2] [,3]
## [1,] "statst" "s"  "t" 
## 
## [[3]]
##      [,1] [,2] [,3]

# '(.).*\\1.*\\1'
cc <- str_match_all(words, '(.).*\\1.*\\1')
cc

## [[1]]
##      [,1] [,2]
## 
## [[2]]
##      [,1]        [,2]
## [1,] "statstics" "s" 
## 
## [[3]]
##      [,1]    [,2]
## [1,] "eleve" "e"