1. Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset, provide code that identifies the majors that contain either “DATA” or “STATISTICS”.

[https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/]

library(readr)
library(RCurl)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
majorsList <- read.csv(url('https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv'), stringsAsFactors = F)
str(majorsList)
## 'data.frame':    174 obs. of  3 variables:
##  $ FOD1P         : chr  "1100" "1101" "1102" "1103" ...
##  $ Major         : chr  "GENERAL AGRICULTURE" "AGRICULTURE PRODUCTION AND MANAGEMENT" "AGRICULTURAL ECONOMICS" "ANIMAL SCIENCES" ...
##  $ Major_Category: chr  "Agriculture & Natural Resources" "Agriculture & Natural Resources" "Agriculture & Natural Resources" "Agriculture & Natural Resources" ...
datamajor <- majorsList$Major[grepl("DATA", majorsList$Major)]; datamajor
## [1] "COMPUTER PROGRAMMING AND DATA PROCESSING"
statsmajor <- majorsList$Major[grepl("STATISTICS", majorsList$Major)]; statsmajor
## [1] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
## [2] "STATISTICS AND DECISION SCIENCE"

2. Write code that transforms the data below:

[1] “bell pepper” “bilberry” “blackberry” “blood orange”

[5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”

[9] “elderberry” “lime” “lychee” “mulberry”

[13] “olive” “salal berry”

Into a format like this:

c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)

## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ forcats 0.5.0
## ✓ tidyr   1.1.2
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x tidyr::complete() masks RCurl::complete()
## x dplyr::filter()   masks stats::filter()
## x dplyr::lag()      masks stats::lag()
## [[1]]
##  [1] "bell pepper"  "bilberry"     "blackberry"   "blood orange" "blueberry"   
##  [6] "cantaloupe"   "chili pepper" "cloudberry"   "elderberry"   "lime"        
## [11] "lychee"       "mulberry"     "olive"        "salal berry"
## [1] "c(bell pepper, bilberry, blackberry, blood orange, blueberry, cantaloupe, chili pepper, cloudberry, elderberry, lime, lychee, mulberry, olive, salal berry)"

3. Describe, in words, what these expressions will match:

(.)\1\1 - This expression would not match anything because it is missing the double quotes that are required for backreferencing. Without the quotes, the user would actually receive an error as I did and because of problems knitting the document, I commented out the code. The other problem with this expression is that it’s missing the double backslashes which would reference the first capturing group and match that exact same text twice more, so essentially the first charcter represented by the period would be repeated thrice. However, the single backslash in the regular expression will be interpreted as an escape sequence and not really backreference the first capturing group.

library(tidyverse)
library(stringr)

a <- c("aaa", "bbbb", "adadad", "ghfdaert")
#part31 <- str_extract(a, (.)\1\1); part31

“(.)(.)\2\1” - This signifies a pair of characters followed by the same pair in reverse order.

library(tidyverse)
library(stringr)

b <- c("abba", "baooab", "garrison", "aabbcdeffgg")
part32 <- str_extract(b, "(.)(.)\\2\\1"); part32
## [1] "abba" "aooa" NA     NA

(..)\1 - Again, here I mistook to mean this expression was enclosed in double quotes and had a double backslash. In the absence of those enclosing quotes and double backslash, I actually get an error and have commented out the code. Using quotes but no double backslashes, I get no matches.

library(tidyverse)
library(stringr)

c <- c("kitchen", "banana", "ccfcfcff", "aabbcdeffgg")
#part33 <- str_extract(c, (..)\1); part33

“(.).\1.\1” - A character followed by another character (non-matching), followed by the original character (\1) and any other character as signified by the period and finally again ending with the original character.

library(tidyverse)
library(stringr)

d <- c("kitchen", "banana", "malayalam", "aabbcdeffgg")
part34 <- str_extract(d, "(.).\\1.\\1"); part34
## [1] NA      "anana" "alaya" NA

"(.)(.)(.).\3\2\1" - The signifies 0 or more times, so this expression signifies any 3 characters followed by none or more characters of any kind with the original three characters repeated in the reverse order.

library(tidyverse)
library(stringr)

e <- c("kitchen", "abcdcba", "malayalam", "bcddcb")
part35 <- str_extract(e, "(.)(.)(.).*\\3\\2\\1"); part35
## [1] NA          "abcdcba"   "malayalam" "bcddcb"

4. Construct regular expressions to match words that:

Start and end with the same character. Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.) Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

#Part I: Start and end with the same character.

library(tidyverse)
library(stringr)

x <- c("america", "russia", "armenia", "canada")
part1 <- str_extract(x, '^(.).*\\1$'); part1
## [1] "america" NA        "armenia" NA
#Part II: Contain a repeated pair of letters (e.g. "church" contains "ch" repeated twice.)

y <- c("aabbccaa", "abcdefg", "aaa", "abac", "bbbbbb")
part2 <- str_extract(y, '(..).*\\1'); part2
## [1] "aabbccaa" NA         NA         NA         "bbbbbb"
#Part III: Contain one letter repeated in at least three places (e.g. "eleven" contains three "e"s.)

z <- c("bookkeeper", "successful", "tattoo", "dog", "cat")
part3 <- str_extract(z, '([a-z]).*\\1.*\\1'); part3
## [1] "eepe"    "success" "tatt"    NA        NA