HW3

Problem 1

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(stringr)

download.file('https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv', 'data.txt')

df <- read.csv('data.txt')

data_statistics <- df %>% filter(str_detect(Major, 'DATA|STATISTICS'))

data_statistics

##   FOD1P                                         Major          Major_Category
## 1  6212 MANAGEMENT INFORMATION SYSTEMS AND STATISTICS                Business
## 2  2101      COMPUTER PROGRAMMING AND DATA PROCESSING Computers & Mathematics
## 3  3702               STATISTICS AND DECISION SCIENCE Computers & Mathematics

Problem 2

test <- "[1] \"bell pepper\" \"bilberry\" \"blackberry\""
res <- str_extract_all(test, '"[a-z ]+"')
res <- str_split(res[[1]], '" "')
final <- unlist(res, use.names = FALSE)
final

## [1] "\"bell pepper\"" "\"bilberry\""    "\"blackberry\""

This looks a little funky, but when printed will return the strings as asked, since the extra " will disappear when printed.

Problem 3

test_cases = c('aaa', 'abba', 'abab', 'aaaa', 'abaca', 'abccba', 'acdc', 'acdddddca')

a <- '(.)\\1\\1' ##I have added extra slashes to escape the quotes
b <- "(.)(.)\\2\\1"
c <- '(..)\\1'   ##Again, added a slash to escape the quotes
d <- "(.).\\1.\\1"
e <- "(.)(.)(.).*\\3\\2\\1"

str_view_all(test_cases, e)

(.)\1\1 will match any character repeated 3 times.

“(.)(.)\2\1” will match the pattern abba where a and b are any character.

(..)\1 will match a two character pattern repeated, such as abab

“(.).\1.\1” will match any 2 characters, followed by the first character, followed by another character, followed by the first character, such as abaca

"(.)(.)(.).*\3\2\1" will match at least 3 characters and then end with the first 3 characters in reverse order such as abcddddcba

Problem 4

test_cases <- c('arma', 'alabama', 'church', 'chinchilla', 'mississippi', 'eleven')
a <- '^(.).*\\1$'
b <- '.*(..).*\\1.*'
c <- '.*(.).*\\1.*\\1.*'

str_view_all(test_cases, c)

^(.).*\1$ (this makes the assumption that the word is at least 2 letters long)
.* (..).* \1.*
.* (.).* \1.* \1.*

Please note that the added spaces in these regex is not part of the regular expression. I had to add it or else R markdown would consider the *’s as formatting rather than part of the expression.