Data607 Assignment for Week 3

#1. Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset, provide code that identifies the majors that contain either “DATA” or “STATISTICS”

majors <- read.csv(url('https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv'), stringsAsFactors = F)
str(majors)

## 'data.frame':    174 obs. of  3 variables:
##  $ FOD1P         : chr  "1100" "1101" "1102" "1103" ...
##  $ Major         : chr  "GENERAL AGRICULTURE" "AGRICULTURE PRODUCTION AND MANAGEMENT" "AGRICULTURAL ECONOMICS" "ANIMAL SCIENCES" ...
##  $ Major_Category: chr  "Agriculture & Natural Resources" "Agriculture & Natural Resources" "Agriculture & Natural Resources" "Agriculture & Natural Resources" ...

majors$Major[grepl('DATA', majors$Major)]

## [1] "COMPUTER PROGRAMMING AND DATA PROCESSING"

majors$Major[grepl('STATISTICS', majors$Major)]

## [1] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
## [2] "STATISTICS AND DECISION SCIENCE"

#2 Write code that transforms the data below:

[1] “bell pepper” “bilberry” “blackberry” “blood orange”

[5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”

[9] “elderberry” “lime” “lychee” “mulberry”

[13] “olive” “salal berry”

Into a format like this:

c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)

blob <- '[1] "bell pepper"  "bilberry"     "blackberry"   "blood orange"

[5] "blueberry"    "cantaloupe"   "chili pepper" "cloudberry"  

[9] "elderberry"   "lime"         "lychee"       "mulberry"    

[13] "olive"        "salal berry"'

library(stringr)

foods <- str_extract_all(blob, '[a-z]+\\s[a-z]+|[a-z]+')
unlist(foods)

##  [1] "bell pepper"  "bilberry"     "blackberry"   "blood orange" "blueberry"   
##  [6] "cantaloupe"   "chili pepper" "cloudberry"   "elderberry"   "lime"        
## [11] "lychee"       "mulberry"     "olive"        "salal berry"

#3 Describe, in words, what these expressions will match:

(.)\1\1

This would match the string ‘\1\1’ and the character before it, as long as its not a new line

“(.)(.)\\2\\1”

This would match anything like a 4 letter palindrome. one character, another character, the same as the 2nd character, the same as the first character, while surrounded by quotes.

(..)\1

This would match any two characters followed by the string ‘\1’

“(.).\\1.\\1”

This would match any character, then any other character, then the first character again, then any other character, then the first character again, while surrounded by quotes.

"(.)(.)(.).*\\3\\2\\1"

This would match any three characters and any (maybe none) characters in between the first 3 characters reversed, while surrounded by quotes.

#4 Construct regular expressions to match words that:

Start and end with the same character.

(.)[a-z]*\\1

Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)

([a-z]{2})[a-z]*\\1

Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

[a-z]*([a-z])[a-z]*\\1[a-z]*\\1[a-z]*

Data607 Assignment for Week 3

Leo Yi

2020-02-12

#1. Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset, provide code that identifies the majors that contain either “DATA” or “STATISTICS”

#2 Write code that transforms the data below:

#3 Describe, in words, what these expressions will match:

#4 Construct regular expressions to match words that: