R Markdown of Week 3 Assn

Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”

##install stringr

library(stringr)

##import data

library(readr)
majors_list <- read.csv( "https://raw.githubusercontent.com/fivethirtyeight/data/2d2ff3e9457549d51f8e571c52099bfe9b2017ad/college-majors/majors-list.csv")

##look at import

str(majors_list)
## 'data.frame':    174 obs. of  3 variables:
##  $ FOD1P         : chr  "1100" "1101" "1102" "1103" ...
##  $ Major         : chr  "GENERAL AGRICULTURE" "AGRICULTURE PRODUCTION AND MANAGEMENT" "AGRICULTURAL ECONOMICS" "ANIMAL SCIENCES" ...
##  $ Major_Category: chr  "Agriculture & Natural Resources" "Agriculture & Natural Resources" "Agriculture & Natural Resources" "Agriculture & Natural Resources" ...
summary(majors_list)
##     FOD1P              Major           Major_Category    
##  Length:174         Length:174         Length:174        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character

##select majors with titles DATA or STATISTICS

grep(pattern = "DATA|STATISTICS", majors_list$Major, value = TRUE)
## [1] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
## [2] "COMPUTER PROGRAMMING AND DATA PROCESSING"     
## [3] "STATISTICS AND DECISION SCIENCE"

##2 Write code that transforms the data below: [1] “bell pepper” “bilberry” “blackberry” “blood orange”

[5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”

[9] “elderberry” “lime” “lychee” “mulberry”

[13] “olive” “salal berry”

Into a format like this:

c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)

newfruit <- c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")

describe in words, what these expressions will match

(.)\1\1 Single characters that repeats three times

“(.)(.)\2\1” Two characters followed by same two characters in reverse order

(..)\1 Two characters repeated

“(.).\1.\1” Single character followed by another character, the first single character, another characters, the original single character

“(.)(.)(.).*\3\2\1” Three characters that repeat in reverse order after any number of variable characters

##construct regular expressions to match words that

##start and end with the same character

list <-c("anna", "church", "bob", "harry", "paul", "eleven", "bubble")
regex_expr ="^(.)((.*\\1$)|\\1?$)"
str_subset(list, regex_expr)
## [1] "anna" "bob"

##contain a repeated pair of letters (e.g., “church” contains “ch”

regex_expr2 = "([A-Za-z][A-Za-z]).*\\1"
str_subset(list,regex_expr2)
## [1] "church"

##contain one letter repeated in at least three places (e.g., “eleven contains three”e”s)

regex_expr3 = "([A-Za-z]).*\\1.*\\1"
str_subset(list,regex_expr3 )
## [1] "eleven" "bubble"