Homework3_ZhouxinShi

Please deliver links to an R Markdown file (in GitHub and rpubs.com) with solutions to the problems below. You may work in a small group, but please submit separately with names of all group participants in your submission.

1. Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”

library(stringr)

college.majors <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv",header=TRUE,stringsAsFactors=FALSE)

data <- str_detect(college.majors$Major, fixed("DATA"))
college.majors[data, ]

##    FOD1P                                    Major          Major_Category
## 52  2101 COMPUTER PROGRAMMING AND DATA PROCESSING Computers & Mathematics

stats <- str_detect(college.majors$Major, fixed("STATISTICS"))
college.majors[stats, ]

##    FOD1P                                         Major          Major_Category
## 44  6212 MANAGEMENT INFORMATION SYSTEMS AND STATISTICS                Business
## 59  3702               STATISTICS AND DECISION SCIENCE Computers & Mathematics

2. Write code that transforms the data below:

fruits <- '[1] "bell pepper"  "bilberry"     "blackberry"   "blood orange"

[5] "blueberry"    "cantaloupe"   "chili pepper" "cloudberry"  

[9] "elderberry"   "lime"         "lychee"       "mulberry"    

[13] "olive"        "salal berry"'
fruits

## [1] "[1] \"bell pepper\"  \"bilberry\"     \"blackberry\"   \"blood orange\"\n\n[5] \"blueberry\"    \"cantaloupe\"   \"chili pepper\" \"cloudberry\"  \n\n[9] \"elderberry\"   \"lime\"         \"lychee\"       \"mulberry\"    \n\n[13] \"olive\"        \"salal berry\""

fruits_extract <- unlist(str_extract_all(fruits, pattern = "\"([a-z]+.[a-z]+)\""))
fruits_extract

##  [1] "\"bell pepper\""  "\"bilberry\""     "\"blackberry\""   "\"blood orange\""
##  [5] "\"blueberry\""    "\"cantaloupe\""   "\"chili pepper\"" "\"cloudberry\""  
##  [9] "\"elderberry\""   "\"lime\""         "\"lychee\""       "\"mulberry\""    
## [13] "\"olive\""        "\"salal berry\""

fruits_remove <- str_remove_all(fruits_extract, "\"")
fruits_remove

##  [1] "bell pepper"  "bilberry"     "blackberry"   "blood orange" "blueberry"   
##  [6] "cantaloupe"   "chili pepper" "cloudberry"   "elderberry"   "lime"        
## [11] "lychee"       "mulberry"     "olive"        "salal berry"

The two exercises below are taken from R for Data Science, 14.3.5.1 in the on-line version:

3. Describe, in words, what these expressions will match:

(.)\1\1

This will match any one character, followed by two repetitions, such as “ccc” or “111”. The correct expression would be “(.)\1\1”.

str_match("ccc","(.)\\1\\1")

##      [,1]  [,2]
## [1,] "ccc" "c"

“(.)(.)\2\1”

This will search for two characters repeated, except reverse. Like “abba” or “1331”. The correct expression would be “(.)(.)\2\1”.

str_match("1331","(.)(.)\\2\\1")

##      [,1]   [,2] [,3]
## [1,] "1331" "1"  "3"

(..)\1

This will search for two characters, repeated once, like “abab” or “7878”. The correct expression would be “(..)\1”.

str_match("7878","(..)\\1")

##      [,1]   [,2]
## [1,] "7878" "78"

“(.).\1.\1”

This will search for a five character term, three of which are the same, like “cacdc” or “95929”.The correct expression would be “(.).\1.\1”.

str_match("95929","(.).\\1.\\1")

##      [,1]    [,2]
## [1,] "95929" "9"

"(.)(.)(.).*\3\2\1"

This will construct a set of characters that begin and end with the same three characters, except the second instance is reversed, like “abczegcba” or “1238879321”.

str_match("abczegcba","(.)(.)(.).*\\3\\2\\1")

##      [,1]        [,2] [,3] [,4]
## [1,] "abczegcba" "a"  "b"  "c"

4. Construct regular expressions to match words that:

Start and end with the same character.

"(.).*\1"

str_detect("abba","(.).*\\1")

## [1] TRUE

Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)

".([A-Za-z][A-Za-z]).\1.*"

str_detect("church",".*([A-Za-z][A-Za-z]).*\\1.*")

## [1] TRUE

Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

".([A-Za-z]).\1.\1.*."

str_detect("eleven",".*([A-Za-z]).*\\1.\\1.*")

## [1] TRUE