Please deliver links to an R Markdown file (in GitHub and rpubs.com) with solutions to the problems below. You may work in a small group, but please submit separately with names of all group participants in your submission.
1. Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”
library(stringr)
college.majors <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv",header=TRUE,stringsAsFactors=FALSE)
data <- str_detect(college.majors$Major, fixed("DATA"))
college.majors[data, ]
## FOD1P Major Major_Category
## 52 2101 COMPUTER PROGRAMMING AND DATA PROCESSING Computers & Mathematics
stats <- str_detect(college.majors$Major, fixed("STATISTICS"))
college.majors[stats, ]
## FOD1P Major Major_Category
## 44 6212 MANAGEMENT INFORMATION SYSTEMS AND STATISTICS Business
## 59 3702 STATISTICS AND DECISION SCIENCE Computers & Mathematics
2. Write code that transforms the data below:
fruits <- '[1] "bell pepper" "bilberry" "blackberry" "blood orange"
[5] "blueberry" "cantaloupe" "chili pepper" "cloudberry"
[9] "elderberry" "lime" "lychee" "mulberry"
[13] "olive" "salal berry"'
fruits
## [1] "[1] \"bell pepper\" \"bilberry\" \"blackberry\" \"blood orange\"\n\n[5] \"blueberry\" \"cantaloupe\" \"chili pepper\" \"cloudberry\" \n\n[9] \"elderberry\" \"lime\" \"lychee\" \"mulberry\" \n\n[13] \"olive\" \"salal berry\""
fruits_extract <- unlist(str_extract_all(fruits, pattern = "\"([a-z]+.[a-z]+)\""))
fruits_extract
## [1] "\"bell pepper\"" "\"bilberry\"" "\"blackberry\"" "\"blood orange\""
## [5] "\"blueberry\"" "\"cantaloupe\"" "\"chili pepper\"" "\"cloudberry\""
## [9] "\"elderberry\"" "\"lime\"" "\"lychee\"" "\"mulberry\""
## [13] "\"olive\"" "\"salal berry\""
fruits_remove <- str_remove_all(fruits_extract, "\"")
fruits_remove
## [1] "bell pepper" "bilberry" "blackberry" "blood orange" "blueberry"
## [6] "cantaloupe" "chili pepper" "cloudberry" "elderberry" "lime"
## [11] "lychee" "mulberry" "olive" "salal berry"
The two exercises below are taken from R for Data Science, 14.3.5.1 in the on-line version:
3. Describe, in words, what these expressions will match:
(.)\1\1
This will match any one character, followed by two repetitions, such as “ccc” or “111”. The correct expression would be “(.)\1\1”.
str_match("ccc","(.)\\1\\1")
## [,1] [,2]
## [1,] "ccc" "c"
“(.)(.)\2\1”
This will search for two characters repeated, except reverse. Like “abba” or “1331”. The correct expression would be “(.)(.)\2\1”.
str_match("1331","(.)(.)\\2\\1")
## [,1] [,2] [,3]
## [1,] "1331" "1" "3"
(..)\1
This will search for two characters, repeated once, like “abab” or “7878”. The correct expression would be “(..)\1”.
str_match("7878","(..)\\1")
## [,1] [,2]
## [1,] "7878" "78"
“(.).\1.\1”
This will search for a five character term, three of which are the same, like “cacdc” or “95929”.The correct expression would be “(.).\1.\1”.
str_match("95929","(.).\\1.\\1")
## [,1] [,2]
## [1,] "95929" "9"
"(.)(.)(.).*\3\2\1"
This will construct a set of characters that begin and end with the same three characters, except the second instance is reversed, like “abczegcba” or “1238879321”.
str_match("abczegcba","(.)(.)(.).*\\3\\2\\1")
## [,1] [,2] [,3] [,4]
## [1,] "abczegcba" "a" "b" "c"
4. Construct regular expressions to match words that:
"(.).*\1"
str_detect("abba","(.).*\\1")
## [1] TRUE
".([A-Za-z][A-Za-z]).\1.*"
str_detect("church",".*([A-Za-z][A-Za-z]).*\\1.*")
## [1] TRUE
".([A-Za-z]).\1.\1.*."
str_detect("eleven",".*([A-Za-z]).*\\1.\\1.*")
## [1] TRUE