library(tidyverse)
library(stringr)
Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”
url <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv"
college_major <- read_csv(url) #read the data
head(college_major)
## # A tibble: 6 x 3
## FOD1P Major Major_Category
## <chr> <chr> <chr>
## 1 1100 GENERAL AGRICULTURE Agriculture & Natural Resources
## 2 1101 AGRICULTURE PRODUCTION AND MANAGEMENT Agriculture & Natural Resources
## 3 1102 AGRICULTURAL ECONOMICS Agriculture & Natural Resources
## 4 1103 ANIMAL SCIENCES Agriculture & Natural Resources
## 5 1104 FOOD SCIENCE Agriculture & Natural Resources
## 6 1105 PLANT SCIENCE AND AGRONOMY Agriculture & Natural Resources
major_data_stat <- college_major[str_detect(college_major$Major, pattern = "DATA|STATISTICS"), ]
major_data_stat
## # A tibble: 3 x 3
## FOD1P Major Major_Category
## <chr> <chr> <chr>
## 1 6212 MANAGEMENT INFORMATION SYSTEMS AND STATISTICS Business
## 2 2101 COMPUTER PROGRAMMING AND DATA PROCESSING Computers & Mathematics
## 3 3702 STATISTICS AND DECISION SCIENCE Computers & Mathematics
Write code that transforms the data below:
[1] “bell pepper” “bilberry” “blackberry” “blood orange”
[5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”
[9] “elderberry” “lime” “lychee” “mulberry”
[13] “olive” “salal berry”
Into a format like this:
c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)
text_data <- '[1] "bell pepper" "bilberry" "blackberry" "blood orange"
[5] "blueberry" "cantaloupe" "chili pepper" "cloudberry"
[9] "elderberry" "lime" "lychee" "mulberry"
[13] "olive" "salal berry"'
#desired output
desire_output <- c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")
#Remove multiple white space from left and and right of the strings
text_data_no_space <- str_squish(text_data)
writeLines(text_data_no_space)
## [1] "bell pepper" "bilberry" "blackberry" "blood orange" [5] "blueberry" "cantaloupe" "chili pepper" "cloudberry" [9] "elderberry" "lime" "lychee" "mulberry" [13] "olive" "salal berry"
#Extract All strings
text_extract <- unlist(str_extract_all(text_data_no_space, pattern = "[[:alpha:]]+\\s[[:alpha:]]+|[[:alpha:]]+"))
#Combine the strings into a vector
text <- str_c(text_extract, sep = '"')
text
## [1] "bell pepper" "bilberry" "blackberry" "blood orange" "blueberry"
## [6] "cantaloupe" "chili pepper" "cloudberry" "elderberry" "lime"
## [11] "lychee" "mulberry" "olive" "salal berry"
Compare the desired output to the output
#Check if the output is the same as the desired output
identical(text, desire_output)
## [1] TRUE
The result of the comparison shows that both are the same and a logical “TRUE” was returned.
Describe, in words, what these expressions will match:
(3a) (.)\1\1 : This lacks a second backslash and hence will return an error and no match.
(3b) “(.)(.)\\2\\1” : This will return match for the first (1st) character, the second (2nd) character, and the second (2nd) character followed by the first (1st) character again. Basically, this returns matches for first two characters and the same two characters in reverse order as in “daad” or in the example shown below.
str_view_all(stringr::words, pattern = "(.)(.)\\2\\1", match = TRUE)
(3c) (..)\1 : This will also return an error as it lacks a second backslash. The correct version “(..)\\1” will return characters in the first group twice. Any two characters that appear will be returned again.
str_view_all(fruit, "(..)\\1", match = TRUE)
(3d) “(.).\\1.\\1” : This will return matches where the first character occurs three (3) times with any other single character in between. In the fruit example below, “a” occurs three times with “n” in between any occurrence of “a”. The second match “papaya” shows that the match need not be the same letter in between but any other single character.
str_view_all(fruit, "(.).\\1.\\1", match = TRUE)
str_view_all(words, "(.).\\1.\\1", match = TRUE)
(3e) "(.)(.)(.).*\\3\\2\\1“: This will return matches where a group of three (3) characters appear again but in reverse order with a certain number of characters in between them. In the example below”par" has characters “ag” in between the reverse occurrence “rap”
str_view_all(words, "(.)(.)(.).*\\3\\2\\1", match = TRUE)
Construct regular expressions to match words that:
Start and end with the same character.
Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)
Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)
Start and end with the same character
str_view_all(words, "^(.).*\\1$", match = TRUE)
Contain a repeated pair of letters (e.g. “church contains”ch" repeated twice.)
str_view_all(fruit, "(..)\\1", match = TRUE)
Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”.)
str_view_all(words, "([a-zA-Z]).*\\1.*\\1", match = TRUE)