Task: Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”
Load the libraries to use:
library(readr)
library(dplyr)
library(stringr)
Read in the data:
url <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv"
majors <- read_csv(url)
# Look at the first 10 rows:
head(majors, 10)
## # A tibble: 10 × 3
## FOD1P Major Major_Category
## <chr> <chr> <chr>
## 1 1100 GENERAL AGRICULTURE Agriculture & Natural Resources
## 2 1101 AGRICULTURE PRODUCTION AND MANAGEMENT Agriculture & Natural Resources
## 3 1102 AGRICULTURAL ECONOMICS Agriculture & Natural Resources
## 4 1103 ANIMAL SCIENCES Agriculture & Natural Resources
## 5 1104 FOOD SCIENCE Agriculture & Natural Resources
## 6 1105 PLANT SCIENCE AND AGRONOMY Agriculture & Natural Resources
## 7 1106 SOIL SCIENCE Agriculture & Natural Resources
## 8 1199 MISCELLANEOUS AGRICULTURE Agriculture & Natural Resources
## 9 1302 FORESTRY Agriculture & Natural Resources
## 10 1303 NATURAL RESOURCES MANAGEMENT Agriculture & Natural Resources
Identify the majors that contain either “DATA” or “STATISTICS”:
data_stats <- majors %>%
filter(str_detect(Major, "DATA|STATISTICS"))
data_stats
## # A tibble: 3 × 3
## FOD1P Major Major_Category
## <chr> <chr> <chr>
## 1 6212 MANAGEMENT INFORMATION SYSTEMS AND STATISTICS Business
## 2 2101 COMPUTER PROGRAMMING AND DATA PROCESSING Computers & Mathematics
## 3 3702 STATISTICS AND DECISION SCIENCE Computers & Mathematics
If we need the majors as a vector we can do:
data_stats$Major
## [1] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
## [2] "COMPUTER PROGRAMMING AND DATA PROCESSING"
## [3] "STATISTICS AND DECISION SCIENCE"
Task: Write code that transforms the data below:
[1] “bell pepper” “bilberry” “blackberry” “blood orange” [5] “blueberry” “cantaloupe” “chili pepper” “cloudberry” [9] “elderberry” “lime” “lychee” “mulberry” [13] “olive” “salal berry”
Into a format like this:
c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)
# original string
x <- '
[1] "bell pepper" "bilberry" "blackberry" "blood orange"
[5] "blueberry" "cantaloupe" "chili pepper" "cloudberry"
[9] "elderberry" "lime" "lychee" "mulberry"
[13] "olive" "salal berry"
'
# split the input string into lines and remove the first line
x_clean <- strsplit(x, "\n")[[1]][-1] %>%
# remove all digits and the square brackets in each line
# replace the space between two double quotes with a comma and a space
str_remove_all(pattern = "[0-9]") %>%
str_remove_all(pattern = "\\[\\] ") %>%
str_replace_all(pattern = '" "', replacement = '", "') %>%
# join all cleaned lines together into one string separated by commas
paste0(collapse = ", ")
# enclose the cleaned string in "c()" and print it to the console
x_clean <- paste0("c(", x_clean, ")")
cat(x_clean)
## c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")
Task: Describe, in words, what these expressions will match:
The regular expression (.)\1\1 will match any sequence of three consecutive identical characters.
For example:
The regular expression “(.)(.)\2\1” will match any string surrounded by double quotes that consists of two pairs of identical characters in reverse order.
For example:
The regular expression (..)\1 will match any string that contains a sequence of two consecutive identical two-character substrings.
For example: - “hello” - “goodbye” - “apple” - “banana”
The regular expression “(.).\1.\1” will match any string that has a character followed by any character, the original character, any other character, the original character again.
For example:
The regular expression “(.)(.)(.).*\3\2\1” will match any string surrounded by double quotes that starts with three characters, where the third character is repeated later in the string, followed by any number of additional characters, and then ends with the three characters in reverse order.
For example:
Task: Construct regular expressions to match words that:
^(.).*\1$
This regex matches any string that starts with a single character, followed by any number of characters (including none), and then ends with the same character that it started with.
.*(.{2}).*\1.*
This regex matches any string that contains at least two consecutive characters (captured by .{2}), followed by any number of characters (including none), and then the same two characters that were captured earlier (using \1).
.*(.).*\1.*\1.*
This regex matches any string that contains at least one character (captured by .), followed by any number of characters (including none), and then the same character that was captured earlier (using \1), repeated at least two more times. The .* before and after each instance of \1 allow any number of characters to appear between each repetition of the captured character.