Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”
# Load necessary libraries
library(readr)
library(dplyr)
# Download and load the CSV from the GitHub raw URL
majors_df <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/college-majors/majors-list.csv")
# Check the structure to see the available columns
str(majors_df)
## spc_tbl_ [174 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ FOD1P : chr [1:174] "1100" "1101" "1102" "1103" ...
## $ Major : chr [1:174] "GENERAL AGRICULTURE" "AGRICULTURE PRODUCTION AND MANAGEMENT" "AGRICULTURAL ECONOMICS" "ANIMAL SCIENCES" ...
## $ Major_Category: chr [1:174] "Agriculture & Natural Resources" "Agriculture & Natural Resources" "Agriculture & Natural Resources" "Agriculture & Natural Resources" ...
## - attr(*, "spec")=
## .. cols(
## .. FOD1P = col_character(),
## .. Major = col_character(),
## .. Major_Category = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
# Filter the majors that contain either "DATA" or "STATISTICS"
data_majors <- majors_df %>%
filter(grepl("data|statistics", Major, ignore.case = TRUE))
# Display the filtered majors
data_majors
## # A tibble: 3 × 3
## FOD1P Major Major_Category
## <chr> <chr> <chr>
## 1 6212 MANAGEMENT INFORMATION SYSTEMS AND STATISTICS Business
## 2 2101 COMPUTER PROGRAMMING AND DATA PROCESSING Computers & Mathematics
## 3 3702 STATISTICS AND DECISION SCIENCE Computers & Mathematics
Write code that transforms the data below:
[1] "bell pepper" "bilberry" "blackberry" "blood orange"
[5] "blueberry" "cantaloupe" "chili pepper" "cloudberry"
[9] "elderberry" "lime" "lychee" "mulberry"
[13] "olive" "salal berry"
Into a format like this:
c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")
# Define the original vector of fruit names
fruits <- c("bell pepper", "bilberry", "blackberry", "blood orange",
"blueberry", "cantaloupe", "chili pepper", "cloudberry",
"elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")
# Use dput() to print the vector in the literal format
dput(fruits)
## c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry",
## "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime",
## "lychee", "mulberry", "olive", "salal berry")
Describe, in words, what these expressions will match:
(.)\1\1
The first element (.)
captures any character, and
\1\1
repeats that character two more times.
Description: Any character repeated three times consecutively
(e.g., “aaa”).
"(.)(.)\\2\\1"
This expression captures two characters separately, then matches the
second captured character followed by the first.
Description: A four-character palindrome (e.g.,
“abba”).
(..)\1
This captures any two characters as a group and then immediately repeats
that exact pair.
Description: A repeated two-character sequence (e.g.,
“abab”).
"(.).\\1.\\1"
This captures a single character and then matches any character,
followed by the captured character, another arbitrary character, and the
captured character again.
Description: A five-character pattern where the first, third,
and fifth characters are identical (e.g., “ababa”).
"(.)(.)(.).*\\3\\2\\1"
This captures three characters and then, after any characters
(.*
), requires that the three captured characters appear in
reverse order.
Description: A string that starts with three characters and
ends with those same characters in reverse order (e.g.,
“abc…cba”).
Construct regular expressions to match words that:
regex_start_end <- "^(.).*\\1$"
regex_start_end
## [1] "^(.).*\\1$"
regex_repeated_pair <- "(..).*?\\1"
regex_repeated_pair
## [1] "(..).*?\\1"
regex_letter_triplicate <- ".*([A-Za-z]).*\\1.*\\1.*"
regex_letter_triplicate
## [1] ".*([A-Za-z]).*\\1.*\\1.*"
This was a fun assignment! This was an interesting way to think about understanding, reading, and generating specific expressions. At least it was fun and a unique discovery for me.