This is the third weekly assignment for the Fall 2024 edition of DATA 607. This week covers the interpretation and manipulation of regular expressions.
The first question states:
Using the 173 majors listed in [fivethirtyeight.com’s College Majors dataset] (https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/), provide code that identifies the majors that contain either “DATA” or “STATISTICS”
To do so, I downloaded the majors-list_csv from the Github linked in the story and then uploaded the file to my GCP instance.
To answer the question, I will:
library(knitr)
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
majors_csv <- "https://storage.googleapis.com/data_science_masters_files/2024_fall/data_607_data_management/week_three_files/majors-list.csv"
majors_data <- read_csv(url(majors_csv))
## Rows: 174 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): FOD1P, Major, Major_Category
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
stats_data_majors <- majors_data %>%
filter(
!is.na(Major),
Major != "N/A",
grepl("DATA|STATISTICS", Major, ignore.case = TRUE)
)
kable(stats_data_majors, caption = "Majors with 'STATS' or 'DATA' in Name")
FOD1P | Major | Major_Category |
---|---|---|
6212 | MANAGEMENT INFORMATION SYSTEMS AND STATISTICS | Business |
2101 | COMPUTER PROGRAMMING AND DATA PROCESSING | Computers & Mathematics |
3702 | STATISTICS AND DECISION SCIENCE | Computers & Mathematics |
The three majors with “STATS” or “DATA” in the title are:
This will take the provided list of words and transform them into an output that can later be used to get the list back to its original state.
fruits_before <- '[1] "bell pepper" "bilberry" "blackberry" "blood orange"
[5] "blueberry" "cantaloupe" "chili pepper" "cloudberry"
[9] "elderberry" "lime" "lychee" "mulberry"
[13] "olive" "salal berry"'
text_clean <- gsub("\\[\\d+\\]", "", fruits_before)
text_clean <- gsub("\n", " ", text_clean)
elements <- regmatches(text_clean, gregexpr('"(.*?)"', text_clean))[[1]]
fruits_after <- gsub('"', '', elements)
dput(fruits_after, file = "fruits_after.R")
fruits_before_two <- dget("fruits_after.R")
(.)\1\1
“(.)(.)\2\1”
(..)\1
“(.).\1.\1”
“(.)(.)(.).*\3\2\1”
This question asks me to construct regular expressions that match the specified criteria.
regexers <- "^(.).*\\1$"
double_oh <- "(..).*\\1"
thirds <- "([a-zA-Z]).*\\1.*\\1"