1. Loading Data into a Dataframe

Via Github, I will be loading data from the article: “The Economic Guide To Picking A College Major” by FiveThirtyEight. The data is in CSV format and will be loaded into a dataframe. Since our objective is simple, I have specifically obtained the data in the csv called “major-list.csv”. This data should provide us with the ID associated with the majors, the names of the majors, and the categories of the majors. Note: The Major ID is unique to each major, but is not ordered in any particular way. The repository references the Codes as coming from FO1DP in the ACS PUMS. ACS PUMS is the American Community Survey Public Use Microdata Sample from the US Census Bureau.

data <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/college-majors/majors-list.csv")


1.a. Identifying Certain Majors by String Matching

By using the str_detect function, which comes with tidyverse, and which returns a Boolean depending on whether the string is detected, we can identify majors that contain the words “STATISTICS” or “DATA”. To facilitate the selection of either the word statistics or data by using the regex pattern “STATISTICS|DATA” for either word. Then we apply the filter function to the dataframe that returns a subset of the data that matches the condition determined by the str_detect function.

data %>%
    filter(str_detect(Major, "STATISTICS|DATA"))
##   FOD1P                                         Major          Major_Category
## 1  6212 MANAGEMENT INFORMATION SYSTEMS AND STATISTICS                Business
## 2  2101      COMPUTER PROGRAMMING AND DATA PROCESSING Computers & Mathematics
## 3  3702               STATISTICS AND DECISION SCIENCE Computers & Mathematics


2. Reformatting Data

Given [1] “bell pepper” “bilberry” “blackberry” “blood orange”

[5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”

[9] “elderberry” “lime” “lychee” “mulberry”

[13] “olive” “salal berry”

2.a. Convert into a vector of strings:

Original Vector for later checking

word_vec <- c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry",
    "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry",
    "olive", "salal berry")


In regex, I created two requirements for the regex pattern:

  1. A single word item
  2. A two-word item separated by a single space

The regex for single word items must be after the two-word items as the regex for single word items is a subset of the regex for two-word items and would return all single words irrespective of whether they are part of a two-word item.

source <- "[1] \"bell pepper\"  \"bilberry\"     \"blackberry\"   \"blood orange\"

[5] \"blueberry\"    \"cantaloupe\"   \"chili pepper\" \"cloudberry\"  

[9] \"elderberry\"   \"lime\"         \"lychee\"       \"mulberry\"    

[13] \"olive\"        \"salal berry\""

regex_filtered <- unlist(str_extract_all(str_squish(source), "([a-z]+\\s[a-z]+|[a-z]+)"))

all(regex_filtered == word_vec)  # Check if all are the same as the word_vec
## [1] TRUE


3. Describe, in words, what these expressions will match:

3.a. (.)\\1\\1

This matches any three characters three times. By Grouping the first character, and then matching that specific character twice more, we are essentially matching three of the same characters in a row.

str_view("AA BB CCC", "(.)\\1\\1", match = TRUE)  # to test the regex
## [1] │ AA BB <CCC>


3.b. (.)(.)\\2\\1

This matches any two characters that are the same. The first character is grouped, followed by a second character that is grouped. Using back references we match one group 1 then two group 2, then finally group 1 again.

str_view("AA BB CCC ABBA", "(.)(.)\\2\\1", match = TRUE)  # to test the regex
## [1] │ AA< BB >CCC <ABBA>


3.c. (..)\\1

This matches any two characters that are the same. The two characters are grouped so back references matches the two again.

str_view("AA BB CCC Banana", "(..)\\1", match = TRUE)  # to test the regex
## [1] │ AA BB CCC B<anan>a

We can see it first matches anan rather than nana since it is greedy and picks the first one it sees.


3.d. (.).\\1.\\1

This matches any two characters, the first of the two is grouped. Using back references the the first character is matched three times again.

str_view("AA BB CBCCC", "(.).\\1.\\1", match = TRUE)  # to test the regex
## [1] │ AA BB <CBCCC>


3.e. (.)(.)(.).*\\3\\2\\1

This matches any three characters, each character is grouped. The regex then matches a ungrouped character for as many times it appears. Then the matches the third character, then the second character, and finally the first character.

str_view("ABCDDDDDDDDCBA", "(.)(.)(.).*\\3\\2\\1", match = TRUE)  # to test the regex
## [1] │ <ABCDDDDDDDDCBA>


Create a Regex Pattern

Start and end with the same character. Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.) Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

It was hard to find out if there is a simple way to meet the requirement in regex, therefore I have split the task into three separate regex checks. The regex I performed carries out the checks in order of the assignment requirements:

  1. Start and end with the same character.
  2. Contain a repeated pair of letters.
  3. Contain one letter repeated in at least three places.
  4. Check if all three conditions are met.
string <- "baaccaab"
start_end <- grepl("^(.).*\\1$", string)
repeat_pair <- grepl(".*(.{2}).*\\1", string)
three_char <- grepl(".*(.).*\\1.*\\1.*", string)
regex_out <- start_end & repeat_pair & three_char  # Check all three
regex_out
## [1] TRUE

Conclusion

In this assignment, we have loaded data from a CSV file into a dataframe, identified certain majors by string matching, and reformatted data. We have also described what certain regex expressions will match and created a regex pattern that meets certain requirements. Practicing these tasks has helped me to better understand how to load and interact with data and how to use regex to match certain patterns in strings while also exploring the limitations of my abilities. The prominent limitation I faced was in creating a regex pattern that met all three requirements in a single regex expression. I had to split the requirements into three separate regex checks. This assignment has been a great learning experience and has helped me to improve my skills in data manipulation and regex.