R Markdown

I am compiling my answers for this week’s assignment regarding regular expressions.

  1. Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”
# URL of the CSV file
url <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv"

# Read the CSV file from the URL
majors_list <- read_csv(url)
## Rows: 174 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): FOD1P, Major, Major_Category
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Filter majors that include "DATA" or "STATISTICS"
filtered_majors <- majors_list[grepl("DATA|STATISTICS", majors_list$Major, ignore.case = TRUE), ]

# Print the filtered majors
print(filtered_majors)
## # A tibble: 3 × 3
##   FOD1P Major                                         Major_Category         
##   <chr> <chr>                                         <chr>                  
## 1 6212  MANAGEMENT INFORMATION SYSTEMS AND STATISTICS Business               
## 2 2101  COMPUTER PROGRAMMING AND DATA PROCESSING      Computers & Mathematics
## 3 3702  STATISTICS AND DECISION SCIENCE               Computers & Mathematics

I struggled with this problem quite a bit actually, I think having one in greater context would have been easier for me to wrap my head around.

# Given our original list string:
my_list_string <- "[1] \"bell pepper\"  \"bilberry\"     \"blackberry\"   \"blood orange\"

[5] \"blueberry\"    \"cantaloupe\"   \"chili pepper\" \"cloudberry\"  

[9] \"elderberry\"   \"lime\"         \"lychee\"       \"mulberry\"    

[13] \"olive\"        \"salal berry\""

# Use str_extract_all from the stringr package to parse out the fruits
library(stringr)
string_vector <- str_extract_all(my_list_string, "\"[^\"]+\"")[[1]]

# Remove the quotes
string_vector <- str_replace_all(string_vector, "\"", "")

# Print out the result
print(string_vector)
##  [1] "bell pepper"  "bilberry"     "blackberry"   "blood orange" "blueberry"   
##  [6] "cantaloupe"   "chili pepper" "cloudberry"   "elderberry"   "lime"        
## [11] "lychee"       "mulberry"     "olive"        "salal berry"

This code first uses str_extract_all() to find all instances of text sandwiched between double quotes (this


The third part of the assignment involves evaluating some regular expression patterns:

  1. If you come across the regular expression pattern (.)\1\1, it means you’re looking for any character that appears three times in a row. The (.) part captures any single character, and \1 refers back to the first captured character, requiring it to be repeated twice more consecutively.

  2. Now, if you encounter the regular expression "(.)(.)\\2\\1", it’s searching for two characters where the second character is reversed and appears right after the first character. So, (.) captures the first character, (.) captures the second character, and \\2\\1 ensures that the second character followed by the first character from the earlier captures is present. In simpler terms, it’s looking for a two-character palindrome like “ABBA”.

  3. When you see the regular expression pattern (..)\1, it means you’re looking for any two characters that are repeated once. The (..) part captures any two characters, and \1 refers back to the first captured pair, requiring it to appear again.

  4. "(.).\\1.\\1": This regular expression will look for a single character that repeats after every other character. (.) captures a single character, and . captures another character (any character). \\1 refers back to the first captured group. So this will match patterns like ‘GeGe’,


#4 Construct regular expressions to match words that:

Start and end with the same character. Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.) Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

To match words that start and end with the same character, you can use the regular expression ^(.).*\1$. Here’s a breakdown of the expression:

To match words that contain a repeated pair of letters, you can use the regular expression (.{2})\1. Here’s how it works:

To match words that contain one letter repeated in at least three places, you can use the regular expression (.)\1{2,}. Here’s a breakdown: