Confused - Week 3 Assignment

R Markdown

I am compiling my answers for this week’s assignment regarding regular expressions.

Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”

# URL of the CSV file
url <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv"

# Read the CSV file from the URL
majors_list <- read_csv(url)

## Rows: 174 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): FOD1P, Major, Major_Category
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Filter majors that include "DATA" or "STATISTICS"
filtered_majors <- majors_list[grepl("DATA|STATISTICS", majors_list$Major, ignore.case = TRUE), ]

# Print the filtered majors
print(filtered_majors)

## # A tibble: 3 × 3
##   FOD1P Major                                         Major_Category         
##   <chr> <chr>                                         <chr>                  
## 1 6212  MANAGEMENT INFORMATION SYSTEMS AND STATISTICS Business               
## 2 2101  COMPUTER PROGRAMMING AND DATA PROCESSING      Computers & Mathematics
## 3 3702  STATISTICS AND DECISION SCIENCE               Computers & Mathematics

I struggled with this problem quite a bit actually, I think having one in greater context would have been easier for me to wrap my head around.

# Given our original list string:
my_list_string <- "[1] \"bell pepper\"  \"bilberry\"     \"blackberry\"   \"blood orange\"

[5] \"blueberry\"    \"cantaloupe\"   \"chili pepper\" \"cloudberry\"  

[9] \"elderberry\"   \"lime\"         \"lychee\"       \"mulberry\"    

[13] \"olive\"        \"salal berry\""

# Use str_extract_all from the stringr package to parse out the fruits
library(stringr)
string_vector <- str_extract_all(my_list_string, "\"[^\"]+\"")[[1]]

# Remove the quotes
string_vector <- str_replace_all(string_vector, "\"", "")

# Print out the result
print(string_vector)

##  [1] "bell pepper"  "bilberry"     "blackberry"   "blood orange" "blueberry"   
##  [6] "cantaloupe"   "chili pepper" "cloudberry"   "elderberry"   "lime"        
## [11] "lychee"       "mulberry"     "olive"        "salal berry"

This code first uses str_extract_all() to find all instances of text sandwiched between double quotes (this

The third part of the assignment involves evaluating some regular expression patterns:

If you come across the regular expression pattern (.)\1\1, it means you’re looking for any character that appears three times in a row. The (.) part captures any single character, and \1 refers back to the first captured character, requiring it to be repeated twice more consecutively.
Now, if you encounter the regular expression "(.)(.)\\2\\1", it’s searching for two characters where the second character is reversed and appears right after the first character. So, (.) captures the first character, (.) captures the second character, and \\2\\1 ensures that the second character followed by the first character from the earlier captures is present. In simpler terms, it’s looking for a two-character palindrome like “ABBA”.
When you see the regular expression pattern (..)\1, it means you’re looking for any two characters that are repeated once. The (..) part captures any two characters, and \1 refers back to the first captured pair, requiring it to appear again.
"(.).\\1.\\1": This regular expression will look for a single character that repeats after every other character. (.) captures a single character, and . captures another character (any character). \\1 refers back to the first captured group. So this will match patterns like ‘GeGe’,

#4 Construct regular expressions to match words that:

Start and end with the same character. Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.) Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

To match words that start and end with the same character, you can use the regular expression ^(.).*\1$. Here’s a breakdown of the expression:

^ denotes the start of the word.
(.) captures any single character.
.* matches zero or any number of characters between the beginning and end of the word.
\1 refers back to the first captured character, ensuring that it is also the last character.
$ denotes the end of the word.

To match words that contain a repeated pair of letters, you can use the regular expression (.{2})\1. Here’s how it works:

(.{2}) captures any two characters.
\1 refers back to the first captured pair of characters, requiring it to be repeated exactly.

To match words that contain one letter repeated in at least three places, you can use the regular expression (.)\1{2,}. Here’s a breakdown:

(.) captures any single character.
\1 refers back to the first captured character.
{2,} specifies that the captured character should be repeated at least two more times consecutively.

Confused - Week 3 Assignment

captainjalbo

2023-09-25

R Markdown