I am compiling my answers for this week’s assignment regarding regular expressions.
# URL of the CSV file
url <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv"
# Read the CSV file from the URL
majors_list <- read_csv(url)
## Rows: 174 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): FOD1P, Major, Major_Category
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Filter majors that include "DATA" or "STATISTICS"
filtered_majors <- majors_list[grepl("DATA|STATISTICS", majors_list$Major, ignore.case = TRUE), ]
# Print the filtered majors
print(filtered_majors)
## # A tibble: 3 × 3
## FOD1P Major Major_Category
## <chr> <chr> <chr>
## 1 6212 MANAGEMENT INFORMATION SYSTEMS AND STATISTICS Business
## 2 2101 COMPUTER PROGRAMMING AND DATA PROCESSING Computers & Mathematics
## 3 3702 STATISTICS AND DECISION SCIENCE Computers & Mathematics
I struggled with this problem quite a bit actually, I think having one in greater context would have been easier for me to wrap my head around.
# Given our original list string:
my_list_string <- "[1] \"bell pepper\" \"bilberry\" \"blackberry\" \"blood orange\"
[5] \"blueberry\" \"cantaloupe\" \"chili pepper\" \"cloudberry\"
[9] \"elderberry\" \"lime\" \"lychee\" \"mulberry\"
[13] \"olive\" \"salal berry\""
# Use str_extract_all from the stringr package to parse out the fruits
library(stringr)
string_vector <- str_extract_all(my_list_string, "\"[^\"]+\"")[[1]]
# Remove the quotes
string_vector <- str_replace_all(string_vector, "\"", "")
# Print out the result
print(string_vector)
## [1] "bell pepper" "bilberry" "blackberry" "blood orange" "blueberry"
## [6] "cantaloupe" "chili pepper" "cloudberry" "elderberry" "lime"
## [11] "lychee" "mulberry" "olive" "salal berry"
This code first uses str_extract_all() to find all
instances of text sandwiched between double quotes (this
The third part of the assignment involves evaluating some regular expression patterns:
If you come across the regular expression pattern
(.)\1\1, it means you’re looking for any character that
appears three times in a row. The (.) part captures any
single character, and \1 refers back to the first captured
character, requiring it to be repeated twice more
consecutively.
Now, if you encounter the regular expression
"(.)(.)\\2\\1", it’s searching for two characters where the
second character is reversed and appears right after the first
character. So, (.) captures the first character,
(.) captures the second character, and \\2\\1
ensures that the second character followed by the first character from
the earlier captures is present. In simpler terms, it’s looking for a
two-character palindrome like “ABBA”.
When you see the regular expression pattern (..)\1,
it means you’re looking for any two characters that are repeated once.
The (..) part captures any two characters, and
\1 refers back to the first captured pair, requiring it to
appear again.
"(.).\\1.\\1": This regular expression will look for
a single character that repeats after every other character.
(.) captures a single character, and .
captures another character (any character). \\1 refers back
to the first captured group. So this will match patterns like
‘GeGe’,
#4 Construct regular expressions to match words that:
Start and end with the same character. Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.) Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)
To match words that start and end with the same character, you can
use the regular expression ^(.).*\1$. Here’s a breakdown of
the expression:
^ denotes the start of the word.(.) captures any single character..* matches zero or any number of characters between the
beginning and end of the word.\1 refers back to the first captured character,
ensuring that it is also the last character.$ denotes the end of the word.To match words that contain a repeated pair of letters, you can use
the regular expression (.{2})\1. Here’s how it works:
(.{2}) captures any two characters.\1 refers back to the first captured pair of
characters, requiring it to be repeated exactly.To match words that contain one letter repeated in at least three
places, you can use the regular expression (.)\1{2,}.
Here’s a breakdown:
(.) captures any single character.\1 refers back to the first captured character.{2,} specifies that the captured character should be
repeated at least two more times consecutively.