## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
The raw data can be read into a data frame
url <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv"
major_list <- read_csv(url, show_col_types = FALSE)
glimpse(major_list)
## Rows: 174
## Columns: 3
## $ FOD1P <chr> "1100", "1101", "1102", "1103", "1104", "1105", "1106",…
## $ Major <chr> "GENERAL AGRICULTURE", "AGRICULTURE PRODUCTION AND MANA…
## $ Major_Category <chr> "Agriculture & Natural Resources", "Agriculture & Natur…
Using the stringr library we can use regular expressions to identify majors that meet specific criteria. We can filter the original data frame to find the majors that include the word “DATA” or “STATISTICS”
major_list %>%
filter(str_detect(Major, "DATA|STATISTICS"))
I first created a vector with a long string containing many fruits. My goal was to match each fruit, create a seperate string, and store it in a character vector.
fruits <- "bell pepper, bilberry, blackberry, blood orange, blueberry, cantaloupe, chili pepper, cloudberry, elderberry, lime, lychee, mulberry, olive, salal berry"
# pattern to match each fruit
pattern <- "\\w+\\s?\\w+"
# find each fruit with str_match_all
fruit_list <- str_match_all(fruits, pattern)
#use unlist() to unpack the items into a vector
fruit_vector <- unlist(fruit_list)
fruit_vector
## [1] "bell pepper" "bilberry" "blackberry" "blood orange" "blueberry"
## [6] "cantaloupe" "chili pepper" "cloudberry" "elderberry" "lime"
## [11] "lychee" "mulberry" "olive" "salal berry"
A. The expression (.)\1\1 will match strings with the character captured by the (.) appearing three times in a row. See the example below.
x <- "aaaxyzccc"
# The regular expression needs to expressed in it's string form
str_view(x, "(.)\\1\\1")
## [1] │ <aaa>xyz<ccc>
B. The expression “(.)(.)\2\1” will match any string with any two characters grouped by each (.) that appear consecutively in reverse order. See the example below.
str_view(words, "(.)(.)\\2\\1")
## [19] │ after<noon>
## [43] │ <appa>rent
## [53] │ <arra>nge
## [107] │ b<otto>m
## [112] │ br<illi>ant
## [174] │ c<ommo>n
## [230] │ d<iffi>cult
## [259] │ <effe>ct
## [329] │ f<ollo>w
## [422] │ in<deed>
## [470] │ l<ette>r
## [521] │ m<illi>on
## [581] │ <oppo>rtunity
## [582] │ <oppo>se
## [877] │ tom<orro>w
C. The expression “(..)\1” will return a string that repeats a group of any two characters “(..)”. See example below.
z <- "gntntlw"
str_view(z, "(..)\\1")
## [1] │ g<ntnt>lw
D. The expression “(.).\1.\1” will match any string that alternates the character grouped in (.) with any character. See the example below.
a <- "gafrabafa"
str_view(a,"(.).\\1.\\1" )
## [1] │ gafr<abafa>
E. The expression **“(.)(.)(.).*\3\2\1”** will match any string with three groups of characters that are sperated by any character appearing zero or more times followed by the same cahracters in each group but in reverse order. See the example below.
b <- "habcxycbai"
str_view(b, "(.)(.)(.).*\\3\\2\\1")
## [1] │ h<abcxycba>i
A. Start and end with the same character.
# we will view the first five matches
head(str_view(words, "^(.).*\\1$"))
## [36] │ <america>
## [49] │ <area>
## [209] │ <dad>
## [213] │ <dead>
## [223] │ <depend>
## [258] │ <educate>
B. Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.).
# we will view the first 5 matches
head(str_view(words, "(..).*\\1"))
## [48] │ ap<propr>iate
## [152] │ <church>
## [181] │ c<ondition>
## [217] │ <decide>
## [275] │ <environmen>t
## [487] │ l<ondon>
C. Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)
# we will view the first five matches
head(str_view(words, "(.).*\\1.*\\1"))
## [48] │ a<pprop>riate
## [62] │ <availa>ble
## [86] │ b<elieve>
## [90] │ b<etwee>n
## [119] │ bu<siness>
## [221] │ d<egree>