We are examining the list of college majors from the FiveThiryEight article The Economic Guide To Picking A College Major.
Below we import the data from the FiveThirtyEight Github
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)
fileURL = 'https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv'
majorsDF = read.csv((url(fileURL)))
We are interested in finding majors that contain the phrases “DATA” or “STATISTICS”
majorsOfInterest <- grep("DATA|STATISTICS", majorsDF$Major, value = TRUE, ignore.case = FALSE)
print(majorsOfInterest)
## [1] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
## [2] "COMPUTER PROGRAMMING AND DATA PROCESSING"
## [3] "STATISTICS AND DECISION SCIENCE"
We are interested in transforming the below vector into one line of output that is separated by commas
foodVector <- c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")
str_flatten_comma(foodVector)
## [1] "bell pepper, bilberry, blackberry, blood orange, blueberry, cantaloupe, chili pepper, cloudberry, elderberry, lime, lychee, mulberry, olive, salal berry"
Describe, in words, what these expressions will match:
(.)\1\1
find a string that has the same charter three times in a row
none of the strings in fruit or words fill the requirement
str_view(fruit, "(.)\\1\\1")
str_view(words, "(.)\\1\\1")
str_view(c('abbba', 'goooal', 'ball'), "(.)\\1\\1") # returns only 'abbba' and 'goooal' with the repeating characters highlighted
## [1] │ a<bbb>a
## [2] │ g<ooo>al
“(.)(.)\\2\\1”
find a string that has two characters followed by the same two characters but in reverse order
str_view(words, "(.)(.)\\2\\1")
## [19] │ after<noon>
## [43] │ <appa>rent
## [53] │ <arra>nge
## [107] │ b<otto>m
## [112] │ br<illi>ant
## [174] │ c<ommo>n
## [230] │ d<iffi>cult
## [259] │ <effe>ct
## [329] │ f<ollo>w
## [422] │ in<deed>
## [470] │ l<ette>r
## [521] │ m<illi>on
## [581] │ <oppo>rtunity
## [582] │ <oppo>se
## [877] │ tom<orro>w
(..)\1
find a string that has the same two characters repeating
str_view(words,"(..)\\1")
## [696] │ r<emem>ber
str_view(fruit,"(..)\\1")
## [4] │ b<anan>a
## [20] │ <coco>nut
## [22] │ <cucu>mber
## [41] │ <juju>be
## [56] │ <papa>ya
## [73] │ s<alal> berry
“(.).\\1.\\1”
find a string that has a character, followed by a different character, followed by the original character, followed by a different character, followed by the original character
str_view(words,"(.).\\1.\\1")
## [265] │ <eleve>n
str_view(fruit,"(.).\\1.\\1")
## [4] │ b<anana>
## [56] │ p<apaya>
“(.)(.)(.).*\\3\\2\\1”
finds a string where there are three characters followed by any set of characters followed by the aforementioned three characters in reverse order
str_view(words,"(.)(.)(.).*\\3\\2\\1")
## [598] │ <paragrap>h
Construct regular expressions to match words that:
Start and end with the same character.
str_view(words, "^(.).*\\1$")
## [36] │ <america>
## [49] │ <area>
## [209] │ <dad>
## [213] │ <dead>
## [223] │ <depend>
## [258] │ <educate>
## [266] │ <else>
## [268] │ <encourage>
## [270] │ <engine>
## [278] │ <europe>
## [283] │ <evidence>
## [285] │ <example>
## [287] │ <excuse>
## [288] │ <exercise>
## [291] │ <expense>
## [292] │ <experience>
## [296] │ <eye>
## [386] │ <health>
## [394] │ <high>
## [450] │ <knock>
## ... and 16 more
Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)
str_view(words, "(..).*\\1")
## [48] │ ap<propr>iate
## [152] │ <church>
## [181] │ c<ondition>
## [217] │ <decide>
## [275] │ <environmen>t
## [487] │ l<ondon>
## [598] │ pa<ragra>ph
## [603] │ p<articular>
## [617] │ <photograph>
## [638] │ p<repare>
## [641] │ p<ressure>
## [696] │ r<emem>ber
## [698] │ <repre>sent
## [699] │ <require>
## [739] │ <sense>
## [858] │ the<refore>
## [903] │ u<nderstand>
## [946] │ w<hethe>r
Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)
str_view(words, "(.).*\\1.*\\1")
## [48] │ a<pprop>riate
## [62] │ <availa>ble
## [86] │ b<elieve>
## [90] │ b<etwee>n
## [119] │ bu<siness>
## [221] │ d<egree>
## [229] │ diff<erence>
## [233] │ di<scuss>
## [265] │ <eleve>n
## [275] │ e<nvironmen>t
## [283] │ <evidence>
## [288] │ <exercise>
## [291] │ <expense>
## [292] │ <experience>
## [423] │ <indivi>dual
## [598] │ p<aragra>ph
## [684] │ r<eceive>
## [696] │ r<emembe>r
## [698] │ r<eprese>nt
## [845] │ t<elephone>
## ... and 2 more