Manipulation and Data Processing

Author

Rashad Long

library(stringr)
Warning: package 'stringr' was built under R version 4.3.2

1. Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/]

majors_url <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv"
majors <- read.csv(majors_url)

Provide code that identifies the majors that contain either “DATA” or “STATISTICS”

grep(pattern = "DATA|STATISTICS", majors$Major, value = TRUE, ignore.case = TRUE)
[1] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
[2] "COMPUTER PROGRAMMING AND DATA PROCESSING"     
[3] "STATISTICS AND DECISION SCIENCE"              

2. Write code that transforms the data below:

[1] “bell pepper” “bilberry”     “blackberry”   “blood orange”

[5] “blueberry”    “cantaloupe”   “chili pepper” “cloudberry”  

[9] “elderberry”   “lime”         “lychee”       “mulberry”    

[13] “olive”        “salal berry”

foods <- c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")
foods
 [1] "bell pepper"  "bilberry"     "blackberry"   "blood orange" "blueberry"   
 [6] "cantaloupe"   "chili pepper" "cloudberry"   "elderberry"   "lime"        
[11] "lychee"       "mulberry"     "olive"        "salal berry" 

Into a format like this:

c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)

foods_formatted <- paste(foods, collapse = ", ")
foods_formatted
[1] "bell pepper, bilberry, blackberry, blood orange, blueberry, cantaloupe, chili pepper, cloudberry, elderberry, lime, lychee, mulberry, olive, salal berry"

3. Describe, in words, what these expressions will match:

  • (.)\1\1
    • This is a regular expression that matches a character that is repeated 3 times in a row
sample_words <- c("aaa","bbb","xyz","bba", "cctv","brrr")
str_view(sample_words, "(.)\\1\\1")
[1] │ <aaa>
[2] │ <bbb>
[6] │ b<rrr>
  • “(.)(.)\\2\\1”
    • This is a string that defines a regular expression that matches 2 different characters followed by the same characters in reverse.
str_view(words,"(.)(.)\\2\\1")
 [19] │ after<noon>
 [43] │ <appa>rent
 [53] │ <arra>nge
[107] │ b<otto>m
[112] │ br<illi>ant
[174] │ c<ommo>n
[230] │ d<iffi>cult
[259] │ <effe>ct
[329] │ f<ollo>w
[422] │ in<deed>
[470] │ l<ette>r
[521] │ m<illi>on
[581] │ <oppo>rtunity
[582] │ <oppo>se
[877] │ tom<orro>w
  • (..)\1
    • This is a regular expression that matches 2 characters that are repeated consecutively.
str_view(words,"(..)\\1")
[696] │ r<emem>ber
  • “(.).\\1.\\1”
    • This is a string that defines a regular expression that matches a character, that repeats after the next character then repeats again after the next character.
str_view(words,"(.).\\1.\\1")
[265] │ <eleve>n
  • “(.)(.)(.).*\\3\\2\\1”
    • This is a string that defines a regular expression that matches any 3 characters and then third character, second character and first character again
str_view(words,"(.)(.)(.).*\\3\\2\\1")
[598] │ <paragrap>h

4. Construct regular expressions to match words that:

  • Start and end with the same character.
str_view(words, "^(.).*\\1$")
 [36] │ <america>
 [49] │ <area>
[209] │ <dad>
[213] │ <dead>
[223] │ <depend>
[258] │ <educate>
[266] │ <else>
[268] │ <encourage>
[270] │ <engine>
[278] │ <europe>
[283] │ <evidence>
[285] │ <example>
[287] │ <excuse>
[288] │ <exercise>
[291] │ <expense>
[292] │ <experience>
[296] │ <eye>
[386] │ <health>
[394] │ <high>
[450] │ <knock>
... and 16 more
  • Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)
str_view(words, "(..).*\\1")
 [48] │ ap<propr>iate
[152] │ <church>
[181] │ c<ondition>
[217] │ <decide>
[275] │ <environmen>t
[487] │ l<ondon>
[598] │ pa<ragra>ph
[603] │ p<articular>
[617] │ <photograph>
[638] │ p<repare>
[641] │ p<ressure>
[696] │ r<emem>ber
[698] │ <repre>sent
[699] │ <require>
[739] │ <sense>
[858] │ the<refore>
[903] │ u<nderstand>
[946] │ w<hethe>r
  • Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)
str_view(words, "(.).*\\1.*\\1")
 [48] │ a<pprop>riate
 [62] │ <availa>ble
 [86] │ b<elieve>
 [90] │ b<etwee>n
[119] │ bu<siness>
[221] │ d<egree>
[229] │ diff<erence>
[233] │ di<scuss>
[265] │ <eleve>n
[275] │ e<nvironmen>t
[283] │ <evidence>
[288] │ <exercise>
[291] │ <expense>
[292] │ <experience>
[423] │ <indivi>dual
[598] │ p<aragra>ph
[684] │ r<eceive>
[696] │ r<emembe>r
[698] │ r<eprese>nt
[845] │ t<elephone>
... and 2 more