Week 3 - Assignment

1. Identify Data or Stats Majors in a Table

The fivethirtyeight.com table College Majors contains 173 college majors (https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv). To identify the majors containing the strings “DATA” or “STATISTICS,” I first loaded the .csv from Github:

# Load csv file from github and verify

majors<-read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv")

## Rows: 174 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): FOD1P, Major, Major_Category
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

To find majors that contain the strings “DATA” or “STATISTICS” in any position, I used the stringr function str_detect to filter the dataframe:

# Use str_detect and filter function with OR operator
majors %>% 
  filter(str_detect(majors$Major,"DATA") =='TRUE'|str_detect(majors$Major,"STATISTICS")=='TRUE')

## # A tibble: 3 × 3
##   FOD1P Major                                         Major_Category         
##   <chr> <chr>                                         <chr>                  
## 1 6212  MANAGEMENT INFORMATION SYSTEMS AND STATISTICS Business               
## 2 2101  COMPUTER PROGRAMMING AND DATA PROCESSING      Computers & Mathematics
## 3 3702  STATISTICS AND DECISION SCIENCE               Computers & Mathematics

Another way to accomplish this would be to create flag fields in the dataframe to flag majors containing the words DATA or STATISTICS and then print the records where either of the flags = TRUE: adding flags to the dataframe (table, view) can be useful for frequently-used searches, particularly those requiring complex logic:

# Add flags for Data and Stats majors
majors<-mutate(majors, data_flag=str_detect(majors$Major,"DATA"))
majors<-mutate(majors, stat_flag=str_detect(majors$Major,"STATISTICS"))

# Print records where either is TRUE
majors %>% 
  filter(data_flag=='TRUE'|stat_flag=='TRUE')

## # A tibble: 3 × 5
##   FOD1P Major                                 Major_Category data_flag stat_flag
##   <chr> <chr>                                 <chr>          <lgl>     <lgl>    
## 1 6212  MANAGEMENT INFORMATION SYSTEMS AND S… Business       FALSE     TRUE     
## 2 2101  COMPUTER PROGRAMMING AND DATA PROCES… Computers & M… TRUE      FALSE    
## 3 3702  STATISTICS AND DECISION SCIENCE       Computers & M… FALSE     TRUE

2. Convert data in a vector to a string in a given format

First, I copied the text string provided in the question to create a vector v. Then I converted that vector to a string using the paste0 function, adding characters “c(” and “)” at either end; i.e., I recreated the combine function used to create the vector.

# Create vector v and print
v<-c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")
v

##  [1] "bell pepper"  "bilberry"     "blackberry"   "blood orange" "blueberry"   
##  [6] "cantaloupe"   "chili pepper" "cloudberry"   "elderberry"   "lime"        
## [11] "lychee"       "mulberry"     "olive"        "salal berry"

# Convert vector back to temporary string ("a") with commas and quotes

a<-paste0(v,sep="",collapse='","')
cat(a)

## bell pepper","bilberry","blackberry","blood orange","blueberry","cantaloupe","chili pepper","cloudberry","elderberry","lime","lychee","mulberry","olive","salal berry

# Use paste0 to add necessary beginning and ending characters to the final string "v_str"

v_str<-paste0('c("',a,'")')
cat(v_str)

## c("bell pepper","bilberry","blackberry","blood orange","blueberry","cantaloupe","chili pepper","cloudberry","elderberry","lime","lychee","mulberry","olive","salal berry")

3. Describe how given expressions will match

The given expressions will match strings as follows:

(.)\1\1 - Needs double quotes and the backslashes need to be escaped (double backslashes), but with those corrections (see below), this expression matches to three of the same character in a row

“(.)(.)\2\1” - Two characters followed immediately by the same two characters in reverse order

(..)\1 - Needs double quotes and the backslash needs to be escaped (double backslash), but with this done (see below), it will match two characters followed immediately by the same two characters in the same order

“(.).\1.\1” - One character followed by another, followed by the first character again, followed by another character, followed by the first character again (i.e. a string of five characters, with one character repeating in places 1, 3, and 5)

“(.)(.)(.).*\3\2\1” - Three characters, possibly followed by one or more other characters, then the same three characters in reverse order

test<-c("ed","bed","hundred","anna","banana","aaabbaaa")
str_view(test,"(.)\\1\\1")

## [6] │ <aaa>bb<aaa>

str_view(test,"(.)(.)\\2\\1")

## [4] │ <anna>
## [6] │ aa<abba>aa

str_view(test,"(..)\\1")

## [5] │ b<anan>a

str_view(test,"(.).\\1.\\1")

## [5] │ b<anana>

str_view(test,"(.)(.)(.).*\\3\\2\\1")

## [6] │ <aaabbaaa>

4. Construct regular expressions

Construct regular expressions to match words that:

Start and end with the same character: str_view(words,“(^.).*\1$“)
Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.): str_view(words,“(..).*\1”)
Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.): str_view(words,“.(.).\1.*\1”)

# Start and end with the same character
str_view(words,"(^.).*\\1$")

##  [36] │ <america>
##  [49] │ <area>
## [209] │ <dad>
## [213] │ <dead>
## [223] │ <depend>
## [258] │ <educate>
## [266] │ <else>
## [268] │ <encourage>
## [270] │ <engine>
## [278] │ <europe>
## [283] │ <evidence>
## [285] │ <example>
## [287] │ <excuse>
## [288] │ <exercise>
## [291] │ <expense>
## [292] │ <experience>
## [296] │ <eye>
## [386] │ <health>
## [394] │ <high>
## [450] │ <knock>
## ... and 16 more

# Contain a repeated pair of letters
str_view(words,"(..).*\\1")

##  [48] │ ap<propr>iate
## [152] │ <church>
## [181] │ c<ondition>
## [217] │ <decide>
## [275] │ <environmen>t
## [487] │ l<ondon>
## [598] │ pa<ragra>ph
## [603] │ p<articular>
## [617] │ <photograph>
## [638] │ p<repare>
## [641] │ p<ressure>
## [696] │ r<emem>ber
## [698] │ <repre>sent
## [699] │ <require>
## [739] │ <sense>
## [858] │ the<refore>
## [903] │ u<nderstand>
## [946] │ w<hethe>r

# Contain one letter repeated in at least three places
str_view(words,".*(.).*\\1.*\\1")

##  [48] │ <approp>riate
##  [62] │ <availa>ble
##  [86] │ <believe>
##  [90] │ <betwee>n
## [119] │ <business>
## [221] │ <degree>
## [229] │ <difference>
## [233] │ <discuss>
## [265] │ <eleve>n
## [275] │ <environmen>t
## [283] │ <evidence>
## [288] │ <exercise>
## [291] │ <expense>
## [292] │ <experience>
## [423] │ <indivi>dual
## [598] │ <paragra>ph
## [684] │ <receive>
## [696] │ <remembe>r
## [698] │ <represe>nt
## [845] │ <telephone>
## ... and 2 more

Week 3 - Assignment

Amanda Fox

2024-02-09

1. Identify Data or Stats Majors in a Table

2. Convert data in a vector to a string in a given format

3. Describe how given expressions will match

4. Construct regular expressions