DATA606-Assignment3

Reading in FiveThirtyEight data, calling it “majors”

majors <- read.csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/recent-grads.csv')

Loading dplyr package for data manipulation and string package for string manipulation/regex

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(stringr)

#1. Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset, provide code that identifies the majors that contain either “DATA” or “STATISTICS”

majors %>% 
  filter(grepl('STATISTICS', Major) | grepl('DATA', Major))

##   Rank Major_code                                         Major Total   Men
## 1   25       6212 MANAGEMENT INFORMATION SYSTEMS AND STATISTICS 18713 13496
## 2   47       3702               STATISTICS AND DECISION SCIENCE  6251  2960
## 3   54       2101      COMPUTER PROGRAMMING AND DATA PROCESSING  4168  3046
##   Women          Major_category ShareWomen Sample_size Employed Full_time
## 1  5217                Business  0.2787901         278    16413     15141
## 2  3291 Computers & Mathematics  0.5264758          37     4247      3190
## 3  1122 Computers & Mathematics  0.2691939          43     3257      3204
##   Part_time Full_time_year_round Unemployed Unemployment_rate Median P25th
## 1      2420                13017       1015        0.05823961  51000 38000
## 2      1840                 2151        401        0.08627367  45000 26700
## 3       482                 2453        419        0.11398259  41300 20000
##   P75th College_jobs Non_college_jobs Low_wage_jobs
## 1 60000         6342             5741           708
## 2 60000         2298             1200           343
## 3 46000         2024             1033           263

#2 Write code that transforms the data below: [1] “bell pepper” “bilberry” “blackberry” “blood orange” [5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”
[9] “elderberry” “lime” “lychee” “mulberry”
[13] “olive” “salal berry” #Into a format like this: c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)

For this problem, I’m assuming the index-labeled fruit strings are an output from a df, as in this df I’ll create named “fruit”.

fruit <- data.frame(price = c(1.50, 0.25, 0.30, 2.00, 0.15, 5.00, 1.00, 0.50, 0.60, 0.35, 0.75, 0.60, 0.10, 0.40),
                    name = c('bell pepper', ' bilberry', 'blackberry', 'blood orange', 'blueberry', 'cantaloupe', 'chili pepper', 'cloudberry', 'elderberry', 'lime', 'lychee', 'mulberry', 'olive' ,'salal berry'),
                    type = c('Berry','Berry','Berry','Citrus','Berry','Melon','Berry','Berry','Berry','Citrus','Berry','Berry','Stone','Berry'))

In this scenario, the index-labeled fruit strings output would be generated by a call to the dataframe’s relevant column, such as:

fruit$name

##  [1] "bell pepper"  " bilberry"    "blackberry"   "blood orange" "blueberry"   
##  [6] "cantaloupe"   "chili pepper" "cloudberry"   "elderberry"   "lime"        
## [11] "lychee"       "mulberry"     "olive"        "salal berry"

Base R has a function as.vector which coerces its argument into a vector. I’ll save the above output to a new variable, fruit_names, which contains this vector.

fruit_names <- as.vector(fruit$name)

Now that my data frame’s series is a stand-alone vector, I can transform it, for example, adding strings to each value using the paste function.

paste(fruit_names,"are delicious",sep=" ")

##  [1] "bell pepper are delicious"  " bilberry are delicious"   
##  [3] "blackberry are delicious"   "blood orange are delicious"
##  [5] "blueberry are delicious"    "cantaloupe are delicious"  
##  [7] "chili pepper are delicious" "cloudberry are delicious"  
##  [9] "elderberry are delicious"   "lime are delicious"        
## [11] "lychee are delicious"       "mulberry are delicious"    
## [13] "olive are delicious"        "salal berry are delicious"

#3 Describe, in words, what these expressions will match:

For this exercise, I’m assuming we’re using the fruit_names vector defined above. It includes: ‘bell pepper’, ’ bilberry’, ‘blackberry’, ‘blood orange’, ‘blueberry’, ‘cantaloupe’, ‘chili pepper’, ‘cloudberry’, ‘elderberry’, ‘lime’, ‘lychee’, ‘mulberry’, ‘olive’ ,‘salal berry’

#3a (.)\1\1 - First, this code will not run since it’s lacking quotes around the regex and there’s only one blackslash before the backreference group number 1. But I’ll analyze it assuming those two issues are fixed. - Dot is wild card character. Since there’s only one, it means “match any single character”. - Parentheses group parts of a regular expression, limiting matches to part of the regex. There’s only one group in this instance. - \1\1 means take that wildcard character and find where it repeats two more times (total of 3 times) - If the regex was “(.)\1\1” it would find any single character in text which repeats 3 times

#Corrected regex, which finds any single character which repeats 3 times, like 'aaa'
str_view('aaa bb c',"(.)\\1\\1")

## [1] │ <aaa> bb c

#Doesn't produce anything since no text in fruit_names is a single #character repeating 3 times.
str_view(fruit_names,"(.)\\1\\1")

#3b “(.)(.)\2\1” - (.)(.) matches any two characters in text which are in one of two different groups - \2 backreferences the second group result - \1 backreferences the first group result - To summarize, since the second group result is referenced before the first, this regex looks for group 2 characters which are followed by group 1 characters in the reverse order

#Finds character group then character group reversed, like ' bb ', 'dddd', '1221', and 'abba'
str_view('aaa bb c dddd 1221 abba 1212',"(.)(.)\\2\\1")

## [1] │ aaa< bb >c <dddd> <1221> <abba> 1212

#Within fruit_names text, matches p<eppe>r within "bell pepper" and "chili pepper"
str_view(fruit_names,"(.)(.)\\2\\1")

## [1] │ bell p<eppe>r
## [7] │ chili p<eppe>r

#3c (..)\1 - First, this code will not run since it’s lacking quotes around the regex and there’s only one backslash before the backreference group number 1. But I’ll analyze it assuming those two issues are fixed. - (..) matches any two characters grouped together - \1 backreferences the first group result, looking for a repeat of that first group result - To summarize, “(..)\1” matches any two characters which repeat again such as 1212, abab, and dddd.

#Finds character pair which repeats, like in the below dddd and 1212
str_view('aaa bb c dddd 1221 abba 1212',"(..)\\1")

## [1] │ aaa bb c <dddd> 1221 abba <1212>

#Within fruit_names text, matches s<alal> within "salal berry"
str_view(fruit_names,"(..)\\1")

## [14] │ s<alal> berry

#3d “(.).\1.\1” - This is similar to 3a except with two gaps between a group which repeats three times. Such as ababa or a a a.

#Within the below, views 1. "aba a" (greedily taking away from "ababa") and 2. "ddddd"
str_view('abca abcc cbaba ababa ddddd',"(.).\\1.\\1")

## [1] │ abca abcc cb<aba a>baba <ddddd>

#Within fruit_names text, doesn't match anything
str_view(fruit_names,"(.).\\1.\\1")

#3e “(.)(.)(.).\3\2\1” - This is similar to 3b except 1. The kleene star asterisk matches the preceding wildcard . by 0 (optional) or more times and 2. This regex looks for three wildcards, with the first … in reverse order to the second … - The kleene star asterisk * means there can be any number of characters between these 3 wildcards, whether 0 like abccba or 9 like abc123456789cba

#The below R code views "abca abcc cba" and " ababa "
str_view('abca abcc cbaba ababa ddddd',"(.)(.)(.).*\\3\\2\\1")

## [1] │ <abca abcc cba>ba< ababa >ddddd

#Within fruit_names text, doesn't view anything
str_view(fruit_names,"(.)(.)(.).*\\3\\2\\1")

#4 Construct regular expressions to match words that: #4a Start and end with the same character. Answer = “^(.)(.*\1$)”

#Vector to test. Should view "racecar", "dad", "stress", and "high".
test4a <- c('apple','banana','orange','blueberry','racecar','umbrella','dad','stress','high','church')

#Passes check
str_view(test4a,"^(.)(.*\\1$)")

## [5] │ <racecar>
## [7] │ <dad>
## [8] │ <stress>
## [9] │ <high>

#4b Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.) Answer = “([A-Za-z][A-Za-z]).*\1”

#Vector to test. Should view <anan> from "banana" and "church"
test4a <- c('apple','banana','orange','blueberry','racecar','umbrella','dad','stress','high','church')

#Passes check
str_view(test4a,"([A-Za-z][A-Za-z]).*\\1")

##  [2] │ b<anan>a
## [10] │ <church>

#4c Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.) Answer = “([A-Za-z]).\1.\1”

#Vector to test. Should view "b<anana>" (three As) and "<stress>" (three Ss)
test4a <- c('apple','banana','orange','blueberry','racecar','umbrella','dad','stress','high','church')

#Passes check
str_view(test4a,"([A-Za-z]).*\\1.*\\1")

## [2] │ b<anana>
## [8] │ <stress>

DATA606-Assignment3

Ross Boehme

2023-02-10