Reading in FiveThirtyEight data, calling it “majors”
majors <- read.csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/recent-grads.csv')
Loading dplyr package for data manipulation and string package for string manipulation/regex
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stringr)
#1. Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset, provide code that identifies the majors that contain either “DATA” or “STATISTICS”
majors %>%
filter(grepl('STATISTICS', Major) | grepl('DATA', Major))
## Rank Major_code Major Total Men
## 1 25 6212 MANAGEMENT INFORMATION SYSTEMS AND STATISTICS 18713 13496
## 2 47 3702 STATISTICS AND DECISION SCIENCE 6251 2960
## 3 54 2101 COMPUTER PROGRAMMING AND DATA PROCESSING 4168 3046
## Women Major_category ShareWomen Sample_size Employed Full_time
## 1 5217 Business 0.2787901 278 16413 15141
## 2 3291 Computers & Mathematics 0.5264758 37 4247 3190
## 3 1122 Computers & Mathematics 0.2691939 43 3257 3204
## Part_time Full_time_year_round Unemployed Unemployment_rate Median P25th
## 1 2420 13017 1015 0.05823961 51000 38000
## 2 1840 2151 401 0.08627367 45000 26700
## 3 482 2453 419 0.11398259 41300 20000
## P75th College_jobs Non_college_jobs Low_wage_jobs
## 1 60000 6342 5741 708
## 2 60000 2298 1200 343
## 3 46000 2024 1033 263
#2 Write code that transforms the data below: [1] “bell pepper”
“bilberry” “blackberry” “blood orange” [5] “blueberry” “cantaloupe”
“chili pepper” “cloudberry”
[9] “elderberry” “lime” “lychee” “mulberry”
[13] “olive” “salal berry” #Into a format like this: c(“bell pepper”,
“bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”,
“chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”,
“mulberry”, “olive”, “salal berry”)
For this problem, I’m assuming the index-labeled fruit strings are an output from a df, as in this df I’ll create named “fruit”.
fruit <- data.frame(price = c(1.50, 0.25, 0.30, 2.00, 0.15, 5.00, 1.00, 0.50, 0.60, 0.35, 0.75, 0.60, 0.10, 0.40),
name = c('bell pepper', ' bilberry', 'blackberry', 'blood orange', 'blueberry', 'cantaloupe', 'chili pepper', 'cloudberry', 'elderberry', 'lime', 'lychee', 'mulberry', 'olive' ,'salal berry'),
type = c('Berry','Berry','Berry','Citrus','Berry','Melon','Berry','Berry','Berry','Citrus','Berry','Berry','Stone','Berry'))
In this scenario, the index-labeled fruit strings output would be generated by a call to the dataframe’s relevant column, such as:
fruit$name
## [1] "bell pepper" " bilberry" "blackberry" "blood orange" "blueberry"
## [6] "cantaloupe" "chili pepper" "cloudberry" "elderberry" "lime"
## [11] "lychee" "mulberry" "olive" "salal berry"
Base R has a function as.vector which coerces its argument into a vector. I’ll save the above output to a new variable, fruit_names, which contains this vector.
fruit_names <- as.vector(fruit$name)
Now that my data frame’s series is a stand-alone vector, I can transform it, for example, adding strings to each value using the paste function.
paste(fruit_names,"are delicious",sep=" ")
## [1] "bell pepper are delicious" " bilberry are delicious"
## [3] "blackberry are delicious" "blood orange are delicious"
## [5] "blueberry are delicious" "cantaloupe are delicious"
## [7] "chili pepper are delicious" "cloudberry are delicious"
## [9] "elderberry are delicious" "lime are delicious"
## [11] "lychee are delicious" "mulberry are delicious"
## [13] "olive are delicious" "salal berry are delicious"
#3 Describe, in words, what these expressions will match:
For this exercise, I’m assuming we’re using the fruit_names vector defined above. It includes: ‘bell pepper’, ’ bilberry’, ‘blackberry’, ‘blood orange’, ‘blueberry’, ‘cantaloupe’, ‘chili pepper’, ‘cloudberry’, ‘elderberry’, ‘lime’, ‘lychee’, ‘mulberry’, ‘olive’ ,‘salal berry’
#3a (.)\1\1 - First, this code will not run since it’s lacking quotes around the regex and there’s only one blackslash before the backreference group number 1. But I’ll analyze it assuming those two issues are fixed. - Dot is wild card character. Since there’s only one, it means “match any single character”. - Parentheses group parts of a regular expression, limiting matches to part of the regex. There’s only one group in this instance. - \1\1 means take that wildcard character and find where it repeats two more times (total of 3 times) - If the regex was “(.)\1\1” it would find any single character in text which repeats 3 times
#Corrected regex, which finds any single character which repeats 3 times, like 'aaa'
str_view('aaa bb c',"(.)\\1\\1")
## [1] │ <aaa> bb c
#Doesn't produce anything since no text in fruit_names is a single #character repeating 3 times.
str_view(fruit_names,"(.)\\1\\1")
#3b “(.)(.)\2\1” - (.)(.) matches any two characters in text which are in one of two different groups - \2 backreferences the second group result - \1 backreferences the first group result - To summarize, since the second group result is referenced before the first, this regex looks for group 2 characters which are followed by group 1 characters in the reverse order
#Finds character group then character group reversed, like ' bb ', 'dddd', '1221', and 'abba'
str_view('aaa bb c dddd 1221 abba 1212',"(.)(.)\\2\\1")
## [1] │ aaa< bb >c <dddd> <1221> <abba> 1212
#Within fruit_names text, matches p<eppe>r within "bell pepper" and "chili pepper"
str_view(fruit_names,"(.)(.)\\2\\1")
## [1] │ bell p<eppe>r
## [7] │ chili p<eppe>r
#3c (..)\1 - First, this code will not run since it’s lacking quotes around the regex and there’s only one backslash before the backreference group number 1. But I’ll analyze it assuming those two issues are fixed. - (..) matches any two characters grouped together - \1 backreferences the first group result, looking for a repeat of that first group result - To summarize, “(..)\1” matches any two characters which repeat again such as 1212, abab, and dddd.
#Finds character pair which repeats, like in the below dddd and 1212
str_view('aaa bb c dddd 1221 abba 1212',"(..)\\1")
## [1] │ aaa bb c <dddd> 1221 abba <1212>
#Within fruit_names text, matches s<alal> within "salal berry"
str_view(fruit_names,"(..)\\1")
## [14] │ s<alal> berry
#3d “(.).\1.\1” - This is similar to 3a except with two gaps between a group which repeats three times. Such as ababa or a a a.
#Within the below, views 1. "aba a" (greedily taking away from "ababa") and 2. "ddddd"
str_view('abca abcc cbaba ababa ddddd',"(.).\\1.\\1")
## [1] │ abca abcc cb<aba a>baba <ddddd>
#Within fruit_names text, doesn't match anything
str_view(fruit_names,"(.).\\1.\\1")
#3e “(.)(.)(.).\3\2\1” - This is similar to 3b except 1. The kleene star asterisk matches the preceding wildcard . by 0 (optional) or more times and 2. This regex looks for three wildcards, with the first … in reverse order to the second … - The kleene star asterisk * means there can be any number of characters between these 3 wildcards, whether 0 like abccba or 9 like abc123456789cba
#The below R code views "abca abcc cba" and " ababa "
str_view('abca abcc cbaba ababa ddddd',"(.)(.)(.).*\\3\\2\\1")
## [1] │ <abca abcc cba>ba< ababa >ddddd
#Within fruit_names text, doesn't view anything
str_view(fruit_names,"(.)(.)(.).*\\3\\2\\1")
#4 Construct regular expressions to match words that: #4a Start and end with the same character. Answer = “^(.)(.*\1$)”
#Vector to test. Should view "racecar", "dad", "stress", and "high".
test4a <- c('apple','banana','orange','blueberry','racecar','umbrella','dad','stress','high','church')
#Passes check
str_view(test4a,"^(.)(.*\\1$)")
## [5] │ <racecar>
## [7] │ <dad>
## [8] │ <stress>
## [9] │ <high>
#4b Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.) Answer = “([A-Za-z][A-Za-z]).*\1”
#Vector to test. Should view <anan> from "banana" and "church"
test4a <- c('apple','banana','orange','blueberry','racecar','umbrella','dad','stress','high','church')
#Passes check
str_view(test4a,"([A-Za-z][A-Za-z]).*\\1")
## [2] │ b<anan>a
## [10] │ <church>
#4c Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.) Answer = “([A-Za-z]).\1.\1”
#Vector to test. Should view "b<anana>" (three As) and "<stress>" (three Ss)
test4a <- c('apple','banana','orange','blueberry','racecar','umbrella','dad','stress','high','church')
#Passes check
str_view(test4a,"([A-Za-z]).*\\1.*\\1")
## [2] │ b<anana>
## [8] │ <stress>