library('dplyr')
library(stringr)
Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset, provide code that identifies the majors that contain either “DATA” or “STATISTICS”
majors <- read.csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv')
tail(majors,1)
## FOD1P Major Major_Category
## 174 5599 MISCELLANEOUS SOCIAL SCIENCES Social Science
majors %>%
filter(str_detect(Major, ("DATA|STATISTICS")))
## FOD1P Major Major_Category
## 1 6212 MANAGEMENT INFORMATION SYSTEMS AND STATISTICS Business
## 2 2101 COMPUTER PROGRAMMING AND DATA PROCESSING Computers & Mathematics
## 3 3702 STATISTICS AND DECISION SCIENCE Computers & Mathematics
Write code that transforms the data below:
[1] “bell pepper” “bilberry” “blackberry” “blood orange” [5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”
[9] “elderberry” “lime” “lychee” “mulberry”
[13] “olive” “salal berry”
Into a format like this: c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)
fruit_vector <- c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")
dput(as.character(fruit_vector))
## c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry",
## "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime",
## "lychee", "mulberry", "olive", "salal berry")
Describe, in words, what these expressions will match:
mine <- c("apple", "apricot", "avocado", "banana", "bepp pepper", "bilberry", "blackberry", "blackcurrant", "blood orange", "blueberry", "cranberry", "myapplebarrel")
mine
## [1] "apple" "apricot" "avocado" "banana"
## [5] "bepp pepper" "bilberry" "blackberry" "blackcurrant"
## [9] "blood orange" "blueberry" "cranberry" "myapplebarrel"
pattern <- "(.)(.)(.).*\\3\\2\\1"
mine %>%
str_subset(pattern)
## [1] "bepp pepper"
(.)\1\1 I think this is an error because all these Regex want you to use quotes around them. Also, I can only get the to work when I use \N.
“(.)(.)\2\1” The two sets of parentheses denote two groups and the goal is to find the first string, where the group occurs twice. For example, in my data above “bepp pepper”, is returned because first it found “pp” and then it found another “pp” in that string.
(..)\1 Same logic as the first one above: No quotes and only one . I get an error when I try to use it.
“(.).\1.\1” Find strings where there are repeating patterns. In my data above “banana” and “bepp pepper” are returned.
"(.)(.)(.).*\3\2\1" Find strings where a three character pattern occurs more than once. In my data, the “epp” in “bepp pepper” meets this criteria.
Construct regular expressions to match words that: Start and end with the same character.
data1 <- c("starts", "loses", "going", "frog", "gxxxfffg")
pattern1 <- "^([a-z]).*\\1$"
data1 %>%
str_subset(pattern1)
## [1] "starts" "going" "gxxxfffg"
Contain a repeated pair of letters
data2 <- c("banana", "apricot", "church")
#pattern2 <- "(.)(.).*\\1."
pattern2 <- "(.)(.).*\\1."
data2 %>%
str_subset(pattern2)
## [1] "banana" "church"
Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)
data3 <- c("eleven", "apricot", "abxbxbxcd", "kjjjjjjj", "Shannon", "reje")
pattern3 <- "(.).\\1.\\1"
data3 %>%
str_subset(pattern3)
## [1] "eleven" "abxbxbxcd" "kjjjjjjj"