Load data
majors <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv")
We will use grep function to check if column Major contains word “DATA” or “STATISTICS”.
grep_output <- grep('DATA|STATISTICS',majors$Major, value = TRUE)
grep_output
## [1] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
## [2] "COMPUTER PROGRAMMING AND DATA PROCESSING"
## [3] "STATISTICS AND DECISION SCIENCE"
Using stringr
stringr_output <- majors[str_detect(majors$Major, "STATISTICS|DATA"),]
stringr_output
## FOD1P Major Major_Category
## 44 6212 MANAGEMENT INFORMATION SYSTEMS AND STATISTICS Business
## 52 2101 COMPUTER PROGRAMMING AND DATA PROCESSING Computers & Mathematics
## 59 3702 STATISTICS AND DECISION SCIENCE Computers & Mathematics
[1] “bell pepper” “bilberry” “blackberry” “blood orange” [5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”
[9] “elderberry” “lime” “lychee” “mulberry”
[13] “olive” “salal berry” Into a format like this: c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”) I didn’t fully get this assignment, should I just create input and make output look like c(“bell pepper”, etc..)?
input_str <- '[1] "bell pepper" "bilberry" "blackberry" "blood orange"
[5] "blueberry" "cantaloupe" "chili pepper" "cloudberry"
[9] "elderberry" "lime" "lychee" "mulberry"
[13] "olive" "salal berry"'
input_str
## [1] "[1] \"bell pepper\" \"bilberry\" \"blackberry\" \"blood orange\"\n[5] \"blueberry\" \"cantaloupe\" \"chili pepper\" \"cloudberry\" \n[9] \"elderberry\" \"lime\" \"lychee\" \"mulberry\" \n[13] \"olive\" \"salal berry\""
Finding all quoted values:
extract_berry <- unlist(str_extract_all(input_str, '"[^"]*"'))
extract_berry
## [1] "\"bell pepper\"" "\"bilberry\"" "\"blackberry\"" "\"blood orange\""
## [5] "\"blueberry\"" "\"cantaloupe\"" "\"chili pepper\"" "\"cloudberry\""
## [9] "\"elderberry\"" "\"lime\"" "\"lychee\"" "\"mulberry\""
## [13] "\"olive\"" "\"salal berry\""
Removing ":
extract_berry_update <- str_remove_all(extract_berry, "\"")
extract_berry_update
## [1] "bell pepper" "bilberry" "blackberry" "blood orange" "blueberry"
## [6] "cantaloupe" "chili pepper" "cloudberry" "elderberry" "lime"
## [11] "lychee" "mulberry" "olive" "salal berry"
Writing the result in the required formed with c, parentheses
final <- paste('c(', paste('"',extract_berry_update,'"',sep = "", collapse = ','), sep = "",')')
str_view(final,'""')
(.) 1st capturing group, would match any character except for for line terminators
\1\1 would refer to the same text as previously matched by the first capturing group 2 times
We will need o create a string to use regular expression in R by using \1 instead of \1.
For example, 777 in 84572777. In other words, will show 3 characters repeated.
stng_1 <- '84572777'
str_view(stng_1, "(.)\\1\\1", match = TRUE)
The regular expression is shown as a string with “” and \
(.)(.) 1st and 2nd capturing group, would match any character except for for line terminators
\2 would refer to the same text as previously matched by the second capturing group
\1 would refer to the same text as previously matched by the first capturing group
For example, 4444 in 4564444rt. In other words, will show 4 characters repeated.
stng_2 <- '4564444rt'
str_view(stng_2, "(.)(.)\\2\\1", match = TRUE)
(..) would match any 2 characters except for for line terminators
\1 would refer to the same text as previously matched by the first capturing group
For example, 1414 in ssff1414slfs. In other words, will show 2 characters repeated twice.
stng_3 <- 'ssff1414slfs'
str_view(stng_3, "(..)\\1", match = TRUE)
(.) 1st capturing group, would match any character except for for line terminators
. would match any character except for for line terminators/ \1 would refer to the same text as previously matched by the first capturing group
For example, e7e8e in e7e8e9e7. In other words, will return string if 1st character repeated on the 3rd and 5th positions as well as 2nd and 4th symbols.
stng_4 <- 'e7e8e9e7'
str_view(stng_4, "(.).\\1.\\1", match = TRUE)
(.)(.) 1st,2nd,3rd capturing group, would match any character except for for line terminators
. would match any character except for for line terminators
* controls how many times a pattern matches between zero and unlimited times, as many times as possible
\3 would refer to the same text as previously matched by the third capturing group
\2 would refer to the same text as previously matched by the second capturing group
\1 would refer to the same text as previously matched by the first capturing group
For example, 1234y4321 in 1234y4321rff. In other words, will return string with the min length of 6 if first three characters are the same as last three in reverse order, there can be any symbols between these two groups.
stng_5 <- '1234y4321rff'
str_view(stng_5, "(.)(.)(.).*\\3\\2\\1", match = TRUE)
To only match a complete string, we should start with ^ and end with $.
stng <- c('slow','momnm','slack','hopeh')
str_view(stng, "^(.).*\\1$")
st <- c('slowwew','momnm','scklack','ohopehop','banana')
str_view(st, ".*(.)(.).*\\1\\2.*")
s <- c('slowwew','momnm','scklack','ohopehop','banana')
str_view(s, ".*(.).*\\1.*\\1.*")