Part 1

Let’s first start with loading libraries

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Once there, lets grab our raw data and import it into a dataframe with headers

url.data <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv"

raw <- read.csv(url(url.data), header = TRUE,)

Let’s take a quick head of the data, and see what it looks like with headers

head(raw)
##   FOD1P                                 Major                  Major_Category
## 1  1100                   GENERAL AGRICULTURE Agriculture & Natural Resources
## 2  1101 AGRICULTURE PRODUCTION AND MANAGEMENT Agriculture & Natural Resources
## 3  1102                AGRICULTURAL ECONOMICS Agriculture & Natural Resources
## 4  1103                       ANIMAL SCIENCES Agriculture & Natural Resources
## 5  1104                          FOOD SCIENCE Agriculture & Natural Resources
## 6  1105            PLANT SCIENCE AND AGRONOMY Agriculture & Natural Resources

Now let’s filter!

We will now use grep to find things in the Majors column

majors<-rbind(raw[grepl("STATISTICS", raw$Major),], raw[grepl("DATA", raw$Major),])

head(majors)
##    FOD1P                                         Major          Major_Category
## 44  6212 MANAGEMENT INFORMATION SYSTEMS AND STATISTICS                Business
## 59  3702               STATISTICS AND DECISION SCIENCE Computers & Mathematics
## 52  2101      COMPUTER PROGRAMMING AND DATA PROCESSING Computers & Mathematics

Part 2

The Novel Way

Let’s start with grabbing Tidyverse, there will be conflits but that’s outside of scope.

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v stringr 1.4.0
## v tidyr   1.2.0     v forcats 0.5.1
## v readr   2.1.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

From there, let’s grab our text file from in folder, and read it into a dataframe with a set deliminator.

Looking at the character “, and then look for patterns. Clearly every even column has a valid entry. Proceed to filter the rows with only complete cases, IE not nulls, and then proceed to add them into the list. In this case we will have to ignore a parsing issue, due to the /r in the file.

file <- "Assignment_2_Text.txt"
df <- read_delim(file, delim='\"', col_names=FALSE, show_col_types = FALSE) 
## Warning: One or more parsing issues, see `problems()` for details
list <- df$X2[complete.cases(df$X2)]
list <- (c(list, df$X4[complete.cases(df$X4)]))
list <- (c(list, df$X6[complete.cases(df$X6)]))
list <- (c(list, df$X8[complete.cases(df$X8)]))
print(sort(list))
##  [1] "bell pepper"  "bilberry"     "blackberry"   "blood orange" "blueberry"   
##  [6] "cantaloupe"   "chili pepper" "cloudberry"   "elderberry"   "lime"        
## [11] "lychee"       "mulberry"     "olive"        "salal berry"

The One Shot Regex Way

We are first going to import the string in a bit easier way, wrapping it with single quotes. This is because our data has double quotes, and this is th easiest way to adjust it to handle it. At this point we will also import the target output under the target variable.

At this point we create a oneshot_extract with str_extract_all with a positive look behind, a search for up to two words, and then a positive look ahead. Practically this searches for between one and two words between two quotes.

string_start <- '[1] "bell pepper"  "bilberry"     "blackberry"   "blood orange"
[5] "blueberry"    "cantaloupe"   "chili pepper" "cloudberry"  
[9] "elderberry"   "lime"         "lychee"       "mulberry"    
[13] "olive"        "salal berry"'
target <- c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")
oneshot_extract <- unlist(str_extract_all(string_start, "(?<=\")(\\w+.\\w+)(?=\")"))
target %in% oneshot_extract
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
oneshot_extract %in% target
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

In order to test this, we will use %in% first comparing the target to the extract to ensure that all of the values of the target are contained within the oneshot_extract. Now we will ensure that oneshot_extract does not have any values not contained within the target by reversing the comparison. With all of these results reading true, the exercise is complete.

Part 3

Part 3A

test_case <- "(.)\1\1"
writeLines(test_case)
## (.)

If we treat this as a parameter for a regex in r, it would be read as (.) with 2 invalid breakout characters. Parsed, through the r interpreter, it would look for any character, and be equal to “(.)”. If we treat this as a language agnostic input, no language specific escape characters, it would be read as (.)\1\1, which matches 3 characters in a row.

Part 3B

test_case <- "(.)(.)\\2\\1"
writeLines(test_case)
## (.)(.)\2\1

Through the interpreter, it will be viewed as the output above. This would select the pattern of First Char Second Char then Second Char First Char

For example,

  1. ABBA

  2. NIIN

Part 3C

test_case1 <- "(..)\1"
writeLines(test_case1)
## (..)

If we treat this as a parameter for a regex in r, it would be read as (..) with an invalid breakout character. Parsed, through the r interpreter, it would look for groups of 2 characters, and be equal to “(..)”.

Part 3D

test_case2 <- "(.).\\1.\\1"
writeLines(test_case2)
## (.).\1.\1

Sequence of 5 characters, 3 of the same separated by any two

For Example

  1. ABACA

Part 3E

test_case3 <- "(.)(.)(.).*\\3\\2\\1"
writeLines(test_case3)
## (.)(.)(.).*\3\2\1

Sequence of 3 characters in order 123, any number of characters, then the same three characters in 321

For Example,

  1. ABC***CBA

Part 4

Part 4A

Starts and ends with same character: ‘^([:graph:])(.*)\1$’. I searched for anything in the scope of letters, numbers and punctuation, at the start and end of a word using graph. The (.*) searches for all non-newline characters, and the \1$ attempts to match the same item found at the start.

test_case4a <- "^([:graph:])(.*)\\1$"
fruit_test <- c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry", "iamatesti")
str_view(fruit_test, test_case4a, match = TRUE)

Part 4B

Repeated set of 2 characters “(.)(.).*\1\2”

test_case4b <- "(.)(.).*\\1\\2"
fruit_test <- c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry", "iamatestia")
str_view(fruit_test, test_case4b, match = TRUE)

Part 4C

Set of three repeated characters, I am assuming that there will have to be at least 1 character between them. For example, “eleven” will work, however, “elee” nor “eee” will. “(.).+\1.+\1”

test_case4c <- "(.).+\\1.+\\1"
fruit_test <- c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry", "iamatestia")
str_view(fruit_test, test_case4c, match = TRUE)

References

I found these sites great helps for Regex testing and construction:

https://evoldyn.gitlab.io/evomics-2018/ref-sheets/R_strings.pdf

https://regex101.com/