Let’s first start with loading libraries
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Once there, lets grab our raw data and import it into a dataframe with headers
url.data <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv"
raw <- read.csv(url(url.data), header = TRUE,)
Let’s take a quick head of the data, and see what it looks like with headers
head(raw)
## FOD1P Major Major_Category
## 1 1100 GENERAL AGRICULTURE Agriculture & Natural Resources
## 2 1101 AGRICULTURE PRODUCTION AND MANAGEMENT Agriculture & Natural Resources
## 3 1102 AGRICULTURAL ECONOMICS Agriculture & Natural Resources
## 4 1103 ANIMAL SCIENCES Agriculture & Natural Resources
## 5 1104 FOOD SCIENCE Agriculture & Natural Resources
## 6 1105 PLANT SCIENCE AND AGRONOMY Agriculture & Natural Resources
We will now use grep to find things in the Majors column
majors<-rbind(raw[grepl("STATISTICS", raw$Major),], raw[grepl("DATA", raw$Major),])
head(majors)
## FOD1P Major Major_Category
## 44 6212 MANAGEMENT INFORMATION SYSTEMS AND STATISTICS Business
## 59 3702 STATISTICS AND DECISION SCIENCE Computers & Mathematics
## 52 2101 COMPUTER PROGRAMMING AND DATA PROCESSING Computers & Mathematics
Let’s start with grabbing Tidyverse, there will be conflits but that’s outside of scope.
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v stringr 1.4.0
## v tidyr 1.2.0 v forcats 0.5.1
## v readr 2.1.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
From there, let’s grab our text file from in folder, and read it into a dataframe with a set deliminator.
Looking at the character “, and then look for patterns. Clearly every even column has a valid entry. Proceed to filter the rows with only complete cases, IE not nulls, and then proceed to add them into the list. In this case we will have to ignore a parsing issue, due to the /r in the file.
file <- "Assignment_2_Text.txt"
df <- read_delim(file, delim='\"', col_names=FALSE, show_col_types = FALSE)
## Warning: One or more parsing issues, see `problems()` for details
list <- df$X2[complete.cases(df$X2)]
list <- (c(list, df$X4[complete.cases(df$X4)]))
list <- (c(list, df$X6[complete.cases(df$X6)]))
list <- (c(list, df$X8[complete.cases(df$X8)]))
print(sort(list))
## [1] "bell pepper" "bilberry" "blackberry" "blood orange" "blueberry"
## [6] "cantaloupe" "chili pepper" "cloudberry" "elderberry" "lime"
## [11] "lychee" "mulberry" "olive" "salal berry"
We are first going to import the string in a bit easier way, wrapping it with single quotes. This is because our data has double quotes, and this is th easiest way to adjust it to handle it. At this point we will also import the target output under the target variable.
At this point we create a oneshot_extract with str_extract_all with a positive look behind, a search for up to two words, and then a positive look ahead. Practically this searches for between one and two words between two quotes.
string_start <- '[1] "bell pepper" "bilberry" "blackberry" "blood orange"
[5] "blueberry" "cantaloupe" "chili pepper" "cloudberry"
[9] "elderberry" "lime" "lychee" "mulberry"
[13] "olive" "salal berry"'
target <- c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")
oneshot_extract <- unlist(str_extract_all(string_start, "(?<=\")(\\w+.\\w+)(?=\")"))
target %in% oneshot_extract
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
oneshot_extract %in% target
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
In order to test this, we will use %in% first comparing the target to the extract to ensure that all of the values of the target are contained within the oneshot_extract. Now we will ensure that oneshot_extract does not have any values not contained within the target by reversing the comparison. With all of these results reading true, the exercise is complete.
test_case <- "(.)\1\1"
writeLines(test_case)
## (.)
If we treat this as a parameter for a regex in r, it would be read as (.) with 2 invalid breakout characters. Parsed, through the r interpreter, it would look for any character, and be equal to “(.)”. If we treat this as a language agnostic input, no language specific escape characters, it would be read as (.)\1\1, which matches 3 characters in a row.
test_case <- "(.)(.)\\2\\1"
writeLines(test_case)
## (.)(.)\2\1
Through the interpreter, it will be viewed as the output above. This would select the pattern of First Char Second Char then Second Char First Char
For example,
ABBA
NIIN
test_case1 <- "(..)\1"
writeLines(test_case1)
## (..)
If we treat this as a parameter for a regex in r, it would be read as (..) with an invalid breakout character. Parsed, through the r interpreter, it would look for groups of 2 characters, and be equal to “(..)”.
test_case2 <- "(.).\\1.\\1"
writeLines(test_case2)
## (.).\1.\1
Sequence of 5 characters, 3 of the same separated by any two
For Example
test_case3 <- "(.)(.)(.).*\\3\\2\\1"
writeLines(test_case3)
## (.)(.)(.).*\3\2\1
Sequence of 3 characters in order 123, any number of characters, then the same three characters in 321
For Example,
Starts and ends with same character: ‘^([:graph:])(.*)\1$’. I searched for anything in the scope of letters, numbers and punctuation, at the start and end of a word using graph. The (.*) searches for all non-newline characters, and the \1$ attempts to match the same item found at the start.
test_case4a <- "^([:graph:])(.*)\\1$"
fruit_test <- c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry", "iamatesti")
str_view(fruit_test, test_case4a, match = TRUE)
Repeated set of 2 characters “(.)(.).*\1\2”
test_case4b <- "(.)(.).*\\1\\2"
fruit_test <- c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry", "iamatestia")
str_view(fruit_test, test_case4b, match = TRUE)
Set of three repeated characters, I am assuming that there will have to be at least 1 character between them. For example, “eleven” will work, however, “elee” nor “eee” will. “(.).+\1.+\1”
test_case4c <- "(.).+\\1.+\\1"
fruit_test <- c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry", "iamatestia")
str_view(fruit_test, test_case4c, match = TRUE)
I found these sites great helps for Regex testing and construction:
https://evoldyn.gitlab.io/evomics-2018/ref-sheets/R_strings.pdf