This assignment is a collection of exercises, so I will address each one separately.
In this code block I import the data on majors from 538’s github repository. I then add a column that determines if the major in that row contains the word “DATA” and another column that determines if the major in that row contains the word “STATISTICS”. I then create a new data frame that is a subset of the majors data frame containing only the rows where one of the new columns I added registers as “TRUE”, and finally I remove the two columns I added leaving only the columns from the original data frame. I then print the data frame using kable.
majors_data <- read.csv(url("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv"))
library(stringr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ purrr 1.0.1
## ✔ forcats 1.0.0 ✔ readr 2.1.4
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(knitr)
majors_data$contains_data <- str_detect(majors_data$Major, "DATA")
majors_data$contains_statistics <- str_detect(majors_data$Major, "STATISTICS")
desired_majors <- subset(majors_data, contains_data == TRUE | contains_statistics == TRUE)
desired_majors <- subset(desired_majors, select = -c(contains_data, contains_statistics))
kable(desired_majors, format = "pipe", caption = "Majors Containing 'DATA' or 'STATISTICS'", align = "lll")
FOD1P | Major | Major_Category | |
---|---|---|---|
44 | 6212 | MANAGEMENT INFORMATION SYSTEMS AND STATISTICS | Business |
52 | 2101 | COMPUTER PROGRAMMING AND DATA PROCESSING | Computers & Mathematics |
59 | 3702 | STATISTICS AND DECISION SCIENCE | Computers & Mathematics |
In this code block I store the provided data as a string and then turn it into a character vector by removing all the extraneous characters, turning it into a vector, and then removing all the empty elements of the vector.
library(stringi)
inputstring <- '[1] "bell pepper" "bilberry" "blackberry" "blood orange"
[5] "blueberry" "cantaloupe" "chili pepper" "cloudberry"
[9] "elderberry" "lime" "lychee" "mulberry"
[13] "olive" "salal berry"'
inputstring2 <- str_remove_all(inputstring, '\\[')
inputstring3 <- str_remove_all(inputstring2, '\\]')
inputstring4 <- str_remove_all(inputstring3, '\\"')
inputstring5 <- str_remove_all(inputstring4, '\n')
inputstring6 <- str_remove_all(inputstring5, '[0-9]')
string_vector <- str_split_1(inputstring6, " ")
string_vector_2 <- stri_remove_empty(string_vector)
string_vector_2
## [1] "bell" "pepper" "bilberry" "blackberry" "blood"
## [6] "orange" "blueberry" "cantaloupe" "chili" "pepper"
## [11] "cloudberry" "elderberry" "lime" "lychee" "mulberry"
## [16] "olive" "salal" "berry"
Describe in words what these expressions will match:
(.)\1\1 will match any character string with 3 of the same character in a row (for example, “aaa”). It will need to be put in quotation marks and have the escape backslashes doubled in order to be used in RStudio.
“(.)(.)\2\1” will match any character string with a sequence of any two characters followed immediately by the same two characters in reverse order (for example, “abba”).
(..)\1 will match any character string containing a sequence of any two characters that are repeated immediately in that same order. It will need to be put in quotation marks and have the escape backslash doubled in order to be used in RStudio.
“(.).\1.\1” will match any character string containing any character, then any character again, and then two more instances of the first character (for example, “abaa”)
“(.)(.)(.).*\3\2\1” will match any character string starting with any three characters, and then at a later time ending with those same three characters in the reverse order (for example, “abcsdfgjfghjfjhdgjhdfcba”)
This exercise asks for three regular expressions to match words meeting three different criteria. The criteria are: 1. Words that start and end with the same character. 2. Words that contain a repeated pair of letters. 3. Words that contain one letter repeated in at least three places.
I have created a list of words matching these criteria in various configurations. There is one word that matches each criteria exclusively (does not match the other two), one word that matches each possible combination of two of the criteria but not the third, and one word that matches all three criteria. The words are as follows:
Matches all three categories: earthenware Matches only categories 1 & 2: dartboard Matches only categories 1 & 3: fluff Matches only categories 2 & 3: insensitive Matches only category 1: encapsulate Matches only category 2: church Matches only category 3: eleven
My solutions to the exercise are: 1. Words that start and end with the same character: ^(.).\1$ [This should return “earthenware”, “dartboard”, “fluff”, and “encapsulate”] 2. Words that contain a repeated pair of letters. (..).\1 [This should return “earthenware”, “dartboard”, “insensitive”, and “church”] 3. Words that contain one letter repeated in at least three places. (.).\1.\1 [This should return “earthenware”, “fluff”, “insensitive”, and “eleven”]
Using the code below, I demonstrate that these solutions produce the desired results.
words <- c("earthenware", "dartboard", "fluff", "insensitive", "encapsulate", "church", "eleven")
regex1 <- "^(.).*\\1$"
regex2 <- "(..).*\\1"
regex3 <- "(.).*\\1.*\\1"
solution1 <- words[str_detect(words, regex1) == TRUE]
solution1
## [1] "earthenware" "dartboard" "fluff" "encapsulate"
solution2 <- words[str_detect(words, regex2) == TRUE]
solution2
## [1] "earthenware" "dartboard" "insensitive" "church"
solution3 <- words[str_detect(words, regex3) == TRUE]
solution3
## [1] "earthenware" "fluff" "insensitive" "eleven"