Problem: Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”
Solution to Question 1
To solve Question 1 Problem; I first imported the raw college major
data from Github, first using the link provided from fivethirtyeight
website to open the article.
Then I loaded packages: dplyr ,
tidyverse and readr. Afterwards, I searched the majors data frame
through filter. The filter uses regex to filter out majors that contain
either ‘DATA’ or ‘STATISTICS’ and pulls it out into a new data frame
called mathematical_majors.
#import data and packages
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(readr)
majors <- read.csv(url("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv"))
#filter majors that contains "DATA" & "STATISTICS"
(mathematical_majors <- majors %>%
filter(str_detect(Major, "DATA") | str_detect(Major, "STATISTICS")))
## FOD1P Major Major_Category
## 1 6212 MANAGEMENT INFORMATION SYSTEMS AND STATISTICS Business
## 2 2101 COMPUTER PROGRAMMING AND DATA PROCESSING Computers & Mathematics
## 3 3702 STATISTICS AND DECISION SCIENCE Computers & Mathematics
Write code that transforms the data below:
[1] “bell pepper” “bilberry” “blackberry” “blood orange”.
[5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”
[9] “elderberry” “lime” “lychee” “mulberry”
[13] “olive” “salal berry”
Into a format like this:
c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)
Solution to Question 2
fruits <- c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")
fruit_vec <- paste0("\"", fruits, "\"")
fru_vec2 <- paste0(fruit_vec, collapse = ", ")
fru_vec3 <- paste0("c(", fru_vec2, ")")
cat(fru_vec3)
## c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")
Steps
So I had a hard time with this one a bit to make it look exactly like
the way it is showed. I at first had it but minus the quotations around
it. So I eventually came to this solution a different style.
I
first created the string. The I had added backlash and quotation around
each element (fruits), and defined it into a new vector called
“fruit_vec”.
Next, I collapsed the quoted words into a string,
separating each word with a comma. Now identified as “fruit_vec2”.
Then, I added the leading characters which is “c(…..)”.
Finally, I
parsed the backslashes as escape characters using cat function. The
output is now in the desired format.
Describe, in words, what these expressions will match:
Solution to Question 3
there is an extra backslash in script to show the original expression in bullet points 2,4,and 5 because the first backlash escapes ; in question above is original expression.
(.)\1\1 - matches any character that is repeated three times in a row, like ‘aaa’, ‘bbb’.
“(.)(.)\\2\\1” - matches any two characters followed by the same two characters in reverse order, example: abba or 1441.
(..)\1 - this expression will match with 2 adjoining repeats of a pair of characters or two characters repeated; example ‘3131’ or ‘abab’ .
“(.).\\1.\\1” - this match, will search for a five character term, three of which are the same. Example lets say you have characters ‘d’ ‘e’ ‘f’. The output would be ‘dedfd’ or ‘41424’.
“(.)(.)(.).*\\3\\2\\1” - this will match a set of characters that begin and end with the same three characters, except the second instance is reversed, like “racecar” or ‘12321’.
Construct regular expressions to match words that:
Solution to Question 4
there is an extra backslash in script to show the expression I intend to be solution. a to c is 2 backslash but made it 3 backlashes in script to show 2 backlashes for html
a). Start and end with the same character —> ^(.).*\\1$
b). Contain a repeated pair of letters —-> (.).*\\1
c). Contain one letter repeated in at least three places —> (.).\\1.\\1