Please deliver links to an R Markdown file (in GitHub and rpubs.com) with solutions to the problems below. You may work in a small group, but please submit separately with names of all group participants in your submission.
Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”
#Firstly, im pulling in the data from github along with the packages i need.
library(tidyverse, quietly=TRUE)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
majors_df <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv")
## Rows: 174 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): FOD1P, Major, Major_Category
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(majors_df)
## # A tibble: 6 × 3
## FOD1P Major Major_Category
## <chr> <chr> <chr>
## 1 1100 GENERAL AGRICULTURE Agriculture & Natural Resources
## 2 1101 AGRICULTURE PRODUCTION AND MANAGEMENT Agriculture & Natural Resources
## 3 1102 AGRICULTURAL ECONOMICS Agriculture & Natural Resources
## 4 1103 ANIMAL SCIENCES Agriculture & Natural Resources
## 5 1104 FOOD SCIENCE Agriculture & Natural Resources
## 6 1105 PLANT SCIENCE AND AGRONOMY Agriculture & Natural Resources
# Ensuring the case is all upper
majors_df$Major <- toupper(majors_df$Major)
data_mjrs <- str_subset(majors_df$Major, pattern = "(DATA)|(STATISTICS)")
data_mjrs
## [1] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
## [2] "COMPUTER PROGRAMMING AND DATA PROCESSING"
## [3] "STATISTICS AND DECISION SCIENCE"
Write code that transforms the data below:
[1] “bell pepper” “bilberry” “blackberry” “blood orange”
[5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”
[9] “elderberry” “lime” “lychee” “mulberry”
[13] “olive” “salal berry”
Into a format like this:
c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)
#looking at fruit
fruit
## [1] "apple" "apricot" "avocado"
## [4] "banana" "bell pepper" "bilberry"
## [7] "blackberry" "blackcurrant" "blood orange"
## [10] "blueberry" "boysenberry" "breadfruit"
## [13] "canary melon" "cantaloupe" "cherimoya"
## [16] "cherry" "chili pepper" "clementine"
## [19] "cloudberry" "coconut" "cranberry"
## [22] "cucumber" "currant" "damson"
## [25] "date" "dragonfruit" "durian"
## [28] "eggplant" "elderberry" "feijoa"
## [31] "fig" "goji berry" "gooseberry"
## [34] "grape" "grapefruit" "guava"
## [37] "honeydew" "huckleberry" "jackfruit"
## [40] "jambul" "jujube" "kiwi fruit"
## [43] "kumquat" "lemon" "lime"
## [46] "loquat" "lychee" "mandarine"
## [49] "mango" "mulberry" "nectarine"
## [52] "nut" "olive" "orange"
## [55] "pamelo" "papaya" "passionfruit"
## [58] "peach" "pear" "persimmon"
## [61] "physalis" "pineapple" "plum"
## [64] "pomegranate" "pomelo" "purple mangosteen"
## [67] "quince" "raisin" "rambutan"
## [70] "raspberry" "redcurrant" "rock melon"
## [73] "salal berry" "satsuma" "star fruit"
## [76] "strawberry" "tamarillo" "tangerine"
## [79] "ugli fruit" "watermelon"
#Picking out the index #s we want
subset_fruit = fruit[c(5:7,9:10,14,17,19,29,45,47,50,54,73)]
# Found this function online to print the literal vector as question 2 demands.
dput(subset_fruit)
## c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry",
## "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime",
## "lychee", "mulberry", "orange", "salal berry")
Describe, in words, what these expressions will match:
#(.)\1\1
#This will match any character that is repeated three consecutive times. The "." is any character except line breaks. The first "\1" #matches the set of characters if the first finding is repeated twice. The second "\1" makes it so if a character is repeated a third consecutive time it will be captured.
#"(.)(.)\\2\\1"
#This will match any two characters where the second repeats 2 and the last character is a repeat of the first.
#(..)\1
#This will match any two characters that repeat themself once. For instance jkjk or hhhh. The "(..)" matches any group of any 2 #characters. The \1 essentially limits to any to characters that repeat once.
#"(.).\\1.\\1"
#This will match group of characters where hte first character is repeated after the second character. Additionally,the repeat of the second instance of the first character is followed by another character. The lats \\1 means that the first character has to repeat one more time after the second "." character.
#"(.)(.)(.).*\\3\\2\\1"
#This matches any combination of 3 initial single characters as three groups. The ".*" matches anything. However, these two conditons must be followed by the third, second, and first characters in that order.
Construct regular expressions to match words that:
Start and end with the same character. Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.) Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)
#----
#Start and end with the same character.
#^(.).*\\1$
#----
#Contain a repeated pair of letters (e.g. "church" contains "ch" repeated twice.)
#^(..).*\\1(.*)?$
#----
#----
#Contain one letter repeated in at least three places (e.g. "eleven" contains three "e"s.)
#.?(.)(.*)?\\1(.*)?\\1.?
#----
#----