Load the tidyverse package
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
#1. Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”
# get the csv file linked in the article
url <- read.csv(url("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/all-ages.csv"))
# put it in a dataframe
majors <- as.data.frame(url)
# filter for majors that contains DATA or STATISTICS
dat_or_stat <- majors %>%
filter(str_detect(Major, "DATA|STATISTICS"))
glimpse(dat_or_stat)
## Rows: 3
## Columns: 11
## $ Major_code <int> 2101, 3702, 6212
## $ Major <chr> "COMPUTER PROGRAMMING AND DATA PROCESSIN…
## $ Major_category <chr> "Computers & Mathematics", "Computers & …
## $ Total <int> 29317, 24806, 156673
## $ Employed <int> 22828, 18808, 134478
## $ Employed_full_time_year_round <int> 18747, 14468, 118249
## $ Unemployed <int> 2265, 1138, 6186
## $ Unemployment_rate <dbl> 0.09026422, 0.05705405, 0.04397714
## $ Median <int> 60000, 70000, 72000
## $ P25th <int> 40000, 43000, 50000
## $ P75th <dbl> 85000, 102000, 100000
dat_or_stat
## Major_code Major
## 1 2101 COMPUTER PROGRAMMING AND DATA PROCESSING
## 2 3702 STATISTICS AND DECISION SCIENCE
## 3 6212 MANAGEMENT INFORMATION SYSTEMS AND STATISTICS
## Major_category Total Employed Employed_full_time_year_round
## 1 Computers & Mathematics 29317 22828 18747
## 2 Computers & Mathematics 24806 18808 14468
## 3 Business 156673 134478 118249
## Unemployed Unemployment_rate Median P25th P75th
## 1 2265 0.09026422 60000 40000 85000
## 2 1138 0.05705405 70000 43000 102000
## 3 6186 0.04397714 72000 50000 100000
#2 Write code that transforms the data below: [1] “bell pepper”
“bilberry” “blackberry” “blood orange” [5] “blueberry” “cantaloupe”
“chili pepper” “cloudberry”
[9] “elderberry” “lime” “lychee” “mulberry”
[13] “olive” “salal berry” Into a format like this: c(“bell pepper”,
“bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”,
“chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”,
“mulberry”, “olive”, “salal berry”)
a_vector <- c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")
print(a_vector)
## [1] "bell pepper" "bilberry" "blackberry" "blood orange" "blueberry"
## [6] "cantaloupe" "chili pepper" "cloudberry" "elderberry" "lime"
## [11] "lychee" "mulberry" "olive" "salal berry"
#3 Describe, in words, what these expressions will match:
(.)\1\1
"(.)(.)\\2\\1"
(..)\1
"(.).\\1.\\1"
"(.)(.)(.).*\\3\\2\\1"
(.)\1\1 means three of the same characters because (.) means it can be anything, the \1 have to match what is before and it appears twice.
“(.)(.)\2\1” means the first two characters can be anything, the third character have to match the second, and the last character have to match the first.
(..)\1 means the first two characters can be anything and the next group of two characters have to match the same two afterwards
“(.).\1.\1” looks like the first one
“(.)(.)(.).*\3\2\1” means the first three characters could be anything, then afterwards there could be anything between or just nothing, but it has to end with whatever the third character was, then whatever the second character was, then finally whatever the first character was.
#4 Construct regular expressions to match words that:
Start and end with the same character.
Contain a repeated pair of letters (e.g. "church" contains "ch" repeated twice.)
Contain one letter repeated in at least three places (e.g. "eleven" contains three "e"s.)
#Credit to ChatGPT for these, I’m sorry
start_and_end_same_pattern <- (.)\1
repeated_pair_pattern <- ().*\1
one_letter_repeat_in_at_least_three_pattern <- ().\1.\1