Load the tidyverse package

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

#1. Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”

# get the csv file linked in the article 
url <- read.csv(url("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/all-ages.csv"))
# put it in a dataframe
majors <- as.data.frame(url)

# filter for majors that contains DATA or STATISTICS
dat_or_stat <- majors %>%
  filter(str_detect(Major, "DATA|STATISTICS"))

glimpse(dat_or_stat)
## Rows: 3
## Columns: 11
## $ Major_code                    <int> 2101, 3702, 6212
## $ Major                         <chr> "COMPUTER PROGRAMMING AND DATA PROCESSIN…
## $ Major_category                <chr> "Computers & Mathematics", "Computers & …
## $ Total                         <int> 29317, 24806, 156673
## $ Employed                      <int> 22828, 18808, 134478
## $ Employed_full_time_year_round <int> 18747, 14468, 118249
## $ Unemployed                    <int> 2265, 1138, 6186
## $ Unemployment_rate             <dbl> 0.09026422, 0.05705405, 0.04397714
## $ Median                        <int> 60000, 70000, 72000
## $ P25th                         <int> 40000, 43000, 50000
## $ P75th                         <dbl> 85000, 102000, 100000
dat_or_stat
##   Major_code                                         Major
## 1       2101      COMPUTER PROGRAMMING AND DATA PROCESSING
## 2       3702               STATISTICS AND DECISION SCIENCE
## 3       6212 MANAGEMENT INFORMATION SYSTEMS AND STATISTICS
##            Major_category  Total Employed Employed_full_time_year_round
## 1 Computers & Mathematics  29317    22828                         18747
## 2 Computers & Mathematics  24806    18808                         14468
## 3                Business 156673   134478                        118249
##   Unemployed Unemployment_rate Median P25th  P75th
## 1       2265        0.09026422  60000 40000  85000
## 2       1138        0.05705405  70000 43000 102000
## 3       6186        0.04397714  72000 50000 100000

#2 Write code that transforms the data below: [1] “bell pepper” “bilberry” “blackberry” “blood orange” [5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”
[9] “elderberry” “lime” “lychee” “mulberry”
[13] “olive” “salal berry” Into a format like this: c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)

a_vector <- c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")
print(a_vector)
##  [1] "bell pepper"  "bilberry"     "blackberry"   "blood orange" "blueberry"   
##  [6] "cantaloupe"   "chili pepper" "cloudberry"   "elderberry"   "lime"        
## [11] "lychee"       "mulberry"     "olive"        "salal berry"

#3 Describe, in words, what these expressions will match:

(.)\1\1
"(.)(.)\\2\\1"
(..)\1
"(.).\\1.\\1"
"(.)(.)(.).*\\3\\2\\1"

(.)\1\1 means three of the same characters because (.) means it can be anything, the \1 have to match what is before and it appears twice.

“(.)(.)\2\1” means the first two characters can be anything, the third character have to match the second, and the last character have to match the first.

(..)\1 means the first two characters can be anything and the next group of two characters have to match the same two afterwards

“(.).\1.\1” looks like the first one

“(.)(.)(.).*\3\2\1” means the first three characters could be anything, then afterwards there could be anything between or just nothing, but it has to end with whatever the third character was, then whatever the second character was, then finally whatever the first character was.

#4 Construct regular expressions to match words that:

Start and end with the same character.
Contain a repeated pair of letters (e.g. "church" contains "ch" repeated twice.)
Contain one letter repeated in at least three places (e.g. "eleven" contains three "e"s.)

#Credit to ChatGPT for these, I’m sorry

start_and_end_same_pattern <- (.)\1

repeated_pair_pattern <- ().*\1

one_letter_repeat_in_at_least_three_pattern <- ().\1.\1