Data 607 Assignment 3

Libraries

library (tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

majors <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/all-ages.csv")

Question 1

Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”

majors_with_data_or_stats <- majors %>%
  filter(str_detect(Major, regex("DATA|STATISTICS", ignore_case = TRUE)))

print(majors_with_data_or_stats)

##   Major_code                                         Major
## 1       2101      COMPUTER PROGRAMMING AND DATA PROCESSING
## 2       3702               STATISTICS AND DECISION SCIENCE
## 3       6212 MANAGEMENT INFORMATION SYSTEMS AND STATISTICS
##            Major_category  Total Employed Employed_full_time_year_round
## 1 Computers & Mathematics  29317    22828                         18747
## 2 Computers & Mathematics  24806    18808                         14468
## 3                Business 156673   134478                        118249
##   Unemployed Unemployment_rate Median P25th  P75th
## 1       2265        0.09026422  60000 40000  85000
## 2       1138        0.05705405  70000 43000 102000
## 3       6186        0.04397714  72000 50000 100000

Question 2

Write code that transforms the data below: [1] “bell pepper” “bilberry” “blackberry” “blood orange” [5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”
[9] “elderberry” “lime” “lychee” “mulberry”
[13] “olive” “salal berry” Into a format like this: c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)

##  [1] "bell pepper"  "bilberry"     "blackberry"   "blood orange" "blueberry"   
##  [6] "cantaloupe"   "chili pepper" "cloudberry"   "elderberry"   "lime"        
## [11] "lychee"       "mulberry"     "olive"        "salal berry"

Question 3

Describe, in words, what these expressions will match:

(..)\1

This references two characters (.)followed by the same characters. An extra backslash was need as to being interpreted as a literal backslash

text3 <- "aedaddaxyzabab"
str_view(text3, "(..)\\1", match = TRUE)

## [1] │ aedaddaxyz<abab>

“(.).\1.\1”

The reference is looking for a sequence where there is a character, followed by any character, followed by any character, followed by the same character as the first one, and then followed by any character, with the first character repeating once more in the sequence.

text4 <- "dcbabazadcbsdsms"
str_view(text4, "(.).\\1.\\1", match = TRUE)

## [1] │ dcb<abaza>dcb<sdsms>

“(.)(.)(.).*\3\2\1”

The reference matches a sequence in a string where the first three characters are followed by any characters and then the same three characters appear in reverse order

text5 <- "stuwyqabczcbaioeuo"
str_view(text5, "(.)(.)(.).*\\3\\2\\1", match = TRUE)

## [1] │ stuwyq<abczcba>ioeuo

Question 4

Construct regular expressions to match words that:

Start and end with the same character.

words <- c("banana", "apple", "cherry", "radar", "level", "pop")

str_detect(words,"^(.).*\\1$")

## [1] FALSE FALSE FALSE  TRUE  TRUE  TRUE

Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)

letters <- c("church", "success", "committee", "occurred", "aggressive", "necessary")
str_detect(letters,"(..).*\\1")

## [1]  TRUE FALSE FALSE FALSE FALSE FALSE

Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

letters2 <- c("church", "success", "committee", "occurred", "aggressive", "necessary", "eleven","accelerate" )
str_detect(letters2,"(.).*\\1.*\\1.*")

## [1] FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE