Data 607 Week 3

Data 607 Week 3 Assignment

Problem 1

Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”

# Load necessary libraries
library(readr)
library(dplyr)

# Download and load the CSV from the GitHub raw URL
majors_df <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/college-majors/majors-list.csv")

# Check the structure to see the available columns
str(majors_df)

## spc_tbl_ [174 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ FOD1P         : chr [1:174] "1100" "1101" "1102" "1103" ...
##  $ Major         : chr [1:174] "GENERAL AGRICULTURE" "AGRICULTURE PRODUCTION AND MANAGEMENT" "AGRICULTURAL ECONOMICS" "ANIMAL SCIENCES" ...
##  $ Major_Category: chr [1:174] "Agriculture & Natural Resources" "Agriculture & Natural Resources" "Agriculture & Natural Resources" "Agriculture & Natural Resources" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   FOD1P = col_character(),
##   ..   Major = col_character(),
##   ..   Major_Category = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

# Filter the majors that contain either "DATA" or "STATISTICS"
data_majors <- majors_df %>%
  filter(grepl("data|statistics", Major, ignore.case = TRUE))

# Display the filtered majors
data_majors

## # A tibble: 3 × 3
##   FOD1P Major                                         Major_Category         
##   <chr> <chr>                                         <chr>                  
## 1 6212  MANAGEMENT INFORMATION SYSTEMS AND STATISTICS Business               
## 2 2101  COMPUTER PROGRAMMING AND DATA PROCESSING      Computers & Mathematics
## 3 3702  STATISTICS AND DECISION SCIENCE               Computers & Mathematics

Problem 2

Write code that transforms the data below:

[1] "bell pepper"  "bilberry"     "blackberry"   "blood orange"
[5] "blueberry"    "cantaloupe"   "chili pepper" "cloudberry"  
[9] "elderberry"   "lime"         "lychee"       "mulberry"    
[13] "olive"        "salal berry"

Into a format like this:

c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")

# Define the original vector of fruit names
fruits <- c("bell pepper", "bilberry", "blackberry", "blood orange",
            "blueberry", "cantaloupe", "chili pepper", "cloudberry",
            "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")

# Use dput() to print the vector in the literal format
dput(fruits)

## c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", 
## "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", 
## "lychee", "mulberry", "olive", "salal berry")

Problem 3

Describe, in words, what these expressions will match:

(.)\1\1
The first element (.) captures any character, and \1\1 repeats that character two more times.
Description: Any character repeated three times consecutively (e.g., “aaa”).
"(.)(.)\\2\\1"
This expression captures two characters separately, then matches the second captured character followed by the first.
Description: A four-character palindrome (e.g., “abba”).
(..)\1
This captures any two characters as a group and then immediately repeats that exact pair.
Description: A repeated two-character sequence (e.g., “abab”).
"(.).\\1.\\1"
This captures a single character and then matches any character, followed by the captured character, another arbitrary character, and the captured character again.
Description: A five-character pattern where the first, third, and fifth characters are identical (e.g., “ababa”).
"(.)(.)(.).*\\3\\2\\1"
This captures three characters and then, after any characters (.*), requires that the three captured characters appear in reverse order.
Description: A string that starts with three characters and ends with those same characters in reverse order (e.g., “abc…cba”).

Problem 4

Construct regular expressions to match words that:

a) Start and end with the same character.

regex_start_end <- "^(.).*\\1$"
regex_start_end

## [1] "^(.).*\\1$"

b) Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice).

regex_repeated_pair <- "(..).*?\\1"
regex_repeated_pair

## [1] "(..).*?\\1"

c) Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s).

regex_letter_triplicate <- ".*([A-Za-z]).*\\1.*\\1.*"
regex_letter_triplicate

## [1] ".*([A-Za-z]).*\\1.*\\1.*"

Conclusion

This was a fun assignment! This was an interesting way to think about understanding, reading, and generating specific expressions. At least it was fun and a unique discovery for me.