Week 3 Assignment

1.

Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(stringr)

Majors_list <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv")

Majors_list[grep("DATA|STATISTICS", Majors_list$Major,ignore.case = TRUE),]
##    FOD1P                                         Major          Major_Category
## 44  6212 MANAGEMENT INFORMATION SYSTEMS AND STATISTICS                Business
## 52  2101      COMPUTER PROGRAMMING AND DATA PROCESSING Computers & Mathematics
## 59  3702               STATISTICS AND DECISION SCIENCE Computers & Mathematics

2.

Write code that transforms the data below:
Different answers were reviewed with peers and lecture, with still some confusion of the meaning of the question. but the most straightforward way I understand to answer the question is in the following.

# first we store all values in a string

Product <- c("bell pepper", "bilberry", "blackberry",   "blood orange", "blueberry",    "cantaloupe",   "chili pepper", "cloudberry", "elderberry",   "lime",         "lychee",       "mulberry",    
 "olive",        "salal berry")

# to present and removing the spacings
str_c(Product, collapse = " ")
## [1] "bell pepper bilberry blackberry blood orange blueberry cantaloupe chili pepper cloudberry elderberry lime lychee mulberry olive salal berry"

3. Describe, in words, what these expressions will match: Codes with demonstrations are attached below.

library(babynames)


letters = c("abc","abcd","abcdef","xyz", "AAA","BBB")

str_match(letters, "abc") 
##      [,1] 
## [1,] "abc"
## [2,] "abc"
## [3,] "abc"
## [4,] NA   
## [5,] NA   
## [6,] NA
str_match(letters, "\\d")
##      [,1]
## [1,] NA  
## [2,] NA  
## [3,] NA  
## [4,] NA  
## [5,] NA  
## [6,] NA
str_view(c(letters,"ccc","444"), "(.)\\1\\1")
## [5] │ <AAA>
## [6] │ <BBB>
## [7] │ <ccc>
## [8] │ <444>
str_view(fruit, "(.)(.)\\2.\\1")
str_view(fruit, "(.).\\1.\\1" )
##  [4] │ b<anana>
## [56] │ p<apaya>
str_view(fruit,"(.).\\1.\\1")
##  [4] │ b<anana>
## [56] │ p<apaya>
str_view(words,"(.)(.)(.).*\\3\\2\\1")
## [598] │ <paragrap>h


4.

Expression that starts and ends with the same character
As I get familiar with REGEX, I am relaying on LLM (chatGPT) to help me generate REGEX that works, I verify and test using trial and error.

For REGEX with a pattern that starts and ends with the same character, we can use.

^(.).*\\1$

For REGEX that contain a repeated pair of letters, we can use

"(\\w)\\1"

For REGEX that contain one letter repeated in at least three places, we can use

"(\\w)\\1\\1"