Week 3 Assignment

1.

Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
library(stringr)

Majors_list <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv")

Majors_list[grep("DATA|STATISTICS", Majors_list$Major,ignore.case = TRUE),]

##    FOD1P                                         Major          Major_Category
## 44  6212 MANAGEMENT INFORMATION SYSTEMS AND STATISTICS                Business
## 52  2101      COMPUTER PROGRAMMING AND DATA PROCESSING Computers & Mathematics
## 59  3702               STATISTICS AND DECISION SCIENCE Computers & Mathematics

2.

Write code that transforms the data below:
Different answers were reviewed with peers and lecture, with still some confusion of the meaning of the question. but the most straightforward way I understand to answer the question is in the following.

# first we store all values in a string

Product <- c("bell pepper", "bilberry", "blackberry",   "blood orange", "blueberry",    "cantaloupe",   "chili pepper", "cloudberry", "elderberry",   "lime",         "lychee",       "mulberry",    
 "olive",        "salal berry")

# to present and removing the spacings
str_c(Product, collapse = " ")

## [1] "bell pepper bilberry blackberry blood orange blueberry cantaloupe chili pepper cloudberry elderberry lime lychee mulberry olive salal berry"

3. Describe, in words, what these expressions will match: Codes with demonstrations are attached below.

(.)\1\1 The expression represents a pattern where there are 3 consecutive similar letters/objects in the string.
“(.)(.)\\2\\1” The expression represents a pattern of two sets of characters where there are first two characters, then the second one rebeats before the first.
(..)\1 the expression means both consecutive characters repeat again with the same sequence repeat
“(.).\\1.\\1” represents a string of 5 characters where the first is similar to the third and fifth.
“(.)(.)(.).*\\3\\2\\1”

library(babynames)


letters = c("abc","abcd","abcdef","xyz", "AAA","BBB")

str_match(letters, "abc")

##      [,1] 
## [1,] "abc"
## [2,] "abc"
## [3,] "abc"
## [4,] NA   
## [5,] NA   
## [6,] NA

str_match(letters, "\\d")

##      [,1]
## [1,] NA  
## [2,] NA  
## [3,] NA  
## [4,] NA  
## [5,] NA  
## [6,] NA

str_view(c(letters,"ccc","444"), "(.)\\1\\1")

## [5] │ <AAA>
## [6] │ <BBB>
## [7] │ <ccc>
## [8] │ <444>

str_view(fruit, "(.)(.)\\2.\\1")
str_view(fruit, "(.).\\1.\\1" )

##  [4] │ b<anana>
## [56] │ p<apaya>

str_view(fruit,"(.).\\1.\\1")

##  [4] │ b<anana>
## [56] │ p<apaya>

str_view(words,"(.)(.)(.).*\\3\\2\\1")

## [598] │ <paragrap>h

4.

Expression that starts and ends with the same character
As I get familiar with REGEX, I am relaying on LLM (chatGPT) to help me generate REGEX that works, I verify and test using trial and error.

For REGEX with a pattern that starts and ends with the same character, we can use.

^(.).*\\1$

For REGEX that contain a repeated pair of letters, we can use

"(\\w)\\1"

For REGEX that contain one letter repeated in at least three places, we can use

"(\\w)\\1\\1"

Week 3 Assignment

Bishoy Sokkar

2024-02-10

Week 3 Assignment

1.

2.

3. Describe, in words, what these expressions will match: Codes with demonstrations are attached below.

4.