Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(stringr)
Majors_list <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv")
Majors_list[grep("DATA|STATISTICS", Majors_list$Major,ignore.case = TRUE),]
## FOD1P Major Major_Category
## 44 6212 MANAGEMENT INFORMATION SYSTEMS AND STATISTICS Business
## 52 2101 COMPUTER PROGRAMMING AND DATA PROCESSING Computers & Mathematics
## 59 3702 STATISTICS AND DECISION SCIENCE Computers & Mathematics
Write code that transforms the data below:
Different answers were reviewed with peers and lecture, with still some
confusion of the meaning of the question. but the most straightforward
way I understand to answer the question is in the following.
# first we store all values in a string
Product <- c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry",
"olive", "salal berry")
# to present and removing the spacings
str_c(Product, collapse = " ")
## [1] "bell pepper bilberry blackberry blood orange blueberry cantaloupe chili pepper cloudberry elderberry lime lychee mulberry olive salal berry"
(.)\1\1 The expression represents a pattern where there are 3 consecutive similar letters/objects in the string.
“(.)(.)\\2\\1” The expression represents a pattern of two sets of characters where there are first two characters, then the second one rebeats before the first.
(..)\1 the expression means both consecutive characters repeat again with the same sequence repeat
“(.).\\1.\\1” represents a string of 5 characters where the first is similar to the third and fifth.
“(.)(.)(.).*\\3\\2\\1”
library(babynames)
letters = c("abc","abcd","abcdef","xyz", "AAA","BBB")
str_match(letters, "abc")
## [,1]
## [1,] "abc"
## [2,] "abc"
## [3,] "abc"
## [4,] NA
## [5,] NA
## [6,] NA
str_match(letters, "\\d")
## [,1]
## [1,] NA
## [2,] NA
## [3,] NA
## [4,] NA
## [5,] NA
## [6,] NA
str_view(c(letters,"ccc","444"), "(.)\\1\\1")
## [5] │ <AAA>
## [6] │ <BBB>
## [7] │ <ccc>
## [8] │ <444>
str_view(fruit, "(.)(.)\\2.\\1")
str_view(fruit, "(.).\\1.\\1" )
## [4] │ b<anana>
## [56] │ p<apaya>
str_view(fruit,"(.).\\1.\\1")
## [4] │ b<anana>
## [56] │ p<apaya>
str_view(words,"(.)(.)(.).*\\3\\2\\1")
## [598] │ <paragrap>h
Expression that starts and ends with the same character
As I get familiar with REGEX, I am relaying on LLM (chatGPT) to help me
generate REGEX that works, I verify and test using trial and error.
For REGEX with a pattern that starts and ends with the same character, we can use.
^(.).*\\1$
For REGEX that contain a repeated pair of letters, we can use
"(\\w)\\1"
For REGEX that contain one letter repeated in at least three places, we can use
"(\\w)\\1\\1"