library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
url <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv"
majors_list <- read.csv(url)
selected_majors <- majors_list %>%
filter(str_detect(Major, "DATA") | str_detect(Major, "STATISTICS"))
selected_majors
## FOD1P Major Major_Category
## 1 6212 MANAGEMENT INFORMATION SYSTEMS AND STATISTICS Business
## 2 2101 COMPUTER PROGRAMMING AND DATA PROCESSING Computers & Mathematics
## 3 3702 STATISTICS AND DECISION SCIENCE Computers & Mathematics
x <- '[1] "bell pepper" "bilberry" "blackberry" "blood orange"
[5] "blueberry" "cantaloupe" "chili pepper" "cloudberry"
[9] "elderberry" "lime" "lychee" "mulberry"
[13] "olive" "salal berry"'
x <- str_remove_all(x, "\\d+")
x <- str_replace_all(x, "\\[]", "")
x <- str_replace_all(x, "\\s+", " ")
x <- str_replace_all(x, "\"\\s+", "\",")
x <- str_c("c(",x,")")
str_view(x)
## [1] │ c( "bell pepper","bilberry","blackberry","blood orange","blueberry","cantaloupe","chili pepper","cloudberry","elderberry","lime","lychee","mulberry","olive","salal berry")
i) (.)\1\1
The above regular expression will match any character that will repeat three time in a row. The “(.)” represents a group followed by “\1” which is a back reference to first group. Our first group in this case has a “.” which means a character. This character is back referred twice with “\1” so the will match any character that will repeat three time in a row.
For example :
Test_string <- c("abc312131cba", "abcabc", "abab", "111222","aabb","acca", "church","xyzzyx","reverence", "repeled", "pulp")
str_subset(Test_string, "(.)\\1\\1")
## [1] "111222"
ii) “(.)(.)\2\1”
This REGEX has two groups each has one character which is back referred in reverse order. So in other words it will match out two characters followed by the same two characters in reverse order
For Example:
str_subset(Test_string, "(.)(.)\\2\\1")
## [1] "acca" "xyzzyx"
iii) “(..)\1”
This particular REGEX has one group that contains two characters and that group is back referred so it will match out two repeated characters
for example:
str_subset(Test_string, "(..)\\1")
## [1] "abab"
iv) “(.).\1.\1”
This REGEX will match out a character repeated three times with characters in between each repetition. The “.” between the group and back references stand for character between the repeated ones.
For Example:
str_subset(Test_string, "(.).\\1.\\1")
## [1] "abc312131cba" "reverence" "repeled"
v) “(.)(.)(.).*\3\2\1”
The characters followed by any character repeat 0 or more times and then the same three characters in reverse order.
For Example:
str_subset(Test_string, "(.)(.)(.).*\\3\\2\\1")
## [1] "abc312131cba" "xyzzyx"
str_subset(Test_string, "^(.).*\\1$")
## [1] "abc312131cba" "acca" "xyzzyx" "pulp"
str_subset(Test_string, "(..).*\\1")
## [1] "abc312131cba" "abcabc" "abab" "church" "reverence"
str_subset(Test_string, "([a-z]).*\\1.*\\1")
## [1] "reverence" "repeled"