Load libraries.
library(RCurl)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stringr)
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ readr 2.1.2
## ✔ tibble 3.1.7 ✔ purrr 0.3.4
## ✔ tidyr 1.2.0 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ tidyr::complete() masks RCurl::complete()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
Import CSV from Raw File of 173 majors listed in fivethirtyeight.com’s College Majors dataset
df_173_majors = read.csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv')
Identifying the majors that contain either “DATA” or “STATISTICS” with string matching
df_173_majors$Major %>%
str_subset(pattern = "DATA")
## [1] "COMPUTER PROGRAMMING AND DATA PROCESSING"
df_173_majors$Major %>%
str_subset(pattern = "STATISTICS")
## [1] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
## [2] "STATISTICS AND DECISION SCIENCE"
[1] “bell pepper” “bilberry” “blackberry” “blood orange” [5]
“blueberry” “cantaloupe” “chili pepper” “cloudberry”
[9] “elderberry” “lime” “lychee” “mulberry”
[13] “olive” “salal berry”
Into a format like this: c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)
First thing was to create a string of the data. And see what that string looks like in the terminal.
string_work <- ' [1] "bell pepper" "bilberry" "blackberry" "blood orange" [5] "blueberry" "cantaloupe" "chili pepper" "cloudberry" [9] "elderberry" "lime" "lychee" "mulberry" [13] "olive" "salal berry" '
print(string_work)
## [1] " [1] \"bell pepper\" \"bilberry\" \"blackberry\" \"blood orange\" [5] \"blueberry\" \"cantaloupe\" \"chili pepper\" \"cloudberry\" [9] \"elderberry\" \"lime\" \"lychee\" \"mulberry\" [13] \"olive\" \"salal berry\" "
Then, a regular expression was used to recognize a pattern. The pattern being any alphanumeric () followed by at least 1 a-z lowercase characters [a-z+] till we reach a space() then again, I’m looking for any other at least 1 a-z lowercase characters [a-z+] and then I stop at another alphanumeric ().
alpha_2 <- str_extract_all(string_work, "\\w[a-z]+\\s?[a-z]+\\w")
print(alpha_2)
## [[1]]
## [1] "bell pepper" "bilberry" "blackberry" "blood orange" "blueberry"
## [6] "cantaloupe" "chili pepper" "cloudberry" "elderberry" "lime"
## [11] "lychee" "mulberry" "olive" "salal berry"
I realized that this new variable was a list. This list needed to be changes back into a character variable so that I could use the cat function to print with any [digit] in my output. I then collapsed the x variable with a ‘,’ in between the list of words to create the output desired.
class(alpha_2)
## [1] "list"
x = unlist(alpha_2)
print(x)
## [1] "bell pepper" "bilberry" "blackberry" "blood orange" "blueberry"
## [6] "cantaloupe" "chili pepper" "cloudberry" "elderberry" "lime"
## [11] "lychee" "mulberry" "olive" "salal berry"
class (x)
## [1] "character"
cat(str_c(x, collapse = " , "))
## bell pepper , bilberry , blackberry , blood orange , blueberry , cantaloupe , chili pepper , cloudberry , elderberry , lime , lychee , mulberry , olive , salal berry
Will match three of any character that are together. Example: ‘aaa’, ‘,,,’ ,etc. However, in R, it needs to be expressed like this: (.)\1\1 while coding or it will not work. Example here is ‘yyy’
string_1 = 'yyy'
sting_output_1 <- str_extract_all(string_1, "(.)\1\1")
print(sting_output_1)
## [[1]]
## character(0)
sting_output_1_with_backslash <- str_extract_all(string_1, "(.)\\1\\1")
print(sting_output_1_with_backslash)
## [[1]]
## [1] "yyy"
Will match any character that has itself and its reverse order. Example here is ‘anna’
string_2 = 'anna'
sting_output_2 <- str_extract_all(string_2, "(.)(.)\\2\\1")
print(sting_output_2)
## [[1]]
## [1] "anna"
Will match with 2 of any character that repeats, however like in the first question, R will produce no match without the extra backslash. Example here is ‘bebe’
string_3 = 'bebe'
sting_output_3 <- str_extract_all(string_3, "(..)\1")
print(sting_output_3)
## [[1]]
## character(0)
sting_output_3_with_backslash <- str_extract_all(string_3, "(..)\\1")
print(sting_output_3_with_backslash)
## [[1]]
## [1] "bebe"
Will match any character followed by another character and will repeat the first character but what matters most in this pattern is the first character need to be the first middle and last character in the pattern. Example here is ‘aba.amatag,’
string_4 = 'aba.amatag,'
string_4_2 = 'ababababa' # Will match this one as well.
sting_output_4 <- str_extract_all(string_4, "(.).\\1.\\1")
print(sting_output_4)
## [[1]]
## [1] "aba.a"
Will match 3 characters then at least 0 time another character (*) then in reverse order the 3 characters. Example here is ‘asdfdsa’
string_5 = 'asdfdsa'
sting_output_5 <- str_extract_all(string_5, "(.)(.)(.).*\\3\\2\\1")
print(sting_output_5)
## [[1]]
## [1] "asdfdsa"
“(.).+\1” was my regular expressions pattern solution.
char_1 = 'puioufghgfh'
char_output_1 <- str_extract_all(char_1, "(.).+\\1")
print(char_output_1)
## [[1]]
## [1] "uiou" "fghgf"
“(..).*\1” was my regular expressions pattern solution.
char_2 <- list("ChurCh", "ertyuierasdfghj")
char_output_2 <- str_extract_all(char_2, "(..).*\\1")
print(char_output_2)
## [[1]]
## [1] "ChurCh"
##
## [[2]]
## [1] "ertyuier"
“(.).\1.\1” was my regular expressions pattern solution.
char_3 = 'eltyetyven'
char_output_3 <- str_extract_all(char_3, "(.).*\\1.*\\1")
print(char_output_3)
## [[1]]
## [1] "eltyetyve"