In Assignment 3 for Data 607, 4 problems were given to solve with a focus on regular expressions. The following are the libraries needed to complete the task and the answers to the 4 problems.

Load libraries.

library(RCurl)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(stringr)
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ readr   2.1.2
## ✔ tibble  3.1.7     ✔ purrr   0.3.4
## ✔ tidyr   1.2.0     ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ tidyr::complete() masks RCurl::complete()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ dplyr::lag()      masks stats::lag()

1. Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”

Import CSV from Raw File of 173 majors listed in fivethirtyeight.com’s College Majors dataset

df_173_majors = read.csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv')

Identifying the majors that contain either “DATA” or “STATISTICS” with string matching

df_173_majors$Major %>%
  str_subset(pattern = "DATA") 
## [1] "COMPUTER PROGRAMMING AND DATA PROCESSING"
df_173_majors$Major %>%
  str_subset(pattern = "STATISTICS")
## [1] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
## [2] "STATISTICS AND DECISION SCIENCE"

2. Write code that transforms the data below:

[1] “bell pepper” “bilberry” “blackberry” “blood orange” [5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”
[9] “elderberry” “lime” “lychee” “mulberry”
[13] “olive” “salal berry”

Into a format like this: c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)

First thing was to create a string of the data. And see what that string looks like in the terminal.

string_work <- ' [1] "bell pepper"  "bilberry"     "blackberry"   "blood orange" [5] "blueberry"    "cantaloupe"   "chili pepper" "cloudberry" [9] "elderberry"   "lime"         "lychee"       "mulberry"      [13]   "olive"        "salal berry" '
print(string_work)
## [1] " [1] \"bell pepper\"  \"bilberry\"     \"blackberry\"   \"blood orange\" [5] \"blueberry\"    \"cantaloupe\"   \"chili pepper\" \"cloudberry\" [9] \"elderberry\"   \"lime\"         \"lychee\"       \"mulberry\"      [13]   \"olive\"        \"salal berry\" "

Then, a regular expression was used to recognize a pattern. The pattern being any alphanumeric () followed by at least 1 a-z lowercase characters [a-z+] till we reach a space() then again, I’m looking for any other at least 1 a-z lowercase characters [a-z+] and then I stop at another alphanumeric ().

alpha_2 <- str_extract_all(string_work, "\\w[a-z]+\\s?[a-z]+\\w")
print(alpha_2)
## [[1]]
##  [1] "bell pepper"  "bilberry"     "blackberry"   "blood orange" "blueberry"   
##  [6] "cantaloupe"   "chili pepper" "cloudberry"   "elderberry"   "lime"        
## [11] "lychee"       "mulberry"     "olive"        "salal berry"

I realized that this new variable was a list. This list needed to be changes back into a character variable so that I could use the cat function to print with any [digit] in my output. I then collapsed the x variable with a ‘,’ in between the list of words to create the output desired.

class(alpha_2)
## [1] "list"
x = unlist(alpha_2)

print(x)
##  [1] "bell pepper"  "bilberry"     "blackberry"   "blood orange" "blueberry"   
##  [6] "cantaloupe"   "chili pepper" "cloudberry"   "elderberry"   "lime"        
## [11] "lychee"       "mulberry"     "olive"        "salal berry"
class (x)
## [1] "character"
cat(str_c(x, collapse = " , "))
## bell pepper , bilberry , blackberry , blood orange , blueberry , cantaloupe , chili pepper , cloudberry , elderberry , lime , lychee , mulberry , olive , salal berry

3. Describe, in words, what these expressions will match:

“(.)\1\1”

Will match three of any character that are together. Example: ‘aaa’, ‘,,,’ ,etc. However, in R, it needs to be expressed like this: (.)\1\1 while coding or it will not work. Example here is ‘yyy’

string_1 = 'yyy'

sting_output_1 <- str_extract_all(string_1, "(.)\1\1")
print(sting_output_1)
## [[1]]
## character(0)
sting_output_1_with_backslash <- str_extract_all(string_1, "(.)\\1\\1")
print(sting_output_1_with_backslash)
## [[1]]
## [1] "yyy"

“(.)(.)\2\1”

Will match any character that has itself and its reverse order. Example here is ‘anna’

string_2 = 'anna'

sting_output_2 <- str_extract_all(string_2, "(.)(.)\\2\\1")
print(sting_output_2)
## [[1]]
## [1] "anna"

“(..)\1”

Will match with 2 of any character that repeats, however like in the first question, R will produce no match without the extra backslash. Example here is ‘bebe’

string_3 = 'bebe'

sting_output_3 <- str_extract_all(string_3, "(..)\1")
print(sting_output_3)
## [[1]]
## character(0)
sting_output_3_with_backslash <- str_extract_all(string_3, "(..)\\1")
print(sting_output_3_with_backslash)
## [[1]]
## [1] "bebe"

“(.).\1.\1”

Will match any character followed by another character and will repeat the first character but what matters most in this pattern is the first character need to be the first middle and last character in the pattern. Example here is ‘aba.amatag,’

string_4 = 'aba.amatag,'
string_4_2 = 'ababababa' # Will match this one as well. 

sting_output_4 <- str_extract_all(string_4, "(.).\\1.\\1")
print(sting_output_4)
## [[1]]
## [1] "aba.a"

“(.)(.)(.).*\3\2\1”

Will match 3 characters then at least 0 time another character (*) then in reverse order the 3 characters. Example here is ‘asdfdsa’

string_5 = 'asdfdsa' 

sting_output_5 <- str_extract_all(string_5, "(.)(.)(.).*\\3\\2\\1")
print(sting_output_5)
## [[1]]
## [1] "asdfdsa"

4. Construct regular expressions to match words that:

Start and end with the same character.

“(.).+\1” was my regular expressions pattern solution.

char_1 = 'puioufghgfh'
  
char_output_1 <- str_extract_all(char_1, "(.).+\\1")
print(char_output_1)
## [[1]]
## [1] "uiou"  "fghgf"

Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)

“(..).*\1” was my regular expressions pattern solution.

char_2 <- list("ChurCh", "ertyuierasdfghj")

char_output_2 <- str_extract_all(char_2, "(..).*\\1")
print(char_output_2)
## [[1]]
## [1] "ChurCh"
## 
## [[2]]
## [1] "ertyuier"

Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

“(.).\1.\1” was my regular expressions pattern solution.

char_3 = 'eltyetyven'
  
char_output_3 <- str_extract_all(char_3, "(.).*\\1.*\\1")
print(char_output_3)
## [[1]]
## [1] "eltyetyve"