CUNY SPS DATA607 HW3

Name: Chinedu Onyeka, Date: 9/12/2021

Load Libraries

library(tidyverse)
library(stringr)

Problem 1:

Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”

Solution 1:

url <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv"
college_major <- read_csv(url) #read the data
head(college_major)
## # A tibble: 6 x 3
##   FOD1P Major                                 Major_Category                 
##   <chr> <chr>                                 <chr>                          
## 1 1100  GENERAL AGRICULTURE                   Agriculture & Natural Resources
## 2 1101  AGRICULTURE PRODUCTION AND MANAGEMENT Agriculture & Natural Resources
## 3 1102  AGRICULTURAL ECONOMICS                Agriculture & Natural Resources
## 4 1103  ANIMAL SCIENCES                       Agriculture & Natural Resources
## 5 1104  FOOD SCIENCE                          Agriculture & Natural Resources
## 6 1105  PLANT SCIENCE AND AGRONOMY            Agriculture & Natural Resources
major_data_stat <- college_major[str_detect(college_major$Major, pattern = "DATA|STATISTICS"), ]
major_data_stat
## # A tibble: 3 x 3
##   FOD1P Major                                         Major_Category         
##   <chr> <chr>                                         <chr>                  
## 1 6212  MANAGEMENT INFORMATION SYSTEMS AND STATISTICS Business               
## 2 2101  COMPUTER PROGRAMMING AND DATA PROCESSING      Computers & Mathematics
## 3 3702  STATISTICS AND DECISION SCIENCE               Computers & Mathematics

Problem 2:

Write code that transforms the data below:

[1] “bell pepper” “bilberry” “blackberry” “blood orange”

[5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”

[9] “elderberry” “lime” “lychee” “mulberry”

[13] “olive” “salal berry”

Into a format like this:

c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)

Solution 2:

text_data <- '[1] "bell pepper"  "bilberry"     "blackberry"   "blood orange"

[5] "blueberry"    "cantaloupe"   "chili pepper" "cloudberry"  

[9] "elderberry"   "lime"         "lychee"       "mulberry"    

[13] "olive"        "salal berry"'

#desired output
desire_output <- c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")

#Remove multiple white space from left and and right of the strings
text_data_no_space <- str_squish(text_data)
writeLines(text_data_no_space)
## [1] "bell pepper" "bilberry" "blackberry" "blood orange" [5] "blueberry" "cantaloupe" "chili pepper" "cloudberry" [9] "elderberry" "lime" "lychee" "mulberry" [13] "olive" "salal berry"
#Extract All strings
text_extract <- unlist(str_extract_all(text_data_no_space, pattern = "[[:alpha:]]+\\s[[:alpha:]]+|[[:alpha:]]+"))
#Combine the strings into a vector
text <- str_c(text_extract, sep = '"')
text
##  [1] "bell pepper"  "bilberry"     "blackberry"   "blood orange" "blueberry"   
##  [6] "cantaloupe"   "chili pepper" "cloudberry"   "elderberry"   "lime"        
## [11] "lychee"       "mulberry"     "olive"        "salal berry"

Compare the desired output to the output

#Check if the output is the same as the desired output
identical(text, desire_output)
## [1] TRUE

The result of the comparison shows that both are the same and a logical “TRUE” was returned.

Problem 3:

Describe, in words, what these expressions will match:

  1. (.)\1\1
  2. “(.)(.)\\2\\1”
  3. (..)\1
  4. “(.).\\1.\\1”
  5. "(.)(.)(.).*\\3\\2\\1"

Solution 3:

(3a) (.)\1\1 : This lacks a second backslash and hence will return an error and no match.

(3b) “(.)(.)\\2\\1” : This will return match for the first (1st) character, the second (2nd) character, and the second (2nd) character followed by the first (1st) character again. Basically, this returns matches for first two characters and the same two characters in reverse order as in “daad” or in the example shown below.

str_view_all(stringr::words, pattern = "(.)(.)\\2\\1", match = TRUE)

(3c) (..)\1 : This will also return an error as it lacks a second backslash. The correct version “(..)\\1” will return characters in the first group twice. Any two characters that appear will be returned again.

str_view_all(fruit, "(..)\\1", match = TRUE)

(3d) “(.).\\1.\\1” : This will return matches where the first character occurs three (3) times with any other single character in between. In the fruit example below, “a” occurs three times with “n” in between any occurrence of “a”. The second match “papaya” shows that the match need not be the same letter in between but any other single character.

str_view_all(fruit, "(.).\\1.\\1", match = TRUE)
str_view_all(words, "(.).\\1.\\1", match = TRUE)

(3e) "(.)(.)(.).*\\3\\2\\1“: This will return matches where a group of three (3) characters appear again but in reverse order with a certain number of characters in between them. In the example below”par" has characters “ag” in between the reverse occurrence “rap”

str_view_all(words, "(.)(.)(.).*\\3\\2\\1", match = TRUE)

Problem 4:

Construct regular expressions to match words that:
Start and end with the same character.
Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)
Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

Solution 4:

Start and end with the same character

str_view_all(words, "^(.).*\\1$", match = TRUE)

Contain a repeated pair of letters (e.g. “church contains”ch" repeated twice.)

str_view_all(fruit, "(..)\\1", match = TRUE)

Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”.)

str_view_all(words, "([a-zA-Z]).*\\1.*\\1", match = TRUE)