DATA 607 - Homework Assignment # 3

Vladimir Nimchenko

1.Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”

library(stringr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
 # Load the list of majors from github
 majors <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv", header = TRUE)
 
 #Selecting the "DATA" or "STATISTICS" majors from the dataset
 majors %>% filter(str_detect(Major, ("DATA|STATISTICS")))
##   FOD1P                                         Major          Major_Category
## 1  6212 MANAGEMENT INFORMATION SYSTEMS AND STATISTICS                Business
## 2  2101      COMPUTER PROGRAMMING AND DATA PROCESSING Computers & Mathematics
## 3  3702               STATISTICS AND DECISION SCIENCE Computers & Mathematics

2.Write code that transforms the data below:

[1] “bell pepper” “bilberry” “blackberry” “blood orange”

[5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”

[9] “elderberry” “lime” “lychee” “mulberry”

[13] “olive” “salal berry”

Into a format like this:

c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”

 # Create a fruits string as it is listed initially.
 fruits_list <- '[1] "bell pepper" "bilberry" "blackberry" "blood orange"

                 [5] "blueberry" "cantaloupe" "chili pepper" "cloudberry"  

                 [9] "elderberry" "lime" "lychee" "mulberry"    

                 [13] "olive" "salal berry"'

# Extracts the pieces of the fruits list which match a regular expression
fruits_list <- str_extract_all(fruits_list, '[a-z]+\\s[a-z]+|[a-z]+')
#adds comma's in between each item in the list.
fruits_list <- paste(fruits_list, collapse = ",")
#coverts the list arguments to a character vector and concatenates to one character vector
cat(fruits_list)
## c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")

3 Describe, in words, what these expressions will match:

a.  (.)\1\1
b. "(.)(.)\\2\\1"
c. (..)\1
d. "(.).\\1.\\1"
e. "(.)(.)(.).*\\3\\2\\1"
#a. This expression will have the same character show up three times on the same line but is not done in \\ so it is not in needed format. (ex."bbxb")
#b. This expression will is where the first and second characters match/followed by the second and first characters all in quotes and on the same line.(ex."yxxy")
#c. This expression will check if the first two characters repeat on the same line. It is not in the needed format //(ex."mnmn")
#d. This expression will check at the two characters following the first character after which it will check the first character for the second time all on the same line.(ex."gxgng")
#e. This expression will check the first three characters and then check the same characters in opposite order all on the same line.(ex."efggfe")

4 Construct regular expressions to match words that:

  1. Start and end with the same character
  2. Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)
  3. Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)
#a.

# vector containing words that start and end with the same character
r<- c("civic", "radar", "level", "refer")

#displays these words to validate the correctness of the regular expression
str_view(r, "^(.)(.)*\\1$",match = TRUE)
#b.

# vector containing words that have a repeated pair of letters
r<- c("church","shush","gigi","George")
#displays these words to validate the correctness of the regular expression
str_view(r, "(..).*\\1",match = TRUE)
#c.

# vector containing words that have one letter repeated in at least 3 places
r<- c("eleven","Mississippi","Melville","glasses")
#displays these words to validate the correctness of the regular expression
str_view(r, "(.).*\\1.*\\1",match = TRUE)