607W3

Question One

Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”

After uploading our data as a data frame, I added two new columns to test whether the course title contained the words “STATISTICS” and “DATA”. (While I could have used a function to add one column to test if the course title contained either STATISTICS or DATA - I elected to do it this way so I could test to make sure there were no spelling errors or other issues). The str_detect function returned a TRUE or FALSE value on each line. I was then able to filter the table for rows which contained a TRUE value in either of these two new columns, and then removed the added columns for final presentation.

library(RCurl)
library(tidyverse)
library(knitr)

Major_538 <- getURL("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv")
Major_538 <- read.csv(text = Major_538, check.names=FALSE)
Major_538 <- transform(Major_538, Data = str_detect(Major_538$Major,"DATA"), STATS = str_detect(Major_538$Major,"STATISTICS") )
Major_538 <- filter(Major_538, Data == TRUE|STATS == TRUE)
Major_538 <- select(Major_538, 1, 2, 3)
kable(Major_538)

FOD1P	Major	Major_Category
6212	MANAGEMENT INFORMATION SYSTEMS AND STATISTICS	Business
2101	COMPUTER PROGRAMMING AND DATA PROCESSING	Computers & Mathematics
3702	STATISTICS AND DECISION SCIENCE	Computers & Mathematics

Question Two

Write code that transforms the data below:

[1] “bell pepper” “bilberry” “blackberry” “blood orange”

[5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”

[9] “elderberry” “lime” “lychee” “mulberry”

[13] “olive” “salal berry”

Into a format like this:

c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)

Reading the question - I understood that the exercise was to transform the text string from its written form, to a vector containing each of the food items. As you can see above, each of our target food items is encased in quotation marks. In order to isolate these items, I used the strsplit to split the text string into separate items at points where we have a quotation mark. As strsplit returns a list with one item, the first of which is our vector, we have to use the unlist function to proceed.

FruitString <- '
[1] "bell pepper"  "bilberry"     "blackberry"   "blood orange"

[5] "blueberry"    "cantaloupe"   "chili pepper" "cloudberry"  

[9] "elderberry"   "lime"         "lychee"       "mulberry"    

[13] "olive"        "salal berry" '

FruitString<- strsplit(FruitString,"\"")
FruitString<- unlist(FruitString)

As you can see below, the result includes not only our isolated food items, but also has some other items including strings of spaces and other characters. I used a str_which function to return only the values in the vector which contained vowels - as the food items were the only items in the list which contained vowels, the resulting vector is identical to our goal for the exercise.

FruitString

##  [1] "\n[1] "        "bell pepper"   "  "            "bilberry"     
##  [5] "     "         "blackberry"    "   "           "blood orange" 
##  [9] "\n\n[5] "      "blueberry"     "    "          "cantaloupe"   
## [13] "   "           "chili pepper"  " "             "cloudberry"   
## [17] "  \n\n[9] "    "elderberry"    "   "           "lime"         
## [21] "         "     "lychee"        "       "       "mulberry"     
## [25] "    \n\n[13] " "olive"         "        "      "salal berry"  
## [29] " "

str_which(FruitString,"[a,e,i,o,u]")

##  [1]  2  4  6  8 10 12 14 16 18 20 22 24 26 28

FruitString <- FruitString[str_which(FruitString,"[a,e,i,o,u]")]

FruitString

##  [1] "bell pepper"  "bilberry"     "blackberry"   "blood orange" "blueberry"   
##  [6] "cantaloupe"   "chili pepper" "cloudberry"   "elderberry"   "lime"        
## [11] "lychee"       "mulberry"     "olive"        "salal berry"

c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")

##  [1] "bell pepper"  "bilberry"     "blackberry"   "blood orange" "blueberry"   
##  [6] "cantaloupe"   "chili pepper" "cloudberry"   "elderberry"   "lime"        
## [11] "lychee"       "mulberry"     "olive"        "salal berry"

Question Three

Describe, in words, what these expressions will match:

(.)\1\1

“(.)(.)\2\1”

(..)\1

“(.).\1.\1”

"(.)(.)(.).*\3\2\1"

I think this questions leaves some room for interpretation. The section of the text where this question is located is concerned mostly with the str_view function, which requires that the pattern be placed in quotation marks. As some of these do and do not have quotation marks, I interpreted the question such that the text should be entered in the pattern field of the function exactly as written. This feels more probable than inserting additional quotes on the pattern in all cases, and returning strings with quotes in cases where the string already has quotation marks For cases where the string does not have quotes, I have also included a description of what the function would return should we have added quotation marks

(.)\1\1 : As this string is not encased by quotation marks, this will return an error if inserted into a str_view function. The string “(.)\1\1” with quotation marks will return any character followed by \1\1 as written.

“(.)(.)\2\1” : This will return a 4 character palindrome - “abba” “c11c” “xxxx”.

(..)\1 : Again this string is not encased by quotation marks, so this will return an error. If we used the string with quotation marks we would return any two characters followed by \1 as written.

“(.).\1.\1” : This will return a string of 5 letters, the first, third and fifth of which would match.

"(.)(.)(.).*\3\2\1" : This results in a string of at least 6 characters, the first 3 and the last 3 characters are such that if the string was exactly 6 or 7 characters, the string would be a palindrome (abcxcba). If the string is greater than 6 characters, any additional characters would be placed in the “middle” of the string. These characters do not need to be in this palindrome form, only the first and last 3.

Please see the following test:

test <- list('racecar','abc11102cba','abccba','xxx','12321','xox',"testtest\1\1","xkxdxcxbxax","church","eleven")

str_view(test, "((.)\1\1)", match = TRUE)

str_view(test, "(.)(.)\\2\\1", match = TRUE)

str_view(test, "(..)\1", match = TRUE)

str_view(test, "(.).\\1.\\1", match = TRUE)

str_view(test, "(.)(.)(.).*\\3\\2\\1", match = TRUE)

Question Four

Construct regular expressions to match words that:

Start and end with the same character.

"^(.).*\\1$"

Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)

"^(\\w)(\\w).*\\1\\2"

Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

"(\\w).*\\1.*\\1"

str_view(test,"^(.).*\\1$", match = TRUE)

str_view(test,"^(\\w)(\\w).*\\1\\2$", match = TRUE)

str_view(test,"(\\w).*\\1.*\\1", match = TRUE)

607W3

Christopher Bloome

2/12/2020

Question One

Question Two

Question Three

Question Four