DATA607WK3Assignment

Picture of Salal Berries

Picture of the salal berries mentioned in Exercise 2, from www.diys.com/salal-plant/salal-berries

Exercise 1

Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”

There are three majors that contain either “DATA” or “STATISTICS”

# Load packages --------------------------------------
library(tidyverse)

# Load data --------------------------------------
majordom <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv", stringsAsFactors = FALSE)

majordom_data <- str_subset(majordom$Major, pattern = "DATA|STATISTICS")
majordom_data

## [1] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
## [2] "COMPUTER PROGRAMMING AND DATA PROCESSING"     
## [3] "STATISTICS AND DECISION SCIENCE"

Exercise 2

Write code that transforms the data below:

[1] “bell pepper” “bilberry” “blackberry” “blood orange”
[5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”
[9] “elderberry” “lime” “lychee” “mulberry”
[13] “olive” “salal berry”

Into a format like this:

c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)

Generate a string that looks like the output

We can generate a string that looks like that with the following steps:

# Generate input
line1 <- '[1] "bell pepper"  "bilberry"     "blackberry"   "blood orange"'
line2 <- '[5] "blueberry"    "cantaloupe"   "chili pepper" "cloudberry"  '
line3 <- '[9] "elderberry"   "lime"         "lychee"       "mulberry"    '
line4 <- '[13] "olive"        "salal berry"'
input <- c(line1, line2, line3, line4)
writeLines(input)

## [1] "bell pepper"  "bilberry"     "blackberry"   "blood orange"
## [5] "blueberry"    "cantaloupe"   "chili pepper" "cloudberry"  
## [9] "elderberry"   "lime"         "lychee"       "mulberry"    
## [13] "olive"        "salal berry"

# Combine the lines into one vector element
input <- str_c(input, collapse = " ")
writeLines(input)

## [1] "bell pepper"  "bilberry"     "blackberry"   "blood orange" [5] "blueberry"    "cantaloupe"   "chili pepper" "cloudberry"   [9] "elderberry"   "lime"         "lychee"       "mulberry"     [13] "olive"        "salal berry"

# Remove the numbers in square brackets
interim <- str_remove_all(input, "\\[(.|..)\\]")
writeLines(interim)

##  "bell pepper"  "bilberry"     "blackberry"   "blood orange"  "blueberry"    "cantaloupe"   "chili pepper" "cloudberry"    "elderberry"   "lime"         "lychee"       "mulberry"      "olive"        "salal berry"

# Trim the white space
interim <- str_squish(interim)
writeLines(interim)

## "bell pepper" "bilberry" "blackberry" "blood orange" "blueberry" "cantaloupe" "chili pepper" "cloudberry" "elderberry" "lime" "lychee" "mulberry" "olive" "salal berry"

# Replace " " with ", "
interim <- str_replace_all(interim, "\" \"", "\", \"")
writeLines(interim)

## "bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry"

# Wrap the vectorized element with `c()`
output <- str_c("c(", interim, ")")
writeLines(output)

## c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")

However, this only looks like the output above but isn’t a vector with these elements. If it needs to be a vector we can extract just the fruit names from the original input collapsed into one vector element. WriteLines() doesn’t return an output that can be assigned to a variable, otherwise we would have shown that, too.

Generate a vector that produces the same output

# Extract names into a list 
output2 <- str_extract_all(input, "\"[a-z ]+\"")
# [a-z ]+ is not picking up " " because the beginning " is picked up already
output2

## [[1]]
##  [1] "\"bell pepper\""  "\"bilberry\""     "\"blackberry\""   "\"blood orange\""
##  [5] "\"blueberry\""    "\"cantaloupe\""   "\"chili pepper\"" "\"cloudberry\""  
##  [9] "\"elderberry\""   "\"lime\""         "\"lychee\""       "\"mulberry\""    
## [13] "\"olive\""        "\"salal berry\""

# Turn the list into a vector
output2 <- unlist(output2)
output2

##  [1] "\"bell pepper\""  "\"bilberry\""     "\"blackberry\""   "\"blood orange\""
##  [5] "\"blueberry\""    "\"cantaloupe\""   "\"chili pepper\"" "\"cloudberry\""  
##  [9] "\"elderberry\""   "\"lime\""         "\"lychee\""       "\"mulberry\""    
## [13] "\"olive\""        "\"salal berry\""

# Remove explicit double quotes
output2 <- str_remove_all(output2, "\"")
output2

##  [1] "bell pepper"  "bilberry"     "blackberry"   "blood orange" "blueberry"   
##  [6] "cantaloupe"   "chili pepper" "cloudberry"   "elderberry"   "lime"        
## [11] "lychee"       "mulberry"     "olive"        "salal berry"

This matches the output of: c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)

# Show output of "c("bell pepper", ..."salal berry")
c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")

##  [1] "bell pepper"  "bilberry"     "blackberry"   "blood orange" "blueberry"   
##  [6] "cantaloupe"   "chili pepper" "cloudberry"   "elderberry"   "lime"        
## [11] "lychee"       "mulberry"     "olive"        "salal berry"

Exercise 3

Escaped non special characters

On #3, please don’t assume that it’s a typo when there is only a single forward slash. That is, what happens when the second forward slash that R requires is forgotten?

Just above 14.3.1.1 in R for Data Science, Wickham & Grolemund write, “In this book, I’ll write regular expression as \. and strings that represent the regular expression as "\\.".” So I assume in the two examples without quotes he’s demonstrating the difference between regex and the strings that represent regex.

Nevertheless if you do escape a normal character it doesn’t just disappear. Otherwise the second and third code chunks below would have equivalent output. Instead we see that "\1" becomes an invisible character. You can see in the examples below that while "A\1\1" is matched by both "(.)" and "(.)\1\1", "AAA", "BBB", "B11" don’t match with "(.)\1\1" because they don’t have those two escaped normal characters.

# Writing the string "(.)\1\1" returns "(.)" but they are not equivalent in the examples below
writeLines("(.)\1\1")

## (.)

# Running string_view using "(.)\1\1" matches the first instance of any character followed by
# two escaped 1s.
sample_string <- c("AAA", "BBB", "B11", "A\1\1")
str_view(sample_string, "(.)\1\1")

# Running string_view using "(.)" matches the first instance of any character
sample_string <- c("AAA", "BBB", "B11", "A\1\1")
str_view(sample_string, "(.)")

Exercise 3A

Describe, in words, what these expressions will match:

(.)\1\1

This will match any character (except \n) repeated three times consecutively.

sample_string <- c("AAA", "BBB", "BCCC", "ABBCB")
str_view(sample_string, "(.)\\1\\1")

Exercise 3B

"(.)(.)\\2\\1"

This will match any two characters, repeated in reverse.

sample_string <- c("ANNA", "BABA", "UOMMO", "ABBACADABBADOO")
str_view_all(sample_string, "(.)(.)\\2\\1")

Exercise 3C

(..)\1

This will match any two characters, repeated.

sample_string <- c("MAMA", "PAPA", "GAGAGA", "HAHAHAHA")
str_view_all(sample_string, "(..)\\1")

Exercise 3D

"(.).\\1.\\1"

This will match five characters that have the same first, third and fifth character.

sample_string <- c("DADDY", "DADADY", "NANNA", "NANXN")
str_view(sample_string, "(.).\\1.\\1")

Exercise 3E

"(.)(.)(.).*\\3\\2\\1"

This will match any three characters, repeated in reverse, with no or any number of characters in between.

sample_string <- c("ABCCB", "ABCCBA", "ABCXCBA", "ABCXYCBA", "ABCX CBAY")
str_view(sample_string, "(.)(.)(.).*\\3\\2\\1")

Exercise 4

Exercise 4A

Construct regular expressions to match words that:

Start and end with the same character.

"^(.).*\\1$"

See below, only the vector elements that start and end with the same character are viewed.

sample_string <- c("ABCCB", "ABCXCBA", "ZBCXYCBZ", "ZBCXYZCBA")
str_view(sample_string, "^(.).*\\1$")

Exercise 4B

Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)

"(..).*\\1"

See below for the regex in practice. I tried adding {2} or ? but it didn’t end the churchitch match after the second ch.

sample_string <- c("ABCCAB", "church", "ladeeadi", "ZBCXYXYZA", "churchitch")
str_view(sample_string, "(..).*\\1")

Exercise 4C

Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

"(.).*\\1.*\\1"

See below

sample_string <- c("ABCCAB", "eleven", "DADDY", "ladeeadi", "ZBCXYAYAAY")
str_view(sample_string, "(.).*\\1.*\\1")

Source Files

The R Markdown file for this document is saved here, github.com/pkofy/DATA607, with the name “DATA607WK3Assignment.rmd”.