library(tidyverse)
library(ggplot2)
library(stringr)

Problem 1

First we'll read in the dataset and then subset based on data/statistics strings

df.college.majors = read.csv( url("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv"))

vec.majors = df.college.majors$Major[grep("DATA|STATISTICS", df.college.majors$Major)] 
print(vec.majors)

## [1] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
## [2] "COMPUTER PROGRAMMING AND DATA PROCESSING"     
## [3] "STATISTICS AND DECISION SCIENCE"

Problem 2

To convert this to a properly formatted character string, we'll take the following steps:

Remove bracketed numbers from the string
Split the string based on quotation marks
Subset based on values that contain letters

vec.text = c('[1] "bell pepper"  "bilberry"     "blackberry"   "blood orange"
[5] "blueberry"    "cantaloupe"   "chili pepper" "cloudberry"  
[9] "elderberry"   "lime"         "lychee"       "mulberry"    
[13] "olive"        "salal berry"')

vec.text.char = gsub("(\\n\\[\\d+\\])|(^\\[\\d+\\])", "", vec.text)
vec.text.char = strsplit(vec.text.char, '\\"')
vec.text.char = unlist(vec.text.char)
vec.text.char = vec.text.char[grep("[a-z]", vec.text.char)]
print(vec.text.char)

##  [1] "bell pepper"  "bilberry"     "blackberry"   "blood orange" "blueberry"   
##  [6] "cantaloupe"   "chili pepper" "cloudberry"   "elderberry"   "lime"        
## [11] "lychee"       "mulberry"     "olive"        "salal berry"

Problem 3

1.(.)\1\1

(.)(.)\\2\\1

(..)\1

"(.).\\1.\\1"

"(.)(.)(.).*\\3\\2\\1"

If this string is fed into the function without the escape \, the pattern will flag any character, followed by two \1s. If (.)\\1\\1 is used, one character appears three times in a row

Example seen below:

str_detect("ZZZ", "(.)\1\1")

## [1] FALSE

str_detect("Z\1\1", "(.)\1\1")

## [1] TRUE

str_detect("ZZZ", "(.)\\1\\1")

## [1] TRUE

str_detect("Z\1\1", "(.)\\1\\1")

## [1] FALSE

One character appears, followed by a second character twice, followed by the same character (e.g. zooz)
If this string is fed into the function without the escape \, the pattern will flag two characters followed by \1. But if the escape character is included, then it will flag two characters (can be different) repeated consecutively (e.g. dodo)

Example seen below:

str_detect("zz\1", "(..)\1")

## [1] TRUE

str_detect("dodo", "(..)\1")

## [1] FALSE

str_detect("zz\1", "(..)\\1")

## [1] FALSE

str_detect("dodo", "(..)\\1")

## [1] TRUE

An original character, followed by any character followed by the original character, followed by any character, and then the original character again.
Three original characters, followed by any number of characters, followed by the same characters in reversed order.

Problem 4:

Construct regular expressions to match words that:

Start and end with the same character.

str_view("civic", "^(.).*\\1$", match = T)

Contain a repeated pair of letters (e.g. "church" contains "ch" repeated twice.)

str_view("church", "(..).*\\1", match = T)

Contain one letter repeated in at least three places (e.g. "eleven" contains three "e"s.)

str_view("eleven", "(.).*\\1.*\\1", match = T)

Data 607: Assignment 3

Deepika Dilip

Problem 1

Problem 2

Problem 3

Problem 4: