library(knitr)
library(stringr)
library(readr)

Overview

This is the Week 3 assignment for Data 607 - Data Acquisition and Management.

1.

Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”.

# Import the dataset
majors <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv", header = TRUE, sep = ",")

I create a subset of the list of the majors from the column “Major”.

majors_col <- c(majors$Major)

This code creates the variable “data” that returns the values in the majors subset that have the word “DATA” in them and the variable “statistics” that returns the values in the majors subset that have the word “STATISTICS” in them. It then creates and returns a string of the majors that have the words “DATA” or “STATISTICS” in them.

data <- str_subset(majors_col, pattern = "DATA")
statistics <- str_subset(majors_col, pattern = "STATISTICS")
data_stats_majors <- c(data, statistics)
data_stats_majors
## [1] "COMPUTER PROGRAMMING AND DATA PROCESSING"     
## [2] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
## [3] "STATISTICS AND DECISION SCIENCE"

2. Write code that transforms the data below:

[1] “bell pepper” “bilberry” “blackberry” “blood orange” [5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”
[9] “elderberry” “lime” “lychee” “mulberry”
[13] “olive” “salal berry” ## Into a format like this: c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)

I got help on this problem from Zig Dukuray.

This code stores the data as a string.

input <- '[1] "bell pepper"  "bilberry"     "blackberry"   "blood orange"
[5] "blueberry"    "cantaloupe"   "chili pepper" "cloudberry"  
[9] "elderberry"   "lime"         "lychee"       "mulberry"    
[13] "olive"        "salal berry"'
input
## [1] "[1] \"bell pepper\"  \"bilberry\"     \"blackberry\"   \"blood orange\"\n[5] \"blueberry\"    \"cantaloupe\"   \"chili pepper\" \"cloudberry\"  \n[9] \"elderberry\"   \"lime\"         \"lychee\"       \"mulberry\"    \n[13] \"olive\"        \"salal berry\""

This code takes the string and extracts each element in the input that matches the regex and creates an atomic vector out of the list of elements so we can continue to use stringr commands.

input2 <- str_extract_all(input, "\"([a-z]+.[a-z]+)\"")
input2
## [[1]]
##  [1] "\"bell pepper\""  "\"bilberry\""     "\"blackberry\""   "\"blood orange\""
##  [5] "\"blueberry\""    "\"cantaloupe\""   "\"chili pepper\"" "\"cloudberry\""  
##  [9] "\"elderberry\""   "\"lime\""         "\"lychee\""       "\"mulberry\""    
## [13] "\"olive\""        "\"salal berry\""
input3 <- unlist(input2)
input3
##  [1] "\"bell pepper\""  "\"bilberry\""     "\"blackberry\""   "\"blood orange\""
##  [5] "\"blueberry\""    "\"cantaloupe\""   "\"chili pepper\"" "\"cloudberry\""  
##  [9] "\"elderberry\""   "\"lime\""         "\"lychee\""       "\"mulberry\""    
## [13] "\"olive\""        "\"salal berry\""

Remove each backslash.

input4 <- str_replace_all(input3, "\"", "")
input4
##  [1] "bell pepper"  "bilberry"     "blackberry"   "blood orange" "blueberry"   
##  [6] "cantaloupe"   "chili pepper" "cloudberry"   "elderberry"   "lime"        
## [11] "lychee"       "mulberry"     "olive"        "salal berry"

This code concatenates our list of vectors

input5 <- paste0('c("' ,input4, '")', collapse = ',')
input6 <- paste0(c(input4))
input5
## [1] "c(\"bell pepper\"),c(\"bilberry\"),c(\"blackberry\"),c(\"blood orange\"),c(\"blueberry\"),c(\"cantaloupe\"),c(\"chili pepper\"),c(\"cloudberry\"),c(\"elderberry\"),c(\"lime\"),c(\"lychee\"),c(\"mulberry\"),c(\"olive\"),c(\"salal berry\")"
input6
##  [1] "bell pepper"  "bilberry"     "blackberry"   "blood orange" "blueberry"   
##  [6] "cantaloupe"   "chili pepper" "cloudberry"   "elderberry"   "lime"        
## [11] "lychee"       "mulberry"     "olive"        "salal berry"

I know that I did not reach the desired format. I’m not sure how to use the paste0() function with c() to “concatenate” the whole list and not each element in the list.

3. Describe, in words, what these expressions will match:

4. Construct regular expressions to match words that:

Start and end with the same character. “^(.).*\1$”

Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.) “(..)\1.*”

Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.) “([a-z]).\1.\1”