Week3 - Character Manipulation and Date Processing

1. Provide code that identifies the majors that contain either “DATA” or “STATISTICS”

Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”

# Load the CSV from Github
all_ages = read_csv("https://raw.githubusercontent.com/cliftonleesps/607_acq_mgt/main/week3/college-majors/all-ages.csv", show_col_types=FALSE)

# Now grep for DATA or STATISTICS
data_statistics_majors <- grep(pattern = "(STATISTICS|DATA)", all_ages$Major)

# Output the matched majors
for (i in (data_statistics_majors)) {
  cat (sprintf("%s\n", all_ages$Major[i]))
}

## COMPUTER PROGRAMMING AND DATA PROCESSING
## STATISTICS AND DECISION SCIENCE
## MANAGEMENT INFORMATION SYSTEMS AND STATISTICS

2. Write code that transforms the data below:

[1] “bell pepper” “bilberry” “blackberry” “blood orange” [5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”
[9] “elderberry” “lime” “lychee” “mulberry”
[13] “olive” “salal berry”

Into a format like this:

c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)

# Create a character vector representing the original comma separted, data string
stuff <- paste("bell pepper, bilberry, blackberry, blood orange, blueberry, cantaloupe, chili pepper, cloudberry, elderberry, lime, lychee, mulberry, olive, salal berry")

# Initialize
output <- ""

# Concatenate each element wrapped in double quotes and a comma
for (i in (str_split(stuff, " *, *"))) {
  output <- str_c(output, '"',i,'"', sep="", collapse=",")
}

# Wrap with parentheses
output <-str_c('c(',output,')', sep="", collapse="")

# Display the result
cat(output)

## c("bell pepper","bilberry","blackberry","blood orange","blueberry","cantaloupe","chili pepper","cloudberry","elderberry","lime","lychee","mulberry","olive","salal berry")

The two exercises below are taken from R for Data Science, 14.3.5.1 in the on-line version:

3. Describe, in words, what these expressions will match:

(.)\1\1

will match any 3 consecutive characters (ex: CCC)

“(.)(.)\2\1”

will match any 2 characters followed by the same 2 captured characters in reverse order (ex: eppe)

(..)\1

will match any 2 characters followed by the same 2 characters (ex: anan)

“(.).\1.\1”

will match any character followed by any character, followed by the 1st captured character, followed by any character apart from 1st captured character, followed by the 1st captured character (ex: anana, anama)

"(.)(.)(.).*\3\2\1"

will match any 3 characters, followed by any or no characters, followed by the first captured 3 characters in reverse order (ex: aprrpa, apricotrpa)

4. Construct regular expressions to match words that:

Start and end with the same character.

words[str_detect(words, "^(.)((.*\\1$)|\\1?$)")]

##  [1] "a"          "america"    "area"       "dad"        "dead"      
##  [6] "depend"     "educate"    "else"       "encourage"  "engine"    
## [11] "europe"     "evidence"   "example"    "excuse"     "exercise"  
## [16] "expense"    "experience" "eye"        "health"     "high"      
## [21] "knock"      "level"      "local"      "nation"     "non"       
## [26] "rather"     "refer"      "remember"   "serious"    "stairs"    
## [31] "test"       "tonight"    "transport"  "treat"      "trust"     
## [36] "window"     "yesterday"

Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)

str_view(words, "([a-zA-Z][a-zA-Z]).*\\1", match = TRUE)

Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

str_view(words, "([a-zA-Z]).*\\1.*\\1", match = TRUE)