Introduction

Below, I’ll approach each question using a combination of text for explanation and markdown for testing, examples and relevant solutions.

Imports

Pull in tidyverse for any analyses I might like to run.

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.3.0      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Question 1

Need to pull in the majors from the provided 538 data. While I can download the file outright, I want to practice pulling in data directly from a GitHub URL. I needed to use the “raw” link to access the data.

urlfile = 'https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv'

df <- read.csv(url(urlfile))

Search for key terms

The majors are already all upper-case, so I can safely search using upper-case strings. I made two separate vectors, one searching for “DATA” and the other for “STATISTICS”, then combined them.

c(str_subset(df$Major, pattern = 'DATA'),
  str_subset(df$Major, pattern = 'STATISTICS')
)

## [1] "COMPUTER PROGRAMMING AND DATA PROCESSING"     
## [2] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
## [3] "STATISTICS AND DECISION SCIENCE"

I came back to this problem after realizing my previous solution was a bit clunky. I wanted to base a more elegant solution on combining search terms into a single string and piping it into the grep function.

terms <- c("DATA", "STATISTICS")

str_c(terms, collapse = '|') %>%
  grep(df$Major, value = TRUE)

## [1] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
## [2] "COMPUTER PROGRAMMING AND DATA PROCESSING"     
## [3] "STATISTICS AND DECISION SCIENCE"

Question 2

Copy in long string of terms.

strStart = '[1] "bell pepper" "bilberry"   "blackberry"  "blood orange"
[5] "blueberry"  "cantaloupe"  "chili pepper" "cloudberry"
[9] "elderberry"  "lime"     "lychee"    "mulberry"
[13] "olive"    "salal berry"'

strStart

## [1] "[1] \"bell pepper\" \"bilberry\"   \"blackberry\"  \"blood orange\"\n[5] \"blueberry\"  \"cantaloupe\"  \"chili pepper\" \"cloudberry\"\n[9] \"elderberry\"  \"lime\"     \"lychee\"    \"mulberry\"\n[13] \"olive\"    \"salal berry\""

I know there are literal quotation marks separating all the values I care about, so I can use str_split to break them up.

listSplit <- str_split(strStart, pattern = '"')

#listSplit is technically now a one-element list of a list. So I want to save a new version that selects only this one element.

listSplit <- listSplit[[1]]

listSplit

##  [1] "[1] "         "bell pepper"  " "            "bilberry"     "   "         
##  [6] "blackberry"   "  "           "blood orange" "\n[5] "       "blueberry"   
## [11] "  "           "cantaloupe"   "  "           "chili pepper" " "           
## [16] "cloudberry"   "\n[9] "       "elderberry"   "  "           "lime"        
## [21] "     "        "lychee"       "    "         "mulberry"     "\n[13] "     
## [26] "olive"        "    "         "salal berry"  ""

The result is messy, but all of the values I care about are contained within. I see that all of the foods in the list contain lowercase letters, while the values I want to lose (spaces of various length, line breaks, numbers in brackets) do not. I can use regex to select only for these elements.

vectorFood <- c(grep('[a-z]', listSplit, value = TRUE))

vectorFood

##  [1] "bell pepper"  "bilberry"     "blackberry"   "blood orange" "blueberry"   
##  [6] "cantaloupe"   "chili pepper" "cloudberry"   "elderberry"   "lime"        
## [11] "lychee"       "mulberry"     "olive"        "salal berry"

Question 3

(.)\1\1

answer: Any character followed by “\1\1”. Because there is no second backslash, the \1s are interpreted literally.

test_vec = c('a\1\1', 'a11', 'z\1\1', '566')

str_view(test_vec, pattern = '(.)\1\1')

## [1] │ <a>
## [3] │ <z>

“(.)(.)\2\1”

answer: The string must contain a four-character, palindromic stretch, ie. xyyx or 1221.

test_vec = c('xyyx', 'aba', '1221', '0123', 'ajsdfkttksasdf')

str_view(test_vec, pattern = "(.)(.)\\2\\1")

## [1] │ <xyyx>
## [3] │ <1221>
## [5] │ ajsdf<kttk>sasdf

(..)\1

answer: Any two characters followed by “\1”. See the first example.

test_vec = c('a1', 'a\1', 'ab1', 'ab\1')

str_view(test_vec, pattern = '(..)\1')

## [4] │ <ab>

“(.).\1.\1”

answer: String must contain the same letter three times, separated by any character between each occurrence (a la “12131” or “cabana”)

test_vec = c('12345', '12131', 'cabana', 'bonobo', 'babble', 'baaa')

str_view(test_vec, pattern = '(.).\\1.\\1')

## [2] │ <12131>
## [3] │ c<abana>
## [4] │ b<onobo>

“(.)(.)(.).*\3\2\1”

answer: String must have a stretch of 3 characters and then, somewhere later in the string, the same stretch backwards. So something like “abcdcba” or “nab some bananas”

test_vec = c('nab some bananas', 'abcdefg', 'abcdcba', '123454321', '98765')

str_view(test_vec, pattern = '(.)(.)(.).*\\3\\2\\1')

## [1] │ <nab some ban>anas
## [3] │ <abcdcba>
## [4] │ <123454321>

Number 4

Start and end with the same character.

answer: ^(.)(.*)\1$

explanation: ^ means the first character must start the string, ‘\1’ calls back that same character, and $ means it comes at the end. The (.*) in the middle means a string of any characters in any amount (even 0) can come between them.

#test

test_vec <- c("bob", "david", "fred", "joe", "roger")

str_view(test_vec, pattern = '^(.)(.*)\\1$')

## [1] │ <bob>
## [2] │ <david>
## [5] │ <roger>

Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)

answer: ([a-z][a-z]).*\1

explanation: [a-z] means any lowercase letter, so two in a row means two lowercase letters in row. Parenthese group them so they can be recalled later with ‘\1’. As before, .* stands in to show that the two instances of the two-letter string need not occur right next to each other.

test_vec = c('church', 'peewee', 'burger', 'mammoth', 'torpor')

str_view(test_vec, pattern = '([a-z][a-z]).*\\1')

## [1] │ <church>
## [2] │ p<eewee>
## [5] │ t<orpor>

Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

answer: (.).\1.\1

explanation: “(.)” sets up the individual character, which does not need to occur first in the string. ‘\1’ occurs twice, calling the original character back in twice. Because the question did not specify how close these occurrances must be together, .* was inserted in between.

test_vec = c('eleven', 'ghost', 'banana', 'pizazz', 'basketball')

str_view(test_vec, pattern = '(.).*\\1.*\\1')

## [1] │ <eleve>n
## [3] │ b<anana>
## [4] │ pi<zazz>

data607_hw3

2023-02-08