Data607_Assignment3

Libraries

I used the following libraries in order to answer the below questions.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
library(stringr)

Question 1

You can find the raw data used to answer this question here https://github.com/fivethirtyeight/data/blob/master/college-majors/majors-list.csv

college_majors <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv")

## Rows: 174 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): FOD1P, Major, Major_Category
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

college_majors

## # A tibble: 174 × 3
##    FOD1P Major                                 Major_Category                 
##    <chr> <chr>                                 <chr>                          
##  1 1100  GENERAL AGRICULTURE                   Agriculture & Natural Resources
##  2 1101  AGRICULTURE PRODUCTION AND MANAGEMENT Agriculture & Natural Resources
##  3 1102  AGRICULTURAL ECONOMICS                Agriculture & Natural Resources
##  4 1103  ANIMAL SCIENCES                       Agriculture & Natural Resources
##  5 1104  FOOD SCIENCE                          Agriculture & Natural Resources
##  6 1105  PLANT SCIENCE AND AGRONOMY            Agriculture & Natural Resources
##  7 1106  SOIL SCIENCE                          Agriculture & Natural Resources
##  8 1199  MISCELLANEOUS AGRICULTURE             Agriculture & Natural Resources
##  9 1302  FORESTRY                              Agriculture & Natural Resources
## 10 1303  NATURAL RESOURCES MANAGEMENT          Agriculture & Natural Resources
## # ℹ 164 more rows

college_majors |>
  filter(str_detect(Major, "DATA|STATISTICS")) |>
  count(Major, sort=TRUE)

## # A tibble: 3 × 2
##   Major                                             n
##   <chr>                                         <int>
## 1 COMPUTER PROGRAMMING AND DATA PROCESSING          1
## 2 MANAGEMENT INFORMATION SYSTEMS AND STATISTICS     1
## 3 STATISTICS AND DECISION SCIENCE                   1

summary(cars)

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Question 2

Write code that transforms the data below:

[1] “bell pepper” “bilberry” “blackberry” “blood orange”

[5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”

[9] “elderberry” “lime” “lychee” “mulberry”

[13] “olive” “salal berry”

Into a format like this:

c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)

*I was slightly confused by the wording of this question. I assumed what was meant was to create a new string vector whose output would match the output of c(fruits).

output_string <- "[1] \"bell pepper\"  \"bilberry\"  \"blackberry\"  \"blood orange\"\n[5] \"blueberry\"  \"cantaloupe\"  \"chili pepper\"  \"cloudberry\"\n[9] \"elderberry\"  \"lime\"  \"lychee\"  \"mulberry\"\n[13] \"olive\"  \"salal berry\""

"OG String Vector:"

## [1] "OG String Vector:"

cat(output_string)

## [1] "bell pepper"  "bilberry"  "blackberry"  "blood orange"
## [5] "blueberry"  "cantaloupe"  "chili pepper"  "cloudberry"
## [9] "elderberry"  "lime"  "lychee"  "mulberry"
## [13] "olive"  "salal berry"

new_word_vec <- function(string) {

  words <- str_extract_all(string, '"[a-zA-Z ]+"')[[1]]
  
  words <- str_replace_all(words, '"', '')
  
  return(words)
}

"Desired Output:"

## [1] "Desired Output:"

c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")

##  [1] "bell pepper"  "bilberry"     "blackberry"   "blood orange" "blueberry"   
##  [6] "cantaloupe"   "chili pepper" "cloudberry"   "elderberry"   "lime"        
## [11] "lychee"       "mulberry"     "olive"        "salal berry"

vec <- new_word_vec(output_string)

"Reorganized output:"

## [1] "Reorganized output:"

print(vec)

##  [1] "bell pepper"  "bilberry"     "blackberry"   "blood orange" "blueberry"   
##  [6] "cantaloupe"   "chili pepper" "cloudberry"   "elderberry"   "lime"        
## [11] "lychee"       "mulberry"     "olive"        "salal berry"

Question 3

Describe, in words, what these expressions will match:

(.)\1\1
A character is repeated three times (.) selects any character, then repeated twice consecutively
“(.)(.)\2\1”
This would select four character palindromes where the first two characters are randomly selected and then the third character is the second selection group and the fourth is the first selection group
(..)\1
Selects the two consecutive characters and then repeats that group of characters once
“(.).\1.\1”
First character group and then selects a single character immediately after, then it repeats the first selection group with the next trailing character and then finally repeats the first selection group again
“(.)(.)(.).\3\2\1”
Captures three characters, then “.’ captures 0-any characters except for a new line (essentially the rest of the string), finally the 3rd then 2nd and 1st capture group are repeated in that order

Question 4

Start and end with the same character.

str_view(words, "\\b(.).*\\1$")

##  [36] │ <america>
##  [49] │ <area>
## [209] │ <dad>
## [213] │ <dead>
## [223] │ <depend>
## [258] │ <educate>
## [266] │ <else>
## [268] │ <encourage>
## [270] │ <engine>
## [278] │ <europe>
## [283] │ <evidence>
## [285] │ <example>
## [287] │ <excuse>
## [288] │ <exercise>
## [291] │ <expense>
## [292] │ <experience>
## [296] │ <eye>
## [386] │ <health>
## [394] │ <high>
## [450] │ <knock>
## ... and 16 more

Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)

str_view(words, "(..).*\\1")

##  [48] │ ap<propr>iate
## [152] │ <church>
## [181] │ c<ondition>
## [217] │ <decide>
## [275] │ <environmen>t
## [487] │ l<ondon>
## [598] │ pa<ragra>ph
## [603] │ p<articular>
## [617] │ <photograph>
## [638] │ p<repare>
## [641] │ p<ressure>
## [696] │ r<emem>ber
## [698] │ <repre>sent
## [699] │ <require>
## [739] │ <sense>
## [858] │ the<refore>
## [903] │ u<nderstand>
## [946] │ w<hethe>r

Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

str_view(words, "(.)(.*\\1){2,}")

##  [48] │ a<pprop>riate
##  [62] │ <availa>ble
##  [86] │ b<elieve>
##  [90] │ b<etwee>n
## [119] │ bu<siness>
## [221] │ d<egree>
## [229] │ diff<erence>
## [233] │ di<scuss>
## [265] │ <eleve>n
## [275] │ e<nvironmen>t
## [283] │ <evidence>
## [288] │ <exercise>
## [291] │ <expense>
## [292] │ <experience>
## [423] │ <indivi>dual
## [598] │ p<aragra>ph
## [684] │ r<eceive>
## [696] │ r<emembe>r
## [698] │ r<eprese>nt
## [845] │ t<elephone>
## ... and 2 more