Assignment – Character Manipulation and Data Processing

Please deliver links to an R Markdown file (in GitHub and rpubs.com) with solutions to the problems below. You may work in a small group, but please submit separately with names of all group participants in your submission.

1) Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”

#load the data
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.3
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'tidyr' was built under R version 4.2.3
## Warning: package 'readr' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## Warning: package 'lubridate' was built under R version 4.2.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
library(RMySQL)
## Warning: package 'RMySQL' was built under R version 4.2.3
## Loading required package: DBI
college_majors <- read.csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv')

#view the data
head(college_majors)
##   FOD1P                                 Major                  Major_Category
## 1  1100                   GENERAL AGRICULTURE Agriculture & Natural Resources
## 2  1101 AGRICULTURE PRODUCTION AND MANAGEMENT Agriculture & Natural Resources
## 3  1102                AGRICULTURAL ECONOMICS Agriculture & Natural Resources
## 4  1103                       ANIMAL SCIENCES Agriculture & Natural Resources
## 5  1104                          FOOD SCIENCE Agriculture & Natural Resources
## 6  1105            PLANT SCIENCE AND AGRONOMY Agriculture & Natural Resources
#use str_detect to see if pattern is in the data
str_detect(college_majors, '(DATA|STATISTICS)')
## Warning in stri_detect_regex(string, pattern, negate = negate, opts_regex =
## opts(pattern)): argument is not an atomic vector; coercing
## [1] FALSE  TRUE FALSE
#find the pattern using grep
grep('DATA|STATISTICS',college_majors$Major, value = TRUE,ignore.case = TRUE)
## [1] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
## [2] "COMPUTER PROGRAMMING AND DATA PROCESSING"     
## [3] "STATISTICS AND DECISION SCIENCE"

We see that there are only 3 majors in the list that contains either “Data” or “statistics”

2) Write code that transforms the data below:

[1] “bell pepper” “bilberry” “blackberry” “blood orange” [5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”
[9] “elderberry” “lime” “lychee” “mulberry”
[13] “olive” “salal berry”

Into a format like this: c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)

From my understanding, I believe I am suppose to take the above string and print it as what it would look like as a vector?

fruits <- '[1] "bell pepper"  "bilberry"     "blackberry"   "blood orange"
[5] "blueberry"    "cantaloupe"   "chili pepper" "cloudberry"  
[9] "elderberry"   "lime"         "lychee"       "mulberry"    
[13] "olive"        "salal berry"'


list_fruits <- str_extract_all(string = fruits, pattern = '\\".*?\\"')

#removing [x] characters
items <- str_c(list_fruits[[1]], collapse = ', ')
str_glue('c({items})', items = items)
## c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")

The two exercises below are taken from R for Data Science, 14.3.5.1 in the on-line version: ### 3) Describe, in words, what these expressions will match:

(.)\1\1 - Matches the same character appearing 3 times in a row ex. aaa
"(.)(.)\\2\\1" Matches a pair of characters to the same pair but backwards ex. appa
(..)\1 Matches any two characters that repeat ex. abab
"(.).\\1.\\1" A character, then any character, then the original character again, then any character, then the original character ex. qrqsq
"(.)(.)(.).*\\3\\2\\1" three characters, followed by 0 or more characters of any kind, followed by the original 3 characters backwards.  ex. abcdrlmnopcba

4) Construct regular expressions to match words that:

random_words <- c("apple", "keys", "america","high","tonight","onomonopia","window","door","cocoa","bucket","eye","leg","arm")

#Start and end with the same character.
str_subset(random_words, "^(.)((.*\\1$)|\\1?$)")
## [1] "america" "high"    "tonight" "window"  "eye"
#Contain a repeated pair of letters (e.g. "church" contains "ch" repeated twice.)
str_subset(random_words,"([A-Za-z][A-Za-z]).*\\1")
## [1] "onomonopia" "cocoa"
#Contain one letter repeated in at least three places (e.g. "eleven" contains three "e"s.)
str_subset(random_words, "([a-z]).*\\1.*\\1")
## [1] "onomonopia"