Please deliver links to an R Markdown file (in GitHub and rpubs.com) with solutions to the problems below. You may work in a small group, but please submit separately with names of all group participants in your submission.
#load the data
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.3
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'tidyr' was built under R version 4.2.3
## Warning: package 'readr' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## Warning: package 'lubridate' was built under R version 4.2.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
library(RMySQL)
## Warning: package 'RMySQL' was built under R version 4.2.3
## Loading required package: DBI
college_majors <- read.csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv')
#view the data
head(college_majors)
## FOD1P Major Major_Category
## 1 1100 GENERAL AGRICULTURE Agriculture & Natural Resources
## 2 1101 AGRICULTURE PRODUCTION AND MANAGEMENT Agriculture & Natural Resources
## 3 1102 AGRICULTURAL ECONOMICS Agriculture & Natural Resources
## 4 1103 ANIMAL SCIENCES Agriculture & Natural Resources
## 5 1104 FOOD SCIENCE Agriculture & Natural Resources
## 6 1105 PLANT SCIENCE AND AGRONOMY Agriculture & Natural Resources
#use str_detect to see if pattern is in the data
str_detect(college_majors, '(DATA|STATISTICS)')
## Warning in stri_detect_regex(string, pattern, negate = negate, opts_regex =
## opts(pattern)): argument is not an atomic vector; coercing
## [1] FALSE TRUE FALSE
#find the pattern using grep
grep('DATA|STATISTICS',college_majors$Major, value = TRUE,ignore.case = TRUE)
## [1] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
## [2] "COMPUTER PROGRAMMING AND DATA PROCESSING"
## [3] "STATISTICS AND DECISION SCIENCE"
We see that there are only 3 majors in the list that contains either “Data” or “statistics”
[1] “bell pepper” “bilberry” “blackberry” “blood orange” [5]
“blueberry” “cantaloupe” “chili pepper” “cloudberry”
[9] “elderberry” “lime” “lychee” “mulberry”
[13] “olive” “salal berry”
Into a format like this: c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)
From my understanding, I believe I am suppose to take the above string and print it as what it would look like as a vector?
fruits <- '[1] "bell pepper" "bilberry" "blackberry" "blood orange"
[5] "blueberry" "cantaloupe" "chili pepper" "cloudberry"
[9] "elderberry" "lime" "lychee" "mulberry"
[13] "olive" "salal berry"'
list_fruits <- str_extract_all(string = fruits, pattern = '\\".*?\\"')
#removing [x] characters
items <- str_c(list_fruits[[1]], collapse = ', ')
str_glue('c({items})', items = items)
## c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")
The two exercises below are taken from R for Data Science, 14.3.5.1 in the on-line version: ### 3) Describe, in words, what these expressions will match:
(.)\1\1 - Matches the same character appearing 3 times in a row ex. aaa
"(.)(.)\\2\\1" Matches a pair of characters to the same pair but backwards ex. appa
(..)\1 Matches any two characters that repeat ex. abab
"(.).\\1.\\1" A character, then any character, then the original character again, then any character, then the original character ex. qrqsq
"(.)(.)(.).*\\3\\2\\1" three characters, followed by 0 or more characters of any kind, followed by the original 3 characters backwards. ex. abcdrlmnopcba
random_words <- c("apple", "keys", "america","high","tonight","onomonopia","window","door","cocoa","bucket","eye","leg","arm")
#Start and end with the same character.
str_subset(random_words, "^(.)((.*\\1$)|\\1?$)")
## [1] "america" "high" "tonight" "window" "eye"
#Contain a repeated pair of letters (e.g. "church" contains "ch" repeated twice.)
str_subset(random_words,"([A-Za-z][A-Za-z]).*\\1")
## [1] "onomonopia" "cocoa"
#Contain one letter repeated in at least three places (e.g. "eleven" contains three "e"s.)
str_subset(random_words, "([a-z]).*\\1.*\\1")
## [1] "onomonopia"