#1. Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”
First lets pull the data from the GitHub provided in the article.
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.2 --
## v ggplot2 3.3.6 v purrr 0.3.4
## v tibble 3.1.8 v dplyr 1.0.9
## v tidyr 1.2.0 v stringr 1.4.1
## v readr 2.1.2 v forcats 0.5.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
path = 'https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv'
majors = read.table(file=path, header=TRUE, sep=',')
## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
## EOF within quoted string
## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
## number of items read is not a multiple of the number of columns
df = data.frame(majors)
head(df)
## FOD1P Major Major_Category
## 1 1100 GENERAL AGRICULTURE Agriculture & Natural Resources
## 2 1101 AGRICULTURE PRODUCTION AND MANAGEMENT Agriculture & Natural Resources
## 3 1102 AGRICULTURAL ECONOMICS Agriculture & Natural Resources
## 4 1103 ANIMAL SCIENCES Agriculture & Natural Resources
## 5 1104 FOOD SCIENCE Agriculture & Natural Resources
## 6 1105 PLANT SCIENCE AND AGRONOMY Agriculture & Natural Resources
Next, lets identify the “DATA” or “STATISTICS” majors in the dataset. As we can see, MANAGEMENT INFORMATION SYSTEMS AND STATISTICS, COMPUTER PROGRAMMING AND DATA PROCESSING, and STATISTICS AND DECISION SCIENCE are the 3 majors with “DATA” or “STATISTICS” in its title.
majors %>% filter(str_detect(Major, ("DATA|STATISTICS")))
## FOD1P Major Major_Category
## 1 6212 MANAGEMENT INFORMATION SYSTEMS AND STATISTICS Business
## 2 2101 COMPUTER PROGRAMMING AND DATA PROCESSING Computers & Mathematics
## 3 3702 STATISTICS AND DECISION SCIENCE Computers & Mathematics
#2 Write code that transforms the data below:
[1] “bell pepper” “bilberry” “blackberry” “blood orange”
[5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”
[9] “elderberry” “lime” “lychee” “mulberry”
[13] “olive” “salal berry”
Into a format like this:
c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)
data = '[1] "bell pepper" "bilberry" "blackberry" "blood orange"
[5] "blueberry" "cantaloupe" "chili pepper" "cloudberry"
[9] "elderberry" "lime" "lychee" "mulberry"
[13] "olive" "salal berry"'
data
## [1] "[1] \"bell pepper\" \"bilberry\" \"blackberry\" \"blood orange\"\n\n [5] \"blueberry\" \"cantaloupe\" \"chili pepper\" \"cloudberry\" \n\n [9] \"elderberry\" \"lime\" \"lychee\" \"mulberry\" \n\n [13] \"olive\" \"salal berry\""
w = c("bell pepper","bilberry","blackberry","blood orange")
x = c("blueberry","cantaloupe","chili pepper","cloudberry")
y = c("elderberry","lime","lychee","mulberry")
z = c("olive","salal berry")
join = c(w,x,y,z)
join
## [1] "bell pepper" "bilberry" "blackberry" "blood orange" "blueberry"
## [6] "cantaloupe" "chili pepper" "cloudberry" "elderberry" "lime"
## [11] "lychee" "mulberry" "olive" "salal berry"
#3 Describe, in words, what these expressions will match:
“(.)\1\1” = The (.) character will appear and repeat 2 times. “(.)(.)\2\1” = The 2 characters repeated will appear and then appear reversed. “(..)\1” = The 2 characters will be repeated once. “(.).\1.\1” = The 3 same characters out of a 5 character expression will be placed in 1, 3, and 5 positions. “(.)(.)(.).*\3\2\1” = This will repeat the first 3 characters at the end in reverse order.
#4 Construct regular expressions to match words that:
Start and end with the same character.
^(.)(.*)\1$”
Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)
“([A-Za-z][A-Za-z]).*\1”
Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.
“([A-Za-z]).\1.\1”