Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”
Write code that transforms the data below:
[1] “bell pepper” “bilberry” “blackberry” “blood orange”
[5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”
[9] “elderberry” “lime” “lychee” “mulberry”
[13] “olive” “salal berry”
Into a format like this:
c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")
The two exercises below are taken from R for Data Science, 14.3.5.1 in the on-line version:
Construct regular expressions to match words that:
Start and end with the same character.
Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)
Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)
The Economic Guide To Picking A College Major article data is stored on the fivethirtyeight github. I will be importing my data from the majors=list.csv and the grad-students.csv respectively.
majors_list<-
read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv",
header= TRUE, sep=",")
majors_list<-majors_list[order(majors_list$Major),]
grad_students<-
read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/grad-students.csv",
header= TRUE, sep=",")
grad_students<-grad_students[order(grad_students$Major),]## [1] "Columns for major_lists are:"
## [1] "FOD1P" "Major" "Major_Category"
## [1] "Columns for grad_students are:"
## [1] "Major_code" "Major"
## [3] "Major_category" "Grad_total"
## [5] "Grad_sample_size" "Grad_employed"
## [7] "Grad_full_time_year_round" "Grad_unemployed"
## [9] "Grad_unemployment_rate" "Grad_median"
## [11] "Grad_P25" "Grad_P75"
## [13] "Nongrad_total" "Nongrad_employed"
## [15] "Nongrad_full_time_year_round" "Nongrad_unemployed"
## [17] "Nongrad_unemployment_rate" "Nongrad_median"
## [19] "Nongrad_P25" "Nongrad_P75"
## [21] "Grad_share" "Grad_premium"
The majors that contain either “DATA” or “STATISTICS” in them, are as follows, according to both datasets:
## FOD1P Major Major_Category
## 1 2101 COMPUTER PROGRAMMING AND DATA PROCESSING Computers & Mathematics
## 2 6212 MANAGEMENT INFORMATION SYSTEMS AND STATISTICS Business
## 3 3702 STATISTICS AND DECISION SCIENCE Computers & Mathematics
## FOD1P Major Major_Category
## 1 2101 COMPUTER PROGRAMMING AND DATA PROCESSING Computers & Mathematics
## 2 6212 MANAGEMENT INFORMATION SYSTEMS AND STATISTICS Business
## 3 3702 STATISTICS AND DECISION SCIENCE Computers & Mathematics
Write code that transforms the data below:
NOTE: the data below is in a matrix with the number values in brackets representing the row name.
## [,1] [,2] [,3] [,4]
## [1] "bell pepper" "bilberry" "blackberry" "blood orange"
## [5] "blueberry" "cantaloupe" "chili pepper" "cloudberry"
## [9] "elderberry" "lime" "lychee" "mulberry"
## [13] "olive" "salal berry" "" ""
Into a format like this: c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)
## [1] "\""
# Make vector
x_vector<-as.vector(t(x_matrix[c("[1]","[5]","[9]","[13]"),]))
# Remove "" based on index search
x_vector<- x_vector[-(match("",x_vector))]
x_vector<- x_vector[-(match("",x_vector))]
# add quotes on each element
x_vector <-paste(quotes,x_vector,quotes, sep = "")
# add commas at the end of all elements but the last one
x_vector <- paste(c("","","","","","","","","","","","","",""), x_vector, sep = "")
# collapse to a single string
x_vector <- paste(x_vector, collapse = ", ")
# add 'C()' wrapper
x_vector <- paste("c(",x_vector, ")", sep = "")
cat(x_vector)## c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")
(.)\1\1 |`( )` group
|`(.)` match-any-character operator matches any character except line
| terminator
|`\1` back-reference-operator references capturing group `\1`
|`\1` the second is just to replace duplicate words of group `\1`
|NOTE: Regex requires `\` escapes with `R` functions since parameter is inserted
| as a string. This combiniation finds triples.
|
"(.)(.)\\2\\1" |`( )` group
|`(.)` match-any-character operator group 1
|`(.)` match-any-character operator group 2
|`\\2` back-reference-operator with `\` to escape references capturing group `\2`
|`\\1` back-reference-operator with `\` to escape references capturing group `\1`
|NOTE: Regex matches 2 groups of doubles such as `bbbb` or `abba`
|
(..)\1 |`(..)` same as match-any-character group 1
|`\1` back-references group `\1`
|NOTE: Requires `\` escape. Regex finds 1 group of quadruples
|
"(.).\\1.\\1" |. Works as wild cards outside of `( )`. Regex searches for a repeated character
|3 times and seperated by any character
|
"(.)(.)(.).*\\3\\2\\1" |`(.)(.)(.)` 3 grouping
|. wild card for string beginning with group
|* 0 to ∞ combinations with these specific regex
|`\\3\\2\\1` back-reference 3 seperate groups
|This expression finds 3 seperate groups of matching characters