Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(RCurl)
##
## Attaching package: 'RCurl'
##
## The following object is masked from 'package:tidyr':
##
## complete
x <- getURL("https://raw.githubusercontent.com/isaias-soto/CUNY_DAT607/08e7bcca8ddd127ac68612a511c7d05f8d16261e/majors-list.csv")
major_list <- read.csv(text = x)
major_list[146,]
## FOD1P Major Major_Category
## 146 bbbb N/A (less than bachelor's degree) <NA>
major_list <- major_list[-146,] # remove missing value
str_view(major_list$Major,"DATA|STATISTICS")
## [44] │ MANAGEMENT INFORMATION SYSTEMS AND <STATISTICS>
## [52] │ COMPUTER PROGRAMMING AND <DATA> PROCESSING
## [59] │ <STATISTICS> AND DECISION SCIENCE
major_list[c(44,52,59),]
## FOD1P Major Major_Category
## 44 6212 MANAGEMENT INFORMATION SYSTEMS AND STATISTICS Business
## 52 2101 COMPUTER PROGRAMMING AND DATA PROCESSING Computers & Mathematics
## 59 3702 STATISTICS AND DECISION SCIENCE Computers & Mathematics
Write code that transforms the data below:
[1] “bell pepper” “bilberry” “blackberry” “blood orange”
[5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”
[9] “elderberry” “lime” “lychee” “mulberry”
[13] “olive” “salal berry”
Into a format like this:
c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)
The following code sets up the problem as a character vector with 14 elements:
fruits <- c("bell pepper", "bilberry", "blackberry", "blood orange",
"blueberry", "cantaloupe", "chili pepper", "cloudberry",
"elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")
fruits
## [1] "bell pepper" "bilberry" "blackberry" "blood orange" "blueberry"
## [6] "cantaloupe" "chili pepper" "cloudberry" "elderberry" "lime"
## [11] "lychee" "mulberry" "olive" "salal berry"
Next, it seems we are to shrink this into a one-element vector. This can be achieved by the following:
str_view(str_flatten(fruits, ", "))
## [1] │ bell pepper, bilberry, blackberry, blood orange, blueberry, cantaloupe, chili pepper, cloudberry, elderberry, lime, lychee, mulberry, olive, salal berry
The two exercises below are taken from R for Data Science, 14.3.5.1 in the on-line version:
Describe, in words, what these expressions will match:
|Header|Description|
(.)\1\1 “(.)(.)\\2\\1” (..)\1 “(.).\\1.\\1” “(.)(.)(.).*\\3\\2\\1”
Exercise | Description |
---|---|
(.)\1\1 | This would match words that have the same letter repeated 3 times, for example, “bbb”. |
“(.)(.)\\2\\1” | This would match words where the second letter is repeated and this first letter is the start and end, for example, “abba”. |
(..)\1 | This would match words with any two letters with the same two letters repeated, for example, “cdcd”. |
“(.).\\1.\\1” | This would match words that begin with any letter, followed by a
different letter, followed by the first letter repeated, followed by any letter, and finally with followed by the first letter. For example, “abaca”. |
“(.)(.)(.).*\\3\\2\\1” | This matches words where there are 3 different letters, optionally
followed by zero to any number of characters, and end with the first three letters in reverse order. For example, “abcdcba” or “abccba”. |
Construct regular expressions to match words that:
Start and end with the same character. Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.) Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)
Exercise | regex |
---|---|
Start and end with the same character. | ^(.).*\1$ |
Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.) | ([A-Za-z][A-Za-z]).*\1 |
Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.) | [A-Za-z].*\1.*\1 |