1.Please deliver links to an R Markdown file (in GitHub and rpubs.com) with solutions to the problems below. You may work in a small group, but please submit separately with names of all group participants in your submission. 1. Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Data Acquisition Input file:"all-ages.csv” from https://github.com/fivethirtyeight/data/tree/master/college-majors
url <- 'https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/all-ages.csv'
database <- read_csv(url)
## Rows: 173 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Major, Major_category
## dbl (9): Major_code, Total, Employed, Employed_full_time_year_round, Unemplo...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
database
## # A tibble: 173 × 11
## Major_code Major Major_category Total Employed Employed_full_time_y…¹
## <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1100 GENERAL AGR… Agriculture &… 128148 90245 74078
## 2 1101 AGRICULTURE… Agriculture &… 95326 76865 64240
## 3 1102 AGRICULTURA… Agriculture &… 33955 26321 22810
## 4 1103 ANIMAL SCIE… Agriculture &… 103549 81177 64937
## 5 1104 FOOD SCIENCE Agriculture &… 24280 17281 12722
## 6 1105 PLANT SCIEN… Agriculture &… 79409 63043 51077
## 7 1106 SOIL SCIENCE Agriculture &… 6586 4926 4042
## 8 1199 MISCELLANEO… Agriculture &… 8549 6392 5074
## 9 1301 ENVIRONMENT… Biology & Lif… 106106 87602 65238
## 10 1302 FORESTRY Agriculture &… 69447 48228 39613
## # ℹ 163 more rows
## # ℹ abbreviated name: ¹Employed_full_time_year_round
## # ℹ 5 more variables: Unemployed <dbl>, Unemployment_rate <dbl>, Median <dbl>,
## # P25th <dbl>, P75th <dbl>
Then we can look up rows containing key words.
majors_containing_data_or_statistics <- database[grep("DATA|STATISTICS", database$Major),]
majors_containing_data_or_statistics
## # A tibble: 3 × 11
## Major_code Major Major_category Total Employed Employed_full_time_y…¹
## <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2101 COMPUTER PRO… Computers & M… 29317 22828 18747
## 2 3702 STATISTICS A… Computers & M… 24806 18808 14468
## 3 6212 MANAGEMENT I… Business 156673 134478 118249
## # ℹ abbreviated name: ¹Employed_full_time_year_round
## # ℹ 5 more variables: Unemployed <dbl>, Unemployment_rate <dbl>, Median <dbl>,
## # P25th <dbl>, P75th <dbl>
2.PLEASE SEE HINT/CLARIFICATION AFTER #4 BELOW. Write code that
transforms the data below: [1] “bell pepper” “bilberry” “blackberry”
“blood orange” [5] “blueberry” “cantaloupe” “chili pepper”
“cloudberry”
[9] “elderberry” “lime” “lychee” “mulberry”
[13] “olive” “salal berry” Into a format like this: c(“bell pepper”,
“bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”,
“chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”,
“mulberry”, “olive”, “salal berry”) The two exercises below are taken
from R for Data Science, 14.3.5.1 in the on-line version:
First create dataframe.
fruit <- c("bell pepper", "bilberry", "blackberry", "blood orange",
"blueberry", "cantaloupe", "chili pepper", "cloudberry",
"elderberry", "lime", "lychee", "mulberry",
"olive", "salal berry")
fruit
## [1] "bell pepper" "bilberry" "blackberry" "blood orange" "blueberry"
## [6] "cantaloupe" "chili pepper" "cloudberry" "elderberry" "lime"
## [11] "lychee" "mulberry" "olive" "salal berry"
Once we have dataframe fruit available, then we can use “Paste()” to concatenate.
strOut <- paste("c(", paste0("\"", fruit, "\"", collapse = ", "), ")", collapse = "")
str_view(strOut)
## [1] │ c( "bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry" )
3.Describe, in words, what these expressions will match:
1-(.)\1\1
(.) means a single letter, \1 means repeating first letter, so \1\1 means repeating twice
Example “a” in (.)\1\1 format is “aaa”;
2-“(.)(.)\2\1”
Similar like above, \2 means repeating second letter
Example “ab” in (.)(.)\2\1 format is “abba”;
3-(..)\1
These parentheses contain two “.”, so it check any pairs of letters
Example “ab” in (..)\1 format is “abab”;
4-“(.).\1.\1”
“.” letter inside parentheses will be the same with \1, “.” letter without parentheses can be anything.
Example a,b,c is “(a)b1c1” in (.).\1.\1 format is “abaca”;
5-“(.)(.)(.).*\3\2\1”
New thing in here is “.*“, this can be nothing or any lengths of letters
Example “abc” in (.)(.)(.).*\3\2\1 format is “abccba”;
Example “abc123” in (.)(.)(.).*\3\2\1 format is “abc123cba”;
4.Construct regular expressions to match words that:
sample_use <- c("civic", "abab", "transmission")
sample_use
## [1] "civic" "abab" "transmission"
Start and end with the same character.
str_view(sample_use, "^(.).*\\1$")
## [1] │ <civic>
Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)
str_view(sample_use, "(.)(.).*\\1\\2")
## [2] │ <abab>
Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)
str_view(sample_use, "(.).*\\1.*\\1")
## [3] │ tran<smiss>ion