Week 3 assignment

1.Please deliver links to an R Markdown file (in GitHub and rpubs.com) with solutions to the problems below. You may work in a small group, but please submit separately with names of all group participants in your submission. 1. Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Data Acquisition Input file:"all-ages.csv” from https://github.com/fivethirtyeight/data/tree/master/college-majors

url <- 'https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/all-ages.csv'
database <- read_csv(url)

## Rows: 173 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Major, Major_category
## dbl (9): Major_code, Total, Employed, Employed_full_time_year_round, Unemplo...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

database

## # A tibble: 173 × 11
##    Major_code Major        Major_category  Total Employed Employed_full_time_y…¹
##         <dbl> <chr>        <chr>           <dbl>    <dbl>                  <dbl>
##  1       1100 GENERAL AGR… Agriculture &… 128148    90245                  74078
##  2       1101 AGRICULTURE… Agriculture &…  95326    76865                  64240
##  3       1102 AGRICULTURA… Agriculture &…  33955    26321                  22810
##  4       1103 ANIMAL SCIE… Agriculture &… 103549    81177                  64937
##  5       1104 FOOD SCIENCE Agriculture &…  24280    17281                  12722
##  6       1105 PLANT SCIEN… Agriculture &…  79409    63043                  51077
##  7       1106 SOIL SCIENCE Agriculture &…   6586     4926                   4042
##  8       1199 MISCELLANEO… Agriculture &…   8549     6392                   5074
##  9       1301 ENVIRONMENT… Biology & Lif… 106106    87602                  65238
## 10       1302 FORESTRY     Agriculture &…  69447    48228                  39613
## # ℹ 163 more rows
## # ℹ abbreviated name: ¹Employed_full_time_year_round
## # ℹ 5 more variables: Unemployed <dbl>, Unemployment_rate <dbl>, Median <dbl>,
## #   P25th <dbl>, P75th <dbl>

Then we can look up rows containing key words.

majors_containing_data_or_statistics <- database[grep("DATA|STATISTICS", database$Major),]
majors_containing_data_or_statistics

## # A tibble: 3 × 11
##   Major_code Major         Major_category  Total Employed Employed_full_time_y…¹
##        <dbl> <chr>         <chr>           <dbl>    <dbl>                  <dbl>
## 1       2101 COMPUTER PRO… Computers & M…  29317    22828                  18747
## 2       3702 STATISTICS A… Computers & M…  24806    18808                  14468
## 3       6212 MANAGEMENT I… Business       156673   134478                 118249
## # ℹ abbreviated name: ¹Employed_full_time_year_round
## # ℹ 5 more variables: Unemployed <dbl>, Unemployment_rate <dbl>, Median <dbl>,
## #   P25th <dbl>, P75th <dbl>

2.PLEASE SEE HINT/CLARIFICATION AFTER #4 BELOW. Write code that transforms the data below: [1] “bell pepper” “bilberry” “blackberry” “blood orange” [5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”
[9] “elderberry” “lime” “lychee” “mulberry”
[13] “olive” “salal berry” Into a format like this: c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”) The two exercises below are taken from R for Data Science, 14.3.5.1 in the on-line version:

First create dataframe.

fruit <- c("bell pepper", "bilberry", "blackberry", "blood orange",
          "blueberry", "cantaloupe", "chili pepper", "cloudberry",
          "elderberry", "lime", "lychee", "mulberry",
          "olive", "salal berry")
fruit

##  [1] "bell pepper"  "bilberry"     "blackberry"   "blood orange" "blueberry"   
##  [6] "cantaloupe"   "chili pepper" "cloudberry"   "elderberry"   "lime"        
## [11] "lychee"       "mulberry"     "olive"        "salal berry"

Once we have dataframe fruit available, then we can use “Paste()” to concatenate.

strOut  <- paste("c(", paste0("\"", fruit, "\"", collapse = ", "), ")", collapse = "")
str_view(strOut)

## [1] │ c( "bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry" )

3.Describe, in words, what these expressions will match:

1-(.)\1\1

(.) means a single letter, \1 means repeating first letter, so \1\1 means repeating twice

Example “a” in (.)\1\1 format is “aaa”;

2-“(.)(.)\2\1”

Similar like above, \2 means repeating second letter

Example “ab” in (.)(.)\2\1 format is “abba”;

3-(..)\1

These parentheses contain two “.”, so it check any pairs of letters

Example “ab” in (..)\1 format is “abab”;

4-“(.).\1.\1”

“.” letter inside parentheses will be the same with \1, “.” letter without parentheses can be anything.

Example a,b,c is “(a)b1c1” in (.).\1.\1 format is “abaca”;

5-“(.)(.)(.).*\3\2\1”

New thing in here is “.*“, this can be nothing or any lengths of letters

Example “abc” in (.)(.)(.).*\3\2\1 format is “abccba”;

Example “abc123” in (.)(.)(.).*\3\2\1 format is “abc123cba”;

4.Construct regular expressions to match words that:

sample_use <- c("civic", "abab", "transmission")
sample_use

## [1] "civic"        "abab"         "transmission"

Start and end with the same character.

str_view(sample_use, "^(.).*\\1$")

## [1] │ <civic>

Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)

str_view(sample_use, "(.)(.).*\\1\\2")

## [2] │ <abab>

Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

str_view(sample_use, "(.).*\\1.*\\1")

## [3] │ tran<smiss>ion

Week 3 assignment

ZIXIAN LIANG

2024-02-10