#1. Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.1
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.4 v dplyr 1.0.7
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 2.0.1 v forcats 0.5.1
## Warning: package 'tibble' was built under R version 4.1.1
## Warning: package 'readr' was built under R version 4.1.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
This reads the file from github into a tibble (dataframe) and take a glimpse of the data
url <- "https://raw.githubusercontent.com/omocharly/Data607_Assignment3/main/all-ages.csv"
college_majors <- read.csv(url)
head(college_majors)
## Major_code Major
## 1 1100 GENERAL AGRICULTURE
## 2 1101 AGRICULTURE PRODUCTION AND MANAGEMENT
## 3 1102 AGRICULTURAL ECONOMICS
## 4 1103 ANIMAL SCIENCES
## 5 1104 FOOD SCIENCE
## 6 1105 PLANT SCIENCE AND AGRONOMY
## Major_category Total Employed Employed_full_time_year_round
## 1 Agriculture & Natural Resources 128148 90245 74078
## 2 Agriculture & Natural Resources 95326 76865 64240
## 3 Agriculture & Natural Resources 33955 26321 22810
## 4 Agriculture & Natural Resources 103549 81177 64937
## 5 Agriculture & Natural Resources 24280 17281 12722
## 6 Agriculture & Natural Resources 79409 63043 51077
## Unemployed Unemployment_rate Median P25th P75th
## 1 2423 0.02614711 50000 34000 80000
## 2 2266 0.02863606 54000 36000 80000
## 3 821 0.03024832 63000 40000 98000
## 4 3619 0.04267890 46000 30000 72000
## 5 894 0.04918845 62000 38500 90000
## 6 2070 0.03179089 50000 35000 75000
sub_filter <- dplyr::filter(college_majors, grepl('DATA|STATISTIC', Major))
sub_filter
## Major_code Major
## 1 2101 COMPUTER PROGRAMMING AND DATA PROCESSING
## 2 3702 STATISTICS AND DECISION SCIENCE
## 3 6212 MANAGEMENT INFORMATION SYSTEMS AND STATISTICS
## Major_category Total Employed Employed_full_time_year_round
## 1 Computers & Mathematics 29317 22828 18747
## 2 Computers & Mathematics 24806 18808 14468
## 3 Business 156673 134478 118249
## Unemployed Unemployment_rate Median P25th P75th
## 1 2265 0.09026422 60000 40000 85000
## 2 1138 0.05705405 70000 43000 102000
## 3 6186 0.04397714 72000 50000 100000
sub_col_majors <- sub_filter %>% select(Major_code:Employed)
sub_col_majors
## Major_code Major
## 1 2101 COMPUTER PROGRAMMING AND DATA PROCESSING
## 2 3702 STATISTICS AND DECISION SCIENCE
## 3 6212 MANAGEMENT INFORMATION SYSTEMS AND STATISTICS
## Major_category Total Employed
## 1 Computers & Mathematics 29317 22828
## 2 Computers & Mathematics 24806 18808
## 3 Business 156673 134478
#2 Write code that transforms the data below:
[1] “bell pepper” “bilberry” “blackberry” “blood orange”
[5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”
[9] “elderberry” “lime” “lychee” “mulberry”
[13] “olive” “salal berry”
Into a format like this:
c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)
library(stringr)
main_data <- '[1] "bell pepper" "bilberry" "blackberry" "blood orange"
[5] "blueberry" "cantaloupe" "chili pepper" "cloudberry"
[9] "elderberry" "lime" "lychee" "mulberry"
[13] "olive" "salal berry"'
output <- str_extract_all(main_data, pattern = '[a-z]+\\s?[a-z]+')
output_string <- str_c(output, collapse = ", ")
## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing
writeLines(output_string)
## c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")
#3 Describe, in words, what these expressions will match:
(.)\1\1 “(.)(.)\2\1” (..)\1 “(.).\1.\1” "(.)(.)(.).\3\2\1"
(1) (.)\1\1 Matches strings containing any single character except new line followed by ‘\1\1’.
(2) “(.)(.)\2\1” Matches strings containing any two characters that are immediately followed by the same two characters in the opposite order. In essence, it matches a 4-character palindrome.
(3) (..)\1 Matches strings containing any two characters (except new line) followed by ‘\1’.
(4) “(.).\1.\1” Matches strings containing any single character (except new line) followed by any character then the matched character followed by any character, then the matched character again.
(5) "(.)(.)(.).*\3\2\1" Matches strings containing any three characters in a row in which the three characters are repreated but in the opposite order later in the string.
example_1 <- c('a\1\1', 'b\1\1', 'ccc\1\1c', 'd\1d')
str_view(example_1, "(.)\1\1")
example_2 <- c('abcccba', 'badmomdad', 'maddam')
str_view(example_2, "(.)(.)\\2\\1")
example_3 <- c('abc\1d', 'abc\2dd', '\1d')
str_view(example_3, "(..)\1")
example_04 <- c('philzlyl')
str_view(example_04, "(.).\\1.\\1")
Construct regular expressions to match words that:
Start and end with the same character. Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.) Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)
answer.str <- c("CHURCH","ELEVEN","PROP","SEVENTEEN","BANANA" ,"EAGLE")
code1 <- "^(.).*\\1$"
answer.str %>% str_subset(code1)
## [1] "PROP" "EAGLE"
code2 <- "(..).*\\1"
answer.str %>% str_subset(code2)
## [1] "CHURCH" "SEVENTEEN" "BANANA"
code3 <- "(.).*\\1.*\\1"
answer.str %>% str_subset(code3)
## [1] "ELEVEN" "SEVENTEEN" "BANANA"