Week3

###1. Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”

data=read.csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/all-ages.csv',header=T)
head(data)

##   Major_code                                 Major
## 1       1100                   GENERAL AGRICULTURE
## 2       1101 AGRICULTURE PRODUCTION AND MANAGEMENT
## 3       1102                AGRICULTURAL ECONOMICS
## 4       1103                       ANIMAL SCIENCES
## 5       1104                          FOOD SCIENCE
## 6       1105            PLANT SCIENCE AND AGRONOMY
##                    Major_category  Total Employed Employed_full_time_year_round
## 1 Agriculture & Natural Resources 128148    90245                         74078
## 2 Agriculture & Natural Resources  95326    76865                         64240
## 3 Agriculture & Natural Resources  33955    26321                         22810
## 4 Agriculture & Natural Resources 103549    81177                         64937
## 5 Agriculture & Natural Resources  24280    17281                         12722
## 6 Agriculture & Natural Resources  79409    63043                         51077
##   Unemployed Unemployment_rate Median P25th P75th
## 1       2423        0.02614711  50000 34000 80000
## 2       2266        0.02863606  54000 36000 80000
## 3        821        0.03024832  63000 40000 98000
## 4       3619        0.04267890  46000 30000 72000
## 5        894        0.04918845  62000 38500 90000
## 6       2070        0.03179089  50000 35000 75000

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.0.6     v stringr 1.4.0
## v tidyr   1.1.2     v forcats 0.5.1
## v readr   1.4.0

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

ds1 = data[str_detect(data$Major, regex("DATA",ignore_case = TRUE)) | str_detect(data$Major, regex("STATISTICS",ignore_case = TRUE)) ,]
head(ds1)

##     Major_code                                         Major
## 20        2101      COMPUTER PROGRAMMING AND DATA PROCESSING
## 93        3702               STATISTICS AND DECISION SCIENCE
## 170       6212 MANAGEMENT INFORMATION SYSTEMS AND STATISTICS
##              Major_category  Total Employed Employed_full_time_year_round
## 20  Computers & Mathematics  29317    22828                         18747
## 93  Computers & Mathematics  24806    18808                         14468
## 170                Business 156673   134478                        118249
##     Unemployed Unemployment_rate Median P25th  P75th
## 20        2265        0.09026422  60000 40000  85000
## 93        1138        0.05705405  70000 43000 102000
## 170       6186        0.04397714  72000 50000 100000

###2 Write code that transforms the data below: [1] “bell pepper” “bilberry” “blackberry” “blood orange” [5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”
[9] “elderberry” “lime” “lychee” “mulberry”
[13] “olive” “salal berry”

Into a format like this: c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)

strng= paste("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry", sep=',')
strng = paste('c("', gsub(pattern = ",", replacement = '\",\"', strng), '")')
strng = gsub(pattern = '\" ', replacement = '\"', strng)
strng = gsub(pattern = ' \"', replacement = '\"', strng)
message(strng)

## c("bell pepper","bilberry","blackberry","blood orange","blueberry","cantaloupe","chili pepper","cloudberry","elderberry","lime","lychee","mulberry","olive","salal berry")

###3 Describe, in words, what these expressions will match:

(.)\1\1 - 1st capturing group - any char, match the same char as 1st group, match the same char as 1st group “(.)(.)\2\1” - 1st capturing group any char, 2nd capturing group any char, match the same char as 2nd group, , match the same char as 1st group (..)\1 - found all strings that have a repeated pair of letters. “(.).\1.\1” - 1st capturing group any char, any char, repeat the same char twice "(.)(.)(.).*\3\2\1" - find three charters match in reverse order

###4 Construct regular expressions to match words that:

Start and end with the same character.

str_view(c("qwq", "qwe"), "^q.*q$",match = TRUE)

Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)

str_view(c("chur", "church", "chch"), "(..)\\1",match = TRUE)

Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

str_view(c("eleven", "church"), "(..)\\1{3}",match = TRUE)