Assignment3_Chunjie

1. Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”

library(stringr)
url<-"https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv"
major<-read.csv(url)
major<-data.frame(major)
head(major)

##   FOD1P                                 Major                  Major_Category
## 1  1100                   GENERAL AGRICULTURE Agriculture & Natural Resources
## 2  1101 AGRICULTURE PRODUCTION AND MANAGEMENT Agriculture & Natural Resources
## 3  1102                AGRICULTURAL ECONOMICS Agriculture & Natural Resources
## 4  1103                       ANIMAL SCIENCES Agriculture & Natural Resources
## 5  1104                          FOOD SCIENCE Agriculture & Natural Resources
## 6  1105            PLANT SCIENCE AND AGRONOMY Agriculture & Natural Resources

str(major)

## 'data.frame':    174 obs. of  3 variables:
##  $ FOD1P         : chr  "1100" "1101" "1102" "1103" ...
##  $ Major         : chr  "GENERAL AGRICULTURE" "AGRICULTURE PRODUCTION AND MANAGEMENT" "AGRICULTURAL ECONOMICS" "ANIMAL SCIENCES" ...
##  $ Major_Category: chr  "Agriculture & Natural Resources" "Agriculture & Natural Resources" "Agriculture & Natural Resources" "Agriculture & Natural Resources" ...

# There are 174 observations and 3 variables.

data_stat<-grep("DATA|STATISTICS", major$Major, value=F ,ignore.case = TRUE ) 
data_stat

## [1] 44 52 59

# row 44,52,59 are the majors have data or statistics


major[c(44,52,59),]

##    FOD1P                                         Major          Major_Category
## 44  6212 MANAGEMENT INFORMATION SYSTEMS AND STATISTICS                Business
## 52  2101      COMPUTER PROGRAMMING AND DATA PROCESSING Computers & Mathematics
## 59  3702               STATISTICS AND DECISION SCIENCE Computers & Mathematics

There are 3 majors contain either DATA or STATISTICS, included “Management Information Systems and Statistics”,“Computer Programming and Data Processing”,“Statistics and Decision Science”.

2 Write code that transforms the data below:

[1] “bell pepper” “bilberry” “blackberry” “blood orange”

[5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”

[9] “elderberry” “lime” “lychee” “mulberry”

[13] “olive” “salal berry”

Into a format like this:

c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)

fruit<-c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")

dput(fruit)

## c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", 
## "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", 
## "lychee", "mulberry", "olive", "salal berry")

The two exercises below are taken from R for Data Science, 14.3.5.1 in the on-line version: ### 3 Describe, in words, what these expressions will match:

(.)\1\1
The same letters repeats 3times.

“(.)(.)\2\1” two pair letters repeats in symmetrical reverse.

(..)\1 two pair letters repeats twice.

“(.).\1.\1” the same letter repeats 3 times, and separated by any letter.

"(.)(.)(.).*\3\2\1"

three pair letters repeats in symmetrical reverse.

4 Construct regular expressions to match words that:

Start and end with the same character. Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.) Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

library(stringr)
data("fruit")
data<-data.frame(fruit)
head(fruit)

## [1] "apple"       "apricot"     "avocado"     "banana"      "bell pepper"
## [6] "bilberry"

The fruit data from stringr library is used for this problem set.

# Start and end with the same character.
str_view(fruit, "^(.).*\1$", match = TRUE)

# it seems no character in the fruit data set has the same letter start and end.

# Contain a repeated pair of letters (e.g. "church" contains "ch" repeated twice.)
str_view(fruit, "(..).*\\1", match = TRUE)

# Contain one letter repeated in at least three places (e.g. "eleven" contains three "e"s.)
str_view(fruit, "(.).*\\1.*\\1", match = TRUE)

Assignment3_Chunjie_Nan

Chunjie Nan

9/12/2021

1. Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”

2 Write code that transforms the data below:

4 Construct regular expressions to match words that: