Week 2 - Data 607

Task 1

Task: Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”

Solution

Load the libraries to use:

library(readr)
library(dplyr)
library(stringr)

Read in the data:

url <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv"
majors <- read_csv(url)

# Look at the first 10 rows:
head(majors, 10)

## # A tibble: 10 × 3
##    FOD1P Major                                 Major_Category                 
##    <chr> <chr>                                 <chr>                          
##  1 1100  GENERAL AGRICULTURE                   Agriculture & Natural Resources
##  2 1101  AGRICULTURE PRODUCTION AND MANAGEMENT Agriculture & Natural Resources
##  3 1102  AGRICULTURAL ECONOMICS                Agriculture & Natural Resources
##  4 1103  ANIMAL SCIENCES                       Agriculture & Natural Resources
##  5 1104  FOOD SCIENCE                          Agriculture & Natural Resources
##  6 1105  PLANT SCIENCE AND AGRONOMY            Agriculture & Natural Resources
##  7 1106  SOIL SCIENCE                          Agriculture & Natural Resources
##  8 1199  MISCELLANEOUS AGRICULTURE             Agriculture & Natural Resources
##  9 1302  FORESTRY                              Agriculture & Natural Resources
## 10 1303  NATURAL RESOURCES MANAGEMENT          Agriculture & Natural Resources

Identify the majors that contain either “DATA” or “STATISTICS”:

data_stats <- majors %>%
  filter(str_detect(Major, "DATA|STATISTICS"))

data_stats

## # A tibble: 3 × 3
##   FOD1P Major                                         Major_Category         
##   <chr> <chr>                                         <chr>                  
## 1 6212  MANAGEMENT INFORMATION SYSTEMS AND STATISTICS Business               
## 2 2101  COMPUTER PROGRAMMING AND DATA PROCESSING      Computers & Mathematics
## 3 3702  STATISTICS AND DECISION SCIENCE               Computers & Mathematics

If we need the majors as a vector we can do:

data_stats$Major

## [1] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
## [2] "COMPUTER PROGRAMMING AND DATA PROCESSING"     
## [3] "STATISTICS AND DECISION SCIENCE"

Task 2

Task: Write code that transforms the data below:

[1] “bell pepper” “bilberry” “blackberry” “blood orange” [5] “blueberry” “cantaloupe” “chili pepper” “cloudberry” [9] “elderberry” “lime” “lychee” “mulberry” [13] “olive” “salal berry”

Into a format like this:

c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)

Solution

# original string
x <- '
[1] "bell pepper" "bilberry" "blackberry" "blood orange"
[5] "blueberry" "cantaloupe" "chili pepper" "cloudberry"
[9] "elderberry" "lime" "lychee" "mulberry"
[13] "olive" "salal berry"
'

# split the input string into lines and remove the first line
x_clean <- strsplit(x, "\n")[[1]][-1] %>%
  # remove all digits and the square brackets in each line
  # replace the space between two double quotes with a comma and a space
  str_remove_all(pattern = "[0-9]") %>%
  str_remove_all(pattern = "\\[\\] ") %>%
  str_replace_all(pattern = '" "', replacement = '", "') %>%
  # join all cleaned lines together into one string separated by commas
  paste0(collapse = ", ")

# enclose the cleaned string in "c()" and print it to the console
x_clean <- paste0("c(", x_clean, ")")
cat(x_clean)

## c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")

Task 3

Task: Describe, in words, what these expressions will match:

(.)\1\1
“(.)(.)\2\1”
(..)\1
“(.).\1.\1”
“(.)(.)(.).*\3\2\1”

Solution

(.)\1\1

The regular expression (.)\1\1 will match any sequence of three consecutive identical characters.

For example:

“aaa”
“bbb”
“111”
“+++”

“(.)(.)\2\1”

The regular expression “(.)(.)\2\1” will match any string surrounded by double quotes that consists of two pairs of identical characters in reverse order.

For example:

“abba”
“123321”
“xyyx”
“&&00&&”

(..)\1

The regular expression (..)\1 will match any string that contains a sequence of two consecutive identical two-character substrings.

For example: - “hello” - “goodbye” - “apple” - “banana”

“(.).\1.\1”

The regular expression “(.).\1.\1” will match any string that has a character followed by any character, the original character, any other character, the original character again.

For example:

“abaca”
“b8b.b”

“(.)(.)(.).*\3\2\1”

The regular expression “(.)(.)(.).*\3\2\1” will match any string surrounded by double quotes that starts with three characters, where the third character is repeated later in the string, followed by any number of additional characters, and then ends with the three characters in reverse order.

For example:

“abcdecba”
“1234d4321”
“xyzyzx”
“&&&qwerty&&&”

Task 4

Task: Construct regular expressions to match words that:

Start and end with the same character.
Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)
Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

Solution

To match words that start and end with the same character, we can use:

^(.).*\1$

This regex matches any string that starts with a single character, followed by any number of characters (including none), and then ends with the same character that it started with.

To match words that contain a repeated pair of letters, we can use:

.*(.{2}).*\1.*

This regex matches any string that contains at least two consecutive characters (captured by .{2}), followed by any number of characters (including none), and then the same two characters that were captured earlier (using \1).

To match words that contain one letter repeated in at least three places, we can use:

.*(.).*\1.*\1.*

This regex matches any string that contains at least one character (captured by .), followed by any number of characters (including none), and then the same character that was captured earlier (using \1), repeated at least two more times. The .* before and after each instance of \1 allow any number of characters to appear between each repetition of the captured character.

Week 2 - Data 607

Mohammed Rahman

2024-02-11

Task 1

Solution

Task 2

Solution

Task 3

Solution

Task 4

Solution