practice

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
library(readr)
library(babynames)

#1. Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”

The code below reads the college majors all_ages.csv file into the majors_data data frame. The next code selects the “Majors” column and and return the selected column as a vector. The str_view function is the used to slect the majors with the words DATA or STATUSTICS using alternation.

##1.1 Load the college majors data into data frame and preview

majors_data <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/all-ages.csv")

## Rows: 173 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Major, Major_category
## dbl (9): Major_code, Total, Employed, Employed_full_time_year_round, Unemplo...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(majors_data)

## Rows: 173
## Columns: 11
## $ Major_code                    <dbl> 1100, 1101, 1102, 1103, 1104, 1105, 1106…
## $ Major                         <chr> "GENERAL AGRICULTURE", "AGRICULTURE PROD…
## $ Major_category                <chr> "Agriculture & Natural Resources", "Agri…
## $ Total                         <dbl> 128148, 95326, 33955, 103549, 24280, 794…
## $ Employed                      <dbl> 90245, 76865, 26321, 81177, 17281, 63043…
## $ Employed_full_time_year_round <dbl> 74078, 64240, 22810, 64937, 12722, 51077…
## $ Unemployed                    <dbl> 2423, 2266, 821, 3619, 894, 2070, 264, 2…
## $ Unemployment_rate             <dbl> 0.02614711, 0.02863606, 0.03024832, 0.04…
## $ Median                        <dbl> 50000, 54000, 63000, 46000, 62000, 50000…
## $ P25th                         <dbl> 34000, 36000, 40000, 30000, 38500, 35000…
## $ P75th                         <dbl> 80000, 80000, 98000, 72000, 90000, 75000…

Select the Major column and return it as a vector

majors <- majors_data |> 
  select(Major) |> 
  pull()
head(majors, n=12)

##  [1] "GENERAL AGRICULTURE"                  
##  [2] "AGRICULTURE PRODUCTION AND MANAGEMENT"
##  [3] "AGRICULTURAL ECONOMICS"               
##  [4] "ANIMAL SCIENCES"                      
##  [5] "FOOD SCIENCE"                         
##  [6] "PLANT SCIENCE AND AGRONOMY"           
##  [7] "SOIL SCIENCE"                         
##  [8] "MISCELLANEOUS AGRICULTURE"            
##  [9] "ENVIRONMENTAL SCIENCE"                
## [10] "FORESTRY"                             
## [11] "NATURAL RESOURCES MANAGEMENT"         
## [12] "ARCHITECTURE"

Identifies the majors that contain either “DATA” or “STATISTICS”

str_view(majors, "DATA|STATISTICS")

##  [20] │ COMPUTER PROGRAMMING AND <DATA> PROCESSING
##  [93] │ <STATISTICS> AND DECISION SCIENCE
## [170] │ MANAGEMENT INFORMATION SYSTEMS AND <STATISTICS>

# Define the original_data string
original_data <- '[1] "bell pepper"  "bilberry"     "blackberry"   "blood orange"
[5] "blueberry"    "cantaloupe"   "chili pepper" "cloudberry"  
[9] "elderberry"   "lime"         "lychee"       "mulberry"    
[13] "olive"        "salal berry"'

# Extract the text within double quotes
extracted_text <- str_extract_all(original_data, '"(.*?)"')[[1]]

# Convert the extracted_text into an R vector
vector_data <- as.vector(extracted_text)

# Print the resulting vector
str_view(vector_data)

##  [1] │ "bell pepper"
##  [2] │ "bilberry"
##  [3] │ "blackberry"
##  [4] │ "blood orange"
##  [5] │ "blueberry"
##  [6] │ "cantaloupe"
##  [7] │ "chili pepper"
##  [8] │ "cloudberry"
##  [9] │ "elderberry"
## [10] │ "lime"
## [11] │ "lychee"
## [12] │ "mulberry"
## [13] │ "olive"
## [14] │ "salal berry"

#3 Describe, in words, what these expressions will match:

Description of (.)\1\1

(.): Matches any single character and captures it in a group. The parentheses (.) create a capturing group that remembers the character it matches.

\1: This is a backreference to the first capturing group (.). It matches whatever character was captured by the first group.

\1: This is another backreference to the first capturing group (.). It again matches the same character that was captured by the first group.

In simple terms, this regular expression is looking for three consecutive identical characters. It will match any sequence of three characters where all three characters are the same.

three_consecutive_words <- c("godessship", "skulllike", "headmistressship", "hostessship", "godess-ship", "skull-like", "headmistress-ship", "hostess-ship")
str_view(three_consecutive_words, "(.)\\1\\1")

## [1] │ gode<sss>hip
## [2] │ sku<lll>ike
## [3] │ headmistre<sss>hip
## [4] │ hoste<sss>hip

Description of “(.)(.)\2\1”

” Match a double quotation mark (“).

(.) Match any single character and capture it (this is the first captured group).

(.) Match any single character and capture it (this is the second captured group).

\2 Match the same character that was captured in the second group (a backreference to the second captured group).

\1 Match the same character that was captured in the first group (a backreference to the first captured group).

” Match a closing double quotation mark.

In summary, this regex pattern matches a sequence of four characters enclosed in double quotation marks, where the first and fourth characters are the same, and the second and third characters are also the same. It essentially matches palindrome-like strings within double quotes, where the middle two characters are the same as the outer characters.

sample <- c('"tattarrattat"', '"maddam"', '"racecar"')
str_view(sample, "(.)(.)\\2\\1")

## [1] │ "t<atta>rr<atta>t"
## [2] │ "m<adda>m"

Description of (..)\1

(..): This part of the pattern consists of two consecutive dots inside parentheses. It is a capturing group that matches and remembers any two characters in the input string.

\1: This is a backreference to the first capturing group (the two characters matched by (..)). It ensures that the same two characters that were captured by the first part of the pattern are repeated immediately after. In other words, it looks for a repetition of the exact same two characters that were found earlier in the string.

This regex pattern is looking for sequences of exactly four characters where the first two characters are the same as the last two characters.

sample <- c('"tatatr"', '"madman"', '"racececar"')
str_view(sample, "(..)\\1")

## [1] │ "<tata>tr"
## [3] │ "ra<cece>car"

Description of “(.).\1.\1”

” - The pattern starts with a pair of double quotation marks, which are literal characters and must appear exactly as they are.

(.) - This part of the pattern defines a capture group (Group 1) enclosed in parentheses. It captures any single character and stores it for later reference.

Dot ‘.’ - The dot (.) is a special character in regex that matches any single character except a newline.

\1 - This part of the pattern is a backreference to Group 1, which means it will match the same character that was captured by Group 1 earlier in the pattern. It ensures that the same character is repeated here.

Dot: ‘.’ - Another dot, which again matches any single character except a newline.

\1 - This is another backreference to Group 1, ensuring that the same character as captured by Group 1 is repeated once more.

sample <- c('"tatatr"', '"ABACDC"', '"racececar"')
str_view(sample,  '(.).\\1.\\1')

## [1] │ "<tatat>r"
## [3] │ "ra<cecec>ar"

Description of “(.)(.)(.).*\3\2\1”

” The pattern starts by matching a double quotation mark (“).

(.)(.)(.): Inside the double quotes, it expects to find any three characters, and it captures each of them individually using parentheses. This means it captures three arbitrary characters in sequence.

.* After capturing the first three characters, the pattern allows for zero or more arbitrary characters to appear (denoted by .*). This means it matches any characters (including none) between the initial three and the next part of the pattern.

\3\2\1: Finally, the pattern checks for a sequence of characters that corresponds to the reverse order of the first three captured characters. \3 refers to the third captured character, \2 refers to the second captured character, and \1 refers to the first captured character. So, this part checks for the same characters in reverse order as the initial capture.

Overall, this regex pattern is designed to match strings that start and end with double quotation marks and have three arbitrary characters in between, where the last three characters must be the reverse of the first three characters.

sample <- c('"tapstarpat"', '"ABACDC"', '"racececar"')
str_view(sample,  '"(.)(.)(.).*\\3\\2\\1"')

## [1] │ <"tapstarpat">
## [3] │ <"racececar">

#4 Construct regular expressions to match words that: - Start and end with the same character.

sample <- c("podstardop",  "made", "noon", "june", "racecar", "bob")
str_view(sample, "^(\\w)(\\w*)\\1$")

## [1] │ <podstardop>
## [3] │ <noon>
## [5] │ <racecar>
## [6] │ <bob>

Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)

sample <- c("pop",  "made", "noon", "june", "arcecar", "church", "noonnoon", "indeeded")
str_view(sample, "(\\w{2})(\\w*)\\1")

## [5] │ <arcecar>
## [6] │ <church>
## [7] │ <noonno>on
## [8] │ in<deede>d

Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

sample <- c("podorot",  "madadad", "noon", "june", "racecar", "church", "mississippi", "bookkeeper", "hello")
str_view(sample, "\\w*(\\w)\\w*\\1\\w*\\1\\w*")

## [1] │ <podorot>
## [2] │ <madadad>
## [7] │ <mississippi>
## [8] │ <bookkeeper>