Problem 1

Here I am bringing the data in using the rvest and tidyverse libraries. Html_table brings in all the table into a list. Since I only have one table I select the first index. The “[-1,]” is to remove the first row of data since is was not relevant. The data is pretty messy, so for the purpose of the problem I remove all the other columns and just select the “MAJOR” column. Then I use a regex to match to either data or statistics while ignoring casing.

content <- read_html("https://projects.fivethirtyeight.com/mid-levels/college-majors/index.html")

tables <- content %>% html_table(fill = TRUE)

majors <- tables[[1]][-1,]
majors <- majors %>% select(2)

data_statistics_majors <- filter(majors, grepl('(data|statistics)', MAJOR, ignore.case = TRUE))

head(data_statistics_majors)

##                                    MAJOR
## 1 Mgmt. Information Systems & Statistics
## 2          Statistics & Decision Science
## 3 Computer Programming & Data Processing

Problem 2

Here I used the regular expression \“[a-z?]+\” to extract each element from the string. But, I was still left with commented out quotations. I used the remove all to get ride of these.

fruits <- '[1] "bell pepper"  "bilberry"     "blackberry"   "blood orange"

[5] "blueberry"    "cantaloupe"   "chili pepper" "cloudberry"  

[9] "elderberry"   "lime"         "lychee"       "mulberry"    

[13] "olive"        "salal berry"'

print(fruits)

## [1] "[1] \"bell pepper\"  \"bilberry\"     \"blackberry\"   \"blood orange\"\n\n[5] \"blueberry\"    \"cantaloupe\"   \"chili pepper\" \"cloudberry\"  \n\n[9] \"elderberry\"   \"lime\"         \"lychee\"       \"mulberry\"    \n\n[13] \"olive\"        \"salal berry\""

fruit_new <- str_extract_all(fruits,'\\"[a-z\\s?]+\\"')[[1]]

print(fruit_new)

##  [1] "\"bell pepper\""  "\"bilberry\""     "\"blackberry\""   "\"blood orange\""
##  [5] "\"blueberry\""    "\"cantaloupe\""   "\"chili pepper\"" "\"cloudberry\""  
##  [9] "\"elderberry\""   "\"lime\""         "\"lychee\""       "\"mulberry\""    
## [13] "\"olive\""        "\"salal berry\""

fruit_new <- str_remove_all(fruit_new, '\"')

print(fruit_new)

##  [1] "bell pepper"  "bilberry"     "blackberry"   "blood orange" "blueberry"   
##  [6] "cantaloupe"   "chili pepper" "cloudberry"   "elderberry"   "lime"        
## [11] "lychee"       "mulberry"     "olive"        "salal berry"

Problem 3

(.)\1\1

Answer: This would match strings that repeat the same character of length 3. For example: 333, hhh, lll, ???

“(.)(.)\2\1”

Answer: This would capture strings with 4 characters looking at the first 2 and the same 2 characters in reverse order. For example: “assa”, “1221”, “1pp1”

(..)\1

Answer: Similar to one the strings characters would be repeated 4 times. The capture group is 2 characters here. For example: aaaa, bbbb, pppp

“(.).\1.\1”

Answer: This would capture all strings that repeat at the 1st, 3rd, and 5th indicies. The 2nd and 4th index can be any character. For example: “ahapa”, “b1b4b”, “p5p6p”

"(.)(.)(.).*\3\2\1"

Answer: This captures all strings that end with the same 3 characters it started with but in reverse order. The string can be any length. For example: 95122335159 See how the string starts with 951 and ends with 159.

Problem 4

(.).*\1
.(..).\1.*
.(.).\1.\1.

Homework 3

Jordan Glendrange

February 21, 2021

Problem 1

Problem 2

Problem 3

Problem 4