DATA_607_HW_3

R Markdown

This is an R Markdown document containing Sean Amato’s work pertaining to HW #3.

Exercise 1:

# First I need to delete row 146 because it represents less than a Bachelor's Degree.
df2 <- df[-146,]

# Now I'm going to create a column and assign a row TRUE or FALSE based on whether it contains any of the following text in the 'Major' column: SCIENCE, OLOGY, COMPUTER, MATH, and ENGINEERING. While this probably isn't a perfect approach, in my opinion it's a very good first pass. I know "Physics" is one major my classifier misses as you can collect data during experiments and run statistical analysis to ensure your data is sound.

df2$Data_or_Stats <- str_detect(df2$Major, "SCIENCE")|str_detect(df2$Major, "OLOGY")|str_detect(df2$Major, "COMPUTER")|str_detect(df2$Major, "MATH")|str_detect(df2$Major, "ENGINEERING")|str_detect(df2$Major, "TECHNO")

# Printing the top 10 rows to show the results
head(df2, 10) |>
  select(Major, Data_or_Stats) |>
  gt() |>
  tab_header(title = md("**Data or Stats**"), subtitle = md("Are Data or Stats used in a particular major?")) |>
  cols_width(everything() ~ px(200))

Major	Data_or_Stats
Data or Stats
Are Data or Stats used in a particular major?
GENERAL AGRICULTURE	FALSE
AGRICULTURE PRODUCTION AND MANAGEMENT	FALSE
AGRICULTURAL ECONOMICS	FALSE
ANIMAL SCIENCES	TRUE
FOOD SCIENCE	TRUE
PLANT SCIENCE AND AGRONOMY	TRUE
SOIL SCIENCE	TRUE
MISCELLANEOUS AGRICULTURE	FALSE
FORESTRY	FALSE
NATURAL RESOURCES MANAGEMENT	FALSE

Exercise 2:

# For this problem I copy and pasted the data directly from the homework and put it into "string1", via single quotes, without any formatting.
string1 <- '[1] "bell pepper"  "bilberry"     "blackberry"   "blood orange"
[5] "blueberry"    "cantaloupe"   "chili pepper" "cloudberry"  
[9] "elderberry"   "lime"         "lychee"       "mulberry"    
[13] "olive"        "salal berry"'

# In the following lines I've removed numbers and square brackets from my string.
string2 <- str_remove_all(string1, "[0-9]")
string3 <- str_remove_all(string2, "\\[\\]")

# Here I've extracted items in quotations and stored them in matches.
matches <- str_extract_all(string3, '"([^"]+)"') # The regex here does all the heavy lifting.
x <- unlist(matches)
x2 <- str_remove_all(x, "\"")
print(x2)

##  [1] "bell pepper"  "bilberry"     "blackberry"   "blood orange" "blueberry"   
##  [6] "cantaloupe"   "chili pepper" "cloudberry"   "elderberry"   "lime"        
## [11] "lychee"       "mulberry"     "olive"        "salal berry"

Exercise 3: Describe in words what the following expressions will match: (.)\1\1 - This expression will find all triplets in a string for any character, i.e. “ a jja l lkds”.
“(.)(.)\2\1” - This expression will find any 4 character palindromes, i.e. “b aab racecarara l <….>”.
(..)\1 - This expression will find any two consecutive character pairs, i.e. “b ab raceca kkkl lkds”.
“(.).\1.\1” - This expression will find a 5 character string where positions 1, 3, & 5 are any of the same character and position 2 & 4 are any characters, i.e. “b ab<a.aba>b race..carara< k i >kl <…g.>”.
“(.)(.)(.).*\3\2\1” - This expression will find any 3 characters with any characters in between followed by the same 3 initial characters in reverse, i.e. “<abc cba . I love taco tuesday! kjn abc . cba> bleh!”.

Exercise 4: Construct regular expressions to match words that:
1. Start and end with the same character.
2. Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)
3 .Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

word_list <- c("racecar", "church", "eleven", "taco", "spinning", "retroactive", "ubuntu", "eel", "lyrically")
# 1 
answer1 <- str_view(word_list, "^(.).*\\1$")
print(answer1)

## [1] │ <racecar>
## [7] │ <ubuntu>

answer2 <- str_view(word_list, "(..).*\\1")
print(answer2)

## [2] │ <church>
## [5] │ sp<innin>g
## [9] │ <lyrically>

answer3 <- str_view(word_list, "(.).*\\1.*\\1.*")
print(answer3)

## [3] │ <eleven>
## [5] │ spi<nning>
## [7] │ <ubuntu>
## [9] │ <lyrically>

DATA_607_HW_3_SA

Sean Amato

2023-09-21

R Markdown