Lab 5: Working with Text and Strings

Author

Amanda Rose Knudsen

Overview

In this lab you will practice perform a series of exercises that use text and string manipulation to either analyze data with text, manipulate data containing strings, apply regular expressions, or handle data files with unusual formats or text strings.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(fivethirtyeight)

Problems

Problem 1. Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset, provide code that identifies the majors that contain either “DATA” or “STATISTICS”, case insensitive. You can find this dataset on R by installing the package fivethirtyeight and using the major column in either college_recent_grades, college_new_grads, or college_all_ages.

collegemajorslist <- college_all_ages |> 
  select(major)

collegemajorslist <- separate_longer_delim(
  collegemajorslist, major, delim = ",") 

collegemajorslist |> 
  filter(str_detect(major, "Data|Statistics"), ignore_case = TRUE)
# A tibble: 3 × 1
  major                                        
  <chr>                                        
1 Computer Programming And Data Processing     
2 Statistics And Decision Science              
3 Management Information Systems And Statistics

Problem 2 Write code that transforms the data below:

[1] "bell pepper" "bilberry" "blackberry" "blood orange"
[5] "blueberry" "cantaloupe" "chili pepper" "cloudberry"
[9] "elderberry" "lime" "lychee" "mulberry"
[13] "olive" "salal berry"

Into a format like this:

c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")

As your starting point take the string defined in the following code chunk:

messyString = ' [1] "bell pepper" "bilberry" "blackberry" "blood orange" \n
 [5] "blueberry" "cantaloupe" "chili pepper" "cloudberry" \n
 [9] "elderberry" "lime" "lychee" "mulberry" \n
 [13] "olive"  "salal berry" '
cleanString <- messyString |> 
  str_replace_all("\n", " ") |>   # Remove line breaks 
  str_replace_all("\\[\\s*\\d+\\]", "") |> # Remove leading indices 
  str_extract_all('"([^"]+)"') |>    # Extract values within quotes
  unlist() 

cleanString
 [1] "\"bell pepper\""  "\"bilberry\""     "\"blackberry\""   "\"blood orange\""
 [5] "\"blueberry\""    "\"cantaloupe\""   "\"chili pepper\"" "\"cloudberry\""  
 [9] "\"elderberry\""   "\"lime\""         "\"lychee\""       "\"mulberry\""    
[13] "\"olive\""        "\"salal berry\"" 
cleanString2 <- paste0(paste((cleanString), collapse = ", "))
endString <- str_flatten(c(cleanString2))
print(cat(endString))

“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”NULL

Hint: There are many different ways to solve this problem, but if you use str_extract_all a helpful flag that returns a character vector instead of a list is simplify=TRUE. Then you can apply other tools from stringr if needed.

Problem 3 Describe, in words, what these regular expressions will match. Read carefully to see if each entry is a regular expression or a string that defines a regular expression.

  • ^.*$
    • I think that this a regular expression rather than a string that defines a regular expression.
    • If it were inside quotation marks, it would match a full string since it’s anchored at the beginning with a ^ and at the end with a $. The period . and asterisk * are both metacharacters, and . matches any one character and then the asterisk * means 0 or more of any of the preceding . character – so it could match anything, even an empty string because of the * which says the preceding character can be optional or repeat (matches any number of times including 0).
  • "\\{.+\\}"
    • This is I think a string that defines a regular expression.
    • It would match a literal “{” that has one or more characters (because of “.+”) followed by a literal “}”.
  • \d{4}-\d{2}-\d{2}
    • This, I think, is a regular expression rather than a string that defines a regular expression. If we were putting it in a string to define a regular expression we’d need to escape the backslashes before the “\d”s and the whole thing would need to be in quotation marks.
    • This would match 4 digits followed by a hyphen followed by 2 digits followed by a hyphen followed by 2 digits. This is like the “YYYY-MM-DD” date format.
  • "\\\\{4}"
    • This is a string that defines a regular expression, I think. This would match 4 back slashes in a row: “\\\\”. In the regular expression pattern since \ needs to be escaped.
    • Based on what it says in the textbook in 15.4.1, “\\\\” would match just one “\”, which I think I finally understand now, because of the way that the string has to be escaped and then is read as a string, so “\\\\”, because the {4} means exactly four of the preceding character, which would be “\”, so … “\\\\”. Where would this everbe needed, I am not certain. But there it is. And it helped me finally feel like I have a relative grasp on why there are so many layers of escapes in R.
  • "(..)\\1"
    • This would find repeated pairs of the same characters. The “(..)” in the first capturing group matches two characters, and then the “\\1” refers to the first capturing group, so it’s finding instances of those two characters repeating. (It would match, for example, “coconut” because “co” repeats right after the first “co”.)

Problem 4. Construct regular expressions to match words that:

  • Start with “y”.

    • \\b\[y\]\\w+\ – this makes sure that the “y” is at the beginning of the word (with word boundary “\b”) and makes sure there’s more than just the “y” with the ‘1 or more’ of “+” after the “\w” word so you get the full word that starts with the lower case “y”
    pattern1 <- "\\b[y]\\w+"
    matchpattern1 <- stringr::words[str_detect(words, pattern1)]
    head(matchpattern1, 10)
    [1] "year"      "yes"       "yesterday" "yet"       "you"       "young"    
  • Have seven letters or more.

    • “(…….)+”
    pattern2 <- "(.......)+"
    matchpattern2 <- stringr::words[str_detect(words, pattern2)]
    head(matchpattern2, 10)
     [1] "absolute"  "account"   "achieve"   "address"   "advertise" "afternoon"
     [7] "against"   "already"   "alright"   "although" 
  • Contain a vowel-consonant pair

    • “[aeiou][^aeiou]”
pattern3 <- "[aeiou][^aeiou]"
matchpattern3 <- stringr::words[str_detect(words, pattern3)]
head(matchpattern3, 10)
 [1] "able"     "about"    "absolute" "accept"   "account"  "achieve" 
 [7] "across"   "act"      "active"   "actual"  
  • Contain at least two vowel-consonant pairs in a row.

    • “([aeiou][^aeiou])([aeiou][^aeiou])”
    pattern4 <- "([aeiou][^aeiou])([aeiou][^aeiou])"
    matchpattern4 <- stringr::words[str_detect(words, pattern4)]
    head(matchpattern4, 10)
     [1] "absolute"  "agent"     "along"     "america"   "another"   "apart"    
     [7] "apparent"  "authority" "available" "aware"    
  • Contain the same vowel-consonant pair repeated twice in a row.

    • “([aeiou][^aeiou])\\1”
pattern5 <- "([aeiou][^aeiou])\\1"
matchpattern5 <- stringr::words[str_detect(words, pattern5)]
head(matchpattern5, 10)
[1] "remember"

For each example, verify that they work by running them on the stringr::words dataset and show the first 10 results (hint: combine str_detect and logical subsetting).

Problem 5 Consider the gss_cat data-frame discussed in Chapter 16 of R4DS (provided as part of the forcats package):

  • Create a new variable that describes whether the party-id of a survey respondent is “strong” if they are a strong republican or strong democrat, “weak” if they are a not strong democrat, not strong republican, or independent of any type, and “other” for the rest.
stronggroup <- c("Strong democrat", "Strong republican")
weakgroup <- c("Not str democrat", "Not str republican", "Independent", 
               "Ind,near dem", "Ind,near rep")
othergroup <- c("No answer", "Don't know", "Other party")

gss_cat_updated <- gss_cat |> 
  group_by(partyid) |> 
  mutate(
    partyid_type = case_when(
      partyid %in% stronggroup ~ "strong",
      partyid %in% weakgroup ~ "weak",
      partyid %in% othergroup ~ "other"
    )
  ) |> 
  ungroup()

gss_cat_updated |> 
  group_by(partyid_type) |> 
  summarise(n = n())
# A tibble: 3 × 2
  partyid_type     n
  <chr>        <int>
1 other          548
2 strong        5804
3 weak         15131
gss_cat_updated |> 
  group_by(partyid_type) |> 
  summarise(mean_tv = mean(tvhours, na.rm = TRUE)) |> 
  mutate(mean_tv = as_factor(mean_tv)) |> 
  ggplot(aes(x = partyid_type, y = fct_reorder(mean_tv, partyid_type))) +
  geom_point() + 
  labs(title = "Mean TV Hours Watched by PartyId Types")

  • Calculate the mean hours of TV watched by each of the groups “strong”, “weak”, and “other” and display it with a dot-plot (geom_point). Sort the levels in the dot-plot so that the group appears in order of most mean TV hours watched.