Lab 5: Working with Text and Strings

Author

Darwhin Gomez

Overview

In this lab you will practice perform a series of exercises that use text and string manipulation to either analyze data with text, manipulate data containing strings, apply regular expressions, or handle data files with unusual formats or text strings.

Problems

Problem 1. Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset, provide code that identifies the majors that contain either “DATA” or “STATISTICS”, case insensitive. You can find this dataset on R by installing the package fivethirtyeight and using the major column in either college_recent_grades, college_new_grads, or college_all_ages.

library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.4.1
Warning: package 'ggplot2' was built under R version 4.4.1
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(fivethirtyeight)
Warning: package 'fivethirtyeight' was built under R version 4.4.1
Some larger datasets need to be installed separately, like senators and
house_district_forecast. To install these, we recommend you install the
fivethirtyeightdata package by running:
install.packages('fivethirtyeightdata', repos =
'https://fivethirtyeightdata.github.io/drat/', type = 'source')
theme_set(theme_minimal())
college_all_ages|>
  filter(str_detect(major,"(?i)data|statistics"))|>
  select(major)
# A tibble: 3 × 1
  major                                        
  <chr>                                        
1 Computer Programming And Data Processing     
2 Statistics And Decision Science              
3 Management Information Systems And Statistics

Problem 2 Write code that transforms the data below:

[1] "bell pepper" "bilberry" "blackberry" "blood orange"
[5] "blueberry" "cantaloupe" "chili pepper" "cloudberry"
[9] "elderberry" "lime" "lychee" "mulberry"
[13] "olive" "salal berry"

Into a format like this:

c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")

As your starting point take the string defined in the following code chunk:

messyString = ' [1] "bell pepper" "bilberry" "blackberry" "blood orange" \n
 [5] "blueberry" "cantaloupe" "chili pepper" "cloudberry" \n
 [9] "elderberry" "lime" "lychee" "mulberry" \n
 [13] "olive"  "salal berry" '

Hint: There are many different ways to solve this problem, but if you use str_extract_all a helpful flag that returns a character vector instead of a list is simplify=TRUE. Then you can apply other tools from stringr if needed.

foodthings<-str_extract_all(messyString,'"[^"]+"',simplify = TRUE)
foodthings
     [,1]              [,2]           [,3]             [,4]              
[1,] "\"bell pepper\"" "\"bilberry\"" "\"blackberry\"" "\"blood orange\""
     [,5]            [,6]             [,7]               [,8]            
[1,] "\"blueberry\"" "\"cantaloupe\"" "\"chili pepper\"" "\"cloudberry\""
     [,9]             [,10]      [,11]        [,12]          [,13]      
[1,] "\"elderberry\"" "\"lime\"" "\"lychee\"" "\"mulberry\"" "\"olive\""
     [,14]            
[1,] "\"salal berry\""
foodstring <-  str_c(foodthings[,-1], collapse = ", ")

foodstring
[1] "\"bilberry\", \"blackberry\", \"blood orange\", \"blueberry\", \"cantaloupe\", \"chili pepper\", \"cloudberry\", \"elderberry\", \"lime\", \"lychee\", \"mulberry\", \"olive\", \"salal berry\""
cat("C(",foodstring,")")
C( "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry" )

Problem 3 Describe, in words, what these regular expressions will match. Read carefully to see if each entry is a regular expression or a string that defines a regular expression.

  • ^.*$ Regular expression that matches any string . is wildacard meaning any charachter^starting anchor* any number of times 0 or mote meaning we would get a mtach for an empty string$ ending anchor

  • "\\{.+\\}"

    string defining a pattern to match a string is in curly braces and has at least one character except a new line within the curly braces

  • \d{4}-\d{2}-\d{2}

    reg exp for a date format
    _ _ _ _ (digits) “hyphen” _ _ (digits) “hyphen” _ _(digits) so —>>>yyyy-mm-dd

    the values inside the { } determine how many digits are accepted

  • "\\\\{4}"

    a string matching 1 literal back slashes 4 times LITETERAL = \\\\

  • "(..)\\1"

    a string expression of any two characters that immediately repeat like
    coconut it would find coco
    banana it would find anana

    papal it would return papa

Problem 4. Construct regular expressions to match words that:

head(words)
[1] "a"        "able"     "about"    "absolute" "accept"   "account" 
  • Start with “y”.

    lettersY<- words|>
      str_detect("^y")
    words[lettersY]
    [1] "year"      "yes"       "yesterday" "yet"       "you"       "young"    
  • Have seven letters or more.

    bigwords <- words |>
      str_detect("[A-Za-z]{7,}") 
    
    words[bigwords][1:10]
     [1] "absolute"  "account"   "achieve"   "address"   "advertise" "afternoon"
     [7] "against"   "already"   "alright"   "although" 
  • Contain a vowel-consonant pair

    vow_cons<- words|>
      str_detect("[aeiouAEIOU][^aeiouAEIOU]")
    words[vow_cons][1:10]
     [1] "able"     "about"    "absolute" "accept"   "account"  "achieve" 
     [7] "across"   "act"      "active"   "actual"  
  • Contain at least two vowel-consonant pairs in a row.

    vow_cons2x<- words|>
      str_detect("([aeiouAEIOU][^aeiouAEIOU]){2}")
    
    words[vow_cons2x][30:40]
     [1] "depend"     "design"     "develop"    "difference" "difficult" 
     [6] "direct"     "divide"     "document"   "during"     "economy"   
    [11] "educate"   
  • Contain the same vowel-consonant pair repeated twice in a row.

    vow_cons_excact<- words|>
      str_detect("([aeiouAEIOU][^aeiouAEIOU])\\1")
    words[vow_cons_excact]
    [1] "remember"

For each example, verify that they work by running them on the stringr::words dataset and show the first 10 results (hint: combine str_detect and logical subsetting).


Problem 5
Consider the gss_cat data-frame discussed in Chapter 16 of R4DS (provided as part of the forcats package):

  • Create a new variable that describes whether the party-id of a survey respondent is “strong” if they are a strong republican or strong democrat, “weak” if they are a not strong democrat, not strong republican, or independent of any type, and “other” for the rest.

    # lets see partyid
    levels(gss_cat$partyid)
     [1] "No answer"          "Don't know"         "Other party"       
     [4] "Strong republican"  "Not str republican" "Ind,near rep"      
     [7] "Independent"        "Ind,near dem"       "Not str democrat"  
    [10] "Strong democrat"   
    # this works better, on the console i also see the numeric value of each level
    gss_cat|>
      count(partyid)
    # A tibble: 10 × 2
       partyid                n
       <fct>              <int>
     1 No answer            154
     2 Don't know             1
     3 Other party          393
     4 Strong republican   2314
     5 Not str republican  3032
     6 Ind,near rep        1791
     7 Independent         4119
     8 Ind,near dem        2499
     9 Not str democrat    3690
    10 Strong democrat     3490
    politic_levels<-c("strong","weak","other")
    # accesing the levels of partyid to logicaly fill party_commitment
    gss_cat <- gss_cat |>
      mutate(party_commitment = case_when(
       as.numeric(partyid) %in% c(4, 10) ~ "strong",   
       as.numeric(partyid) %in% c(5, 6, 7, 8, 9) ~ "weak", 
       as.numeric(partyid) %in% c(1, 2, 3) ~ "other",   
        TRUE ~ "other"                       
      )) |>
      mutate(party_commitment = factor(party_commitment, levels = politic_levels))
    gss_cat|>
      count(party_commitment)
    # A tibble: 3 × 2
      party_commitment     n
      <fct>            <int>
    1 strong            5804
    2 weak             15131
    3 other              548
  • Calculate the mean hours of TV watched by each of the groups “strong”, “weak”, and “other” and display it with a dot-plot (geom_point). Sort the levels in the dot-plot so that the group appears in order of most mean TV hours watched.

    gss_cat|>
      group_by(party_commitment)|>
      summarise(mean_tv=mean(tvhours, na.rm = TRUE))|>
      ggplot(aes(x=mean_tv,y = fct_reorder(party_commitment,mean_tv)))+
              geom_point()+
               labs(
                 x= "Average Hours of TV watched",
                 y="Commitment to party",
                 title = "Avegerage TV \nWatched by levels of \nPolitical Commitment"
               )