Strings

Strings (just sequences of natural language letters, intended to be legible by humans) are super useful for doing categorical data analysis. Whenever people from outside R look at the way it’s used, often they’re struck by how flexible and fundamental R’s string implementation is.

Manipulating and interacting with label text is often super useful and allows for very concise R code (for instance, you want to collapse degrees of “Agreement” and “Disagreement” across a 5 value scale). But strings also are super easy to coerce into formulas–now all of a sudden strings are the way we manage sets of models.

Strings are a pretty poorly taught concept in classic graduate stats texts–I suspect it’s a result of the fact that so many statisticians are trained using continuous data (agronomists, biostatisticians, epidemiologists, econometricians, etc. have written the seminal graduate stats texts from the 20th century) and they basically looked down on strings as a mathematically uninteresting application for people interested in data analysis. But doing statistics to attitudinal data is where strings shine.

Separate to learning how to use the stringr library in the tidyverse, but comparably important, is the need to under Regular Expressions, or regex–a way to summarize patterns of strings in a symbolic fashion. There’s no shortcuts on regex–you google, and you ChatGPT, you go to StackOverflow. You kvetch, you lament. Just don’t give up. It’s tough for everyone. Even XKCD jokes about it.

Regex is like garlic–a little goes a long way.

The stringr library

stringr, the Tidyverse string package, wraps the stringi library, which itself is mostly written in C++ to achieve speed.

stringr features a lovely design choice–a common prefix for all functions str_ , which lets you use RStudio autocompete-magic in TAB.

stringr function Base equivalent Purpose
str_c() paste() string combine, to join multiple strings into one
str_detect() grepl() detects if a pattern is found in a string
str_extract() regmatches() extracts a pattern from a string
str_remove() sub() removes a pattern from a string
str_locate() regexpr() locates a pattern in a string, and returns the location as a numeric vector
str_sub() substr() returns a substring, ie a small subsection of a longer string.
str_pad() sprintf() increases the length of a string to a minimum threshold
str_length() nchar() measures the length of a string
str_to_lower(),
str_to_upper(),
str_to_title(),
tolower(),
toupper(),
no equivalent
pretty formatting of strings

Whoa, there’s a lot to cover. Let’s get to typing.

Pasting with str_c()

library(tidyverse)
library(magrittr)

str_c(
  "First",
  c("Second 1",
    "Second 2", 
    "Second 3"),
  "Third", 
  sep = " "
  )
## [1] "First Second 1 Third" "First Second 2 Third" "First Second 3 Third"

Recycling is, excuse me, until December 15 2022 and tidyverse adopted new meta recycling rules, was1, a super useful trait for pasting

paste0(
    LETTERS[1:4],
  "_",
  letters[1:2]
  )
## [1] "A_a" "B_b" "C_a" "D_b"

Detecting strings with str_detect() and subsetting with str_subset()

str_detect() converts a string into vector of TRUE and FALSE, for when a pattern has been detected.

Look at this table t0–how would one subset it for the presence of a specific pattern, absent a nice str_detect()?

t0 <- tibble(
  x1 = c(
    "I like to read books and watch movies, especially on rainy days.",
    "She is a great singer and dancer, and she also plays the guitar.",
    "He went to the park with his dog, and they had a lot of fun together.",
    "They are having a party tonight, and they invited all their friends.",
    "You are very smart and kind, and I appreciate your help."
  )
)

t0$x1 %>% 
  str_detect(" I")

t0$x1 %>% 
  str_detect("and they")

# or with a Regex

t0$x1 %>% 
  str_detect("\\s\\w{3,6},\\s")
## [1] FALSE FALSE FALSE FALSE  TRUE
## [1] FALSE FALSE  TRUE  TRUE FALSE
## [1]  TRUE  TRUE  TRUE FALSE  TRUE

What’s a very typical behavioral application for str_detect()? Open ended survey responses!

Let’s do an example from the 2022 CES. CC22_433_t reports all the open ended answers when a subject opted out of the party ID branches:

tf <- tempfile()

download.file(
  "https://github.com/thomasjwood/constraint/raw/main/ces/ces_22_c.rds",
  tf
  )

t1 <- tf %>% 
  readRDS

t1$CC22_433_t %>% 
  unique %>%
  extract(
    order(
      decreasing = T,
      t1$CC22_433_t %>%
        unique %>% 
        str_length
    )
  ) %>% 
  extract(1:10)
##  [1] "I am registered as a democrat , but I do not think we should put labels on people. It makes people act like teenagers, and gives them reasons to hate. You can be against abortion and believe in the right to carry a gun at the same time. Why do we need to choose sides?"
##  [2] "I don’t consider myself to be either of those and I really hate labels. I don’t understand why this is the United States but every state has different laws and we just can’t agree to disagree is without there being Pardison bias"                                  
##  [3] "Disgruntled, not being supported citizen that wants change nut not seeing anything good coming from that lot in Washington. Registered Democrat because I have been a long time but neither party is looking out for us."                                                    
##  [4] "usually I base my decision on the views of the Candidate that has the closest to the same as myself in relation to the topics le am most concerned about at the time of election."                                                                                           
##  [5] "Ex democrat , I still believe democrats are better for the county. I now lean slightly to the right because the far left had taken over the democrat party"                                                                                                                  
##  [6] "Person who wishes we had more than 2 parties, and wishes citizens could vote on legislation themselves without having middlemen messing it up."                                                                                                                              
##  [7] "Formerly an Independent voting the candidate, now a Dem-leaning Independent who will possibly be a Dem by the time next election comes round."                                                                                                                               
##  [8] "I don't know enough about politics to identify with any of the selections... democrat, republican, or independent. I will get educated soon!"                                                                                                                                
##  [9] "Register Democratic but I diss agree and agree with something in both parties, but when it comes to human rights more with democratic"                                                                                                                                       
## [10] "I dont really fit into either party as far as my core beliefs some align with each party and some do not align with either party."

How many respondents feel like MAGA is somehow indicated in their partisanship? We’ll use str_subset(), which takes a vector, and returns every element which matches

t1$CC22_433_t %>% 
  str_subset("MAGA")
##  [1] "MAGA AMERICAN"   "MAGA"            "MAGA patriot."   "MAGA"           
##  [5] "Ultra MAGA"      "MAGA"            "MAGA"            "MAGA Communist" 
##  [9] "MAGA"            "MAGA Republican"

and …whoa why does this return 4 more matches?!?

t1$CC22_433_t %>% 
  str_to_lower %>% 
  str_subset("maga")
##  [1] "maga american"   "maga"            "maga"            "maga patriot."  
##  [5] "maga"            "ultra maga"      "maga"            "maga"           
##  [9] "maga"            "maga"            "ultra maga"      "maga communist" 
## [13] "maga"            "maga republican"

these data are rich

t1$CC22_433_t %>% 
  str_to_lower %>% 
  str_subset("nazi")

t1$CC22_433_t %>% 
  str_to_lower %>% 
  str_subset("communist")
## [1] "nazi"                                        
## [2] "anti nazi anti racism x president nazi trump"
## [3] "nazi anti-communist anti-zionist"            
##  [1] "communist ðÿ’—ðÿ’—"               "communist"                       
##  [3] "communist"                        "communist"                       
##  [5] "communist"                        "leftist/communist"               
##  [7] "communist"                        "libertarian communist"           
##  [9] "communist"                        "socialist/communist"             
## [11] "communist"                        "communist"                       
## [13] "communist"                        "communist"                       
## [15] "communist"                        "communist"                       
## [17] "communist"                        "communist"                       
## [19] "communist"                        "communist"                       
## [21] "communist"                        "communist"                       
## [23] "socialist/ communist"             "communist"                       
## [25] "communist"                        "communist"                       
## [27] "anti-communist"                   "communist"                       
## [29] "communist"                        "communist"                       
## [31] "communist"                        "communist"                       
## [33] "communist"                        "libertarian communist"           
## [35] "communist/socialist"              "maga communist"                  
## [37] "communist"                        "socialist/communist"             
## [39] "communist"                        "nazi anti-communist anti-zionist"
## [41] "communist"                        "communist-adjacent"              
## [43] "communist"

Let’s have an interlude with Regex

So during my simply lovely grad school years, I was sitting in my ground floor apartment in Hitchcock Hall, maybe 2010, when my dear wife stifled a chuckle at my choice in reading material.

I genuinely did not understand the source of her mirth. She eventually asked “why would anyone need to read a book to begin using regular expressions?! And does this man look like the guy to teach you?!?”

Let’s have a silly but hopefully useful summary of writing a Regex

Regex Matches
\\w \\d \\s a word, a digit, a whitespace
\\W \\D \\S NOT a word, not a digit, not a whitespace
[abc] any of a, b, or c
[^abc] NOT a, b, or c
[a-gs] characters between a and g.
a{5} a{2,} exactly 5 as, two or more as
a{1,3} between 1 and 3 as

AutoRegex, when it works, uses GPT to make Regex a snap. Regexr lets you try out a match on sample text. Here’s how Regexr works, on the 12 longest open-ended explanations for for non-voting among CES respondents.

t1$CC22_402a_t %>% 
  unique %>% 
  extract(
    t1$CC22_402a_t %>% 
      unique %>% 
      str_length %>% 
      order(decreasing = T)
    ) %>% 
  extract(1:12) %>% 
  dput
## c("witnesses,and chose Jesus be my King as King of God's Kingdom. The Kingdom that we pray for when we pray the Lord's prayer found in the Bible at Matthew chapter 6 verses 9,10 and the earth willbe restored to a paradise like the garden of Eden, according to Psalms 37 verse 9-11. For more information, please go to JW.ORG. Thank you, Consuelo Johnson King, as King of God's Kingdm. The Kingdom that many pray for as Jesus taught us to pray for God's Kingdom in the Lords prayer", 
## "Lost interest, strongly feel that poltical leaders don't serve the American people. It's outright depressing, upsetting, an stressful to follow. Neither side can put aside differences on the most partisian issues, can't pass overwhelmingly supported agendas that majority of Americans support, popular vote doesn't mean anything because of the map districting/gerrymandering & electorial college so one person one vote is an objective lie.", 
## "I did not know how many days before Election Day I needed to send my mail in ballot so it will be received and counted on Election Day. The deadline date for people voting by mail I did not know. My only transportation is the bus, bike or walking. Due to my declining health I haven’t been able to travel often. I did not know where my closest polling place to where I live now to drop off my completed mail in ballot", 
## "It was a combination. I have not registered at my new address since moving from North Carolina last year. I have not bothered to register because I dislike most candidates and my vote does not matter because I live in an area that blindly selects the \"R\" because the cult leaders have convinced them that the party actually represents them, while the party just uses them as pawns in reality.", 
## "I cannot out vote stupidity. People in IL keep voting for people who rip us off.. Becoming tired of fraudelent voting practices such as the tech voting machines and mass distribution of mail in ballots. Our government is out of control and in some was leary of voting because of the labeling of republicans as domestic terrorists. Current government intimidates opposition.", 
## "I can vote in Missouri but live in Texas. I signed up for a real ID in Texas and it takes literal MONTHS just to get a chance to walk in. It’s disgusting cuz I know they’re doing it on purpose to keep people from voting. I don’t have money to just go out to St. Louis to vote and come back. WE DONT HAVE MONEY. I can’t stand this.", 
## "I committed a nonviolent, non-drug related crime years ago and Alabama government is still in the dark ages and nonforgiving despite living a respectful, law-abiding life, college attending life since the conviction. Hated putting this in writing about my past, but honest.", 
## "I just recently got my voting rights restored and honestly didn't know enough about the candidates or election to feel comfortable voting. I'm not even sure if I'm republican or democrat? I want to vote and be able to make a difference!", 
## "I am one of Jehovah’s Witnesses. Around the world we do not get involved in politics. We follow the example and teachings of Christ Jesus. To learn more about this topic you can visit the website jw.org and do a search on the subject.", 
## "I figured out how voting works, every midterm it switches to the other party. Plus, both parties are heavy into neoliberalist tactics and capitalism. Both major focus of parties are social issues, and nothing really fiscal.", 
## "I haven't been able to register because I don't have a valid driver's license and I can't afford to get them renewed so now I have to take the written test all over again and it's been over twenty years .", 
## "Candidates that run for office promise to do what is correct then do not follow thru, because they are all liars.Just want your vote to get into office to corrupt & make it more crooked than it is."
## )

Past these strings into Regexr, and try a couple of matches out:

  • Are there any respondents who use double spacing around punctuation?
  • How many respondents mention a state’s name?
  • How many mention America or their religion?

Substrings with str_sub()

Returning substrings via a numeric vector or starts and ends is a super general and useful tool:

str_sub(
  "test string",
  start = 2
  )

str_sub(
  "test string",
  end = 8
)

str_sub(
  "test string",
  start = 2, end = 8
)

str_sub(
  "test string",
   end = -4
  )
## [1] "est string"
## [1] "test str"
## [1] "est str"
## [1] "test str"

str_sub_all abstracts to vectors of strings and locations

rec_pres <- c(
  "Ronald Reagan",
  "George Bush",
  "Bill Clinton",
  "George W Bush",
  "Barack Obama",
  "Donald Trump",
  "Joe Biden"
  )

rec_pres %>% 
  str_sub_all(
  start = 1:7,
  end = -7:-1
  )
## [[1]]
## [1] "Ronald " "onald R" "nald Re" "ald Rea" "ld Reag" "d Reaga" " Reagan"
## 
## [[2]]
## [1] "Georg" "eorge" "orge " "rge B" "ge Bu" "e Bus" " Bush"
## 
## [[3]]
## [1] "Bill C" "ill Cl" "ll Cli" "l Clin" " Clint" "Clinto" "linton"
## 
## [[4]]
## [1] "George " "eorge W" "orge W " "rge W B" "ge W Bu" "e W Bus" " W Bush"
## 
## [[5]]
## [1] "Barack" "arack " "rack O" "ack Ob" "ck Oba" "k Obam" " Obama"
## 
## [[6]]
## [1] "Donald" "onald " "nald T" "ald Tr" "ld Tru" "d Trum" " Trump"
## 
## [[7]]
## [1] "Joe" "oe " "e B" " Bi" "Bid" "ide" "den"

Finding locations in strings with str_locate()

str_locate() takes a string, and a pattern, and returns a numeric matrix with the starting and ending location for the pattern in each string element.

var2 <- 
  c("_test string",
    "te_t string",
    "test_string",
    "test str_ng",
    "test string_")

var2 %>% 
  str_locate("_")
##      start end
## [1,]     1   1
## [2,]     3   3
## [3,]     5   5
## [4,]     9   9
## [5,]    12  12

A strings exercise

Take this vector of street addresses

var_3 <- c(
  "2299 Piedmont Ave, Berkeley CA, 94720",
  "35th St NW, Washington, DC 20057", 
  "5828 S University Ave, Chicago IL, 60637",
  "154 N Oval Mall, Columbus OH, 43210",
  "353 Aragon Ave, Coral Gables FL, 33134"
  )

and return

  • The state abbreviation
  • The zip code
  • The street address

  1. This change to stringr’s recycling rules were announced when stringr was updated to version 1.5 , described here: https://www.tidyverse.org/blog/2022/12/stringr-1-5-0/↩︎