Using stringr to handle Character and String Data

Overview

This is a brief overview of the stringr package from Hadley Wickham’s Tidyverse. Strings and characters are frequent data types that a data scientist encounters. The stringr package simplifies data manipulation involving string and character data types. Below are a handful of useful functions in the stringr package with an example dataset.

The example dataset was acquired from Kaggle.com at the following link: https://www.kaggle.com/rtatman/every-pub-in-england?select=open_pubs.csv. This dataset contains the information of all pubs in England.

Load Data

# load packages and read data file

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.4     v dplyr   1.0.5
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
# library(stringr)
# library(readr)
# library(dplyr)

pub.data <- read_csv("https://raw.githubusercontent.com/SaneSky109/DATA607/main/Tidyverse/Data/open_pubs.csv")
## Rows: 51566 Columns: 9
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (6): name, address, postcode, latitude, longitude, local_authority
## dbl (3): fas_id, easting, northing
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

Glimpse of Dataset

glimpse(pub.data)
## Rows: 51,566
## Columns: 9
## $ fas_id          <dbl> 24, 30, 63, 64, 65, 85, 101, 126, 140, 153, 154, 197, ~
## $ name            <chr> "Anchor Inn", "Angel Inn", "Black Boy Hotel", "Black H~
## $ address         <chr> "Upper Street, Stratford St Mary, COLCHESTER, Essex", ~
## $ postcode        <chr> "CO7 6LW", "CO10 7SA", "CO10 2EA", "CO7 6JS", "CO10 7R~
## $ easting         <dbl> 604748, 582888, 587356, 604270, 582750, 624667, 620709~
## $ northing        <dbl> 234405, 247368, 241327, 233920, 248298, 233744, 237978~
## $ latitude        <chr> "51.97039", "52.094427", "52.038683", "51.966211", "52~
## $ longitude       <chr> "0.979328", "0.668408", "0.730226", "0.972091", "0.666~
## $ local_authority <chr> "Babergh", "Babergh", "Babergh", "Babergh", "Babergh",~

Combine Strings

Sometimes the data that you are working with is not in the most ideal form. For example, one may want to combine first and last name to a single column. The str_c function can combine strings together.

Usage: str_c(…, sep = "", collapse = NULL)

Purpose: Concatenate strings.

Input: String or vector of strings that are separated by a commas.

Output: Vector containing the combined strings.

Example: Using pub dataset, combine the name with the address column.

pub.data$Location <- str_c(pub.data$name, pub.data$address, sep = " located at ")

head(pub.data$Location)
## [1] "Anchor Inn located at Upper Street, Stratford St Mary, COLCHESTER, Essex" 
## [2] "Angel Inn located at Egremont Street, Glemsford, SUDBURY, Suffolk"        
## [3] "Black Boy Hotel located at 7 Market Hill, SUDBURY, Suffolk"               
## [4] "Black Horse located at Lower Street, Stratford St Mary, COLCHESTER, Essex"
## [5] "Black Lion located at Lion Road, Glemsford, SUDBURY, Suffolk"             
## [6] "Bristol Arms located at Bristol Hill, Shotley, IPSWICH, Suffolk"

Filter Data using str_detect

Filtering data is a valueable asset to data scientists. str_detect returns boolean values on the inputted list. This can be used with other tidyverse functions to filter data.

Usage: str_detect(string, pattern, negate = FALSE)

Purpose: find if pattern is within string.

Input: Vector that is or can be coerced to a character data type

Output: Boolean vector where TRUE is that the string contains the pattern and FALSE is that the string does not contain the pattern

Example: Using pub dataset, determine rows that have “Essex” in the address column and store the TRUE rows in a new data frame.

example <- str_detect(pub.data$address,"Essex")
head(example)
## [1]  TRUE FALSE FALSE  TRUE FALSE FALSE
essex.data <- pub.data %>%
  filter(str_detect(address,"Essex"))
head(essex.data)
## # A tibble: 6 x 10
##   fas_id name            address    postcode easting northing latitude longitude
##    <dbl> <chr>           <chr>      <chr>      <dbl>    <dbl> <chr>    <chr>    
## 1     24 Anchor Inn      Upper Str~ CO7 6LW   604748   234405 51.97039 0.979328 
## 2     64 Black Horse     Lower Str~ CO7 6JS   604270   233920 51.9662~ 0.972091 
## 3    126 Carriers Arms   Heath Roa~ CO7 6RA   607463   235397 51.9782~ 1.019392 
## 4    197 Crown Inn       Cattawade~ CO11 1RE  610208   233105 51.9566~ 1.057899 
## 5    307 Hare and Hounds Harrow St~ CO6 4PW   595658   237394 52.0005~ 0.848865 
## 6    308 Hare And Hounds Heath Roa~ CO7 6RL   607632   235424 51.9784~ 1.021866 
## # ... with 2 more variables: local_authority <chr>, Location <chr>

Separate Strings to make new columns

Occasionally the data that you work with will not be very tidy. For example, multiple columns are stored in a single column. The stringr package provides a good solution to this problem via the str_split function. This function separates a string into pieces that can be used to make new columns.

Usage: str_split(string, pattern, n = Inf, simplify = FALSE)

Purpose: Separate strings that meet the desired criteria.

Input: Vector that is or can be coerced to a character data type

Output: List of vectors that contain the separated strings

Example: Using pub dataset, separate the address column into individual components. These components can be used to construct new columns for street, village, town, and county.

# For example, dropped all rows that did not have 4 components in the address
pub.data1 <- pub.data %>%
  filter(!str_detect(address,".*,.*,.*,.*,"))

pub.data1 <- pub.data1 %>%
  filter(str_detect(address,".*,.*,.*,.+"))


split.address <- str_split(pub.data1$address, pattern = ",")
head(split.address)
## [[1]]
## [1] "Upper Street"       " Stratford St Mary" " COLCHESTER"       
## [4] " Essex"            
## 
## [[2]]
## [1] "Egremont Street" " Glemsford"      " SUDBURY"        " Suffolk"       
## 
## [[3]]
## [1] "Lower Street"       " Stratford St Mary" " COLCHESTER"       
## [4] " Essex"            
## 
## [[4]]
## [1] "Lion Road"  " Glemsford" " SUDBURY"   " Suffolk"  
## 
## [[5]]
## [1] "Bristol Hill" " Shotley"     " IPSWICH"     " Suffolk"    
## 
## [[6]]
## [1] "Pin Mill Road"  " Chelmondiston" " IPSWICH"       " Suffolk"
split.address <- split.address[sapply(split.address, length)==4]

split.address <- unlist(split.address)


pub.data1$street <- 0
pub.data1$street <- split.address[seq(1, length(split.address), 4)]
pub.data1$village <- 0
pub.data1$village <- split.address[seq(2, length(split.address), 4)]
pub.data1$town <- 0
pub.data1$town <- split.address[seq(3, length(split.address), 4)]
pub.data1$county <- 0
pub.data1$county <- split.address[seq(4, length(split.address), 4)]

head(pub.data1[,c("street","village","town","county")])
## # A tibble: 6 x 4
##   street          village              town          county    
##   <chr>           <chr>                <chr>         <chr>     
## 1 Upper Street    " Stratford St Mary" " COLCHESTER" " Essex"  
## 2 Egremont Street " Glemsford"         " SUDBURY"    " Suffolk"
## 3 Lower Street    " Stratford St Mary" " COLCHESTER" " Essex"  
## 4 Lion Road       " Glemsford"         " SUDBURY"    " Suffolk"
## 5 Bristol Hill    " Shotley"           " IPSWICH"    " Suffolk"
## 6 Pin Mill Road   " Chelmondiston"     " IPSWICH"    " Suffolk"

Ordering Strings Alphabetically

Ordering strings could be useful in data tables and data visualizations. The function str_sort, provided by the stringr package, does this task for us.

Usage: str_sort( x, decreasing = FALSE, na_last = TRUE, locale = “en”, numeric = FALSE, … )

Purpose: sort strings.

Input: Vector that is or can be coerced to a character data type

Output: Vector in desired order

Example: Using pub dataset, sort the name column in descending and ascending order alphabetically. It should be noted special characters like apostrophe (’) and period (.) will be place before the letter a.

decrease.FALSE <- str_sort(pub.data$name, decreasing = FALSE)
head(decrease.FALSE)
## [1] ".burger"                     "'Oswaldtwistle Social Club'"
## [3] "'The Commercial Hotel'"      "'The Dog & Otter'"          
## [5] "'The Park Inn '"             "@75"
decrease.TRUE <- str_sort(pub.data$name, decreasing = TRUE)
head(decrease.TRUE)
## [1] "Zynk"                 "Zy Bar"               "Zu Studios (Zutopia)"
## [4] "Zorita's Kitchen"     "Zoo Too"              "Zoo Bar"

Conclusion

These are just a handful of useful functions in the stringr package. Feel free to check out the cheatsheet of stringr functions at: https://github.com/rstudio/cheatsheets/blob/master/strings.pdf.