This is a brief overview of the stringr package from Hadley Wickham’s Tidyverse. Strings and characters are frequent data types that a data scientist encounters. The stringr package simplifies data manipulation involving string and character data types. Below are a handful of useful functions in the stringr package with an example dataset.
The example dataset was acquired from Kaggle.com at the following link: https://www.kaggle.com/rtatman/every-pub-in-england?select=open_pubs.csv. This dataset contains the information of all pubs in England.
# load packages and read data file
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.4 v dplyr 1.0.5
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 2.0.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
# library(stringr)
# library(readr)
# library(dplyr)
pub.data <- read_csv("https://raw.githubusercontent.com/SaneSky109/DATA607/main/Tidyverse/Data/open_pubs.csv")
## Rows: 51566 Columns: 9
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (6): name, address, postcode, latitude, longitude, local_authority
## dbl (3): fas_id, easting, northing
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(pub.data)
## Rows: 51,566
## Columns: 9
## $ fas_id <dbl> 24, 30, 63, 64, 65, 85, 101, 126, 140, 153, 154, 197, ~
## $ name <chr> "Anchor Inn", "Angel Inn", "Black Boy Hotel", "Black H~
## $ address <chr> "Upper Street, Stratford St Mary, COLCHESTER, Essex", ~
## $ postcode <chr> "CO7 6LW", "CO10 7SA", "CO10 2EA", "CO7 6JS", "CO10 7R~
## $ easting <dbl> 604748, 582888, 587356, 604270, 582750, 624667, 620709~
## $ northing <dbl> 234405, 247368, 241327, 233920, 248298, 233744, 237978~
## $ latitude <chr> "51.97039", "52.094427", "52.038683", "51.966211", "52~
## $ longitude <chr> "0.979328", "0.668408", "0.730226", "0.972091", "0.666~
## $ local_authority <chr> "Babergh", "Babergh", "Babergh", "Babergh", "Babergh",~
Sometimes the data that you are working with is not in the most ideal form. For example, one may want to combine first and last name to a single column. The str_c function can combine strings together.
Usage: str_c(…, sep = "", collapse = NULL)
Purpose: Concatenate strings.
Input: String or vector of strings that are separated by a commas.
Output: Vector containing the combined strings.
Example: Using pub dataset, combine the name with the address column.
pub.data$Location <- str_c(pub.data$name, pub.data$address, sep = " located at ")
head(pub.data$Location)
## [1] "Anchor Inn located at Upper Street, Stratford St Mary, COLCHESTER, Essex"
## [2] "Angel Inn located at Egremont Street, Glemsford, SUDBURY, Suffolk"
## [3] "Black Boy Hotel located at 7 Market Hill, SUDBURY, Suffolk"
## [4] "Black Horse located at Lower Street, Stratford St Mary, COLCHESTER, Essex"
## [5] "Black Lion located at Lion Road, Glemsford, SUDBURY, Suffolk"
## [6] "Bristol Arms located at Bristol Hill, Shotley, IPSWICH, Suffolk"
Filtering data is a valueable asset to data scientists. str_detect returns boolean values on the inputted list. This can be used with other tidyverse functions to filter data.
Usage: str_detect(string, pattern, negate = FALSE)
Purpose: find if pattern is within string.
Input: Vector that is or can be coerced to a character data type
Output: Boolean vector where TRUE is that the string contains the pattern and FALSE is that the string does not contain the pattern
Example: Using pub dataset, determine rows that have “Essex” in the address column and store the TRUE rows in a new data frame.
example <- str_detect(pub.data$address,"Essex")
head(example)
## [1] TRUE FALSE FALSE TRUE FALSE FALSE
essex.data <- pub.data %>%
filter(str_detect(address,"Essex"))
head(essex.data)
## # A tibble: 6 x 10
## fas_id name address postcode easting northing latitude longitude
## <dbl> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr>
## 1 24 Anchor Inn Upper Str~ CO7 6LW 604748 234405 51.97039 0.979328
## 2 64 Black Horse Lower Str~ CO7 6JS 604270 233920 51.9662~ 0.972091
## 3 126 Carriers Arms Heath Roa~ CO7 6RA 607463 235397 51.9782~ 1.019392
## 4 197 Crown Inn Cattawade~ CO11 1RE 610208 233105 51.9566~ 1.057899
## 5 307 Hare and Hounds Harrow St~ CO6 4PW 595658 237394 52.0005~ 0.848865
## 6 308 Hare And Hounds Heath Roa~ CO7 6RL 607632 235424 51.9784~ 1.021866
## # ... with 2 more variables: local_authority <chr>, Location <chr>
Occasionally the data that you work with will not be very tidy. For example, multiple columns are stored in a single column. The stringr package provides a good solution to this problem via the str_split function. This function separates a string into pieces that can be used to make new columns.
Usage: str_split(string, pattern, n = Inf, simplify = FALSE)
Purpose: Separate strings that meet the desired criteria.
Input: Vector that is or can be coerced to a character data type
Output: List of vectors that contain the separated strings
Example: Using pub dataset, separate the address column into individual components. These components can be used to construct new columns for street, village, town, and county.
# For example, dropped all rows that did not have 4 components in the address
pub.data1 <- pub.data %>%
filter(!str_detect(address,".*,.*,.*,.*,"))
pub.data1 <- pub.data1 %>%
filter(str_detect(address,".*,.*,.*,.+"))
split.address <- str_split(pub.data1$address, pattern = ",")
head(split.address)
## [[1]]
## [1] "Upper Street" " Stratford St Mary" " COLCHESTER"
## [4] " Essex"
##
## [[2]]
## [1] "Egremont Street" " Glemsford" " SUDBURY" " Suffolk"
##
## [[3]]
## [1] "Lower Street" " Stratford St Mary" " COLCHESTER"
## [4] " Essex"
##
## [[4]]
## [1] "Lion Road" " Glemsford" " SUDBURY" " Suffolk"
##
## [[5]]
## [1] "Bristol Hill" " Shotley" " IPSWICH" " Suffolk"
##
## [[6]]
## [1] "Pin Mill Road" " Chelmondiston" " IPSWICH" " Suffolk"
split.address <- split.address[sapply(split.address, length)==4]
split.address <- unlist(split.address)
pub.data1$street <- 0
pub.data1$street <- split.address[seq(1, length(split.address), 4)]
pub.data1$village <- 0
pub.data1$village <- split.address[seq(2, length(split.address), 4)]
pub.data1$town <- 0
pub.data1$town <- split.address[seq(3, length(split.address), 4)]
pub.data1$county <- 0
pub.data1$county <- split.address[seq(4, length(split.address), 4)]
head(pub.data1[,c("street","village","town","county")])
## # A tibble: 6 x 4
## street village town county
## <chr> <chr> <chr> <chr>
## 1 Upper Street " Stratford St Mary" " COLCHESTER" " Essex"
## 2 Egremont Street " Glemsford" " SUDBURY" " Suffolk"
## 3 Lower Street " Stratford St Mary" " COLCHESTER" " Essex"
## 4 Lion Road " Glemsford" " SUDBURY" " Suffolk"
## 5 Bristol Hill " Shotley" " IPSWICH" " Suffolk"
## 6 Pin Mill Road " Chelmondiston" " IPSWICH" " Suffolk"
Ordering strings could be useful in data tables and data visualizations. The function str_sort, provided by the stringr package, does this task for us.
Usage: str_sort( x, decreasing = FALSE, na_last = TRUE, locale = “en”, numeric = FALSE, … )
Purpose: sort strings.
Input: Vector that is or can be coerced to a character data type
Output: Vector in desired order
Example: Using pub dataset, sort the name column in descending and ascending order alphabetically. It should be noted special characters like apostrophe (’) and period (.) will be place before the letter a.
decrease.FALSE <- str_sort(pub.data$name, decreasing = FALSE)
head(decrease.FALSE)
## [1] ".burger" "'Oswaldtwistle Social Club'"
## [3] "'The Commercial Hotel'" "'The Dog & Otter'"
## [5] "'The Park Inn '" "@75"
decrease.TRUE <- str_sort(pub.data$name, decreasing = TRUE)
head(decrease.TRUE)
## [1] "Zynk" "Zy Bar" "Zu Studios (Zutopia)"
## [4] "Zorita's Kitchen" "Zoo Too" "Zoo Bar"
These are just a handful of useful functions in the stringr package. Feel free to check out the cheatsheet of stringr functions at: https://github.com/rstudio/cheatsheets/blob/master/strings.pdf.