Getting started

First we need to load these packages:

  • tidyverse
  • stringr
  • dplyr - used for subsetting data in our analysis
  • rmdformats - used to for styling html document

We’re going to load a dataset from fivethirtyeight.com to help us show examples of stringr at work. Our data shows murders in cities in America from 2014 to 2015.

We’ll take the first 10 rows of the data for simplicity’s sake.

## Parsed with column specification:
## cols(
##   city = col_character(),
##   state = col_character(),
##   `2014_murders` = col_double(),
##   `2015_murders` = col_double(),
##   change = col_double()
## )

Ordering Strings

str_order(character vector,decreasing = X)

Purpose:
Order a character vector alphabetically.

Input:
character vector - what you want to order
X - indicate whether to order characters decreasing (FALSE - alphabetically) or increasing (TRUE - order from Z to A)

Output:
An ordered character vector

Example:
We’ll order the column ‘city’ from our dataframe ‘murder’

##  [1] "Baltimore"    "Chicago"      "Cleveland"    "Houston"      "Kansas City" 
##  [6] "Milwaukee"    "Nashville"    "Philadelphia" "St. Louis"    "Washington"

If you want to reverse the order to Z-A you can set decreasing = FALSE

##  [1] "Washington"   "St. Louis"    "Philadelphia" "Nashville"    "Milwaukee"   
##  [6] "Kansas City"  "Houston"      "Cleveland"    "Chicago"      "Baltimore"

Combining Strings

str_c(String1,String2,…Stringn)

Purpose:
The function takes in a strings or vectors of strings and concatentates them together

Input:
String or vector of strings separated by comma

Output:
Single string of vector of combined strings

Example:
You can combine as many strings as you want together at once

## [1] "abcdefgh"

Let’s let’s see how we can combine two vectors of strings together from our dataframe: the city and the state

##  [1] "BaltimoreMaryland"        "ChicagoIllinois"         
##  [3] "HoustonTexas"             "ClevelandOhio"           
##  [5] "WashingtonD.C."           "MilwaukeeWisconsin"      
##  [7] "PhiladelphiaPennsylvania" "Kansas CityMissouri"     
##  [9] "NashvilleTennessee"       "St. LouisMissouri"

You can add a separator between the strings you’re combining using the sep = ’’ argument. Let’s separate the city and state by a comma.

Add this new data as a column, named City_State, in our dataframe murder.

##  [1] "Baltimore,Maryland"        "Chicago,Illinois"         
##  [3] "Houston,Texas"             "Cleveland,Ohio"           
##  [5] "Washington,D.C."           "Milwaukee,Wisconsin"      
##  [7] "Philadelphia,Pennsylvania" "Kansas City,Missouri"     
##  [9] "Nashville,Tennessee"       "St. Louis,Missouri"

Replacing Strings

str_replace_all(string, pattern, string)

Purpose:
This function will replace all instances of a pattern with the given replacement

Input:
String or vector of strings
Pattern - you can use regular expressions here

Output:
Single string of vector of combined strings

Example:
Supposed we wanted to replace all appearances of . in the column ‘City_State’. We can easily do this with str_replace_all()

##  [1] "Baltimore,Maryland"        "Chicago,Illinois"         
##  [3] "Houston,Texas"             "Cleveland,Ohio"           
##  [5] "Washington,D*C*"           "Milwaukee,Wisconsin"      
##  [7] "Philadelphia,Pennsylvania" "Kansas City,Missouri"     
##  [9] "Nashville,Tennessee"       "St* Louis,Missouri"

Get the Length of a String

str_length(string)

Purpose:
Find out the length of a string or a vector of strings

Input:
String or vector of strings

Output:
Integer

Example:
Let’s find how out how long each city name

##  [1]  9  7  7  9 10  9 12 11  9  9

Let’s only view the rows in the dataframe where the city has more than 9 letters in the name. To do this we’ll also use the filter function from the package dplyr.

## # A tibble: 3 x 6
##   city       state      `2014_murders` `2015_murders` change City_State         
##   <chr>      <chr>               <dbl>          <dbl>  <dbl> <chr>              
## 1 Washington D.C.                  105            162     57 Washington,D*C*    
## 2 Philadelp… Pennsylva…            248            280     32 Philadelphia,Penns…
## 3 Kansas Ci… Missouri               78            109     31 Kansas City,Missou…

Conclusion

These examples are just the beginning of what you can do with stringr. If you need to manipulate, combine or work with strings in general, stringr is a great package to do so. Here’s a great stringr cheatsheet released by RStudio (https://rstudio.com/resources/cheatsheets/).

Resources:

Extended By Amit Kapoor

stringr is built on top of stringi, which provides accurate and fast string manipulation tasks. stringr has most frequently used string manipulation functions while stringi provides a comprehensive set. stringr functions start with str_ prefix and simplifies the manipulation of character strings in R.

## # A tibble: 6 x 6
##   city       state     `2014_murders` `2015_murders` change City_State         
##   <chr>      <chr>              <dbl>          <dbl>  <dbl> <chr>              
## 1 Baltimore  Maryland             211            344    133 Baltimore,Maryland 
## 2 Chicago    Illinois             411            478     67 Chicago,Illinois   
## 3 Houston    Texas                242            303     61 Houston,Texas      
## 4 Cleveland  Ohio                  63            120     57 Cleveland,Ohio     
## 5 Washington D.C.                 105            162     57 Washington,D*C*    
## 6 Milwaukee  Wisconsin             90            145     55 Milwaukee,Wisconsin

str_which()

str_which(string, pattern)

Purpose:
Identify the location of the character strings containing a certain pattern. It Returns the index of entries that contain the pattern.

Input:
String or vector of strings
Pattern - you can use regular expressions here

Output:
A vector of integer

Example:

We’ll check the column ‘City_State’ from our dataframe ‘murder’ to check if it follows the pattern ‘alphabets,alphabets’ pattern.

##  [1]  1  2  3  4  5  6  7  8  9 10

Here [a-zA-Z ] represents all lower and upper case alphabets and ‘+’ represnts one or more.

str_match()

str_match(string, pattern)

Purpose:
Extract first part of the string that matches the groups and the patterns defined by the groups.

Input:
String or vector of strings
Pattern - you can use regular expressions here

Output:
Character matrix with with one column for the complete match and one column for each group.

Example:

We’ll use the column ‘City_State’ from our dataframe ‘murder’ to identify the groups.

##       [,1]                        [,2]           [,3] [,4]          
##  [1,] "Baltimore,Maryland"        "Baltimore"    ","  "Maryland"    
##  [2,] "Chicago,Illinois"          "Chicago"      ","  "Illinois"    
##  [3,] "Houston,Texas"             "Houston"      ","  "Texas"       
##  [4,] "Cleveland,Ohio"            "Cleveland"    ","  "Ohio"        
##  [5,] "Washington,D"              "Washington"   ","  "D"           
##  [6,] "Milwaukee,Wisconsin"       "Milwaukee"    ","  "Wisconsin"   
##  [7,] "Philadelphia,Pennsylvania" "Philadelphia" ","  "Pennsylvania"
##  [8,] "Kansas City,Missouri"      "Kansas City"  ","  "Missouri"    
##  [9,] "Nashville,Tennessee"       "Nashville"    ","  "Tennessee"   
## [10,] " Louis,Missouri"           " Louis"       ","  "Missouri"

As described in output, the output shows the first column of complete match and rest 3 for each group of patteren specified.

str_pad()

str_pad(string, width, side = c(“left”, “right”,“both”), pad = " ")

Purpose:
Adds padding characters (default white space) to string to make it a fixed size.

Input:
String or vector of strings width - Minimum width of padded strings. side - Side on which padding character is added (left, right or both). pad - Single padding character (default is a space).

Output:
String or vector of strings

Example:

Let’s use the column ‘city’ of dataframe ‘murder’ to demonstrate the padding here. Default padding is on left.

##  [1] "****************Baltimore" "******************Chicago"
##  [3] "******************Houston" "****************Cleveland"
##  [5] "***************Washington" "****************Milwaukee"
##  [7] "*************Philadelphia" "**************Kansas City"
##  [9] "****************Nashville" "****************St. Louis"

We can use vector as well for padding.

##  [1] "------Baltimore" "        Chicago" "--------Houston" "      Cleveland"
##  [5] "-----Washington" "      Milwaukee" "---Philadelphia" "    Kansas City"
##  [9] "------Nashville" "      St. Louis"

str_dup()

str_dup(string, times)

Purpose:
Repeat a string.

Input:
String or vector of strings times - Number of times to duplicate each string.

Output:
String or vector of strings

Example:

We will use ‘state’ here to show the str_dup use. In this example duplicated twice.

##  [1] "MarylandMaryland"         "IllinoisIllinois"        
##  [3] "TexasTexas"               "OhioOhio"                
##  [5] "D.C.D.C."                 "WisconsinWisconsin"      
##  [7] "PennsylvaniaPennsylvania" "MissouriMissouri"        
##  [9] "TennesseeTennessee"       "MissouriMissouri"

Another way of alternate twice and thrice replication.

##  [1] "MarylandMaryland"            "IllinoisIllinoisIllinois"   
##  [3] "TexasTexas"                  "OhioOhioOhio"               
##  [5] "D.C.D.C."                    "WisconsinWisconsinWisconsin"
##  [7] "PennsylvaniaPennsylvania"    "MissouriMissouriMissouri"   
##  [9] "TennesseeTennessee"          "MissouriMissouriMissouri"

We can use range as well as described in below example.

##  [1] "Maryland"                                                                            
##  [2] "IllinoisIllinois"                                                                    
##  [3] "TexasTexasTexas"                                                                     
##  [4] "OhioOhioOhioOhio"                                                                    
##  [5] "D.C.D.C.D.C.D.C.D.C."                                                                
##  [6] "WisconsinWisconsinWisconsinWisconsinWisconsinWisconsin"                              
##  [7] "PennsylvaniaPennsylvaniaPennsylvaniaPennsylvaniaPennsylvaniaPennsylvaniaPennsylvania"
##  [8] "MissouriMissouriMissouriMissouriMissouriMissouriMissouriMissouri"                    
##  [9] "TennesseeTennesseeTennesseeTennesseeTennesseeTennesseeTennesseeTennesseeTennessee"   
## [10] "MissouriMissouriMissouriMissouriMissouriMissouriMissouriMissouriMissouriMissouri"

str_detect()

str_detect(string, pattern)

Purpose:
Detects pattern in the string.

Input:
String or vector of strings
Pattern - you can use regular expressions here

Output:
boolean or vector of booleans

Example:

We will again use ‘City_State’ column of given dataframe here to detect the pattern here. It detects the presence or absence of a pattern and returns a logical vector.

##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

If we change the pattern which doesnt match, It returns false. Here I introduced a space after comma.

##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

str_locate()

str_locate(string, pattern)

Purpose:
Locates the first position of a pattern and returns a numeric matrix with columns start and end.

Input:
String or vector of strings
Pattern - you can use regular expressions here

Output:
A numeric matrix.

Example:

Here we use the column City_State to locate the given pattern. As we can see every value in the given column follows the pattern and the result, locates the the start and end of the pattern.

##       start end
##  [1,]     1  18
##  [2,]     1  16
##  [3,]     1  13
##  [4,]     1  14
##  [5,]     1  12
##  [6,]     1  19
##  [7,]     1  25
##  [8,]     1  20
##  [9,]     1  19
## [10,]     4  18

Conclusion

stringr package is extrmely useful package for string manipulation which is of great use in data cleaning. It works great with regex patterns as demonstrated above and part of tidyverse package. All functions deal with “NA”’s and zero length vectors in the same way, and the output from one function is easy to feed into the input of another. It is very helpful in data presentation with data of large size.