Introduction

Mostly, in the data analysis, we are confronted with numeric data. However, for advanced analytics such as text mining or handling text data, orientation to text data management is required. String variables or text data requires pre-processing to carry out meaningful analysis. Data as naive as gender can be obtained in raw data as M, F, m, f, male,female. Before calculating the proportion of females, for example, dataset should be pre-processed. One can go to excel and do it but it is time consuming as well as prone to manual errors. Using stringr package in R, the same can be done efficiently and in a reproducible manner.

Expectations from this script

It will enable readers to understand how to do the following in R using ‘stringr’ package:-

  1. Detect patterns
  2. Subset according to patterns
  3. Adjust length of a string variable
  4. Transform string variables
  5. Join string variables
  6. Split string variables

Creating a dummy dataset with “column” string (text) variable

library(tidyverse)
## -- Attaching packages ------------------------------------------------------------ tidyverse 1.3.0 --
## v ggplot2 3.3.0     v purrr   0.3.3
## v tibble  3.0.0     v dplyr   0.8.5
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## -- Conflicts --------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(stringr)
dat <- as.data.frame(c(
  "To do, or not, to do - 1234.",
  "How to do?",
  "LET US Do IT",
  "12345"
  ))
dat <- dat %>%
  rename(
  column = `c("To do, or not, to do - 1234.", "How to do?", "LET US Do IT", "12345")`
)

01. DETECT PATTERNS IN A STRING

For illustrative purpose, pattern chosen is “Do” Or “do”

Which rows have a pattern?

dat$column %>% 
  str_which(
  "(D|d)o", 
  negate = F
  )
## [1] 1 2 3

Alternate to detect the rows with a pattern

dat$column %>% 
  str_detect(
  "(D|d)o", 
  negate = F # Negate = T will give rows without the pattern
  )
## [1]  TRUE  TRUE  TRUE FALSE

How many times the pattern is present in a string?

dat$column %>%
  str_count(
  "(D|d)o"
  )
## [1] 2 1 1 0

What is the location of the pattern appearing first time in the string?

dat$column %>% 
  str_locate(
  "(D|d)o"
  )
##      start end
## [1,]     4   5
## [2,]     8   9
## [3,]     8   9
## [4,]    NA  NA

What is the location of the pattern everytime it appears in the string?

dat$column %>% 
  str_locate_all(
  "(D|d)o"
  )
## [[1]]
##      start end
## [1,]     4   5
## [2,]    19  20
## 
## [[2]]
##      start end
## [1,]     8   9
## 
## [[3]]
##      start end
## [1,]     8   9
## 
## [[4]]
##      start end

02 SUBSET ACCORDING TO PATTERNS

How to extract pattern matching for the first time in a string?

dat$column %>% 
  str_extract(
  "(D|d)o"
  )
## [1] "do" "do" "Do" NA

How to extract all pattern matches in a string?

dat$column %>% 
  str_extract_all(
    "(D|d)o",
    simplify = T
  )
##      [,1] [,2]
## [1,] "do" "do"
## [2,] "do" ""  
## [3,] "Do" ""  
## [4,] ""   ""

How to extract pattern with match for all sub-groups () of a pattern appearing for the first time in a string?

dat$column %>% 
  str_match(
    "(D|d)o"
  )
##      [,1] [,2]
## [1,] "do" "d" 
## [2,] "do" "d" 
## [3,] "Do" "D" 
## [4,] NA   NA

How to extract pattern with match for all sub-groups () of a pattern appearing everytime in a string?

dat$column %>% 
  str_match_all(
    "(D|d)o"
  )
## [[1]]
##      [,1] [,2]
## [1,] "do" "d" 
## [2,] "do" "d" 
## 
## [[2]]
##      [,1] [,2]
## [1,] "do" "d" 
## 
## [[3]]
##      [,1] [,2]
## [1,] "Do" "D" 
## 
## [[4]]
##      [,1] [,2]

How to subset only those observations (rows) containing pattern match?

dat$column %>% 
  str_subset(
    "(D|d)o",
    negate = F
  )
## [1] "To do, or not, to do - 1234." "How to do?"                  
## [3] "LET US Do IT"

How to subset a string according to a specified location?

dat$column %>% 
  str_sub(
    start = 2,
    end = -3
  )
## [1] "o do, or not, to do - 123" "ow to d"                  
## [3] "ET US Do "                 "23"

03. ADJUST LENGTH OF A STRING VARIABLE

How to determine the length of a string?

dat$column %>% 
  str_length()
## [1] 28 10 12  5

How to trim whitespace from ends of a string?

dat$column %>% 
  str_trim(
    side = "both"#can be "right" or "left" also
  )
## [1] "To do, or not, to do - 1234." "How to do?"                  
## [3] "LET US Do IT"                 "12345"

How to remove extra whitespace from a string?

dat$column %>% 
  str_squish()
## [1] "To do, or not, to do - 1234." "How to do?"                  
## [3] "LET US Do IT"                 "12345"

How to add a whitespace and make all strings of equal length?

dat$column %>% 
  str_pad(
    width = 100,
    side = "right",#can be right/both also
    pad = "."#can be blank as ""
  )
## [1] "To do, or not, to do - 1234........................................................................."
## [2] "How to do?.........................................................................................."
## [3] "LET US Do IT........................................................................................"
## [4] "12345..............................................................................................."

04 TRANSFORM STRING VARIABLES

How to replace part of a string by specifying location details?

Replacing location 2-4 with “AAA”

str_sub(
    dat$column,
    start = 2,
    end = 4,
    ) <- "AAA" 
dat$column
## [1] "TAAAo, or not, to do - 1234." "HAAAto do?"                  
## [3] "LAAAUS Do IT"                 "1AAA5"

How to replace first matched pattern in a string?

dat$column %>% 
  str_replace(
    "(d|d)o",#to be replaced with
    "XXXXXX"#replacement
  )
## [1] "TAAAo, or not, to XXXXXX - 1234." "HAAAto XXXXXX?"                  
## [3] "LAAAUS Do IT"                     "1AAA5"

How to replace all matched patterns?

dat$column %>% 
  str_replace_all(
    "(D|d)o",
    "XXXXXX"
  )
## [1] "TAAAo, or not, to XXXXXX - 1234." "HAAAto XXXXXX?"                  
## [3] "LAAAUS XXXXXX IT"                 "1AAA5"

How to transform string to lowercase?

dat$column %>% 
  str_to_lower()
## [1] "taaao, or not, to do - 1234." "haaato do?"                  
## [3] "laaaus do it"                 "1aaa5"

How to transform string to uppercase?

dat$column %>% 
  str_to_upper()
## [1] "TAAAO, OR NOT, TO DO - 1234." "HAAATO DO?"                  
## [3] "LAAAUS DO IT"                 "1AAA5"

How to transform string to Title format (All words starting with capital letter)?

dat$column %>% 
  str_to_title()
## [1] "Taaao, Or Not, To Do - 1234." "Haaato Do?"                  
## [3] "Laaaus Do It"                 "1aaa5"

How to transform string to sentence format (First word starting with capital letter only)?

dat$column %>% 
  str_to_sentence()
## [1] "Taaao, or not, to do - 1234." "Haaato do?"                  
## [3] "Laaaus do it"                 "1aaa5"

05 JOIN STRING VARIABLES

How to add a pattern at the end of a string?

Joining “IIIIII” at the end separated by “—” from the existing string

dat$column %>% 
  str_c(
    "IIIIII", 
    sep = "---"
  )
## [1] "TAAAo, or not, to do - 1234.---IIIIII"
## [2] "HAAAto do?---IIIIII"                  
## [3] "LAAAUS Do IT---IIIIII"                
## [4] "1AAA5---IIIIII"

How to add a pattern at the beginning of a string?

(dat$column1 <- str_c("DDDD", dat$column, sep = "----"))
## [1] "DDDD----TAAAo, or not, to do - 1234."
## [2] "DDDD----HAAAto do?"                  
## [3] "DDDD----LAAAUS Do IT"                
## [4] "DDDD----1AAA5"

06 SPLIT STRING VARIABLES

How to spit a string based on a pattern?

Splitting the string variables based on “,”

dat$column %>% 
  str_split(
    ",",
    n = Inf,
    simplify = F
  )
## [[1]]
## [1] "TAAAo"          " or not"        " to do - 1234."
## 
## [[2]]
## [1] "HAAAto do?"
## 
## [[3]]
## [1] "LAAAUS Do IT"
## 
## [[4]]
## [1] "1AAA5"