Mostly, in the data analysis, we are confronted with numeric data. However, for advanced analytics such as text mining or handling text data, orientation to text data management is required. String variables or text data requires pre-processing to carry out meaningful analysis. Data as naive as gender can be obtained in raw data as M, F, m, f, male,female. Before calculating the proportion of females, for example, dataset should be pre-processed. One can go to excel and do it but it is time consuming as well as prone to manual errors. Using stringr package in R, the same can be done efficiently and in a reproducible manner.
It will enable readers to understand how to do the following in R using ‘stringr’ package:-
library(tidyverse)
## -- Attaching packages ------------------------------------------------------------ tidyverse 1.3.0 --
## v ggplot2 3.3.0 v purrr 0.3.3
## v tibble 3.0.0 v dplyr 0.8.5
## v tidyr 1.0.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts --------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(stringr)
dat <- as.data.frame(c(
"To do, or not, to do - 1234.",
"How to do?",
"LET US Do IT",
"12345"
))
dat <- dat %>%
rename(
column = `c("To do, or not, to do - 1234.", "How to do?", "LET US Do IT", "12345")`
)
For illustrative purpose, pattern chosen is “Do” Or “do”
dat$column %>%
str_which(
"(D|d)o",
negate = F
)
## [1] 1 2 3
dat$column %>%
str_detect(
"(D|d)o",
negate = F # Negate = T will give rows without the pattern
)
## [1] TRUE TRUE TRUE FALSE
dat$column %>%
str_count(
"(D|d)o"
)
## [1] 2 1 1 0
dat$column %>%
str_locate(
"(D|d)o"
)
## start end
## [1,] 4 5
## [2,] 8 9
## [3,] 8 9
## [4,] NA NA
dat$column %>%
str_locate_all(
"(D|d)o"
)
## [[1]]
## start end
## [1,] 4 5
## [2,] 19 20
##
## [[2]]
## start end
## [1,] 8 9
##
## [[3]]
## start end
## [1,] 8 9
##
## [[4]]
## start end
dat$column %>%
str_extract(
"(D|d)o"
)
## [1] "do" "do" "Do" NA
dat$column %>%
str_extract_all(
"(D|d)o",
simplify = T
)
## [,1] [,2]
## [1,] "do" "do"
## [2,] "do" ""
## [3,] "Do" ""
## [4,] "" ""
dat$column %>%
str_match(
"(D|d)o"
)
## [,1] [,2]
## [1,] "do" "d"
## [2,] "do" "d"
## [3,] "Do" "D"
## [4,] NA NA
dat$column %>%
str_match_all(
"(D|d)o"
)
## [[1]]
## [,1] [,2]
## [1,] "do" "d"
## [2,] "do" "d"
##
## [[2]]
## [,1] [,2]
## [1,] "do" "d"
##
## [[3]]
## [,1] [,2]
## [1,] "Do" "D"
##
## [[4]]
## [,1] [,2]
dat$column %>%
str_subset(
"(D|d)o",
negate = F
)
## [1] "To do, or not, to do - 1234." "How to do?"
## [3] "LET US Do IT"
dat$column %>%
str_sub(
start = 2,
end = -3
)
## [1] "o do, or not, to do - 123" "ow to d"
## [3] "ET US Do " "23"
dat$column %>%
str_length()
## [1] 28 10 12 5
dat$column %>%
str_trim(
side = "both"#can be "right" or "left" also
)
## [1] "To do, or not, to do - 1234." "How to do?"
## [3] "LET US Do IT" "12345"
dat$column %>%
str_squish()
## [1] "To do, or not, to do - 1234." "How to do?"
## [3] "LET US Do IT" "12345"
dat$column %>%
str_pad(
width = 100,
side = "right",#can be right/both also
pad = "."#can be blank as ""
)
## [1] "To do, or not, to do - 1234........................................................................."
## [2] "How to do?.........................................................................................."
## [3] "LET US Do IT........................................................................................"
## [4] "12345..............................................................................................."
Replacing location 2-4 with “AAA”
str_sub(
dat$column,
start = 2,
end = 4,
) <- "AAA"
dat$column
## [1] "TAAAo, or not, to do - 1234." "HAAAto do?"
## [3] "LAAAUS Do IT" "1AAA5"
dat$column %>%
str_replace(
"(d|d)o",#to be replaced with
"XXXXXX"#replacement
)
## [1] "TAAAo, or not, to XXXXXX - 1234." "HAAAto XXXXXX?"
## [3] "LAAAUS Do IT" "1AAA5"
dat$column %>%
str_replace_all(
"(D|d)o",
"XXXXXX"
)
## [1] "TAAAo, or not, to XXXXXX - 1234." "HAAAto XXXXXX?"
## [3] "LAAAUS XXXXXX IT" "1AAA5"
dat$column %>%
str_to_lower()
## [1] "taaao, or not, to do - 1234." "haaato do?"
## [3] "laaaus do it" "1aaa5"
dat$column %>%
str_to_upper()
## [1] "TAAAO, OR NOT, TO DO - 1234." "HAAATO DO?"
## [3] "LAAAUS DO IT" "1AAA5"
dat$column %>%
str_to_title()
## [1] "Taaao, Or Not, To Do - 1234." "Haaato Do?"
## [3] "Laaaus Do It" "1aaa5"
dat$column %>%
str_to_sentence()
## [1] "Taaao, or not, to do - 1234." "Haaato do?"
## [3] "Laaaus do it" "1aaa5"
Joining “IIIIII” at the end separated by “—” from the existing string
dat$column %>%
str_c(
"IIIIII",
sep = "---"
)
## [1] "TAAAo, or not, to do - 1234.---IIIIII"
## [2] "HAAAto do?---IIIIII"
## [3] "LAAAUS Do IT---IIIIII"
## [4] "1AAA5---IIIIII"
(dat$column1 <- str_c("DDDD", dat$column, sep = "----"))
## [1] "DDDD----TAAAo, or not, to do - 1234."
## [2] "DDDD----HAAAto do?"
## [3] "DDDD----LAAAUS Do IT"
## [4] "DDDD----1AAA5"
Splitting the string variables based on “,”
dat$column %>%
str_split(
",",
n = Inf,
simplify = F
)
## [[1]]
## [1] "TAAAo" " or not" " to do - 1234."
##
## [[2]]
## [1] "HAAAto do?"
##
## [[3]]
## [1] "LAAAUS Do IT"
##
## [[4]]
## [1] "1AAA5"