The string is one of the fundamental data types in any programming languages. It is usually a sequence of characters. In this R Vignette, I will explore basic string manipulation and then look further into the regular expression, or regex, which is a powerful way of matching, finding and filtering. I will start with stringr package which is also part of tidyverse.
# Loading "tidyverse" package in the current R environment
# "tidyverse" package contains "stringr" package
library(tidyverse)## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
Strings are commonly defined within double quotes. Single quotes are also valid syntax for defining strings.
string_1 <- "Hello Sydney"
string_2 <- "let's start with strings"
string_3 <- 'More manipulation with strings'We can also create multiple strings in R and store it in vectors. Vectors are defined in R using ‘c()’.
# Strings in vector
sports <- c("Football", "Baseball", "Cricket", "Rugby", "Badminton")# number of characters in string_1
str_length(string_1)## [1] 12
# number of characters in each element of vector sport
str_length(sports)## [1] 8 8 7 5 9
# combining 2 strings without separator
str_c(string_1,string_2)## [1] "Hello Sydneylet's start with strings"
# combining 2 strings with separator ". "
str_c(string_1,string_2, sep = ". ")## [1] "Hello Sydney. let's start with strings"
# Extracting all characters from position 1 till 17, both including
str_sub(string_3, 1, 17)## [1] "More manipulation"
# Extracting characters from position 2 till 4, in all elements of the vector
str_sub(sports, 2, 4)## [1] "oot" "ase" "ric" "ugb" "adm"
# converting all characters to upper case
str_to_upper(string_1)## [1] "HELLO SYDNEY"
# converting all characters to lower case
str_to_lower(sports)## [1] "football" "baseball" "cricket" "rugby" "badminton"
# capitalize first character of each word
str_to_title(string_3)## [1] "More Manipulation With Strings"
# capitalize first character of the string
str_to_sentence(string_2)## [1] "Let's start with strings"
Regular expressions are powerful tools in R for matching patterns. It helps in making jobs easier for filter and matching a specific pattern from a list. Function str_view() (shows the 1st match) and str_view_all() (shows all matches) are used for match patterns.
# Matching pattern that has "ball"
str_view(sports,"ball")# Matching pattern that has character "e" and other characters on both sides of it
str_view(sports,".e.")One can also find a specific string containing the first or last character. ^ is used for the matching first character of the string while $ is used for the last character of the string. Let’s check a few examples:
# checking strings with the first character "B" in sports
str_view(sports, "^B")# checking strings with the last character "l" in sports
str_view(sports, "l$")# checking strings with 1st character "B" and last character "l" in sports
str_view(sports, "^Baseball$")To find match for special characters such as { “.”, “!”, “?”, ““,”(“,”)“,”{“,”}“, new line, tab, whitespace }, then we use backslash \ as a escape command.
(ab|xy): to match either ab, or xy.
# matching those which either have a, b, or c
str_view_all(sports, "[abc]")# matching all except a, b, or c
str_view_all(sports, "[^abc]")# matching either ball or ket
str_view_all(sports, "(ball|ket)")# matching those with from a to d, all included
str_view_all(sports, "[a-d]")In this vignette, I tried to explain basic string manipulation and functions in stringr package with novice R learners in mind. For further in-depth learning, one can refer here or R documentation. For a quick overview, this cheatsheet is a great resource.