Overview:

The string is one of the fundamental data types in any programming languages. It is usually a sequence of characters. In this R Vignette, I will explore basic string manipulation and then look further into the regular expression, or regex, which is a powerful way of matching, finding and filtering. I will start with stringr package which is also part of tidyverse.

# Loading "tidyverse" package in the current R environment
# "tidyverse" package contains "stringr" package
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Defining strings:

Strings are commonly defined within double quotes. Single quotes are also valid syntax for defining strings.

string_1 <- "Hello Sydney" 
string_2 <- "let's start with strings"
string_3 <- 'More manipulation with strings'

We can also create multiple strings in R and store it in vectors. Vectors are defined in R using ‘c()’.

# Strings in vector
sports <- c("Football", "Baseball", "Cricket", "Rugby", "Badminton")

Basic manipulations and functions:

  1. Finding the length of the string: str_length() helps in finding the number of characters in a string.
# number of characters in string_1
str_length(string_1)
## [1] 12
# number of characters in each element of vector sport
str_length(sports)
## [1] 8 8 7 5 9
  1. Joining strings: function str_c() is used to join one or more strings. Optional parameter sep helps in how combining strings are separated.
# combining 2 strings without separator
str_c(string_1,string_2)
## [1] "Hello Sydneylet's start with strings"
# combining 2 strings with separator ". "
str_c(string_1,string_2, sep = ". ")
## [1] "Hello Sydney. let's start with strings"
  1. Sub-strings: by using function str_sub(), any sub-part can be extracted.
# Extracting all characters from position 1 till 17, both including
str_sub(string_3, 1, 17)
## [1] "More manipulation"
# Extracting characters from position 2 till 4, in all elements of the vector
str_sub(sports, 2, 4)
## [1] "oot" "ase" "ric" "ugb" "adm"
  1. Case conversion: following basic four case conversion functions will help in understanding strings manipulation further str_to_upper(), str_to_lower(), str_to_title(), str_to_sentence().
# converting all characters to upper case
str_to_upper(string_1)
## [1] "HELLO SYDNEY"
# converting all characters to lower case
str_to_lower(sports)
## [1] "football"  "baseball"  "cricket"   "rugby"     "badminton"
# capitalize first character of each word
str_to_title(string_3)
## [1] "More Manipulation With Strings"
# capitalize first character of the string
str_to_sentence(string_2)
## [1] "Let's start with strings"

Regular expression or regex:

Regular expressions are powerful tools in R for matching patterns. It helps in making jobs easier for filter and matching a specific pattern from a list. Function str_view() (shows the 1st match) and str_view_all() (shows all matches) are used for match patterns.

# Matching pattern that has "ball"
str_view(sports,"ball")
# Matching pattern that has character "e" and other characters on both sides of it
str_view(sports,".e.")

One can also find a specific string containing the first or last character. ^ is used for the matching first character of the string while $ is used for the last character of the string. Let’s check a few examples:

# checking strings with the first character "B" in sports
str_view(sports, "^B")
# checking strings with the last character "l" in sports
str_view(sports, "l$")
# checking strings with 1st character "B" and last character "l" in sports
str_view(sports, "^Baseball$")

To find match for special characters such as { “.”, “!”, “?”, ““,”(“,”)“,”{“,”}“, new line, tab, whitespace }, then we use backslash \ as a escape command.

(ab|xy): to match either ab, or xy.

# matching those which either have a, b, or c
str_view_all(sports, "[abc]")
# matching all except a, b, or c
str_view_all(sports, "[^abc]")
# matching either ball or ket
str_view_all(sports, "(ball|ket)")
# matching those with from a to d, all included
str_view_all(sports, "[a-d]")

Conclusion:

In this vignette, I tried to explain basic string manipulation and functions in stringr package with novice R learners in mind. For further in-depth learning, one can refer here or R documentation. For a quick overview, this cheatsheet is a great resource.

References:

  1. https://r4ds.had.co.nz/strings.html#string-length
  2. https://www.rdocumentation.org/packages/stringr/versions/1.4.0/topics/str_view
  3. https://evoldyn.gitlab.io/evomics-2018/ref-sheets/R_strings.pdf