The most basic way to create strings is to use quotation marks and assign a string to an object.
quote <- "The Legend"
author <- "Fraser Myers"
The paste() function under Base R is used for creating and building strings. str_c() is equivalent to the paste() function.
paste(quote, "by", author)
## [1] "The Legend by Fraser Myers"
Use paste0() to paste without spaces between characters.
paste0("I", "love", "data")
## [1] "Ilovedata"
Strings and characters can be tested with is.character() and any other data format can be converted into strings/characters with as.character().
is.character(quote)
## [1] TRUE
as.character(pi)
## [1] "3.14159265358979"
Printing strings/characters can be done with the following:
Print without quotes.
print( paste(quote,author) , quote = FALSE)
## [1] The Legend Fraser Myers
Same as above, but cat() does not print the numeric [1]
cat( paste(quote,author) )
## The Legend Fraser Myers
Print alphabet
cat(letters)
## a b c d e f g h i j k l m n o p q r s t u v w x y z
Specify a separator between the combined characters
cat(letters, sep = "-")
## a-b-c-d-e-f-g-h-i-j-k-l-m-n-o-p-q-r-s-t-u-v-w-x-y-z
Print with no breaks between lines
cat(quote, author, fill = FALSE)
## The Legend Fraser Myers
Print with breaks between lines
cat(letters, letters, letters, fill = TRUE)
## a b c d e f g h i j k l m n o p q r s t u v w x y z a b c d e f g h i j k l m n
## o p q r s t u v w x y z a b c d e f g h i j k l m n o p q r s t u v w x y z
Count number of elements in a string using length()
length("Fraser Myers is a God")
## [1] 1
length( c("How", "many", "elements", "are", "in", "this", "string?") )
## [1] 7
Count how many characters in a string using nchar()
nchar("Fraser Myers is a God")
## [1] 21
nchar(c("How", "many", "characters", "are", "in", "this", "string?"))
## [1] 3 4 10 3 2 4 7
To convert all upper case characters to lower use tolower()
To convert all lower case characters to upper use toupper()
a <- "MATH2349 is AWesomE"
tolower(a)
## [1] "math2349 is awesome"
toupper(a)
## [1] "MATH2349 IS AWESOME"
To replace a character in a string use chartr()
# replace 'z' with 's'
american <- "This is how we analyze."
chartr(old = "z", new = "s", american)
## [1] "This is how we analyse."
# replace 'i' with 'w', 'X' with 'h' and 's' with 'y'
x <- "MiXeD cAsE 123"
chartr(old ="iXs", new ="why", x)
## [1] "MwheD cAyE 123"
To replace a pattern in a string use gsub()
# replace "ot" pattern with "ut"
x <- "R Totorial"
gsub(pattern = "ot", replacement="ut", x)
## [1] "R Tutorial"
To abbreviate strings we can use abbreviate()
streets <- c("Victoria", "Yarra", "Russell", "Williams", "Swanston")
# default abbreviations
abbreviate(streets)
## Victoria Yarra Russell Williams Swanston
## "Vctr" "Yarr" "Rssl" "Wllm" "Swns"
# set minimum length of abbreviation
abbreviate(streets, minlength = 2)
## Victoria Yarra Russell Williams Swanston
## "Vc" "Yr" "Rs" "Wl" "Sw"
The purpose of subtr() is to extract and replace substrings with specified starting and stopping characters.
alphabet <- paste(LETTERS, collapse = "")
alphabet
## [1] "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
# extract 18-24th characters in alphabet
substr(alphabet, start = 18, stop = 24)
## [1] "RSTUVWX"
# replace 19-24th characters with `R`
substr(alphabet, start = 19, stop = 24) <- "RRRRRR"
alphabet
## [1] "ABCDEFGHIJKLMNOPQRRRRRRRYZ"
To split the elements of a character string use strsplit()
z <- "Victoria Yarra Russell Williams Swanston"
strsplit(z, split = " ")
## [[1]]
## [1] "Victoria" "Yarra" "Russell" "Williams" "Swanston"
a <- "Victoria-Yarra-Russell-Williams-Swanston"
strsplit(a, split = "-")
## [[1]]
## [1] "Victoria" "Yarra" "Russell" "Williams" "Swanston"
Note that the output of strsplit() is a list. To convert the output to a simple atomic vector simply use unlist()
unlist(strsplit(a, split = "-"))
## [1] "Victoria" "Yarra" "Russell" "Williams" "Swanston"
set_1 <- c("VIC", "NSW", "WA", "TAS")
set_2 <- c("TAS", "QLD", "SA", "NSW")
union(set_1, set_2)
## [1] "VIC" "NSW" "WA" "TAS" "QLD" "SA"
intersect(set_1, set_2)
## [1] "NSW" "TAS"
setdiff(set_1, set_2)
## [1] "VIC" "WA"
The stringr provides a new functionality using str_dup() in which base R does not have a specific function for is character duplication.
library(stringr)
str_dup("apples", times = 4)
## [1] "applesapplesapplesapples"
library(stringr)
str_dup("apples", times = 1:4)
## [1] "apples" "applesapples"
## [3] "applesapplesapples" "applesapplesapplesapples"
In string processing, a common task is parsing text into individual words.
Often, this results in words having blank spaces (whitespaces) on either end of the word. The str_trim() can be used to remove these spaces.
text <- c("Text ", " with", " whitespace ", " on", "both ", " sides ")
text
## [1] "Text " " with" " whitespace " " on" "both "
## [6] " sides "
str_trim(text, side = "left")
## [1] "Text " "with" "whitespace " "on" "both "
## [6] "sides "
str_trim(text, side = "right")
## [1] "Text" " with" " whitespace" " on" "both"
## [6] " sides"
Conversely, to add whitespace, or to pad a string, we can use str_pad().
str_pad("apples", width = 10, side = "left")
## [1] " apples"
str_pad("apples", width = 10, side = "both")
## [1] " apples "
Use str_pad() to pad a string with specified characters. The width argument will give width of padded strings and the pad argument will specify the padding characters.
str_pad("apples", width = 10, side = "right", pad = "!")
## [1] "apples!!!!"
str_detect() detects the presence or absence of a pattern and returns a logical vector.
# detects pattern "ea"
x <- c("apple", "banana", "pear")
str_detect(x, pattern ="ea")
## [1] FALSE FALSE TRUE
#same as above
str_detect(x, "ea")
## [1] FALSE FALSE TRUE
While matching patterns, you can also use the regular expressions.
Regular expressions (a.k.a. regex’s) are a language that allow you to describe patterns in strings.
You can perform a case-insensitive match using ignore_case = TRUE:
bananas <- c("banana", "Banana", "BANANA")
#case insensitive match
str_detect(bananas, regex("banana",ignore_case = TRUE))
## [1] TRUE TRUE TRUE
With regex, you can create your own character classes using [ ].
[abc]: matches a, b, or c[a-z]: matches every character between a and z (in Unicode code point order).[^abc]: matches anything except a, b, or c.[\^\-]: matches ^ or -.There are a number of pre-built classes that you can use inside [].
str_subset() returns the elements of a character vector that match a regular expression.
Using starwars data set, let’s subset the character names that contain any punctuation:
stars <- c("Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Organa", "Owen Lars")
stars
## [1] "Luke Skywalker" "C-3PO" "R2-D2" "Darth Vader"
## [5] "Leia Organa" "Owen Lars"
str_subset(stars, "[:punct:]")
## [1] "C-3PO" "R2-D2"
str_extract() extracts text corresponding to the first match, returning a character vector.
str_extract(stars, "[:punct:]")
## [1] NA "-" "-" NA NA NA
str_locate() locates the first position of a pattern and returns a numeric matrix with columns start and end whereas str_locate_all() locates all positions of a given pattern.
str_locate(stars, "[:punct:]") %>% head()
## start end
## [1,] NA NA
## [2,] 2 2
## [3,] 3 3
## [4,] NA NA
## [5,] NA NA
## [6,] NA NA
str_count() counts the number of matches for a given string
str_count(stars, "[:punct:]")
## [1] 0 1 1 0 0 0
str_replace() replaces a string with another one.
The pattern argument will give the string that is going to be replaced and replacement argument will specify the replacement string.
head(fruit)
## [1] "apple" "apricot" "avocado" "banana" "bell pepper"
## [6] "bilberry"
# Replace berry with berries
head(str_replace(fruit, pattern = "berry", replacement = "berries"))
## [1] "apple" "apricot" "avocado" "banana" "bell pepper"
## [6] "bilberries"
#replace first l with "" (delete first l)
str_replace("Hello world", pattern = "l", replacement = "")
## [1] "Helo world"
# replace all l's with "" (delete l's)
str_replace_all("Hello world", pattern = "l", replacement = "")
## [1] "Heo word"