String Manipulations

Base R Functions

Creating Strings

The most basic way to create strings is to use quotation marks and assign a string to an object.

quote <- "The Legend"

author <- "Fraser Myers"

The paste() function under Base R is used for creating and building strings. str_c() is equivalent to the paste() function.

paste(quote, "by", author)

## [1] "The Legend by Fraser Myers"

Use paste0() to paste without spaces between characters.

paste0("I", "love", "data")

## [1] "Ilovedata"

Converting to Strings

Strings and characters can be tested with is.character() and any other data format can be converted into strings/characters with as.character().

is.character(quote)

## [1] TRUE

as.character(pi)

## [1] "3.14159265358979"

Printing Strings

Printing strings/characters can be done with the following:

Print without quotes.

print( paste(quote,author) , quote = FALSE)

## [1] The Legend Fraser Myers

Same as above, but cat() does not print the numeric [1]

cat( paste(quote,author) )

## The Legend Fraser Myers

Print alphabet

cat(letters)

## a b c d e f g h i j k l m n o p q r s t u v w x y z

Specify a separator between the combined characters

cat(letters, sep = "-")

## a-b-c-d-e-f-g-h-i-j-k-l-m-n-o-p-q-r-s-t-u-v-w-x-y-z

Print with no breaks between lines

cat(quote, author, fill = FALSE)

## The Legend Fraser Myers

Print with breaks between lines

cat(letters, letters, letters, fill = TRUE)

## a b c d e f g h i j k l m n o p q r s t u v w x y z a b c d e f g h i j k l m n 
## o p q r s t u v w x y z a b c d e f g h i j k l m n o p q r s t u v w x y z

Counting String Elements

Count number of elements in a string using length()

length("Fraser Myers is a God")

## [1] 1

length( c("How", "many", "elements", "are", "in", "this", "string?") )

## [1] 7

Count how many characters in a string using nchar()

nchar("Fraser Myers is a God")

## [1] 21

nchar(c("How", "many", "characters", "are", "in", "this", "string?"))

## [1]  3  4 10  3  2  4  7

Upper/Lower Case Conversion

To convert all upper case characters to lower use tolower()

To convert all lower case characters to upper use toupper()

a <- "MATH2349 is AWesomE"

tolower(a)

## [1] "math2349 is awesome"

toupper(a)

## [1] "MATH2349 IS AWESOME"

Simple Character Replacement

To replace a character in a string use chartr()

# replace 'z' with 's'
american <- "This is how we analyze."
chartr(old = "z", new = "s", american)

## [1] "This is how we analyse."

# replace 'i' with 'w', 'X' with 'h' and 's' with 'y'
x <- "MiXeD cAsE 123"
chartr(old ="iXs", new ="why", x)

## [1] "MwheD cAyE 123"

Pattern Replacement

To replace a pattern in a string use gsub()

# replace "ot" pattern with "ut"
x <- "R Totorial"
gsub(pattern = "ot", replacement="ut", x)

## [1] "R Tutorial"

String Abbreviations

To abbreviate strings we can use abbreviate()

streets <- c("Victoria", "Yarra", "Russell", "Williams", "Swanston")
# default abbreviations
abbreviate(streets)

## Victoria    Yarra  Russell Williams Swanston 
##   "Vctr"   "Yarr"   "Rssl"   "Wllm"   "Swns"

# set minimum length of abbreviation
abbreviate(streets, minlength = 2)

## Victoria    Yarra  Russell Williams Swanston 
##     "Vc"     "Yr"     "Rs"     "Wl"     "Sw"

Extract/Replace Substrings

The purpose of subtr() is to extract and replace substrings with specified starting and stopping characters.

alphabet <- paste(LETTERS, collapse = "")
alphabet

## [1] "ABCDEFGHIJKLMNOPQRSTUVWXYZ"

# extract 18-24th characters in alphabet
substr(alphabet, start = 18, stop = 24)

## [1] "RSTUVWX"

# replace 19-24th characters with `R`
substr(alphabet, start = 19, stop = 24) <- "RRRRRR"
alphabet

## [1] "ABCDEFGHIJKLMNOPQRRRRRRRYZ"

To split the elements of a character string use strsplit()

z <- "Victoria Yarra Russell Williams Swanston"
strsplit(z, split = " ")

## [[1]]
## [1] "Victoria" "Yarra"    "Russell"  "Williams" "Swanston"

a <- "Victoria-Yarra-Russell-Williams-Swanston"
strsplit(a, split = "-")

## [[1]]
## [1] "Victoria" "Yarra"    "Russell"  "Williams" "Swanston"

Note that the output of strsplit() is a list. To convert the output to a simple atomic vector simply use unlist()

unlist(strsplit(a, split = "-"))

## [1] "Victoria" "Yarra"    "Russell"  "Williams" "Swanston"

Set Operations for Character Strings

set_1 <- c("VIC", "NSW", "WA", "TAS")
set_2 <- c("TAS", "QLD", "SA", "NSW")
union(set_1, set_2)

## [1] "VIC" "NSW" "WA"  "TAS" "QLD" "SA"

intersect(set_1, set_2)

## [1] "NSW" "TAS"

setdiff(set_1, set_2)

## [1] "VIC" "WA"

Stringr Functions

Duplicate Characters within a String

The stringr provides a new functionality using str_dup() in which base R does not have a specific function for is character duplication.

library(stringr)

str_dup("apples", times = 4)

## [1] "applesapplesapplesapples"

library(stringr)

str_dup("apples", times = 1:4)

## [1] "apples"                   "applesapples"            
## [3] "applesapplesapples"       "applesapplesapplesapples"

Remove Leading and Trailing Whitespace

In string processing, a common task is parsing text into individual words.

Often, this results in words having blank spaces (whitespaces) on either end of the word. The str_trim() can be used to remove these spaces.

text <- c("Text ", "  with", " whitespace ", " on", "both ", " sides ")
text

## [1] "Text "        "  with"       " whitespace " " on"          "both "       
## [6] " sides "

str_trim(text, side = "left")

## [1] "Text "       "with"        "whitespace " "on"          "both "      
## [6] "sides "

str_trim(text, side = "right")

## [1] "Text"        "  with"      " whitespace" " on"         "both"       
## [6] " sides"

Pad a String With Whitespace

Conversely, to add whitespace, or to pad a string, we can use str_pad().

str_pad("apples", width = 10, side = "left")

## [1] "    apples"

str_pad("apples", width = 10, side = "both")

## [1] "  apples  "

Use str_pad() to pad a string with specified characters. The width argument will give width of padded strings and the pad argument will specify the padding characters.

str_pad("apples", width = 10, side = "right", pad = "!")

## [1] "apples!!!!"

Pattern Detection with str_detect()

str_detect() detects the presence or absence of a pattern and returns a logical vector.

# detects pattern "ea"
x <- c("apple", "banana", "pear")
str_detect(x, pattern ="ea")

## [1] FALSE FALSE  TRUE

#same as above
str_detect(x, "ea")

## [1] FALSE FALSE  TRUE

Regular Expressions (Regex)

While matching patterns, you can also use the regular expressions.

Regular expressions (a.k.a. regex’s) are a language that allow you to describe patterns in strings.

You can perform a case-insensitive match using ignore_case = TRUE:

bananas <- c("banana", "Banana", "BANANA")
#case insensitive match
str_detect(bananas, regex("banana",ignore_case = TRUE))

## [1] TRUE TRUE TRUE

With regex, you can create your own character classes using [ ].

[abc]: matches a, b, or c
[a-z]: matches every character between a and z (in Unicode code point order).
[^abc]: matches anything except a, b, or c.
[\^\-]: matches ^ or -.

There are a number of pre-built classes that you can use inside [].

String Subsetting with str_subset()

str_subset() returns the elements of a character vector that match a regular expression.

Using starwars data set, let’s subset the character names that contain any punctuation:

stars <- c("Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Organa", "Owen Lars")
stars

## [1] "Luke Skywalker" "C-3PO"          "R2-D2"          "Darth Vader"   
## [5] "Leia Organa"    "Owen Lars"

str_subset(stars, "[:punct:]")

## [1] "C-3PO" "R2-D2"

String Extract using str_extract()

str_extract() extracts text corresponding to the first match, returning a character vector.

str_extract(stars, "[:punct:]")

## [1] NA  "-" "-" NA  NA  NA

Finding Patterns using str_locate()

str_locate() locates the first position of a pattern and returns a numeric matrix with columns start and end whereas str_locate_all() locates all positions of a given pattern.

str_locate(stars, "[:punct:]") %>% head()

##      start end
## [1,]    NA  NA
## [2,]     2   2
## [3,]     3   3
## [4,]    NA  NA
## [5,]    NA  NA
## [6,]    NA  NA

Pattern Counting using str_count()

str_count() counts the number of matches for a given string

str_count(stars, "[:punct:]")

## [1] 0 1 1 0 0 0

String Replacing with str_replace()

str_replace() replaces a string with another one.

The pattern argument will give the string that is going to be replaced and replacement argument will specify the replacement string.

head(fruit)

## [1] "apple"       "apricot"     "avocado"     "banana"      "bell pepper"
## [6] "bilberry"

# Replace berry with berries
head(str_replace(fruit, pattern = "berry", replacement = "berries"))

## [1] "apple"       "apricot"     "avocado"     "banana"      "bell pepper"
## [6] "bilberries"

#replace first l with "" (delete first l)
str_replace("Hello world", pattern = "l", replacement = "")

## [1] "Helo world"

# replace all l's with "" (delete l's)
str_replace_all("Hello world", pattern = "l", replacement = "")

## [1] "Heo word"