Julius Schmid
STRING MANIPULATION
Although R is a statistical language with numeric vectors and
matrices playing a central role, character strings are surprisingly
important as well. Ranging from birth dates stored in medical research
data files to textmining applications, character data arises quite
frequently in R programs. Accordingly, R has a number of
string-manipulation utilities, many of which will be introduced in this
chapter.
grep()
The call grep(pattern,x) searches for a specified substring pattern
in a vector x of strings. If x has n elements—that is, it contains n
strings—then grep(pattern,x) will return a vector of length up to
n. Each element of this vector will be the index in x at which a match
of pattern as a substring of x[i]) was found.
Here’s an example of using grep:
Our pattern is the string “Pole”, and we are looking for the word or
pole within the strings “Equator”, “North Pole”, and “South Pole”.
grep("Pole",c("Equator","North Pole","South Pole"))
[1] 2 3
Since the pattern is only contained in the latter two, the grep()
function returns the entry indices 2 and 3.
Note that the grep() function is case-sensitive! See what happens if
our pattern is “pole” instead of “Pole”:
grep("pole",c("Equator","North Pole","South Pole"))
integer(0)
grep() returns no indices, which means that the exact case-sensitive
pattern is could not be found within the strings above.
In the first case, the string “Pole” was found in elements 2 and 3 of
the second argument, hence the output (2,3). In the second case, string
“pole” was not found anywhere, so an empty vector was returned.
nchar()
The call nchar(x) finds the length of a string x. Here’s an
example:
nchar("South Pole")
[1] 10
The string “South Pole” was found to have 10 characters. C
programmers, take note: There is no NULL character terminating R
strings. Also note that the results of nchar() will be unpredictable if
x is not in character mode. For instance, nchar(NA) turns out to be 2,
and nchar(factor(“abc”)) is 1. For more consistent results on nonstring
objects, use Hadley Wickham’s stringr package on CRAN.
paste()
The call paste(…) concatenates several strings, returning the result
in one long string. Here are some examples:
The arguments in the paste() function are the strings we want to
concatenate, together with an optional argument sep which determines the
connector of the strings. The default connector is just a space.
paste("North","Pole")
[1] "North Pole"
paste("North","Pole",sep="")
[1] "NorthPole"
paste("North","Pole",sep=".")
[1] "North.Pole"
paste("North","and","South","Poles")
[1] "North and South Poles"
As you can see, the optional argument sep can be used to put
something other than a space between the pieces being spliced together.
If you specify sep as an empty string, the pieces won’t have any
character between them.
Concatenate “Final” and “Exam” using the steps shown above.
We call exactly the same functions as above, just alternating the
strings, but leaving the connectors the same:
# Enter Answer Here
paste("Final","Exam")
[1] "Final Exam"
paste("Final","Exam",sep="")
[1] "FinalExam"
paste("Final","Exam",sep=".")
[1] "Final.Exam"
paste("Final","and","Midterm","Exams")
[1] "Final and Midterm Exams"
Depending on the optional argument in the sep parameter, we get
different outputs.
substr()
The call substr(x,start,stop) returns the substring in the given
character position range start:stop in the given string x. Here’s an
example:
substring("Equator",3,5)
[1] "uat"
We start with letter number 3 (u) and return everything until letter
number 5 (t). Hence, the output is “uat”.
strsplit()
The call strsplit(x,split) splits a string x into an R list of
substrings based on another string split in x. Here’s an example:
strsplit("6-16-2011",split="-")
[[1]]
[1] "6" "16" "2011"
The strsplit() function split the date into the three components
month (June), day (16) and year (2011).
Use the function above to split “11-28-2022”
#Enter Answer Here
strsplit("11-28-2022",split="-")
[[1]]
[1] "11" "28" "2022"
Again, the strsplit() function split the date into the three
components month (November), day (11) and year (2022).
LS0tCnRpdGxlOiAiUiBOb3RlYm9va19TdHJpbmcgTWFuaXB1bGF0aW9uIgpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sKLS0tCgpKdWxpdXMgU2NobWlkCgoqKlNUUklORyBNQU5JUFVMQVRJT04qKgoKLS0tCgpBbHRob3VnaCBSIGlzIGEgc3RhdGlzdGljYWwgbGFuZ3VhZ2Ugd2l0aCBudW1lcmljIHZlY3RvcnMgYW5kIG1hdHJpY2VzIHBsYXlpbmcgYSBjZW50cmFsIHJvbGUsIGNoYXJhY3RlciBzdHJpbmdzIGFyZSBzdXJwcmlzaW5nbHkgaW1wb3J0YW50IGFzIHdlbGwuIFJhbmdpbmcgZnJvbSBiaXJ0aCBkYXRlcyBzdG9yZWQgaW4gbWVkaWNhbCByZXNlYXJjaCBkYXRhIGZpbGVzIHRvIHRleHRtaW5pbmcgYXBwbGljYXRpb25zLCBjaGFyYWN0ZXIgZGF0YSBhcmlzZXMgcXVpdGUgZnJlcXVlbnRseSBpbiBSIHByb2dyYW1zLiBBY2NvcmRpbmdseSwgUiBoYXMgYSBudW1iZXIgb2Ygc3RyaW5nLW1hbmlwdWxhdGlvbiB1dGlsaXRpZXMsIG1hbnkgb2Ygd2hpY2ggd2lsbCBiZSBpbnRyb2R1Y2VkIGluIHRoaXMgY2hhcHRlci4KCgoqKmdyZXAoKSoqCi0tLQpUaGUgY2FsbCBncmVwKHBhdHRlcm4seCkgc2VhcmNoZXMgZm9yIGEgc3BlY2lmaWVkIHN1YnN0cmluZyBwYXR0ZXJuIGluIGEgdmVjdG9yIHggb2Ygc3RyaW5ncy4gSWYgeCBoYXMgbiBlbGVtZW50c+KAlHRoYXQgaXMsIGl0IGNvbnRhaW5zIG4gc3RyaW5nc+KAlHRoZW4gZ3JlcChwYXR0ZXJuLHgpIHdpbGwgcmV0dXJuIGEgdmVjdG9yIG9mIGxlbmd0aCB1cCB0byBuLiBFYWNoIGVsZW1lbnQgb2YgdGhpcyB2ZWN0b3Igd2lsbCBiZSB0aGUgaW5kZXggaW4geCBhdCB3aGljaCBhIG1hdGNoIG9mIHBhdHRlcm4gYXMgYSBzdWJzdHJpbmcgb2YgeFtpXSkgd2FzIGZvdW5kLgoKSGVyZeKAmXMgYW4gZXhhbXBsZSBvZiB1c2luZyBncmVwOgoKT3VyIHBhdHRlcm4gaXMgdGhlIHN0cmluZyAiUG9sZSIsIGFuZCB3ZSBhcmUgbG9va2luZyBmb3IgdGhlIHdvcmQgb3IgcG9sZSB3aXRoaW4gdGhlIHN0cmluZ3MgIkVxdWF0b3IiLCAiTm9ydGggUG9sZSIsIGFuZCAiU291dGggUG9sZSIuCmBgYHtyfQpncmVwKCJQb2xlIixjKCJFcXVhdG9yIiwiTm9ydGggUG9sZSIsIlNvdXRoIFBvbGUiKSkKYGBgClNpbmNlIHRoZSBwYXR0ZXJuIGlzIG9ubHkgY29udGFpbmVkIGluIHRoZSBsYXR0ZXIgdHdvLCB0aGUgZ3JlcCgpIGZ1bmN0aW9uIHJldHVybnMgdGhlIGVudHJ5IGluZGljZXMgMiBhbmQgMy4KCk5vdGUgdGhhdCB0aGUgZ3JlcCgpIGZ1bmN0aW9uIGlzIGNhc2Utc2Vuc2l0aXZlISBTZWUgd2hhdCBoYXBwZW5zIGlmIG91ciBwYXR0ZXJuIGlzICJwb2xlIiBpbnN0ZWFkIG9mICJQb2xlIjoKYGBge3J9CmdyZXAoInBvbGUiLGMoIkVxdWF0b3IiLCJOb3J0aCBQb2xlIiwiU291dGggUG9sZSIpKQpgYGAKZ3JlcCgpIHJldHVybnMgbm8gaW5kaWNlcywgd2hpY2ggbWVhbnMgdGhhdCB0aGUgZXhhY3QgY2FzZS1zZW5zaXRpdmUgcGF0dGVybiBpcyBjb3VsZCBub3QgYmUgZm91bmQgd2l0aGluIHRoZSBzdHJpbmdzIGFib3ZlLiAKCkluIHRoZSBmaXJzdCBjYXNlLCB0aGUgc3RyaW5nICJQb2xlIiB3YXMgZm91bmQgaW4gZWxlbWVudHMgMiBhbmQgMyBvZiB0aGUgc2Vjb25kIGFyZ3VtZW50LCBoZW5jZSB0aGUgb3V0cHV0ICgyLDMpLiBJbiB0aGUgc2Vjb25kIGNhc2UsIHN0cmluZyAicG9sZSIgd2FzIG5vdCBmb3VuZCBhbnl3aGVyZSwgc28gYW4gZW1wdHkgdmVjdG9yIHdhcyByZXR1cm5lZC4KCgoqKm5jaGFyKCkqKgotLS0KVGhlIGNhbGwgbmNoYXIoeCkgZmluZHMgdGhlIGxlbmd0aCBvZiBhIHN0cmluZyB4LiBIZXJl4oCZcyBhbiBleGFtcGxlOgoKYGBge3J9Cm5jaGFyKCJTb3V0aCBQb2xlIikKYGBgCgpUaGUgc3RyaW5nICJTb3V0aCBQb2xlIiB3YXMgZm91bmQgdG8gaGF2ZSAxMCBjaGFyYWN0ZXJzLiBDIHByb2dyYW1tZXJzLCB0YWtlIG5vdGU6IFRoZXJlIGlzIG5vIE5VTEwgY2hhcmFjdGVyIHRlcm1pbmF0aW5nIFIgc3RyaW5ncy4gQWxzbyBub3RlIHRoYXQgdGhlIHJlc3VsdHMgb2YgbmNoYXIoKSB3aWxsIGJlIHVucHJlZGljdGFibGUgaWYgeCBpcyBub3QgaW4gY2hhcmFjdGVyIG1vZGUuIEZvciBpbnN0YW5jZSwgbmNoYXIoTkEpIHR1cm5zIG91dCB0byBiZSAyLCBhbmQgbmNoYXIoZmFjdG9yKCJhYmMiKSkgaXMgMS4gRm9yIG1vcmUgY29uc2lzdGVudCByZXN1bHRzIG9uIG5vbnN0cmluZyBvYmplY3RzLCB1c2UgSGFkbGV5IFdpY2toYW3igJlzIHN0cmluZ3IgcGFja2FnZSBvbiBDUkFOLgoKCioqcGFzdGUoKSoqCgpUaGUgY2FsbCBwYXN0ZSguLi4pIGNvbmNhdGVuYXRlcyBzZXZlcmFsIHN0cmluZ3MsIHJldHVybmluZyB0aGUgcmVzdWx0IGluIG9uZSBsb25nIHN0cmluZy4gSGVyZSBhcmUgc29tZSBleGFtcGxlczoKClRoZSBhcmd1bWVudHMgaW4gdGhlIHBhc3RlKCkgZnVuY3Rpb24gYXJlIHRoZSBzdHJpbmdzIHdlIHdhbnQgdG8gY29uY2F0ZW5hdGUsIHRvZ2V0aGVyIHdpdGggYW4gb3B0aW9uYWwgYXJndW1lbnQgc2VwIHdoaWNoIGRldGVybWluZXMgdGhlIGNvbm5lY3RvciBvZiB0aGUgc3RyaW5ncy4gVGhlIGRlZmF1bHQgY29ubmVjdG9yIGlzIGp1c3QgYSBzcGFjZS4gCmBgYHtyfQpwYXN0ZSgiTm9ydGgiLCJQb2xlIikKcGFzdGUoIk5vcnRoIiwiUG9sZSIsc2VwPSIiKQpwYXN0ZSgiTm9ydGgiLCJQb2xlIixzZXA9Ii4iKQpwYXN0ZSgiTm9ydGgiLCJhbmQiLCJTb3V0aCIsIlBvbGVzIikKYGBgCgpBcyB5b3UgY2FuIHNlZSwgdGhlIG9wdGlvbmFsIGFyZ3VtZW50IHNlcCBjYW4gYmUgdXNlZCB0byBwdXQgc29tZXRoaW5nIG90aGVyIHRoYW4gYSBzcGFjZSBiZXR3ZWVuIHRoZSBwaWVjZXMgYmVpbmcgc3BsaWNlZCB0b2dldGhlci4gSWYgeW91IHNwZWNpZnkgc2VwIGFzIGFuIGVtcHR5IHN0cmluZywgdGhlIHBpZWNlcyB3b27igJl0IGhhdmUgYW55IGNoYXJhY3RlciBiZXR3ZWVuIHRoZW0uCgoKQ29uY2F0ZW5hdGUgIkZpbmFsIiBhbmQgIkV4YW0iIHVzaW5nIHRoZSBzdGVwcyBzaG93biBhYm92ZS4KCldlIGNhbGwgZXhhY3RseSB0aGUgc2FtZSBmdW5jdGlvbnMgYXMgYWJvdmUsIGp1c3QgYWx0ZXJuYXRpbmcgdGhlIHN0cmluZ3MsIGJ1dCBsZWF2aW5nIHRoZSBjb25uZWN0b3JzIHRoZSBzYW1lOgpgYGB7cn0KIyBFbnRlciBBbnN3ZXIgSGVyZQpwYXN0ZSgiRmluYWwiLCJFeGFtIikKcGFzdGUoIkZpbmFsIiwiRXhhbSIsc2VwPSIiKQpwYXN0ZSgiRmluYWwiLCJFeGFtIixzZXA9Ii4iKQpwYXN0ZSgiRmluYWwiLCJhbmQiLCJNaWR0ZXJtIiwiRXhhbXMiKQpgYGAKRGVwZW5kaW5nIG9uIHRoZSBvcHRpb25hbCBhcmd1bWVudCBpbiB0aGUgc2VwIHBhcmFtZXRlciwgd2UgZ2V0IGRpZmZlcmVudCBvdXRwdXRzLgoKCioqc3Vic3RyKCkqKgotLS0KVGhlIGNhbGwgc3Vic3RyKHgsc3RhcnQsc3RvcCkgcmV0dXJucyB0aGUgc3Vic3RyaW5nIGluIHRoZSBnaXZlbiBjaGFyYWN0ZXIgcG9zaXRpb24gcmFuZ2Ugc3RhcnQ6c3RvcCBpbiB0aGUgZ2l2ZW4gc3RyaW5nIHguIEhlcmXigJlzIGFuIGV4YW1wbGU6CmBgYHtyfQpzdWJzdHJpbmcoIkVxdWF0b3IiLDMsNSkKYGBgCldlIHN0YXJ0IHdpdGggbGV0dGVyIG51bWJlciAzICh1KSBhbmQgcmV0dXJuIGV2ZXJ5dGhpbmcgdW50aWwgbGV0dGVyIG51bWJlciA1ICh0KS4gSGVuY2UsIHRoZSBvdXRwdXQgaXMgInVhdCIuCgoKKipzdHJzcGxpdCgpKioKLS0tClRoZSBjYWxsIHN0cnNwbGl0KHgsc3BsaXQpIHNwbGl0cyBhIHN0cmluZyB4IGludG8gYW4gUiBsaXN0IG9mIHN1YnN0cmluZ3MgYmFzZWQgb24gYW5vdGhlciBzdHJpbmcgc3BsaXQgaW4geC4gSGVyZeKAmXMgYW4gZXhhbXBsZToKYGBge3J9CnN0cnNwbGl0KCI2LTE2LTIwMTEiLHNwbGl0PSItIikKYGBgClRoZSBzdHJzcGxpdCgpIGZ1bmN0aW9uIHNwbGl0IHRoZSBkYXRlIGludG8gdGhlIHRocmVlIGNvbXBvbmVudHMgbW9udGggKEp1bmUpLCBkYXkgKDE2KSBhbmQgeWVhciAoMjAxMSkuICAKClVzZSB0aGUgZnVuY3Rpb24gYWJvdmUgdG8gc3BsaXQgIjExLTI4LTIwMjIiCgpgYGB7cn0KI0VudGVyIEFuc3dlciBIZXJlCnN0cnNwbGl0KCIxMS0yOC0yMDIyIixzcGxpdD0iLSIpCmBgYApBZ2FpbiwgdGhlIHN0cnNwbGl0KCkgZnVuY3Rpb24gc3BsaXQgdGhlIGRhdGUgaW50byB0aGUgdGhyZWUgY29tcG9uZW50cyBtb250aCAoTm92ZW1iZXIpLCBkYXkgKDExKSBhbmQgeWVhciAoMjAyMikuCg==