Activity 15

Michael Marx

STRING MANIPULATION


Although R is a statistical language with numeric vectors and matrices playing a central role, character strings are surprisingly important as well. Ranging from birth dates stored in medical research data files to textmining applications, character data arises quite frequently in R programs. Accordingly, R has a number of string-manipulation utilities, many of which will be introduced in this chapter.

grep()

The call grep(pattern,x) searches for a specified substring pattern in a vector x of strings. If x has n elements—that is, it contains n strings—then grep(pattern,x) will return a vector of length up to n. Each element of this vector will be the index in x at which a match of pattern as a substring of x[i]) was found.

Here’s an example of using grep:

grep("Pole",c("Equator","North Pole","South Pole"))
[1] 2 3

grep returns the index of the grep argument within the vector in question -> pole is within the 2nd and 3rd element of the vector

grep("pole",c("Equator","North Pole","South Pole"))
integer(0)

-> case sensitive!

In the first case, the string “Pole” was found in elements 2 and 3 of the second argument, hence the output (2,3). In the second case, string “pole” was not found anywhere, so an empty vector was returned.

nchar()

The call nchar(x) finds the length of a string x. Here’s an example:

nchar("South Pole")
[1] 10

-> nchar returns the number of character within a string

The string “South Pole” was found to have 10 characters. C programmers, take note: There is no NULL character terminating R strings. Also note that the results of nchar() will be unpredictable if x is not in character mode. For instance, nchar(NA) turns out to be 2, and nchar(factor(“abc”)) is 1. For more consistent results on nonstring objects, use Hadley Wickham’s stringr package on CRAN.

paste()

The call paste(…) concatenates several strings, returning the result in one long string. Here are some examples:

paste("North","Pole")
[1] "North Pole"
paste("North","Pole",sep="")
[1] "NorthPole"
paste("North","Pole",sep=".")
[1] "North.Pole"
paste("North","and","South","Poles")
[1] "North and South Poles"

-> notice the automatic space inbetween -> can be avoided by specifiying the seperator

As you can see, the optional argument sep can be used to put something other than a space between the pieces being spliced together. If you specify sep as an empty string, the pieces won’t have any character between them.

Concatenate “Final” Exam” using the steps shown above.

paste('Final', 'Exam')
[1] "Final Exam"

substr()

The call substr(x,start,stop) returns the substring in the given character position range start:stop in the given string x. Here’s an example:

substring("Equator",3,5)
[1] "uat"

-> picks the 3th to the 5th characters from a supplied string

strsplit()

The call strsplit(x,split) splits a string x into an R list of substrings based on another string split in x. Here’s an example:

strsplit("6-16-2011",split="-")
[[1]]
[1] "6"    "16"   "2011"

-> splits a string by at the specified string location note that specified split location is not included in output

Use the function above to split “11-28-2022”

strsplit("11-28-2022", split="-")
[[1]]
[1] "11"   "28"   "2022"
LS0tDQp0aXRsZTogIlIgTm90ZWJvb2tfU3RyaW5nIE1hbmlwdWxhdGlvbiINCm91dHB1dDogaHRtbF9ub3RlYm9vaw0KLS0tDQoNCiMgQWN0aXZpdHkgMTUNCiMjIE1pY2hhZWwgTWFyeA0KDQojIyBTVFJJTkcgTUFOSVBVTEFUSU9ODQoNCi0tLQ0KDQpBbHRob3VnaCBSIGlzIGEgc3RhdGlzdGljYWwgbGFuZ3VhZ2Ugd2l0aCBudW1lcmljIHZlY3RvcnMgYW5kIG1hdHJpY2VzIHBsYXlpbmcgYSBjZW50cmFsIHJvbGUsIGNoYXJhY3RlciBzdHJpbmdzIGFyZSBzdXJwcmlzaW5nbHkgaW1wb3J0YW50IGFzIHdlbGwuIFJhbmdpbmcgZnJvbSBiaXJ0aCBkYXRlcyBzdG9yZWQgaW4gbWVkaWNhbCByZXNlYXJjaCBkYXRhIGZpbGVzIHRvIHRleHRtaW5pbmcgYXBwbGljYXRpb25zLCBjaGFyYWN0ZXIgZGF0YSBhcmlzZXMgcXVpdGUgZnJlcXVlbnRseSBpbiBSIHByb2dyYW1zLiBBY2NvcmRpbmdseSwgUiBoYXMgYSBudW1iZXIgb2Ygc3RyaW5nLW1hbmlwdWxhdGlvbiB1dGlsaXRpZXMsIG1hbnkgb2Ygd2hpY2ggd2lsbCBiZSBpbnRyb2R1Y2VkIGluIHRoaXMgY2hhcHRlci4NCg0KDQoqKmdyZXAoKSoqDQotLS0NClRoZSBjYWxsIGdyZXAocGF0dGVybix4KSBzZWFyY2hlcyBmb3IgYSBzcGVjaWZpZWQgc3Vic3RyaW5nIHBhdHRlcm4gaW4gYSB2ZWN0b3IgeCBvZiBzdHJpbmdzLiBJZiB4IGhhcyBuIGVsZW1lbnRz4oCUdGhhdCBpcywgaXQgY29udGFpbnMgbiBzdHJpbmdz4oCUdGhlbiBncmVwKHBhdHRlcm4seCkgd2lsbCByZXR1cm4gYSB2ZWN0b3Igb2YgbGVuZ3RoIHVwIHRvIG4uIEVhY2ggZWxlbWVudCBvZiB0aGlzIHZlY3RvciB3aWxsIGJlIHRoZSBpbmRleCBpbiB4IGF0IHdoaWNoIGEgbWF0Y2ggb2YgcGF0dGVybiBhcyBhIHN1YnN0cmluZyBvZiB4W2ldKSB3YXMgZm91bmQuDQoNCkhlcmXigJlzIGFuIGV4YW1wbGUgb2YgdXNpbmcgZ3JlcDoNCg0KYGBge3J9DQpncmVwKCJQb2xlIixjKCJFcXVhdG9yIiwiTm9ydGggUG9sZSIsIlNvdXRoIFBvbGUiKSkNCmBgYA0KZ3JlcCByZXR1cm5zIHRoZSBpbmRleCBvZiB0aGUgZ3JlcCBhcmd1bWVudCB3aXRoaW4gdGhlIHZlY3RvciBpbiBxdWVzdGlvbg0KLT4gcG9sZSBpcyB3aXRoaW4gdGhlIDJuZCBhbmQgM3JkIGVsZW1lbnQgb2YgdGhlIHZlY3Rvcg0KDQoNCg0KYGBge3J9DQpncmVwKCJwb2xlIixjKCJFcXVhdG9yIiwiTm9ydGggUG9sZSIsIlNvdXRoIFBvbGUiKSkNCmBgYA0KLT4gY2FzZSBzZW5zaXRpdmUhDQoNCg0KSW4gdGhlIGZpcnN0IGNhc2UsIHRoZSBzdHJpbmcgIlBvbGUiIHdhcyBmb3VuZCBpbiBlbGVtZW50cyAyIGFuZCAzIG9mIHRoZSBzZWNvbmQgYXJndW1lbnQsIGhlbmNlIHRoZSBvdXRwdXQgKDIsMykuIEluIHRoZSBzZWNvbmQgY2FzZSwgc3RyaW5nICJwb2xlIiB3YXMgbm90IGZvdW5kIGFueXdoZXJlLCBzbyBhbiBlbXB0eSB2ZWN0b3Igd2FzIHJldHVybmVkLg0KDQoNCioqbmNoYXIoKSoqDQotLS0NClRoZSBjYWxsIG5jaGFyKHgpIGZpbmRzIHRoZSBsZW5ndGggb2YgYSBzdHJpbmcgeC4gSGVyZeKAmXMgYW4gZXhhbXBsZToNCg0KYGBge3J9DQpuY2hhcigiU291dGggUG9sZSIpDQpgYGANCi0+IG5jaGFyIHJldHVybnMgdGhlIG51bWJlciBvZiBjaGFyYWN0ZXIgd2l0aGluIGEgc3RyaW5nDQoNCg0KVGhlIHN0cmluZyAiU291dGggUG9sZSIgd2FzIGZvdW5kIHRvIGhhdmUgMTAgY2hhcmFjdGVycy4gQyBwcm9ncmFtbWVycywgdGFrZSBub3RlOiBUaGVyZSBpcyBubyBOVUxMIGNoYXJhY3RlciB0ZXJtaW5hdGluZyBSIHN0cmluZ3MuIEFsc28gbm90ZSB0aGF0IHRoZSByZXN1bHRzIG9mIG5jaGFyKCkgd2lsbCBiZSB1bnByZWRpY3RhYmxlIGlmIHggaXMgbm90IGluIGNoYXJhY3RlciBtb2RlLiBGb3IgaW5zdGFuY2UsIG5jaGFyKE5BKSB0dXJucyBvdXQgdG8gYmUgMiwgYW5kIG5jaGFyKGZhY3RvcigiYWJjIikpIGlzIDEuIEZvciBtb3JlIGNvbnNpc3RlbnQgcmVzdWx0cyBvbiBub25zdHJpbmcgb2JqZWN0cywgdXNlIEhhZGxleSBXaWNraGFt4oCZcyBzdHJpbmdyIHBhY2thZ2Ugb24gQ1JBTi4NCg0KDQoqKnBhc3RlKCkqKg0KDQpUaGUgY2FsbCBwYXN0ZSguLi4pIGNvbmNhdGVuYXRlcyBzZXZlcmFsIHN0cmluZ3MsIHJldHVybmluZyB0aGUgcmVzdWx0IGluIG9uZSBsb25nIHN0cmluZy4gSGVyZSBhcmUgc29tZSBleGFtcGxlczoNCg0KYGBge3J9DQpwYXN0ZSgiTm9ydGgiLCJQb2xlIikNCnBhc3RlKCJOb3J0aCIsIlBvbGUiLHNlcD0iIikNCnBhc3RlKCJOb3J0aCIsIlBvbGUiLHNlcD0iLiIpDQpwYXN0ZSgiTm9ydGgiLCJhbmQiLCJTb3V0aCIsIlBvbGVzIikNCmBgYA0KLT4gbm90aWNlIHRoZSBhdXRvbWF0aWMgc3BhY2UgaW5iZXR3ZWVuDQotPiBjYW4gYmUgYXZvaWRlZCBieSBzcGVjaWZpeWluZyB0aGUgc2VwZXJhdG9yDQoNCg0KQXMgeW91IGNhbiBzZWUsIHRoZSBvcHRpb25hbCBhcmd1bWVudCBzZXAgY2FuIGJlIHVzZWQgdG8gcHV0IHNvbWV0aGluZyBvdGhlciB0aGFuIGEgc3BhY2UgYmV0d2VlbiB0aGUgcGllY2VzIGJlaW5nIHNwbGljZWQgdG9nZXRoZXIuIElmIHlvdSBzcGVjaWZ5IHNlcCBhcyBhbiBlbXB0eSBzdHJpbmcsIHRoZSBwaWVjZXMgd29u4oCZdCBoYXZlIGFueSBjaGFyYWN0ZXIgYmV0d2VlbiB0aGVtLg0KDQoNCkNvbmNhdGVuYXRlICJGaW5hbCIgRXhhbSIgdXNpbmcgdGhlIHN0ZXBzIHNob3duIGFib3ZlLg0KDQoNCmBgYHtyfQ0KcGFzdGUoJ0ZpbmFsJywgJ0V4YW0nKQ0KYGBgDQoNCg0KKipzdWJzdHIoKSoqDQotLS0NClRoZSBjYWxsIHN1YnN0cih4LHN0YXJ0LHN0b3ApIHJldHVybnMgdGhlIHN1YnN0cmluZyBpbiB0aGUgZ2l2ZW4gY2hhcmFjdGVyIHBvc2l0aW9uIHJhbmdlIHN0YXJ0OnN0b3AgaW4gdGhlIGdpdmVuIHN0cmluZyB4LiBIZXJl4oCZcyBhbiBleGFtcGxlOg0KDQoNCmBgYHtyfQ0Kc3Vic3RyaW5nKCJFcXVhdG9yIiwzLDUpDQpgYGANCi0+IHBpY2tzIHRoZSAzdGggdG8gdGhlIDV0aCBjaGFyYWN0ZXJzIGZyb20gYSBzdXBwbGllZCBzdHJpbmcNCg0KDQoNCioqc3Ryc3BsaXQoKSoqDQotLS0NClRoZSBjYWxsIHN0cnNwbGl0KHgsc3BsaXQpIHNwbGl0cyBhIHN0cmluZyB4IGludG8gYW4gUiBsaXN0IG9mIHN1YnN0cmluZ3MgYmFzZWQgb24gYW5vdGhlciBzdHJpbmcgc3BsaXQgaW4geC4gSGVyZeKAmXMgYW4gZXhhbXBsZToNCg0KDQpgYGB7cn0NCnN0cnNwbGl0KCI2LTE2LTIwMTEiLHNwbGl0PSItIikNCmBgYA0KLT4gc3BsaXRzIGEgc3RyaW5nIGJ5IGF0IHRoZSBzcGVjaWZpZWQgc3RyaW5nIGxvY2F0aW9uDQpub3RlIHRoYXQgc3BlY2lmaWVkIHNwbGl0IGxvY2F0aW9uIGlzIG5vdCBpbmNsdWRlZCBpbiBvdXRwdXQNCg0KDQpVc2UgdGhlIGZ1bmN0aW9uIGFib3ZlIHRvIHNwbGl0ICIxMS0yOC0yMDIyIg0KDQpgYGB7cn0NCnN0cnNwbGl0KCIxMS0yOC0yMDIyIiwgc3BsaXQ9Ii0iKQ0KYGBgDQoNCg==