Introduction to Text Manipulation in R

1. Loading the dataset

First, we are loading a the BrexitData.csv file which we used in the recitation on cross-validation into our library using read.csv(). This file contains sociodemographic information and official voting data from the Brexit referendum.

# Loading the data using read.csv()
myData <- read.csv("/Users/evelynebrie/Dropbox/TA/RPubs/BrexitData.csv")

# Looking at the content of our myData dataset
colnames(myData)

##  [1] "X"                      "ID"                    
##  [3] "Region.Code"            "Region"                
##  [5] "Code"                   "Area.x"                
##  [7] "Electorate"             "Expected.Ballots"      
##  [9] "Verified.Ballot.Papers" "Percent.Turnout"       
## [11] "Votes.Cast"             "Valid.Votes"           
## [13] "Remain"                 "Leave"                 
## [15] "Rejected.Ballots"       "No.Official.Mark"      
## [17] "Multiple.Marks"         "Writing.or.Mark"       
## [19] "Unmarked.or.Void"       "Percent.Remain"        
## [21] "Percent.Leave"          "Percent.Rejected"      
## [23] "Type"                   "Area.y"                
## [25] "All.Residents"          "Age.0.to.4"            
## [27] "Age.5.to.9"             "Age.10.to.14"          
## [29] "Age.15.to.19"           "Age.20.to.24"          
## [31] "Age.25.to.29"           "Age.30.to.34"          
## [33] "Age.35.to.39"           "Age.40.to.44"          
## [35] "Age.45.to.49"           "Age.50.to.54"          
## [37] "Age.55.to.59"           "Age.60.to.64"          
## [39] "Age.65.to.69"           "Age.70.to.74"          
## [41] "Age.75.to.79"           "Age.80.to.84"          
## [43] "Age.85.to.89"           "Age.90.and.Over"

2. Basic Text Manipulation

In our dataset, the Area.x and Area.y variables are slightly different. In this case, we want to modify the area names within Area.y to fit the format of the Area.x variable.

2.1 Identifying the differences between Area.x and Area.y

myData$Area.x <- as.character(myData$Area.x) # Converting to character
myData$Area.y <- as.character(myData$Area.y) # Converting to character

table(myData$Area.x == myData$Area.y, useNA="always") # 4 of these area names are not identical, and 7 of the rows have NAs in either/both columns

## 
## FALSE  TRUE  <NA> 
##     4   371     7

which(myData$Area.x != myData$Area.y) # Here are the rows with different area names

## [1]  33 310 322 324

rows <- which(myData$Area.x != myData$Area.y) # Selecting the rows with different area names
cols <- c(6, 24) # Selecting the columns for "Area.x" and "Area.y" within the myData dataframe

myData[rows,cols] # Displaying the area names of the mismatches

##                           Area.x                       Area.y
## 33  King's Lynn and West Norfolk King`s Lynn and West Norfolk
## 310             Isle of Anglesey                     Anglesey
## 322            Vale of Glamorgan        The Vale of Glamorgan
## 324            Rhondda Cynon Taf         Rhondda, Cynon, Taff

Can you find the differences between the format of the area names by looking at the output? In words, what would need to be done to make the Area.y format fit the Area.x format?

2.2 Performing textual manipulation

Here are some useful functions to modify text strings in R:

Function	Description
`gsub()`	Removes and substitutes a text string
`strsplit()`	Divides a given text string
`paste()`	Putting together separate elements to form a single text string
`substr()`	Selects a subset within a text string
`toupper()`	Converts all elements of a text string to uppercase

2.2.1 Example using `gsub()`

For this first example, we will be removing and substituting a grave accent for an apostrophe.

Zener Cards

myData[rows[1],cols] # Displaying the relevant row

##                          Area.x                       Area.y
## 33 King's Lynn and West Norfolk King`s Lynn and West Norfolk

myData$Area.y[33] <- gsub("`","'",myData$Area.y[33]) # Running that command on this specific row

myData[rows[1],cols] # Sanity check

##                          Area.x                       Area.y
## 33 King's Lynn and West Norfolk King's Lynn and West Norfolk

# You could also run this command on the entire vector if this was a frequent problem (this applies the gsub() functions to all rows within "myData$Area.y"")

# myData$Area.y <- gsub("`","'",myData$Area.y) 

identical(myData$Area.y[33],myData$Area.x[33]) # Proving that they are now identical

## [1] TRUE

2.2.2 Example using `strsplit()` and `paste()`

For this second example, we will be splitting a text string within the Area.x variable and pasting part of its content to the Area.y variable. For your information, one could do this more efficiently (for instance, by using the paste() function directly), but I’m using this method for demonstration purposes.

Zener Cards

myData[rows[2],cols] # Displaying the relevant row

##               Area.x   Area.y
## 310 Isle of Anglesey Anglesey

x <- strsplit(myData$Area.x[310], " ") # Using strsplit to divide the text string within the "Area.x" column
x

## [[1]]
## [1] "Isle"     "of"       "Anglesey"

firstword <- x[[1]][1] # Selecting the first element of that list
firstword

## [1] "Isle"

secondword <- x[[1]][2] # Selecting the second element of that list
secondword

## [1] "of"

y <- paste(firstword,secondword) # Using paste to combine these words in a "y" object
y

## [1] "Isle of"

# If you don't want to include spaces automatically when using paste(), include the sep="" argument

paste(y, myData$Area.y[310]) # Pasting this "y" object to the "Area.y" name

## [1] "Isle of Anglesey"

myData$Area.y[310] <- paste(y, myData$Area.y[310]) # Same, but changing the content of this row in the "Area.y" column

myData$Area.y[310] # Sanity check

## [1] "Isle of Anglesey"

identical(myData$Area.y[310],myData$Area.x[310]) # Proving that they are now identical

## [1] TRUE

Exercises

Exercise 1

Convert the “Area.y” and “Area.x” variables from factor to character. Make the “Area.y” variable similar to the “Area.x” variable for the 322nd row of the dataset. Then, prove that these observations are now identical.

Relevant function: as.character(), gsub(), identical().

Exercise 2

Make the “Area.y” variable similar to the “Area.x” variable for the 324nd row of the dataset. Then, prove that these observations are now identical.

Relevant function: gsub(), identical().