First, we are loading a the BrexitData.csv file which we used in the recitation on cross-validation into our library using read.csv()
. This file contains sociodemographic information and official voting data from the Brexit referendum.
# Loading the data using read.csv()
myData <- read.csv("/Users/evelynebrie/Dropbox/TA/RPubs/BrexitData.csv")
# Looking at the content of our myData dataset
colnames(myData)
## [1] "X" "ID"
## [3] "Region.Code" "Region"
## [5] "Code" "Area.x"
## [7] "Electorate" "Expected.Ballots"
## [9] "Verified.Ballot.Papers" "Percent.Turnout"
## [11] "Votes.Cast" "Valid.Votes"
## [13] "Remain" "Leave"
## [15] "Rejected.Ballots" "No.Official.Mark"
## [17] "Multiple.Marks" "Writing.or.Mark"
## [19] "Unmarked.or.Void" "Percent.Remain"
## [21] "Percent.Leave" "Percent.Rejected"
## [23] "Type" "Area.y"
## [25] "All.Residents" "Age.0.to.4"
## [27] "Age.5.to.9" "Age.10.to.14"
## [29] "Age.15.to.19" "Age.20.to.24"
## [31] "Age.25.to.29" "Age.30.to.34"
## [33] "Age.35.to.39" "Age.40.to.44"
## [35] "Age.45.to.49" "Age.50.to.54"
## [37] "Age.55.to.59" "Age.60.to.64"
## [39] "Age.65.to.69" "Age.70.to.74"
## [41] "Age.75.to.79" "Age.80.to.84"
## [43] "Age.85.to.89" "Age.90.and.Over"
In our dataset, the Area.x and Area.y variables are slightly different. In this case, we want to modify the area names within Area.y to fit the format of the Area.x variable.
myData$Area.x <- as.character(myData$Area.x) # Converting to character
myData$Area.y <- as.character(myData$Area.y) # Converting to character
table(myData$Area.x == myData$Area.y, useNA="always") # 4 of these area names are not identical, and 7 of the rows have NAs in either/both columns
##
## FALSE TRUE <NA>
## 4 371 7
which(myData$Area.x != myData$Area.y) # Here are the rows with different area names
## [1] 33 310 322 324
rows <- which(myData$Area.x != myData$Area.y) # Selecting the rows with different area names
cols <- c(6, 24) # Selecting the columns for "Area.x" and "Area.y" within the myData dataframe
myData[rows,cols] # Displaying the area names of the mismatches
## Area.x Area.y
## 33 King's Lynn and West Norfolk King`s Lynn and West Norfolk
## 310 Isle of Anglesey Anglesey
## 322 Vale of Glamorgan The Vale of Glamorgan
## 324 Rhondda Cynon Taf Rhondda, Cynon, Taff
Can you find the differences between the format of the area names by looking at the output? In words, what would need to be done to make the Area.y format fit the Area.x format?
Here are some useful functions to modify text strings in R:
Function | Description |
---|---|
gsub() |
Removes and substitutes a text string |
strsplit() |
Divides a given text string |
paste() |
Putting together separate elements to form a single text string |
substr() |
Selects a subset within a text string |
toupper() |
Converts all elements of a text string to uppercase |
gsub()
For this first example, we will be removing and substituting a grave accent for an apostrophe.
myData[rows[1],cols] # Displaying the relevant row
## Area.x Area.y
## 33 King's Lynn and West Norfolk King`s Lynn and West Norfolk
myData$Area.y[33] <- gsub("`","'",myData$Area.y[33]) # Running that command on this specific row
myData[rows[1],cols] # Sanity check
## Area.x Area.y
## 33 King's Lynn and West Norfolk King's Lynn and West Norfolk
# You could also run this command on the entire vector if this was a frequent problem (this applies the gsub() functions to all rows within "myData$Area.y"")
# myData$Area.y <- gsub("`","'",myData$Area.y)
identical(myData$Area.y[33],myData$Area.x[33]) # Proving that they are now identical
## [1] TRUE
strsplit()
and paste()
For this second example, we will be splitting a text string within the Area.x variable and pasting part of its content to the Area.y variable. For your information, one could do this more efficiently (for instance, by using the paste()
function directly), but I’m using this method for demonstration purposes.
myData[rows[2],cols] # Displaying the relevant row
## Area.x Area.y
## 310 Isle of Anglesey Anglesey
x <- strsplit(myData$Area.x[310], " ") # Using strsplit to divide the text string within the "Area.x" column
x
## [[1]]
## [1] "Isle" "of" "Anglesey"
firstword <- x[[1]][1] # Selecting the first element of that list
firstword
## [1] "Isle"
secondword <- x[[1]][2] # Selecting the second element of that list
secondword
## [1] "of"
y <- paste(firstword,secondword) # Using paste to combine these words in a "y" object
y
## [1] "Isle of"
# If you don't want to include spaces automatically when using paste(), include the sep="" argument
paste(y, myData$Area.y[310]) # Pasting this "y" object to the "Area.y" name
## [1] "Isle of Anglesey"
myData$Area.y[310] <- paste(y, myData$Area.y[310]) # Same, but changing the content of this row in the "Area.y" column
myData$Area.y[310] # Sanity check
## [1] "Isle of Anglesey"
identical(myData$Area.y[310],myData$Area.x[310]) # Proving that they are now identical
## [1] TRUE
Convert the “Area.y” and “Area.x” variables from factor to character. Make the “Area.y” variable similar to the “Area.x” variable for the 322nd row of the dataset. Then, prove that these observations are now identical.
Relevant function: as.character()
, gsub()
, identical()
.
Make the “Area.y” variable similar to the “Area.x” variable for the 324nd row of the dataset. Then, prove that these observations are now identical.
Relevant function: gsub()
, identical()
.