In this recitation, we (1) create a unique ID number within two datasets and (2) merge datasets using various functions from the dplyr package.
Relevant functions: rename(), unique(), merge(), right_join(), left_join(), inner_join(), full_join(), anti_join(), na.omit().
We begin by loading the PassengerData.csv and the SurvivalData.csv datasets from the Canvas class website. These are actual datasets from the Titanic disaster.
The first thing we want to do is check whether there already exists a unique ID within these datasets.
# Don't forget to set your working directory using setwd()
# Loading the dplyr package
# library(dplyr)
# Load the "PassengerData" dataset
PassengerData <- read.csv(file = "/Users/evelynebrie/Dropbox/TA/PSCI_107_Fall2018/Recitation/Week4/PassengerData.csv")
# Load the "SurvivalData" dataset
SurvivalData <- read.csv(file = "/Users/evelynebrie/Dropbox/TA/PSCI_107_Fall2018/Recitation/Week4/SurvivalData.csv")
# Looking at the content of the variables using head()
head(PassengerData,5)
## X PassengerId Sex Age SibSp Parch Ticket Fare Cabin
## 1 1 1 male 22 1 0 A/5 21171 7.2500
## 2 2 2 female 38 1 0 PC 17599 71.2833 C85
## 3 3 3 female 26 0 0 STON/O2. 3101282 7.9250
## 4 4 4 female 35 1 0 113803 53.1000 C123
## 5 5 5 male 35 0 0 373450 8.0500
## Embarked
## 1 S
## 2 C
## 3 S
## 4 S
## 5 S
# Looking at the content of the variables using head()
head(SurvivalData,5)
## X Survived Pclass Name ID
## 1 1 0 3 Braund, Mr. Owen Harris 1
## 2 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) 2
## 3 3 1 3 Heikkinen, Miss. Laina 3
## 4 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 4
## 5 5 0 3 Allen, Mr. William Henry 5
It looks like the PassengerId variable and the ID variable are similar. The next thing we want to do is check whether these variables have the same class using class().
# Verifying that both columns have the similar class using class()
class(PassengerData$PassengerId)
## [1] "integer"
class(SurvivalData$ID)
## [1] "integer"
# Please note that integers are a type of numeric variable
is.numeric(PassengerData$PassengerId)
## [1] TRUE
is.numeric(SurvivalData$ID)
## [1] TRUE
The following step is making sure that both unique identifier columns have the same name. Here, we change the ID variable name in the SurvivalData dataset to match that of the PassengerData (= PassengerId).
# Using the rename() function in dplyr, renaming the "PassengerId" column "ID" within a new dataset called "New_SurvivalData"
New_SurvivalData <- rename(SurvivalData,"PassengerId"="ID")
# Another way to do this without creating a new dataset is by using the following syntax
# SurvivalData$PassengerId <- SurvivalData$ID
# Removing the first Survival dataset
rm(SurvivalData)
We can’t be sure whether a given passenger appears more than one in this dataset. In order to avoid problems during merging, we need to verify that each unique ID appears only once in each dataset.
# Verifying that the number of unique IDs is similar to the number of rows
length(unique(PassengerData$PassengerId))==dim(PassengerData)[1]
## [1] TRUE
# How many passengers in this dataset?
dim(PassengerData)[1]
## [1] 891
# Verifying that the number of unique IDs is similar to the number of rows
length(unique(New_SurvivalData$PassengerId))==dim(New_SurvivalData)[1]
## [1] TRUE
# How many passengers in this dataset?
dim(New_SurvivalData)[1]
## [1] 750
Because there isn’t the same number of passengers in each dataset, we might want to remove the 141 unmatched passengers when merging to have a final dataset with 750 passengers.
Let’s merge both datasets by passenger ID. We will keep all rows from Y and all columns from X and Y.
# Merging using merge()
Data_RightJoin <- merge(PassengerData,New_SurvivalData,by="PassengerId")
# Merging using right_join()
Data_RightJoin <- right_join(PassengerData,New_SurvivalData,by="PassengerId")
# Checking the dimensions of the new dataset
dim(Data_RightJoin)
## [1] 750 14
# Remove old datasets (if needed)
#rm(New_SurvivalData)
#rm(PassengerSurvivalData)
We have successfully merged both datasets! Let’s take a look at the content of our new dataset.
# Looking at the content of the first five rows using head()
head(Data_RightJoin,5)
## X.x PassengerId Sex Age SibSp Parch Ticket Fare Cabin
## 1 1 1 male 22 1 0 A/5 21171 7.2500
## 2 2 2 female 38 1 0 PC 17599 71.2833 C85
## 3 3 3 female 26 0 0 STON/O2. 3101282 7.9250
## 4 4 4 female 35 1 0 113803 53.1000 C123
## 5 5 5 male 35 0 0 373450 8.0500
## Embarked X.y Survived Pclass
## 1 S 1 0 3
## 2 C 2 1 1
## 3 S 3 1 3
## 4 S 4 1 1
## 5 S 5 0 3
## Name
## 1 Braund, Mr. Owen Harris
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer)
## 3 Heikkinen, Miss. Laina
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel)
## 5 Allen, Mr. William Henry
We can further manipulate the data to remove missing values and look at the data that was removed with merging (remember, we know that the original PassengerData dataset had 891 observations, and that the original SurvivalData had only 750 observations).
# Removing all rows with missing values using the na.omit() function (we call this "listwise deletion")
Clean_DataSet <- na.omit(Data_RightJoin)
# Which rows from the PassengerData file are not in the New_SurvivalData file (i.e. could not be matched)?
Data_AntiJoin <- anti_join(PassengerData,New_SurvivalData)
## Joining, by = c("X", "PassengerId")
dim(Data_AntiJoin)
## [1] 141 10
What can we learn from merging these datasets?
Download the PassengerData.csv and the SurvivalData.csv from the Canvas class website. Load both datasets in R. Create a unique ID variable. (You already did this if you followed along the recitation code)
Merge both datasets (Left Join) and check the dimensions of the new merged dataset.
Merge both datasets (Inner Join) and check the dimensions of the new merged dataset
Merge both datasets (Full Join) and check the dimensions of the new merged dataset. Please perform listwise deletion (i.e. removing any rows with missing data). How many rows and columns does the resulting data set have?
Which rows from the PassengerData dataset are not in the SurvivalData dataset?