Merging and Manipulating Data in R

In this recitation, we (1) create a unique ID number within two datasets and (2) merge datasets using various functions from the dplyr package.

Relevant functions: rename(), unique(), merge(), right_join(), left_join(), inner_join(), full_join(), anti_join(), na.omit().

1. Creating a Unique ID Number

We begin by loading the PassengerData.csv and the SurvivalData.csv datasets from the Canvas class website. These are actual datasets from the Titanic disaster.

1.1 Looking for a Similar Column in Both Datasets

The first thing we want to do is check whether there already exists a unique ID within these datasets.

# Don't forget to set your working directory using setwd()

# Loading the dplyr package
# library(dplyr)

# Load the "PassengerData" dataset
PassengerData <- read.csv(file = "/Users/evelynebrie/Dropbox/TA/PSCI_107_Fall2018/Recitation/Week4/PassengerData.csv")

# Load the "SurvivalData" dataset
SurvivalData <- read.csv(file = "/Users/evelynebrie/Dropbox/TA/PSCI_107_Fall2018/Recitation/Week4/SurvivalData.csv")

# Looking at the content of the variables using head()
head(PassengerData,5)

##   X PassengerId    Sex Age SibSp Parch           Ticket    Fare Cabin
## 1 1           1   male  22     1     0        A/5 21171  7.2500      
## 2 2           2 female  38     1     0         PC 17599 71.2833   C85
## 3 3           3 female  26     0     0 STON/O2. 3101282  7.9250      
## 4 4           4 female  35     1     0           113803 53.1000  C123
## 5 5           5   male  35     0     0           373450  8.0500      
##   Embarked
## 1        S
## 2        C
## 3        S
## 4        S
## 5        S

# Looking at the content of the variables using head()
head(SurvivalData,5)

##   X Survived Pclass                                                Name ID
## 1 1        0      3                             Braund, Mr. Owen Harris  1
## 2 2        1      1 Cumings, Mrs. John Bradley (Florence Briggs Thayer)  2
## 3 3        1      3                              Heikkinen, Miss. Laina  3
## 4 4        1      1        Futrelle, Mrs. Jacques Heath (Lily May Peel)  4
## 5 5        0      3                            Allen, Mr. William Henry  5

It looks like the PassengerId variable and the ID variable are similar. The next thing we want to do is check whether these variables have the same class using class().

# Verifying that both columns have the similar class using class()
class(PassengerData$PassengerId)

## [1] "integer"

class(SurvivalData$ID)

## [1] "integer"

# Please note that integers are a type of numeric variable
is.numeric(PassengerData$PassengerId)

## [1] TRUE

is.numeric(SurvivalData$ID)

## [1] TRUE

1.2 Renaming a Column

The following step is making sure that both unique identifier columns have the same name. Here, we change the ID variable name in the SurvivalData dataset to match that of the PassengerData (= PassengerId).

# Using the rename() function in dplyr, renaming the "PassengerId" column "ID" within a new dataset called "New_SurvivalData"
New_SurvivalData <- rename(SurvivalData,"PassengerId"="ID")

# Another way to do this without creating a new dataset is by using the following syntax
# SurvivalData$PassengerId <- SurvivalData$ID

# Removing the first Survival dataset
rm(SurvivalData)

2. Looking for Duplicates

We can’t be sure whether a given passenger appears more than one in this dataset. In order to avoid problems during merging, we need to verify that each unique ID appears only once in each dataset.

# Verifying that the number of unique IDs is similar to the number of rows
length(unique(PassengerData$PassengerId))==dim(PassengerData)[1]

## [1] TRUE

# How many passengers in this dataset?
dim(PassengerData)[1]

## [1] 891

# Verifying that the number of unique IDs is similar to the number of rows
length(unique(New_SurvivalData$PassengerId))==dim(New_SurvivalData)[1]

## [1] TRUE

# How many passengers in this dataset?
dim(New_SurvivalData)[1]

## [1] 750

Because there isn’t the same number of passengers in each dataset, we might want to remove the 141 unmatched passengers when merging to have a final dataset with 750 passengers.

3. Merging Datasets

Let’s merge both datasets by passenger ID. We will keep all rows from Y and all columns from X and Y.

# Merging using merge()
Data_RightJoin <- merge(PassengerData,New_SurvivalData,by="PassengerId")

# Merging using right_join()
Data_RightJoin <- right_join(PassengerData,New_SurvivalData,by="PassengerId")

# Checking the dimensions of the new dataset
dim(Data_RightJoin)

## [1] 750  14

# Remove old datasets (if needed)
#rm(New_SurvivalData)
#rm(PassengerSurvivalData)

We have successfully merged both datasets! Let’s take a look at the content of our new dataset.

# Looking at the content of the first five rows using head()
head(Data_RightJoin,5)

##   X.x PassengerId    Sex Age SibSp Parch           Ticket    Fare Cabin
## 1   1           1   male  22     1     0        A/5 21171  7.2500      
## 2   2           2 female  38     1     0         PC 17599 71.2833   C85
## 3   3           3 female  26     0     0 STON/O2. 3101282  7.9250      
## 4   4           4 female  35     1     0           113803 53.1000  C123
## 5   5           5   male  35     0     0           373450  8.0500      
##   Embarked X.y Survived Pclass
## 1        S   1        0      3
## 2        C   2        1      1
## 3        S   3        1      3
## 4        S   4        1      1
## 5        S   5        0      3
##                                                  Name
## 1                             Braund, Mr. Owen Harris
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer)
## 3                              Heikkinen, Miss. Laina
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel)
## 5                            Allen, Mr. William Henry

We can further manipulate the data to remove missing values and look at the data that was removed with merging (remember, we know that the original PassengerData dataset had 891 observations, and that the original SurvivalData had only 750 observations).

# Removing all rows with missing values using the na.omit() function (we call this "listwise deletion")
Clean_DataSet <- na.omit(Data_RightJoin)

# Which rows from the PassengerData file are not in the New_SurvivalData file (i.e. could not be matched)?
Data_AntiJoin <- anti_join(PassengerData,New_SurvivalData)

## Joining, by = c("X", "PassengerId")

dim(Data_AntiJoin)

## [1] 141  10

What can we learn from merging these datasets?

Exercises

Step 1

Download the PassengerData.csv and the SurvivalData.csv from the Canvas class website. Load both datasets in R. Create a unique ID variable. (You already did this if you followed along the recitation code)

Step 2

Merge both datasets (Left Join) and check the dimensions of the new merged dataset.

Step 3

Merge both datasets (Inner Join) and check the dimensions of the new merged dataset

Step 4

Merge both datasets (Full Join) and check the dimensions of the new merged dataset. Please perform listwise deletion (i.e. removing any rows with missing data). How many rows and columns does the resulting data set have?

Step 5

Which rows from the PassengerData dataset are not in the SurvivalData dataset?