Merging, Subsetting and Manipulating Data
In this lab, we (1) import datasets, (2) create variables, (3) rename variables, (4) merge datasets and (5) create subsets of datasets.
Relevant functions: setwd(),
read.csv(), subset(), rename(),
length(),, unique(), merge(),
right_join(), left_join()
1. Importing Data
We begin by loading the dataset named “Data_Poll.csv”, which is available on OWL (Resources -> Exercises -> Week 2). Please don’t use the “import dataset” function in the environment window. The script alone should be sufficient to replicate your work.
We input the CSV file using the read.csv() command. The
following code will work if the dataset and your R script are located
within the same folder.
As an alternative, you can skip the setwd() function and
set the path to your file directly within the read.csv()
function (ex.:
read.csv("/Users/evelynebrie/Desktop/myFolder/ExitPoll.csv")).
# Setting the folder where the script is as our working directory
setwd(dirname(rstudioapi::getSourceEditorContext()$path))
# Loading the dataset
PollData <- read.csv("Data_Poll.csv")
The data frame should now appear within your environment (upper right window).
2. Manipulating Data
2.1 Creating Variables
Let’s say we want to create a new indicator (or dummy) variable which takes a value of 1 everytime the voteChoice variable has a value of blue, and 0 otherwise. Additional ways of doing this using the Tidyverse grammar are available here.
# Create a new empty variable called votedBlue within our dataframe
PollData$votedBlue <- NA
PollData$votedBlue[PollData$voteChoice=="blue"] <- 1
PollData$votedBlue[PollData$voteChoice!="blue"] <- 0
# Sanity Check
table(PollData$votedBlue)
##
## 0 1
## 7 4
# Let's just print out the entire dataset to display the result
PollData
## voteChoice age female educHS educCollege educGrad votedBlue
## 1 red 28 1 0 1 0 0
## 2 blue 18 0 1 0 0 1
## 3 blue 65 0 0 1 0 1
## 4 yellow 40 1 0 0 1 0
## 5 red 44 1 0 0 1 0
## 6 blue 21 1 0 0 1 1
## 7 red 52 0 1 0 0 0
## 8 yellow 33 1 0 1 0 0
## 9 blue 29 0 1 0 0 1
## 10 yellow 37 1 0 0 1 0
## 11 red 72 0 0 1 0 0
# We now have an extra variable called "votedBlue"
Exercise 1
Create a numeric variable called education which takes a value of 1 when the respondent has a high school education, a value of 2 when the respondent has college-level education, and a value of 3 when the respondent has a graduate-level education.
# Hint: start by creating an empty variable (i.e. vector) called "education"
PollData$education <- NA
Once you’re done, the distribution of your education variable should look like this:
# Creating a table using the table() function
table(PollData$education)
##
## 1 2 3
## 3 4 4
Now, let’s say we want to have a unique ID for all respondents. We thus create another variable called ID.
# Creating a variable encompassing all numbers from 1 to the last row number
PollData$ID <- NA
PollData$ID <- seq(1,dim(PollData)[1])
# Sanity check
length(unique(PollData$ID)) == dim(PollData)[1] # you could also use nrow(PollData) instead of dim(PollData)[1]
## [1] TRUE
# We can thus confirm that this is a unique ID
# Let's print the ID variable to be extra explicit here
PollData$ID
## [1] 1 2 3 4 5 6 7 8 9 10 11
# Makes sense, right? All people in the dataset are associated with a unique number
2.2 Renaming Variables
Let’s now import another dataset called “Names.csv”
using read.csv(). This dataset is also available on OWL
(Resources -> Exercises -> Week 2).
# Importing the dataset
NamesData <- read.csv("Names.csv")
# Printing out the first 5 rows
head(NamesData,5)
## ident name
## 1 1 Emma
## 2 2 William
## 3 3 Thomas
## 4 4 Olivia
## 5 5 Alice
I want to rename the ident variable in this dataset into ID to merge both datasets later on. Here are two different ways to proceed.
# FIRST WAY
# Loading the dplyr packages
library(dplyr)
# Renaming the column
NamesData <- rename(NamesData,"ID"="ident")
# Sanity check
colnames(NamesData)
## [1] "ID" "name"
# SECOND WAY
# Renaming the column
colnames(NamesData)[1] <- "ID"
# Sanity check
colnames(NamesData)
## [1] "ID" "name"
2.3 Merging Data
Let’s now merge both datasets. Merging implies that you (1) have a column with the same name in both datasets, with a unique ID and (2) that this column has the same class in both datasets. Note that it is theoretically possible to merge datasets when there are duplicates, but let’s ignore this for now.
| Function | Use |
|---|---|
merge() |
generic merging of two datasets (i.e. includes all observations) |
right_join() (dplyr package) |
merging the first dataset to the second dataset (i.e. includes only observations from the second dataset) |
left_join() (dplyr package) |
merging the second dataset to the first dataset (i.e. includes only observations from the first dataset) |
The baseline way these functions work is the following:
functionName(firstDataset, secondDataset, by="ID") (replace
“functionName” by the appropriate merging function and
“firstDataset”/“secondDataset” by the appropriate data frame names, and
note that the “ID” variable here refers to any unique ID variable in
these datasets and could therefore have any other name).
Exercise 2
Merge the NamesData and the PollData datasets by
the ID column using the merge() function
from the dplyr package. Your new dataset should be
called MergedData.
# Hint: make sure to save your merged dataset in an object called MergedData
# MergedData <- ...
The dimensions of MergedData should look like this:
dim(MergedData)
## [1] 11 10
And the content of MergedData should look like this:
MergedData
## ID name voteChoice age female educHS educCollege educGrad votedBlue
## 1 1 Emma red 28 1 0 1 0 0
## 2 2 William blue 18 0 1 0 0 1
## 3 3 Thomas blue 65 0 0 1 0 1
## 4 4 Olivia yellow 40 1 0 0 1 0
## 5 5 Alice red 44 1 0 0 1 0
## 6 6 Florence blue 21 1 0 0 1 1
## 7 7 Leo red 52 0 1 0 0 0
## 8 8 Livia yellow 33 1 0 1 0 0
## 9 9 Nathan blue 29 0 1 0 0 1
## 10 10 Charlotte yellow 37 1 0 0 1 0
## 11 11 Arthur red 72 0 0 1 0 0
## education
## 1 2
## 2 1
## 3 2
## 4 3
## 5 3
## 6 3
## 7 1
## 8 2
## 9 1
## 10 3
## 11 2
2.4 Subsetting Data
There are different ways to subset a dataframe using R grammar. Here are the symbols (or logical operators) we need to use to select observations.
| Logical operators | Meaning |
|---|---|
| & | and |
| | | or |
| == | equals |
| != | doesn’t equal |
| > | greater than |
| >= | greater or equal |
| < | smaller than |
| <= | smaller or equal |
Let’s try this out: we’ll create different subsets below. Here are
three different methods to do so, using filter(),
subset() or basic R grammar.
# Selecting only people 30 or older (using dplyr)
library(dplyr)
subset1 <- MergedData %>% filter(age >= 30)
dim(subset1)
## [1] 7 10
# Selecting only women and blue voters (using base R)
subset2 <- subset(MergedData, MergedData$female == 1 & MergedData$voteChoice == "blue")
dim(subset2)
## [1] 1 10
# Selecting only people who voted for the red or the blue party (using base R)
subset3 <- MergedData[MergedData$voteChoice == "red" | MergedData$voteChoice == "blue",]
dim(subset3)
## [1] 8 10
Exercise 3
Find the mistake in each of the following subsetting attempts.
# BlueVoters <- subset(PollData, PollData$voteChoice = "blue")
# YoungVoters <- subset(PollData, PollData$age =< 25)
# Female_and_Grad_Voters <- subset(PollData, PollData$female == 1 | PollData$educGrad == 1)
# NoYellowVoters <- subset(PollData, PollData$voteChoice != yellow)