POLISCI 3325G - Data Science for Political Science

Evelyne Brie

Winter 2023

Merging, Subsetting and Manipulating Data

In this lab, we (1) import datasets, (2) create variables, (3) rename variables, (4) merge datasets and (5) create subsets of datasets.

Relevant functions: setwd(), read.csv(), subset(), rename(), length(),, unique(), merge(), right_join(), left_join()

1. Importing Data

We begin by loading the dataset named “Data_Poll.csv”, which is available on OWL (Resources -> Exercises -> Week 2). Please don’t use the “import dataset” function in the environment window. The script alone should be sufficient to replicate your work.

We input the CSV file using the read.csv() command. The following code will work if the dataset and your R script are located within the same folder.

As an alternative, you can skip the setwd() function and set the path to your file directly within the read.csv() function (ex.: read.csv("/Users/evelynebrie/Desktop/myFolder/ExitPoll.csv")).

# Setting the folder where the script is as our working directory
setwd(dirname(rstudioapi::getSourceEditorContext()$path))

# Loading the dataset
PollData <- read.csv("Data_Poll.csv")

The data frame should now appear within your environment (upper right window).

2. Manipulating Data

2.1 Creating Variables

Let’s say we want to create a new indicator (or dummy) variable which takes a value of 1 everytime the voteChoice variable has a value of blue, and 0 otherwise. Additional ways of doing this using the Tidyverse grammar are available here.

# Create a new empty variable called votedBlue within our dataframe
PollData$votedBlue <- NA 

PollData$votedBlue[PollData$voteChoice=="blue"] <- 1
PollData$votedBlue[PollData$voteChoice!="blue"] <- 0

# Sanity Check
table(PollData$votedBlue)

## 
## 0 1 
## 7 4

# Let's just print out the entire dataset to display the result
PollData

##    voteChoice age female educHS educCollege educGrad votedBlue
## 1         red  28      1      0           1        0         0
## 2        blue  18      0      1           0        0         1
## 3        blue  65      0      0           1        0         1
## 4      yellow  40      1      0           0        1         0
## 5         red  44      1      0           0        1         0
## 6        blue  21      1      0           0        1         1
## 7         red  52      0      1           0        0         0
## 8      yellow  33      1      0           1        0         0
## 9        blue  29      0      1           0        0         1
## 10     yellow  37      1      0           0        1         0
## 11        red  72      0      0           1        0         0

# We now have an extra variable called "votedBlue"

Exercise 1

Create a numeric variable called education which takes a value of 1 when the respondent has a high school education, a value of 2 when the respondent has college-level education, and a value of 3 when the respondent has a graduate-level education.

# Hint: start by creating an empty variable (i.e. vector) called "education"
PollData$education <- NA

Once you’re done, the distribution of your education variable should look like this:

# Creating a table using the table() function
table(PollData$education)

## 
## 1 2 3 
## 3 4 4

Now, let’s say we want to have a unique ID for all respondents. We thus create another variable called ID.

# Creating a variable encompassing all numbers from 1 to the last row number
PollData$ID <- NA
PollData$ID <- seq(1,dim(PollData)[1])

# Sanity check
length(unique(PollData$ID)) == dim(PollData)[1] # you could also use nrow(PollData) instead of dim(PollData)[1]

## [1] TRUE

# We can thus confirm that this is a unique ID

# Let's print the ID variable to be extra explicit here
PollData$ID

##  [1]  1  2  3  4  5  6  7  8  9 10 11

# Makes sense, right? All people in the dataset are associated with a unique number

2.2 Renaming Variables

Let’s now import another dataset called “Names.csv” using read.csv(). This dataset is also available on OWL (Resources -> Exercises -> Week 2).

# Importing the dataset
NamesData <- read.csv("Names.csv")

# Printing out the first 5 rows
head(NamesData,5)

##   ident    name
## 1     1    Emma
## 2     2 William
## 3     3  Thomas
## 4     4  Olivia
## 5     5   Alice

I want to rename the ident variable in this dataset into ID to merge both datasets later on. Here are two different ways to proceed.

# FIRST WAY

# Loading the dplyr packages
library(dplyr)

# Renaming the column
NamesData <- rename(NamesData,"ID"="ident")

# Sanity check
colnames(NamesData)

## [1] "ID"   "name"

# SECOND WAY

# Renaming the column
colnames(NamesData)[1] <- "ID"

# Sanity check
colnames(NamesData)

## [1] "ID"   "name"

2.3 Merging Data

Let’s now merge both datasets. Merging implies that you (1) have a column with the same name in both datasets, with a unique ID and (2) that this column has the same class in both datasets. Note that it is theoretically possible to merge datasets when there are duplicates, but let’s ignore this for now.

Function	Use
`merge()`	generic merging of two datasets (i.e. includes all observations)
`right_join()` (dplyr package)	merging the first dataset to the second dataset (i.e. includes only observations from the second dataset)
`left_join()` (dplyr package)	merging the second dataset to the first dataset (i.e. includes only observations from the first dataset)

The baseline way these functions work is the following: functionName(firstDataset, secondDataset, by="ID") (replace “functionName” by the appropriate merging function and “firstDataset”/“secondDataset” by the appropriate data frame names, and note that the “ID” variable here refers to any unique ID variable in these datasets and could therefore have any other name).

Exercise 2

Merge the NamesData and the PollData datasets by the ID column using the merge() function from the dplyr package. Your new dataset should be called MergedData.

# Hint: make sure to save your merged dataset in an object called MergedData 
# MergedData <- ...

The dimensions of MergedData should look like this:

dim(MergedData)

## [1] 11 10

And the content of MergedData should look like this:

MergedData

##    ID      name voteChoice age female educHS educCollege educGrad votedBlue
## 1   1      Emma        red  28      1      0           1        0         0
## 2   2   William       blue  18      0      1           0        0         1
## 3   3    Thomas       blue  65      0      0           1        0         1
## 4   4    Olivia     yellow  40      1      0           0        1         0
## 5   5     Alice        red  44      1      0           0        1         0
## 6   6  Florence       blue  21      1      0           0        1         1
## 7   7       Leo        red  52      0      1           0        0         0
## 8   8     Livia     yellow  33      1      0           1        0         0
## 9   9    Nathan       blue  29      0      1           0        0         1
## 10 10 Charlotte     yellow  37      1      0           0        1         0
## 11 11    Arthur        red  72      0      0           1        0         0
##    education
## 1          2
## 2          1
## 3          2
## 4          3
## 5          3
## 6          3
## 7          1
## 8          2
## 9          1
## 10         3
## 11         2

2.4 Subsetting Data

There are different ways to subset a dataframe using R grammar. Here are the symbols (or logical operators) we need to use to select observations.

Logical operators	Meaning
&	and
\|	or
==	equals
!=	doesn’t equal
>	greater than
>=	greater or equal
<	smaller than
<=	smaller or equal

Let’s try this out: we’ll create different subsets below. Here are three different methods to do so, using filter(), subset() or basic R grammar.

# Selecting only people 30 or older (using dplyr)

library(dplyr)

subset1 <- MergedData %>% filter(age >= 30)
  
dim(subset1)

## [1]  7 10

# Selecting only women and blue voters (using base R)

subset2 <- subset(MergedData, MergedData$female == 1 & MergedData$voteChoice == "blue")

dim(subset2)

## [1]  1 10

# Selecting only people who voted for the red or the blue party (using base R)

subset3 <- MergedData[MergedData$voteChoice == "red" | MergedData$voteChoice == "blue",]

dim(subset3)

## [1]  8 10

Exercise 3

Find the mistake in each of the following subsetting attempts.

# BlueVoters <- subset(PollData, PollData$voteChoice = "blue")

# YoungVoters <- subset(PollData, PollData$age =< 25)

# Female_and_Grad_Voters <- subset(PollData, PollData$female == 1 | PollData$educGrad == 1)

# NoYellowVoters <- subset(PollData, PollData$voteChoice != yellow)