PSCI 9590A - Introduction to Quantitative Methods

Evelyne Brie

Fall 2023

Inputting and Manipulating Data

In this lab, we (1) import datasets, (2) explore their properties and the properties of their variables, (3) create variables, (4) rename variables, (5) merge datasets and (6) create subsets of datasets.

Relevant functions: setwd(), read.csv(), dim(), colnames(), head(), class(), levels(), summary(), length(), which(), min(), max(), which.min(), which.max(), subset(), rename(), unique(), merge(), right_join(), left_join()

1. Importing and Exploring Data

1.1 Importing Data

We begin by loading the dataset named “ExitPoll.csv”, which is available on OWL. Please don’t use the “import dataset” function in the environment window. The script alone should be sufficient to replicate your work.

We input the CSV file using the read.csv() command. The following code will work if the dataset and your R script are located within the same folder.

As an alternative, you can skip the setwd() function and set the path to your file directly within the read.csv() function (ex.: read.csv("/Users/evelynebrie/Desktop/myFolder/ExitPoll.csv")).

# Setting the folder where the script is as our working directory
setwd(dirname(rstudioapi::getSourceEditorContext()$path))

# Loading the dataset
PollData <- read.csv("ExitPoll.csv")

The data frame should now appear within your environment (upper right window).

1.2 Exploring the Data

This dataset represents the results of an exit poll conducted on residents of a fictitious country. It contains the following variables: vote choice of respondent, age of respondent, sex of respondent, and three indicator variables (or dummy variables) for the respondent’s education level.

Some of these examples are inspired from Yannick Dufresne’s “POL-7004 Analyse Quantitative” seminar.

1.2.1 Properties of the Dataset

The first things we typically want to know from a dataset are: (1) how many dimensions this data frame has, (2) what are the names of its columns and (3) what the data itself looks like.

# Looking at the dimensions of the dataset using dim()
dim(PollData)

## [1] 11  6

# We can see that the dataset has 11 rows and 6 columns

# Looking at the variable names using colnames()
colnames(PollData)

## [1] "voteChoice"  "age"         "female"      "educHS"      "educCollege"
## [6] "educGrad"

# Displaying the content of the first 5 rows using head()
head(PollData,5)

##   voteChoice age female educHS educCollege educGrad
## 1        red  28      1      0           1        0
## 2       blue  18      0      1           0        0
## 3       blue  65      0      0           1        0
## 4     yellow  40      1      0           0        1
## 5        red  44      1      0           0        1

# Displaying the oldest respondent using which.max()
PollData[which.max(PollData$age),]

##    voteChoice age female educHS educCollege educGrad
## 11        red  72      0      0           1        0

# Displaying the youngest respondent using which() and min()
PollData[which(PollData$age==min(PollData$age)),]

##   voteChoice age female educHS educCollege educGrad
## 2       blue  18      0      1           0        0

1.2.2 Properties of the Variables

Let’s focus on the variable that might be the most interesting to our pollster: vote choice. In our dataset, this variable is called voteChoice.

This variable is currently a character variable, but we will convert it to a factor variable, which means it’s categorical, and that each factor level is assigned a numeric value.

# Viewing the class attribute of the "voteChoice" variable using class()
class(PollData$voteChoice)

## [1] "character"

# Converting the "voteChoice" variable to a factor variable
PollData$voteChoice <- as.factor(PollData$voteChoice)

# Check what the levels of this factor are using levels()
levels(PollData$voteChoice)

## [1] "blue"   "red"    "yellow"

# Select the second level of this factor variable using brackets
levels(PollData$voteChoice)[2]

## [1] "red"

# Check how many levels there are within this factor using nlevels()
nlevels(PollData$voteChoice)

## [1] 3

# Check how many respondents there are per levels (i.e. political party) using summary()
summary(PollData$voteChoice)

##   blue    red yellow 
##      4      4      3

The class of this variable ensures that R recognizes each similar vote choice as belonging to the same category. This step does not have a specific purpose for us now, but it’s important to familiarize yourself with the concept of factor. Please note that you can also convert variables to numeric (using as.numeric()) and to character (using as.character()) if needed.

2. Manipulating Data

2.1 Creating Variables

Let’s say we want to create a new indicator (or dummy) variable which takes a value of 1 everytime the voteChoice variable has a value of blue, and 0 otherwise. Below are two different ways to do this. Additional ways of doing this using the Tidyverse grammar are available here.

# Create a new empty variable called votedBlue within our dataframe
PollData$votedBlue <- NA 

# FIRST WAY
PollData$votedBlue[PollData$voteChoice=="blue"] <- 1
PollData$votedBlue[PollData$voteChoice!="blue"] <- 0

# Sanity Check
table(PollData$votedBlue)

## 
## 0 1 
## 7 4

# SECOND WAY
PollData$votedBlue <- ifelse(PollData$voteChoice=="blue", 1, 0)

# Sanity Check
table(PollData$votedBlue)

## 
## 0 1 
## 7 4

Exercise 1

Create a numeric variable called education which takes a value of 1 when the respondent has a high school education, a value of 2 when the respondent has college-level education, and a value of 3 when the respondent has a graduate-level education.

# Hint: start by creating an empty variable (i.e. vector) called "education"
PollData$education <- NA

# Solution
PollData$education[PollData$educHS==1] <- 1
PollData$education[PollData$educCollege==1] <- 2
PollData$education[PollData$educGrad==1] <- 3

Once you’re done, the distribution of your education variable should look like this:

# Creating a table using the table() function
table(PollData$education)

## 
## 1 2 3 
## 3 4 4

Now, let’s say we want to have a unique ID for all respondents. We thus create another variable called ID.

# Creating a variable encompassing all numbers from 1 to the last row number
PollData$ID <- NA
PollData$ID <- seq(1,dim(PollData)[1])

# Sanity check
length(unique(PollData$ID)) == dim(PollData)[1] # you could also use nrow(PollData) instead of dim(PollData)[1]

## [1] TRUE

# We can thus confirm that this is a unique ID

2.2 Renaming Variables

Let’s now import another dataset called “Names.csv” using read.csv().

# Importing the dataset
NamesData <- read.csv("Names.csv")

# Printing out the first 5 rows
head(NamesData,5)

##   ident    name
## 1     1    Emma
## 2     2 William
## 3     3  Thomas
## 4     4  Olivia
## 5     5   Alice

I want to rename the ident variable in this dataset into ID to merge both datasets later on. Here are two different ways to proceed.

# FIRST WAY

# Loading the dplyr packages
library(dplyr)

# Renaming the column
NamesData <- rename(NamesData,"ID"="ident")

# Sanity check
colnames(NamesData)

## [1] "ID"   "name"

# SECOND WAY

# Renaming the column
colnames(NamesData)[1] <- "ID"

# Sanity check
colnames(NamesData)

## [1] "ID"   "name"

2.3 Merging Data

Let’s now merge both datasets. Merging implies that you (1) have a column with the same name in both datasets, with a unique ID and (2) that this column has the same class in both datasets. Note that it is theoretically possible to merge datasets when there are duplicates, but let’s ignore this for now.

Function	Use
`merge()`	generic merging of two datasets (i.e. includes all observations)
`right_join()` (dplyr package)	merging the first dataset to the second dataset (i.e. includes only observations from the second dataset)
`left_join()` (dplyr package)	merging the second dataset to the first dataset (i.e. includes only observations from the first dataset)

The baseline way these functions work is the following: functionName(firstDataset, secondDataset, by="ID") (replace “functionName” by the appropriate merging function and “firstDataset”/“secondDataset” by the appropriate data frame names, and note that the “ID” variable here refers to any unique ID variable in these datasets and could therefore have any other name).

Exercise 2

Merge the NamesData and the PollData datasets by the ID column using the merge() function from the dplyr package. Your new dataset should be called MergedData.

# Hint: make sure to save your merged dataset in an object called MergedData 
# MergedData <- ...

MergedData <- merge(NamesData, PollData, by="ID")

The dimensions of MergedData should look like this:

dim(MergedData)

## [1] 11 10

And the content of MergedData should look like this:

MergedData

##    ID      name voteChoice age female educHS educCollege educGrad votedBlue
## 1   1      Emma        red  28      1      0           1        0         0
## 2   2   William       blue  18      0      1           0        0         1
## 3   3    Thomas       blue  65      0      0           1        0         1
## 4   4    Olivia     yellow  40      1      0           0        1         0
## 5   5     Alice        red  44      1      0           0        1         0
## 6   6  Florence       blue  21      1      0           0        1         1
## 7   7       Leo        red  52      0      1           0        0         0
## 8   8     Livia     yellow  33      1      0           1        0         0
## 9   9    Nathan       blue  29      0      1           0        0         1
## 10 10 Charlotte     yellow  37      1      0           0        1         0
## 11 11    Arthur        red  72      0      0           1        0         0
##    education
## 1          2
## 2          1
## 3          2
## 4          3
## 5          3
## 6          3
## 7          1
## 8          2
## 9          1
## 10         3
## 11         2

2.4 Subsetting Data

There are different ways to subset a dataframe using R grammar. Here are the symbols (or logical operators) we need to use to select observations.

Logical operators	Meaning
&	and
\|	or
==	equals
!=	doesn’t equal
>	greater than
>=	greater or equal
<	smaller than
<=	smaller or equal

Let’s try this out: we’ll create different subsets below. Here are three different methods to do so, using filter(), subset() or basic R grammar.

# Selecting only people 30 or older (using dplyr)

library(dplyr)

subset1 <- MergedData %>% filter(age >= 30)
  
dim(subset1)

## [1]  7 10

# Selecting only women and blue voters (using base R)

subset2 <- subset(MergedData, MergedData$female == 1 & MergedData$voteChoice == "blue")

dim(subset2)

## [1]  1 10

# Selecting only people who voted for the red or the blue party (using base R)

subset3 <- MergedData[MergedData$voteChoice == "red" | MergedData$voteChoice == "blue",]

dim(subset3)

## [1]  8 10

Exercise 3

Find the mistake in each of the following subsetting attempts.

# BlueVoters <- subset(PollData, PollData$voteChoice = "blue")

# YoungVoters <- subset(PollData, PollData$age =< 25)

# Female_and_Grad_Voters <- subset(PollData, PollData$female == 1 | PollData$educGrad == 1)

# NoYellowVoters <- subset(PollData, PollData$voteChoice != yellow)