Inputting and Manipulating Data
In this lab, we (1) import datasets, (2) explore their properties and the properties of their variables, (3) create variables, (4) rename variables, (5) merge datasets and (6) create subsets of datasets.
Relevant functions: setwd(),
read.csv(), dim(), colnames(),
head(), class(), levels(),
summary(), length(), which(),
min(), max(), which.min(),
which.max(), subset(), rename(),
unique(), merge(), right_join(),
left_join()
1. Importing and Exploring Data
1.1 Importing Data
We begin by loading the dataset named “ExitPoll.csv”, which is available on OWL. Please don’t use the “import dataset” function in the environment window. The script alone should be sufficient to replicate your work.
We input the CSV file using the read.csv() command. The
following code will work if the dataset and your R script are located
within the same folder.
As an alternative, you can skip the setwd() function and
set the path to your file directly within the read.csv()
function (ex.:
read.csv("/Users/evelynebrie/Desktop/myFolder/ExitPoll.csv")).
# Setting the folder where the script is as our working directory
setwd(dirname(rstudioapi::getSourceEditorContext()$path))
# Loading the dataset
PollData <- read.csv("ExitPoll.csv")
The data frame should now appear within your environment (upper right window).
1.2 Exploring the Data
This dataset represents the results of an exit poll conducted on residents of a fictitious country. It contains the following variables: vote choice of respondent, age of respondent, sex of respondent, and three indicator variables (or dummy variables) for the respondent’s education level.
Some of these examples are inspired from Yannick Dufresne’s “POL-7004 Analyse Quantitative” seminar.
1.2.1 Properties of the Dataset
The first things we typically want to know from a dataset are: (1) how many dimensions this data frame has, (2) what are the names of its columns and (3) what the data itself looks like.
# Looking at the dimensions of the dataset using dim()
dim(PollData)
## [1] 11 6
# We can see that the dataset has 11 rows and 6 columns
# Looking at the variable names using colnames()
colnames(PollData)
## [1] "voteChoice" "age" "female" "educHS" "educCollege"
## [6] "educGrad"
# Displaying the content of the first 5 rows using head()
head(PollData,5)
## voteChoice age female educHS educCollege educGrad
## 1 red 28 1 0 1 0
## 2 blue 18 0 1 0 0
## 3 blue 65 0 0 1 0
## 4 yellow 40 1 0 0 1
## 5 red 44 1 0 0 1
# Displaying the oldest respondent using which.max()
PollData[which.max(PollData$age),]
## voteChoice age female educHS educCollege educGrad
## 11 red 72 0 0 1 0
# Displaying the youngest respondent using which() and min()
PollData[which(PollData$age==min(PollData$age)),]
## voteChoice age female educHS educCollege educGrad
## 2 blue 18 0 1 0 0
1.2.2 Properties of the Variables
Let’s focus on the variable that might be the most interesting to our pollster: vote choice. In our dataset, this variable is called voteChoice.
This variable is currently a character variable, but we will convert it to a factor variable, which means it’s categorical, and that each factor level is assigned a numeric value.
# Viewing the class attribute of the "voteChoice" variable using class()
class(PollData$voteChoice)
## [1] "character"
# Converting the "voteChoice" variable to a factor variable
PollData$voteChoice <- as.factor(PollData$voteChoice)
# Check what the levels of this factor are using levels()
levels(PollData$voteChoice)
## [1] "blue" "red" "yellow"
# Select the second level of this factor variable using brackets
levels(PollData$voteChoice)[2]
## [1] "red"
# Check how many levels there are within this factor using nlevels()
nlevels(PollData$voteChoice)
## [1] 3
# Check how many respondents there are per levels (i.e. political party) using summary()
summary(PollData$voteChoice)
## blue red yellow
## 4 4 3
The class of this variable ensures that R recognizes each similar
vote choice as belonging to the same category. This step does not have a
specific purpose for us now, but it’s important to familiarize yourself
with the concept of factor. Please note that you can also convert
variables to numeric (using as.numeric()) and to character
(using as.character()) if needed.
2. Manipulating Data
2.1 Creating Variables
Let’s say we want to create a new indicator (or dummy) variable which takes a value of 1 everytime the voteChoice variable has a value of blue, and 0 otherwise. Below are two different ways to do this. Additional ways of doing this using the Tidyverse grammar are available here.
# Create a new empty variable called votedBlue within our dataframe
PollData$votedBlue <- NA
# FIRST WAY
PollData$votedBlue[PollData$voteChoice=="blue"] <- 1
PollData$votedBlue[PollData$voteChoice!="blue"] <- 0
# Sanity Check
table(PollData$votedBlue)
##
## 0 1
## 7 4
# SECOND WAY
PollData$votedBlue <- ifelse(PollData$voteChoice=="blue", 1, 0)
# Sanity Check
table(PollData$votedBlue)
##
## 0 1
## 7 4
Exercise 1
Create a numeric variable called education which takes a value of 1 when the respondent has a high school education, a value of 2 when the respondent has college-level education, and a value of 3 when the respondent has a graduate-level education.
# Hint: start by creating an empty variable (i.e. vector) called "education"
PollData$education <- NA
# Solution
PollData$education[PollData$educHS==1] <- 1
PollData$education[PollData$educCollege==1] <- 2
PollData$education[PollData$educGrad==1] <- 3
Once you’re done, the distribution of your education variable should look like this:
# Creating a table using the table() function
table(PollData$education)
##
## 1 2 3
## 3 4 4
Now, let’s say we want to have a unique ID for all respondents. We thus create another variable called ID.
# Creating a variable encompassing all numbers from 1 to the last row number
PollData$ID <- NA
PollData$ID <- seq(1,dim(PollData)[1])
# Sanity check
length(unique(PollData$ID)) == dim(PollData)[1] # you could also use nrow(PollData) instead of dim(PollData)[1]
## [1] TRUE
# We can thus confirm that this is a unique ID
2.2 Renaming Variables
Let’s now import another dataset called “Names.csv”
using read.csv().
# Importing the dataset
NamesData <- read.csv("Names.csv")
# Printing out the first 5 rows
head(NamesData,5)
## ident name
## 1 1 Emma
## 2 2 William
## 3 3 Thomas
## 4 4 Olivia
## 5 5 Alice
I want to rename the ident variable in this dataset into ID to merge both datasets later on. Here are two different ways to proceed.
# FIRST WAY
# Loading the dplyr packages
library(dplyr)
# Renaming the column
NamesData <- rename(NamesData,"ID"="ident")
# Sanity check
colnames(NamesData)
## [1] "ID" "name"
# SECOND WAY
# Renaming the column
colnames(NamesData)[1] <- "ID"
# Sanity check
colnames(NamesData)
## [1] "ID" "name"
2.3 Merging Data
Let’s now merge both datasets. Merging implies that you (1) have a column with the same name in both datasets, with a unique ID and (2) that this column has the same class in both datasets. Note that it is theoretically possible to merge datasets when there are duplicates, but let’s ignore this for now.
| Function | Use |
|---|---|
merge() |
generic merging of two datasets (i.e. includes all observations) |
right_join() (dplyr package) |
merging the first dataset to the second dataset (i.e. includes only observations from the second dataset) |
left_join() (dplyr package) |
merging the second dataset to the first dataset (i.e. includes only observations from the first dataset) |
The baseline way these functions work is the following:
functionName(firstDataset, secondDataset, by="ID") (replace
“functionName” by the appropriate merging function and
“firstDataset”/“secondDataset” by the appropriate data frame names, and
note that the “ID” variable here refers to any unique ID variable in
these datasets and could therefore have any other name).
Exercise 2
Merge the NamesData and the PollData datasets by
the ID column using the merge() function
from the dplyr package. Your new dataset should be
called MergedData.
# Hint: make sure to save your merged dataset in an object called MergedData
# MergedData <- ...
MergedData <- merge(NamesData, PollData, by="ID")
The dimensions of MergedData should look like this:
dim(MergedData)
## [1] 11 10
And the content of MergedData should look like this:
MergedData
## ID name voteChoice age female educHS educCollege educGrad votedBlue
## 1 1 Emma red 28 1 0 1 0 0
## 2 2 William blue 18 0 1 0 0 1
## 3 3 Thomas blue 65 0 0 1 0 1
## 4 4 Olivia yellow 40 1 0 0 1 0
## 5 5 Alice red 44 1 0 0 1 0
## 6 6 Florence blue 21 1 0 0 1 1
## 7 7 Leo red 52 0 1 0 0 0
## 8 8 Livia yellow 33 1 0 1 0 0
## 9 9 Nathan blue 29 0 1 0 0 1
## 10 10 Charlotte yellow 37 1 0 0 1 0
## 11 11 Arthur red 72 0 0 1 0 0
## education
## 1 2
## 2 1
## 3 2
## 4 3
## 5 3
## 6 3
## 7 1
## 8 2
## 9 1
## 10 3
## 11 2
2.4 Subsetting Data
There are different ways to subset a dataframe using R grammar. Here are the symbols (or logical operators) we need to use to select observations.
| Logical operators | Meaning |
|---|---|
| & | and |
| | | or |
| == | equals |
| != | doesn’t equal |
| > | greater than |
| >= | greater or equal |
| < | smaller than |
| <= | smaller or equal |
Let’s try this out: we’ll create different subsets below. Here are
three different methods to do so, using filter(),
subset() or basic R grammar.
# Selecting only people 30 or older (using dplyr)
library(dplyr)
subset1 <- MergedData %>% filter(age >= 30)
dim(subset1)
## [1] 7 10
# Selecting only women and blue voters (using base R)
subset2 <- subset(MergedData, MergedData$female == 1 & MergedData$voteChoice == "blue")
dim(subset2)
## [1] 1 10
# Selecting only people who voted for the red or the blue party (using base R)
subset3 <- MergedData[MergedData$voteChoice == "red" | MergedData$voteChoice == "blue",]
dim(subset3)
## [1] 8 10
Exercise 3
Find the mistake in each of the following subsetting attempts.
# BlueVoters <- subset(PollData, PollData$voteChoice = "blue")
# YoungVoters <- subset(PollData, PollData$age =< 25)
# Female_and_Grad_Voters <- subset(PollData, PollData$female == 1 | PollData$educGrad == 1)
# NoYellowVoters <- subset(PollData, PollData$voteChoice != yellow)