Election Data Science: Simple R Data Visualization with CPS Turnout Data

Introduction

The purpose of this document is to demonstrate the usage of simple R plot outputs and options.

As a motivating example, I use voter turnout data drawn from the Census Bureau’s 2016 and 2018 Current Population Survey’s Voting and Registration Supplement. I’ve placed a zip file onto Dropbox containing these data:

https://www.dropbox.com/s/vhk8rekf0qasvew/turnout_rate_data.zip?dl=0

Download this file by clicking on the URL or copying the URL to a web-browser. When in the Dropbox interface, click the ... button in the upper righthand side to download the file to the directory you will do your work in. (If at some later point DropBox changes its web interface, you are on your own to figure out how to download the file.) Unzip the files into this directory. You are know ready to use these data.

To accompany the programming, I assign Chapter 4 of Wattenberg’s Is Voting for Young People? which discusses youth turnout rates drawn from the CPS. His conclusion younger people shouldn’t be allowed to vote will spark discussion.

Load Packages

As with all R code, we begin by loading the necessary packages to run the code.

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.2.1 --

## v ggplot2 3.0.0     v purrr   0.2.5
## v tibble  1.4.2     v dplyr   0.7.6
## v tidyr   0.8.1     v stringr 1.3.1
## v readr   1.1.1     v forcats 0.3.0

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Packages are code that has been developed to do certain tasks. Tidyverse is a set of tools that manage and visualize data in R.

If you try to load a package that has’t previously been installed, you will get an error message. To install a new package, go to the Tools->Install Packages subdirectory in the RStudio menu and follow the instructions.

Other Global Setup

Once the packages are loaded, I like to do other global setup, like setting the working directory, which I’ve called working.dir. Let’s break down how this R statement works for a moment, since it is basic to R coding.

The <- indicates that I’m assigning something to working.dir.

That something is the string "D:/Classes/Election Science/Class 2"

This directory working.dir is usually where I keep my data and files for a single project. Sometimes, though, data may be stored in a different directory that has many different programs pointing to it, like a large voter file or a lookup table of county names and codes. In this case, create a variable with data.dir for the data directory, and so on.

working.dir <- "D:/Classes/Election Science/Class 2"
working.dir

## [1] "D:/Classes/Election Science/Class 2"

As with just about everything when it comes to programming, there is more than one way to skin a cat. Another way to manage the directory R is pointing to is to set the working directory with the command setwd.

setwd("D:/Classes/Election Science/Class 2")

I do not like this approach since it is easy to forget to set the working directory back. But to each their own.

/ or \

Note that my directory structure uses a forward slash / to separate the names of directories and subdirectories. Depending on the operating system you are using, you may use the backslash \ to separate the names of directories and subdirectories. This is leads to a pretty common error message for newbies, and even not newbies. If your having trouble with loading a file, check your directory name for typos and the direction of your slash.

Creating a filename with the working directory path

I next need to create a string with the path and name of the data file I want to work with (one of the ones you downloaded in the first step, above), in this case I’m calling the string R object turnout_age.file.

To do this, I use the paste command, which joins (or concatenates) together strings.

turnout_2016.file <- paste(working.dir,"/turnout_rates_by_age_2016.csv", sep="")

Notice that next to paste are some parentheses with stuff inside paste(stuff inside). This is how R and most programming lanuages pass information into a another program which takes that information, manipulates it, and returns the output from the manipulated information.

In this example, the code passes the working.dir R object (which, recall, also happens to be a string) and the string "/2016_Turnout_Data". These are the two strings I would like to concatenate. You can add more strings if you wish and tap your heels together three times.

I also pass to the paste function another bit of information sep="". Here, I am interacting with one of the paste options. I’m over-riding the default paste separator, or what paste will put between the strings it concatenates together. The default separator is a space, or sep=" ", which will put an awkard space in a place that I don’t want it and break my code. So, I’d like to use no separator at all, which is the same as sep="".

Finally, note that each of the three pieces of information I pass on to paste are separated by a comma. This lets R tell the different peices apart. Another common programming errors is forgetting a ,.

The result is that paste creates a new string, which is passed to the turnout_2016.file R object. I can find out what is in this object – and any other object – by calling the name of the object.

turnout_2016.file

## [1] "D:/Classes/Election Science/Class 2/turnout_rates_by_age_2016.csv"

Loading a data file

I can now read in the data from turnout_2016.file. I’ve pre-processed these data in this file to speed up the class. In practice, there will be more preparation work. Much more.

We will talk about reading in different file formats in the future.

turnout_2016.data <- read_csv(turnout_2016.file, col_names = T)

## Parsed with column specification:
## cols(
##   age = col_integer(),
##   citizen_turnout_rate = col_double(),
##   citizen_turnout_rate_white = col_double(),
##   citizen_turnout_rate_black = col_double(),
##   citizen_turnout_rate_hispanic = col_double(),
##   registered_turnout_rate = col_double()
## )

The read_csv command reads into R a data file with comma separated values format, which means for each row of data, the columns are separated by a comma. A file name with .csv at the end is usually a good sign it has rows with commas separating the column values. Of course, it would be silly to expect that if you renamed a file with the .csv extension it would magically convert itself into comma separated values format.

The other peice of information I pass to paste is col_names = T. By default, read_csv doesn’t think your data has headers, or names of the variables in the first row of data. These data do have variables names, so I override the default setting by turning col_names from False (its default) to True. You can spell out True and False if you want to, but who wants to write more than they need to?

I place the data that read_csv read into the R object turnout_age.data. R objects can be many things, strings, numbers, and in this case the entire dataset nicely formatted into rows and columns with variables names.

If you are using R studio, the object turnout_age.data appeared on the right pane. Go ahead and click on it. You will see the raw data for this example. Also try simply typing turnout_age_data into the command line console.

Creating a plot

If you actually looked at the data like you were supposed to, you will see it is hard to interpret turnout_2016.data by just looking at the columns of data. Let’s now create our first plot!

Let’s plot the turnout rate for citzens by age in the 2016 presidental election.

ggplot(data = turnout_2016.data) + 
  geom_point(aes(x=age, y=citizen_turnout_rate))

We see here a common pattern in the relationship between age and voter turnout, where the youngest voters vote at the lowest rate, turnout increases with age, peaks around retirement age, and turnout slumps again in the twighlight years.

I am using the geom_point function to generate this plot, where each dot represents the turnout rate (on the Y axis) for citizens of each age (on the X axis).

Let’s take a moment to break this command down further to understand what is going on.

The function I’m calling is ggplot, which is a visualization package for R created by Wickham. There is another package called ggplot2 that does more, but I’ll stick with ggplot for now.

Next to ggplot you will find (data = turnout_2016_data). Here, we are telling ggplot to visualize the data found in turnout_2016.data.

The + sign tells R that I’m not done with the ggplot command line yet. As long as there is a + is at the end of line, R will continue to process ggplot commands. If you leave a stray + at the end of the last command that you want to run with ggplot you will break your code - another common bug!

The next command that is run is geom_point, a function that creates a point plot (thus the name point). I need to tell geom_point which variables in the dataset I want to plot against one another.I do this with the code snippet (aes(x = age, y = turnout_rate)). I want to plot the variable age on the x axis and turnout_rate on the y axis. I do this with the code snippet (x = age, y = turnout_rate). But wait, there is more. See the aes? This stands for aesthetic and it passes information about what you are going to map to geom_point.

There is always more than one way to skin a cat. This also works:

ggplot(data = turnout_2016.data, aes(x = age, y = citizen_turnout_rate)) +
  geom_point()

As does this:

ggplot(data = turnout_2016.data, aes(age, citizen_turnout_rate)) +
  geom_point()

And this, too:

ggplot(data = turnout_2016.data) + 
  geom_point(aes(age, citizen_turnout_rate))

I like to explictly tell R which variables to associate with the x and y axes. In just a minute, we will see it is possible to overlay more than one plot on top of each other, which is why I prefer to put the aes() with geom_point.

A Smooth Line Plot

Let’s look at a different type of plot. How about a nice looking line instead of those dots?

ggplot(data = turnout_2016.data) + 
  geom_smooth(aes(x=age, y=citizen_turnout_rate))

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

What does this plot do? It creates a line fitted to the points in the previous graph, using a technique known as loess or local regression. Basically, loess takes an average of the first set of points on the left, it then drops the left most point and adds the next adjacent right point, and computes the average again. It draws a line connecting the averages generated as the function sweeps across the full range of the data. The shaded region basically tells you how spread out the data are in a given subset range of the data.

The geom_smooth plot is more pleasing to the eye and easier for people to interpret, so smoothed lines are generally preferred. However, you may still wish to create a point graph. It is generally good practice to plot your data to detect outliers - an outlier could be a data error, or could be very influential to your analysis. You may also wish to use a point plot to highlight certain data points of interest to the data story you are telling.

Overlay Plots

It is possible to overlay mutiple plots onto each other using ggplot. Just invoke the new plot that you want to add before you stop ggplot commands.

ggplot(data = turnout_2016.data) + 
  geom_point(aes(x=age, y=citizen_turnout_rate)) +
  geom_smooth(aes(x=age, y=citizen_turnout_rate))

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Combining these two plots tells us that the loess line is generally good, but is under-estimating the turnout rate around peak age of 70 and is over-estimating turnout for the oldest people. There is else something going on with these data point for the oldest people that you will see upon closer inspection. What is it? (hint: top-coding)