Election Data Science: Simple R Data Visualization with CPS Turnout Data

Introduction

The purpose of this document is to demonstrate the usage of simple R plot outputs and options.

As a motivating example, I use voter turnout data drawn from the Census Bureau’s 2016 and 2018 Current Population Survey’s Voting and Registration Supplement. I’ve placed a zip file onto Dropbox containing these data:

https://www.dropbox.com/s/vhk8rekf0qasvew/turnout_rate_data.zip?dl=0

Download this file by clicking on the URL or copying the URL to a web-browser. When in the Dropbox interface, click the ... button in the upper righthand side to download the file to the directory you will do your work in. (If at some later point DropBox changes its web interface, you are on your own to figure out how to download the file.) Unzip the files into this directory. You are know ready to use these data.

To accompany the programming, I assign Chapter 4 of Wattenberg’s Is Voting for Young People? which discusses youth turnout rates drawn from the CPS. His conclusion younger people shouldn’t be allowed to vote will spark discussion.

Load Packages

As with all R code, we begin by loading the necessary packages to run the code.

library(tidyverse)

## -- Attaching packages -------------------------------------------------------------------------------- tidyverse 1.2.1 --

## v ggplot2 3.2.1     v purrr   0.3.2
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   0.8.3     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0

## -- Conflicts ----------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Packages are code that has been developed to do certain tasks. Tidyverse is a set of tools that manage and visualize data in R.

If you try to load a package that has’t previously been installed, you will get an error message. To install a new package, go to the Tools->Install Packages subdirectory in the RStudio menu and follow the instructions.

Other Global Setup

Once the packages are loaded, I like to do other global setup, like setting the working directory, which I’ve called working.dir. Let’s break down how this R statement works for a moment, since it is basic to R coding.

The <- indicates that I’m assigning something to working.dir.

That something is the string "D:/Classes/Election Science/Class 2"

This directory working.dir is usually where I keep my data and files for a single project. Sometimes, though, data may be stored in a different directory that has many different programs pointing to it, like a large voter file or a lookup table of county names and codes. In this case, create a variable with data.dir for the data directory, and so on.

working.dir <- "D:/Classes/Election Science/Class 2"
working.dir

## [1] "D:/Classes/Election Science/Class 2"

As with just about everything when it comes to programming, there is more than one way to skin a cat. Another way to manage the directory R is pointing to is to set the working directory with the command setwd.

setwd("D:/Classes/Election Science/Class 2")

I do not like this approach since it is easy to forget to set the working directory back. But to each their own.

/ or \

Note that my directory structure uses a forward slash / to separate the names of directories and subdirectories. Depending on the operating system you are using, you may use the backslash \ to separate the names of directories and subdirectories. This is leads to a pretty common error message for newbies, and even not newbies. If your having trouble with loading a file, check your directory name for typos and the direction of your slash.

Creating a filename with the working directory path

I next need to create a string with the path and name of the data file I want to work with (one of the ones you downloaded in the first step, above), in this case I’m calling the string R object turnout_age.file.

To do this, I use the paste command, which joins (or concatenates) together strings.

turnout_2016.file <- paste(working.dir,"/turnout_rates_by_age_2016.csv", sep="")

Notice that next to paste are some parentheses with stuff inside paste(stuff inside). This is how R and most programming lanuages pass information into a another program which takes that information, manipulates it, and returns the output from the manipulated information.

In this example, the code passes the working.dir R object (which, recall, also happens to be a string) and the string "/2016_Turnout_Data". These are the two strings I would like to concatenate. You can add more strings if you wish and tap your heels together three times.

I also pass to the paste function another bit of information sep="". Here, I am interacting with one of the paste options. I’m over-riding the default paste separator, or what paste will put between the strings it concatenates together. The default separator is a space, or sep=" ", which will put an awkard space in a place that I don’t want it and break my code. So, I’d like to use no separator at all, which is the same as sep="".

Finally, note that each of the three pieces of information I pass on to paste are separated by a comma. This lets R tell the different peices apart. Another common programming errors is forgetting a ,.

The result is that paste creates a new string, which is passed to the turnout_2016.file R object. I can find out what is in this object – and any other object – by calling the name of the object.

turnout_2016.file

## [1] "D:/Classes/Election Science/Class 2/turnout_rates_by_age_2016.csv"

Loading a data file

I can now read in the data from turnout_2016.file. I’ve pre-processed these data in this file to speed up the class. In practice, there will be more preparation work. Much more.

We will talk about reading in different file formats in the future.

turnout_2016.data <- read_csv(turnout_2016.file, col_names = T)

## Parsed with column specification:
## cols(
##   age = col_double(),
##   citizen_turnout_rate = col_double(),
##   citizen_turnout_rate_white = col_double(),
##   citizen_turnout_rate_black = col_double(),
##   citizen_turnout_rate_hispanic = col_double(),
##   registered_turnout_rate = col_double()
## )

The read_csv command reads into R a data file with comma separated values format, which means for each row of data, the columns are separated by a comma. A file name with .csv at the end is usually a good sign it has rows with commas separating the column values. Of course, it would be silly to expect that if you renamed a file with the .csv extension it would magically convert itself into comma separated values format.

The other peice of information I pass to paste is col_names = T. By default, read_csv doesn’t think your data has headers, or names of the variables in the first row of data. These data do have variables names, so I override the default setting by turning col_names from False (its default) to True. You can spell out True and False if you want to, but who wants to write more than they need to?

I place the data that read_csv read into the R object turnout_age.data. R objects can be many things, strings, numbers, and in this case the entire dataset nicely formatted into rows and columns with variables names.

If you are using R studio, the object turnout_age.data appeared on the right pane. Go ahead and click on it. You will see the raw data for this example. Also try simply typing turnout_age_data into the command line console.

Creating a plot

If you actually looked at the data like you were supposed to, you will see it is hard to interpret turnout_2016.data by just looking at the columns of data. Let’s now create our first plot!

Let’s plot the turnout rate for citzens by age in the 2016 presidental election.

ggplot(data = turnout_2016.data) + 
  geom_point(aes(x=age, y=citizen_turnout_rate))

We see here a common pattern in the relationship between age and voter turnout, where the youngest voters vote at the lowest rate, turnout increases with age, peaks around retirement age, and turnout slumps again in the twighlight years.

I am using the geom_point function to generate this plot, where each dot represents the turnout rate (on the Y axis) for citizens of each age (on the X axis).

Let’s take a moment to break this command down further to understand what is going on.

The function I’m calling is ggplot, which is a visualization package for R created by Wickham. There is another package called ggplot2 that does more, but I’ll stick with ggplot for now.

Next to ggplot you will find (data = turnout_2016_data). Here, we are telling ggplot to visualize the data found in turnout_2016.data.

The + sign tells R that I’m not done with the ggplot command line yet. As long as there is a + is at the end of line, R will continue to process ggplot commands. If you leave a stray + at the end of the last command that you want to run with ggplot you will break your code - another common bug!

The next command that is run is geom_point, a function that creates a point plot (thus the name point). I need to tell geom_point which variables in the dataset I want to plot against one another.I do this with the code snippet (aes(x = age, y = turnout_rate)). I want to plot the variable age on the x axis and turnout_rate on the y axis. I do this with the code snippet (x = age, y = turnout_rate). But wait, there is more. See the aes? This stands for aesthetic and it passes information about what you are going to map to geom_point.

There is always more than one way to skin a cat. This also works:

ggplot(data = turnout_2016.data, aes(x = age, y = citizen_turnout_rate)) +
  geom_point()

As does this:

ggplot(data = turnout_2016.data, aes(age, citizen_turnout_rate)) +
  geom_point()

And this, too:

ggplot(data = turnout_2016.data) + 
  geom_point(aes(age, citizen_turnout_rate))

I like to explictly tell R which variables to associate with the x and y axes. In just a minute, we will see it is possible to overlay more than one plot on top of each other, which is why I prefer to put the aes() with geom_point.

Changing basic `geom_point` Plot Options

In the dataset, there are variables not only for the citizen turnout rate for everyone, but also for turnout rates for persons of different races. What if I want to compare turnout rates for different races using the same plot? There are a couple of ways to do this. First, let’s use a ggplot trick that allows me to overlay plots on top of one another.

I tell ggplot I want to make another plot, like the one I just made, but this time, I make one plot with the citizen_turnout_rate variable and I add a second plot with another with the variable in the same dataset, citizen_turnout_rate_white. This second plot is also uses geom_point, and this second plot is added to the same output with a + on the proceeding line telling R that you still want to add things to ggplot.

ggplot(data = turnout_2016.data) + 
  geom_point(aes(x=age, y=citizen_turnout_rate)) +
  geom_point(aes(x=age, y=citizen_turnout_rate_white))

A problem with this plot is that you really can’t tell the dots apart. We need a way to distinguish them. Fortunately, there are options on what sort of makers to use for the dots, marker colors, and more. Let’s try this again, but this time I’m going to tell geom_point to plot out a diamond for each of the dots associated with citizen_turnout_rate_white.

ggplot(data = turnout_2016.data) + 
  geom_point(aes(x=age, y=citizen_turnout_rate)) +
  geom_point(shape = 23, aes(x=age, y=citizen_turnout_rate_white))

That looks a little better.

How do I know shape=23 is a diamond with an empty fill? You can see all the options for geom_point plot at:

http://www.sthda.com/english/wiki/ggplot2-point-shapes

We can do more, to distinguish the dots.

ggplot(data = turnout_2016.data) + 
    geom_point(size = 3, shape = 21, fill = "orange", aes(x=age, y=citizen_turnout_rate)) +
    geom_point(size = 3, shape = 23, fill = "blue", aes(x=age, y=citizen_turnout_rate_white))

What can we learn from this plot? With just a few exceptions, turnout rates for whites are higher than the overall population. The few deviations - like for 25 year olds - are probably more due to random variation in the survey than a real lower turnout.

Here, I am using color fills. An R color pallet is available here:

http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf

and here:

https://www.nceas.ucsb.edu/~frazier/RSpatialGuides/colorPaletteCheatsheet.pdf

I must pass a string to the fill option, which is why fill = "orange" works while fill = orange doesn’t. If I passed to fill an R object that contains a string that is an acceptable value for the particular option, you can pass a variable to an R option. For example, this works.

fill_color <- "orange"

ggplot(data = turnout_2016.data) + 
    geom_point(size = 3, shape = 21, fill = fill_color, aes(x=age, y=citizen_turnout_rate))

Sometimes this trick is helpful when you are doing visualizations in bulk.

Why doesn’t this work the same as before?

ggplot(data = turnout_2016.data) + 
    geom_point(size = 3, shape = 19, fill = "orange", aes(x=age, y=citizen_turnout_rate)) +
    geom_point(size = 3, shape = 23, fill = "blue", aes(x=age, y=citizen_turnout_rate_white))

The circle shape I’m using for the overall citizen turnout rate - shape = 19 has a solid black background, so R ignores that I tried to change the background to orange.

I can also change the color of the outline of the shape.

ggplot(data = turnout_2016.data) + 
    geom_point(size = 3, shape = 21, fill = "orange", color = "red", aes(x=age, y=citizen_turnout_rate)) +
    geom_point(size = 3, shape = 23, fill = "blue", color = "red", aes(x=age, y=citizen_turnout_rate_white))

Having R Automatically Choose For You

What if I want R to do the work of chosing the colors and dots for me, rather than doing the work manually. There is a way to do this, but it requires the data to be in a certain format. I’ve pre-processed the CPS data and created a file called turnout_rates_by_age_race_2016.csv which is suitable for this example. In later classes we’ll find out how to reshape the turnout_rates_by_age_2016.csv data we’ve been working with. But for now, let’s skip this step so we can learn more about plot options.

I need to load in the new file.

turnout_2016_race.file <- paste(working.dir,"/turnout_rates_by_age_race_2016.csv", sep="")
turnout_2016_race.data <- read_csv(turnout_2016_race.file, col_names = T)

## Parsed with column specification:
## cols(
##   age = col_double(),
##   turnout_rate = col_double(),
##   race = col_double()
## )

Note that this file has only three variables, age and turnout_rate which we’ve been using before and a new varable race. Examine the data frame contents. What do you see? There are 256 observations where before there were 64. There are 4 different values for race, repeated for each age.

If I plot all these data, I get a plot that is difficult to interpret, so I am going to need to tell R to distinguish the data points.

ggplot(data = turnout_2016_race.data) + 
    geom_point(aes(x=age, y=turnout_rate))

Adding a color option kind of does what I want, but a problem is that R wants to color the 4 values of race on a scale.

ggplot(data = turnout_2016_race.data) + 
    geom_point(aes(x=age, y=turnout_rate, color = race))

The four values of race are different categories, they are not on a scale. I can tell R that these values are different categories this way:

turnout_2016_race.data$race = factor(turnout_2016_race.data$race)

A factor is a special kind of R data type, like a number or string. R associates

ggplot(data = turnout_2016_race.data) + 
    geom_point(aes(x=age, y=turnout_rate, color = race))

This looks better. R has chosen a different default color value for each of the four different race classifications.

Notice that I’m now putting color inside the aes(). This won’t work.

ggplot(data = turnout_2016_race.data) + 
    geom_point(color = race, aes(x=age, y=turnout_rate))

What else can I do? How about automatically change the size of the dots.

ggplot(data = turnout_2016_race.data) + 
    geom_point(aes(x=age, y=turnout_rate, size = race))

## Warning: Using size for a discrete variable is not advised.

R is actually giving us a warning that this is probably not a good thing to do with a factor. Plotting the size of a dot by a continuous variable like population might help you make a point in a visualization of, say, election results by counties, with larger dots representing the size of the county.

Here is one more, which automatically changes the markers for each race classification.

ggplot(data = turnout_2016_race.data) + 
    geom_point(aes(x=age, y=turnout_rate, shape = race))

Facets

Perhaps you are not satisfied with this plot because we are attempting to plot too much data at once. It is possible to create separate plots for each race using facets.

ggplot(data = turnout_2016_race.data) + 
    geom_point(aes(x=age, y=turnout_rate)) +
    facet_wrap(~race, nrow = 2)

~race tells R that you want to create a separate plot for each level of race and the option nrow = 2 tells R that you want two rows of plots.

You can do more with facets, even telling R to do a 2x2 table with different variables!

A Smooth Line Plot

Let’s look at a different type of plot. How about a nice looking line instead of those dots?

ggplot(data = turnout_2016.data) + 
  geom_smooth(aes(x=age, y=citizen_turnout_rate))

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

What does this plot do? It creates a line fitted to the points in the previous graph, using a technique known as loess or local regression. Basically, loess takes an average of the first set of points on the left, it then drops the left most point and adds the next adjacent right point, and computes the average again. It draws a line connecting the averages generated as the function sweeps across the full range of the data. The shaded region basically tells you how spread out the data are in a given subset range of the data.

The geom_smooth plot is more pleasing to the eye and easier for people to interpret, so smoothed lines are generally preferred. However, you may still wish to create a point graph. It is generally good practice to plot your data to detect outliers - an outlier could be a data error, or could be very influential to your analysis. You may also wish to use a point plot to highlight certain data points of interest to the data story you are telling.

There are, oif course, options on line types and colors.

ggplot(data = turnout_2016_race.data) + 
  geom_smooth(aes(x=age, y=turnout_rate, linetype = race, color = race))

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Overlay Plots

It is possible to overlay mutiple plots onto each other using ggplot. Just invoke the new plot that you want to add before you stop ggplot commands.

ggplot(data = turnout_2016.data) + 
  geom_point(aes(x=age, y=citizen_turnout_rate)) +
  geom_smooth(aes(x=age, y=citizen_turnout_rate))

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Combining these two plots tells us that the loess line is generally good, but is under-estimating the turnout rate around peak age of 70 and is over-estimating turnout for the oldest people. There is else something going on with these data point for the oldest people that you will see upon closer inspection. What is it? (hint: top-coding)

Bar charts

Okay, so I’m going to cheat and use a tidyverse example dataset diamonds

Here is a bar chart

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut))

The cool thing happening here is that R is doing a calculation when it creates the plot, which is the number of observations in the data frame diamonds with a cut that is Fair, Good, and so on.

The calculation that geom_bar is doing is called the stat and the default for a geom_bar is a count.

If, perchance, you had preprocessed data ready for a bar chat, you can tell R to skip the stat and just plot data. Such as is done in this example.

demo <- tribble(
  ~cut,         ~freq,
  "Fair",       1610,
  "Good",       4906,
  "Very Good",  12082,
  "Premium",    13791,
  "Ideal",      21551
)

ggplot(data = demo) +
  geom_bar(mapping = aes(x = cut, y = freq), stat = "identity")

You can also tell geom_bar to output proportions instead of the raw numbers.

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1))

You can do more with plotting statistical functions, besides just a count, as you might do with a bar chart.

ggplot(data = diamonds) + 
  stat_summary(
    mapping = aes(x = cut, y = depth),
    fun.ymin = min,
    fun.ymax = max,
    fun.y = median
  )

You can color bar charts, like before.

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = cut))

Here is a cool trick. You can put color fills for another variable!

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity))

The result is a bar plot that has its values filled by the count of the number of observations found in the other variable!

You can use the alpha option to change the transparency of the fill, to get a nice shaded plot.

ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) + 
  geom_bar(alpha = 1/5, position = "identity")

You can put the bars beside each other!

ggplot(data = diamonds) + 
  geom_bar(position="dodge", mapping = aes(x = cut, fill = clarity))

Election Data Science: Simple R Data Visualization with CPS Turnout Data

Michael McDonald - University of Florida

August 26, 2019

Introduction

Load Packages

Other Global Setup

/ or \

Creating a filename with the working directory path

Loading a data file

Creating a plot

Changing basic `geom_point` Plot Options

Having R Automatically Choose For You

Facets

A Smooth Line Plot

Overlay Plots

Bar charts

Election Data Science: Simple R Data Visualization with CPS Turnout Data

Michael McDonald - University of Florida

August 26, 2019

Introduction

Load Packages

Other Global Setup

/ or \

Creating a filename with the working directory path

Loading a data file

Creating a plot

Changing basic geom_point Plot Options

Having R Automatically Choose For You

Facets

A Smooth Line Plot

Overlay Plots

Bar charts

Changing basic `geom_point` Plot Options