Introduction

I’ve followed politics since high school, so I decided to take a look at Fivethirtyeight.com’s Generic Ballot dataset. The dataset includes information about which party voters support during the current election cycle.

The Dataset

Importing the dataset and viewing it’s structure.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ stringr 1.4.0
## ✔ tidyr   1.2.0     ✔ forcats 0.5.1
## ✔ readr   2.1.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(lubridate)
## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
url <- "https://projects.fivethirtyeight.com/generic-ballot-data/generic_polllist.csv"
df <- read.csv(url) 
glimpse(df)
## Rows: 426
## Columns: 21
## $ subgroup      <chr> "All polls", "All polls", "All polls", "All polls", "All…
## $ modeldate     <chr> "9/3/2022", "9/3/2022", "9/3/2022", "9/3/2022", "9/3/202…
## $ startdate     <chr> "11/21/2020", "12/9/2020", "2/12/2021", "2/18/2021", "2/…
## $ enddate       <chr> "11/23/2020", "12/13/2020", "2/18/2021", "2/23/2021", "2…
## $ pollster      <chr> "McLaughlin & Associates", "McLaughlin & Associates", "E…
## $ grade         <chr> "C/D", "C/D", "B/C", "B", "", "C/D", "B/C", "B/C", "", "…
## $ samplesize    <int> 1000, 1000, 1005, 1000, 1200, 1000, 1002, 1008, 1000, 10…
## $ population    <chr> "lv", "lv", "rv", "rv", "lv", "lv", "a", "rv", "lv", "rv…
## $ weight        <dbl> 0.4267553, 0.4056115, 0.7268514, 1.1342245, 1.0558690, 0…
## $ influence     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ dem           <dbl> 45.00000, 46.00000, 44.00000, 47.66667, 39.00000, 46.000…
## $ rep           <dbl> 47.00000, 48.00000, 44.00000, 42.33333, 38.00000, 46.000…
## $ adjusted_dem  <dbl> 44.04946, 45.04946, 41.81625, 45.69885, 40.64968, 45.049…
## $ adjusted_rep  <dbl> 44.03849, 45.03849, 43.73029, 40.90631, 38.93957, 43.038…
## $ multiversions <chr> "", "", "", "*", "", "", "", "", "", "", "", "", "", "",…
## $ tracking      <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ url           <chr> "https://mclaughlinonline.com/pols/wp-content/uploads/20…
## $ poll_id       <int> 73339, 73755, 74400, 74469, 74745, 74444, 74438, 74557, …
## $ question_id   <int> 139355, 139356, 140351, 140129, 140879, 140041, 139992, …
## $ createddate   <chr> "1/20/2021", "1/20/2021", "3/24/2021", "3/10/2021", "4/3…
## $ timestamp     <chr> "17:05:09  3 Sep 2022", "17:05:09  3 Sep 2022", "17:05:0…

The dataset consists of 426 observations (rows) and 21 columns.

Data Cleaning

This analysis focuses on voter preference between Democrat versus Republican over time. As such, the dataset will be subset to its relevant variables for this analysis.

# convert enddate column to date format
df$enddate <- df$enddate %>%
  as.Date(format="%m/%d/%y")

# select columns to subset
cols <- c('enddate', 'dem', 'rep')
df_clean <- select(df, all_of(cols))

# rename enddate column: date
df_clean <- rename(df_clean, date=enddate)
  
# view dataframe header
head(df_clean)
##         date      dem      rep
## 1 2020-11-23 45.00000 47.00000
## 2 2020-12-13 46.00000 48.00000
## 3 2020-02-18 44.00000 44.00000
## 4 2020-02-23 47.66667 42.33333
## 5 2020-02-25 39.00000 38.00000
## 6 2020-02-28 46.00000 46.00000

We now have scores for each individual date, but it doesn’t tell us a whole lot at first glance.

Data Exploration

Let’s take a look at the average poll scores by month.

by_month <- df_clean %>%
  group_by(date = floor_date(date, 'month')) %>%
  summarize(rep = mean(rep), dem =mean(dem)) %>%
  mutate(date=format(date, "%m-%y"))

by_month
## # A tibble: 12 × 3
##    date    rep   dem
##    <chr> <dbl> <dbl>
##  1 01-20  44.3  43.0
##  2 02-20  42.9  42.8
##  3 03-20  43.6  42.7
##  4 04-20  43.0  43.2
##  5 05-20  43.0  43.7
##  6 06-20  42.9  42.8
##  7 07-20  42.9  43.7
##  8 08-20  42.1  43.9
##  9 09-20  41.4  44.9
## 10 10-20  41    42.5
## 11 11-20  41.9  41.8
## 12 12-20  42.2  42.1

Creating a tibble of data grouping by month and calculating the mean scores by month using summarize() gives us a much clearer picture of the data.

Conclusions

We now have a better way to observe voter preference over time. Future steps to investigate this dataset include visualizing the Republican versus Democrat scores over time using a line graph, and creating a difference column from the rep and dem observations.