Project 2

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

setwd("C:/Users/Mulut/Desktop/Classes/Data101/projects/project 2")

voter_count <- read_csv("voter_count.csv")

## Rows: 936 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): region
## dbl (6): year, voting_eligible_population, total_ballots_counted, highest_of...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Does the average voter turnout percentage for the highest office (percent_highest_office) differ between presidential election years and midterm election years in the United States from 1980 to 2014?

Introduction

This project explores how voter turnout in the United States changes between presidential and midterm election years. Voter turnout is an important measure of civic engagement and helps us understand how citizens participate in democracy. Historically, turnout tends to be higher in presidential elections, but this project aims to test that difference using real data.

The dataset used is the US Voter Turnout Data (voter_count) from the Openintro.org, which includes 936 observations on state-level elections from 1980 to 2014. Each observation represents the voting data for one state in a given election year. The dataset provides information about the voting-eligible population, the total number of ballots counted, and the percentage of ballots cast for the highest office. By analyzing this dataset, I hope to determine whether voter turnout, measured as the percent of ballots cast for the highest office, is significantly higher during presidential elections than during midterm elections.

Data Analysis

I began by loading the dataset and checking its structure using the head and str functions to view the first few rows and identify the variable types. Next, I focused only on state-level rows to avoid double counting national totals. I also checked for missing values and used na.rm = TRUE to handle them during calculations. After cleaning the data, I created a new variable called election_type, which classifies each election year as either “Presidential” (years divisible by 4) or “Midterm” (other years). Then, I summarized and compared the average percent turnout for the highest office between these two categories.

To better understand the data, I generated summary statistics and a bar plot to explore how turnout distributions differ between presidential and midterm elections. Finally, I conducted a two-sample t-test to test whether the mean voter turnout percentage is significantly higher in presidential elections.

The variables I used in this project are:

percent_highest_office – Voter turnout percentage for the highest office of the election

year – Year the election took place (used to classify election type)

region – Indicates whether the data is at the state or national level

highest_office – Number of ballots that contained a vote for the highest office

head(voter_count)

## # A tibble: 6 × 7
##    year region       voting_eligible_popu…¹ total_ballots_counted highest_office
##   <dbl> <chr>                         <dbl>                 <dbl>          <dbl>
## 1  2014 United Stat…              227157964              83262122       81687059
## 2  2014 Alabama                     3588783               1191274        1180413
## 3  2014 Alaska                       520562                285431         282382
## 4  2014 Arizona                     4510186               1537671        1506416
## 5  2014 Arkansas                    2117881                852642         848592
## 6  2014 California                 24440416               7513972        7317581
## # ℹ abbreviated name: ¹voting_eligible_population
## # ℹ 2 more variables: percent_total_ballots_counted <dbl>,
## #   percent_highest_office <dbl>

str(voter_count)

## spc_tbl_ [936 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ year                         : num [1:936] 2014 2014 2014 2014 2014 ...
##  $ region                       : chr [1:936] "United States" "Alabama" "Alaska" "Arizona" ...
##  $ voting_eligible_population   : num [1:936] 2.27e+08 3.59e+06 5.21e+05 4.51e+06 2.12e+06 ...
##  $ total_ballots_counted        : num [1:936] 83262122 1191274 285431 1537671 852642 ...
##  $ highest_office               : num [1:936] 81687059 1180413 282382 1506416 848592 ...
##  $ percent_total_ballots_counted: num [1:936] 0.367 0.332 0.548 0.341 0.403 ...
##  $ percent_highest_office       : num [1:936] 0.36 0.329 0.542 0.334 0.401 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   year = col_double(),
##   ..   region = col_character(),
##   ..   voting_eligible_population = col_double(),
##   ..   total_ballots_counted = col_double(),
##   ..   highest_office = col_double(),
##   ..   percent_total_ballots_counted = col_double(),
##   ..   percent_highest_office = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

turnout <- voter_count |>
  filter(region != "United States") |>
  mutate(election_type = ifelse(year %% 4 == 0, "Presidential", "Midterm"))

Overall summaries

summary(turnout$percent_highest_office)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.2020  0.4154  0.5009  0.4996  0.5857  0.7837       1

mean(turnout$percent_highest_office, na.rm = TRUE)

## [1] 0.4995872

sd(turnout$percent_highest_office, na.rm = TRUE)

## [1] 0.1102049

Group and Summarise by Election type

election_by_type <- turnout |>
  group_by(election_type) |>
  summarise(
    n = n(),
    mean_turnout   = mean(percent_highest_office, na.rm = TRUE),
    median_turnout = median(percent_highest_office, na.rm = TRUE),
    sd_turnout     = sd(percent_highest_office, na.rm = TRUE),
    min_turnout    = min(percent_highest_office, na.rm = TRUE),
    max_turnout    = max(percent_highest_office, na.rm = TRUE)
  )

election_by_type

## # A tibble: 2 × 7
##   election_type     n mean_turnout median_turnout sd_turnout min_turnout
##   <chr>         <int>        <dbl>          <dbl>      <dbl>       <dbl>
## 1 Midterm         460        0.420          0.417     0.0800       0.202
## 2 Presidential    459        0.579          0.581     0.0729       0.372
## # ℹ 1 more variable: max_turnout <dbl>

barplot representation of voter turnout

library(ggplot2)


ggplot(election_by_type, aes(x = election_type, y = mean_turnout, fill = election_type)) +
  geom_col() +
  labs(title = "Average Voter Turnout by Election Type",
       x = "Election Type",
       y = "Mean Turnout") +
  theme_minimal()

Hypotheses

I am using a hypothesis test to determine whether the average voter turnout is higher in presidential election years than in midterm election years. The variable election_type is categorical, because it places each observation into one of two groups: Presidential or Midterm. The variable percent_highest_office is quantitative, because it represents numerical turnout percentages for each state. Since I am comparing the means of a quantitative variable across two independent groups, a two-sample t-test is the appropriate method to use. The null hypothesis states that the mean turnout is the same in both election types, while the alternative hypothesis states that turnout is higher during presidential years. I am using a 5% significance level, to help determine whether the observed difference is large enough to conclude that presidential elections truly have higher turnout and that the difference is not due to random chance.

\(H_0\): \(\mu_1\) = \(\mu_2\) \(H_a\): \(\mu_1\) > \(\mu_2\)

where:

\(\mu_1\) = mean percent_highest_office in presidential election years \(\mu_2\) = mean percent_highest_office in midterm election years

presidential_turnout <- turnout |>
  filter(election_type == "Presidential") |>
  pull(percent_highest_office)

midterm_turnout <- turnout |>
  filter(election_type == "Midterm") |>
  pull(percent_highest_office)

t.test(presidential_turnout, midterm_turnout, alternative = "greater")

## 
##  Welch Two Sample t-test
## 
## data:  presidential_turnout and midterm_turnout
## t = 31.4, df = 908.24, p-value < 2.2e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  0.1502897       Inf
## sample estimates:
## mean of x mean of y 
## 0.5788905 0.4202839

Interpret Results

Since the p-value is less than 2.2e-16, which is far below the significance level of 0.05, we have to reject the null hypothesis. and we have to conclude that There is strong statistical evidence that voter turnout is higher in presidential elections than in midterm elections.

Conclusion

The results clearly show that voter turnout is higher in presidential election years compared to midterm years in the United States from 1980 to 2014.This aligns with historical trends that presidential elections receive more media attention, national campaigning, and voter engagement compared to midterms. Since midterms often lack competitive statewide races in many states, turnout tends to be lower.

Future analyses could explore a wide variety of things related to voter turnout. Such as whether certain states show larger differences than others, or whether there are regional trends or if patterns change over time.

References

OpenIntro. (n.d.). voter_count in the openintro R package.