Thesis: The Project 538 prediction models overestimate the popular
vote support for third-party candidates, resulting in an underestimation
of Republican popular vote support and creating the false impression
that the race is less competitive than it truly is.
This analysis uses popular vote data from 2016 and 2020 to predict
the popular vote per candidate for the 2024 election. Its data includes
the election year, the candidate, the grade of the pollster, and the
popular vote prediction per pollster. If Project 538 pollster data over
predicts the third-party candidate and under predicts the republican and
democratic candidate, I expect the actual popular vote to have less
votes for the third-party candidate and more votes for republican and
democratic candidates. Proportionally, though, the republican candidate
should get more of the third-party votes than the democratic candidate.
However, if the predictions for the republican, democratic, and
third-party candidate are accurate, the Project 538 pollster data is
accurate as is.
Data Description
Project 538 provided over thirty-thousand rows of data from the 2016,
2020, 2024 presidential elections. Twenty-thousand rows related to
specific states were excluded.
Key variables included:
- Candidate: The presidential candidate termed either
by republican, democrat, or other.
- Average Error: The error of actual percentage minus
predicted percentage on average per candidate.
- Predicted Percentage: The percentage of the popular
vote per candidate that was predicted by the 538 pollster data.
- Expected Percentage: The percentage of the popular
vote per candidate that is expected based on the 538 pollster data
predictions.
- Actual Percentage: The percentage of the popular
vote per candidate that actually occurred for previous elections.
- Average Percentage: The average percentage of the
popular vote per candidate predicted with and without a model.
- Frequency: The number of times a value occurs.
- Residuals: The actual percentage result per
candidate minus the predicted percentage result per candidate.
Methods
Average Error in Popular Vote Predictions Per Candidate
In 2016 and in 2020, the popular vote was overpredicted for the
third-party, or “other”, candidate. This overprediction was mainly taken
from the republican candidate. However, both the republican and
democratic candidate were underpredicted.
── Attaching core tidyverse packages ────────────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ──────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
Attaching package: ‘janitor’
The following objects are masked from ‘package:stats’:
chisq.test, fisher.test

Expected Popular Vote Results Without Using a Regression Model
By using the average underprediction of the republican and democratic
candidates and the average overprediction of the other candidates, a
simple model can be made based on predicted 2024 popular vote per
candidate to predict the actual 2024 popular vote per candidate.
Attaching package: ‘gridExtra’
The following object is masked from ‘package:dplyr’:
combine

Creating a Regression Model Based on 2016 and 2020 Data
Using 2016 and 2020 popular vote data, a training and testing dataset
can be created to create a model that predicts the popular vote by
candidate based on the predicted percentage of votes, the grade of the
pollster, and the candidate in the election.
Loading required package: lattice
Registered S3 method overwritten by 'data.table':
method from
print.data.table
Attaching package: ‘caret’
The following object is masked from ‘package:purrr’:
lift

Results of the 2016/2020 Model on the Testing Data
The values and histogram below show the results of the training model
being applied to the testing data. It appears that the model is
overfitted due to the extremely high R^2 value and the small
residuals.
[1] "Testing RMSE: 0.0075"
[1] "R-squared for testing: 0.998"

Applying the Model to 2024 Predicted Data
The output below does not produce the result that is wanted. The
output shows that the other candidate will receive more votes than
expected, and that proportionally, the democratic candidate will receive
more votes than expected than the republican candidate.

Limitations
The original model exhibited overfitting, leading to unrealistic
predictions. A primary concern is that the model assumes the Democratic
candidate will win the popular vote in 2024, based on their victories in
2016 and 2020. As a result, the model unjustly deducts points from the
Republican candidate’s expected outcomes. To fix this overfitting, it
would be beneficial to include data from multiple elections rather than
relying solely on these two recent elections.
Discussion
Before the election, I created a predictive model by hand and a
predictive model using linear regression that were intended to predict
the popular vote for each party (Republican, Democrat, Other) in the
2024 election. My model by hand, which utilized Project 538 pollster
predictions adjusted by average 2016 and 2020 error predicted that Trump
would receive 49.8% of the vote, Harris would receive 48.7% of the vote,
and that other candidates would receive 0.4% of the vote. My linear
regression model, which used the Project 538 pollster predicted
percentage of votes, the candidate, and the grade of pollsters as
inputs, predicted that Trump would receive 47.1% of the vote, Harris
would receive 49.8% of the vote, and that other candidates would receive
4.1% of the vote. In reality, Trump received 50.2% of the popular vote,
Harris received 48.1% of the popular vote, and the other candidates
received 1.7% of the popular vote.
My model by hand was a better predictor of the popular vote than my
linear regression model. My model by hand correctly predicted which
candidate would win the popular vote and had an average margin of error
of roughly 0.8%. My linear regression model was not a good predictor of
the popular vote. My linear regression model incorrectly predicted which
candidate would win the popular vote and had an average margin of error
of roughly 2.4%.
I am happy with my initial analysis, thesis, code, and thought
process throughout the project. Using a rough model by hand, I was able
to correctly predict who would win the popular vote within a reasonable
margin of error. However, I am not happy that my thesis did not
correlate to my linear regression model, which was the main focus of the
project. That error and lack of correlation definitely came from the
overfitting of my model that I used to predict the 2024 popular vote.
Because the Democratic candidate won the popular vote in 2016 and 2020,
my overfit linear regression model assumed that they would as well in
2024. Overall, I am happy with my thesis and handmade model, but
frustrated with my linear regression model.
References
Sources included:
---
title: "R Project 2"
author: "Josh Freeman"
date: "10/15/2024"
output: html_notebook
---

## Thesis: The Project 538 prediction models overestimate the popular vote support for third-party candidates, resulting in an underestimation of Republican popular vote support and creating the false impression that the race is less competitive than it truly is.

This analysis uses popular vote data from 2016 and 2020 to predict the popular vote per candidate for the 2024 election. Its data includes the election year, the candidate, the grade of the pollster, and the popular vote prediction per pollster. If Project 538 pollster data over predicts the third-party candidate and under predicts the republican and democratic candidate, I expect the actual popular vote to have less votes for the third-party candidate and more votes for republican and democratic candidates. Proportionally, though, the republican candidate should get more of the third-party votes than the democratic candidate. However, if the predictions for the republican, democratic, and third-party candidate are accurate, the Project 538 pollster data is accurate as is.


## Data Description

Project 538 provided over thirty-thousand rows of data from the 2016, 2020, 2024 presidential elections. Twenty-thousand rows related to specific states were excluded.

Key variables included:

- **Candidate**: The presidential candidate termed either by republican, democrat, or other.
- **Average Error**: The error of actual percentage minus predicted percentage on average per candidate.
- **Predicted Percentage**: The percentage of the popular vote per candidate that was predicted by the 538 pollster data.
- **Expected Percentage**: The percentage of the popular vote per candidate that is expected based on the 538 pollster data predictions.
- **Actual Percentage**: The percentage of the popular vote per candidate that actually occurred for previous elections.
- **Average Percentage**: The average percentage of the popular vote per candidate predicted with and without a model.
- **Frequency**: The number of times a value occurs.
- **Residuals**: The actual percentage result per candidate minus the predicted percentage result per candidate.


## Methods


### Average Error in Popular Vote Predictions Per Candidate


In 2016 and in 2020, the popular vote was overpredicted for the third-party, or "other", candidate. This overprediction was mainly taken from the republican candidate. However, both the republican and democratic candidate were underpredicted. 



```{r echo=FALSE, message=FALSE, warning=FALSE}
library(tidyverse)
library(lubridate)
library(janitor)
library(dplyr)
library(ggplot2)

grade_mapping <- c(
  "C-" = 0.5, "C" = 0.8, "C+" = 1.1,
  "B-" = 1.4, "B" = 1.75, "B+" = 2.1,
  "A-" = 2.4, "A" = 2.7, "A+" = 3.0
)

t_president_polls_US_2016 <- read_csv('president_general_polls_2016 (1).csv', show_col_types = FALSE) %>% 
  janitor::clean_names() %>% 
     select(cycle, state, poll_id, grade, rawpoll_clinton, rawpoll_trump, rawpoll_johnson) %>% 
    mutate(democrat = as.numeric(gsub("%", "", rawpoll_clinton)) / 100,
           republican = as.numeric(gsub("%", "", rawpoll_trump)) / 100,
           other = as.numeric(gsub("%", "", rawpoll_johnson)) / 100) %>% 
    mutate(year = cycle,
         poll = poll_id) %>% 
  filter(state == "U.S.") %>%
 drop_na(other, grade) %>% 
  select(year, state, poll, grade, democrat, republican, other) %>% 
  pivot_longer(cols = c(democrat, republican, other), names_to = "candidate", values_to = "predicted_percentage") %>%
    mutate(actual_percentage = case_when(
    candidate == "democrat" ~ 0.485,
    candidate == "republican" ~ 0.464,
    candidate == "other" ~ 0.051,
    TRUE ~ NA_real_
  )) %>% 
    mutate(error = actual_percentage - predicted_percentage) %>% 
  mutate(grade = ifelse(year == 2016, recode(grade, !!!grade_mapping), grade))


t_president_polls_US_2020 <- read_csv('president_polls_historical.csv', show_col_types = FALSE) %>% 
  janitor::clean_names() %>% 
  select(cycle, state_name, poll_id, candidate_id, numeric_grade, candidate_name, percentage) %>% 
    mutate(percentage = as.numeric(gsub("%", "", percentage)) / 100,
           state = state_name,
           year = cycle,
           predicted_percentage = percentage) %>% 
    select(-state_name, -cycle) %>% 
  filter(is.na(state)) %>% 
      replace_na(list(state = "U.S.")) %>% 
  mutate(candidate = case_when(
    candidate_name == "Joe Biden" ~ "democrat",
    candidate_name == "Donald Trump" ~ "republican",
    candidate_name == "Jo Jorgensen" ~ "other"),
    poll = poll_id,
    grade = numeric_grade
  ) %>%
  filter(candidate %in% c("democrat", "republican", "other")) %>% 
  select(year, state, poll, grade, candidate, predicted_percentage) %>% 
      mutate(actual_percentage = case_when(
    candidate == "democrat" ~ 0.513,
    candidate == "republican" ~ 0.469,
    candidate == "other" ~ 0.018,    
    TRUE ~ NA_real_
  )) %>% 
    mutate(error = actual_percentage - predicted_percentage) %>% 
  filter(!is.na(grade))


average_error_2016 <- t_president_polls_US_2016 %>% 
    group_by(candidate) %>%  
  summarise(average_error = mean(error, na.rm = TRUE)) %>% 
  mutate(year = "2016")


average_error_2020 <- t_president_polls_US_2020 %>% 
    group_by(candidate) %>%  
  summarise(average_error = mean(error, na.rm = TRUE)) %>% 
  mutate(year = "2020")


average_error <- bind_rows(average_error_2016, average_error_2020)

  
ggplot(average_error, aes(x = candidate, y = average_error, fill = candidate)) +
  geom_bar(stat = "identity") +
  facet_wrap(~ year) +  
  labs(title = "Average Error by Candidate for 2016 and 2020",
       x = "Candidate",
       y = "Average Error") +
  scale_fill_manual(values = c("democrat" = "blue", 
                                "republican" = "red",
                               "other" = "green")) +
  theme_minimal() +
  theme(legend.position = "none") +
  geom_hline(yintercept = 0, color = "black", size = 1)

```


### Expected Popular Vote Results Without Using a Regression Model


By using the average underprediction of the republican and democratic candidates and the average overprediction of the other candidates, a simple model can be made based on predicted 2024 popular vote per candidate to predict the actual 2024 popular vote per candidate.  


```{r echo=FALSE, message=FALSE, warning=FALSE}
library(tidyverse)
library(lubridate)
library(janitor)
library(dplyr)
library(ggplot2)
library(gridExtra)


t_president_polls_US_2024 <- read_csv('president_polls.csv', show_col_types = FALSE) %>% 
  janitor::clean_names() %>% 
  select(cycle, state_name, poll_id, numeric_grade, candidate_name, percentage) %>% 
    mutate(predicted_percentage = as.numeric(gsub("%", "", percentage)) / 100,
           year = cycle,
           state = state_name,
           poll = poll_id, 
           grade = numeric_grade, 
           candidate = candidate_name) %>% 
  select(year, state, poll, grade, candidate, predicted_percentage) %>% 
  filter(is.na(state)) %>%
  mutate(state = "U.S.") %>% 
  filter(candidate %in% c("Donald Trump", "Kamala Harris", "Cornel West", "Jill Stein", "Chase Oliver")) %>% 
      mutate(candidate = case_when(
    candidate == "Kamala Harris" ~ "democrat",
    candidate == "Donald Trump" ~ "republican",
    candidate %in% c("Chase Oliver", "Cornel West", "Jill Stein") ~ "other",
    TRUE ~ candidate
  )) %>% 
   filter(!is.na(grade))


average_error_per_candidate <- average_error %>%
  select(-year) %>% 
  group_by(candidate) %>% 
  summarize(average_error = mean(average_error, na.rm = TRUE))

average_percentage_2024 <- t_president_polls_US_2024 %>%
  group_by(candidate) %>%
  summarize(average_predicted_percentage = mean(predicted_percentage, na.rm = TRUE))


t_president_polls_US_2024_no_model <- average_percentage_2024 %>% 
    left_join(average_error_per_candidate, by = "candidate") %>%
  mutate(expected_percentage = average_predicted_percentage + average_error)

max_y <- max(t_president_polls_US_2024_no_model$average_predicted_percentage, 
              t_president_polls_US_2024_no_model$expected_percentage)

plot1 <- ggplot(t_president_polls_US_2024_no_model, aes(x = candidate, y = average_predicted_percentage, fill = candidate)) +
  geom_bar(stat = "identity", show.legend = FALSE) +  
  labs(
    title = "2024 Popular Vote (Before Adjustment)", 
    x = "Candidate",
    y = "Predicted Percentage"
  ) +
  scale_fill_manual(values = c("republican" = "red", "democrat" = "blue", "other" = "green")) +
  ylim(0, max_y) +  # Set the same y-axis limit
  theme_minimal()

plot2 <- ggplot(t_president_polls_US_2024_no_model, aes(x = candidate, y = expected_percentage, fill = candidate)) +
  geom_bar(stat = "identity", show.legend = FALSE) +  
  labs(
    title = "2024 Popular Vote (Adjusted for Error)",  
    x = "Candidate",
    y = "Expected Percentage"
  ) +
  scale_fill_manual(values = c("republican" = "red", "democrat" = "blue", "other" = "green")) +
  ylim(0, max_y) +  # Set the same y-axis limit
  theme_minimal()

grid.arrange(plot1, plot2, ncol = 2)
```


### Creating a Regression Model Based on 2016 and 2020 Data


Using 2016 and 2020 popular vote data, a training and testing dataset can be created to create a model that predicts the popular vote by candidate based on the predicted percentage of votes, the grade of the pollster, and the candidate in the election. 


```{r echo=FALSE, message=FALSE, warning=FALSE}
library(tidyverse)
library(dplyr)
library(ggcorrplot)
library(ggplot2)
library(caret)


t_president_polls_US_all <- rbind(t_president_polls_US_2016, t_president_polls_US_2020)

set.seed(123)

train_indices <- createDataPartition(t_president_polls_US_all$actual_percentage, p = 0.5, list = FALSE)
t_training_set <- t_president_polls_US_all[train_indices, ]
t_testing_set <- t_president_polls_US_all[-train_indices, ]

model <- lm(actual_percentage ~ predicted_percentage + candidate + grade, data = t_training_set)

t_testing_set <- t_testing_set %>% 
  mutate(predicted = predict(model, newdata = t_testing_set),
         residuals = predicted - actual_percentage)

regression_plot <- ggplot(t_testing_set, aes(x = predicted, y = actual_percentage)) +
  geom_point(alpha = 0.5, color = "blue") + 
  geom_smooth(method = "lm", color = "red", se = FALSE) + 
  labs(
    title = "Regression Model Predictions vs Actual Results for 2016 & 2020 Testing Data",
    x = "Predicted Percentage",
    y = "Actual Percentage"
  ) +
  theme_minimal()

suppressMessages(
  print(regression_plot)
)

```


### Results of the 2016/2020 Model on the Testing Data


The values and histogram below show the results of the training model being applied to the testing data. It appears that the model is overfitted due to the extremely high R^2 value and the small residuals. 


```{r echo=FALSE, message=FALSE, warning=FALSE}
library(tidyverse)
library(dplyr)
library(ggcorrplot)
library(ggplot2)
library(caret)

rmse_test <- sqrt(mean(t_testing_set$residuals^2))
rmse_test_rounded <- round(rmse_test, 4)
print(paste("Testing RMSE:", rmse_test_rounded))

residual_sum_of_squares_test <- sum(t_testing_set$residuals ^ 2)
total_variation_in_model_test <- t_testing_set$actual_percentage - mean(t_testing_set$actual_percentage)
total_sum_of_squares_test <- sum(total_variation_in_model_test ^ 2)
r2_test <- 1 - (residual_sum_of_squares_test / total_sum_of_squares_test)
r2_test_rounded <- round(r2_test, 4)
print(paste("R-squared for testing:", r2_test_rounded))

hist(t_testing_set$residuals, main = "Histogram of Testing Set Residuals for Percentage", xlab = "Residuals")


```

### Applying the Model to 2024 Predicted Data

The output below does not produce the result that is wanted. The output shows that the other candidate will receive more votes than expected, and that proportionally, the democratic candidate will receive more votes than expected than the republican candidate. 

```{r echo=FALSE, message=FALSE, warning=FALSE}
library(tidyverse)
library(dplyr)
library(ggcorrplot)
library(ggplot2)

predictions_2024 <- t_president_polls_US_2024 %>% 
  mutate(predicted = predict(model, newdata = t_president_polls_US_2024))


average_results_2024 <- predictions_2024 %>%
  group_by(candidate) %>%
  summarise(
    avg_predicted_percentage = mean(predicted_percentage, na.rm = TRUE),
    avg_predicted_by_model = mean(predicted, na.rm = TRUE)
  ) %>%
  pivot_longer(cols = c(avg_predicted_percentage, avg_predicted_by_model),
               names_to = "prediction_type",
               values_to = "average_percentage") %>%
  # Set the levels for prediction_type in the desired order
  mutate(prediction_type = factor(prediction_type, levels = c("avg_predicted_percentage", "avg_predicted_by_model")))

# Create the combined plot with fixed y-axis limits
ggplot(data = average_results_2024, aes(x = candidate, y = average_percentage, fill = candidate)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  facet_wrap(~ prediction_type, scales = "fixed", labeller = labeller(prediction_type = c(
    avg_predicted_percentage = "Predicted Beforehand",
    avg_predicted_by_model = "Predicted with Model"
  ))) +
  labs(
    title = "Average Predicted Percentages by Candidate for 2024",
    x = "Candidate",
    y = "Average Percentage",
    fill = "Candidate"
  ) +
  scale_fill_manual(values = c("democrat" = "blue", "republican" = "red", "other" = "green")) +
  theme_minimal() +
  theme(legend.position = "none")
```

## Limitations

The original model exhibited overfitting, leading to unrealistic predictions. A primary concern is that the model assumes the Democratic candidate will win the popular vote in 2024, based on their victories in 2016 and 2020. As a result, the model unjustly deducts points from the Republican candidate's expected outcomes. To fix this overfitting, it would be beneficial to include data from multiple elections rather than relying solely on these two recent elections.

## Discussion

Before the election, I created a predictive model by hand and a predictive model using linear regression that were intended to predict the popular vote for each party (Republican, Democrat, Other) in the 2024 election. My model by hand, which utilized Project 538 pollster predictions adjusted by average 2016 and 2020 error predicted that Trump would receive 49.8% of the vote, Harris would receive 48.7% of the vote, and that other candidates would receive 0.4% of the vote. My linear regression model, which used the Project 538 pollster predicted percentage of votes, the candidate, and the grade of pollsters as inputs, predicted that Trump would receive 47.1% of the vote, Harris would receive 49.8% of the vote, and that other candidates would receive 4.1% of the vote. In reality, Trump received 50.2% of the popular vote, Harris received 48.1% of the popular vote, and the other candidates received 1.7% of the popular vote. 

My model by hand was a better predictor of the popular vote than my linear regression model. My model by hand correctly predicted which candidate would win the popular vote and had an average margin of error of roughly 0.8%. My linear regression model was not a good predictor of the popular vote. My linear regression model incorrectly predicted which candidate would win the popular vote and had an average margin of error of roughly 2.4%. 

I am happy with my initial analysis, thesis, code, and thought process throughout the project. Using a rough model by hand, I was able to correctly predict who would win the popular vote within a reasonable margin of error. However, I am not happy that my thesis did not correlate to my linear regression model, which was the main focus of the project. That error and lack of correlation definitely came from the overfitting of my model that I used to predict the 2024 popular vote. Because the Democratic candidate won the popular vote in 2016 and 2020, my overfit linear regression model assumed that they would as well in 2024. Overall, I am happy with my thesis and handmade model, but frustrated with my linear regression model. 

## References

Sources included:

- Project 538 2016 presidential data:
https://projects.fivethirtyeight.com/2016-election-forecast/
- Project 538 2020 and 2024 presidential data: https://projects.fivethirtyeight.com/polls/president-general/2024/
- ChatGPT: https://openai.com/chatgpt/overview/
ChatGPT helped me throughout the project and on each section. I would write up the code that I wanted, and then ask ChatGPT what I should do if any issues arose and asked ChatGPT for recommendations on how to do things that I wasn't sure of how to do. I also asked Chat GPT for advice on how to work and achieve my output efficiently. 


