Data Description

Project 538 provided over thirty-thousand rows of data from the 2016, 2020, 2024 presidential elections. Twenty-thousand rows related to specific states were excluded.

Key variables included:

Methods

Creating a Regression Model Based on 2016 and 2020 Data

Using 2016 and 2020 popular vote data, a training and testing dataset can be created to create a model that predicts the popular vote by candidate based on the predicted percentage of votes, the grade of the pollster, and the candidate in the election.

Loading required package: lattice
Registered S3 method overwritten by 'data.table':
  method           from
  print.data.table     

Attaching package: ‘caret’

The following object is masked from ‘package:purrr’:

    lift

Results of the 2016/2020 Model on the Testing Data

The values and histogram below show the results of the training model being applied to the testing data. It appears that the model is overfitted due to the extremely high R^2 value and the small residuals.

[1] "Testing RMSE: 0.0075"
[1] "R-squared for testing: 0.998"

Applying the Model to 2024 Predicted Data

The output below does not produce the result that is wanted. The output shows that the other candidate will receive more votes than expected, and that proportionally, the democratic candidate will receive more votes than expected than the republican candidate.

Limitations

The original model exhibited overfitting, leading to unrealistic predictions. A primary concern is that the model assumes the Democratic candidate will win the popular vote in 2024, based on their victories in 2016 and 2020. As a result, the model unjustly deducts points from the Republican candidate’s expected outcomes. To fix this overfitting, it would be beneficial to include data from multiple elections rather than relying solely on these two recent elections.

Discussion

Before the election, I created a predictive model by hand and a predictive model using linear regression that were intended to predict the popular vote for each party (Republican, Democrat, Other) in the 2024 election. My model by hand, which utilized Project 538 pollster predictions adjusted by average 2016 and 2020 error predicted that Trump would receive 49.8% of the vote, Harris would receive 48.7% of the vote, and that other candidates would receive 0.4% of the vote. My linear regression model, which used the Project 538 pollster predicted percentage of votes, the candidate, and the grade of pollsters as inputs, predicted that Trump would receive 47.1% of the vote, Harris would receive 49.8% of the vote, and that other candidates would receive 4.1% of the vote. In reality, Trump received 50.2% of the popular vote, Harris received 48.1% of the popular vote, and the other candidates received 1.7% of the popular vote.

My model by hand was a better predictor of the popular vote than my linear regression model. My model by hand correctly predicted which candidate would win the popular vote and had an average margin of error of roughly 0.8%. My linear regression model was not a good predictor of the popular vote. My linear regression model incorrectly predicted which candidate would win the popular vote and had an average margin of error of roughly 2.4%.

I am happy with my initial analysis, thesis, code, and thought process throughout the project. Using a rough model by hand, I was able to correctly predict who would win the popular vote within a reasonable margin of error. However, I am not happy that my thesis did not correlate to my linear regression model, which was the main focus of the project. That error and lack of correlation definitely came from the overfitting of my model that I used to predict the 2024 popular vote. Because the Democratic candidate won the popular vote in 2016 and 2020, my overfit linear regression model assumed that they would as well in 2024. Overall, I am happy with my thesis and handmade model, but frustrated with my linear regression model.

References

Sources included:

---
title: "R Project 2"
author: "Josh Freeman"
date: "10/15/2024"
output: html_notebook
---

## Thesis: The Project 538 prediction models overestimate the popular vote support for third-party candidates, resulting in an underestimation of Republican popular vote support and creating the false impression that the race is less competitive than it truly is.

This analysis uses popular vote data from 2016 and 2020 to predict the popular vote per candidate for the 2024 election. Its data includes the election year, the candidate, the grade of the pollster, and the popular vote prediction per pollster. If Project 538 pollster data over predicts the third-party candidate and under predicts the republican and democratic candidate, I expect the actual popular vote to have less votes for the third-party candidate and more votes for republican and democratic candidates. Proportionally, though, the republican candidate should get more of the third-party votes than the democratic candidate. However, if the predictions for the republican, democratic, and third-party candidate are accurate, the Project 538 pollster data is accurate as is.


## Data Description

Project 538 provided over thirty-thousand rows of data from the 2016, 2020, 2024 presidential elections. Twenty-thousand rows related to specific states were excluded.

Key variables included:

- **Candidate**: The presidential candidate termed either by republican, democrat, or other.
- **Average Error**: The error of actual percentage minus predicted percentage on average per candidate.
- **Predicted Percentage**: The percentage of the popular vote per candidate that was predicted by the 538 pollster data.
- **Expected Percentage**: The percentage of the popular vote per candidate that is expected based on the 538 pollster data predictions.
- **Actual Percentage**: The percentage of the popular vote per candidate that actually occurred for previous elections.
- **Average Percentage**: The average percentage of the popular vote per candidate predicted with and without a model.
- **Frequency**: The number of times a value occurs.
- **Residuals**: The actual percentage result per candidate minus the predicted percentage result per candidate.


## Methods


### Average Error in Popular Vote Predictions Per Candidate


In 2016 and in 2020, the popular vote was overpredicted for the third-party, or "other", candidate. This overprediction was mainly taken from the republican candidate. However, both the republican and democratic candidate were underpredicted. 



```{r echo=FALSE, message=FALSE, warning=FALSE}
library(tidyverse)
library(lubridate)
library(janitor)
library(dplyr)
library(ggplot2)

grade_mapping <- c(
  "C-" = 0.5, "C" = 0.8, "C+" = 1.1,
  "B-" = 1.4, "B" = 1.75, "B+" = 2.1,
  "A-" = 2.4, "A" = 2.7, "A+" = 3.0
)

t_president_polls_US_2016 <- read_csv('president_general_polls_2016 (1).csv', show_col_types = FALSE) %>% 
  janitor::clean_names() %>% 
     select(cycle, state, poll_id, grade, rawpoll_clinton, rawpoll_trump, rawpoll_johnson) %>% 
    mutate(democrat = as.numeric(gsub("%", "", rawpoll_clinton)) / 100,
           republican = as.numeric(gsub("%", "", rawpoll_trump)) / 100,
           other = as.numeric(gsub("%", "", rawpoll_johnson)) / 100) %>% 
    mutate(year = cycle,
         poll = poll_id) %>% 
  filter(state == "U.S.") %>%
 drop_na(other, grade) %>% 
  select(year, state, poll, grade, democrat, republican, other) %>% 
  pivot_longer(cols = c(democrat, republican, other), names_to = "candidate", values_to = "predicted_percentage") %>%
    mutate(actual_percentage = case_when(
    candidate == "democrat" ~ 0.485,
    candidate == "republican" ~ 0.464,
    candidate == "other" ~ 0.051,
    TRUE ~ NA_real_
  )) %>% 
    mutate(error = actual_percentage - predicted_percentage) %>% 
  mutate(grade = ifelse(year == 2016, recode(grade, !!!grade_mapping), grade))


t_president_polls_US_2020 <- read_csv('president_polls_historical.csv', show_col_types = FALSE) %>% 
  janitor::clean_names() %>% 
  select(cycle, state_name, poll_id, candidate_id, numeric_grade, candidate_name, percentage) %>% 
    mutate(percentage = as.numeric(gsub("%", "", percentage)) / 100,
           state = state_name,
           year = cycle,
           predicted_percentage = percentage) %>% 
    select(-state_name, -cycle) %>% 
  filter(is.na(state)) %>% 
      replace_na(list(state = "U.S.")) %>% 
  mutate(candidate = case_when(
    candidate_name == "Joe Biden" ~ "democrat",
    candidate_name == "Donald Trump" ~ "republican",
    candidate_name == "Jo Jorgensen" ~ "other"),
    poll = poll_id,
    grade = numeric_grade
  ) %>%
  filter(candidate %in% c("democrat", "republican", "other")) %>% 
  select(year, state, poll, grade, candidate, predicted_percentage) %>% 
      mutate(actual_percentage = case_when(
    candidate == "democrat" ~ 0.513,
    candidate == "republican" ~ 0.469,
    candidate == "other" ~ 0.018,    
    TRUE ~ NA_real_
  )) %>% 
    mutate(error = actual_percentage - predicted_percentage) %>% 
  filter(!is.na(grade))


average_error_2016 <- t_president_polls_US_2016 %>% 
    group_by(candidate) %>%  
  summarise(average_error = mean(error, na.rm = TRUE)) %>% 
  mutate(year = "2016")


average_error_2020 <- t_president_polls_US_2020 %>% 
    group_by(candidate) %>%  
  summarise(average_error = mean(error, na.rm = TRUE)) %>% 
  mutate(year = "2020")


average_error <- bind_rows(average_error_2016, average_error_2020)

  
ggplot(average_error, aes(x = candidate, y = average_error, fill = candidate)) +
  geom_bar(stat = "identity") +
  facet_wrap(~ year) +  
  labs(title = "Average Error by Candidate for 2016 and 2020",
       x = "Candidate",
       y = "Average Error") +
  scale_fill_manual(values = c("democrat" = "blue", 
                                "republican" = "red",
                               "other" = "green")) +
  theme_minimal() +
  theme(legend.position = "none") +
  geom_hline(yintercept = 0, color = "black", size = 1)

```


### Expected Popular Vote Results Without Using a Regression Model


By using the average underprediction of the republican and democratic candidates and the average overprediction of the other candidates, a simple model can be made based on predicted 2024 popular vote per candidate to predict the actual 2024 popular vote per candidate.  


```{r echo=FALSE, message=FALSE, warning=FALSE}
library(tidyverse)
library(lubridate)
library(janitor)
library(dplyr)
library(ggplot2)
library(gridExtra)


t_president_polls_US_2024 <- read_csv('president_polls.csv', show_col_types = FALSE) %>% 
  janitor::clean_names() %>% 
  select(cycle, state_name, poll_id, numeric_grade, candidate_name, percentage) %>% 
    mutate(predicted_percentage = as.numeric(gsub("%", "", percentage)) / 100,
           year = cycle,
           state = state_name,
           poll = poll_id, 
           grade = numeric_grade, 
           candidate = candidate_name) %>% 
  select(year, state, poll, grade, candidate, predicted_percentage) %>% 
  filter(is.na(state)) %>%
  mutate(state = "U.S.") %>% 
  filter(candidate %in% c("Donald Trump", "Kamala Harris", "Cornel West", "Jill Stein", "Chase Oliver")) %>% 
      mutate(candidate = case_when(
    candidate == "Kamala Harris" ~ "democrat",
    candidate == "Donald Trump" ~ "republican",
    candidate %in% c("Chase Oliver", "Cornel West", "Jill Stein") ~ "other",
    TRUE ~ candidate
  )) %>% 
   filter(!is.na(grade))


average_error_per_candidate <- average_error %>%
  select(-year) %>% 
  group_by(candidate) %>% 
  summarize(average_error = mean(average_error, na.rm = TRUE))

average_percentage_2024 <- t_president_polls_US_2024 %>%
  group_by(candidate) %>%
  summarize(average_predicted_percentage = mean(predicted_percentage, na.rm = TRUE))


t_president_polls_US_2024_no_model <- average_percentage_2024 %>% 
    left_join(average_error_per_candidate, by = "candidate") %>%
  mutate(expected_percentage = average_predicted_percentage + average_error)

max_y <- max(t_president_polls_US_2024_no_model$average_predicted_percentage, 
              t_president_polls_US_2024_no_model$expected_percentage)

plot1 <- ggplot(t_president_polls_US_2024_no_model, aes(x = candidate, y = average_predicted_percentage, fill = candidate)) +
  geom_bar(stat = "identity", show.legend = FALSE) +  
  labs(
    title = "2024 Popular Vote (Before Adjustment)", 
    x = "Candidate",
    y = "Predicted Percentage"
  ) +
  scale_fill_manual(values = c("republican" = "red", "democrat" = "blue", "other" = "green")) +
  ylim(0, max_y) +  # Set the same y-axis limit
  theme_minimal()

plot2 <- ggplot(t_president_polls_US_2024_no_model, aes(x = candidate, y = expected_percentage, fill = candidate)) +
  geom_bar(stat = "identity", show.legend = FALSE) +  
  labs(
    title = "2024 Popular Vote (Adjusted for Error)",  
    x = "Candidate",
    y = "Expected Percentage"
  ) +
  scale_fill_manual(values = c("republican" = "red", "democrat" = "blue", "other" = "green")) +
  ylim(0, max_y) +  # Set the same y-axis limit
  theme_minimal()

grid.arrange(plot1, plot2, ncol = 2)
```


### Creating a Regression Model Based on 2016 and 2020 Data


Using 2016 and 2020 popular vote data, a training and testing dataset can be created to create a model that predicts the popular vote by candidate based on the predicted percentage of votes, the grade of the pollster, and the candidate in the election. 


```{r echo=FALSE, message=FALSE, warning=FALSE}
library(tidyverse)
library(dplyr)
library(ggcorrplot)
library(ggplot2)
library(caret)


t_president_polls_US_all <- rbind(t_president_polls_US_2016, t_president_polls_US_2020)

set.seed(123)

train_indices <- createDataPartition(t_president_polls_US_all$actual_percentage, p = 0.5, list = FALSE)
t_training_set <- t_president_polls_US_all[train_indices, ]
t_testing_set <- t_president_polls_US_all[-train_indices, ]

model <- lm(actual_percentage ~ predicted_percentage + candidate + grade, data = t_training_set)

t_testing_set <- t_testing_set %>% 
  mutate(predicted = predict(model, newdata = t_testing_set),
         residuals = predicted - actual_percentage)

regression_plot <- ggplot(t_testing_set, aes(x = predicted, y = actual_percentage)) +
  geom_point(alpha = 0.5, color = "blue") + 
  geom_smooth(method = "lm", color = "red", se = FALSE) + 
  labs(
    title = "Regression Model Predictions vs Actual Results for 2016 & 2020 Testing Data",
    x = "Predicted Percentage",
    y = "Actual Percentage"
  ) +
  theme_minimal()

suppressMessages(
  print(regression_plot)
)

```


### Results of the 2016/2020 Model on the Testing Data


The values and histogram below show the results of the training model being applied to the testing data. It appears that the model is overfitted due to the extremely high R^2 value and the small residuals. 


```{r echo=FALSE, message=FALSE, warning=FALSE}
library(tidyverse)
library(dplyr)
library(ggcorrplot)
library(ggplot2)
library(caret)

rmse_test <- sqrt(mean(t_testing_set$residuals^2))
rmse_test_rounded <- round(rmse_test, 4)
print(paste("Testing RMSE:", rmse_test_rounded))

residual_sum_of_squares_test <- sum(t_testing_set$residuals ^ 2)
total_variation_in_model_test <- t_testing_set$actual_percentage - mean(t_testing_set$actual_percentage)
total_sum_of_squares_test <- sum(total_variation_in_model_test ^ 2)
r2_test <- 1 - (residual_sum_of_squares_test / total_sum_of_squares_test)
r2_test_rounded <- round(r2_test, 4)
print(paste("R-squared for testing:", r2_test_rounded))

hist(t_testing_set$residuals, main = "Histogram of Testing Set Residuals for Percentage", xlab = "Residuals")


```

### Applying the Model to 2024 Predicted Data

The output below does not produce the result that is wanted. The output shows that the other candidate will receive more votes than expected, and that proportionally, the democratic candidate will receive more votes than expected than the republican candidate. 

```{r echo=FALSE, message=FALSE, warning=FALSE}
library(tidyverse)
library(dplyr)
library(ggcorrplot)
library(ggplot2)

predictions_2024 <- t_president_polls_US_2024 %>% 
  mutate(predicted = predict(model, newdata = t_president_polls_US_2024))


average_results_2024 <- predictions_2024 %>%
  group_by(candidate) %>%
  summarise(
    avg_predicted_percentage = mean(predicted_percentage, na.rm = TRUE),
    avg_predicted_by_model = mean(predicted, na.rm = TRUE)
  ) %>%
  pivot_longer(cols = c(avg_predicted_percentage, avg_predicted_by_model),
               names_to = "prediction_type",
               values_to = "average_percentage") %>%
  # Set the levels for prediction_type in the desired order
  mutate(prediction_type = factor(prediction_type, levels = c("avg_predicted_percentage", "avg_predicted_by_model")))

# Create the combined plot with fixed y-axis limits
ggplot(data = average_results_2024, aes(x = candidate, y = average_percentage, fill = candidate)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  facet_wrap(~ prediction_type, scales = "fixed", labeller = labeller(prediction_type = c(
    avg_predicted_percentage = "Predicted Beforehand",
    avg_predicted_by_model = "Predicted with Model"
  ))) +
  labs(
    title = "Average Predicted Percentages by Candidate for 2024",
    x = "Candidate",
    y = "Average Percentage",
    fill = "Candidate"
  ) +
  scale_fill_manual(values = c("democrat" = "blue", "republican" = "red", "other" = "green")) +
  theme_minimal() +
  theme(legend.position = "none")
```

## Limitations

The original model exhibited overfitting, leading to unrealistic predictions. A primary concern is that the model assumes the Democratic candidate will win the popular vote in 2024, based on their victories in 2016 and 2020. As a result, the model unjustly deducts points from the Republican candidate's expected outcomes. To fix this overfitting, it would be beneficial to include data from multiple elections rather than relying solely on these two recent elections.

## Discussion

Before the election, I created a predictive model by hand and a predictive model using linear regression that were intended to predict the popular vote for each party (Republican, Democrat, Other) in the 2024 election. My model by hand, which utilized Project 538 pollster predictions adjusted by average 2016 and 2020 error predicted that Trump would receive 49.8% of the vote, Harris would receive 48.7% of the vote, and that other candidates would receive 0.4% of the vote. My linear regression model, which used the Project 538 pollster predicted percentage of votes, the candidate, and the grade of pollsters as inputs, predicted that Trump would receive 47.1% of the vote, Harris would receive 49.8% of the vote, and that other candidates would receive 4.1% of the vote. In reality, Trump received 50.2% of the popular vote, Harris received 48.1% of the popular vote, and the other candidates received 1.7% of the popular vote. 

My model by hand was a better predictor of the popular vote than my linear regression model. My model by hand correctly predicted which candidate would win the popular vote and had an average margin of error of roughly 0.8%. My linear regression model was not a good predictor of the popular vote. My linear regression model incorrectly predicted which candidate would win the popular vote and had an average margin of error of roughly 2.4%. 

I am happy with my initial analysis, thesis, code, and thought process throughout the project. Using a rough model by hand, I was able to correctly predict who would win the popular vote within a reasonable margin of error. However, I am not happy that my thesis did not correlate to my linear regression model, which was the main focus of the project. That error and lack of correlation definitely came from the overfitting of my model that I used to predict the 2024 popular vote. Because the Democratic candidate won the popular vote in 2016 and 2020, my overfit linear regression model assumed that they would as well in 2024. Overall, I am happy with my thesis and handmade model, but frustrated with my linear regression model. 

## References

Sources included:

- Project 538 2016 presidential data:
https://projects.fivethirtyeight.com/2016-election-forecast/
- Project 538 2020 and 2024 presidential data: https://projects.fivethirtyeight.com/polls/president-general/2024/
- ChatGPT: https://openai.com/chatgpt/overview/
ChatGPT helped me throughout the project and on each section. I would write up the code that I wanted, and then ask ChatGPT what I should do if any issues arose and asked ChatGPT for recommendations on how to do things that I wasn't sure of how to do. I also asked Chat GPT for advice on how to work and achieve my output efficiently. 


