Thesis: The Project 538 prediction models overestimate the popular
vote support for third-party candidates, resulting in an underestimation
of Republican popular vote support and creating the false impression
that the race is less competitive than it truly is.
This analysis uses popular vote data from 2016 and 2020 to predict
the popular vote per candidate for the 2024 election. Its data includes
the election year, the candidate, the grade of the pollster, and the
popular vote prediction per pollster. If Project 538 pollster data over
predicts the third-party candidate and under predicts the republican and
democratic candidate, I expect the actual popular vote to have less
votes for the third-party candidate and more votes for republican and
democratic candidates. Proportionally, though, the republican candidate
should get more of the third-party votes than the democratic candidate.
However, if the predictions for the republican, democratic, and
third-party candidate are accurate, the Project 538 pollster data is
accurate as is.
Data Description
Project 538 provided over thirty-thousand rows of data from the 2016,
2020, 2024 presidential elections. Twenty-thousand rows related to
specific states were excluded.
Key variables included:
- Candidate: The presidential candidate termed either
by republican, democrat, or other.
- Average Error: The error of actual percentage minus
predicted percentage on average per candidate.
- Predicted Percentage: The percentage of the popular
vote per candidate that was predicted by the 538 pollster data.
- Expected Percentage: The percentage of the popular
vote per candidate that is expected based on the 538 pollster data
predictions.
- Actual Percentage: The percentage of the popular
vote per candidate that actually occurred for previous elections.
- Average Percentage: The average percentage of the
popular vote per candidate predicted with and without a model.
- Frequency: The number of times a value occurs.
- Residuals: The actual percentage result per
candidate minus the predicted percentage result per candidate.
Methods
Average Error in Popular Vote Predictions Per Candidate
In 2016 and in 2020, the popular vote was overpredicted for the
third-party, or “other”, candidate. This overprediction was mainly taken
from the republican candidate. However, both the republican and
democratic candidate were underpredicted.
── Attaching core tidyverse packages ────────────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ──────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
Attaching package: ‘janitor’
The following objects are masked from ‘package:stats’:
chisq.test, fisher.test

Expected Popular Vote Results Without Using a Regression Model
By using the average underprediction of the republican and democratic
candidates and the average overprediction of the other candidates, a
simple model can be made based on predicted 2024 popular vote per
candidate to predict the actual 2024 popular vote per candidate.
Attaching package: ‘gridExtra’
The following object is masked from ‘package:dplyr’:
combine

Creating a Regression Model Based on 2016 and 2020 Data
Using 2016 and 2020 popular vote data, a training and testing dataset
can be created to create a model that predicts the popular vote by
candidate based on the predicted percentage of votes, the grade of the
pollster, and the candidate in the election.
Loading required package: lattice
Registered S3 method overwritten by 'data.table':
method from
print.data.table
Attaching package: ‘caret’
The following object is masked from ‘package:purrr’:
lift

Results of the 2016/2020 Model on the Testing Data
The values and histogram below show the results of the training model
being applied to the testing data. It appears that the model is
overfitted due to the extremely high R^2 value and the small
residuals.
[1] "Testing RMSE: 0.0075"
[1] "R-squared for testing: 0.998"

Applying the Model to 2024 Predicted Data
The output below does not produce the result that is wanted. The
output shows that the other candidate will receive more votes than
expected, and that proportionally, the democratic candidate will receive
more votes than expected than the republican candidate.

Limitations
The original model exhibited overfitting, leading to unrealistic
predictions. A primary concern is that the model assumes the Democratic
candidate will win the popular vote in 2024, based on their victories in
2016 and 2020. As a result, the model unjustly deducts points from the
Republican candidate’s expected outcomes. To fix this overfitting, it
would be beneficial to include data from multiple elections rather than
relying solely on these two recent elections.
Discussion
Before the election, I created a predictive model by hand and a
predictive model using linear regression that were intended to predict
the popular vote for each party (Republican, Democrat, Other) in the
2024 election. My model by hand, which utilized Project 538 pollster
predictions adjusted by average 2016 and 2020 error predicted that Trump
would receive 49.8% of the vote, Harris would receive 48.7% of the vote,
and that other candidates would receive 0.4% of the vote. My linear
regression model, which used the Project 538 pollster predicted
percentage of votes, the candidate, and the grade of pollsters as
inputs, predicted that Trump would receive 47.1% of the vote, Harris
would receive 49.8% of the vote, and that other candidates would receive
4.1% of the vote. In reality, Trump received 50.2% of the popular vote,
Harris received 48.1% of the popular vote, and the other candidates
received 1.7% of the popular vote.
My model by hand was a better predictor of the popular vote than my
linear regression model. My model by hand correctly predicted which
candidate would win the popular vote and had an average margin of error
of roughly 0.8%. My linear regression model was not a good predictor of
the popular vote. My linear regression model incorrectly predicted which
candidate would win the popular vote and had an average margin of error
of roughly 2.4%.
I am happy with my initial analysis, thesis, code, and thought
process throughout the project. Using a rough model by hand, I was able
to correctly predict who would win the popular vote within a reasonable
margin of error. However, I am not happy that my thesis did not
correlate to my linear regression model, which was the main focus of the
project. That error and lack of correlation definitely came from the
overfitting of my model that I used to predict the 2024 popular vote.
Because the Democratic candidate won the popular vote in 2016 and 2020,
my overfit linear regression model assumed that they would as well in
2024. Overall, I am happy with my thesis and handmade model, but
frustrated with my linear regression model.
References
Sources included:
