The aim of this analysis is to predict how Pennsylvania will vote in the 2024 presidential election. Using county level demographic data from 2020, as well as 2020 election results, I created a model that predicts the percentage difference between Republican and Democratic votes with an R-Squared value of 0.9222. When applied to statewide demographic data from 2024, the model predicts that Republicans will win Pennsylvania in 2024 by 8.628%.
The data comes from 2 csv files. The first one contains county level election results from 2020, and was found on the electionreturns.pa.gov website. The data was originally given as a raw total of votes in each county, but I converted the data to percentages to better compare results between counties with vastly different populations.
The second csv file was manually created by myself using census data from rural.pa.gov. I found the demographics that I wanted to explore and manually exported each field into an Excel file, which I then converted to a csv. I included a row for current (2024) statewide data as well.
Key Variables used in my analysis:
I began approaching this problem by looking at the correlations between my variables. Using the ggcorrplot package, I found that the Republican advantage had a very strong positive correlation with ‘white’. It also had strong negative correlations with ‘bachelors’, ‘foreign’, and ‘population’.
After playing around with using different variables for my model, I eventually found a combination of five variables that yielded an R-Squared value of 0.9222 and a p-value below 2.2e-16. The variables that I used were ‘white’, ‘bachelors’, ‘pop_change2010_2020’, ‘median_home_value’, and ‘foreign’. Notably, I decided against using the ‘population’ variable despite it’s strong correlation, as there were too many extreme outliers.
##
## Call:
## lm(formula = rep_advantage ~ white + bachelors + pop_change2010_2020 +
## median_home_value + foreign, data = t_joined)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.208309 -0.047430 -0.004968 0.049161 0.155354
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.2406910146 0.1950913558 -1.234 0.22211
## white 1.0554221224 0.2212434859 4.770 0.0000121605794162 ***
## bachelors -1.8868336714 0.1922469591 -9.815 0.0000000000000432 ***
## pop_change2010_2020 0.7600016431 0.2871439654 2.647 0.01036 *
## median_home_value 0.0000010043 0.0000003252 3.089 0.00304 **
## foreign -2.1450269587 0.9547055220 -2.247 0.02834 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07369 on 60 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.9282, Adjusted R-squared: 0.9222
## F-statistic: 155.1 on 5 and 60 DF, p-value: < 2.2e-16
After testing the model, I found that the residuals follow an interesting pattern. As shown in the histogram, the residuals follow an almost normal distribution with a small exception; there are two peaks on either side of 0. When the predictions are plotted against the 2020 election results, they form a very clear linear pattern around y = x. The model’s RMSE was just 0.07026153.
## [1] 0.07026153
After running the 2024 Pennsylvania demographic data through the model, the result is a prediction of 0.08628. The positive value means that the model is predicting that the Republicans will take Pennsylvania, and that it won’t be particularly close.
## 1
## 0.08628
There are some clear limitations in this model, namely the fact that I am using county data to predict the election at the state level. This decision was made due to lack of accurate 2024 demographic data at the county level. Another factor that worries me is using the median home value in my model. Since home values have skyrocketed across the country in the past 4 years, it is possible that median home values will have a different effect on the election results during this cycle.
https://www.rural.pa.gov/data/county-profiles.cfm - County Demographic Data
https://www.electionreturns.pa.gov/ReportCenter/Reports - 2020 Election Data