Project 2 Mackenzie Hoeflinger

Introduction

Age plays a factor in party preference as older voters over 65 years old tend to lean republican. Some other factors discussed throughout this paper include education (at least a bachelors degree), median household income, and unemployment. Using Pennsylvania data from the 2020 election as model, I predict that 76.67% of Texas will vote for Trump in the upcoming election.

Data

The following are the links for data used for this project:

Key variable in my data include:

People Over the Age of 65 Voted for Trump in the 2020 Election

This graph shows a the relationship between the percentage of people over 65 versus the percentage of votes for Trump in Pennsylvania Counties in the 2020 Election between Donald Trump and Joe Biden. There is a positive correlation between the two variables meaning people over the age of 65 in Pennsylvania tend to vote republican or in this case for Donald Trump.

Correlation Test

After noticing this relationship I executed a correlation test with the data. After noticing much negative correlation, I decided to continue investigating the variables to see what factors impact votes for trump.

Outliers by County

This histogram shows a few small outliers with a very low percentage of votes for Trump. The one farthest to the left is Philadelphia county which has very different demographics from the rest of the state. The city draws a much younger crowd, ages 18-29 mostly, more educated and is not rural which is very different from most other counties. This could explain why Philadelphia is not grouped with the rest.

I also wanted to look at median household income as the graph here shows an interesting pattern. The higher the percentage of people that voted for Trump tended to have a smaller median household income with a few outliers. The largest outlier on the bottom left was Philadelphia county again.

Methods

The regression model I created using age (over 65), those with at least a bachelors degree, and unemployment had the best results. I dropped median household income as it had little affect on the model. The summary of the model is below. The model shows a strong relationship between the variables and the percentage of people that voted for Trump, with an R-squared value of 0.7394. This means that approximately 73.94% of the variability in Trump vote percentage is explained by the model. All values are statistically significant, because their p-values are very low meaning that the relationships are probably not because of random chance.

## 
## Call:
## lm(formula = trump_vote_percent ~ percent_people_over_65 + percent_bachelors + 
##     unemploy_percent, data = t_county_votes)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.206769 -0.030365  0.003069  0.029108  0.119145 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             0.7831947  0.0905215   8.652 2.60e-12 ***
## percent_people_over_65  0.0148030  0.0033995   4.355 5.00e-05 ***
## percent_bachelors      -0.0063556  0.0009473  -6.709 6.41e-09 ***
## unemploy_percent       -0.0330907  0.0056190  -5.889 1.64e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05614 on 63 degrees of freedom
## Multiple R-squared:  0.7394, Adjusted R-squared:  0.727 
## F-statistic: 59.59 on 3 and 63 DF,  p-value: < 2.2e-16

Histogram of Residuals

Looking at the histogram of the residuals of this model, it is mostly normally distributed with most of the values in the center. The peak is slightly above zero meaning the model could slightly overpredict occasionally, however, the model overall fits the data well.

Using the Model to Predict

Using the model with training data from Pennsylvania, it can be applied to counties in Texas to make predictions about the upcoming election. Below is a graph of the predicted values of Texas counties and their percentages of voting for Donald Trump compared to the percentage of people in those counties that are over 65. The mean of these percentages comes out to 76.67% which is my prediction for the 2024 election.

Limitations

In terms of limitations for this project, it can be said that Texas and Pennsylvania are too different to compare, however, it terms of demographics, they are not very far off. However, this model should be able to account for the changes in terms of education, age, and unemployment. Also, this model only accounted for Pennsylvania and could have been expanded, but with an R-squared of 0.7394 it accounts for about 73.94% of the variability which is a significant amount of the data, even being solely based on Pennsylvania.

Discussion

I predicted that 76.67% of Texas would vote for Donald Trump, however only 56.2% of Texas voted for him. I am unhappy with these results as the race was must closer than my model had predicted which could be because I was limited in scope. By only training the model with a few demographics from Pennsylvania it was not accurately able to predict the results of Texas and skewed the results by 20%. The model also failed to incorporate events such as his felony charges and other factors that affect voter behaviors in Texas, which differ significantly from Pennsylvania. As a result, the prediction was overly optimistic and did not account for key factors that may have influenced the actual outcome.

References

Sources included: