The HD Pulse link contains Pennsylvania population age by county and will be useful in determining which counties have a generally younger population versus an older population. I will mostly use this data in my explanations of visuals as it gives a percentage of each county in PA that is over 65 and a percentage that is between 18 and 39. Centre and Philadelphia County have the highest percentage of 18-39 aged people and Sullivan and Cameron County have the highest 65 and up population in PA.
https://www.rural.pa.gov/data/county-profiles
This link will help illustrate the way the population age is trending from 2010-2020 also by county. Because we are using this data to predict this coming election from past trends I think it is important to look at trends in age as well as things are constantly changing. This will also be used more in my explanation to show forecasting for this election.
https://projects.fivethirtyeight.com/polls/president-general/2024/national/
Next, the Projects 538 will show the how the current election is standing and how I can relate past trends to this November’s outcome. This will be after I have completed my analysis on the 2020 election, I will use: - county_name: displays the county name in PA - candidate_name: the candidate with votes - votes: displays how many votes have currently been counted at this point in time (not official)
https://www.electionreturns.pa.gov/ReportCenter/Reports
Lastly, I will use the Election Returns link to show the percentage of each Pennsylvanian county that voted democratic or republican in the 2020 election and apply this to show a correlation in county age and vote preference.
The columns I will be using are:
I started off by combining the 2020 election data including the votes for Biden and Trump by county and joining it to the two datasets showing counties with percentage of 65 and up people and 18-39. Although my thesis says 18-29, the data only had 18-39.
I also used the Official dataset from the 2020 election to turn the number of votes per county for each candidate into percentages.
The biggest correlation I saw between my data was people over the age of 65 and votes for Trump in the 2020 election. There was a correlation between people ages 18 to 39 and votes for Biden, however it was not as strong.
This suggests that there are other factors at play in terms of democratic voters and those aged 18-39 so I looked into education and median household income. Although neither seemed to have a significant correlation with age and voter preference, median household income did have a correlation with education, which makes sense.
Model 1: Simple Linear Regression a) First I ran a regression for trump voters and the PA counties population aged 65+. These were my results: Residuals: Min 1Q Median 3Q Max -4.2310 -1.4310 -0.1769 0.8789 7.5137
Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.878 1.762 5.039 0.00000398653 trump_percent
15.629 2.285 6.841 0.00000000331
Residual standard error: 1.994 on 65 degrees of freedom Multiple R-squared: 0.4186, Adjusted R-squared: 0.4096 F-statistic: 46.79 on 1 and 65 DF, p-value: 0.000000003314
This shows that the coefficient is 15.629 which shows a strong positive relationship. The multiple R-squared is 41.86% which does not account for too much of the data but the p-value is very low indicating it is probably not just because of random chance. When graphed, there is a cluster of data points in the top left corner and the line slopes down.
Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 22.2778 0.9069 24.566 < 0.0000000000000002
biden_percent 17.5168 3.4995 5.005 0.00000451
Residual standard error: 3.055 on 65 degrees of freedom Multiple R-squared: 0.2782, Adjusted R-squared: 0.2671 F-statistic: 25.05 on 1 and 65 DF, p-value: 0.000004512
This shows that the coefficient is 17.5168 which shows a strong positive relationship. The multiple R-squared is only 27.82% of the data but the p-value is also very low meaning it is not because of random chance. When graphed, there is a cluster of data points in the bottom left corner and the line slopes up.
Model 2: Multiple Linear Regression (Continuous Variables) This model looks at the population aged 18-39 while also looking at the percentage of biden voters, median household income (value_dollars) and percentage of those with at least a bachelor’s degree (value_percent). These were my results:
Residuals: Min 1Q Median 3Q Max -4.8118 -2.1632 -0.4646 1.7299 8.9188
Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 28.49053385 1.97999395 14.389 < 0.0000000000000002
biden_percent 5.84120829 4.74746825 1.230 0.223128
value_dollars -0.00018365 0.00004433 -4.143 0.000104
value_percent 0.35124232 0.08164256 4.302 0.00006 ***
Residual standard error: 2.699 on 63 degrees of freedom Multiple R-squared: 0.4537, Adjusted R-squared: 0.4277 F-statistic: 17.44 on 3 and 63 DF, p-value: 0.00000002351
The coefficient is 5.8412 which indicates a positive relationship but it is not statistically significant because the p-value is 0.223. The R-squared is 45.37% which shows an okay fit of the data. The p-value is very low so the model as a whole is statistically significant.
Model 3: Regression with a Categorical Variable
Adding county to the regression these are the results:
Not much data can be drawn from this due to the feedback of the amount of counties. I am going to continue to tweak this to see different results but this is the path that I will be taking with the prediction that Centre and Philadelphia county will show a relationship with Biden votes since they have a younger population and Sullivan and Cameron county will show a relationship with Trump votes since they have an older population.
The first linear regressions I did show the most positive correlations to each other and after variables continue to be added it shows a positive relationship but it is not as strong. The p-values are consistently very low which is a good thing and the R-squared could be improved as 45% and 27% are not great representations. The residuals are in a good range and do not show too many significant outliers. The first regression models will be most important in determining how the current election will turn out as age seems to be a significantly correlated factor. Due to the fact that Trump and 65+ voters are more strongly correlated than Biden voters and 18-39 voters, there is some more analysis on other factors that affected democratic voters than republicans. Pennsylvania is a swing state, and has switched back and forth for the past few elections so it is important to remove biases in predicting its influence for this election. I am going to continue to tweak these regressions to include age trends and try to find a better relationship as education and median household income has not shown great correlation to the data.