Part 1: Data Source

https://hdpulse.nimhd.nih.gov/data-portal/social/table?socialtopic=070&socialtopic_options=social_6&demo=00003&demo_options=pop_12&race=00&race_options=race_7&sex=0&sex_options=sex_3&age=999&age_options=ageNA_1&statefips=42&statefips_options=area_states

The HD Pulse link contains Pennsylvania population age by county and will be useful in determining which counties have a generally younger population versus an older population. I will mostly use this data in my explanations of visuals as it gives a percentage of each county in PA that is over 65 and a percentage that is between 18 and 39. Centre and Philadelphia County have the highest percentage of 18-39 aged people and Sullivan and Cameron County have the highest 65 and up population in PA.

https://www.rural.pa.gov/data/county-profiles

This link will help illustrate the way the population age is trending from 2010-2020 also by county. Because we are using this data to predict this coming election from past trends I think it is important to look at trends in age as well as things are constantly changing. This will also be used more in my explanation to show forecasting for this election.

https://projects.fivethirtyeight.com/polls/president-general/2024/national/

Next, the Projects 538 will show the how the current election is standing and how I can relate past trends to this November’s outcome. This will be after I have completed my analysis on the 2020 election, I will use: - county_name: displays the county name in PA - candidate_name: the candidate with votes - votes: displays how many votes have currently been counted at this point in time (not official)

https://www.electionreturns.pa.gov/ReportCenter/Reports

Lastly, I will use the Election Returns link to show the percentage of each Pennsylvanian county that voted democratic or republican in the 2020 election and apply this to show a correlation in county age and vote preference.

The columns I will be using are:

Part 2: Data Transformation

Part 3: Correlations

The biggest correlation I saw between my data was people over the age of 65 and votes for Trump in the 2020 election. There was a correlation between people ages 18 to 39 and votes for Biden, however it was not as strong.

This suggests that there are other factors at play in terms of democratic voters and those aged 18-39 so I looked into education and median household income. Although neither seemed to have a significant correlation with age and voter preference, median household income did have a correlation with education, which makes sense.

Part 4: Modeling

Model 1: Simple Linear Regression a) First I ran a regression for trump voters and the PA counties population aged 65+. These were my results: Residuals: Min 1Q Median 3Q Max -4.2310 -1.4310 -0.1769 0.8789 7.5137

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.878 1.762 5.039 0.00000398653 trump_percent 15.629 2.285 6.841 0.00000000331

Residual standard error: 1.994 on 65 degrees of freedom Multiple R-squared: 0.4186, Adjusted R-squared: 0.4096 F-statistic: 46.79 on 1 and 65 DF, p-value: 0.000000003314

This shows that the coefficient is 15.629 which shows a strong positive relationship. The multiple R-squared is 41.86% which does not account for too much of the data but the p-value is very low indicating it is probably not just because of random chance. When graphed, there is a cluster of data points in the top left corner and the line slopes down.

  1. Next, I ran a regression for biden votes and the PA counties population aged 18-39. These were my results: Residuals: Min 1Q Median 3Q Max -5.2008 -1.8756 -0.4651 0.7804 14.6203

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 22.2778 0.9069 24.566 < 0.0000000000000002 biden_percent 17.5168 3.4995 5.005 0.00000451

Residual standard error: 3.055 on 65 degrees of freedom Multiple R-squared: 0.2782, Adjusted R-squared: 0.2671 F-statistic: 25.05 on 1 and 65 DF, p-value: 0.000004512

This shows that the coefficient is 17.5168 which shows a strong positive relationship. The multiple R-squared is only 27.82% of the data but the p-value is also very low meaning it is not because of random chance. When graphed, there is a cluster of data points in the bottom left corner and the line slopes up.

Model 2: Multiple Linear Regression (Continuous Variables) This model looks at the population aged 18-39 while also looking at the percentage of biden voters, median household income (value_dollars) and percentage of those with at least a bachelor’s degree (value_percent). These were my results:

Residuals: Min 1Q Median 3Q Max -4.8118 -2.1632 -0.4646 1.7299 8.9188

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 28.49053385 1.97999395 14.389 < 0.0000000000000002 biden_percent 5.84120829 4.74746825 1.230 0.223128
value_dollars -0.00018365 0.00004433 -4.143 0.000104
value_percent 0.35124232 0.08164256 4.302 0.00006 ***

Residual standard error: 2.699 on 63 degrees of freedom Multiple R-squared: 0.4537, Adjusted R-squared: 0.4277 F-statistic: 17.44 on 3 and 63 DF, p-value: 0.00000002351

The coefficient is 5.8412 which indicates a positive relationship but it is not statistically significant because the p-value is 0.223. The R-squared is 45.37% which shows an okay fit of the data. The p-value is very low so the model as a whole is statistically significant.

Model 3: Regression with a Categorical Variable

Adding county to the regression these are the results:

Not much data can be drawn from this due to the feedback of the amount of counties. I am going to continue to tweak this to see different results but this is the path that I will be taking with the prediction that Centre and Philadelphia county will show a relationship with Biden votes since they have a younger population and Sullivan and Cameron county will show a relationship with Trump votes since they have an older population.

Part 4: Analysis of Results

The first linear regressions I did show the most positive correlations to each other and after variables continue to be added it shows a positive relationship but it is not as strong. The p-values are consistently very low which is a good thing and the R-squared could be improved as 45% and 27% are not great representations. The residuals are in a good range and do not show too many significant outliers. The first regression models will be most important in determining how the current election will turn out as age seems to be a significantly correlated factor. Due to the fact that Trump and 65+ voters are more strongly correlated than Biden voters and 18-39 voters, there is some more analysis on other factors that affected democratic voters than republicans. Pennsylvania is a swing state, and has switched back and forth for the past few elections so it is important to remove biases in predicting its influence for this election. I am going to continue to tweak these regressions to include age trends and try to find a better relationship as education and median household income has not shown great correlation to the data.