Part 1: Data Source

1. https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/42MVDX

- Years, parties, candidate votes, total votes
- This dataset has basic historical data explaining how many people voted for each candidate each year.

2. https://www.rural.pa.gov/data/county-profiles.cfm

- (still sorting through these but looking at population and income numbers)
- This dataset has numbers for numerous demographics for each county in Pennsylvania.

3. https://fred.stlouisfed.org/series/MEHOINUSPAA646N

- Date and median household income, plus I added a new column of just the year
- This dataset has median household income from 1984 to 2023.

4. https://github.com/profgarrett/profgarrettdata

- PCA_states
- This dataset has numbers for numerous demographics in each state in the US

Part 2: Data Transformation

1.

- I got rid of columns that were mostly/ all NAs
- I filtered the party data to just be demo cratic and republican information as these are the top candidates
- I filtered the state to just look at PA because it is my focus state (and also a swing state)
- I also mutated a new column that has candidate votes over total votes times 100 to get a percentage of that years votes.

2.

- This had a lot of columns which I got rid of some in excel before uploading but still not enough
- I had to make it into a tibble, transverse the data, and clean the names/ shift the rows

3.

- I didn't have to clean this up at all, it only had two columns wich were both of use to me, but I also added a year column

4.

- The only thing I changed was mutating a new column to categorize regions

Part 3: Correlations

Looking at the correlation between the median household income and the number of democrat candidate votes… there is a correlation of 0.9047118 which we are very confident about (p-value = 0.0003211)
Then looking at the same median household income but number of republican candidate votes… there is a correlation of 0.7417894 which we are also less confident in (p-value = 0.01405)
Overall, there is a stronger (positive) correlation between higher incomes and democrat candidate votes in Pennsylvania.
- Although we can identify Pittsburgh and Philadelphia outliers or just data that would skew the data as these cities will tend to have higher incomes and vote more democratic as they are urban areas.

Part 4: Modeling

###Model 1: Simple Linear Regression

(looking at democratic votes) __________________________________________________________________________________________________________________

Residuals: Min 1Q Median 3Q Max -9299.3 -5259.1 -21.8 5002.0 10745.7

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.729e+04 1.361e+04 -2.740 0.025467 *
candidatevotes 2.991e-02 4.979e-03 6.006 0.000321 *** — Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7163 on 8 degrees of freedom Multiple R-squared: 0.8185, Adjusted R-squared: 0.7958 F-statistic: 36.08 on 1 and 8 DF, p-value: 0.0003211

The median residual being -21.8 suggests that the predictions are slightly underestimating the observed values on average but this is not too bad considering how much variation of data can come from income data.
We can also see we are confident in these numbers with a p-value of under 0.05
Whether looking at R-squared or adjusted R-squared our model is relatively strong showing us it can explain 79-81% of the data

###Model 2: Multiple Linear Regression (Continuous Variables) - Looking across the US at factors that may have influence on median household income (based on average family size)

Residuals: Min 1Q Median 3Q Max -0.28606 -0.07422 -0.01821 0.08438 0.32144

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.1571 1.3283 -0.871 0.38811
residence_a_year_ago_-_different_state -1.9237 1.4556 -1.322 0.19270
total_households_with_a_computer 4.7688 1.4657 3.254 0.00211 ** bachelor's_degree_or_higher -0.6017 0.3282 -1.833 0.07311 . — Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1218 on 47 degrees of freedom Multiple R-squared: 0.2021, Adjusted R-squared: 0.1512 F-statistic: 3.968 on 3 and 47 DF, p-value: 0.01333

Here I picked some variables in the US that would impact peoples’ median income as well as their voting stances - starting a family can alter voting habits as priorities change, and having large families can increase voting behavior because what is important to parents is likely to be instilled as important to their kids - changing geographical area could affect someone’s values according to their surroundings - household’s with computers have greater access to news outlets, social media, and other platforms that could sway their voting behavior - higher education is associated with more liberal views
the correlation between family size and moving to a different state is negative but also not very significant in any way according to the sd, t value, and p value.
total households with a computer however shows us a positive relationship with great significance (p-value = .00211)
and a bachelor’s degree or higher has a negative relationship (this is surprising as I would guess more education means more income which can lead to better affordability of starting a family) but it is just slightly significant (p-value is just above .05)
R- squared also shows us the model does not show variability very well at all

###Model 3: Regression with a Categorical Variable - adding a categorical variable : region (Northeast, Midwest, South, and West)

Residuals: Min 1Q Median 3Q Max -0.225441 -0.054603 -0.000204 0.068285 0.291214

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.67845 1.51508 0.448 0.65650
residence_a_year_ago_-_different_state -4.17415 1.45804 -2.863 0.00641 total_households_with_a_computer 2.58771 1.68383 1.537 0.13150
bachelor's_degree_or_higher 0.08183 0.38061 0.215 0.83075
regionNortheast -0.03637 0.05322 -0.683 0.49800
regionSouth 0.09917 0.04278 2.318 0.02513 * regionWest 0.14889 0.05298 2.810 0.00736 — Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1099 on 44 degrees of freedom Multiple R-squared: 0.3917, Adjusted R-squared: 0.3087 F-statistic: 4.721 on 6 and 44 DF, p-value: 0.0008635

looking at just residuals, this seems like a strong model as the predicted only varies from -0.225 to 0.291
Both the South and West regions show a positive and statistically significant effect on the dependent variable
the Northeast region shows a small negative effect but it is not very significant
and R-squared, again, shows us this is not a great model for showing the variability

Part 4: Analysis of Results

Looking at votes alone in the first model, R-squared is very strong. It is less strong in the other two but they are still statisically significant.
The linear model that includes the categorical variable of regions while it does not show variance well, it has a very small p-value of 0.0008635 which tells us it is very significant. The continuous variables alone also have a very significant p-value even though it is not as small as the second one (both under 0.05).
Things I need to complete to show a more direct correlation between the variable to voting instead of just variables to each other compared to voting separately is to combine tables to have the votes and demographics together… this will help me be able to predict more directly number of votes in Pennsylvania.
- For this I might need to make two tables (one democratic, one republican) and based it off of the percentage of votes rather than number of votes to get better data of proportions rather than just more or less voters.

HW10

Heidi Hartje

2024-10-21