https://rpubs.com/hrh00009/HW10
- Years, parties, candidate votes, total votes
- This dataset has basic historical data explaining how many people voted for each candidate each year.
- (still sorting through these but looking at population and income numbers)
- This dataset has numbers for numerous demographics for each county in Pennsylvania.
- Date and median household income, plus I added a new column of just the year
- This dataset has median household income from 1984 to 2023.
- PCA_states
- This dataset has numbers for numerous demographics in each state in the US
- I got rid of columns that were mostly/ all NAs
- I filtered the party data to just be demo cratic and republican information as these are the top candidates
- I filtered the state to just look at PA because it is my focus state (and also a swing state)
- I also mutated a new column that has candidate votes over total votes times 100 to get a percentage of that years votes.
- This had a lot of columns which I got rid of some in excel before uploading but still not enough
- I had to make it into a tibble, transverse the data, and clean the names/ shift the rows
- I didn't have to clean this up at all, it only had two columns wich were both of use to me, but I also added a year column
- The only thing I changed was mutating a new column to categorize regions
###Model 1: Simple Linear Regression
Residuals: Min 1Q Median 3Q Max -9299.3 -5259.1 -21.8 5002.0 10745.7
Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.729e+04 1.361e+04 -2.740 0.025467 *
candidatevotes 2.991e-02 4.979e-03 6.006 0.000321 *** — Signif. codes: 0
‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7163 on 8 degrees of freedom Multiple R-squared: 0.8185, Adjusted R-squared: 0.7958 F-statistic: 36.08 on 1 and 8 DF, p-value: 0.0003211
###Model 2: Multiple Linear Regression (Continuous Variables) - Looking across the US at factors that may have influence on median household income (based on average family size)
Residuals: Min 1Q Median 3Q Max -0.28606 -0.07422 -0.01821 0.08438 0.32144
Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.1571 1.3283 -0.871 0.38811
residence_a_year_ago_-_different_state -1.9237 1.4556
-1.322 0.19270
total_households_with_a_computer 4.7688 1.4657 3.254
0.00211 ** bachelor's_degree_or_higher -0.6017 0.3282
-1.833 0.07311 . — Signif. codes: 0 ‘’ 0.001 ‘’
0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1218 on 47 degrees of freedom Multiple R-squared: 0.2021, Adjusted R-squared: 0.1512 F-statistic: 3.968 on 3 and 47 DF, p-value: 0.01333
Here I picked some variables in the US that would impact peoples’ median income as well as their voting stances - starting a family can alter voting habits as priorities change, and having large families can increase voting behavior because what is important to parents is likely to be instilled as important to their kids - changing geographical area could affect someone’s values according to their surroundings - household’s with computers have greater access to news outlets, social media, and other platforms that could sway their voting behavior - higher education is associated with more liberal views
the correlation between family size and moving to a different state is negative but also not very significant in any way according to the sd, t value, and p value.
total households with a computer however shows us a positive relationship with great significance (p-value = .00211)
and a bachelor’s degree or higher has a negative relationship (this is surprising as I would guess more education means more income which can lead to better affordability of starting a family) but it is just slightly significant (p-value is just above .05)
R- squared also shows us the model does not show variability very well at all
###Model 3: Regression with a Categorical Variable - adding a categorical variable : region (Northeast, Midwest, South, and West)
Residuals: Min 1Q Median 3Q Max -0.225441 -0.054603 -0.000204 0.068285 0.291214
Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.67845 1.51508 0.448 0.65650
residence_a_year_ago_-_different_state -4.17415 1.45804
-2.863 0.00641 total_households_with_a_computer
2.58771 1.68383 1.537 0.13150
bachelor's_degree_or_higher 0.08183 0.38061 0.215
0.83075
regionNortheast -0.03637 0.05322 -0.683 0.49800
regionSouth 0.09917 0.04278 2.318 0.02513 * regionWest 0.14889 0.05298
2.810 0.00736 — Signif. codes: 0 ‘’ 0.001
‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1099 on 44 degrees of freedom Multiple R-squared: 0.3917, Adjusted R-squared: 0.3087 F-statistic: 4.721 on 6 and 44 DF, p-value: 0.0008635
Looking at votes alone in the first model, R-squared is very strong. It is less strong in the other two but they are still statisically significant.
The linear model that includes the categorical variable of regions while it does not show variance well, it has a very small p-value of 0.0008635 which tells us it is very significant. The continuous variables alone also have a very significant p-value even though it is not as small as the second one (both under 0.05).
Things I need to complete to show a more direct correlation between the variable to voting instead of just variables to each other compared to voting separately is to combine tables to have the votes and demographics together… this will help me be able to predict more directly number of votes in Pennsylvania.