https://rpubs.com/hrh00009/HW10

Part 1: Data Source

1. https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/42MVDX

- Years, parties, candidate votes, total votes
- This dataset has basic historical data explaining how many people voted for each candidate each year. 

2. https://www.rural.pa.gov/data/county-profiles.cfm

- (still sorting through these but looking at population and income numbers)
- This dataset has numbers for numerous demographics for each county in Pennsylvania. 

3. https://fred.stlouisfed.org/series/MEHOINUSPAA646N

- Date and median household income, plus I added a new column of just the year
- This dataset has median household income from 1984 to 2023. 

4. https://github.com/profgarrett/profgarrettdata

- PCA_states
- This dataset has numbers for numerous demographics in each state in the US

Part 2: Data Transformation

1.

- I got rid of columns that were mostly/ all NAs
- I filtered the party data to just be demo cratic and republican information as these are the top candidates
- I filtered the state to just look at PA because it is my focus state (and also a swing state)
- I also mutated a new column that has candidate votes over total votes times 100 to get a percentage of that years votes. 

2.

- This had a lot of columns which I got rid of some in excel before uploading but still not enough
- I had to make it into a tibble, transverse the data, and clean the names/ shift the rows

3.

- I didn't have to clean this up at all, it only had two columns wich were both of use to me, but I also added a year column

4.

- The only thing I changed was mutating a new column to categorize regions 

Part 3: Correlations

Part 4: Modeling

###Model 1: Simple Linear Regression

Residuals: Min 1Q Median 3Q Max -9299.3 -5259.1 -21.8 5002.0 10745.7

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.729e+04 1.361e+04 -2.740 0.025467 *
candidatevotes 2.991e-02 4.979e-03 6.006 0.000321 *** — Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7163 on 8 degrees of freedom Multiple R-squared: 0.8185, Adjusted R-squared: 0.7958 F-statistic: 36.08 on 1 and 8 DF, p-value: 0.0003211


###Model 2: Multiple Linear Regression (Continuous Variables) - Looking across the US at factors that may have influence on median household income (based on average family size)


Residuals: Min 1Q Median 3Q Max -0.28606 -0.07422 -0.01821 0.08438 0.32144

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.1571 1.3283 -0.871 0.38811
residence_a_year_ago_-_different_state -1.9237 1.4556 -1.322 0.19270
total_households_with_a_computer 4.7688 1.4657 3.254 0.00211 ** bachelor's_degree_or_higher -0.6017 0.3282 -1.833 0.07311 . — Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1218 on 47 degrees of freedom Multiple R-squared: 0.2021, Adjusted R-squared: 0.1512 F-statistic: 3.968 on 3 and 47 DF, p-value: 0.01333


###Model 3: Regression with a Categorical Variable - adding a categorical variable : region (Northeast, Midwest, South, and West)


Residuals: Min 1Q Median 3Q Max -0.225441 -0.054603 -0.000204 0.068285 0.291214

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.67845 1.51508 0.448 0.65650
residence_a_year_ago_-_different_state -4.17415 1.45804 -2.863 0.00641 total_households_with_a_computer 2.58771 1.68383 1.537 0.13150
bachelor's_degree_or_higher 0.08183 0.38061 0.215 0.83075
regionNortheast -0.03637 0.05322 -0.683 0.49800
regionSouth 0.09917 0.04278 2.318 0.02513 * regionWest 0.14889 0.05298 2.810 0.00736
— Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1099 on 44 degrees of freedom Multiple R-squared: 0.3917, Adjusted R-squared: 0.3087 F-statistic: 4.721 on 6 and 44 DF, p-value: 0.0008635


Part 4: Analysis of Results