Data verification

  • Continuation of last weeks project link
  • my data set is a 2016 election result county-wide data set.
    • Looks like this:
kable(uspres_results[1:5,])
county.fips county.name state.name party vote.count county.total.count national.party.percent national.count is.national.winner
45001 abbeville south carolina D 3741 10775 48.098104 135851595 FALSE
45001 abbeville south carolina O 271 10775 5.789663 135851595 FALSE
45001 abbeville south carolina R 6763 10775 46.112232 135851595 TRUE
22001 acadia louisiana D 5638 27389 48.098104 135851595 FALSE
22001 acadia louisiana O 589 27389 5.789663 135851595 FALSE
## Warning: Setting row names on a tibble is deprecated.
our_data_set wiki_data
total_pop 310184565.00000 318000000.0
med_age 37.49154 37.0
white 63.27000 62.0
african_american 12.24000 12.6
asian 4.82000 5.2
hispanic 16.69000 17.0

Modeling

  • I had many columns I had to take out because of collinearity issues
    • i built exploratory factor variables, which directly correlate with specific columns, these were all removed
    • States and county names would leave far too many Independent variables so they were removed
    • Total population was removed so that I could look at a population factor column of small, to large size counties
    • During backwards selection, per capita income is also removed.
  • I built a over/under 40 column to create some dichotomous data
  • R has a cool functions in the Car package called av plots. Pretty much it gives you residual plots of the coefficients of each of your dependent variables plotted against your independent variable
df_2 <- merged_df[,-c(1,2,3,4,5,7,9,10,19,20,21,22,23,24,25,26,27,28,29)]
df_2$age_strata <-  age_strata
my_fit <- lm(Dem.pct ~ ., data = df_2)

layout(matrix(c(1, 2, 3, 4), 2, 2))
summary(my_fit)
## 
## Call:
## lm(formula = Dem.pct ~ ., data = df_2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.46583 -0.06641 -0.00507  0.06054  0.41334 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        5.439e-01  2.903e-02  18.737  < 2e-16 ***
## O                  1.427e-06  5.236e-07   2.725 0.006466 ** 
## percent_white     -4.882e-03  2.588e-04 -18.869  < 2e-16 ***
## percent_black      1.208e-03  2.670e-04   4.526 6.25e-06 ***
## percent_asian      3.661e-03  1.038e-03   3.526 0.000428 ***
## percent_hispanic  -2.019e-03  2.797e-04  -7.219 6.58e-13 ***
## per_capita_income -1.408e-06  5.539e-07  -2.542 0.011057 *  
## median_rent        2.560e-04  1.600e-05  15.995  < 2e-16 ***
## median_age        -3.751e-03  5.933e-04  -6.323 2.93e-10 ***
## voter_turnout      4.133e-01  3.277e-02  12.613  < 2e-16 ***
## county_age         2.433e-10  3.455e-10   0.704 0.481303    
## age_strataover_40  3.752e-02  5.404e-03   6.942 4.68e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09512 on 3080 degrees of freedom
## Multiple R-squared:  0.6098, Adjusted R-squared:  0.6084 
## F-statistic: 437.6 on 11 and 3080 DF,  p-value: < 2.2e-16

Cool R fucntions

  • av plots- gives us the slope of our dependent via independent variable given everything in our model
  • crPlots- gives us our residual plots given the model.
avPlots(my_fit)

crPlots(my_fit)

Update model

  • for this project we wanted a quadratic term, the cr plot above shows us some good options for quadratic terms
    • percent white and O
  • we also want to have interaction effect between a dichotomous variable and factor column so lets add that
  • check in with out model summary and plots again
fit <- update(my_fit,.~.+age_strata*income_levels)
fit <- update(fit,.~.+I(O^3))
fit <- update(fit,.~.+I(percent_white^2))
fit <- update(fit,.~.-median_rent)
fit <- update(fit,.~.-median_age)
layout(matrix(c(1, 2, 3, 4), 2, 2))
summary(fit)
## 
## Call:
## lm(formula = Dem.pct ~ O + percent_white + percent_black + percent_asian + 
##     percent_hispanic + per_capita_income + voter_turnout + county_age + 
##     age_strata + income_levels + I(O^3) + I(percent_white^2) + 
##     age_strata:income_levels, data = df_2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.51629 -0.05804 -0.00562  0.05352  0.37018 
## 
## Coefficients:
##                                         Estimate Std. Error t value
## (Intercept)                            6.904e-01  3.017e-02  22.882
## O                                      4.351e-06  5.276e-07   8.246
## percent_white                         -1.494e-02  6.027e-04 -24.792
## percent_black                          2.360e-03  2.672e-04   8.831
## percent_asian                          8.392e-03  9.943e-04   8.440
## percent_hispanic                      -1.678e-03  2.763e-04  -6.072
## per_capita_income                      6.284e-06  9.725e-07   6.461
## voter_turnout                          2.613e-01  3.068e-02   8.518
## county_age                             2.101e-10  3.586e-10   0.586
## age_strataover_40                      1.378e-03  1.706e-02   0.081
## income_levels5-25%                    -1.556e-02  1.171e-02  -1.328
## income_levels25-50%                   -2.151e-02  1.312e-02  -1.640
## income_levels50-75%                   -3.720e-03  1.491e-02  -0.249
## income_levels75-95%                   -1.549e-02  1.762e-02  -0.879
## income_levelstop 5%                   -4.275e-02  2.639e-02  -1.620
## I(O^3)                                -1.543e-16  1.967e-17  -7.844
## I(percent_white^2)                     8.038e-05  4.122e-06  19.500
## age_strataover_40:income_levels5-25%  -1.219e-02  1.841e-02  -0.662
## age_strataover_40:income_levels25-50%  6.274e-04  1.824e-02   0.034
## age_strataover_40:income_levels50-75% -2.120e-02  1.849e-02  -1.147
## age_strataover_40:income_levels75-95% -1.905e-02  1.874e-02  -1.017
## age_strataover_40:income_levelstop 5%  7.965e-03  2.246e-02   0.355
##                                       Pr(>|t|)    
## (Intercept)                            < 2e-16 ***
## O                                     2.40e-16 ***
## percent_white                          < 2e-16 ***
## percent_black                          < 2e-16 ***
## percent_asian                          < 2e-16 ***
## percent_hispanic                      1.42e-09 ***
## per_capita_income                     1.20e-10 ***
## voter_turnout                          < 2e-16 ***
## county_age                               0.558    
## age_strataover_40                        0.936    
## income_levels5-25%                       0.184    
## income_levels25-50%                      0.101    
## income_levels50-75%                      0.803    
## income_levels75-95%                      0.379    
## income_levelstop 5%                      0.105    
## I(O^3)                                5.98e-15 ***
## I(percent_white^2)                     < 2e-16 ***
## age_strataover_40:income_levels5-25%     0.508    
## age_strataover_40:income_levels25-50%    0.973    
## age_strataover_40:income_levels50-75%    0.252    
## age_strataover_40:income_levels75-95%    0.309    
## age_strataover_40:income_levelstop 5%    0.723    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09315 on 3070 degrees of freedom
## Multiple R-squared:  0.6271, Adjusted R-squared:  0.6245 
## F-statistic: 245.8 on 21 and 3070 DF,  p-value: < 2.2e-16
plot(fit)

  • the model doesn’t take well to the dichotomous* categorical variable
    • I am going to drop this predictor
  • the transformation of column O didnt go as planned, im gonna try log transform below instead of exponential
fit <- update(fit,.~.-age_strata*income_levels)
#fit <- update(fit,.~.+ log(percent_asian+.0001))
fit <- update(fit,.~.- I(O^3))
fit <- update(fit,.~.+ log(O+.0001))
fit <- update(fit,.~.- O)
summary(fit)
## 
## Call:
## lm(formula = Dem.pct ~ percent_white + percent_black + percent_asian + 
##     percent_hispanic + per_capita_income + voter_turnout + county_age + 
##     I(percent_white^2) + log(O + 1e-04), data = df_2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.41048 -0.05110 -0.00713  0.04615  0.40520 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         5.430e-01  2.378e-02  22.833  < 2e-16 ***
## percent_white      -1.653e-02  4.746e-04 -34.841  < 2e-16 ***
## percent_black       2.666e-03  2.269e-04  11.748  < 2e-16 ***
## percent_asian       5.260e-03  8.307e-04   6.333 2.76e-10 ***
## percent_hispanic   -1.408e-03  2.338e-04  -6.023 1.91e-09 ***
## per_capita_income   2.594e-06  4.050e-07   6.405 1.73e-10 ***
## voter_turnout       3.141e-01  2.515e-02  12.488  < 2e-16 ***
## county_age         -7.093e-10  1.608e-10  -4.410 1.07e-05 ***
## I(percent_white^2)  9.125e-05  3.332e-06  27.386  < 2e-16 ***
## log(O + 1e-04)      4.028e-02  1.153e-03  34.929  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08049 on 3082 degrees of freedom
## Multiple R-squared:  0.7205, Adjusted R-squared:  0.7196 
## F-statistic: 882.6 on 9 and 3082 DF,  p-value: < 2.2e-16
layout(matrix(c(1, 2, 3, 4), 2, 2))
plot(fit)

LR assumptions for regression seem to be met

  • independent and dependent variables have a linear relationship with each other
  • all observations are independent
  • our residuals seem to be constant and non linear
  • cooks distance seems like there could be one possible outlier with leverage(1271)

overall improvement in model from adding quadratics and log transforms

  • adj r ^2 of .72, started with .62
  • f statistic of 879 from 258
  • All coefficents are significant

Questions

  • If anyone can further explain the crplots and how they can be used to transform our predictors around I would appreciate it.