Data verification

Continuation of last weeks project link
my data set is a 2016 election result county-wide data set.
- Looks like this:

kable(uspres_results[1:5,])

county.fips	county.name	state.name	party	vote.count	county.total.count	national.party.percent	national.count	is.national.winner
45001	abbeville	south carolina	D	3741	10775	48.098104	135851595	FALSE
45001	abbeville	south carolina	O	271	10775	5.789663	135851595	FALSE
45001	abbeville	south carolina	R	6763	10775	46.112232	135851595	TRUE
22001	acadia	louisiana	D	5638	27389	48.098104	135851595	FALSE
22001	acadia	louisiana	O	589	27389	5.789663	135851595	FALSE

I downloaded this data set last week when i took this datacamp class
- The course can be found hereAnalyzing Election and Polling Data in R

## Warning: Setting row names on a tibble is deprecated.

	our_data_set	wiki_data
total_pop	310184565.00000	318000000.0
med_age	37.49154	37.0
white	63.27000	62.0
african_american	12.24000	12.6
asian	4.82000	5.2
hispanic	16.69000	17.0

Modeling

I had many columns I had to take out because of collinearity issues
- i built exploratory factor variables, which directly correlate with specific columns, these were all removed
- States and county names would leave far too many Independent variables so they were removed
- Total population was removed so that I could look at a population factor column of small, to large size counties
- During backwards selection, per capita income is also removed.
I built a over/under 40 column to create some dichotomous data
R has a cool functions in the Car package called av plots. Pretty much it gives you residual plots of the coefficients of each of your dependent variables plotted against your independent variable

df_2 <- merged_df[,-c(1,2,3,4,5,7,9,10,19,20,21,22,23,24,25,26,27,28,29)]
df_2$age_strata <-  age_strata
my_fit <- lm(Dem.pct ~ ., data = df_2)

layout(matrix(c(1, 2, 3, 4), 2, 2))
summary(my_fit)

## 
## Call:
## lm(formula = Dem.pct ~ ., data = df_2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.46583 -0.06641 -0.00507  0.06054  0.41334 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        5.439e-01  2.903e-02  18.737  < 2e-16 ***
## O                  1.427e-06  5.236e-07   2.725 0.006466 ** 
## percent_white     -4.882e-03  2.588e-04 -18.869  < 2e-16 ***
## percent_black      1.208e-03  2.670e-04   4.526 6.25e-06 ***
## percent_asian      3.661e-03  1.038e-03   3.526 0.000428 ***
## percent_hispanic  -2.019e-03  2.797e-04  -7.219 6.58e-13 ***
## per_capita_income -1.408e-06  5.539e-07  -2.542 0.011057 *  
## median_rent        2.560e-04  1.600e-05  15.995  < 2e-16 ***
## median_age        -3.751e-03  5.933e-04  -6.323 2.93e-10 ***
## voter_turnout      4.133e-01  3.277e-02  12.613  < 2e-16 ***
## county_age         2.433e-10  3.455e-10   0.704 0.481303    
## age_strataover_40  3.752e-02  5.404e-03   6.942 4.68e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09512 on 3080 degrees of freedom
## Multiple R-squared:  0.6098, Adjusted R-squared:  0.6084 
## F-statistic: 437.6 on 11 and 3080 DF,  p-value: < 2.2e-16

Cool R fucntions

av plots- gives us the slope of our dependent via independent variable given everything in our model
crPlots- gives us our residual plots given the model.

avPlots(my_fit)

crPlots(my_fit)

Update model

for this project we wanted a quadratic term, the cr plot above shows us some good options for quadratic terms
- percent white and O
we also want to have interaction effect between a dichotomous variable and factor column so lets add that
check in with out model summary and plots again

fit <- update(my_fit,.~.+age_strata*income_levels)
fit <- update(fit,.~.+I(O^3))
fit <- update(fit,.~.+I(percent_white^2))
fit <- update(fit,.~.-median_rent)
fit <- update(fit,.~.-median_age)
layout(matrix(c(1, 2, 3, 4), 2, 2))
summary(fit)

## 
## Call:
## lm(formula = Dem.pct ~ O + percent_white + percent_black + percent_asian + 
##     percent_hispanic + per_capita_income + voter_turnout + county_age + 
##     age_strata + income_levels + I(O^3) + I(percent_white^2) + 
##     age_strata:income_levels, data = df_2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.51629 -0.05804 -0.00562  0.05352  0.37018 
## 
## Coefficients:
##                                         Estimate Std. Error t value
## (Intercept)                            6.904e-01  3.017e-02  22.882
## O                                      4.351e-06  5.276e-07   8.246
## percent_white                         -1.494e-02  6.027e-04 -24.792
## percent_black                          2.360e-03  2.672e-04   8.831
## percent_asian                          8.392e-03  9.943e-04   8.440
## percent_hispanic                      -1.678e-03  2.763e-04  -6.072
## per_capita_income                      6.284e-06  9.725e-07   6.461
## voter_turnout                          2.613e-01  3.068e-02   8.518
## county_age                             2.101e-10  3.586e-10   0.586
## age_strataover_40                      1.378e-03  1.706e-02   0.081
## income_levels5-25%                    -1.556e-02  1.171e-02  -1.328
## income_levels25-50%                   -2.151e-02  1.312e-02  -1.640
## income_levels50-75%                   -3.720e-03  1.491e-02  -0.249
## income_levels75-95%                   -1.549e-02  1.762e-02  -0.879
## income_levelstop 5%                   -4.275e-02  2.639e-02  -1.620
## I(O^3)                                -1.543e-16  1.967e-17  -7.844
## I(percent_white^2)                     8.038e-05  4.122e-06  19.500
## age_strataover_40:income_levels5-25%  -1.219e-02  1.841e-02  -0.662
## age_strataover_40:income_levels25-50%  6.274e-04  1.824e-02   0.034
## age_strataover_40:income_levels50-75% -2.120e-02  1.849e-02  -1.147
## age_strataover_40:income_levels75-95% -1.905e-02  1.874e-02  -1.017
## age_strataover_40:income_levelstop 5%  7.965e-03  2.246e-02   0.355
##                                       Pr(>|t|)    
## (Intercept)                            < 2e-16 ***
## O                                     2.40e-16 ***
## percent_white                          < 2e-16 ***
## percent_black                          < 2e-16 ***
## percent_asian                          < 2e-16 ***
## percent_hispanic                      1.42e-09 ***
## per_capita_income                     1.20e-10 ***
## voter_turnout                          < 2e-16 ***
## county_age                               0.558    
## age_strataover_40                        0.936    
## income_levels5-25%                       0.184    
## income_levels25-50%                      0.101    
## income_levels50-75%                      0.803    
## income_levels75-95%                      0.379    
## income_levelstop 5%                      0.105    
## I(O^3)                                5.98e-15 ***
## I(percent_white^2)                     < 2e-16 ***
## age_strataover_40:income_levels5-25%     0.508    
## age_strataover_40:income_levels25-50%    0.973    
## age_strataover_40:income_levels50-75%    0.252    
## age_strataover_40:income_levels75-95%    0.309    
## age_strataover_40:income_levelstop 5%    0.723    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09315 on 3070 degrees of freedom
## Multiple R-squared:  0.6271, Adjusted R-squared:  0.6245 
## F-statistic: 245.8 on 21 and 3070 DF,  p-value: < 2.2e-16

plot(fit)

the model doesn’t take well to the dichotomous* categorical variable
- I am going to drop this predictor
the transformation of column O didnt go as planned, im gonna try log transform below instead of exponential

fit <- update(fit,.~.-age_strata*income_levels)
#fit <- update(fit,.~.+ log(percent_asian+.0001))
fit <- update(fit,.~.- I(O^3))
fit <- update(fit,.~.+ log(O+.0001))
fit <- update(fit,.~.- O)
summary(fit)

## 
## Call:
## lm(formula = Dem.pct ~ percent_white + percent_black + percent_asian + 
##     percent_hispanic + per_capita_income + voter_turnout + county_age + 
##     I(percent_white^2) + log(O + 1e-04), data = df_2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.41048 -0.05110 -0.00713  0.04615  0.40520 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         5.430e-01  2.378e-02  22.833  < 2e-16 ***
## percent_white      -1.653e-02  4.746e-04 -34.841  < 2e-16 ***
## percent_black       2.666e-03  2.269e-04  11.748  < 2e-16 ***
## percent_asian       5.260e-03  8.307e-04   6.333 2.76e-10 ***
## percent_hispanic   -1.408e-03  2.338e-04  -6.023 1.91e-09 ***
## per_capita_income   2.594e-06  4.050e-07   6.405 1.73e-10 ***
## voter_turnout       3.141e-01  2.515e-02  12.488  < 2e-16 ***
## county_age         -7.093e-10  1.608e-10  -4.410 1.07e-05 ***
## I(percent_white^2)  9.125e-05  3.332e-06  27.386  < 2e-16 ***
## log(O + 1e-04)      4.028e-02  1.153e-03  34.929  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08049 on 3082 degrees of freedom
## Multiple R-squared:  0.7205, Adjusted R-squared:  0.7196 
## F-statistic: 882.6 on 9 and 3082 DF,  p-value: < 2.2e-16

layout(matrix(c(1, 2, 3, 4), 2, 2))
plot(fit)

LR assumptions for regression seem to be met

independent and dependent variables have a linear relationship with each other
all observations are independent
our residuals seem to be constant and non linear
cooks distance seems like there could be one possible outlier with leverage(1271)

overall improvement in model from adding quadratics and log transforms

adj r ^2 of .72, started with .62
f statistic of 879 from 258
All coefficents are significant

Questions

If anyone can further explain the crplots and how they can be used to transform our predictors around I would appreciate it.

County-wide voting behavior based on racial demographics

Justin Herman

November 9, 2018

Data verification

Modeling

Cool R fucntions

Update model

LR assumptions for regression seem to be met

overall improvement in model from adding quadratics and log transforms

Questions