Part 1 - Introduction

We’re going to be investigating if specific characters of county residents can help us predict whether or not a resident voted for the Democratic Presidential nominee in 1992. I will be focusing on the factors income,population density, and % college educated residents along with other variables.

This research question is important because if you can find out the characteristics that may help determine why people vote for Democrats vs. Republicans, then you can begin to investigate what thoughts and ideas live behind those characteristics. As you learn more about people’s habits and thoughts, you can target them to sway their opinions and influence elections.

Part 2 - Data

Data collection: This dataset can be found on the Department of Biostatistics website at Vanderbilt University. We will also be pulling in a csv from GitHub that has regional data.
Cases: Each case is a county in the United States with average characteristics about the time residents like average income.
Variables: I’ll be looking at the income (numerical, continuous), region (categorical, nominal) and percent of residents with a bachelor degree (numerical, continous).
Type of study: This is an observational data since I’m just looking at data that was already collected on the Department of Biostatics website. Unfortunately, we don’t know more about how the data was collected. An experiment was not conducted to gather this data.
Scope of inference - generalizability: The population of interest is all counties in the United States. Although its unclear, we’re going to assume that a sample of residents responded from each county. The subset of respondents can be generalized to the population of each county. There is potential for bias because we aren’t sure how the data was collected. If through a census or another survey, often there may be underlying patterns with residents who actually responded.
Scope of inference - causality: Since this is just an observation we are only able to prove an association between variables and not a causal connection.

Part 3 - Exploratory data analysis

Variables

Voting Breakdown by County

Most counties have a median of about 40% of residents voting democratic and republican. There are clear outliers where some counties are voting primarily democratic or republican (greater than 80% voting one way). Both variables are normally distributed.

Percent Residents with College Education

On average, there are about 13.5% of residents per county with college educations. There are clear outliers where some counties have over 50% of residents with a college education.

There appears to be a positive association between the percent of residents that voted republican and college education. When looking at percentage of residents that voted democratic, the association is a little less clear. It appears there may be a negative relation below 15% of resident voting democrat, then a positive relationship. We will look into this more later.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.70    9.20   11.80   13.49   15.60   53.40

## NULL

Bringing Location into the Mix

Let’s investigate whether Division, which are specific areas of the United States, have an influence on voting patterns. Do certain parts of the country vote certain ways?

  • There are definite differences in slope of the lines based on different regions
  • Most regions look to follow a linear pattern in % republics and % democrats

Multiple Regression

I’m going to create a multiple regression to try and predict the percent of county that will vote for the Democrat nominee. My approach is to use backward elimination to eliminate the highest p-value or varialbes that do not have an impact on predicting whether a county will vote for a Democratic Presidential nominee.

First we will remove one of the variables from the correlated pairs in order to remove some model complexity. From there, we will remove variables that are not statistically significant, meaning the Pr(>|t|) > 0.05. We do this a single variable at a time, because sometimes variables can have interactions that will affect one another in the model.

## 
## Call:
## lm(formula = democrat ~ Division + pop.density + pop + pop.change + 
##     income + farm + white + turnout, data = county_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.303  -5.819  -0.487   5.429  37.088 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 6.972e+01  1.537e+00  45.367  < 2e-16 ***
## DivisionEast South Central  2.082e+00  6.568e-01   3.171 0.001535 ** 
## DivisionMiddle Atlantic    -1.400e+00  8.376e-01  -1.672 0.094691 .  
## DivisionMountain           -6.907e+00  6.745e-01 -10.241  < 2e-16 ***
## DivisionNew England         3.924e+00  1.140e+00   3.443 0.000582 ***
## DivisionPacific            -8.344e-01  8.660e-01  -0.964 0.335329    
## DivisionSouth Atlantic     -3.177e-01  6.088e-01  -0.522 0.601801    
## DivisionWest North Central -2.656e+00  5.823e-01  -4.560 5.30e-06 ***
## DivisionWest South Central -2.661e+00  6.170e-01  -4.312 1.67e-05 ***
## pop.density                 8.571e-04  1.183e-04   7.245 5.44e-13 ***
## pop                         2.586e-06  6.723e-07   3.847 0.000122 ***
## pop.change                 -8.163e-02  9.279e-03  -8.797  < 2e-16 ***
## income                     -3.983e-04  2.950e-05 -13.501  < 2e-16 ***
## farm                       -4.037e-01  2.874e-02 -14.046  < 2e-16 ***
## white                      -2.320e-01  1.253e-02 -18.515  < 2e-16 ***
## turnout                     1.289e-01  2.601e-02   4.956 7.60e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.538 on 3098 degrees of freedom
## Multiple R-squared:  0.3752, Adjusted R-squared:  0.3722 
## F-statistic:   124 on 15 and 3098 DF,  p-value: < 2.2e-16

This leaves us with the model:

= $2.082DivisionEast South Central -1.400DivisionMiddle Atlantic + -6.907DivisionMountain + 3.924DivisionNewEngland - 0.8344DivisionPacific - 0.318DivisionSouthAtlantic - 2.656DivisionWestNorthCentral - 2.661DivisionWestSouthCentral + 0.0009pop.density + 2.586e-06pop - 0.082pop.change - 0.0004income - 0.4037farm - 0.2320white + 0.1289*turnout + 6.9720 $

Here we see our R-squared value is 0.3752 so the model accounts for 37.52% of the variability in the data. The Divisions East South Central and New England are most likely to return higher democratic voting counties.

Part 4: Inference

In order to run multiple regression the residuals should be nearly normal and indepedent, the residual variability should be nearly constant, and the variables should be linearly related to the outcome. Based on the residual graphs below, I’m confident that the requirements are being met. Not all variables that were looked at earlier in this report were normally distributed (see % college educated), but we will assume they are close enough to normal distribution to proceed.

Another method that automates model selection is to use the function stepAIC(). This functions optomizes the model parameters for you.

## Start:  AIC=13370.28
## democrat ~ Division + pop.density + pop + pop.change + age6574 + 
##     crime + college + income + farm + white + turnout
## 
##               Df Sum of Sq    RSS   AIC
## - age6574      1     126.8 225399 13370
## <none>                     225272 13370
## - college      1     230.7 225503 13372
## - crime        1     397.7 225670 13374
## - turnout      1     724.7 225997 13378
## - pop          1    1153.6 226425 13384
## - pop.density  1    3487.8 228760 13416
## - pop.change   1    5349.3 230621 13441
## - income       1    7832.2 233104 13475
## - farm         1   14052.1 239324 13557
## - Division     8   15954.4 241226 13567
## - white        1   23992.7 249265 13683
## 
## Step:  AIC=13370.03
## democrat ~ Division + pop.density + pop + pop.change + crime + 
##     college + income + farm + white + turnout
## 
##               Df Sum of Sq    RSS   AIC
## <none>                     225399 13370
## - college      1     177.6 225576 13370
## - crime        1     357.8 225756 13373
## - pop          1    1216.2 226615 13385
## - turnout      1    1271.2 226670 13386
## - pop.density  1    3583.5 228982 13417
## - pop.change   1    5487.2 230886 13443
## - income       1    9320.9 234720 13494
## - farm         1   13936.9 239336 13555
## - Division     8   16158.2 241557 13570
## - white        1   24086.0 249485 13684
## 
## Call:
## lm(formula = democrat ~ Division + pop.density + pop + pop.change + 
##     crime + college + income + farm + white + turnout, data = county_data)
## 
## Coefficients:
##                (Intercept)  DivisionEast South Central  
##                  7.112e+01                   1.823e+00  
##    DivisionMiddle Atlantic            DivisionMountain  
##                 -1.688e+00                  -7.180e+00  
##        DivisionNew England             DivisionPacific  
##                  3.646e+00                  -8.881e-01  
##     DivisionSouth Atlantic  DivisionWest North Central  
##                 -4.900e-01                  -2.880e+00  
## DivisionWest South Central                 pop.density  
##                 -2.833e+00                   8.363e-04  
##                        pop                  pop.change  
##                  2.787e-06                  -8.061e-02  
##                      crime                     college  
##                 -1.894e-04                   5.971e-02  
##                     income                        farm  
##                 -4.213e-04                  -4.150e-01  
##                      white                     turnout  
##                 -2.326e-01                   1.133e-01

The following parameters were selected using this method, which isn’t far from our manual model creation. It’s clear that Division has the largest impact on democratic voting percentages in both cases:

  • age6574
  • college
  • crime
  • turnout
  • pop
  • pop.density
  • pop.change
  • income
  • farm
  • Division
  • white

Part 5: Conclusion

Clearly the most influential parameter associated with whether or not a county votes democratic is Division. Some future analysis, which may prove insightful, would be to include more complete data related to whether counties are suburban or cities. We had a data column pmsa, Primary Metropolitan Statistical Areas, but it was mainly blank in our data set.

Here are the breakdowns of states in the divisions most likely to be democrat, New England and East South Central:

##    state           Division
## 1     AL East South Central
## 2     KY East South Central
## 3     MS East South Central
## 4     TN East South Central
## 5     CT        New England
## 6     ME        New England
## 7     MA        New England
## 8     NH        New England
## 9     RI        New England
## 10    VT        New England

References