Final Project DATA 606
Part 1 - Introduction
We’re going to be investigating if specific characters of county residents can help us predict whether or not a resident voted for the Democratic Presidential nominee in 1992. I will be focusing on the factors income,population density, and % college educated residents along with other variables.
This research question is important because if you can find out the characteristics that may help determine why people vote for Democrats vs. Republicans, then you can begin to investigate what thoughts and ideas live behind those characteristics. As you learn more about people’s habits and thoughts, you can target them to sway their opinions and influence elections.
Part 2 - Data
Data collection: This dataset can be found on the Department of Biostatistics website at Vanderbilt University. We will also be pulling in a csv from GitHub that has regional data.
Cases: Each case is a county in the United States with average characteristics about the time residents like average income.
Variables: I’ll be looking at the income (numerical, continuous), region (categorical, nominal) and percent of residents with a bachelor degree (numerical, continous).
Type of study: This is an observational data since I’m just looking at data that was already collected on the Department of Biostatics website. Unfortunately, we don’t know more about how the data was collected. An experiment was not conducted to gather this data.
Scope of inference - generalizability: The population of interest is all counties in the United States. Although its unclear, we’re going to assume that a sample of residents responded from each county. The subset of respondents can be generalized to the population of each county. There is potential for bias because we aren’t sure how the data was collected. If through a census or another survey, often there may be underlying patterns with residents who actually responded.
Scope of inference - causality: Since this is just an observation we are only able to prove an association between variables and not a causal connection.
Part 3 - Exploratory data analysis
Import & Clean Data
load(url('http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/counties.sav'))
state <- read_csv('https://raw.githubusercontent.com/devinteran/DATA606-Project/master/states.csv')
county_data <- left_join(counties,state,by = c("state" = "State Code"))
county_data <- county_data %>% filter(is.na(democrat) == FALSE|is.na(republican) == FALSE)
#remove msa and pmsa because the columns are very sparse
county_data <- county_data %>% dplyr::select(-msa,-pmsa)Variables
Voting Breakdown by County
Most counties have a median of about 40% of residents voting democratic and republican. There are clear outliers where some counties are voting primarily democratic or republican (greater than 80% voting one way). Both variables are normally distributed.
attach(county_data)
par(mfrow=c(1,2))
hist(county_data$democrat,main="% Residents Voted for Democrat",xlab="")
hist(county_data$republican,main="% Residents Voted for Republican",xlab="")attach(county_data)
par(mfrow=c(1,2))
boxplot(county_data$democrat,main="% Residents Voted for Democrat",xlab="")
boxplot(county_data$democrat,main="% Residents Voted for Republican",xlab="")Percent Residents with College Education
On average, there are about 13.5% of residents per county with college educations. There are clear outliers where some counties have over 50% of residents with a college education.
There appears to be a positive association between the percent of residents that voted republican and college education. When looking at percentage of residents that voted democratic, the association is a little less clear. It appears there may be a negative relation below 15% of resident voting democrat, then a positive relationship. We will look into this more later.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.70 9.20 11.80 13.49 15.60 53.40
boxplot(county_data$college,main="% with Bachelor Degree",xlab="")
college_rep <- ggplot(county_data,aes(x=college,y=republican,color="Dark2")) +
geom_point() +
ggtitle("% with Bachelor Degree vs % Residents Voted for Republicans") +
xlab("") +
ylab("") +
scale_color_brewer(palette = "Dark2")
college_dem <- ggplot(county_data,aes(x=college,y=democrat,color="Dark2")) +
geom_point() +
ggtitle("% with Bachelor Degree vs % Residents Voted for Democrat") +
xlab("") +
ylab("") +
scale_color_brewer(palette = "Dark2")
ggarrange(college_rep, college_dem,ncol = 2, nrow = 2) + scale_color_brewer(palette = "Dark2")## NULL
Income
The average income for all counties is ~$28,000. There seems to be a positive relationship between % republican voters and income and potentially a negative relationship between % democrate voters and income. Both groups have outliers where residents earn far more money.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10903 23820 27319 28366 31659 65201
boxplot(county_data$income,main="Median Family Income",xlab="")
inc_dem <- ggplot(county_data,aes(x=income,y=democrat,color="Dark2")) +
geom_point() +
ggtitle("Median Family Income vs % Residents Voted for Democrats") +
xlab("") +
ylab("") +
scale_color_brewer(palette = "Dark2")
inc_rep <- ggplot(county_data,aes(x=income,y=republican,color="Dark2")) +
geom_point() +
ggtitle("Median Family Income vs % Residents Voted for Republicans") +
xlab("") +
ylab("") +
scale_color_brewer(palette = "Dark2")
ggarrange(inc_rep, inc_dem,ncol = 2, nrow = 2)Bringing Location into the Mix
Let’s investigate whether Division, which are specific areas of the United States, have an influence on voting patterns. Do certain parts of the country vote certain ways?
- There are definite differences in slope of the lines based on different regions
- Most regions look to follow a linear pattern in % republics and % democrats
ggplot(county_data,aes(x=republican,y=democrat,fill=Division,color=Division)) +
geom_point() +
ggtitle("Voter Breakdown") +
xlab("% Republican") +
ylab("% Democratic") +
facet_wrap(~Division, ncol = 2, scales = "free_y")Pairs Plotting
The following variables are correlated and should not both be included in the model:
- White & Black
- Age6574 & Age75
- College & Income
county_sub <- county_data %>% dplyr::select(age6574,age75,white,black,college,income)
count_cor_matrix <- cor(county_sub)
corrplot(count_cor_matrix, type = "upper", order = "hclust",
tl.col = "black", tl.srt = 45) Multiple Regression
I’m going to create a multiple regression to try and predict the percent of county that will vote for the Democrat nominee. My approach is to use backward elimination to eliminate the highest p-value or varialbes that do not have an impact on predicting whether a county will vote for a Democratic Presidential nominee.
First we will remove one of the variables from the correlated pairs in order to remove some model complexity. From there, we will remove variables that are not statistically significant, meaning the Pr(>|t|) > 0.05. We do this a single variable at a time, because sometimes variables can have interactions that will affect one another in the model.
#remove crime
lm5 <- lm(democrat ~ Division + pop.density + pop + pop.change + income + farm + white + turnout,data = county_data)
summary(lm5)##
## Call:
## lm(formula = democrat ~ Division + pop.density + pop + pop.change +
## income + farm + white + turnout, data = county_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.303 -5.819 -0.487 5.429 37.088
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.972e+01 1.537e+00 45.367 < 2e-16 ***
## DivisionEast South Central 2.082e+00 6.568e-01 3.171 0.001535 **
## DivisionMiddle Atlantic -1.400e+00 8.376e-01 -1.672 0.094691 .
## DivisionMountain -6.907e+00 6.745e-01 -10.241 < 2e-16 ***
## DivisionNew England 3.924e+00 1.140e+00 3.443 0.000582 ***
## DivisionPacific -8.344e-01 8.660e-01 -0.964 0.335329
## DivisionSouth Atlantic -3.177e-01 6.088e-01 -0.522 0.601801
## DivisionWest North Central -2.656e+00 5.823e-01 -4.560 5.30e-06 ***
## DivisionWest South Central -2.661e+00 6.170e-01 -4.312 1.67e-05 ***
## pop.density 8.571e-04 1.183e-04 7.245 5.44e-13 ***
## pop 2.586e-06 6.723e-07 3.847 0.000122 ***
## pop.change -8.163e-02 9.279e-03 -8.797 < 2e-16 ***
## income -3.983e-04 2.950e-05 -13.501 < 2e-16 ***
## farm -4.037e-01 2.874e-02 -14.046 < 2e-16 ***
## white -2.320e-01 1.253e-02 -18.515 < 2e-16 ***
## turnout 1.289e-01 2.601e-02 4.956 7.60e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.538 on 3098 degrees of freedom
## Multiple R-squared: 0.3752, Adjusted R-squared: 0.3722
## F-statistic: 124 on 15 and 3098 DF, p-value: < 2.2e-16
This leaves us with the model:
= $2.082DivisionEast South Central -1.400DivisionMiddle Atlantic + -6.907DivisionMountain + 3.924DivisionNewEngland - 0.8344DivisionPacific - 0.318DivisionSouthAtlantic - 2.656DivisionWestNorthCentral - 2.661DivisionWestSouthCentral + 0.0009pop.density + 2.586e-06pop - 0.082pop.change - 0.0004income - 0.4037farm - 0.2320white + 0.1289*turnout + 6.9720 $
Here we see our R-squared value is 0.3752 so the model accounts for 37.52% of the variability in the data. The Divisions East South Central and New England are most likely to return higher democratic voting counties.
Part 4: Inference
In order to run multiple regression the residuals should be nearly normal and indepedent, the residual variability should be nearly constant, and the variables should be linearly related to the outcome. Based on the residual graphs below, I’m confident that the requirements are being met. Not all variables that were looked at earlier in this report were normally distributed (see % college educated), but we will assume they are close enough to normal distribution to proceed.
Another method that automates model selection is to use the function stepAIC(). This functions optomizes the model parameters for you.
lm_AIC <- lm(democrat ~ Division + pop.density + pop + pop.change + age6574 + crime + college + income + farm + white + turnout,data = county_data)
output <- stepAIC(lm_AIC,direction='backward')## Start: AIC=13370.28
## democrat ~ Division + pop.density + pop + pop.change + age6574 +
## crime + college + income + farm + white + turnout
##
## Df Sum of Sq RSS AIC
## - age6574 1 126.8 225399 13370
## <none> 225272 13370
## - college 1 230.7 225503 13372
## - crime 1 397.7 225670 13374
## - turnout 1 724.7 225997 13378
## - pop 1 1153.6 226425 13384
## - pop.density 1 3487.8 228760 13416
## - pop.change 1 5349.3 230621 13441
## - income 1 7832.2 233104 13475
## - farm 1 14052.1 239324 13557
## - Division 8 15954.4 241226 13567
## - white 1 23992.7 249265 13683
##
## Step: AIC=13370.03
## democrat ~ Division + pop.density + pop + pop.change + crime +
## college + income + farm + white + turnout
##
## Df Sum of Sq RSS AIC
## <none> 225399 13370
## - college 1 177.6 225576 13370
## - crime 1 357.8 225756 13373
## - pop 1 1216.2 226615 13385
## - turnout 1 1271.2 226670 13386
## - pop.density 1 3583.5 228982 13417
## - pop.change 1 5487.2 230886 13443
## - income 1 9320.9 234720 13494
## - farm 1 13936.9 239336 13555
## - Division 8 16158.2 241557 13570
## - white 1 24086.0 249485 13684
##
## Call:
## lm(formula = democrat ~ Division + pop.density + pop + pop.change +
## crime + college + income + farm + white + turnout, data = county_data)
##
## Coefficients:
## (Intercept) DivisionEast South Central
## 7.112e+01 1.823e+00
## DivisionMiddle Atlantic DivisionMountain
## -1.688e+00 -7.180e+00
## DivisionNew England DivisionPacific
## 3.646e+00 -8.881e-01
## DivisionSouth Atlantic DivisionWest North Central
## -4.900e-01 -2.880e+00
## DivisionWest South Central pop.density
## -2.833e+00 8.363e-04
## pop pop.change
## 2.787e-06 -8.061e-02
## crime college
## -1.894e-04 5.971e-02
## income farm
## -4.213e-04 -4.150e-01
## white turnout
## -2.326e-01 1.133e-01
The following parameters were selected using this method, which isn’t far from our manual model creation. It’s clear that Division has the largest impact on democratic voting percentages in both cases:
- age6574
- college
- crime
- turnout
- pop
- pop.density
- pop.change
- income
- farm
- Division
- white
Part 5: Conclusion
Clearly the most influential parameter associated with whether or not a county votes democratic is Division. Some future analysis, which may prove insightful, would be to include more complete data related to whether counties are suburban or cities. We had a data column pmsa, Primary Metropolitan Statistical Areas, but it was mainly blank in our data set.
Here are the breakdowns of states in the divisions most likely to be democrat, New England and East South Central:
## state Division
## 1 AL East South Central
## 2 KY East South Central
## 3 MS East South Central
## 4 TN East South Central
## 5 CT New England
## 6 ME New England
## 7 MA New England
## 8 NH New England
## 9 RI New England
## 10 VT New England
References
- Probabilty & Statistics Class - Lecture on Multiple Regression - April 22,2020
- Open Intro Statistics: Fourth Edition by David Dieze, Mine C ̧etinkaya-Rundel, and Christopher D Barr