Data verification

  • The dataset is a 2016 election result county-wide dataset.
    • Looks like this:
kable(uspres_results[1:5,])
county.fips county.name state.name party vote.count county.total.count national.party.percent national.count is.national.winner
45001 abbeville south carolina D 3741 10775 48.098104 135851595 FALSE
45001 abbeville south carolina O 271 10775 5.789663 135851595 FALSE
45001 abbeville south carolina R 6763 10775 46.112232 135851595 TRUE
22001 acadia louisiana D 5638 27389 48.098104 135851595 FALSE
22001 acadia louisiana O 589 27389 5.789663 135851595 FALSE
  • This dataset is from datacamp class on election data
  • Some outliers have been removed
  • If anyone is interested, I really haven’t gotten far in analysis(some graphs are mislabeled) and it may be offensive and typo filled(I’m admittedly left leaning), but here is the link to the overall project I have going so far. My main goal was to get used to the graphing packages

Do Not Blindly Trust Data

  • How representative of real world facts is this data set?
  • Well one easy way to validate our dataset is comparing summary statistics of the population in these observations to data taken from Wiki US demography
    • Population Of US
    • Age demographics
    • Race demographics
  • The results of comparing our data set to the wiki data set are displayed below
    • I would say our data set is highly representative of actual real world demographics
    • This is very reassuring considering we have little understanding of how this data was collected
## Warning: Setting row names on a tibble is deprecated.
our_data_set wiki_data
total_pop 310184565.00000 318000000.0
med_age 37.49154 37.0
white 63.27000 62.0
african_american 12.24000 12.6
asian 4.82000 5.2
hispanic 16.69000 17.0

Data Exploration

Explore relationship with white voters and party support

  • Let’s take a deeper look at how white a county is can effect its voting behavior
## [1] "-0.590907977723247 Correlation between democratic party suport and white voters"
## [1] "0.53088523554694 Correlation between Republican party suport and white voters"

  • Extremely negative relationship between white voters and democratic party support
    • Correlation of -.59
  • When looking at this relationship above, we want to know how well it fits the data.
    • Now in theory the smooth line, which produces the shaded area around the blue regression line, indicates to us how well the line fits the data. We can see that at low levels of white voters, the shade seems to get wider indicating the model may not work as well in that region.
    • We can also explore this through modelling this relationship and looking at the residuals
      • The added benefit of this approach is it will allow us to plot many error tests, including outlier tests
  • Below I Run a simple model Democratic Party Support ~ percent_white. Our plotted residuals and regression checks are below

Modeling

  • I had many columns I had to take out becuase of likely collinearity issues
    • i built many exploratory factor vairables, which directly correlate with specific columns, these were all removed
    • States and county names would leave far too many Independent variables so they were removed
    • Total population was removed so that I could look at a population factor column of small, to large size counties
    • during backwards selection, percapita income is also removed.
df_2 <- merged_df[,-c(1,2,3,4,5,7,9,10,19,20,21,22,23,24,25,26,27,28,29)]
my_fit <- lm(Dem.pct ~ ., data = df_2)
layout(matrix(c(1, 2, 3, 4), 2, 2))
summary(my_fit)
## 
## Call:
## lm(formula = Dem.pct ~ ., data = df_2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.44169 -0.06796 -0.00488  0.06118  0.42889 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        4.642e-01  2.687e-02  17.279  < 2e-16 ***
## O                  1.389e-06  5.276e-07   2.633 0.008515 ** 
## percent_white     -4.849e-03  2.607e-04 -18.600  < 2e-16 ***
## percent_black      1.122e-03  2.687e-04   4.174 3.08e-05 ***
## percent_asian      3.604e-03  1.046e-03   3.445 0.000578 ***
## percent_hispanic  -2.104e-03  2.816e-04  -7.471 1.03e-13 ***
## per_capita_income -1.521e-06  5.579e-07  -2.727 0.006430 ** 
## median_rent        2.534e-04  1.612e-05  15.717  < 2e-16 ***
## median_age        -1.129e-03  4.610e-04  -2.450 0.014359 *  
## voter_turnout      4.065e-01  3.300e-02  12.317  < 2e-16 ***
## county_age         2.607e-10  3.481e-10   0.749 0.454027    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09585 on 3081 degrees of freedom
## Multiple R-squared:  0.6037, Adjusted R-squared:  0.6024 
## F-statistic: 469.4 on 10 and 3081 DF,  p-value: < 2.2e-16

Analysis From Summary Model Stats

  • Percent_white is statistically significant at explaining how people voted
  • The min and max for the residuals seem in line with each other, as do the quartiles. The median is close to 0, this all shows that residuals are seemingly normally distributed.
  • The model explains about 35% of the data’s variance r squared of .349
  • Lets take a deeper look at the residuals by plotting them

Plot Residuals

layout(matrix(c(1, 2, 3, 4), 2, 2))
plot(my_fit)

  • This model diagnostic shows that at low levels of democratic party support the percentage of the white population does an excellent job at determining democratic party support.
  • The residuals, are the error, or how good/poor that line fits the data. Theoretically if this line correctly explains the relationship, than our residuals would be consistent across all observations.
  • However, as democratic support increases, we can see the model’s residuals start to become inconsistent. This can also be seen on the normality qq plot
    • This is a relationship I Will look into deeper below by separating the data into areas where it seems my residuals start to spread out.
      • This deeper analysis will also reveal some outlier cases for us to analyze.

Run Alternative models

  • Model 1, will model how percent white influences voting percentage in low democratic counties(less than 40% democratic)
    • I expect our residuals to look much better here
  • Model 2 will model how percent white influences voting percentage in high democratic counties(50%+)
    • I expect our residuals to be all over the place here
low_support <- merged_df %>% 
    filter(Dem.pct < 0.4) %>% 
    mutate(Dem.pct= round(Dem.pct,2)) %>% 
    mutate(Rep.pct= round(Rep.pct,2))
    
high_support <- merged_df %>% 
    filter(Dem.pct > 0.5) %>% 
    mutate(Dem.pct = round(Dem.pct,2))%>% 
    mutate(Rep.pct= round(Rep.pct,2))
    
# cor(low_support$Rep.pct,low_support$percent_white)

my_fit <- lm(Dem.pct ~ percent_white, data = low_support)
layout(matrix(c(1, 2, 3, 4), 2, 2))
plot(my_fit)
mtext("LOW DEMOCRATIC PARTY SUPPORT", side = 3, line = -1, outer = TRUE)

my_fit <- lm(Dem.pct ~ percent_white, data = high_support)
layout(matrix(c(1, 2, 3, 4), 2, 2))
plot(my_fit)
mtext("HIGH DEMOCRATIC PARTY SUPPORT", side = 3, line = -1, outer = TRUE)

Conclusions from Residuals

  • The above plots show
    • In areas of high democratic support, the percentage of the population that is white, is inconsistent at predicting democratic support levels
    • In areas of low democratic support, the percentage of the population that is white, is consistent at predicting democratic support levels

Explore outliers

  • The neat part about our residuals check and models, is they allow us to explore outliers in the data relatively easily
  • Lets take a look at some of these outliers
  • First outlier analysis below
Outliers in low democratic support
county.fips county.name state.name county.total.count D O R Dem.pct Rep.pct total_population percent_white percent_black percent_asian percent_hispanic per_capita_income median_rent median_age voter_turnout party_supprt_levels population_levels income_levels african_american_strata white_strata hispanic_american_strata asian_american_strata white_pop black_pop asian_pop hispanic_pop county_age
1125 48269 king texas 159 5 5 149 0.03 0.94 321 76 0 0 24 29836 525 46.9 0.4953271 heavily republican under 10k 75-95% 0 70-80 10-30 0 243.96 0 0 77.04 15054.9
1296 48301 loving texas 65 4 3 58 0.06 0.89 87 51 0 0 34 34068 525 44.4 0.7471264 heavily republican under 10k top 5% 0 50-60 30-40 0 44.37 0 0 29.58 3862.8
1407 48311 mcmullen texas 499 40 5 454 0.08 0.91 616 42 0 0 57 27375 550 40.6 0.8100649 heavily republican under 10k 75-95% 0 40-50 50-60 0 258.72 0 0 351.12 25009.6
1579 48357 ochiltree texas 3002 274 100 2628 0.09 0.88 10467 48 0 0 50 23382 544 31.3 0.2868062 heavily republican 10k-100k 50-75% 0 40-50 40-50 0 5024.16 0 0 5233.50 327617.1
Outliers in high democratic support
county.fips county.name state.name county.total.count D O R Dem.pct Rep.pct total_population percent_white percent_black percent_asian percent_hispanic per_capita_income median_rent median_age voter_turnout party_supprt_levels population_levels income_levels african_american_strata white_strata hispanic_american_strata asian_american_strata white_pop black_pop asian_pop hispanic_pop county_age
254 44005 newport rhode island 41045 22851 3117 15077 0.56 0.37 82545 87 3 2 5 40293 1029 43.7 0.4972439 heavily Democratic 10k-100k top 5% 2-10 80-90 2-10 0-2 71814.15 2476.35 1650.90 4127.25 3607217
94 11001 district of columbia district of columbia 311268 282830 15715 12723 0.91 0.04 619371 35 49 3 10 45290 1154 33.8 0.5025550 heavily Democratic 100k-1 million top 5% 40-70 30-40 2-10 2-10 216779.85 303491.79 18581.13 61937.10 20934740
309 6077 san joaquin california 204595 108559 13559 82477 0.53 0.40 693177 35 7 14 39 22589 872 32.9 0.2951555 5-15% Democratic 100k-1 million 25-50% 2-10 30-40 30-40 10-20 242611.95 48522.39 97044.78 270339.03 22805523


  • Our outliers reveal some interesting possibilities
    • In areas of low democratic support, our outliers are counties which support the democratic party even less than would be typically expected.
    • These areas have extremely low levels of Dem party support, and are majority minority communities
    • Even more amazing, is all four of these observations happen to be in Texas, with Ochiltree Texas having a 50% Hispanic population, yet only a 9% democratic party support
    • Population size could have something to do with this as these populations are very small
      • I believe voter turnout is significant here as well.
    • For areas of high democratic support, our outliers are counties with high democratic support where our model under predicts democratic party support

Observations from general exploration

  • The obvious point, white voters tend to support Republicans more than Democrats
  • It appears areas that have low democratic support, are very likely to have high percentages of white population
  • Areas with high democratic support seem to fluctuate in terms of the white population size
  • Put together this means while you can assume with some confidence that if I tell you an area leans republican, it will in fact be a largely white neighborhood, you can’t necessarily make the assumption that if an area has high democratic support it is in fact a low white population.
    • Intuitively I think this makes sense. We have not even begun to get into the multiple layers of social, racial and economics that influence voting. But we sort of know that generally speaking some areas with large white populations swing democratic.

Limits of Data

  • Unfortunately these are exit polls, which presumably tend to have some sampling bias which should be concerning when extrapolating to the general population.