Data verification
- The dataset is a 2016 election result county-wide dataset.
kable(uspres_results[1:5,])
county.fips
|
county.name
|
state.name
|
party
|
vote.count
|
county.total.count
|
national.party.percent
|
national.count
|
is.national.winner
|
45001
|
abbeville
|
south carolina
|
D
|
3741
|
10775
|
48.098104
|
135851595
|
FALSE
|
45001
|
abbeville
|
south carolina
|
O
|
271
|
10775
|
5.789663
|
135851595
|
FALSE
|
45001
|
abbeville
|
south carolina
|
R
|
6763
|
10775
|
46.112232
|
135851595
|
TRUE
|
22001
|
acadia
|
louisiana
|
D
|
5638
|
27389
|
48.098104
|
135851595
|
FALSE
|
22001
|
acadia
|
louisiana
|
O
|
589
|
27389
|
5.789663
|
135851595
|
FALSE
|
- This dataset is from datacamp class on election data
- Some outliers have been removed
- If anyone is interested, I really haven’t gotten far in analysis(some graphs are mislabeled) and it may be offensive and typo filled(I’m admittedly left leaning), but here is the link to the overall project I have going so far. My main goal was to get used to the graphing packages
Do Not Blindly Trust Data
- How representative of real world facts is this data set?
- Well one easy way to validate our dataset is comparing summary statistics of the population in these observations to data taken from Wiki US demography
- Population Of US
- Age demographics
- Race demographics
- The results of comparing our data set to the wiki data set are displayed below
- I would say our data set is highly representative of actual real world demographics
- This is very reassuring considering we have little understanding of how this data was collected
## Warning: Setting row names on a tibble is deprecated.
|
our_data_set
|
wiki_data
|
total_pop
|
310184565.00000
|
318000000.0
|
med_age
|
37.49154
|
37.0
|
white
|
63.27000
|
62.0
|
african_american
|
12.24000
|
12.6
|
asian
|
4.82000
|
5.2
|
hispanic
|
16.69000
|
17.0
|
Data Exploration
Explore relationship with white voters and party support
- Let’s take a deeper look at how white a county is can effect its voting behavior
## [1] "-0.590907977723247 Correlation between democratic party suport and white voters"
## [1] "0.53088523554694 Correlation between Republican party suport and white voters"

- Extremely negative relationship between white voters and democratic party support
- When looking at this relationship above, we want to know how well it fits the data.
- Now in theory the smooth line, which produces the shaded area around the blue regression line, indicates to us how well the line fits the data. We can see that at low levels of white voters, the shade seems to get wider indicating the model may not work as well in that region.
- We can also explore this through modelling this relationship and looking at the residuals
- The added benefit of this approach is it will allow us to plot many error tests, including outlier tests
- Below I Run a simple model Democratic Party Support ~ percent_white. Our plotted residuals and regression checks are below
Modeling
- I had many columns I had to take out becuase of likely collinearity issues
- i built many exploratory factor vairables, which directly correlate with specific columns, these were all removed
- States and county names would leave far too many Independent variables so they were removed
- Total population was removed so that I could look at a population factor column of small, to large size counties
- during backwards selection, percapita income is also removed.
df_2 <- merged_df[,-c(1,2,3,4,5,7,9,10,19,20,21,22,23,24,25,26,27,28,29)]
my_fit <- lm(Dem.pct ~ ., data = df_2)
layout(matrix(c(1, 2, 3, 4), 2, 2))
summary(my_fit)
##
## Call:
## lm(formula = Dem.pct ~ ., data = df_2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.44169 -0.06796 -0.00488 0.06118 0.42889
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.642e-01 2.687e-02 17.279 < 2e-16 ***
## O 1.389e-06 5.276e-07 2.633 0.008515 **
## percent_white -4.849e-03 2.607e-04 -18.600 < 2e-16 ***
## percent_black 1.122e-03 2.687e-04 4.174 3.08e-05 ***
## percent_asian 3.604e-03 1.046e-03 3.445 0.000578 ***
## percent_hispanic -2.104e-03 2.816e-04 -7.471 1.03e-13 ***
## per_capita_income -1.521e-06 5.579e-07 -2.727 0.006430 **
## median_rent 2.534e-04 1.612e-05 15.717 < 2e-16 ***
## median_age -1.129e-03 4.610e-04 -2.450 0.014359 *
## voter_turnout 4.065e-01 3.300e-02 12.317 < 2e-16 ***
## county_age 2.607e-10 3.481e-10 0.749 0.454027
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09585 on 3081 degrees of freedom
## Multiple R-squared: 0.6037, Adjusted R-squared: 0.6024
## F-statistic: 469.4 on 10 and 3081 DF, p-value: < 2.2e-16
Analysis From Summary Model Stats
- Percent_white is statistically significant at explaining how people voted
- The min and max for the residuals seem in line with each other, as do the quartiles. The median is close to 0, this all shows that residuals are seemingly normally distributed.
- The model explains about 35% of the data’s variance r squared of .349
- Lets take a deeper look at the residuals by plotting them
Plot Residuals
layout(matrix(c(1, 2, 3, 4), 2, 2))
plot(my_fit)

- This model diagnostic shows that at low levels of democratic party support the percentage of the white population does an excellent job at determining democratic party support.
- The residuals, are the error, or how good/poor that line fits the data. Theoretically if this line correctly explains the relationship, than our residuals would be consistent across all observations.
- However, as democratic support increases, we can see the model’s residuals start to become inconsistent. This can also be seen on the normality qq plot
- This is a relationship I Will look into deeper below by separating the data into areas where it seems my residuals start to spread out.
- This deeper analysis will also reveal some outlier cases for us to analyze.
Run Alternative models
- Model 1, will model how percent white influences voting percentage in low democratic counties(less than 40% democratic)
- I expect our residuals to look much better here
- Model 2 will model how percent white influences voting percentage in high democratic counties(50%+)
- I expect our residuals to be all over the place here
low_support <- merged_df %>%
filter(Dem.pct < 0.4) %>%
mutate(Dem.pct= round(Dem.pct,2)) %>%
mutate(Rep.pct= round(Rep.pct,2))
high_support <- merged_df %>%
filter(Dem.pct > 0.5) %>%
mutate(Dem.pct = round(Dem.pct,2))%>%
mutate(Rep.pct= round(Rep.pct,2))
# cor(low_support$Rep.pct,low_support$percent_white)
my_fit <- lm(Dem.pct ~ percent_white, data = low_support)
layout(matrix(c(1, 2, 3, 4), 2, 2))
plot(my_fit)
mtext("LOW DEMOCRATIC PARTY SUPPORT", side = 3, line = -1, outer = TRUE)

my_fit <- lm(Dem.pct ~ percent_white, data = high_support)
layout(matrix(c(1, 2, 3, 4), 2, 2))
plot(my_fit)
mtext("HIGH DEMOCRATIC PARTY SUPPORT", side = 3, line = -1, outer = TRUE)

Conclusions from Residuals
- The above plots show
- In areas of high democratic support, the percentage of the population that is white, is inconsistent at predicting democratic support levels
- In areas of low democratic support, the percentage of the population that is white, is consistent at predicting democratic support levels
Explore outliers
- The neat part about our residuals check and models, is they allow us to explore outliers in the data relatively easily
- Lets take a look at some of these outliers
- First outlier analysis below
Outliers in low democratic support
|
county.fips
|
county.name
|
state.name
|
county.total.count
|
D
|
O
|
R
|
Dem.pct
|
Rep.pct
|
total_population
|
percent_white
|
percent_black
|
percent_asian
|
percent_hispanic
|
per_capita_income
|
median_rent
|
median_age
|
voter_turnout
|
party_supprt_levels
|
population_levels
|
income_levels
|
african_american_strata
|
white_strata
|
hispanic_american_strata
|
asian_american_strata
|
white_pop
|
black_pop
|
asian_pop
|
hispanic_pop
|
county_age
|
1125
|
48269
|
king
|
texas
|
159
|
5
|
5
|
149
|
0.03
|
0.94
|
321
|
76
|
0
|
0
|
24
|
29836
|
525
|
46.9
|
0.4953271
|
heavily republican
|
under 10k
|
75-95%
|
0
|
70-80
|
10-30
|
0
|
243.96
|
0
|
0
|
77.04
|
15054.9
|
1296
|
48301
|
loving
|
texas
|
65
|
4
|
3
|
58
|
0.06
|
0.89
|
87
|
51
|
0
|
0
|
34
|
34068
|
525
|
44.4
|
0.7471264
|
heavily republican
|
under 10k
|
top 5%
|
0
|
50-60
|
30-40
|
0
|
44.37
|
0
|
0
|
29.58
|
3862.8
|
1407
|
48311
|
mcmullen
|
texas
|
499
|
40
|
5
|
454
|
0.08
|
0.91
|
616
|
42
|
0
|
0
|
57
|
27375
|
550
|
40.6
|
0.8100649
|
heavily republican
|
under 10k
|
75-95%
|
0
|
40-50
|
50-60
|
0
|
258.72
|
0
|
0
|
351.12
|
25009.6
|
1579
|
48357
|
ochiltree
|
texas
|
3002
|
274
|
100
|
2628
|
0.09
|
0.88
|
10467
|
48
|
0
|
0
|
50
|
23382
|
544
|
31.3
|
0.2868062
|
heavily republican
|
10k-100k
|
50-75%
|
0
|
40-50
|
40-50
|
0
|
5024.16
|
0
|
0
|
5233.50
|
327617.1
|
Outliers in high democratic support
|
county.fips
|
county.name
|
state.name
|
county.total.count
|
D
|
O
|
R
|
Dem.pct
|
Rep.pct
|
total_population
|
percent_white
|
percent_black
|
percent_asian
|
percent_hispanic
|
per_capita_income
|
median_rent
|
median_age
|
voter_turnout
|
party_supprt_levels
|
population_levels
|
income_levels
|
african_american_strata
|
white_strata
|
hispanic_american_strata
|
asian_american_strata
|
white_pop
|
black_pop
|
asian_pop
|
hispanic_pop
|
county_age
|
254
|
44005
|
newport
|
rhode island
|
41045
|
22851
|
3117
|
15077
|
0.56
|
0.37
|
82545
|
87
|
3
|
2
|
5
|
40293
|
1029
|
43.7
|
0.4972439
|
heavily Democratic
|
10k-100k
|
top 5%
|
2-10
|
80-90
|
2-10
|
0-2
|
71814.15
|
2476.35
|
1650.90
|
4127.25
|
3607217
|
94
|
11001
|
district of columbia
|
district of columbia
|
311268
|
282830
|
15715
|
12723
|
0.91
|
0.04
|
619371
|
35
|
49
|
3
|
10
|
45290
|
1154
|
33.8
|
0.5025550
|
heavily Democratic
|
100k-1 million
|
top 5%
|
40-70
|
30-40
|
2-10
|
2-10
|
216779.85
|
303491.79
|
18581.13
|
61937.10
|
20934740
|
309
|
6077
|
san joaquin
|
california
|
204595
|
108559
|
13559
|
82477
|
0.53
|
0.40
|
693177
|
35
|
7
|
14
|
39
|
22589
|
872
|
32.9
|
0.2951555
|
5-15% Democratic
|
100k-1 million
|
25-50%
|
2-10
|
30-40
|
30-40
|
10-20
|
242611.95
|
48522.39
|
97044.78
|
270339.03
|
22805523
|
- Our outliers reveal some interesting possibilities
- In areas of low democratic support, our outliers are counties which support the democratic party even less than would be typically expected.
- These areas have extremely low levels of Dem party support, and are majority minority communities
- Even more amazing, is all four of these observations happen to be in Texas, with Ochiltree Texas having a 50% Hispanic population, yet only a 9% democratic party support
- Population size could have something to do with this as these populations are very small
- I believe voter turnout is significant here as well.
- For areas of high democratic support, our outliers are counties with high democratic support where our model under predicts democratic party support
Observations from general exploration
- The obvious point, white voters tend to support Republicans more than Democrats
- It appears areas that have low democratic support, are very likely to have high percentages of white population
- Areas with high democratic support seem to fluctuate in terms of the white population size
- Put together this means while you can assume with some confidence that if I tell you an area leans republican, it will in fact be a largely white neighborhood, you can’t necessarily make the assumption that if an area has high democratic support it is in fact a low white population.
- Intuitively I think this makes sense. We have not even begun to get into the multiple layers of social, racial and economics that influence voting. But we sort of know that generally speaking some areas with large white populations swing democratic.
Limits of Data
- Unfortunately these are exit polls, which presumably tend to have some sampling bias which should be concerning when extrapolating to the general population.