Data verification

The dataset is a 2016 election result county-wide dataset.
- Looks like this:

kable(uspres_results[1:5,])

county.fips	county.name	state.name	party	vote.count	county.total.count	national.party.percent	national.count	is.national.winner
45001	abbeville	south carolina	D	3741	10775	48.098104	135851595	FALSE
45001	abbeville	south carolina	O	271	10775	5.789663	135851595	FALSE
45001	abbeville	south carolina	R	6763	10775	46.112232	135851595	TRUE
22001	acadia	louisiana	D	5638	27389	48.098104	135851595	FALSE
22001	acadia	louisiana	O	589	27389	5.789663	135851595	FALSE

This dataset is from datacamp class on election data
- The course can be found hereAnalyzing Election and Polling Data in R
Some outliers have been removed
If anyone is interested, I really haven’t gotten far in analysis(some graphs are mislabeled) and it may be offensive and typo filled(I’m admittedly left leaning), but here is the link to the overall project I have going so far. My main goal was to get used to the graphing packages
- myproject

Do Not Blindly Trust Data

How representative of real world facts is this data set?
Well one easy way to validate our dataset is comparing summary statistics of the population in these observations to data taken from Wiki US demography
- Population Of US
- Age demographics
- Race demographics
The results of comparing our data set to the wiki data set are displayed below
- I would say our data set is highly representative of actual real world demographics
- This is very reassuring considering we have little understanding of how this data was collected

## Warning: Setting row names on a tibble is deprecated.

	our_data_set	wiki_data
total_pop	310184565.00000	318000000.0
med_age	37.49154	37.0
white	63.27000	62.0
african_american	12.24000	12.6
asian	4.82000	5.2
hispanic	16.69000	17.0

Data Exploration

Explore relationship with white voters and party support

Let’s take a deeper look at how white a county is can effect its voting behavior

## [1] "-0.590907977723247 Correlation between democratic party suport and white voters"

## [1] "0.53088523554694 Correlation between Republican party suport and white voters"

Extremely negative relationship between white voters and democratic party support
- Correlation of -.59
When looking at this relationship above, we want to know how well it fits the data.
- Now in theory the smooth line, which produces the shaded area around the blue regression line, indicates to us how well the line fits the data. We can see that at low levels of white voters, the shade seems to get wider indicating the model may not work as well in that region.
- We can also explore this through modelling this relationship and looking at the residuals
  - The added benefit of this approach is it will allow us to plot many error tests, including outlier tests
Below I Run a simple model Democratic Party Support ~ percent_white. Our plotted residuals and regression checks are below

Modeling

I had many columns I had to take out becuase of likely collinearity issues
- i built many exploratory factor vairables, which directly correlate with specific columns, these were all removed
- States and county names would leave far too many Independent variables so they were removed
- Total population was removed so that I could look at a population factor column of small, to large size counties
- during backwards selection, percapita income is also removed.

df_2 <- merged_df[,-c(1,2,3,4,5,7,9,10,19,20,21,22,23,24,25,26,27,28,29)]
my_fit <- lm(Dem.pct ~ ., data = df_2)
layout(matrix(c(1, 2, 3, 4), 2, 2))
summary(my_fit)

## 
## Call:
## lm(formula = Dem.pct ~ ., data = df_2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.44169 -0.06796 -0.00488  0.06118  0.42889 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        4.642e-01  2.687e-02  17.279  < 2e-16 ***
## O                  1.389e-06  5.276e-07   2.633 0.008515 ** 
## percent_white     -4.849e-03  2.607e-04 -18.600  < 2e-16 ***
## percent_black      1.122e-03  2.687e-04   4.174 3.08e-05 ***
## percent_asian      3.604e-03  1.046e-03   3.445 0.000578 ***
## percent_hispanic  -2.104e-03  2.816e-04  -7.471 1.03e-13 ***
## per_capita_income -1.521e-06  5.579e-07  -2.727 0.006430 ** 
## median_rent        2.534e-04  1.612e-05  15.717  < 2e-16 ***
## median_age        -1.129e-03  4.610e-04  -2.450 0.014359 *  
## voter_turnout      4.065e-01  3.300e-02  12.317  < 2e-16 ***
## county_age         2.607e-10  3.481e-10   0.749 0.454027    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09585 on 3081 degrees of freedom
## Multiple R-squared:  0.6037, Adjusted R-squared:  0.6024 
## F-statistic: 469.4 on 10 and 3081 DF,  p-value: < 2.2e-16

Analysis From Summary Model Stats

Percent_white is statistically significant at explaining how people voted
The min and max for the residuals seem in line with each other, as do the quartiles. The median is close to 0, this all shows that residuals are seemingly normally distributed.
The model explains about 35% of the data’s variance r squared of .349
Lets take a deeper look at the residuals by plotting them

Plot Residuals

layout(matrix(c(1, 2, 3, 4), 2, 2))
plot(my_fit)

This model diagnostic shows that at low levels of democratic party support the percentage of the white population does an excellent job at determining democratic party support.
The residuals, are the error, or how good/poor that line fits the data. Theoretically if this line correctly explains the relationship, than our residuals would be consistent across all observations.
However, as democratic support increases, we can see the model’s residuals start to become inconsistent. This can also be seen on the normality qq plot
- This is a relationship I Will look into deeper below by separating the data into areas where it seems my residuals start to spread out.
  - This deeper analysis will also reveal some outlier cases for us to analyze.

Run Alternative models

Model 1, will model how percent white influences voting percentage in low democratic counties(less than 40% democratic)
- I expect our residuals to look much better here
Model 2 will model how percent white influences voting percentage in high democratic counties(50%+)
- I expect our residuals to be all over the place here

low_support <- merged_df %>% 
    filter(Dem.pct < 0.4) %>% 
    mutate(Dem.pct= round(Dem.pct,2)) %>% 
    mutate(Rep.pct= round(Rep.pct,2))
    
high_support <- merged_df %>% 
    filter(Dem.pct > 0.5) %>% 
    mutate(Dem.pct = round(Dem.pct,2))%>% 
    mutate(Rep.pct= round(Rep.pct,2))
    
# cor(low_support$Rep.pct,low_support$percent_white)

my_fit <- lm(Dem.pct ~ percent_white, data = low_support)
layout(matrix(c(1, 2, 3, 4), 2, 2))
plot(my_fit)
mtext("LOW DEMOCRATIC PARTY SUPPORT", side = 3, line = -1, outer = TRUE)

my_fit <- lm(Dem.pct ~ percent_white, data = high_support)
layout(matrix(c(1, 2, 3, 4), 2, 2))
plot(my_fit)
mtext("HIGH DEMOCRATIC PARTY SUPPORT", side = 3, line = -1, outer = TRUE)

Conclusions from Residuals

The above plots show
- In areas of high democratic support, the percentage of the population that is white, is inconsistent at predicting democratic support levels
- In areas of low democratic support, the percentage of the population that is white, is consistent at predicting democratic support levels

Explore outliers

The neat part about our residuals check and models, is they allow us to explore outliers in the data relatively easily
Lets take a look at some of these outliers
First outlier analysis below

Outliers in low democratic support
	county.fips	county.name	state.name	county.total.count	D	O	R	Dem.pct	Rep.pct	total_population	percent_white	percent_hispanic	per_capita_income	median_rent	median_age	voter_turnout	party_supprt_levels	population_levels	income_levels	white_strata	hispanic_american_strata	white_pop	hispanic_pop	county_age
1125	48269	king	texas	159	5	5	149	0.03	0.94	321	76	24	29836	525	46.9	0.4953271	heavily republican	under 10k	75-95%	70-80	10-30	243.96	77.04	15054.9
1296	48301	loving	texas	65	4	3	58	0.06	0.89	87	51	34	34068	525	44.4	0.7471264	heavily republican	under 10k	top 5%	50-60	30-40	44.37	29.58	3862.8
1407	48311	mcmullen	texas	499	40	5	454	0.08	0.91	616	42	57	27375	550	40.6	0.8100649	heavily republican	under 10k	75-95%	40-50	50-60	258.72	351.12	25009.6
1579	48357	ochiltree	texas	3002	274	100	2628	0.09	0.88	10467	48	50	23382	544	31.3	0.2868062	heavily republican	10k-100k	50-75%	40-50	40-50	5024.16	5233.50	327617.1

Outliers in high democratic support
	county.fips	county.name	state.name	county.total.count	D	O	R	Dem.pct	Rep.pct	total_population	percent_white	percent_black	percent_asian	percent_hispanic	per_capita_income	median_rent	median_age	voter_turnout	party_supprt_levels	population_levels	income_levels	african_american_strata	white_strata	hispanic_american_strata	asian_american_strata	white_pop	black_pop	asian_pop	hispanic_pop	county_age
254	44005	newport	rhode island	41045	22851	3117	15077	0.56	0.37	82545	87	3	2	5	40293	1029	43.7	0.4972439	heavily Democratic	10k-100k	top 5%	2-10	80-90	2-10	0-2	71814.15	2476.35	1650.90	4127.25	3607217
94	11001	district of columbia	district of columbia	311268	282830	15715	12723	0.91	0.04	619371	35	49	3	10	45290	1154	33.8	0.5025550	heavily Democratic	100k-1 million	top 5%	40-70	30-40	2-10	2-10	216779.85	303491.79	18581.13	61937.10	20934740
309	6077	san joaquin	california	204595	108559	13559	82477	0.53	0.40	693177	35	7	14	39	22589	872	32.9	0.2951555	5-15% Democratic	100k-1 million	25-50%	2-10	30-40	30-40	10-20	242611.95	48522.39	97044.78	270339.03	22805523

Our outliers reveal some interesting possibilities
- In areas of low democratic support, our outliers are counties which support the democratic party even less than would be typically expected.
- These areas have extremely low levels of Dem party support, and are majority minority communities
- Even more amazing, is all four of these observations happen to be in Texas, with Ochiltree Texas having a 50% Hispanic population, yet only a 9% democratic party support
- Population size could have something to do with this as these populations are very small
  - I believe voter turnout is significant here as well.
- For areas of high democratic support, our outliers are counties with high democratic support where our model under predicts democratic party support

Observations from general exploration

The obvious point, white voters tend to support Republicans more than Democrats
It appears areas that have low democratic support, are very likely to have high percentages of white population
Areas with high democratic support seem to fluctuate in terms of the white population size
Put together this means while you can assume with some confidence that if I tell you an area leans republican, it will in fact be a largely white neighborhood, you can’t necessarily make the assumption that if an area has high democratic support it is in fact a low white population.
- Intuitively I think this makes sense. We have not even begun to get into the multiple layers of social, racial and economics that influence voting. But we sort of know that generally speaking some areas with large white populations swing democratic.

Limits of Data

Unfortunately these are exit polls, which presumably tend to have some sampling bias which should be concerning when extrapolating to the general population.

County-wide voting behavior based on racial demographics

Justin Herman

November 9, 2018

Data verification

Do Not Blindly Trust Data

Data Exploration

Explore relationship with white voters and party support

Modeling

Analysis From Summary Model Stats

Plot Residuals

Run Alternative models

Conclusions from Residuals

Explore outliers

Observations from general exploration

Limits of Data