The goal of this report is to predict election day turnout and results in NC using early voting data. The preliminary data processing and model testing will be avaliable on my rpubs page. I predicted outcome at the county level then weighted previous county election results from 2008 and 2012 to come up with a projection. Trump came out slightly ahead but it is a coin flip and either candidate winning is within the margin of error.
| Candidate | Votes | Percent |
|---|---|---|
| TRUMP | 2,376,740 | 49.9% |
| CLINTON | 2,356,125 | 49.5% |
## Warning in arrange_impl(.data, dots): '.Random.seed' is not an integer
## vector but of type 'NULL', so ignored
| Party | Voters | Percent |
|---|---|---|
| DEM | 1,915,618 | 70.29% |
| REP | 1,553,310 | 74.69% |
| UNA | 1,295,391 | 62.89% |
| TOTAL: | 4,783,524 | 69.36% |
I used data from the NCBOE and the following R packages:
load("models.RData")
load("modelset.RData")
load("TOTAL.RData")
library(ggplot2)
library(dplyr)
library(caret)
library(data.table)
library(randomForest)
To model turnout I used Random Forests for each party. See my rpubs page. for model testing of glm, lm and regressor/year selection. Modelset is the set of aggregated data which I used for training the model. On my final model I chose to use only on-election years (2006,2008,2010,2012,2014) in order to minimize the variance.
set.seed(22)
model_all_demF <- train(TOE_DEM~
DOS_DEM+
ROS_REP+
UOS_UNA+
LOS_LIB+
DBM_DEM+
RBM_REP+
UBM_UNA+
LBM_LIB+
factor(county_desc)+
factor(type), method = "rf", data = modelset)
model_all_repF <- train(TOE_REP~
DOS_DEM+
ROS_REP+
UOS_UNA+
LOS_LIB+
DBM_DEM+
RBM_REP+
UBM_UNA+
LBM_LIB+
Democrats+
Republicans+
Unaffiliated+
Libertarians+
Male+
Female+
American.Indian+
Black+
White+
Hispanic+
Other+
factor(county_desc)+
factor(type), method = "rf", data = modelset)
model_all_unaF <- train(TOE_UNA~
DOS_DEM+
ROS_REP+
UOS_UNA+
LOS_LIB+
DBM_DEM+
RBM_REP+
UBM_UNA+
LBM_LIB+
Democrats+
Republicans+
Unaffiliated+
Libertarians+
Male+
Female+
American.Indian+
Black+
White+
Hispanic+
Other+
factor(county_desc)+
factor(type), method = "rf", data = modelset)
model_all_libF <- train(TOE_LIB~
DOS_DEM+
ROS_REP+
UOS_UNA+
LOS_LIB+
DBM_DEM+
RBM_REP+
UBM_UNA+
LBM_LIB+
factor(county_desc)+
factor(type), method = "rf", data = modelset, intercept = 0)
I understand it is a cardinal sin of machine learning not to use a test set. In choosing the models and regressors I went through many tests and tested different models, see my rpubs page for all of it. However, I wanted the final model to use as much data as possible. The fit is based largely on predicting outcome at a county level and considering there are only 5 results per county so every one counts.
The question now is if the models are overfit. We will find out tomorrow….
Some interesting outliers in the data include high election day turnout based on low early voting numbers in Currituck county in 2006 and 2014 which might suggest limited access to early voting sites. Another is Robeson County in 2012 which has a very high early voter turnout for democrats but the election day turnout did not increase as(it was capped). This was a point of saturation where the early voting was so common that less people showed up for election day. If my model is overfit I would consider weighting this point more heavily. Other can be noted on the charts below.
model_all_repF
## Random Forest
##
## 500 samples
## 21 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 500, 500, 500, 500, 500, 500, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared
## 2 0.05222052 0.4068448
## 60 0.04340902 0.5164246
## 119 0.04401155 0.4927232
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 60.
#Residual standard error: 0.01176 on 498 degrees of freedom
#Multiple R-squared: 0.9625, Adjusted R-squared: 0.9624
#F-statistic: 1.277e+04 on 1 and 498 DF, p-value: < 2.2e-16
pred_modelsetFr <- predict(model_all_repF,modelset)
qplot(modelset$TOE_REP,pred_modelsetFr)
mFr <- lm(modelset$TOE_REP~pred_modelsetFr)
summary(mFr)
##
## Call:
## lm(formula = modelset$TOE_REP ~ pred_modelsetFr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.046945 -0.006990 -0.000080 0.007186 0.063034
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.071177 0.003427 -20.77 <2e-16 ***
## pred_modelsetFr 1.229181 0.010878 113.00 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01176 on 498 degrees of freedom
## Multiple R-squared: 0.9625, Adjusted R-squared: 0.9624
## F-statistic: 1.277e+04 on 1 and 498 DF, p-value: < 2.2e-16
plot(mFr)
model_all_demF
## Random Forest
##
## 500 samples
## 10 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 500, 500, 500, 500, 500, 500, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared
## 2 0.05221708 0.5014064
## 55 0.04564429 0.5102502
## 108 0.04634928 0.4904360
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 55.
#Residual standard error: 0.01385 on 498 degrees of freedom
#Multiple R-squared: 0.9504, Adjusted R-squared: 0.9503
#F-statistic: 9545 on 1 and 498 DF, p-value: < 2.2e-16
pred_modelsetFd <- predict(model_all_demF,modelset)
qplot(modelset$TOE_DEM,pred_modelsetFd)
mFd <- lm(modelset$TOE_DEM~pred_modelsetFd)
summary(mFd)
##
## Call:
## lm(formula = modelset$TOE_DEM ~ pred_modelsetFd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.041317 -0.009298 0.000323 0.009368 0.054648
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.054897 0.003488 -15.74 <2e-16 ***
## pred_modelsetFd 1.197653 0.012259 97.70 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01385 on 498 degrees of freedom
## Multiple R-squared: 0.9504, Adjusted R-squared: 0.9503
## F-statistic: 9545 on 1 and 498 DF, p-value: < 2.2e-16
plot(mFd)
model_all_unaF
## Random Forest
##
## 500 samples
## 21 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 500, 500, 500, 500, 500, 500, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared
## 2 0.04585918 0.4216473
## 60 0.03874313 0.5351081
## 119 0.03937912 0.5189696
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 60.
#Residual standard error: 0.01222 on 498 degrees of freedom
#Multiple R-squared: 0.9534, Adjusted R-squared: 0.9533
#F-statistic: 1.018e+04 on 1 and 498 DF, p-value: < 2.2e-16
pred_modelsetFu <- predict(model_all_unaF,modelset)
qplot(modelset$TOE_UNA,pred_modelsetFu)
mFu <- lm(modelset$TOE_UNA~pred_modelsetFu)
summary(mFu)
##
## Call:
## lm(formula = modelset$TOE_UNA ~ pred_modelsetFu)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.037048 -0.007097 -0.000234 0.007226 0.065527
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.040025 0.002844 -14.07 <2e-16 ***
## pred_modelsetFu 1.165595 0.011552 100.90 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01222 on 498 degrees of freedom
## Multiple R-squared: 0.9534, Adjusted R-squared: 0.9533
## F-statistic: 1.018e+04 on 1 and 498 DF, p-value: < 2.2e-16
plot(mFu)
I chose to force the model for libertarian turnout through the origin as a way of mitigating standard error for the model. There are a lot of occasions when no early voters
model_all_libF
## Random Forest
##
## 500 samples
## 10 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 500, 500, 500, 500, 500, 500, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared
## 2 0.1668136 0.3694696
## 55 0.1663285 0.3601420
## 108 0.1731729 0.3262059
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 55.
#Residual standard error: 0.06656 on 498 degrees of freedom
#Multiple R-squared: 0.8955, Adjusted R-squared: 0.8953
#F-statistic: 4267 on 1 and 498 DF, p-value: < 2.2e-16
pred_modelsetFl <- predict(model_all_libF,modelset)
qplot(modelset$TOE_LIB,pred_modelsetFl)
mFl <- lm(modelset$TOE_LIB~pred_modelsetFl)
summary(mFl)
##
## Call:
## lm(formula = modelset$TOE_LIB ~ pred_modelsetFl)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.57726 -0.03050 0.00180 0.03118 0.75075
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.04635 0.00513 -9.036 <2e-16 ***
## pred_modelsetFl 1.20958 0.01852 65.322 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06656 on 498 degrees of freedom
## Multiple R-squared: 0.8955, Adjusted R-squared: 0.8953
## F-statistic: 4267 on 1 and 498 DF, p-value: < 2.2e-16
plot(mFl)
#0.702965
#1915618
predict_2016dF <- predict(model_all_demF, TOTAL)
PDEM_16 <- sum(predict_2016dF*TOTAL$Democrats+
+ TOTAL$DOS_DEM*TOTAL$Democrats+
+ TOTAL$DBM_DEM*TOTAL$Democrats)/(sum(TOTAL$Democrats))
DEM_16 <- sum(predict_2016dF*TOTAL$Democrats+
+ TOTAL$DOS_DEM*TOTAL$Democrats+
+ TOTAL$DBM_DEM*TOTAL$Democrats)
#0.7469204
#1553310
predict_2016rF <- predict(model_all_repF, TOTAL)
PREP_16 <- sum(predict_2016rF*TOTAL$Republicans+
TOTAL$ROS_REP*TOTAL$Republicans+
TOTAL$RBM_REP*TOTAL$Republicans)/(sum(TOTAL$Republicans))
REP_16 <- sum(predict_2016rF*TOTAL$Republicans+
TOTAL$ROS_REP*TOTAL$Republicans+
TOTAL$RBM_REP*TOTAL$Republicans)
#0.6289593
#1295391
predict_2016uF <- predict(model_all_unaF, TOTAL)
PUNA_16 <- sum(predict_2016uF*TOTAL$Unaffiliated+
TOTAL$UOS_UNA*TOTAL$Unaffiliated+
TOTAL$UBM_UNA*TOTAL$Unaffiliated)/(sum(TOTAL$Unaffiliated))
UNA_16 <- sum(predict_2016uF*TOTAL$Unaffiliated+
TOTAL$UOS_UNA*TOTAL$Unaffiliated+
TOTAL$UBM_UNA*TOTAL$Unaffiliated)
#0.5983549
#19205.4
predict_2016lF <- predict(model_all_libF, TOTAL)
PLIB_16 <- sum(predict_2016lF*TOTAL$Libertarians+
TOTAL$LOS_LIB*TOTAL$Libertarians+
TOTAL$LBM_LIB*TOTAL$Libertarians)/(sum(TOTAL$Libertarians))
LIB_16 <- sum(predict_2016lF*TOTAL$Libertarians+
TOTAL$LOS_LIB*TOTAL$Libertarians+
TOTAL$LBM_LIB*TOTAL$Libertarians)
#2947459
EV_16 <- sum(TOTAL$LOS_LIB*TOTAL$Libertarians+TOTAL$UOS_UNA*TOTAL$Unaffiliated+TOTAL$ROS_REP*TOTAL$Republicans+TOTAL$DOS_DEM*TOTAL$Democrats)
#Total by mail 155265
BM_16 <- sum(TOTAL$LBM_LIB*TOTAL$Libertarians+TOTAL$UBM_UNA*TOTAL$Unaffiliated+TOTAL$RBM_REP*TOTAL$Republicans+TOTAL$DBM_DEM*TOTAL$Democrats)
#total election day 1680800
ED_16 <- sum(predict_2016dF*TOTAL$Democrats+predict_2016rF*TOTAL$Republicans+predict_2016uF*TOTAL$Unaffiliated+predict_2016lF*TOTAL$Libertarians)
#0.6936314
#4783524
PTOT_16 <- sum(predict_2016rF*TOTAL$Republicans+
TOTAL$ROS_REP*TOTAL$Republicans+
TOTAL$RBM_REP*TOTAL$Republicans+
predict_2016uF*TOTAL$Unaffiliated+
TOTAL$UOS_UNA*TOTAL$Unaffiliated+
TOTAL$UBM_UNA*TOTAL$Unaffiliated+
predict_2016lF*TOTAL$Libertarians+
TOTAL$LOS_LIB*TOTAL$Libertarians+
TOTAL$LBM_LIB*TOTAL$Libertarians+
predict_2016dF*TOTAL$Democrats+
TOTAL$DOS_DEM*TOTAL$Democrats+
TOTAL$DBM_DEM*TOTAL$Democrats)/(sum(TOTAL$Republicans+TOTAL$Unaffiliated+TOTAL$Libertarians+TOTAL$Democrats))
TOT_16 <- sum(predict_2016rF*TOTAL$Republicans+
TOTAL$ROS_REP*TOTAL$Republicans+
TOTAL$RBM_REP*TOTAL$Republicans+
predict_2016uF*TOTAL$Unaffiliated+
TOTAL$UOS_UNA*TOTAL$Unaffiliated+
TOTAL$UBM_UNA*TOTAL$Unaffiliated+
predict_2016lF*TOTAL$Libertarians+
TOTAL$LOS_LIB*TOTAL$Libertarians+
TOTAL$LBM_LIB*TOTAL$Libertarians+
predict_2016dF*TOTAL$Democrats+
TOTAL$DOS_DEM*TOTAL$Democrats+
TOTAL$DBM_DEM*TOTAL$Democrats)
PTOT_16w <- sum(predict_2016rF*TOTAL$Republicans+
TOTAL$ROS_REP*TOTAL$Republicans+
TOTAL$RBM_REP*TOTAL$Republicans+
predict_2016uF*TOTAL$Unaffiliated+
TOTAL$UOS_UNA*TOTAL$Unaffiliated+
TOTAL$UBM_UNA*TOTAL$Unaffiliated+
predict_2016dF*TOTAL$Democrats+
TOTAL$DOS_DEM*TOTAL$Democrats+
TOTAL$DBM_DEM*TOTAL$Democrats)/(sum(TOTAL$Republicans+TOTAL$Unaffiliated+TOTAL$Democrats))
modelset <- mutate(modelset, DCS_DEM = DCS/Democrats, RCS_REP = RCS/Republicans, UCS_UNA = UCS/Unaffiliated, LCS_LIB = LCS/Libertarians)
DT <- as.data.table(modelset)
invisible(lapply(names(DT),function(.name) set(DT, which(is.infinite(DT[[.name]])), j = .name,value =0)))
modelset <- DT
modelset[is.na(modelset)] <- 0
#2008 0.7163673
PDEM_08 <- sum(modelset[101:200,]$TOE_DEM*modelset[101:200,]$Democrats+
+ modelset[101:200,]$DOS_DEM*modelset[101:200,]$Democrats+
+ modelset[101:200,]$DBM_DEM*modelset[101:200,]$Democrats+modelset[101:200,]$DCS)/(sum(modelset[101:200,]$Democrats))
DEM_08 <- sum(modelset[101:200,]$TOE_DEM*modelset[101:200,]$Democrats+
+ modelset[101:200,]$DOS_DEM*modelset[101:200,]$Democrats+
+ modelset[101:200,]$DBM_DEM*modelset[101:200,]$Democrats+modelset[101:200,]$DCS)
#2012 0.6991179
PDEM_12 <- sum(modelset[301:400,]$TOE_DEM*modelset[301:400,]$Democrats+
modelset[301:400,]$DOS_DEM*modelset[301:400,]$Democrats+
modelset[301:400,]$DBM_DEM*modelset[301:400,]$Democrats+modelset[301:400,]$DCS)/(sum(modelset[301:400,]$Democrats))
DEM_12 <- sum(modelset[301:400,]$TOE_DEM*modelset[301:400,]$Democrats+
modelset[301:400,]$DOS_DEM*modelset[301:400,]$Democrats+
modelset[301:400,]$DBM_DEM*modelset[301:400,]$Democrats+modelset[301:400,]$DCS)
#0.7130856 2008
PREP_08 <- sum(modelset[101:200,]$TOE_REP*modelset[101:200,]$Republicans+
modelset[101:200,]$ROS_REP*modelset[101:200,]$Republicans+
modelset[101:200,]$RBM_REP*modelset[101:200,]$Republicans+modelset[101:200,]$RCS)/(sum(modelset[101:200,]$Republicans))
REP_08 <- sum(modelset[101:200,]$TOE_REP*modelset[101:200,]$Republicans+
modelset[101:200,]$ROS_REP*modelset[101:200,]$Republicans+
modelset[101:200,]$RBM_REP*modelset[101:200,]$Republicans+modelset[101:200,]$RCS)
# 0.7279284 2012
PREP_12 <- sum(modelset[301:400,]$TOE_REP*modelset[301:400,]$Republicans+
modelset[301:400,]$ROS_REP*modelset[301:400,]$Republicans+
modelset[301:400,]$RBM_REP*modelset[301:400,]$Republicans+modelset[301:400,]$RCS)/(sum(modelset[301:400,]$Republicans))
REP_12 <- sum(modelset[301:400,]$TOE_REP*modelset[301:400,]$Republicans+
modelset[301:400,]$ROS_REP*modelset[301:400,]$Republicans+
modelset[301:400,]$RBM_REP*modelset[301:400,]$Republicans+modelset[301:400,]$RCS)
#0.6205267 2008
PUNA_08 <- sum(modelset[101:200,]$TOE_UNA*modelset[101:200,]$Unaffiliated+
modelset[101:200,]$UOS_UNA*modelset[101:200,]$Unaffiliated+
modelset[101:200,]$UBM_UNA*modelset[101:200,]$Unaffiliated+modelset[101:200,]$RCS)/(sum(modelset[101:200,]$Unaffiliated))
UNA_08 <- sum(modelset[101:200,]$TOE_UNA*modelset[101:200,]$Unaffiliated+
modelset[101:200,]$UOS_UNA*modelset[101:200,]$Unaffiliated+
modelset[101:200,]$UBM_UNA*modelset[101:200,]$Unaffiliated+modelset[101:200,]$RCS)
#0.603038 2012
PUNA_12 <- sum(modelset[301:400,]$TOE_UNA*modelset[301:400,]$Unaffiliated+
modelset[301:400,]$UOS_UNA*modelset[301:400,]$Unaffiliated+
modelset[301:400,]$UBM_UNA*modelset[301:400,]$Unaffiliated+modelset[301:400,]$RCS)/(sum(modelset[301:400,]$Unaffiliated))
UNA_12 <- sum(modelset[301:400,]$TOE_UNA*modelset[301:400,]$Unaffiliated+
modelset[301:400,]$UOS_UNA*modelset[301:400,]$Unaffiliated+
modelset[301:400,]$UBM_UNA*modelset[301:400,]$Unaffiliated+modelset[301:400,]$RCS)
TOT_08 <- sum(REP_08+UNA_08+DEM_08)
TOT_12 <- sum(REP_12+UNA_12+DEM_12)
PTOT_08 <- sum(REP_08+UNA_08+DEM_08)/sum(modelset[101:200,]$Unaffiliated+modelset[101:200,]$Republicans+modelset[101:200,]$Democrats)
PTOT_12 <- sum(REP_12+UNA_12+DEM_12)/sum(modelset[301:400,]$Unaffiliated+modelset[301:400,]$Republicans+modelset[301:400,]$Democrats)
My prediction suggests that this year’s turnout will exceed 2008 but fall short of 2012. See charts below. ##Total Turnout