Ryan Tillis

Candidate	Votes	Percent
TRUMP	2,376,740	49.9%
CLINTON	2,356,125	49.5%

Party	Voters	Percent
DEM	1,915,618	70.29%
REP	1,553,310	74.69%
UNA	1,295,391	62.89%
TOTAL:	4,783,524	69.36%

The Method

I used data from the NCBOE and the following R packages:

load("models.RData")
load("modelset.RData")
load("TOTAL.RData")
library(ggplot2)
library(dplyr)
library(caret)
library(data.table)
library(randomForest)

Random Forests

To model turnout I used Random Forests for each party. See my rpubs page. for model testing of glm, lm and regressor/year selection. Modelset is the set of aggregated data which I used for training the model. On my final model I chose to use only on-election years (2006,2008,2010,2012,2014) in order to minimize the variance.

set.seed(22)

model_all_demF <- train(TOE_DEM~
                               DOS_DEM+
                               ROS_REP+
                               UOS_UNA+
                               LOS_LIB+
                               DBM_DEM+
                               RBM_REP+
                               UBM_UNA+
                               LBM_LIB+
                               factor(county_desc)+
                               factor(type), method = "rf", data = modelset)

model_all_repF <- train(TOE_REP~
                               DOS_DEM+
                               ROS_REP+
                               UOS_UNA+
                               LOS_LIB+
                               DBM_DEM+
                               RBM_REP+
                               UBM_UNA+
                               LBM_LIB+
                               Democrats+
                               Republicans+
                               Unaffiliated+
                               Libertarians+
                               Male+
                               Female+
                               American.Indian+
                               Black+
                               White+
                               Hispanic+
                               Other+
                               factor(county_desc)+
                               factor(type), method = "rf", data = modelset)

model_all_unaF <- train(TOE_UNA~
                               DOS_DEM+
                               ROS_REP+
                               UOS_UNA+
                               LOS_LIB+
                               DBM_DEM+
                               RBM_REP+
                               UBM_UNA+
                               LBM_LIB+
                               Democrats+
                               Republicans+
                               Unaffiliated+
                               Libertarians+
                               Male+
                               Female+
                               American.Indian+
                               Black+
                               White+
                               Hispanic+
                               Other+
                               factor(county_desc)+
                               factor(type), method = "rf", data = modelset)


model_all_libF <- train(TOE_LIB~
                               DOS_DEM+
                               ROS_REP+
                               UOS_UNA+
                               LOS_LIB+
                               DBM_DEM+
                               RBM_REP+
                               UBM_UNA+
                               LBM_LIB+
                               factor(county_desc)+
                               factor(type), method = "rf", data = modelset, intercept = 0)

Diagnosing Residuals and Checking the Fit

I understand it is a cardinal sin of machine learning not to use a test set. In choosing the models and regressors I went through many tests and tested different models, see my rpubs page for all of it. However, I wanted the final model to use as much data as possible. The fit is based largely on predicting outcome at a county level and considering there are only 5 results per county so every one counts.

The question now is if the models are overfit. We will find out tomorrow….

Some interesting outliers in the data include high election day turnout based on low early voting numbers in Currituck county in 2006 and 2014 which might suggest limited access to early voting sites. Another is Robeson County in 2012 which has a very high early voter turnout for democrats but the election day turnout did not increase as(it was capped). This was a point of saturation where the early voting was so common that less people showed up for election day. If my model is overfit I would consider weighting this point more heavily. Other can be noted on the charts below.

Republican Turnout Model

model_all_repF

## Random Forest 
## 
## 500 samples
##  21 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 500, 500, 500, 500, 500, 500, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE        Rsquared 
##     2   0.05222052  0.4068448
##    60   0.04340902  0.5164246
##   119   0.04401155  0.4927232
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final value used for the model was mtry = 60.

#Residual standard error: 0.01176 on 498 degrees of freedom
#Multiple R-squared:  0.9625,   Adjusted R-squared:  0.9624 
#F-statistic: 1.277e+04 on 1 and 498 DF,  p-value: < 2.2e-16
pred_modelsetFr <- predict(model_all_repF,modelset)
qplot(modelset$TOE_REP,pred_modelsetFr)

mFr <- lm(modelset$TOE_REP~pred_modelsetFr)
summary(mFr)

## 
## Call:
## lm(formula = modelset$TOE_REP ~ pred_modelsetFr)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.046945 -0.006990 -0.000080  0.007186  0.063034 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -0.071177   0.003427  -20.77   <2e-16 ***
## pred_modelsetFr  1.229181   0.010878  113.00   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01176 on 498 degrees of freedom
## Multiple R-squared:  0.9625, Adjusted R-squared:  0.9624 
## F-statistic: 1.277e+04 on 1 and 498 DF,  p-value: < 2.2e-16

plot(mFr)

Democrat Turnout Model

model_all_demF

## Random Forest 
## 
## 500 samples
##  10 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 500, 500, 500, 500, 500, 500, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE        Rsquared 
##     2   0.05221708  0.5014064
##    55   0.04564429  0.5102502
##   108   0.04634928  0.4904360
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final value used for the model was mtry = 55.

#Residual standard error: 0.01385 on 498 degrees of freedom
#Multiple R-squared:  0.9504,   Adjusted R-squared:  0.9503 
#F-statistic:  9545 on 1 and 498 DF,  p-value: < 2.2e-16
pred_modelsetFd <- predict(model_all_demF,modelset)
qplot(modelset$TOE_DEM,pred_modelsetFd)

mFd <- lm(modelset$TOE_DEM~pred_modelsetFd)
summary(mFd)

## 
## Call:
## lm(formula = modelset$TOE_DEM ~ pred_modelsetFd)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.041317 -0.009298  0.000323  0.009368  0.054648 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -0.054897   0.003488  -15.74   <2e-16 ***
## pred_modelsetFd  1.197653   0.012259   97.70   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01385 on 498 degrees of freedom
## Multiple R-squared:  0.9504, Adjusted R-squared:  0.9503 
## F-statistic:  9545 on 1 and 498 DF,  p-value: < 2.2e-16

plot(mFd)

Unaffiliated Turnout Model

model_all_unaF

## Random Forest 
## 
## 500 samples
##  21 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 500, 500, 500, 500, 500, 500, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE        Rsquared 
##     2   0.04585918  0.4216473
##    60   0.03874313  0.5351081
##   119   0.03937912  0.5189696
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final value used for the model was mtry = 60.

#Residual standard error: 0.01222 on 498 degrees of freedom
#Multiple R-squared:  0.9534,   Adjusted R-squared:  0.9533 
#F-statistic: 1.018e+04 on 1 and 498 DF,  p-value: < 2.2e-16
pred_modelsetFu <- predict(model_all_unaF,modelset)
qplot(modelset$TOE_UNA,pred_modelsetFu)

mFu <- lm(modelset$TOE_UNA~pred_modelsetFu)
summary(mFu)

## 
## Call:
## lm(formula = modelset$TOE_UNA ~ pred_modelsetFu)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.037048 -0.007097 -0.000234  0.007226  0.065527 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -0.040025   0.002844  -14.07   <2e-16 ***
## pred_modelsetFu  1.165595   0.011552  100.90   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01222 on 498 degrees of freedom
## Multiple R-squared:  0.9534, Adjusted R-squared:  0.9533 
## F-statistic: 1.018e+04 on 1 and 498 DF,  p-value: < 2.2e-16

plot(mFu)

Libertarian Turnout Model

I chose to force the model for libertarian turnout through the origin as a way of mitigating standard error for the model. There are a lot of occasions when no early voters

model_all_libF

## Random Forest 
## 
## 500 samples
##  10 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 500, 500, 500, 500, 500, 500, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE       Rsquared 
##     2   0.1668136  0.3694696
##    55   0.1663285  0.3601420
##   108   0.1731729  0.3262059
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final value used for the model was mtry = 55.

#Residual standard error: 0.06656 on 498 degrees of freedom
#Multiple R-squared:  0.8955,   Adjusted R-squared:  0.8953 
#F-statistic:  4267 on 1 and 498 DF,  p-value: < 2.2e-16
pred_modelsetFl <- predict(model_all_libF,modelset)
qplot(modelset$TOE_LIB,pred_modelsetFl)

mFl <- lm(modelset$TOE_LIB~pred_modelsetFl)
summary(mFl)

## 
## Call:
## lm(formula = modelset$TOE_LIB ~ pred_modelsetFl)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.57726 -0.03050  0.00180  0.03118  0.75075 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -0.04635    0.00513  -9.036   <2e-16 ***
## pred_modelsetFl  1.20958    0.01852  65.322   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06656 on 498 degrees of freedom
## Multiple R-squared:  0.8955, Adjusted R-squared:  0.8953 
## F-statistic:  4267 on 1 and 498 DF,  p-value: < 2.2e-16

plot(mFl)

Predictions

#0.702965
#1915618
predict_2016dF <- predict(model_all_demF, TOTAL)

PDEM_16 <- sum(predict_2016dF*TOTAL$Democrats+
           +         TOTAL$DOS_DEM*TOTAL$Democrats+
           +         TOTAL$DBM_DEM*TOTAL$Democrats)/(sum(TOTAL$Democrats))
DEM_16 <- sum(predict_2016dF*TOTAL$Democrats+
           +         TOTAL$DOS_DEM*TOTAL$Democrats+
           +         TOTAL$DBM_DEM*TOTAL$Democrats)


#0.7469204
#1553310
predict_2016rF <- predict(model_all_repF, TOTAL)

PREP_16 <- sum(predict_2016rF*TOTAL$Republicans+
           TOTAL$ROS_REP*TOTAL$Republicans+
           TOTAL$RBM_REP*TOTAL$Republicans)/(sum(TOTAL$Republicans))

REP_16 <- sum(predict_2016rF*TOTAL$Republicans+
           TOTAL$ROS_REP*TOTAL$Republicans+
           TOTAL$RBM_REP*TOTAL$Republicans)

#0.6289593
#1295391
predict_2016uF <- predict(model_all_unaF, TOTAL)

PUNA_16 <- sum(predict_2016uF*TOTAL$Unaffiliated+
           TOTAL$UOS_UNA*TOTAL$Unaffiliated+
           TOTAL$UBM_UNA*TOTAL$Unaffiliated)/(sum(TOTAL$Unaffiliated))

UNA_16 <- sum(predict_2016uF*TOTAL$Unaffiliated+
           TOTAL$UOS_UNA*TOTAL$Unaffiliated+
           TOTAL$UBM_UNA*TOTAL$Unaffiliated)

#0.5983549
#19205.4
predict_2016lF <- predict(model_all_libF, TOTAL)
PLIB_16 <- sum(predict_2016lF*TOTAL$Libertarians+
           TOTAL$LOS_LIB*TOTAL$Libertarians+
           TOTAL$LBM_LIB*TOTAL$Libertarians)/(sum(TOTAL$Libertarians))
LIB_16 <- sum(predict_2016lF*TOTAL$Libertarians+
           
           TOTAL$LOS_LIB*TOTAL$Libertarians+
           TOTAL$LBM_LIB*TOTAL$Libertarians)

#2947459
EV_16 <- sum(TOTAL$LOS_LIB*TOTAL$Libertarians+TOTAL$UOS_UNA*TOTAL$Unaffiliated+TOTAL$ROS_REP*TOTAL$Republicans+TOTAL$DOS_DEM*TOTAL$Democrats)
#Total by mail 155265
BM_16 <- sum(TOTAL$LBM_LIB*TOTAL$Libertarians+TOTAL$UBM_UNA*TOTAL$Unaffiliated+TOTAL$RBM_REP*TOTAL$Republicans+TOTAL$DBM_DEM*TOTAL$Democrats)
#total election day 1680800
ED_16 <- sum(predict_2016dF*TOTAL$Democrats+predict_2016rF*TOTAL$Republicans+predict_2016uF*TOTAL$Unaffiliated+predict_2016lF*TOTAL$Libertarians)

#0.6936314
#4783524
PTOT_16 <- sum(predict_2016rF*TOTAL$Republicans+
           TOTAL$ROS_REP*TOTAL$Republicans+
           TOTAL$RBM_REP*TOTAL$Republicans+
           predict_2016uF*TOTAL$Unaffiliated+
           TOTAL$UOS_UNA*TOTAL$Unaffiliated+
           TOTAL$UBM_UNA*TOTAL$Unaffiliated+
           predict_2016lF*TOTAL$Libertarians+
           TOTAL$LOS_LIB*TOTAL$Libertarians+
           TOTAL$LBM_LIB*TOTAL$Libertarians+
           predict_2016dF*TOTAL$Democrats+
           TOTAL$DOS_DEM*TOTAL$Democrats+
           TOTAL$DBM_DEM*TOTAL$Democrats)/(sum(TOTAL$Republicans+TOTAL$Unaffiliated+TOTAL$Libertarians+TOTAL$Democrats))

TOT_16 <- sum(predict_2016rF*TOTAL$Republicans+
           TOTAL$ROS_REP*TOTAL$Republicans+
           TOTAL$RBM_REP*TOTAL$Republicans+
           predict_2016uF*TOTAL$Unaffiliated+
           TOTAL$UOS_UNA*TOTAL$Unaffiliated+
           TOTAL$UBM_UNA*TOTAL$Unaffiliated+
           predict_2016lF*TOTAL$Libertarians+
           TOTAL$LOS_LIB*TOTAL$Libertarians+
           TOTAL$LBM_LIB*TOTAL$Libertarians+
           predict_2016dF*TOTAL$Democrats+
           TOTAL$DOS_DEM*TOTAL$Democrats+
           TOTAL$DBM_DEM*TOTAL$Democrats)

PTOT_16w <- sum(predict_2016rF*TOTAL$Republicans+
           TOTAL$ROS_REP*TOTAL$Republicans+
           TOTAL$RBM_REP*TOTAL$Republicans+
           predict_2016uF*TOTAL$Unaffiliated+
           TOTAL$UOS_UNA*TOTAL$Unaffiliated+
           TOTAL$UBM_UNA*TOTAL$Unaffiliated+
           predict_2016dF*TOTAL$Democrats+
           TOTAL$DOS_DEM*TOTAL$Democrats+
           TOTAL$DBM_DEM*TOTAL$Democrats)/(sum(TOTAL$Republicans+TOTAL$Unaffiliated+TOTAL$Democrats))

Calculations

modelset <- mutate(modelset, DCS_DEM = DCS/Democrats, RCS_REP = RCS/Republicans, UCS_UNA = UCS/Unaffiliated, LCS_LIB = LCS/Libertarians)

DT <- as.data.table(modelset)
invisible(lapply(names(DT),function(.name) set(DT, which(is.infinite(DT[[.name]])), j = .name,value =0)))

modelset <- DT

modelset[is.na(modelset)] <- 0


#2008 0.7163673
PDEM_08 <- sum(modelset[101:200,]$TOE_DEM*modelset[101:200,]$Democrats+
+         modelset[101:200,]$DOS_DEM*modelset[101:200,]$Democrats+
+         modelset[101:200,]$DBM_DEM*modelset[101:200,]$Democrats+modelset[101:200,]$DCS)/(sum(modelset[101:200,]$Democrats))
DEM_08 <- sum(modelset[101:200,]$TOE_DEM*modelset[101:200,]$Democrats+
+         modelset[101:200,]$DOS_DEM*modelset[101:200,]$Democrats+
+         modelset[101:200,]$DBM_DEM*modelset[101:200,]$Democrats+modelset[101:200,]$DCS)

#2012 0.6991179
PDEM_12 <- sum(modelset[301:400,]$TOE_DEM*modelset[301:400,]$Democrats+
        modelset[301:400,]$DOS_DEM*modelset[301:400,]$Democrats+
        modelset[301:400,]$DBM_DEM*modelset[301:400,]$Democrats+modelset[301:400,]$DCS)/(sum(modelset[301:400,]$Democrats))
DEM_12 <- sum(modelset[301:400,]$TOE_DEM*modelset[301:400,]$Democrats+
        modelset[301:400,]$DOS_DEM*modelset[301:400,]$Democrats+
        modelset[301:400,]$DBM_DEM*modelset[301:400,]$Democrats+modelset[301:400,]$DCS)

#0.7130856 2008
PREP_08 <- sum(modelset[101:200,]$TOE_REP*modelset[101:200,]$Republicans+
        modelset[101:200,]$ROS_REP*modelset[101:200,]$Republicans+
        modelset[101:200,]$RBM_REP*modelset[101:200,]$Republicans+modelset[101:200,]$RCS)/(sum(modelset[101:200,]$Republicans))
REP_08 <- sum(modelset[101:200,]$TOE_REP*modelset[101:200,]$Republicans+
        modelset[101:200,]$ROS_REP*modelset[101:200,]$Republicans+
        modelset[101:200,]$RBM_REP*modelset[101:200,]$Republicans+modelset[101:200,]$RCS)

# 0.7279284 2012
PREP_12 <- sum(modelset[301:400,]$TOE_REP*modelset[301:400,]$Republicans+
        modelset[301:400,]$ROS_REP*modelset[301:400,]$Republicans+
        modelset[301:400,]$RBM_REP*modelset[301:400,]$Republicans+modelset[301:400,]$RCS)/(sum(modelset[301:400,]$Republicans))

REP_12 <- sum(modelset[301:400,]$TOE_REP*modelset[301:400,]$Republicans+
        modelset[301:400,]$ROS_REP*modelset[301:400,]$Republicans+
        modelset[301:400,]$RBM_REP*modelset[301:400,]$Republicans+modelset[301:400,]$RCS)

#0.6205267 2008
PUNA_08 <- sum(modelset[101:200,]$TOE_UNA*modelset[101:200,]$Unaffiliated+
        modelset[101:200,]$UOS_UNA*modelset[101:200,]$Unaffiliated+
        modelset[101:200,]$UBM_UNA*modelset[101:200,]$Unaffiliated+modelset[101:200,]$RCS)/(sum(modelset[101:200,]$Unaffiliated))

UNA_08 <- sum(modelset[101:200,]$TOE_UNA*modelset[101:200,]$Unaffiliated+
        modelset[101:200,]$UOS_UNA*modelset[101:200,]$Unaffiliated+
        modelset[101:200,]$UBM_UNA*modelset[101:200,]$Unaffiliated+modelset[101:200,]$RCS)
#0.603038 2012
PUNA_12 <- sum(modelset[301:400,]$TOE_UNA*modelset[301:400,]$Unaffiliated+
        modelset[301:400,]$UOS_UNA*modelset[301:400,]$Unaffiliated+
        modelset[301:400,]$UBM_UNA*modelset[301:400,]$Unaffiliated+modelset[301:400,]$RCS)/(sum(modelset[301:400,]$Unaffiliated))

UNA_12 <- sum(modelset[301:400,]$TOE_UNA*modelset[301:400,]$Unaffiliated+
        modelset[301:400,]$UOS_UNA*modelset[301:400,]$Unaffiliated+
        modelset[301:400,]$UBM_UNA*modelset[301:400,]$Unaffiliated+modelset[301:400,]$RCS)

TOT_08 <- sum(REP_08+UNA_08+DEM_08)
TOT_12 <- sum(REP_12+UNA_12+DEM_12)

PTOT_08 <- sum(REP_08+UNA_08+DEM_08)/sum(modelset[101:200,]$Unaffiliated+modelset[101:200,]$Republicans+modelset[101:200,]$Democrats)
PTOT_12 <- sum(REP_12+UNA_12+DEM_12)/sum(modelset[301:400,]$Unaffiliated+modelset[301:400,]$Republicans+modelset[301:400,]$Democrats)

Charts

My prediction suggests that this year’s turnout will exceed 2008 but fall short of 2012. See charts below. ##Total Turnout

Ryan Tillis - Predicting Election Results - Random Forests

11/7/2016

Predict NC - Random Forests Project

This chart below shows the split of the chosen ballots in the 2016 primary by unaffiliateds who have already cast their vote in the 2016 General.

Predicted Turnout by Party

Predicted Turnout by County