This was my input for the following Kaggle Competition: https://www.kaggle.com/benhamner/2016-us-election

The dataset was primary results from the “early” states for the 2016 US Presidential Election. Also, included with the data were “County Facts”, variables that described the population in that County.

Libraries Used

Library Description
Chropleth A library for plotting US County and State data
RPart A library usued for its recursive partioning algorithms
GGPlot2 Grammar of Graphics Plotting library
Reshape2 Pivoting data from long to wide
caret ML Library
RSQLite Connect to SQLite tables
DPlyr The Data Wrangling Package
SQLDF Query dfs like tables
RColorBrewer Produce Better Looking Trees

Connect to SQLite dB

dbpath = "C:/Users/Salil/Downloads/PE/2016_presidential_election/database.sqlite"
##dbpath = "H:/PE/2016_presidential_election/database.sqlite"
db <- dbConnect(dbDriver("SQLite"), dbpath)
dbListTables(db)
## [1] "county_facts"            "county_facts_dictionary"
## [3] "primary_results"
countyfacts <- dbGetQuery(db, "select *  from county_facts")
cfd <- dbGetQuery(db, "select *  from county_facts_dictionary")
presults <- dbGetQuery(db, "select *  from primary_results")
rsPR <- data.frame(presults$fips,presults$party,presults$candidate, presults$fraction_votes)

Pivot the Candidate’s Percentage of Votes from Rows to Columns for each County Code

colnames(rsPR) <- c("fips","party","cand","perVotes")
newdf <- dcast(rsPR,fips ~ cand , value.var = "perVotes")

mydf <- merge(newdf,countyfacts, by = "fips")
mydf$ID <- rownames(mydf)
##Derived column 
##head(mydf,1)

Derived Columns to get County Winners, and Binary Fields for Candidate of Interest (Trump)

demdf <- data.frame(mydf$`Hillary Clinton`, mydf$`Bernie Sanders`)
demwinners <- colnames(demdf)[apply(demdf,1,which.max)]

repubdf <- data.frame(mydf$`Ben Carson`, mydf$`Carly Fiorina`, mydf$`Chris Christie`, mydf$`Jeb Bush`, mydf$`John Kasich`, mydf$`Marco Rubio`, mydf$`Ted Cruz`, mydf$`Donald Trump`, mydf$`Ted Cruz`, mydf$`Mike Huckabee` ,mydf$`Rand Paul`, mydf$`Rick Santorum`)
rwinners <- colnames(repubdf)[apply(repubdf,1,which.max)]

dwdf <- data.frame(demwinners)
dwdf$ID <- rownames(dwdf)

rwdf <- data.frame(rwinners)
rwdf$ID <- rownames(rwdf)

winners <- merge(rwdf,dwdf, by = "ID")
mydf <- merge(winners,mydf, by = "ID")
BinaryCandidateWin <- sqldf("select ID, 
                   
                   case when rwinners = 'mydf..Donald.Trump.' then 1 else 0 end as TrumpWin,
                   case when demwinners = 'mydf..Hillary.Clinton.' then 1 else 0 end as HilWin

                   from mydf ")
## Loading required package: tcltk
mydf <- merge(BinaryCandidateWin,mydf, by = "ID")

Map Plots : Income and Trump Percentage

  df <- data.frame(mydf$fips, mydf$INC110213 )
  colnames(df) <- cbind("region", "value")
  
  
  county_choropleth(df, title = "Income", legend = "Dollars")

  df <- data.frame(mydf$fips, mydf$`Donald Trump` )
  colnames(df) <- cbind("region", "value")
  
  
  county_choropleth(df, title ="Percentage of Votes for Trump")

This is a noteworthy plot, and we can provide a good narrative for why certain areas aren’t voting for Marco Rubio, as people might possibly expect

  ggplot(aes(x=RHI725214,y=AGE775214,color=mydf$rwinners),data=mydf)+geom_jitter() + ggtitle("Republicans")+xlab("Percentage of Hispanics") + ylab("Percentage of People Over 65")

Can we apply the same narrative to the Democratic Candidates?

   ggplot(aes(x=RHI725214,y=AGE775214,color=mydf$demwinners),data=mydf)+geom_jitter() + ggtitle("Democrats") +xlab("Percentage of Hispanics") +ylab("Percentage of People Over 65")

Martin Malley hardly took any votes all.

Here I construct a pairwise correlation matrix that shows if there is indeed an obvious correlation between any of the variables. I was mostly interested in seeing if there were two candidates from the opposite sides of the aisle that either had a strong positive or negative relation. The only noteworthy observation is that Hillary Clinton and Bernie Sanders are nearly perfectly related. There was a third candidate in their primary but he took such a negligble amount of votes that [% Votes for Bernie] + [% Votes for Hillary] ~ 1

  pairs(mydf[,8:21])

Now I thought I would see if I can predict Trump Victories by looking at the different County Facts data. I used a Recursive Partitioned Decision Tree, and passed in all the variables.

I split the data into equal sized testing and training data. I used createDataParition, and by passing the variable of interest to the y argument, the random samppling occurs within each class (Trump Wins/ Trump Doesnt Win) and should preserve the overall class distribution.
As shown by the confusion matrix results the model overfitted. We got a much better result on the Training set prediction than the Testing set.

  ##dropcols <- c("rwinners", "demwinners","area_name","state_abbreviation" )
##cordf <- mydf[ , !names(mydf) %in% dropcols ]

##i <- sapply(cordf, is.character)
##cordf[i] <- lapply(cordf[i], as.numeric)


##mycors <- cor(cordf, use="complete.obs", method="kendall") 

##myelectioncorplot <- corrplot(mycors,method = "circle")

treeFormu <- formula(TrumpWin ~  PST045214 + 
                     PST040210 +                      PST120214 +                      POP010210 +                      AGE135214 + 
                     AGE295214 +                      AGE775214 +                      SEX255214 +                      RHI125214 + 
                     RHI225214 +                      RHI325214 +                      RHI425214 +                      RHI525214 + 
                     RHI625214 +                      RHI725214 +                      RHI825214 +                      POP715213 + 
                     POP645213 +                      POP815213 +                      EDU635213 +                      EDU685213 + 
                     VET605213 +                      LFE305213 +                      HSG010214 +                      HSG445213 + 
                     HSG096213 +                      HSG495213 +                      HSD410213 +                      HSD310213 + 
                     INC910213 +                      INC110213 +                      PVY020213 +                      BZA010213 + 
                     BZA110213 +                      BZA115213 +                      NES010213 +                      SBO001207 + 
                     SBO315207 +                      SBO115207 +                      SBO215207 +                      SBO515207 + 
                     SBO415207 +                      SBO015207 +                      MAN450207 +                      WTN220207 + 
                     RTN130207 +                      RTN131207 +                      AFN120207 +                      BPS030214 + 
                     LND110210 +                      POP060210 )
 set.seed(1001)
rprimindex <- createDataPartition(y=mydf$rwinners, times = 1, p =.5, list = F)
mydf.train <-mydf[rprimindex, ]
mydf.test  <-mydf[-rprimindex, ]

treeModel <- rpart(treeFormu,  data=mydf.train)


trumpPred.train<- factor(format(round(predict(treeModel, mydf.train))))
trumpPred.test<- factor(format(round(predict(treeModel, mydf.test))))
fancyRpartPlot(treeModel)

The variables of note were

Variable Description
RHI825214 White (Alone), percent
RTN131207 Retail Sales 2007
AGE135214 Perons under 5 year, percent
LFE305213 Pivoting data from long to wide
POP645213 Overall Population Amount

As shown by the confusion matrix results the model overfitted. We got a much better result on the Training set (85% accuracy) than the Testing set (72%)

confusionMatrix(trumpPred.train, mydf.train$TrumpWin)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 22  2
##          1 11 51
##                                          
##                Accuracy : 0.8488         
##                  95% CI : (0.7554, 0.917)
##     No Information Rate : 0.6163         
##     P-Value [Acc > NIR] : 2.11e-06       
##                                          
##                   Kappa : 0.6631         
##  Mcnemar's Test P-Value : 0.0265         
##                                          
##             Sensitivity : 0.6667         
##             Specificity : 0.9623         
##          Pos Pred Value : 0.9167         
##          Neg Pred Value : 0.8226         
##              Prevalence : 0.3837         
##          Detection Rate : 0.2558         
##    Detection Prevalence : 0.2791         
##       Balanced Accuracy : 0.8145         
##                                          
##        'Positive' Class : 0              
## 
confusionMatrix(trumpPred.test, mydf.test$TrumpWin)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 16  8
##          1 16 45
##                                         
##                Accuracy : 0.7176        
##                  95% CI : (0.6096, 0.81)
##     No Information Rate : 0.6235        
##     P-Value [Acc > NIR] : 0.04456       
##                                         
##                   Kappa : 0.3672        
##  Mcnemar's Test P-Value : 0.15304       
##                                         
##             Sensitivity : 0.5000        
##             Specificity : 0.8491        
##          Pos Pred Value : 0.6667        
##          Neg Pred Value : 0.7377        
##              Prevalence : 0.3765        
##          Detection Rate : 0.1882        
##    Detection Prevalence : 0.2824        
##       Balanced Accuracy : 0.6745        
##                                         
##        'Positive' Class : 0             
##