This was my input for the following Kaggle Competition: https://www.kaggle.com/benhamner/2016-us-election
The dataset was primary results from the “early” states for the 2016 US Presidential Election. Also, included with the data were “County Facts”, variables that described the population in that County.
| Library | Description |
|---|---|
| Chropleth | A library for plotting US County and State data |
| RPart | A library usued for its recursive partioning algorithms |
| GGPlot2 | Grammar of Graphics Plotting library |
| Reshape2 | Pivoting data from long to wide |
| caret | ML Library |
| RSQLite | Connect to SQLite tables |
| DPlyr | The Data Wrangling Package |
| SQLDF | Query dfs like tables |
| RColorBrewer | Produce Better Looking Trees |
dbpath = "C:/Users/Salil/Downloads/PE/2016_presidential_election/database.sqlite"
##dbpath = "H:/PE/2016_presidential_election/database.sqlite"
db <- dbConnect(dbDriver("SQLite"), dbpath)
dbListTables(db)
## [1] "county_facts" "county_facts_dictionary"
## [3] "primary_results"
countyfacts <- dbGetQuery(db, "select * from county_facts")
cfd <- dbGetQuery(db, "select * from county_facts_dictionary")
presults <- dbGetQuery(db, "select * from primary_results")
rsPR <- data.frame(presults$fips,presults$party,presults$candidate, presults$fraction_votes)
colnames(rsPR) <- c("fips","party","cand","perVotes")
newdf <- dcast(rsPR,fips ~ cand , value.var = "perVotes")
mydf <- merge(newdf,countyfacts, by = "fips")
mydf$ID <- rownames(mydf)
##Derived column
##head(mydf,1)
demdf <- data.frame(mydf$`Hillary Clinton`, mydf$`Bernie Sanders`)
demwinners <- colnames(demdf)[apply(demdf,1,which.max)]
repubdf <- data.frame(mydf$`Ben Carson`, mydf$`Carly Fiorina`, mydf$`Chris Christie`, mydf$`Jeb Bush`, mydf$`John Kasich`, mydf$`Marco Rubio`, mydf$`Ted Cruz`, mydf$`Donald Trump`, mydf$`Ted Cruz`, mydf$`Mike Huckabee` ,mydf$`Rand Paul`, mydf$`Rick Santorum`)
rwinners <- colnames(repubdf)[apply(repubdf,1,which.max)]
dwdf <- data.frame(demwinners)
dwdf$ID <- rownames(dwdf)
rwdf <- data.frame(rwinners)
rwdf$ID <- rownames(rwdf)
winners <- merge(rwdf,dwdf, by = "ID")
mydf <- merge(winners,mydf, by = "ID")
BinaryCandidateWin <- sqldf("select ID,
case when rwinners = 'mydf..Donald.Trump.' then 1 else 0 end as TrumpWin,
case when demwinners = 'mydf..Hillary.Clinton.' then 1 else 0 end as HilWin
from mydf ")
## Loading required package: tcltk
mydf <- merge(BinaryCandidateWin,mydf, by = "ID")
df <- data.frame(mydf$fips, mydf$INC110213 )
colnames(df) <- cbind("region", "value")
county_choropleth(df, title = "Income", legend = "Dollars")
df <- data.frame(mydf$fips, mydf$`Donald Trump` )
colnames(df) <- cbind("region", "value")
county_choropleth(df, title ="Percentage of Votes for Trump")
This is a noteworthy plot, and we can provide a good narrative for why certain areas aren’t voting for Marco Rubio, as people might possibly expect
ggplot(aes(x=RHI725214,y=AGE775214,color=mydf$rwinners),data=mydf)+geom_jitter() + ggtitle("Republicans")+xlab("Percentage of Hispanics") + ylab("Percentage of People Over 65")
Can we apply the same narrative to the Democratic Candidates?
ggplot(aes(x=RHI725214,y=AGE775214,color=mydf$demwinners),data=mydf)+geom_jitter() + ggtitle("Democrats") +xlab("Percentage of Hispanics") +ylab("Percentage of People Over 65")
Here I construct a pairwise correlation matrix that shows if there is indeed an obvious correlation between any of the variables. I was mostly interested in seeing if there were two candidates from the opposite sides of the aisle that either had a strong positive or negative relation. The only noteworthy observation is that Hillary Clinton and Bernie Sanders are nearly perfectly related. There was a third candidate in their primary but he took such a negligble amount of votes that [% Votes for Bernie] + [% Votes for Hillary] ~ 1
pairs(mydf[,8:21])
Now I thought I would see if I can predict Trump Victories by looking at the different County Facts data. I used a Recursive Partitioned Decision Tree, and passed in all the variables.
I split the data into equal sized testing and training data. I used createDataParition, and by passing the variable of interest to the y argument, the random samppling occurs within each class (Trump Wins/ Trump Doesnt Win) and should preserve the overall class distribution.
As shown by the confusion matrix results the model overfitted. We got a much better result on the Training set prediction than the Testing set.
##dropcols <- c("rwinners", "demwinners","area_name","state_abbreviation" )
##cordf <- mydf[ , !names(mydf) %in% dropcols ]
##i <- sapply(cordf, is.character)
##cordf[i] <- lapply(cordf[i], as.numeric)
##mycors <- cor(cordf, use="complete.obs", method="kendall")
##myelectioncorplot <- corrplot(mycors,method = "circle")
treeFormu <- formula(TrumpWin ~ PST045214 +
PST040210 + PST120214 + POP010210 + AGE135214 +
AGE295214 + AGE775214 + SEX255214 + RHI125214 +
RHI225214 + RHI325214 + RHI425214 + RHI525214 +
RHI625214 + RHI725214 + RHI825214 + POP715213 +
POP645213 + POP815213 + EDU635213 + EDU685213 +
VET605213 + LFE305213 + HSG010214 + HSG445213 +
HSG096213 + HSG495213 + HSD410213 + HSD310213 +
INC910213 + INC110213 + PVY020213 + BZA010213 +
BZA110213 + BZA115213 + NES010213 + SBO001207 +
SBO315207 + SBO115207 + SBO215207 + SBO515207 +
SBO415207 + SBO015207 + MAN450207 + WTN220207 +
RTN130207 + RTN131207 + AFN120207 + BPS030214 +
LND110210 + POP060210 )
set.seed(1001)
rprimindex <- createDataPartition(y=mydf$rwinners, times = 1, p =.5, list = F)
mydf.train <-mydf[rprimindex, ]
mydf.test <-mydf[-rprimindex, ]
treeModel <- rpart(treeFormu, data=mydf.train)
trumpPred.train<- factor(format(round(predict(treeModel, mydf.train))))
trumpPred.test<- factor(format(round(predict(treeModel, mydf.test))))
fancyRpartPlot(treeModel)
The variables of note were
| Variable | Description |
|---|---|
| RHI825214 | White (Alone), percent |
| RTN131207 | Retail Sales 2007 |
| AGE135214 | Perons under 5 year, percent |
| LFE305213 | Pivoting data from long to wide |
| POP645213 | Overall Population Amount |
As shown by the confusion matrix results the model overfitted. We got a much better result on the Training set (85% accuracy) than the Testing set (72%)
confusionMatrix(trumpPred.train, mydf.train$TrumpWin)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 22 2
## 1 11 51
##
## Accuracy : 0.8488
## 95% CI : (0.7554, 0.917)
## No Information Rate : 0.6163
## P-Value [Acc > NIR] : 2.11e-06
##
## Kappa : 0.6631
## Mcnemar's Test P-Value : 0.0265
##
## Sensitivity : 0.6667
## Specificity : 0.9623
## Pos Pred Value : 0.9167
## Neg Pred Value : 0.8226
## Prevalence : 0.3837
## Detection Rate : 0.2558
## Detection Prevalence : 0.2791
## Balanced Accuracy : 0.8145
##
## 'Positive' Class : 0
##
confusionMatrix(trumpPred.test, mydf.test$TrumpWin)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 16 8
## 1 16 45
##
## Accuracy : 0.7176
## 95% CI : (0.6096, 0.81)
## No Information Rate : 0.6235
## P-Value [Acc > NIR] : 0.04456
##
## Kappa : 0.3672
## Mcnemar's Test P-Value : 0.15304
##
## Sensitivity : 0.5000
## Specificity : 0.8491
## Pos Pred Value : 0.6667
## Neg Pred Value : 0.7377
## Prevalence : 0.3765
## Detection Rate : 0.1882
## Detection Prevalence : 0.2824
## Balanced Accuracy : 0.6745
##
## 'Positive' Class : 0
##