Kaggle Competition : Predict Voting Outcomes

Introduction

What predicts voting outcomes? In this analysis data from Show of Hands, an informal polling platform for use on mobile devices and the web is used to see what aspects and characteristics of people’s lives predict how they will be voting for the presidential election.

Show of Hands has been downloaded over 300,000 times across Apple and Android app stores, and users have cast more than 75 million votes. In this problem, we’ll use data from thousands of users and one hundred different questions to see which responses predict voting outcomes.

The aim of the project is to use various machine learning models to see accuracy in predicting voting outcomes. We have been given the training data with the results labelled and we have to predict voting outcomes on the Test data the results of which can be checked by uploading to kaggle which gives the accuracy and ranking on the leaderboard.

Load Packages

library(data.table)
library(rpart)
library(rpart.plot)
library(randomForest)
library(ROCR)
library(mice)
library(Hmisc)
library(caret)
library(mlbench)
library(glmnet)
library(ggplot2)
library(xgboost)
library(Boruta)

Load Data

Train <- fread("train2016.csv", na.strings = c("NA",""), stringsAsFactors = T)
Test <- fread("test2016.csv", na.strings = c("NA",""), stringsAsFactors = T)

Data cleaning

Train$trainflag <- 1
Test$trainflag <- 0
combined <- rbind(Train[, .SD, .SDcols = -"Party"], Test)
combined$USER_ID <- NULL
summary(combined)

The results point out to a lot of missing values in the dataset. We are going to impute NA’s using mice package for demographics and replacing all the question variable’s NA values with -1.

boxplot(combined$YOB)

tail(sort(combined$YOB))

## [1] 2003 2003 2011 2013 2039 2039

set(combined, i = which(combined[[1]] >2013), j = 1L, NA)

The value 2039 is definitely an invalid value and hence we set it to NA. Converting the year of birth to an age group variable can be more useful in classifier algorithms. It will give an added advantage for missing value imputation as using year of birth is causing problems with imputation.

We’ll also convert all NA’s of the question variables as they have a large proportion of missing values to -1.

combined$ageGroup <- cut2(combined$YOB, c(1880,1959,1969,1979,1989,1996,2014))
levels(combined$ageGroup)  <- c(">55","45-55","35-45","25-35","18-25","0-18")
combined$YOB <- NULL
exclude <- c("Income","HouseholdStatus","EducationLevel","ageGroup")
minus <- combined[, lapply(.SD, function(x){ifelse(is.na(x),-1,x)}), .SDcols = setdiff(names(combined), exclude)]
untreated <- combined[, .SD, .SDcols = setdiff(names(combined), exclude)]

Now computing imputed values for the demographic variables.

set.seed(144)
miceImpute <- combined[, exclude, with=F]
miceTreated <- complete(mice(miceImpute,m = 5, maxit = 5, printFlag = F))
write.csv(miceTreated, "miceTreated.csv", row.names = FALSE)
miceTreated <- fread("miceTreated.csv", stringsAsFactors = T)
dummy <- dummyVars( ~ HouseholdStatus + Income + EducationLevel + ageGroup, data = miceTreated, fullRank = T)
miceTreated <- cbind(miceTreated, predict(dummy, newdata = miceTreated))
miceTreated[, (exclude) := NULL]
names(miceTreated) <- make.names(names(miceTreated))

combining the two different imputations :

imputedData <- cbind(miceTreated, minus)
imputedTrain <- imputedData[combined$trainflag == 1]
imputedTrain$Party <- Train$Party
imputedTest <- imputedData[combined$trainflag == 0]
imputedTrain[, trainflag := NULL]
imputedTest[, trainflag := NULL]

Modelling

We’ll see how accurate different models are in predicting.

Logistic regression and Penalised logistic regression
Decision Trees
RandomForests
Boosted Trees using xgboost

Logistic Regression

modglm <- glm(Party ~ ., data = imputedTrain, family = "binomial")

predlmTrain <- predict(modglm, type = "response")
table(imputedTrain$Party, predlmTrain > 0.5)

##             
##              FALSE TRUE
##   Democrat    2015  936
##   Republican  1195 1422

predlmROCR <- prediction(predlmTrain, imputedTrain$Party)
# Area under The Curve
as.numeric(performance(predlmROCR,"auc")@y.values)

## [1] 0.6648381

# Making Test Predictions
predlmTest <- predict(modglm, newdata = imputedTest, type = "response")

Since there are so many variables in the input dataset, lets try feature selection using rfe from the caret package.

rfControl <- rfeControl(rfFuncs, method = "cv", number = 10)
results <- rfe(imputedTrain[,-107,with=F], Train$Party, rfeControl = rfControl)

The four important variables selected are Q109244, Q115611, Q113181, Q98197. Lets train a logistic regression model using these variables.

modrfelm <- glm(Party ~ Q109244+ Q115611+ Q113181+ Q98197, data = imputedTrain, family = "binomial")
predrfelmTrain <- predict(modrfelm, type = "response")
table(imputedTrain$Party, predrfelmTrain > 0.5)

##             
##              FALSE TRUE
##   Democrat    2124  827
##   Republican  1620  997

The training set accuracy is 0.5605244 which is quite low and it wont make sense to use this. Lets try the boruta package for determining important variables.

boruta.train <- Boruta(Party ~ ., data = imputedTrain, doTrace = 2)
final.boruta <- TentativeRoughFix(boruta.train)
getSelectedAttributes(final.boruta)

The selected features using boruta are : “Gender” “Q121699” “Q120650” “Q120472” “Q118233” “Q115611” “Q113181” “Q110740” “Q109244” “Q108855” “Q106272” “Q106042” “Q105655” “Q101163” “Q100010” “Q99716” “Q99480” “Q98869” “Q98197”. Now we’ll implement penalized regression techniques lasso and ridge regression using the selected variables for modelling.

borSelect <- c("Gender","Q121699","Q120650","Q120472","Q118233","Q115611", "Q113181","Q110740","Q109244","Q108855","Q106272","Q106042","Q105655","Q101163","Q100010","Q99716","Q99480","Q98869","Q98197")

The package glmnet is used for penalised linear regression, It requires a numeric matrix and doesn’t accept data frames. So we have to convert our data accordingly.

set.seed(100)
partyTrain <- as.matrix(ifelse(imputedTrain$Party == "Democrat" ,0 ,1))
inputMatrix <- data.matrix(imputedTrain[,borSelect,with = F]) # Converts dataframe to numeric matrix
testMatrix <- data.matrix(imputedTest[,borSelect,with = F])
# Lasso Regression
modLasso <- cv.glmnet(inputMatrix,partyTrain,alpha=1,family="binomial")
plot(modLasso)

#Making predictions on Test set
predLassoTest <- predict(modLasso, newx = testMatrix, s = "lambda.min", type = "class")
# Ridge Regression
modRidge <- cv.glmnet(inputMatrix,partyTrain,alpha=0,family="binomial")
plot(modRidge)

#Making predictions on Test set
predRidgeTest <- predict(modRidge, newx = testMatrix, s = "lambda.min", type = "class")
#modglmnet <- train(x = inputMatrix, y = imputedTrain$Party, method = "glmnet", trControl = trnCtrl, tuneGrid = expand.grid(alpha = seq(0, 1, 0.1), lambda = 10 ^ seq(10, -2, length = 100)))
#modglmnet$bestTune
# selected values are aplha = 0.2 and lambda = 0.01
# Elastic Net Regression
modelnet <- cv.glmnet(inputMatrix,partyTrain,alpha = 0.2, family = "binomial")
predElTest <- predict(modelnet, newx = testMatrix, s = "lambda.min", type = "class")

Trees!

Now we move to Decision trees for possibly better accuracy, here we use cross-validated tuning of the model for the cp parameter.

set.seed(244)
trnCtrl <- trainControl(method = "repeatedcv", number = 10, repeats = 5)
#rpartTrained <- train(Party ~ ., data = imputedTrain, method = "rpart", trControl = trnCtrl, tuneLength = 100)
modTree <- rpart(Party ~ ., data = imputedTrain, method = "class", cp = 0.0364748)

prp(modTree)

predTreeTrain <- predict(modTree, type = "class")
table(imputedTrain$Party, predTreeTrain)

##             predTreeTrain
##              Democrat Republican
##   Democrat       2257        694
##   Republican     1430       1187

predTreeTest <- predict(modTree, newdata = imputedTest, type = "class")

We got the cp value of 0.0364748 and after using that for the rpart model we get a model with just three splits.

Now trying randomForests on the data

set.seed(10)
modrf <- randomForest(Party ~ ., data = imputedTrain[,c(borSelect,"Party"),with = F], ntree = 1000,importance = T)
predrfTrain <- predict(modrf, type = "class")
table(imputedTrain$Party, predrfTrain)

##             predrfTrain
##              Democrat Republican
##   Democrat       1962        989
##   Republican     1088       1529

# Making Predictions on the Test Set
predrfTest <- predict(modrf, newdata = imputedTest[,borSelect,with=F], type = "response")

Finally we’ll try boosted trees with “xgboost” package in R. We have used variable importance using “xgb.importance” to select features to be included in the xgboost model.

xgInclude <- c("Gender","Q106042","Q113181","Q100010","Q115611","Q109244","Q110740","Q98197","Q99480","Q120650","Q101163","Q118233","Q98869","Q99716","Q120472","Q106272","Q105655","Q108855","Q121699")
# Creaing dummy Variables as xgboost doesn't accept factor variables.
dummy2 <- dummyVars(~., data = combined[,xgInclude,with=F], fullRank = TRUE)
trialTrain <- predict(dummy2, newdata = combined[trainflag==1, .SD, .SDcols = xgInclude])
trialTest <- predict(dummy2, newdata = combined[trainflag==0, .SD, .SDcols = xgInclude])
X_train <- data.matrix(trialTrain)
X_test <- data.matrix(trialTest)
xgtrain <- xgb.DMatrix(X_train, label = partyTrain, missing = NA)
xgtest <- xgb.DMatrix(X_test, missing = NA)

# Setting Parameters for xgboost model
params <- list()
params$objective <- "binary:logistic"
params$eta <- 0.05
params$max_depth <- 5
params$subsample <- 0.5
params$colsample_bytree <- 0.9
params$min_child_weight <- 1
params$eval_metric <- "auc"

# Using cv for tuning parameters
modelXgbCv <- xgb.cv(params = params, data = xgtrain, nrounds = 100, nfold = 10, early.stop.round = 30, prediction = TRUE, verbose = FALSE)

## Stopping. Best iteration: 33

modelXgb <- xgb.train(params = params, data = xgtrain, nrounds = 100)
vimp <- xgb.importance(model = modelXgb, feature_names = borSelect)
# Checking important variables in the model
View(vimp)
# Making predictions on the Test set
predxgbTest <- predict(modelXgb, newdata = xgtest)

Final Submission and Results

threshold <- 0.5
predglmLabel <- as.factor(ifelse(predlmTest < threshold, "Democrat","Republican")) # Acc - 0.60345 , Rank - 2504
predLassoLabel <- as.factor(ifelse(predLassoTest == 0, "Democrat","Republican")) # Acc - 0.61207 , Rank - 2194
predRidgeLabel <- as.factor(ifelse(predRidgeTest == 0, "Democrat","Republican")) # Acc - 0.61603, Rank - 2416
predElLabel <- as.factor(ifelse(predElTest == 0, "Democrat","Republican")) # Acc- 0.59626 , Rank - 2579
# predTreeTest - Acc - 0.61207 , Rank - 2194
# predrfTest - Acc - 0.64368 , Rank - 158
predxgbLabel <- as.factor(ifelse(predxgbTest < threshold, "Democrat","Republican")) # Acc - 0.63506, Rank - 648

#mySubmission <- data.table(USER_ID = Test$USER_ID, Predictions = predxgbLabel)
#write.csv(mySubmission, "xgb.csv", row.names = F)

The final results show that randomForests and xgboost gives us the most accurate model much ahead of decision trees and linear regression. The results from Lasso and Ridge regression are an improvement over normal logistic regression showing that they can be used to improve a logistic regression model while retaining model interpretability. So when accuracy is the major concern randomForests and xgboost give us the best models, whereas when interpretability is important penalised linear regression and decison trees can be used.