ShowOfHands Competition

In this competition, we were to use data from Show of Hands, an informal polling platform for use on mobile devices and the web, to see what aspects and characteristics of people's lives predict happiness.

We start off by loading the necessary R libraries and reading in the data. More details about the data can be found here - www.kaggle.com/c/the-analytics-edge-mit-15-071x/data

library(caTools)  # for splitting data into train and test set
library(glmnet)  # for ridge regression
library(ROCR)  # for finding the AUC value of our model
showofhands = read.csv("train.csv")
summary(showofhands)
##      UserID          YOB          Gender                     Income    
##  Min.   :   1   Min.   :1900         : 537                      :1215  
##  1st Qu.:1770   1st Qu.:1969   Female:1650   $100,001 - $150,000: 571  
##  Median :3717   Median :1982   Male  :2432   $25,001 - $50,000  : 545  
##  Mean   :3830   Mean   :1979                 $50,000 - $74,999  : 642  
##  3rd Qu.:5674   3rd Qu.:1992                 $75,000 - $100,000 : 567  
##  Max.   :9503   Max.   :2039                 over $150,000      : 536  
##                 NA's   :684                  under $25,000      : 543  
##                     HouseholdStatus               EducationLevel
##                             : 800                        :1091  
##  Domestic Partners (no kids): 118   Bachelor's Degree    : 935  
##  Domestic Partners (w/kids) :  34   Current K-12         : 607  
##  Married (no kids)          : 522   Current Undergraduate: 557  
##  Married (w/kids)           :1226   Master's Degree      : 503  
##  Single (no kids)           :1760   High School Diploma  : 487  
##  Single (w/kids)            : 159   (Other)              : 439  
##          Party          Happy       Q124742    Q124122    Q123464   
##             : 728   Min.   :0.000      :2563      :1613      :1455  
##  Democrat   : 926   1st Qu.:0.000   No :1300   No :1233   No :2966  
##  Independent:1126   Median :1.000   Yes: 756   Yes:1773   Yes: 198  
##  Libertarian: 409   Mean   :0.564                                   
##  Other      : 245   3rd Qu.:1.000                                   
##  Republican :1185   Max.   :1.000                                   
##                                                                     
##  Q123621    Q122769    Q122770       Q122771     Q122120    Q121699   
##     :1524      :1333      :1211          :1201      :1230      :1080  
##  No :1506   No :2019   No :1445   Private: 567   No :2499   No : 936  
##  Yes:1589   Yes:1267   Yes:1963   Public :2851   Yes: 890   Yes:2603  
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##  Q121700    Q120978    Q121011    Q120379    Q120650       Q120472    
##     :1113      :1175      :1134      :1219      :1225          :1276  
##  No :3037   No :1530   No :1563   No :1847   No : 270   Art    :1061  
##  Yes: 469   Yes:1914   Yes:1922   Yes:1553   Yes:3124   Science:2282  
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##         Q120194     Q120012    Q120014    Q119334    Q119851   
##             :1337      :1188      :1322      :1207      :1098  
##  Study first:1910   No :1813   No :1333   No :1748   No :2050  
##  Try first  :1372   Yes:1618   Yes:1964   Yes:1664   Yes:1471  
##                                                                
##                                                                
##                                                                
##                                                                
##       Q119650     Q118892    Q118117          Q118232     Q118233   
##           :1190      :1019      :1092             :1580      :1289  
##  Giving   :2611   No :1331   No :2087   Idealist  :1327   No :2392  
##  Receiving: 818   Yes:2269   Yes:1440   Pragmatist:1712   Yes: 938  
##                                                                     
##                                                                     
##                                                                     
##                                                                     
##  Q118237           Q117186               Q117193     Q116797   
##     :1240              :1429                 :1410      :1338  
##  No :1831   Cool headed:2054   Odd hours     :1299   No :2169  
##  Yes:1548   Hot headed :1136   Standard hours:1910   Yes:1112  
##                                                                
##                                                                
##                                                                
##                                                                
##   Q116881     Q116953    Q116601    Q116441    Q116448    Q116197    
##       :1445      :1412      :1217      :1255      :1304       :1251  
##  Happy:2268   No :1063   No : 578   No :2098   No :1839   A.M.:1172  
##  Right: 906   Yes:2144   Yes:2824   Yes:1266   Yes:1476   P.M.:2196  
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##  Q115602     Q115777     Q115610    Q115611             Q115899    
##     :1239        :1346      :1280      :1120                :1342  
##  No : 719   End  :1346   No : 581   No :2230   Circumstances:1448  
##  Yes:2661   Start:1927   Yes:2758   Yes:1269   Me           :1829  
##                                                                    
##                                                                    
##                                                                    
##                                                                    
##  Q115390    Q114961    Q114748    Q115195    Q114517          Q114386    
##     :1421      :1280      :1132      :1283      :1189             :1309  
##  No :1304   No :1707   No :1494   No :1193   No :2283   Mysterious:1891  
##  Yes:1894   Yes:1632   Yes:1993   Yes:2143   Yes:1147   TMI       :1419  
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##  Q113992    Q114152     Q113583           Q113584     Q113181   
##     :1195      :1422        :1292             :1306      :1262  
##  No :2387   No :2194   Talk :1085   People    :1666   No :1954  
##  Yes:1037   Yes:1003   Tunes:2242   Technology:1647   Yes:1403  
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##  Q112478    Q112512    Q112270    Q111848          Q111580     Q111220   
##     :1372      :1305      :1417      :1173             :1355      :1255  
##  No :1278   No : 645   No :1765   No :1349   Demanding :1158   No :2468  
##  Yes:1969   Yes:2669   Yes:1437   Yes:2097   Supportive:2106   Yes: 896  
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##  Q110740    Q109367             Q108950     Q109244      Q108855    
##     :1246      :1311                :1271      :1399         :1587  
##  Mac:1395   No :1285   Cautious     :2289   No :2353   Umm...:1203  
##  PC :1978   Yes:2023   Risk-friendly:1059   Yes: 867   Yes!  :1829  
##                                                                     
##                                                                     
##                                                                     
##                                                                     
##  Q108617         Q108856     Q108754         Q108342     Q108343   
##     :1336            :1591      :1399            :1369      :1358  
##  No :2884   Socialize: 880   No :2166   In-person:2240   No :1981  
##  Yes: 399   Space    :2148   Yes:1054   Online   :1010   Yes:1280  
##                                                                    
##                                                                    
##                                                                    
##                                                                    
##  Q107869    Q107491    Q106993           Q106997     Q106272    Q106388   
##     :1375      :1302      :1310              :1319      :1321      :1380  
##  No :1468   No : 436   No : 567   Grrr people:1778   No : 937   No :2359  
##  Yes:1776   Yes:2881   Yes:2742   Yay people!:1522   Yes:2361   Yes: 880  
##                                                                           
##                                                                           
##                                                                           
##                                                                           
##  Q106389    Q106042    Q105840    Q105655    Q104996    Q103293   
##     :1429      :1334      :1438      :1229      :1252      :1306  
##  No :1684   No :1723   No :1720   No :1521   No :1652   No :1775  
##  Yes:1506   Yes:1562   Yes:1461   Yes:1869   Yes:1715   Yes:1538  
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##  Q102906    Q102674    Q102687    Q102289    Q102089          Q101162    
##     :1432      :1441      :1315      :1365       :1329            :1381  
##  No :2038   No :2016   No :1609   No :2259   Own :2281   Optimist :2021  
##  Yes:1149   Yes:1162   Yes:1695   Yes: 995   Rent:1009   Pessimist:1217  
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##  Q101163    Q101596    Q100689    Q100680    Q100562       Q99982    
##     :1485      :1368      :1198      :1341      :1349         :1418  
##  Dad:1760   No :2117   No :1346   No :1293   No : 643   Check!:1677  
##  Mom:1374   Yes:1134   Yes:2075   Yes:1985   Yes:2627   Nope  :1524  
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##  Q100010    Q99716     Q99581     Q99480     Q98869     Q98578    
##     :1247      :1355      :1288      :1310      :1476      :1448  
##  No : 653   No :2884   No :2868   No : 738   No : 704   No :1993  
##  Yes:2719   Yes: 380   Yes: 463   Yes:2571   Yes:2439   Yes:1178  
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##         Q98059     Q98078     Q98197     Q96024         votes      
##            :1267      :1517      :1438      :1459   Min.   : 20.0  
##  Only-child: 330   No :1768   No :1895   No :1232   1st Qu.: 45.0  
##  Yes       :3022   Yes:1334   Yes:1286   Yes:1928   Median : 82.0  
##                                                     Mean   : 71.9  
##                                                     3rd Qu.: 99.0  
##                                                     Max.   :101.0  
## 

From the summary we can see that the YOB variable has certain NA's. So we impute these by setting them equal to the mean of the YOB variable. In general, I believe it will make no difference to which value we set the NAs to.

showofhands[is.na(showofhands$YOB), "YOB"] = round(mean(showofhands$YOB, na.rm = TRUE))

We now split the data set into training and test set using the sample.split function of the caTools library. The function ensures that there is an even distribution of the dependent variable i.e. 'Happy' between the training and test set and gives a more robust model.

set.seed(123)
split = sample.split(showofhands$Happy, SplitRatio = 0.75)
train = subset(showofhands, split == TRUE)
test = subset(showofhands, split == FALSE)
dim(train)
## [1] 3464  110
dim(test)
## [1] 1155  110

We then prepare the data to perform the ridge regression. The glmnet package which does the ridge regression requires the independen variable in the training set to be in the form of a matrix of dimension nobs x nvars. Now the number of variables in our data set is 110, but many of them are factor variables and we know that a n-level factor variable can be represented by (n-1) variable. The model.matrix function makes it very easy for us to achieve these two transformations i.e. factor representation and matrix transformation. The '-1' is to remove the intercept term of the model, it being insignificant. After the transformation we can see that the resultant matrix 'x' has 231 variables.

Further the dependent variable needs to be quantitative (as we are predicting probabilities). Finally, we do the same transformation for the test set.

x = model.matrix(Happy ~ . - 1 - UserID, data = train)
y = as.numeric(train$Happy)
x.test = model.matrix(Happy ~ . - 1 - UserID, data = test)
dim(x)
## [1] 3464  231

We now fit the ridge regression model to our data using the glmnet function (alpha = 0 signifies ridge regression, wheres alpha = 1 signifies lasso) and plot the resultant variable coefficients as a function of the regularization parameter lambda. As we can see, with increasing values of lambda the coefficients of independent variables are shrunk to almost zero (but not zero). This is the essence of a ridge regression model. This helps us to achieve a good bias-variance tradeoff and thus a more robust model that generalizes well.

ridge = glmnet(x, y, alpha = 0)
plot(ridge, xvar = "lambda", label = TRUE)

plot of chunk unnamed-chunk-5

The glmnet library has a very handy tool named cv.glmnet which runs a 10-fold cross validation on our data set and enables us to find a value of lambda (shown by the second vertical line from the left in the plot below) for which our model mean-square error (with some variable coefficients shrunk to almost zero) is within one standard deviation of the one with lowest mean-square error (shown by the first vertical line from the left in the plot below).

cv.ridge = cv.glmnet(x, y, alpha = 0)
plot(cv.ridge)

plot of chunk unnamed-chunk-6

We store the coefficients of the independent variables for the above determined value of lambda in a variable named coefi. We then use these estimated model coefficients to predict the outcome for the test set. In particular, the predicted outcomes are converted into probabilities by using the sigmoid transformation.

coefi = coef(cv.ridge)
predict = (1/(1 + exp(-(x.test[, ] %*% coefi[2:232, ]))))
summary(predict)
##        V1       
##  Min.   :0.477  
##  1st Qu.:0.582  
##  Median :0.607  
##  Mean   :0.605  
##  3rd Qu.:0.630  
##  Max.   :0.711

Finally we find the AUC value for our model using the ROCR library.

ROCRpredTest = prediction(predict, test$Happy)
auc = as.numeric(performance(ROCRpredTest, "auc")@y.values)
auc
## [1] 0.7406

We will now use our model to predict the outcome values from the submission dataset. We read in the data and process the variables in the same way as we did earlier for the training data.

testdata = read.csv("test.csv")
testdata[is.na(testdata$YOB), "YOB"] = round(mean(testdata$YOB, na.rm = TRUE))
testdata$Happy = 0
x.submit = model.matrix(Happy ~ . - 1 - UserID, data = testdata)

Havig done the pre-processing we now go ahead to make our predictions for the submission data using the cofficients that we had obtained from our ridge regression model. Finally, we prepare a .csv submission file.

predictiontest = (1/(1 + exp(-(x.submit[, ] %*% coefi[2:232, ]))))
summary(predictiontest)
##        V1       
##  Min.   :0.469  
##  1st Qu.:0.581  
##  Median :0.604  
##  Mean   :0.603  
##  3rd Qu.:0.629  
##  Max.   :0.718
submission = data.frame(UserID = testdata$UserID, Probability1 = predictiontest)
write.csv(submission, "submissionRidge.csv", row.names = FALSE)