In this competition, we were to use data from Show of Hands, an informal polling platform for use on mobile devices and the web, to see what aspects and characteristics of people's lives predict happiness.
We start off by loading the necessary R libraries and reading in the data. More details about the data can be found here - www.kaggle.com/c/the-analytics-edge-mit-15-071x/data
library(caTools) # for splitting data into train and test set
library(glmnet) # for ridge regression
library(ROCR) # for finding the AUC value of our model
showofhands = read.csv("train.csv")
summary(showofhands)
## UserID YOB Gender Income
## Min. : 1 Min. :1900 : 537 :1215
## 1st Qu.:1770 1st Qu.:1969 Female:1650 $100,001 - $150,000: 571
## Median :3717 Median :1982 Male :2432 $25,001 - $50,000 : 545
## Mean :3830 Mean :1979 $50,000 - $74,999 : 642
## 3rd Qu.:5674 3rd Qu.:1992 $75,000 - $100,000 : 567
## Max. :9503 Max. :2039 over $150,000 : 536
## NA's :684 under $25,000 : 543
## HouseholdStatus EducationLevel
## : 800 :1091
## Domestic Partners (no kids): 118 Bachelor's Degree : 935
## Domestic Partners (w/kids) : 34 Current K-12 : 607
## Married (no kids) : 522 Current Undergraduate: 557
## Married (w/kids) :1226 Master's Degree : 503
## Single (no kids) :1760 High School Diploma : 487
## Single (w/kids) : 159 (Other) : 439
## Party Happy Q124742 Q124122 Q123464
## : 728 Min. :0.000 :2563 :1613 :1455
## Democrat : 926 1st Qu.:0.000 No :1300 No :1233 No :2966
## Independent:1126 Median :1.000 Yes: 756 Yes:1773 Yes: 198
## Libertarian: 409 Mean :0.564
## Other : 245 3rd Qu.:1.000
## Republican :1185 Max. :1.000
##
## Q123621 Q122769 Q122770 Q122771 Q122120 Q121699
## :1524 :1333 :1211 :1201 :1230 :1080
## No :1506 No :2019 No :1445 Private: 567 No :2499 No : 936
## Yes:1589 Yes:1267 Yes:1963 Public :2851 Yes: 890 Yes:2603
##
##
##
##
## Q121700 Q120978 Q121011 Q120379 Q120650 Q120472
## :1113 :1175 :1134 :1219 :1225 :1276
## No :3037 No :1530 No :1563 No :1847 No : 270 Art :1061
## Yes: 469 Yes:1914 Yes:1922 Yes:1553 Yes:3124 Science:2282
##
##
##
##
## Q120194 Q120012 Q120014 Q119334 Q119851
## :1337 :1188 :1322 :1207 :1098
## Study first:1910 No :1813 No :1333 No :1748 No :2050
## Try first :1372 Yes:1618 Yes:1964 Yes:1664 Yes:1471
##
##
##
##
## Q119650 Q118892 Q118117 Q118232 Q118233
## :1190 :1019 :1092 :1580 :1289
## Giving :2611 No :1331 No :2087 Idealist :1327 No :2392
## Receiving: 818 Yes:2269 Yes:1440 Pragmatist:1712 Yes: 938
##
##
##
##
## Q118237 Q117186 Q117193 Q116797
## :1240 :1429 :1410 :1338
## No :1831 Cool headed:2054 Odd hours :1299 No :2169
## Yes:1548 Hot headed :1136 Standard hours:1910 Yes:1112
##
##
##
##
## Q116881 Q116953 Q116601 Q116441 Q116448 Q116197
## :1445 :1412 :1217 :1255 :1304 :1251
## Happy:2268 No :1063 No : 578 No :2098 No :1839 A.M.:1172
## Right: 906 Yes:2144 Yes:2824 Yes:1266 Yes:1476 P.M.:2196
##
##
##
##
## Q115602 Q115777 Q115610 Q115611 Q115899
## :1239 :1346 :1280 :1120 :1342
## No : 719 End :1346 No : 581 No :2230 Circumstances:1448
## Yes:2661 Start:1927 Yes:2758 Yes:1269 Me :1829
##
##
##
##
## Q115390 Q114961 Q114748 Q115195 Q114517 Q114386
## :1421 :1280 :1132 :1283 :1189 :1309
## No :1304 No :1707 No :1494 No :1193 No :2283 Mysterious:1891
## Yes:1894 Yes:1632 Yes:1993 Yes:2143 Yes:1147 TMI :1419
##
##
##
##
## Q113992 Q114152 Q113583 Q113584 Q113181
## :1195 :1422 :1292 :1306 :1262
## No :2387 No :2194 Talk :1085 People :1666 No :1954
## Yes:1037 Yes:1003 Tunes:2242 Technology:1647 Yes:1403
##
##
##
##
## Q112478 Q112512 Q112270 Q111848 Q111580 Q111220
## :1372 :1305 :1417 :1173 :1355 :1255
## No :1278 No : 645 No :1765 No :1349 Demanding :1158 No :2468
## Yes:1969 Yes:2669 Yes:1437 Yes:2097 Supportive:2106 Yes: 896
##
##
##
##
## Q110740 Q109367 Q108950 Q109244 Q108855
## :1246 :1311 :1271 :1399 :1587
## Mac:1395 No :1285 Cautious :2289 No :2353 Umm...:1203
## PC :1978 Yes:2023 Risk-friendly:1059 Yes: 867 Yes! :1829
##
##
##
##
## Q108617 Q108856 Q108754 Q108342 Q108343
## :1336 :1591 :1399 :1369 :1358
## No :2884 Socialize: 880 No :2166 In-person:2240 No :1981
## Yes: 399 Space :2148 Yes:1054 Online :1010 Yes:1280
##
##
##
##
## Q107869 Q107491 Q106993 Q106997 Q106272 Q106388
## :1375 :1302 :1310 :1319 :1321 :1380
## No :1468 No : 436 No : 567 Grrr people:1778 No : 937 No :2359
## Yes:1776 Yes:2881 Yes:2742 Yay people!:1522 Yes:2361 Yes: 880
##
##
##
##
## Q106389 Q106042 Q105840 Q105655 Q104996 Q103293
## :1429 :1334 :1438 :1229 :1252 :1306
## No :1684 No :1723 No :1720 No :1521 No :1652 No :1775
## Yes:1506 Yes:1562 Yes:1461 Yes:1869 Yes:1715 Yes:1538
##
##
##
##
## Q102906 Q102674 Q102687 Q102289 Q102089 Q101162
## :1432 :1441 :1315 :1365 :1329 :1381
## No :2038 No :2016 No :1609 No :2259 Own :2281 Optimist :2021
## Yes:1149 Yes:1162 Yes:1695 Yes: 995 Rent:1009 Pessimist:1217
##
##
##
##
## Q101163 Q101596 Q100689 Q100680 Q100562 Q99982
## :1485 :1368 :1198 :1341 :1349 :1418
## Dad:1760 No :2117 No :1346 No :1293 No : 643 Check!:1677
## Mom:1374 Yes:1134 Yes:2075 Yes:1985 Yes:2627 Nope :1524
##
##
##
##
## Q100010 Q99716 Q99581 Q99480 Q98869 Q98578
## :1247 :1355 :1288 :1310 :1476 :1448
## No : 653 No :2884 No :2868 No : 738 No : 704 No :1993
## Yes:2719 Yes: 380 Yes: 463 Yes:2571 Yes:2439 Yes:1178
##
##
##
##
## Q98059 Q98078 Q98197 Q96024 votes
## :1267 :1517 :1438 :1459 Min. : 20.0
## Only-child: 330 No :1768 No :1895 No :1232 1st Qu.: 45.0
## Yes :3022 Yes:1334 Yes:1286 Yes:1928 Median : 82.0
## Mean : 71.9
## 3rd Qu.: 99.0
## Max. :101.0
##
From the summary we can see that the YOB variable has certain NA's. So we impute these by setting them equal to the mean of the YOB variable. In general, I believe it will make no difference to which value we set the NAs to.
showofhands[is.na(showofhands$YOB), "YOB"] = round(mean(showofhands$YOB, na.rm = TRUE))
We now split the data set into training and test set using the sample.split function of the caTools library. The function ensures that there is an even distribution of the dependent variable i.e. 'Happy' between the training and test set and gives a more robust model.
set.seed(123)
split = sample.split(showofhands$Happy, SplitRatio = 0.75)
train = subset(showofhands, split == TRUE)
test = subset(showofhands, split == FALSE)
dim(train)
## [1] 3464 110
dim(test)
## [1] 1155 110
We then prepare the data to perform the ridge regression. The glmnet package which does the ridge regression requires the independen variable in the training set to be in the form of a matrix of dimension nobs x nvars. Now the number of variables in our data set is 110, but many of them are factor variables and we know that a n-level factor variable can be represented by (n-1) variable. The model.matrix function makes it very easy for us to achieve these two transformations i.e. factor representation and matrix transformation. The '-1' is to remove the intercept term of the model, it being insignificant. After the transformation we can see that the resultant matrix 'x' has 231 variables.
Further the dependent variable needs to be quantitative (as we are predicting probabilities). Finally, we do the same transformation for the test set.
x = model.matrix(Happy ~ . - 1 - UserID, data = train)
y = as.numeric(train$Happy)
x.test = model.matrix(Happy ~ . - 1 - UserID, data = test)
dim(x)
## [1] 3464 231
We now fit the ridge regression model to our data using the glmnet function (alpha = 0 signifies ridge regression, wheres alpha = 1 signifies lasso) and plot the resultant variable coefficients as a function of the regularization parameter lambda. As we can see, with increasing values of lambda the coefficients of independent variables are shrunk to almost zero (but not zero). This is the essence of a ridge regression model. This helps us to achieve a good bias-variance tradeoff and thus a more robust model that generalizes well.
ridge = glmnet(x, y, alpha = 0)
plot(ridge, xvar = "lambda", label = TRUE)
The glmnet library has a very handy tool named cv.glmnet which runs a 10-fold cross validation on our data set and enables us to find a value of lambda (shown by the second vertical line from the left in the plot below) for which our model mean-square error (with some variable coefficients shrunk to almost zero) is within one standard deviation of the one with lowest mean-square error (shown by the first vertical line from the left in the plot below).
cv.ridge = cv.glmnet(x, y, alpha = 0)
plot(cv.ridge)
We store the coefficients of the independent variables for the above determined value of lambda in a variable named coefi. We then use these estimated model coefficients to predict the outcome for the test set. In particular, the predicted outcomes are converted into probabilities by using the sigmoid transformation.
coefi = coef(cv.ridge)
predict = (1/(1 + exp(-(x.test[, ] %*% coefi[2:232, ]))))
summary(predict)
## V1
## Min. :0.477
## 1st Qu.:0.582
## Median :0.607
## Mean :0.605
## 3rd Qu.:0.630
## Max. :0.711
Finally we find the AUC value for our model using the ROCR library.
ROCRpredTest = prediction(predict, test$Happy)
auc = as.numeric(performance(ROCRpredTest, "auc")@y.values)
auc
## [1] 0.7406
We will now use our model to predict the outcome values from the submission dataset. We read in the data and process the variables in the same way as we did earlier for the training data.
testdata = read.csv("test.csv")
testdata[is.na(testdata$YOB), "YOB"] = round(mean(testdata$YOB, na.rm = TRUE))
testdata$Happy = 0
x.submit = model.matrix(Happy ~ . - 1 - UserID, data = testdata)
Havig done the pre-processing we now go ahead to make our predictions for the submission data using the cofficients that we had obtained from our ridge regression model. Finally, we prepare a .csv submission file.
predictiontest = (1/(1 + exp(-(x.submit[, ] %*% coefi[2:232, ]))))
summary(predictiontest)
## V1
## Min. :0.469
## 1st Qu.:0.581
## Median :0.604
## Mean :0.603
## 3rd Qu.:0.629
## Max. :0.718
submission = data.frame(UserID = testdata$UserID, Probability1 = predictiontest)
write.csv(submission, "submissionRidge.csv", row.names = FALSE)