Experiments with data

This report shows how I reached accuracy of 0.84 on the public leaderboard for hackathon “Experiments with data” on Analytics Vidhya website. The datasets for the analysis are available here.

The target is to classify whether a person’s salary is in the <=50K range, or >50K, without having explicit salary data.

The data

I downloaded the datsets and read them into R from a local directory.

library(mlr)
train<-read.csv("train_gbW7HTd.csv", na.strings = c(""," ",NA))
test<-read.csv("test_2AFBew7.csv", na.strings = c(""," ",NA))

summarizeColumns(train)

##              name    type   na        mean         disp median        mad
## 1              ID integer    0 16281.00000 9399.6953940  16281 12068.3640
## 2             Age integer    0    38.58165   13.6404326     37    14.8260
## 3       Workclass  factor 1836          NA           NA     NA         NA
## 4       Education  factor    0          NA    0.6774976     NA         NA
## 5  Marital.Status  factor    0          NA    0.5400633     NA         NA
## 6      Occupation  factor 1843          NA           NA     NA         NA
## 7    Relationship  factor    0          NA    0.5948220     NA         NA
## 8            Race  factor    0          NA    0.1457265     NA         NA
## 9             Sex  factor    0          NA    0.3307945     NA         NA
## 10 Hours.Per.Week integer    0    40.43746   12.3474287     40     4.4478
## 11 Native.Country  factor  583          NA           NA     NA         NA
## 12   Income.Group  factor    0          NA    0.2408096     NA         NA
##      min   max nlevs
## 1      1 32561     0
## 2     17    90     0
## 3      7 22696     8
## 4     51 10501    16
## 5     23 14976     7
## 6      9  4140    14
## 7    981 13193     6
## 8    271 27816     5
## 9  10771 21790     2
## 10     1    99     0
## 11     1 29170    41
## 12  7841 24720     2

summarizeColumns(test)

##              name    type  na        mean         disp median       mad
## 1              ID integer   0 40702.00000 4700.0642017  40702 6034.1820
## 2             Age integer   0    38.76746   13.8491868     37   14.8260
## 3       Workclass  factor 963          NA           NA     NA        NA
## 4       Education  factor   0          NA    0.6755113     NA        NA
## 5  Marital.Status  factor   0          NA    0.5452982     NA        NA
## 6      Occupation  factor 966          NA           NA     NA        NA
## 7    Relationship  factor   0          NA    0.5993489     NA        NA
## 8            Race  factor   0          NA    0.1434187     NA        NA
## 9             Sex  factor   0          NA    0.3329648     NA        NA
## 10 Hours.Per.Week integer   0    40.39224   12.4793322     40    4.4478
## 11 Native.Country  factor 274          NA           NA     NA        NA
##      min   max nlevs
## 1  32562 48842     0
## 2     17    90     0
## 3      3 11210     8
## 4     32  5283    16
## 5     14  7403     7
## 6      6  2032    14
## 7    525  6523     6
## 8    135 13946     5
## 9   5421 10860     2
## 10     1    99     0
## 11     5 14662    40

The test set has no variable “Income.Group” .This is the dependant variable.

levels(train$Income.Group)

## [1] "<=50K" ">50K"

p = prop.table(table(train$Income.Group))
p

## 
##     <=50K      >50K 
## 0.7591904 0.2408096

NA’s are present in the both sets in variables:“Workclass”,“occupation”,“Native Country”. These will need to be analysed first. They are all factor variables.

#join train and test sets to explore all data at once
mydata<-rbind(train[,1:11],test)
#inspect Native.Country variable

as.matrix(summary(mydata$Native.Country))

##                             [,1]
## Cambodia                      28
## Canada                       182
## China                        122
## Columbia                      85
## Cuba                         138
## Dominican-Republic           103
## Ecuador                       45
## El-Salvador                  155
## England                      127
## France                        38
## Germany                      206
## Greece                        49
## Guatemala                     88
## Haiti                         75
## Holand-Netherlands             1
## Honduras                      20
## Hong                          30
## Hungary                       19
## India                        151
## Iran                          59
## Ireland                       37
## Italy                        105
## Jamaica                      106
## Japan                         92
## Laos                          23
## Mexico                       951
## Nicaragua                     49
## Outlying-US(Guam-USVI-etc)    23
## Peru                          46
## Philippines                  295
## Poland                        87
## Portugal                      67
## Puerto-Rico                  184
## Scotland                      21
## South                        115
## Taiwan                        65
## Thailand                      30
## Trinadad&Tobago               27
## United-States              43832
## Vietnam                       86
## Yugoslavia                    23
## NA's                         857

#most of the country data is United States. The only country having more 
#than the number of NA's is Mexico. All the others have btween 1-295 counts.
#There are so many levels. Does the country make a difference?  

country<-aggregate(Income.Group~Native.Country,data=train,table)
country$total<-table(train$Native.Country)
country[,2][,1]<-country[,2][,1]/country$total
country[,2][,2]<-country[,2][,2]/country$total
country

##                Native.Country Income.Group.<=50K Income.Group.>50K total
## 1                    Cambodia         0.63157895        0.36842105    19
## 2                      Canada         0.67768595        0.32231405   121
## 3                       China         0.73333333        0.26666667    75
## 4                    Columbia         0.96610169        0.03389831    59
## 5                        Cuba         0.73684211        0.26315789    95
## 6          Dominican-Republic         0.97142857        0.02857143    70
## 7                     Ecuador         0.85714286        0.14285714    28
## 8                 El-Salvador         0.91509434        0.08490566   106
## 9                     England         0.66666667        0.33333333    90
## 10                     France         0.58620690        0.41379310    29
## 11                    Germany         0.67883212        0.32116788   137
## 12                     Greece         0.72413793        0.27586207    29
## 13                  Guatemala         0.95312500        0.04687500    64
## 14                      Haiti         0.90909091        0.09090909    44
## 15         Holand-Netherlands         1.00000000        0.00000000     1
## 16                   Honduras         0.92307692        0.07692308    13
## 17                       Hong         0.70000000        0.30000000    20
## 18                    Hungary         0.76923077        0.23076923    13
## 19                      India         0.60000000        0.40000000   100
## 20                       Iran         0.58139535        0.41860465    43
## 21                    Ireland         0.79166667        0.20833333    24
## 22                      Italy         0.65753425        0.34246575    73
## 23                    Jamaica         0.87654321        0.12345679    81
## 24                      Japan         0.61290323        0.38709677    62
## 25                       Laos         0.88888889        0.11111111    18
## 26                     Mexico         0.94867807        0.05132193   643
## 27                  Nicaragua         0.94117647        0.05882353    34
## 28 Outlying-US(Guam-USVI-etc)         1.00000000        0.00000000    14
## 29                       Peru         0.93548387        0.06451613    31
## 30                Philippines         0.69191919        0.30808081   198
## 31                     Poland         0.80000000        0.20000000    60
## 32                   Portugal         0.89189189        0.10810811    37
## 33                Puerto-Rico         0.89473684        0.10526316   114
## 34                   Scotland         0.75000000        0.25000000    12
## 35                      South         0.80000000        0.20000000    80
## 36                     Taiwan         0.60784314        0.39215686    51
## 37                   Thailand         0.83333333        0.16666667    18
## 38            Trinadad&Tobago         0.89473684        0.10526316    19
## 39              United-States         0.75416524        0.24583476 29170
## 40                    Vietnam         0.92537313        0.07462687    67
## 41                 Yugoslavia         0.62500000        0.37500000    16

prop.table(table(train$Income.Group[is.na(train$Native.Country)]))

## 
##     <=50K      >50K 
## 0.7495712 0.2504288

#The observations with NA's have a similar proportion to US or Scotland
#It makes sense to impute missing values with Mode (=US)
train_imp <-train
train_imp$Native.Country[is.na(train_imp$Native.Country)]<-"United-States"
test_imp<-test
test_imp$Native.Country[is.na(test_imp$Native.Country)]<-"United-States"

Next, look at workclass variable.

as.matrix(summary(mydata$Workclass))

##                   [,1]
## Federal-gov       1432
## Local-gov         3136
## Never-worked        10
## Private          33906
## Self-emp-inc      1695
## Self-emp-not-inc  3862
## State-gov         1981
## Without-pay         21
## NA's              2799

workclass<-aggregate(Income.Group~Workclass,data=train,table)
workclass$total<-table(train$Workclass)
workclass[,2][,1]<-workclass[,2][,1]/workclass$total
workclass[,2][,2]<-workclass[,2][,2]/workclass$total
workclass

##          Workclass Income.Group.<=50K Income.Group.>50K total
## 1      Federal-gov          0.6135417         0.3864583   960
## 2        Local-gov          0.7052078         0.2947922  2093
## 3     Never-worked          1.0000000         0.0000000     7
## 4          Private          0.7813271         0.2186729 22696
## 5     Self-emp-inc          0.4426523         0.5573477  1116
## 6 Self-emp-not-inc          0.7150728         0.2849272  2541
## 7        State-gov          0.7280431         0.2719569  1298
## 8      Without-pay          1.0000000         0.0000000    14

prop.table(table(train$Income.Group[is.na(train$Workclass)]))

## 
##     <=50K      >50K 
## 0.8959695 0.1040305

In this case it will be necessary to use a more sophisticated imputation method, since the proportions are very different from all of the groups. I will use rpart.

Notice that two levels can be combined, Never-worked and Without-pay. These will always be in <50 category anyway.

Finally, the Ocuupation variable:

as.matrix(summary(mydata$Occupation))

##                   [,1]
## Adm-clerical      5611
## Armed-Forces        15
## Craft-repair      6112
## Exec-managerial   6086
## Farming-fishing   1490
## Handlers-cleaners 2072
## Machine-op-inspct 3022
## Other-service     4923
## Priv-house-serv    242
## Prof-specialty    6172
## Protective-serv    983
## Sales             5504
## Tech-support      1446
## Transport-moving  2355
## NA's              2809

occ<-aggregate(Income.Group~Occupation,data=train,table)
occ$total<-table(train$Occupation)
occ[,2][,1]<-occ[,2][,1]/occ$total
occ[,2][,2]<-occ[,2][,2]/occ$total
occ

##           Occupation Income.Group.<=50K Income.Group.>50K total
## 1       Adm-clerical        0.865517241       0.134482759  3770
## 2       Armed-Forces        0.888888889       0.111111111     9
## 3       Craft-repair        0.773359356       0.226640644  4099
## 4    Exec-managerial        0.515986227       0.484013773  4066
## 5    Farming-fishing        0.884305835       0.115694165   994
## 6  Handlers-cleaners        0.937226277       0.062773723  1370
## 7  Machine-op-inspct        0.875124875       0.124875125  2002
## 8      Other-service        0.958421851       0.041578149  3295
## 9    Priv-house-serv        0.993288591       0.006711409   149
## 10    Prof-specialty        0.550966184       0.449033816  4140
## 11   Protective-serv        0.674884438       0.325115562   649
## 12             Sales        0.730684932       0.269315068  3650
## 13      Tech-support        0.695043103       0.304956897   928
## 14  Transport-moving        0.799624296       0.200375704  1597

prop.table(table(train$Income.Group[is.na(train$Occupation)]))

## 
##     <=50K      >50K 
## 0.8963646 0.1036354

This variable has similar proportions to Armed-Forces, but very few of this category exist in the data. It is also similar to Farming-fishing, but again there are more NA’s than observations in this category, again it will be better to impute using rpart.

Join sets for preprocessing:

#combine two levels of workclass
mydata<-rbind(train_imp[,1:11],test_imp)
mydata$Workclass[mydata$Workclass=="Without-pay"]<-"Never-worked"
mydata$Workclass<-as.character(mydata$Workclass)
mydata$Workclass<-as.factor(mydata$Workclass)

library(caret)
#separate into two data sets with known workclass and na's,leaving out occupation
workclass<-mydata[!is.na(mydata$Workclass),c(2:5,7:11)]
workna<-mydata[is.na(mydata$Workclass),c(2:5,7:11)]
t2<-createDataPartition(workclass$Workclass,list=FALSE,p=0.8)
worktrain<-workclass[t2,]
worktest<-workclass[-t2,]
worktree<-train(data=worktrain,Workclass~.,"rpart")
workpred1<-predict(worktree,newdata=worktest)

confusionMatrix(workpred1,worktest$Workclass)$overall

##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##      0.7365048      0.0000000      0.7273783      0.7454815      0.7365048 
## AccuracyPValue  McnemarPValue 
##      0.5054627            NaN

#quite good!
mydata$Workclass[is.na(mydata$Workclass)]<-predict(worktree,mydata[is.na(mydata$Workclass),])

#do the same for occupation,use workclass, including predicted values
occ<-mydata[!is.na(mydata$Occupation),2:11]
occna<-mydata[is.na(mydata$Occupation),c(2:11)]
t1<-createDataPartition(occ$Occupation,list=FALSE,p=0.8)
occtrain<-occ[t1,]
occtest<-occ[-t1,]
occtree<-train(data=occtrain,Occupation~.,"rpart")
occpred1<-predict(occtree,newdata=occtest)

confusionMatrix(occpred1,occtest$Occupation)$overall

##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##   2.325581e-01   1.187346e-01   2.239578e-01   2.413281e-01   1.341013e-01 
## AccuracyPValue  McnemarPValue 
##  6.820372e-144            NaN

#This performs really badly

#replacing them all with the mode is even worse
occpred2<-rep("Prof-specialty",dim(occtest)[1])
confusionMatrix(occpred2,occtest$Occupation)$overall

##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##      0.1341013      0.0000000      0.1272012      0.1412339      0.1341013 
## AccuracyPValue  McnemarPValue 
##      0.5046136            NaN

#using rpart seems to be more reasonable
mydata$Occupation[is.na(mydata$Occupation)]<-predict(occtree,mydata[is.na(mydata$Occupation),])


#reseparate into train and test sets
train<-cbind(mydata[1:32561,],Income.Group=train$Income.Group)
test<-mydata[32562:48842,]

I now take a look at the remaining variables to see if they each independently appear to have a relationship with the Income.Group target variable.

#Inspecting age variable:  
print(ggplot(data=train,aes(x=Income.Group,y=Age))+geom_boxplot())

#age is apparently important, the distribution of ages for each income group are different.

#workclass
print(ggplot(data=train,aes(x=Workclass))+geom_bar()+facet_grid(Income.Group)~.)

## ggplot(data = train, aes(x = Workclass)) + geom_bar() + facet_grid(Income.Group) ~ 
##     .

#makes a difference

#education
print(ggplot(data=train,aes(x=Education))+geom_bar()+facet_grid(Income.Group~.))

#also makes a difference

#marital status
print(ggplot(data=train,aes(x=Marital.Status))+geom_bar()+facet_grid(Income.Group~.))

#a big difference

#occupation
print(ggplot(data=train,aes(x=Occupation))+geom_bar()+facet_grid(Income.Group~.))

#also makes a difference

#relationship
print(ggplot(data=train,aes(x=Relationship))+geom_bar()+facet_grid(Income.Group~.))

#this too

#race
print(ggplot(data=train,aes(x=Race))+geom_bar()+facet_grid(Income.Group~.))

#and this

#sex
print(ggplot(data=train,aes(x=Sex))+geom_bar()+facet_grid(Income.Group~.))

#definitely

#native country
print(ggplot(data=train,aes(x=Native.Country))+geom_bar()+facet_grid(Income.Group~.))

#there are too many levels. 
summary(country$Income.Group[,2])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.08491 0.20000 0.20240 0.32120 0.41860

#this shows the statistics of the proportion of >50 by country
#group countries with less than median as poor
poor<-country$Native.Country[country$Income.Group[,2]<=0.2]
mydata$Native.Country<-as.character(mydata$Native.Country)
mydata$Native.Country[mydata$Native.Country%in% poor]<-"poor"
mydata$Native.Country<-as.factor(mydata$Native.Country)

#reseparate into train and test sets
train<-cbind(mydata[1:32561,],Income.Group=train$Income.Group)
test<-mydata[32562:48842,]

print(ggplot(data=train,aes(x=Native.Country))+geom_bar()+facet_grid(Income.Group~.))

#hours per week
print(ggplot(data=train,aes(x=Income.Group,y=Hours.Per.Week))+geom_boxplot())

#75% of the <50 work less hours than 25% of >50.Many outliers

Prepare the data for modelling

It is mecessary to create a fictive Income group for test set, so that the mlr package can be used.

test$Income.Group<-as.factor(sample(levels(train$Income.Group),dim(test)[1],replace=TRUE,c(0.5,0.5)))

#remove ID variable. This should not be included in the anaylsis.  
train<-train[,2:12]
test<-test[,2:12]
#model
trainTask <- makeClassifTask(data = train,target = "Income.Group")
testTask <- makeClassifTask(data = test, target = "Income.Group")

#both numeric variables are skewed, they need to be normalized before training
hist(train$Age)

hist(train$Hours.Per.Week)

trainTask <- normalizeFeatures(trainTask,method = "standardize")
testTask <- normalizeFeatures(testTask,method = "standardize")

Tune and train models, and predict

rpart

#make tree learner
makeatree <- makeLearner("classif.rpart", predict.type = "response")

#set 5 fold cross validation
set_cv <- makeResampleDesc("CV",iters = 5L)

#Search for hyperparameters
gs <- makeParamSet(
        makeIntegerParam("minsplit",lower = 10, upper = 50),
        makeIntegerParam("minbucket", lower = 5, upper = 50),
        makeNumericParam("cp", lower = 0.001, upper = 0.2)
)
gscontrol <- makeTuneControlGrid()

#hypertune the parameters
stune <- tuneParams(learner = makeatree, resampling = set_cv, 
                    task = trainTask, par.set = gs, control = gscontrol, measures = acc)

stune

## Tune result:
## Op. pars: minsplit=37; minbucket=15; cp=0.001
## acc.test.mean=0.84

#using hyperparameters for modeling,train the model using caret package
treefit<-train(data=train,Income.Group~.,"rpart",minsplit=37,minbucket=15,cp=0.001)

#make predictions
tpmodel <- predict(treefit, test)

# #create a submission file
# submit <- data.frame(ID = mydata$ID[32562:48842], Income.Group = tpmodel)
# write.csv(submit, "submission1.csv",row.names = F)

This model gives about 0.82 accuracy on the public leaderboard. Perhaps this can be improved using random forest.
2. rf

#create a learner
rf <- makeLearner("classif.randomForest", predict.type = "response"
                  , par.vals = list(ntree = 200, mtry = 3))
rf$par.vals <- list(importance = TRUE)

#set tunable parameters
#grid search to find hyperparameters
rf_param <- makeParamSet(
        makeIntegerParam("ntree",lower = 50, upper = 500),
        makeIntegerParam("mtry", lower = 3, upper = 10),
        makeIntegerParam("nodesize", lower = 10, upper = 50)
)

#random search for 50 iterations, it will take far too long to cover all the grid
rancontrol <- makeTuneControlRandom(maxit = 50L)

#set 5 fold cross validation
set_cv <- makeResampleDesc("CV",iters = 5L)

#hypertuning
rf_tune <- tuneParams(learner = rf, resampling = set_cv, task = trainTask, par.set = rf_param, control = rancontrol, measures = acc)

#cv accuracy
rf_tune$y

## acc.test.mean 
##     0.8438929

#best parameters
rf_tune$x

## $ntree
## [1] 407
## 
## $mtry
## [1] 3
## 
## $nodesize
## [1] 44

Trying to use mlr to train the model after hypertuning resulted in errors.(EDIT:I later realised that there were conflictions between caret train and mlr train, and caret must be unloaded to prevent this error.) I used caret train instead. This did not allow me to override its search for optimal mtry but I fixed the other tuning parameters to the above values.

#train the model
rffit<-train(data=train,Income.Group~.,"rf",ntree=407,nodesize=44)

#make predictions
rfmodel <- predict(rffit, test)

# #create a submission file
# submit <- data.frame(ID = mydata$ID[32562:48842], Income.Group = rfmodel)
# write.csv(submit, "submission2.csv",row.names = F)

Interestingly, the optimal mtry turned out to be 2, which I hadn’t included in the hypertuning search.
The accuracy on the public leaderboard increased to 0.84.

Looking for further improvement

library(caret)
varImp(rffit)

## rf variable importance
## 
##   only 20 most important variables shown (out of 72)
## 
##                                  Overall
## Marital.StatusMarried-civ-spouse 100.000
## Age                               41.357
## Hours.Per.Week                    28.873
## OccupationProf-specialty          21.030
## Marital.StatusNever-married       20.687
## OccupationExec-managerial         18.458
## EducationBachelors                17.291
## EducationMasters                   8.576
## SexMale                            5.446
## EducationProf-school               4.997
## EducationHS-grad                   4.446
## WorkclassSelf-emp-not-inc          4.379
## EducationSome-college              4.014
## EducationDoctorate                 3.961
## RelationshipWife                   3.286
## OccupationSales                    3.261
## OccupationTech-support             3.143
## OccupationOther-service            2.725
## RelationshipNot-in-family          2.494
## Native.Countrypoor                 2.286

varImp(treefit)

## rpart variable importance
## 
##   only 20 most important variables shown (out of 78)
## 
##                                    Overall
## Marital.StatusMarried-civ-spouse   100.000
## Age                                 64.699
## Marital.StatusNever-married         51.278
## Hours.Per.Week                      41.753
## EducationBachelors                  37.147
## OccupationExec-managerial           32.574
## RelationshipOwn-child               26.410
## OccupationProf-specialty            14.447
## EducationHS-grad                    11.153
## EducationMasters                     6.726
## OccupationSales                      4.712
## Native.Countrypoor                   2.657
## `OccupationArmed-Forces`             0.000
## Native.CountryGreece                 0.000
## `Marital.StatusMarried-civ-spouse`   0.000
## `OccupationProtective-serv`          0.000
## `EducationHS-grad`                   0.000
## Marital.StatusSeparated              0.000
## Native.CountryHong                   0.000
## `RelationshipOther-relative`         0.000

I now try to improve accuracy by combining levels of factors that are not important.

library(car)
train$Occupation<-recode(train$Occupation,"c('Adm-clerical','Armed-Forces','Craft-repair','Farming-fishing', 'Handlers-cleaners', 'Machine-op-inspct', 'Protective-serv' , 'Priv-house-serv','Transport-moving' ) = 'other'")
test$Occupation<- recode(test$Occupation,"c('Adm-clerical','Armed-Forces','Craft-repair','Farming-fishing', 'Handlers-cleaners', 'Machine-op-inspct', 'Protective-serv' , 'Priv-house-serv','Transport-moving' )='other'")
train$Education<- recode(train$Education,"c('10th','11th','12th','1st-4th','5th-6th','7th-8th', '9th','Assoc-acdm','Assoc-voc','Preschool' )='low'")
test$Education<- recode(test$Education,"c('10th','11th','12th','1st-4th','5th-6th','7th-8th', '9th','Assoc-acdm','Assoc-voc','Preschool' )='low'")
train$Marital.Status<-recode(train$Marital.Status,"c('Divorced','Married-AF-spouse','Married-spouse-absent','Separated','Widowed')='other'")
test$Marital.Status<-recode(test$Marital.Status,"c('Divorced','Married-AF-spouse','Married-spouse-absent','Separated','Widowed')='other'")
train$Workclass<-recode(train$Workclass,"c('Federal-gov','Local-gov','State-gov')='gov'")
test$Workclass<-recode(test$Workclass,"c('Federal-gov','Local-gov','State-gov')='gov'")
train$Native.Country <-recode(train$Native.Country,"c('Cambodia','Canada', 'China','Cuba','England','France','Germany','Greece','Hong','Hungary','India', 'Iran','Ireland','Italy','Japan','Philippines','Scotland','Taiwan','Yugoslavia'   )='other'")
test$Native.Country <-recode(test$Native.Country,"c('Cambodia','Canada', 'China','Cuba','England','France','Germany','Greece','Hong','Hungary','India', 'Iran','Ireland','Italy','Japan','Philippines','Scotland','Taiwan','Yugoslavia'   )='other'")

Make new rf:

#make task
trainTask <- makeClassifTask(data = train,target = "Income.Group")
testTask <- makeClassifTask(data = test, target = "Income.Group")

#normalizae

trainTask <- normalizeFeatures(trainTask,method = "standardize")
testTask <- normalizeFeatures(testTask,method = "standardize")
#create a learner
rf2 <- makeLearner("classif.randomForest", predict.type = "response"
                  , par.vals = list(ntree = 200, mtry = 3))
rf$par.vals <- list(importance = TRUE)

#set tunable parameters
#grid search to find hyperparameters
rf_param <- makeParamSet(
        makeIntegerParam("ntree",lower = 50, upper = 500),
        makeIntegerParam("mtry", lower = 2, upper = 4),
        makeIntegerParam("nodesize", lower = 10, upper = 50)
)

#random search for 50 iterations, it will take far too long to cover all the grid
rancontrol <- makeTuneControlRandom(maxit = 50L)

#set 5 fold cross validation
set_cv <- makeResampleDesc("CV",iters = 5L)

#hypertuning
rf_tune2 <- tuneParams(learner = rf2, resampling = set_cv, task = trainTask, par.set = rf_param, control = rancontrol, measures = acc)

#using hyperparameters for modeling
rf.tree <- setHyperPars(rf2, par.vals = rf_tune2$x)
#train a model
unloadNamespace("caret")
rforest <- train(learner=rf.tree, task=trainTask)

#make predictions
rfmodel2 <- predict(rforest, testTask)

# #create a submission file
# submit <- data.frame(ID = mydata$ID[32562:48842], Income.Group = rfmodel2$data$response)
#  write.csv(submit, "submission3.csv",row.names = F)

This didn’t lead to an improvement, of course, but achieved the same accuracy a little more efficiently.

Now I’ll try xgboost using the mlr package.

#create a learner
xgb <- makeLearner("classif.xgboost", predict.type = "response"
                  , par.vals = list(eta=0.1,gamma=0,max_depth=6,min_child_weight=1,subsample=0.8,colsample_bytree=0.8,nthread=2,nrounds=1))
xgb$par.vals <- list(importance = TRUE)

#set tunable parameters
#grid search to find hyperparameters
xgb_param <- makeParamSet(
        makeNumericParam("eta",lower = 0.05, upper = 0.3),
        makeNumericParam("gamma", lower = 0, upper = 3),
        makeNumericParam("subsample", lower = 0.5, upper = 1),
        makeNumericParam("colsample_bytree", lower = 0.5, upper = 1),
        makeDiscreteParam("max_depth" ,values=3:10),
        makeDiscreteParam("min_child_weight",values=1:8),
        makeDiscreteParam("nrounds",values=1:100)
)

#let's do random search for 100 iterations
rancontrol <- makeTuneControlRandom(maxit = 100L)

#set 5 fold cross validation
set_cv <- makeResampleDesc("CV",iters = 5L)

#hypertuning
xgb_tune <- tuneParams(learner = xgb, resampling = set_cv, task = trainTask, par.set = xgb_param, control = rancontrol, measures = acc)

#cv accuracy
xgb_tune$y

## acc.test.mean 
##     0.8393168

#best parameters
xgb_tune$x

## $eta
## [1] 0.1482234
## 
## $gamma
## [1] 1.766518
## 
## $subsample
## [1] 0.735649
## 
## $colsample_bytree
## [1] 0.5425878
## 
## $max_depth
## [1] 8
## 
## $min_child_weight
## [1] 4
## 
## $nrounds
## [1] 48

#using hyperparameters for modeling
xgb.tuned <- setHyperPars(xgb, par.vals = xgb_tune$x)

#train a model
xgb.fit <- train(xgb.tuned, trainTask)

#make predictions
xgbmodel <- predict(xgb.fit, testTask)

#submission file
#  submit <- data.frame(ID = mydata$ID[32562:48842], Income.Group = xgbmodel$data$response)
# write.csv(submit, "submission4.csv",row.names = F)

This actually improved the accuracy on the public leaderboard to just over 0.84. I should probably retune the paramters because the tuned value of min_child_weight was 8 and this was the maximum value in the grid, so possibly it could be more than that, but I am going to leave it at this for now! Also more iterations might improve the accuracy, since there are an enormous amount of possible combinations of so many tuning parameters. Maybe another day…

Experiments with data

Elisa Lerner

5 September 2016

The data

Prepare the data for modelling

Tune and train models, and predict

Looking for further improvement