This report shows how I reached accuracy of 0.84 on the public leaderboard for hackathon “Experiments with data” on Analytics Vidhya website. The datasets for the analysis are available here.
The target is to classify whether a person’s salary is in the <=50K range, or >50K, without having explicit salary data.
I downloaded the datsets and read them into R from a local directory.
library(mlr)
train<-read.csv("train_gbW7HTd.csv", na.strings = c(""," ",NA))
test<-read.csv("test_2AFBew7.csv", na.strings = c(""," ",NA))
summarizeColumns(train)
## name type na mean disp median mad
## 1 ID integer 0 16281.00000 9399.6953940 16281 12068.3640
## 2 Age integer 0 38.58165 13.6404326 37 14.8260
## 3 Workclass factor 1836 NA NA NA NA
## 4 Education factor 0 NA 0.6774976 NA NA
## 5 Marital.Status factor 0 NA 0.5400633 NA NA
## 6 Occupation factor 1843 NA NA NA NA
## 7 Relationship factor 0 NA 0.5948220 NA NA
## 8 Race factor 0 NA 0.1457265 NA NA
## 9 Sex factor 0 NA 0.3307945 NA NA
## 10 Hours.Per.Week integer 0 40.43746 12.3474287 40 4.4478
## 11 Native.Country factor 583 NA NA NA NA
## 12 Income.Group factor 0 NA 0.2408096 NA NA
## min max nlevs
## 1 1 32561 0
## 2 17 90 0
## 3 7 22696 8
## 4 51 10501 16
## 5 23 14976 7
## 6 9 4140 14
## 7 981 13193 6
## 8 271 27816 5
## 9 10771 21790 2
## 10 1 99 0
## 11 1 29170 41
## 12 7841 24720 2
summarizeColumns(test)
## name type na mean disp median mad
## 1 ID integer 0 40702.00000 4700.0642017 40702 6034.1820
## 2 Age integer 0 38.76746 13.8491868 37 14.8260
## 3 Workclass factor 963 NA NA NA NA
## 4 Education factor 0 NA 0.6755113 NA NA
## 5 Marital.Status factor 0 NA 0.5452982 NA NA
## 6 Occupation factor 966 NA NA NA NA
## 7 Relationship factor 0 NA 0.5993489 NA NA
## 8 Race factor 0 NA 0.1434187 NA NA
## 9 Sex factor 0 NA 0.3329648 NA NA
## 10 Hours.Per.Week integer 0 40.39224 12.4793322 40 4.4478
## 11 Native.Country factor 274 NA NA NA NA
## min max nlevs
## 1 32562 48842 0
## 2 17 90 0
## 3 3 11210 8
## 4 32 5283 16
## 5 14 7403 7
## 6 6 2032 14
## 7 525 6523 6
## 8 135 13946 5
## 9 5421 10860 2
## 10 1 99 0
## 11 5 14662 40
The test set has no variable “Income.Group” .This is the dependant variable.
levels(train$Income.Group)
## [1] "<=50K" ">50K"
p = prop.table(table(train$Income.Group))
p
##
## <=50K >50K
## 0.7591904 0.2408096
NA’s are present in the both sets in variables:“Workclass”,“occupation”,“Native Country”. These will need to be analysed first. They are all factor variables.
#join train and test sets to explore all data at once
mydata<-rbind(train[,1:11],test)
#inspect Native.Country variable
as.matrix(summary(mydata$Native.Country))
## [,1]
## Cambodia 28
## Canada 182
## China 122
## Columbia 85
## Cuba 138
## Dominican-Republic 103
## Ecuador 45
## El-Salvador 155
## England 127
## France 38
## Germany 206
## Greece 49
## Guatemala 88
## Haiti 75
## Holand-Netherlands 1
## Honduras 20
## Hong 30
## Hungary 19
## India 151
## Iran 59
## Ireland 37
## Italy 105
## Jamaica 106
## Japan 92
## Laos 23
## Mexico 951
## Nicaragua 49
## Outlying-US(Guam-USVI-etc) 23
## Peru 46
## Philippines 295
## Poland 87
## Portugal 67
## Puerto-Rico 184
## Scotland 21
## South 115
## Taiwan 65
## Thailand 30
## Trinadad&Tobago 27
## United-States 43832
## Vietnam 86
## Yugoslavia 23
## NA's 857
#most of the country data is United States. The only country having more
#than the number of NA's is Mexico. All the others have btween 1-295 counts.
#There are so many levels. Does the country make a difference?
country<-aggregate(Income.Group~Native.Country,data=train,table)
country$total<-table(train$Native.Country)
country[,2][,1]<-country[,2][,1]/country$total
country[,2][,2]<-country[,2][,2]/country$total
country
## Native.Country Income.Group.<=50K Income.Group.>50K total
## 1 Cambodia 0.63157895 0.36842105 19
## 2 Canada 0.67768595 0.32231405 121
## 3 China 0.73333333 0.26666667 75
## 4 Columbia 0.96610169 0.03389831 59
## 5 Cuba 0.73684211 0.26315789 95
## 6 Dominican-Republic 0.97142857 0.02857143 70
## 7 Ecuador 0.85714286 0.14285714 28
## 8 El-Salvador 0.91509434 0.08490566 106
## 9 England 0.66666667 0.33333333 90
## 10 France 0.58620690 0.41379310 29
## 11 Germany 0.67883212 0.32116788 137
## 12 Greece 0.72413793 0.27586207 29
## 13 Guatemala 0.95312500 0.04687500 64
## 14 Haiti 0.90909091 0.09090909 44
## 15 Holand-Netherlands 1.00000000 0.00000000 1
## 16 Honduras 0.92307692 0.07692308 13
## 17 Hong 0.70000000 0.30000000 20
## 18 Hungary 0.76923077 0.23076923 13
## 19 India 0.60000000 0.40000000 100
## 20 Iran 0.58139535 0.41860465 43
## 21 Ireland 0.79166667 0.20833333 24
## 22 Italy 0.65753425 0.34246575 73
## 23 Jamaica 0.87654321 0.12345679 81
## 24 Japan 0.61290323 0.38709677 62
## 25 Laos 0.88888889 0.11111111 18
## 26 Mexico 0.94867807 0.05132193 643
## 27 Nicaragua 0.94117647 0.05882353 34
## 28 Outlying-US(Guam-USVI-etc) 1.00000000 0.00000000 14
## 29 Peru 0.93548387 0.06451613 31
## 30 Philippines 0.69191919 0.30808081 198
## 31 Poland 0.80000000 0.20000000 60
## 32 Portugal 0.89189189 0.10810811 37
## 33 Puerto-Rico 0.89473684 0.10526316 114
## 34 Scotland 0.75000000 0.25000000 12
## 35 South 0.80000000 0.20000000 80
## 36 Taiwan 0.60784314 0.39215686 51
## 37 Thailand 0.83333333 0.16666667 18
## 38 Trinadad&Tobago 0.89473684 0.10526316 19
## 39 United-States 0.75416524 0.24583476 29170
## 40 Vietnam 0.92537313 0.07462687 67
## 41 Yugoslavia 0.62500000 0.37500000 16
prop.table(table(train$Income.Group[is.na(train$Native.Country)]))
##
## <=50K >50K
## 0.7495712 0.2504288
#The observations with NA's have a similar proportion to US or Scotland
#It makes sense to impute missing values with Mode (=US)
train_imp <-train
train_imp$Native.Country[is.na(train_imp$Native.Country)]<-"United-States"
test_imp<-test
test_imp$Native.Country[is.na(test_imp$Native.Country)]<-"United-States"
Next, look at workclass variable.
as.matrix(summary(mydata$Workclass))
## [,1]
## Federal-gov 1432
## Local-gov 3136
## Never-worked 10
## Private 33906
## Self-emp-inc 1695
## Self-emp-not-inc 3862
## State-gov 1981
## Without-pay 21
## NA's 2799
workclass<-aggregate(Income.Group~Workclass,data=train,table)
workclass$total<-table(train$Workclass)
workclass[,2][,1]<-workclass[,2][,1]/workclass$total
workclass[,2][,2]<-workclass[,2][,2]/workclass$total
workclass
## Workclass Income.Group.<=50K Income.Group.>50K total
## 1 Federal-gov 0.6135417 0.3864583 960
## 2 Local-gov 0.7052078 0.2947922 2093
## 3 Never-worked 1.0000000 0.0000000 7
## 4 Private 0.7813271 0.2186729 22696
## 5 Self-emp-inc 0.4426523 0.5573477 1116
## 6 Self-emp-not-inc 0.7150728 0.2849272 2541
## 7 State-gov 0.7280431 0.2719569 1298
## 8 Without-pay 1.0000000 0.0000000 14
prop.table(table(train$Income.Group[is.na(train$Workclass)]))
##
## <=50K >50K
## 0.8959695 0.1040305
In this case it will be necessary to use a more sophisticated imputation method, since the proportions are very different from all of the groups. I will use rpart.
Notice that two levels can be combined, Never-worked and Without-pay. These will always be in <50 category anyway.
Finally, the Ocuupation variable:
as.matrix(summary(mydata$Occupation))
## [,1]
## Adm-clerical 5611
## Armed-Forces 15
## Craft-repair 6112
## Exec-managerial 6086
## Farming-fishing 1490
## Handlers-cleaners 2072
## Machine-op-inspct 3022
## Other-service 4923
## Priv-house-serv 242
## Prof-specialty 6172
## Protective-serv 983
## Sales 5504
## Tech-support 1446
## Transport-moving 2355
## NA's 2809
occ<-aggregate(Income.Group~Occupation,data=train,table)
occ$total<-table(train$Occupation)
occ[,2][,1]<-occ[,2][,1]/occ$total
occ[,2][,2]<-occ[,2][,2]/occ$total
occ
## Occupation Income.Group.<=50K Income.Group.>50K total
## 1 Adm-clerical 0.865517241 0.134482759 3770
## 2 Armed-Forces 0.888888889 0.111111111 9
## 3 Craft-repair 0.773359356 0.226640644 4099
## 4 Exec-managerial 0.515986227 0.484013773 4066
## 5 Farming-fishing 0.884305835 0.115694165 994
## 6 Handlers-cleaners 0.937226277 0.062773723 1370
## 7 Machine-op-inspct 0.875124875 0.124875125 2002
## 8 Other-service 0.958421851 0.041578149 3295
## 9 Priv-house-serv 0.993288591 0.006711409 149
## 10 Prof-specialty 0.550966184 0.449033816 4140
## 11 Protective-serv 0.674884438 0.325115562 649
## 12 Sales 0.730684932 0.269315068 3650
## 13 Tech-support 0.695043103 0.304956897 928
## 14 Transport-moving 0.799624296 0.200375704 1597
prop.table(table(train$Income.Group[is.na(train$Occupation)]))
##
## <=50K >50K
## 0.8963646 0.1036354
This variable has similar proportions to Armed-Forces, but very few of this category exist in the data. It is also similar to Farming-fishing, but again there are more NA’s than observations in this category, again it will be better to impute using rpart.
Join sets for preprocessing:
#combine two levels of workclass
mydata<-rbind(train_imp[,1:11],test_imp)
mydata$Workclass[mydata$Workclass=="Without-pay"]<-"Never-worked"
mydata$Workclass<-as.character(mydata$Workclass)
mydata$Workclass<-as.factor(mydata$Workclass)
library(caret)
#separate into two data sets with known workclass and na's,leaving out occupation
workclass<-mydata[!is.na(mydata$Workclass),c(2:5,7:11)]
workna<-mydata[is.na(mydata$Workclass),c(2:5,7:11)]
t2<-createDataPartition(workclass$Workclass,list=FALSE,p=0.8)
worktrain<-workclass[t2,]
worktest<-workclass[-t2,]
worktree<-train(data=worktrain,Workclass~.,"rpart")
workpred1<-predict(worktree,newdata=worktest)
confusionMatrix(workpred1,worktest$Workclass)$overall
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 0.7365048 0.0000000 0.7273783 0.7454815 0.7365048
## AccuracyPValue McnemarPValue
## 0.5054627 NaN
#quite good!
mydata$Workclass[is.na(mydata$Workclass)]<-predict(worktree,mydata[is.na(mydata$Workclass),])
#do the same for occupation,use workclass, including predicted values
occ<-mydata[!is.na(mydata$Occupation),2:11]
occna<-mydata[is.na(mydata$Occupation),c(2:11)]
t1<-createDataPartition(occ$Occupation,list=FALSE,p=0.8)
occtrain<-occ[t1,]
occtest<-occ[-t1,]
occtree<-train(data=occtrain,Occupation~.,"rpart")
occpred1<-predict(occtree,newdata=occtest)
confusionMatrix(occpred1,occtest$Occupation)$overall
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 2.325581e-01 1.187346e-01 2.239578e-01 2.413281e-01 1.341013e-01
## AccuracyPValue McnemarPValue
## 6.820372e-144 NaN
#This performs really badly
#replacing them all with the mode is even worse
occpred2<-rep("Prof-specialty",dim(occtest)[1])
confusionMatrix(occpred2,occtest$Occupation)$overall
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 0.1341013 0.0000000 0.1272012 0.1412339 0.1341013
## AccuracyPValue McnemarPValue
## 0.5046136 NaN
#using rpart seems to be more reasonable
mydata$Occupation[is.na(mydata$Occupation)]<-predict(occtree,mydata[is.na(mydata$Occupation),])
#reseparate into train and test sets
train<-cbind(mydata[1:32561,],Income.Group=train$Income.Group)
test<-mydata[32562:48842,]
I now take a look at the remaining variables to see if they each independently appear to have a relationship with the Income.Group target variable.
#Inspecting age variable:
print(ggplot(data=train,aes(x=Income.Group,y=Age))+geom_boxplot())
#age is apparently important, the distribution of ages for each income group are different.
#workclass
print(ggplot(data=train,aes(x=Workclass))+geom_bar()+facet_grid(Income.Group)~.)
## ggplot(data = train, aes(x = Workclass)) + geom_bar() + facet_grid(Income.Group) ~
## .
#makes a difference
#education
print(ggplot(data=train,aes(x=Education))+geom_bar()+facet_grid(Income.Group~.))
#also makes a difference
#marital status
print(ggplot(data=train,aes(x=Marital.Status))+geom_bar()+facet_grid(Income.Group~.))
#a big difference
#occupation
print(ggplot(data=train,aes(x=Occupation))+geom_bar()+facet_grid(Income.Group~.))
#also makes a difference
#relationship
print(ggplot(data=train,aes(x=Relationship))+geom_bar()+facet_grid(Income.Group~.))
#this too
#race
print(ggplot(data=train,aes(x=Race))+geom_bar()+facet_grid(Income.Group~.))
#and this
#sex
print(ggplot(data=train,aes(x=Sex))+geom_bar()+facet_grid(Income.Group~.))
#definitely
#native country
print(ggplot(data=train,aes(x=Native.Country))+geom_bar()+facet_grid(Income.Group~.))
#there are too many levels.
summary(country$Income.Group[,2])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.08491 0.20000 0.20240 0.32120 0.41860
#this shows the statistics of the proportion of >50 by country
#group countries with less than median as poor
poor<-country$Native.Country[country$Income.Group[,2]<=0.2]
mydata$Native.Country<-as.character(mydata$Native.Country)
mydata$Native.Country[mydata$Native.Country%in% poor]<-"poor"
mydata$Native.Country<-as.factor(mydata$Native.Country)
#reseparate into train and test sets
train<-cbind(mydata[1:32561,],Income.Group=train$Income.Group)
test<-mydata[32562:48842,]
print(ggplot(data=train,aes(x=Native.Country))+geom_bar()+facet_grid(Income.Group~.))
#hours per week
print(ggplot(data=train,aes(x=Income.Group,y=Hours.Per.Week))+geom_boxplot())
#75% of the <50 work less hours than 25% of >50.Many outliers
It is mecessary to create a fictive Income group for test set, so that the mlr package can be used.
test$Income.Group<-as.factor(sample(levels(train$Income.Group),dim(test)[1],replace=TRUE,c(0.5,0.5)))
#remove ID variable. This should not be included in the anaylsis.
train<-train[,2:12]
test<-test[,2:12]
#model
trainTask <- makeClassifTask(data = train,target = "Income.Group")
testTask <- makeClassifTask(data = test, target = "Income.Group")
#both numeric variables are skewed, they need to be normalized before training
hist(train$Age)
hist(train$Hours.Per.Week)
trainTask <- normalizeFeatures(trainTask,method = "standardize")
testTask <- normalizeFeatures(testTask,method = "standardize")
#make tree learner
makeatree <- makeLearner("classif.rpart", predict.type = "response")
#set 5 fold cross validation
set_cv <- makeResampleDesc("CV",iters = 5L)
#Search for hyperparameters
gs <- makeParamSet(
makeIntegerParam("minsplit",lower = 10, upper = 50),
makeIntegerParam("minbucket", lower = 5, upper = 50),
makeNumericParam("cp", lower = 0.001, upper = 0.2)
)
gscontrol <- makeTuneControlGrid()
#hypertune the parameters
stune <- tuneParams(learner = makeatree, resampling = set_cv,
task = trainTask, par.set = gs, control = gscontrol, measures = acc)
stune
## Tune result:
## Op. pars: minsplit=37; minbucket=15; cp=0.001
## acc.test.mean=0.84
#using hyperparameters for modeling,train the model using caret package
treefit<-train(data=train,Income.Group~.,"rpart",minsplit=37,minbucket=15,cp=0.001)
#make predictions
tpmodel <- predict(treefit, test)
# #create a submission file
# submit <- data.frame(ID = mydata$ID[32562:48842], Income.Group = tpmodel)
# write.csv(submit, "submission1.csv",row.names = F)
This model gives about 0.82 accuracy on the public leaderboard. Perhaps this can be improved using random forest.
2. rf
#create a learner
rf <- makeLearner("classif.randomForest", predict.type = "response"
, par.vals = list(ntree = 200, mtry = 3))
rf$par.vals <- list(importance = TRUE)
#set tunable parameters
#grid search to find hyperparameters
rf_param <- makeParamSet(
makeIntegerParam("ntree",lower = 50, upper = 500),
makeIntegerParam("mtry", lower = 3, upper = 10),
makeIntegerParam("nodesize", lower = 10, upper = 50)
)
#random search for 50 iterations, it will take far too long to cover all the grid
rancontrol <- makeTuneControlRandom(maxit = 50L)
#set 5 fold cross validation
set_cv <- makeResampleDesc("CV",iters = 5L)
#hypertuning
rf_tune <- tuneParams(learner = rf, resampling = set_cv, task = trainTask, par.set = rf_param, control = rancontrol, measures = acc)
#cv accuracy
rf_tune$y
## acc.test.mean
## 0.8438929
#best parameters
rf_tune$x
## $ntree
## [1] 407
##
## $mtry
## [1] 3
##
## $nodesize
## [1] 44
Trying to use mlr to train the model after hypertuning resulted in errors.(EDIT:I later realised that there were conflictions between caret train and mlr train, and caret must be unloaded to prevent this error.) I used caret train instead. This did not allow me to override its search for optimal mtry but I fixed the other tuning parameters to the above values.
#train the model
rffit<-train(data=train,Income.Group~.,"rf",ntree=407,nodesize=44)
#make predictions
rfmodel <- predict(rffit, test)
# #create a submission file
# submit <- data.frame(ID = mydata$ID[32562:48842], Income.Group = rfmodel)
# write.csv(submit, "submission2.csv",row.names = F)
Interestingly, the optimal mtry turned out to be 2, which I hadn’t included in the hypertuning search.
The accuracy on the public leaderboard increased to 0.84.
library(caret)
varImp(rffit)
## rf variable importance
##
## only 20 most important variables shown (out of 72)
##
## Overall
## Marital.StatusMarried-civ-spouse 100.000
## Age 41.357
## Hours.Per.Week 28.873
## OccupationProf-specialty 21.030
## Marital.StatusNever-married 20.687
## OccupationExec-managerial 18.458
## EducationBachelors 17.291
## EducationMasters 8.576
## SexMale 5.446
## EducationProf-school 4.997
## EducationHS-grad 4.446
## WorkclassSelf-emp-not-inc 4.379
## EducationSome-college 4.014
## EducationDoctorate 3.961
## RelationshipWife 3.286
## OccupationSales 3.261
## OccupationTech-support 3.143
## OccupationOther-service 2.725
## RelationshipNot-in-family 2.494
## Native.Countrypoor 2.286
varImp(treefit)
## rpart variable importance
##
## only 20 most important variables shown (out of 78)
##
## Overall
## Marital.StatusMarried-civ-spouse 100.000
## Age 64.699
## Marital.StatusNever-married 51.278
## Hours.Per.Week 41.753
## EducationBachelors 37.147
## OccupationExec-managerial 32.574
## RelationshipOwn-child 26.410
## OccupationProf-specialty 14.447
## EducationHS-grad 11.153
## EducationMasters 6.726
## OccupationSales 4.712
## Native.Countrypoor 2.657
## `OccupationArmed-Forces` 0.000
## Native.CountryGreece 0.000
## `Marital.StatusMarried-civ-spouse` 0.000
## `OccupationProtective-serv` 0.000
## `EducationHS-grad` 0.000
## Marital.StatusSeparated 0.000
## Native.CountryHong 0.000
## `RelationshipOther-relative` 0.000
I now try to improve accuracy by combining levels of factors that are not important.
library(car)
train$Occupation<-recode(train$Occupation,"c('Adm-clerical','Armed-Forces','Craft-repair','Farming-fishing', 'Handlers-cleaners', 'Machine-op-inspct', 'Protective-serv' , 'Priv-house-serv','Transport-moving' ) = 'other'")
test$Occupation<- recode(test$Occupation,"c('Adm-clerical','Armed-Forces','Craft-repair','Farming-fishing', 'Handlers-cleaners', 'Machine-op-inspct', 'Protective-serv' , 'Priv-house-serv','Transport-moving' )='other'")
train$Education<- recode(train$Education,"c('10th','11th','12th','1st-4th','5th-6th','7th-8th', '9th','Assoc-acdm','Assoc-voc','Preschool' )='low'")
test$Education<- recode(test$Education,"c('10th','11th','12th','1st-4th','5th-6th','7th-8th', '9th','Assoc-acdm','Assoc-voc','Preschool' )='low'")
train$Marital.Status<-recode(train$Marital.Status,"c('Divorced','Married-AF-spouse','Married-spouse-absent','Separated','Widowed')='other'")
test$Marital.Status<-recode(test$Marital.Status,"c('Divorced','Married-AF-spouse','Married-spouse-absent','Separated','Widowed')='other'")
train$Workclass<-recode(train$Workclass,"c('Federal-gov','Local-gov','State-gov')='gov'")
test$Workclass<-recode(test$Workclass,"c('Federal-gov','Local-gov','State-gov')='gov'")
train$Native.Country <-recode(train$Native.Country,"c('Cambodia','Canada', 'China','Cuba','England','France','Germany','Greece','Hong','Hungary','India', 'Iran','Ireland','Italy','Japan','Philippines','Scotland','Taiwan','Yugoslavia' )='other'")
test$Native.Country <-recode(test$Native.Country,"c('Cambodia','Canada', 'China','Cuba','England','France','Germany','Greece','Hong','Hungary','India', 'Iran','Ireland','Italy','Japan','Philippines','Scotland','Taiwan','Yugoslavia' )='other'")
Make new rf:
#make task
trainTask <- makeClassifTask(data = train,target = "Income.Group")
testTask <- makeClassifTask(data = test, target = "Income.Group")
#normalizae
trainTask <- normalizeFeatures(trainTask,method = "standardize")
testTask <- normalizeFeatures(testTask,method = "standardize")
#create a learner
rf2 <- makeLearner("classif.randomForest", predict.type = "response"
, par.vals = list(ntree = 200, mtry = 3))
rf$par.vals <- list(importance = TRUE)
#set tunable parameters
#grid search to find hyperparameters
rf_param <- makeParamSet(
makeIntegerParam("ntree",lower = 50, upper = 500),
makeIntegerParam("mtry", lower = 2, upper = 4),
makeIntegerParam("nodesize", lower = 10, upper = 50)
)
#random search for 50 iterations, it will take far too long to cover all the grid
rancontrol <- makeTuneControlRandom(maxit = 50L)
#set 5 fold cross validation
set_cv <- makeResampleDesc("CV",iters = 5L)
#hypertuning
rf_tune2 <- tuneParams(learner = rf2, resampling = set_cv, task = trainTask, par.set = rf_param, control = rancontrol, measures = acc)
#using hyperparameters for modeling
rf.tree <- setHyperPars(rf2, par.vals = rf_tune2$x)
#train a model
unloadNamespace("caret")
rforest <- train(learner=rf.tree, task=trainTask)
#make predictions
rfmodel2 <- predict(rforest, testTask)
# #create a submission file
# submit <- data.frame(ID = mydata$ID[32562:48842], Income.Group = rfmodel2$data$response)
# write.csv(submit, "submission3.csv",row.names = F)
This didn’t lead to an improvement, of course, but achieved the same accuracy a little more efficiently.
Now I’ll try xgboost using the mlr package.
#create a learner
xgb <- makeLearner("classif.xgboost", predict.type = "response"
, par.vals = list(eta=0.1,gamma=0,max_depth=6,min_child_weight=1,subsample=0.8,colsample_bytree=0.8,nthread=2,nrounds=1))
xgb$par.vals <- list(importance = TRUE)
#set tunable parameters
#grid search to find hyperparameters
xgb_param <- makeParamSet(
makeNumericParam("eta",lower = 0.05, upper = 0.3),
makeNumericParam("gamma", lower = 0, upper = 3),
makeNumericParam("subsample", lower = 0.5, upper = 1),
makeNumericParam("colsample_bytree", lower = 0.5, upper = 1),
makeDiscreteParam("max_depth" ,values=3:10),
makeDiscreteParam("min_child_weight",values=1:8),
makeDiscreteParam("nrounds",values=1:100)
)
#let's do random search for 100 iterations
rancontrol <- makeTuneControlRandom(maxit = 100L)
#set 5 fold cross validation
set_cv <- makeResampleDesc("CV",iters = 5L)
#hypertuning
xgb_tune <- tuneParams(learner = xgb, resampling = set_cv, task = trainTask, par.set = xgb_param, control = rancontrol, measures = acc)
#cv accuracy
xgb_tune$y
## acc.test.mean
## 0.8393168
#best parameters
xgb_tune$x
## $eta
## [1] 0.1482234
##
## $gamma
## [1] 1.766518
##
## $subsample
## [1] 0.735649
##
## $colsample_bytree
## [1] 0.5425878
##
## $max_depth
## [1] 8
##
## $min_child_weight
## [1] 4
##
## $nrounds
## [1] 48
#using hyperparameters for modeling
xgb.tuned <- setHyperPars(xgb, par.vals = xgb_tune$x)
#train a model
xgb.fit <- train(xgb.tuned, trainTask)
#make predictions
xgbmodel <- predict(xgb.fit, testTask)
#submission file
# submit <- data.frame(ID = mydata$ID[32562:48842], Income.Group = xgbmodel$data$response)
# write.csv(submit, "submission4.csv",row.names = F)
This actually improved the accuracy on the public leaderboard to just over 0.84. I should probably retune the paramters because the tuned value of min_child_weight was 8 and this was the maximum value in the grid, so possibly it could be more than that, but I am going to leave it at this for now! Also more iterations might improve the accuracy, since there are an enormous amount of possible combinations of so many tuning parameters. Maybe another day…