Introduction

Long ago, in the distant, fragrant mists of time, there was a competition… It was not just any competition.

It was a competition that challenged mere mortals to model a 20,000x200 matrix of continuous variables using only 250 training samples… without overfitting.

Data scientists ??? including Kaggle’s very own Will Cukierski ??? competed by the hundreds. Legends were made. (Will took 5th place, and eventually ended up working at Kaggle!) People overfit like crazy. It was a Kaggle-y, data science-y madhouse.

So… we’re doing it again.

Don’t Overfit II: The Overfittening This is the next logical step in the evolution of weird competitions. Once again we have 20,000 rows of continuous variables, and a mere handful of training samples. Once again, we challenge you not to overfit. Do your best, model without overfitting, and add, perhaps, to your own legend.

In addition to bragging rights, the winner also gets swag. Enjoy!

Step#1 -Load package & dataset.

write(paste0(date(), "::", "Start!!!"), file = "log.txt", append = TRUE)
library(gbm)
library(dplyr)
setwd("G:/DataScienceProject/Kaggle-dont-overfit-ii")
TrainDF <- read.csv(file = "train.csv", header = TRUE)
TestDF <- read.csv(file = "test.csv", header = TRUE)
TestDF$target <- 0
TestDF <- TestDF[, c(1, 302, 2:301)]
head(glimpse(TrainDF), 25)

Step#2 Run a quick GBM (machine learning), just to check datset variables impact.

modFit <- gbm(target ~. -id, data=TrainDF, 
              distribution = 'bernoulli', 
              n.trees = 55, #Iteration
              shrinkage = 0.1,
              interaction.depth = 2, 
              bag.fraction = 0.5, #Fraction for sample next tree
              train.fraction = 0.95,
              n.minobsinnode = 10, #num of sub-trees
              cv.folds = 10, #How many check on CV
              keep.data = TRUE,
              verbose = FALSE, n.cores = 1)

gbm.perf(modFit, method="OOB")
saveRDS(modFit, "modFit-gbm-1.rds")
#(paste0(date(), "::", "BoostingTrees - DONE"), file = "log.txt", append = TRUE)

Step#3 Play and find the minimum variable that could give 80/20 prediction. Let’s take top 5 variables.

topColDF <- as.data.frame(summary(modFit))
topColDF <- head(topColDF, 5)
topColDF$var <- as.character(topColDF$var)
topColDF$var <- gsub("X", "", topColDF$var)
topColDF$var <- as.integer(topColDF$var) + 3
topColDF <- topColDF[order(topColDF$var),]
tunningDF <- TrainDF[,c(1,2)]
predicDF <- TestDF[,c(1,2)]
for (i in 1:length(topColDF$var)) {
  D <- paste0("X", topColDF[i,1] - 3)
  tunningDF[,D] <- TrainDF[,topColDF[i,1]]
  predicDF[,D] <- TestDF[,topColDF[i,1]]
  i <- i + 1
}

Step#4 Run GBM on new datasets. The tunning will be for get the weights while the predict will be used for the actual prediction.

modFit5 <- gbm(target ~. -id, data=tunningDF, 
              distribution = 'bernoulli', 
              n.trees = 55, #Iteration
              shrinkage = 0.1,
              interaction.depth = 2, 
              bag.fraction = 0.5, #Fraction for sample next tree
              train.fraction = 0.95,
              n.minobsinnode = 10, #num of sub-trees
              cv.folds = 10, #How many check on CV
              keep.data = TRUE,
              verbose = FALSE, n.cores = 1)

gbm.perf(modFit5, method="OOB")
saveRDS(modFit, "modFit-gbm-5.rds")

Prediction <- plogis(predict.gbm(modFit10, newdata = predicDF))
predicDF <- as.data.frame(predicDF[,1])
predicDF <- cbind(predicDF, Prediction)
names(predicDF) <- c("id", "target")
df1 <- predicDF[which(predicDF$target > 0.5),]
df0 <- predicDF[which(predicDF$target < 0.5),]
df1$target <- 1
df0$target <- 0
predicDF <- rbind(df0, df1)
predicDF  <- predicDF [order(predicDF$id),]
write.csv(predicDF, file = "submit.csv", row.names = FALSE)

Kaggle-dont-overfit-ii

Gilad Shtern

February 11, 2019

Introduction