Kaggle-life
My team and I were fortunate to win the Kaggle competition hosted by Stanford University’s Data Mining course in 2019. There were 430 submissions by 76 teams (135 competitors). We are sharing our approach and code in this publication. To view the source data and challenge particulars, check here: https://www.kaggle.com/c/stats20219/overview
Interestingly, during the competition phase, we had only the second-best score, but rocketed into first place when our method was evaluated on the final test set. The leading team dropped three places. The averaged predictions of three tuned GBM models achieved the best Private Score (on 71% of the data), and the second best Public Score.
Kaggle scores
In this report, we describe the following:
Team meeting 1: All members had explored the data individually, and we discussed various transformations of predictors and the outcome variable. We decided to log-transform house price (outcome). The team decided to each investigate different diversified methods (KNN, GLM, Random Forrest, etc.) to average predictions at the end, and to look for additional data. A key focus was on bias-variance trade-off optimization: e.g. don’t overfit the training data, but don’t assume a too simplistic model either, so we capture the complexity of the underlying data well but not the noise.
Team meeting 2: Members had submitted respective predictions into Kaggle, including averaged model predictions. We discussed how to improve the best performing models (cubist and gbm), by looking at additional data and transformations. We submitted an improved cubist prediction, gbm prediction and also a gradient boosted machine model averaged with cubist. Adding external data (downloaded from web) did not further improve the best performing model, but most linear models.
The provided data did not have missing values to worry about (thanks!). So we could focus on variable transformations.
We plotted the histograms of all variables and identified potential transformations to improve normality (qq-plot) of variables, and different data types:
## This section simplifies R package management. It determines what needs to be installed, installs and loads packages.
## ---
## add any required R packages to this list
my_packages = c("ggplot2",
"corrplot",
"caret",
"mlbench",
"xgboost"
)
## generates a list of required packages not yet installed
new_packages = my_packages[!(my_packages %in% installed.packages()[,"Package"])]
## install required new pacjages
if(length(new_packages)) install.packages(new_packages)
## load required packages
lapply(my_packages, require, character.only = TRUE)
## Load and explore the data
path.data = paste(getwd(),sep="")
df.train = read.csv(paste(path.data,"/train.csv", sep = ""), header = T, sep = ",")
df.test = read.csv(paste(path.data,"/test.csv", sep = ""), header = T, sep = ",")
## Explore data distributions
colnames(df.train) = c("id", "price","year","bedrooms", "transport.prox",
"sqft", "year.built", "scl.qlty", "crime", "lot")
colnames(df.test) = c("id","year","bedrooms", "transport.prox",
"sqft", "year.built", "scl.qlty", "crime", "lot")
summary(df.train)
## id price year bedrooms
## Min. : 1.0 Min. : 371300 Min. :2000 Min. :2.000
## 1st Qu.: 69.5 1st Qu.: 556050 1st Qu.:2005 1st Qu.:2.000
## Median :138.0 Median : 696100 Median :2009 Median :3.000
## Mean :138.0 Mean : 758570 Mean :2009 Mean :2.953
## 3rd Qu.:206.5 3rd Qu.: 903500 3rd Qu.:2014 3rd Qu.:4.000
## Max. :275.0 Max. :1859000 Max. :2018 Max. :4.000
## transport.prox sqft year.built scl.qlty
## Min. :0.0000 Min. :1100 Min. :1928 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:1508 1st Qu.:1942 1st Qu.:0.0000
## Median :0.0000 Median :1944 Median :1958 Median :1.0000
## Mean :0.2764 Mean :1969 Mean :1958 Mean :0.8945
## 3rd Qu.:1.0000 3rd Qu.:2309 3rd Qu.:1974 3rd Qu.:1.0000
## Max. :1.0000 Max. :3376 Max. :1988 Max. :2.0000
## crime lot
## Min. :1.000 Min. : 5020
## 1st Qu.:2.000 1st Qu.: 6126
## Median :2.000 Median : 7488
## Mean :2.622 Mean : 7501
## 3rd Qu.:3.000 3rd Qu.: 8664
## Max. :4.000 Max. :11946
pairs(df.train)
## Check distribution of dependent variable
par(mfrow=c(2,5))
for(i in 1:10) {
hist(df.train[,i], main=names(df.train)[i])
}
## Check normality and the impact of variable transformations, the
## best result might be obtained with linear regression given optimal transformations
# Transformation 1
t1.price = log(df.train$price)
# Transformation 2
t2.price = 1/sqrt(df.train$price)
df.price = data.frame(price = df.train$price, log_price = t1.price, sqrt_price = t2.price)
# Plot various price transformations, if the result is a straight line, the distirbution is perfectly normal.
a = ggplot(df.price, aes(sample = price)) +
stat_qq() +
stat_qq_line() +
theme_bw() +
ggtitle("Original")
b = ggplot(df.price, aes(sample = log_price)) +
stat_qq() +
stat_qq_line() +
theme_bw() +
ggtitle("Log-transformed")
c = ggplot(df.price, aes(sample = sqrt_price)) +
stat_qq() +
stat_qq_line() +
theme_bw() +
ggtitle("1/sqrt transformed")
## graph all
ggpubr::ggarrange(a,b,c, nrow = 1, align = "h")
Transform the outcome variable (price) and plot histograms after transformation.
# Transformation 1
df.train$price = log(df.train$price)
par(mfrow=c(2,5))
for(i in 1:10) {
hist(df.train[,i], main=names(df.train)[i])
}
Check for predictor variable correlations to ascertain if any of these need to be excluded due to high correlation leading to co-linearity issues.
# plot correlation matrix
cordata = cor(df.train[,3:10])
corrplot(cordata, method='square')
# Remove any highly correlated variables (r2 >0.7). Since there seem to be none, expect zero vector.
highCorr = caret::findCorrelation(cordata, cutoff=0.70)
# Identify variables that are highly correlated
names(df.train[highCorr])
## character(0)
A correlation plot showed that none of the independent variables were significantly correlated (with R2 > 0.7) so we did not exclude any variables from the analysis although ANOVA suggested only year, bedrooms, transport.prox, and scl.qlty were significant predictors. Two points were identified as potential outliers or high leverage points. No points were excluded from analysis.
I would recommend using dplyr pipelines, but this is how it was coded back then:
# Remove ID variable from data frame
df.one = df.train[,-1] # take out ID
df.two = df.test[,-1]
## Trainig set: factorize
df.one$bedrooms = as.factor(df.one$bedrooms)
df.one$scl.qlty = as.factor(df.one$scl.qlty)
df.one$transport.prox = as.factor(df.one$transport.prox)
df.one$crime = as.factor(df.one$crime)
# Test set: factorize
df.two$bedrooms = as.factor(df.two$bedrooms)
df.two$scl.qlty = as.factor(df.two$scl.qlty)
df.two$transport.prox = as.factor(df.two$transport.prox)
df.two$crime = as.factor(df.two$crime)
# View and compare dataframes
str(df.one)
## 'data.frame': 275 obs. of 9 variables:
## $ price : num 13.2 13.7 13.7 13.2 13.4 ...
## $ year : int 2013 2004 2018 2013 2011 2003 2000 2004 2017 2005 ...
## $ bedrooms : Factor w/ 3 levels "2","3","4": 1 1 3 1 1 2 1 2 1 3 ...
## $ transport.prox: Factor w/ 2 levels "0","1": 1 2 2 1 1 1 1 1 1 1 ...
## $ sqft : int 1818 2433 1922 1599 2238 1989 1535 1480 1657 1581 ...
## $ year.built : int 1936 1950 1971 1964 1968 1936 1951 1965 1956 1974 ...
## $ scl.qlty : Factor w/ 3 levels "0","1","2": 2 3 2 3 2 2 3 1 1 2 ...
## $ crime : Factor w/ 4 levels "1","2","3","4": 1 2 4 1 3 4 1 3 2 4 ...
## $ lot : int 9818 7874 8911 9032 7844 7766 9625 8056 7595 5601 ...
str(df.two)
## 'data.frame': 75 obs. of 8 variables:
## $ year : int 2009 2017 2016 2008 2016 2017 2009 2000 2004 2009 ...
## $ bedrooms : Factor w/ 3 levels "2","3","4": 3 3 3 1 3 1 1 2 1 2 ...
## $ transport.prox: Factor w/ 2 levels "0","1": 1 1 2 1 1 1 1 2 1 1 ...
## $ sqft : int 2929 2706 1526 2704 2033 1463 2985 1604 1643 3131 ...
## $ year.built : int 1974 1972 1936 1974 1939 1964 1951 1949 1947 1966 ...
## $ scl.qlty : Factor w/ 3 levels "0","1","2": 1 2 1 2 1 2 1 3 2 1 ...
## $ crime : Factor w/ 4 levels "1","2","3","4": 3 2 2 2 4 3 4 4 1 2 ...
## $ lot : int 9417 6886 8873 9329 5355 5050 5234 9479 6412 5895 ...
We used the “caret” and “mlbench” to explore multiple statistical methods. Computed R2 and RMSE for multiple trained models: linear regression, general linear regression, penalized linear regression, regression trees, radial support vector machines, KNN, random forest, gradient boosting, and cubist.
## Fit multiple models to the centered and scaled version of the dataset df.one.
## set control and optimization metric
control = trainControl(method='repeatedcv', number=10, repeats=3)
metric = 'RMSE'
# Linear Regression (LR)
set.seed(44)
fit.lm = train(price~., data = df.one, method='lm', metric = metric,
preProc=c('center', 'scale'), trControl = control)
# Generalized Linear Regression (GLM)
set.seed(44)
fit.glm = train(price~., data = df.one, method='glm', metric = metric,
preProc=c('center', 'scale'), trControl = control)
# Penalized Linear Regression (GLMNET)
set.seed(44)
fit.glmnet = train(price~., data = df.one, method='glm', metric = metric,
preProc=c('center', 'scale'), trControl = control)
# Support Vector Machines (SVM)
set.seed(44)
fit.svm = train(price~., data = df.one, method='svmRadial', metric = metric,
preProc=c('center', 'scale','BoxCox'), trControl = control)
# k-Nearest Neighbors (KNN)
set.seed(44)
fit.knn = train(price~., data = df.one, method='knn', metric = metric,
preProc=c('center', 'scale'), trControl = control)
# random forest
set.seed(44)
fit.rf = train(price~., data=df.one, method='ranger', metric=metric,
trControl = control)
# Gradient Boosting Machines (GBM)
set.seed(44)
fit.gbm = train(price~., data=df.one, method='gbm', metric=metric,
trControl = control)
# XGBoost
set.seed(44)
fit.xgb = train(price~., data=df.one, method='xgbTree', metric=metric,
trControl = control)
# Cubist
set.seed(44)
fit.cubist = train(price~., data=df.one, method='cubist', metric=metric,
trControl = control)
# Compare the results of these algorithms
results = resamples(list(lm=fit.lm,
glm=fit.glm,
glmnet=fit.glmnet,
svm=fit.svm,
knn=fit.knn,
rf=fit.rf,
xgb=fit.xgb,
gbm=fit.gbm,
cubist=fit.cubist))
# Summary and Plot of Results
dotplot(results)
Based on model selection, KNN, SVM, GBM, and Cubist models were further optimized. Multiple cross validation parameter tuning was performed for radial SVM, GBM, XGBoost, and Cubist models.
As sanity check, we quickly optimized some of the low-scoring methods to check that no obvious parameter settings, that we may have missed, would move the method back into contention for best model.
# Optimize K-nearest neighbors
set.seed(44)
gbmGrid = expand.grid(k = 1:25)
fit.knn = train(price~., data = df.train, method='knn', metric = metric,
tuneGrid = gbmGrid, trControl = control)
plot(fit.knn)
fit.knn$bestTune
## k
## 6 6
# Tune the SVM model to check performance: RADIAL
set.seed(44)
trainControl = trainControl(method='repeatedcv',number=10,repeats=3)
metric = 'RMSE'
svmGrid = expand.grid(.sigma=c(0.005, 0.01,0.015,0.0175,0.02,0.03),
.C = seq(5,40, by= 5))
svmModel = train(price~., data=df.train, method='svmRadial',
preProc=c('center', 'scale', 'BoxCox'), metric=metric, tuneGrid=svmGrid,
trControl=trainControl)
plot(svmModel)
# print the best performining paramter settings
svmModel$bestTune
## sigma C
## 14 0.01 30
We spent considerably more time on tuning the best performing models: Cubist, GBM, and XGboost. The XGBoost predictions, even when well-tuned, achieved a lower score on the final test set than GBM but higher than cubist, evaluated upon submission into Kaggle.
# Further tune the cubist model
set.seed(44)
trainControl = trainControl(method='repeatedcv', number=10, repeats=3)
metric = 'RMSE'
grid = expand.grid(.committees=seq(10,110, by=25), .neighbors=c(5,8,9))
tune.cubist = train(price~., data=df.one, method='cubist', metric=metric,
tuneGrid=grid, trControl=trainControl)
plot(tune.cubist)
set.seed(44)
grid = expand.grid(.committees=seq(34,44,by=1), .neighbors=c(8,9))
tune.cubist = train(price~., data=df.one, method='cubist', metric=metric,
tuneGrid=grid, trControl=trainControl)
plot(tune.cubist)
## best model tuning paramters
tune.cubist$bestTune
## committees neighbors
## 16 41 9
Best tune for CUBIST is committees = 41 and neighbors = 9.
# Further tune the XGBoost model
# Define search grid:
xgbGrid = expand.grid(nrounds = c(400,800,1600), # this is n_estimators
max_depth = c(1,2,3),
colsample_bytree = seq(0.5, 0.6),
## The values below are default values in the sklearn-api.
eta = c(0.2),
gamma=c(0,1),
min_child_weight = c(1),
subsample = c(0.8,1))
xgb_model = train(price~., data=df.one, method = "xgbTree", trControl = control, tuneGrid = xgbGrid)
plot(xgb_model)
## best model tuning paramters
xgb_model$bestTune
## nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
## 3 1600 1 0.2 0 0.5 1 0.8
# Further tune the GBM model
set.seed(44)
grid = expand.grid(interaction.depth = c(1,3),
n.trees = c(150,400,3600,6000),
shrinkage = c(0.01, 0.025, 0.05),
n.minobsinnode = c(2,5,9))
tune.gbm = train(price~., data=df.one, method='gbm', metric=metric,
tuneGrid=grid, trControl=control)
plot(tune.gbm)
## best model tuning paramters
tune.gbm$bestTune
## n.trees interaction.depth shrinkage n.minobsinnode
## 52 6000 1 0.05 2
The best performing submission was the averaged GBM model. And GBM alone performed better than the averaged GBM and cubist models. Tuned model fits were compared by RMSE and R2. The tuned Cubist model had R2 = 0.991, and used committee = 41 and neighbors = 9. The winning tuned GBM model had R2 = 0.992 and used n.trees = 150, interaction.depth = 3, shrinkage = 0.1 and min.obsin.node = 10. We then explored training GBM models towards broad, granular, and intermediate “detail” in the dataset, and subsequently averaged the predictions. The latter lead to the best performing model.
## Each of the following models were optimized for their interaction depth and minobsinnode
## for computational reasons, only the optimal value combinations are used below
## broad model -- few trees, broad interaction depth
grid_t3 = expand.grid(interaction.depth = c(3),
n.trees = c(150),
shrinkage = c(0.1),
n.minobsinnode = c(10))
tune.gbm_t3 = train(price~., data=df.one, method='gbm', metric=metric,
tuneGrid=grid_t3, trControl=control)
## granular model -- many trees, narrow interaction depth
grid_t4 = expand.grid(interaction.depth = c(1),
n.trees = c(6000),
shrinkage = c(0.1),
n.minobsinnode = c(2))
tune.gbm_t4 = train(price~., data=df.one, method='gbm', metric=metric,
tuneGrid=grid_t4, trControl=control)
## intermediate model
grid_t6 = expand.grid(interaction.depth = c(1),
n.trees = c(3600),
shrinkage = c(0.1),
n.minobsinnode = c(3))
tune.gbm_t6 = train(price~., data=df.one, method='gbm', metric=metric,
tuneGrid=grid_t6, trControl=control)
## best model tuning parameters
tune.gbm_t3$bestTune
tune.gbm_t4$bestTune
tune.gbm_t6$bestTune
Predicted house price for the test set data were estimated on log scale and then exponentiated prior to saving as .csv file.
## Average GBM models into one ensamble file
# apply models to log-transformed data
number_of_models = 3
h_price_m = matrix(data = NA, nrow = dim(df.two)[1], ncol = number_of_models) # initialize empty matrix
h_price_m[,1] = predict(tune.gbm_t3, newdata = df.two)
h_price_m[,2] = predict(tune.gbm_t4, newdata = df.two)
h_price_m[,3] = predict(tune.gbm_t6, newdata = df.two)
# tranform predictions back from log to normal
h_price.m = exp(h_price_m)
# combine the model predictions
avg_price.m = rowMeans(h_price.m)
out = data.frame(Id = df.test$id, Predicted = round(avg_price.m,0))
row.names(out) = NULL
## save the predicted value files and best parameters
version = "avg_gbms"
file_name = "prediction"
write.csv(out, file = paste0(file_name,"_",version,".csv", sep=""), row.names = FALSE)
#save the best models
final_model = list(tune.gbm_t3, tune.gbm_t4, tune.gbm_t6)
saveRDS(final_model, "./final_model.rds")
We submitted the .csv file into Kaggle. Our team was allowed 3 submissions per day. We increased our number of daily submissions by first acting as individuals (each having 3 submissions) and later on “(e)merging” as a team out of “stealth-mode”, once we felt we that many more submissions are not necessary.
The average of the three tuned GBM models t3, t4, and t6 had the best Private Score (on 71% of the data), and the second-best Public Score. In this case, averaging three tuned models substantially improved the prediction for the Private dataset, but not the public score. The tuned GBM models outperformed XGBoost and Cubist models. For final scoring of the challenge, we had to choose two of our models. The GBM_t346 model was not an option, because we submitted it too close to the challenge deadline. “Pro tip”: Give yourself enough time to submit all models well before the Kaggle cut-off time! (duh!)
Public and private scores for our team
About the gradient boosting machine (GBM) model: “Whereas random forests build an ensemble of deep independent trees, GBMs build an ensemble of shallow and weak successive trees with each tree learning and improving on the previous. When combined, these many weak successive trees produce a powerful “committee” that are often hard to beat with other algorithms.” http://uc-r.github.io/gbm_regression
About the Cubist model: “Cubist is a rule–based model. A tree is grown where the terminal leaves contain linear regression models. These models are based on the predictors used in previous splits. Also, there are intermediate linear models at each step of the tree. A prediction is made using the linear regression model at the terminal node of the tree, but is “smoothed” by taking into account the prediction from the linear model in the previous node of the tree (which also occurs recursively up the tree). The tree is reduced to a set of rules, which initially are paths from the top of the tree to the bottom. Rules are eliminated via pruning and/or combined for simplification. This is explained better in Quinlan (1992).” https://cran.r-project.org/web/packages/Cubist/vignettes/cubist.html
sessionInfo()
## R version 3.6.2 (2019-12-12)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] xgboost_0.90.0.2 mlbench_2.1-1 caret_6.0-85 lattice_0.20-38
## [5] corrplot_0.84 ggplot2_3.3.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.3 lubridate_1.7.4 class_7.3-15
## [4] digest_0.6.23 ipred_0.9-9 foreach_1.4.7
## [7] R6_2.4.1 ranger_0.12.1 plyr_1.8.5
## [10] stats4_3.6.2 evaluate_0.14 e1071_1.7-3
## [13] pillar_1.4.3 rlang_0.4.7 data.table_1.12.8
## [16] kernlab_0.9-29 rpart_4.1-15 Matrix_1.2-18
## [19] rmarkdown_2.3 labeling_0.3 splines_3.6.2
## [22] gower_0.2.1 stringr_1.4.0 Cubist_0.2.3
## [25] munsell_0.5.0 compiler_3.6.2 xfun_0.12
## [28] pkgconfig_2.0.3 gbm_2.1.5 htmltools_0.4.0
## [31] nnet_7.3-12 tidyselect_1.1.0 tibble_3.0.1
## [34] gridExtra_2.3 prodlim_2019.11.13 codetools_0.2-16
## [37] crayon_1.3.4 dplyr_1.0.2 withr_2.1.2
## [40] ggpubr_0.2.4 MASS_7.3-51.5 recipes_0.1.9
## [43] ModelMetrics_1.2.2.1 grid_3.6.2 nlme_3.1-143
## [46] gtable_0.3.0 lifecycle_0.2.0 magrittr_1.5
## [49] pROC_1.16.1 scales_1.1.0 stringi_1.4.5
## [52] farver_2.0.3 ggsignif_0.6.0 reshape2_1.4.3
## [55] timeDate_3043.102 ellipsis_0.3.0 generics_0.0.2
## [58] vctrs_0.3.4 cowplot_1.0.0 lava_1.6.6
## [61] iterators_1.0.12 tools_3.6.2 glue_1.4.0
## [64] purrr_0.3.3 survival_3.1-8 yaml_2.2.0
## [67] colorspace_1.4-1 knitr_1.28