Kaggle Competition Autumn 2019 (Stanford University, STATS202)

Team Name: CJNS

Kaggle-life

Kaggle-life

My team and I were fortunate to win the Kaggle competition hosted by Stanford University’s Data Mining course in 2019. There were 430 submissions by 76 teams (135 competitors). We are sharing our approach and code in this publication. To view the source data and challenge particulars, check here: https://www.kaggle.com/c/stats20219/overview

Interestingly, during the competition phase, we had only the second-best score, but rocketed into first place when our method was evaluated on the final test set. The leading team dropped three places. The averaged predictions of three tuned GBM models achieved the best Private Score (on 71% of the data), and the second best Public Score.

Kaggle scores

Kaggle scores

In this report, we describe the following:

Team Approach

Team meeting 1: All members had explored the data individually, and we discussed various transformations of predictors and the outcome variable. We decided to log-transform house price (outcome). The team decided to each investigate different diversified methods (KNN, GLM, Random Forrest, etc.) to average predictions at the end, and to look for additional data. A key focus was on bias-variance trade-off optimization: e.g. don’t overfit the training data, but don’t assume a too simplistic model either, so we capture the complexity of the underlying data well but not the noise.

Team meeting 2: Members had submitted respective predictions into Kaggle, including averaged model predictions. We discussed how to improve the best performing models (cubist and gbm), by looking at additional data and transformations. We submitted an improved cubist prediction, gbm prediction and also a gradient boosted machine model averaged with cubist. Adding external data (downloaded from web) did not further improve the best performing model, but most linear models.

Data exploration and wrangling

The provided data did not have missing values to worry about (thanks!). So we could focus on variable transformations.

Transformations:

We plotted the histograms of all variables and identified potential transformations to improve normality (qq-plot) of variables, and different data types:

  • price (log), numeric transformed
  • year (order.norm), integer not transformed
  • year.built (order.norm), integer not transformed
  • bedrooms, transport proximity, school quality, and crime were encoded as factors
  • ID was removed from the training and test data

Load the libraries

## This section simplifies R package management. It determines what needs to be installed, installs and loads packages.
## --- 
## add any required R packages to this list
my_packages = c("ggplot2",
                "corrplot",
                "caret",
                "mlbench",
                "xgboost"
                )

## generates a list of required packages not yet installed
new_packages = my_packages[!(my_packages %in% installed.packages()[,"Package"])]

## install required new pacjages
if(length(new_packages)) install.packages(new_packages)

## load required packages
lapply(my_packages, require, character.only = TRUE)

Explore the data

## Load and explore the data
path.data = paste(getwd(),sep="")
df.train = read.csv(paste(path.data,"/train.csv", sep = ""), header = T, sep = ",")
df.test = read.csv(paste(path.data,"/test.csv", sep = ""), header = T, sep = ",")

## Explore data distributions
colnames(df.train) = c("id", "price","year","bedrooms", "transport.prox", 
                       "sqft", "year.built", "scl.qlty", "crime", "lot")
colnames(df.test) = c("id","year","bedrooms", "transport.prox", 
                       "sqft", "year.built", "scl.qlty", "crime", "lot")
summary(df.train)
##        id            price              year         bedrooms    
##  Min.   :  1.0   Min.   : 371300   Min.   :2000   Min.   :2.000  
##  1st Qu.: 69.5   1st Qu.: 556050   1st Qu.:2005   1st Qu.:2.000  
##  Median :138.0   Median : 696100   Median :2009   Median :3.000  
##  Mean   :138.0   Mean   : 758570   Mean   :2009   Mean   :2.953  
##  3rd Qu.:206.5   3rd Qu.: 903500   3rd Qu.:2014   3rd Qu.:4.000  
##  Max.   :275.0   Max.   :1859000   Max.   :2018   Max.   :4.000  
##  transport.prox        sqft        year.built      scl.qlty     
##  Min.   :0.0000   Min.   :1100   Min.   :1928   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:1508   1st Qu.:1942   1st Qu.:0.0000  
##  Median :0.0000   Median :1944   Median :1958   Median :1.0000  
##  Mean   :0.2764   Mean   :1969   Mean   :1958   Mean   :0.8945  
##  3rd Qu.:1.0000   3rd Qu.:2309   3rd Qu.:1974   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :3376   Max.   :1988   Max.   :2.0000  
##      crime            lot       
##  Min.   :1.000   Min.   : 5020  
##  1st Qu.:2.000   1st Qu.: 6126  
##  Median :2.000   Median : 7488  
##  Mean   :2.622   Mean   : 7501  
##  3rd Qu.:3.000   3rd Qu.: 8664  
##  Max.   :4.000   Max.   :11946
pairs(df.train)

## Check distribution of dependent variable 
par(mfrow=c(2,5))
for(i in 1:10) {
  hist(df.train[,i], main=names(df.train)[i])
}

## Check normality and the impact of variable transformations, the
## best result might be obtained with linear regression given optimal transformations
# Transformation 1
t1.price = log(df.train$price)
# Transformation 2
t2.price = 1/sqrt(df.train$price)
df.price = data.frame(price = df.train$price, log_price = t1.price, sqrt_price = t2.price)

# Plot various price transformations, if the result is a straight line, the distirbution is perfectly normal.
a = ggplot(df.price, aes(sample = price)) +
  stat_qq() + 
  stat_qq_line() +
  theme_bw() + 
  ggtitle("Original")
b = ggplot(df.price, aes(sample = log_price)) +
  stat_qq() + 
  stat_qq_line() +
  theme_bw() + 
  ggtitle("Log-transformed")
c = ggplot(df.price, aes(sample = sqrt_price)) +
  stat_qq() + 
  stat_qq_line() +
  theme_bw() + 
  ggtitle("1/sqrt transformed")
## graph all
ggpubr::ggarrange(a,b,c, nrow = 1, align = "h")

Transform the outcome variable (price) and plot histograms after transformation.

# Transformation 1
df.train$price = log(df.train$price)
par(mfrow=c(2,5))
for(i in 1:10) {
  hist(df.train[,i], main=names(df.train)[i])
}

Check for predictor variable correlations to ascertain if any of these need to be excluded due to high correlation leading to co-linearity issues.

# plot correlation matrix
cordata = cor(df.train[,3:10])
corrplot(cordata, method='square')

# Remove any highly correlated variables (r2 >0.7). Since there seem to be none, expect zero vector.
highCorr = caret::findCorrelation(cordata, cutoff=0.70)

# Identify variables that are highly correlated
names(df.train[highCorr])
## character(0)

Interactions:

A correlation plot showed that none of the independent variables were significantly correlated (with R2 > 0.7) so we did not exclude any variables from the analysis although ANOVA suggested only year, bedrooms, transport.prox, and scl.qlty were significant predictors. Two points were identified as potential outliers or high leverage points. No points were excluded from analysis.

Recode some of the variables

I would recommend using dplyr pipelines, but this is how it was coded back then:

# Remove ID variable from data frame
df.one = df.train[,-1] # take out ID
df.two = df.test[,-1]

## Trainig set: factorize
df.one$bedrooms = as.factor(df.one$bedrooms)
df.one$scl.qlty = as.factor(df.one$scl.qlty)
df.one$transport.prox = as.factor(df.one$transport.prox)
df.one$crime = as.factor(df.one$crime)

# Test set: factorize 
df.two$bedrooms = as.factor(df.two$bedrooms)
df.two$scl.qlty = as.factor(df.two$scl.qlty)
df.two$transport.prox = as.factor(df.two$transport.prox)
df.two$crime = as.factor(df.two$crime)

# View and compare dataframes
str(df.one)
## 'data.frame':    275 obs. of  9 variables:
##  $ price         : num  13.2 13.7 13.7 13.2 13.4 ...
##  $ year          : int  2013 2004 2018 2013 2011 2003 2000 2004 2017 2005 ...
##  $ bedrooms      : Factor w/ 3 levels "2","3","4": 1 1 3 1 1 2 1 2 1 3 ...
##  $ transport.prox: Factor w/ 2 levels "0","1": 1 2 2 1 1 1 1 1 1 1 ...
##  $ sqft          : int  1818 2433 1922 1599 2238 1989 1535 1480 1657 1581 ...
##  $ year.built    : int  1936 1950 1971 1964 1968 1936 1951 1965 1956 1974 ...
##  $ scl.qlty      : Factor w/ 3 levels "0","1","2": 2 3 2 3 2 2 3 1 1 2 ...
##  $ crime         : Factor w/ 4 levels "1","2","3","4": 1 2 4 1 3 4 1 3 2 4 ...
##  $ lot           : int  9818 7874 8911 9032 7844 7766 9625 8056 7595 5601 ...
str(df.two)
## 'data.frame':    75 obs. of  8 variables:
##  $ year          : int  2009 2017 2016 2008 2016 2017 2009 2000 2004 2009 ...
##  $ bedrooms      : Factor w/ 3 levels "2","3","4": 3 3 3 1 3 1 1 2 1 2 ...
##  $ transport.prox: Factor w/ 2 levels "0","1": 1 1 2 1 1 1 1 2 1 1 ...
##  $ sqft          : int  2929 2706 1526 2704 2033 1463 2985 1604 1643 3131 ...
##  $ year.built    : int  1974 1972 1936 1974 1939 1964 1951 1949 1947 1966 ...
##  $ scl.qlty      : Factor w/ 3 levels "0","1","2": 1 2 1 2 1 2 1 3 2 1 ...
##  $ crime         : Factor w/ 4 levels "1","2","3","4": 3 2 2 2 4 3 4 4 1 2 ...
##  $ lot           : int  9417 6886 8873 9329 5355 5050 5234 9479 6412 5895 ...

Model selection

We used the “caret” and “mlbench” to explore multiple statistical methods. Computed R2 and RMSE for multiple trained models: linear regression, general linear regression, penalized linear regression, regression trees, radial support vector machines, KNN, random forest, gradient boosting, and cubist.

## Fit multiple models to the centered and scaled version of the dataset df.one. 
## set control and optimization metric
control = trainControl(method='repeatedcv', number=10, repeats=3)
metric = 'RMSE'

# Linear Regression (LR)
set.seed(44)
fit.lm = train(price~., data = df.one, method='lm', metric = metric, 
                preProc=c('center', 'scale'), trControl = control)  

# Generalized Linear Regression (GLM)
set.seed(44)
fit.glm = train(price~., data = df.one, method='glm', metric = metric, 
                 preProc=c('center', 'scale'), trControl = control)

# Penalized Linear Regression (GLMNET)
set.seed(44)
fit.glmnet = train(price~., data = df.one, method='glm', metric = metric, 
                    preProc=c('center', 'scale'), trControl = control)

# Support Vector Machines (SVM) 
set.seed(44)
fit.svm = train(price~., data = df.one, method='svmRadial', metric = metric,
                 preProc=c('center', 'scale','BoxCox'), trControl = control)  

# k-Nearest Neighbors (KNN)
set.seed(44)
fit.knn = train(price~., data = df.one, method='knn', metric = metric, 
                    preProc=c('center', 'scale'), trControl = control)

# random forest
set.seed(44)
fit.rf = train(price~., data=df.one, method='ranger', metric=metric, 
              trControl = control)

# Gradient Boosting Machines (GBM) 
set.seed(44)
fit.gbm = train(price~., data=df.one, method='gbm', metric=metric, 
                trControl = control)

# XGBoost
set.seed(44)
fit.xgb = train(price~., data=df.one, method='xgbTree', metric=metric, 
              trControl = control)

# Cubist
set.seed(44)
fit.cubist = train(price~., data=df.one, method='cubist', metric=metric, 
               trControl = control)

# Compare the results of these algorithms
results = resamples(list(lm=fit.lm, 
                         glm=fit.glm,
                         glmnet=fit.glmnet, 
                         svm=fit.svm, 
                         knn=fit.knn,
                         rf=fit.rf, 
                         xgb=fit.xgb,
                         gbm=fit.gbm,
                         cubist=fit.cubist))

# Summary and Plot of Results
dotplot(results)

Based on model selection, KNN, SVM, GBM, and Cubist models were further optimized. Multiple cross validation parameter tuning was performed for radial SVM, GBM, XGBoost, and Cubist models.

Model optimization

As sanity check, we quickly optimized some of the low-scoring methods to check that no obvious parameter settings, that we may have missed, would move the method back into contention for best model.

KNN

# Optimize K-nearest neighbors
set.seed(44)
gbmGrid = expand.grid(k = 1:25)
fit.knn = train(price~., data = df.train, method='knn', metric = metric, 
                tuneGrid = gbmGrid, trControl = control)
plot(fit.knn)

fit.knn$bestTune
##   k
## 6 6

Support vector machine (radial)

# Tune the SVM model to check performance: RADIAL
set.seed(44)
trainControl = trainControl(method='repeatedcv',number=10,repeats=3)
metric = 'RMSE'
svmGrid = expand.grid(.sigma=c(0.005, 0.01,0.015,0.0175,0.02,0.03), 
                       .C = seq(5,40, by= 5))
svmModel = train(price~., data=df.train, method='svmRadial', 
                preProc=c('center', 'scale', 'BoxCox'), metric=metric, tuneGrid=svmGrid,
                trControl=trainControl)
plot(svmModel)

# print the best performining paramter settings
svmModel$bestTune
##    sigma  C
## 14  0.01 30

We spent considerably more time on tuning the best performing models: Cubist, GBM, and XGboost. The XGBoost predictions, even when well-tuned, achieved a lower score on the final test set than GBM but higher than cubist, evaluated upon submission into Kaggle.

Cubist model

# Further tune the cubist model
set.seed(44)
trainControl = trainControl(method='repeatedcv', number=10, repeats=3)
metric = 'RMSE'
grid = expand.grid(.committees=seq(10,110, by=25), .neighbors=c(5,8,9))
tune.cubist = train(price~., data=df.one, method='cubist', metric=metric,
              tuneGrid=grid, trControl=trainControl)
plot(tune.cubist)

set.seed(44)
grid = expand.grid(.committees=seq(34,44,by=1), .neighbors=c(8,9))
tune.cubist = train(price~., data=df.one, method='cubist', metric=metric,
              tuneGrid=grid, trControl=trainControl)
plot(tune.cubist)

## best model tuning paramters
tune.cubist$bestTune
##    committees neighbors
## 16         41         9

Best tune for CUBIST is committees = 41 and neighbors = 9.

XGBoost model

# Further tune the XGBoost model
# Define search grid:
xgbGrid = expand.grid(nrounds = c(400,800,1600),  # this is n_estimators 
                       max_depth = c(1,2,3),
                       colsample_bytree = seq(0.5, 0.6),
                       ## The values below are default values in the sklearn-api. 
                       eta = c(0.2),
                       gamma=c(0,1),
                       min_child_weight = c(1),
                       subsample = c(0.8,1))
xgb_model = train(price~., data=df.one, method = "xgbTree", trControl = control, tuneGrid = xgbGrid)
plot(xgb_model)

## best model tuning paramters
xgb_model$bestTune
##   nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
## 3    1600         1 0.2     0              0.5                1       0.8

Gradient boosted model

# Further tune the GBM model
set.seed(44)
grid = expand.grid(interaction.depth = c(1,3), 
                   n.trees = c(150,400,3600,6000), 
                   shrinkage = c(0.01, 0.025, 0.05),
                   n.minobsinnode = c(2,5,9))
tune.gbm = train(price~., data=df.one, method='gbm', metric=metric,
              tuneGrid=grid, trControl=control)
plot(tune.gbm)

## best model tuning paramters
tune.gbm$bestTune
##    n.trees interaction.depth shrinkage n.minobsinnode
## 52    6000                 1      0.05              2

The best performing submission was the averaged GBM model. And GBM alone performed better than the averaged GBM and cubist models. Tuned model fits were compared by RMSE and R2. The tuned Cubist model had R2 = 0.991, and used committee = 41 and neighbors = 9. The winning tuned GBM model had R2 = 0.992 and used n.trees = 150, interaction.depth = 3, shrinkage = 0.1 and min.obsin.node = 10. We then explored training GBM models towards broad, granular, and intermediate “detail” in the dataset, and subsequently averaged the predictions. The latter lead to the best performing model.

Multiple tuned GMB models

## Each of the following models were optimized for their interaction depth and minobsinnode
## for computational reasons, only the optimal value combinations are used below

## broad model -- few trees, broad interaction depth
grid_t3 = expand.grid(interaction.depth = c(3), 
                   n.trees = c(150), 
                   shrinkage = c(0.1),
                   n.minobsinnode = c(10))
tune.gbm_t3 = train(price~., data=df.one, method='gbm', metric=metric,
              tuneGrid=grid_t3, trControl=control)

## granular model -- many trees, narrow interaction depth
grid_t4 = expand.grid(interaction.depth = c(1), 
                   n.trees = c(6000), 
                   shrinkage = c(0.1),
                   n.minobsinnode = c(2))
tune.gbm_t4 = train(price~., data=df.one, method='gbm', metric=metric,
              tuneGrid=grid_t4, trControl=control)

## intermediate model 
grid_t6 = expand.grid(interaction.depth = c(1), 
                   n.trees = c(3600), 
                   shrinkage = c(0.1),
                   n.minobsinnode = c(3))
tune.gbm_t6 = train(price~., data=df.one, method='gbm', metric=metric,
              tuneGrid=grid_t6, trControl=control)

## best model tuning parameters
tune.gbm_t3$bestTune
tune.gbm_t4$bestTune
tune.gbm_t6$bestTune

Predicted house price for the test set data were estimated on log scale and then exponentiated prior to saving as .csv file.

## Average GBM models into one ensamble file
# apply models to log-transformed data
number_of_models = 3 
h_price_m = matrix(data = NA, nrow = dim(df.two)[1], ncol = number_of_models) # initialize empty matrix
h_price_m[,1] = predict(tune.gbm_t3, newdata = df.two) 
h_price_m[,2] = predict(tune.gbm_t4, newdata = df.two)
h_price_m[,3] = predict(tune.gbm_t6, newdata = df.two)

# tranform predictions back from log to normal
h_price.m = exp(h_price_m)

# combine the model predictions
avg_price.m = rowMeans(h_price.m)
out = data.frame(Id = df.test$id, Predicted = round(avg_price.m,0))
row.names(out) = NULL

## save the predicted value files and best parameters 
version = "avg_gbms"
file_name = "prediction"
write.csv(out, file = paste0(file_name,"_",version,".csv", sep=""), row.names = FALSE)

#save the best models
final_model = list(tune.gbm_t3, tune.gbm_t4, tune.gbm_t6)
saveRDS(final_model, "./final_model.rds")

Kaggle submissions

We submitted the .csv file into Kaggle. Our team was allowed 3 submissions per day. We increased our number of daily submissions by first acting as individuals (each having 3 submissions) and later on “(e)merging” as a team out of “stealth-mode”, once we felt we that many more submissions are not necessary.

A word on the final scores

The average of the three tuned GBM models t3, t4, and t6 had the best Private Score (on 71% of the data), and the second-best Public Score. In this case, averaging three tuned models substantially improved the prediction for the Private dataset, but not the public score. The tuned GBM models outperformed XGBoost and Cubist models. For final scoring of the challenge, we had to choose two of our models. The GBM_t346 model was not an option, because we submitted it too close to the challenge deadline. “Pro tip”: Give yourself enough time to submit all models well before the Kaggle cut-off time! (duh!)

Public and private scores for our team

Public and private scores for our team

How GMB and Cubist models work

About the gradient boosting machine (GBM) model: “Whereas random forests build an ensemble of deep independent trees, GBMs build an ensemble of shallow and weak successive trees with each tree learning and improving on the previous. When combined, these many weak successive trees produce a powerful “committee” that are often hard to beat with other algorithms.” http://uc-r.github.io/gbm_regression

About the Cubist model: “Cubist is a rule–based model. A tree is grown where the terminal leaves contain linear regression models. These models are based on the predictors used in previous splits. Also, there are intermediate linear models at each step of the tree. A prediction is made using the linear regression model at the terminal node of the tree, but is “smoothed” by taking into account the prediction from the linear model in the previous node of the tree (which also occurs recursively up the tree). The tree is reduced to a set of rules, which initially are paths from the top of the tree to the bottom. Rules are eliminated via pruning and/or combined for simplification. This is explained better in Quinlan (1992).” https://cran.r-project.org/web/packages/Cubist/vignettes/cubist.html

Package versions

sessionInfo()
## R version 3.6.2 (2019-12-12)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] xgboost_0.90.0.2 mlbench_2.1-1    caret_6.0-85     lattice_0.20-38 
## [5] corrplot_0.84    ggplot2_3.3.0   
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.3           lubridate_1.7.4      class_7.3-15        
##  [4] digest_0.6.23        ipred_0.9-9          foreach_1.4.7       
##  [7] R6_2.4.1             ranger_0.12.1        plyr_1.8.5          
## [10] stats4_3.6.2         evaluate_0.14        e1071_1.7-3         
## [13] pillar_1.4.3         rlang_0.4.7          data.table_1.12.8   
## [16] kernlab_0.9-29       rpart_4.1-15         Matrix_1.2-18       
## [19] rmarkdown_2.3        labeling_0.3         splines_3.6.2       
## [22] gower_0.2.1          stringr_1.4.0        Cubist_0.2.3        
## [25] munsell_0.5.0        compiler_3.6.2       xfun_0.12           
## [28] pkgconfig_2.0.3      gbm_2.1.5            htmltools_0.4.0     
## [31] nnet_7.3-12          tidyselect_1.1.0     tibble_3.0.1        
## [34] gridExtra_2.3        prodlim_2019.11.13   codetools_0.2-16    
## [37] crayon_1.3.4         dplyr_1.0.2          withr_2.1.2         
## [40] ggpubr_0.2.4         MASS_7.3-51.5        recipes_0.1.9       
## [43] ModelMetrics_1.2.2.1 grid_3.6.2           nlme_3.1-143        
## [46] gtable_0.3.0         lifecycle_0.2.0      magrittr_1.5        
## [49] pROC_1.16.1          scales_1.1.0         stringi_1.4.5       
## [52] farver_2.0.3         ggsignif_0.6.0       reshape2_1.4.3      
## [55] timeDate_3043.102    ellipsis_0.3.0       generics_0.0.2      
## [58] vctrs_0.3.4          cowplot_1.0.0        lava_1.6.6          
## [61] iterators_1.0.12     tools_3.6.2          glue_1.4.0          
## [64] purrr_0.3.3          survival_3.1-8       yaml_2.2.0          
## [67] colorspace_1.4-1     knitr_1.28