Kaggle: Santander Customer Satisfaction Challenge

This document summarizes the solution to the Santanger Customer Satisfaction Competition hosted by Kaggle in winter-spring 2016. See details of the competition and download the files at https://www.kaggle.com/c/santander-customer-satisfaction.

The final solition was based on several xboost models with some pre-processing of the data.

Import and pre-processing

The datafiles are quite large, so we will be using data.table::fread to inhale it. We will remove ID variable and extract the TARGET variable to avoid accidental leakage.

library(xgboost)
library(Matrix)
require(MatrixModels)

## Loading required package: MatrixModels

library(data.table)
library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

require(rms)

## Loading required package: rms

## Loading required package: Hmisc

## Loading required package: survival

## 
## Attaching package: 'survival'

## The following object is masked from 'package:caret':
## 
##     cluster

## Loading required package: Formula

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':
## 
##     format.pval, round.POSIXt, trunc.POSIXt, units

## Loading required package: SparseM

## 
## Attaching package: 'SparseM'

## The following object is masked from 'package:base':
## 
##     backsolve

library(vtreat)


rm(list=ls())

train <- fread("./input/train.csv", integer64 = "numeric", data.table = FALSE)

## 
Read 13.2% of 76020 rows
Read 52.6% of 76020 rows
Read 76020 rows and 371 (of 371) columns from 0.055 GB file in 00:00:05

test <- fread("./input/test.csv", integer64 = "numeric", data.table = FALSE)

## 
Read 52.8% of 75818 rows
Read 75818 rows and 370 (of 370) columns from 0.055 GB file in 00:00:04

#str(train); summary(train)

##### Removing IDs
train$ID <- NULL
test.id <- test$ID
test$ID <- NULL

##### Extracting TARGET
train.y <- train$TARGET
train$TARGET <- NULL

The data seems to very sparse. In such cases NAs and zero values may carry additional information, especially since the TARGET variable is related to customer satisfaction. Looking closer at the data we see that some of the variables look like ordinal values with the multiple of 3. We will count how many of those features we have in the dataset (by row).

##### 0 count per line
count0 <- function(x) {
  sum(x == 0, na.rm = T)
}

count3mod <- function(x) {
  sum(x %% 3 == 0, na.rm = T)
}

train$n0 <- apply(train, 1, FUN=count0) 
test$n0 <- apply(test, 1, FUN=count0)

train$mod3 <- apply(train, 1, FUN=count3mod) 
test$mod3 <- apply(test, 1, FUN=count3mod)

Outliers

Quickly looking at the histograms we notice that some variables contain extreme values (like 99 or -999999), which are likely to be “placeholders” for missing values. Both var3 and var36 seem to be categorical. We will convert them to factors.

Also, var38 has a special mode of 117310.979016494 which looks like mean or median value filled in instead of missing values.

Let’s reintroduce the missing values to deal with them in the consistent manner later.

Once mode value is removed var38 looks near-perfect. The shape of the distribution suggests that this might be net worth, or income or other measure of monetary value.

train[train$var3==-999999, "var3"] <- NA
test[test$var3==-999999, "var3"] <- NA

train[train$var36==99, "var36"] <- NA
test[test$var36==99, "var36"] <- NA

train[train$var38==117310.979016494, "var38"] <- NA
test[test$var38==117310.979016494, "var38"] <- NA

hist(log1p(train$var38), 100)

In the result of extensive EDA the following variable map was built: Santander Feature Map

The above map illustrates the relationship between the variables (aggregation) and the grouping of variables by the type of activity they, likely, describe. Since the dataset is semi-anonimized it is difficult to be certain for sure, but the hypothesis was that the variable group rolling up to var30 describes the cash products, the variables rolling up to var01 relate to card products and those rolling up to var31 are likely loan products. The nature of other groups of variables is less clear.

We will now remove constant and identical features to reduce dimensions of the dataset before further processing.

##### Removing constant features
cat("\n## Removing the constants features.\n")

## 
## ## Removing the constants features.

for (f in names(train)) {
  if (length(unique(train[[f]])) == 1) {
    cat(f, "is constant in train. We delete it.\n")
    train[[f]] <- NULL
    test[[f]] <- NULL
  }
}

## ind_var2_0 is constant in train. We delete it.
## ind_var2 is constant in train. We delete it.
## ind_var27_0 is constant in train. We delete it.
## ind_var28_0 is constant in train. We delete it.
## ind_var28 is constant in train. We delete it.
## ind_var27 is constant in train. We delete it.
## ind_var41 is constant in train. We delete it.
## ind_var46_0 is constant in train. We delete it.
## ind_var46 is constant in train. We delete it.
## num_var27_0 is constant in train. We delete it.
## num_var28_0 is constant in train. We delete it.
## num_var28 is constant in train. We delete it.
## num_var27 is constant in train. We delete it.
## num_var41 is constant in train. We delete it.
## num_var46_0 is constant in train. We delete it.
## num_var46 is constant in train. We delete it.
## saldo_var28 is constant in train. We delete it.
## saldo_var27 is constant in train. We delete it.
## saldo_var41 is constant in train. We delete it.
## saldo_var46 is constant in train. We delete it.
## imp_amort_var18_hace3 is constant in train. We delete it.
## imp_amort_var34_hace3 is constant in train. We delete it.
## imp_reemb_var13_hace3 is constant in train. We delete it.
## imp_reemb_var33_hace3 is constant in train. We delete it.
## imp_trasp_var17_out_hace3 is constant in train. We delete it.
## imp_trasp_var33_out_hace3 is constant in train. We delete it.
## num_var2_0_ult1 is constant in train. We delete it.
## num_var2_ult1 is constant in train. We delete it.
## num_reemb_var13_hace3 is constant in train. We delete it.
## num_reemb_var33_hace3 is constant in train. We delete it.
## num_trasp_var17_out_hace3 is constant in train. We delete it.
## num_trasp_var33_out_hace3 is constant in train. We delete it.
## saldo_var2_ult1 is constant in train. We delete it.
## saldo_medio_var13_medio_hace3 is constant in train. We delete it.

##### Removing identical features
features_pair <- combn(names(train), 2, simplify = F)
toRemove <- c()
for(pair in features_pair) {
  f1 <- pair[1]
  f2 <- pair[2]
  
  if (!(f1 %in% toRemove) & !(f2 %in% toRemove)) {
    if (all(train[[f1]] == train[[f2]])) {
      cat(f1, "and", f2, "are equals.\n")
      toRemove <- c(toRemove, f2)
    }
  }
}

## ind_var6_0 and ind_var29_0 are equals.
## ind_var6 and ind_var29 are equals.
## ind_var13_medio_0 and ind_var13_medio are equals.
## ind_var18_0 and ind_var18 are equals.
## ind_var26_0 and ind_var26 are equals.
## ind_var25_0 and ind_var25 are equals.
## ind_var32_0 and ind_var32 are equals.
## ind_var34_0 and ind_var34 are equals.
## ind_var37_0 and ind_var37 are equals.
## ind_var40 and ind_var39 are equals.
## num_var6_0 and num_var29_0 are equals.
## num_var6 and num_var29 are equals.
## num_var13_medio_0 and num_var13_medio are equals.
## num_var18_0 and num_var18 are equals.
## num_var26_0 and num_var26 are equals.
## num_var25_0 and num_var25 are equals.
## num_var32_0 and num_var32 are equals.
## num_var34_0 and num_var34 are equals.
## num_var37_0 and num_var37 are equals.
## num_var40 and num_var39 are equals.
## saldo_var6 and saldo_var29 are equals.
## saldo_var13_medio and saldo_medio_var13_medio_ult1 are equals.
## delta_imp_reemb_var13_1y3 and delta_num_reemb_var13_1y3 are equals.
## delta_imp_reemb_var17_1y3 and delta_num_reemb_var17_1y3 are equals.
## delta_imp_reemb_var33_1y3 and delta_num_reemb_var33_1y3 are equals.
## delta_imp_trasp_var17_in_1y3 and delta_num_trasp_var17_in_1y3 are equals.
## delta_imp_trasp_var17_out_1y3 and delta_num_trasp_var17_out_1y3 are equals.
## delta_imp_trasp_var33_in_1y3 and delta_num_trasp_var33_in_1y3 are equals.
## delta_imp_trasp_var33_out_1y3 and delta_num_trasp_var33_out_1y3 are equals.

feature.names <- setdiff(names(train), toRemove)

train <- train[, feature.names]
test <- test[, feature.names]

Missing values

Dealing with missing value can be tricky due to the danger of overfitting. In this competition we will be using the excellent vtreat package by WinVector LLC to clean and normalize the variables. More information about the package are available in reference manual and vignettes on CRAN.

First of all we will reintroduce the label variable and indicate some of the features as categorical for vtreat to apply proper method of imputation/processing.

train$TARGET <- train.y

# Make some variable categorical

train$var3 <- as.factor(train$var3)
test$var3 <- as.factor(test$var3)

train$var36 <- as.factor(train$var36)
test$var36 <- as.factor(test$var36)

library(parallel)
no_cores <- detectCores()-1 # Calculate the number of cores
cl <- makeCluster(no_cores) # Initiate cluster

##### VTREAT ########
#set.seed(1234)
#treatmentsC <- designTreatmentsC(train,colnames(train),'TARGET',1, verbose=F)

#dTrainCTreated <- prepare(treatmentsC,train,pruneSig=0.5,scale=FALSE)
#dTestCTreated <- prepare(treatmentsC,test,pruneSig=0.5,scale=FALSE)

set.seed(1234)
prep <- vtreat::mkCrossFrameCExperiment(dframe=train, varlist=colnames(train),outcomename = "TARGET",
                                        outcometarget=1, rareCount=2, scale = F,
                                         ncross = 5, parallelCluster=cl)
dTrainCTreated <- prep$crossFrame
treatments <- prep$treatments
treatmentsSF <- treatments$scoreFrame

dTestCTreated <- vtreat::prepare(treatments,test,pruneSig=0.5,scale=F)

## Warning in vtreat::prepare(treatments, test, pruneSig = 0.5, scale = F):
## variable imp_aport_var33_hace3 expected type/class integer integer saw
## double numeric

## Warning in vtreat::prepare(treatments, test, pruneSig = 0.5, scale = F):
## variable imp_aport_var33_ult1 expected type/class integer integer saw
## double numeric

# limit train columns to those found significant by prepare function given threshold + the TARGET variable
dTrainCTreated <- dTrainCTreated[, c(colnames(dTestCTreated), "TARGET")]

train <- dTrainCTreated
test <- dTestCTreated

rm(dTrainCTreated, dTestCTreated, treatments, treatmentsSF)

We will now build a simple xgboost model that we will be able to cross-validate. The data is quite noisy and the model is difficult to tune. This competition is evaluated using “area under the ROC” metric (“auc”), but we will use other classification metrics available through rms package to calculate, among other things Brier score. We will use this and other metrics for more reliable cross-validation.

See rmsreference on CRAN for details of the val.prob function.

trainLGOCVxgb_C <- function(train, valid){
  train.y <- train[, 'TARGET']
  valid.y <- valid[, 'TARGET']
  
  # Matrix
  train <- sparse.model.matrix(TARGET ~ .-1, data = train)
  valid <- sparse.model.matrix(TARGET ~ .-1, data = valid)

  dtrain <- xgb.DMatrix(train, label = train.y)
  dvalid <- xgb.DMatrix(valid, label= valid.y)
  
  watchlist <- list(valid = dvalid, train = dtrain)
  
  
  params <- list(booster = "gbtree", objective = "binary:logistic", eval_metric = "auc"
                 , max_depth = 5
                 , eta = 0.04 
                 , colsample_bytree = 0.75
                 , subsample = 0.75
                )
  
  set.seed(1234)
  model <- xgb.train(params = params
                     , data = dtrain
                     , nrounds = 560
                     , verbose = 1
                     , early.stop.round = 40
                     , watchlist = watchlist
                     , print.every.n = 20
  )
  
  pred <- predict(model, dvalid)
  model$valprob <- rms::val.prob(pred, valid.y)
  
  return(model)
}

Similar function is built for making final predictions. Number of rounds is chosen based on cross-validated early-stop.

trainPredxgb_C <- function(train){
  train.y <- train[, 'TARGET']
  # Matrix
  train <- sparse.model.matrix(TARGET ~ .-1, data = train)
  dtrain <- xgb.DMatrix(data=train, label = train.y)
  
  watchlist <- list(train = dtrain)
  
  params <- list(booster = "gbtree", objective = "binary:logistic", eval_metric = "auc"
                 , max_depth = 5
                 , eta = 0.04 
                 , colsample_bytree = 0.75
                 , subsample = 0.75
                )
  
  set.seed(1234)
  model <- xgb.train(params = params
                     , data = dtrain
                     , nrounds = 300
                     , verbose = 1
                     , watchlist = watchlist
                     , print.every.n = 20
                    )
    
  return(model)
}

We will then proceed to modeling the customer satisfaction (TARGET variable) through 10-fold cross-validation. We will collect the classification measures from the folds to trace the mean and SD of scores across the resampling rounds.

# perform cross-validation
 
  set.seed(120)
  k=10 #7
  t=1 #2
  folds <- createMultiFolds(train[, "TARGET"], k = k, times=t)
  r <- k*t
  CVpred <-NULL
  CVlabels <- NULL
  CVvalprob <- matrix(NA, nrow=10, ncol=17)

  for (i in 1:r) { # start CV
    cat("\nResampling round", i,"\n")
    m2m <- trainLGOCVxgb_C(train=train[folds[[i]],] , valid=train[-folds[[i]],])
    CVvalprob[i,] <- m2m$valprob
    colnames(CVvalprob) <- names(m2m$valprob)
  } # end of CV

## 
## Resampling round 1

## Warning in xgb.train(params = params, data = dtrain, nrounds = 560, verbose
## = 1, : Only the first data set in watchlist is used for early stopping
## process.

## [0]  valid-auc:0.798390  train-auc:0.812304
## [20] valid-auc:0.827746  train-auc:0.840313
## [40] valid-auc:0.832511  train-auc:0.846894
## [60] valid-auc:0.838726  train-auc:0.853556
## [80] valid-auc:0.842743  train-auc:0.858556
## [100]    valid-auc:0.845698  train-auc:0.864177
## [120]    valid-auc:0.846751  train-auc:0.869149
## [140]    valid-auc:0.847907  train-auc:0.873286
## [160]    valid-auc:0.848563  train-auc:0.876437
## [180]    valid-auc:0.849196  train-auc:0.879360
## [200]    valid-auc:0.849258  train-auc:0.882344
## [220]    valid-auc:0.849378  train-auc:0.885014
## [240]    valid-auc:0.849562  train-auc:0.887074
## [260]    valid-auc:0.849767  train-auc:0.889517
## [280]    valid-auc:0.849651  train-auc:0.891901
## Stopping. Best iteration: 249

## 
## Resampling round 2

## Warning in xgb.train(params = params, data = dtrain, nrounds = 560, verbose
## = 1, : Only the first data set in watchlist is used for early stopping
## process.

## [0]  valid-auc:0.779119  train-auc:0.815678
## [20] valid-auc:0.810477  train-auc:0.846187
## [40] valid-auc:0.810584  train-auc:0.851360
## [60] valid-auc:0.815591  train-auc:0.857757
## [80] valid-auc:0.818495  train-auc:0.862582
## [100]    valid-auc:0.819811  train-auc:0.867476
## [120]    valid-auc:0.820694  train-auc:0.871634
## [140]    valid-auc:0.821022  train-auc:0.875537
## [160]    valid-auc:0.820999  train-auc:0.879030
## [180]    valid-auc:0.820733  train-auc:0.882312
## Stopping. Best iteration: 145

## 
## Resampling round 3

## Warning in xgb.train(params = params, data = dtrain, nrounds = 560, verbose
## = 1, : Only the first data set in watchlist is used for early stopping
## process.

## [0]  valid-auc:0.806939  train-auc:0.815943
## [20] valid-auc:0.835925  train-auc:0.841134
## [40] valid-auc:0.837040  train-auc:0.848050
## [60] valid-auc:0.841376  train-auc:0.853647
## [80] valid-auc:0.843213  train-auc:0.859292
## [100]    valid-auc:0.843224  train-auc:0.864974
## [120]    valid-auc:0.844314  train-auc:0.869554
## [140]    valid-auc:0.843719  train-auc:0.873451
## [160]    valid-auc:0.843800  train-auc:0.877286
## Stopping. Best iteration: 121

## 
## Resampling round 4

## Warning in xgb.train(params = params, data = dtrain, nrounds = 560, verbose
## = 1, : Only the first data set in watchlist is used for early stopping
## process.

## [0]  valid-auc:0.784890  train-auc:0.812542
## [20] valid-auc:0.819146  train-auc:0.842661
## [40] valid-auc:0.818752  train-auc:0.847750
## [60] valid-auc:0.825504  train-auc:0.854882
## [80] valid-auc:0.829247  train-auc:0.860129
## [100]    valid-auc:0.832721  train-auc:0.865246
## [120]    valid-auc:0.834451  train-auc:0.869740
## [140]    valid-auc:0.836093  train-auc:0.874329
## [160]    valid-auc:0.837631  train-auc:0.877452
## [180]    valid-auc:0.837950  train-auc:0.880554
## [200]    valid-auc:0.837873  train-auc:0.883378
## [220]    valid-auc:0.837902  train-auc:0.885463
## [240]    valid-auc:0.837863  train-auc:0.888463
## Stopping. Best iteration: 204

## 
## Resampling round 5

## Warning in xgb.train(params = params, data = dtrain, nrounds = 560, verbose
## = 1, : Only the first data set in watchlist is used for early stopping
## process.

## [0]  valid-auc:0.814316  train-auc:0.809990
## [20] valid-auc:0.840854  train-auc:0.839184
## [40] valid-auc:0.847417  train-auc:0.847449
## [60] valid-auc:0.848482  train-auc:0.853322
## [80] valid-auc:0.851602  train-auc:0.858589
## [100]    valid-auc:0.852320  train-auc:0.863457
## [120]    valid-auc:0.853293  train-auc:0.867730
## [140]    valid-auc:0.854684  train-auc:0.871549
## [160]    valid-auc:0.855598  train-auc:0.875628
## [180]    valid-auc:0.855792  train-auc:0.879188
## [200]    valid-auc:0.856465  train-auc:0.882226
## [220]    valid-auc:0.857304  train-auc:0.884958
## [240]    valid-auc:0.858117  train-auc:0.887320
## [260]    valid-auc:0.858596  train-auc:0.889450
## [280]    valid-auc:0.858987  train-auc:0.891753
## [300]    valid-auc:0.859531  train-auc:0.893647
## [320]    valid-auc:0.859490  train-auc:0.895716
## [340]    valid-auc:0.859625  train-auc:0.897530
## [360]    valid-auc:0.860391  train-auc:0.899594
## [380]    valid-auc:0.860532  train-auc:0.901476
## [400]    valid-auc:0.860337  train-auc:0.903417
## Stopping. Best iteration: 380

## 
## Resampling round 6

## Warning in xgb.train(params = params, data = dtrain, nrounds = 560, verbose
## = 1, : Only the first data set in watchlist is used for early stopping
## process.

## [0]  valid-auc:0.800968  train-auc:0.811410
## [20] valid-auc:0.827667  train-auc:0.842045
## [40] valid-auc:0.833039  train-auc:0.849013
## [60] valid-auc:0.839367  train-auc:0.855348
## [80] valid-auc:0.839773  train-auc:0.860723
## [100]    valid-auc:0.839713  train-auc:0.865577
## [120]    valid-auc:0.840553  train-auc:0.869804
## [140]    valid-auc:0.841513  train-auc:0.873576
## [160]    valid-auc:0.842456  train-auc:0.877239
## [180]    valid-auc:0.842921  train-auc:0.880588
## [200]    valid-auc:0.842224  train-auc:0.883314
## [220]    valid-auc:0.841925  train-auc:0.886193
## Stopping. Best iteration: 181

## 
## Resampling round 7

## Warning in xgb.train(params = params, data = dtrain, nrounds = 560, verbose
## = 1, : Only the first data set in watchlist is used for early stopping
## process.

## [0]  valid-auc:0.803594  train-auc:0.807368
## [20] valid-auc:0.831374  train-auc:0.841771
## [40] valid-auc:0.835528  train-auc:0.847556
## [60] valid-auc:0.839605  train-auc:0.853725
## [80] valid-auc:0.842495  train-auc:0.859265
## [100]    valid-auc:0.843390  train-auc:0.864893
## [120]    valid-auc:0.844821  train-auc:0.869286
## [140]    valid-auc:0.845229  train-auc:0.873536
## [160]    valid-auc:0.845331  train-auc:0.876599
## [180]    valid-auc:0.845947  train-auc:0.880291
## [200]    valid-auc:0.846009  train-auc:0.883495
## Stopping. Best iteration: 173

## 
## Resampling round 8

## Warning in xgb.train(params = params, data = dtrain, nrounds = 560, verbose
## = 1, : Only the first data set in watchlist is used for early stopping
## process.

## [0]  valid-auc:0.816285  train-auc:0.811096
## [20] valid-auc:0.834251  train-auc:0.841090
## [40] valid-auc:0.838270  train-auc:0.847343
## [60] valid-auc:0.841923  train-auc:0.855037
## [80] valid-auc:0.844856  train-auc:0.861027
## [100]    valid-auc:0.844697  train-auc:0.866154
## [120]    valid-auc:0.844838  train-auc:0.870411
## [140]    valid-auc:0.845198  train-auc:0.874319
## [160]    valid-auc:0.845333  train-auc:0.877221
## [180]    valid-auc:0.846205  train-auc:0.880337
## [200]    valid-auc:0.846576  train-auc:0.883735
## [220]    valid-auc:0.846268  train-auc:0.886435
## Stopping. Best iteration: 195

## 
## Resampling round 9

## Warning in xgb.train(params = params, data = dtrain, nrounds = 560, verbose
## = 1, : Only the first data set in watchlist is used for early stopping
## process.

## [0]  valid-auc:0.806238  train-auc:0.817467
## [20] valid-auc:0.825399  train-auc:0.842917
## [40] valid-auc:0.825904  train-auc:0.848433
## [60] valid-auc:0.831145  train-auc:0.855515
## [80] valid-auc:0.833601  train-auc:0.860764
## [100]    valid-auc:0.835537  train-auc:0.866775
## [120]    valid-auc:0.835919  train-auc:0.870838
## [140]    valid-auc:0.835534  train-auc:0.874806
## [160]    valid-auc:0.835287  train-auc:0.877899
## Stopping. Best iteration: 129

## 
## Resampling round 10

## Warning in xgb.train(params = params, data = dtrain, nrounds = 560, verbose
## = 1, : Only the first data set in watchlist is used for early stopping
## process.

## [0]  valid-auc:0.796626  train-auc:0.814269
## [20] valid-auc:0.820444  train-auc:0.844613
## [40] valid-auc:0.822156  train-auc:0.849720
## [60] valid-auc:0.825449  train-auc:0.856312
## [80] valid-auc:0.826399  train-auc:0.861326
## [100]    valid-auc:0.827708  train-auc:0.866078
## [120]    valid-auc:0.828470  train-auc:0.870403
## [140]    valid-auc:0.828681  train-auc:0.874283
## [160]    valid-auc:0.829436  train-auc:0.878770
## [180]    valid-auc:0.829649  train-auc:0.881454
## [200]    valid-auc:0.829624  train-auc:0.884762
## [220]    valid-auc:0.829941  train-auc:0.888216
## [240]    valid-auc:0.829517  train-auc:0.890189
## Stopping. Best iteration: 219

  cat("\n")

  cat("CV score: \n")

## CV score:

  scores <- rbind(apply(CVvalprob, 2, mean), apply(CVvalprob, 2, sd))
  rownames(scores) <- c("mean", "sd")
  print(scores)

##             Dxy   C (ROC)        R2          D  D:Chi-sq D:p             U
## mean 0.68124354 0.8406218 0.2245267 0.06558355 499.56617  NA -0.0000942426
## sd   0.02187359 0.0109368 0.0156069 0.00458612  34.86369  NA  0.0001831644
##      U:Chi-sq       U:p           Q        Brier  Intercept      Slope
## mean 1.283568 0.6191425 0.065677795 0.0344364875 0.02368246 1.00647979
## sd   1.392416 0.2942358 0.004651485 0.0008411937 0.11705484 0.05298757
##            Emax       S:z       S:p         Eavg
## mean 0.02837047 0.1816441 0.7194494 0.0029800177
## sd   0.01764127 0.6634078 0.3113167 0.0009152832

  # calculate and show feature importance from the last resampling round
  f<- xgb.importance(feature_names = colnames(test), model=m2m)
  head(f,50)

##                            Feature        Gain       Cover   Frequence
##  1:                    var15_clean 0.268463490 0.194272374 0.083877129
##  2:              saldo_var30_clean 0.167816916 0.128352380 0.046952093
##  3:                    var38_clean 0.068617748 0.072706435 0.097564858
##  4:                       n0_clean 0.042572843 0.020404546 0.024351424
##  5:   saldo_medio_var5_hace3_clean 0.029674546 0.032244611 0.037084195
##  6:    saldo_medio_var5_ult3_clean 0.029167107 0.026853052 0.043132262
##  7:   saldo_medio_var5_hace2_clean 0.025264761 0.025250417 0.038994111
##  8:           num_var22_ult3_clean 0.020213546 0.026112114 0.031195289
##  9:          num_var45_hace3_clean 0.018668896 0.020087042 0.038994111
## 10:    saldo_medio_var5_ult1_clean 0.016606118 0.011543835 0.024669744
## 11:  imp_op_var41_efect_ult3_clean 0.012893958 0.014676484 0.017825879
## 12:           num_var22_ult1_clean 0.012345215 0.024951802 0.018939997
## 13:                     mod3_clean 0.011932514 0.016813066 0.021804870
## 14:           num_var45_ult1_clean 0.010877274 0.007862311 0.020690753
## 15:              saldo_var42_clean 0.010711639 0.011341639 0.015120166
## 16:          num_var22_hace3_clean 0.010017284 0.014607306 0.023396467
## 17:        imp_op_var39_ult1_clean 0.009700368 0.012425803 0.013846888
## 18:               saldo_var5_clean 0.009581209 0.010287524 0.018462518
## 19:              saldo_var37_clean 0.009180095 0.010767366 0.015438485
## 20:               saldo_var8_clean 0.008848297 0.031999816 0.009867898
## 21:                 num_var4_clean 0.008805998 0.004269736 0.002864873
## 22:                      var3_catB 0.008052801 0.008886556 0.013210250
## 23:        imp_op_var41_ult1_clean 0.007642434 0.011678609 0.014165208
## 24:     imp_trans_var37_ult1_clean 0.007521902 0.010971744 0.015438485
## 25: num_meses_var39_vig_ult3_clean 0.007122657 0.010938460 0.013687729
## 26:                      var3_catP 0.007077299 0.007937457 0.014642687
## 27:  imp_op_var41_efect_ult1_clean 0.007062190 0.016375040 0.011141175
## 28:                     var36_catB 0.006478173 0.002173671 0.013846888
## 29:              num_var30_0_clean 0.006469239 0.007905537 0.004138151
## 30:      num_meses_var5_ult3_clean 0.006439109 0.005367138 0.003978991
## 31:                num_var35_clean 0.006242707 0.008795362 0.003342352
## 32:  imp_op_var39_comer_ult3_clean 0.006096856 0.008039103 0.012573611
## 33:  imp_op_var39_comer_ult1_clean 0.005631149 0.009000989 0.011459494
## 34:          num_var22_hace2_clean 0.004987933 0.002772061 0.010504536
## 35:  imp_op_var41_comer_ult3_clean 0.004605716 0.004951422 0.008753780
## 36:                     var36_catP 0.004532460 0.002630844 0.012891931
## 37:      imp_var43_emit_ult1_clean 0.004521308 0.005760601 0.009549578
## 38:       num_ent_var16_ult1_clean 0.004448427 0.012878239 0.009072099
## 39:   saldo_medio_var8_hace2_clean 0.004142743 0.008869759 0.008117141
## 40:  imp_op_var39_efect_ult3_clean 0.004072686 0.004700536 0.005729747
## 41:  imp_op_var41_comer_ult1_clean 0.003464496 0.004382961 0.008753780
## 42:        num_op_var41_ult1_clean 0.003401376 0.002079729 0.006048066
## 43:       num_med_var22_ult3_clean 0.003344084 0.003591988 0.005729747
## 44:  num_op_var41_efect_ult3_clean 0.003300449 0.002933236 0.006048066
## 45:       num_op_var41_hace2_clean 0.003223464 0.003387859 0.005570587
## 46:  imp_op_var39_efect_ult1_clean 0.002980164 0.005178179 0.004138151
## 47:                    var38_isBAD 0.002955948 0.002513535 0.006525545
## 48:        num_op_var41_ult3_clean 0.002891716 0.001691668 0.005570587
## 49:    saldo_medio_var8_ult1_clean 0.002867156 0.005501656 0.005729747
## 50:              num_var42_0_clean 0.002699580 0.004705893 0.002705714
##                            Feature        Gain       Cover   Frequence

This model can now be combined with other in an ensemble for further submission to the competition.

Kaggle: Santander Customer Satisfaction Challenge

Dmytro Perepolkin

juni 2016

Import and pre-processing

Outliers

Missing values