Titanic Kaggle competition

Pablo Adames
April 6, 2020

Kaggle Kernels

The work was done in a kaggle kernel with a Jupyter notebook.

  • Created tutorial as a kernel in my Kaggle account

  • Kaggle uses Docker containers to sandbox notebook

    • 2 (fast) CPUs
    • NVIDIA Tesla P100 to your Notebook for free (US$7,000 data centre oriented GPU)
    • TPU v3-8 to your Notebook for free
  • I/O

    • input from “../input/”
    • output to current folder (accessible after major versioning)
  • Submissions (still command line)

  • Installing packages (enable Internet in kernel settings)

Data

The rules:

Fixed proportion of data in the train and test sets

Submission says nothing how

Forces to impute missing values in test set

11 submissions per 24 hour period

Data Exploration

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0 PC 17599 71.2833 C85 C
3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S
5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 S
6 0 3 Moran, Mr. James male NA 0 0 330877 8.4583 Q
7 0 1 McCarthy, Mr. Timothy J male 54 0 0 17463 51.8625 E46 S
8 0 3 Palsson, Master. Gosta Leonard male 2 3 1 349909 21.0750 S
9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27 0 2 347742 11.1333 S

Continuous variable distribution

plot of chunk unnamed-chunk-1

Categorical variables

  • Sex for gender (wo levels):

    • Female (0)
    • Male (1)
  • Pclass for the passenger category (three levels):

    • 1st
    • 2nd
    • 3rd
  • Survived

    • Perished (0): 196 (59.6%)
    • Survived (1): 133 (40.4%)

Irrelevant variables

  • Passenger Id
  • Name (as entered)
  • Ticket

Pre-processing

  • Only numerical variables
    • centering
    • scaling
    • normalization
    • imputation

Same procedure for both Training and Test sets

Imputation with KNN averaging and k=10 vectors.

Models

Binary classification problesm like this in Kaggle competitions:

  • trees (T)
  • bayesian predictors (B)
  • support vector machines (S)
  • generalized linear models (G)
  • their esembles
  1. Logistic model trees (GT)
  2. Bayesian generalized model (BG)
  3. Generalized linear model (G)
  4. xGBoost (gradient boosting+T)
  5. Random forest (T+bagging)
  6. SVM linear kernel (S))
  7. SVM least square Radial Basis Support (S+least squares)

Results

models scores
1 Logistic model trees 0.79425
4 xGBoost 0.77990
5 Random forest 0.77990
7 SVM LLRBS 0.77511
6 SVM linear kernel 0.76555
2 Bayesian generalized model 0.76076
3 Generalized linear model 0.76076

Downloading results from the kernel

plot of chunk unnamed-chunk-2

Submitting from the Kaggle command line

$ kaggle competitions submit -c titanic -f results/glm_default.csv -m “typo fixed”

Conclusions

  • Fun
  • Excellent way to learn how to compete in real Kaggle events
  • Very hard to score high
  • Feature engineering
  • Optimize hyperparameters
  • Ensemble of trees and logistic regression best model