Pablo Adames
April 6, 2020
The work was done in a kaggle kernel with a Jupyter notebook.
Created tutorial as a kernel in my Kaggle account
Kaggle uses Docker containers to sandbox notebook
I/O
Submissions (still command line)
Installing packages (enable Internet in kernel settings)
Fixed proportion of data in the train and test sets
Submission says nothing how
Forces to impute missing values in test set
11 submissions per 24 hour period
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.2500 | S | |
2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.9250 | S | |
4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35 | 1 | 0 | 113803 | 53.1000 | C123 | S |
5 | 0 | 3 | Allen, Mr. William Henry | male | 35 | 0 | 0 | 373450 | 8.0500 | S | |
6 | 0 | 3 | Moran, Mr. James | male | NA | 0 | 0 | 330877 | 8.4583 | Q | |
7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54 | 0 | 0 | 17463 | 51.8625 | E46 | S |
8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2 | 3 | 1 | 349909 | 21.0750 | S | |
9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27 | 0 | 2 | 347742 | 11.1333 | S |
Sex
for gender (wo levels):
Pclass
for the passenger category (three levels):
Survived
Same procedure for both Training and Test sets
Imputation with KNN averaging and k=10 vectors.
Binary classification problesm like this in Kaggle competitions:
models | scores | |
---|---|---|
1 | Logistic model trees | 0.79425 |
4 | xGBoost | 0.77990 |
5 | Random forest | 0.77990 |
7 | SVM LLRBS | 0.77511 |
6 | SVM linear kernel | 0.76555 |
2 | Bayesian generalized model | 0.76076 |
3 | Generalized linear model | 0.76076 |
$ kaggle competitions submit -c titanic -f results/glm_default.csv -m “typo fixed”