1. Abstract

This is a document present collection of common techniques widely used in industry for model building, especially in Kaggle Competition. Most of the methods come from the top Kaggle winners’ solutions. The methods collections cover not only all-purpose methods in Data Process and EDA, Model Training, Ensembling & Stack, but also some special methods in image data process.

2. Introduction

Mostly, the top solutions are listed in blog format published by authors seperately, or a conclusion is showed in discussion board for a single competition. However, no one did the coleecting job that gather all common methods together, classify them by process steps, interpretate the usage and advantages, and attach the corresponding codes.

This paper includes the most common methods from top solutions over 50 blogs or papers, to help people to build models which would have better performance in machine learning tasks.

3. Common Methods for All Model

This part includes the methods for models whose input features are already a numerical list, or a vector.
For image data and natural language data, it need more pre-process steps. The methods for image data will be listed later but no NLP problem is included in this paper.

3.1 Data Process & EDA (Exploratory Data Analysis)

3.1.1 Resampling for unblanced data

  • Definition:

    Resampling: Drawing randomly with replacement from a set of data points

    Unbalanced: The dataset has different ratios of cases for each class is called unbalanced.

  • Purpose [1]:
    Downsampling creates a balanced dataset by matching the number of samples in the minority class with a random sample from the majority class.

    Upsampling (Oversampling) matches the number of samples in the majority class with resampling from the minority class.
    Both of them correct for a bias in the original dataset.

  • Potential Drawbacks [2]:
    Upsampling: The machine learning algorithm sees the minor cases many times and thus designs to overfit to these examples specifically.
    Downsampling: We could risk removing some of the majority class instances which is more representative, thus discarding useful information.

  • Related Works:
    RUSBoost, SMOTE Bagging and Underbagging, which are all regarded as more promising approaches since SMOTE(Synthetic Minority Over-sampling Technique). SMOTE is still very popular due to its simplicity.

  • Implementation:
    Kaggle: Planet: Understanding the Amazon from Space – 13th place

  • Packages:
    imbalanced-learn, scikit-learn

3.1.2 Confusion Matrix

3.1.3 Autoencoder

  • Definition:
    An autoencoder neural network is an unsupervised learning algorithm that applies backpropagation, setting the target values to be equal to the inputs. I.e., it uses y(i)=x(i). [3]

  • Purpose:
    If there is structure in the data, for example, if some of the input features are correlated, then this algorithm will be able to discover some of those correlations. In fact, this simple autoencoder often ends up learning a low-dimensional representation very similar to PCAs.

  • Benifits:
    Denoising autoencoders (DAE) are nice to find a better representation of the numeric data for later neural net supervised learning. One can use train+test features to build the DAE. The larger the testset, the better.
    A denoising autoencoder tries to reconstruct the noisy version of the features. It tries to find some representation of the data to better reconstruct the clean one.

  • Implementation:
    Kaggle: Porto Seguro’s Safe Driver Prediction – 1st place
    Kaggle: Data Science Bowl – 1st place

  • Code:
    Kaggle: National Data Science Bowl – 1st place

3.1.4 Normalization/Batch Normalization

  • Definition:
    In the simplest cases, normalization of ratings means adjusting values measured on different scales to a notionally common scale, often prior to averaging. In more complicated cases, normalization may refer to more sophisticated adjustments where the intention is to bring the entire probability distributions of adjusted values into alignment.

  • Benifits[4][5]:
  1. Networks train faster.
  2. Allows higher learning rates.
  3. Makes weights easier to initialise.
  4. Makes more activation functions viable.
  5. Simplifies the creation of deeper networks.
  6. Provides some regularisation.

3.2 Model Training

3.2.1 Train Hard Example Mining Network:

3.2.2 Use Network to Train Features

3.2.3 Transfer Learning

  • Definition:
    Transfer learning is the process of taking a pre-trained model (the weights and parameters of a network that has been trained on a large dataset by somebody else) and “fine-tuning” the model with your own dataset. The idea is that this pre-trained model will act as a feature extractor.

  • Usage:
    You will remove the last layer of the network and replace it with your own classifier (depending on what your problem space is). You then freeze the weights of all the other layers and train the network normally (Freezing the layers means not changing the weights during gradient descent/optimization).

  • Implementation:
    Kaggle: Planet: Understanding the Amazon from Space – 7th place

  • Code:
    Kaggle: Transfer Learning on Stack Exchange Tags – Top 1% place

3.2.4 Optimizer

Some of the models use Adam, but most of them use SGD and its variants. There are also 2 recent papers show that the drawbacks of Adam.

  • Definition[7]:
    The goal of the algorithm is to set the parameter \(\Theta\) so as to minimize the total loss \(L(\Theta) = \sum_{i=1}^{n}(L(f(x_i, \theta), y_i))\) over the training set. It works by repeatedly sampling a training example and computing the gradient of the error on the example with respect to the parameters \(\Theta\) – the input and expected output are assumed to be fixed, and the loss is treated as a function of the parameters \(\Theta\). The parameters \(\Theta\) are then updated in the opposite direction of the gradient, scaled a learning rate \(\eta_t\). The learning rate can eighter be fixed throught the training process or decay as a function of the time step \(t\).

  • Related Papers:
    On The Convergence of Adam and beyond: Under review as a conference paper at ICLR 2018
    It has been empirically observed that sometimes these algorithms fail to converge to an optimal solution.
    The Marginal Value of Adaptive Gradient Methods in Machine Learning The solutions found by adaptive methods generalize worse (often sig- nificantly worse) than SGD.

  • Implementation:
    Kaggle: Porto Seguro’s Safe Driver Prediction – 1st place

  • Code:
    Kaggle: Web Traffic Time Series Forecasting – 1st place

3.2.5 Distilling the Knowledge in a Neural Network

Paper: (https://arxiv.org/pdf/1503.02531.pdf)

3.2.6 Pseudo-labeling

  • Definition:
    Pseudo-labeling entails adding test data to the training set to create a much larger dataset. The labels of the test datapoints (so called pseudo-labels) are based on predictions from a previously trained model or an ensemble of models.

  • Benefits:
    This mostly had a regularizing effect, which allowed us to train bigger networks.

  • Attention:
    Balance between original data and pseudo-labeled data in the resulting dataset: in most of our experiments 33% of the minibatch was sampled from the pseudolabeled dataset and 67% from the real training set. When using too much pseudolabeled dataset, it regularized a lot more, but the results will be more similar to the pseudolabels, we have to reduce or disable dropout, or the models would underfit.

  • Implementation: Kaggle: Data Science Bowl – 1st place

  • Code:
    Kaggle: National Data Science Bowl – 1st place

3.3 Model Ensemble

3.3.2 Ridge regression / Logistic regression / xgboost regression on different model results to predict each label separately

3.3.3. Bagging Ensemble Selection

Paper:
Ensemble Selection from Libraries of Models

3.4 Tools Collection

3.4.1 Parameter Tuning:

  1. SMAC3
    Kaggle: Planet: Understanding the Amazon from Space – 9nd place

Benifits:
Support for conditional parameters (e.g. jointly tune number of layers and dropout for each layer; dropout on second layer will be tuned only if n_layers > 1) Explicit handling of model variance. SMAC trains several instances of each model on different seeds, and compares models only if instances were trained on same seed. One model wins if it’s better than another model on all equal seeds.

  1. Hyperopt hyperopt: Distributed Asynchronous Hyper-parameter Optimization
    Hyperopt is a Python library for serial and parallel optimization over awkward search spaces, which may include real-valued, discrete, and conditional dimensions.

3.4.2 Stacking

StackNet Published by Top 1 Kaggler KazAnova.

StackNet is a computational, scalable and analytical framework implemented with a software implementation in Java that resembles a feedforward neural network and uses Wolpert’s stacked generalization [1] in multiple levels to improve accuracy in machine learning problems. In contrast to feedforward neural networks, rather than being trained through back propagation, the network is built iteratively one layer at a time (using stacked generalization), each of which uses the final target as its target.

4. Special Techniques for Image Data

4.1 Data Process

4.1.1 “Single Image Haze Removal using Dark Channel Prior”

4.1.2 Data Augmentation

  • Common Augment Patterns:
  1. Horizontal/Vertiacal Filp
  2. Random Crop
  3. Random Resize, Zooming
  4. Random change in Brightness, Saturation, Contrast
  5. Rotation
  6. Blurring
  7. Gaussian Noise
  8. Translation, Distortion, Mirror, Transposition
  9. Elastic Transformation

4.2 Model Training

4.2.1 Different Network Sizes

trained different networks with 64*64 224*224 256*256 inputs and I used dilation of the in a resnet 34 network. Different networks have different capabilities on different labels. For example, SimpleNet 64 have good performance on Label:clear.

4.2.2 TTA (Test Time Augmentation)

In last section we mentioned Data Augment for training dataset, TTA is still very useful in model improvement.
It is also very easy to implement, for example, we do the rotation and flips for the test dataset, and predict the result for all transformed data also. Then we could use stacking methods such as averaging or voting to get the final result.

4.2.3 Use Pre-trained Models

5. Conclusion

Overall, this paper gathered most common general techniques that made the model has better performance recently as best as I know so far. Most of the source are from recent competions or papers, but I do have checked some older blogs and discussions. Most of methods are already included, and the others are replaced by recent work due to the efficiency or accuracy, especially for image competition. Other many competions focused on image segmentation or time series, I roughly viewed several blogs, but did not included them so far.
Of course there must be some limitations in this paper due to my ability, and I do encourage everyone who is reading the paper to discuss with me about your findings and considerations. Hope everyone could train more and more useful models although all models are wrong :P.
Also, I want to say thanks to Dr. Guang Chen, he gathered a lot of good blogs from different resources and repost on Weibo, which includes a lot of interviews, overviews, discussions, codes about the top solutions for Kaggle. It made much easier for me to finish this paper. Dr. Chenlong Chen, who was Top 10 Kaggler, also shared a lot of his experiences on Zhihu.

References

[1] Managing unbalanced data for building machine learning models (http://www.simafore.com/blog/handling-unbalanced-data-machine-learning-models)
[2] Class Imbalance Problem (http://www.chioka.in/class-imbalance-problem/)
[3] Autoencoders (http://ufldl.stanford.edu/tutorial/unsupervised/Autoencoders/)
[4] Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift (https://arxiv.org/pdf/1502.03167.pdf)
[5] Glossary of Deep Learning: Batch Normalisation (https://medium.com/deeper-learning/glossary-of-deep-learning-batch-normalisation-8266dcd2fa82)
[6] Training Region-based Object Detectors with Online Hard Example Mining (https://arxiv.org/pdf/1604.03540.pdf)
[7] Neural Network Methods for Natural Language Processing, Yoav Goldberg (http://www.morganclaypool.com/doi/abs/10.2200/S00762ED1V01Y201703HLT037)
[8] Kaggle: Planet: Understanding the Amazon from Space – 6nd place(https://zhuanlan.zhihu.com/p/28084438)