This is a document present collection of common techniques widely used in industry for model building, especially in Kaggle Competition. Most of the methods come from the top Kaggle winners’ solutions. The methods collections cover not only all-purpose methods in Data Process and EDA, Model Training, Ensembling & Stack, but also some special methods in image data process.
Mostly, the top solutions are listed in blog format published by authors seperately, or a conclusion is showed in discussion board for a single competition. However, no one did the coleecting job that gather all common methods together, classify them by process steps, interpretate the usage and advantages, and attach the corresponding codes.
This paper includes the most common methods from top solutions over 50 blogs or papers, to help people to build models which would have better performance in machine learning tasks.
This part includes the methods for models whose input features are already a numerical list, or a vector.
For image data and natural language data, it need more pre-process steps. The methods for image data will be listed later but no NLP problem is included in this paper.
Definition:
Resampling: Drawing randomly with replacement from a set of data points
Unbalanced: The dataset has different ratios of cases for each class is called unbalanced.
Purpose [1]:
Downsampling creates a balanced dataset by matching the number of samples in the minority class with a random sample from the majority class.
Upsampling (Oversampling) matches the number of samples in the majority class with resampling from the minority class.
Both of them correct for a bias in the original dataset.
Potential Drawbacks [2]:
Upsampling: The machine learning algorithm sees the minor cases many times and thus designs to overfit to these examples specifically.
Downsampling: We could risk removing some of the majority class instances which is more representative, thus discarding useful information.
Related Works:
RUSBoost, SMOTE Bagging and Underbagging, which are all regarded as more promising approaches since SMOTE(Synthetic Minority Over-sampling Technique). SMOTE is still very popular due to its simplicity.
Implementation:
Kaggle: Planet: Understanding the Amazon from Space – 13th place
Packages:
imbalanced-learn, scikit-learn
Definition:
A specific table layout that allows visualization of the performance of an algorithm. Each row of the matrix represents the instances in a predicted class while each column represents the instances in an actual class.
Purpose:
The name stems from the fact that it makes it easy to see if the system is confusing two classes.
Hence, we could train hard example mining work. Refer to 3.2.1.
Implementation & Code:
Kaggle: Quora Question Pairs – Top 1% place
Kaggle: Digit Recognizer – Top 6% place (single model)
Package: scikit-learn
Definition:
An autoencoder neural network is an unsupervised learning algorithm that applies backpropagation, setting the target values to be equal to the inputs. I.e., it uses y(i)=x(i). [3]
Purpose:
If there is structure in the data, for example, if some of the input features are correlated, then this algorithm will be able to discover some of those correlations. In fact, this simple autoencoder often ends up learning a low-dimensional representation very similar to PCAs.
Benifits:
Denoising autoencoders (DAE) are nice to find a better representation of the numeric data for later neural net supervised learning. One can use train+test features to build the DAE. The larger the testset, the better.
A denoising autoencoder tries to reconstruct the noisy version of the features. It tries to find some representation of the data to better reconstruct the clean one.
Implementation:
Kaggle: Porto Seguro’s Safe Driver Prediction – 1st place
Kaggle: Data Science Bowl – 1st place
Definition:
In the simplest cases, normalization of ratings means adjusting values measured on different scales to a notionally common scale, often prior to averaging. In more complicated cases, normalization may refer to more sophisticated adjustments where the intention is to bring the entire probability distributions of adjusted values into alignment.
Comments From Best Solution Input normalization for gradient-based models such as neural nets is critical. For lightgbm/xgb it does not matter. The best what I found during the past and works straight of the box is “RankGauss”. Its based on rank transformation. First step is to assign a linspace to the sorted features from 0..1, then apply the inverse of error function ErfInv to shape them like gaussians, then I substract the mean. Binary features are not touched with this trafo (eg. 1-hot ones). This works usually much better than standard mean/std scaler or min/max.
Implementation:
Kaggle: Porto Seguro’s Safe Driver Prediction – 1st place
Code:
Kaggle: Web Traffic Time Series Forecasting – 1st place
National Data Science Bowl – 17th place
Paper:
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Packages:
PyTorch, Keras, TensorFlow
Details:
Select part of examples with largest loss and do back-propagation.
Benifits[6]:
Automatic selection of hard examples can make training more effective and efficient. OHEM (Online Hard Example Mining) is a simple and intuitive algorithm that eliminates several heuristics and hyperparameters in common use. But more importantly, it yields consistent and significant boosts in detection performance on benchmarks like PASCAL VOC 2007 and 2012. Its effectiveness increases as datasets become larger and more difficult, as demonstrated by the results on the MS COCO dataset.
Implementation:
Kaggle: Planet: Understanding the Amazon from Space – 1st place
Paper: Training Region-based Object Detectors with Online Hard Example Mining
Details:
Stage 1 : Around 300 models, Paul and Lam’s neural nets, and classical algorithms like XGB, LGBM, which worked pretty well, and a lot of Scikit-learn classification algorithms (ET, RF, KNN, etc.)
Stage 2 : Around 150 models using:
All the inputs features
Predictions of all the algorithms above
We added hidden layers of the best L1 pure text ESIM model Stage 3 : 2 Linear models
Ridge by perimeter (3 perimeters were created, based on min/max degrees) on 3 least Spearman correlated L2 predictions
Lasso with logit preprocessing of all L1 and L2 predictions
Stage 4 : Blend
Benifits:
Different models could learn different patterns from original features. Combine these patterns could have a much better representation for the data points.
Attention:
With the increase number of models used to extract the features, the stacking model is easy to be overfitting, so we would better to have simple models in the latter stage.
Implementation:
Kaggle: Quora Question Pairs – Top 1% place
Kaggle: Quora Question Pairs – 1st place
Kaggle: Planet: Understanding the Amazon from Space – 13th place
Code:
kaggle:Two Sigma Connect: Rental Listing Inquiries – 1st place
Kaggle: Instacart Market Basket Analysis – 3rd place
Packages:
StackNet: Stacking Hyperopt: Hyper-parameter optimization
Definition:
Transfer learning is the process of taking a pre-trained model (the weights and parameters of a network that has been trained on a large dataset by somebody else) and “fine-tuning” the model with your own dataset. The idea is that this pre-trained model will act as a feature extractor.
Usage:
You will remove the last layer of the network and replace it with your own classifier (depending on what your problem space is). You then freeze the weights of all the other layers and train the network normally (Freezing the layers means not changing the weights during gradient descent/optimization).
Implementation:
Kaggle: Planet: Understanding the Amazon from Space – 7th place
Code:
Kaggle: Transfer Learning on Stack Exchange Tags – Top 1% place
Some of the models use Adam, but most of them use SGD and its variants. There are also 2 recent papers show that the drawbacks of Adam.
Definition[7]:
The goal of the algorithm is to set the parameter \(\Theta\) so as to minimize the total loss \(L(\Theta) = \sum_{i=1}^{n}(L(f(x_i, \theta), y_i))\) over the training set. It works by repeatedly sampling a training example and computing the gradient of the error on the example with respect to the parameters \(\Theta\) – the input and expected output are assumed to be fixed, and the loss is treated as a function of the parameters \(\Theta\). The parameters \(\Theta\) are then updated in the opposite direction of the gradient, scaled a learning rate \(\eta_t\). The learning rate can eighter be fixed throught the training process or decay as a function of the time step \(t\).
Related Papers:
On The Convergence of Adam and beyond: Under review as a conference paper at ICLR 2018
It has been empirically observed that sometimes these algorithms fail to converge to an optimal solution.
The Marginal Value of Adaptive Gradient Methods in Machine Learning The solutions found by adaptive methods generalize worse (often sig- nificantly worse) than SGD.
Implementation:
Kaggle: Porto Seguro’s Safe Driver Prediction – 1st place
Code:
Kaggle: Web Traffic Time Series Forecasting – 1st place
Paper: (https://arxiv.org/pdf/1503.02531.pdf)
Abstract:
Compress the knowledge in an ensemble into a single model which is much easier to deploy.
Introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full mod- els confuse.
Benifit:
Achieve some surprising results on MNIST and show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model.
Models can be trained rapidly and in parallel.
Implementation: Kaggle: Data Science Bowl – 1st place
Kaggle: Cdiscount’s Image Classification Challenge – 2nd place
Definition:
Pseudo-labeling entails adding test data to the training set to create a much larger dataset. The labels of the test datapoints (so called pseudo-labels) are based on predictions from a previously trained model or an ensemble of models.
Benefits:
This mostly had a regularizing effect, which allowed us to train bigger networks.
Attention:
Balance between original data and pseudo-labeled data in the resulting dataset: in most of our experiments 33% of the minibatch was sampled from the pseudolabeled dataset and 67% from the real training set. When using too much pseudolabeled dataset, it regularized a lot more, but the results will be more similar to the pseudolabels, we have to reduce or disable dropout, or the models would underfit.
Implementation: Kaggle: Data Science Bowl – 1st place
Implementation:
Kaggle: Walmart Recruiting - Store Sales Forecasting – 2nd place
Kaggle: Invasive Species Monitor – 3rd place
Kaggle: Porto Seguro’s Safe Driver Prediction – Top 4% place
Code:
Kaggle: Statoil/C-CORE Iceberg Classifier Challenge – Top 5% place Kaggle: Porto Seguro’s Safe Driver Prediction – 2nd place
Usage:
For results from different models to predict same label, build a regression model to combine these result together. This make different models has different weight on the final result instead of simple mean.
For results from different models to predict different labels, we also could use them to take advantage of relatioship between labels by building a regression model. For example, in Planet: Understanding the Amazon from Space Kaggle Competition, we need to predict 17 labels, some labels are relavent and exclusive such as clear, haze and cloudy. If the probability of clear is very high, of course we should not get a very high probability of cloudy at the same time.
Implementation:
Kaggle: Planet: Understanding the Amazon from Space – 1st place
Kaggle: Planet: Understanding the Amazon from Space – 1st place
Kaggle: Planet: Understanding the Amazon from Space – 6th place
Kaggle: Planet: Understanding the Amazon from Space – 9th place
Paper:
Ensemble Selection from Libraries of Models
Abstract:
Present a method for constructing ensem- bles from libraries of thousands of models. Forward stepwise selection is used to add to the ensemble the models that maximize its performance. Ensemble selection allows en- sembles to be optimized to performance met- ric such as accuracy, cross entropy, mean precision, or ROC Area.
Implementation:
Kaggle: Planet: Understanding the Amazon from Space – 6th place
Kaggle: Quora Question Pairs – 33rd place
Benifits:
Support for conditional parameters (e.g. jointly tune number of layers and dropout for each layer; dropout on second layer will be tuned only if n_layers > 1) Explicit handling of model variance. SMAC trains several instances of each model on different seeds, and compares models only if instances were trained on same seed. One model wins if it’s better than another model on all equal seeds.
StackNet Published by Top 1 Kaggler KazAnova.
StackNet is a computational, scalable and analytical framework implemented with a software implementation in Java that resembles a feedforward neural network and uses Wolpert’s stacked generalization [1] in multiple levels to improve accuracy in machine learning problems. In contrast to feedforward neural networks, rather than being trained through back propagation, the network is built iteratively one layer at a time (using stacked generalization), each of which uses the final target as its target.
Paper: (http://ieeexplore.ieee.org/abstract/document/5567108/)
Usage:
Directly estimate the thickness of the haze and recover a high-quality haze-free image.
Implementation:
Kaggle: Planet: Understanding the Amazon from Space – 1st place
Advantage:
The Data Augment adds new samples which are in the original distribution using prior knowledge from EDA or experience, made the training set could represent the distribution better, to remit the overfitting.
Notice[8]:
It would be better if we do not enlarge the range of data samples. For example, if the brightness in train and test set are very even and close, we should not do the augment in brightness. Otherwise the difficulty of training would be enlarge, but the performance would not be improved much.
In addition, it would be better if the augmented data would be more different from the original data. Some very slight change such as several pixel changes would be eligible in model training.
Implementation:
Kaggle: Invasive Species Monitor : 3rd place Kaggle: Planet: Understanding the Amazon from Space – 6nd place Kaggle: Planet: Understanding the Amazon from Space – 9nd place
trained different networks with 64*64 224*224 256*256 inputs and I used dilation of the in a resnet 34 network. Different networks have different capabilities on different labels. For example, SimpleNet 64 have good performance on Label:clear.
In last section we mentioned Data Augment for training dataset, TTA is still very useful in model improvement.
It is also very easy to implement, for example, we do the rotation and flips for the test dataset, and predict the result for all transformed data also. Then we could use stacking methods such as averaging or voting to get the final result.
Advantages:
Some models are pre-trained, so the training step would be very fast and convinient.
Notice: Need revise sometimes.
Implementation: Kaggle: Planet: Understanding the Amazon from Space – 6nd place
Packages:
Pytorch: model zoo Keras
Overall, this paper gathered most common general techniques that made the model has better performance recently as best as I know so far. Most of the source are from recent competions or papers, but I do have checked some older blogs and discussions. Most of methods are already included, and the others are replaced by recent work due to the efficiency or accuracy, especially for image competition. Other many competions focused on image segmentation or time series, I roughly viewed several blogs, but did not included them so far.
Of course there must be some limitations in this paper due to my ability, and I do encourage everyone who is reading the paper to discuss with me about your findings and considerations. Hope everyone could train more and more useful models although all models are wrong :P.
Also, I want to say thanks to Dr. Guang Chen, he gathered a lot of good blogs from different resources and repost on Weibo, which includes a lot of interviews, overviews, discussions, codes about the top solutions for Kaggle. It made much easier for me to finish this paper. Dr. Chenlong Chen, who was Top 10 Kaggler, also shared a lot of his experiences on Zhihu.
[1] Managing unbalanced data for building machine learning models (http://www.simafore.com/blog/handling-unbalanced-data-machine-learning-models)
[2] Class Imbalance Problem (http://www.chioka.in/class-imbalance-problem/)
[3] Autoencoders (http://ufldl.stanford.edu/tutorial/unsupervised/Autoencoders/)
[4] Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift (https://arxiv.org/pdf/1502.03167.pdf)
[5] Glossary of Deep Learning: Batch Normalisation (https://medium.com/deeper-learning/glossary-of-deep-learning-batch-normalisation-8266dcd2fa82)
[6] Training Region-based Object Detectors with Online Hard Example Mining (https://arxiv.org/pdf/1604.03540.pdf)
[7] Neural Network Methods for Natural Language Processing, Yoav Goldberg (http://www.morganclaypool.com/doi/abs/10.2200/S00762ED1V01Y201703HLT037)
[8] Kaggle: Planet: Understanding the Amazon from Space – 6nd place(https://zhuanlan.zhihu.com/p/28084438)