Machine learning competitions have become an extremely popular format for
solving prediction and classification problems of all kinds.
The central component of any competition is the leaderboard which ranks all
teams in the competition by the score of their best submission.
Often, participants incorporate the feedback from the leaderboard into the
design of their classifier thus creating a dependence between the classifier
and the data on which it is evaluated.
Our Shiny application is designed
to demonstrate how this dependence leads to biased estimate of the
classifier’s true performance.
Objective
Typically, the competition is designed such the data is partitioned into two
sets: a training set (instances with labels) and a test set
instances without labels).
To avoid overfitting to the test set, the competition organizers further
partition the test set into two parts:
One part of the test set is used for computing scores on the public
leaderboard.
The other is used to rank all submissions after the competition ended.
Our goal is to climb the public leaderboard without even looking at the
data.
Algorithm (Wacky Boosting)
Notation
\( y \in \{0,1\}^N \) set of labeled prediction
\( s_H(y) \); public score of a submission \( y \)
Algorithm
Choose \( y_1,...,y_k \in \{0,1\}^N \) uniformly at random.
Let \( I= \{ i \in [k]:s_H(y_i)<0.5 \} \).
Output \( \hat{y} = \text{majority} \{ y_i:i \in I\} \), where the majority is
component-wise.
Results
Lo and behold, this is what happens:
We keep climbing the leaderboard! :)
Wacky boosting did nothing whatsoever on the final test set :(
Further Reading
To see what just happened, see
Moody Rd excellent blog
To play around with dynamic simulation of this phenomena
click here