Kaggle competitions are data-oriented challenges in which teams and individuals try to make accurate predictions about unlabeled data using a wide variety of statistical techniques and machine learning methods. Some approaches work better for some problems than others, but what strategies work for all problems? This report investigates the success of iteration as a strategy for completing Kaggle competitions. Iteration as a strategy for problem solving is making many stepwise changes to a problem solution with the goal of incrementally improving each time. Do teams that make more submissions have a better chance of winning Kaggle competitions?

What is Kaggle?

Kaggle competitions are data-oriented challenges in which teams and individuals use statistics and machine learning to make accurate predictions about unlabeled data. For example, a standard Kaggle competition is to predict the survivors of the Titanic. Given a passenger’s ticket information, how accurately can teams predict whether or not the passenger survived? This example is a bit more morbid than usual, but otherwise it’s illustrative of the fact that perfect performance is rarely possible in a Kaggle competition. It’s unlikely to be able to predict exactly who survived the Titanic based only on ticket information, but people can do far better than chance. The way teams improve their predictions is by using statistics and machine learning to generate more and more accurate predictions. Teams can use any method in statistics and machine learning to generate the predictions, which means there are many possible solutions to any Kaggle competition. The goal of this report is to determine whether making more submissions is an effective strategy for completing Kaggle competitions over and above other predictors of final place.

Survival rates for Titanic passengers. Women were more likely to survive then men, and survival rate by age varied by gender. In a Kaggle competition, these relationships and others are used to build predictive models that guess the likelihood of survival based only on a passenger's ticket information.

Survival rates for Titanic passengers. Women were more likely to survive then men, and survival rate by age varied by gender. In a Kaggle competition, these relationships and others are used to build predictive models that guess the likelihood of survival based only on a passenger’s ticket information.

Number of submissions

Teams that make more submissions do better overall. Top place teams make more submissions across competitions (top left) and within the same competition (top right). Making more total submissions is associated with an increase in final place (bottom left). Subsequent submissions made by individual teams improved their predicted final place (bottom right).

Submission interval

People that make more submissions also tend to spend more time on the problem (r = 0.49). Perhaps it’s the amount of time spent on the problem that is positively related with competition success, and not the number of submissions made by each team. Even though the values are positively correlated, average submission interval does not vary among top 100 place teams across competitions (top left). One possible explanation for this is that the measure of submission interval is inadequate, either because it varies dramatically across competitions or that the submission intervals are already at ceiling (in which case there isn’t room for variation between teams). However, converting submission intervals into percentages of competition time used reveals the same results: that submission interval is relatively unrelated to final place (top right).

Correlation

Given the positive correlation between submissions and total time, it’s worth exploring the correlation between these two values in more detail. Both distributions are very positively skewed (actually, poisson distributions?). Although the majority of teams fall close to the origin, making only a few submissions, there is a full range of team types in the Kaggle data.

A different way to summarize this data is by place. Here the average number of submissions and total time for the teams in each of the top 100 places are plotted. This view of the data reveals that among top place teams, what separates them is not how much time the spend on the problem but by how many submissions they have made.

The results of a simple linear model fit to these data reveals that making more submissions is associated with an improvement in place (a decrease in place number), while spending more time on the problem is actually associated with increasing place number, and a worse place overall. This descrepancy in the valence on the predictors suggests that there might be some collinearity going on.

interaction_mod <- lm(Place ~ TotalSubmissionsZ * TotalTimeZ, data = top_100)
term estimate std.error statistic p.value
(Intercept) 44.101145 0.2285496 192.960902 0.0000000
TotalSubmissionsZ -3.400067 0.3736674 -9.099181 0.0000000
TotalTimeZ 1.791212 0.2416723 7.411738 0.0000000
TotalSubmissionsZ:TotalTimeZ -0.270573 0.1822703 -1.484460 0.1377035

Team types

Rather than thinking about submissions and submission intervals as continuous dimensions we can use median splits to divide the teams into four different types based on where they fall on these dimensions. Below are four views of the same data depicting the relationship between number of submissions and submission interval in four quadrants where each quadrant is a type of team. Steady* and short** teams make submissions proportional to the amount of time spent on the problem. Long teams make few submissions, but engage with the problem over the duration of the competition. Rapid teams make many submissions over a short interval. If the number of submissions matters more in determining final place than the total length of time spent on the problem, then steady and rapid teams should perform better than short and long teams.

When dividing the actual data into team types based on median splits, the positive skew in the distributions and the positive correlation between the two variables both become apparent (left. Although the first quandrant (steady teams) and the third quadrant (short teams) occupy very different extents of the variables, there are approximately the same number of teams in both (middle). Investigating the average place of teams in each quadrant revealed that the rapid teams were the most successful in completing the Kaggle competitions (right). That is, the teams that spent less time on the problem than half of the other teams, and made more submissions than half of the other teams, placed better overall than teams that spent more time on the problem and made more submissions to the problem.

Distribution and performance of team types. Team types were assigned by median splitting both number of submissions and submission interval. The correlation between number of submissions and submission interval explains why there are more **short** and **steady** teams than there are **long** and **rapid** teams. When looking at how teams of each type do in the final rankings, **rapid** teams outperform the others.

Distribution and performance of team types. Team types were assigned by median splitting both number of submissions and submission interval. The correlation between number of submissions and submission interval explains why there are more short and steady teams than there are long and rapid teams. When looking at how teams of each type do in the final rankings, rapid teams outperform the others.