Kaggle Higgs Classification Challenge Leaderboard Analysis

This brief analysis looks at the results of the Higgs Boson Machine Learning Challenge hosted by Kaggle. The competition documents state that “The goal of the Higgs Boson Machine Learning Challenge is to explore the potential of advanced machine learning methods to improve the discovery significance of the experiment.” The competition had 1941 individuals participating in 1785 teams. There were over 35,000 submissions with many teams testing over 50 submissions.

The competition was evaluated using a custom metric called Approximate Median Significance (AMS). This is a non-linear metric which compares the total weight of true positives and penalizes based on the total weight of false positivies.

On a personal note, I completed this analyis while waiting for a root canal. I’m sure there is some insight here in comparing excrutiating tooth pain with statistical analysis. But I’m in too much pain right now to figure it out.

Conclusions

This analysis indicates that that both skill and luck were significant factors in winning this competition. Significant modeling skill is requried to underatand how to extract the most useful information from this complex data. Skill is also required in creating a model which produces consitent results when processing new data. There is also a strong element of luck involved since the difference between leaderboard scores of the top performers is quite small compared with model variance.

Public Leaderboard scores and rankings represent the best single response produced by a team.
The best public score is often a ‘lucky’ fit, and not necessissarily the best model produced by that team.
There is significant ‘shakeup’ in leaderboard rankings due primarily to model variance.
Top contenders often submit dozens (or hundreds) of models, which increases the odds that one model will over-perform on the publc leaderboard due to random chance of the observations used.
Private leaderboard rankings generally reflect model performance.
The Kaggle leaderboard methodology somewhat distorts relationships between public and private ranking because all models are used for public scores, but only two models are used for private scores.
Participants should generally ignore public leaderboard scores and focus on Cross-Validation scores.

Simple recommendation: Participants would benefit from additional information on their public leaderboard performance. Perhaps Kaggle could split the public leaderboard results into two equal subsets and report both scores along with the average score for ranking. This would give participants insight into how well their model performs against different subsets of the public data. This would also give other competitors insight into the performance of other competitors. Finally, this would give Kaggle administrators a tool to produce better leader boards.

Getting the Data

I start with a function to pull leaderboard information based on code from David Thaler. This function grabs data from both the public and private leaderboards at the end of the competition.

#################################################################
# Packages and Functions to make life easier
#################################################################

library(RCurl, quietly = TRUE)
library(XML, quietly = TRUE)
library(plyr, quietly = TRUE)

leaderboard <- function(url.root){
    # url.root is the bare LB url ending in /leaderboard
    # usage: leaderboard('http://www.kaggle.com/c/mlsp-2014-mri/leaderboard')
    # Modified from shakeup function published by David Thaler
    
    # Note that this function pulls the leaderboard status at the end of the competition
    # per release notes by Kaggle Admin Jeff Moser published at, 
    # http://www.kaggle.com/forums/t/827/reliving-the-leaderboards-of-the-past
    
    pub.url <- paste0(url.root,'/public')
    pvt.url <- paste0(url.root,'/private')
    
    pub.raw <- getURL(pub.url)
    pvt.raw <- getURL(pvt.url)
    
    pub.doc <- htmlTreeParse(pub.raw, useInternalNodes=TRUE)
    pvt.doc <- htmlTreeParse(pvt.raw, useInternalNodes=TRUE)
    
    pub.ids <- xpathSApply(pub.doc, '//tr[@id]/@id')
    pvt.ids <- xpathSApply(pvt.doc, '//tr[@id]/@id')

    pub.score <- as.numeric(xpathSApply(pub.doc, "//abbr[@class='score']", xmlValue))
    pvt.score <- as.numeric(xpathSApply(pvt.doc, "//abbr[@class='score']", xmlValue))
        
    n <- length(pub.ids)
    pub.df <- data.frame('id'=pub.ids, 'pub.idx'=1:n, 'pub.score' = pub.score)
    pvt.df <- data.frame('id'=pvt.ids, 'pvt.idx'=1:n, 'pvt.score' = pvt.score)
    all.df <- join(pub.df, pvt.df)
    
    return(all.df)
}

Let’s pull the leaderboard data as it stood at the end of competition.

#################################################################
# Get Data from Kaggle Leaderboard
#################################################################

contest_url <- "http://www.kaggle.com/c/higgs-boson/leaderboard"
higgs_lb <- leaderboard(contest_url)

## Joining by: id

Comparing Leaderboard AMS Metrics

Now that we have the data, let’s start by looking at histograms of the public and private leaderboards. I’ve marked the various starting kits provided for this competition. Many participants just uploaded the random submission data. The Naieve Bayes method was also quite popular. The XGBoost model was a popular choice later in the competition. It seems that many participants who used that model were unable to improve upon it.

# Generate histogram of AMS Scores - This includes all submissions that improved a previous AMS
hist(higgs_lb$pub.score, breaks = 400, main = "Higgs Boson Public Leaderboard at End of Competition", xlab = "Public AMS", xlim = c(0,4), ylim = c(0,200))
# Add benchmarks
abline(v = 0.58477, col="blue", lty = 3)   # Random submission
abline(v = 1.54451, col="blue", lty = 3)   # Simple window (single feature)
abline(v = 2.06036, col="blue", lty = 3)   # Naive Bayes
abline(v = 3.24954, col="blue", lty = 3)   # TMVA Boosted trees
abline(v = 3.34085, col="blue", lty = 3)   # Multiboost
abline(v = 3.60003, col="blue", lty = 3)   # XGBoost kit

plot of chunk histogram of public AMS

# Generate histogram of AMS Scores - This includes all submissions that improved a previous AMS
hist(higgs_lb$pvt.score, breaks = 400, main = "Higgs Boson Private Leaderboard at End of Competition", xlab = "Public AMS", xlim = c(0,4), ylim = c(0,200))
# Add benchmarks
abline(v = 0.58648, col="blue", lty = 3)   # Random submission
abline(v = 1.53518, col="blue", lty = 3)   # Simple window (single feature)
abline(v = 2.06021, col="blue", lty = 3)   # Naive Bayes
abline(v = 3.19956, col="blue", lty = 3)   # TMVA Boosted trees
abline(v = 3.40488, col="blue", lty = 3)   # Multiboost
abline(v = 3.64655, col="blue", lty = 3)   # XGBoost kit

plot of chunk histogram of private AMS

The private AMS distribution looks very similar to the public AMS distribution. Lets dig a little deeper by examining the relationship between the public and private AMS scores for each team. The metrics are closely related. There are a few interesting outliers, but nearly all the points are clustered along the equvalence line.

# Generate some plots of Private AMS vs Public AMS
plot(higgs_lb$pub.score, higgs_lb$pvt.score, main="Higgs Boson Leaderboard", xlab="Public AMS", ylab="Private AMS")
lines(c(0,4),c(0,4))

plot of chunk AMS plots

I calculated Private AMS improvement by subtracting Public AMS from Private AMS. With this metric, we can look at how much the AMS changes between the public and private leaderboards. Now this is interesting. We still see the occaional outlier point. The variation seems to increase with improving AMS scores. But there is something unusual going on for AMS values around 3.5. Let’s adjust the scales to look closer at the points with AMS between 3.0 and 4.0.

plot(higgs_lb$pub.score, higgs_lb$pvt.score-higgs_lb$pub.score, main="Higgs Boson Leaderboard AMS Improvement", xlab="Public AMS", ylab="Private AMS Improvement")
abline(h = 0, col="blue", lty = 3)

plot of chunk AMS improvement plots

plot(higgs_lb$pub.score, higgs_lb$pvt.score-higgs_lb$pub.score, main="Higgs Boson Leaderboard AMS Improvement", xlab="Public AMS", ylab="Private AMS Improvement", ylim = c(-0.2,0.2))
abline(h = 0, col="blue", lty = 3)

plot of chunk AMS improvement plots

plot(higgs_lb$pub.score, higgs_lb$pvt.score-higgs_lb$pub.score, main="Higgs Boson Leaderboard AMS Improvement", xlab="Public AMS", ylab="Private AMS Improvement", ylim = c(-0.2,0.2), xlim = c(3.0,4.0))
abline(h = 0, col="blue", lty = 3)

plot of chunk AMS improvement plots

It looks like teams who had AMS scores below 3.6 generally gained 0.05 AMS points on the private leaderboard. Howeve, the teams at the very top of the public leaderboard generally lost 0.05 AMS points on the private leaderboard. It seems clear to me that this is due to random over-fitting on the public leaderboard test data. This data is somewhat distorted by the nature of the way the Kaggle leaderboard works. The Public scores are calculated as the best result of all submissions by the team. However, private leaderboard scores are calculated based on only two submissions. This is bound to create some shakeup because it is impossible to predict which model will have the best private score. Teams must pick the model they are most confident in using based on a technique like Cross-Validation. But this leads to a crisis of confidence because models with the smallest CV variance may not score well on the leader board.

Comparing Leaderboard Ranking

OK, let’s chage gears and look as the public and private ranking. Since there is so much variation in the top AMS scores, we would expect to see a lot of variation in the top ranings. Sure enough, there is a lot of variation in the top positions. Again, I calculate rank improvement to get a better sense of what is going on.

# Plot of Private Rank vs Public RANK
plot(higgs_lb$pub.idx, higgs_lb$pvt.idx, main="Higgs Boson Leaderboard", xlab="Public Rank", ylab="Private Rank")

plot of chunk Private vs Public Rank plots

# Plot of Private Rank Improvement vs Public RANK
plot(higgs_lb$pub.idx, higgs_lb$pvt.idx-higgs_lb$pub.idx, main="Higgs Boson Leaderboard Rank Improvement", xlab="Public Rank", ylab="Private Rank Improvement")

plot of chunk Private vs Public Rank plots

# Plot rante of interest in Private Rank Improvement vs Public RANK
plot(higgs_lb$pub.idx, higgs_lb$pvt.idx-higgs_lb$pub.idx, main="Higgs Boson Leaderboard Rank Improvement", xlab="Public Rank", ylab="Private Rank Improvement", xlim = c(0,500))
abline(h = 0, col="blue", lty = 3)

There is a lot of noise in this plot. The flat line around rank of 500 is the group of XGBoost submissions mentioned during the initial AMS histogram analysis. Each of these teams advanced about 115 positions. Let’s look at the rank improvement related to public AMS scores.

# Private Rank Improvement vs AMS
plot(higgs_lb$pub.score, higgs_lb$pvt.idx-higgs_lb$pub.idx, main="Higgs Boson Leaderboard Rank Improvement", xlab="Public AMS", ylab="Private Rank Improvement")

plot of chunk Private Rank Improvement vs AMS

plot(higgs_lb$pub.score, higgs_lb$pvt.idx-higgs_lb$pub.idx, main="Higgs Boson Leaderboard Rank Improvement", xlab="Public AMS", ylab="Private Rank Improvement", xlim = c(3.0,4.0))
abline(h = 0, col="blue", lty = 3)

plot of chunk Private Rank Improvement vs AMS

There is quite a bit of mixing for teams with Public AMS around 3.6. Many moved ahead by 200 to 400 places on the Private leaderboard. But many fell back 200 to 400 places on the Private leaderboard.

Comparing Rank and Score Improvement

Let’s look at interactions of AMS improvement and Rank improvement. We need subsets of the top 10 submissions, and the XGBoost submissions to set some reference points.

# Subsets of top 10 Private Leaderboard Teams, and the XGBoost teams
top_lb <- higgs_lb[higgs_lb$pvt.idx<=10, ]
xgb_lb <- higgs_lb[higgs_lb$pvt.score==3.64655 & higgs_lb$pub.score==3.60003, ]

Well this is interesting. There are some interesting patterns here, and what looks like a lot of noise. First the overall pattern seems to indicate that larger AMS improvement will give a lower rank, while a larger drop in AMS will lead to a higher rank. This makes sense, but there are some striking anomalies. There are quite a few points in the upper right quadrant, which indicates teams that improved AMS, but lost ground on the ranking. This is likely due to other teams having even better improvement in AMS score. There are essentially no points in the lower left quadrant, which means there were no teams that improved their ranking while losing ground in their AMS score. There are also quite a few linear clusters. I suspect that these clusters represent similar models that systematically out perform other types of models, or are systematically beaten by other types of models.

# Private Rank Improvement vs Private AMS Improvement
plot(higgs_lb$pvt.score-higgs_lb$pub.score, higgs_lb$pvt.idx-higgs_lb$pub.idx, main="Higgs Boson Leaderboard Rank Improvement", xlab="Private AMS Improvement", ylab="Private Rank Improvement", xlim = c(-0.2,0.2), ylim = c(-400,400))
abline(h = 0, col="blue", lty = 3); abline(v = 0, col="blue", lty = 3)
points(top_lb$pvt.score-top_lb$pub.score, top_lb$pvt.idx-top_lb$pub.idx, col="blue", pch=19)
points(xgb_lb$pvt.score-xgb_lb$pub.score, xgb_lb$pvt.idx-xgb_lb$pub.idx, col="red", pch=19)
legend("topright", legend = c("Top Teams", "XGBoost Teams"), col=c("blue","red"), pch=c(19,19))

plot of chunk Improvement Comparison Plots

So what did we learn from all this. There is quite a bit of noise in the public leaderboard. This is to be expected, and the heat of competition drives many teams to over-fit their models. On the public leaderboard, there is no penalty for over-fitting.

Kaggle Higgs Classification Challenge Leaderboard Analysis

Jeff Hebert

Saturday, September 20, 2014

Conclusions

Getting the Data

Comparing Leaderboard AMS Metrics

Comparing Leaderboard Ranking

Comparing Rank and Score Improvement