The present exercise provides a high-scoring solution to a kaggle competition that is obtained by exploiting a data leakage and leaderboard probing.
The solution herein presented has reached 0.9912 accuracy.
This exercise is undertaken as a part of the Coursera MOOC, How to Win a Data Science Competition.
It proposes to solve a real kaggle competition just by exploiting a data leakage of the test data-set via leaderboard probing technique. Thus neither training data set nor machine learning models are required.
The contest is a binary classification of images’ pairs, e.i.: whether a determined pair of images belong to the only category or not.
The required solution consist on two-columns matrix, where:
Solution can be obtained by performing an Exploratory Data Analysis (EDA) on the test data-set, which contains only 3 columns: pair ID, first image ID and second image ID:
df<- read.csv("test_pairs.csv")
head(df)
## pairId FirstId SecondId
## 1 0 1427 8053
## 2 1 17044 7681
## 3 2 19237 20966
## 4 3 8005 20765
## 5 4 16837 599
## 6 5 3657 12504
A very first plot of the first and second images’ IDs shows a strange pattern (values paralell to blue line) that can be exploited via leaderboard probing.
library(ggplot2)
set.seed(42)
sam<-df[sample(nrow(df), 2000),]
g<-ggplot(sam, aes(x=FirstId, y=SecondId, xlab= "First Image ID", ylab= "Second Image ID")) + geom_point(alpha= 0.3) + geom_abline(slope=1, intercept = 0, col="blue")
g
Different hypotheses based upon the observed pattern are going to be proposed and will be adapted depending on the results that are obtained when submitting a solution to the competition.
hypothesis: All images’ pairs are category 1.
Accuracy: 0.5
Conclusion: This indicates that there is a pre-designed distribution of number of pairs in category 1 and 0.
df$Prediction<-0
df$Prediction<-ifelse(df$SecondId > 12300,1,0)
sam<-df[sample(nrow(df), 2000),]
g<-ggplot(sam, aes(x=FirstId, y=SecondId, col = factor(Prediction))) +
geom_point(alpha= 0.3) +
geom_abline(slope=0, intercept = 12300, col="blue") +
scale_color_manual(values=c("black", "red")) +
labs(col = "Category") +
xlab("First Image ID") +
ylab("Second Image ID")
g
hypothesis: Only pair IDs whose second images’ IDs values are above 12300.
Accuracy: 0.8984
Conclusion:This shows that the observed pattern was not casual, the category 1 pairs are actually following it. However there is an area where noisy pairs are mixed up with those from the pattern.
## code for s2 in the annex section
s2<-s2[sample(nrow(s2), 2000),]
g<- ggplot(s2, aes(x= FirstId, y= SecondId, col = as.factor(Prediction))) +
geom_point(alpha= 0.3) +
scale_color_manual(values=c("black", "red")) +
labs(col = "Category") +
xlab("First Image ID") +
ylab("Second Image ID")
g
hypothesis: Only pair IDs following the pattern.
Accuracy: 0.9914
Conclusion:See annex for the walkthrough of this solution.
High accuracy has been reached just by exploiting a data leakage of the test data-set. Hypotheses have been developed from EDA and confirmed via leaderboard probing. Only three attempts are required to propose a competent solution.
There is noisy data that can be easily removed:
## Step 1: Remove X < Y
theZone<-df[df$FirstId<= df$SecondId,]
tz<-theZone[sample(nrow(theZone), 2000),]
g1<-ggplot(tz, aes(x=FirstId, y=SecondId))+
geom_point(alpha= 0.3) +
geom_abline(slope=1, intercept = 0, col="blue") +
labs(col = "Category", title = "Step 1: Remove X < Y") +
xlab("First Image ID") +
ylab("Second Image ID")
g1
In order to capture the pattern, the vertical component of it will be removed by performing a 45° turn:
rotate2D<- function(x, angle){
r<-x
for (i in 1:nrow(x)){
x1<- x[i,1]
y1<- x[i,2]
r[i,1]<-round((x1*cos(angle) - y1*sin(angle)))
r[i,2]<-round((x1*sin(angle) + y1*cos(angle)))
}
return(r)
}
tz90<-rotate2D(theZone[,2:3], -pi/4)
tz90<-tz90[tz90$FirstId <= 35000,]
tz9<-tz90[sample(nrow(tz90), 2000),]
g2<-ggplot(tz9, aes(x=FirstId, y=SecondId))+
geom_point(alpha= 0.3) +
labs(col = "Category", title = "Step 2: 45° turn") +
xlab("First Image ID") +
ylab("Second Image ID")
g2
Now, those pairs’ IDs whose second IDs values are bigger than the average of the occurrences of the second IDs are retained as category 1:
tz90sub<-tz90[sample(nrow(tz90), 2000),]
g3 <-ggplot(tz90sub, aes(x=FirstId, y=SecondId, col= factor(col)))+
geom_point(alpha= 0.3, show.legend = FALSE) +
scale_color_manual(values=c("black", "red")) +
coord_cartesian(ylim = c(0, 8800), xlim=c(0,17500)) +
labs(title = "Step 3: Second IDs histogram") +
xlab("First Image ID") +
ylab("Second Image ID")
a<-aggregate(tz90$SecondId, by = list(tz90$SecondId), length)
a$high <- ifelse(a$x>mean(a$x), 1, 0)
g4<-ggplot(a, aes(x=Group.1, y=x, fill= factor(high)))+
geom_bar(stat = "identity") +
geom_abline(slope=0, intercept = mean(a$x), col="blue") +
coord_flip(xlim=c(0, 8800), ylim = c(0, 100)) +
scale_fill_manual(values=c("black", "red"), name= "Category", labels=c("0", "1")) +
labs(title = " ") +
xlab("Second Image ID") +
ylab("Frequency")
library(gridExtra)
grid.arrange(g3, g4, ncol=2)
# Compiling all the hypotheses
# Adding Prediction column
df$Prediction<-0
# Accounting for Solution 1
df$Prediction<-ifelse(df$SecondId > 12300,1,0)
sum(df$Prediction)
# Accounting for Solution 2
c<-tz90$pairId[tz90$col==1]
df$Prediction[df$pairId %in% c]<-1
s2<- df