Data Leakage Exploit

Synopsis

The present exercise provides a high-scoring solution to a kaggle competition that is obtained by exploiting a data leakage and leaderboard probing.

The solution herein presented has reached 0.9912 accuracy.

Introduction

This exercise is undertaken as a part of the Coursera MOOC, How to Win a Data Science Competition.

It proposes to solve a real kaggle competition just by exploiting a data leakage of the test data-set via leaderboard probing technique. Thus neither training data set nor machine learning models are required.

Problem

The contest is a binary classification of images’ pairs, e.i.: whether a determined pair of images belong to the only category or not.

The required solution consist on two-columns matrix, where:

the first column is the images pairs’ Ids and
the prediction column, where 1 means the pair belongs to the category and 0 otherwise.

Exploratory Data Analysis

Solution can be obtained by performing an Exploratory Data Analysis (EDA) on the test data-set, which contains only 3 columns: pair ID, first image ID and second image ID:

df<- read.csv("test_pairs.csv")
head(df)

##   pairId FirstId SecondId
## 1      0    1427     8053
## 2      1   17044     7681
## 3      2   19237    20966
## 4      3    8005    20765
## 5      4   16837      599
## 6      5    3657    12504

A very first plot of the first and second images’ IDs shows a strange pattern (values paralell to blue line) that can be exploited via leaderboard probing.

library(ggplot2)

set.seed(42)

sam<-df[sample(nrow(df), 2000),]

g<-ggplot(sam, aes(x=FirstId, y=SecondId, xlab= "First Image ID", ylab= "Second Image ID")) + geom_point(alpha= 0.3) + geom_abline(slope=1, intercept = 0, col="blue")

g

Leaderboard probing

Different hypotheses based upon the observed pattern are going to be proposed and will be adapted depending on the results that are obtained when submitting a solution to the competition.

Solution 0

hypothesis: All images’ pairs are category 1.

Accuracy: 0.5

Conclusion: This indicates that there is a pre-designed distribution of number of pairs in category 1 and 0.

Solution 1

df$Prediction<-0
df$Prediction<-ifelse(df$SecondId > 12300,1,0)
sam<-df[sample(nrow(df), 2000),]

g<-ggplot(sam, aes(x=FirstId, y=SecondId, col = factor(Prediction))) + 
  geom_point(alpha= 0.3) + 
  geom_abline(slope=0, intercept = 12300, col="blue") +
  scale_color_manual(values=c("black", "red")) +
  labs(col = "Category") +
  xlab("First Image ID") + 
  ylab("Second Image ID")

g

hypothesis: Only pair IDs whose second images’ IDs values are above 12300.

Accuracy: 0.8984

Conclusion:This shows that the observed pattern was not casual, the category 1 pairs are actually following it. However there is an area where noisy pairs are mixed up with those from the pattern.

Solution 2

## code for s2 in the annex section  
s2<-s2[sample(nrow(s2), 2000),]

g<- ggplot(s2, aes(x= FirstId, y= SecondId, col = as.factor(Prediction))) + 
  geom_point(alpha= 0.3) + 
  scale_color_manual(values=c("black", "red")) +
  labs(col = "Category") +
  xlab("First Image ID") + 
  ylab("Second Image ID")

g

hypothesis: Only pair IDs following the pattern.

Accuracy: 0.9914

Conclusion:See annex for the walkthrough of this solution.

Conclusion

High accuracy has been reached just by exploiting a data leakage of the test data-set. Hypotheses have been developed from EDA and confirmed via leaderboard probing. Only three attempts are required to propose a competent solution.

ANNEX 1: Solution 2 walkthrough

Step 1: Remove X < Y

There is noisy data that can be easily removed:

## Step 1: Remove X < Y

theZone<-df[df$FirstId<= df$SecondId,]
tz<-theZone[sample(nrow(theZone), 2000),]

g1<-ggplot(tz, aes(x=FirstId, y=SecondId))+ 
  geom_point(alpha= 0.3) + 
  geom_abline(slope=1, intercept = 0, col="blue") + 
  labs(col = "Category", title = "Step 1: Remove X < Y") +
  xlab("First Image ID") + 
  ylab("Second Image ID")

g1

Step 2: 45° turn

In order to capture the pattern, the vertical component of it will be removed by performing a 45° turn:

rotate2D<- function(x, angle){
  
  r<-x
    for (i in 1:nrow(x)){
      x1<- x[i,1]
      y1<- x[i,2] 
      
      r[i,1]<-round((x1*cos(angle) - y1*sin(angle)))
      r[i,2]<-round((x1*sin(angle) + y1*cos(angle)))
  }
    
  return(r)
}

tz90<-rotate2D(theZone[,2:3], -pi/4)
tz90<-tz90[tz90$FirstId <= 35000,]

tz9<-tz90[sample(nrow(tz90), 2000),]

g2<-ggplot(tz9, aes(x=FirstId, y=SecondId))+ 
  geom_point(alpha= 0.3) + 
  labs(col = "Category", title = "Step 2: 45° turn") +
  xlab("First Image ID") + 
  ylab("Second Image ID")

g2

Step 3: Second IDs histogram

Now, those pairs’ IDs whose second IDs values are bigger than the average of the occurrences of the second IDs are retained as category 1:

tz90sub<-tz90[sample(nrow(tz90), 2000),]

g3 <-ggplot(tz90sub, aes(x=FirstId, y=SecondId, col= factor(col)))+ 
  geom_point(alpha= 0.3, show.legend = FALSE) + 
  scale_color_manual(values=c("black", "red")) +
  coord_cartesian(ylim = c(0, 8800), xlim=c(0,17500)) +
  labs(title = "Step 3: Second IDs histogram") +
  xlab("First Image ID") + 
  ylab("Second Image ID")
  

a<-aggregate(tz90$SecondId, by = list(tz90$SecondId), length)
a$high <- ifelse(a$x>mean(a$x), 1, 0)

g4<-ggplot(a, aes(x=Group.1, y=x, fill= factor(high)))+ 
  geom_bar(stat = "identity") + 
  geom_abline(slope=0, intercept = mean(a$x), col="blue") +
  coord_flip(xlim=c(0, 8800), ylim = c(0, 100)) +
  scale_fill_manual(values=c("black", "red"), name= "Category", labels=c("0", "1")) +
  labs(title = " ") +
  xlab("Second Image ID") + 
  ylab("Frequency")

library(gridExtra)
grid.arrange(g3, g4, ncol=2)

# Compiling all the hypotheses

# Adding Prediction column 
df$Prediction<-0

# Accounting for Solution 1 
df$Prediction<-ifelse(df$SecondId > 12300,1,0)
sum(df$Prediction)

# Accounting for Solution 2
c<-tz90$pairId[tz90$col==1]
df$Prediction[df$pairId %in% c]<-1
s2<- df