My Simple Shiny

94% on Kaggle without building any prediction function

Kevin Siswandi
Student

Introduction

https://www.kaggle.com/c/homesite-quote-conversion

This project is inspired by the Homesite's Quote Conversion challenge on Kaggle (9 Nov 2015 to 8 Feb 2016).
Every analysis in this project is based on the training data from the competition.
Surprisingly, it has been found that there's a 6-line of R code that would yields 0.94 accuracy on the test set without building any prediction function.
In the shiny demo, only first 5000 observations in the original training data were used due to memory limitations on the RStudio server.

The 6-liner

The 6 lines of code that achieves that remarkable feat are as follows.

library(dplyr)

train <- read.csv("../input/train.csv")

test <- read.csv("../input/test.csv")

preds <- train %>% group_by(PersonalField9, PropertyField37, SalesField5,
PropertyField29, SalesField1A, PersonalField1) %>%
summarise(QuoteConversion_Flag=mean(QuoteConversion_Flag))

testPreds <- merge(test, preds, all.x=TRUE)

testPreds$QuoteConversion_Flag[is.na(testPreds$QuoteConversion_Flag)] <-
mean(testPreds$QuoteConversion_Flag, na.rm=TRUE)

Our server.R

Building upon the 6-liner, the server.R looks like the following.

data <- read.csv("train.csv", nrows=5000)
set.seed(12345)
train_index <- sample(1:nrow(data), nrow(data)/2, replace = FALSE)
training <- data[train_index,]
testing <- data[-train_index,]
shinyServer(
  function(input, output){
    output$test <- renderPrint({paste(input$checkbox, sep = ",")})
    output$model <- renderPrint({
      dots <- lapply(input$checkbox, as.symbol)
      preds <- training %>% group_by_(.dots = dots) %>% summarise(QuoteConversion_Flag2=mean(QuoteConversion_Flag))
      testPreds <- merge(testing, preds, all.x=TRUE)
      testPreds$QuoteConversion_Flag2[is.na(testPreds$QuoteConversion_Flag2)] <- mean(testPreds$QuoteConversion_Flag2, na.rm=TRUE)
      sum(testPreds$QuoteConversion_Flag2 == testPreds$QuoteConversion_Flag)/nrow(testPreds)
    })
      }
)

Discussion & Conclusion

0.94 on test data was obtained by summarizing the entire training data, while this demo only used the first 5000 rows of training data and further splitting it evenly into train (50%) and test (50%) sets.
The performance on this demo was quite poor - my best was ~0.5.
These findings suggest that this method can't be applied in general to the Quote Conversion problem.
It's always safer to fall back to xgboost, gbm, or random forest.