My Simple Shiny

94% on Kaggle without building any prediction function

Kevin Siswandi
Student

Introduction

https://www.kaggle.com/c/homesite-quote-conversion

  • This project is inspired by the Homesite's Quote Conversion challenge on Kaggle (9 Nov 2015 to 8 Feb 2016).
  • Every analysis in this project is based on the training data from the competition.
  • Surprisingly, it has been found that there's a 6-line of R code that would yields 0.94 accuracy on the test set without building any prediction function.
  • In the shiny demo, only first 5000 observations in the original training data were used due to memory limitations on the RStudio server.

The 6-liner

The 6 lines of code that achieves that remarkable feat are as follows.

library(dplyr)

train <- read.csv("../input/train.csv")

test <- read.csv("../input/test.csv")

preds <- train %>% group_by(PersonalField9, PropertyField37, SalesField5,
PropertyField29, SalesField1A, PersonalField1) %>%
summarise(QuoteConversion_Flag=mean(QuoteConversion_Flag))

testPreds <- merge(test, preds, all.x=TRUE)

testPreds$QuoteConversion_Flag[is.na(testPreds$QuoteConversion_Flag)] <-
mean(testPreds$QuoteConversion_Flag, na.rm=TRUE)

Our server.R

Building upon the 6-liner, the server.R looks like the following.

data <- read.csv("train.csv", nrows=5000)
set.seed(12345)
train_index <- sample(1:nrow(data), nrow(data)/2, replace = FALSE)
training <- data[train_index,]
testing <- data[-train_index,]
shinyServer(
  function(input, output){
    output$test <- renderPrint({paste(input$checkbox, sep = ",")})
    output$model <- renderPrint({
      dots <- lapply(input$checkbox, as.symbol)
      preds <- training %>% group_by_(.dots = dots) %>% summarise(QuoteConversion_Flag2=mean(QuoteConversion_Flag))
      testPreds <- merge(testing, preds, all.x=TRUE)
      testPreds$QuoteConversion_Flag2[is.na(testPreds$QuoteConversion_Flag2)] <- mean(testPreds$QuoteConversion_Flag2, na.rm=TRUE)
      sum(testPreds$QuoteConversion_Flag2 == testPreds$QuoteConversion_Flag)/nrow(testPreds)
    })
      }
)

Discussion & Conclusion

  • 0.94 on test data was obtained by summarizing the entire training data, while this demo only used the first 5000 rows of training data and further splitting it evenly into train (50%) and test (50%) sets.
  • The performance on this demo was quite poor - my best was ~0.5.
  • These findings suggest that this method can't be applied in general to the Quote Conversion problem.
  • It's always safer to fall back to xgboost, gbm, or random forest.