Titanic Competition in Kaggle

The Titanic Competition is a well known introductory challenge in Kaggle.
It has many people who score 100%.
These people are usually assumed to be “cheating”* by using the full data set to reveal what should be hidden.
I wondered how many people were scoring 100% and had a look round but couldn’t find the answer… so here’s my afternoon distraction!
Data correct as of 28/03/2023

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.0     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.1     ✔ tibble    3.2.0
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
breaks <- c(-0.01, 0, seq(0.05, 1.05, by = 0.05))

df_leaderb_anon <- read.csv("titanic-publicleaderboard.csv") %>% 
  subset(select = c(-TeamName, -TeamId)) %>% 
  mutate(score_bin = cut(Score, breaks = breaks))

I had to put a custom breaks list in due to zero scores being counted as NA.

ggplot(df_leaderb_anon, aes(x = score_bin)) +
  geom_bar() +
  xlab("Score Bin") +
  ylab("Count") +
  ggtitle("Histogram of Score Counts by Bin") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

By following the tutorial, I managed to get 0.77511, so I’m feeling pretty good about myself at this point. But it’s really hard to see the shorter bars and their distributions, so…

ggplot(df_leaderb_anon, aes(x = score_bin)) +
  geom_bar() +
  xlab("Score Bin") +
  ylab("Count") +
  ggtitle("Histogram of Titanic Kaggle Score Counts by Bin (Logarithmic Scale)") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_y_log10()

What does this mean? I guess that: 1) Not that many people are concerned about their Titanic score (about the same “cheated”* as got zero). 2) Lots of folk did the tutorial 3) There’s a guy who said if you scored between 0.8 and 1.0, you cheated, but badly… I’m not so sure, I think the 0.8 to 0.9 could just be luck or being really good. The distribution seems to be naturally explained that way. Please comment and let me know I’m wrong!