The Titanic Competition is a well known introductory challenge in
Kaggle.
It has many people who score 100%.
These people are usually assumed to be “cheating”* by using the full
data set to reveal what should be hidden.
I wondered how many people were scoring 100% and had a look round but
couldn’t find the answer… so here’s my afternoon distraction!
Data correct as of 28/03/2023
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.0 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.1 ✔ tibble 3.2.0
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
breaks <- c(-0.01, 0, seq(0.05, 1.05, by = 0.05))
df_leaderb_anon <- read.csv("titanic-publicleaderboard.csv") %>%
subset(select = c(-TeamName, -TeamId)) %>%
mutate(score_bin = cut(Score, breaks = breaks))
I had to put a custom breaks list in due to zero scores being counted as NA.
ggplot(df_leaderb_anon, aes(x = score_bin)) +
geom_bar() +
xlab("Score Bin") +
ylab("Count") +
ggtitle("Histogram of Score Counts by Bin") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
By following the tutorial, I managed to get 0.77511, so I’m feeling pretty good about myself at this point. But it’s really hard to see the shorter bars and their distributions, so…
ggplot(df_leaderb_anon, aes(x = score_bin)) +
geom_bar() +
xlab("Score Bin") +
ylab("Count") +
ggtitle("Histogram of Titanic Kaggle Score Counts by Bin (Logarithmic Scale)") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_y_log10()
What does this mean? I guess that: 1) Not that many people are concerned about their Titanic score (about the same “cheated”* as got zero). 2) Lots of folk did the tutorial 3) There’s a guy who said if you scored between 0.8 and 1.0, you cheated, but badly… I’m not so sure, I think the 0.8 to 0.9 could just be luck or being really good. The distribution seems to be naturally explained that way. Please comment and let me know I’m wrong!