Titanic analysis

Titanic Competition in Kaggle

The Titanic Competition is a well known introductory challenge in Kaggle.
It has many people who score 100%.
These people are usually assumed to be “cheating”* by using the full data set to reveal what should be hidden.
I wondered how many people were scoring 100% and had a look round but couldn’t find the answer… so here’s my afternoon distraction!
Data correct as of 28/03/2023

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.0     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.1     ✔ tibble    3.2.0
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors

breaks <- c(-0.01, 0, seq(0.05, 1.05, by = 0.05))

df_leaderb_anon <- read.csv("titanic-publicleaderboard.csv") %>% 
  subset(select = c(-TeamName, -TeamId)) %>% 
  mutate(score_bin = cut(Score, breaks = breaks))

I had to put a custom breaks list in due to zero scores being counted as NA.

ggplot(df_leaderb_anon, aes(x = score_bin)) +
  geom_bar() +
  xlab("Score Bin") +
  ylab("Count") +
  ggtitle("Histogram of Score Counts by Bin") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

By following the tutorial, I managed to get 0.77511, so I’m feeling pretty good about myself at this point. But it’s really hard to see the shorter bars and their distributions, so…

ggplot(df_leaderb_anon, aes(x = score_bin)) +
  geom_bar() +
  xlab("Score Bin") +
  ylab("Count") +
  ggtitle("Histogram of Titanic Kaggle Score Counts by Bin (Logarithmic Scale)") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_y_log10()

What does this mean? I guess that: 1) Not that many people are concerned about their Titanic score (about the same “cheated”* as got zero). 2) Lots of folk did the tutorial 3) There’s a guy who said if you scored between 0.8 and 1.0, you cheated, but badly… I’m not so sure, I think the 0.8 to 0.9 could just be luck or being really good. The distribution seems to be naturally explained that way. Please comment and let me know I’m wrong!

Titanic analysis

BWesterman

2023-03-28

Titanic Competition in Kaggle