Did Rose survive because she had a “Lucky” name? How CONFOUNDING!

In thinking about a fun way to show some spurious correlations (which is often a byproduct of confounding variables I came across this Kaggle post:https://www.kaggle.com/antgoldbloom/lucky-names/comments? which in response to machine learning competition to predict survivors of the Titantic tragedy the idea that names could have an affect on probablility of survival.

This notion ofcourse, while fun definitely contains confounding variables - as mentioned by a lot of the comments left under the original post.

Now, I understand this might not be exactly the task - i.e. the dataset isn’t real world, but it was a fun way to demonstrate how looking at the data incorrectly could lead to some funny(wrong) conclusions.

This post is a more about thought process than it is about the technical abilities to adjust the counfounding variables, and is intended to be a basic way to look at the results you get and question its validity. For a more technical view on this I highly recommend looking at the many posts on Canvas and CIC Around by other students for a more real world example as mine was really just to do some fun things in R to illustrate the issue of confounding variables.

A little about confounding variables

“In statistics, a confounder (also confounding variable, confounding factor, or lurking variable) is a variable that influences both the dependent variable and independent variable, causing a spurious association. Confounding is a causal concept, and as such, cannot be described in terms of correlations or associations.” ~ Wikipedia Wikipedia:Confounding

Load data from CSV

The original data is available here: http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.csv

I must admit I cheated a bit here and downloaded the CSV then did the munging of the data in Excel. It is currently a little beyond my skill level to munge the names data efficiently so in the interest of time firstnames were extrapolated in Excel.

I also added a coluumn to the data called freq_firstName this is just a count of the first name and age groupings.

titanicData <- read_csv("titanicdata.csv")
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   pclass = col_double(),
##   survived = col_double(),
##   freq_firstName = col_double(),
##   age = col_double(),
##   sibsp = col_double(),
##   parch = col_double(),
##   fare = col_double(),
##   body = col_double()
## )
## See spec(...) for full column specifications.

Conclusion

These names might have been lucky, because they belonged to many people that survived, but is is much more likely that these were just commonly used names for the time and the relationship between age, class and sex are much better predictors of chance of survival than just name alone.