Question 6: Data Question

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this question, you are to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

Import the titanic train.csv file into R. You will have to clean the data to run some of commands/set up the workflow.

train <- read.csv("~/Library/CloudStorage/Dropbox/WCAS/Summer/Data Analysis/Summer 2024/Day 7/train.csv")

1. Run some preliminary correlations of Survived with some other variables.

# Install and load ggcorrplot
# install.packages("ggcorrplot")
library(ggcorrplot)
## Loading required package: ggplot2
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
train_numeric <- select(.data = train, -Name, -Ticket,-Cabin, -Embarked)
train_numeric$sex_male <- ifelse(test = train_numeric$Sex=="male", yes = 1, no = 0)
train_numeric$Sex <- NULL

train_corr_matrix <- cor(train_numeric)
# Visualize the correlation matrix
ggcorrplot(corr = train_corr_matrix, method = "circle", type = "lower")

2. Conduct descriptive statistics of the dataset. Anything interesting you find ?

library("stargazer")
## 
## Please cite as:
##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
train_clean <- train
train_clean$Sex_Male <- as.numeric(train_clean$Sex=="male")

stargazer(train_clean, type="text")
## 
## ==============================================
## Statistic    N   Mean   St. Dev.  Min    Max  
## ----------------------------------------------
## PassengerId 891 446.000 257.354    1     891  
## Survived    891  0.384   0.487     0      1   
## Pclass      891  2.309   0.836     1      3   
## Age         714 29.699   14.526  0.420 80.000 
## SibSp       891  0.523   1.103     0      8   
## Parch       891  0.382   0.806     0      6   
## Fare        891 32.204   49.693  0.000 512.329
## Sex_Male    891  0.648   0.478     0      1   
## ----------------------------------------------

3. Use set.seed(100) command, and create a subset of train dataset that has only 500 observations.

  • You can use sample_n command from tidyverse ecosystem dplyr package which you will have to install.

train <- sample_n(train, 500)

train_subset <- sample_n(train, 500)
test_subset <- dplyr::anti_join(x = train, y = train_subset, by = "PassengerId")

You can use alternative commands to subset the data if you wish.

4. Create an Ordinary Least Squares model / linear regression where Survived is the dependent variable on your n=500 sample.

model1 <- lm(data = train_subset, formula = Survived ~ as.factor(Pclass) + Age + I(Age^2)+SibSp + Parch + Fare + Sex )
summary(model1)
## 
## Call:
## lm(formula = Survived ~ as.factor(Pclass) + Age + I(Age^2) + 
##     SibSp + Parch + Fare + Sex, data = train_subset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.95333 -0.21893 -0.06136  0.18031  1.02954 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1.408e+00  1.020e-01  13.804  < 2e-16 ***
## as.factor(Pclass)2 -1.920e-01  6.117e-02  -3.138  0.00183 ** 
## as.factor(Pclass)3 -3.858e-01  6.011e-02  -6.418 3.99e-10 ***
## Age                -1.899e-02  4.426e-03  -4.290 2.26e-05 ***
## I(Age^2)            1.643e-04  6.004e-05   2.736  0.00651 ** 
## SibSp              -6.944e-02  2.350e-02  -2.954  0.00332 ** 
## Parch              -4.470e-02  2.432e-02  -1.838  0.06680 .  
## Fare                5.014e-04  5.382e-04   0.932  0.35205    
## Sexmale            -5.342e-01  4.097e-02 -13.040  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3673 on 392 degrees of freedom
##   (99 observations deleted due to missingness)
## Multiple R-squared:  0.4511, Adjusted R-squared:  0.4399 
## F-statistic: 40.27 on 8 and 392 DF,  p-value: < 2.2e-16

5. Create an estimate of whether an individual survived or not (binary variable) using the predict command on your estimated model. Essentially, you are using the coefficient from your linear model to forecast/predict/estimate the survival variable given independant variable values /data.

test_subset$prediction <- predict(model1, newdata = test_subset)

summary(test_subset$prediction)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## -0.2427  0.1312  0.3232  0.4188  0.7249  1.2880      78
test_subset$predic_survived <- ifelse(test_subset$prediction >.5,  yes= 1, no= 0 )

table(test_subset$Survived, test_subset$predic_survived)
##    
##       0   1
##   0 154  30
##   1  39  90

prediction <- predict(model1, newdata = wine_test)

https://www.analyticsvidhya.com/blog/2020/12/predicting-using-linear-regression-in-r/

  • You may have to transform your predicted values into 0 or 1, say using an ifelse rule/command.

?ifelse in R shows the syntax and usage. EG -

ifelse(predicted>.5, yes= 1, no= 0 )