The sinking of the Titanic is one of the most infamous shipwrecks in history.
On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.
While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.
In this question, you are to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).
train <- read.csv("~/Library/CloudStorage/Dropbox/WCAS/Summer/Data Analysis/Summer 2024/Day 7/train.csv")
# Install and load ggcorrplot
# install.packages("ggcorrplot")
library(ggcorrplot)
## Loading required package: ggplot2
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
train_numeric <- select(.data = train, -Name, -Ticket,-Cabin, -Embarked)
train_numeric$sex_male <- ifelse(test = train_numeric$Sex=="male", yes = 1, no = 0)
train_numeric$Sex <- NULL
train_corr_matrix <- cor(train_numeric)
# Visualize the correlation matrix
ggcorrplot(corr = train_corr_matrix, method = "circle", type = "lower")
library("stargazer")
##
## Please cite as:
## Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
train_clean <- train
train_clean$Sex_Male <- as.numeric(train_clean$Sex=="male")
stargazer(train_clean, type="text")
##
## ==============================================
## Statistic N Mean St. Dev. Min Max
## ----------------------------------------------
## PassengerId 891 446.000 257.354 1 891
## Survived 891 0.384 0.487 0 1
## Pclass 891 2.309 0.836 1 3
## Age 714 29.699 14.526 0.420 80.000
## SibSp 891 0.523 1.103 0 8
## Parch 891 0.382 0.806 0 6
## Fare 891 32.204 49.693 0.000 512.329
## Sex_Male 891 0.648 0.478 0 1
## ----------------------------------------------
train <- sample_n(train, 500)
train_subset <- sample_n(train, 500)
test_subset <- dplyr::anti_join(x = train, y = train_subset, by = "PassengerId")
You can use alternative commands to subset the data if you wish.
n=500
sample.model1 <- lm(data = train_subset, formula = Survived ~ as.factor(Pclass) + Age + I(Age^2)+SibSp + Parch + Fare + Sex )
summary(model1)
##
## Call:
## lm(formula = Survived ~ as.factor(Pclass) + Age + I(Age^2) +
## SibSp + Parch + Fare + Sex, data = train_subset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.95333 -0.21893 -0.06136 0.18031 1.02954
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.408e+00 1.020e-01 13.804 < 2e-16 ***
## as.factor(Pclass)2 -1.920e-01 6.117e-02 -3.138 0.00183 **
## as.factor(Pclass)3 -3.858e-01 6.011e-02 -6.418 3.99e-10 ***
## Age -1.899e-02 4.426e-03 -4.290 2.26e-05 ***
## I(Age^2) 1.643e-04 6.004e-05 2.736 0.00651 **
## SibSp -6.944e-02 2.350e-02 -2.954 0.00332 **
## Parch -4.470e-02 2.432e-02 -1.838 0.06680 .
## Fare 5.014e-04 5.382e-04 0.932 0.35205
## Sexmale -5.342e-01 4.097e-02 -13.040 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3673 on 392 degrees of freedom
## (99 observations deleted due to missingness)
## Multiple R-squared: 0.4511, Adjusted R-squared: 0.4399
## F-statistic: 40.27 on 8 and 392 DF, p-value: < 2.2e-16
test_subset$prediction <- predict(model1, newdata = test_subset)
summary(test_subset$prediction)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -0.2427 0.1312 0.3232 0.4188 0.7249 1.2880 78
test_subset$predic_survived <- ifelse(test_subset$prediction >.5, yes= 1, no= 0 )
table(test_subset$Survived, test_subset$predic_survived)
##
## 0 1
## 0 154 30
## 1 39 90
prediction <- predict(model1, newdata = wine_test)
https://www.analyticsvidhya.com/blog/2020/12/predicting-using-linear-regression-in-r/
?ifelse in R shows the syntax and usage. EG -
ifelse(predicted>.5, yes= 1, no= 0 )