Import the titanic train.csv file into R. You will have to
clean the data to run some of commands/set up the workflow.
remove(list = ls())
train <- read.csv("~/Library/CloudStorage/Dropbox/WCAS/Summer/Data Analysis/Summer 2024/Day 7/train.csv")
1. Run some preliminary correlations of Survived with some
other variables.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# cor(train)
train_numeric <- select(.data = train, -Name, -Sex, -Ticket,-Cabin,-Embarked)
train_corr_matrix <- cor(train_numeric, use = "pairwise.complete.obs")
?cor
#install.packages(ggcorrplot)
library(ggcorrplot)
?ggcorrplot
ggcorrplot(corr = train_corr_matrix, type = "lower")

2. Conduct descriptive statistics of the dataset. Anything
interesting you find ?
library(stargazer)
##
## Please cite as:
## Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
train$Sex_Male <- as.numeric(train$Sex=="male")
stargazer(train, type = "text")
##
## ==============================================
## Statistic N Mean St. Dev. Min Max
## ----------------------------------------------
## PassengerId 891 446.000 257.354 1 891
## Survived 891 0.384 0.487 0 1
## Pclass 891 2.309 0.836 1 3
## Age 714 29.699 14.526 0.420 80.000
## SibSp 891 0.523 1.103 0 8
## Parch 891 0.382 0.806 0 6
## Fare 891 32.204 49.693 0.000 512.329
## Sex_Male 891 0.648 0.478 0 1
## ----------------------------------------------
3. Use set.seed(100) command, and create a subset of
train dataset that has only 500 observations.
set.seed(100)
?sample_n
train_subset <- sample_n(tbl = train,
size = 500)
test_subset <- dplyr::anti_join(x = train,
y = train_subset,
by = "PassengerId")
4. Create an Ordinary Least Squares model / linear regression where
Survived is the dependent variable on your n=500
sample.
model1 <-
lm(data = train_subset,
formula = Survived ~ as.factor(Pclass) + Sex + Age + I(Age^2) + Fare)
summary(model1)
##
## Call:
## lm(formula = Survived ~ as.factor(Pclass) + Sex + Age + I(Age^2) +
## Fare, data = train_subset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.0888 -0.2825 -0.1048 0.2729 0.9990
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.100e+00 1.071e-01 10.271 < 2e-16 ***
## as.factor(Pclass)2 -1.845e-01 6.996e-02 -2.637 0.00869 **
## as.factor(Pclass)3 -3.892e-01 6.749e-02 -5.767 1.62e-08 ***
## Sexmale -4.222e-01 4.411e-02 -9.571 < 2e-16 ***
## Age -6.234e-03 4.695e-03 -1.328 0.18496
## I(Age^2) -3.374e-06 6.923e-05 -0.049 0.96116
## Fare 1.008e-05 5.241e-04 0.019 0.98466
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4079 on 398 degrees of freedom
## (95 observations deleted due to missingness)
## Multiple R-squared: 0.3144, Adjusted R-squared: 0.3041
## F-statistic: 30.42 on 6 and 398 DF, p-value: < 2.2e-16
5. Create an estimate of whether an individual survived or not
(binary variable) using the predict command on your estimated
model. Essentially, you are using the coefficient from your linear model
to forecast/predict/estimate the survival variable given
independant variable values /data.
?predict
test_subset$prediction <- predict(object = model1, newdata = test_subset)
summary(test_subset$prediction)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -0.1914 0.1625 0.3649 0.4217 0.6796 1.0130 82
test_subset$predicted_Survived <- ifelse(test = test_subset$prediction > .5, yes = 1, no = 0)
table(test_subset$Survived, test_subset$predicted_Survived)
##
## 0 1
## 0 156 22
## 1 27 104