Import the titanic train.csv file into R. You will have to clean the data to run some of commands/set up the workflow.

remove(list = ls())

train <- read.csv("~/Library/CloudStorage/Dropbox/WCAS/Summer/Data Analysis/Summer 2024/Day 7/train.csv")

1. Run some preliminary correlations of Survived with some other variables.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# cor(train)
train_numeric <- select(.data = train, -Name, -Sex, -Ticket,-Cabin,-Embarked)

train_corr_matrix <- cor(train_numeric, use = "pairwise.complete.obs")
?cor

#install.packages(ggcorrplot)
library(ggcorrplot)
?ggcorrplot
ggcorrplot(corr = train_corr_matrix, type = "lower")

2. Conduct descriptive statistics of the dataset. Anything interesting you find ?

library(stargazer)
## 
## Please cite as:
##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
train$Sex_Male <- as.numeric(train$Sex=="male")

stargazer(train, type = "text")
## 
## ==============================================
## Statistic    N   Mean   St. Dev.  Min    Max  
## ----------------------------------------------
## PassengerId 891 446.000 257.354    1     891  
## Survived    891  0.384   0.487     0      1   
## Pclass      891  2.309   0.836     1      3   
## Age         714 29.699   14.526  0.420 80.000 
## SibSp       891  0.523   1.103     0      8   
## Parch       891  0.382   0.806     0      6   
## Fare        891 32.204   49.693  0.000 512.329
## Sex_Male    891  0.648   0.478     0      1   
## ----------------------------------------------

3. Use set.seed(100) command, and create a subset of train dataset that has only 500 observations.

set.seed(100)
?sample_n
train_subset <- sample_n(tbl = train, 
                         size = 500)

test_subset <- dplyr::anti_join(x = train,
                                y = train_subset, 
                                by = "PassengerId")

4. Create an Ordinary Least Squares model / linear regression where Survived is the dependent variable on your n=500 sample.

model1 <-
lm(data = train_subset, 
    formula = Survived ~ as.factor(Pclass) + Sex + Age + I(Age^2) + Fare)


summary(model1)
## 
## Call:
## lm(formula = Survived ~ as.factor(Pclass) + Sex + Age + I(Age^2) + 
##     Fare, data = train_subset)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0888 -0.2825 -0.1048  0.2729  0.9990 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1.100e+00  1.071e-01  10.271  < 2e-16 ***
## as.factor(Pclass)2 -1.845e-01  6.996e-02  -2.637  0.00869 ** 
## as.factor(Pclass)3 -3.892e-01  6.749e-02  -5.767 1.62e-08 ***
## Sexmale            -4.222e-01  4.411e-02  -9.571  < 2e-16 ***
## Age                -6.234e-03  4.695e-03  -1.328  0.18496    
## I(Age^2)           -3.374e-06  6.923e-05  -0.049  0.96116    
## Fare                1.008e-05  5.241e-04   0.019  0.98466    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4079 on 398 degrees of freedom
##   (95 observations deleted due to missingness)
## Multiple R-squared:  0.3144, Adjusted R-squared:  0.3041 
## F-statistic: 30.42 on 6 and 398 DF,  p-value: < 2.2e-16

5. Create an estimate of whether an individual survived or not (binary variable) using the predict command on your estimated model. Essentially, you are using the coefficient from your linear model to forecast/predict/estimate the survival variable given independant variable values /data.

?predict

test_subset$prediction <- predict(object = model1, newdata = test_subset)

summary(test_subset$prediction)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## -0.1914  0.1625  0.3649  0.4217  0.6796  1.0130      82
test_subset$predicted_Survived <- ifelse(test = test_subset$prediction > .5, yes = 1, no = 0)

table(test_subset$Survived, test_subset$predicted_Survived)
##    
##       0   1
##   0 156  22
##   1  27 104