library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(titanic)
## Warning: package 'titanic' was built under R version 4.1.3
library(ggplot2)
library(pscl)
## Warning: package 'pscl' was built under R version 4.1.3
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
df <- as.data.frame(titanic::titanic_train) %>%
  mutate(Pclass = factor(Pclass),
         Sex = factor(Sex))
head(df)
##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           6        0      3
##                                                  Name    Sex Age SibSp Parch
## 1                             Braund, Mr. Owen Harris   male  22     1     0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
## 3                              Heikkinen, Miss. Laina female  26     0     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
## 5                            Allen, Mr. William Henry   male  35     0     0
## 6                                    Moran, Mr. James   male  NA     0     0
##             Ticket    Fare Cabin Embarked
## 1        A/5 21171  7.2500              S
## 2         PC 17599 71.2833   C85        C
## 3 STON/O2. 3101282  7.9250              S
## 4           113803 53.1000  C123        S
## 5           373450  8.0500              S
## 6           330877  8.4583              Q
df <- na.omit(df)
model <- glm(Survived ~ Pclass + Sex + Age, data = df, family = "binomial")
summary(model)
## 
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age, family = "binomial", 
##     data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7303  -0.6780  -0.3953   0.6485   2.4657  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.777013   0.401123   9.416  < 2e-16 ***
## Pclass2     -1.309799   0.278066  -4.710 2.47e-06 ***
## Pclass3     -2.580625   0.281442  -9.169  < 2e-16 ***
## Sexmale     -2.522781   0.207391 -12.164  < 2e-16 ***
## Age         -0.036985   0.007656  -4.831 1.36e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 964.52  on 713  degrees of freedom
## Residual deviance: 647.28  on 709  degrees of freedom
## AIC: 657.28
## 
## Number of Fisher Scoring iterations: 5
plot(model)

pr2 <- pR2(model)
## fitting null model for pseudo-r2
pr2
##          llh      llhNull           G2     McFadden         r2ML         r2CU 
## -323.6415628 -482.2579824  317.2328394    0.3289037    0.3587294    0.4841261

Selected X: Pclass, Sexmale, Age and they are all statistically significant at ***. They all negatively related to Y survived which means as these X increases, Y decreases. The sexmale has largest impact on Y. Intercept is also significant. From the McFadden results, the model fits the data well.

****** We should not run a multivariate regression because we aim to get the probability of a passenger survived or not, binary outcome. Also, logistic regression can deal with weak assumption, overfitting problems. Linear model are generally simple and are not frequently used in industries currently. Tree model, NN model are much better.

Part B First, I want to say thank you to our professor for his kindness and patience. I have learned so much including data manipulating, probability with R, plotting, modeling etc. But the most important takeaway is what kind of mindset we should have in approaching a data analytics project. With this mindset and the experience I get from HW and discussion, I think I can handle some easy Kaggle project now.