library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(titanic)
## Warning: package 'titanic' was built under R version 4.1.3
library(ggplot2)
library(pscl)
## Warning: package 'pscl' was built under R version 4.1.3
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
df <- as.data.frame(titanic::titanic_train) %>%
mutate(Pclass = factor(Pclass),
Sex = factor(Sex))
head(df)
## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 6 0 3
## Name Sex Age SibSp Parch
## 1 Braund, Mr. Owen Harris male 22 1 0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
## 3 Heikkinen, Miss. Laina female 26 0 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
## 5 Allen, Mr. William Henry male 35 0 0
## 6 Moran, Mr. James male NA 0 0
## Ticket Fare Cabin Embarked
## 1 A/5 21171 7.2500 S
## 2 PC 17599 71.2833 C85 C
## 3 STON/O2. 3101282 7.9250 S
## 4 113803 53.1000 C123 S
## 5 373450 8.0500 S
## 6 330877 8.4583 Q
df <- na.omit(df)
model <- glm(Survived ~ Pclass + Sex + Age, data = df, family = "binomial")
summary(model)
##
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age, family = "binomial",
## data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7303 -0.6780 -0.3953 0.6485 2.4657
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.777013 0.401123 9.416 < 2e-16 ***
## Pclass2 -1.309799 0.278066 -4.710 2.47e-06 ***
## Pclass3 -2.580625 0.281442 -9.169 < 2e-16 ***
## Sexmale -2.522781 0.207391 -12.164 < 2e-16 ***
## Age -0.036985 0.007656 -4.831 1.36e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 964.52 on 713 degrees of freedom
## Residual deviance: 647.28 on 709 degrees of freedom
## AIC: 657.28
##
## Number of Fisher Scoring iterations: 5
plot(model)
pr2 <- pR2(model)
## fitting null model for pseudo-r2
pr2
## llh llhNull G2 McFadden r2ML r2CU
## -323.6415628 -482.2579824 317.2328394 0.3289037 0.3587294 0.4841261
Selected X: Pclass, Sexmale, Age and they are all statistically significant at ***. They all negatively related to Y survived which means as these X increases, Y decreases. The sexmale has largest impact on Y. Intercept is also significant. From the McFadden results, the model fits the data well.
****** We should not run a multivariate regression because we aim to get the probability of a passenger survived or not, binary outcome. Also, logistic regression can deal with weak assumption, overfitting problems. Linear model are generally simple and are not frequently used in industries currently. Tree model, NN model are much better.
Part B First, I want to say thank you to our professor for his kindness and patience. I have learned so much including data manipulating, probability with R, plotting, modeling etc. But the most important takeaway is what kind of mindset we should have in approaching a data analytics project. With this mindset and the experience I get from HW and discussion, I think I can handle some easy Kaggle project now.