The mean age of passengers was 29.8811377.
The standard deviation of the age was 14.4134932.
The number of passengers who died was 809.
The number of passengers who lived was 500.
The number of passengers who were in first class was 323.
The number of passengers who were in second class was 277.
The number of passengers who were in third class was 709.
The mean age of the survivors was 28.9182436.
The mean age of the dead was 30.5453635.
The mean age of the first class passengers was 39.1599296.
The mean age of the second class passengers was 29.506705.
The mean age of the third class passengers was 24.8163673.
| passenger.class | gender | number.survived | number.died |
|---|---|---|---|
| first class | female | 139 | 5 |
| second class | female | 94 | 12 |
| third class | female | 106 | 110 |
| first class | male | 61 | 118 |
| second class | male | 25 | 146 |
| third class | male | 75 | 418 |
set.seed(243)
random.subset<-sample_n(Titanic_passengers, 130)
titanic.glm.minimal <- glm(random.subset$survived ~ -1+1, family = binomial())
summary(titanic.glm.minimal)
##
## Call:
## glm(formula = random.subset$survived ~ -1 + 1, family = binomial())
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.062 -1.062 -1.062 1.298 1.298
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.2787 0.1771 -1.574 0.116
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 177.72 on 129 degrees of freedom
## Residual deviance: 177.72 on 129 degrees of freedom
## AIC: 179.72
##
## Number of Fisher Scoring iterations: 4
titanic.glm.2vars <- glm(random.subset$survived ~ random.subset$pclass + random.subset$age, family = binomial())
summary(titanic.glm.2vars)
##
## Call:
## glm(formula = random.subset$survived ~ random.subset$pclass +
## random.subset$age, family = binomial())
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0939 -0.8653 -0.4611 0.9616 1.9424
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.72958 1.47820 3.876 0.000106 ***
## random.subset$pclass -1.71864 0.41089 -4.183 2.88e-05 ***
## random.subset$age -0.07174 0.02282 -3.144 0.001667 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 137.63 on 99 degrees of freedom
## Residual deviance: 110.50 on 97 degrees of freedom
## (30 observations deleted due to missingness)
## AIC: 116.5
##
## Number of Fisher Scoring iterations: 4
titanic.glm.3vars <- glm(random.subset$survived ~ random.subset$pclass + random.subset$age + random.subset$sex, family = binomial())
summary(titanic.glm.3vars)
##
## Call:
## glm(formula = random.subset$survived ~ random.subset$pclass +
## random.subset$age + random.subset$sex, family = binomial())
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4017 -0.6172 -0.3452 0.5440 2.3163
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.43985 1.81679 4.095 4.22e-05 ***
## random.subset$pclass -1.66233 0.47166 -3.524 0.000424 ***
## random.subset$age -0.07159 0.02588 -2.767 0.005665 **
## random.subset$sexmale -2.77370 0.61072 -4.542 5.58e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 137.63 on 99 degrees of freedom
## Residual deviance: 84.05 on 96 degrees of freedom
## (30 observations deleted due to missingness)
## AIC: 92.05
##
## Number of Fisher Scoring iterations: 5
Titanic_passengers <-na.omit(Titanic_passengers[,c(1,2,4,5)])
hoslem.data<-c(1:1046)
hoslem.data<-cbind(hoslem.data,hoslem.data)
for (i in 1:1046){
pc<-Titanic_passengers$pclass[i]
a<-Titanic_passengers$age[i]
if (Titanic_passengers$sex[i]=='male') {m<-1} else {m<-0}
prediction<-(7.43985-1.66233*pc-.07159*a-2.77370*m)
hoslem.data[i,1]<-exp(prediction)/(1+exp(prediction))
hoslem.data[i,2]<-Titanic_passengers$survived[i]
}
hoslem.test(hoslem.data[,2], hoslem.data[,1])
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: hoslem.data[, 2], hoslem.data[, 1]
## X-squared = 89.537, df = 8, p-value = 5.551e-16
When we add the gender variable to the 2 factor model, the AIC is reduced from 116.5 to 92.05. This is compared to 179.72 for the null model. The model with gender, even after penalizing for the two parameters, fits our data best. For this model, pclass and maleness are statistically significant at a confidence level of .001. Age is significant at a level of .01.
The result of a Hosmer Lemeshow test, a chi-squared test, is a p-value of 5.551e-16. Our model is a good fit.
residuals.3vars<-residuals(titanic.glm.3vars)
ggplot(x=seq(1,100), y=residuals.3vars)+geom_point(aes(x=seq(1,100), y=residuals.3vars),shape=21,fill='red')+ theme(panel.background = element_rect(fill = '#7fc4e0'))+geom_abline(slope=0, intercept=0,color='red')+labs(x='',y='residuals')
predict.accuracy<-rep(0,30)
predict.value<-rep(0,30)
empirical.value<-rep(0,30)
for (i in 1:30) {
random.passenger<-sample_n(Titanic_passengers, 1)
pc<-random.passenger$pclass
a<-random.passenger$age
if (random.passenger$sex=='male') {m<-1} else {m<-0}
prediction<-(7.43985-1.66233*pc-.07159*a-2.77370*m)
prediction<-exp(prediction)/(1+exp(prediction))
predict.value[i]<-prediction
empirical.value[i]<-random.passenger$survived
gap<-abs(random.passenger$survived-prediction)
predict.accuracy[i]<-findInterval(gap, c(0,.5) ) == 1
}
We ran a simulation to test how often our model would correctly guess if a passenger survived. When we took a sample of 30 and tested it against the actual data, the model produced 90.9% accuracy.
mean(predict.accuracy,na.rm=TRUE)
## [1] 0.9
model.graphic<-data.frame(cbind(predict.value,empirical.value))
lower.CI<-coefficients(titanic.glm.3vars)[2] -1.96*summary(titanic.glm.3vars)$coefficients[2,2]
upper.CI<-coefficients(titanic.glm.3vars)[2] +1.96*summary(titanic.glm.3vars)$coefficients[2,2]
ggplot(data=model.graphic)+geom_point(aes(x=seq(1,30),y=predict.value,color=empirical.value),size=2)+ theme(panel.background = element_rect(fill = '#c4b8a1')) + theme(legend.position="none")+labs(title='Prediction Value vs. Actual',subtitle='light means survival',x='',y='predicted value')+theme(title = element_text(size = rel(1.5),color='#3988ad'))
\(\large\frac{\pi}{1+\pi}= 7.44-1.66(pclass_{i})-.07(age_{i})-2.77(maleness_{i})\)