About the Data
For this assignment, we will be analyzing data on the passengers from the ill-fated maiden voyage of the Titanic. Within R’s PASWR library, the data set titanic3 can be accessed using the command data(titanic3). A description of the variables is available in the help file for this data set by typing help(titanic3).
Question 1: Data Exploration
1a: Survival
How many passengers survived, and how many passed away as a result of the crash?
pclass survived name sex age sibsp
1 1st 1 Allen, Miss. Elisabeth Walton female 29.0000 0
2 1st 1 Allison, Master. Hudson Trevor male 0.9167 1
3 1st 0 Allison, Miss. Helen Loraine female 2.0000 1
4 1st 0 Allison, Mr. Hudson Joshua Crei male 30.0000 1
5 1st 0 Allison, Mrs. Hudson J C (Bessi female 25.0000 1
6 1st 1 Anderson, Mr. Harry male 48.0000 0
7 1st 1 Andrews, Miss. Kornelia Theodos female 63.0000 1
8 1st 0 Andrews, Mr. Thomas Jr male 39.0000 0
9 1st 1 Appleton, Mrs. Edward Dale (Cha female 53.0000 2
10 1st 0 Artagaveytia, Mr. Ramon male 71.0000 0
parch ticket fare cabin embarked boat body
1 0 24160 211.3375 B5 Southampton 2 NA
2 2 113781 151.5500 C22 C26 Southampton 11 NA
3 2 113781 151.5500 C22 C26 Southampton NA
4 2 113781 151.5500 C22 C26 Southampton 135
5 2 113781 151.5500 C22 C26 Southampton NA
6 0 19952 26.5500 E12 Southampton 3 NA
7 0 13502 77.9583 D7 Southampton 10 NA
8 0 112050 0.0000 A36 Southampton NA
9 0 11769 51.4792 C101 Southampton D NA
10 0 PC 17609 49.5042 Cherbourg 22
home.dest
1 St Louis, MO
2 Montreal, PQ / Chesterville, ON
3 Montreal, PQ / Chesterville, ON
4 Montreal, PQ / Chesterville, ON
5 Montreal, PQ / Chesterville, ON
6 New York, NY
7 Hudson, NY
8 Belfast, NI
9 Bayside, Queens, NY
10 Montevideo, Uruguay
[1] 500
[1] 809
0 1
809 500
1b: Class of Tickets
The ship sold tickets in 1st, 2nd, and 3rd class areas. How many passengers were in each class?
1st 2nd 3rd
323 277 709
1c: Point of Embarcation
Most of the passengers embarked from Southampton. Let’s create a new variable in the data set called southampton that will have the value 1 for passengers who embarked from Southampton and 0 for all other passengers (including any missing values). Create this variable and then show how many passengers had the values of 1 and how many had the value of 0 for the southampton variable.
library(tidyverse)
titanic3 <-titanic3 %>%
mutate(southampton = ifelse(embarked == "Southampton" , 1, 0))
table(titanic3$southampton)
0 1
395 914
1d: Sex of the Passengers
Create a variable called female with the value 1 for females and 0 for males. Show the counts of each category.
0 1
843 466
1e: Fares
Use the summary function to display some key figures about the distribution of the fares that the passengers paid.
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.000 7.896 14.454 33.295 31.275 512.329 1
1f: Family Members
Now use the summary function to display the key figures about the distribution of the number of siblings (sibsp) and separately for the number of parents or children (parch) on board.
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.0000 0.0000 0.4989 1.0000 8.0000
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 0.000 0.000 0.385 0.000 9.000
1g: Appropriate Summarization
Do you think that providing the quartiles, minimum, maximum, and mean value are the best way to summarize the counts of the number of family members for each passenger? If not, what would be a better way to summarize these variables?
#Providing the quartiles, minimum, maximum and mean are not the best way to summarize the counts for number of family members, since this count is not continuous. There are finite number of possible values that was recorded for the number of family members for each passenger.
#Using frequency distribution table is a better option to summarize discrete data. In terms of graphs, boxplot, which represents the quartiles, minimum, maximum and mean, is ideal to present continuous data, while a barplot is suitable to visualize discrete data.
table(titanic3$sibsp)
0 1 2 3 4 5 8
891 319 42 20 22 6 9
0 1 2 3 4 5 6 9
1002 170 113 8 6 6 2 2
Question 2: Clustering Models
To better investigate the relationships between the passengers, we will use clustering analysis on the following variables:
- age
- female
- fare
- sibsp
- parch
- southampton
For this exercise, create a separate data.frame called measured.dat. This object should include only the variables listed above, along with the survived variable. Use the na.omit function to restrict the rows to those completely measured (without any missing data). Then use the scale function to standardize all of the values in units of the number of standard deviations above average, which will be stored in a matrix called subdat. The subdat will only contain the predictors, not the survived outcome. We will use the subdat object to answer the following questions.
#create a separate data.frame called measured.dat:
measured.dat <- titanic3 %>% select(age, female, fare, sibsp, parch, southampton, survived)
#remove missing data:
measured.dat <- na.omit(measured.dat)
predictors <-c("age", "female", "fare", "sibsp", "parch", "southampton")
outcome <- c("survived")
subdat <- scale(measured.dat[predictors])2a: Hierarchical Clustering
Use hierarchicical clustering with complete linkage to cluster the subdat. Assign each passenger to one of 5 clusters. Add these assignments as a column called hclust.group to the measured.dat. Then show a dendrogram depicting the results of the hierarchical clustering. Using the color_branches method of the dendextend library may provide a helpful visualization.
dist <- dist(subdat, method = "euclidean")
clusters <- hclust(dist, method = "complete")
#Do H-clustering with 5 clusters:
cluster_cut <- cutree(clusters, k = 5)
#add a column of clusters assignment to measured.dat
measured.dat <- measured.dat%>%
mutate(hclust.group = cluster_cut)
#using color_branches method of the dendextend library to visualize clustering:
plot(color_branches(as.dendrogram(clusters),k = 5,groupLabels = F))2b: K-Means
Now use K-Means clustering with 5 centers and iter.max = 20 to cluster the subdat. Set the randomization seed to 821 just prior to running the algorithm. Add the clustering assignments as a column called kmeans.group to the measured.dat. Then plot the fare versus the age for each passenger while assigning different colors to their kmeans clustering assignment.
set.seed(seed = 821)
# Do k-means clustering with five clusters, repeat 20 times:
km <- kmeans(subdat, centers = 5, iter.max = 20)
#add a column of kmeans cluster assignments to subdat
measured.dat <- measured.dat %>%
mutate(kmeans.group = km$cluster)
head(measured.dat) age female fare sibsp parch southampton survived hclust.group
1 29.0000 1 211.3375 0 0 1 1 1
2 0.9167 0 151.5500 1 2 1 1 2
3 2.0000 1 151.5500 1 2 1 0 1
4 30.0000 0 151.5500 1 2 1 0 3
5 25.0000 1 151.5500 1 2 1 0 1
6 48.0000 0 26.5500 0 0 1 1 3
kmeans.group
1 5
2 4
3 4
4 4
5 4
6 3
# Plot the fare as function of age. Color by cluster
plot(measured.dat$age, measured.dat$fare, xlab = "Age",ylab ="fare",col = measured.dat$kmeans.group)Question 3: Models
3a
Now we would like to build a logistic regression model for survived using the measured.dat. Include all of the following predictor variables:
- age
- female
- fare
- sibsp
- parch
- southampton
Fit this model and show a summary of the estimated coefficients.
age female fare sibsp parch southampton survived hclust.group
1 29.0000 1 211.3375 0 0 1 1 1
2 0.9167 0 151.5500 1 2 1 1 2
3 2.0000 1 151.5500 1 2 1 0 1
4 30.0000 0 151.5500 1 2 1 0 3
5 25.0000 1 151.5500 1 2 1 0 1
6 48.0000 0 26.5500 0 0 1 1 3
kmeans.group
1 5
2 4
3 4
4 4
5 4
6 3
mod.glm <- glm(formula = survived ~ age + female + fare + sibsp + parch + southampton, data = measured.dat, family = binomial)
summary(mod.glm)
Call:
glm(formula = survived ~ age + female + fare + sibsp + parch +
southampton, family = binomial, data = measured.dat)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2772 -0.6696 -0.5455 0.7576 2.2726
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.657172 0.250226 -2.626 0.008632 **
age -0.018024 0.005794 -3.111 0.001865 **
female 2.438544 0.163843 14.883 < 2e-16 ***
fare 0.011229 0.002119 5.298 1.17e-07 ***
sibsp -0.351742 0.101161 -3.477 0.000507 ***
parch -0.064259 0.100206 -0.641 0.521347
southampton -0.451819 0.183425 -2.463 0.013769 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1413.6 on 1044 degrees of freedom
Residual deviance: 1037.4 on 1038 degrees of freedom
AIC: 1051.4
Number of Fisher Scoring iterations: 5
3b: Odds Ratios
The model’s estimated coefficients are on a logarithmic scale. The estimated Odds Ratio, which is the exponential of the estimated coefficient, can be more easily interpreted. Compute the estimated Odds Ratio of each variable. Then compute a 95% confidence interval for the Odds Ratio. To do so, you can exponentiate the 95% confidence interval for the coefficient.
#compute 95% CI using qnorm function:
mod.glm.coef.matrix <- summary(mod.glm)$coefficients
mod.glm.coef <- mod.glm.coef.matrix[, colnames(mod.glm.coef.matrix) == "Estimate"]
mod.glm.coef.ste <- mod.glm.coef.matrix[, colnames(mod.glm.coef.matrix) == "Std. Error"]
lower.ci <- qnorm(0.025, mean = mod.glm.coef, sd = mod.glm.coef.ste)
upper.ci <- qnorm(0.975, mean = mod.glm.coef, sd = mod.glm.coef.ste)
#take exponentiation for values of coefficient estimate, lower.ci and upper.ci, and round to 2 digits:
Variable = names(mod.glm.coef)
Coef = round(exp(mod.glm.coef),digits = 2)
Lower = round(exp(lower.ci), digits = 2)
Upper = round(exp(upper.ci), digits = 2)
#display in a dataframe:
mod.glm.res <- data.frame( Variable ,Coef , Lower , Upper , row.names = NULL)
mod.glm.res Variable Coef Lower Upper
1 (Intercept) 0.52 0.32 0.85
2 age 0.98 0.97 0.99
3 female 11.46 8.31 15.79
4 fare 1.01 1.01 1.02
5 sibsp 0.70 0.58 0.86
6 parch 0.94 0.77 1.14
7 southampton 0.64 0.44 0.91
##alternatively, using confint function and cbind function to summarize odds ratios and 95% CI:
exp(cbind(coef = coef(mod.glm), confint(mod.glm))) coef 2.5 % 97.5 %
(Intercept) 0.5183152 0.3164465 0.8448001
age 0.9821373 0.9709323 0.9932594
female 11.4563515 8.3492390 15.8782229
fare 1.0112923 1.0072729 1.0156778
sibsp 0.7034618 0.5736165 0.8534267
parch 0.9377619 0.7699818 1.1419401
southampton 0.6364696 0.4443296 0.9126204
3c: Increased Odds of Survival
Which factors led to an increased likelihood of survival that was statistically significant at the 0.05 level?
3d: Decreased Odds of Survival
Which factors led to an decreased likelihood of survival that was statistically significant at the 0.05 level?
3e: In-Sample Results
For the first 10 rows of the measured.dat, show the model’s fitted value (an estimated probability of survival) and the passenger’s actual survival status.
pred.glm.10 <- predict( mod.glm, newdata = measured.dat[1:10,], type = "response")
data.frame(Actual = round(measured.dat[1:10, "survived"], 2), Pred = round(pred.glm.10, 2) ) Actual Pred
1 1 0.96
2 1 0.52
3 0 0.93
4 0 0.39
5 0 0.89
6 1 0.16
7 1 0.67
8 0 0.14
9 1 0.56
10 0 0.20
3f: In-Sample Evaluation
For all of the rows of the measured.dat, create an estimated classification of a passenger’s survival status by rounding the logistic regression model’s estimated likelihood to 1 or 0. Then compute the percentage of false positives, the percentage of false negatives, and the overall percentage of correct classifications.
pred.glm <- predict(mod.glm , type = "response")
measured.dat["mod.glm.pred"]<- factor(ifelse(pred.glm >= 0.5, 1, 0))
#create a confusion matrix using confusionMatrix of caret library:
library("caret")
conf <- confusionMatrix(measured.dat[["mod.glm.pred"]], reference = factor(measured.dat[[ "survived" ]]), positive = "1")
confConfusion Matrix and Statistics
Reference
Prediction 0 1
0 517 133
1 101 294
Accuracy : 0.7761
95% CI : (0.7496, 0.801)
No Information Rate : 0.5914
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.5312
Mcnemar's Test P-Value : 0.04271
Sensitivity : 0.6885
Specificity : 0.8366
Pos Pred Value : 0.7443
Neg Pred Value : 0.7954
Prevalence : 0.4086
Detection Rate : 0.2813
Detection Prevalence : 0.3780
Balanced Accuracy : 0.7625
'Positive' Class : 1
#create a new column mod.glm.newclass, with levels of FP, FN, TP, TN:
measured.dat["mod.glm.newclass"] <- ifelse(measured.dat[[ "survived" ]]== 0 & measured.dat[["mod.glm.pred"]]== 0, "TN",
ifelse(measured.dat[[ "survived" ]]== 0 & measured.dat[["mod.glm.pred"]]== 1, "FP",
ifelse(measured.dat[[ "survived" ]]== 1 & measured.dat[["mod.glm.pred"]]== 0, "FN", "TP")))
#Create raw confusion matrix using "table" function:
conf.val <- table(measured.dat[["mod.glm.newclass"]])
conf.val
FN FP TN TP
133 101 517 294
#compute percentage of false positives, the percentage of false negatives, and the overall percentage of correct classifications(accuracy):
library("scales")
false.positive.rate <- percent(conf.val[["FP"]]/sum(conf.val[["FP"]],conf.val[["TN"]]))
false.negative.rate <- percent(conf.val[["FN"]]/sum(conf.val[["FN"]],conf.val[["TP"]]))
accuracy <- percent((conf.val[["TP"]]+conf.val[["TN"]])/sum(conf.val))
false.positive.rate[1] "16.3%"
[1] "31.1%"
[1] "77.6%"
Question 4: Added Information
Does a clustering assignment add information to a predictive model? One way to assess this question is to add the group assignments of a clustering procedure (as a categorical variable) to a regression model that already includes the variables that were used to build the clustering model. With this approach, we will evaluate the information added by our earlier clustering work in Question 2 to the logistic regression model built in Question 3.
4a: Hierarchical Clustering Assignments as a Predictor of Survival
Fit a logistic regression model of survived in terms of the following variables:
- age
- female
- fare
- sibsp
- parch
- southampton
- hclust.group (as a categorical variable)
Show a summary of the estimated coefficients and evaluate whether the hierarchical clustering assignments add information to the predictions.
#convert hclus.group to categorical variable
measured.dat$hclust.group <- factor(measured.dat$hclust.group)
#run another logit regression:
mod.glm2 <- glm(formula = survived ~ age + female + fare + sibsp + parch + southampton + hclust.group, data = measured.dat, family = binomial)
#summary of the logit model:
summary(mod.glm2)
Call:
glm(formula = survived ~ age + female + fare + sibsp + parch +
southampton + hclust.group, family = binomial, data = measured.dat)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2004 -0.6328 -0.5341 0.7120 2.6281
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.304e+00 7.651e-01 -4.318 1.58e-05 ***
age -1.009e-02 6.495e-03 -1.553 0.12042
female 2.436e+00 1.668e-01 14.605 < 2e-16 ***
fare 2.121e-02 3.288e-03 6.450 1.12e-10 ***
sibsp -5.301e-01 1.102e-01 -4.812 1.49e-06 ***
parch -4.517e-01 1.429e-01 -3.160 0.00158 **
southampton -3.419e-01 1.871e-01 -1.827 0.06768 .
hclust.group2 3.507e+00 7.399e-01 4.740 2.14e-06 ***
hclust.group3 2.092e+00 6.824e-01 3.066 0.00217 **
hclust.group4 7.678e+00 6.137e+02 0.013 0.99002
hclust.group5 -8.105e+00 1.455e+03 -0.006 0.99556
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1413.6 on 1044 degrees of freedom
Residual deviance: 1008.7 on 1034 degrees of freedom
AIC: 1030.7
Number of Fisher Scoring iterations: 14
4b: K-Means Clustering Assignments as a Predictor of Survival
Repeat the exercise of 4a using the kmeans.group as a predictor instead of hclust.group.
#convert kmeans.group to categorical variable
measured.dat$kmeans.group <- factor(measured.dat$kmeans.group)
#run another logit regression:
mod.glm3 <- glm(formula = survived ~ age + female + fare + sibsp + parch + southampton + kmeans.group, data = measured.dat, family = binomial)
#summary of the logit model:
summary(mod.glm3)
Call:
glm(formula = survived ~ age + female + fare + sibsp + parch +
southampton + kmeans.group, family = binomial, data = measured.dat)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0859 -0.6561 -0.5273 0.7357 2.4794
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.538135 0.653598 -0.823 0.41031
age -0.018471 0.007034 -2.626 0.00864 **
female 2.197216 0.262255 8.378 < 2e-16 ***
fare 0.011068 0.002092 5.291 1.21e-07 ***
sibsp -0.150956 0.120248 -1.255 0.20935
parch 0.149058 0.124370 1.199 0.23072
southampton -0.626664 0.623789 -1.005 0.31508
kmeans.group2 -0.113335 0.673133 -0.168 0.86629
kmeans.group3 -0.117413 0.326320 -0.360 0.71899
kmeans.group4 -1.265825 0.544604 -2.324 0.02011 *
kmeans.group5 0.309035 0.340495 0.908 0.36409
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1413.6 on 1044 degrees of freedom
Residual deviance: 1026.1 on 1034 degrees of freedom
AIC: 1048.1
Number of Fisher Scoring iterations: 5
4c: Both Clustering Assignments
Now repeat the exercise using both clustering assignments as predictors.
mod.glm4 <- glm(formula = survived ~ age + female + fare + sibsp + parch + southampton + hclust.group + kmeans.group , data = measured.dat, family = binomial)
#summary of the logit model:
summary(mod.glm4)
Call:
glm(formula = survived ~ age + female + fare + sibsp + parch +
southampton + hclust.group + kmeans.group, family = binomial,
data = measured.dat)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1617 -0.6223 -0.4983 0.7487 2.7386
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.105e+00 9.488e-01 -3.273 0.00106 **
age -6.696e-03 8.189e-03 -0.818 0.41356
female 2.217e+00 2.664e-01 8.325 < 2e-16 ***
fare 2.033e-02 3.270e-03 6.215 5.12e-10 ***
sibsp -3.572e-01 1.311e-01 -2.725 0.00642 **
parch -2.814e-01 1.691e-01 -1.664 0.09615 .
southampton -4.203e-01 6.121e-01 -0.687 0.49233
hclust.group2 3.309e+00 7.444e-01 4.446 8.76e-06 ***
hclust.group3 1.892e+00 6.894e-01 2.744 0.00606 **
hclust.group4 7.870e+00 6.347e+02 0.012 0.99011
hclust.group5 -8.944e+00 1.455e+03 -0.006 0.99510
kmeans.group2 -7.841e-02 6.617e-01 -0.119 0.90567
kmeans.group3 -3.731e-01 3.423e-01 -1.090 0.27577
kmeans.group4 -9.932e-01 5.454e-01 -1.821 0.06859 .
kmeans.group5 2.359e-01 3.464e-01 0.681 0.49592
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1413.6 on 1044 degrees of freedom
Residual deviance: 1000.1 on 1030 degrees of freedom
AIC: 1030.1
Number of Fisher Scoring iterations: 14
4d: Interpretation of Added Information
What does it mean to add information to a model? Is clustering a worthwhile exercise as a method of building a predictive model? Write a few sentences to explain your answers.
We have several candidate models here to represent the true relation between predictors and outcome. If we knew the true relation, then we could find the information lost from using a candidate model to represent the true model. We want to choose the candidate model that minimized the information loss. If a candidate model, compared to another candidate model, has more information gain, the model quality is relatively lower. AIC is the estimator of the relative quality of statistical models for a given set of data. It is a means for model selection that deals with the trade-off between bias and variance.
Clustering is not a worthwile exercise as a method of building a predictive model, because clustering assignments can be directly predicted from the other predictors, resulting in redundancy. Although it doesn’t reduce the predictive power of the model as a whole, it affects our interpretation of indivadual predictors. Also, the coefficient estimates may change erratically in response to small changes in the model or the data.
Question 5: Description Versus Prediction
5a: Training and Testing Sets?
The logistic regressions developed above did not split the data into separate training and testing sets. What would we have gained by doing so? Explain your answer in a few sentences.
We are more interested in inference rather than prediction here, to find the unbiased estimate of predictors and to judge the effect that changing one predictor variable may have on the siginificance of coefficients in the logit model. It’s more focused on description than prediction. The Train-Test-Validation process attempts to estimate the over-fitting of the model, in order to to achive generalization.
5b: Application of the Titanic’s Predictive Model
Would it even make sense to use the logistic regression model for the Titanic’s passengers to make a prediction? Explain your opinion with a few sentences.
No, it would not make sense to use the logistic regression model to make a prediction. Because the logistic regression model we built hasn’t been evaluated on any samples that were not used to build the model, so that it can provide an unbiased sense of model effectiveness. A biased model that will lead to overfitting, if used to run prediction on unseen data. The Train-Test-Validation process attempts to estimate the over-fitting of the model, in order to to achive generalization.