Homework 3

set.seed(seed = 9291)

library(PASWR)
library(dendextend)

About the Data

For this assignment, we will be analyzing data on the passengers from the ill-fated maiden voyage of the Titanic. Within R’s PASWR library, the data set titanic3 can be accessed using the command data(titanic3). A description of the variables is available in the help file for this data set by typing help(titanic3).

Question 1: Data Exploration

1a: Survival

How many passengers survived, and how many passed away as a result of the crash?

data("titanic3")
head(titanic3,10)

   pclass survived                            name    sex     age sibsp
1     1st        1   Allen, Miss. Elisabeth Walton female 29.0000     0
2     1st        1  Allison, Master. Hudson Trevor   male  0.9167     1
3     1st        0    Allison, Miss. Helen Loraine female  2.0000     1
4     1st        0 Allison, Mr. Hudson Joshua Crei   male 30.0000     1
5     1st        0 Allison, Mrs. Hudson J C (Bessi female 25.0000     1
6     1st        1             Anderson, Mr. Harry   male 48.0000     0
7     1st        1 Andrews, Miss. Kornelia Theodos female 63.0000     1
8     1st        0          Andrews, Mr. Thomas Jr   male 39.0000     0
9     1st        1 Appleton, Mrs. Edward Dale (Cha female 53.0000     2
10    1st        0         Artagaveytia, Mr. Ramon   male 71.0000     0
   parch   ticket     fare   cabin    embarked boat body
1      0    24160 211.3375      B5 Southampton    2   NA
2      2   113781 151.5500 C22 C26 Southampton   11   NA
3      2   113781 151.5500 C22 C26 Southampton        NA
4      2   113781 151.5500 C22 C26 Southampton       135
5      2   113781 151.5500 C22 C26 Southampton        NA
6      0    19952  26.5500     E12 Southampton    3   NA
7      0    13502  77.9583      D7 Southampton   10   NA
8      0   112050   0.0000     A36 Southampton        NA
9      0    11769  51.4792    C101 Southampton    D   NA
10     0 PC 17609  49.5042           Cherbourg        22
                         home.dest
1                     St Louis, MO
2  Montreal, PQ / Chesterville, ON
3  Montreal, PQ / Chesterville, ON
4  Montreal, PQ / Chesterville, ON
5  Montreal, PQ / Chesterville, ON
6                     New York, NY
7                       Hudson, NY
8                      Belfast, NI
9              Bayside, Queens, NY
10             Montevideo, Uruguay

nrow(titanic3[titanic3$survived==1,])

[1] 500

#500 passengers survived
nrow(titanic3[titanic3$survived==0,])

[1] 809

#809 passengers passed away as a result of the crash


#alternatively using:
table(titanic3$survived)


  0   1 
809 500

1b: Class of Tickets

The ship sold tickets in 1st, 2nd, and 3rd class areas. How many passengers were in each class?

table(titanic3$pclass)


1st 2nd 3rd 
323 277 709

#323 passengers in 1st class, 277 passengers in 2nd, and 709 passengers in 3rd class

1c: Point of Embarcation

Most of the passengers embarked from Southampton. Let’s create a new variable in the data set called southampton that will have the value 1 for passengers who embarked from Southampton and 0 for all other passengers (including any missing values). Create this variable and then show how many passengers had the values of 1 and how many had the value of 0 for the southampton variable.

library(tidyverse)

titanic3 <-titanic3 %>%
  mutate(southampton = ifelse(embarked == "Southampton" , 1, 0))

table(titanic3$southampton)


  0   1 
395 914

#there are 914 passengers had the value of 1, and 395 passengers had the value of 0 for the southampton variable.

1d: Sex of the Passengers

Create a variable called female with the value 1 for females and 0 for males. Show the counts of each category.

titanic3 <-titanic3 %>%
  mutate(female = ifelse(sex == "female" , 1, 0))

table(titanic3$female)


  0   1 
843 466

#there are 466 females and 843 males among the passengers

1e: Fares

Use the summary function to display some key figures about the distribution of the fares that the passengers paid.

summary(titanic3$fare)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  0.000   7.896  14.454  33.295  31.275 512.329       1

1f: Family Members

Now use the summary function to display the key figures about the distribution of the number of siblings (sibsp) and separately for the number of parents or children (parch) on board.

summary(titanic3$sibsp)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.0000  0.0000  0.4989  1.0000  8.0000

summary(titanic3$parch)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.000   0.000   0.385   0.000   9.000

1g: Appropriate Summarization

Do you think that providing the quartiles, minimum, maximum, and mean value are the best way to summarize the counts of the number of family members for each passenger? If not, what would be a better way to summarize these variables?

#Providing the quartiles, minimum, maximum and mean are not the best way to summarize the counts for number of family members, since this count is not continuous. There are finite number of possible values that was recorded for the number of family members for each passenger.

#Using frequency distribution table is a better option to summarize discrete data. In terms of graphs, boxplot, which represents the quartiles, minimum, maximum and mean, is ideal to present continuous data, while a barplot is suitable to visualize discrete data.

table(titanic3$sibsp)


  0   1   2   3   4   5   8 
891 319  42  20  22   6   9

table(titanic3$parch)


   0    1    2    3    4    5    6    9 
1002  170  113    8    6    6    2    2

Question 2: Clustering Models

To better investigate the relationships between the passengers, we will use clustering analysis on the following variables:

age
female
fare
sibsp
parch
southampton

For this exercise, create a separate data.frame called measured.dat. This object should include only the variables listed above, along with the survived variable. Use the na.omit function to restrict the rows to those completely measured (without any missing data). Then use the scale function to standardize all of the values in units of the number of standard deviations above average, which will be stored in a matrix called subdat. The subdat will only contain the predictors, not the survived outcome. We will use the subdat object to answer the following questions.

#create a separate data.frame called measured.dat:
measured.dat <- titanic3 %>% select(age, female, fare, sibsp, parch, southampton, survived)
#remove missing data:
measured.dat <- na.omit(measured.dat)

predictors <-c("age", "female", "fare", "sibsp", "parch", "southampton")
outcome <- c("survived")

subdat <- scale(measured.dat[predictors])

2a: Hierarchical Clustering

Use hierarchicical clustering with complete linkage to cluster the subdat. Assign each passenger to one of 5 clusters. Add these assignments as a column called hclust.group to the measured.dat. Then show a dendrogram depicting the results of the hierarchical clustering. Using the color_branches method of the dendextend library may provide a helpful visualization.

dist <- dist(subdat, method = "euclidean")
clusters <- hclust(dist, method = "complete")
#Do H-clustering with 5 clusters: 
cluster_cut <- cutree(clusters, k = 5)

#add a column of clusters assignment to measured.dat
measured.dat <- measured.dat%>%
  mutate(hclust.group = cluster_cut)

#using color_branches method of the dendextend library to visualize clustering:
plot(color_branches(as.dendrogram(clusters),k = 5,groupLabels = F))

2b: K-Means

Now use K-Means clustering with 5 centers and iter.max = 20 to cluster the subdat. Set the randomization seed to 821 just prior to running the algorithm. Add the clustering assignments as a column called kmeans.group to the measured.dat. Then plot the fare versus the age for each passenger while assigning different colors to their kmeans clustering assignment.

set.seed(seed = 821)
# Do k-means clustering with five clusters, repeat 20 times: 
km <- kmeans(subdat, centers = 5, iter.max = 20)

#add a column of kmeans cluster assignments to subdat
measured.dat <- measured.dat %>%
  mutate(kmeans.group = km$cluster)
head(measured.dat)

      age female     fare sibsp parch southampton survived hclust.group
1 29.0000      1 211.3375     0     0           1        1            1
2  0.9167      0 151.5500     1     2           1        1            2
3  2.0000      1 151.5500     1     2           1        0            1
4 30.0000      0 151.5500     1     2           1        0            3
5 25.0000      1 151.5500     1     2           1        0            1
6 48.0000      0  26.5500     0     0           1        1            3
  kmeans.group
1            5
2            4
3            4
4            4
5            4
6            3

# Plot the fare as function of age. Color by cluster
plot(measured.dat$age, measured.dat$fare, xlab = "Age",ylab ="fare",col = measured.dat$kmeans.group)

Question 3: Models

3a

Now we would like to build a logistic regression model for survived using the measured.dat. Include all of the following predictor variables:

age
female
fare
sibsp
parch
southampton

Fit this model and show a summary of the estimated coefficients.

head(measured.dat)

      age female     fare sibsp parch southampton survived hclust.group
1 29.0000      1 211.3375     0     0           1        1            1
2  0.9167      0 151.5500     1     2           1        1            2
3  2.0000      1 151.5500     1     2           1        0            1
4 30.0000      0 151.5500     1     2           1        0            3
5 25.0000      1 151.5500     1     2           1        0            1
6 48.0000      0  26.5500     0     0           1        1            3
  kmeans.group
1            5
2            4
3            4
4            4
5            4
6            3

mod.glm <- glm(formula = survived ~ age + female + fare + sibsp + parch + southampton, data = measured.dat, family = binomial)
summary(mod.glm)


Call:
glm(formula = survived ~ age + female + fare + sibsp + parch + 
    southampton, family = binomial, data = measured.dat)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.2772  -0.6696  -0.5455   0.7576   2.2726  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.657172   0.250226  -2.626 0.008632 ** 
age         -0.018024   0.005794  -3.111 0.001865 ** 
female       2.438544   0.163843  14.883  < 2e-16 ***
fare         0.011229   0.002119   5.298 1.17e-07 ***
sibsp       -0.351742   0.101161  -3.477 0.000507 ***
parch       -0.064259   0.100206  -0.641 0.521347    
southampton -0.451819   0.183425  -2.463 0.013769 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1413.6  on 1044  degrees of freedom
Residual deviance: 1037.4  on 1038  degrees of freedom
AIC: 1051.4

Number of Fisher Scoring iterations: 5

3b: Odds Ratios

The model’s estimated coefficients are on a logarithmic scale. The estimated Odds Ratio, which is the exponential of the estimated coefficient, can be more easily interpreted. Compute the estimated Odds Ratio of each variable. Then compute a 95% confidence interval for the Odds Ratio. To do so, you can exponentiate the 95% confidence interval for the coefficient.

#compute 95% CI using qnorm function:
mod.glm.coef.matrix <- summary(mod.glm)$coefficients
mod.glm.coef <- mod.glm.coef.matrix[, colnames(mod.glm.coef.matrix) == "Estimate"]
mod.glm.coef.ste <- mod.glm.coef.matrix[, colnames(mod.glm.coef.matrix) == "Std. Error"]

lower.ci <- qnorm(0.025, mean = mod.glm.coef, sd = mod.glm.coef.ste)
upper.ci <- qnorm(0.975, mean = mod.glm.coef, sd = mod.glm.coef.ste)

#take exponentiation for values of coefficient estimate, lower.ci and upper.ci, and round to 2 digits:
Variable = names(mod.glm.coef)
Coef = round(exp(mod.glm.coef),digits = 2)
Lower = round(exp(lower.ci), digits = 2)
Upper = round(exp(upper.ci), digits = 2)
#display in a dataframe:
mod.glm.res <- data.frame( Variable ,Coef , Lower , Upper , row.names = NULL)
mod.glm.res

     Variable  Coef Lower Upper
1 (Intercept)  0.52  0.32  0.85
2         age  0.98  0.97  0.99
3      female 11.46  8.31 15.79
4        fare  1.01  1.01  1.02
5       sibsp  0.70  0.58  0.86
6       parch  0.94  0.77  1.14
7 southampton  0.64  0.44  0.91

##alternatively, using confint function and cbind function to summarize odds ratios and 95% CI:
exp(cbind(coef = coef(mod.glm), confint(mod.glm)))

                  coef     2.5 %     97.5 %
(Intercept)  0.5183152 0.3164465  0.8448001
age          0.9821373 0.9709323  0.9932594
female      11.4563515 8.3492390 15.8782229
fare         1.0112923 1.0072729  1.0156778
sibsp        0.7034618 0.5736165  0.8534267
parch        0.9377619 0.7699818  1.1419401
southampton  0.6364696 0.4443296  0.9126204

3c: Increased Odds of Survival

Which factors led to an increased likelihood of survival that was statistically significant at the 0.05 level?

#Telling from the estimate and P-value, we conclude that female and fare, which are all positive values, led to an increased likelihood of survival, given a 0.05 statistical significance level.

3d: Decreased Odds of Survival

Which factors led to an decreased likelihood of survival that was statistically significant at the 0.05 level?

#Telling from the estimate and P-value, we conclude that age,sibsp and southampton, which are all negative, fare led to an decreased likelihood of survival, given a 0.05 statistical significance level.

3e: In-Sample Results

For the first 10 rows of the measured.dat, show the model’s fitted value (an estimated probability of survival) and the passenger’s actual survival status.

pred.glm.10 <- predict( mod.glm, newdata = measured.dat[1:10,], type = "response")
data.frame(Actual = round(measured.dat[1:10, "survived"], 2), Pred = round(pred.glm.10, 2) )

   Actual Pred
1       1 0.96
2       1 0.52
3       0 0.93
4       0 0.39
5       0 0.89
6       1 0.16
7       1 0.67
8       0 0.14
9       1 0.56
10      0 0.20

3f: In-Sample Evaluation

For all of the rows of the measured.dat, create an estimated classification of a passenger’s survival status by rounding the logistic regression model’s estimated likelihood to 1 or 0. Then compute the percentage of false positives, the percentage of false negatives, and the overall percentage of correct classifications.

pred.glm <- predict(mod.glm , type = "response")
measured.dat["mod.glm.pred"]<- factor(ifelse(pred.glm >= 0.5, 1, 0))

#create a confusion matrix using confusionMatrix of caret library:
library("caret")
conf <- confusionMatrix(measured.dat[["mod.glm.pred"]], reference = factor(measured.dat[[ "survived" ]]), positive = "1")
conf

Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 517 133
         1 101 294
                                         
               Accuracy : 0.7761         
                 95% CI : (0.7496, 0.801)
    No Information Rate : 0.5914         
    P-Value [Acc > NIR] : < 2e-16        
                                         
                  Kappa : 0.5312         
                                         
 Mcnemar's Test P-Value : 0.04271        
                                         
            Sensitivity : 0.6885         
            Specificity : 0.8366         
         Pos Pred Value : 0.7443         
         Neg Pred Value : 0.7954         
             Prevalence : 0.4086         
         Detection Rate : 0.2813         
   Detection Prevalence : 0.3780         
      Balanced Accuracy : 0.7625         
                                         
       'Positive' Class : 1

#create a new column mod.glm.newclass, with levels of FP, FN, TP, TN:
measured.dat["mod.glm.newclass"] <- ifelse(measured.dat[[ "survived" ]]== 0 & measured.dat[["mod.glm.pred"]]== 0, "TN", 
                      ifelse(measured.dat[[ "survived" ]]== 0 & measured.dat[["mod.glm.pred"]]== 1, "FP",
                       ifelse(measured.dat[[ "survived" ]]== 1 & measured.dat[["mod.glm.pred"]]== 0, "FN", "TP")))

#Create raw confusion matrix using "table" function:
conf.val <- table(measured.dat[["mod.glm.newclass"]])
conf.val


 FN  FP  TN  TP 
133 101 517 294

#compute percentage of false positives, the percentage of false negatives, and the overall percentage of correct classifications(accuracy):
library("scales")
false.positive.rate <- percent(conf.val[["FP"]]/sum(conf.val[["FP"]],conf.val[["TN"]]))
false.negative.rate <- percent(conf.val[["FN"]]/sum(conf.val[["FN"]],conf.val[["TP"]]))
accuracy <- percent((conf.val[["TP"]]+conf.val[["TN"]])/sum(conf.val))
false.positive.rate

[1] "16.3%"

false.negative.rate

[1] "31.1%"

accuracy

[1] "77.6%"

Question 4: Added Information

Does a clustering assignment add information to a predictive model? One way to assess this question is to add the group assignments of a clustering procedure (as a categorical variable) to a regression model that already includes the variables that were used to build the clustering model. With this approach, we will evaluate the information added by our earlier clustering work in Question 2 to the logistic regression model built in Question 3.

4a: Hierarchical Clustering Assignments as a Predictor of Survival

Fit a logistic regression model of survived in terms of the following variables:

age
female
fare
sibsp
parch
southampton
hclust.group (as a categorical variable)

Show a summary of the estimated coefficients and evaluate whether the hierarchical clustering assignments add information to the predictions.

#convert hclus.group to categorical variable
measured.dat$hclust.group <- factor(measured.dat$hclust.group)
#run another logit regression:
mod.glm2 <- glm(formula = survived ~ age + female + fare + sibsp + parch + southampton + hclust.group, data = measured.dat, family = binomial)
#summary of the logit model:
summary(mod.glm2)


Call:
glm(formula = survived ~ age + female + fare + sibsp + parch + 
    southampton + hclust.group, family = binomial, data = measured.dat)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.2004  -0.6328  -0.5341   0.7120   2.6281  

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -3.304e+00  7.651e-01  -4.318 1.58e-05 ***
age           -1.009e-02  6.495e-03  -1.553  0.12042    
female         2.436e+00  1.668e-01  14.605  < 2e-16 ***
fare           2.121e-02  3.288e-03   6.450 1.12e-10 ***
sibsp         -5.301e-01  1.102e-01  -4.812 1.49e-06 ***
parch         -4.517e-01  1.429e-01  -3.160  0.00158 ** 
southampton   -3.419e-01  1.871e-01  -1.827  0.06768 .  
hclust.group2  3.507e+00  7.399e-01   4.740 2.14e-06 ***
hclust.group3  2.092e+00  6.824e-01   3.066  0.00217 ** 
hclust.group4  7.678e+00  6.137e+02   0.013  0.99002    
hclust.group5 -8.105e+00  1.455e+03  -0.006  0.99556    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1413.6  on 1044  degrees of freedom
Residual deviance: 1008.7  on 1034  degrees of freedom
AIC: 1030.7

Number of Fisher Scoring iterations: 14

#Adding the hierarchical clustering assignments did add some information to the predictions. We can compare the AIC of each model: AIC: 1030.7 vs. 1051.4. The model with hierachical clustering assignments has relatively lower AIC. AIC estimates the relative amount of information lost by a given model: the less information a model loses, the higher the quality of that model.

4b: K-Means Clustering Assignments as a Predictor of Survival

Repeat the exercise of 4a using the kmeans.group as a predictor instead of hclust.group.

#convert kmeans.group to categorical variable
measured.dat$kmeans.group <- factor(measured.dat$kmeans.group)
#run another logit regression:
mod.glm3 <- glm(formula = survived ~ age + female + fare + sibsp + parch + southampton + kmeans.group, data = measured.dat, family = binomial)
#summary of the logit model:
summary(mod.glm3)


Call:
glm(formula = survived ~ age + female + fare + sibsp + parch + 
    southampton + kmeans.group, family = binomial, data = measured.dat)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.0859  -0.6561  -0.5273   0.7357   2.4794  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -0.538135   0.653598  -0.823  0.41031    
age           -0.018471   0.007034  -2.626  0.00864 ** 
female         2.197216   0.262255   8.378  < 2e-16 ***
fare           0.011068   0.002092   5.291 1.21e-07 ***
sibsp         -0.150956   0.120248  -1.255  0.20935    
parch          0.149058   0.124370   1.199  0.23072    
southampton   -0.626664   0.623789  -1.005  0.31508    
kmeans.group2 -0.113335   0.673133  -0.168  0.86629    
kmeans.group3 -0.117413   0.326320  -0.360  0.71899    
kmeans.group4 -1.265825   0.544604  -2.324  0.02011 *  
kmeans.group5  0.309035   0.340495   0.908  0.36409    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1413.6  on 1044  degrees of freedom
Residual deviance: 1026.1  on 1034  degrees of freedom
AIC: 1048.1

Number of Fisher Scoring iterations: 5

#Adding kmeans.group as a predictor, compared to the model using hclust.group as predictor, resulted in a slightly higher AIC: 1041.8 vs. 1030.7, but is still lower than the AIC of the original logit model. So using kmeans.group added information to the predictions, but not as much as using hclust.group as the predictor.

4c: Both Clustering Assignments

Now repeat the exercise using both clustering assignments as predictors.

mod.glm4 <- glm(formula = survived ~ age + female + fare + sibsp + parch + southampton + hclust.group + kmeans.group , data = measured.dat, family = binomial)
#summary of the logit model:
summary(mod.glm4)


Call:
glm(formula = survived ~ age + female + fare + sibsp + parch + 
    southampton + hclust.group + kmeans.group, family = binomial, 
    data = measured.dat)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.1617  -0.6223  -0.4983   0.7487   2.7386  

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -3.105e+00  9.488e-01  -3.273  0.00106 ** 
age           -6.696e-03  8.189e-03  -0.818  0.41356    
female         2.217e+00  2.664e-01   8.325  < 2e-16 ***
fare           2.033e-02  3.270e-03   6.215 5.12e-10 ***
sibsp         -3.572e-01  1.311e-01  -2.725  0.00642 ** 
parch         -2.814e-01  1.691e-01  -1.664  0.09615 .  
southampton   -4.203e-01  6.121e-01  -0.687  0.49233    
hclust.group2  3.309e+00  7.444e-01   4.446 8.76e-06 ***
hclust.group3  1.892e+00  6.894e-01   2.744  0.00606 ** 
hclust.group4  7.870e+00  6.347e+02   0.012  0.99011    
hclust.group5 -8.944e+00  1.455e+03  -0.006  0.99510    
kmeans.group2 -7.841e-02  6.617e-01  -0.119  0.90567    
kmeans.group3 -3.731e-01  3.423e-01  -1.090  0.27577    
kmeans.group4 -9.932e-01  5.454e-01  -1.821  0.06859 .  
kmeans.group5  2.359e-01  3.464e-01   0.681  0.49592    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1413.6  on 1044  degrees of freedom
Residual deviance: 1000.1  on 1030  degrees of freedom
AIC: 1030.1

Number of Fisher Scoring iterations: 14

#using both clustering assignments as predictors added information to the predictions. It has the lowest AIC (1027.4 ) so far.

4d: Interpretation of Added Information

What does it mean to add information to a model? Is clustering a worthwhile exercise as a method of building a predictive model? Write a few sentences to explain your answers.

We have several candidate models here to represent the true relation between predictors and outcome. If we knew the true relation, then we could find the information lost from using a candidate model to represent the true model. We want to choose the candidate model that minimized the information loss. If a candidate model, compared to another candidate model, has more information gain, the model quality is relatively lower. AIC is the estimator of the relative quality of statistical models for a given set of data. It is a means for model selection that deals with the trade-off between bias and variance.

Clustering is not a worthwile exercise as a method of building a predictive model, because clustering assignments can be directly predicted from the other predictors, resulting in redundancy. Although it doesn’t reduce the predictive power of the model as a whole, it affects our interpretation of indivadual predictors. Also, the coefficient estimates may change erratically in response to small changes in the model or the data.

Question 5: Description Versus Prediction

5a: Training and Testing Sets?

The logistic regressions developed above did not split the data into separate training and testing sets. What would we have gained by doing so? Explain your answer in a few sentences.

We are more interested in inference rather than prediction here, to find the unbiased estimate of predictors and to judge the effect that changing one predictor variable may have on the siginificance of coefficients in the logit model. It’s more focused on description than prediction. The Train-Test-Validation process attempts to estimate the over-fitting of the model, in order to to achive generalization.

5b: Application of the Titanic’s Predictive Model

Would it even make sense to use the logistic regression model for the Titanic’s passengers to make a prediction? Explain your opinion with a few sentences.

No, it would not make sense to use the logistic regression model to make a prediction. Because the logistic regression model we built hasn’t been evaluated on any samples that were not used to build the model, so that it can provide an unbiased sense of model effectiveness. A biased model that will lead to overfitting, if used to run prediction on unseen data. The Train-Test-Validation process attempts to estimate the over-fitting of the model, in order to to achive generalization.