set.seed(seed = 9291)
library(PASWR)
library(dendextend)

About the Data

For this assignment, we will be analyzing data on the passengers from the ill-fated maiden voyage of the Titanic. Within R’s PASWR library, the data set titanic3 can be accessed using the command data(titanic3). A description of the variables is available in the help file for this data set by typing help(titanic3).

Question 1: Data Exploration

1a: Survival

How many passengers survived, and how many passed away as a result of the crash?

   pclass survived                            name    sex     age sibsp
1     1st        1   Allen, Miss. Elisabeth Walton female 29.0000     0
2     1st        1  Allison, Master. Hudson Trevor   male  0.9167     1
3     1st        0    Allison, Miss. Helen Loraine female  2.0000     1
4     1st        0 Allison, Mr. Hudson Joshua Crei   male 30.0000     1
5     1st        0 Allison, Mrs. Hudson J C (Bessi female 25.0000     1
6     1st        1             Anderson, Mr. Harry   male 48.0000     0
7     1st        1 Andrews, Miss. Kornelia Theodos female 63.0000     1
8     1st        0          Andrews, Mr. Thomas Jr   male 39.0000     0
9     1st        1 Appleton, Mrs. Edward Dale (Cha female 53.0000     2
10    1st        0         Artagaveytia, Mr. Ramon   male 71.0000     0
   parch   ticket     fare   cabin    embarked boat body
1      0    24160 211.3375      B5 Southampton    2   NA
2      2   113781 151.5500 C22 C26 Southampton   11   NA
3      2   113781 151.5500 C22 C26 Southampton        NA
4      2   113781 151.5500 C22 C26 Southampton       135
5      2   113781 151.5500 C22 C26 Southampton        NA
6      0    19952  26.5500     E12 Southampton    3   NA
7      0    13502  77.9583      D7 Southampton   10   NA
8      0   112050   0.0000     A36 Southampton        NA
9      0    11769  51.4792    C101 Southampton    D   NA
10     0 PC 17609  49.5042           Cherbourg        22
                         home.dest
1                     St Louis, MO
2  Montreal, PQ / Chesterville, ON
3  Montreal, PQ / Chesterville, ON
4  Montreal, PQ / Chesterville, ON
5  Montreal, PQ / Chesterville, ON
6                     New York, NY
7                       Hudson, NY
8                      Belfast, NI
9              Bayside, Queens, NY
10             Montevideo, Uruguay
[1] 500
[1] 809

  0   1 
809 500 

1b: Class of Tickets

The ship sold tickets in 1st, 2nd, and 3rd class areas. How many passengers were in each class?


1st 2nd 3rd 
323 277 709 

1c: Point of Embarcation

Most of the passengers embarked from Southampton. Let’s create a new variable in the data set called southampton that will have the value 1 for passengers who embarked from Southampton and 0 for all other passengers (including any missing values). Create this variable and then show how many passengers had the values of 1 and how many had the value of 0 for the southampton variable.


  0   1 
395 914 

1d: Sex of the Passengers

Create a variable called female with the value 1 for females and 0 for males. Show the counts of each category.


  0   1 
843 466 

1e: Fares

Use the summary function to display some key figures about the distribution of the fares that the passengers paid.

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  0.000   7.896  14.454  33.295  31.275 512.329       1 

1f: Family Members

Now use the summary function to display the key figures about the distribution of the number of siblings (sibsp) and separately for the number of parents or children (parch) on board.

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.0000  0.0000  0.4989  1.0000  8.0000 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.000   0.000   0.385   0.000   9.000 

Question 2: Clustering Models

To better investigate the relationships between the passengers, we will use clustering analysis on the following variables:

For this exercise, create a separate data.frame called measured.dat. This object should include only the variables listed above, along with the survived variable. Use the na.omit function to restrict the rows to those completely measured (without any missing data). Then use the scale function to standardize all of the values in units of the number of standard deviations above average, which will be stored in a matrix called subdat. The subdat will only contain the predictors, not the survived outcome. We will use the subdat object to answer the following questions.

2a: Hierarchical Clustering

Use hierarchicical clustering with complete linkage to cluster the subdat. Assign each passenger to one of 5 clusters. Add these assignments as a column called hclust.group to the measured.dat. Then show a dendrogram depicting the results of the hierarchical clustering. Using the color_branches method of the dendextend library may provide a helpful visualization.

2b: K-Means

Now use K-Means clustering with 5 centers and iter.max = 20 to cluster the subdat. Set the randomization seed to 821 just prior to running the algorithm. Add the clustering assignments as a column called kmeans.group to the measured.dat. Then plot the fare versus the age for each passenger while assigning different colors to their kmeans clustering assignment.

      age female     fare sibsp parch southampton survived hclust.group
1 29.0000      1 211.3375     0     0           1        1            1
2  0.9167      0 151.5500     1     2           1        1            2
3  2.0000      1 151.5500     1     2           1        0            1
4 30.0000      0 151.5500     1     2           1        0            3
5 25.0000      1 151.5500     1     2           1        0            1
6 48.0000      0  26.5500     0     0           1        1            3
  kmeans.group
1            5
2            4
3            4
4            4
5            4
6            3

Question 3: Models

3a

Now we would like to build a logistic regression model for survived using the measured.dat. Include all of the following predictor variables:

  • age
  • female
  • fare
  • sibsp
  • parch
  • southampton

Fit this model and show a summary of the estimated coefficients.

      age female     fare sibsp parch southampton survived hclust.group
1 29.0000      1 211.3375     0     0           1        1            1
2  0.9167      0 151.5500     1     2           1        1            2
3  2.0000      1 151.5500     1     2           1        0            1
4 30.0000      0 151.5500     1     2           1        0            3
5 25.0000      1 151.5500     1     2           1        0            1
6 48.0000      0  26.5500     0     0           1        1            3
  kmeans.group
1            5
2            4
3            4
4            4
5            4
6            3

Call:
glm(formula = survived ~ age + female + fare + sibsp + parch + 
    southampton, family = binomial, data = measured.dat)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.2772  -0.6696  -0.5455   0.7576   2.2726  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.657172   0.250226  -2.626 0.008632 ** 
age         -0.018024   0.005794  -3.111 0.001865 ** 
female       2.438544   0.163843  14.883  < 2e-16 ***
fare         0.011229   0.002119   5.298 1.17e-07 ***
sibsp       -0.351742   0.101161  -3.477 0.000507 ***
parch       -0.064259   0.100206  -0.641 0.521347    
southampton -0.451819   0.183425  -2.463 0.013769 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1413.6  on 1044  degrees of freedom
Residual deviance: 1037.4  on 1038  degrees of freedom
AIC: 1051.4

Number of Fisher Scoring iterations: 5

3b: Odds Ratios

The model’s estimated coefficients are on a logarithmic scale. The estimated Odds Ratio, which is the exponential of the estimated coefficient, can be more easily interpreted. Compute the estimated Odds Ratio of each variable. Then compute a 95% confidence interval for the Odds Ratio. To do so, you can exponentiate the 95% confidence interval for the coefficient.

     Variable  Coef Lower Upper
1 (Intercept)  0.52  0.32  0.85
2         age  0.98  0.97  0.99
3      female 11.46  8.31 15.79
4        fare  1.01  1.01  1.02
5       sibsp  0.70  0.58  0.86
6       parch  0.94  0.77  1.14
7 southampton  0.64  0.44  0.91
                  coef     2.5 %     97.5 %
(Intercept)  0.5183152 0.3164465  0.8448001
age          0.9821373 0.9709323  0.9932594
female      11.4563515 8.3492390 15.8782229
fare         1.0112923 1.0072729  1.0156778
sibsp        0.7034618 0.5736165  0.8534267
parch        0.9377619 0.7699818  1.1419401
southampton  0.6364696 0.4443296  0.9126204

3c: Increased Odds of Survival

Which factors led to an increased likelihood of survival that was statistically significant at the 0.05 level?

3d: Decreased Odds of Survival

Which factors led to an decreased likelihood of survival that was statistically significant at the 0.05 level?

3e: In-Sample Results

For the first 10 rows of the measured.dat, show the model’s fitted value (an estimated probability of survival) and the passenger’s actual survival status.

   Actual Pred
1       1 0.96
2       1 0.52
3       0 0.93
4       0 0.39
5       0 0.89
6       1 0.16
7       1 0.67
8       0 0.14
9       1 0.56
10      0 0.20

3f: In-Sample Evaluation

For all of the rows of the measured.dat, create an estimated classification of a passenger’s survival status by rounding the logistic regression model’s estimated likelihood to 1 or 0. Then compute the percentage of false positives, the percentage of false negatives, and the overall percentage of correct classifications.

Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 517 133
         1 101 294
                                         
               Accuracy : 0.7761         
                 95% CI : (0.7496, 0.801)
    No Information Rate : 0.5914         
    P-Value [Acc > NIR] : < 2e-16        
                                         
                  Kappa : 0.5312         
                                         
 Mcnemar's Test P-Value : 0.04271        
                                         
            Sensitivity : 0.6885         
            Specificity : 0.8366         
         Pos Pred Value : 0.7443         
         Neg Pred Value : 0.7954         
             Prevalence : 0.4086         
         Detection Rate : 0.2813         
   Detection Prevalence : 0.3780         
      Balanced Accuracy : 0.7625         
                                         
       'Positive' Class : 1              
                                         

 FN  FP  TN  TP 
133 101 517 294 
[1] "16.3%"
[1] "31.1%"
[1] "77.6%"

Question 4: Added Information

Does a clustering assignment add information to a predictive model? One way to assess this question is to add the group assignments of a clustering procedure (as a categorical variable) to a regression model that already includes the variables that were used to build the clustering model. With this approach, we will evaluate the information added by our earlier clustering work in Question 2 to the logistic regression model built in Question 3.

4a: Hierarchical Clustering Assignments as a Predictor of Survival

Fit a logistic regression model of survived in terms of the following variables:

  • age
  • female
  • fare
  • sibsp
  • parch
  • southampton
  • hclust.group (as a categorical variable)

Show a summary of the estimated coefficients and evaluate whether the hierarchical clustering assignments add information to the predictions.


Call:
glm(formula = survived ~ age + female + fare + sibsp + parch + 
    southampton + hclust.group, family = binomial, data = measured.dat)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.2004  -0.6328  -0.5341   0.7120   2.6281  

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -3.304e+00  7.651e-01  -4.318 1.58e-05 ***
age           -1.009e-02  6.495e-03  -1.553  0.12042    
female         2.436e+00  1.668e-01  14.605  < 2e-16 ***
fare           2.121e-02  3.288e-03   6.450 1.12e-10 ***
sibsp         -5.301e-01  1.102e-01  -4.812 1.49e-06 ***
parch         -4.517e-01  1.429e-01  -3.160  0.00158 ** 
southampton   -3.419e-01  1.871e-01  -1.827  0.06768 .  
hclust.group2  3.507e+00  7.399e-01   4.740 2.14e-06 ***
hclust.group3  2.092e+00  6.824e-01   3.066  0.00217 ** 
hclust.group4  7.678e+00  6.137e+02   0.013  0.99002    
hclust.group5 -8.105e+00  1.455e+03  -0.006  0.99556    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1413.6  on 1044  degrees of freedom
Residual deviance: 1008.7  on 1034  degrees of freedom
AIC: 1030.7

Number of Fisher Scoring iterations: 14

4b: K-Means Clustering Assignments as a Predictor of Survival

Repeat the exercise of 4a using the kmeans.group as a predictor instead of hclust.group.


Call:
glm(formula = survived ~ age + female + fare + sibsp + parch + 
    southampton + kmeans.group, family = binomial, data = measured.dat)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.0859  -0.6561  -0.5273   0.7357   2.4794  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -0.538135   0.653598  -0.823  0.41031    
age           -0.018471   0.007034  -2.626  0.00864 ** 
female         2.197216   0.262255   8.378  < 2e-16 ***
fare           0.011068   0.002092   5.291 1.21e-07 ***
sibsp         -0.150956   0.120248  -1.255  0.20935    
parch          0.149058   0.124370   1.199  0.23072    
southampton   -0.626664   0.623789  -1.005  0.31508    
kmeans.group2 -0.113335   0.673133  -0.168  0.86629    
kmeans.group3 -0.117413   0.326320  -0.360  0.71899    
kmeans.group4 -1.265825   0.544604  -2.324  0.02011 *  
kmeans.group5  0.309035   0.340495   0.908  0.36409    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1413.6  on 1044  degrees of freedom
Residual deviance: 1026.1  on 1034  degrees of freedom
AIC: 1048.1

Number of Fisher Scoring iterations: 5

4c: Both Clustering Assignments

Now repeat the exercise using both clustering assignments as predictors.


Call:
glm(formula = survived ~ age + female + fare + sibsp + parch + 
    southampton + hclust.group + kmeans.group, family = binomial, 
    data = measured.dat)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.1617  -0.6223  -0.4983   0.7487   2.7386  

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -3.105e+00  9.488e-01  -3.273  0.00106 ** 
age           -6.696e-03  8.189e-03  -0.818  0.41356    
female         2.217e+00  2.664e-01   8.325  < 2e-16 ***
fare           2.033e-02  3.270e-03   6.215 5.12e-10 ***
sibsp         -3.572e-01  1.311e-01  -2.725  0.00642 ** 
parch         -2.814e-01  1.691e-01  -1.664  0.09615 .  
southampton   -4.203e-01  6.121e-01  -0.687  0.49233    
hclust.group2  3.309e+00  7.444e-01   4.446 8.76e-06 ***
hclust.group3  1.892e+00  6.894e-01   2.744  0.00606 ** 
hclust.group4  7.870e+00  6.347e+02   0.012  0.99011    
hclust.group5 -8.944e+00  1.455e+03  -0.006  0.99510    
kmeans.group2 -7.841e-02  6.617e-01  -0.119  0.90567    
kmeans.group3 -3.731e-01  3.423e-01  -1.090  0.27577    
kmeans.group4 -9.932e-01  5.454e-01  -1.821  0.06859 .  
kmeans.group5  2.359e-01  3.464e-01   0.681  0.49592    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1413.6  on 1044  degrees of freedom
Residual deviance: 1000.1  on 1030  degrees of freedom
AIC: 1030.1

Number of Fisher Scoring iterations: 14

4d: Interpretation of Added Information

What does it mean to add information to a model? Is clustering a worthwhile exercise as a method of building a predictive model? Write a few sentences to explain your answers.

We have several candidate models here to represent the true relation between predictors and outcome. If we knew the true relation, then we could find the information lost from using a candidate model to represent the true model. We want to choose the candidate model that minimized the information loss. If a candidate model, compared to another candidate model, has more information gain, the model quality is relatively lower. AIC is the estimator of the relative quality of statistical models for a given set of data. It is a means for model selection that deals with the trade-off between bias and variance.

Clustering is not a worthwile exercise as a method of building a predictive model, because clustering assignments can be directly predicted from the other predictors, resulting in redundancy. Although it doesn’t reduce the predictive power of the model as a whole, it affects our interpretation of indivadual predictors. Also, the coefficient estimates may change erratically in response to small changes in the model or the data.

Question 5: Description Versus Prediction

5a: Training and Testing Sets?

The logistic regressions developed above did not split the data into separate training and testing sets. What would we have gained by doing so? Explain your answer in a few sentences.

We are more interested in inference rather than prediction here, to find the unbiased estimate of predictors and to judge the effect that changing one predictor variable may have on the siginificance of coefficients in the logit model. It’s more focused on description than prediction. The Train-Test-Validation process attempts to estimate the over-fitting of the model, in order to to achive generalization.

5b: Application of the Titanic’s Predictive Model

Would it even make sense to use the logistic regression model for the Titanic’s passengers to make a prediction? Explain your opinion with a few sentences.

No, it would not make sense to use the logistic regression model to make a prediction. Because the logistic regression model we built hasn’t been evaluated on any samples that were not used to build the model, so that it can provide an unbiased sense of model effectiveness. A biased model that will lead to overfitting, if used to run prediction on unseen data. The Train-Test-Validation process attempts to estimate the over-fitting of the model, in order to to achive generalization.