Discussion 1: Error Metrics

Consider the following confusion matrix computed from a validation set. The “yes” outcome here is whether a person commits to purchasing an item over the phone (advertising).

##         TrueNo TrueYes
## PredNo    6326     120
## PredYes   1638     958
  1. Compute by hand the sensitivity and specificity for the confusion matrix assuming “yes” is the positive class.

Answer A.

1. Sensitivity and Specificity - Sensitivity (True Positive Rate): It measures the proportion of actual positives that are correctly identified. \[ Sensitivity = \frac{True Yes}{True Yes + False No} = \frac{958}{958 + 120} \] - Specificity (True Negative Rate): It measures the proportion of actual negatives that are correctly identified. \[ Specificity = \frac{True No}{True No + False Yes} = \frac{6326}{6326 + 1638} \]

2. Sensitivity and Specificity with calculated metrics - Sensitivity = \(\frac{958}{958 + 120} = \frac{958}{1078} ≈ 0.8887\) or 88.87% - Specificity = \(\frac{6326}{6326 + 1638} = \frac{6326}{7964} ≈ 0.7944\) or 79.44%

  1. Compute by hand the negative and positive predictive values.

Answer B.

1. Negative and Positive Predictive Values - Positive Predictive Value (PPV): It measures the proportion of positive identifications that were actually correct. \[ PPV = \frac{True Yes}{PredYes} = \frac{958}{958 + 1638} \] - Negative Predictive Value (NPV): It measures the proportion of negative identifications that were actually correct. \[ NPV = \frac{True No}{PredNo} = \frac{6326}{6326 + 120} \]

2. Positive and Negative Predictive Values witrh calculated metrics - Positive Predictive Value (PPV) = \(\frac{958}{958 + 1638} = \frac{958}{2596} ≈ 0.3690\) or 36.90% - Negative Predictive Value (NPV) = \(\frac{6326}{6326 + 120} = \frac{6326}{6446} ≈ 0.9814\) or 98.14%

  1. What percentage of the entire validation set made a purchase?

Answer C.

1. Percentage of the Entire Validation Set That Made a Purchase - This is the proportion of True Yes outcomes in the total number of observations. \[ Percentage = \frac{True Yes + False No}{Total} = \frac{958 + 120}{6326 + 120 + 1638 + 958} \]

2. Percentage of the Entire Validation Set That Made a Purchase with calculated metrics - Percentage = \(\frac{958 + 120}{6326 + 120 + 1638 + 958} = \frac{1078}{10042} ≈ 0.1074\) or 10.74%

  1. Suppose the company has a list of 1000 potential new customers with which the model (with performance metrics you computed earlier) can be applied. Suppose that 300 of these 1000 were predicted that they would make a purchase. How many customers of these 300 that you predicted as “yes” will actually be a “yes”? Which one of the previous metrics you calculated tells you this?

Answer D.

1. Predicted “Yes” Actually Being a “Yes” - Out of 1000 potential new customers, 300 were predicted to make a purchase. - The question asks how many of these 300 are likely to actually make a purchase. This is given by the Positive Predictive Value. \[ Actual Yes from 300 Predicted Yes = 300 \times PPV \]

2. Predicted “Yes” Actually Being a “Yes” with calcullated metrics - Actual Yes from 300 Predicted Yes = \(300 \times 0.3690 ≈ 111\) - This means out of 300 predicted “yes”, about 111 will actually make a purchase.

  1. (Something to chew on) Suppose that, for every sale, the company makes a profit of 200 dollars and it costs them 20 dollars for every unsuccessful sales pitch. They lose out on 200 dollars if they didn’t identify a customer that would say “yes” to the campaign. They lose 0 dollars for those they didn’t call and would have said “no” to begin with. What do you think the expected net profit (after dealing with the cost) is when considering making the sales pitch to the 1000 prospective clients? What is the expected profit lossed (or missed) under this scenario? Make whatever assumptions you need to come up with your answer, just state those assumptions along with your answer.

Answer E.

1. Expected Net Profit and Profit Loss - Assumptions: - Profit per sale: $200 - Cost per unsuccessful pitch: $20 - Loss per missed sale: $200 - No loss for not calling someone who would have said “no”

2. Expected Net Profit and Profit Loss wirth calculated metrics - Profit from sales = True Yes × $200 = 958 × $200 = $191,600 - Cost from unsuccessful pitches = False Yes × $20 = 1638 × $20 = $32,760 - Loss from missed sales = False No × $200 = 120 × $200 = $24,000

LDA Assumptions.

Consider the following simulated data sets. Based on what you got out of the videos, for each data set comment on whether you think a discriminant analysis tool would be appropriate and if so would it be better to fit an LDA or QDA model.

Discussion 2: Confusion Matrix Sanity Checks

It is quite common for people to read confusion matrix results without paying enough attention to the details, in particular, what you want as output versus what R actually gives. Using one of the previous data sets from the second discussion, consider we added multiple columns that defined the response variable using different strategies. The first is recoding Healthy and Cancer as 0 and 1, respectively. The second is recoding as Control and Sick. In all 3 settings the “positive” class is viewed as the Cancer group.

full2$status2<-ifelse(full2$status=="Healthy",0,1)
full2$status3<-factor(ifelse(full2$status=="Healthy","Control","Sick"))
head(full2)
##         x1        x2 status status2 status3
## 1  9.24283  8.466924 Cancer       1    Sick
## 2 10.25176 10.274625 Cancer       1    Sick
## 3 11.16205 10.895535 Cancer       1    Sick
## 4  8.05908  7.490272 Cancer       1    Sick
## 5 10.27510 10.539110 Cancer       1    Sick
## 6 10.43156 10.528610 Cancer       1    Sick

The following code runs an LDA model on the data and produces the confusion matrix on the training data itself. This is done a number of times using slightly different sets of code. Compare the resulting confusion matrices and summarize what is going on with the code and options. Please note that the CV part of the code and error metric is irrelevant here because LDA doesn’t have any tuning parameters.

library(caret)
fitControl<-trainControl(method="repeatedcv",number=5,repeats=1,classProbs=TRUE, summaryFunction=mnLogLoss)
set.seed(1234)

#Version 1
lda.fit<-train(status~x1+x2,
               data=full2,
               method="lda",
               trControl=fitControl,
               metric="logLoss")
print(lda.fit)

#Computing predicted probabilities on the training data
predictions <- predict(lda.fit, full2, type = "prob")[,"Cancer"]


#Getting confusion matrix
threshold=0.5
lda.preds<-factor(ifelse(predictions>threshold,"Cancer","Healthy"),levels=c("Cancer","Healthy"))
confusionMatrix(data = lda.preds, reference = full2$status)


#Version2
lda.fit<-train(status3~x1+x2,
               data=full2,
               method="lda",
               trControl=fitControl,
               metric="logLoss")
print(lda.fit)

#Computing predicted probabilities on the training data
predictions <- predict(lda.fit, full2, type = "prob")[,"Sick"]

#Getting confusion matrix
threshold=0.5
lda.preds<-factor(ifelse(predictions>threshold,"Sick","Control"),levels=c("Sick","Control"))
confusionMatrix(data = lda.preds, reference = full2$status3)


#Version3
lda.fit<-train(status3~x1+x2,
               data=full2,
               method="lda",
               trControl=fitControl,
               metric="logLoss")
print(lda.fit)

#Computing predicted probabilities on the training data
predictions <- predict(lda.fit, full2, type = "prob")[,"Sick"]


#Getting confusion matrix
threshold=0.5
lda.preds<-factor(ifelse(predictions>threshold,"Sick","Control"))
confusionMatrix(data = lda.preds, reference = full2$status3)



#Version 4
lda.fit<-train(status3~x1+x2,
               data=full2,
               method="lda",
               trControl=fitControl,
               metric="logLoss")
print(lda.fit)

#Computing predicted probabilities on the training data
predictions <- predict(lda.fit, full2, type = "prob")[,"Sick"]

#Getting confusion matrix
threshold=0.5
lda.preds<-factor(ifelse(predictions>threshold,"Sick","Control"))
confusionMatrix(data = lda.preds, reference = full2$status3,positive="Sick")




#Version 5
lda.fit<-train(status3~x1+x2,
               data=full2,
               method="lda",
               trControl=fitControl,
               metric="logLoss")
print(lda.fit)
#Computing predicted probabilities on the training data
predictions <- predict(lda.fit, full2, type = "raw")
head(predictions)

confusionMatrix(data = predictions, reference = full2$status3)


#knn.preds<-factor(ifelse(predictions$knn>threshold,"Cancer","Healthy"),levels=c("Cancer","Healthy"))
#confusionMatrix(data = knn.preds, reference = validate$status)

###Output:

library(caret) fitControl<-trainControl(method=“repeatedcv”,number=5,repeats=1,classProbs=TRUE, summaryFunction=mnLogLoss) set.seed(1234) #Version 1 lda.fit<-train(status~x1+x2, + data=full2, + method=“lda”, + trControl=fitControl, + metric=“logLoss”) print(lda.fit) Linear Discriminant Analysis

800 samples 2 predictor 2 classes: ‘Cancer’, ‘Healthy’

No pre-processing Resampling: Cross-Validated (5 fold, repeated 1 times) Summary of sample sizes: 640, 640, 640, 640, 640 Resampling results:

logLoss
0.1460222

#Computing predicted probabilities on the training data predictions <- predict(lda.fit, full2, type = “prob”)[,“Cancer”]>

#Getting confusion matrix threshold=0.5 lda.preds<-factor(ifelse(predictions>threshold,“Cancer”,“Healthy”),levels=c(“Cancer”,“Healthy”)) confusionMatrix(data = lda.preds, reference = full2$status) Confusion Matrix and Statistics

      Reference

Prediction Cancer Healthy Cancer 389 35 Healthy 11 365

           Accuracy : 0.9425         
             95% CI : (0.924, 0.9576)
No Information Rate : 0.5            
P-Value [Acc > NIR] : < 2.2e-16      
                                     
              Kappa : 0.885          
                                     

Mcnemar’s Test P-Value : 0.000696

        Sensitivity : 0.9725         
        Specificity : 0.9125         
     Pos Pred Value : 0.9175         
     Neg Pred Value : 0.9707         
         Prevalence : 0.5000         
     Detection Rate : 0.4863         

Detection Prevalence : 0.5300
Balanced Accuracy : 0.9425

   'Positive' Class : Cancer         
                                     

#Version2 lda.fit<-train(status3~x1+x2, + data=full2, + method=“lda”, + trControl=fitControl, + metric=“logLoss”) print(lda.fit) Linear Discriminant Analysis

800 samples 2 predictor 2 classes: ‘Control’, ‘Sick’

No pre-processing Resampling: Cross-Validated (5 fold, repeated 1 times) Summary of sample sizes: 640, 640, 640, 640, 640 Resampling results:

logLoss
0.1477328

#Computing predicted probabilities on the training data predictions <- predict(lda.fit, full2, type = “prob”)[,“Sick”]

#Getting confusion matrix threshold=0.5 lda.preds<-factor(ifelse(predictions>threshold,“Sick”,“Control”),levels=c(“Sick”,“Control”)) confusionMatrix(data = lda.preds, reference = full2$status3) Confusion Matrix and Statistics

      Reference

Prediction Control Sick Control 365 11 Sick 35 389

           Accuracy : 0.9425         
             95% CI : (0.924, 0.9576)
No Information Rate : 0.5            
P-Value [Acc > NIR] : < 2.2e-16      
                                     
              Kappa : 0.885          
                                     

Mcnemar’s Test P-Value : 0.000696

        Sensitivity : 0.9125         
        Specificity : 0.9725         
     Pos Pred Value : 0.9707         
     Neg Pred Value : 0.9175         
         Prevalence : 0.5000         
     Detection Rate : 0.4562         

Detection Prevalence : 0.4700
Balanced Accuracy : 0.9425

   'Positive' Class : Control        
                                     

Warning message: In confusionMatrix.default(data = lda.preds, reference = full2$status3) : Levels are not in the same order for reference and data. Refactoring data to match.

#Version3 lda.fit<-train(status3~x1+x2, + data=full2, + method=“lda”, + trControl=fitControl, + metric=“logLoss”) print(lda.fit) Linear Discriminant Analysis

800 samples 2 predictor 2 classes: ‘Control’, ‘Sick’

No pre-processing Resampling: Cross-Validated (5 fold, repeated 1 times) Summary of sample sizes: 640, 640, 640, 640, 640 Resampling results:

logLoss
0.1464863

#Computing predicted probabilities on the training data predictions <- predict(lda.fit, full2, type = “prob”)[,“Sick”]

#Getting confusion matrix threshold=0.5 lda.preds<-factor(ifelse(predictions>threshold,“Sick”,“Control”)) confusionMatrix(data = lda.preds, reference = full2$status3) Confusion Matrix and Statistics

      Reference

Prediction Control Sick Control 365 11 Sick 35 389

           Accuracy : 0.9425         
             95% CI : (0.924, 0.9576)
No Information Rate : 0.5            
P-Value [Acc > NIR] : < 2.2e-16      
                                     
              Kappa : 0.885          
                                     

Mcnemar’s Test P-Value : 0.000696

        Sensitivity : 0.9125         
        Specificity : 0.9725         
     Pos Pred Value : 0.9707         
     Neg Pred Value : 0.9175         
         Prevalence : 0.5000         
     Detection Rate : 0.4562         

Detection Prevalence : 0.4700
Balanced Accuracy : 0.9425

   'Positive' Class : Control        
                                     

#Version 4 lda.fit<-train(status3~x1+x2, + data=full2, + method=“lda”, + trControl=fitControl, + metric=“logLoss”) print(lda.fit) Linear Discriminant Analysis

800 samples 2 predictor 2 classes: ‘Control’, ‘Sick’

No pre-processing Resampling: Cross-Validated (5 fold, repeated 1 times) Summary of sample sizes: 640, 640, 640, 640, 640 Resampling results:

logLoss
0.1481523

#Computing predicted probabilities on the training data predictions <- predict(lda.fit, full2, type = “prob”)[,“Sick”]

#Getting confusion matrix threshold=0.5 lda.preds<-factor(ifelse(predictions>threshold,“Sick”,“Control”)) confusionMatrix(data = lda.preds, reference = full2$status3,positive=“Sick”) Confusion Matrix and Statistics

      Reference

Prediction Control Sick Control 365 11 Sick 35 389

           Accuracy : 0.9425         
             95% CI : (0.924, 0.9576)
No Information Rate : 0.5            
P-Value [Acc > NIR] : < 2.2e-16      
                                     
              Kappa : 0.885          
                                     

Mcnemar’s Test P-Value : 0.000696

        Sensitivity : 0.9725         
        Specificity : 0.9125         
     Pos Pred Value : 0.9175         
     Neg Pred Value : 0.9707         
         Prevalence : 0.5000         
     Detection Rate : 0.4863         

Detection Prevalence : 0.5300
Balanced Accuracy : 0.9425

   'Positive' Class : Sick           
                                     

#Version 5 lda.fit<-train(status3~x1+x2, + data=full2, + method=“lda”, + trControl=fitControl, + metric=“logLoss”) print(lda.fit) Linear Discriminant Analysis

800 samples 2 predictor 2 classes: ‘Control’, ‘Sick’

No pre-processing Resampling: Cross-Validated (5 fold, repeated 1 times) Summary of sample sizes: 640, 640, 640, 640, 640 Resampling results:

logLoss
0.1488042

#Computing predicted probabilities on the training data predictions <- predict(lda.fit, full2, type = “raw”) head(predictions) [1] Sick Sick Sick Control Sick Sick
Levels: Control Sick

confusionMatrix(data = predictions, reference = full2$status3) Confusion Matrix and Statistics

      Reference

Prediction Control Sick Control 365 11 Sick 35 389

           Accuracy : 0.9425         
             95% CI : (0.924, 0.9576)
No Information Rate : 0.5            
P-Value [Acc > NIR] : < 2.2e-16      
                                     
              Kappa : 0.885          
                                     

Mcnemar’s Test P-Value : 0.000696

        Sensitivity : 0.9125         
        Specificity : 0.9725         
     Pos Pred Value : 0.9707         
     Neg Pred Value : 0.9175         
         Prevalence : 0.5000         
     Detection Rate : 0.4562         

Detection Prevalence : 0.4700
Balanced Accuracy : 0.9425

   'Positive' Class : Control        
                               

Answer Discussion 2.

Part 1.

Second Dats Set (Middle Plot): Here, the contours for both classes are distinctly separate, and there is a clear non-linear boundary that can be drawn between the two classes. Given the curved nature of the boundary, a QDA model would likely perform better as it can account for the variance-covariance structure within each class and accommodate the non-linearity.

Part 2.

Sanity Checks

Version 1: Predictions are made based on the LDA model and compared against the full2$status variable. The levels of the factor used for prediction are explicitly set to “Cancer” and “Healthy”, which corresponds to the initial coding of the response variable.

Version 2: This version uses the full2$status3 variable, which is a recoded factor with levels “Control” and “Sick”. The predictions are compared against this new factor variable. Here the predicted probabilities are tagged as “Sick”.

Version 3: Similar to Version 2, but the levels of the factor in lda.preds are not explicitly set, which may lead to R defaulting to alphabetical order. This could potentially reverse the interpretation of the confusion matrix if not accounted for.

Version 4: This version is the same as Version 3 but includes the positive argument in the confusionMatrix function call to explicitly set “Sick” as the positive level. This ensures that metrics like sensitivity and PPV are calculated with respect to the “Sick” class.

Version 5: This version uses the type = “raw” option in the prediction, which gives the class predictions directly instead of the probabilities. The confusion matrix is computed directly from these predictions without the need for a threshold. The head(predictions) command suggests it would show the first few raw predictions, helping us inspect which class was predicted without converting probabilities to binary decisions.

The varying versions highlight common pitfalls when dealing with confusion matrices in R, particularly the importance of consistent factor level coding and ensuring that the positive class is correctly specified when interpreting model performance. It is essential to understand what R is outputting and to verify that it aligns with the initial intent of the model.