Consider the following confusion matrix computed from a validation set. The “yes” outcome here is whether a person commits to purchasing an item over the phone (advertising).
## TrueNo TrueYes
## PredNo 6326 120
## PredYes 1638 958
1. Sensitivity and Specificity - Sensitivity (True Positive Rate): It measures the proportion of actual positives that are correctly identified. \[ Sensitivity = \frac{True Yes}{True Yes + False No} = \frac{958}{958 + 120} \] - Specificity (True Negative Rate): It measures the proportion of actual negatives that are correctly identified. \[ Specificity = \frac{True No}{True No + False Yes} = \frac{6326}{6326 + 1638} \]
2. Sensitivity and Specificity with calculated metrics - Sensitivity = \(\frac{958}{958 + 120} = \frac{958}{1078} ≈ 0.8887\) or 88.87% - Specificity = \(\frac{6326}{6326 + 1638} = \frac{6326}{7964} ≈ 0.7944\) or 79.44%
1. Negative and Positive Predictive Values - Positive Predictive Value (PPV): It measures the proportion of positive identifications that were actually correct. \[ PPV = \frac{True Yes}{PredYes} = \frac{958}{958 + 1638} \] - Negative Predictive Value (NPV): It measures the proportion of negative identifications that were actually correct. \[ NPV = \frac{True No}{PredNo} = \frac{6326}{6326 + 120} \]
2. Positive and Negative Predictive Values witrh calculated metrics - Positive Predictive Value (PPV) = \(\frac{958}{958 + 1638} = \frac{958}{2596} ≈ 0.3690\) or 36.90% - Negative Predictive Value (NPV) = \(\frac{6326}{6326 + 120} = \frac{6326}{6446} ≈ 0.9814\) or 98.14%
1. Percentage of the Entire Validation Set That Made a Purchase - This is the proportion of True Yes outcomes in the total number of observations. \[ Percentage = \frac{True Yes + False No}{Total} = \frac{958 + 120}{6326 + 120 + 1638 + 958} \]
2. Percentage of the Entire Validation Set That Made a Purchase with calculated metrics - Percentage = \(\frac{958 + 120}{6326 + 120 + 1638 + 958} = \frac{1078}{10042} ≈ 0.1074\) or 10.74%
1. Predicted “Yes” Actually Being a “Yes” - Out of 1000 potential new customers, 300 were predicted to make a purchase. - The question asks how many of these 300 are likely to actually make a purchase. This is given by the Positive Predictive Value. \[ Actual Yes from 300 Predicted Yes = 300 \times PPV \]
2. Predicted “Yes” Actually Being a “Yes” with calcullated metrics - Actual Yes from 300 Predicted Yes = \(300 \times 0.3690 ≈ 111\) - This means out of 300 predicted “yes”, about 111 will actually make a purchase.
1. Expected Net Profit and Profit Loss - Assumptions: - Profit per sale: $200 - Cost per unsuccessful pitch: $20 - Loss per missed sale: $200 - No loss for not calling someone who would have said “no”
Net Profit:
Expected net profit from 1000 calls = (Profit from sales) - (Cost from unsuccessful pitches) - (Loss from missed sales)
Expected profit loss:
2. Expected Net Profit and Profit Loss wirth calculated metrics - Profit from sales = True Yes × $200 = 958 × $200 = $191,600 - Cost from unsuccessful pitches = False Yes × $20 = 1638 × $20 = $32,760 - Loss from missed sales = False No × $200 = 120 × $200 = $24,000
Expected net profit from 1000 calls = ($191,600) - ($32,760) - ($24,000) = $134,840
Expected profit loss from missed sales (False No) = 120 × $200 = $24,000
Consider the following simulated data sets. Based on what you got out
of the videos, for each data set comment on whether you think a
discriminant analysis tool would be appropriate and if so would it be
better to fit an LDA or QDA model.
It is quite common for people to read confusion matrix results without paying enough attention to the details, in particular, what you want as output versus what R actually gives. Using one of the previous data sets from the second discussion, consider we added multiple columns that defined the response variable using different strategies. The first is recoding Healthy and Cancer as 0 and 1, respectively. The second is recoding as Control and Sick. In all 3 settings the “positive” class is viewed as the Cancer group.
full2$status2<-ifelse(full2$status=="Healthy",0,1)
full2$status3<-factor(ifelse(full2$status=="Healthy","Control","Sick"))
head(full2)
## x1 x2 status status2 status3
## 1 9.24283 8.466924 Cancer 1 Sick
## 2 10.25176 10.274625 Cancer 1 Sick
## 3 11.16205 10.895535 Cancer 1 Sick
## 4 8.05908 7.490272 Cancer 1 Sick
## 5 10.27510 10.539110 Cancer 1 Sick
## 6 10.43156 10.528610 Cancer 1 Sick
The following code runs an LDA model on the data and produces the confusion matrix on the training data itself. This is done a number of times using slightly different sets of code. Compare the resulting confusion matrices and summarize what is going on with the code and options. Please note that the CV part of the code and error metric is irrelevant here because LDA doesn’t have any tuning parameters.
library(caret)
fitControl<-trainControl(method="repeatedcv",number=5,repeats=1,classProbs=TRUE, summaryFunction=mnLogLoss)
set.seed(1234)
#Version 1
lda.fit<-train(status~x1+x2,
data=full2,
method="lda",
trControl=fitControl,
metric="logLoss")
print(lda.fit)
#Computing predicted probabilities on the training data
predictions <- predict(lda.fit, full2, type = "prob")[,"Cancer"]
#Getting confusion matrix
threshold=0.5
lda.preds<-factor(ifelse(predictions>threshold,"Cancer","Healthy"),levels=c("Cancer","Healthy"))
confusionMatrix(data = lda.preds, reference = full2$status)
#Version2
lda.fit<-train(status3~x1+x2,
data=full2,
method="lda",
trControl=fitControl,
metric="logLoss")
print(lda.fit)
#Computing predicted probabilities on the training data
predictions <- predict(lda.fit, full2, type = "prob")[,"Sick"]
#Getting confusion matrix
threshold=0.5
lda.preds<-factor(ifelse(predictions>threshold,"Sick","Control"),levels=c("Sick","Control"))
confusionMatrix(data = lda.preds, reference = full2$status3)
#Version3
lda.fit<-train(status3~x1+x2,
data=full2,
method="lda",
trControl=fitControl,
metric="logLoss")
print(lda.fit)
#Computing predicted probabilities on the training data
predictions <- predict(lda.fit, full2, type = "prob")[,"Sick"]
#Getting confusion matrix
threshold=0.5
lda.preds<-factor(ifelse(predictions>threshold,"Sick","Control"))
confusionMatrix(data = lda.preds, reference = full2$status3)
#Version 4
lda.fit<-train(status3~x1+x2,
data=full2,
method="lda",
trControl=fitControl,
metric="logLoss")
print(lda.fit)
#Computing predicted probabilities on the training data
predictions <- predict(lda.fit, full2, type = "prob")[,"Sick"]
#Getting confusion matrix
threshold=0.5
lda.preds<-factor(ifelse(predictions>threshold,"Sick","Control"))
confusionMatrix(data = lda.preds, reference = full2$status3,positive="Sick")
#Version 5
lda.fit<-train(status3~x1+x2,
data=full2,
method="lda",
trControl=fitControl,
metric="logLoss")
print(lda.fit)
#Computing predicted probabilities on the training data
predictions <- predict(lda.fit, full2, type = "raw")
head(predictions)
confusionMatrix(data = predictions, reference = full2$status3)
#knn.preds<-factor(ifelse(predictions$knn>threshold,"Cancer","Healthy"),levels=c("Cancer","Healthy"))
#confusionMatrix(data = knn.preds, reference = validate$status)
###Output:
library(caret) fitControl<-trainControl(method=“repeatedcv”,number=5,repeats=1,classProbs=TRUE, summaryFunction=mnLogLoss) set.seed(1234) #Version 1 lda.fit<-train(status~x1+x2, + data=full2, + method=“lda”, + trControl=fitControl, + metric=“logLoss”) print(lda.fit) Linear Discriminant Analysis
800 samples 2 predictor 2 classes: ‘Cancer’, ‘Healthy’
No pre-processing Resampling: Cross-Validated (5 fold, repeated 1 times) Summary of sample sizes: 640, 640, 640, 640, 640 Resampling results:
logLoss
0.1460222
#Computing predicted probabilities on the training data predictions <- predict(lda.fit, full2, type = “prob”)[,“Cancer”]>
#Getting confusion matrix threshold=0.5 lda.preds<-factor(ifelse(predictions>threshold,“Cancer”,“Healthy”),levels=c(“Cancer”,“Healthy”)) confusionMatrix(data = lda.preds, reference = full2$status) Confusion Matrix and Statistics
Reference
Prediction Cancer Healthy Cancer 389 35 Healthy 11 365
Accuracy : 0.9425
95% CI : (0.924, 0.9576)
No Information Rate : 0.5
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.885
Mcnemar’s Test P-Value : 0.000696
Sensitivity : 0.9725
Specificity : 0.9125
Pos Pred Value : 0.9175
Neg Pred Value : 0.9707
Prevalence : 0.5000
Detection Rate : 0.4863
Detection Prevalence : 0.5300
Balanced Accuracy : 0.9425
'Positive' Class : Cancer
#Version2 lda.fit<-train(status3~x1+x2, + data=full2, + method=“lda”, + trControl=fitControl, + metric=“logLoss”) print(lda.fit) Linear Discriminant Analysis
800 samples 2 predictor 2 classes: ‘Control’, ‘Sick’
No pre-processing Resampling: Cross-Validated (5 fold, repeated 1 times) Summary of sample sizes: 640, 640, 640, 640, 640 Resampling results:
logLoss
0.1477328
#Computing predicted probabilities on the training data predictions <- predict(lda.fit, full2, type = “prob”)[,“Sick”]
#Getting confusion matrix threshold=0.5 lda.preds<-factor(ifelse(predictions>threshold,“Sick”,“Control”),levels=c(“Sick”,“Control”)) confusionMatrix(data = lda.preds, reference = full2$status3) Confusion Matrix and Statistics
Reference
Prediction Control Sick Control 365 11 Sick 35 389
Accuracy : 0.9425
95% CI : (0.924, 0.9576)
No Information Rate : 0.5
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.885
Mcnemar’s Test P-Value : 0.000696
Sensitivity : 0.9125
Specificity : 0.9725
Pos Pred Value : 0.9707
Neg Pred Value : 0.9175
Prevalence : 0.5000
Detection Rate : 0.4562
Detection Prevalence : 0.4700
Balanced Accuracy : 0.9425
'Positive' Class : Control
Warning message: In confusionMatrix.default(data = lda.preds, reference = full2$status3) : Levels are not in the same order for reference and data. Refactoring data to match.
#Version3 lda.fit<-train(status3~x1+x2, + data=full2, + method=“lda”, + trControl=fitControl, + metric=“logLoss”) print(lda.fit) Linear Discriminant Analysis
800 samples 2 predictor 2 classes: ‘Control’, ‘Sick’
No pre-processing Resampling: Cross-Validated (5 fold, repeated 1 times) Summary of sample sizes: 640, 640, 640, 640, 640 Resampling results:
logLoss
0.1464863
#Computing predicted probabilities on the training data predictions <- predict(lda.fit, full2, type = “prob”)[,“Sick”]
#Getting confusion matrix threshold=0.5 lda.preds<-factor(ifelse(predictions>threshold,“Sick”,“Control”)) confusionMatrix(data = lda.preds, reference = full2$status3) Confusion Matrix and Statistics
Reference
Prediction Control Sick Control 365 11 Sick 35 389
Accuracy : 0.9425
95% CI : (0.924, 0.9576)
No Information Rate : 0.5
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.885
Mcnemar’s Test P-Value : 0.000696
Sensitivity : 0.9125
Specificity : 0.9725
Pos Pred Value : 0.9707
Neg Pred Value : 0.9175
Prevalence : 0.5000
Detection Rate : 0.4562
Detection Prevalence : 0.4700
Balanced Accuracy : 0.9425
'Positive' Class : Control
#Version 4 lda.fit<-train(status3~x1+x2, + data=full2, + method=“lda”, + trControl=fitControl, + metric=“logLoss”) print(lda.fit) Linear Discriminant Analysis
800 samples 2 predictor 2 classes: ‘Control’, ‘Sick’
No pre-processing Resampling: Cross-Validated (5 fold, repeated 1 times) Summary of sample sizes: 640, 640, 640, 640, 640 Resampling results:
logLoss
0.1481523
#Computing predicted probabilities on the training data predictions <- predict(lda.fit, full2, type = “prob”)[,“Sick”]
#Getting confusion matrix threshold=0.5 lda.preds<-factor(ifelse(predictions>threshold,“Sick”,“Control”)) confusionMatrix(data = lda.preds, reference = full2$status3,positive=“Sick”) Confusion Matrix and Statistics
Reference
Prediction Control Sick Control 365 11 Sick 35 389
Accuracy : 0.9425
95% CI : (0.924, 0.9576)
No Information Rate : 0.5
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.885
Mcnemar’s Test P-Value : 0.000696
Sensitivity : 0.9725
Specificity : 0.9125
Pos Pred Value : 0.9175
Neg Pred Value : 0.9707
Prevalence : 0.5000
Detection Rate : 0.4863
Detection Prevalence : 0.5300
Balanced Accuracy : 0.9425
'Positive' Class : Sick
#Version 5 lda.fit<-train(status3~x1+x2, + data=full2, + method=“lda”, + trControl=fitControl, + metric=“logLoss”) print(lda.fit) Linear Discriminant Analysis
800 samples 2 predictor 2 classes: ‘Control’, ‘Sick’
No pre-processing Resampling: Cross-Validated (5 fold, repeated 1 times) Summary of sample sizes: 640, 640, 640, 640, 640 Resampling results:
logLoss
0.1488042
#Computing predicted probabilities on the training data predictions
<- predict(lda.fit, full2, type = “raw”) head(predictions) [1] Sick
Sick Sick Control Sick Sick
Levels: Control Sick
confusionMatrix(data = predictions, reference = full2$status3) Confusion Matrix and Statistics
Reference
Prediction Control Sick Control 365 11 Sick 35 389
Accuracy : 0.9425
95% CI : (0.924, 0.9576)
No Information Rate : 0.5
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.885
Mcnemar’s Test P-Value : 0.000696
Sensitivity : 0.9125
Specificity : 0.9725
Pos Pred Value : 0.9707
Neg Pred Value : 0.9175
Prevalence : 0.5000
Detection Rate : 0.4562
Detection Prevalence : 0.4700
Balanced Accuracy : 0.9425
'Positive' Class : Control
Second Dats Set (Middle Plot): Here, the contours for both classes are distinctly separate, and there is a clear non-linear boundary that can be drawn between the two classes. Given the curved nature of the boundary, a QDA model would likely perform better as it can account for the variance-covariance structure within each class and accommodate the non-linearity.
Sanity Checks
Version 1: Predictions are made based on the LDA model and compared against the full2$status variable. The levels of the factor used for prediction are explicitly set to “Cancer” and “Healthy”, which corresponds to the initial coding of the response variable.
Version 2: This version uses the full2$status3 variable, which is a recoded factor with levels “Control” and “Sick”. The predictions are compared against this new factor variable. Here the predicted probabilities are tagged as “Sick”.
Version 3: Similar to Version 2, but the levels of the factor in lda.preds are not explicitly set, which may lead to R defaulting to alphabetical order. This could potentially reverse the interpretation of the confusion matrix if not accounted for.
Version 4: This version is the same as Version 3 but includes the positive argument in the confusionMatrix function call to explicitly set “Sick” as the positive level. This ensures that metrics like sensitivity and PPV are calculated with respect to the “Sick” class.
Version 5: This version uses the type = “raw” option in the prediction, which gives the class predictions directly instead of the probabilities. The confusion matrix is computed directly from these predictions without the need for a threshold. The head(predictions) command suggests it would show the first few raw predictions, helping us inspect which class was predicted without converting probabilities to binary decisions.
The varying versions highlight common pitfalls when dealing with confusion matrices in R, particularly the importance of consistent factor level coding and ensuring that the positive class is correctly specified when interpreting model performance. It is essential to understand what R is outputting and to verify that it aligns with the initial intent of the model.