Click here to view the Unit 9 Homework
LDA works best when these assumptins are met, but it can still perform reasonably even is some (but not all) are violated to a certain extent.
Is a table often used to describe the performance of a classification model. It shows the number of correct and incorrect predictions made by the model compared with the actual classifications in the test data. The matrix is divided into four parts:
This matrix is fundamental for computing various performance metrics, including accuracy, precision, recall, and F1-score.
Sensitivity (also known as Recall or True Positive Rate) measures the proportion of actual positives that are correctly identified. It answers the question: “Of all the actual positives, how many did we correctly identify as positive?” Formula: Sensitivity = TP / (TP + FN).
Precision (also known as Positive Predictive Value) measures the proportion of positive identifications that were actually correct. It answers the que
stion: “Of all the positives we identified, how many were actually positive?” Formula: Precision = TP / (TP + FP).
In summary, while sensitivity is about correctly identifying actual positives, precision is about how many of the identified positives are actually positive
Consider the following data set:
library(RSSL)
## Warning: package 'RSSL' was built under R version 4.3.3
set.seed(1234)
mydata<-generateCrescentMoon(200,2,2)
mydata$Class<-factor(ifelse(mydata$Class=="+","Yes","No"))
plot(mydata$X1,mydata$X2,col=mydata$Class,asp=1,pch=16,cex=.7,xlab="X1",ylab="X2")
Would a KNN model or LDA/QDA model fare better for this example? State your reasons. If you are stumped, complete part B and come back.
The following code fits a KNN model using the caret
package and provides a visualization of the prediction boundary.
Repurpose the code to see what the decision boundary is for an LDA model
and a QDA model.
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
fitControl<-trainControl(method="repeatedcv",number=5,repeats=1,classProbs=TRUE, summaryFunction=mnLogLoss)
set.seed(1234)
#knn
knn.fit<-train(Class~X1+X2,
data=mydata,
method="knn",
trControl=fitControl,
metric="logLoss")
#Creating a validation set across the whole graph
validate <- expand.grid(X1=seq(-10, 10, by=0.25), # sample points in X1
X2=seq(-15, 15, by=0.5)) # sample points in X2
#Computing predicted probabilities on the training data
predictions <- predict(knn.fit, validate, type = "prob")[,"Yes"]
threshold=0.5
class_pred = as.factor(ifelse(predictions>threshold, "1", "0"))
color_array <- c("black", "red")[as.numeric(class_pred)]
par(mfrow=c(1,2))
plot(mydata$X1,mydata$X2,col=mydata$Class,asp=1,pch=16,cex=.7,xlab="X1",ylab="X2")
plot(validate,col=color_array,pch=20,cex=0.15)
par(mfrow=c(1,2))
From the scatter plots, it appears that the data forms two crescent-shaped clusters. These clusters are non-linear and are not well separated by a straight line, which suggests that the LDA model may not perform optimally since LDA assumes linear decision boundaries due to its reliance on a common covariance matrix for all classes. The QDA model allows for more flexibility as it can learn a quadratic decision boundary due to class-specific covariance matrices, but it still may not capture the complexity of the crescent shapes perfectly.
A K-Nearest Neighbors (KNN) model does not make any assumptions about the distribution of the data and can handle complex decision boundaries, as it classifies data based on the majority vote of its nearest neighbors. Given the non-linear, complex structure of the data, a KNN model is likely to perform better than both LDA and QDA for this particular dataset.
library(caret)
fitControl <- trainControl(method="repeatedcv", number=5, repeats=1, classProbs=TRUE, summaryFunction=mnLogLoss)
set.seed(1234)
# Fitting an LDA model
lda_fit <- train(Class ~ ., data=mydata, method="lda", trControl=fitControl)
## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. logLoss will be used instead.
# Create a grid for the decision boundary plot
grid <- expand.grid(X1=seq(min(mydata$X1), max(mydata$X1), length=100),
X2=seq(min(mydata$X2), max(mydata$X2), length=100))
# Predicting using the LDA model
lda_predict <- predict(lda_fit, grid)
# Plot the decision boundary
plot(mydata$X1, mydata$X2, col=mydata$Class, asp=1, pch=16, cex=.7, xlab="X1", ylab="X2")
points(grid$X1, grid$X2, pch=16, col=as.numeric(lda_predict), cex=.1)
# Fitting a QDA model
qda_fit <- train(Class ~ ., data=mydata, method="qda", trControl=fitControl)
## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. logLoss will be used instead.
# Predicting using the QDA model
qda_predict <- predict(qda_fit, grid)
# Plot the decision boundary
plot(mydata$X1, mydata$X2, col=mydata$Class, asp=1, pch=16, cex=.7, xlab="X1", ylab="X2")
points(grid$X1, grid$X2, pch=16, col=as.numeric(qda_predict), cex=.1)
library(RSSL)
library(caret)
library(ggplot2)
# Generate the data
set.seed(1234)
mydata <- generateCrescentMoon(200,2,2)
mydata$Class <- factor(ifelse(mydata$Class=="+","Yes","No"))
# Plot original data
plot(mydata$X1, mydata$X2, col=mydata$Class, asp=1, pch=16, cex=.7, xlab="X1", ylab="X2")
# Set up training control
fitControl <- trainControl(method="repeatedcv", number=5, repeats=1, classProbs=TRUE, summaryFunction=mnLogLoss)
# Fit LDA model
set.seed(1234)
lda_fit <- train(Class ~ ., data=mydata, method="lda", trControl=fitControl)
## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. logLoss will be used instead.
# Fit QDA model
set.seed(1234)
qda_fit <- train(Class ~ ., data=mydata, method="qda", trControl=fitControl)
## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. logLoss will be used instead.
# Create a grid for the decision boundary plot
grid <- with(mydata, expand.grid(X1=seq(min(X1), max(X1), length.out=200),
X2=seq(min(X2), max(X2), length.out=200)))
# Predict using the LDA model
grid$lda_predict <- predict(lda_fit, grid)
# Predict using the QDA model
grid$qda_predict <- predict(qda_fit, grid)
# Plot the decision boundary for LDA
ggplot(mydata, aes(x=X1, y=X2, color=Class)) +
geom_point() +
geom_point(data=grid, aes(x=X1, y=X2, color=lda_predict), size=1, alpha=0.5) +
ggtitle("LDA Decision Boundary") +
theme_minimal()
# Plot the decision boundary for QDA
ggplot(mydata, aes(x=X1, y=X2, color=Class)) +
geom_point() +
geom_point(data=grid, aes(x=X1, y=X2, color=qda_predict), size=1, alpha=0.5) +
ggtitle("QDA Decision Boundary") +
theme_minimal()
# alternate - compare for visuals
library(MASS)
library(caret)
fitControl <- trainControl(method="repeatedcv", number=5, repeats=1, classProbs=TRUE, summaryFunction=mnLogLoss)
set.seed(1234)
# LDA
lda.fit <- train(Class~X1+X2, data=mydata, method="lda", trControl=fitControl, metric="logLoss")
# QDA
qda.fit <- train(Class~X1+X2, data=mydata, method="qda", trControl=fitControl, metric="logLoss")
# Predictions for LDA
lda_predictions <- predict(lda.fit, validate, type = "prob")[,"Yes"]
lda_class_pred = as.factor(ifelse(lda_predictions>threshold, "1", "0"))
lda_color_array <- c("black", "red")[as.numeric(lda_class_pred)]
# Predictions for QDA
qda_predictions <- predict(qda.fit, validate, type = "prob")[,"Yes"]
qda_class_pred = as.factor(ifelse(qda_predictions>threshold, "1", "0"))
qda_color_array <- c("black", "red")[as.numeric(qda_class_pred)]
# Plotting decision boundaries
par(mfrow=c(1,3))
plot(mydata$X1,mydata$X2,col=mydata$Class,asp=1,pch=16,cex=.7,xlab="X1",ylab="X2")
plot(validate,col=lda_color_array,pch=20,cex=0.15)
plot(validate,col=qda_color_array,pch=20,cex=0.15)
The cancer data set contains multiple genetic variables in their bloood measured on patients with tumors in the throat/esophagus. After surgery, the tumors were biopsied and classified as malignant or benign. The data set actually has 4 data sets merged together from different hospitals. For this exercise, the code below is going to collapse them into a sing train/validation split. The censor variable is the response code as 0 and 1.
## Censor V1 V2 V3 V4 V5 V6 V7 V8
## 1 1 102.259 124.664 103.400 112.696 100.125 149.494 100.621 107.191
## 2 1 110.415 136.268 102.570 113.661 100.291 115.294 101.316 102.108
## 3 1 118.051 139.745 104.439 101.719 101.013 114.360 101.917 101.834
## 4 1 112.506 132.449 104.609 118.067 100.334 126.537 101.303 103.651
## 5 1 106.437 134.123 103.517 112.575 100.501 111.032 101.138 105.840
## 6 1 110.746 129.957 103.303 112.440 100.306 111.918 100.323 101.670
## 7 1 115.371 141.641 105.789 116.426 100.580 103.255 101.235 104.512
## 8 1 103.468 118.735 100.993 101.525 100.249 103.056 100.363 101.342
## 9 1 125.784 134.208 105.762 116.190 100.879 122.893 101.233 102.385
## 10 1 131.796 131.517 103.646 114.567 100.587 121.571 100.711 103.979
## 11 1 113.999 137.443 103.514 112.968 100.679 116.092 101.915 103.677
## 12 1 116.246 135.993 103.600 119.158 100.306 107.903 101.222 101.835
## 13 1 111.759 127.671 102.340 113.540 100.298 114.124 100.423 102.045
## 14 1 136.134 138.941 104.280 130.984 101.973 112.272 102.648 106.077
## 15 1 124.091 134.535 104.488 128.259 100.864 118.227 101.308 103.180
## 16 1 110.796 122.799 102.512 122.826 100.245 112.523 100.392 101.259
## 17 1 157.106 160.328 110.105 168.287 100.648 112.082 103.730 104.042
## 18 1 112.009 132.457 103.118 126.481 100.493 107.591 100.605 100.905
## 19 1 121.981 147.457 103.496 124.987 100.335 115.133 101.499 101.409
## 20 1 129.680 154.614 103.195 124.661 101.019 117.084 104.226 103.728
## 21 1 133.765 177.891 107.270 178.076 100.919 153.011 101.228 105.411
## 22 1 112.346 126.554 103.041 121.972 100.404 114.517 100.755 102.681
## 23 1 119.418 164.069 107.336 145.334 100.780 101.164 100.802 100.678
## 24 1 137.459 155.788 104.960 136.043 101.675 107.159 103.044 101.978
## 25 1 119.307 161.443 103.918 142.005 100.688 112.686 100.863 100.821
## 26 1 117.711 124.039 101.471 114.358 100.165 115.353 100.324 100.684
## 27 1 109.443 122.538 101.892 114.686 100.268 110.946 100.504 101.196
## 28 1 120.762 135.380 104.096 130.494 100.835 111.399 101.408 101.834
## 29 1 154.781 161.211 106.154 156.059 101.285 129.114 104.082 105.367
## 30 1 104.174 100.721 100.097 101.813 100.022 114.982 100.016 101.555
## 31 1 129.069 152.946 104.975 150.212 100.624 136.504 102.203 103.144
## 32 1 190.809 168.878 107.193 143.370 100.965 122.635 102.916 103.100
## 33 1 118.274 127.687 102.767 116.379 100.739 119.246 100.972 101.834
## 34 1 121.162 141.923 104.428 126.017 101.112 115.920 102.007 101.819
## 35 1 157.043 140.232 103.932 123.389 102.896 111.169 101.157 103.361
## 36 1 158.861 159.481 109.195 162.563 104.405 131.317 102.480 106.191
## 37 1 177.110 132.216 104.515 134.446 103.478 118.756 105.409 106.598
## 38 1 314.181 135.610 111.231 208.408 102.861 113.035 102.625 103.763
## 39 1 124.662 120.555 103.732 124.383 101.290 100.899 101.020 102.016
## 40 1 203.834 381.451 107.275 203.015 101.097 106.818 101.223 101.594
## 41 1 134.591 120.897 105.285 147.673 100.481 109.481 102.547 101.678
## 42 1 121.584 122.038 104.200 127.524 100.232 110.358 102.142 103.217
## 43 1 117.704 116.927 101.090 124.774 100.161 108.402 101.116 100.725
## 44 1 122.215 117.349 102.006 120.775 100.223 115.678 101.621 101.564
## 45 1 115.787 118.024 103.923 135.248 100.941 119.530 104.609 105.556
## 46 1 121.051 143.698 103.939 150.240 100.524 113.395 102.418 101.813
## 47 1 116.645 127.927 103.560 131.027 100.184 104.286 101.722 100.304
## 48 1 115.216 136.878 103.847 130.334 100.228 104.710 103.097 100.206
## 49 0 101.642 107.432 100.974 105.417 100.162 115.319 100.226 100.471
## 50 0 113.701 131.664 102.957 112.839 101.183 111.029 102.536 102.109
## 51 0 104.506 121.306 101.538 110.365 100.614 112.371 100.955 101.450
## 52 0 105.519 120.051 102.539 111.385 100.887 110.934 101.153 103.254
## 53 0 108.862 116.334 102.552 111.076 100.394 107.536 100.821 100.673
## 54 0 108.606 115.044 102.371 111.586 100.423 106.972 100.691 101.802
## 55 0 102.487 113.382 100.710 111.269 100.128 105.985 100.398 100.328
## 56 0 108.527 120.965 104.439 117.526 101.488 108.603 101.376 103.288
## 57 0 102.203 116.082 101.399 108.698 100.239 103.700 100.285 100.651
## 58 0 114.813 120.204 103.559 113.571 100.849 117.010 101.368 102.764
## 59 0 124.543 110.769 101.492 110.006 101.072 102.065 102.077 109.827
## 60 0 106.485 114.217 101.044 106.862 100.587 114.809 100.912 102.411
## 61 0 114.229 110.320 102.180 112.440 100.829 109.454 102.634 100.684
## 62 0 109.800 117.058 102.729 112.872 100.472 106.808 101.070 100.358
## 63 0 126.141 120.924 103.142 122.636 100.426 109.804 102.065 101.664
## 64 0 102.736 108.882 102.415 106.566 100.146 119.297 100.286 101.628
## 65 0 105.557 108.335 101.887 104.903 100.314 106.379 100.426 100.600
## 66 0 103.462 110.053 102.488 108.409 100.251 108.410 100.322 102.020
## 67 0 101.296 130.223 102.307 123.282 100.427 106.135 100.119 100.411
## 68 0 105.224 122.248 102.838 126.875 100.766 127.824 100.929 101.416
## 69 0 103.763 130.721 103.168 117.761 100.677 112.802 101.423 100.565
## 70 0 104.471 128.812 102.465 117.251 100.735 104.649 101.322 101.458
## 71 0 105.095 131.032 103.492 127.274 100.655 108.249 100.743 102.759
## 72 0 105.114 127.366 103.595 119.544 100.820 108.510 101.348 101.510
## 73 0 102.365 132.329 102.486 120.682 100.724 124.822 101.812 102.244
## 74 0 100.710 116.026 101.550 113.207 100.237 106.238 100.146 100.670
## 75 0 103.529 147.546 103.867 122.394 100.987 110.409 100.681 101.512
## 76 0 104.521 116.922 101.657 106.288 100.262 109.639 100.682 100.295
## 77 0 119.699 127.351 104.154 112.870 100.547 108.488 102.750 102.311
## 78 0 106.963 136.430 102.526 114.538 100.128 108.985 100.581 102.142
## 79 0 104.403 109.447 101.295 104.795 100.058 106.588 100.544 100.687
## 80 0 109.893 120.133 102.593 110.167 100.131 113.069 101.231 100.924
## 81 0 111.790 121.641 104.123 111.702 100.318 108.822 101.964 102.312
## 82 0 119.803 123.870 103.167 115.992 100.392 103.769 101.493 101.069
## 83 0 104.136 117.664 101.896 108.511 100.169 123.842 101.893 102.356
## 84 0 109.225 116.925 101.918 108.754 100.158 108.845 101.335 101.109
## 85 0 113.320 135.125 103.251 111.201 100.285 114.597 102.620 102.636
## 86 0 105.029 108.111 101.140 112.114 100.167 105.430 100.379 101.149
## 87 0 104.300 126.233 101.810 112.882 100.091 107.447 100.506 100.650
## 88 0 107.456 115.265 104.304 110.005 100.223 105.686 100.698 100.909
## 89 0 111.124 115.416 103.950 112.644 100.540 110.662 102.768 101.318
## 90 0 102.652 113.204 100.966 108.008 100.179 105.234 100.760 100.034
## 91 0 103.621 111.879 101.607 111.783 100.112 110.584 100.523 100.272
## 92 0 108.703 125.159 101.591 128.967 100.584 100.084 101.258 100.318
## V9
## 1 101.653
## 2 100.383
## 3 100.159
## 4 100.292
## 5 100.403
## 6 100.115
## 7 100.227
## 8 100.115
## 9 100.138
## 10 100.698
## 11 100.781
## 12 100.229
## 13 100.442
## 14 100.851
## 15 100.340
## 16 100.226
## 17 101.090
## 18 100.135
## 19 100.299
## 20 100.543
## 21 100.654
## 22 100.416
## 23 100.052
## 24 100.482
## 25 100.344
## 26 100.234
## 27 100.271
## 28 100.352
## 29 101.240
## 30 100.259
## 31 100.589
## 32 100.197
## 33 100.556
## 34 100.106
## 35 100.595
## 36 101.486
## 37 100.428
## 38 100.381
## 39 100.268
## 40 100.386
## 41 100.310
## 42 101.068
## 43 100.431
## 44 100.364
## 45 100.487
## 46 100.303
## 47 100.084
## 48 100.145
## 49 100.212
## 50 100.688
## 51 100.319
## 52 100.353
## 53 100.113
## 54 100.282
## 55 100.092
## 56 100.927
## 57 100.061
## 58 100.252
## 59 100.154
## 60 100.650
## 61 100.250
## 62 100.048
## 63 100.723
## 64 100.272
## 65 100.024
## 66 100.110
## 67 100.081
## 68 100.279
## 69 100.231
## 70 100.246
## 71 100.081
## 72 100.088
## 73 100.165
## 74 100.088
## 75 100.040
## 76 100.294
## 77 100.120
## 78 100.171
## 79 100.082
## 80 100.239
## 81 100.306
## 82 100.379
## 83 101.271
## 84 100.081
## 85 100.277
## 86 100.170
## 87 100.085
## 88 100.043
## 89 100.128
## 90 100.001
## 91 100.071
## 92 100.016
## Censor V1 V2 V3
## Min. :0.0000 Min. :100.7 Min. :100.7 Min. :100.1
## 1st Qu.:0.0000 1st Qu.:105.1 1st Qu.:117.3 1st Qu.:102.1
## Median :1.0000 Median :112.2 Median :127.0 Median :103.2
## Mean :0.5217 Mean :119.6 Mean :131.5 Mean :103.4
## 3rd Qu.:1.0000 3rd Qu.:121.3 3rd Qu.:136.1 3rd Qu.:104.1
## Max. :1.0000 Max. :314.2 Max. :381.5 Max. :111.2
## V4 V5 V6 V7
## Min. :101.5 Min. :100.0 Min. :100.1 Min. :100.0
## 1st Qu.:111.7 1st Qu.:100.2 1st Qu.:107.4 1st Qu.:100.7
## Median :116.3 Median :100.5 Median :111.0 Median :101.2
## Mean :122.6 Mean :100.7 Mean :112.9 Mean :101.4
## 3rd Qu.:127.0 3rd Qu.:100.8 3rd Qu.:115.3 3rd Qu.:102.0
## Max. :208.4 Max. :104.4 Max. :153.0 Max. :105.4
## V8 V9
## Min. :100.0 Min. :100.0
## 1st Qu.:100.9 1st Qu.:100.1
## Median :101.8 Median :100.3
## Mean :102.2 Mean :100.3
## 3rd Qu.:102.8 3rd Qu.:100.4
## Max. :109.8 Max. :101.7
## Censor V1 V2 V3 V4 V5 V6 V7 V8
## 93 1 119.046 133.608 101.485 114.441 100.290 426.743 101.544 102.646
## 94 1 159.079 159.654 106.289 136.676 101.378 162.038 107.383 102.330
## 95 1 132.633 150.516 105.060 130.100 100.791 228.177 105.533 102.153
## 96 1 117.832 144.863 102.696 121.371 100.600 152.121 105.493 100.864
## 97 1 134.167 121.001 107.656 140.960 101.664 226.206 105.614 110.263
## 98 1 105.753 127.931 101.414 113.470 100.158 134.903 100.537 100.931
## 99 1 104.971 128.974 104.044 126.803 100.272 147.366 100.557 102.085
## 100 1 104.048 129.639 101.532 112.519 100.191 119.467 102.160 100.608
## 101 1 105.443 124.679 103.982 124.874 100.268 184.381 101.244 101.971
## 102 1 107.592 127.138 102.150 118.710 100.265 130.375 100.670 100.677
## 103 1 110.549 122.575 102.416 122.751 100.263 117.912 101.165 100.907
## 104 1 107.315 136.052 102.607 128.496 100.331 121.435 101.815 100.571
## 105 1 107.333 123.238 102.512 126.385 100.150 121.682 101.185 101.213
## 106 1 103.339 125.343 101.419 122.896 100.301 103.776 101.557 101.212
## 107 1 102.673 118.742 100.992 112.684 100.182 123.778 100.861 102.540
## 108 1 110.218 142.843 104.864 146.435 100.838 106.743 103.083 100.630
## 109 1 127.535 149.701 104.137 128.970 100.617 134.568 105.512 101.244
## 110 1 107.157 134.226 101.632 118.651 100.060 134.350 100.345 100.277
## 111 1 107.829 142.228 101.954 121.481 100.687 177.345 102.938 100.889
## 112 1 102.451 134.605 102.397 122.184 100.715 165.739 103.227 101.797
## 113 1 118.664 170.914 107.143 140.203 101.026 138.335 107.177 102.798
## 114 1 116.478 120.895 103.833 130.447 100.413 135.278 101.100 100.744
## 115 1 114.916 123.092 103.682 122.735 100.221 108.396 101.566 100.440
## 116 1 124.956 134.464 104.477 129.351 100.346 105.788 101.402 100.338
## 117 1 210.841 174.558 112.570 193.598 100.684 108.470 106.832 100.825
## 118 1 148.423 137.918 109.789 135.180 100.869 107.050 108.479 101.331
## 119 1 123.803 134.960 105.149 132.395 100.340 100.443 101.825 100.036
## 120 1 122.799 143.460 102.416 127.676 100.182 107.274 100.691 100.057
## 121 1 151.976 132.600 102.572 116.217 100.218 100.270 103.312 100.019
## 122 1 135.189 154.474 107.480 143.261 101.066 100.722 107.494 100.142
## 123 1 111.968 138.710 103.047 121.687 100.147 163.094 100.709 102.115
## 124 1 157.391 155.655 104.362 128.366 100.607 103.864 104.223 100.313
## 125 1 127.068 131.310 104.617 122.682 101.080 116.438 101.528 100.485
## 126 1 144.009 153.344 107.596 148.484 101.022 125.418 104.303 102.552
## 127 1 134.289 151.648 106.380 139.949 100.649 111.825 103.513 102.363
## 128 1 117.934 120.469 102.477 120.952 100.681 124.266 102.110 101.225
## 129 1 123.286 150.952 105.908 134.525 100.749 101.921 102.839 100.195
## 130 1 128.597 138.969 105.880 137.848 100.704 107.596 101.295 101.232
## 131 1 149.202 143.626 105.548 142.657 100.770 104.858 105.274 100.888
## 132 1 107.485 130.557 102.675 122.516 100.163 109.413 100.980 100.450
## 133 1 109.362 143.543 103.887 130.298 100.235 115.881 100.662 100.981
## 134 1 111.875 161.715 103.778 123.703 100.434 115.471 106.248 101.864
## 135 1 158.043 164.474 105.755 142.190 101.047 100.403 106.083 100.009
## 136 1 123.011 142.757 102.654 119.859 100.285 102.182 103.541 100.190
## 137 1 117.677 156.431 105.754 133.371 100.703 100.623 104.499 100.100
## 138 1 106.587 152.645 102.718 127.846 100.344 116.426 102.841 101.276
## 139 1 117.617 135.038 102.559 107.618 100.220 124.722 101.442 100.881
## 140 1 106.270 112.398 101.398 106.114 100.251 125.466 100.374 100.498
## 141 1 123.374 148.874 106.293 121.237 100.707 118.048 104.834 101.522
## 142 1 103.554 125.824 101.740 111.055 100.393 112.108 100.709 100.676
## 143 1 133.190 318.366 107.240 130.345 100.618 120.395 104.895 102.975
## 144 1 122.586 127.119 103.670 110.608 100.486 122.672 102.220 101.972
## 145 1 117.165 140.993 102.554 110.880 100.434 100.879 101.966 100.124
## 146 1 108.494 140.329 101.719 107.960 100.382 108.107 104.280 100.369
## 147 1 103.844 120.347 101.259 105.665 100.409 108.658 101.006 100.835
## 148 1 111.097 129.051 101.434 112.368 100.632 112.288 102.198 101.073
## 149 1 105.499 122.649 101.471 108.328 100.301 117.799 101.515 100.545
## 150 1 101.560 119.936 100.567 104.968 100.202 112.929 100.853 100.369
## 151 1 102.184 122.345 101.413 107.992 100.163 106.538 100.320 100.196
## 152 1 102.868 110.164 100.714 102.761 100.155 107.088 100.227 100.855
## 153 1 103.354 121.530 101.504 106.058 100.150 111.962 100.863 100.741
## 154 1 145.863 164.773 103.503 114.288 101.727 100.957 105.958 100.091
## 155 1 107.534 116.443 101.117 102.830 100.155 114.055 100.242 100.254
## 156 1 116.997 150.966 102.867 113.085 100.781 111.208 101.939 102.297
## 157 1 109.557 128.044 101.485 105.489 100.440 127.177 101.818 100.837
## 158 1 135.548 162.785 103.949 100.016 100.348 118.459 102.888 100.848
## 159 1 105.951 107.487 101.339 105.438 100.000 111.211 100.492 100.668
## 160 1 117.691 136.524 103.536 100.332 100.446 112.854 104.015 102.757
## 161 1 126.174 131.191 104.167 114.697 100.950 104.412 101.988 101.238
## 162 1 109.898 126.291 102.307 109.687 100.438 108.904 101.692 101.267
## 163 1 118.392 134.705 102.623 112.032 100.905 110.058 102.572 102.207
## 164 1 104.605 118.734 101.971 120.389 100.254 116.730 100.458 100.565
## 165 1 103.683 123.057 102.594 109.031 100.246 125.245 100.605 102.557
## 166 1 106.210 118.284 102.366 105.737 100.155 113.842 100.728 101.409
## 167 1 101.356 117.075 101.927 106.775 100.109 106.536 100.206 100.758
## 168 1 102.607 124.596 102.404 107.720 100.204 109.749 100.524 102.013
## 169 1 106.283 116.612 100.653 106.609 100.068 112.870 100.748 100.166
## 170 1 134.721 143.808 104.066 141.080 101.289 103.894 116.324 100.651
## 171 1 202.541 160.165 110.482 150.567 100.749 113.226 103.615 103.121
## 172 1 129.581 148.697 107.470 138.697 100.717 126.179 103.649 101.457
## 173 1 123.508 155.777 107.833 133.261 100.760 138.179 108.392 107.703
## 174 1 115.631 137.188 107.850 141.281 100.608 125.728 103.013 102.516
## 175 1 171.516 168.056 112.158 176.812 101.816 123.484 108.069 103.866
## 176 1 107.827 138.937 106.501 127.032 100.374 122.075 101.295 102.117
## 177 1 117.369 142.555 104.973 124.597 100.313 132.906 103.274 101.027
## 178 1 143.847 155.868 111.945 153.340 100.961 109.351 103.203 101.546
## 179 1 118.871 141.402 107.924 135.021 100.422 112.778 102.621 101.320
## 180 1 150.848 167.951 107.704 141.426 100.730 110.287 103.707 103.205
## 181 1 116.132 129.457 104.861 126.090 100.738 111.926 103.556 101.043
## 182 1 112.232 130.098 102.746 115.960 100.774 106.387 102.090 101.788
## 183 1 143.046 151.158 107.234 137.351 102.378 105.136 107.902 101.519
## 184 1 111.630 124.276 102.082 107.880 101.364 113.812 105.322 102.112
## 185 1 119.521 125.107 102.549 114.487 100.637 115.471 102.694 100.999
## 186 1 127.165 137.545 104.060 127.190 101.533 108.918 107.393 103.783
## 187 1 106.284 117.272 101.061 107.489 100.100 107.806 100.371 100.452
## 188 1 104.743 115.027 102.621 107.692 100.134 112.181 100.516 100.316
## 189 1 124.074 156.724 107.845 130.552 101.310 114.853 102.288 100.412
## 190 1 109.625 122.152 101.936 111.736 100.497 126.835 100.976 101.644
## 191 1 103.730 114.084 101.177 105.665 100.158 110.401 100.805 101.416
## 192 1 103.592 128.421 103.814 115.654 100.178 108.988 100.098 101.763
## 193 1 106.358 115.345 102.001 108.756 100.272 110.656 101.311 100.985
## 194 1 109.414 118.658 101.947 110.661 100.394 114.490 101.991 101.296
## 195 1 107.695 118.946 103.319 110.731 100.269 109.599 101.258 100.564
## 196 1 106.244 115.099 102.089 109.155 100.214 107.886 100.667 100.430
## 197 1 107.505 119.658 102.942 108.229 100.305 112.814 101.253 100.580
## 198 0 109.747 110.149 101.032 117.689 100.516 102.602 101.128 100.172
## 199 0 103.681 136.131 104.044 127.280 101.807 108.779 101.713 101.300
## 200 0 100.192 153.432 108.202 118.943 100.433 109.639 105.494 101.012
## 201 0 114.151 173.975 105.950 135.236 100.295 109.631 101.027 100.100
## 202 0 107.365 110.107 105.486 117.023 100.111 108.209 100.662 100.086
## 203 0 108.379 115.109 101.737 112.935 100.534 114.900 101.235 100.802
## 204 0 102.435 113.803 101.450 111.771 100.344 101.719 101.427 100.195
## 205 0 125.496 120.261 102.648 122.302 101.536 105.599 103.923 101.308
## 206 0 106.728 118.549 101.771 114.320 100.324 104.326 100.215 100.092
## 207 0 100.415 102.458 100.498 102.081 100.079 107.091 100.090 100.440
## 208 0 101.109 106.978 100.901 104.560 100.088 108.502 100.169 100.270
## 209 0 100.410 103.825 100.518 102.222 100.099 116.153 100.624 101.557
## 210 0 100.595 102.773 100.293 101.828 100.359 104.460 100.233 100.029
## 211 0 101.725 105.238 100.397 102.616 100.116 107.865 100.725 101.522
## 212 0 100.862 104.515 100.480 101.931 100.154 104.747 100.254 100.613
## 213 0 100.886 105.172 100.606 101.701 100.111 116.309 100.961 100.974
## 214 0 101.058 105.452 100.319 101.671 100.108 116.216 101.006 100.123
## 215 0 101.518 103.517 100.303 102.086 100.072 107.196 100.703 100.461
## V9
## 93 102.727
## 94 100.934
## 95 101.405
## 96 100.548
## 97 103.729
## 98 100.442
## 99 100.344
## 100 100.254
## 101 100.480
## 102 100.394
## 103 100.174
## 104 100.507
## 105 100.309
## 106 100.163
## 107 100.179
## 108 100.200
## 109 100.289
## 110 100.433
## 111 100.385
## 112 100.747
## 113 101.526
## 114 100.515
## 115 100.144
## 116 100.100
## 117 100.454
## 118 100.325
## 119 100.007
## 120 100.038
## 121 100.011
## 122 100.063
## 123 101.162
## 124 100.008
## 125 100.434
## 126 100.773
## 127 100.519
## 128 100.481
## 129 100.047
## 130 100.263
## 131 100.580
## 132 100.141
## 133 100.338
## 134 100.291
## 135 100.013
## 136 100.059
## 137 100.058
## 138 100.268
## 139 100.539
## 140 100.364
## 141 100.744
## 142 100.216
## 143 101.129
## 144 100.357
## 145 100.035
## 146 100.295
## 147 100.131
## 148 100.560
## 149 100.303
## 150 100.192
## 151 100.059
## 152 100.243
## 153 100.171
## 154 100.046
## 155 100.225
## 156 100.308
## 157 101.229
## 158 100.686
## 159 100.207
## 160 100.807
## 161 100.157
## 162 100.258
## 163 100.298
## 164 100.238
## 165 100.141
## 166 100.053
## 167 100.112
## 168 100.276
## 169 100.058
## 170 100.129
## 171 101.510
## 172 100.796
## 173 101.632
## 174 100.205
## 175 101.019
## 176 100.266
## 177 100.619
## 178 100.395
## 179 100.388
## 180 100.492
## 181 100.636
## 182 100.159
## 183 100.317
## 184 100.290
## 185 100.420
## 186 100.161
## 187 100.048
## 188 100.055
## 189 100.072
## 190 100.490
## 191 100.079
## 192 100.010
## 193 100.025
## 194 100.133
## 195 100.014
## 196 100.058
## 197 100.037
## 198 100.007
## 199 100.057
## 200 100.141
## 201 100.027
## 202 100.033
## 203 100.016
## 204 100.015
## 205 100.039
## 206 100.012
## 207 100.007
## 208 100.015
## 209 100.189
## 210 100.034
## 211 100.065
## 212 100.010
## 213 100.029
## 214 100.100
## 215 100.422
## Censor V1 V2 V3
## Min. :0.0000 Min. :100.2 Min. :102.5 Min. :100.3
## 1st Qu.:1.0000 1st Qu.:105.5 1st Qu.:119.8 1st Qu.:101.7
## Median :1.0000 Median :110.2 Median :130.1 Median :102.7
## Mean :0.8537 Mean :117.7 Mean :134.2 Mean :103.6
## 3rd Qu.:1.0000 3rd Qu.:123.7 3rd Qu.:143.7 3rd Qu.:105.0
## Max. :1.0000 Max. :210.8 Max. :318.4 Max. :112.6
## V4 V5 V6 V7
## Min. :100.0 Min. :100.0 Min. :100.3 Min. :100.1
## 1st Qu.:108.3 1st Qu.:100.2 1st Qu.:107.4 1st Qu.:100.7
## Median :118.7 Median :100.4 Median :112.2 Median :101.7
## Mean :120.8 Mean :100.5 Mean :120.6 Mean :102.6
## 3rd Qu.:129.7 3rd Qu.:100.7 3rd Qu.:121.9 3rd Qu.:103.6
## Max. :193.6 Max. :102.4 Max. :426.7 Max. :116.3
## V8 V9
## Min. :100.0 Min. :100.0
## 1st Qu.:100.4 1st Qu.:100.1
## Median :100.9 Median :100.2
## Mean :101.2 Mean :100.4
## 3rd Qu.:101.6 3rd Qu.:100.4
## Max. :110.3 Max. :103.7
Recode the response variable so that it is a factor with levels
Benign (cancer) and Malignant (healthy). Make sure the recoded response
exists in both training and validation sets. Provide a
ggpairs
plot of the nine genetic predictor variables by
cancer status on the training data. Do the assumptions of LDA look
reasonable? Does it look like the predictors will be able to predict the
response effectively given how LDA tries to classify? Provide commentary
and offer a suggestion to fix issues if you mention any.
Compute an LDA model on the training data and compute the confusion matrix on the validation set using a threshold of 0.5. Underneath this setting is the classification more specific or more sensitive? These data sets are not random samples so we can not really look into PPV and NPV that closely. Compute a second confusion matrix by changing the threshold to get something that has an even balance between sensitivity and specificity. Note: This is not about reading into the values but rather working through the computational steps.
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Recode the response variable
training$Censor <- factor(training$Censor, levels = c(0, 1), labels = c("Malignant", "Benign"))
validate$Censor <- factor(validate$Censor, levels = c(0, 1), labels = c("Malignant", "Benign"))
# Generate ggpairs plot for the training data
ggpairs(training, columns = 2:10, ggplot2::aes(color = Censor), title = "Genetic Predictors by Cancer Status")
# include Censor and v1-v9 row
training <- training[, c("Censor", "V1", "V2", "V3", "V4", "V5", "V6", "V7", "V8", "V9")]
ggpairs(training, ggplot2::aes(color = Censor))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
I’ve analyzed the ggpairs plot that provides a visual examination of the relationships between nine genetic predictor variables and cancer status, where “Censor” is the response variable. This plot presents individual variable distributions on the diagonal, categorized by cancer status (Benign vs. Malignant), scatter plots for variable pairs below the diagonal, and correlation coefficients above the diagonal.
My observations from the histograms on the diagonal are:
Variables V1, V2, V3, V4, and V6 display a level of class separation (Benign vs. Malignant), though there’s notable overlap. V5, V7, V8, and V9 demonstrate more class overlap, indicating they might be less effective for individual discrimination. In terms of Linear Discriminant Analysis (LDA) assumptions:
Multivariate Normality seems questionable for some variables, as their distributions aren’t uniformly normal; some are skewed or show potential outliers. The assumption of Homogeneity of Variance-Covariance Matrices might be breached, as evidenced by the scatter plots’ varied spreads and relationship strengths. The correlation coefficients provide insights into linear relationships between variable pairs. High correlations might be problematic for LDA, which presumes predictors’ statistical independence.
Assessing the predictors’ effectiveness in predicting the response in light of LDA’s classification approach, I notice that while certain variables indicate class separation potential, the overlap and potential assumption violations could hinder model performance.
To address these issues: 1. Non-normal distributions might benefit from transformations like logarithmic, square root, or Box-Cox to approximate normal distributions. 2. Investigating and potentially removing outliers can improve model accuracy. 3. Given the correlated predictors, employing dimensionality reduction methods like PCA before LDA, or switching to models like regularized discriminant analysis that better handle such correlations could be beneficial. 4. Implementing these strategies could enhance adherence to LDA model assumptions and improve classification effectiveness. However, considering the potential mismatch with LDA for this data, exploring alternative models that don’t rely on these assumptions, like logistic regression, support vector machines, or tree-based methods, might also be advantageous.
library(MASS)
library(caret)
# Fit LDA model
lda_fit <- lda(Censor ~ ., data = training)
# Predict on validation set using a threshold of 0.5
validate_prob <- predict(lda_fit, validate)$posterior[, "Benign"]
validate_pred <- ifelse(validate_prob > 0.5, "Benign", "Malignant")
validate_pred <- factor(validate_pred, levels = c("Malignant", "Benign"))
# Compute confusion matrix using threshold of 0.5
conf_matrix_05 <- confusionMatrix(validate_pred, validate$Censor)
print(conf_matrix_05)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Malignant Benign
## Malignant 16 50
## Benign 2 55
##
## Accuracy : 0.5772
## 95% CI : (0.4849, 0.6658)
## No Information Rate : 0.8537
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1961
##
## Mcnemar's Test P-Value : 7.138e-11
##
## Sensitivity : 0.8889
## Specificity : 0.5238
## Pos Pred Value : 0.2424
## Neg Pred Value : 0.9649
## Prevalence : 0.1463
## Detection Rate : 0.1301
## Detection Prevalence : 0.5366
## Balanced Accuracy : 0.7063
##
## 'Positive' Class : Malignant
##
# Adjust the threshold to balance sensitivity and specificity
# use ROC analysis to find such a threshold - I'll choose 0.3 for this example
validate_pred_balanced <- ifelse(validate_prob > 0.3, "Benign", "Malignant")
validate_pred_balanced <- factor(validate_pred_balanced, levels = c("Malignant", "Benign"))
# Compute second confusion matrix with the new threshold
conf_matrix_balanced <- confusionMatrix(validate_pred_balanced, validate$Censor)
print(conf_matrix_balanced)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Malignant Benign
## Malignant 15 23
## Benign 3 82
##
## Accuracy : 0.7886
## 95% CI : (0.7058, 0.857)
## No Information Rate : 0.8537
## P-Value [Acc > NIR] : 0.9811561
##
## Kappa : 0.4207
##
## Mcnemar's Test P-Value : 0.0001944
##
## Sensitivity : 0.8333
## Specificity : 0.7810
## Pos Pred Value : 0.3947
## Neg Pred Value : 0.9647
## Prevalence : 0.1463
## Detection Rate : 0.1220
## Detection Prevalence : 0.3089
## Balanced Accuracy : 0.8071
##
## 'Positive' Class : Malignant
##
Looking at the confusion matrices from my LDA model predictions on the validation set, I can draw several conclusions.
Using a 0.5 threshold, the first confusion matrix shows:
Adjusting the threshold to 0.3, the second confusion matrix reveals:
The following code simulates a training and validation set from multivariate normal distributions. The first two predictors are important while the remaining predictors are just random noise. I’ve included a plot of the first five for verification.
library(mvtnorm)
set.seed(1234)
muYes<-c(10,10)
muNo<-c(8,8)
Sigma<-matrix(c(1,.8,.8,1),2,2,byrow=T)
nY<-30
nN<-30
dataYes<-rmvnorm(nY,muYes,Sigma)
dataNo<- rmvnorm(nN,muNo,Sigma)
train<-rbind(dataYes,dataNo)
train<-data.frame(train)
for (i in 3:20){
train<-cbind(train,rnorm(nY+nN))
}
names(train)<-paste("X",1:20,sep="")
train$Response<-rep(c("Yes","No"),each=30)
train$Response<-factor(train$Response)
#Creating a validation set
muYes<-c(10,10)
muNo<-c(8,8)
Sigma<-matrix(c(1,.8,.8,1),2,2,byrow=T)
nY<-500
nN<-500
dataYes<-rmvnorm(nY,muYes,Sigma)
dataNo<- rmvnorm(nN,muNo,Sigma)
validate<-rbind(dataYes,dataNo)
validate<-data.frame(validate)
for (i in 3:20){
validate<-cbind(validate,rnorm(nY+nN))
}
names(validate)<-paste("X",1:20,sep="")
validate$Response<-rep(c("Yes","No"),each=500)
validate$Response<-factor(validate$Response)
pairs(train[,1:5],col=rep(c("red","black"),each=30))
Fit an LDA model on the training data using just the first two predictions \(X1\) and \(X2\). Observe the overall accuracy on the validation set. (Do not worry about the other metrics for this exercise, but it might be interesting to examine those as well.)
Fit an additional LDA model using the first 10 predictors which will include eight random noise predictors. Observe the overall accuracy on the validation set.
Repeat again one last time using all 20 predictors (18 random noise predictors) and observe the overall accuracy on the validation set.
What trend do you see in the accuracy as you add more predictors that are just random noise? This is referred to as the “curse of dimensionality” and also occurs in other algorithms such as KNN. It highlights the importance of feature selection in the case of LDA. To make matters worse, these tools do not traditionally have feature selection procedures built in so we have to consider other tools for feature selection and only include variables that we think will truly be meaningful to safeguard against this issue. There are some other strategies, one of which utilizes a technique called PCA which we will discuss later.
By fitting an LDA model with just X1 and X2, the important predictors, I expect high accuracy on the validation set. These predictors are significant and should effectively classify observations.
library(MASS)
# Fit LDA model using X1 and X2
lda.fit.2 <- lda(Response ~ X1 + X2, data = train)
# Predict on validation set and calculate accuracy
predictions.2 <- predict(lda.fit.2, newdata = validate)$class
accuracy.2 <- mean(predictions.2 == validate$Response)
print(accuracy.2)
## [1] 0.833
Incorporating the first 10 predictors, including eight random noise predictors, into the LDA model is likely to decrease accuracy. The noise from these additional predictors could interfere with the model’s classification ability.
# Fit LDA model using X1 to X10
lda.fit.10 <- lda(Response ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10, data = train)
# Predict on validation set and calculate accuracy
predictions.10 <- predict(lda.fit.10, newdata = validate)$class
accuracy.10 <- mean(predictions.10 == validate$Response)
print(accuracy.10)
## [1] 0.823
Using all 20 predictors, where only two are significant and the rest are noise, the accuracy on the validation set is anticipated to worsen. The irrelevant predictors can overwhelm the relevant information, reducing accuracy.
# Fit LDA model using X1 to X20
lda.fit.20 <- lda(Response ~ ., data = train)
# Predict on validation set and calculate accuracy
predictions.20 <- predict(lda.fit.20, newdata = validate)$class
accuracy.20 <- mean(predictions.20 == validate$Response)
print(accuracy.20)
## [1] 0.717
Adding more random noise predictors to my LDA model initially might slightly improve accuracy, but as I continue, accuracy declines or levels off. This illustrates the “curse of dimensionality,” where increased dimensions, especially irrelevant ones, reduce the model’s performance. This problem affects not just LDA but also other algorithms like KNN. As a solution, feature selection is crucial for reducing complexity and improving performance. Dimensionality reduction techniques like PCA are also effective in managing overfitting and maintaining data variance.