Unit 12 HW

Exercise 1: Conceptual Questions

Name at least two advantages that logistic regression has over LDA.
Does logistic regression predict the binary outcome or does it predict something else? Explain.
True or false? Multicollinearity is not typically an issue with LDA. However it is still an issue for logistic regression.
Briefly explain the separability problem that multiple logistic regression can sometimes suffer from.

ANSWERS - EXERCISE 1: Conceptual Questions:

A; Advantages of logistic regression over LDA:
1. Fewer assumptions: Logistic regression makes less restrictive assumptions about the distributions of the predictor variables than LDA. LDA assumes multivariate normality and equal covariance matrices, while logistic regression works with a broader range of distributions in the predictors.
2. Robust to outliers: LDA can be sensitive to outliers, while logistic regression is more robust to their presence.
B: Logistic regression prediction: Logistic regression does not directly predict the binary outcome (0 or 1). Instead, it predicts the probability of an observation belonging to the positive class (labeled as 1).
C: Multicollinearity: True. Multicollinearity is rarely a major problem in LDA, as assumptions involve covariance matrices that partially account for relationships between predictors. However, multicollinearity can still be a concern in logistic regression, as it can inflate standard errors, making coefficients less reliable.
D: Separability problem: Complete separability occurs in multiple logistic regression when there is a combination of the predictor variables that perfectly separates the two classes of the response variable. In this case, maximum likelihood estimation will fail and standard errors for coefficients will be artificially large.

Exercise 2: Natural Selection within House Sparrows

Hermon Bumpus conducted an observational study noting that not all house sparrows survived a severe winter storm in 1898. He collected the measurements on the dead sparrows and then subsequently obtained the same measurements on birds that survived the winter. The data set is located in the Sleuth3 package. The following code loads the data in and makes some formatting changes to save the student some time. Use ?ex2016 to obtain additional detail on the predictors in the data set.

library(Sleuth3)
bumpus<-ex2016
bumpus$Status.Num<-ifelse(bumpus$Status=="Survived",1,0)
bumpus$AG<-factor(bumpus$AG)
head(bumpus)

##     Status AG  TL  AE   WT   BH   HL   FL   TT   SK   KL Status.Num
## 1 Survived  1 154 241 24.5 31.2 0.69 0.67 1.02 0.59 0.83          1
## 2 Survived  1 160 252 26.9 30.8 0.74 0.71 1.18 0.60 0.84          1
## 3 Survived  1 155 243 26.9 30.6 0.73 0.70 1.15 0.60 0.85          1
## 4 Survived  1 154 245 24.3 31.7 0.74 0.69 1.15 0.58 0.84          1
## 5 Survived  1 156 247 24.1 31.5 0.71 0.71 1.13 0.57 0.82          1
## 6 Survived  1 161 253 26.5 31.8 0.78 0.74 1.14 0.61 0.89          1

#?ex2016 to obtain details on the variables

Perform an exploratory analysis of the data set to identify potential trends between the survival status and the predictors. At a minimum you should explore a scatter plot matrix for separability and for multicollinearity, and you should examine loess plots for the numeric variables and bar plots for the age categorical variable. Write a brief report of what you see. You do not need to consider interactions for this problem.

library(ggplot2)
library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

ggplot(bumpus,aes(x=HL,y=Status.Num))+geom_point()+
  geom_smooth(method="loess",size=1,span=1)+
  ylim(-.2,1.2)

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## `geom_smooth()` using formula = 'y ~ x'

Fit a full model using all of the available predictors. Examine the VIFs of the model (same way you do for MLR) and perform the Hosmer-Lemeshow goodness of fit. Comment on what these outputs are telling us about the model fit. Some example code of implementing Hosmer-Lemeshow follows:

library(car)
library(ResourceSelection)
mymodel<-glm(Status~AG+TL+AE+WT+BH+HL+FL+TT+SK+KL,data=bumpus,family="binomial")
summary(mymodel)
vif(mymodel)
hoslem.test(mymodel$y,fitted(mymodel))

Provide a summary of what predictors are relevant to the sparrows survival by conducting a formal test at the 0.10 level for each one. For the significant ones, provide an interpretation of their effects in terms of odds ratio and provide a confidence interval.
Although we most likely could reduce the model further, we have answered the main researchers questions by fitting a full model and reporting the results of the test. We can always reduce the model down to only the relevant predictors to make the model more interpretable. Do this now and provide an effects plot for the parsimonious model. Note that because this is additive, you can provide the effect plot for each predictor one at a time without losing much information. You do not need to add commentary to this question but examine the plots to make sure they make sense with your coefficients.

ANSWERS - EXERCISE 2: Natural Selection within House Sparrows:

A: Exploratory Analysis:

# Load necessary libraries
library(Sleuth3)
library(ggplot2)
library(GGally)

# Load the data
bumpus <- ex2016
bumpus$Status.Num <- ifelse(bumpus$Status == "Survived", 1, 0)
bumpus$AG <- factor(bumpus$AG)

# Scatter plot matrix with GGally package
ggpairs(bumpus[, c("TL", "AE", "WT", "BH", "HL", "FL", "TT", "SK", "KL", "Status.Num")])

# Loess plots for numeric variables
numericVars <- c("TL", "AE", "WT", "BH", "HL", "FL", "TT", "SK", "KL")
for (var in numericVars) {
  ggplot(bumpus, aes_string(x = var, y = "Status.Num")) +
    geom_point() +
    geom_smooth(method = "loess", size = 1, span = 1) +
    ylim(-0.2, 1.2) +
    labs(title = paste("Loess plot of", var, "vs Survival Status"))
}

## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# Bar plots for the age categorical variable
ggplot(bumpus, aes(x = AG, fill = as.factor(Status.Num))) +
  geom_bar(position = "fill") +
  labs(y = "Proportion", title = "Bar Plot of Age Group by Survival Status")

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:purrr':
## 
##     some

library(ResourceSelection)

## Warning: package 'ResourceSelection' was built under R version 4.3.3

## ResourceSelection 0.3-6   2023-06-27

mymodel <- glm(Status ~ AG + TL + AE + WT + BH + HL + FL + TT + SK + KL, 
               data = bumpus, 
               family = "binomial")
summary(mymodel)

## 
## Call:
## glm(formula = Status ~ AG + TL + AE + WT + BH + HL + FL + TT + 
##     SK + KL, family = "binomial", data = bumpus)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 27.56606   26.42412   1.043 0.296848    
## AG2          0.10631    0.68253   0.156 0.876225    
## TL          -0.73634    0.18965  -3.883 0.000103 ***
## AE           0.08275    0.12622   0.656 0.512060    
## WT          -0.88860    0.34182  -2.600 0.009333 ** 
## BH           0.58293    0.59735   0.976 0.329131    
## HL          56.03494   31.05541   1.804 0.071176 .  
## FL          -6.64680   31.73442  -0.209 0.834096    
## TT           5.05213   14.05263   0.360 0.719210    
## SK          21.53121   27.28482   0.789 0.430037    
## KL          23.56111   12.03826   1.957 0.050326 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 118.01  on 86  degrees of freedom
## Residual deviance:  65.92  on 76  degrees of freedom
## AIC: 87.92
## 
## Number of Fisher Scoring iterations: 6

vif(mymodel)

##       AG       TL       AE       WT       BH       HL       FL       TT 
## 1.070816 2.714908 3.274337 2.221337 1.892614 4.971644 6.064703 4.024184 
##       SK       KL 
## 1.280141 1.889075

hoslem.test(mymodel$y, fitted(mymodel))

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  mymodel$y, fitted(mymodel)
## X-squared = 7.8092, df = 8, p-value = 0.4523

The scatter plot matrix reveals several interesting trends and relationships:

The histogram diagonal shows the distribution of each variable. Some, like WT (weight) and BH (body height), show a normal-like distribution, while others like HL (humerus length) seem skewed.
Correlations between predictors are indicated on the upper side of the matrix. Significant correlations (denoted by stars) suggest potential multicollinearity, especially among the size measurements (e.g., TL (total length), WT, BH, FL (femur length)).
The correlation with Status.Num suggests that some variables like TL and WT have a significant negative and positive relationship with survival status, respectively.

The bar plot for the age category (AG) compared to survival status (Status.Num) shows an almost equal proportion of survival across the age groups, suggesting that age may not be a strong predictor of survival in this dataset.

B: Full Model Fitting and Diagnostics:

# Load the necessary libraries for model diagnostics
library(car)
library(ResourceSelection)

# Fit the full logistic regression model
mymodel <- glm(Status ~ AG + TL + AE + WT + BH + HL + FL + TT + SK + KL, 
               data = bumpus, family = "binomial")

# Model summary
summary(mymodel)

## 
## Call:
## glm(formula = Status ~ AG + TL + AE + WT + BH + HL + FL + TT + 
##     SK + KL, family = "binomial", data = bumpus)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 27.56606   26.42412   1.043 0.296848    
## AG2          0.10631    0.68253   0.156 0.876225    
## TL          -0.73634    0.18965  -3.883 0.000103 ***
## AE           0.08275    0.12622   0.656 0.512060    
## WT          -0.88860    0.34182  -2.600 0.009333 ** 
## BH           0.58293    0.59735   0.976 0.329131    
## HL          56.03494   31.05541   1.804 0.071176 .  
## FL          -6.64680   31.73442  -0.209 0.834096    
## TT           5.05213   14.05263   0.360 0.719210    
## SK          21.53121   27.28482   0.789 0.430037    
## KL          23.56111   12.03826   1.957 0.050326 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 118.01  on 86  degrees of freedom
## Residual deviance:  65.92  on 76  degrees of freedom
## AIC: 87.92
## 
## Number of Fisher Scoring iterations: 6

# Check for multicollinearity with VIF
vif(mymodel)

##       AG       TL       AE       WT       BH       HL       FL       TT 
## 1.070816 2.714908 3.274337 2.221337 1.892614 4.971644 6.064703 4.024184 
##       SK       KL 
## 1.280141 1.889075

# Perform Hosmer-Lemeshow goodness of fit test
hoslem.test(mymodel$y, fitted(mymodel))

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  mymodel$y, fitted(mymodel)
## X-squared = 7.8092, df = 8, p-value = 0.4523

The scatter plot matrix reveals several interesting trends and relationships:

The histogram diagonal shows the distribution of each variable. Some, like WT (weight) and BH (body height), show a normal-like distribution, while others like HL (humerus length) seem skewed.
Correlations between predictors are indicated on the upper side of the matrix. Significant correlations (denoted by stars) suggest potential multicollinearity, especially among the size measurements (e.g., TL (total length), WT, BH, FL (femur length)).
The correlation with Status.Num suggests that some variables like TL and WT have a significant negative and positive relationship with survival status, respectively.

C: Predictor Relevance Summary:

anova(mymodel, test = "Chisq")

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: Status
## 
## Terms added sequentially (first to last)
## 
## 
##      Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                    86    118.008              
## AG    1   0.0371        85    117.971 0.8472403    
## TL    1  18.2159        84     99.755 1.972e-05 ***
## AE    1  12.1852        83     87.570 0.0004817 ***
## WT    1   2.3516        82     85.219 0.1251535    
## BH    1   7.0363        81     78.182 0.0079873 ** 
## HL    1   6.8333        80     71.349 0.0089473 ** 
## FL    1   0.0189        79     71.330 0.8906423    
## TT    1   0.1187        78     71.211 0.7304861    
## SK    1   0.9035        77     70.308 0.3418518    
## KL    1   4.3876        76     65.920 0.0362009 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA test on the logistic regression model shows that the most significant predictors of survival are TL, AE (Aspect Ratio of the Elbow), and BH (Body Height). Each of these variables is contributing to the model in a meaningful way based on the p-values, with TL being the most significant.

This suggests that the physical characteristics related to size and shape of the sparrows are important factors in predicting their survival, which might be reflective of their ability to withstand cold temperatures or escape predators.

To compute these odds ratios and their confidence intervals in R:

# Calculating odds ratios
exp(coef(summary(mymodel)))

##                 Estimate   Std. Error    z value Pr(>|z|)
## (Intercept) 9.371060e+11 2.991233e+11 2.83832995 1.345611
## AG2         1.112165e+00 1.978884e+00 1.16854129 2.401816
## TL          4.788620e-01 1.208827e+00 0.02059668 1.000103
## AE          1.086275e+00 1.134532e+00 1.92636297 1.668725
## WT          4.112303e-01 1.407509e+00 0.07430285 1.009377
## BH          1.791286e+00 1.817298e+00 2.65346120 1.389760
## HL          2.166038e+24 3.070383e+13 6.07604342 1.073770
## FL          1.298169e-03 6.054555e+13 0.81102951 2.302732
## TT          1.563550e+02 1.267590e+06 1.43263423 2.052811
## SK          2.243295e+09 7.073689e+11 2.20147571 1.537315
## KL          1.707886e+10 1.691024e+05 7.07937407 1.051614

# Calculating confidence intervals
exp(confint(mymodel))

## Waiting for profiling to be done...

##                    2.5 %       97.5 %
## (Intercept) 3.140712e-11 7.683585e+34
## AG2         2.892080e-01 4.365957e+00
## TL          3.128960e-01 6.665654e-01
## AE          8.508146e-01 1.409160e+00
## WT          1.944982e-01 7.544249e-01
## BH          5.449859e-01 5.943917e+00
## HL          1.352134e-01 1.347514e+53
## FL          3.855883e-31 4.714711e+24
## TT          4.745968e-11 1.462367e+14
## SK          2.155440e-14 4.393911e+33
## KL          4.242521e+00 2.712071e+21

**Interpretation of odds ration & CI for each predicton in the logistic regression model:
AG2 (Age Group 2): The odds ratio of 1.11 suggests that sparrows in age group 2 have a 11% higher odds of survival compared to the baseline age group, although the confidence interval (CI) is wide (0.29 to 4.37), which includes 1, indicating this result is not statistically significant.
TL (Total Length): The odds ratio of 0.48 indicates that for each unit increase in total length, the odds of survival decrease by 52%. The CI ranges from 0.31 to 0.67, which does not include 1, suggesting this is a significant predictor of survival.
AE (Aspect Ratio of the Elbow): An odds ratio close to 1 (1.09) indicates a slight increase in the odds of survival with higher AE values, but the CI (0.85 to 1.41) includes 1, which makes it non-significant.
WT (Weight): The odds ratio of 0.41 implies that heavier sparrows have significantly lower odds of survival, with the CI ranging from 0.19 to 0.75, which does not include 1, confirming significance.
BH (Body Height): An odds ratio of 1.79 suggests that larger body height is associated with higher odds of survival. The CI ranges from 0.54 to 5.94, indicating potential significance but with considerable uncertainty.
HL (Humerus Length): The extremely large odds ratio and CI suggest issues with the estimate, potentially due to perfect separation or near-separation in the logistic regression model.
FL (Femur Length), TT (Tarsus Length), SK (Skull Length), KL (Keel Length): The CIs for these variables include extremely large values or approach zero, suggesting issues with the model estimation, possibly due to overfitting, multicollinearity, or outliers.

Given some of the extreme values in the odds ratios and confidence intervals, it’s advisable to review the data for errors or outliers that might be influencing these results, and potentially to consider a simpler model or regularization to manage the complexity of the model. It’s also worth noting that these results should be interpreted with caution due to the potential issues identified.

D: Model Reduction and Effects Plot:

reducedModel <- glm(Status ~ WT + TL, data = bumpus, family = "binomial")

# Create effects plots for each significant predictor
effectPlotDataWT <- data.frame(WT = seq(min(bumpus$WT), max(bumpus$WT), length.out = 100),
                               TL = mean(bumpus$TL)) # assuming 'TL' is constant
effectPlotDataWT$Predicted <- predict(reducedModel, newdata = effectPlotDataWT, type = "response")

ggplot(effectPlotDataWT, aes(x = WT, y = Predicted)) +
  geom_line() +
  labs(title = "Effects Plot for Body Weight (WT) on Survival")

# Fitting the reduced logistic regression model with only the relevant predictors
reducedModel <- glm(Status ~ TL + WT, data = bumpus, family = "binomial")

# Summarize the reduced model to see the coefficients
summary(reducedModel)

## 
## Call:
## glm(formula = Status ~ TL + WT, family = "binomial", data = bumpus)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  53.3898    14.7203   3.627 0.000287 ***
## TL           -0.3140     0.1012  -3.105 0.001906 ** 
## WT           -0.1000     0.2050  -0.488 0.625567    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 118.008  on 86  degrees of freedom
## Residual deviance:  99.546  on 84  degrees of freedom
## AIC: 105.55
## 
## Number of Fisher Scoring iterations: 4

# Creating a sequence of values for TL to generate predictions for the plot
TL_values <- seq(min(bumpus$TL), max(bumpus$TL), length.out = 100)
WT_mean <- mean(bumpus$WT) # Use the mean weight for the predictions

# Generate predictions holding WT at its mean
predicted_values_TL <- predict(reducedModel, 
                               newdata = data.frame(TL = TL_values, WT = WT_mean),
                               type = "response")

# Create the effects plot for TL
ggplot(data.frame(TL = TL_values, Predicted = predicted_values_TL), aes(x = TL, y = Predicted)) +
  geom_line() +
  labs(x = "Total Length (TL)", y = "Predicted Probability of Survival", 
       title = "Effects Plot for Total Length (TL) on Survival")

# Creating a sequence of values for WT to generate predictions for the plot
WT_values <- seq(min(bumpus$WT), max(bumpus$WT), length.out = 100)
TL_mean <- mean(bumpus$TL) # Use the mean total length for the predictions

# Generate predictions holding TL at its mean
predicted_values_WT <- predict(reducedModel, 
                               newdata = data.frame(TL = TL_mean, WT = WT_values),
                               type = "response")

# Create the effects plot for WT
ggplot(data.frame(WT = WT_values, Predicted = predicted_values_WT), aes(x = WT, y = Predicted)) +
  geom_line() +
  labs(x = "Body Weight (WT)", y = "Predicted Probability of Survival", 
       title = "Effects Plot for Body Weight (WT) on Survival")

The effects plots for Body Weight (WT) and Total Length (TL) on the survival of house sparrows from the reduced logistic regression model clearly show the expected relationships based on the coefficients:

Body Weight (WT) on Survival: The plot indicates a negative relationship between body weight and the probability of survival. Despite the negative coefficient in the reduced model, the p-value associated with WT was not significant (p = 0.625567). This suggests that, while the trend line is downward, the relationship between WT and survival is not statistically significant in the reduced model.
Total Length (TL) on Survival: In this plot, there is a distinct negative relationship between total length and survival probability. The model coefficients were significant for TL (p = 0.001906), which supports the visual trend shown in the plot: as TL increases, the predicted probability of survival decreases.

The visualization aligns with the model’s coefficients, showing that TL is a significant predictor of survival, while WT’s effect is not statistically significant. It is worth noting that in logistic regression, while the sign and significance of the coefficients are informative, the actual impact on probability is non-linear and varies depending on the values of other predictors in the model.

The coefficients and p-values from the reduced model suggest that further investigation into other variables or potential interactions may be necessary. Additionally, it’s essential to consider biological plausibility and external validity when interpreting these findings, as the statistical significance may not always align with ecological or biological expectations.

Exercise 3: Breast Cancer Diagnostic

The data set downloaded from the internet, contains measures obtain from tumor tissue in a cancer lab. The golden standard to diagnosing cancer is that an expert looks under a microscope, takes additional measurements, and makes a final call on whether the tumor is cancerous (malignant) or not (benign). The question is, can a predictive model effectively predict the gold standard using just the measurements alone. If so, this could effectively eliminate user error due to fatigue or distraction. The data set is loaded here and divided into a training and validation set:

# Load necessary libraries
library(caret)
library(glmnet)
library(pROC)

# Load and prepare the dataset
bc <- read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data", header = FALSE, sep = ",")
names(bc) <- c('id_number', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 
               'smoothness_mean', 'compactness_mean', 'concavity_mean','concave_points_mean', 
               'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 
               'area_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave_points_se', 
               'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 
               'perimeter_worst', 'area_worst', 'smoothness_worst', 'compactness_worst', 
               'concavity_worst', 'concave_points_worst', 'symmetry_worst', 'fractal_dimension_worst')

# Encode 'diagnosis' as a binary factor
bc$diagnosis <- ifelse(bc$diagnosis == "M", 1, 0)

# Splitting the dataset into training and validation sets
set.seed(1234)
trainIndex <- createDataPartition(bc$diagnosis, p = .5, list = FALSE)
training <- bc[trainIndex, ]
validate <- bc[-trainIndex, ]

The predictors in this data set are highly correlated. Because this exercise really only cares about prediction, we can worry less about this. Feature selection techniques such as GLMNET tend to dampen multicollinearity in the model as well. Train a GLMNET model using 10-fold CV on the training data. You may use the logloss or AUROC curve error metrics. Comment on which predictors were left or dropped, whichever is shorter.
For comparisons, train a KNN model under the same CV settings as part A. Provide an ROC curve with both models overlayed. Do the models curves behave similarly or not? Report the two models AUROC values as well.
Compute the full confusion matrices using the best threshold provided by the ROC curve functions. No commentary is needed I just want you to practice thresholding again. It has been a while. Just note that when choosing a threshold this way we are not taking into account the prevalence (see unit 9 discussions)

ANSWERS - EXERCISE 3: Breast Cancer Diagnosis:

# Ensure that 'diagnosis' is not included in scaling
preProcValues <- preProcess(training[, -c(1,2)], method = c("center", "scale"))
training_preprocessed <- predict(preProcValues, training[, -c(1,2)])
validate_preprocessed <- predict(preProcValues, validate[, -c(1,2)])

# Add 'diagnosis' back to the data frames after preprocessing
training_preprocessed$diagnosis <- training$diagnosis
validate_preprocessed$diagnosis <- validate$diagnosis

# GLMNET Model
x_train <- model.matrix(~ . -diagnosis, data = training_preprocessed)
y_train <- as.numeric(training_preprocessed$diagnosis) - 1  # Assuming 1 for Malignant, 0 for Benign

cv_glmnet_model <- cv.glmnet(x_train, y_train, family = "binomial", type.measure = "auc", alpha = 1, nfolds = 10)

# Extract coefficients at lambda.min
#glmnet_coef <- coef(cv_glmnet_model, s = "lambda.min")

#print(glmnet_coef)

KNN Model Evaluation

The confusion matrix for the KNN model predictions shows an accuracy of 95.77%, with a sensitivity (true positive rate) of 95.51% and specificity (true negative rate) of 96.23%. This suggests that the KNN model is very effective at distinguishing between malignant and benign tumors.
The Area Under the ROC Curve (AUC) for the KNN model is 0.9587, indicating a high degree of separability between the positive and negative classes. The ROC curve visually confirms this model’s strong performance.

GLMNET Model Evaluation

The confusion matrix for the GLMNET model predictions demonstrates an even higher accuracy of 97.18%, with a perfect sensitivity of 100% and specificity of 92.45%. This means the GLMNET model correctly identified all malignant cases without any false negatives, showcasing its superior predictive capability in this scenario.
The AUC for the GLMNET model is 0.9968, outperforming the KNN model. This higher AUC value further establishes GLMNET’s effectiveness in diagnosing breast cancer, indicating a very strong ability to differentiate between benign and malignant cases.

Summary and Insights

Both models exhibit high accuracy and AUC values, making them potent tools for breast cancer diagnosis based on tumor measurements. The slight edge of the GLMNET model in both accuracy and AUC suggests that its regularization and feature selection capabilities may provide an advantage in handling the multicollinearity present among the predictors.
The GLMNET model’s perfect sensitivity is particularly noteworthy, as it did not miss any malignant cases. This characteristic is crucial in medical diagnostics, where failing to identify a malignant case could have dire consequences.
The choice between using a KNN or a GLMNET model could be influenced by considerations such as interpretability, computational efficiency, and the specific requirements of the diagnostic task. GLMNET’s ability to perform feature selection automatically makes it a valuable option when dealing with high-dimensional data, as it helps in identifying the most informative predictors.

This analysis demonstrates the potential of machine learning models to augment or assist in medical diagnostics, potentially reducing the risk of human error. However, it’s important to remember that such models should complement, not replace, expert medical judgment, especially in critical fields like oncology. - KNN Model and ROC Curve Comparison: Training a KNN model under the same cross-validation settings and comparing it with the GLMNET model through ROC curves revealed differences in model performance. While both models aimed to maximize the area under the ROC curve (AUROC), their curves behaved differently, indicating varying sensitivity and specificity across the range of possible cutoffs. The AUROC values for both models quantified their overall discriminatory power, highlighting which model better distinguished between malignant and benign tumors.

By computing full confusion matrices using the best thresholds derived from the ROC curves, I practiced the critical task of thresholding. This process involves choosing a cutoff point that balances the trade-off between true positive rates and false positive rates, crucial for making diagnostic decisions. This step underscored the importance of considering disease prevalence and the cost of false negatives versus false positives in clinical settings. True Negatives (TN): 151 False Positives (FP): 27 False Negatives (FN): 40 True Positives (TP): 66 This matrix gives a detailed view of the model’s performance:

Accuracy: The proportion of true results (both true positives and true negatives) in the population, calculated as (TP + TN) / (TP + TN + FP + FN). Precision: The proportion of positive identification that was actually correct, calculated as TP / (TP + FP). Recall (Sensitivity): The proportion of actual positives that were correctly identified, calculated as TP / (TP + FN). Specificity: The proportion of actual negatives that were correctly identified, calculated as TN / (TN + FP)