Introduction

Annalysis of Machine Learning Algorithms on Breast Cancer Data

CMDA 4654 - Machine Learning | Spring 2025

Sully Stefanik, Pierce Hamlin, Sid Somashekar

Algorithms Used:

1.KNN Classification

2.Naive Bayes

3.Logistic Regression

4.Loess Regression

5.Ridge Regression.

6.Multiple Regression

KNN

Predicting N Stage from age, tumor size, and survival months.

Row

Confusion Matrix

          Reference
Prediction  N1  N2  N3
        N1 782 240 108
        N2  22   9  17
        N3  14   7   8
                 cm.overall
Accuracy       6.619718e-01
Kappa          4.630067e-02
AccuracyLower  6.345013e-01
AccuracyUpper  6.886506e-01
AccuracyNull   6.777133e-01
AccuracyPValue 8.848227e-01
McnemarPValue  1.228241e-55

Row

Scatterplot Results

Confusion Matrix By Class

          Sensitivity Specificity Pos Pred Value Neg Pred Value Precision
Class: N1  0.95599022   0.1053985      0.6920354      0.5324675 0.6920354
Class: N2  0.03515625   0.9589905      0.1875000      0.7868852 0.1875000
Class: N3  0.06015038   0.9804469      0.2758621      0.8938879 0.2758621
              Recall         F1 Prevalence Detection Rate Detection Prevalence
Class: N1 0.95599022 0.80287474  0.6777133    0.647887324           0.93620547
Class: N2 0.03515625 0.05921053  0.2120961    0.007456504           0.03976802
Class: N3 0.06015038 0.09876543  0.1101906    0.006628003           0.02402651
          Balanced Accuracy
Class: N1         0.5306943
Class: N2         0.4970734
Class: N3         0.5202987

Model Results and Accuracy


 
   Cell Contents
|-------------------------|
|                       N |
|           N / Col Total |
|-------------------------|

 
Total Observations in Table:  1207 

 
               | test_classes 
df_knn_classes |        N1 |        N2 |        N3 | Row Total | 
---------------|-----------|-----------|-----------|-----------|
            N1 |       782 |       240 |       108 |      1130 | 
               |     0.956 |     0.938 |     0.812 |           | 
---------------|-----------|-----------|-----------|-----------|
            N2 |        22 |         9 |        17 |        48 | 
               |     0.027 |     0.035 |     0.128 |           | 
---------------|-----------|-----------|-----------|-----------|
            N3 |        14 |         7 |         8 |        29 | 
               |     0.017 |     0.027 |     0.060 |           | 
---------------|-----------|-----------|-----------|-----------|
  Column Total |       818 |       256 |       133 |      1207 | 
               |     0.678 |     0.212 |     0.110 |           | 
---------------|-----------|-----------|-----------|-----------|

 

Naive Bayes Classification

Predicting Estrogen Status from marital status, race, and progresterone status.

Column {data-width = 500}

Estrogen Status

Distribution of Estrogen Status

Column {data-width = 500}

Results

preds
Negative Positive 
      21     1186 
          Actual
Predicted  Negative Positive
  Negative        8       13
  Positive       80     1106
Accuracy:  0.923
  Predicted   Actual Count
1  Negative Negative     8
2  Positive Negative    80
3  Negative Positive    13
4  Positive Positive  1106

Logistic Regression

Predicting Tumor Grade from tumor size and age.

Column {data-width = 650}

Logistic Regression Model Summary


Call:
glm(formula = Grade_binary ~ Age + `Tumor Size`, family = binomial(link = "logit"), 
    data = training_df)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.287392   0.266004  -1.080     0.28    
Age          -0.018616   0.004724  -3.940 8.13e-05 ***
`Tumor Size`  0.011586   0.001882   6.158 7.39e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 3367.2  on 2816  degrees of freedom
Residual deviance: 3311.4  on 2814  degrees of freedom
AIC: 3317.4

Number of Fisher Scoring iterations: 4

Row

Tumor Size Effect on Probability of High Grade Tumors

Effect of Age on Probability of High Grade Tumors

Loess

Predicting Survival Months from tumor size.

Column

Column

Fit 1 (Span = 0.5, Degree = 1)

Fit 2 (Span = 0.5, degree = 2)

Fit 3 (Span = 0.25, Degree = 1)

Fit 3 (Span = 0.25, Degree = 2)

Ridge Regression

Predicting Regional Node Examined from tumor size, regional node positive, and age.

Column

Model Summary

          Length Class     Mode   
a0        100    -none-    numeric
beta      400    dgCMatrix S4     
df        100    -none-    numeric
dim         2    -none-    numeric
lambda    100    -none-    numeric
dev.ratio 100    -none-    numeric
nulldev     1    -none-    numeric
npasses     1    -none-    numeric
jerr        1    -none-    numeric
offset      1    -none-    logical
call        4    -none-    call   
nobs        1    -none-    numeric

CV Model


Call:  cv.glmnet(x = x, y = y, alpha = 0) 

Measure: Mean-Squared Error 

    Lambda Index Measure    SE Nonzero
min  0.333   100   54.44 1.676       4
1se  4.950    71   56.04 1.623       4

Best Model


Call:  glmnet(x = x, y = y, alpha = 0, lambda = best_lambda) 

  Df  %Dev Lambda
1  4 17.17 0.3333

Best Model Coefficients

5 x 1 sparse Matrix of class "dgCMatrix"
                                s0
(Intercept)           12.625925669
Tumor Size             0.002874603
Survival Months        0.010885987
Reginol Node Positive  0.630939584
Age                   -0.032536142

RSQ (Explains Variation)

[1] 0.1717268

Column

Plotted CV Model

Plotted Model

Multiple Regression

Predicting Survival Months from tumor size, regional node examined, grade, and regional node positive.

Column

Model Summary


Call:
lm(formula = `Survival Months` ~ `Tumor Size` + Grade + `Reginol Node Positive` + 
    `Regional Node Examined`, data = df2)

Residuals:
    Min      1Q  Median      3Q     Max 
-73.891 -15.088   1.224  18.284  45.794 

Coefficients:
                                            Estimate Std. Error t value
(Intercept)                                 74.80438    1.19427  62.636
`Tumor Size`                                -0.05772    0.01750  -3.298
GradeModerately differentiated; Grade II    -0.25583    1.08221  -0.236
GradePoorly differentiated; Grade III       -2.80054    1.19975  -2.334
GradeUndifferentiated; anaplastic; Grade IV -5.86191    5.29593  -1.107
`Reginol Node Positive`                     -0.59461    0.07894  -7.532
`Regional Node Examined`                     0.11668    0.04843   2.409
                                            Pr(>|t|)    
(Intercept)                                  < 2e-16 ***
`Tumor Size`                                0.000981 ***
GradeModerately differentiated; Grade II    0.813137    
GradePoorly differentiated; Grade III       0.019631 *  
GradeUndifferentiated; anaplastic; Grade IV 0.268416    
`Reginol Node Positive`                     6.13e-14 ***
`Regional Node Examined`                    0.016034 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 22.64 on 4017 degrees of freedom
Multiple R-squared:  0.0255,    Adjusted R-squared:  0.02404 
F-statistic: 17.52 on 6 and 4017 DF,  p-value: < 2.2e-16

Column

MLR Plots

Scatterplot Matrix

Results and Conclusions

KNN: Sully

Based off limited factors, the KNN algorithm was able to accurately predict what N stage the tumor was in based off of its size, how long it survived and the patients age around 68% of the time with the K set to 15. This is a start, but due to limitations in the data set it could definitely be expanded upon. This is partially due to these variables not being the most indicative of tumor grade, but the lack of data these were the best responses.

Naive Bayes: Pierce

In this Naïve Bayes classification, I was predicting estrogen receptor status using marital status, race, and progesterone status as my categorical predictors. I split the dataset into 70% training and 30% testing subsets. After fitting the model using that training data and making predictions on the test set, I used a confusion matrix to evaluate the performance of the model. The model received a respectable 91.5% accuracy score in identifying estrogen-positive vs estrogen-negative cases. I also included a bar chart of the outcome distribution to provide insight into class balance which can affect model performance in some cases.

Logistic Regression: Sully

The logistic regression model was aimed to predict the tumors grade based on its size and the age of patient. Whe discussing tumurous grades, there are four level grades. Yet, to fall into the binomial family we condensed levels one and two into the low category, and levels three and four into the high category. The tumors size and the patients age were deemed as statistically significant, however the intercept of the two was not. Overall the model was a fairly good fit, and we were able to determine age and tumor size do have an impact on the grade (in this data).

Loess: Pierce

In this section, I used LOESS regression to explore the relationship between tumor size and survival months using varying degrees (1 for linear, and 2 for quadratic) and spans (0.5 and 0.25). I believe the model with the larger span (0.5) is better for visualization because it produces smoother trends and captures less noise. The models with the smaller span (0.25) are much more jagged capturing more specific trends, which are likely just noise in this case as we don’t expect many small trends in this context. Overall, this LOESS fit suggests that survival months tends to decrease as tumor size increases, with higher-degree/lower-span models providing more detailed, but noisier, interpretations.

Ridge Regression: Sid

In this ridge regression model, we predicted the number of regional lymph nodes examined using tumor size, survival months, regional node positive count, and age as the predictors. We found the optimal penalty (lambda) to help prevent overfitting by shrinking the highly correlated coefficients without eliminating them. The model returned a reasonable R-squared value of 0.1717, indicating these predictors explain 17.17% of the variance in the number of regional lymph nodes examined. To summarize, the ridge regression model provided a stable and interpretable model, especially given the potential for multicollinearity among some of these predictors such as regional nodes examined and regional nodes positive.

Multiple Regression: Sid

In order to model the relationship between our response variable of “Survival Months” and our predictor variables “Grade” (of how serious the tumor is), “Regional Node Positive” , “Regional Node Examined”, and “Tumor Size”. The low R-squared value from the summary suggests the model explains a limited amount of the variance in survival months, indicating the need to explore additional relevant variables, consider potential non-linear relationships or interactions, and carefully evaluate the diagnostic plots for violations of model assumptions to improve its predictive power and the reliability of the conclusions regarding the effects of the examined factors. Based on the model plots and testing, we can conclude that the data is approximately normal, but the the assumption of non-constant variance is violated based on the Residuals vs. Fitted plot, with a high concentration of values towards the end of the plot rather than the beginning. Overall, using MLR to model how long a tumor survives based on the grade of the tumor, the size of the tumor, the number of regional lymph nodes that tested positive for cancer, and the number of total lymph nodes examined is not effective, calling for more research and an exploration of other variables and models to model a better relationship that can help predict how long tumors may survive.

Data Source, Resources, Dictionary

Data Source:

https://zenodo.org/records/5120960

Dictionary:

Age = How old in years the patient is

Tumor Size = Size of the Tumor in Centimeters

Survival Months = Number of months a patient is alive from the date of diagnosis until death or end of follow up (how long the tumor was alive for)

N Stage = The extent to which the tumor spread to nearby lymphnodes (N1 = Fewest, N3 = Most)

Grade = Originally four levels, condensed into high and low for binary. Level one meaning the tumor cells closely resemble normal cell tissue, and level four meaning they are very clearly tumor cells under a microscope.

Regional Node Examined: How many lymphnodes were looked at.

Regional Node Positive: How many of the examined lymphnodes were cancerous.

Resources

KNN = Class lecture 6

Naive Bayes = Class Lecture 5

Loess = Group Exercise 2

Logistic Regression: https://stats.oarc.ucla.edu/r/dae/logit-regression/

Ridge Regression: https://www.statology.org/ridge-regression-in-r/

Multiple Regression: none used