Annalysis of Machine Learning Algorithms on Breast Cancer Data
CMDA 4654 - Machine Learning | Spring 2025
Sully Stefanik, Pierce Hamlin, Sid Somashekar
Algorithms Used:
1.KNN Classification
2.Naive Bayes
3.Logistic Regression
4.Loess Regression
5.Ridge Regression.
6.Multiple RegressionPredicting N Stage from age, tumor size, and survival months.
Reference
Prediction N1 N2 N3
N1 782 240 108
N2 22 9 17
N3 14 7 8
cm.overall
Accuracy 6.619718e-01
Kappa 4.630067e-02
AccuracyLower 6.345013e-01
AccuracyUpper 6.886506e-01
AccuracyNull 6.777133e-01
AccuracyPValue 8.848227e-01
McnemarPValue 1.228241e-55
Sensitivity Specificity Pos Pred Value Neg Pred Value Precision
Class: N1 0.95599022 0.1053985 0.6920354 0.5324675 0.6920354
Class: N2 0.03515625 0.9589905 0.1875000 0.7868852 0.1875000
Class: N3 0.06015038 0.9804469 0.2758621 0.8938879 0.2758621
Recall F1 Prevalence Detection Rate Detection Prevalence
Class: N1 0.95599022 0.80287474 0.6777133 0.647887324 0.93620547
Class: N2 0.03515625 0.05921053 0.2120961 0.007456504 0.03976802
Class: N3 0.06015038 0.09876543 0.1101906 0.006628003 0.02402651
Balanced Accuracy
Class: N1 0.5306943
Class: N2 0.4970734
Class: N3 0.5202987
Cell Contents
|-------------------------|
| N |
| N / Col Total |
|-------------------------|
Total Observations in Table: 1207
| test_classes
df_knn_classes | N1 | N2 | N3 | Row Total |
---------------|-----------|-----------|-----------|-----------|
N1 | 782 | 240 | 108 | 1130 |
| 0.956 | 0.938 | 0.812 | |
---------------|-----------|-----------|-----------|-----------|
N2 | 22 | 9 | 17 | 48 |
| 0.027 | 0.035 | 0.128 | |
---------------|-----------|-----------|-----------|-----------|
N3 | 14 | 7 | 8 | 29 |
| 0.017 | 0.027 | 0.060 | |
---------------|-----------|-----------|-----------|-----------|
Column Total | 818 | 256 | 133 | 1207 |
| 0.678 | 0.212 | 0.110 | |
---------------|-----------|-----------|-----------|-----------|
Predicting Estrogen Status from marital status, race, and progresterone status.
preds
Negative Positive
21 1186
Actual
Predicted Negative Positive
Negative 8 13
Positive 80 1106
Accuracy: 0.923
Predicted Actual Count
1 Negative Negative 8
2 Positive Negative 80
3 Negative Positive 13
4 Positive Positive 1106
Predicting Tumor Grade from tumor size and age.
Call:
glm(formula = Grade_binary ~ Age + `Tumor Size`, family = binomial(link = "logit"),
data = training_df)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.287392 0.266004 -1.080 0.28
Age -0.018616 0.004724 -3.940 8.13e-05 ***
`Tumor Size` 0.011586 0.001882 6.158 7.39e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 3367.2 on 2816 degrees of freedom
Residual deviance: 3311.4 on 2814 degrees of freedom
AIC: 3317.4
Number of Fisher Scoring iterations: 4
Predicting Survival Months from tumor size.
Predicting Regional Node Examined from tumor size, regional node positive, and age.
Length Class Mode
a0 100 -none- numeric
beta 400 dgCMatrix S4
df 100 -none- numeric
dim 2 -none- numeric
lambda 100 -none- numeric
dev.ratio 100 -none- numeric
nulldev 1 -none- numeric
npasses 1 -none- numeric
jerr 1 -none- numeric
offset 1 -none- logical
call 4 -none- call
nobs 1 -none- numeric
Call: cv.glmnet(x = x, y = y, alpha = 0)
Measure: Mean-Squared Error
Lambda Index Measure SE Nonzero
min 0.333 100 54.44 1.676 4
1se 4.950 71 56.04 1.623 4
Call: glmnet(x = x, y = y, alpha = 0, lambda = best_lambda)
Df %Dev Lambda
1 4 17.17 0.3333
5 x 1 sparse Matrix of class "dgCMatrix"
s0
(Intercept) 12.625925669
Tumor Size 0.002874603
Survival Months 0.010885987
Reginol Node Positive 0.630939584
Age -0.032536142
[1] 0.1717268
Predicting Survival Months from tumor size, regional node examined, grade, and regional node positive.
Call:
lm(formula = `Survival Months` ~ `Tumor Size` + Grade + `Reginol Node Positive` +
`Regional Node Examined`, data = df2)
Residuals:
Min 1Q Median 3Q Max
-73.891 -15.088 1.224 18.284 45.794
Coefficients:
Estimate Std. Error t value
(Intercept) 74.80438 1.19427 62.636
`Tumor Size` -0.05772 0.01750 -3.298
GradeModerately differentiated; Grade II -0.25583 1.08221 -0.236
GradePoorly differentiated; Grade III -2.80054 1.19975 -2.334
GradeUndifferentiated; anaplastic; Grade IV -5.86191 5.29593 -1.107
`Reginol Node Positive` -0.59461 0.07894 -7.532
`Regional Node Examined` 0.11668 0.04843 2.409
Pr(>|t|)
(Intercept) < 2e-16 ***
`Tumor Size` 0.000981 ***
GradeModerately differentiated; Grade II 0.813137
GradePoorly differentiated; Grade III 0.019631 *
GradeUndifferentiated; anaplastic; Grade IV 0.268416
`Reginol Node Positive` 6.13e-14 ***
`Regional Node Examined` 0.016034 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 22.64 on 4017 degrees of freedom
Multiple R-squared: 0.0255, Adjusted R-squared: 0.02404
F-statistic: 17.52 on 6 and 4017 DF, p-value: < 2.2e-16
KNN: Sully
Based off limited factors, the KNN algorithm was able to accurately predict what N stage the tumor was in based off of its size, how long it survived and the patients age around 68% of the time with the K set to 15. This is a start, but due to limitations in the data set it could definitely be expanded upon. This is partially due to these variables not being the most indicative of tumor grade, but the lack of data these were the best responses.
Naive Bayes: Pierce
In this Naïve Bayes classification, I was predicting estrogen receptor status using marital status, race, and progesterone status as my categorical predictors. I split the dataset into 70% training and 30% testing subsets. After fitting the model using that training data and making predictions on the test set, I used a confusion matrix to evaluate the performance of the model. The model received a respectable 91.5% accuracy score in identifying estrogen-positive vs estrogen-negative cases. I also included a bar chart of the outcome distribution to provide insight into class balance which can affect model performance in some cases.
Logistic Regression: Sully
The logistic regression model was aimed to predict the tumors grade based on its size and the age of patient. Whe discussing tumurous grades, there are four level grades. Yet, to fall into the binomial family we condensed levels one and two into the low category, and levels three and four into the high category. The tumors size and the patients age were deemed as statistically significant, however the intercept of the two was not. Overall the model was a fairly good fit, and we were able to determine age and tumor size do have an impact on the grade (in this data).
Loess: Pierce
In this section, I used LOESS regression to explore the relationship between tumor size and survival months using varying degrees (1 for linear, and 2 for quadratic) and spans (0.5 and 0.25). I believe the model with the larger span (0.5) is better for visualization because it produces smoother trends and captures less noise. The models with the smaller span (0.25) are much more jagged capturing more specific trends, which are likely just noise in this case as we don’t expect many small trends in this context. Overall, this LOESS fit suggests that survival months tends to decrease as tumor size increases, with higher-degree/lower-span models providing more detailed, but noisier, interpretations.
Ridge Regression: Sid
In this ridge regression model, we predicted the number of regional lymph nodes examined using tumor size, survival months, regional node positive count, and age as the predictors. We found the optimal penalty (lambda) to help prevent overfitting by shrinking the highly correlated coefficients without eliminating them. The model returned a reasonable R-squared value of 0.1717, indicating these predictors explain 17.17% of the variance in the number of regional lymph nodes examined. To summarize, the ridge regression model provided a stable and interpretable model, especially given the potential for multicollinearity among some of these predictors such as regional nodes examined and regional nodes positive.
Multiple Regression: Sid
In order to model the relationship between our response variable of “Survival Months” and our predictor variables “Grade” (of how serious the tumor is), “Regional Node Positive” , “Regional Node Examined”, and “Tumor Size”. The low R-squared value from the summary suggests the model explains a limited amount of the variance in survival months, indicating the need to explore additional relevant variables, consider potential non-linear relationships or interactions, and carefully evaluate the diagnostic plots for violations of model assumptions to improve its predictive power and the reliability of the conclusions regarding the effects of the examined factors. Based on the model plots and testing, we can conclude that the data is approximately normal, but the the assumption of non-constant variance is violated based on the Residuals vs. Fitted plot, with a high concentration of values towards the end of the plot rather than the beginning. Overall, using MLR to model how long a tumor survives based on the grade of the tumor, the size of the tumor, the number of regional lymph nodes that tested positive for cancer, and the number of total lymph nodes examined is not effective, calling for more research and an exploration of other variables and models to model a better relationship that can help predict how long tumors may survive.
Data Source:
https://zenodo.org/records/5120960
Dictionary:
Age = How old in years the patient is
Tumor Size = Size of the Tumor in Centimeters
Survival Months = Number of months a patient is alive from the date of diagnosis until death or end of follow up (how long the tumor was alive for)
N Stage = The extent to which the tumor spread to nearby lymphnodes (N1 = Fewest, N3 = Most)
Grade = Originally four levels, condensed into high and low for binary. Level one meaning the tumor cells closely resemble normal cell tissue, and level four meaning they are very clearly tumor cells under a microscope.
Regional Node Examined: How many lymphnodes were looked at.
Regional Node Positive: How many of the examined lymphnodes were cancerous.
Resources
KNN = Class lecture 6
Naive Bayes = Class Lecture 5
Loess = Group Exercise 2
Logistic Regression: https://stats.oarc.ucla.edu/r/dae/logit-regression/
Ridge Regression: https://www.statology.org/ridge-regression-in-r/
Multiple Regression: none used