As breast cancer continues to affect millions of women around the world, better and more precise treatments as well as diagnosis tests have become available, such as Fine Needle Aspiration tests (FNA). This case study will focus on building the most accurate model for determining benign or malignant tumors in women’s breasts. We will be comparing four different models to help make this determination: logistic regression, linear discriminant analysis (LDA) support vector machine (SVM), random forest.
Breast cancer is the second highest cancer that impacts women, with skin cancer being the first. It forms when abnormal cells within the breast tissue conform together to produce a tumor. The tumor can be benign (noncancerous), pre-malignant (pre-cancerous), pr malignant (cancerous). Most of the time women must have surgery to get the tumor removed even when it is noncancerous as it can easily turn into cancer. There are many tests that can help with a diagnosis such as a biopsy, ultrasounds, and mammograms, which every woman must get done yearly once they turn 21.
Women born in the United States have a 1 in 8 possibility of having breast cancer within their lifetime. As of 2022, there are 3.8 million women in the United States alone who have been impacted by breast cancer.
The case study will examine the accuracy of cancer cell prediction of different parametric and nonparametric models. It can be a major finding since early diagnosis of cancer can help to increase health standards and life expectancy for many women all around the world.
There are 212 patient samples (malignant) and 357 healthy samples (benign).
Dependent variable: - Diagnosis (binary): malignant or benign
Independent variable: (Ten real-valued features are computed for each cell nucleus): - radius (mean of distances from center to points on the perimeter - texture (standard deviation of gray-scale values) - perimeter - area - smoothness (local variation in radius lengths) - compactness (perimeter^2 / area - 1.0) - concavity (severity of concave portions of the contour) - concave points (number of concave portions of the contour) - symmetry - fractal dimension (“coastline approximation” - 1)
The data also include the mean, standard error, and “worst” for each of the cell features, resulting in 30 features.
The models being tested are logistic regression, linear discriminant analysis (LDA), support vector machine (SVM), decision tree, boosting, and random forest.
The full dataset is being used for the nonparametric models, which includes the SVM model, decision tree, boosting, and random forest. For the parametric models (logistic regression and LDA) we used a 70/30 split for the training and testing datasets, with only variable ending with ’_mean’.
The data is collected from Wisconsin Data Base Cancer which contained 569 test samples.
We are generating a logistic regression, linear discriminant analysis and support vector machine model to test which variables are important when determining if a tumor is benign or malignant. To provide the best model and predictions possible, we must review the sensitivity, specificity and accuracy of each individual model.
Logistic regression has the following assumptions: - Dependent variable must be categorical and binary, which they are in this case - The observations must be independent of one another - Multicollinearity should be investigated and eliminated. Independent variables should not be correlated with one another - There must be a large sample size LDA has these assumptions: - Sample measurements are independent from each other - Distributions are normal - Co-variance of the measurements are identical across different classes. SVM model assumptions: - The margin should be as large as possible. - The data is independent and identically distributed - However, it doesn’t assume normality of the data The methodological assumptions of the decision tree are as follows: - The dependent variable must be binary - The data is assumed to be independent of each other - The data is also assumed to be mostly free of outliers and missing data - The limitations of decision trees are high variance, potential overfitting, and difficulty in interpretation Random forest assumptions and limitations: - Dependent variable is binary - The observations are independent of one another - The data is mostly free of outliers and missing values - The limitations are high variance, overfitting, and difficult to interpret
This dataset is exceptionally clean and requires very little modification. The only data cleaning necessary was to change the outcome variable, diagnosis, to a factor to perform binary classification analysis. The diagnosis variable was changed to recognize the two possible levels: ‘B’ for benign, and ‘M’ for malignant.
library(MASS)
library(corrplot)
## corrplot 0.92 loaded
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::select() masks MASS::select()
library(SMCRM) # CRM data
library(dplyr) # data wrangling
library(tidyr) # data wrangling
library(ggplot2) # plotting
library(survival) # survival
library(rpart) # DT
library(randomForestSRC)
## Warning: package 'randomForestSRC' was built under R version 4.2.3
##
## randomForestSRC 3.2.1
##
## Type rfsrc.news() to see new features, changes, and bug fixes.
##
##
##
## Attaching package: 'randomForestSRC'
##
## The following object is masked from 'package:purrr':
##
## partial
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:survival':
##
## cluster
##
## The following object is masked from 'package:purrr':
##
## lift
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.2.3
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:dplyr':
##
## combine
##
## The following object is masked from 'package:ggplot2':
##
## margin
df <- read.csv("CancerData.csv")
#Change to factor
df$diagnosis <- as.factor(df$diagnosis)
#Create correlation plot (must have numerical values)
num = select_if(df, is.numeric)
num <- num[,2:11]
M = cor(num)
corrplot::corrplot(M, method = "number")
high_corr <- which(M > 0.9 & M != 1, arr.ind = TRUE)
high_corr
## row col
## perimeter_mean 3 1
## area_mean 4 1
## radius_mean 1 3
## area_mean 4 3
## radius_mean 1 4
## perimeter_mean 3 4
## concave.points_mean 8 7
## concavity_mean 7 8
The variance inflation factor (VIF) was excessively high for numerous variables in the dataset. As a result, the final parametric models only contained the dependent variable diagnosis and the independent variables radius_mean, texture_mean, smoothness_mean, compactness_mean, concave.points_mean, and symmetry_mean. By using these variables, we were able to ensure that the VIF for each variable was below 5 and would not cause multicollinearity issues within the models. As the machine learning, nonparametric models do not require the removal of multicollinearity issues, they were fitted using the entire dataset.
# Keep only columns with mean inside, in order to avoid multicollinearity
df <- df[,2:11]
df = dplyr::select(df, -c(area_mean, concavity_mean, perimeter_mean))
# Create training and testing data
set.seed(1)
idx.train <- sample(1:nrow(df), size = 0.7 * nrow(df))
train <- df[idx.train,]
test <- df[-idx.train,]
table(train$diagnosis)
##
## B M
## 254 144
table(test$diagnosis)
##
## B M
## 103 68
# Check VIF for variables to include:
library(car) #for VIF
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
model <- glm(diagnosis~., data = train, family = binomial)
vif(model)
## radius_mean texture_mean smoothness_mean compactness_mean
## 2.349201 1.927692 2.497485 2.706049
## concave.points_mean symmetry_mean
## 3.897135 1.911515
The only limitation in the dataset was the high collinearity between numerous variables. This issue was dealt with as described in the “Data preprocessing” section above.
#Logistic Regression model
logreg <- glm(diagnosis ~ ., data = train, family = "binomial")
summary(logreg)
##
## Call:
## glm(formula = diagnosis ~ ., family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.26698 -0.14933 -0.03961 0.02398 2.88528
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -31.90431 6.61478 -4.823 1.41e-06 ***
## radius_mean 0.74338 0.21535 3.452 0.000556 ***
## texture_mean 0.39548 0.07529 5.253 1.50e-07 ***
## smoothness_mean 43.66871 29.20245 1.495 0.134816
## compactness_mean -12.60970 10.49676 -1.201 0.229637
## concave.points_mean 89.90824 27.44312 3.276 0.001052 **
## symmetry_mean 31.34013 13.90255 2.254 0.024179 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 520.94 on 397 degrees of freedom
## Residual deviance: 102.80 on 391 degrees of freedom
## AIC: 116.8
##
## Number of Fisher Scoring iterations: 8
predprob = predict.glm(logreg, newdata = test, type = "response")
predclass_log = ifelse(predprob >= 0.5,"M","B")
caret::confusionMatrix(as.factor(predclass_log), as.factor(test$diagnosis))
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 97 6
## M 6 62
##
## Accuracy : 0.9298
## 95% CI : (0.8806, 0.9632)
## No Information Rate : 0.6023
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8535
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9417
## Specificity : 0.9118
## Pos Pred Value : 0.9417
## Neg Pred Value : 0.9118
## Prevalence : 0.6023
## Detection Rate : 0.5673
## Detection Prevalence : 0.6023
## Balanced Accuracy : 0.9268
##
## 'Positive' Class : B
##
The logistic model has an accuracy level of 93% with a sensitivity of 94.17% and a specificity of 91.18%.
#LDA model
lda = lda(diagnosis ~ ., data = train)
lda
## Call:
## lda(diagnosis ~ ., data = train)
##
## Prior probabilities of groups:
## B M
## 0.638191 0.361809
##
## Group means:
## radius_mean texture_mean smoothness_mean compactness_mean concave.points_mean
## B 12.13154 17.83161 0.09166201 0.07898228 0.02508475
## M 17.64986 21.65347 0.10291486 0.14485500 0.09007229
## symmetry_mean
## B 0.1730815
## M 0.1926306
##
## Coefficients of linear discriminants:
## LD1
## radius_mean 0.16955009
## texture_mean 0.09408202
## smoothness_mean 11.23571469
## compactness_mean -4.03515325
## concave.points_mean 26.32177015
## symmetry_mean 6.29311302
predclass_lda = predict(lda, newdata = test)
caret::confusionMatrix(as.factor(predclass_lda$class), as.factor(test$diagnosis))
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 102 16
## M 1 52
##
## Accuracy : 0.9006
## 95% CI : (0.8456, 0.941)
## No Information Rate : 0.6023
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7844
##
## Mcnemar's Test P-Value : 0.000685
##
## Sensitivity : 0.9903
## Specificity : 0.7647
## Pos Pred Value : 0.8644
## Neg Pred Value : 0.9811
## Prevalence : 0.6023
## Detection Rate : 0.5965
## Detection Prevalence : 0.6901
## Balanced Accuracy : 0.8775
##
## 'Positive' Class : B
##
Compared to the logistic regression model, the linear discriminant analysis model has slightly lower accuracy level and specificity of 90.06% and 76.47%, respectively. It does have a higher sensitivity of 99.03%.
We decide to reload the dataset in order to include all variables possible for the non-parametric models
#Reload the dataset
df1 <- read.csv("CancerData.csv")
df1$diagnosis <- as.factor(df1$diagnosis)
#Remove ID
df1 <- df1[,-1]
set.seed(1)
idx.train <- sample(1:nrow(df1), size = 0.7 * nrow(df1))
train1 <- df1[idx.train,]
test1 <- df1[-idx.train,]
table(train1$diagnosis)
##
## B M
## 254 144
table(test1$diagnosis)
##
## B M
## 103 68
#SVM model:
library(e1071) #for SVM model
##
## Attaching package: 'e1071'
## The following objects are masked from 'package:randomForestSRC':
##
## impute, tune
form1 = diagnosis ~.
tuned = tune.svm(form1, data=train1, gamma = seq(0.1, .1, by= 0.01), cost = seq(.1,1, by = .1))
mysvm = svm(formula = form1, data = train1, gamma = tuned$best.parameters$gamma, cost = tuned$best.parameters$cost) #This line might take 2,3 minutes to run
summary(mysvm) #this line is to build the model on training data
##
## Call:
## svm(formula = form1, data = train1, gamma = tuned$best.parameters$gamma,
## cost = tuned$best.parameters$cost)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 0.9
##
## Number of Support Vectors: 169
##
## ( 85 84 )
##
##
## Number of Classes: 2
##
## Levels:
## B M
svmpredict = predict(mysvm, test1, type = 'response') # This line to classify our test data and get the confusion matrix.
caret::confusionMatrix(as.factor(svmpredict), as.factor(test1$diagnosis))
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 98 3
## M 5 65
##
## Accuracy : 0.9532
## 95% CI : (0.9099, 0.9796)
## No Information Rate : 0.6023
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9028
##
## Mcnemar's Test P-Value : 0.7237
##
## Sensitivity : 0.9515
## Specificity : 0.9559
## Pos Pred Value : 0.9703
## Neg Pred Value : 0.9286
## Prevalence : 0.6023
## Detection Rate : 0.5731
## Detection Prevalence : 0.5906
## Balanced Accuracy : 0.9537
##
## 'Positive' Class : B
##
When comparing the svm model with the LDA and logistic regression models, it shows that it has a higher accuracy of 95.32% and specificity of 95.59%. It does, however, have a lower sensitivity than the LDA model of 95.15%.
#Decision Tree
dt.model <- rpart(diagnosis ~ .,
data = train1) # simple DT model
rattle::fancyRpartPlot(dt.model, sub = "") # vizualize the DT
From decision tree, we know that concave.points_mean is the most influential predictors for the result, since the tree begins a split with this variable.
#Random forest
set.seed(123)
forest1 <- randomForest(diagnosis ~ .,
data = train1)
predictions <- predict (forest1, newdata = test1)
caret::confusionMatrix(predictions, test1$diagnosis)
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 102 7
## M 1 61
##
## Accuracy : 0.9532
## 95% CI : (0.9099, 0.9796)
## No Information Rate : 0.6023
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9009
##
## Mcnemar's Test P-Value : 0.0771
##
## Sensitivity : 0.9903
## Specificity : 0.8971
## Pos Pred Value : 0.9358
## Neg Pred Value : 0.9839
## Prevalence : 0.6023
## Detection Rate : 0.5965
## Detection Prevalence : 0.6374
## Balanced Accuracy : 0.9437
##
## 'Positive' Class : B
##
When comparing the random forest model to svm models, it shows that it has a similar accuracy of prediction 95.32%. However, its sensitivity is much better, at 99.03%, while the specificity is lower at 89.71%.
We then try boosting to see if it can improve accuracy of the model:
#Boosting set up:
library(gbm)
## Warning: package 'gbm' was built under R version 4.2.3
## Loaded gbm 2.1.8.1
set.seed(1)
train1$diagnosis <- ifelse(train1$diagnosis == "B", 1, 0) # Needed for boosting function
test1$diagnosis <- ifelse(test1$diagnosis == "B", 1, 0) # Needed for boosting function.
#Perform boosting:
boost <- gbm(diagnosis ~ concave.points_mean, data = train1, distribution = "bernoulli",
n.trees = 1000, interaction.depth = 1, shrinkage = 0.001, verbose = F)
boost.probs <- predict(boost, newdata = test1, n.trees = 1000, type = "response")
boost.pred <- ifelse(boost.probs > 0.5, 1, 0)
caret::confusionMatrix(as.factor(boost.pred), as.factor(test1$diagnosis))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 58 10
## 1 10 93
##
## Accuracy : 0.883
## 95% CI : (0.8252, 0.9271)
## No Information Rate : 0.6023
## P-Value [Acc > NIR] : 4.103e-16
##
## Kappa : 0.7559
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.8529
## Specificity : 0.9029
## Pos Pred Value : 0.8529
## Neg Pred Value : 0.9029
## Prevalence : 0.3977
## Detection Rate : 0.3392
## Detection Prevalence : 0.3977
## Balanced Accuracy : 0.8779
##
## 'Positive' Class : 0
##
Since there is no biased with the predictions of random forest model, boosting doesn’t help much in improving accuracy of prediction. Its accuracy and sensitivity is lower than random forest, at 88.3% and 85.29% respectively. Specificity is only slightly higher, at 90.29%
models <- c("Log Reg", "LDA", "SVM", "RF", "Boost RF")
Accuracy <- c(0.9298, 0.9006, 0.9532, 0.9532, 0.883)
Sensitivity <- c(0.9417, 0.9903, 0.9515, 0.9903, 0.8529)
Specificity <- c(0.9118, 0.7647, 0.9559, 0.8971, 0.9029)
models.df <- data.frame(models, Accuracy, Sensitivity, Specificity)
df_long <- tidyr::pivot_longer(models.df, -1, names_to = "variable", values_to = "value")
ggplot(df_long, aes(x = variable, y = value, fill = models)) +
geom_bar(stat = "identity", position = "dodge") +
scale_fill_manual(values = c("#9933FF", "#33FFFF", "red", "darkblue", "green")) +
xlab("Performance Metrics") +
ylab("Value") +
ggtitle("Comparison of Five Models") +
theme(plot.title = element_text(hjust = 0.5))
As we can see from the plot, the SVM model is the best out of all, with high accuracy, sensitivity, and specificity overall. Random forest and LDA only perform slightly better than SVM for sensitivity; however, their accuracy and specificity wer both lower.
# Decision Tree Variable importance
varImpPlot(forest1, type = 2)
# Logistic regression beta coefficient
summary(logreg)
##
## Call:
## glm(formula = diagnosis ~ ., family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.26698 -0.14933 -0.03961 0.02398 2.88528
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -31.90431 6.61478 -4.823 1.41e-06 ***
## radius_mean 0.74338 0.21535 3.452 0.000556 ***
## texture_mean 0.39548 0.07529 5.253 1.50e-07 ***
## smoothness_mean 43.66871 29.20245 1.495 0.134816
## compactness_mean -12.60970 10.49676 -1.201 0.229637
## concave.points_mean 89.90824 27.44312 3.276 0.001052 **
## symmetry_mean 31.34013 13.90255 2.254 0.024179 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 520.94 on 397 degrees of freedom
## Residual deviance: 102.80 on 391 degrees of freedom
## AIC: 116.8
##
## Number of Fisher Scoring iterations: 8
When we evaluate the importance based on Random Forest, then concave.points_mean has the highest predictive power. This is also true for the regression result, as we can see that concave.points_mean has highest beta coefficients of 89.91 out of all important variables with p_value less than 0.05
The parametric models performed very well with this dataset and were easily interpretable as the final models only contained seven variables including the dependent variable. The logistic regression model was more accurate overall (93%) than the LDA (90%), but the LDA had an impressive sensitivity of 99.03%. If the target of the parametric model is overall accuracy, then the logistic regression would be the best choice. If the target is to identify true positives, then the LDA model would be the best choice.
The best performing models that we built were the SVM and random forest models. The accuracy, sensitivity, and specificity of the SVM model were all just over 95%. The random forest model had an accuracy of 95%, sensitivity of 99%, and specificity of 90%. These models were more accurate than the parametric models, but much more difficult to interpret as they do not provide beta coefficients to help describe the relationship of the independent variables and the dependent variable.
In conclusion, this analysis show that there are few features with more predictive value for the diagnosis of cancer prediction. The observations were confirmed by variance important plot as well as beta coefficient, showing that concave.points_mean, perimeter_worst, and concave.point_worst are the most influential predictors.
We found that it is better to use non-parametric model to predict the result than parametric ones, with SVM performs the best, following by random forest model. The dataset is clean and have almost equal distribution of benign and malignant cells, therefore boosting doesn’t help much in improving the accuracy.
#Reload data
df <- read.csv("CancerData.csv")
#Break up columns into groups, according to (_mean, _se,and __worst)
data_mean <- dplyr::select(df, c("radius_mean", "texture_mean","perimeter_mean", "area_mean", "smoothness_mean", "compactness_mean", "concavity_mean", "concave.points_mean", "symmetry_mean", "fractal_dimension_mean" ))
data_se <- dplyr::select(df, c("radius_se", "texture_se","perimeter_se", "area_se", "smoothness_se", "compactness_se", "concavity_se", "concave.points_se", "symmetry_se", "fractal_dimension_se" ))
data_worst <- dplyr::select(df, c("radius_worst", "texture_worst","perimeter_worst", "area_worst", "smoothness_worst", "compactness_worst", "concavity_worst", "concave.points_worst", "symmetry_worst", "fractal_dimension_worst" ))
#Plot histograms of "_mean" variables group by diagnosis
par(mfrow=c(3,3))
var_names <- names(data_mean)
for (var in var_names) {
var_data <- data_mean[[var]]
hist(var_data, main = paste("Distribution of", var), xlab = var)
}
As we can see, most of the variables in ‘mean’ are normally distributed, except for area, concavity, and concave points
#Plot histograms of "_se" variables group by diagnosis
par(mfrow=c(3,3))
var_names <- names(data_se)
for (i in var_names) {
var_data <- data_se[[i]]
hist(var_data, main = paste("Distribution of", i), xlab = i)
}
#Plot histograms of "_worst" variables group by diagnosis
par(mfrow=c(3,3))
var_names <- names(data_worst)
for (i in var_names) {
var_data <- data_worst[[i]]
hist(var_data, main = paste("Distribution of", i), xlab = i)
}
The variables in ‘se’ and ‘worst’ are all right skewed