Based on these physical properties of the tumors, can we build an ML model that accurately classifies them as malignant or benign?
# Load in data
wisconsin <- read.csv("data.csv")
The dataset we used was called “Breast Cancer Wisconsin (Diagnostic) taken from the UCI ML Repository. It contains data from various people from a hospital in Wisconsin from 1995. There are various physical features used as variables, like texture, area, radius, and more detailing the breasts. Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. We will use these features to predict whether a mass is benign or malignant.
This dataset has been analyzed before by researchers and Kaggle users. Researchers focused on methods such as Random Forest, Perceptrons, or Support Vector Machines. This was usually done with TensorFlow in Python to create their models, usually around a 96-98% accuracy. We will build upon this by using an ensemble learning method where we combine the results from multiple models and the result becomes the most common prediction from the models. This produces more accurate results, since failed cases from some models are negated by the correct predictions of other models.
# Inspect metadata
# Verify that the data was read in correctly
# We know how many columns/rows to expect
is.data.frame(wisconsin)
## [1] TRUE
ncol(wisconsin)
## [1] 33
nrow(wisconsin)
## [1] 569
head(wisconsin)
## id diagnosis radius_mean texture_mean perimeter_mean area_mean
## 1 842302 M 17.99 10.38 122.80 1001.0
## 2 842517 M 20.57 17.77 132.90 1326.0
## 3 84300903 M 19.69 21.25 130.00 1203.0
## 4 84348301 M 11.42 20.38 77.58 386.1
## 5 84358402 M 20.29 14.34 135.10 1297.0
## 6 843786 M 12.45 15.70 82.57 477.1
## smoothness_mean compactness_mean concavity_mean concave.points_mean
## 1 0.11840 0.27760 0.3001 0.14710
## 2 0.08474 0.07864 0.0869 0.07017
## 3 0.10960 0.15990 0.1974 0.12790
## 4 0.14250 0.28390 0.2414 0.10520
## 5 0.10030 0.13280 0.1980 0.10430
## 6 0.12780 0.17000 0.1578 0.08089
## symmetry_mean fractal_dimension_mean radius_se texture_se perimeter_se
## 1 0.2419 0.07871 1.0950 0.9053 8.589
## 2 0.1812 0.05667 0.5435 0.7339 3.398
## 3 0.2069 0.05999 0.7456 0.7869 4.585
## 4 0.2597 0.09744 0.4956 1.1560 3.445
## 5 0.1809 0.05883 0.7572 0.7813 5.438
## 6 0.2087 0.07613 0.3345 0.8902 2.217
## area_se smoothness_se compactness_se concavity_se concave.points_se
## 1 153.40 0.006399 0.04904 0.05373 0.01587
## 2 74.08 0.005225 0.01308 0.01860 0.01340
## 3 94.03 0.006150 0.04006 0.03832 0.02058
## 4 27.23 0.009110 0.07458 0.05661 0.01867
## 5 94.44 0.011490 0.02461 0.05688 0.01885
## 6 27.19 0.007510 0.03345 0.03672 0.01137
## symmetry_se fractal_dimension_se radius_worst texture_worst perimeter_worst
## 1 0.03003 0.006193 25.38 17.33 184.60
## 2 0.01389 0.003532 24.99 23.41 158.80
## 3 0.02250 0.004571 23.57 25.53 152.50
## 4 0.05963 0.009208 14.91 26.50 98.87
## 5 0.01756 0.005115 22.54 16.67 152.20
## 6 0.02165 0.005082 15.47 23.75 103.40
## area_worst smoothness_worst compactness_worst concavity_worst
## 1 2019.0 0.1622 0.6656 0.7119
## 2 1956.0 0.1238 0.1866 0.2416
## 3 1709.0 0.1444 0.4245 0.4504
## 4 567.7 0.2098 0.8663 0.6869
## 5 1575.0 0.1374 0.2050 0.4000
## 6 741.6 0.1791 0.5249 0.5355
## concave.points_worst symmetry_worst fractal_dimension_worst X
## 1 0.2654 0.4601 0.11890 NA
## 2 0.1860 0.2750 0.08902 NA
## 3 0.2430 0.3613 0.08758 NA
## 4 0.2575 0.6638 0.17300 NA
## 5 0.1625 0.2364 0.07678 NA
## 6 0.1741 0.3985 0.12440 NA
table(wisconsin$diagnosis)
##
## B M
## 357 212
wisconsin_numeric <- wisconsin %>% select_if(is.numeric)
# Histogram of all the columns
for (i in 1:ncol(wisconsin_numeric)) {
if (i %% 4 == 1)
par(mfrow=c(2,2))
hist(wisconsin_numeric[,i], main = colnames(wisconsin_numeric)[i], xlab = "")
}
# Convert "B" and "M" to 0 and 1 respectively
wisconsin <- wisconsin %>%
add_column(diagnosis_num = if_else(.$diagnosis == 'M', 1, 0))
temp <- wisconsin$diagnosis_num
# Subtract out columns we don't want to train with
wisconsin <- subset(wisconsin, select = -c(X, diagnosis, id, diagnosis_num))
# Normalize data
wisconsin <- wisconsin %>%
mutate_all(scale)
# Add back in the prediction column
wisconsin$diagnosis_num <- temp
# Split
set.seed(3001)
wisconsin.train <- sample_frac(wisconsin, 0.7)
sample_id <- as.numeric(rownames(wisconsin.train))
wisconsin.test <- wisconsin[-sample_id,]
# Principal Component Analysis
par(mfrow=c(1,1))
pca <- prcomp(wisconsin, scale. = T)
autoplot(pca, data = wisconsin, colour = 'diagnosis_num')
std_dev <- pca$sdev
pr_var <- std_dev^2
prop_varex <- pr_var/sum(pr_var)
plot(prop_varex, xlab = "Principal Component",
ylab = "Proportion of Variance Explained",
type = "b")
# Correlation heatmap
cormat <- round(cor(wisconsin), 2)
head(cormat)
## radius_mean texture_mean perimeter_mean area_mean
## radius_mean 1.00 0.32 1.00 0.99
## texture_mean 0.32 1.00 0.33 0.32
## perimeter_mean 1.00 0.33 1.00 0.99
## area_mean 0.99 0.32 0.99 1.00
## smoothness_mean 0.17 -0.02 0.21 0.18
## compactness_mean 0.51 0.24 0.56 0.50
## smoothness_mean compactness_mean concavity_mean
## radius_mean 0.17 0.51 0.68
## texture_mean -0.02 0.24 0.30
## perimeter_mean 0.21 0.56 0.72
## area_mean 0.18 0.50 0.69
## smoothness_mean 1.00 0.66 0.52
## compactness_mean 0.66 1.00 0.88
## concave.points_mean symmetry_mean fractal_dimension_mean
## radius_mean 0.82 0.15 -0.31
## texture_mean 0.29 0.07 -0.08
## perimeter_mean 0.85 0.18 -0.26
## area_mean 0.82 0.15 -0.28
## smoothness_mean 0.55 0.56 0.58
## compactness_mean 0.83 0.60 0.57
## radius_se texture_se perimeter_se area_se smoothness_se
## radius_mean 0.68 -0.10 0.67 0.74 -0.22
## texture_mean 0.28 0.39 0.28 0.26 0.01
## perimeter_mean 0.69 -0.09 0.69 0.74 -0.20
## area_mean 0.73 -0.07 0.73 0.80 -0.17
## smoothness_mean 0.30 0.07 0.30 0.25 0.33
## compactness_mean 0.50 0.05 0.55 0.46 0.14
## compactness_se concavity_se concave.points_se symmetry_se
## radius_mean 0.21 0.19 0.38 -0.10
## texture_mean 0.19 0.14 0.16 0.01
## perimeter_mean 0.25 0.23 0.41 -0.08
## area_mean 0.21 0.21 0.37 -0.07
## smoothness_mean 0.32 0.25 0.38 0.20
## compactness_mean 0.74 0.57 0.64 0.23
## fractal_dimension_se radius_worst texture_worst
## radius_mean -0.04 0.97 0.30
## texture_mean 0.05 0.35 0.91
## perimeter_mean -0.01 0.97 0.30
## area_mean -0.02 0.96 0.29
## smoothness_mean 0.28 0.21 0.04
## compactness_mean 0.51 0.54 0.25
## perimeter_worst area_worst smoothness_worst compactness_worst
## radius_mean 0.97 0.94 0.12 0.41
## texture_mean 0.36 0.34 0.08 0.28
## perimeter_mean 0.97 0.94 0.15 0.46
## area_mean 0.96 0.96 0.12 0.39
## smoothness_mean 0.24 0.21 0.81 0.47
## compactness_mean 0.59 0.51 0.57 0.87
## concavity_worst concave.points_worst symmetry_worst
## radius_mean 0.53 0.74 0.16
## texture_mean 0.30 0.30 0.11
## perimeter_mean 0.56 0.77 0.19
## area_mean 0.51 0.72 0.14
## smoothness_mean 0.43 0.50 0.39
## compactness_mean 0.82 0.82 0.51
## fractal_dimension_worst diagnosis_num
## radius_mean 0.01 0.73
## texture_mean 0.12 0.42
## perimeter_mean 0.05 0.74
## area_mean 0.00 0.71
## smoothness_mean 0.50 0.36
## compactness_mean 0.69 0.60
melted_cormat <- melt(cormat)
# Plot correlations between features
ggplot(data = melted_cormat, aes(x=Var1, y=Var2, fill=value)) +
geom_tile()
# Feature importance
diagnosis_corr = abs(cor(wisconsin)["diagnosis_num",])
sort(diagnosis_corr)
## symmetry_se texture_se fractal_dimension_mean
## 0.006521756 0.008303333 0.012837603
## smoothness_se fractal_dimension_se concavity_se
## 0.067016011 0.077972417 0.253729766
## compactness_se fractal_dimension_worst symmetry_mean
## 0.292999244 0.323872189 0.330498554
## smoothness_mean concave.points_se texture_mean
## 0.358559965 0.408042333 0.415185300
## symmetry_worst smoothness_worst texture_worst
## 0.416294311 0.421464861 0.456902821
## area_se perimeter_se radius_se
## 0.548235940 0.556140703 0.567133821
## compactness_worst compactness_mean concavity_worst
## 0.590998238 0.596533678 0.659610210
## concavity_mean area_mean radius_mean
## 0.696359707 0.708983837 0.730028511
## area_worst perimeter_mean radius_worst
## 0.733825035 0.742635530 0.776453779
## concave.points_mean perimeter_worst concave.points_worst
## 0.776613840 0.782914137 0.793566017
## diagnosis_num
## 1.000000000
# The most important features are concave.points_worst, perimeter_worst, and radius_worst
importance(randomForest(as.factor(diagnosis_num)~., data=wisconsin, ntree=5000, importance=TRUE))
## 0 1 MeanDecreaseAccuracy
## radius_mean 27.369641 17.661502 30.609674
## texture_mean 28.966023 31.318041 39.650957
## perimeter_mean 25.343734 18.442057 30.163747
## area_mean 30.349797 18.778953 33.741489
## smoothness_mean 8.525414 21.937620 23.387121
## compactness_mean 15.398597 12.036553 19.784787
## concavity_mean 24.625565 26.797775 36.316581
## concave.points_mean 32.181939 35.720130 46.403778
## symmetry_mean 4.720898 12.445773 12.533167
## fractal_dimension_mean 10.532355 4.556438 12.049652
## radius_se 25.967235 18.622309 32.223952
## texture_se 10.059447 6.790069 12.562791
## perimeter_se 23.915344 19.473354 31.308008
## area_se 36.653855 25.278367 44.031427
## smoothness_se 9.572910 2.719263 9.658280
## compactness_se 16.511069 6.460409 17.806649
## concavity_se 13.887479 13.236017 18.573752
## concave.points_se 13.170662 7.619243 15.413761
## symmetry_se 8.922516 1.752871 8.788754
## fractal_dimension_se 12.328152 3.610863 12.417300
## radius_worst 45.921254 34.826698 53.557306
## texture_worst 33.384116 38.402865 45.526593
## perimeter_worst 43.921727 39.552028 55.545632
## area_worst 47.443674 38.814262 57.305002
## smoothness_worst 24.295446 28.315983 36.061088
## compactness_worst 18.328836 17.307315 25.274069
## concavity_worst 24.099342 34.428700 42.214567
## concave.points_worst 45.256997 39.372371 58.722016
## symmetry_worst 16.129591 19.067016 23.837619
## fractal_dimension_worst 13.486838 11.174198 17.398093
## MeanDecreaseGini
## radius_mean 10.8473666
## texture_mean 3.9073005
## perimeter_mean 12.6406155
## area_mean 13.0869988
## smoothness_mean 1.6772331
## compactness_mean 2.9893338
## concavity_mean 12.7929722
## concave.points_mean 28.3110588
## symmetry_mean 1.0130205
## fractal_dimension_mean 0.9660721
## radius_se 4.0087138
## texture_se 1.1548624
## perimeter_se 3.6808610
## area_se 9.1752615
## smoothness_se 1.1224452
## compactness_se 1.1942600
## concavity_se 1.6496709
## concave.points_se 1.3548956
## symmetry_se 1.0479490
## fractal_dimension_se 1.2783071
## radius_worst 29.2175997
## texture_worst 5.1789805
## perimeter_worst 34.4639206
## area_worst 29.4162206
## smoothness_worst 3.6711528
## compactness_worst 4.1943871
## concavity_worst 8.9407574
## concave.points_worst 32.1377454
## symmetry_worst 2.7454015
## fractal_dimension_worst 1.7848085
# Calculate the prevalence
prevalence <- sum(wisconsin$diagnosis_num == 1) / length(wisconsin$diagnosis_num)
prevalence
## [1] 0.3725835
set.seed(3001)
rf_model = randomForest(as.factor(diagnosis_num)~., data=wisconsin.train, ntree=5000)
rf_model
##
## Call:
## randomForest(formula = as.factor(diagnosis_num) ~ ., data = wisconsin.train, ntree = 5000)
## Type of random forest: classification
## Number of trees: 5000
## No. of variables tried at each split: 5
##
## OOB estimate of error rate: 3.77%
## Confusion matrix:
## 0 1 class.error
## 0 247 6 0.02371542
## 1 9 136 0.06206897
# Testing the model
rf_predictions_train<-predict(rf_model, type = "response")
train_acc <- nrow(filter(wisconsin.train, diagnosis_num == rf_predictions_train)) / nrow(wisconsin.train)
train_acc
## [1] 0.9623116
rf_predictions_test<-predict(rf_model, wisconsin.test, type = "response")
test_acc <- nrow(filter(wisconsin.test, diagnosis_num == rf_predictions_test)) / nrow(wisconsin.test)
test_acc
## [1] 0.9766082
# Train logistic model against 6 most important features
set.seed(3001)
logistic_model = glm(data = wisconsin.train,
as.factor(diagnosis_num) ~ concave.points_worst + concave.points_mean +
perimeter_worst + radius_worst + perimeter_mean + area_worst,
family = binomial())
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(logistic_model)
##
## Call:
## glm(formula = as.factor(diagnosis_num) ~ concave.points_worst +
## concave.points_mean + perimeter_worst + radius_worst + perimeter_mean +
## area_worst, family = binomial(), data = wisconsin.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6186 -0.1439 -0.0578 0.0004 3.6731
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.1300 1.0612 2.007 0.04473 *
## concave.points_worst 1.4139 0.9001 1.571 0.11622
## concave.points_mean 2.5970 1.0587 2.453 0.01417 *
## perimeter_worst -0.2396 3.8278 -0.063 0.95010
## radius_worst -4.8223 6.5717 -0.734 0.46307
## perimeter_mean -6.0025 2.1566 -2.783 0.00538 **
## area_worst 19.3468 8.2226 2.353 0.01863 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 522.068 on 397 degrees of freedom
## Residual deviance: 88.094 on 391 degrees of freedom
## AIC: 102.09
##
## Number of Fisher Scoring iterations: 9
# Test logistic model
logistic_predictions_train<-round(predict(logistic_model, type = "response"))
train_acc <- nrow(filter(wisconsin.train, diagnosis_num == logistic_predictions_train)) / nrow(wisconsin.train)
train_acc
## [1] 0.9623116
logistic_predictions_test<-round(predict(logistic_model, wisconsin.test, type = "response"))
test_acc <- nrow(filter(wisconsin.test, diagnosis_num == logistic_predictions_test)) / nrow(wisconsin.test)
test_acc
## [1] 0.9766082
set.seed(3001)
svm_model = svm(formula = as.factor(diagnosis_num) ~ .,
data = wisconsin.train,
type = 'C-classification',
kernel = 'linear')
summary(svm_model)
##
## Call:
## svm(formula = as.factor(diagnosis_num) ~ ., data = wisconsin.train,
## type = "C-classification", kernel = "linear")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
##
## Number of Support Vectors: 33
##
## ( 15 18 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
svm_predictions_train<-predict(svm_model, type = "response")
train_acc <- nrow(filter(wisconsin.train, diagnosis_num == svm_predictions_train)) / nrow(wisconsin.train)
train_acc
## [1] 0.9849246
svm_predictions_test<-predict(svm_model, wisconsin.test, type = "response")
test_acc <- nrow(filter(wisconsin.test, diagnosis_num == svm_predictions_test)) / nrow(wisconsin.test)
test_acc
## [1] 0.9824561
# For Tensorflow, we must subtract out the prediction column
X_train <- wisconsin.train %>%
select(-diagnosis_num) %>%
scale()
y_train <- to_categorical(wisconsin.train$diagnosis_num)
## Loaded Tensorflow version 2.7.0
X_test <- wisconsin.test %>%
select(-diagnosis_num) %>%
scale()
y_test <- to_categorical(wisconsin.test$diagnosis_num)
# Model definition
# 2 hidden layers with dropout to avoid overfitting
model <- keras_model_sequential()
model %>%
layer_dense(units = 64, activation = 'relu', input_shape = ncol(wisconsin.train) - 1) %>%
layer_dropout(rate = 0.5) %>%
layer_dense(units = 32, activation = 'relu') %>%
layer_dropout(rate = 0.5) %>%
layer_dense(units = 2, activation = 'sigmoid')
# Choose loss and optimizer
history <- model %>% compile(
loss = 'binary_crossentropy',
optimizer = 'adam',
metrics = c('accuracy')
)
set.seed(3001)
# Train the model with 100 epochs
model %>% fit(
X_train, y_train,
epochs = 100,
batch_size = 30,
validation_split = 0.4
)
summary(model)
## Model: "sequential"
## ________________________________________________________________________________
## Layer (type) Output Shape Param #
## ================================================================================
## dense_2 (Dense) (None, 64) 1984
##
## dropout_1 (Dropout) (None, 64) 0
##
## dense_1 (Dense) (None, 32) 2080
##
## dropout (Dropout) (None, 32) 0
##
## dense (Dense) (None, 2) 66
##
## ================================================================================
## Total params: 4,130
## Trainable params: 4,130
## Non-trainable params: 0
## ________________________________________________________________________________
# Calculating accuracy
train_predictions <- model %>% predict(X_train) %>% `>`(0.5) %>% k_cast("int32")
test_predictions <- model %>% predict(X_test) %>% `>`(0.5) %>% k_cast("int32")
train_acc <- sum(as.array(train_predictions[,2]) == y_train[,2]) / nrow(train_predictions)
train_acc
## [1] 0.9824121
test_acc <- sum(as.array(test_predictions[,2]) == y_test[,2]) / nrow(test_predictions)
test_acc
## [1] 0.9649123
# Create a matrix with predictions from all models
model_predictions_train <- data.frame(matrix(ncol = 0, nrow = length(logistic_predictions_train)))
model_predictions_train$rf <- as.numeric(rf_predictions_train) - 1
model_predictions_train$logistic <- logistic_predictions_train
model_predictions_train$svm <- svm_predictions_train
model_predictions_train$ann <- as.array(train_predictions[,2])
model_predictions_train$target <- wisconsin.train$diagnosis_num
head(model_predictions_train)
## rf logistic svm ann target
## 1 1 0 1 1 1
## 2 0 1 1 1 1
## 3 0 0 0 0 0
## 4 1 1 0 0 0
## 5 0 0 0 0 0
## 6 1 1 1 1 1
model_predictions_test <- data.frame(matrix(ncol = 0, nrow = length(logistic_predictions_test)))
model_predictions_test$rf <- as.numeric(rf_predictions_test) - 1
model_predictions_test$logistic <- logistic_predictions_test
model_predictions_test$svm <- svm_predictions_test
model_predictions_test$ann <- as.array(test_predictions[,2])
model_predictions_test$target <- wisconsin.test$diagnosis_num
head(model_predictions_test)
## rf logistic svm ann target
## 1 0 0 0 0 0
## 2 0 0 0 0 0
## 3 1 1 1 1 1
## 4 0 0 0 0 0
## 5 0 0 0 0 0
## 6 0 0 0 0 0
# Pass in outputs into another random forest
set.seed(3001)
ensemble = randomForest(as.factor(target)~., data=model_predictions_train, ntree=5000, mtry=2, nodesize=3)
train_preds<-predict(ensemble, type = "response")
train_acc <- nrow(filter(model_predictions_train, target == train_preds)) / nrow(model_predictions_train)
train_acc
## [1] 0.9748744
test_preds<-predict(ensemble, model_predictions_test, type = "response")
test_acc <- nrow(filter(model_predictions_test, target == test_preds)) / nrow(model_predictions_test)
test_acc
## [1] 0.9824561
confusionMatrix(as.factor(test_preds), as.factor(model_predictions_test$target), positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 129 0
## 1 3 39
##
## Accuracy : 0.9825
## 95% CI : (0.9496, 0.9964)
## No Information Rate : 0.7719
## P-Value [Acc > NIR] : 1.339e-15
##
## Kappa : 0.9515
##
## Mcnemar's Test P-Value : 0.2482
##
## Sensitivity : 1.0000
## Specificity : 0.9773
## Pos Pred Value : 0.9286
## Neg Pred Value : 1.0000
## Prevalence : 0.2281
## Detection Rate : 0.2281
## Detection Prevalence : 0.2456
## Balanced Accuracy : 0.9886
##
## 'Positive' Class : 1
##
The data set used contains no demographic or personal information. It is solely composed of physical characteristics of tumors. Because of this, we feel that any in depth fairness assessments are not necessary as our model does not take into account factors that could lead to results that favor one group over another.
It should be noted that a vast majority of breast cancer cases occur with women (1% of cases are from males). This dataset did not explicitly make that distinction, so if we were to extrapolate this statistic, it should be assumed that the Wisconsin Breast Cancer dataset primarily consists of cases involving women. If we were to target and make predictions for males, we would have to reconsider our data to make that distinction clear.
Using this model, we were able to accurately predict whether a breast mass was benign or malignant based on ten distinct features of the mass. This can be applied to a real situation where a simple image of a breast mass could output whether it is benign or malignant, which could potentially save lives, as finding a mass to be malignant earlier on is incredibly important. We found that the most important features that determined whether a breast mass was benign or malignant were the features concave.points.worst, perimeter.worst, and radius.worst.
For future work, something that could have given us more confidence in our results would be more data. The initial dataset had 600 entries, which could result in overfitting. Thus, having more data would prevent such overfitting in our neural network, and overall more models (and subsequently our ensemble model) more robust. However, even with our small initial dataset, we were able to make accurate predictions. Additionally, if we wanted to test / predict breast cancer for males, we will have to make that distinction during data selection and preprocessing, since a vast majority of breast cancer cases come from women.