Question

Based on these physical properties of the tumors, can we build an ML model that accurately classifies them as malignant or benign?

Data Acquisition

# Load in data
wisconsin <- read.csv("data.csv")

About the Dataset

The dataset we used was called “Breast Cancer Wisconsin (Diagnostic) taken from the UCI ML Repository. It contains data from various people from a hospital in Wisconsin from 1995. There are various physical features used as variables, like texture, area, radius, and more detailing the breasts. Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. We will use these features to predict whether a mass is benign or malignant.

Prior Research

This dataset has been analyzed before by researchers and Kaggle users. Researchers focused on methods such as Random Forest, Perceptrons, or Support Vector Machines. This was usually done with TensorFlow in Python to create their models, usually around a 96-98% accuracy. We will build upon this by using an ensemble learning method where we combine the results from multiple models and the result becomes the most common prediction from the models. This produces more accurate results, since failed cases from some models are negated by the correct predictions of other models.

Viewing Data

# Inspect metadata
# Verify that the data was read in correctly
# We know how many columns/rows to expect
is.data.frame(wisconsin)

## [1] TRUE

ncol(wisconsin)

## [1] 33

nrow(wisconsin)

## [1] 569

head(wisconsin)

##         id diagnosis radius_mean texture_mean perimeter_mean area_mean
## 1   842302         M       17.99        10.38         122.80    1001.0
## 2   842517         M       20.57        17.77         132.90    1326.0
## 3 84300903         M       19.69        21.25         130.00    1203.0
## 4 84348301         M       11.42        20.38          77.58     386.1
## 5 84358402         M       20.29        14.34         135.10    1297.0
## 6   843786         M       12.45        15.70          82.57     477.1
##   smoothness_mean compactness_mean concavity_mean concave.points_mean
## 1         0.11840          0.27760         0.3001             0.14710
## 2         0.08474          0.07864         0.0869             0.07017
## 3         0.10960          0.15990         0.1974             0.12790
## 4         0.14250          0.28390         0.2414             0.10520
## 5         0.10030          0.13280         0.1980             0.10430
## 6         0.12780          0.17000         0.1578             0.08089
##   symmetry_mean fractal_dimension_mean radius_se texture_se perimeter_se
## 1        0.2419                0.07871    1.0950     0.9053        8.589
## 2        0.1812                0.05667    0.5435     0.7339        3.398
## 3        0.2069                0.05999    0.7456     0.7869        4.585
## 4        0.2597                0.09744    0.4956     1.1560        3.445
## 5        0.1809                0.05883    0.7572     0.7813        5.438
## 6        0.2087                0.07613    0.3345     0.8902        2.217
##   area_se smoothness_se compactness_se concavity_se concave.points_se
## 1  153.40      0.006399        0.04904      0.05373           0.01587
## 2   74.08      0.005225        0.01308      0.01860           0.01340
## 3   94.03      0.006150        0.04006      0.03832           0.02058
## 4   27.23      0.009110        0.07458      0.05661           0.01867
## 5   94.44      0.011490        0.02461      0.05688           0.01885
## 6   27.19      0.007510        0.03345      0.03672           0.01137
##   symmetry_se fractal_dimension_se radius_worst texture_worst perimeter_worst
## 1     0.03003             0.006193        25.38         17.33          184.60
## 2     0.01389             0.003532        24.99         23.41          158.80
## 3     0.02250             0.004571        23.57         25.53          152.50
## 4     0.05963             0.009208        14.91         26.50           98.87
## 5     0.01756             0.005115        22.54         16.67          152.20
## 6     0.02165             0.005082        15.47         23.75          103.40
##   area_worst smoothness_worst compactness_worst concavity_worst
## 1     2019.0           0.1622            0.6656          0.7119
## 2     1956.0           0.1238            0.1866          0.2416
## 3     1709.0           0.1444            0.4245          0.4504
## 4      567.7           0.2098            0.8663          0.6869
## 5     1575.0           0.1374            0.2050          0.4000
## 6      741.6           0.1791            0.5249          0.5355
##   concave.points_worst symmetry_worst fractal_dimension_worst  X
## 1               0.2654         0.4601                 0.11890 NA
## 2               0.1860         0.2750                 0.08902 NA
## 3               0.2430         0.3613                 0.08758 NA
## 4               0.2575         0.6638                 0.17300 NA
## 5               0.1625         0.2364                 0.07678 NA
## 6               0.1741         0.3985                 0.12440 NA

table(wisconsin$diagnosis)

## 
##   B   M 
## 357 212

wisconsin_numeric <- wisconsin %>% select_if(is.numeric)

# Histogram of all the columns 
for (i in 1:ncol(wisconsin_numeric)) {
  if (i %% 4 == 1)
    par(mfrow=c(2,2))
  hist(wisconsin_numeric[,i], main = colnames(wisconsin_numeric)[i], xlab = "")
}

Data preprocessing

# Convert "B" and "M" to 0 and 1 respectively
wisconsin <- wisconsin %>%
  add_column(diagnosis_num = if_else(.$diagnosis == 'M', 1, 0))
temp <- wisconsin$diagnosis_num

# Subtract out columns we don't want to train with
wisconsin <- subset(wisconsin, select = -c(X, diagnosis, id, diagnosis_num))

# Normalize data
wisconsin <- wisconsin %>%
  mutate_all(scale)

# Add back in the prediction column
wisconsin$diagnosis_num <- temp

Train/Tune/Test Split

# Split
set.seed(3001)
wisconsin.train <- sample_frac(wisconsin, 0.7)
sample_id <- as.numeric(rownames(wisconsin.train))
wisconsin.test <- wisconsin[-sample_id,]

EDA

# Principal Component Analysis
par(mfrow=c(1,1))
pca <- prcomp(wisconsin, scale. = T)
autoplot(pca, data = wisconsin, colour = 'diagnosis_num')

std_dev <- pca$sdev
pr_var <- std_dev^2
prop_varex <- pr_var/sum(pr_var)
plot(prop_varex, xlab = "Principal Component",
     ylab = "Proportion of Variance Explained",
     type = "b")

# Correlation heatmap
cormat <- round(cor(wisconsin), 2)
head(cormat)

##                  radius_mean texture_mean perimeter_mean area_mean
## radius_mean             1.00         0.32           1.00      0.99
## texture_mean            0.32         1.00           0.33      0.32
## perimeter_mean          1.00         0.33           1.00      0.99
## area_mean               0.99         0.32           0.99      1.00
## smoothness_mean         0.17        -0.02           0.21      0.18
## compactness_mean        0.51         0.24           0.56      0.50
##                  smoothness_mean compactness_mean concavity_mean
## radius_mean                 0.17             0.51           0.68
## texture_mean               -0.02             0.24           0.30
## perimeter_mean              0.21             0.56           0.72
## area_mean                   0.18             0.50           0.69
## smoothness_mean             1.00             0.66           0.52
## compactness_mean            0.66             1.00           0.88
##                  concave.points_mean symmetry_mean fractal_dimension_mean
## radius_mean                     0.82          0.15                  -0.31
## texture_mean                    0.29          0.07                  -0.08
## perimeter_mean                  0.85          0.18                  -0.26
## area_mean                       0.82          0.15                  -0.28
## smoothness_mean                 0.55          0.56                   0.58
## compactness_mean                0.83          0.60                   0.57
##                  radius_se texture_se perimeter_se area_se smoothness_se
## radius_mean           0.68      -0.10         0.67    0.74         -0.22
## texture_mean          0.28       0.39         0.28    0.26          0.01
## perimeter_mean        0.69      -0.09         0.69    0.74         -0.20
## area_mean             0.73      -0.07         0.73    0.80         -0.17
## smoothness_mean       0.30       0.07         0.30    0.25          0.33
## compactness_mean      0.50       0.05         0.55    0.46          0.14
##                  compactness_se concavity_se concave.points_se symmetry_se
## radius_mean                0.21         0.19              0.38       -0.10
## texture_mean               0.19         0.14              0.16        0.01
## perimeter_mean             0.25         0.23              0.41       -0.08
## area_mean                  0.21         0.21              0.37       -0.07
## smoothness_mean            0.32         0.25              0.38        0.20
## compactness_mean           0.74         0.57              0.64        0.23
##                  fractal_dimension_se radius_worst texture_worst
## radius_mean                     -0.04         0.97          0.30
## texture_mean                     0.05         0.35          0.91
## perimeter_mean                  -0.01         0.97          0.30
## area_mean                       -0.02         0.96          0.29
## smoothness_mean                  0.28         0.21          0.04
## compactness_mean                 0.51         0.54          0.25
##                  perimeter_worst area_worst smoothness_worst compactness_worst
## radius_mean                 0.97       0.94             0.12              0.41
## texture_mean                0.36       0.34             0.08              0.28
## perimeter_mean              0.97       0.94             0.15              0.46
## area_mean                   0.96       0.96             0.12              0.39
## smoothness_mean             0.24       0.21             0.81              0.47
## compactness_mean            0.59       0.51             0.57              0.87
##                  concavity_worst concave.points_worst symmetry_worst
## radius_mean                 0.53                 0.74           0.16
## texture_mean                0.30                 0.30           0.11
## perimeter_mean              0.56                 0.77           0.19
## area_mean                   0.51                 0.72           0.14
## smoothness_mean             0.43                 0.50           0.39
## compactness_mean            0.82                 0.82           0.51
##                  fractal_dimension_worst diagnosis_num
## radius_mean                         0.01          0.73
## texture_mean                        0.12          0.42
## perimeter_mean                      0.05          0.74
## area_mean                           0.00          0.71
## smoothness_mean                     0.50          0.36
## compactness_mean                    0.69          0.60

melted_cormat <- melt(cormat)

# Plot correlations between features
ggplot(data = melted_cormat, aes(x=Var1, y=Var2, fill=value)) + 
  geom_tile()

# Feature importance
diagnosis_corr = abs(cor(wisconsin)["diagnosis_num",])
sort(diagnosis_corr)

##             symmetry_se              texture_se  fractal_dimension_mean 
##             0.006521756             0.008303333             0.012837603 
##           smoothness_se    fractal_dimension_se            concavity_se 
##             0.067016011             0.077972417             0.253729766 
##          compactness_se fractal_dimension_worst           symmetry_mean 
##             0.292999244             0.323872189             0.330498554 
##         smoothness_mean       concave.points_se            texture_mean 
##             0.358559965             0.408042333             0.415185300 
##          symmetry_worst        smoothness_worst           texture_worst 
##             0.416294311             0.421464861             0.456902821 
##                 area_se            perimeter_se               radius_se 
##             0.548235940             0.556140703             0.567133821 
##       compactness_worst        compactness_mean         concavity_worst 
##             0.590998238             0.596533678             0.659610210 
##          concavity_mean               area_mean             radius_mean 
##             0.696359707             0.708983837             0.730028511 
##              area_worst          perimeter_mean            radius_worst 
##             0.733825035             0.742635530             0.776453779 
##     concave.points_mean         perimeter_worst    concave.points_worst 
##             0.776613840             0.782914137             0.793566017 
##           diagnosis_num 
##             1.000000000

# The most important features are concave.points_worst, perimeter_worst, and radius_worst
importance(randomForest(as.factor(diagnosis_num)~., data=wisconsin, ntree=5000, importance=TRUE))

##                                 0         1 MeanDecreaseAccuracy
## radius_mean             27.369641 17.661502            30.609674
## texture_mean            28.966023 31.318041            39.650957
## perimeter_mean          25.343734 18.442057            30.163747
## area_mean               30.349797 18.778953            33.741489
## smoothness_mean          8.525414 21.937620            23.387121
## compactness_mean        15.398597 12.036553            19.784787
## concavity_mean          24.625565 26.797775            36.316581
## concave.points_mean     32.181939 35.720130            46.403778
## symmetry_mean            4.720898 12.445773            12.533167
## fractal_dimension_mean  10.532355  4.556438            12.049652
## radius_se               25.967235 18.622309            32.223952
## texture_se              10.059447  6.790069            12.562791
## perimeter_se            23.915344 19.473354            31.308008
## area_se                 36.653855 25.278367            44.031427
## smoothness_se            9.572910  2.719263             9.658280
## compactness_se          16.511069  6.460409            17.806649
## concavity_se            13.887479 13.236017            18.573752
## concave.points_se       13.170662  7.619243            15.413761
## symmetry_se              8.922516  1.752871             8.788754
## fractal_dimension_se    12.328152  3.610863            12.417300
## radius_worst            45.921254 34.826698            53.557306
## texture_worst           33.384116 38.402865            45.526593
## perimeter_worst         43.921727 39.552028            55.545632
## area_worst              47.443674 38.814262            57.305002
## smoothness_worst        24.295446 28.315983            36.061088
## compactness_worst       18.328836 17.307315            25.274069
## concavity_worst         24.099342 34.428700            42.214567
## concave.points_worst    45.256997 39.372371            58.722016
## symmetry_worst          16.129591 19.067016            23.837619
## fractal_dimension_worst 13.486838 11.174198            17.398093
##                         MeanDecreaseGini
## radius_mean                   10.8473666
## texture_mean                   3.9073005
## perimeter_mean                12.6406155
## area_mean                     13.0869988
## smoothness_mean                1.6772331
## compactness_mean               2.9893338
## concavity_mean                12.7929722
## concave.points_mean           28.3110588
## symmetry_mean                  1.0130205
## fractal_dimension_mean         0.9660721
## radius_se                      4.0087138
## texture_se                     1.1548624
## perimeter_se                   3.6808610
## area_se                        9.1752615
## smoothness_se                  1.1224452
## compactness_se                 1.1942600
## concavity_se                   1.6496709
## concave.points_se              1.3548956
## symmetry_se                    1.0479490
## fractal_dimension_se           1.2783071
## radius_worst                  29.2175997
## texture_worst                  5.1789805
## perimeter_worst               34.4639206
## area_worst                    29.4162206
## smoothness_worst               3.6711528
## compactness_worst              4.1943871
## concavity_worst                8.9407574
## concave.points_worst          32.1377454
## symmetry_worst                 2.7454015
## fractal_dimension_worst        1.7848085

# Calculate the prevalence
prevalence <- sum(wisconsin$diagnosis_num == 1) / length(wisconsin$diagnosis_num)
prevalence

## [1] 0.3725835

Random Forests

set.seed(3001)
rf_model = randomForest(as.factor(diagnosis_num)~., data=wisconsin.train, ntree=5000)
rf_model

## 
## Call:
##  randomForest(formula = as.factor(diagnosis_num) ~ ., data = wisconsin.train,      ntree = 5000) 
##                Type of random forest: classification
##                      Number of trees: 5000
## No. of variables tried at each split: 5
## 
##         OOB estimate of  error rate: 3.77%
## Confusion matrix:
##     0   1 class.error
## 0 247   6  0.02371542
## 1   9 136  0.06206897

# Testing the model
rf_predictions_train<-predict(rf_model, type = "response")
train_acc <- nrow(filter(wisconsin.train, diagnosis_num == rf_predictions_train)) / nrow(wisconsin.train)
train_acc

## [1] 0.9623116

rf_predictions_test<-predict(rf_model, wisconsin.test, type = "response")
test_acc <- nrow(filter(wisconsin.test, diagnosis_num == rf_predictions_test)) / nrow(wisconsin.test)
test_acc

## [1] 0.9766082

Multiple logistic model

# Train logistic model against 6 most important features
set.seed(3001)
logistic_model = glm(data = wisconsin.train,
                     as.factor(diagnosis_num) ~ concave.points_worst + concave.points_mean + 
                       perimeter_worst + radius_worst + perimeter_mean + area_worst,
                     family = binomial())

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(logistic_model)

## 
## Call:
## glm(formula = as.factor(diagnosis_num) ~ concave.points_worst + 
##     concave.points_mean + perimeter_worst + radius_worst + perimeter_mean + 
##     area_worst, family = binomial(), data = wisconsin.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6186  -0.1439  -0.0578   0.0004   3.6731  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)   
## (Intercept)            2.1300     1.0612   2.007  0.04473 * 
## concave.points_worst   1.4139     0.9001   1.571  0.11622   
## concave.points_mean    2.5970     1.0587   2.453  0.01417 * 
## perimeter_worst       -0.2396     3.8278  -0.063  0.95010   
## radius_worst          -4.8223     6.5717  -0.734  0.46307   
## perimeter_mean        -6.0025     2.1566  -2.783  0.00538 **
## area_worst            19.3468     8.2226   2.353  0.01863 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 522.068  on 397  degrees of freedom
## Residual deviance:  88.094  on 391  degrees of freedom
## AIC: 102.09
## 
## Number of Fisher Scoring iterations: 9

# Test logistic model
logistic_predictions_train<-round(predict(logistic_model, type = "response"))
train_acc <- nrow(filter(wisconsin.train, diagnosis_num == logistic_predictions_train)) / nrow(wisconsin.train)
train_acc

## [1] 0.9623116

logistic_predictions_test<-round(predict(logistic_model, wisconsin.test, type = "response"))
test_acc <- nrow(filter(wisconsin.test, diagnosis_num == logistic_predictions_test)) / nrow(wisconsin.test)
test_acc

## [1] 0.9766082

Support Vector Machine

set.seed(3001)
svm_model = svm(formula = as.factor(diagnosis_num) ~ .,
                 data = wisconsin.train,
                 type = 'C-classification',
                 kernel = 'linear')
summary(svm_model)

## 
## Call:
## svm(formula = as.factor(diagnosis_num) ~ ., data = wisconsin.train, 
##     type = "C-classification", kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
## 
## Number of Support Vectors:  33
## 
##  ( 15 18 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

svm_predictions_train<-predict(svm_model, type = "response")
train_acc <- nrow(filter(wisconsin.train, diagnosis_num == svm_predictions_train)) / nrow(wisconsin.train)
train_acc

## [1] 0.9849246

svm_predictions_test<-predict(svm_model, wisconsin.test, type = "response")
test_acc <- nrow(filter(wisconsin.test, diagnosis_num == svm_predictions_test)) / nrow(wisconsin.test)
test_acc

## [1] 0.9824561

Vanilla Neural Network

# For Tensorflow, we must subtract out the prediction column
X_train <- wisconsin.train %>% 
  select(-diagnosis_num) %>% 
  scale()
y_train <- to_categorical(wisconsin.train$diagnosis_num)

## Loaded Tensorflow version 2.7.0

X_test <- wisconsin.test %>% 
  select(-diagnosis_num) %>% 
  scale()
y_test <- to_categorical(wisconsin.test$diagnosis_num)

# Model definition
# 2 hidden layers with dropout to avoid overfitting
model <- keras_model_sequential()
model %>%
  layer_dense(units = 64, activation = 'relu', input_shape =  ncol(wisconsin.train) - 1) %>% 
  layer_dropout(rate = 0.5) %>% 
  layer_dense(units = 32, activation = 'relu') %>%
  layer_dropout(rate = 0.5) %>%
  layer_dense(units = 2, activation = 'sigmoid')

# Choose loss and optimizer
history <- model %>% compile(
  loss = 'binary_crossentropy',
  optimizer = 'adam',
  metrics = c('accuracy')
)

set.seed(3001)
# Train the model with 100 epochs
model %>% fit(
  X_train, y_train, 
  epochs = 100, 
  batch_size = 30,
  validation_split = 0.4
)
summary(model)

## Model: "sequential"
## ________________________________________________________________________________
##  Layer (type)                       Output Shape                    Param #     
## ================================================================================
##  dense_2 (Dense)                    (None, 64)                      1984        
##                                                                                 
##  dropout_1 (Dropout)                (None, 64)                      0           
##                                                                                 
##  dense_1 (Dense)                    (None, 32)                      2080        
##                                                                                 
##  dropout (Dropout)                  (None, 32)                      0           
##                                                                                 
##  dense (Dense)                      (None, 2)                       66          
##                                                                                 
## ================================================================================
## Total params: 4,130
## Trainable params: 4,130
## Non-trainable params: 0
## ________________________________________________________________________________

# Calculating accuracy
train_predictions <- model %>% predict(X_train) %>% `>`(0.5) %>% k_cast("int32")
test_predictions <- model %>% predict(X_test) %>% `>`(0.5) %>% k_cast("int32")

train_acc <- sum(as.array(train_predictions[,2]) == y_train[,2]) / nrow(train_predictions)
train_acc

## [1] 0.9824121

test_acc <- sum(as.array(test_predictions[,2]) == y_test[,2]) / nrow(test_predictions)
test_acc

## [1] 0.9649123

Evaluation

# Create a matrix with predictions from all models
model_predictions_train <- data.frame(matrix(ncol = 0, nrow = length(logistic_predictions_train)))
model_predictions_train$rf <- as.numeric(rf_predictions_train) - 1
model_predictions_train$logistic <- logistic_predictions_train
model_predictions_train$svm <- svm_predictions_train
model_predictions_train$ann <- as.array(train_predictions[,2])
model_predictions_train$target <- wisconsin.train$diagnosis_num
head(model_predictions_train)

##   rf logistic svm ann target
## 1  1        0   1   1      1
## 2  0        1   1   1      1
## 3  0        0   0   0      0
## 4  1        1   0   0      0
## 5  0        0   0   0      0
## 6  1        1   1   1      1

model_predictions_test <- data.frame(matrix(ncol = 0, nrow = length(logistic_predictions_test)))
model_predictions_test$rf <- as.numeric(rf_predictions_test) - 1
model_predictions_test$logistic <- logistic_predictions_test
model_predictions_test$svm <- svm_predictions_test
model_predictions_test$ann <- as.array(test_predictions[,2])
model_predictions_test$target <- wisconsin.test$diagnosis_num
head(model_predictions_test)

##   rf logistic svm ann target
## 1  0        0   0   0      0
## 2  0        0   0   0      0
## 3  1        1   1   1      1
## 4  0        0   0   0      0
## 5  0        0   0   0      0
## 6  0        0   0   0      0

# Pass in outputs into another random forest
set.seed(3001)
ensemble = randomForest(as.factor(target)~., data=model_predictions_train, ntree=5000, mtry=2, nodesize=3)

train_preds<-predict(ensemble, type = "response")
train_acc <- nrow(filter(model_predictions_train, target == train_preds)) / nrow(model_predictions_train)
train_acc

## [1] 0.9748744

test_preds<-predict(ensemble, model_predictions_test, type = "response")
test_acc <- nrow(filter(model_predictions_test, target == test_preds)) / nrow(model_predictions_test)
test_acc

## [1] 0.9824561

confusionMatrix(as.factor(test_preds), as.factor(model_predictions_test$target), positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 129   0
##          1   3  39
##                                           
##                Accuracy : 0.9825          
##                  95% CI : (0.9496, 0.9964)
##     No Information Rate : 0.7719          
##     P-Value [Acc > NIR] : 1.339e-15       
##                                           
##                   Kappa : 0.9515          
##                                           
##  Mcnemar's Test P-Value : 0.2482          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.9773          
##          Pos Pred Value : 0.9286          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.2281          
##          Detection Rate : 0.2281          
##    Detection Prevalence : 0.2456          
##       Balanced Accuracy : 0.9886          
##                                           
##        'Positive' Class : 1               
##

Fairness Assessment

The data set used contains no demographic or personal information. It is solely composed of physical characteristics of tumors. Because of this, we feel that any in depth fairness assessments are not necessary as our model does not take into account factors that could lead to results that favor one group over another.

It should be noted that a vast majority of breast cancer cases occur with women (1% of cases are from males). This dataset did not explicitly make that distinction, so if we were to extrapolate this statistic, it should be assumed that the Wisconsin Breast Cancer dataset primarily consists of cases involving women. If we were to target and make predictions for males, we would have to reconsider our data to make that distinction clear.

Conclusions

Using this model, we were able to accurately predict whether a breast mass was benign or malignant based on ten distinct features of the mass. This can be applied to a real situation where a simple image of a breast mass could output whether it is benign or malignant, which could potentially save lives, as finding a mass to be malignant earlier on is incredibly important. We found that the most important features that determined whether a breast mass was benign or malignant were the features concave.points.worst, perimeter.worst, and radius.worst.

Future work

For future work, something that could have given us more confidence in our results would be more data. The initial dataset had 600 entries, which could result in overfitting. Thus, having more data would prevent such overfitting in our neural network, and overall more models (and subsequently our ensemble model) more robust. However, even with our small initial dataset, we were able to make accurate predictions. Additionally, if we wanted to test / predict breast cancer for males, we will have to make that distinction during data selection and preprocessing, since a vast majority of breast cancer cases come from women.

Group 15 DS 3001 Final Project

Alan Zheng, Daniel, Malinowski, Jason Deng

5/4/2022