Breast Cancer data: Machine Learning & Analysis

Today we are working with Wisconsin Breast Cancer Database, collected by Dr. William H. Wolberg, University of Wisconsin Hospitals, Madison.

Warning: there will be no fancy visualization here, this project is purely ‘machine learning focused’. Though we are going to make a rainbow out of ROC curve.

Summary:

1/ Preparing data: factors, pre-processing
2/ Logistic regression and Accuracy & ROC curve
3/ Caret: Gradient boosting, Random forest, Neural Network

Loading packages and data:

library(magrittr) # enables to use pipe operators
library(caret) # for pre-processing/cross-validation and ml algrythmes
library(ROCR) # for ROC curve and AUROC

cancerData <- read.csv("CancerData.csv", header = T) 
attach(cancerData); str(cancerData)

## 'data.frame':    699 obs. of  11 variables:
##  $ group           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ thickness       : int  5 5 3 6 4 8 1 2 2 4 ...
##  $ size.unif       : int  1 4 1 8 1 10 1 1 1 2 ...
##  $ shape.unif      : int  1 4 1 8 1 10 1 2 1 1 ...
##  $ adhesion        : int  1 5 1 1 3 8 1 1 1 1 ...
##  $ epi             : int  2 7 2 3 2 7 2 2 2 2 ...
##  $ bare            : int  1 10 2 4 1 10 10 1 1 1 ...
##  $ bland           : int  3 3 3 3 3 9 3 3 1 2 ...
##  $ normal          : int  1 2 1 7 1 7 1 1 1 1 ...
##  $ mitoses         : int  1 1 1 1 1 1 1 1 5 1 ...
##  $ diagnosis.factor: Factor w/ 2 levels "ben","mal": 1 1 1 1 1 2 1 1 1 1 ...

P.S. At the moment of publishing this project, the url with data was blocked, so I refer to self-generated data. Perhaps the original url will be avalaible later on www.stat.yale.edu/~pollard/Courses/230.spring03/WBC/breastcancer.data .

Let’s head over!

Data samples were collected periodically and consist of visually assessed nuclear features of fine needle aspirates (FNAs) taken from patients’ breasts. Each component is in the interval 1 to 10, with value 1 corresponding to a normal state and 10 to a most abnormal state.

Field            Attribute

   1             Sample code number
   2             Class: 2 for benign, 4 for malignant
   3             Clump Thickness
   4             Uniformity of Cell Size
   5             Uniformity of Cell Shape
   6             Marginal Adhesion
   7             Single Epithelial Cell Size
   8             Bare Nuclei
   9             Bland Chromatin
  10             Normal Nvucleoli
  11             Mitoses

In this project we would like to predict the presence of a malignant tumor for future unseen samples based on collected FNA results. So we head over directly to the model and its analysis.

Part 1: Preparing data

We want to know probabilities/classes for the variable ‘diagnosis’: whether the tumor is malignant or not. As our target is a categorical variable with two classes we can build a binomial logistic regression.

First we need to factorize two categorical variables, dependent variable ‘disgnosis’ and independet variable ‘group’. Initially ‘diagnosis’ contained two different numeric values: 2 and 4. Beforehand I assigned its classes to “ben” (benign) and “mal”(malignant) respectively.

cancerData$group <- factor(cancerData$group)
id <- cancerData$id; cancerData$id <- NULL # non relevant for model

Secondly we pre-process our data to maximize the performance of machine learning algorithmes that we are going to use furter in the project. If we wanted to see any pattern in our data, we would do that later. But as our focus is models, we do that now:

cancerDataPP <- preProcess(x = cancerData, 
                           method = c("center", "scale")) 
cancerDataProcessed <- predict(cancerDataPP, cancerData) #stored

Combining together ‘scale’ and ‘center’ methods standardizes the data so the attributes will have a mean of 0 and a standard deviation of 1 (the standard normal distribution).

We do that in order to decrease any present multicollinearities between variables to assure the inter-independence of predictors we have.

Lastly we split the data into train and test sets: the one on which we are training/learning the model and the second one with unseen observations to which we will generalize.

set.seed(1) #for reproductibility of results
trainIndex <- createDataPartition(cancerDataProcessed$diagnosis.factor, 
                                  times = 1,
                                  p = .75,
                                  list = FALSE)
cancerTrain <- cancerDataProcessed[trainIndex,]
cancerTest <- cancerDataProcessed[-trainIndex,]
testValues <- cancerTest$diagnosis.factor

Part 2: Binomial logistic regression

Basicaly what we want here is to obtain a model which will give the best fit on the unseen data set. But first we need to decide which independent variables will be taken as predictors to this model: should we take all of them or should we try to minimize the number of variables. Let’s try both.

set.seed(1)
cancerFitAll <- glm(diagnosis.factor ~., 
                    family = binomial(link = "logit"), 
                    data = cancerTrain)
summary(cancerFitAll)

## 
## Call:
## glm(formula = diagnosis.factor ~ ., family = binomial(link = "logit"), 
##     data = cancerTrain)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.99255  -0.06953  -0.02403   0.00776   1.83939  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.5836     0.5005  -1.166 0.243616    
## group2       -5.3532     1.9899  -2.690 0.007140 ** 
## group3        0.1203     2.3779   0.051 0.959660    
## group4       -2.8542     9.5080  -0.300 0.764030    
## group5        0.6205     1.2313   0.504 0.614295    
## group6       -2.1227     2.9477  -0.720 0.471445    
## group7       -1.3136     3.0932  -0.425 0.671064    
## group8       -2.5554     1.9775  -1.292 0.196270    
## thickness     1.9106     0.6493   2.943 0.003256 ** 
## size.unif     0.3683     0.9256   0.398 0.690733    
## shape.unif    1.8936     1.1589   1.634 0.102266    
## adhesion      0.4375     0.4815   0.909 0.363548    
## epi          -0.1525     0.4726  -0.323 0.746871    
## bare          1.6696     0.4900   3.407 0.000656 ***
## bland         1.1306     0.5893   1.919 0.055020 .  
## normal        0.6284     0.4746   1.324 0.185494    
## mitoses       0.6569     0.8621   0.762 0.446090    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 676.351  on 524  degrees of freedom
## Residual deviance:  60.592  on 508  degrees of freedom
## AIC: 94.592
## 
## Number of Fisher Scoring iterations: 9

There is apparently some insights that we do not know about Group 2. The estimate coefficient talks for itself: there is a clear negative correlation between odds of having a malignant tumor and being in the second group. May it be some specific group of patients with similar living conditions or characteristics? There is no additional information provided.

Let’s test now another model with fewer predictors.

If we were going to take only statistically significant (under the significance level of 0.05) predictors, we would be left only with two variables. As a result our model would fit very poorly. So it makes little sense to use p-values to determine whether the variable is relevant for this model or not.

Instead we could take variables that have the most relative importance in the model:

varImp(cancerFitAll) # the importance of each variable in model

##               Overall
## group2     2.69023418
## group3     0.05057974
## group4     0.30019240
## group5     0.50395251
## group6     0.72012984
## group7     0.42468812
## group8     1.29225263
## thickness  2.94252058
## size.unif  0.39786048
## shape.unif 1.63396768
## adhesion   0.90862558
## epi        0.32276729
## bare       3.40727579
## bland      1.91871847
## normal     1.32402864
## mitoses    0.76195004

For the second model we could try dropping predictors such as ‘size.unif’ and ‘epi’. Let’s check if it is any good.

set.seed(1)
cancerFitFew <- glm(diagnosis.factor ~. -size.unif -epi, 
                      family = binomial(link = "logit"), 
                      data = cancerTrain)
summary(cancerFitFew)

## 
## Call:
## glm(formula = diagnosis.factor ~ . - size.unif - epi, family = binomial(link = "logit"), 
##     data = cancerTrain)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.97727  -0.06913  -0.02649   0.00830   1.81357  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.6390     0.4870  -1.312 0.189454    
## group2       -5.0156     1.8158  -2.762 0.005742 ** 
## group3        0.1854     2.2998   0.081 0.935760    
## group4       -2.6477     9.1436  -0.290 0.772145    
## group5        0.6849     1.2109   0.566 0.571621    
## group6       -2.0210     2.9644  -0.682 0.495383    
## group7       -1.2694     3.0149  -0.421 0.673712    
## group8       -2.4339     1.8507  -1.315 0.188476    
## thickness     1.8914     0.6370   2.969 0.002987 ** 
## shape.unif    2.0516     0.9067   2.263 0.023663 *  
## adhesion      0.4615     0.4490   1.028 0.303997    
## bare          1.6662     0.4919   3.388 0.000705 ***
## bland         1.1719     0.5761   2.034 0.041939 *  
## normal        0.6231     0.4658   1.338 0.180967    
## mitoses       0.6913     0.8767   0.789 0.430354    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 676.351  on 524  degrees of freedom
## Residual deviance:  60.808  on 510  degrees of freedom
## AIC: 90.808
## 
## Number of Fisher Scoring iterations: 9

Ultimately we would like to see a significant drop in both deviances, null and residual, as well as in the AIC. The last one by the way is the analogous metric of adjusted R² for logistic regression model, which penalizes a model for additional coefficients. So the model with minimum AIC is prefered.

Even though in our example the model with fewer predictors is preferable over the first one because of the decreased AIC, the overall difference is still minuscule.

Let’s run analysis of variance, called shortly ANOVA, on these two models with Chi-square test. By that we will test whether the reduction in the residual sum of squares is statistically significant or not.

anova(cancerFitAll, cancerFitFew, test="Chisq")

## Analysis of Deviance Table
## 
## Model 1: diagnosis.factor ~ group + thickness + size.unif + shape.unif + 
##     adhesion + epi + bare + bland + normal + mitoses
## Model 2: diagnosis.factor ~ (group + thickness + size.unif + shape.unif + 
##     adhesion + epi + bare + bland + normal + mitoses) - size.unif - 
##     epi
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1       508     60.592                     
## 2       510     60.808 -2 -0.21665   0.8973

Right, as we noticed before, two models are pretty much the same. We can see that by very high p-value (0.8973).

You know, let’s take the second one. I believe that if the model can generate the same result (or a bit better) but with fewer predictors, than the less usefull ones can be dropped in sake of simplicity.

Before running the chosen model on our test set, it would be better to optimize it by the use of repeated cross validation ‘control’ variable - “fitControl”. By that we will be able to limit model overfitting and make sure that it will then generalize well to an unseen dataset.

set.seed(1)
fitControl <- trainControl(method = "repeatedcv",
                           #number of folds is 10 by default
                           repeats = 3, 
                           savePredictions = T)

glmCancerFit <- train(diagnosis.factor ~.-size.unif -epi, 
                      data = cancerTrain,
                      method = "glm",
                      family = "binomial",
                      trControl = fitControl)

glmFitAcc <- train(diagnosis.factor ~.-size.unif -epi, 
                      data = cancerTrain,
                      method = "glm",
                      metric = "Accuracy",
                      trControl = fitControl) %>% print

## Generalized Linear Model 
## 
## 525 samples
##  10 predictor
##   2 classes: 'ben', 'mal' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 472, 473, 473, 472, 473, 473, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9695452  0.9323695
## 
##

The resulting accuracy of the model is around 96%, which is great. Usually in pair with an overall accuracy we want to plot a ROC curve. What is that?

ROC, short from the Receiving Operating Characteristic, measures the classifier’s performance using the proportion of positive data points correctly considered as positive (True Positive Rate or Sensitivity) and the proportion of negative data points that are mistakenly considered as positive (False Positive Rate or Fall-out).

In our example True Positives are the observations that were correctly predicted as positive (as being malignant). The higher TP rate is, the fewer “malignant” points will be missed. False Positives on the other hand indicate that the class for an observation was predicted as positive, as “malignant”, but in reality it is negative, “bening”. The higher FP rate is, the more “bening” points we will missclassified.

ROC generates then a graphic showing the trade off between the rate at which we can correctly predict something with the rate of incorrectly predicting something.

We are also concerned about the area under the ROC curve (or AUROC). Its value ranges from 0.5 to 1, and the one above 0.8 indicates that the model clasiffies well. In our case the target variable, ‘diagnosis’, has two classes: “benign” and “malignant”.

To plot a ROC curve, we need though to pass as an argument built earlier glm ‘cancerModelFew’, because it already contains probabilities that we want to use.

We could also use caret ROC metric and get a AUROC value, but we would need to change ‘fitControl’ variable so it contains probabilities (we will do it later for random forest model).

And also we want to plot a rainbow:

probROC <- predict(cancerFitFew, 
                   newdata = cancerTest, 
                   type="response")

predROC <- prediction(probROC,
                      testValues)

perfROC <- performance(predROC, 
                       measure = 'tpr', 
                       x.measure = 'fpr')

plot(perfROC, lwd = 2, colorize = TRUE) # rainbow!
lines(x = c(0, 1), y = c(0, 1), col = "black", lwd = 1)

auc = performance(predROC, measure = "auc")
auc = auc@y.values[[1]] %>% print

## [1] 0.9912281

ROC Curves are used to see how well the classifier can separate positive and negative examples. And it does its job very well.

The diagonal line is the base line, which can be obtained with a random classifier. The optimal point on the ROC curve is (FPR, TPR) = (0,1): no false positives and all true positives. Left top corner. The closer we get there (and further from the base line) the better classification went.

Summary: Binomial Logistic Regression

dropped variables for model: size.unif, epi (and id beforehand)
overall accuracy: ~ 96%
area under the curve of the Receiver Operating Characteristic: ~ 99%

We have got a great model here. Though it would be greater to compare our logistic regression model to more complex ones.

In the next part we will use Gradient Boosting, Random Forest and Neural Network on the same dataset to find out if there is one which performs better.

Part 3: Caret: Gradient Boosting, Random Forest and Neural Network

All models below were built by the use of caret package:

library(gbm)
library(randomForest)
library(nnet)

# fitControlProb: we are going to use it later for AUROC
set.seed(1)
fitControlProb <- trainControl(method = "repeatedcv",
                               repeats = 3, 
                               savePredictions = T, 
                               classProbs = T, # probability instead of response
                               summaryFunction = twoClassSummary)
#Gradient boosting: caret
gbmCancerFit <- train(diagnosis.factor ~.-size.unif -epi, 
                      data = cancerTrain, 
                      method = "gbm", 
                      trControl = fitControl,
                      verbose = F)

#Random forest: caret
rfCancerFit <- train(diagnosis.factor ~.-size.unif -epi, 
                     data = cancerTrain,
                     method = "rf",
                     trControl = fitControl)

#Neural network: caret
nnCancerFit <- train(diagnosis.factor ~.-size.unif -epi, 
                     data = cancerTrain,
                     method = "avNNet", 
                     trControl = fitControl, 
                     linout = T)

All four models together:

# glmCancerFit from Part1 added for comparison
results <- resamples(list(RF = rfCancerFit, 
                          GBM = gbmCancerFit, 
                          NNET = nnCancerFit, 
                          GLM = glmCancerFit)) 
summary(results)

## 
## Call:
## summary.resamples(object = results)
## 
## Models: RF, GBM, NNET, GLM 
## Number of resamples: 30 
## 
## Accuracy 
##        Min. 1st Qu. Median   Mean 3rd Qu. Max. NA's
## RF   0.9434  0.9624 0.9808 0.9778  0.9814    1    0
## GBM  0.9231  0.9615 0.9811 0.9759  1.0000    1    0
## NNET 0.9231  0.9615 0.9808 0.9752  0.9953    1    0
## GLM  0.9231  0.9615 0.9808 0.9701  0.9811    1    0
## 
## Kappa 
##        Min. 1st Qu. Median   Mean 3rd Qu. Max. NA's
## RF   0.8755  0.9176 0.9581 0.9510  0.9588    1    0
## GBM  0.8317  0.9152 0.9581 0.9468  1.0000    1    0
## NNET 0.8255  0.9156 0.9574 0.9454  0.9896    1    0
## GLM  0.8301  0.9133 0.9570 0.9339  0.9585    1    0

The random forest model worked a bit better than others with the highest minimum and mean accuracies. Though all four generate pretty much the same result.

diffs <- diff(results) # summarize p-values for pair-wise comparisons
summary(diffs)

## 
## Call:
## summary.diff.resamples(object = diffs)
## 
## p-value adjustment: bonferroni 
## Upper diagonal: estimates of the difference
## Lower diagonal: p-value for H0: difference = 0
## 
## Accuracy 
##      RF     GBM       NNET      GLM      
## RF          0.0019473 0.0026595 0.0076928
## GBM  1.0000           0.0007123 0.0057455
## NNET 1.0000 1.0000              0.0050332
## GLM  1.0000 0.1558    1.0000             
## 
## Kappa 
##      RF     GBM      NNET     GLM     
## RF          0.004200 0.005623 0.017126
## GBM  1.0000          0.001422 0.012926
## NNET 1.0000 1.0000            0.011504
## GLM  1.0000 0.1521   1.0000

The lower diagonal of the table shows p-values for the null hypothesis (H0: distributions are the same). We can see, as noticed before, no difference between distributions of RF/GBM/NNET. Apart from Gradient Boosting Model and Generalized Linear Model (Logistic regression model), which have completely different distributions.

The upper diagonal shows the estimated difference between the distributions. As we noticed before the Random Forest model is the most accurate one, so now we can get an estimate of how much “better” it actually is comparing to others.

Lastly let’s measure the predictive power of Random Forest model and AUROC:

rfPredict <- predict(rfCancerFit, 
                     newdata = cancerTest)
confusionMatrix(rfPredict, testValues)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction ben mal
##        ben 108   1
##        mal   6  59
##                                           
##                Accuracy : 0.9598          
##                  95% CI : (0.9189, 0.9837)
##     No Information Rate : 0.6552          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9127          
##  Mcnemar's Test P-Value : 0.1306          
##                                           
##             Sensitivity : 0.9474          
##             Specificity : 0.9833          
##          Pos Pred Value : 0.9908          
##          Neg Pred Value : 0.9077          
##              Prevalence : 0.6552          
##          Detection Rate : 0.6207          
##    Detection Prevalence : 0.6264          
##       Balanced Accuracy : 0.9654          
##                                           
##        'Positive' Class : ben             
##

Resulting accuracy is ~96%, almost the same we got before for logistic regression model.

rfFitROC <- train(diagnosis.factor ~.-size.unif -epi, 
                      data = cancerTrain,
                      method = "rf",
                      metric = "ROC",
                      trControl = fitControlProb) %>% print

## Random Forest 
## 
## 525 samples
##  10 predictor
##   2 classes: 'ben', 'mal' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 473, 473, 472, 471, 473, 473, ... 
## Resampling results across tuning parameters:
## 
##   mtry  ROC        Sens       Spec     
##    2    0.9961144  0.9816527  0.9760234
##    8    0.9949599  0.9816527  0.9630604
##   14    0.9938296  0.9778151  0.9593567
## 
## ROC was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 2.

AUROC for Random Forest model is about 99.6%, which is a great result that we also got before for the simple logistic regression model.

Final thougts

So it up to you which model to use for this dataset. All four models seem to generalize well to unseen observations in test set. It doesn’t hurt to try, but in future for binomial logistic regression I would stick to the easiest glm().

Though I know that there is still a lot to do with this dataset, for example I didn’t do any fancy visualizations here, but I think that it is a good start for further data exploration.

Thanks for reading,

Data Geekette