1 Presentation of the dataset

The dataset used for this analysis is called Wisconsin Breast Cancer Database and can be found here.

This breast cancer database was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg and describes 699 clinical cases presenting malignant or benign tumors.

Several attributes were measured on a 1-10 scale on these patients between January 1989 and November 1991.

2 Loading and preparing the data

According to the help file associated to the database, some missing values are present in the data, denoted by “?”.
We can also add column names when reading the file.

2.1 Identifier variable

The sample_code_number variable is an identifier and not relevant in a machine learning approach.
However, we have 699 patients, but only 645 unique identifiers.

## [1] 645

Which observations have the same identifier ?

sample_code_number n
1182404 6
1276091 5
1198641 3
1017023 2
1033078 2
1070935 2
1100524 2
1105524 2
1115293 2
1116116 2

Some observations occur up to 6 times using the same identifier !!
Since we don’t have any information related to the exact time of evaluation, we can assume that some patients were followed over several months and were evaluated several times during the campaign.

 

However, a more concerning issue is that some observations (8 in total) are redundant in the database, which could bias our analysis.

sample_code_number clump_thickness uniformity_of_cell_size uniformity_of_cell_shape marginal_adhesion single_epithelial_cell_size bare_nuclei bland_chromatin normal_nucleoli mitoses class
1100524 6 10 10 2 8 10 7 3 3 4
1100524 6 10 10 2 8 10 7 3 3 4
1116116 9 10 10 1 10 8 3 3 1 4
1116116 9 10 10 1 10 8 3 3 1 4
1198641 3 1 1 1 2 1 3 1 1 2
1198641 3 1 1 1 2 1 3 1 1 2
1198641 10 10 6 3 3 10 4 3 2 4
1218860 1 1 1 1 1 1 3 1 1 2
1218860 1 1 1 1 1 1 3 1 1 2
320675 3 3 5 2 3 10 7 1 1 4
320675 3 3 5 2 3 10 7 1 1 4
704097 1 1 1 1 1 1 2 1 1 2
704097 1 1 1 1 1 1 2 1 1 2
1321942 5 1 1 1 2 1 3 1 1 2
1321942 5 1 1 1 2 1 3 1 1 2
466906 1 1 1 1 2 1 1 1 1 2
466906 1 1 1 1 2 1 1 1 1 2

 

As we don’t know why these data are redundant but could bias the analysis, we will delete them.

## Observations: 691
## Variables: 10
## $ clump_thickness             <dbl> 5, 5, 3, 6, 4, 8, 1, 2, 2, 4, 1, 2, …
## $ uniformity_of_cell_size     <dbl> 1, 4, 1, 8, 1, 10, 1, 1, 1, 2, 1, 1,…
## $ uniformity_of_cell_shape    <dbl> 1, 4, 1, 8, 1, 10, 1, 2, 1, 1, 1, 1,…
## $ marginal_adhesion           <dbl> 1, 5, 1, 1, 3, 8, 1, 1, 1, 1, 1, 1, …
## $ single_epithelial_cell_size <dbl> 2, 7, 2, 3, 2, 7, 2, 2, 2, 2, 1, 2, …
## $ bare_nuclei                 <dbl> 1, 10, 2, 4, 1, 10, 10, 1, 1, 1, 1, …
## $ bland_chromatin             <dbl> 3, 3, 3, 3, 3, 9, 3, 3, 1, 2, 3, 2, …
## $ normal_nucleoli             <dbl> 1, 2, 1, 7, 1, 7, 1, 1, 1, 1, 1, 1, …
## $ mitoses                     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 5, 1, 1, 1, …
## $ class                       <fct> 2, 2, 2, 2, 2, 4, 2, 2, 2, 2, 2, 2, …

2.3 Handling missing values

##  clump_thickness  uniformity_of_cell_size uniformity_of_cell_shape
##  Min.   : 1.000   Min.   : 1.00           Min.   : 1.000          
##  1st Qu.: 2.000   1st Qu.: 1.00           1st Qu.: 1.000          
##  Median : 4.000   Median : 1.00           Median : 1.000          
##  Mean   : 4.427   Mean   : 3.13           Mean   : 3.201          
##  3rd Qu.: 6.000   3rd Qu.: 5.00           3rd Qu.: 5.000          
##  Max.   :10.000   Max.   :10.00           Max.   :10.000          
##                                                                   
##  marginal_adhesion single_epithelial_cell_size  bare_nuclei    
##  Min.   : 1.000    Min.   : 1.000              Min.   : 1.000  
##  1st Qu.: 1.000    1st Qu.: 2.000              1st Qu.: 1.000  
##  Median : 1.000    Median : 2.000              Median : 1.000  
##  Mean   : 2.825    Mean   : 3.211              Mean   : 3.538  
##  3rd Qu.: 4.000    3rd Qu.: 4.000              3rd Qu.: 6.000  
##  Max.   :10.000    Max.   :10.000              Max.   :10.000  
##                                                NA's   :16      
##  bland_chromatin  normal_nucleoli     mitoses             class    
##  Min.   : 1.000   Min.   : 1.000   Min.   : 1.000   benign   :453  
##  1st Qu.: 2.000   1st Qu.: 1.000   1st Qu.: 1.000   malignant:238  
##  Median : 3.000   Median : 1.000   Median : 1.000                  
##  Mean   : 3.436   Mean   : 2.883   Mean   : 1.593                  
##  3rd Qu.: 5.000   3rd Qu.: 4.000   3rd Qu.: 1.000                  
##  Max.   :10.000   Max.   :10.000   Max.   :10.000                  
## 

We have 16 missing values from the bare_nuclei variable. How do they relate to the response variable ?

## Warning: Removed 16 rows containing non-finite values (stat_density).

We can see that the bare_nuclei distribution is very different for each class value (benign or malignant). So using a basic mean imputation may not be a good solution.

2.3.1 Imputation

These missing values can be imputed and be tracked, but the process is always tricky.
An easy implementation for imputation is to use the mice package.

##      1  2  3  4 5  6 7  8
## 24  10  3 10 10 7 10 8 10
## 41   5  7  2  5 5  5 7  5
## 140  1  1  1  1 1  1 1  1
## 146  1  1  1  1 1  5 1  2
## 159  1  1  1  1 2  2 1  1
## 165  1  1  1  1 1  4 1  1
## 235  1  5  5  1 1  1 1  1
## 249  1  1  1  1 8  1 1  1
## 271  1  1  8  1 1  2 2  1
## 288  6 10  3  6 3 10 9  1
## 290  1  1  5  1 1  1 1  1
## 293  1  1  3  1 1  1 3  1
## 311  8  2  2  1 8  1 7  1
## 317  1  1  1  1 8  1 1  2
## 406  1  1  1  1 1  1 1  1
## 611  1  1  1  1 1  1 1  3

The new cancer_complete dataset doesn’t have any missing value now.

## [1] 0

2.3.2 Deletion

Imputation, by construct, can induce irrelevant patterns in analysis.
Since we only have 16 missing values (2.32% of total observations), another solution can be to delete the observations where missing data are present.

This solution will be kept for the rest of the analysis.

2.4 Other variables

Quick look at other variables’ distribution.

Some variables, such as bare_nuclei or clump_thickness, seem to create a clear separation between benign and malignant cases.

3 Model Building

Since we have a classification problem, we can use the following algorithms :

  • logistic regression
  • random forest

Before buiding these models, we need to create some training and testing sets to assess models’ performance.

3.1 Training/Testing datasets

In order to fine tune parameters’ models, a validation set is needed, in addition to training and testing sets.
Here, we won’t need a validation dataset. Indeed, random forest has a convenient tuning wrapper function tuneRF() that already split the data into training and validation set and perform, automatically, the cross-validation (10 fold) of the data.

Check distribution in all new sets.

## 
##    benign malignant 
## 0.6534091 0.3465909
## 
##    benign malignant 
## 0.6394558 0.3605442

3.2 Scaling

Since all variables are on the same 1-10 scale, we don’t need to scale the data the training variables.

3.3 Logistic regression

## 
## Call:
## glm(formula = class ~ ., family = "binomial", data = cancer_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4016  -0.1252  -0.0643   0.0280   2.5307  
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -10.07249    1.30683  -7.708 1.28e-14 ***
## clump_thickness               0.56222    0.15897   3.537 0.000405 ***
## uniformity_of_cell_size       0.01942    0.21580   0.090 0.928304    
## uniformity_of_cell_shape      0.24103    0.24071   1.001 0.316657    
## marginal_adhesion             0.30872    0.12484   2.473 0.013401 *  
## single_epithelial_cell_size   0.16143    0.16399   0.984 0.324908    
## bare_nuclei                   0.33035    0.09378   3.523 0.000427 ***
## bland_chromatin               0.44280    0.17722   2.499 0.012467 *  
## normal_nucleoli               0.19318    0.11598   1.666 0.095800 .  
## mitoses                       0.58425    0.38337   1.524 0.127509    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 681.448  on 527  degrees of freedom
## Residual deviance:  91.238  on 518  degrees of freedom
## AIC: 111.24
## 
## Number of Fisher Scoring iterations: 8

As seen in graphs, clump_thickness and bare_nuclei seem to be significant variables, with higher values of these parameters leading to a higher probability of malignant tumor.

3.4 Random Forest

Random forest has a convenient wrapper tuning function to determine the best parameters.
We will tune the number of variables in each tree mtry using tuneRF() function ; then the number of trees in each model will be estimated ‘by hand’.

## mtry = 3  OOB error = 2.84% 
## Searching left ...
## mtry = 2     OOB error = 3.22% 
## -0.1333333 0.05 
## Searching right ...
## mtry = 6     OOB error = 3.98% 
## -0.4 0.05

We can see that the classification error is minimized when mtry is equal to 3.
Let’s build a model using this optimized value of mtry, and visualize the Out-Of-Bag error versus the number of trees.

We can see that over around 700 trees, we don’t have any significant improvement in the classification error value. We will pick this value as the optimized one.

Plot the variable importance.

Contrary to the logistic regression model, uniformity_of_cell_size and uniformity_of_cell_shape seem to be the more relevant variables in the random forest model.

4 Bonus : H2O modeling approach

As a complement, we can try to perform a modeling using H2O platform.
A very quick implementation can be done using “Automatic Machine Learning” via automl() function.

## 
## H2O is not running yet, starting it now...
## 
## Note:  In case of errors look at the following log files:
##     /tmp/RtmpZyGgNG/h2o_alex_started_from_r.out
##     /tmp/RtmpZyGgNG/h2o_alex_started_from_r.err
## 
## 
## Starting H2O JVM and connecting: .. Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         4 seconds 892 milliseconds 
##     H2O cluster timezone:       Europe/Paris 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.22.1.1 
##     H2O cluster version age:    4 months and 14 days !!! 
##     H2O cluster name:           H2O_started_from_R_alex_uml253 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   3.56 GB 
##     H2O cluster total cores:    2 
##     H2O cluster allowed cores:  2 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         XGBoost, Algos, AutoML, Core V3, Core V4 
##     R Version:                  R version 3.6.0 (2019-04-26)
## [1] 0

We will use the clean cancer_complete dataset.

As for previous approach, we have to bluid training and testing datasets.

AutoML only needs a dataset, a target variable and a time or model number limit for training.

By extracting the created leaderboard, we can see that the best model for our data was built using a Gradient Boosting Machine algorithm.

##                                     model_id       auc    logloss
## 1 GBM_grid_1_AutoML_20190512_214820_model_11 0.9960819 0.10748830
## 2  GLM_grid_1_AutoML_20190512_214820_model_1 0.9960429 0.07734271
## 3 GBM_grid_1_AutoML_20190512_214820_model_17 0.9957115 0.07023189
## 4 GBM_grid_1_AutoML_20190512_214820_model_20 0.9954776 0.07604255
## 5 GBM_grid_1_AutoML_20190512_214820_model_15 0.9953996 0.08017143
## 6 GBM_grid_1_AutoML_20190512_214820_model_18 0.9952827 0.10242014
##   mean_per_class_error      rmse        mse
## 1           0.02461988 0.1527318 0.02332700
## 2           0.02210526 0.1450744 0.02104659
## 3           0.01918129 0.1367856 0.01871029
## 4           0.02125731 0.1467313 0.02153008
## 5           0.02669591 0.1515410 0.02296468
## 6           0.02000000 0.1512859 0.02288742
## 
## [51 rows x 6 columns]

Let’s compare the performance of this model using the previous unseen test set.

The GBM model created with automl() has a perfect scoring of 100% on the same unseen previous data !!!

## [1] TRUE