Instructions

Consider the the Cervical Cancer (Risk Factors) data set (available from UCI repository) and try to accurately classify Dx.Cancer.

You must compare different approaches and parameters of a) single decision tree, b) random forest.

Evaluation of derived models should follow a correct methodology, comparing different estimates of generalization error (i.e. holdout, cross-validation, bootstrap, …)

Submit a report (in PDF, generated from R) with the code and the resulting analysis.

The Dataset

The dataset comprises of 36 variables, being 24 categorical and 12 continuous variables. The dataset was collected at ‘Hospital Universitario de Caracas’ in Caracas, Venezuela. The dataset comprises demographic information, habits, and historic medical records of 858 patients. Several patients decided not to answer some of the questions because of privacy concerns (missing values), mainly use of contraceptives, IUD and concerning the STDs.

The Exercise

The objective of this exercise is to attemtp to classify the variable Dx:Cancer using some available algorithms and compare them.

There are several algorithms for that we may use:

Here is a “brief” example of assumptions that we may take in account for each method:

All the Annoying Assumptions, from Dip Ranjan Chatterjee

Regression or Classification?

First, we need to explain what is Regression and what is Classification, since this often causes confusion:

  • Both are under the same umbrella of supervised machine learning.
  • Both utilize a known dataset (training dataset) to make predictions.
  • The main difference is the output: for regression is numerical (or continuous) while for classification is categorical (or discrete).

In this exercise, we have a classification problem, that will attempt to classify for true (has cancer) or false (doesn’t have cancer), having no other class involved.

So, the current exercise will use the following classification methods:

  1. Decision tree
  2. Random forest

Decision Tree

First, let’s create a partition with 80% of the data:

Than, let’s create a model with Gini impurity index and Information Gain:

Both algorithms find that dx_hpv is a strong predictor for dx_cancer.

This seems a little odd (not impossible), let’s take a look in the dataset:

Using the nearZeroVar from caret we see that there are several variables that contains near-zero variance predictors (i.e.: variables that take an unique value across samples).

We don’t want to simply remove these variables. So we will try another approach: subsampling.

Subsampling

Subsampling tries to balance a class-unbalanced dataset. We don’t want to artificially balance the test set, so we will take the training set and, before model fitting, and sample the data. There are two issues with this approach according to caret package help pages:

The author also suggests as an alternative, to include the subsampling inside of the usual resampling procedure (cross-validation, bootstrap, …), in exchange of computing time.

There we will try two algorithms, down-sampling and up-sampling. The latter is know to be more optimistic, but since we have so few cases, it seems the more interesting for us.

Random Forest and evaluations

First we will compare both models using the down-sampled training data:


Call:
summary.resamples(object = results)

Models: rpart, rf 
Number of resamples: 150 

ROC 
           Min. 1st Qu. Median      Mean 3rd Qu. Max. NA's
rpart 0.6666667       1      1 0.9655556       1    1    0
rf    1.0000000       1      1 1.0000000       1    1    0

Sens 
           Min. 1st Qu. Median      Mean 3rd Qu. Max. NA's
rpart 0.3333333       1      1 0.9311111       1    1    0
rf    0.3333333       1      1 0.9311111       1    1    0

Spec 
      Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
rpart    1       1      1    1       1    1    0
rf       1       1      1    1       1    1    0
Setting levels: control = 1, case = 2
Setting direction: controls < cases

Setting levels: control = 1, case = 2
Setting direction: controls < cases


Call:
summary.resamples(object = results)

Models: rpart, rf 
Number of resamples: 150 

ROC 
           Min.   1st Qu.    Median      Mean   3rd Qu. Max. NA's
rpart 0.9948207 0.9990811 0.9996658 0.9988108 0.9999095    1    0
rf    1.0000000 1.0000000 1.0000000 1.0000000 1.0000000    1    0

Sens 
      Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
rpart    1       1      1    1       1    1    0
rf       1       1      1    1       1    1    0

Spec 
           Min.   1st Qu.    Median      Mean   3rd Qu. Max. NA's
rpart 0.9701493 0.9850746 0.9925373 0.9896020 0.9925373    1    0
rf    0.9850746 1.0000000 1.0000000 0.9979602 1.0000000    1    0
Setting levels: control = 1, case = 2
Setting direction: controls < cases

Setting levels: control = 1, case = 2
Setting direction: controls < cases

Validation

Now it is time to validate the models using the test set we created from the original dataset:

Confusion Matrix and Statistics

          Reference
Prediction yes  no
       yes   3   4
       no    0 164
                                          
               Accuracy : 0.9766          
                 95% CI : (0.9412, 0.9936)
    No Information Rate : 0.9825          
    P-Value [Acc > NIR] : 0.8168          
                                          
                  Kappa : 0.5899          
                                          
 Mcnemar's Test P-Value : 0.1336          
                                          
            Sensitivity : 1.00000         
            Specificity : 0.97619         
         Pos Pred Value : 0.42857         
         Neg Pred Value : 1.00000         
             Prevalence : 0.01754         
         Detection Rate : 0.01754         
   Detection Prevalence : 0.04094         
      Balanced Accuracy : 0.98810         
                                          
       'Positive' Class : yes             
                                          
Confusion Matrix and Statistics

          Reference
Prediction yes  no
       yes   3   4
       no    0 164
                                          
               Accuracy : 0.9766          
                 95% CI : (0.9412, 0.9936)
    No Information Rate : 0.9825          
    P-Value [Acc > NIR] : 0.8168          
                                          
                  Kappa : 0.5899          
                                          
 Mcnemar's Test P-Value : 0.1336          
                                          
            Sensitivity : 1.00000         
            Specificity : 0.97619         
         Pos Pred Value : 0.42857         
         Neg Pred Value : 1.00000         
             Prevalence : 0.01754         
         Detection Rate : 0.01754         
   Detection Prevalence : 0.04094         
      Balanced Accuracy : 0.98810         
                                          
       'Positive' Class : yes             
                                          
Confusion Matrix and Statistics

          Reference
Prediction yes  no
       yes   2   1
       no    1 167
                                          
               Accuracy : 0.9883          
                 95% CI : (0.9584, 0.9986)
    No Information Rate : 0.9825          
    P-Value [Acc > NIR] : 0.4212          
                                          
                  Kappa : 0.6607          
                                          
 Mcnemar's Test P-Value : 1.0000          
                                          
            Sensitivity : 0.66667         
            Specificity : 0.99405         
         Pos Pred Value : 0.66667         
         Neg Pred Value : 0.99405         
             Prevalence : 0.01754         
         Detection Rate : 0.01170         
   Detection Prevalence : 0.01754         
      Balanced Accuracy : 0.83036         
                                          
       'Positive' Class : yes             
                                          
Confusion Matrix and Statistics

          Reference
Prediction yes  no
       yes   3   4
       no    0 164
                                          
               Accuracy : 0.9766          
                 95% CI : (0.9412, 0.9936)
    No Information Rate : 0.9825          
    P-Value [Acc > NIR] : 0.8168          
                                          
                  Kappa : 0.5899          
                                          
 Mcnemar's Test P-Value : 0.1336          
                                          
            Sensitivity : 1.00000         
            Specificity : 0.97619         
         Pos Pred Value : 0.42857         
         Neg Pred Value : 1.00000         
             Prevalence : 0.01754         
         Detection Rate : 0.01754         
   Detection Prevalence : 0.04094         
      Balanced Accuracy : 0.98810         
                                          
       'Positive' Class : yes             
                                          

We see that all algorithms had similar performances in the validation step. Even augmenting the dataset isn’t enough to overcome the inherent problems of this dataset with such heavy class imbalances. I would dare to say that in the validation step, the best result was with down sampling instead of upsampling, since we had fewer false negatives (that is a big problem in diagnostic tests).

