Instructions
Consider the the Cervical Cancer (Risk Factors) data set (available from UCI repository) and try to accurately classify Dx.Cancer.
You must compare different approaches and parameters of a) single decision tree, b) random forest.
Evaluation of derived models should follow a correct methodology, comparing different estimates of generalization error (i.e. holdout, cross-validation, bootstrap, …)
Submit a report (in PDF, generated from R) with the code and the resulting analysis.
The Dataset
The dataset comprises of 36 variables, being 24 categorical and 12 continuous variables. The dataset was collected at ‘Hospital Universitario de Caracas’ in Caracas, Venezuela. The dataset comprises demographic information, habits, and historic medical records of 858 patients. Several patients decided not to answer some of the questions because of privacy concerns (missing values), mainly use of contraceptives, IUD and concerning the STDs.
The Exercise
The objective of this exercise is to attemtp to classify the variable Dx:Cancer using some available algorithms and compare them.
There are several algorithms for that we may use:
- Linear regression, Logistic Regression and Support Vector Machines
- Perceptrons and Artificial Neural Networks
- Naive Bayes Classifier
- Decision Trees
- k-nearest neighbor
Here is a “brief” example of assumptions that we may take in account for each method:
Regression or Classification?
First, we need to explain what is Regression and what is Classification, since this often causes confusion:
- Both are under the same umbrella of supervised machine learning.
- Both utilize a known dataset (training dataset) to make predictions.
- The main difference is the output: for regression is numerical (or continuous) while for classification is categorical (or discrete).
In this exercise, we have a classification problem, that will attempt to classify for true (has cancer) or false (doesn’t have cancer), having no other class involved.
So, the current exercise will use the following classification methods:
- Decision tree
- Random forest
Decision Tree
First, let’s create a partition with 80% of the data:
Than, let’s create a model with Gini impurity index and Information Gain:
# Learn tree with Gini impurity
tree.gini <- rpart(
dx_cancer ~ .,
data = data.train,
parms = list(split = "gini"),
method = "class"
)
# Lean tree with information gain
tree.information <- rpart(
dx_cancer ~ .,
data = data.train,
parms = list(split = "information"),
method = "class"
)
par(mfrow = c(1, 2))
rpart.plot(tree.gini, main = "Gini Index", type = 5)
rpart.plot(tree.information, main = "Information Gain", type = 5)

Both algorithms find that dx_hpv is a strong predictor for dx_cancer.
This seems a little odd (not impossible), let’s take a look in the dataset:
Using the nearZeroVar from caret we see that there are several variables that contains near-zero variance predictors (i.e.: variables that take an unique value across samples).
We don’t want to simply remove these variables. So we will try another approach: subsampling.
Subsampling
Subsampling tries to balance a class-unbalanced dataset. We don’t want to artificially balance the test set, so we will take the training set and, before model fitting, and sample the data. There are two issues with this approach according to caret package help pages:
Firstly, during model tuning the holdout samples generated during resampling are also glanced and may not reflect the class imbalance that future predictions would encounter. This is likely to lead to overly optimistic estimates of performance.
Secondly, the subsampling process will probably induce more model uncertainty. Would the model results differ under a different subsample? As above, the resampling statistics are more likely to make the model appear more effective than it actually is.
The author also suggests as an alternative, to include the subsampling inside of the usual resampling procedure (cross-validation, bootstrap, …), in exchange of computing time.
There we will try two algorithms, down-sampling and up-sampling. The latter is know to be more optimistic, but since we have so few cases, it seems the more interesting for us.
Down Sampling
First, create the down-sampled training set.
Now let’s model again:
# Learn tree with Gini impurity
tree.gini <- rpart(
dx_cancer ~ .,
data = down_train,
parms = list(split = "gini"),
method = "class"
)
# Lean tree with information gain
tree.information <- rpart(
dx_cancer ~ .,
data = down_train,
parms = list(split = "information"),
method = "class"
)
par(mfrow = c(1, 2))
rpart.plot(tree.gini, main = "Gini Index", type = 5)
rpart.plot(tree.information, main = "Information Gain", type = 5)

We still have the same as before: a strong prediction based on dx_hpv.
Up Sampling
First, create the up-sampled training set.
Now let’s model again:
# Learn tree with Gini impurity
tree.gini <- rpart(
dx_cancer ~ .,
data = up_train,
parms = list(split = "gini"),
method = "class"
)
# Lean tree with information gain
tree.information <- rpart(
dx_cancer ~ .,
data = up_train,
parms = list(split = "information"),
method = "class"
)
par(mfrow = c(1, 2))
rpart.plot(tree.gini, main = "Gini Index", type = 5)
rpart.plot(tree.information, main = "Information Gain", type = 5)

Now we have something interesting. It seems that dx became also important for our decision tree.
In the next step we will then evaluate the Decision Tree model and the Random Forest model.
Random Forest and evaluations
First we will compare both models using the down-sampled training data:
metric <- "ROC"
control <- trainControl(
method = "LGOCV", p = 0.8, number = 150,
summaryFunction = twoClassSummary,
classProbs = T,
savePredictions = T
)
set.seed(7)
fit.rpart.down <- train(
dx_cancer ~ .,
data = down_train,
method = "rpart",
metric = metric,
trControl = control,
na.action = na.rpart
)
# Random Forest
set.seed(7)
fit.rf.down <- train(
dx_cancer ~ .,
data = down_train,
method = "rf",
metric = metric,
trControl = control,
na.action = randomForest::na.roughfix
)
# Summarize accuracy of models
results <- resamples(list(rpart = fit.rpart.down, rf = fit.rf.down))
summary(results)
Call:
summary.resamples(object = results)
Models: rpart, rf
Number of resamples: 150
ROC
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
rpart 0.6666667 1 1 0.9655556 1 1 0
rf 1.0000000 1 1 1.0000000 1 1 0
Sens
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
rpart 0.3333333 1 1 0.9311111 1 1 0
rf 0.3333333 1 1 0.9311111 1 1 0
Spec
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
rpart 1 1 1 1 1 1 0
rf 1 1 1 1 1 1 0
Setting levels: control = 1, case = 2
Setting direction: controls < cases

Setting levels: control = 1, case = 2
Setting direction: controls < cases


metric <- "ROC"
control <- trainControl(
method = "LGOCV", p = 0.8, number = 150,
summaryFunction = twoClassSummary,
classProbs = T,
savePredictions = T
)
set.seed(7)
fit.rpart.up <- train(
dx_cancer ~ .,
data = up_train,
method = "rpart",
metric = metric,
trControl = control,
na.action = na.rpart
)
# Random Forest
set.seed(7)
fit.rf.up <- train(
dx_cancer ~ .,
data = up_train,
method = "rf",
metric = metric,
trControl = control,
na.action = randomForest::na.roughfix
)
# Summarize accuracy of models
results <- resamples(list(rpart = fit.rpart.up, rf = fit.rf.up))
summary(results)
Call:
summary.resamples(object = results)
Models: rpart, rf
Number of resamples: 150
ROC
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
rpart 0.9948207 0.9990811 0.9996658 0.9988108 0.9999095 1 0
rf 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1 0
Sens
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
rpart 1 1 1 1 1 1 0
rf 1 1 1 1 1 1 0
Spec
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
rpart 0.9701493 0.9850746 0.9925373 0.9896020 0.9925373 1 0
rf 0.9850746 1.0000000 1.0000000 0.9979602 1.0000000 1 0
Setting levels: control = 1, case = 2
Setting direction: controls < cases

Setting levels: control = 1, case = 2
Setting direction: controls < cases


Validation
Now it is time to validate the models using the test set we created from the original dataset:
Confusion Matrix and Statistics
Reference
Prediction yes no
yes 3 4
no 0 164
Accuracy : 0.9766
95% CI : (0.9412, 0.9936)
No Information Rate : 0.9825
P-Value [Acc > NIR] : 0.8168
Kappa : 0.5899
Mcnemar's Test P-Value : 0.1336
Sensitivity : 1.00000
Specificity : 0.97619
Pos Pred Value : 0.42857
Neg Pred Value : 1.00000
Prevalence : 0.01754
Detection Rate : 0.01754
Detection Prevalence : 0.04094
Balanced Accuracy : 0.98810
'Positive' Class : yes
Confusion Matrix and Statistics
Reference
Prediction yes no
yes 3 4
no 0 164
Accuracy : 0.9766
95% CI : (0.9412, 0.9936)
No Information Rate : 0.9825
P-Value [Acc > NIR] : 0.8168
Kappa : 0.5899
Mcnemar's Test P-Value : 0.1336
Sensitivity : 1.00000
Specificity : 0.97619
Pos Pred Value : 0.42857
Neg Pred Value : 1.00000
Prevalence : 0.01754
Detection Rate : 0.01754
Detection Prevalence : 0.04094
Balanced Accuracy : 0.98810
'Positive' Class : yes
Confusion Matrix and Statistics
Reference
Prediction yes no
yes 2 1
no 1 167
Accuracy : 0.9883
95% CI : (0.9584, 0.9986)
No Information Rate : 0.9825
P-Value [Acc > NIR] : 0.4212
Kappa : 0.6607
Mcnemar's Test P-Value : 1.0000
Sensitivity : 0.66667
Specificity : 0.99405
Pos Pred Value : 0.66667
Neg Pred Value : 0.99405
Prevalence : 0.01754
Detection Rate : 0.01170
Detection Prevalence : 0.01754
Balanced Accuracy : 0.83036
'Positive' Class : yes
Confusion Matrix and Statistics
Reference
Prediction yes no
yes 3 4
no 0 164
Accuracy : 0.9766
95% CI : (0.9412, 0.9936)
No Information Rate : 0.9825
P-Value [Acc > NIR] : 0.8168
Kappa : 0.5899
Mcnemar's Test P-Value : 0.1336
Sensitivity : 1.00000
Specificity : 0.97619
Pos Pred Value : 0.42857
Neg Pred Value : 1.00000
Prevalence : 0.01754
Detection Rate : 0.01754
Detection Prevalence : 0.04094
Balanced Accuracy : 0.98810
'Positive' Class : yes
We see that all algorithms had similar performances in the validation step. Even augmenting the dataset isn’t enough to overcome the inherent problems of this dataset with such heavy class imbalances. I would dare to say that in the validation step, the best result was with down sampling instead of upsampling, since we had fewer false negatives (that is a big problem in diagnostic tests).
---
title: "HEADS - HIDA - LEARN: Assignment 1"
output: 
  html_notebook: 
    fig_height: 8
    fig_width: 10
    highlight: pygments
    theme: united
    toc: yes
author: Francisco Bischoff
---

## Instructions

Consider the the Cervical Cancer (Risk Factors) data set (available from UCI repository) and try to accurately classify Dx.Cancer.

You must compare different approaches and parameters of a) single decision tree, b) random forest.

Evaluation of derived models should follow a correct methodology, comparing different estimates of generalization error (i.e. holdout, cross-validation, bootstrap, ...)

Submit a report (in PDF, generated from R) with the code and the resulting analysis. 

```{r setup, include = FALSE}
knitr::opts_chunk$set(warning = TRUE)
library(readr)
library(caret)
library(expss)
library(rpart)
library(rpart.plot) # better plots
library(pROC)
dataset <- read_csv("risk_factors_cervical_cancer.csv", na = "?")
# coerce it to data.frame, for simplicity
dataset <- data.frame(dataset)
columns <- c(
  "age",
  "partners",
  "sexarca",
  "pregnancies",
  "smokes",
  "smokes_years",
  "smokers_pack_year",
  "ohc",
  "ohc_years",
  "iud",
  "iud_years",
  "std",
  "std_num",
  "std_condyloma",
  "std_cervical",
  "std_vaginal",
  "std_vulvo_perit",
  "std_syphilis",
  "std_pid",
  "std_herpes",
  "std_molluscum",
  "std_aids",
  "std_hiv",
  "std_hepb",
  "std_hpv",
  "std_num_of_diag",
  "std_time_since_first",
  "std_time_since_last",
  "dx_cancer",
  "dx_cin",
  "dx_hpv",
  "dx",
  "hinselmann",
  "schiller",
  "citology",
  "biopsy"
)
names(dataset) <- columns
# ugly way to move the dependant variable to the end:
dataset <- cbind(dataset[, -which(columns == "dx_cancer")],
                 dx_cancer = dataset$dx_cancer)

dataset$smokes <- factor(dataset$smokes,
  levels = c(1, 0),
  labels = c("yes", "no")
)
dataset$ohc <- factor(dataset$ohc,
  levels = c(1, 0),
  labels = c("yes", "no")
)
dataset$iud <- factor(dataset$iud,
  levels = c(1, 0),
  labels = c("yes", "no")
)
dataset$std <- factor(dataset$std,
  levels = c(1, 0),
  labels = c("yes", "no")
)
dataset$std_condyloma <- factor(dataset$std_condyloma,
  levels = c(1, 0),
  labels = c("yes", "no")
)
dataset$std_cervical <- factor(dataset$std_cervical,
  levels = c(1, 0),
  labels = c("yes", "no")
)
dataset$std_vaginal <- factor(dataset$std_vaginal,
  levels = c(1, 0),
  labels = c("yes", "no")
)
dataset$std_vulvo_perit <- factor(dataset$std_vulvo_perit,
  levels = c(1, 0),
  labels = c("yes", "no")
)
dataset$std_syphilis <- factor(dataset$std_syphilis,
  levels = c(1, 0),
  labels = c("yes", "no")
)
dataset$std_pid <- factor(dataset$std_pid,
  levels = c(1, 0),
  labels = c("yes", "no")
)
dataset$std_herpes <- factor(dataset$std_herpes,
  levels = c(1, 0),
  labels = c("yes", "no")
)
dataset$std_molluscum <- factor(dataset$std_molluscum,
  levels = c(1, 0),
  labels = c("yes", "no")
)
dataset$std_aids <- factor(dataset$std_aids,
  levels = c(1, 0),
  labels = c("yes", "no")
)
dataset$std_hiv <- factor(dataset$std_hiv,
  levels = c(1, 0),
  labels = c("yes", "no")
)
dataset$std_hepb <- factor(dataset$std_hepb,
  levels = c(1, 0),
  labels = c("yes", "no")
)
dataset$std_hpv <- factor(dataset$std_hpv,
  levels = c(1, 0),
  labels = c("yes", "no")
)
dataset$dx_cancer <- factor(dataset$dx_cancer,
  levels = c(1, 0),
  labels = c("yes", "no")
)
dataset$dx_cin <- factor(dataset$dx_cin,
  levels = c(1, 0),
  labels = c("yes", "no")
)
dataset$dx_hpv <- factor(dataset$dx_hpv,
  levels = c(1, 0),
  labels = c("yes", "no")
)
dataset$dx <- factor(dataset$dx,
  levels = c(1, 0),
  labels = c("yes", "no")
)
dataset$hinselmann <- factor(dataset$hinselmann,
  levels = c(1, 0),
  labels = c("yes", "no")
)
dataset$schiller <- factor(dataset$schiller,
  levels = c(1, 0),
  labels = c("yes", "no")
)
dataset$citology <- factor(dataset$citology,
  levels = c(1, 0),
  labels = c("yes", "no")
)
dataset$biopsy <- factor(dataset$biopsy,
  levels = c(1, 0),
  labels = c("yes", "no")
)
```

## The Dataset

The dataset comprises of 36 variables, being 24 categorical and 12 continuous variables. The dataset was collected at 'Hospital Universitario de Caracas' in Caracas, Venezuela. The dataset comprises demographic information, habits, and historic medical records of 858 patients. Several patients decided not to answer some of the questions because of privacy concerns (missing values), mainly use of contraceptives, IUD and concerning the STDs.

## The Exercise

The objective of this exercise is to attemtp to classify the variable Dx:Cancer using some available algorithms and compare them.

There are several algorithms for that we may use:

- Linear regression, Logistic Regression and Support Vector Machines
- Perceptrons and Artificial Neural Networks
- Naive Bayes Classifier
- Decision Trees
- k-nearest neighbor

Here is a "brief" example of assumptions that we may take in account for each method:

![](https://miro.medium.com/max/1728/1*kO2xE3tQFxg9xp0ZSKWOIQ.png)<div align = "center">All the Annoying Assumptions, from [Dip Ranjan Chatterjee
](https://towardsdatascience.com/all-the-annoying-assumptions-31b55df246c3)</div>

### Regression or Classification?
First, we need to explain what is Regression and what is Classification, since this often causes confusion:

- Both are under the same umbrella of supervised machine learning.
- Both utilize a known dataset (training dataset) to make predictions.
- The main difference is the output: for regression is numerical (or continuous) while for classification is categorical (or discrete).

In this exercise, we have a classification problem, that will attempt to classify for **`true`** (has cancer) or **`false`** (doesn't have cancer), having no other class involved.

So, the current exercise will use the following classification methods:

a) Decision tree
b) Random forest

## Decision Tree

First, let's create a partition with 80% of the data:

```{r prepare_data}
set.seed(2020)
# Create the indexes of the 80% partition
part_idxs <-
  createDataPartition(dataset$dx_cancer, p = 0.8, list = FALSE)
data.train <- dataset[part_idxs, ]
data.test <- dataset[-part_idxs, ]
```

Than, let's create a model with Gini impurity index and Information Gain:

```{r single_tree, fig.height=4, fig.width=10}
# Learn tree with Gini impurity
tree.gini <- rpart(
  dx_cancer ~ .,
  data = data.train,
  parms = list(split = "gini"),
  method = "class"
)
# Lean tree with information gain
tree.information <- rpart(
  dx_cancer ~ .,
  data = data.train,
  parms = list(split = "information"),
  method = "class"
)
par(mfrow = c(1, 2))
rpart.plot(tree.gini, main = "Gini Index", type = 5)
rpart.plot(tree.information, main = "Information Gain", type = 5)
```

Both algorithms find that **dx_hpv** is a strong predictor for **dx_cancer**.

This seems a little odd (not impossible), let's take a look in the dataset:

```{r check_data, echo=TRUE}
near <- nearZeroVar(dataset, saveMetrics = TRUE)
near[near[, "zeroVar"] + near[, "nzv"] > 0, ]
```

Using the `nearZeroVar` from `caret` we see that there are several variables that contains near-zero variance predictors (i.e.: variables that take an unique value across samples).

We don't want to simply remove these variables. So we will try another approach: subsampling.

## Subsampling

Subsampling tries to balance a class-unbalanced dataset. We don't want to artificially balance the test set, so we will take the training set and, before model fitting, and sample the data. There are two issues with this approach according to `caret` package help pages:

- Firstly, during model tuning the holdout samples generated during resampling are also glanced and may not reflect the class imbalance that future predictions would encounter. This is likely to lead to overly optimistic estimates of performance.

- Secondly, the subsampling process will probably induce more model uncertainty. Would the model results differ under a different subsample? As above, the resampling statistics are more likely to make the model appear more effective than it actually is.

The author also suggests as an alternative, to include the subsampling inside of the usual resampling procedure (cross-validation, bootstrap, ...), in exchange of computing time.

There we will try two algorithms, _down-sampling_ and _up-sampling_. The latter is know to be more optimistic, but since we have so few cases, it seems the more interesting for us.

### Down Sampling

First, create the down-sampled training set.

```{r down_sample}
set.seed(2020)
down_train <- downSample(
  x = data.train[, -ncol(data.train)],
  y = data.train$dx_cancer,
  yname = "dx_cancer"
)
```

Now let's model again:

```{r down_tree, fig.height=4, fig.width=10}
# Learn tree with Gini impurity
tree.gini <- rpart(
  dx_cancer ~ .,
  data = down_train,
  parms = list(split = "gini"),
  method = "class"
)
# Lean tree with information gain
tree.information <- rpart(
  dx_cancer ~ .,
  data = down_train,
  parms = list(split = "information"),
  method = "class"
)
par(mfrow = c(1, 2))
rpart.plot(tree.gini, main = "Gini Index", type = 5)
rpart.plot(tree.information, main = "Information Gain", type = 5)
```

We still have the same as before: a strong prediction based on **dx_hpv**.

### Up Sampling

First, create the up-sampled training set.

```{r up_sample}
set.seed(2020)
up_train <- upSample(
  x = data.train[, -ncol(data.train)],
  y = data.train$dx_cancer,
  yname = "dx_cancer"
)
```

Now let's model again:

```{r up_tree, fig.height=4, fig.width=10}
# Learn tree with Gini impurity
tree.gini <- rpart(
  dx_cancer ~ .,
  data = up_train,
  parms = list(split = "gini"),
  method = "class"
)
# Lean tree with information gain
tree.information <- rpart(
  dx_cancer ~ .,
  data = up_train,
  parms = list(split = "information"),
  method = "class"
)
par(mfrow = c(1, 2))
rpart.plot(tree.gini, main = "Gini Index", type = 5)
rpart.plot(tree.information, main = "Information Gain", type = 5)
```

Now we have something interesting. It seems that **dx** became also important for our decision tree.

In the next step we will then evaluate the Decision Tree model and the Random Forest model.

## Random Forest and evaluations

First we will compare both models using the down-sampled training data:

```{r down_evaluation, warning=FALSE}
metric <- "ROC"
control <- trainControl(
  method = "LGOCV", p = 0.8, number = 150,
  summaryFunction = twoClassSummary,
  classProbs = T,
  savePredictions = T
)

set.seed(7)
fit.rpart.down <- train(
  dx_cancer ~ .,
  data = down_train,
  method = "rpart",
  metric = metric,
  trControl = control,
  na.action = na.rpart
)
# Random Forest
set.seed(7)
fit.rf.down <- train(
  dx_cancer ~ .,
  data = down_train,
  method = "rf",
  metric = metric,
  trControl = control,
  na.action = randomForest::na.roughfix
)

# Summarize accuracy of models
results <- resamples(list(rpart = fit.rpart.down, rf = fit.rf.down))
summary(results)
plot.roc(as.numeric(fit.rpart.down$pred$obs), as.numeric(fit.rpart.down$pred$pred), main = fit.rpart.down$method, print.auc = T)
plot.roc(as.numeric(fit.rf.down$pred$obs), as.numeric(fit.rf.down$pred$pred), main = fit.rf.down$method, print.auc = T)
# Compare accuracy of models
dotplot(results)
```

```{r up_evaluation, warning=FALSE}
metric <- "ROC"
control <- trainControl(
  method = "LGOCV", p = 0.8, number = 150,
  summaryFunction = twoClassSummary,
  classProbs = T,
  savePredictions = T
)

set.seed(7)
fit.rpart.up <- train(
  dx_cancer ~ .,
  data = up_train,
  method = "rpart",
  metric = metric,
  trControl = control,
  na.action = na.rpart
)
# Random Forest
set.seed(7)
fit.rf.up <- train(
  dx_cancer ~ .,
  data = up_train,
  method = "rf",
  metric = metric,
  trControl = control,
  na.action = randomForest::na.roughfix
)

# Summarize accuracy of models
results <- resamples(list(rpart = fit.rpart.up, rf = fit.rf.up))
summary(results)
plot.roc(as.numeric(fit.rpart.up$pred$obs), as.numeric(fit.rpart.up$pred$pred), main = fit.rpart.up$method, print.auc = T)
plot.roc(as.numeric(fit.rf.up$pred$obs), as.numeric(fit.rf.up$pred$pred), main = fit.rf.up$method, print.auc = T)
# Compare accuracy of models
dotplot(results)
```

## Validation

Now it is time to validate the models using the test set we created from the original dataset:

```{r validation}
# Make predictions
pred.rpart.up <- predict(fit.rpart.up, data.test, na.action = na.rpart)
pred.rpart.down <- predict(fit.rpart.up, data.test, na.action = na.rpart)
pred.rf.up <- predict(fit.rf.up, data.test, na.action = randomForest::na.roughfix)
pred.rf.down <- predict(fit.rf.down, data.test, na.action = randomForest::na.roughfix)

confusionMatrix(pred.rpart.up, data.test$dx_cancer)
confusionMatrix(pred.rpart.down, data.test$dx_cancer)
confusionMatrix(pred.rf.up, data.test$dx_cancer)
confusionMatrix(pred.rf.down, data.test$dx_cancer)
```

We see that all algorithms had similar performances in the validation step. Even augmenting the dataset isn't enough to overcome the inherent problems of this dataset with such heavy class imbalances. I would dare to say that in the validation step, the best result was with down sampling instead of upsampling, since we had fewer false negatives (that is a big problem in diagnostic tests).
