Congrats! You just graduated from medical school and got a PhD in Data Science at the same time, wow impressive. Because of these incredible accomplishments the world now believes you will be able to cure cancer…no pressure. To start you figured you better create some way to detect cancer when present. Luckily because you are now a MD and DS PhD or MDSDPhD, you have access to data sets and know your way around a ML classifier. So, on the way to fulfilling your destiny to rig the world of cancer you start by building several classifiers that can be used to aid in determining if patients have cancer and the type of tumor.

The included dataset (clinical_data_breast_cancer_modified.csv) has information on 105 patients across 17 variables, your goal is to build two classifiers one for PR.Status (progesterone receptor), a biomarker that routinely leads to a cancer diagnosis, indicating if there was a positive or negative outcome and one for the Tumor a multi-class variable . You would like to be able to explain the model to the mere mortals around you but need a fairly robust and flexible approach so you’ve chosen to use decision trees to get started. In building both models us CART and C5.0 and compare the differences.

In doing so, similar to great data scientists of the past, you remembered the excellent education provided to you at UVA in a undergrad data science course and have outlined steps that will need to be undertaken to complete this task (you can add more or combine if needed).
As always, you will need to make sure to #comment your work heavily and render the results in a clear report (knitted) as the non MDSDPhDs of the world will someday need to understand the wonder and spectacle that will be your R code. Good luck and the world thanks you.

Footnotes: - Some of the steps will not need to be repeated for the second model, use your judgment - You can add or combine steps if needed - Also, remember to try several methods during evaluation and always be mindful of how the model will be used in practice. - Do not include ER.Status in your first tree it’s basically the same as PR.Status

## ── Attaching packages ──────────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0

## ── Conflicts ─────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::arrange()   masks plyr::arrange()
## x purrr::compact()   masks plyr::compact()
## x dplyr::count()     masks plyr::count()
## x dplyr::failwith()  masks plyr::failwith()
## x dplyr::filter()    masks stats::filter()
## x dplyr::id()        masks plyr::id()
## x dplyr::lag()       masks stats::lag()
## x dplyr::mutate()    masks plyr::mutate()
## x dplyr::rename()    masks plyr::rename()
## x dplyr::summarise() masks plyr::summarise()
## x dplyr::summarize() masks plyr::summarize()

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

## Loading required package: bitops

## Rattle: A free graphical interface for data science with R.
## Version 5.4.0 Copyright (c) 2006-2020 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

## Installing package into '/home/rstudio-user/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)
## Installing package into '/home/rstudio-user/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)

##                                     vars   n    mean     sd median trimmed
## Gender*                                1 105    1.02   0.14      1    1.00
## Age.at.Initial.Pathologic.Diagnosis    2 105   58.69  13.07     58   58.31
## ER.Status*                             3 105    2.64   0.50      3    2.68
## PR.Status                              4 105    0.51   0.50      1    0.52
## HER2.Final.Status*                     5 105    2.25   0.46      2    2.20
## Tumor*                                 6 105    2.15   0.73      2    2.12
## Node.Coded*                            7 105    1.50   0.50      1    1.49
## Metastasis*                            8 105    1.02   0.14      1    1.00
## Metastasis.Coded*                      9 105    1.02   0.14      1    1.00
## AJCC.Stage*                           10 105    5.79   2.25      5    5.78
## Converted.Stage*                      11 105    2.62   1.64      3    2.41
## Survival.Data.Form*                   12 105    1.45   0.50      1    1.44
## Vital.Status*                         13 105    1.90   0.31      2    1.99
## Days.to.Date.of.Last.Contact          14 105  788.39 645.28    643  737.09
## Days.to.date.of.Death                 15  11 1254.45 678.05   1364 1239.56
## OS.event                              16 105    0.10   0.31      0    0.01
## OS.Time                               17 105  817.65 672.03    665  763.09
##                                        mad min  max range  skew kurtosis     se
## Gender*                               0.00   1    2     1  6.94    46.56   0.01
## Age.at.Initial.Pathologic.Diagnosis  13.34  30   88    58  0.23    -0.60   1.28
## ER.Status*                            0.00   1    3     2 -0.79    -0.86   0.05
## PR.Status                             0.00   0    1     1 -0.06    -2.02   0.05
## HER2.Final.Status*                    0.00   1    3     2  0.85    -0.48   0.04
## Tumor*                                0.00   1    4     3  0.64     0.54   0.07
## Node.Coded*                           0.00   1    2     1  0.02    -2.02   0.05
## Metastasis*                           0.00   1    2     1  6.94    46.56   0.01
## Metastasis.Coded*                     0.00   1    2     1  6.94    46.56   0.01
## AJCC.Stage*                           1.48   1   11    10  0.19    -0.18   0.22
## Converted.Stage*                      2.97   1    7     6  0.79     0.01   0.16
## Survival.Data.Form*                   0.00   1    2     1  0.21    -1.98   0.05
## Vital.Status*                         0.00   1    2     1 -2.54     4.52   0.03
## Days.to.Date.of.Last.Contact        848.05   0 2850  2850  0.61    -0.18  62.97
## Days.to.date.of.Death               486.29 160 2483  2323 -0.12    -0.88 204.44
## OS.event                              0.00   0    1     1  2.54     4.52   0.03
## OS.Time                             836.19   0 2850  2850  0.60    -0.30  65.58

Checking to see if the variables are classified correctly

Removed ER Status, age variables, day variables, tumor variable and then converted the remaining ones to factors.

## List of 11
##  $ Gender            : Factor w/ 2 levels "FEMALE","MALE": 1 1 1 1 1 1 1 1 1 1 ...
##  $ PR.Status         : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ HER2.Final.Status : Factor w/ 3 levels "Equivocal","Negative",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Node.Coded        : Factor w/ 2 levels "Negative","Positive": 2 1 2 2 2 1 1 1 1 1 ...
##  $ Metastasis        : Factor w/ 2 levels "M0","M1": 2 1 1 1 1 1 1 1 1 1 ...
##  $ Metastasis.Coded  : Factor w/ 2 levels "Negative","Positive": 2 1 1 1 1 1 1 1 1 1 ...
##  $ AJCC.Stage        : Factor w/ 11 levels "Stage I","Stage IA",..: 11 5 6 6 10 5 6 5 5 5 ...
##  $ Converted.Stage   : Factor w/ 7 levels "No_Conversion",..: 1 3 1 1 1 3 4 3 3 3 ...
##  $ Survival.Data.Form: Factor w/ 2 levels "enrollment","followup": 2 2 1 1 2 2 2 2 2 2 ...
##  $ Vital.Status      : Factor w/ 2 levels "DECEASED","LIVING": 1 1 1 1 2 2 2 2 2 2 ...
##  $ OS.event          : Factor w/ 2 levels "0","1": 2 2 2 2 1 1 1 1 1 1 ...

## 
##  0  1 
## 51 54

Split the data into two sets, one for training (80%) and one for testing

## Warning: The `i` argument of ``[`()` can't be a matrix as of tibble 3.0.0.
## Convert to a vector.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

Determining the baserate for the classifier

## [1] 54

## [1] 105

## [1] 0.4857143

Baserate is .4857 Represents the percent of people who have the biomarker

Building the model with the default settings

Viewing the results

## n= 85 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 85 41 1 (0.4823529 0.5176471)  
##    2) Converted.Stage=No_Conversion,Stage IIA,Stage IIIB 60 26 0 (0.5666667 0.4333333)  
##      4) AJCC.Stage=Stage IB,Stage IIA,Stage III,Stage IIIA,Stage IIIC,Stage IV 35 12 0 (0.6571429 0.3428571)  
##        8) AJCC.Stage=Stage IB,Stage III,Stage IIIA,Stage IIIC,Stage IV 10  2 0 (0.8000000 0.2000000) *
##        9) AJCC.Stage=Stage IIA 25 10 0 (0.6000000 0.4000000)  
##         18) HER2.Final.Status=Negative 18  6 0 (0.6666667 0.3333333) *
##         19) HER2.Final.Status=Positive 7  3 1 (0.4285714 0.5714286) *
##      5) AJCC.Stage=Stage IA,Stage II,Stage IIB,Stage IIIB 25 11 1 (0.4400000 0.5600000)  
##       10) Node.Coded=Negative 11  5 0 (0.5454545 0.4545455) *
##       11) Node.Coded=Positive 14  5 1 (0.3571429 0.6428571) *
##    3) Converted.Stage=Stage I,Stage IIB,Stage IIIA,Stage IIIC 25  7 1 (0.2800000 0.7200000) *

AJCC Stage and Her2 Final Status seem to be the most important variables for the tree.

Plotting the tree with the rpart package

Plot the cp chart and note the optimal size of the tree

## # A tibble: 5 x 5
##       CP nsplit `rel error` xerror  xstd
##    <dbl>  <dbl>       <dbl>  <dbl> <dbl>
## 1 0.195       0       1       1.27 0.110
## 2 0.0732      1       0.805   1.05 0.112
## 3 0.0244      2       0.732   1.10 0.112
## 4 0.0122      3       0.707   1.12 0.112
## 5 0.01        5       0.683   1.12 0.112

##    Converted.Stage         AJCC.Stage         Node.Coded  HER2.Final.Status 
##         3.48344450         2.49329696         0.77974026         0.68744426 
## Survival.Data.Form             Gender         Metastasis   Metastasis.Coded 
##         0.13444282         0.11601569         0.11428571         0.11428571 
##       Vital.Status 
##         0.05714286

Using the predict function and models to predict the target variable using test set.

##                  var  n wt dev yval complexity ncompete nsurrogate    yval2.V1
## 1    Converted.Stage 85 85  41    2 0.19512195        4          3  2.00000000
## 2         AJCC.Stage 60 60  26    1 0.07317073        4          2  1.00000000
## 4         AJCC.Stage 35 35  12    1 0.01219512        4          5  1.00000000
## 8             <leaf> 10 10   2    1 0.01000000        0          0  1.00000000
## 9  HER2.Final.Status 25 25  10    1 0.01219512        1          0  1.00000000
## 18            <leaf> 18 18   6    1 0.01000000        0          0  1.00000000
## 19            <leaf>  7  7   3    2 0.01000000        0          0  2.00000000
## 5         Node.Coded 25 25  11    2 0.02439024        2          3  2.00000000
## 10            <leaf> 11 11   5    1 0.01000000        0          0  1.00000000
## 11            <leaf> 14 14   5    2 0.01000000        0          0  2.00000000
## 3             <leaf> 25 25   7    2 0.00000000        0          0  2.00000000
##       yval2.V2    yval2.V3    yval2.V4    yval2.V5 yval2.nodeprob
## 1  41.00000000 44.00000000  0.48235294  0.51764706     1.00000000
## 2  34.00000000 26.00000000  0.56666667  0.43333333     0.70588235
## 4  23.00000000 12.00000000  0.65714286  0.34285714     0.41176471
## 8   8.00000000  2.00000000  0.80000000  0.20000000     0.11764706
## 9  15.00000000 10.00000000  0.60000000  0.40000000     0.29411765
## 18 12.00000000  6.00000000  0.66666667  0.33333333     0.21176471
## 19  3.00000000  4.00000000  0.42857143  0.57142857     0.08235294
## 5  11.00000000 14.00000000  0.44000000  0.56000000     0.29411765
## 10  6.00000000  5.00000000  0.54545455  0.45454545     0.12941176
## 11  5.00000000  9.00000000  0.35714286  0.64285714     0.16470588
## 3   7.00000000 18.00000000  0.28000000  0.72000000     0.29411765

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction 0 1
##          0 5 4
##          1 5 6
##                                           
##                Accuracy : 0.55            
##                  95% CI : (0.3153, 0.7694)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 0.4119          
##                                           
##                   Kappa : 0.1             
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.6000          
##             Specificity : 0.5000          
##          Pos Pred Value : 0.5455          
##          Neg Pred Value : 0.5556          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3000          
##    Detection Prevalence : 0.5500          
##       Balanced Accuracy : 0.5500          
##                                           
##        'Positive' Class : 1               
##

The confusion matrix is not very good and looks like there has been some mistake. The accuracy value is .5, which is awful. Sensitivty and Specificity are also .5 which is what leads me to think there has been some error in the model.

Generating the hit rate and detection rate

Using the the confusion matrix function in caret check a variety of metrics

Generate a ROC and AUC output

Had an AUC value of 0.5 like all of the confusion matrix values I believe this is probably an error in the model. ROC was unable to run.

Second Model using the multiple class Tumor Variable

## List of 11
##  $ Gender            : Factor w/ 2 levels "FEMALE","MALE": 1 1 1 1 1 1 1 1 1 1 ...
##  $ HER2.Final.Status : Factor w/ 3 levels "Equivocal","Negative",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Tumor             : Factor w/ 4 levels "T1","T2","T3",..: 3 2 2 2 3 2 3 2 2 2 ...
##  $ Node.Coded        : Factor w/ 2 levels "Negative","Positive": 2 1 2 2 2 1 1 1 1 1 ...
##  $ Metastasis        : Factor w/ 2 levels "M0","M1": 2 1 1 1 1 1 1 1 1 1 ...
##  $ Metastasis.Coded  : Factor w/ 2 levels "Negative","Positive": 2 1 1 1 1 1 1 1 1 1 ...
##  $ AJCC.Stage        : Factor w/ 11 levels "Stage I","Stage IA",..: 11 5 6 6 10 5 6 5 5 5 ...
##  $ Converted.Stage   : Factor w/ 7 levels "No_Conversion",..: 1 3 1 1 1 3 4 3 3 3 ...
##  $ Survival.Data.Form: Factor w/ 2 levels "enrollment","followup": 2 2 1 1 2 2 2 2 2 2 ...
##  $ Vital.Status      : Factor w/ 2 levels "DECEASED","LIVING": 1 1 1 1 2 2 2 2 2 2 ...
##  $ OS.event          : Factor w/ 2 levels "0","1": 2 2 2 2 1 1 1 1 1 1 ...

## 
## T1 T2 T3 T4 
## 15 65 19  6

Splitting the data into test and train sets

Finding the baserate for the classifier

##   base_rate_1 base_rate_2 base_rate_3 base_rate_4
## 1   0.1428571   0.6190476   0.1809524  0.05714286

The baserate for T1 is .14, T2 is .61, T3 is .18 and T4 is .05

Building the model using default settings

Viewing the results

## n= 85 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 85 33 T2 (0.14117647 0.61176471 0.18823529 0.05882353)  
##    2) AJCC.Stage=Stage I,Stage IA,Stage IIIB 13  5 T1 (0.61538462 0.00000000 0.00000000 0.38461538) *
##    3) AJCC.Stage=Stage IB,Stage II,Stage IIA,Stage IIB,Stage III,Stage IIIA,Stage IIIC,Stage IV 72 20 T2 (0.05555556 0.72222222 0.22222222 0.00000000)  
##      6) AJCC.Stage=Stage IB,Stage II,Stage IIA,Stage III,Stage IV 38  3 T2 (0.07894737 0.92105263 0.00000000 0.00000000) *
##      7) AJCC.Stage=Stage IIB,Stage IIIA,Stage IIIC 34 17 T2 (0.02941176 0.50000000 0.47058824 0.00000000)  
##       14) Converted.Stage=No_Conversion,Stage IIA,Stage IIIC 22  7 T2 (0.04545455 0.68181818 0.27272727 0.00000000) *
##       15) Converted.Stage=Stage IIB,Stage IIIA 12  2 T3 (0.00000000 0.16666667 0.83333333 0.00000000) *

This shows that AJCC and Node coded are the most important variables.

Using the predict function and your models to predict the target variable using test set

##                var  n wt dev yval complexity ncompete nsurrogate    yval2.V1
## 1       AJCC.Stage 85 85  33    2  0.2424242        4          1  2.00000000
## 2           <leaf> 13 13   5    1  0.0100000        0          0  1.00000000
## 3       AJCC.Stage 72 72  20    2  0.1212121        4          4  2.00000000
## 6           <leaf> 38 38   3    2  0.0000000        0          0  2.00000000
## 7  Converted.Stage 34 34  17    2  0.1212121        4          3  2.00000000
## 14          <leaf> 22 22   7    2  0.0000000        0          0  2.00000000
## 15          <leaf> 12 12   2    3  0.0100000        0          0  3.00000000
##       yval2.V2    yval2.V3    yval2.V4    yval2.V5    yval2.V6    yval2.V7
## 1  12.00000000 52.00000000 16.00000000  5.00000000  0.14117647  0.61176471
## 2   8.00000000  0.00000000  0.00000000  5.00000000  0.61538462  0.00000000
## 3   4.00000000 52.00000000 16.00000000  0.00000000  0.05555556  0.72222222
## 6   3.00000000 35.00000000  0.00000000  0.00000000  0.07894737  0.92105263
## 7   1.00000000 17.00000000 16.00000000  0.00000000  0.02941176  0.50000000
## 14  1.00000000 15.00000000  6.00000000  0.00000000  0.04545455  0.68181818
## 15  0.00000000  2.00000000 10.00000000  0.00000000  0.00000000  0.16666667
##       yval2.V8    yval2.V9 yval2.nodeprob
## 1   0.18823529  0.05882353     1.00000000
## 2   0.00000000  0.38461538     0.15294118
## 3   0.22222222  0.00000000     0.84705882
## 6   0.00000000  0.00000000     0.44705882
## 7   0.47058824  0.00000000     0.40000000
## 14  0.27272727  0.00000000     0.25882353
## 15  0.83333333  0.00000000     0.14117647

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction T1 T2 T3 T4
##         T1  2  0  0  1
##         T2  1 11  2  0
##         T3  0  2  1  0
##         T4  0  0  0  0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7             
##                  95% CI : (0.4572, 0.8811)
##     No Information Rate : 0.65            
##     P-Value [Acc > NIR] : 0.4166          
##                                           
##                   Kappa : 0.4             
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: T1 Class: T2 Class: T3 Class: T4
## Sensitivity             0.6667    0.8462    0.3333      0.00
## Specificity             0.9412    0.5714    0.8824      1.00
## Pos Pred Value          0.6667    0.7857    0.3333       NaN
## Neg Pred Value          0.9412    0.6667    0.8824      0.95
## Prevalence              0.1500    0.6500    0.1500      0.05
## Detection Rate          0.1000    0.5500    0.0500      0.00
## Detection Prevalence    0.1500    0.7000    0.1500      0.00
## Balanced Accuracy       0.8039    0.7088    0.6078      0.50

The accuracy rate is .75 while we have some strange sensitivity and specificity values. T1 looks good with values of .66 and .94, but the rest have values like 1.0 and 0.0 which is not what we want.

Generating the hit and detection rates

Using the confusion matrix to check a variety of metrics

Had an error with this one, was unable to get it to work.

C50 analysis for the second model

Had a multitude of erorr with this last section of C50 code. I really wasn’t able to get any of it to run unfortunately.

While I was able to get most of the CART single class model to run, the model itself was very poor. I’m not sure if there was an error, or I took out the wrong variables, but we only had an accuracy value of .5 and also has sensitivity and specificity values of .5 as well. This is not exactly what we are looking for.

My code for C50 was a disaster. More things had errors and wouldn’t run than would run. The code that did run didn’t have the output I was expecting. I’m not sure why exactly this was, but maybe I just didn’t fully understand what I needed to be doing.

It’s hard to compare two models that you really don’t have an output for. I don’t think I understand what happened well enough to even try. I will say a working model to help predict cancer by looking at biomarkers would be invaluable. Any way this could be incorporated into hospitals and other medical centers should be attempted. This model could be directly responsible for saving lives if it is able to predict cancer occuring in an individual. The earlier cancer gets found, the easier it is to treat and the more likely one is to recover for it.

In Class DT

Brian Wright

December 7, 2017

Checking to see if the variables are classified correctly

Split the data into two sets, one for training (80%) and one for testing

Determining the baserate for the classifier

Building the model with the default settings

Viewing the results

Plotting the tree with the rpart package

Plot the cp chart and note the optimal size of the tree

Using the predict function and models to predict the target variable using test set.

Generating the hit rate and detection rate

Using the the confusion matrix function in caret check a variety of metrics

Generate a ROC and AUC output

Second Model using the multiple class Tumor Variable

Splitting the data into test and train sets

Finding the baserate for the classifier

Building the model using default settings

Viewing the results

Using the predict function and your models to predict the target variable using test set

Generating the hit and detection rates

Using the confusion matrix to check a variety of metrics

C50 analysis for the second model