Congrats! You just graduated from medical school and got a PhD in Data Science at the same time, wow impressive. Because of these incredible accomplishments the world now believes you will be able to cure cancer…no pressure. To start you figured you better create some way to detect cancer when present. Luckily because you are now a MD and DS PhD or MDSDPhD, you have access to data sets and know your way around a ML classifier. So, on the way to fulfilling your destiny to rig the world of cancer you start by building several classifiers that can be used to aid in determining if patients have cancer and the type of tumor.
The included dataset (clinical_data_breast_cancer_modified.csv) has information on 105 patients across 17 variables, your goal is to build two classifiers one for PR.Status (progesterone receptor), a biomarker that routinely leads to a cancer diagnosis, indicating if there was a positive or negative outcome and one for the Tumor a multi-class variable . You would like to be able to explain the model to the mere mortals around you but need a fairly robust and flexible approach so you’ve chosen to use decision trees to get started. In building both models us CART and C5.0 and compare the differences.
In doing so, similar to great data scientists of the past, you remembered the excellent education provided to you at UVA in a undergrad data science course and have outlined steps that will need to be undertaken to complete this task (you can add more or combine if needed).
As always, you will need to make sure to #comment your work heavily and render the results in a clear report (knitted) as the non MDSDPhDs of the world will someday need to understand the wonder and spectacle that will be your R code. Good luck and the world thanks you.
Footnotes: - Some of the steps will not need to be repeated for the second model, use your judgment - You can add or combine steps if needed - Also, remember to try several methods during evaluation and always be mindful of how the model will be used in practice. - Do not include ER.Status in your first tree it’s basically the same as PR.Status
## ── Attaching packages ──────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.3 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ─────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::arrange() masks plyr::arrange()
## x purrr::compact() masks plyr::compact()
## x dplyr::count() masks plyr::count()
## x dplyr::failwith() masks plyr::failwith()
## x dplyr::filter() masks stats::filter()
## x dplyr::id() masks plyr::id()
## x dplyr::lag() masks stats::lag()
## x dplyr::mutate() masks plyr::mutate()
## x dplyr::rename() masks plyr::rename()
## x dplyr::summarise() masks plyr::summarise()
## x dplyr::summarize() masks plyr::summarize()
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
## Loading required package: bitops
## Rattle: A free graphical interface for data science with R.
## Version 5.4.0 Copyright (c) 2006-2020 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
## Installing package into '/home/rstudio-user/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)
## Installing package into '/home/rstudio-user/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)
## vars n mean sd median trimmed
## Gender* 1 105 1.02 0.14 1 1.00
## Age.at.Initial.Pathologic.Diagnosis 2 105 58.69 13.07 58 58.31
## ER.Status* 3 105 2.64 0.50 3 2.68
## PR.Status 4 105 0.51 0.50 1 0.52
## HER2.Final.Status* 5 105 2.25 0.46 2 2.20
## Tumor* 6 105 2.15 0.73 2 2.12
## Node.Coded* 7 105 1.50 0.50 1 1.49
## Metastasis* 8 105 1.02 0.14 1 1.00
## Metastasis.Coded* 9 105 1.02 0.14 1 1.00
## AJCC.Stage* 10 105 5.79 2.25 5 5.78
## Converted.Stage* 11 105 2.62 1.64 3 2.41
## Survival.Data.Form* 12 105 1.45 0.50 1 1.44
## Vital.Status* 13 105 1.90 0.31 2 1.99
## Days.to.Date.of.Last.Contact 14 105 788.39 645.28 643 737.09
## Days.to.date.of.Death 15 11 1254.45 678.05 1364 1239.56
## OS.event 16 105 0.10 0.31 0 0.01
## OS.Time 17 105 817.65 672.03 665 763.09
## mad min max range skew kurtosis se
## Gender* 0.00 1 2 1 6.94 46.56 0.01
## Age.at.Initial.Pathologic.Diagnosis 13.34 30 88 58 0.23 -0.60 1.28
## ER.Status* 0.00 1 3 2 -0.79 -0.86 0.05
## PR.Status 0.00 0 1 1 -0.06 -2.02 0.05
## HER2.Final.Status* 0.00 1 3 2 0.85 -0.48 0.04
## Tumor* 0.00 1 4 3 0.64 0.54 0.07
## Node.Coded* 0.00 1 2 1 0.02 -2.02 0.05
## Metastasis* 0.00 1 2 1 6.94 46.56 0.01
## Metastasis.Coded* 0.00 1 2 1 6.94 46.56 0.01
## AJCC.Stage* 1.48 1 11 10 0.19 -0.18 0.22
## Converted.Stage* 2.97 1 7 6 0.79 0.01 0.16
## Survival.Data.Form* 0.00 1 2 1 0.21 -1.98 0.05
## Vital.Status* 0.00 1 2 1 -2.54 4.52 0.03
## Days.to.Date.of.Last.Contact 848.05 0 2850 2850 0.61 -0.18 62.97
## Days.to.date.of.Death 486.29 160 2483 2323 -0.12 -0.88 204.44
## OS.event 0.00 0 1 1 2.54 4.52 0.03
## OS.Time 836.19 0 2850 2850 0.60 -0.30 65.58
Removed ER Status, age variables, day variables, tumor variable and then converted the remaining ones to factors.
## List of 11
## $ Gender : Factor w/ 2 levels "FEMALE","MALE": 1 1 1 1 1 1 1 1 1 1 ...
## $ PR.Status : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ HER2.Final.Status : Factor w/ 3 levels "Equivocal","Negative",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Node.Coded : Factor w/ 2 levels "Negative","Positive": 2 1 2 2 2 1 1 1 1 1 ...
## $ Metastasis : Factor w/ 2 levels "M0","M1": 2 1 1 1 1 1 1 1 1 1 ...
## $ Metastasis.Coded : Factor w/ 2 levels "Negative","Positive": 2 1 1 1 1 1 1 1 1 1 ...
## $ AJCC.Stage : Factor w/ 11 levels "Stage I","Stage IA",..: 11 5 6 6 10 5 6 5 5 5 ...
## $ Converted.Stage : Factor w/ 7 levels "No_Conversion",..: 1 3 1 1 1 3 4 3 3 3 ...
## $ Survival.Data.Form: Factor w/ 2 levels "enrollment","followup": 2 2 1 1 2 2 2 2 2 2 ...
## $ Vital.Status : Factor w/ 2 levels "DECEASED","LIVING": 1 1 1 1 2 2 2 2 2 2 ...
## $ OS.event : Factor w/ 2 levels "0","1": 2 2 2 2 1 1 1 1 1 1 ...
##
## 0 1
## 51 54
## Warning: The `i` argument of ``[`()` can't be a matrix as of tibble 3.0.0.
## Convert to a vector.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## [1] 54
## [1] 105
## [1] 0.4857143
Baserate is .4857 Represents the percent of people who have the biomarker
## n= 85
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 85 41 1 (0.4823529 0.5176471)
## 2) Converted.Stage=No_Conversion,Stage IIA,Stage IIIB 60 26 0 (0.5666667 0.4333333)
## 4) AJCC.Stage=Stage IB,Stage IIA,Stage III,Stage IIIA,Stage IIIC,Stage IV 35 12 0 (0.6571429 0.3428571)
## 8) AJCC.Stage=Stage IB,Stage III,Stage IIIA,Stage IIIC,Stage IV 10 2 0 (0.8000000 0.2000000) *
## 9) AJCC.Stage=Stage IIA 25 10 0 (0.6000000 0.4000000)
## 18) HER2.Final.Status=Negative 18 6 0 (0.6666667 0.3333333) *
## 19) HER2.Final.Status=Positive 7 3 1 (0.4285714 0.5714286) *
## 5) AJCC.Stage=Stage IA,Stage II,Stage IIB,Stage IIIB 25 11 1 (0.4400000 0.5600000)
## 10) Node.Coded=Negative 11 5 0 (0.5454545 0.4545455) *
## 11) Node.Coded=Positive 14 5 1 (0.3571429 0.6428571) *
## 3) Converted.Stage=Stage I,Stage IIB,Stage IIIA,Stage IIIC 25 7 1 (0.2800000 0.7200000) *
AJCC Stage and Her2 Final Status seem to be the most important variables for the tree.
## # A tibble: 5 x 5
## CP nsplit `rel error` xerror xstd
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.195 0 1 1.27 0.110
## 2 0.0732 1 0.805 1.05 0.112
## 3 0.0244 2 0.732 1.10 0.112
## 4 0.0122 3 0.707 1.12 0.112
## 5 0.01 5 0.683 1.12 0.112
## Converted.Stage AJCC.Stage Node.Coded HER2.Final.Status
## 3.48344450 2.49329696 0.77974026 0.68744426
## Survival.Data.Form Gender Metastasis Metastasis.Coded
## 0.13444282 0.11601569 0.11428571 0.11428571
## Vital.Status
## 0.05714286
## var n wt dev yval complexity ncompete nsurrogate yval2.V1
## 1 Converted.Stage 85 85 41 2 0.19512195 4 3 2.00000000
## 2 AJCC.Stage 60 60 26 1 0.07317073 4 2 1.00000000
## 4 AJCC.Stage 35 35 12 1 0.01219512 4 5 1.00000000
## 8 <leaf> 10 10 2 1 0.01000000 0 0 1.00000000
## 9 HER2.Final.Status 25 25 10 1 0.01219512 1 0 1.00000000
## 18 <leaf> 18 18 6 1 0.01000000 0 0 1.00000000
## 19 <leaf> 7 7 3 2 0.01000000 0 0 2.00000000
## 5 Node.Coded 25 25 11 2 0.02439024 2 3 2.00000000
## 10 <leaf> 11 11 5 1 0.01000000 0 0 1.00000000
## 11 <leaf> 14 14 5 2 0.01000000 0 0 2.00000000
## 3 <leaf> 25 25 7 2 0.00000000 0 0 2.00000000
## yval2.V2 yval2.V3 yval2.V4 yval2.V5 yval2.nodeprob
## 1 41.00000000 44.00000000 0.48235294 0.51764706 1.00000000
## 2 34.00000000 26.00000000 0.56666667 0.43333333 0.70588235
## 4 23.00000000 12.00000000 0.65714286 0.34285714 0.41176471
## 8 8.00000000 2.00000000 0.80000000 0.20000000 0.11764706
## 9 15.00000000 10.00000000 0.60000000 0.40000000 0.29411765
## 18 12.00000000 6.00000000 0.66666667 0.33333333 0.21176471
## 19 3.00000000 4.00000000 0.42857143 0.57142857 0.08235294
## 5 11.00000000 14.00000000 0.44000000 0.56000000 0.29411765
## 10 6.00000000 5.00000000 0.54545455 0.45454545 0.12941176
## 11 5.00000000 9.00000000 0.35714286 0.64285714 0.16470588
## 3 7.00000000 18.00000000 0.28000000 0.72000000 0.29411765
## Confusion Matrix and Statistics
##
## Actual
## Prediction 0 1
## 0 5 4
## 1 5 6
##
## Accuracy : 0.55
## 95% CI : (0.3153, 0.7694)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : 0.4119
##
## Kappa : 0.1
##
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 0.6000
## Specificity : 0.5000
## Pos Pred Value : 0.5455
## Neg Pred Value : 0.5556
## Prevalence : 0.5000
## Detection Rate : 0.3000
## Detection Prevalence : 0.5500
## Balanced Accuracy : 0.5500
##
## 'Positive' Class : 1
##
The confusion matrix is not very good and looks like there has been some mistake. The accuracy value is .5, which is awful. Sensitivty and Specificity are also .5 which is what leads me to think there has been some error in the model.
Had an AUC value of 0.5 like all of the confusion matrix values I believe this is probably an error in the model. ROC was unable to run.
## List of 11
## $ Gender : Factor w/ 2 levels "FEMALE","MALE": 1 1 1 1 1 1 1 1 1 1 ...
## $ HER2.Final.Status : Factor w/ 3 levels "Equivocal","Negative",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Tumor : Factor w/ 4 levels "T1","T2","T3",..: 3 2 2 2 3 2 3 2 2 2 ...
## $ Node.Coded : Factor w/ 2 levels "Negative","Positive": 2 1 2 2 2 1 1 1 1 1 ...
## $ Metastasis : Factor w/ 2 levels "M0","M1": 2 1 1 1 1 1 1 1 1 1 ...
## $ Metastasis.Coded : Factor w/ 2 levels "Negative","Positive": 2 1 1 1 1 1 1 1 1 1 ...
## $ AJCC.Stage : Factor w/ 11 levels "Stage I","Stage IA",..: 11 5 6 6 10 5 6 5 5 5 ...
## $ Converted.Stage : Factor w/ 7 levels "No_Conversion",..: 1 3 1 1 1 3 4 3 3 3 ...
## $ Survival.Data.Form: Factor w/ 2 levels "enrollment","followup": 2 2 1 1 2 2 2 2 2 2 ...
## $ Vital.Status : Factor w/ 2 levels "DECEASED","LIVING": 1 1 1 1 2 2 2 2 2 2 ...
## $ OS.event : Factor w/ 2 levels "0","1": 2 2 2 2 1 1 1 1 1 1 ...
##
## T1 T2 T3 T4
## 15 65 19 6
## base_rate_1 base_rate_2 base_rate_3 base_rate_4
## 1 0.1428571 0.6190476 0.1809524 0.05714286
The baserate for T1 is .14, T2 is .61, T3 is .18 and T4 is .05
## n= 85
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 85 33 T2 (0.14117647 0.61176471 0.18823529 0.05882353)
## 2) AJCC.Stage=Stage I,Stage IA,Stage IIIB 13 5 T1 (0.61538462 0.00000000 0.00000000 0.38461538) *
## 3) AJCC.Stage=Stage IB,Stage II,Stage IIA,Stage IIB,Stage III,Stage IIIA,Stage IIIC,Stage IV 72 20 T2 (0.05555556 0.72222222 0.22222222 0.00000000)
## 6) AJCC.Stage=Stage IB,Stage II,Stage IIA,Stage III,Stage IV 38 3 T2 (0.07894737 0.92105263 0.00000000 0.00000000) *
## 7) AJCC.Stage=Stage IIB,Stage IIIA,Stage IIIC 34 17 T2 (0.02941176 0.50000000 0.47058824 0.00000000)
## 14) Converted.Stage=No_Conversion,Stage IIA,Stage IIIC 22 7 T2 (0.04545455 0.68181818 0.27272727 0.00000000) *
## 15) Converted.Stage=Stage IIB,Stage IIIA 12 2 T3 (0.00000000 0.16666667 0.83333333 0.00000000) *
This shows that AJCC and Node coded are the most important variables.
## var n wt dev yval complexity ncompete nsurrogate yval2.V1
## 1 AJCC.Stage 85 85 33 2 0.2424242 4 1 2.00000000
## 2 <leaf> 13 13 5 1 0.0100000 0 0 1.00000000
## 3 AJCC.Stage 72 72 20 2 0.1212121 4 4 2.00000000
## 6 <leaf> 38 38 3 2 0.0000000 0 0 2.00000000
## 7 Converted.Stage 34 34 17 2 0.1212121 4 3 2.00000000
## 14 <leaf> 22 22 7 2 0.0000000 0 0 2.00000000
## 15 <leaf> 12 12 2 3 0.0100000 0 0 3.00000000
## yval2.V2 yval2.V3 yval2.V4 yval2.V5 yval2.V6 yval2.V7
## 1 12.00000000 52.00000000 16.00000000 5.00000000 0.14117647 0.61176471
## 2 8.00000000 0.00000000 0.00000000 5.00000000 0.61538462 0.00000000
## 3 4.00000000 52.00000000 16.00000000 0.00000000 0.05555556 0.72222222
## 6 3.00000000 35.00000000 0.00000000 0.00000000 0.07894737 0.92105263
## 7 1.00000000 17.00000000 16.00000000 0.00000000 0.02941176 0.50000000
## 14 1.00000000 15.00000000 6.00000000 0.00000000 0.04545455 0.68181818
## 15 0.00000000 2.00000000 10.00000000 0.00000000 0.00000000 0.16666667
## yval2.V8 yval2.V9 yval2.nodeprob
## 1 0.18823529 0.05882353 1.00000000
## 2 0.00000000 0.38461538 0.15294118
## 3 0.22222222 0.00000000 0.84705882
## 6 0.00000000 0.00000000 0.44705882
## 7 0.47058824 0.00000000 0.40000000
## 14 0.27272727 0.00000000 0.25882353
## 15 0.83333333 0.00000000 0.14117647
## Confusion Matrix and Statistics
##
## Actual
## Prediction T1 T2 T3 T4
## T1 2 0 0 1
## T2 1 11 2 0
## T3 0 2 1 0
## T4 0 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.7
## 95% CI : (0.4572, 0.8811)
## No Information Rate : 0.65
## P-Value [Acc > NIR] : 0.4166
##
## Kappa : 0.4
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: T1 Class: T2 Class: T3 Class: T4
## Sensitivity 0.6667 0.8462 0.3333 0.00
## Specificity 0.9412 0.5714 0.8824 1.00
## Pos Pred Value 0.6667 0.7857 0.3333 NaN
## Neg Pred Value 0.9412 0.6667 0.8824 0.95
## Prevalence 0.1500 0.6500 0.1500 0.05
## Detection Rate 0.1000 0.5500 0.0500 0.00
## Detection Prevalence 0.1500 0.7000 0.1500 0.00
## Balanced Accuracy 0.8039 0.7088 0.6078 0.50
The accuracy rate is .75 while we have some strange sensitivity and specificity values. T1 looks good with values of .66 and .94, but the rest have values like 1.0 and 0.0 which is not what we want.
Had an error with this one, was unable to get it to work.
Had a multitude of erorr with this last section of C50 code. I really wasn’t able to get any of it to run unfortunately.
While I was able to get most of the CART single class model to run, the model itself was very poor. I’m not sure if there was an error, or I took out the wrong variables, but we only had an accuracy value of .5 and also has sensitivity and specificity values of .5 as well. This is not exactly what we are looking for.
My code for C50 was a disaster. More things had errors and wouldn’t run than would run. The code that did run didn’t have the output I was expecting. I’m not sure why exactly this was, but maybe I just didn’t fully understand what I needed to be doing.
It’s hard to compare two models that you really don’t have an output for. I don’t think I understand what happened well enough to even try. I will say a working model to help predict cancer by looking at biomarkers would be invaluable. Any way this could be incorporated into hospitals and other medical centers should be attempted. This model could be directly responsible for saving lives if it is able to predict cancer occuring in an individual. The earlier cancer gets found, the easier it is to treat and the more likely one is to recover for it.