The included dataset (clinical_data_breast_cancer_modified.csv) has information on 105 patients across 17 variables, your goal is to build two classifiers one for PR.Status (progesterone receptor), a biomarker that routinely leads to a cancer diagnosis, indicating if there was a positive or negative outcome and one for the Tumor a multi-class variable . You would like to be able to explain the model to the mere mortals around you but need a fairly robust and flexible approach so you’ve chosen to use decision trees to get started. In building both models us CART and C5.0 and compare the differences.

In doing so, similar to great data scientists of the past, you remembered the excellent education provided to you at UVA in a undergrad data science course and have outlined steps that will need to be undertaken to complete this task (you can add more or combine if needed).
As always, you will need to make sure to #comment your work heavily and render the results in a clear report (knitted) as the non MDSDPhDs of the world will someday need to understand the wonder and spectacle that will be your R code. Good luck and the world thanks you.

Footnotes: - Some of the steps will not need to be repeated for the second model, use your judgment - You can add or combine steps if needed - Also, remember to try several methods during evaluation and always be mindful of how the model will be used in practice. - Do not include ER.Status in your first tree it’s basically the same as PR.Status

Prep

#3 Don't check for correlated variables....because it doesn't matter with Decision Trees...that was easy

Split

#5 Guess what, you also don't need to standardize the data, because DTs don't 
# give a ish, they make local decisions...keeps getting easier 

Base Rate

## [1] 54
## [1] 105
## [1] 0.4857143
## [1] 0.4857143

#CART Model

#7 Build your model using the default settings

# First, wrangle data into proper form...
# - make each variable a factor
# - convert to data frame
breastcancer_factored <- breastcancer %>% 
  apply(2, function(x) as.factor(x)) %>% 
  as.data.frame()



# set seed for reproducibility purposes
set.seed(1980)

# build default decision tree model for PR.Status classification
breastcancer_tree <- rpart(
  PR.Status ~ ., # model formula
  method = "class", # tree method
  parms = list(split = "gini"), # split method
  data = breastcancer_factored,
  control = rpart.control(cp = .01)
  )

breastcancer_tree
## n= 105 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 105 51 1 (0.48571429 0.51428571)  
##   2) Days.to.Date.of.Last.Contact=   0,   5,   7,   9, 133, 178, 197, 230, 240, 309, 414, 425, 450, 469, 515, 520, 544, 591, 643, 713, 735, 754, 775, 943, 964,1027,1051,1099,1180,1288,1305,1309,1319,1364,1471,1555,1627,1679,1692,1948,1965,2426 55  4 0 (0.92727273 0.07272727) *
##   3) Days.to.Date.of.Last.Contact=  15,  21,  31,  89, 212, 243, 274, 325, 362, 372, 387, 409, 441, 445, 477, 502, 549, 569, 575, 606, 631, 665, 769, 904, 968, 989, 993,1006,1072,1148,1215,1217,1229,1242,1255,1270,1295,1317,1338,1393,1405,1492,1512,1519,1547,1641,1742,1826,2359,2850 50  0 1 (0.00000000 1.00000000) *

Variable Importance

#8 View the results, what is the most important variable for the tree? 

table(breastcancer_tree$variable.importance)
## 
## 2.70233766233766 5.40467532467533 7.20623376623376 9.00779220779221 
##                1                1                1                1 
## 21.6187012987013  45.038961038961 
##                1                1

Note: days of last contact is most important variable here

#9 Plot the tree using the rpart.plot package (CART only).

rpart.plot(breastcancer_tree, type =4, extra = 101)

#10 plot the cp chart and note the optimal size of the tree (CART only).

plotcp(breastcancer_tree)

#11 Use the predict function and your models to predict the target variable using
#test set. 

cancertest_factored <- cancertest %>% 
  apply(2, function(x) as.factor(x)) %>% 
  as.data.frame()

tree_predict = predict(breastcancer_tree,cancertest_factored, type= "class")

as.data.frame(tree_predict)
##    tree_predict
## 1             0
## 2             0
## 3             0
## 4             0
## 5             0
## 6             0
## 7             0
## 8             1
## 9             0
## 10            1
## 11            1
## 12            1
## 13            1
## 14            1
## 15            1
## 16            0
## 17            1
## 18            1
## 19            1
## 20            1
## 21            1
#tree_predict <- as.numeric(tree_predict)
tree_predict
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 
##  0  0  0  0  0  0  0  1  0  1  1  1  1  1  1  0  1  1  1  1  1 
## Levels: 0 1
#12 Generate, "by-hand", the hit rate and detection rate and compare the 
#detection rate to your original baseline rate. How did your models work?

str(tree_predict)
##  Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 2 ...
##  - attr(*, "names")= chr [1:21] "1" "2" "3" "4" ...
tree_tibble <- as.data.frame(as.numeric(tree_predict))


# error rate = FP+FN/Total = .095%
# thus, hit rate = 90.5%

# Detection Rate = A/(A+B+C+D) == 0.75%
#13 Use the the confusion matrix function in caret to 
#check a variety of metrics and comment on the metric that might be best for 
#each type of analysis.  


confusionMatrix(
  as.factor(tree_predict), 
  as.factor(cancertest_factored$PR.Status), 
  positive = "1", 
  dnn=c("Prediction", "Actual"), 
  mode = "sens_spec"
  )
## Confusion Matrix and Statistics
## 
##           Actual
## Prediction  0  1
##          0  8  1
##          1  0 12
##                                           
##                Accuracy : 0.9524          
##                  95% CI : (0.7618, 0.9988)
##     No Information Rate : 0.619           
##     P-Value [Acc > NIR] : 0.0005888       
##                                           
##                   Kappa : 0.9014          
##                                           
##  Mcnemar's Test P-Value : 1.0000000       
##                                           
##             Sensitivity : 0.9231          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.8889          
##              Prevalence : 0.6190          
##          Detection Rate : 0.5714          
##    Detection Prevalence : 0.5714          
##       Balanced Accuracy : 0.9615          
##                                           
##        'Positive' Class : 1               
## 
#14 Generate a ROC and AUC output, interpret the results


# generate roc output
cancerroc <- roc(as.numeric(cancertest_factored$PR.Status), as.numeric(tree_predict), plot = TRUE)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

Base Rate (Tumor)

#15 Follow the same steps for the multi-class target, tumor, aside from step 1, 
# 2 and 14. For step 13 compare to the four base rates and see how you did. 



(t1 <-  sum(breastcancer$Tumor == "T1")/length(breastcancer$Tumor))
## [1] 0.1428571
(t2 <- sum(breastcancer$Tumor == "T2")/length(breastcancer$Tumor))
## [1] 0.6190476
(t3 <- sum(breastcancer$Tumor == "T3")/length(breastcancer$Tumor))
## [1] 0.1809524
(t4 <- sum(breastcancer$Tumor == "T4")/length(breastcancer$Tumor))
## [1] 0.05714286
Tumor_baserate <- as_tibble(c(t1,t2,t3,t4))

Tumor_baserate
## # A tibble: 4 x 1
##    value
##    <dbl>
## 1 0.143 
## 2 0.619 
## 3 0.181 
## 4 0.0571

Split (Tumor)

split <- createDataPartition(breastcancer$Tumor,times=1,p = 0.8,list=FALSE)

cancertrain_multi <- breastcancer[split,]
cancertest_multi <- breastcancer[-split,]

Cross validation process

fitter <- trainControl(method = "repeatedcv",
  number = 1,
  repeats = 1, returnResamp="all") #setting up our cross validation 

# Small, only one fold?

Cleaning

cancertrain_multi<- cancertrain_multi %>% 
  mutate(
    Node.Coded = if_else(Node.Coded == "Positive", 1, 0)
  )

cancertrain_multi<- cancertrain_multi %>% 
  mutate(
    HER2.Final.Status = if_else(HER2.Final.Status == "Positive", 1, 0)
  )

cancertrain_multi<- cancertrain_multi %>% 
  mutate(
    Metastasis.Coded = if_else(Metastasis.Coded== "Positive", 1, 0)
  )

cancertrain_multi<- cancertrain_multi %>% 
  mutate(
    Vital.Status= if_else(Vital.Status == "LIVING", 1, 0)
  )


cancertrain_multi <- cancertrain_multi %>% 
  apply(2, function(x) as.factor(x)) %>% 
  as.data.frame()



# Chosing target and training features for C5
features <- cancertrain_multi
target <- as.factor(cancertrain_multi$Tumor)

Grid

grid <- expand.grid(.winnow = c(TRUE,FALSE), .trials=c(1,5,10,15,20), .model="tree" )

7 - Train

#cancertrain_tree2 <- train(x=features,y=target,tuneGrid=grid,trControl=fitter,method="C5.0"
  #          ,verbose=TRUE)


#fiver <-C5.0(x=features, y=target)

Plotting

Prediction and Closing

# 16 Summarize what you learned for each model along the way and make 
# recommendations to the world on how this could be used moving forward, 
# being careful not to over promise. 

It Seems that the C5 Model and CART Model are pratically the same, but the CART model was more adept at offering overall accuracy. This may be due in part because of the fact that it was handling a fewer classification count during its analysis of the target variable, PR.Status. the base rates of the Tumors were relatively small compartively, and perhaps C5 should only be implemented moving forward with more robust data that can allow for proper training of the tree.