DS 1870: Module 7 Homework

For this homework assignment, you’ll be using ten (10) different features to try to predict if a tumor is malignant (cancerous) or benign (harmless).

Question 1: Exploratory Data Analysis

Part 1a) EDA Graph

Create a pair of boxplots for each feature to compare the malignant and benign tumors. Place diagnosis inside fct_rev() when mapping diagnosis to the fill aesthetic to get it to match the solutions.

cancer |> 
  # Stacking the columns again
  pivot_longer(
    cols = -diagnosis,
    names_to = "feature"
  ) |> 
  
  # Creating small multiples for the boxplots
  ggplot(
    mapping = aes(
      x = value,
      y = diagnosis,
      fill = fct_rev(diagnosis)
    )
  ) + 
  
  geom_boxplot(show.legend = F) + 
  
  facet_wrap(
    facets = ~ feature,
    scales = "free_x",
    nrow = 5
  ) + 
  
  labs(
    y = NULL,
    x = NULL
  ) + 
  
  # Removing the tick marks on the x-axis
  scale_x_continuous(breaks = NULL)

Part 1b) EDA Findings

Which feature seems to be the most useful at determining if a tumor is malignant? Points seems to be the most helpful, while area, concavity, perimeter, radius, texture, and compactness appear to be at least somewhat helpful

Which feature seems to be the least useful at determining if a tumor is malignant? The box plots for dimension have the most overlap, followed by smoothness and symmetry.

Question 2) Classification tree

For question 2, you’ll use a classification tree to predict if a tumor is malignant or benign

Part 2A) Full classification tree

Create the pruned classification tree and name it tree_pruned. Display the CP table of the pruned tree.

# Keep this at the top of the code chunk
RNGversion("4.1.0"); set.seed(1234)

# Reading in the prune tree function
source("prune trees.R")

# Growing the pruned tree
pruned_tree <- 
  rpart_pruned(
    formula_ct = diagnosis ~ .,  # The . means all the other columns
    df = cancer
  )

# Displaying the cp table
pruned_tree |> 
  pluck("cptable") |> 
  data.frame()

##           CP nsplit rel.error    xerror       xstd
## 1 0.77358491      0 1.0000000 1.0000000 0.05440140
## 2 0.02122642      1 0.2264151 0.2405660 0.03214090
## 3 0.01729560      3 0.1839623 0.2311321 0.03156514
## 4 0.01650943      6 0.1320755 0.2122642 0.03036546

Part 2B) Displaying the tree

Displayed the classification tree.

rpart.plot(
  x = pruned_tree,
  type = 5,
  extra = 101
)

Part 2C) Interpret the left and right most nodes

Left-most node: If a tumor has points below 0.051 and an area less than 696, it is expected to be benign (low points + small area = benign)

Right-most node: If a tumor has points above 0.051 and an area above 791, it is expected to be malignant (high points + large area = malignant)

Part 2D) Most important variables

Which features, if any, are important when diagnosing a tumor as benign or malignant, according to the pruned classification tree? Which two are the least useful?

varImp(object = pruned_tree) |> 
  arrange(-Overall)

##                Overall
## points      269.193890
## area        243.159674
## perimeter   240.623872
## radius      238.276245
## concavity   193.172384
## texture      64.485358
## smoothness    3.329745
## symmetry      3.329745
## compactness   0.000000
## dimension     0.000000

Points, area, perimeter, radius, and concavity all seem to have high predictive strength.

Compactness and dimension don’t add any predictive power to the classification tree.

Part 2E) Estimating the Classification Error Rate

pruned_tree |> 
  pluck("cptable") |> 
  data.frame()

##           CP nsplit rel.error    xerror       xstd
## 1 0.77358491      0 1.0000000 1.0000000 0.05440140
## 2 0.02122642      1 0.2264151 0.2405660 0.03214090
## 3 0.01729560      3 0.1839623 0.2311321 0.03156514
## 4 0.01650943      6 0.1320755 0.2122642 0.03036546

# Confusion matrix
confusionMatrix(
  data = predict(pruned_tree, type = "class"),
  reference = cancer$diagnosis
)

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Benign Malignant
##   Benign       355        26
##   Malignant      2       186
##                                           
##                Accuracy : 0.9508          
##                  95% CI : (0.9297, 0.9671)
##     No Information Rate : 0.6274          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8923          
##                                           
##  Mcnemar's Test P-Value : 1.383e-05       
##                                           
##             Sensitivity : 0.9944          
##             Specificity : 0.8774          
##          Pos Pred Value : 0.9318          
##          Neg Pred Value : 0.9894          
##              Prevalence : 0.6274          
##          Detection Rate : 0.6239          
##    Detection Prevalence : 0.6696          
##       Balanced Accuracy : 0.9359          
##                                           
##        'Positive' Class : Benign          
##

Use the CP table and confusion matrix above to calculate the estimated error rate (how often the tree will predict the cancer incorrectly) from the cross-validation results.

The estimated error using cross-validation is: no information error rate \(\times\) the pruned tree’s xerror:

\[\textrm{Estimated error} = (1-0.627) \times 0.212 = 0.079\]

DS 1870: Module 7 Homework - Classification

yourname

2025-04-28