knitr::opts_chunk$set(echo = TRUE,
message = F,
warning = F,
fig.align = "center")
# load packages: typical - tidyverse and skimr
# Classification - caret, rpart, rpart.plot
pacman::p_load(tidyverse, skimr, caret, rpart, rpart.plot)
# Setting the default theme
theme_set(theme_bw())
# Reading in the data
cancer <-
read.csv(
'cancer.csv',
stringsAsFactors = T
) |>
mutate(
diagnosis = factor(diagnosis,
levels = c("B", "M"),
labels = c("Benign", "Malignant"))
)
For this homework assignment, you’ll be using ten (10) different features to try to predict if a tumor is malignant (cancerous) or benign (harmless).
The 10 different measurements (features) about each tumor are:
Create a pair of boxplots for each feature to compare the
malignant and benign tumors. Place diagnosis inside
fct_rev()
when mapping diagnosis to the fill
aesthetic to get it to match the solutions.
cancer |>
# Stacking the columns again
pivot_longer(
cols = -diagnosis,
names_to = "feature"
) |>
# Creating small multiples for the boxplots
ggplot(
mapping = aes(
x = value,
y = diagnosis,
fill = fct_rev(diagnosis)
)
) +
geom_boxplot(show.legend = F) +
facet_wrap(
facets = ~ feature,
scales = "free_x",
nrow = 5
) +
labs(
y = NULL,
x = NULL
) +
# Removing the tick marks on the x-axis
scale_x_continuous(breaks = NULL)
Which feature seems to be the most useful at determining if a tumor is malignant? Points seems to be the most helpful, while area, concavity, perimeter, radius, texture, and compactness appear to be at least somewhat helpful
Which feature seems to be the least useful at determining if a tumor is malignant? The box plots for dimension have the most overlap, followed by smoothness and symmetry.
For question 2, you’ll use a classification tree to predict if a tumor is malignant or benign
Create the pruned classification tree and name it tree_pruned. Display the CP table of the pruned tree.
# Keep this at the top of the code chunk
RNGversion("4.1.0"); set.seed(1234)
# Reading in the prune tree function
source("prune trees.R")
# Growing the pruned tree
pruned_tree <-
rpart_pruned(
formula_ct = diagnosis ~ ., # The . means all the other columns
df = cancer
)
# Displaying the cp table
pruned_tree |>
pluck("cptable") |>
data.frame()
## CP nsplit rel.error xerror xstd
## 1 0.77358491 0 1.0000000 1.0000000 0.05440140
## 2 0.02122642 1 0.2264151 0.2405660 0.03214090
## 3 0.01729560 3 0.1839623 0.2311321 0.03156514
## 4 0.01650943 6 0.1320755 0.2122642 0.03036546
Displayed the classification tree.
rpart.plot(
x = pruned_tree,
type = 5,
extra = 101
)
Left-most node: If a tumor has points below 0.051 and an area less than 696, it is expected to be benign (low points + small area = benign)
Right-most node: If a tumor has points above 0.051 and an area above 791, it is expected to be malignant (high points + large area = malignant)
Which features, if any, are important when diagnosing a tumor as benign or malignant, according to the pruned classification tree? Which two are the least useful?
varImp(object = pruned_tree) |>
arrange(-Overall)
## Overall
## points 269.193890
## area 243.159674
## perimeter 240.623872
## radius 238.276245
## concavity 193.172384
## texture 64.485358
## smoothness 3.329745
## symmetry 3.329745
## compactness 0.000000
## dimension 0.000000
Points, area, perimeter, radius, and concavity all seem to have high predictive strength.
Compactness and dimension don’t add any predictive power to the classification tree.
pruned_tree |>
pluck("cptable") |>
data.frame()
## CP nsplit rel.error xerror xstd
## 1 0.77358491 0 1.0000000 1.0000000 0.05440140
## 2 0.02122642 1 0.2264151 0.2405660 0.03214090
## 3 0.01729560 3 0.1839623 0.2311321 0.03156514
## 4 0.01650943 6 0.1320755 0.2122642 0.03036546
# Confusion matrix
confusionMatrix(
data = predict(pruned_tree, type = "class"),
reference = cancer$diagnosis
)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Benign Malignant
## Benign 355 26
## Malignant 2 186
##
## Accuracy : 0.9508
## 95% CI : (0.9297, 0.9671)
## No Information Rate : 0.6274
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8923
##
## Mcnemar's Test P-Value : 1.383e-05
##
## Sensitivity : 0.9944
## Specificity : 0.8774
## Pos Pred Value : 0.9318
## Neg Pred Value : 0.9894
## Prevalence : 0.6274
## Detection Rate : 0.6239
## Detection Prevalence : 0.6696
## Balanced Accuracy : 0.9359
##
## 'Positive' Class : Benign
##
Use the CP table and confusion matrix above to calculate the estimated error rate (how often the tree will predict the cancer incorrectly) from the cross-validation results.
The estimated error using cross-validation is: no information error rate \(\times\) the pruned tree’s xerror:
\[\textrm{Estimated error} = (1-0.627) \times 0.212 = 0.079\]