knitr::opts_chunk$set(echo = F,
warning = F,
message = F)
# load packages
pacman::p_load(tidyverse, class, skimr, caret, rpart, rpart.plot)
# Setting the seed for the markdown
RNGversion("4.1.0"); set.seed(2870)
# Changing the default theme
theme_set(theme_bw())
For the code chunk below to run, make sure to install (but not load)
the alr4
package.
## # A tibble: 970 × 3
## diameter severity status
## <dbl> <dbl> <chr>
## 1 13 0.0480 lived
## 2 21 0.0525 lived
## 3 26 0.0525 lived
## 4 19 0.0598 lived
## 5 6 0.0650 lived
## 6 8 0.0650 lived
## 7 9 0.0650 lived
## 8 18 0.0650 lived
## 9 10 0.0650 lived
## 10 13 0.0650 lived
## # ℹ 960 more rows
The trees_storm data set are from after a severe storm in the Boundary Waters Canoe Area Wilderness. The full data has nine different species of trees, but we’ll focus on just the most common: cedars.
The three columns are:
diameter: The diameter of the tree at 6 ft off the ground
severity: The proportion of trees within 25 meters that were knocked over. It’s a way of measuring how severe the storm is in that area.
status: If the tree survived (lived) or died (died) from the storm.
This homework will try to predict if a tree survived based on diameter and severity.
We’ll start by visualizing and summarizing the data
Create an appropriate graph to display the relationship between diameter, severity, and status. Make sure the graph looks nice and appropriately displays how status differs based on the two explanatory variables.
Calculate the 5 number summary, mean, and standard deviation of diameter and severity separately for trees that survived and trees that died. Display them in a way that makes it easy to compare how the two groups of cedars differ on diameter and severity.
## # A tibble: 14 × 4
## name stat lived died
## <chr> <chr> <dbl> <dbl>
## 1 diameter min 5 6
## 2 diameter Q1 8 12
## 3 diameter median 12 15
## 4 diameter Q3 15 18
## 5 diameter max 51 35
## 6 diameter average 12.4 15.4
## 7 diameter sd 5.59 5.20
## 8 severity min 0.0480 0.0662
## 9 severity Q1 0.129 0.405
## 10 severity median 0.256 0.554
## 11 severity Q3 0.470 0.684
## 12 severity max 0.834 0.974
## 13 severity average 0.310 0.544
## 14 severity sd 0.201 0.198
When comparing diameter and severity by the cedars that lived and died:
Is it better for the tree to have a small or wide diameter? It appears for it to be better for the tree to have a smaller diameter
Is better for the storm to be strong or weak in the area? It is better for the storm to be weaker (lower severity)
Find the best choice of k to use for the kNN algorithm. Start by searching over the range of 10 to 20. Once you get the code to work for k = 10 to 20, change it to search over the range 10 to 200.
## # A tibble: 191 × 2
## k acc
## <int> <dbl>
## 1 10 0.756
## 2 11 0.764
## 3 12 0.758
## 4 13 0.761
## 5 14 0.747
## 6 15 0.762
## 7 16 0.761
## 8 17 0.764
## 9 18 0.760
## 10 19 0.754
## # ℹ 181 more rows
Create a graph of the results
What is the best choice of k for kNN when trying to predict if a cedar will die during a storm?
## # A tibble: 1 × 2
## k acc
## <int> <dbl>
## 1 105 0.775
Form a confusion matrix for the best “model”. Then answer the questions below the code chunk.
## Confusion Matrix and Statistics
##
## Reference
## Prediction died lived
## died 451 137
## lived 81 301
##
## Accuracy : 0.7753
## 95% CI : (0.7477, 0.8012)
## No Information Rate : 0.5485
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5411
##
## Mcnemar's Test P-Value : 0.0001953
##
## Sensitivity : 0.8477
## Specificity : 0.6872
## Pos Pred Value : 0.7670
## Neg Pred Value : 0.7880
## Prevalence : 0.5485
## Detection Rate : 0.4649
## Detection Prevalence : 0.6062
## Balanced Accuracy : 0.7675
##
## 'Positive' Class : died
##
How much more accurate is the model compared to just guessing that all of the trees died? kNN accuracy: 77.5% No information accuracy: 54.9% kNN is noticeably more accurate than just guessing that all the trees died by about 22%
What percentage of trees that died are correctly classified as having died? Trees that died: 532 Trees that died that were classified as having died: 451 Percentage of trees that died that were correctly classified: \(\frac{451}{532} = 0.848\) or 84.8%
What percentage of trees that were classified as lived actually survived the storm? Trees classified as lived: 382 Trees classified as lived that actually survived: 301 Percentage of trees classified as lived that actually survived: \(\frac{301}{382} = 0.788\) or 78.8%
Using the kNN results only, can you determine if diameter or severity is better at predicting if the tree survived the storm? If so, which is more important?
For question 3, you’ll be using a classification tree to predict if a tree survived or died.
Using all the other columns of the data, grow a full classification tree to predict survived. Display the complexity parameter table of the full tree
## CP nsplit rel.error xerror xstd
## 6 0.0028538813 18 0.40867580 0.5091324 0.02991939
## 7 0.0022831050 22 0.39726027 0.5091324 0.02991939
## 8 0.0017123288 82 0.24657534 0.5410959 0.03055387
## 9 0.0015220700 86 0.23972603 0.5502283 0.03072644
## 10 0.0013698630 116 0.18949772 0.5479452 0.03068365
## 11 0.0011415525 124 0.17808219 0.5776256 0.03122194
## 12 0.0009132420 184 0.10958904 0.5821918 0.03130135
## 13 0.0007610350 200 0.09132420 0.5958904 0.03153431
## 14 0.0005707763 221 0.07534247 0.5958904 0.03153431
## 15 0.0000000000 245 0.05707763 0.6141553 0.03183282
Find the xerror and cp value to prune the full tree.
## [1] 0.005707763
xerror: 0.5204023
cp value: 0.0057078
Prune the full tree from part a) and plot it.
Interpret the right most and left most leaf nodes in context
Left: If the storm’s intensity is greater than 0.27 and the diameter of the tree is greater than 9.8 cm, we predict the tree to die.
Right: If the storm’s intensity is less than 0.27, we predict the tree to survive (lived)
What is the estimated accuracy of the classification tree, using cross validation? Make sure to show your work!
## CP nsplit rel.error xerror xstd
## 3 0.005707763 2 0.4817352 0.4931507 0.02958371
pruned tree’s xerror is 0.489
No information rate is 0.549
The CV estimated error rate is: \(\textrm{xerror}\times(1 - \textrm{NIF}) = 0.489(1 - 0.549) = 0.221\)
So the estimated accuracy is 0.779 or 77.9%
Using the result of the classification tree, can you determine which variable is more important when predicting if the tree will survive? If so, which variable is more important?
## Overall
## severity 136.4482
## diameter 103.6479
Severity is more important than diameter, but they aren’t drastically different, so both are important to the classification tree.