Homework 8 - Classification Analysis

knitr::opts_chunk$set(echo = F,
                      warning = F,
                      message = F)
# load packages
pacman::p_load(tidyverse, class, skimr, caret, rpart, rpart.plot)

# Setting the seed for the markdown
RNGversion("4.1.0"); set.seed(2870)

# Changing the default theme
theme_set(theme_bw())

Data Description

For the code chunk below to run, make sure to install (but not load) the alr4 package.

## # A tibble: 970 × 3
##    diameter severity status
##       <dbl>    <dbl> <chr> 
##  1       13   0.0480 lived 
##  2       21   0.0525 lived 
##  3       26   0.0525 lived 
##  4       19   0.0598 lived 
##  5        6   0.0650 lived 
##  6        8   0.0650 lived 
##  7        9   0.0650 lived 
##  8       18   0.0650 lived 
##  9       10   0.0650 lived 
## 10       13   0.0650 lived 
## # ℹ 960 more rows

The trees_storm data set are from after a severe storm in the Boundary Waters Canoe Area Wilderness. The full data has nine different species of trees, but we’ll focus on just the most common: cedars.

The three columns are:

diameter: The diameter of the tree at 6 ft off the ground
severity: The proportion of trees within 25 meters that were knocked over. It’s a way of measuring how severe the storm is in that area.
status: If the tree survived (lived) or died (died) from the storm.

This homework will try to predict if a tree survived based on diameter and severity.

Question 1) Exploratory Analysis

We’ll start by visualizing and summarizing the data

1a) Graphing the data

Create an appropriate graph to display the relationship between diameter, severity, and status. Make sure the graph looks nice and appropriately displays how status differs based on the two explanatory variables.

1b) Summarizing the data

Calculate the 5 number summary, mean, and standard deviation of diameter and severity separately for trees that survived and trees that died. Display them in a way that makes it easy to compare how the two groups of cedars differ on diameter and severity.

## # A tibble: 14 × 4
##    name     stat      lived    died
##    <chr>    <chr>     <dbl>   <dbl>
##  1 diameter min      5       6     
##  2 diameter Q1       8      12     
##  3 diameter median  12      15     
##  4 diameter Q3      15      18     
##  5 diameter max     51      35     
##  6 diameter average 12.4    15.4   
##  7 diameter sd       5.59    5.20  
##  8 severity min      0.0480  0.0662
##  9 severity Q1       0.129   0.405 
## 10 severity median   0.256   0.554 
## 11 severity Q3       0.470   0.684 
## 12 severity max      0.834   0.974 
## 13 severity average  0.310   0.544 
## 14 severity sd       0.201   0.198

Part 1c) EDA Conclusions

When comparing diameter and severity by the cedars that lived and died:

Is it better for the tree to have a small or wide diameter? It appears for it to be better for the tree to have a smaller diameter

Is better for the storm to be strong or weak in the area? It is better for the storm to be weaker (lower severity)

Question 2: kNN

2a) Finding best kNN “model”

Find the best choice of k to use for the kNN algorithm. Start by searching over the range of 10 to 20. Once you get the code to work for k = 10 to 20, change it to search over the range 10 to 200.

## # A tibble: 191 × 2
##        k   acc
##    <int> <dbl>
##  1    10 0.756
##  2    11 0.764
##  3    12 0.758
##  4    13 0.761
##  5    14 0.747
##  6    15 0.762
##  7    16 0.761
##  8    17 0.764
##  9    18 0.760
## 10    19 0.754
## # ℹ 181 more rows

Create a graph of the results

What is the best choice of k for kNN when trying to predict if a cedar will die during a storm?

## # A tibble: 1 × 2
##       k   acc
##   <int> <dbl>
## 1   105 0.775

Part 2b) Accuracy of the best model

Form a confusion matrix for the best “model”. Then answer the questions below the code chunk.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction died lived
##      died   451   137
##      lived   81   301
##                                           
##                Accuracy : 0.7753          
##                  95% CI : (0.7477, 0.8012)
##     No Information Rate : 0.5485          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5411          
##                                           
##  Mcnemar's Test P-Value : 0.0001953       
##                                           
##             Sensitivity : 0.8477          
##             Specificity : 0.6872          
##          Pos Pred Value : 0.7670          
##          Neg Pred Value : 0.7880          
##              Prevalence : 0.5485          
##          Detection Rate : 0.4649          
##    Detection Prevalence : 0.6062          
##       Balanced Accuracy : 0.7675          
##                                           
##        'Positive' Class : died            
##

How much more accurate is the model compared to just guessing that all of the trees died? kNN accuracy: 77.5% No information accuracy: 54.9% kNN is noticeably more accurate than just guessing that all the trees died by about 22%

What percentage of trees that died are correctly classified as having died? Trees that died: 532 Trees that died that were classified as having died: 451 Percentage of trees that died that were correctly classified: \(\frac{451}{532} = 0.848\) or 84.8%

What percentage of trees that were classified as lived actually survived the storm? Trees classified as lived: 382 Trees classified as lived that actually survived: 301 Percentage of trees classified as lived that actually survived: \(\frac{301}{382} = 0.788\) or 78.8%

2c) Important variables

Using the kNN results only, can you determine if diameter or severity is better at predicting if the tree survived the storm? If so, which is more important?

Question 3) Classification Tree

For question 3, you’ll be using a classification tree to predict if a tree survived or died.

3a) Grow the full tree

Using all the other columns of the data, grow a full classification tree to predict survived. Display the complexity parameter table of the full tree

##              CP nsplit  rel.error    xerror       xstd
## 6  0.0028538813     18 0.40867580 0.5091324 0.02991939
## 7  0.0022831050     22 0.39726027 0.5091324 0.02991939
## 8  0.0017123288     82 0.24657534 0.5410959 0.03055387
## 9  0.0015220700     86 0.23972603 0.5502283 0.03072644
## 10 0.0013698630    116 0.18949772 0.5479452 0.03068365
## 11 0.0011415525    124 0.17808219 0.5776256 0.03122194
## 12 0.0009132420    184 0.10958904 0.5821918 0.03130135
## 13 0.0007610350    200 0.09132420 0.5958904 0.03153431
## 14 0.0005707763    221 0.07534247 0.5958904 0.03153431
## 15 0.0000000000    245 0.05707763 0.6141553 0.03183282

Part 3b) Find the cut off to prune the tree

Find the xerror and cp value to prune the full tree.

## [1] 0.005707763

xerror: 0.5204023

cp value: 0.0057078

Part 3c) Creating the pruned tree

Prune the full tree from part a) and plot it.

Part 3d) Intepreting two leaf nodes

Interpret the right most and left most leaf nodes in context

Left: If the storm’s intensity is greater than 0.27 and the diameter of the tree is greater than 9.8 cm, we predict the tree to die.

Right: If the storm’s intensity is less than 0.27, we predict the tree to survive (lived)

Part 3e) Estimated Accuracy

What is the estimated accuracy of the classification tree, using cross validation? Make sure to show your work!

##            CP nsplit rel.error    xerror       xstd
## 3 0.005707763      2 0.4817352 0.4931507 0.02958371

pruned tree’s xerror is 0.489

No information rate is 0.549

The CV estimated error rate is: \(\textrm{xerror}\times(1 - \textrm{NIF}) = 0.489(1 - 0.549) = 0.221\)

So the estimated accuracy is 0.779 or 77.9%

Part 2f) Variable Importance

Using the result of the classification tree, can you determine which variable is more important when predicting if the tree will survive? If so, which variable is more important?

##           Overall
## severity 136.4482
## diameter 103.6479

Severity is more important than diameter, but they aren’t drastically different, so both are important to the classification tree.