Question 1

fir <- read.csv("HW6data.csv")

a.

counts <- fir %>% group_by(y) %>% summarise(n = n(), proportion = n() / 659)
counts
# A tibble: 2 × 3
  y         n proportion
  <chr> <int>      <dbl>
1 No      426      0.646
2 Yes     233      0.354
fir$y <- as.factor(fir$y)

b.

set.seed(1)
kGrid <- expand.grid(k = seq(1, 30, by = 2))
fitControl <- trainControl(method = "cv", number = 10)

model.knn <- train(y ~.,
                        data = fir,
                        method = "knn",
                        trControl = fitControl,
                        tuneGrid = kGrid)
model.knn
k-Nearest Neighbors 

659 samples
  2 predictor
  2 classes: 'No', 'Yes' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 594, 592, 594, 593, 592, 594, ... 
Resampling results across tuning parameters:

  k   Accuracy   Kappa    
   1  0.7449097  0.4448469
   3  0.7934920  0.5383872
   5  0.7816456  0.5109847
   7  0.7937682  0.5351687
   9  0.8090603  0.5703620
  11  0.8013913  0.5521750
  13  0.8028146  0.5538941
  15  0.8012994  0.5502945
  17  0.8012768  0.5496301
  19  0.7996924  0.5499072
  21  0.7846533  0.5185537
  23  0.7800838  0.5109016
  25  0.7740452  0.5024982
  27  0.7709682  0.4944898
  29  0.7770522  0.5091332

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 9.

Highest accuracy (0.8090603) when k = 9. When k = 9, the model correctly predicted whether the tree survived or died 80.91% of the time.

c.

The kappa statistic is lower than the accuracy due to slight class imbalance. The kappa statistic takes into account the accuracy that would be generated simply by chance. Since there is two options, no or yes, the model could simply guess and get 50% accuracy. Since there are more trees that survived (no) than died (yes), the model could choose no for every tree and get 65% accuracy. The kappa statistic takes all of this into account.

d.

predict(model.knn, fir[1, ])
[1] No
Levels: No Yes
d <- as.matrix(dist(fir[ , -3]))

e.

fir$y[order(d[1,])[2:10]]
[1] No  No  No  No  No  No  Yes No  No 
Levels: No Yes

8 no’s and 1 yes

f.

set.seed(1)
model.knn.s <- train(y ~.,
                        data = fir,
                        method = "knn",
                        preProc = c("center", "scale"),
                        trControl = fitControl,
                        tuneGrid = kGrid)
model.knn.s
k-Nearest Neighbors 

659 samples
  2 predictor
  2 classes: 'No', 'Yes' 

Pre-processing: centered (2), scaled (2) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 594, 592, 594, 593, 592, 594, ... 
Resampling results across tuning parameters:

  k   Accuracy   Kappa    
   1  0.7524629  0.4543978
   3  0.7874095  0.5238633
   5  0.7997843  0.5474779
   7  0.7982232  0.5468836
   9  0.8074526  0.5656214
  11  0.8150983  0.5812651
  13  0.8089225  0.5649740
  15  0.8196218  0.5887324
  17  0.8226295  0.5948393
  19  0.8181293  0.5860307
  21  0.8181978  0.5858372
  23  0.8272223  0.6048862
  25  0.8257297  0.6012081
  27  0.8271764  0.6021629
  29  0.8240994  0.5961436

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 23.

The new best test accuracy (0.8272223) is when k = 23. D was measured in cm and S was measured in units, so both were on completely different scales. Centering the data turns it into distances from the average and then scaling turns those distances into the same unit. This puts D and S on the same scale. Now a value 1 std dev above the mean for D and for S will be labeled as a 1.

g.

set.seed(1)
model.logit <- train(data = fir,
                     y ~ .,
                     method = "glm",
                     family = "binomial",
                     trControl = fitControl)
model.logit
Generalized Linear Model 

659 samples
  2 predictor
  2 classes: 'No', 'Yes' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 594, 592, 594, 593, 592, 594, ... 
Resampling results:

  Accuracy   Kappa    
  0.8134015  0.5736139

The new test accuracy is 0.8134015.

summary(model.logit)

Call:
NULL

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -5.2238     0.3902 -13.387   <2e-16 ***
D             0.2850     0.0290   9.828   <2e-16 ***
S             4.4242     0.5037   8.784   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 856.21  on 658  degrees of freedom
Residual deviance: 582.66  on 656  degrees of freedom
AIC: 588.66

Number of Fisher Scoring iterations: 5

The coefficient for D is 0.2850. When all other variables are held constant, a 1 unit increase in D increases the log-odds of the outcome by 0.2850.

h.

predict(model.logit, newdata = data.frame(D = 10, S = 0.40), type = "prob")
         No       Yes
1 0.6466098 0.3533902

The probability this tree will die is 0.353390. The odds the tree dies is 0.3533902/0.6466098 = 0.547. For every 1 tree that survives, about 0.547 trees die.

i.

predict(model.logit, newdata = data.frame(D = 11, S = 0.40), type = "prob")
         No       Yes
1 0.5791239 0.4208761

In part (g), we stated “When all other variables are held constant, a 1 unit increase in D increases the log-odds of the outcome by 0.2850. Since S was held constant and D increased by 1 unit, we would expect the log-odds to increase by 0.2850. The math below proves this is the case.

d10 <- log(0.3533902/0.6466098)
d11 <- log(0.4208761/0.5791239)
d11 - d10
[1] 0.2849922

j.

set.seed(1)

grid <- expand.grid(
  usekernel = FALSE, # Use Normal instead of KDE
  fL = 0, # Laplace correction
  adjust = 1 # Bandwidth adjustment factor
)

model.nb <- train(data = fir,
                  y ~ .,
                  method = "nb",
                  trControl = fitControl,
                  tuneGrid = grid)
model.nb
Naive Bayes 

659 samples
  2 predictor
  2 classes: 'No', 'Yes' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 594, 592, 594, 593, 592, 594, ... 
Resampling results:

  Accuracy   Kappa    
  0.8013235  0.5422791

Tuning parameter 'fL' was held constant at a value of 0
Tuning
 parameter 'usekernel' was held constant at a value of FALSE
Tuning
 parameter 'adjust' was held constant at a value of 1

The test accuracy was 0.8013235.

k.

table <- fir %>% group_by(y) %>% summarise(meanD = mean(D), stdD = sd(D), meanS = mean(S), stdS = sd(S))
table
# A tibble: 2 × 5
  y     meanD  stdD meanS  stdS
  <fct> <dbl> <dbl> <dbl> <dbl>
1 No     8.01  3.31 0.313 0.189
2 Yes   12.8   4.89 0.520 0.216
fir[1,]
  D        S  y
1 9 0.024212 No

Probabilities of getting obs1 if no were true

dnorm(9, mean = 8.005869, sd = 3.311916)
[1] 0.1151504
dnorm(0.024212, mean = 0.3128129, sd = 0.1891409)
[1] 0.6585027

Probabilities of getting obs1 if yes were true

dnorm(9, mean = 12.798283, sd = 4.889300)
[1] 0.06034118
dnorm(0.024212, mean = 0.5195392, sd = 0.2162452)
[1] 0.1338578

Probability of being in No vs Yes

pNo <- mean(fir$y == "No") * 0.1151504 * 0.6585027
pNo
[1] 0.04901705
pYes <- mean(fir$y == "Yes") * 0.06034118 * 0.1338578
pYes
[1] 0.002855801

The probability of being in class No when D = 9 and S = 0.024212 is higher than the probability of being in class Yes. So the tree survived.

l

KNN estimates probability based on nearby points by looking at the k closest observations and taking the proportion of those that fall into each class. Logistic regression models the probability directly by plugging the predictors into a linear equation and passing it through a function to keep it between 0 and 1. Naive Bayes calculates the probability by checking how likely the feature values are under each class using distributions, then combining those and comparing across classes, assuming the features are independent.

---
title: "Homework 6"
author: "Charlie Morgan"
date: " Due: 04/26/26"
output:
  html_document: 
    toc: yes
    toc_depth: 4
    toc_float: yes
    number_sections: no
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: yes
    theme: lumen
  pdf_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    number_sections: yes
    fig_width: 3
    fig_height: 3
  word_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    keep_md: yes
editor_options: 
  chunk_output_type: inline
---

```{css, echo = FALSE}
#TOC::before {
  content: "Table of Contents";
  font-weight: bold;
  font-size: 1.2em;
  display: block;
  color: navy;
  margin-bottom: 10px;
}


div#TOC li {     /* table of content  */
    list-style:upper-roman;
    background-image:none;
    background-repeat:none;
    background-position:0;
}

h1.title {    /* level 1 header of title  */
  font-size: 22px;
  font-weight: bold;
  color: DarkRed;
  text-align: center;
  font-family: "Gill Sans", sans-serif;
}

h4.author { /* Header 4 - and the author and data headers use this too  */
  font-size: 15px;
  font-weight: bold;
  font-family: system-ui;
  color: navy;
  text-align: center;
}

h4.date { /* Header 4 - and the author and data headers use this too  */
  font-size: 18px;
  font-weight: bold;
  font-family: "Gill Sans", sans-serif;
  color: DarkBlue;
  text-align: center;
}

h1 { /* Header 1 - and the author and data headers use this too  */
    font-size: 20px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: center;
}

h2 { /* Header 2 - and the author and data headers use this too  */
    font-size: 18px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h3 { /* Header 3 - and the author and data headers use this too  */
    font-size: 16px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - and the author and data headers use this too  */
    font-size: 14px;
  font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

/* Add dots after numbered headers */
.header-section-number::after {
  content: ".";

body { background-color:white; }

.highlightme { background-color:yellow; }

p { background-color:white; }

}
```

```{r setup, include=FALSE}
# code chunk specifies whether the R code, warnings, and output 
# will be included in the output files.
if (!require("knitr")) {
   install.packages("knitr")
   library(knitr)
}
if (!require("factoextra")) {
  install.packages("factoextra")
  library(factoextra)
}

if (!require("tidyverse")) {
  install.packages("tidyverse")
  library(tidyverse)
}

if (!require("FactoMineR")) {
  install.packages("FactoMineR")
  library(FactoMineR)
}

if (!require("corrplot")) {
  install.packages("corrplot")
  library(corrplot)
}

if (!require("mice")) {
  install.packages("mice")
  library(mice)
}

if (!require("kableExtra")) {
  install.packages("kableExtra")
  library(kableExtra)
}

if (!require("cluster")) {
  install.packages("cluster")
  library(cluster)
}

if (!require("mclust")) {
  install.packages("mclust")
  library(mclust)
}

if (!require("dbscan")) {
  install.packages("dbscan")
  library(dbscan)
}

if (!require("caret")) {
  install.packages("caret")
  library(caret)
}
####
knitr::opts_chunk$set(echo = TRUE,       # include code chunk in the output file
                      warning = FALSE,   # sometimes, you code may produce warning messages,
                                         # you can choose to include the warning messages in
                                         # the output file. 
                      results = TRUE,    # you can also decide whether to include the output
                                         # in the output file.
                      message = FALSE,
                      comment = NA
                      )  

```

# Question 1
```{r}
fir <- read.csv("HW6data.csv")
```

## a.
```{r}
counts <- fir %>% group_by(y) %>% summarise(n = n(), proportion = n() / 659)
counts
fir$y <- as.factor(fir$y)
```

## b.
```{r}
set.seed(1)
kGrid <- expand.grid(k = seq(1, 30, by = 2))
fitControl <- trainControl(method = "cv", number = 10)

model.knn <- train(y ~.,
                        data = fir,
                        method = "knn",
                        trControl = fitControl,
                        tuneGrid = kGrid)
model.knn
```
Highest accuracy (0.8090603) when k = 9. When k = 9, the model correctly predicted whether the tree survived or died 80.91% of the time.

## c.
The kappa statistic is lower than the accuracy due to slight class imbalance. The kappa statistic takes into account the accuracy that would be generated simply by chance. Since there is two options, no or yes, the model could simply guess and get 50% accuracy. Since there are more trees that survived (no) than died (yes), the model could choose no for every tree and get 65% accuracy. The kappa statistic takes all of this into account.

## d.
```{r}
predict(model.knn, fir[1, ])
d <- as.matrix(dist(fir[ , -3]))
```

## e.
```{r}
fir$y[order(d[1,])[2:10]]
```
8 no's and 1 yes

## f.
```{r}
set.seed(1)
model.knn.s <- train(y ~.,
                        data = fir,
                        method = "knn",
                        preProc = c("center", "scale"),
                        trControl = fitControl,
                        tuneGrid = kGrid)
model.knn.s
```
The new best test accuracy (0.8272223) is when k = 23. D was measured in cm and S was measured in units, so both were on completely different scales. Centering the data turns it into distances from the average and then scaling turns those distances into the same unit. This puts D and S on the same scale. Now a value 1 std dev above the mean for D and for S will be labeled as a 1.

## g.
```{r}
set.seed(1)
model.logit <- train(data = fir,
                     y ~ .,
                     method = "glm",
                     family = "binomial",
                     trControl = fitControl)
model.logit
```
The new test accuracy is 0.8134015. 

```{r}
summary(model.logit)
```
The coefficient for D is 0.2850. When all other variables are held constant, a 1 unit increase in D increases the log-odds of the outcome by 0.2850.

## h.
```{r}
predict(model.logit, newdata = data.frame(D = 10, S = 0.40), type = "prob")
```
The probability this tree will die is 0.353390. The odds the tree dies is 0.3533902/0.6466098 = 0.547. For every 1 tree that survives, about 0.547 trees die.

## i.
```{r}
predict(model.logit, newdata = data.frame(D = 11, S = 0.40), type = "prob")
```
In part (g), we stated "When all other variables are held constant, a 1 unit increase in D increases the log-odds of the outcome by 0.2850. Since S was held constant and D increased by 1 unit, we would expect the log-odds to increase by 0.2850. The math below proves this is the case.
```{r}
d10 <- log(0.3533902/0.6466098)
d11 <- log(0.4208761/0.5791239)
d11 - d10
```

## j.
```{r}
set.seed(1)

grid <- expand.grid(
  usekernel = FALSE, # Use Normal instead of KDE
  fL = 0, # Laplace correction
  adjust = 1 # Bandwidth adjustment factor
)

model.nb <- train(data = fir,
                  y ~ .,
                  method = "nb",
                  trControl = fitControl,
                  tuneGrid = grid)
model.nb
```
The test accuracy was 0.8013235.

## k.
```{r}
table <- fir %>% group_by(y) %>% summarise(meanD = mean(D), stdD = sd(D), meanS = mean(S), stdS = sd(S))
table
```
```{r}
fir[1,]
```
Probabilities of getting obs1 if no were true
```{r}
dnorm(9, mean = 8.005869, sd = 3.311916)
dnorm(0.024212, mean = 0.3128129, sd = 0.1891409)
```
Probabilities of getting obs1 if yes were true
```{r}
dnorm(9, mean = 12.798283, sd = 4.889300)
dnorm(0.024212, mean = 0.5195392, sd = 0.2162452)
```
Probability of being in No vs Yes
```{r}
pNo <- mean(fir$y == "No") * 0.1151504 * 0.6585027
pNo
pYes <- mean(fir$y == "Yes") * 0.06034118 * 0.1338578
pYes
```
The probability of being in class No when D = 9 and S = 0.024212 is higher than the probability of being in class Yes. So the tree survived.

## l
KNN estimates probability based on nearby points by looking at the k closest observations and taking the proportion of those that fall into each class. Logistic regression models the probability directly by plugging the predictors into a linear equation and passing it through a function to keep it between 0 and 1. Naive Bayes calculates the probability by checking how likely the feature values are under each class using distributions, then combining those and comparing across classes, assuming the features are independent.