Exercises

1.

data <- read.csv("Exam2Data.csv")
fitControl <- trainControl(method = "cv", number = 5)

2.

set.seed(1)
model.cv <- train(Proline ~.,
                  data = data,
                  method = "lm",
                  trControl = fitControl)

model.cv$finalModel

Call:
lm(formula = .outcome ~ ., data = dat)

Coefficients:
   (Intercept)         RegionB         RegionC         Alcohol       MalicAcid  
       502.560        -506.343        -513.535          18.659         -16.079  
           Ash      Alcalinity       Magnesium         Phenols      Flavanoids  
         6.000          -1.205           1.789          54.315         -54.122  
  Nonflavanoid         Proanth  ColorIntensity             Hue           OD280  
       -54.780          34.304          24.828         184.853         -54.195  

When all other variables are held constant, a 1 unit increase in Color Intensity results in a 24.828 unit increase in Proline. When all other variables are held constant, RegionB is associated with 506.343 lower Proline than Region A. When all other variables are held constant, RegionC is associated with 513.535 lower Proline than Region A.

3.

model.cv$results
  intercept     RMSE  Rsquared      MAE   RMSESD RsquaredSD    MAESD
1      TRUE 168.0626 0.7333614 129.8525 12.42053 0.03451967 12.82251

The test RMSE was 168.0626, meaning the model’s predictions are typically about 168 units away from the true Proline values.

4.

model.lm <- lm(Proline ~., data = data)
rmse <- sqrt(mean(residuals(model.lm)^2))
rmse
[1] 154.4184

The training RMSE is 154.4184. The training RMSE is less than the testing RMSE since the model is being evaluated on data it was trained on.

5.

set.seed(1)

tune.grid <- expand.grid(alpha = 1, lambda = 10^seq(-2, 1, length = 25))

model.lasso <- train(Proline ~ .,
                     data = data,
                     method = "glmnet",
                     tuneGrid = tune.grid,
                     standardize = TRUE,
                     trControl = fitControl)
model.lasso
glmnet 

178 samples
 13 predictor

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 142, 143, 142, 143, 142 
Resampling results across tuning parameters:

  lambda       RMSE      Rsquared   MAE     
   0.01000000  167.7167  0.7343222  129.7797
   0.01333521  167.7167  0.7343222  129.7797
   0.01778279  167.7167  0.7343222  129.7797
   0.02371374  167.7167  0.7343222  129.7797
   0.03162278  167.7167  0.7343222  129.7797
   0.04216965  167.7167  0.7343222  129.7797
   0.05623413  167.7167  0.7343222  129.7797
   0.07498942  167.7167  0.7343222  129.7797
   0.10000000  167.7167  0.7343222  129.7797
   0.13335214  167.7167  0.7343222  129.7797
   0.17782794  167.7144  0.7343355  129.8004
   0.23713737  167.7047  0.7343841  129.8424
   0.31622777  167.6579  0.7345308  129.8712
   0.42169650  167.5510  0.7348530  129.8712
   0.56234133  167.4477  0.7351406  129.8815
   0.74989421  167.3700  0.7353262  129.9154
   1.00000000  167.3575  0.7352228  130.0115
   1.33352143  167.4027  0.7348441  130.3183
   1.77827941  167.4069  0.7346477  130.5143
   2.37137371  167.4025  0.7344871  130.6858
   3.16227766  167.4484  0.7346740  130.9283
   4.21696503  167.7314  0.7346227  131.4540
   5.62341325  167.8945  0.7350463  132.0989
   7.49894209  168.1832  0.7352647  132.8636
  10.00000000  169.0439  0.7338634  134.1911

Tuning parameter 'alpha' was held constant at a value of 1
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were alpha = 1 and lambda = 1.

The test RMSE is 167.3575. We did better with Ridge than OLS regression since Ridge is allowed to set the coefficients of unhelpful variables close to zero. These variables won’t contribute as much to the models predictions, reducing noise, overfitting, and collinearity.

6.

classes <- data %>% group_by(Region) %>% summarise(n = n())
kable(classes)
Region n
A 59
B 71
C 48

There are three regions: A, B, and C. C has less observations than both A and B, and B has the most, but there isn’t an extreme class imbalance.

7.

set.seed(1)
kGrid <- expand.grid(k = seq(1, 13, 2))

model.knn <- train(Region ~.,
                        data = data,
                        method = "knn",
                        preProc = c("center", "scale"),
                        trControl = fitControl,
                        tuneGrid = kGrid)
model.knn
k-Nearest Neighbors 

178 samples
 13 predictor
  3 classes: 'A', 'B', 'C' 

Pre-processing: centered (13), scaled (13) 
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 141, 143, 143, 143, 142 
Resampling results across tuning parameters:

  k   Accuracy   Kappa    
   1  0.9604676  0.9403076
   3  0.9661819  0.9488648
   5  0.9661819  0.9488648
   7  0.9604676  0.9402229
   9  0.9715873  0.9569956
  11  0.9491978  0.9234248
  13  0.9604676  0.9402229

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 9.

Optimal value for k was 9, with a test accuracy of 0.9715873 and kappa statistic of 0.9569956. The model correctly predicted the region 97.16% of the time. The kappa statistic alters the accuracy to reflect the fact that some of it could be due to random chance. Since the kappa statistic is still really high this means our model is good and there isn’t a class imbalance (which we noted above).

8.

set.seed(1)

grid <- expand.grid(
  usekernel = FALSE, 
  fL = 0, 
  adjust = 1 
)

model.nb <- train(data = data,
                  Region ~ .,
                  method = "nb",
                  trControl = fitControl,
                  tuneGrid = grid)
model.nb
Naive Bayes 

178 samples
 13 predictor
  3 classes: 'A', 'B', 'C' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 141, 143, 143, 143, 142 
Resampling results:

  Accuracy   Kappa   
  0.9666409  0.949126

Tuning parameter 'fL' was held constant at a value of 0
Tuning
 parameter 'usekernel' was held constant at a value of FALSE
Tuning
 parameter 'adjust' was held constant at a value of 1

The test accuracy with naive bayes was 0.9666409.

9.

t <- data %>%
  group_by(Region) %>%
  summarise(across(
    where(is.numeric),
    list(mean = mean, sd = sd),
    .names = "{.col}_{.fn}" #got code from web
  ))

kable(t)
Region Alcohol_mean Alcohol_sd MalicAcid_mean MalicAcid_sd Ash_mean Ash_sd Alcalinity_mean Alcalinity_sd Magnesium_mean Magnesium_sd Phenols_mean Phenols_sd Flavanoids_mean Flavanoids_sd Nonflavanoid_mean Nonflavanoid_sd Proanth_mean Proanth_sd ColorIntensity_mean ColorIntensity_sd Hue_mean Hue_sd OD280_mean OD280_sd Proline_mean Proline_sd
A 13.74475 0.4621254 2.010678 0.6885489 2.455593 0.2271660 17.03729 2.546323 106.3390 10.49895 2.840170 0.3389614 2.9823729 0.3974936 0.290000 0.0700492 1.899322 0.4121092 5.528305 1.2385728 1.0620339 0.1164826 3.157797 0.3570766 1115.7119 221.5208
B 12.27873 0.5379642 1.932676 1.0155687 2.244789 0.3154673 20.23803 3.349770 94.5493 16.75350 2.258873 0.5453611 2.0808451 0.7057008 0.363662 0.1239613 1.630282 0.6020678 3.086620 0.9249293 1.0562817 0.2029368 2.785352 0.4965735 519.5070 157.2112
C 13.15375 0.5302413 3.333750 1.0879057 2.437083 0.1846902 21.41667 2.258161 99.3125 10.89047 1.678750 0.3569709 0.7814583 0.2935041 0.447500 0.1241396 1.153542 0.4088359 7.396250 2.3109421 0.6827083 0.1144411 1.683542 0.2721114 629.8958 115.0970

You assume each variable is normally distributed, which allows you to find the probability of certain values for each variable. So if an observation has an alcohol value of 14, you can standardize it using the mean and standard deviation of alcohol levels and get the probability from a z-score table for each region. Now you know the probability of the observation being in Region A, B, or C based solely on its alcohol value. If you do this for every variable and multiply the probabilities together, you get a good estimate of what region it belongs to. This is essentially what Naive Bayes does.

10.

KNN estimates probability based on nearby points by looking at the k closest observations and taking the proportion of those that fall into each class. An observation is assigned the class that the majority of its k nearest neighbors are in. Logistic regression models the probability directly by plugging the predictors into a linear equation and passing it through a function to keep it between 0 and 1. An observation is assigned to class 1 if the predicted probability is above a threshold (usually 0.5), otherwise it is assigned to class 0.

11.

set.seed(1)

model.tree.prune <- train(Region ~ .,
                          data = data,
                          method = "rpart",
                          trControl = fitControl,
                          tuneLength = 10)

par(xpd = NA)
plot(model.tree.prune$finalModel)
text(model.tree.prune$finalModel)

There are 5 terminal nodes, their values are their predicted regions (A, B, or C).

testing <- data %>% filter(Proline >= 755 & Flavanoids >= 2.165) %>% group_by(Region) %>% summarise(n = n())
kable(testing)
Region n
A 57
B 2

Since majority of the observations where Proline is >= 755 and Flavanoids >= 2.165 are in Region A, that terminal node has the correct value.

12.

The Gini Index measures how pure a node is. A node is pure if most of the observations in it belong to one class. A Gini value of 0 means the node is perfectly pure, while higher values indicate the classes are more mixed. When building trees, we choose splits that reduce the Gini Index the most, meaning they create more pure groups. We use Gini instead of RMSE because RMSE is for regression, and instead of accuracy because accuracy isn’t sensitive enough to small improvements in splits.

13.

prop.table(table(data$Region))

        A         B         C 
0.3314607 0.3988764 0.2696629 
0.3314607*(1-0.3314607) + 0.3988764*(1-0.3988764) + 0.2696629*(1-0.2696629)
[1] 0.6583133

The gini index with no splits is 0.6583133.

14.

test <- data %>% filter(Proline >= 755) %>% group_by(Region) %>% summarise(n = n())
kable(test)
Region n
A 57
B 4
C 6
test <- data %>% filter(Proline < 755) %>% group_by(Region) %>% summarise(n = n())
kable(test)
Region n
A 2
B 67
C 42
gini_progr <- (57/67)*(1-57/67) + (4/67)*(1-4/67) + (6/67)*(1-6/67)
gini_prole <- (2/111)*(1-2/111) + (67/111)*(1-67/111) + (42/111)*(1-42/111)
gini_imp <- (67/178) * gini_progr + (111/178) * gini_prole
gini_imp
[1] 0.4065279

The gini index is now 0.4065279, so we did better.

15.

test <- data %>% filter(Proline >= 800) %>% group_by(Region) %>% summarise(n = n())
kable(test)
Region n
A 53
B 4
C 5
test <- data %>% filter(Proline < 800) %>% group_by(Region) %>% summarise(n = n())
kable(test)
Region n
A 6
B 67
C 43
gini_progr <- (53/62)*(1-53/62) + (4/62)*(1-4/62) + (5/62)*(1-5/62)
gini_prole <- (6/116)*(1-6/116) + (67/116)*(1-67/116) + (43/116)*(1-43/116)
gini_imp <- (62/178) * gini_progr + (116/178) * gini_prole
gini_imp
[1] 0.4330561

The gini index is now 0.4330561, so we did worse than the previous split.

16.

test <- data %>% filter(Flavanoids >= 2.165) %>% group_by(Region) %>% summarise(n = n())
kable(test)
Region n
A 59
B 29
test <- data %>% filter(Flavanoids < 2.165) %>% group_by(Region) %>% summarise(n = n())
kable(test)
Region n
B 42
C 48
gini_progr <- (59/88)*(1-59/88) + (29/88)*(1-29/88) + (0/88)*(1-0/88)
gini_prole <- (0/90)*(1-0/90) + (42/90)*(1-42/90) + (48/90)*(1-48/90)
gini_imp <- (88/178) * gini_progr + (90/178) * gini_prole
gini_imp
[1] 0.4701481

The new Gini index is 0.4701481, so once again we did worse. The initial split chosen by the train() function did the best. This function loops through possible initial splits and chooses the one that results in the lowest Gini index, so any other split will generally perform the same or worse at that step.

---
title: "Exam 2"
author: "Charlie Morgan"
date: " Due: 05/03/26"
output:
  html_document: 
    toc: yes
    toc_depth: 4
    toc_float: yes
    number_sections: no
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: yes
    theme: lumen
  pdf_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    number_sections: yes
    fig_width: 3
    fig_height: 3
  word_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    keep_md: yes
editor_options: 
  chunk_output_type: inline
---

```{css, echo = FALSE}
#TOC::before {
  content: "Table of Contents";
  font-weight: bold;
  font-size: 1.2em;
  display: block;
  color: navy;
  margin-bottom: 10px;
}


div#TOC li {     /* table of content  */
    list-style:upper-roman;
    background-image:none;
    background-repeat:none;
    background-position:0;
}

h1.title {    /* level 1 header of title  */
  font-size: 22px;
  font-weight: bold;
  color: DarkRed;
  text-align: center;
  font-family: "Gill Sans", sans-serif;
}

h4.author { /* Header 4 - and the author and data headers use this too  */
  font-size: 15px;
  font-weight: bold;
  font-family: system-ui;
  color: navy;
  text-align: center;
}

h4.date { /* Header 4 - and the author and data headers use this too  */
  font-size: 18px;
  font-weight: bold;
  font-family: "Gill Sans", sans-serif;
  color: DarkBlue;
  text-align: center;
}

h1 { /* Header 1 - and the author and data headers use this too  */
    font-size: 20px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: center;
}

h2 { /* Header 2 - and the author and data headers use this too  */
    font-size: 18px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h3 { /* Header 3 - and the author and data headers use this too  */
    font-size: 16px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - and the author and data headers use this too  */
    font-size: 14px;
  font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

/* Add dots after numbered headers */
.header-section-number::after {
  content: ".";

body { background-color:white; }

.highlightme { background-color:yellow; }

p { background-color:white; }

}
```

```{r setup, include=FALSE}
# code chunk specifies whether the R code, warnings, and output 
# will be included in the output files.
if (!require("knitr")) {
   install.packages("knitr")
   library(knitr)
}
if (!require("factoextra")) {
  install.packages("factoextra")
  library(factoextra)
}

if (!require("tidyverse")) {
  install.packages("tidyverse")
  library(tidyverse)
}

if (!require("FactoMineR")) {
  install.packages("FactoMineR")
  library(FactoMineR)
}

if (!require("corrplot")) {
  install.packages("corrplot")
  library(corrplot)
}

if (!require("mice")) {
  install.packages("mice")
  library(mice)
}

if (!require("kableExtra")) {
  install.packages("kableExtra")
  library(kableExtra)
}

if (!require("cluster")) {
  install.packages("cluster")
  library(cluster)
}

if (!require("mclust")) {
  install.packages("mclust")
  library(mclust)
}

if (!require("dbscan")) {
  install.packages("dbscan")
  library(dbscan)
}

if (!require("caret")) {
  install.packages("caret")
  library(caret)
}
####
knitr::opts_chunk$set(echo = TRUE,       # include code chunk in the output file
                      warning = FALSE,   # sometimes, you code may produce warning messages,
                                         # you can choose to include the warning messages in
                                         # the output file. 
                      results = TRUE,    # you can also decide whether to include the output
                                         # in the output file.
                      message = FALSE,
                      comment = NA
                      )  

```

# Exercises

## 1.
```{r}
data <- read.csv("Exam2Data.csv")
fitControl <- trainControl(method = "cv", number = 5)
```

## 2.
```{r}
set.seed(1)
model.cv <- train(Proline ~.,
                  data = data,
                  method = "lm",
                  trControl = fitControl)

model.cv$finalModel
```
When all other variables are held constant, a 1 unit increase in Color Intensity results in a 24.828 unit increase in Proline. When all other variables are held constant, RegionB is associated with 506.343 lower Proline than Region A. When all other variables are held constant, RegionC is associated with 513.535 lower Proline than Region A.

## 3.
```{r}
model.cv$results
```
The test RMSE was 168.0626, meaning the model’s predictions are typically about 168 units away from the true Proline values.

## 4.
```{r}
model.lm <- lm(Proline ~., data = data)
rmse <- sqrt(mean(residuals(model.lm)^2))
rmse
```
The training RMSE is 154.4184. The training RMSE is less than the testing RMSE since the model is being evaluated on data it was trained on.

## 5.
```{r}
set.seed(1)

tune.grid <- expand.grid(alpha = 1, lambda = 10^seq(-2, 1, length = 25))

model.lasso <- train(Proline ~ .,
                     data = data,
                     method = "glmnet",
                     tuneGrid = tune.grid,
                     standardize = TRUE,
                     trControl = fitControl)
model.lasso
```
The test RMSE is 167.3575. We did better with Ridge than OLS regression since Ridge is allowed to set the coefficients of unhelpful variables close to zero. These variables won't contribute as much to the models predictions, reducing noise, overfitting, and collinearity.

## 6.
```{r}
classes <- data %>% group_by(Region) %>% summarise(n = n())
kable(classes)
```
There are three regions: A, B, and C. C has less observations than both A and B, and B has the most, but there isn't an extreme class imbalance.

## 7.
```{r}
set.seed(1)
kGrid <- expand.grid(k = seq(1, 13, 2))

model.knn <- train(Region ~.,
                        data = data,
                        method = "knn",
                        preProc = c("center", "scale"),
                        trControl = fitControl,
                        tuneGrid = kGrid)
model.knn
```
Optimal value for k was 9, with a test accuracy of 0.9715873 and kappa statistic of 0.9569956. The model correctly predicted the region 97.16% of the time. The kappa statistic alters the accuracy to reflect the fact that some of it could be due to random chance. Since the kappa statistic is still really high this means our model is good and there isn't a class imbalance (which we noted above).

## 8.
```{r}
set.seed(1)

grid <- expand.grid(
  usekernel = FALSE, 
  fL = 0, 
  adjust = 1 
)

model.nb <- train(data = data,
                  Region ~ .,
                  method = "nb",
                  trControl = fitControl,
                  tuneGrid = grid)
model.nb
```
The test accuracy with naive bayes was 0.9666409.

## 9.
```{r}
t <- data %>%
  group_by(Region) %>%
  summarise(across(
    where(is.numeric),
    list(mean = mean, sd = sd),
    .names = "{.col}_{.fn}" #got code from web
  ))

kable(t)
```
You assume each variable is normally distributed, which allows you to find the probability of certain values for each variable. So if an observation has an alcohol value of 14, you can standardize it using the mean and standard deviation of alcohol levels and get the probability from a z-score table for each region. Now you know the probability of the observation being in Region A, B, or C based solely on its alcohol value. If you do this for every variable and multiply the probabilities together, you get a good estimate of what region it belongs to. This is essentially what Naive Bayes does.

## 10.

KNN estimates probability based on nearby points by looking at the k closest observations and taking the proportion of those that fall into each class. An observation is assigned the class that the majority of its k nearest neighbors are in. Logistic regression models the probability directly by plugging the predictors into a linear equation and passing it through a function to keep it between 0 and 1. An observation is assigned to class 1 if the predicted probability is above a threshold (usually 0.5), otherwise it is assigned to class 0.

## 11.
```{r}
set.seed(1)

model.tree.prune <- train(Region ~ .,
                          data = data,
                          method = "rpart",
                          trControl = fitControl,
                          tuneLength = 10)

par(xpd = NA)
plot(model.tree.prune$finalModel)
text(model.tree.prune$finalModel)
```

There are 5 terminal nodes, their values are their predicted regions (A, B, or C).
```{r}
testing <- data %>% filter(Proline >= 755 & Flavanoids >= 2.165) %>% group_by(Region) %>% summarise(n = n())
kable(testing)
```
Since majority of the observations where Proline is >= 755 and Flavanoids >= 2.165 are in Region A, that terminal node has the correct value.

## 12.
The Gini Index measures how pure a node is. A node is pure if most of the observations in it belong to one class. A Gini value of 0 means the node is perfectly pure, while higher values indicate the classes are more mixed. When building trees, we choose splits that reduce the Gini Index the most, meaning they create more pure groups. We use Gini instead of RMSE because RMSE is for regression, and instead of accuracy because accuracy isn’t sensitive enough to small improvements in splits.

## 13.
```{r}
prop.table(table(data$Region))
0.3314607*(1-0.3314607) + 0.3988764*(1-0.3988764) + 0.2696629*(1-0.2696629)
```
The gini index with no splits is 0.6583133.

## 14.
```{r}
test <- data %>% filter(Proline >= 755) %>% group_by(Region) %>% summarise(n = n())
kable(test)
```
```{r}
test <- data %>% filter(Proline < 755) %>% group_by(Region) %>% summarise(n = n())
kable(test)
```

```{r}
gini_progr <- (57/67)*(1-57/67) + (4/67)*(1-4/67) + (6/67)*(1-6/67)
gini_prole <- (2/111)*(1-2/111) + (67/111)*(1-67/111) + (42/111)*(1-42/111)
gini_imp <- (67/178) * gini_progr + (111/178) * gini_prole
gini_imp
```
The gini index is now 0.4065279, so we did better.

## 15.
```{r}
test <- data %>% filter(Proline >= 800) %>% group_by(Region) %>% summarise(n = n())
kable(test)
```

```{r}
test <- data %>% filter(Proline < 800) %>% group_by(Region) %>% summarise(n = n())
kable(test)
```

```{r}
gini_progr <- (53/62)*(1-53/62) + (4/62)*(1-4/62) + (5/62)*(1-5/62)
gini_prole <- (6/116)*(1-6/116) + (67/116)*(1-67/116) + (43/116)*(1-43/116)
gini_imp <- (62/178) * gini_progr + (116/178) * gini_prole
gini_imp
```
The gini index is now 0.4330561, so we did worse than the previous split.

## 16.
```{r}
test <- data %>% filter(Flavanoids >= 2.165) %>% group_by(Region) %>% summarise(n = n())
kable(test)
```

```{r}
test <- data %>% filter(Flavanoids < 2.165) %>% group_by(Region) %>% summarise(n = n())
kable(test)
```

```{r}
gini_progr <- (59/88)*(1-59/88) + (29/88)*(1-29/88) + (0/88)*(1-0/88)
gini_prole <- (0/90)*(1-0/90) + (42/90)*(1-42/90) + (48/90)*(1-48/90)
gini_imp <- (88/178) * gini_progr + (90/178) * gini_prole
gini_imp
```
The new Gini index is 0.4701481, so once again we did worse. The initial split chosen by the train() function did the best. This function loops through possible initial splits and chooses the one that results in the lowest Gini index, so any other split will generally perform the same or worse at that step.