Question 1

heart <- read.csv("HW7data.csv")

a.

heart$target <- factor(heart$target, levels = c("Yes", "No")) #I changed the levels because it hurts my brain when no is the positive class.

heart$sex <- as.character(heart$sex)
heart$cp <- as.character(heart$cp)
heart$fbs <- as.character(heart$fbs)
heart$restecg <- as.character(heart$restecg)
heart$exang <- as.character(heart$exang)
heart$slope <- as.character(heart$slope)
heart$thal <- as.character(heart$thal)

imbalance <- heart %>% group_by(target) %>% summarise(n = n(), p = n / 303)
imbalance
# A tibble: 2 × 3
  target     n     p
  <fct>  <int> <dbl>
1 Yes      165 0.545
2 No       138 0.455

There are more yes than no’s, but the imbalance is not severe.

b.

set.seed(1)

fitControl <- trainControl(method = "cv", number = 10, classProbs = TRUE, savePredictions = "all", summaryFunction = twoClassSummary)

model.logit <- train(data = heart,
                     target ~ .,
                     method = "glm",
                     family = "binomial",
                     trControl = fitControl)
model.logit
Generalized Linear Model 

303 samples
 13 predictor
  2 classes: 'Yes', 'No' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 274, 272, 273, 272, 273, 272, ... 
Resampling results:

  ROC        Sens       Spec     
  0.8888474  0.8797794  0.7769231

Test accuracy of 83.31%.

c.

set.seed(1)

freq_class <- table(heart$target)

weights_vector <- ifelse(heart$target == "No", nrow(heart)/(length(freq_class)*freq_class[2]), nrow(heart)/(length(freq_class)*freq_class[1]))

model.logit2 <- train(data = heart,
                     target ~ .,
                     method = "glm",
                     family = "quasibinomial",
                     weights = weights_vector,
                     trControl = fitControl)
model.logit2
Generalized Linear Model 

303 samples
 13 predictor
  2 classes: 'Yes', 'No' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 274, 272, 273, 272, 273, 272, ... 
Resampling results:

  ROC        Sens       Spec     
  0.8892069  0.8492647  0.7917582

The new test accuracy is 82.33%. We did worse because there was no severe class imbalance.

d.

The no class would be the positive class if I hadn’t changed the levels. I changed the levels so the yes class is the positive class since your testing if a person has heart disease. Sensitivity measures the proportion of actual positives (person has heart disease) that the model correctly identified. So out of the people with heart disease, how many times did the model correctly label them as ‘Yes’. Specificity measures the proportion of actual negatives. So out of the people who didn’t have heart disease, how many times did the model correctly label them as ‘No’. The sensitivity and specificity for regular log reg were 0.8797794 and 0.7769231. The sensitivity and specificity for weighted log reg were 0.8492647 and 0.7917582. So the weighted log reg did better with the people who didn’t have heart disease, which makes sense since these people were assigned larger weights.

e.

thresholder(model.logit, threshold = 0.50) 
  parameter prob_threshold Sensitivity Specificity Pos Pred Value
1      none            0.5   0.8797794   0.7769231      0.8287918
  Neg Pred Value Precision    Recall        F1 Prevalence Detection Rate
1      0.8513736 0.8287918 0.8797794 0.8516563  0.5445384      0.4789247
  Detection Prevalence Balanced Accuracy  Accuracy     Kappa         J
1            0.5802373         0.8283512 0.8330738 0.6608442 0.6567025
       Dist
1 0.2629323
thresholder(model.logit, threshold = 0.25) 
  parameter prob_threshold Sensitivity Specificity Pos Pred Value
1      none           0.25   0.9272059   0.6434066      0.7649135
  Neg Pred Value Precision    Recall       F1 Prevalence Detection Rate
1      0.8821503 0.7649135 0.9272059 0.836072  0.5445384      0.5049537
  Detection Prevalence Balanced Accuracy  Accuracy     Kappa         J
1             0.666548         0.7853062 0.7988209 0.5825124 0.5706125
       Dist
1 0.3675353
thresholder(model.logit, threshold = 0.75) 
  parameter prob_threshold Sensitivity Specificity Pos Pred Value
1      none           0.75   0.6797794   0.9065934      0.8990113
  Neg Pred Value Precision    Recall        F1 Prevalence Detection Rate
1      0.7072092 0.8990113 0.6797794 0.7712848  0.5445384       0.370178
  Detection Prevalence Balanced Accuracy  Accuracy     Kappa         J
1             0.412551         0.7931864 0.7832666 0.5735424 0.5863728
       Dist
1 0.3413992

When we decrease the threshold (0.25), we make it easier for the model to label someone as a ‘Yes”. As a result, the sensitivity goes up and the specificity goes down. When we increase the threshold (0.75), we make it harder for the model to label someone as a ’Yes’. As a result, the sensitivity goes down and the specificity goes up.

f.

cpGrid <- expand.grid(cp = 0) # No pruning
set.seed(1)
fitControl <- trainControl(method = "cv", number = 10)

model.tree <- train(target ~ .,
                    data = heart,
                    method = "rpart",
                    trControl = fitControl,
                    tuneGrid = cpGrid)
par(xpd = NA)
plot(model.tree$finalModel)
text(model.tree$finalModel)

The first decision is when thal is equal to 2. Go to the left if it equals 2, go to the right otherwise.

g.

p <- prop.table(table(heart$target))
(0.5445545)*(0.4554455) + (0.5445545)*(0.4554455)
[1] 0.4960298

The Gini index with no decisions is 0.4960298.

h.

heart %>%
  filter(thal == "2") %>%
  group_by(target) %>%
  summarize(n = n())
# A tibble: 2 × 2
  target     n
  <fct>  <int>
1 Yes      130
2 No        36
heart %>%
  filter(thal != "2") %>%
  group_by(target) %>%
  summarize(n = n())
# A tibble: 2 × 2
  target     n
  <fct>  <int>
1 Yes       35
2 No       102
gini_thal2 <- (130/166)*(1 - 130/166) + (36/166)*(1 - 36/166)
gini_not_thal2 <- (35/137)*(1 - 35/137) + (102/137)*(1 - 102/137)
split_gini <- (166/303)*gini_thal2 + (137/303)*gini_not_thal2
split_gini
[1] 0.3580935

They are now more pure.

i.

heart %>%
  filter(thal == "3") %>%
  group_by(target) %>%
  summarize(n = n())
# A tibble: 2 × 2
  target     n
  <fct>  <int>
1 Yes       28
2 No        89
heart %>%
  filter(thal != "3") %>%
  group_by(target) %>%
  summarize(n = n())
# A tibble: 2 × 2
  target     n
  <fct>  <int>
1 Yes      137
2 No        49
gini_thal3 <- (137/186)*(1 - 137/186) + (49/186)*(1 - 49/186)
gini_not_thal3 <- (28/117)*(1 - 28/117) + (89/117)*(1 - 89/117)
split_gini <- (186/303)*gini_thal3 + (117/303)*gini_not_thal3
split_gini
[1] 0.3788155

Slightly less pure now.

j.

heart %>%
  filter(cp == "2") %>%
  group_by(target) %>%
  summarize(n = n())
# A tibble: 2 × 2
  target     n
  <fct>  <int>
1 Yes       69
2 No        18
heart %>%
  filter(cp != "2") %>%
  group_by(target) %>%
  summarize(n = n())
# A tibble: 2 × 2
  target     n
  <fct>  <int>
1 Yes       96
2 No       120
gini_cp2 <- (69/87)*(1 - 69/87) + (18/87)*(1 - 18/87)
gini_not_cp2 <- (96/216)*(1 - 96/216) + (120/216)*(1 - 120/216)
split_gini <- (87/303)*gini_cp2 + (216/303)*gini_not_cp2
split_gini
[1] 0.4462653

Less pure then the split on thal2.

k.

set.seed(1)

model.tree.prune <- train(target ~ .,
                          data = heart,
                          method = "rpart",
                          trControl = fitControl,
                          tuneLength = 10) # Let function pick 10 reasonable tuning parameters to prune the tree
model.tree.prune
CART 

303 samples
 13 predictor
  2 classes: 'Yes', 'No' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 274, 272, 273, 272, 273, 272, ... 
Resampling results across tuning parameters:

  cp          Accuracy   Kappa     
  0.00000000  0.7338821  0.46342416
  0.05394525  0.7534594  0.50171008
  0.10789050  0.7667927  0.52861591
  0.16183575  0.7667927  0.52861591
  0.21578100  0.7667927  0.52861591
  0.26972625  0.7667927  0.52861591
  0.32367150  0.7667927  0.52861591
  0.37761675  0.7667927  0.52861591
  0.43156200  0.7667927  0.52861591
  0.48550725  0.5800222  0.09607575

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.431562.

The accuracy when not pruning was 0.7338821, the best accuracy with pruning was 0.7667927, so we did better with pruning. This makes sense since pruning limits the number of splits a tree can do, keeping it from overfitting and removing unnecessary splits.

l.

par(xpd = NA)
plot(model.tree.prune$finalModel)
text(model.tree.prune$finalModel)

heart %>%
  filter(thal == "2") %>%
  group_by(target) %>%
  summarize(n = n())
# A tibble: 2 × 2
  target     n
  <fct>  <int>
1 Yes      130
2 No        36
heart %>%
  filter(thal != "2") %>%
  group_by(target) %>%
  summarize(n = n())
# A tibble: 2 × 2
  target     n
  <fct>  <int>
1 Yes       35
2 No       102
232/303
[1] 0.7656766

Everything checks out.

---
title: "Homework 7"
author: "Charlie Morgan"
date: " Due: 05/03/26"
output:
  html_document: 
    toc: yes
    toc_depth: 4
    toc_float: yes
    number_sections: no
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: yes
    theme: lumen
  pdf_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    number_sections: yes
    fig_width: 3
    fig_height: 3
  word_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    keep_md: yes
editor_options: 
  chunk_output_type: inline
---

```{css, echo = FALSE}
#TOC::before {
  content: "Table of Contents";
  font-weight: bold;
  font-size: 1.2em;
  display: block;
  color: navy;
  margin-bottom: 10px;
}


div#TOC li {     /* table of content  */
    list-style:upper-roman;
    background-image:none;
    background-repeat:none;
    background-position:0;
}

h1.title {    /* level 1 header of title  */
  font-size: 22px;
  font-weight: bold;
  color: DarkRed;
  text-align: center;
  font-family: "Gill Sans", sans-serif;
}

h4.author { /* Header 4 - and the author and data headers use this too  */
  font-size: 15px;
  font-weight: bold;
  font-family: system-ui;
  color: navy;
  text-align: center;
}

h4.date { /* Header 4 - and the author and data headers use this too  */
  font-size: 18px;
  font-weight: bold;
  font-family: "Gill Sans", sans-serif;
  color: DarkBlue;
  text-align: center;
}

h1 { /* Header 1 - and the author and data headers use this too  */
    font-size: 20px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: center;
}

h2 { /* Header 2 - and the author and data headers use this too  */
    font-size: 18px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h3 { /* Header 3 - and the author and data headers use this too  */
    font-size: 16px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - and the author and data headers use this too  */
    font-size: 14px;
  font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

/* Add dots after numbered headers */
.header-section-number::after {
  content: ".";

body { background-color:white; }

.highlightme { background-color:yellow; }

p { background-color:white; }

}
```

```{r setup, include=FALSE}
# code chunk specifies whether the R code, warnings, and output 
# will be included in the output files.
if (!require("knitr")) {
   install.packages("knitr")
   library(knitr)
}
if (!require("factoextra")) {
  install.packages("factoextra")
  library(factoextra)
}

if (!require("tidyverse")) {
  install.packages("tidyverse")
  library(tidyverse)
}

if (!require("FactoMineR")) {
  install.packages("FactoMineR")
  library(FactoMineR)
}

if (!require("corrplot")) {
  install.packages("corrplot")
  library(corrplot)
}

if (!require("mice")) {
  install.packages("mice")
  library(mice)
}

if (!require("kableExtra")) {
  install.packages("kableExtra")
  library(kableExtra)
}

if (!require("cluster")) {
  install.packages("cluster")
  library(cluster)
}

if (!require("mclust")) {
  install.packages("mclust")
  library(mclust)
}

if (!require("dbscan")) {
  install.packages("dbscan")
  library(dbscan)
}

if (!require("caret")) {
  install.packages("caret")
  library(caret)
}
####
knitr::opts_chunk$set(echo = TRUE,       # include code chunk in the output file
                      warning = FALSE,   # sometimes, you code may produce warning messages,
                                         # you can choose to include the warning messages in
                                         # the output file. 
                      results = TRUE,    # you can also decide whether to include the output
                                         # in the output file.
                      message = FALSE,
                      comment = NA
                      )  

```

# Question 1

```{r}
heart <- read.csv("HW7data.csv")
```

## a.
```{r}
heart$target <- factor(heart$target, levels = c("Yes", "No")) #I changed the levels because it hurts my brain when no is the positive class.

heart$sex <- as.character(heart$sex)
heart$cp <- as.character(heart$cp)
heart$fbs <- as.character(heart$fbs)
heart$restecg <- as.character(heart$restecg)
heart$exang <- as.character(heart$exang)
heart$slope <- as.character(heart$slope)
heart$thal <- as.character(heart$thal)

imbalance <- heart %>% group_by(target) %>% summarise(n = n(), p = n / 303)
imbalance
```
There are more yes than no's, but the imbalance is not severe.

## b.
```{r}
set.seed(1)

fitControl <- trainControl(method = "cv", number = 10, classProbs = TRUE, savePredictions = "all", summaryFunction = twoClassSummary)

model.logit <- train(data = heart,
                     target ~ .,
                     method = "glm",
                     family = "binomial",
                     trControl = fitControl)
model.logit
```
Test accuracy of 83.31%.

## c.
```{r}
set.seed(1)

freq_class <- table(heart$target)

weights_vector <- ifelse(heart$target == "No", nrow(heart)/(length(freq_class)*freq_class[2]), nrow(heart)/(length(freq_class)*freq_class[1]))

model.logit2 <- train(data = heart,
                     target ~ .,
                     method = "glm",
                     family = "quasibinomial",
                     weights = weights_vector,
                     trControl = fitControl)
model.logit2
```
The new test accuracy is 82.33%. We did worse because there was no severe class imbalance.

## d.
The no class would be the positive class if I hadn't changed the levels. I changed the levels so the yes class is the positive class since your testing if a person has heart disease. Sensitivity measures the proportion of actual positives (person has heart disease) that the model correctly identified. So out of the people with heart disease, how many times did the model correctly label them as 'Yes'. Specificity measures the proportion of actual negatives. So out of the people who didn't have heart disease, how many times did the model correctly label them as 'No'. The sensitivity and specificity for regular log reg were 0.8797794 and 0.7769231. The sensitivity and specificity for weighted log reg were  0.8492647 and 0.7917582. So the weighted log reg did better with the people who didn't have heart disease, which makes sense since these people were assigned larger weights.

## e.
```{r}
thresholder(model.logit, threshold = 0.50) 
```
```{r}
thresholder(model.logit, threshold = 0.25) 
```
```{r}
thresholder(model.logit, threshold = 0.75) 
```
When we decrease the threshold (0.25), we make it easier for the model to label someone as a 'Yes". As a result, the sensitivity goes up and the specificity goes down. When we increase the threshold (0.75), we make it harder for the model to label someone as a 'Yes'. As a result, the sensitivity goes down and the specificity goes up. 

## f.
```{r}
cpGrid <- expand.grid(cp = 0) # No pruning
set.seed(1)
fitControl <- trainControl(method = "cv", number = 10)

model.tree <- train(target ~ .,
                    data = heart,
                    method = "rpart",
                    trControl = fitControl,
                    tuneGrid = cpGrid)
par(xpd = NA)
plot(model.tree$finalModel)
text(model.tree$finalModel)
```

The first decision is when thal is equal to 2. Go to the left if it equals 2, go to the right otherwise.

## g.
```{r}
p <- prop.table(table(heart$target))
(0.5445545)*(0.4554455) + (0.5445545)*(0.4554455)
```
The Gini index with no decisions is 0.4960298.

## h.
```{r}
heart %>%
  filter(thal == "2") %>%
  group_by(target) %>%
  summarize(n = n())
```
```{r}
heart %>%
  filter(thal != "2") %>%
  group_by(target) %>%
  summarize(n = n())
```

```{r}
gini_thal2 <- (130/166)*(1 - 130/166) + (36/166)*(1 - 36/166)
gini_not_thal2 <- (35/137)*(1 - 35/137) + (102/137)*(1 - 102/137)
split_gini <- (166/303)*gini_thal2 + (137/303)*gini_not_thal2
split_gini
```
They are now more pure.

## i.
```{r}
heart %>%
  filter(thal == "3") %>%
  group_by(target) %>%
  summarize(n = n())
```
```{r}
heart %>%
  filter(thal != "3") %>%
  group_by(target) %>%
  summarize(n = n())
```

```{r}
gini_thal3 <- (137/186)*(1 - 137/186) + (49/186)*(1 - 49/186)
gini_not_thal3 <- (28/117)*(1 - 28/117) + (89/117)*(1 - 89/117)
split_gini <- (186/303)*gini_thal3 + (117/303)*gini_not_thal3
split_gini
```
Slightly less pure now.

## j.

```{r}
heart %>%
  filter(cp == "2") %>%
  group_by(target) %>%
  summarize(n = n())
```

```{r}
heart %>%
  filter(cp != "2") %>%
  group_by(target) %>%
  summarize(n = n())
```

```{r}
gini_cp2 <- (69/87)*(1 - 69/87) + (18/87)*(1 - 18/87)
gini_not_cp2 <- (96/216)*(1 - 96/216) + (120/216)*(1 - 120/216)
split_gini <- (87/303)*gini_cp2 + (216/303)*gini_not_cp2
split_gini
```
Less pure then the split on thal2.

## k.
```{r}
set.seed(1)

model.tree.prune <- train(target ~ .,
                          data = heart,
                          method = "rpart",
                          trControl = fitControl,
                          tuneLength = 10) # Let function pick 10 reasonable tuning parameters to prune the tree
model.tree.prune
```
The accuracy when not pruning was 0.7338821, the best accuracy with pruning was 0.7667927, so we did better with pruning. This makes sense since pruning limits the number of splits a tree can do, keeping it from overfitting and removing unnecessary splits.

## l.
```{r}
par(xpd = NA)
plot(model.tree.prune$finalModel)
text(model.tree.prune$finalModel)
```
```{r}
heart %>%
  filter(thal == "2") %>%
  group_by(target) %>%
  summarize(n = n())
```
```{r}
heart %>%
  filter(thal != "2") %>%
  group_by(target) %>%
  summarize(n = n())
```
```{r}
232/303
```
Everything checks out.