1. (Exercise 4.8.13) This question should be answered using the Weekly data set, which is part of the ISLR2 package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1,089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.

  1. Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?
#download the library create data frame and get data overview
library(ISLR2)
data(Weekly)
    
#Display histograms of each variable
par(mfrow = c(3,3), mar = c(3.4,3.5,1.2,0), mgp =c(2,1,0))
hist(Weekly$Year, main = "Freq of Years Sampled",
      xlab= "Year", col = "lightpink1")
    
hist(Weekly$Lag1, main = "Prev Week's % Return",
      xlab= "Values of Prev Week's Return (%)", col = "palegreen3")
    
hist(Weekly$Lag2, main = "Prev 2 Weeks' % Return",
      xlab= "Values of Prev 2 Weeks' Return (%)", col = "slateblue1")
    
hist(Weekly$Lag3, main = "Prev 3 Weeks' % Return",
      xlab= "Values of Prev 3 Weeks' Return (%)", col = "indianred3")
    
hist(Weekly$Lag4, main = "Prev 4 Weeks' % Return",
      xlab= "Values of Prev 4 Weeks' Return (%)", col = "steelblue2")
    
hist(Weekly$Lag5, main = "Prev 5 Weeks' % Return",
      xlab= "Values of Prev 5 Weeks' Return (%)", col = "thistle2")
    
hist(Weekly$Volume, main = "Share Trading Volume",
      xlab= "Daily Traded Share Avgs (billions)", col = "orange1")
    
hist(Weekly$Today, main = "Current Week % Return",
      xlab= "Values of Current Week Returns (%)", col = "olivedrab3")
    
plot(Weekly$Direction, main = "Obs Market Direction",
      xlab= "Weekly Directional Market Trends", col = "darkorchid2")

library(corrplot)
par(mfrow=c(1,1))

corrplot.mixed(cor(Weekly[,1:8]), order = "hclust", tl.cex = .65, number.cex = 0.75, 
               lower="number", upper="circle")

NA

Most predictor’s histograms appear approximately normally distributed, with two notable exceptions:

  • The predictor Year appears to be approximately uniformly distributed with some slight right skew
  • The predictor Volume has extreme right skew

Additionally, there appears to be a strong-positive correlation between Year and Volume, which may indicate multicollinearity.

  1. Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?
model.1b <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, 
                family = binomial, data = Weekly)
summary(model.1b)

Call:
glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
    Volume, family = binomial, data = Weekly)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept)  0.26686    0.08593   3.106   0.0019 **
Lag1        -0.04127    0.02641  -1.563   0.1181   
Lag2         0.05844    0.02686   2.175   0.0296 * 
Lag3        -0.01606    0.02666  -0.602   0.5469   
Lag4        -0.02779    0.02646  -1.050   0.2937   
Lag5        -0.01447    0.02638  -0.549   0.5833   
Volume      -0.02274    0.03690  -0.616   0.5377   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1496.2  on 1088  degrees of freedom
Residual deviance: 1486.4  on 1082  degrees of freedom
AIC: 1500.4

Number of Fisher Scoring iterations: 4

In this model, only the predictor Lag2 appears to be statistically significant for \(0.01 < \alpha < 0.05\).

  1. Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.
probabilities.1b <- predict(model.1b, type = "response")
predictions.1b <- rep("Down", length(probabilities.1b))
predictions.1b[probabilities.1b >= 0.5] <- "Up"
table(Weekly$Direction, predictions.1b)
      predictions.1b
       Down  Up
  Down   54 430
  Up     48 557
accuracy.1b <- mean(predictions.1b == Weekly$Direction)
cat("The accuracy or fraction of correct predictions is:", accuracy.1b, "\n")
The accuracy or fraction of correct predictions is: 0.5610652 
#Type I error (False Positive Rate)
FPR <- 100*(430/1089)
cat("The Type I Error or False Positive Rate is: ", round(FPR, 2), "% \n", sep = "")
The Type I Error or False Positive Rate is: 39.49% 
#Type II Error (False Negative Rate)
FNR <- 100*(48/1089)
  cat("The Type II Error or False Negative Rate is: ", round(FNR, 2), "% \n", sep = "" )
The Type II Error or False Negative Rate is: 4.41% 

This tells us that the model correctly predicted the value of Direction (either Up or Down) approximately 56.11% of the time. This would indicate that the predictive accuracy of our model is not particularly strong.

Additionally, we have the Type I and Type II error rates, which are the rates of false positive predictions and false negative predictions, respectively.

  1. Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).
p1.train <- Weekly[1:985,]
p1.test <- Weekly[986:1089,]

model.1d <- glm(Direction ~ Lag2, data = p1.train, family = 'binomial')

probs.1d <- predict(model.1d, p1.test, type = "response")
preds.1d <- rep("Down", length(probs.1d))
preds.1d[probs.1d >= 0.5] <- "Up"
table(p1.test$Direction, preds.1d)
      preds.1d
       Down Up
  Down    9 34
  Up      5 56
accuracy.1d <- mean(preds.1d == p1.test$Direction)
cat("The accuracy or fraction of correct predictions is:", accuracy.1d, "\n")
The accuracy or fraction of correct predictions is: 0.625 
  1. Repeat (d) using LDA.
library(MASS)
model.1e <- lda(Direction ~ Lag2, data = p1.train, family = 'binomial')
Error in eval(expr, p) : object 'p1.train' not found
  1. Repeat (d) using QDA.
model.1f <- qda(Direction ~ Lag2, data = p1.train, family = 'binomial')

preds.1f <- predict(model.1f, p1.test, type = 'response')
table(p1.test$Direction, preds.1f$class)
      
       Down Up
  Down    0 43
  Up      0 61
accuracy.1f <- mean(preds.1f$class == p1.test$Direction)
cat("The accuracy or fraction of correct predictions is:", accuracy.1f, "\n")
The accuracy or fraction of correct predictions is: 0.5865385 
  1. Repeat (d) using KNN with \(K=1\).
train2 <- as.matrix(Weekly$Lag2[Weekly$Year < 2009])
test2 <- as.matrix(Weekly$Lag2[Weekly$Year >= 2009])
tr.resp <- Weekly$Direction[Weekly$Year < 2009]

library(class)
model.1g <- knn(train2, test2, tr.resp, k = 1)
table(model.1g, p1.test$Direction)
        
model.1g Down Up
    Down   21 29
    Up     22 32
accuracy.1g <- mean(model.1g == p1.test$Direction)
cat("The accuracy or fraction of correct predictions is:", accuracy.1g, "\n")
The accuracy or fraction of correct predictions is: 0.5096154 
  1. Repeat (d) using naive Bayes.
library(e1071)
model.1h <- naiveBayes(Direction ~ Lag2, data = p1.train)

preds.1h <- predict(model.1h, p1.test, type = 'class')
table(p1.test$Direction, preds.1h)
      preds.1h
       Down Up
  Down    0 43
  Up      0 61
accuracy.1h <- mean(preds.1h == p1.test$Direction)
cat("The accuracy or fraction of correct predictions is:", accuracy.1h, "\n")
The accuracy or fraction of correct predictions is: 0.5865385 
  1. Which of these methods appears to provide the best results on this data?

LDA an dLogistic regression are the two methods with the highest accuracy rate (62.5%).

2. (Exercise 4.8.14) In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.

  1. Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median() function. Note you may find it helpful to use the data.frame() function to create a single data set containing both mpg01 and the other Auto variables.
data(Auto)
mpg.med <- median(Auto$mpg)
mpg01 <- rep(0, length(Auto$mpg))
mpg01[Auto$mpg >= mpg.med] <- 1


Auto <- data.frame(mpg01, Auto)
  1. Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.
plot(Auto)

library(ggplot2)
ggplot(Auto, aes(x=as.factor(origin), y=mpg, fill=as.factor(cylinders))) + 
    geom_boxplot()

ggplot(Auto, aes(x=weight, y=mpg, fill=as.factor(cylinders))) + 
    geom_boxplot()

ggplot(Auto, aes(x=weight, y=horsepower, fill=as.factor(cylinders))) + 
    geom_boxplot()

ggplot(Auto, aes(x=as.factor(origin), y=mpg, fill=as.factor(year))) + 
    geom_boxplot()

The value of mpg tends to decrease as the number of cylinders increases in vehicles of both American and European origin, but this trend does not hold with Japanese-made vehicles.

In vehicles with 4 or more cylinders, as the number of cylinders increases, the vehicle weight increases and the mpg decreases. Trading mpg for horsepower we find that the opposite is true: generally, as weight and cylinders increase, so does horsepower.

Across all three places of origin, as the vehicle model year increases (becomes newer), the mpg also generally tends to increase.

  1. Split the data into a training set and a test set.
library(caret)
set.seed(4814)
auto.index <- createDataPartition(Auto$mpg01, p = 0.75, list = FALSE)
tr.auto <- Auto[auto.index,]
te.auto <- Auto[-auto.index,]
  1. Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
lda.auto <- lda(mpg01 ~ cylinders + horsepower + weight + year, data = tr.auto, family = 'binomial')
lda.auto
Call:
lda(mpg01 ~ cylinders + horsepower + weight + year, data = tr.auto, 
    family = "binomial")

Prior probabilities of groups:
  0   1 
0.5 0.5 

Group means:
  cylinders horsepower   weight     year
0  6.789116  130.55782 3621.898 74.51701
1  4.170068   77.71429 2318.986 77.62585

Coefficients of linear discriminants:
                    LD1
cylinders  -0.435975274
horsepower  0.005803904
weight     -0.001161321
year        0.109786589
preds.2d <- predict(lda.auto, te.auto, type = 'response')
table(te.auto$mpg01, preds.2d$class)
   
     0  1
  0 43  6
  1  2 47
te.error.2d <- mean(preds.2d$class != te.auto$mpg01)
cat("The test error of this LDA model is:", percent(te.error.2d, accuracy = 0.01), "\n")
The test error of this LDA model is: 8.16% 
  1. Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
qda.auto <- qda(mpg01 ~ cylinders + horsepower + weight + year, data = tr.auto, family = 'binomial')

preds.2e <- predict(qda.auto, te.auto, type = 'response')
table(te.auto$mpg01, preds.2e$class)
   
     0  1
  0 44  5
  1  5 44
te.error.2e <- mean(preds.2e$class != te.auto$mpg01)
cat("The test error of this QDA model is:", percent(te.error.2e, accuracy = 0.01), "\n")
The test error of this QDA model is: 10.20% 

The test error of this QDA model is slightly higher than that of the previous LDA model, fitted with the same predictors.

  1. Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
log.auto <- glm(mpg01 ~ cylinders + horsepower + weight + year, data = tr.auto, family = 'binomial')
summary(log.auto)

Call:
glm(formula = mpg01 ~ cylinders + horsepower + weight + year, 
    family = "binomial", data = tr.auto)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.185e+01  5.447e+00  -2.175  0.02960 *  
cylinders   -2.826e-01  2.928e-01  -0.965  0.33458    
horsepower  -5.736e-02  1.925e-02  -2.980  0.00288 ** 
weight      -3.808e-03  8.699e-04  -4.377 1.20e-05 ***
year         3.858e-01  8.271e-02   4.665 3.09e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 407.57  on 293  degrees of freedom
Residual deviance: 116.36  on 289  degrees of freedom
AIC: 126.36

Number of Fisher Scoring iterations: 8
probs.2f <- predict(log.auto, te.auto, type = "response")
preds.2f <- rep(0, length(probs.2f))
preds.2f[probs.2f >= 0.5] <- 1
table(te.auto$mpg01, preds.2f)
   preds.2f
     0  1
  0 44  5
  1  6 43
te.error.2f <- mean(preds.2f != te.auto$mpg01)
cat("The test error of this Logistic model is:", percent(te.error.2f, accuracy = 0.01), "\n")
The test error of this Logistic model is: 11.22% 
  1. Perform naive Bayes on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
nb.auto <- naiveBayes(mpg01 ~ cylinders + horsepower + weight + year, data = tr.auto)

preds.2g <- predict(nb.auto, te.auto, type = 'class')
table(te.auto$mpg01, preds.2g)
   preds.2g
     0  1
  0 43  6
  1  4 45
te.error.2g <- mean(preds.2g != te.auto$mpg01)
cat("The test error of this Logistic model is:", percent(te.error.2g, accuracy = 0.01), "\n")
The test error of this Logistic model is: 10.20% 
  1. Perform KNN on the training data, with several values of \(K\), in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of \(K\) seems to perform the best on this data set?
x1.tr <- tr.auto[,c(3,5,6,8)]
y1.tr <- tr.auto[,1]
x1.te <- te.auto[,c(3,5,6,8)]
y1.te <- te.auto[,1]

set.seed(481)
## KNN - K = 1 ##
knn.auto <- knn(x1.tr, x1.te, y1.tr, k = 1)
table(knn.auto, y1.te)
        y1.te
knn.auto  0  1
       0 46 11
       1  3 38
te.error.2h <- mean(knn.auto != y1.te)
cat("The test error of this KNN model with k = 1 is:", percent(te.error.2h, accuracy = 0.01), "\n")
The test error of this KNN model with k = 1 is: 14.29% 
## KNN - K = 10 ##
knn.auto <- knn(x1.tr, x1.te, y1.tr, k = 10)
table(knn.auto, y1.te)
        y1.te
knn.auto  0  1
       0 44  5
       1  5 44
te.error.2h <- mean(knn.auto != y1.te)
cat("The test error of this KNN model with k = 10 is:", percent(te.error.2h, accuracy = 0.01), "\n")
The test error of this KNN model with k = 10 is: 10.20% 
## KNN - K = 15 ##
knn.auto <- knn(x1.tr, x1.te, y1.tr, k = 15)
table(knn.auto, y1.te)
        y1.te
knn.auto  0  1
       0 41  3
       1  8 46
te.error.2h <- mean(knn.auto != y1.te)
cat("The test error of this KNN model with k = 15 is:", percent(te.error.2h, accuracy = 0.01), "\n")
The test error of this KNN model with k = 15 is: 11.22% 

Based on this data set, \(k=10\) seems to perform the best.

3. (Exercise 4.8.16) Using the Boston data set, fit classification models in order to predict whether a given census tract has a crime rate above or below the median. Explore logistic regression, LDA, naive Bayes, and KNN models using various subsets of the predictors. Describe your findings.

Hint: You will have to create the response variable yourself, using the variables that are contained in the Boston data set.

data(Boston)
library(dplyr)
library(scales)
library(olsrr)

#get median crime rate and create response vector
crim.med <- median(Boston$crim)
response <- rep(0, length(Boston$crim))
response[Boston$crim >= crim.med] <- 1

#attach response vector to data frame
Boston$response <- response

#Split the data into test and train
set.seed(4816)
index.boston <- createDataPartition(Boston$response, p = 0.8, list = FALSE)
tr.bos <- Boston[index.boston,]
te.bos <- Boston[-index.boston,]
## LOGISTIC - FULL ##
log.bos <- glm(response ~ ., data = tr.bos, family = 'binomial')
summary(log.bos)

probs.log <- predict(log.bos, te.bos, type = "response")
preds.log <- rep(0, length(probs.log))
preds.log[probs.log >= 0.5] <- 1
table(te.bos$response, preds.log)

acc.log <- mean(preds.log == te.bos$response)
cat("The prediction accuracy of the full logistic model is:", percent(acc.log, accuracy = 0.01), "\n")

## LOGISTIC - REDUCED ##
#model selection
ols_step_both_p(log.bos, p_enter = 0.05, p_remove = 0.1)
log.red <- glm(response ~ nox + rad + medv + age, data = tr.bos, family = 'binomial')
summary(log.red)

probs.log.red <- predict(log.red, te.bos, type = "response")
preds.log.red <- rep(0, length(probs.log.red))
preds.log.red[probs.log.red >= 0.5] <- 1
table(te.bos$response, preds.log.red)

acc.log.red <- mean(preds.log.red == te.bos$response)
cat("The prediction accuracy of the reduced logistic model is:", percent(acc.log.red, accuracy = 0.01), "\n")

The full logistic regression model outperformed the reduced model, selected using stepwise regression. Not only did the full model have higher prediction accuracy, but the AIC was also much lower than that of the reduced model (28 vs. 209.26), demonstrating that the former has a much better balance between model fit and complexity.

## LDA - FULL ##
lda.bos <- lda(response ~ ., data = tr.bos, family = 'binomial')

preds.lda <- predict(lda.bos, te.bos, type = 'response')
table(te.bos$response, preds.lda$class)

acc.lda <- mean(preds.lda$class == te.bos$response)
cat("The prediction accuracy of the full LDA model is:", percent(acc.lda, accuracy = 0.01), "\n")

## LDA - REDUCED ##
lda.red <- lda(response ~ nox + age + rad + medv, data = tr.bos, family = 'binomial')

preds.lda.red <- predict(lda.red, te.bos, type = 'response')
table(te.bos$response, preds.lda.red$class)

acc.lda.red <- mean(preds.lda.red$class == te.bos$response)
cat("The prediction accuracy of the reduced LDA model is:", percent(acc.lda.red, accuracy = 0.01), "\n")

The prediction accuracy improved by one percentage point from the full LDA (83.00%) model to the reduced LDA model (84.00%). Neither LDA model outperformed the full Logistic model.

## NAIVE BAYES - FULL ##
nb.bos <- naiveBayes(response ~ ., data = tr.bos)

preds.nb <- predict(nb.bos, te.bos, type = 'class')
table(te.bos$response, preds.nb)

acc.nb <- mean(preds.nb == te.bos$response)
cat("The prediction accuracy of the full Naive Bayes model is:", percent(acc.nb, accuracy = 0.01), "\n")

## NAIVE BAYES - REDUCED ##
nb.bos.red <- naiveBayes(response ~ zn + nox + age + rad + ptratio + medv, data = tr.bos)

preds.nb.red <- predict(nb.bos.red, te.bos, type = 'class')
table(te.bos$response, preds.nb.red)

acc.nb.red <- mean(preds.nb.red == te.bos$response)
cat("The prediction accuracy of the reduced Naive Bayes model is:", percent(acc.nb.red, accuracy = 0.01), "\n")

The full Naive Bayes model outperforms the reduced model, but is still slightly lower in accuracy than the full Logistic.

## KNN - K = 1 ##
x.tr <- tr.bos[,1:13]
y.tr <- tr.bos[,14]
x.te <- te.bos[,1:13]
y.te <- te.bos[, 14]

knn.bos <- knn(x.tr, x.te, y.tr, k = 1)
table(knn.bos, y.te)

acc.knn <- mean(knn.bos == y.te)
cat("The prediction accuracy of the KNN model with k = 1 is:", percent(acc.knn, accuracy = 0.01), "\n")

The full Logisitc regression model remains the highest performing of those tested here.

---
title: "STA 6543 Assignment 3"
author: "Allyssa Weinbrecht"
date: "2025-03-07"
output:
  html_notebook:
    toc: true
    toc_float: true
---

```{r echo=FALSE, warning=FALSE, include=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  fig.align="center",
  fig.pos="b",
  strip.white = TRUE
)
```

**1. (Exercise 4.8.13) This question should be answered using the `Weekly` data set, which is part of the `ISLR2` package. This data is similar in nature to the `Smarket` data from this chapter’s lab, except that it contains 1,089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.**

(a) Produce some numerical and graphical summaries of the `Weekly` data. Do there appear to be any patterns?

``` {r Prob1a}
#download the library create data frame and get data overview
library(ISLR2)
data(Weekly)
    
#Display histograms of each variable
par(mfrow = c(3,3), mar = c(3.4,3.5,1.2,0), mgp =c(2,1,0))
hist(Weekly$Year, main = "Freq of Years Sampled",
      xlab= "Year", col = "lightpink1")
    
hist(Weekly$Lag1, main = "Prev Week's % Return",
      xlab= "Values of Prev Week's Return (%)", col = "palegreen3")
    
hist(Weekly$Lag2, main = "Prev 2 Weeks' % Return",
      xlab= "Values of Prev 2 Weeks' Return (%)", col = "slateblue1")
    
hist(Weekly$Lag3, main = "Prev 3 Weeks' % Return",
      xlab= "Values of Prev 3 Weeks' Return (%)", col = "indianred3")
    
hist(Weekly$Lag4, main = "Prev 4 Weeks' % Return",
      xlab= "Values of Prev 4 Weeks' Return (%)", col = "steelblue2")
    
hist(Weekly$Lag5, main = "Prev 5 Weeks' % Return",
      xlab= "Values of Prev 5 Weeks' Return (%)", col = "thistle2")
    
hist(Weekly$Volume, main = "Share Trading Volume",
      xlab= "Daily Traded Share Avgs (billions)", col = "orange1")
    
hist(Weekly$Today, main = "Current Week % Return",
      xlab= "Values of Current Week Returns (%)", col = "olivedrab3")
    
plot(Weekly$Direction, main = "Obs Market Direction",
      xlab= "Weekly Directional Market Trends", col = "darkorchid2")

library(corrplot)
par(mfrow=c(1,1))
corrplot.mixed(cor(Weekly[,1:8]), order = "hclust", tl.cex = .65, number.cex = 0.75, 
               lower="number", upper="circle")
    
```
Most predictor's histograms appear approximately normally distributed, with two notable exceptions:

* The predictor `Year` appears to be approximately uniformly distributed with some slight right skew
* The predictor `Volume` has extreme right skew
  
Additionally, there appears to be a strong-positive correlation between `Year` and `Volume`, which may indicate multicollinearity.

(b) Use the full data set to perform a logistic regression with `Direction` as the response and the five lag variables plus `Volume` as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?

``` {r Prob1b}
model.1b <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, 
                family = binomial, data = Weekly)
summary(model.1b)
```
In this model, only the predictor `Lag2` appears to be statistically significant for $0.01 < \alpha < 0.05$.

(c) Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.

``` {r Prob1c}
probabilities.1b <- predict(model.1b, type = "response")
predictions.1b <- rep("Down", length(probabilities.1b))
predictions.1b[probabilities.1b >= 0.5] <- "Up"
table(Weekly$Direction, predictions.1b)

accuracy.1b <- mean(predictions.1b == Weekly$Direction)
cat("The accuracy or fraction of correct predictions is:", accuracy.1b, "\n")

#Type I error (False Positive Rate)
FPR <- 100*(430/1089)
cat("The Type I Error or False Positive Rate is: ", round(FPR, 2), "% \n", sep = "")

#Type II Error (False Negative Rate)
FNR <- 100*(48/1089)
  cat("The Type II Error or False Negative Rate is: ", round(FNR, 2), "% \n", sep = "" )
```

This tells us that the model correctly predicted the value of `Direction` (either `Up` or `Down`) approximately 56.11% of the time. This would indicate that the predictive accuracy of our model is not particularly strong.

Additionally, we have the Type I and Type II error rates, which are the rates of false positive predictions and false negative predictions, respectively.

(d) Now fit the logistic regression model using a training data period from 1990 to 2008, with `Lag2` as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).

``` {r Prob1d}
p1.train <- Weekly[1:985,]
p1.test <- Weekly[986:1089,]

model.1d <- glm(Direction ~ Lag2, data = p1.train, family = 'binomial')

probs.1d <- predict(model.1d, p1.test, type = "response")
preds.1d <- rep("Down", length(probs.1d))
preds.1d[probs.1d >= 0.5] <- "Up"
table(p1.test$Direction, preds.1d)

accuracy.1d <- mean(preds.1d == p1.test$Direction)
cat("The accuracy or fraction of correct predictions is:", accuracy.1d, "\n")
```

(e) Repeat (d) using LDA.

``` {r Prob1e, warning = FALSE}
library(MASS)
model.1e <- lda(Direction ~ Lag2, data = p1.train, family = 'binomial')

preds.1e <- predict(model.1e, p1.test, type = 'response')
table(p1.test$Direction, preds.1e$class)

accuracy.1e <- mean(preds.1e$class == p1.test$Direction)
cat("The accuracy or fraction of correct predictions is:", accuracy.1e, "\n")
```

(f) Repeat (d) using QDA.

``` {r Prob1f}
model.1f <- qda(Direction ~ Lag2, data = p1.train, family = 'binomial')

preds.1f <- predict(model.1f, p1.test, type = 'response')
table(p1.test$Direction, preds.1f$class)

accuracy.1f <- mean(preds.1f$class == p1.test$Direction)
cat("The accuracy or fraction of correct predictions is:", accuracy.1f, "\n")
```

(g) Repeat (d) using KNN with $K=1$.

``` {r Prob1g}
train2 <- as.matrix(Weekly$Lag2[Weekly$Year < 2009])
test2 <- as.matrix(Weekly$Lag2[Weekly$Year >= 2009])
tr.resp <- Weekly$Direction[Weekly$Year < 2009]

library(class)
model.1g <- knn(train2, test2, tr.resp, k = 1)
table(model.1g, p1.test$Direction)

accuracy.1g <- mean(model.1g == p1.test$Direction)
cat("The accuracy or fraction of correct predictions is:", accuracy.1g, "\n")
```

(h) Repeat (d) using naive Bayes.

``` {r Prob1h}
library(e1071)
model.1h <- naiveBayes(Direction ~ Lag2, data = p1.train)

preds.1h <- predict(model.1h, p1.test, type = 'class')
table(p1.test$Direction, preds.1h)

accuracy.1h <- mean(preds.1h == p1.test$Direction)
cat("The accuracy or fraction of correct predictions is:", accuracy.1h, "\n")
```

(i) Which of these methods appears to provide the best results on this data?

LDA an dLogistic regression are the two methods with the highest accuracy rate (62.5\%).


**2. (Exercise 4.8.14) In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the `Auto` data set.**

(a) Create a binary variable, `mpg01`, that contains a 1 if `mpg` contains a value above its median, and a 0 if `mpg` contains a value below its median. You can compute the median using the `median()` function. Note you may find it helpful to use the `data.frame()` function to create a single data set containing both `mpg01` and the other `Auto` variables.

``` {r Prob2a}
data(Auto)
mpg.med <- median(Auto$mpg)
mpg01 <- rep(0, length(Auto$mpg))
mpg01[Auto$mpg >= mpg.med] <- 1


Auto <- data.frame(mpg01, Auto)
```

(b) Explore the data graphically in order to investigate the association between `mpg01` and the other features. Which of the other features seem most likely to be useful in predicting `mpg01`? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.

``` {r Prob2b}
plot(Auto)
library(ggplot2)
ggplot(Auto, aes(x=as.factor(origin), y=mpg, fill=as.factor(cylinders))) + 
    geom_boxplot()
ggplot(Auto, aes(x=weight, y=mpg, fill=as.factor(cylinders))) + 
    geom_boxplot()
ggplot(Auto, aes(x=weight, y=horsepower, fill=as.factor(cylinders))) + 
    geom_boxplot()
ggplot(Auto, aes(x=as.factor(origin), y=mpg, fill=as.factor(year))) + 
    geom_boxplot()

```
The value of `mpg` tends to decrease as the number of `cylinders` increases in vehicles of both American and European `origin`, but this trend does not hold with Japanese-made vehicles.

In vehicles with 4 or more `cylinders`, as the number of `cylinders` increases, the vehicle `weight` increases and the `mpg` decreases. Trading `mpg` for `horsepower` we find that the opposite is true: generally, as `weight` and `cylinders` increase, so does `horsepower`.

Across all three places of `origin`, as the vehicle model `year` increases (becomes newer), the `mpg` also generally tends to increase.

(c) Split the data into a training set and a test set.

``` {r Prob2c}
library(caret)
set.seed(4814)
auto.index <- createDataPartition(Auto$mpg01, p = 0.75, list = FALSE)
tr.auto <- Auto[auto.index,]
te.auto <- Auto[-auto.index,]
```

(d) Perform LDA on the training data in order to predict `mpg01` using the variables that seemed most associated with `mpg01` in (b). What is the test error of the model obtained?

``` {r Prob2d}
lda.auto <- lda(mpg01 ~ cylinders + horsepower + weight + year, data = tr.auto, family = 'binomial')

preds.2d <- predict(lda.auto, te.auto, type = 'response')
table(te.auto$mpg01, preds.2d$class)

te.error.2d <- mean(preds.2d$class != te.auto$mpg01)
cat("The test error of this LDA model is:", percent(te.error.2d, accuracy = 0.01), "\n")
```

(e) Perform QDA on the training data in order to predict `mpg01` using the variables that seemed most associated with `mpg01` in (b). What is the test error of the model obtained?

``` {r Prob2e}
qda.auto <- qda(mpg01 ~ cylinders + horsepower + weight + year, data = tr.auto, family = 'binomial')

preds.2e <- predict(qda.auto, te.auto, type = 'response')
table(te.auto$mpg01, preds.2e$class)

te.error.2e <- mean(preds.2e$class != te.auto$mpg01)
cat("The test error of this QDA model is:", percent(te.error.2e, accuracy = 0.01), "\n")
```
The test error of this QDA model is slightly higher than that of the previous LDA model, fitted with the same predictors.

(f) Perform logistic regression on the training data in order to predict `mpg01` using the variables that seemed most associated with `mpg01` in (b). What is the test error of the model obtained?

``` {r Prob2f}
log.auto <- glm(mpg01 ~ cylinders + horsepower + weight + year, data = tr.auto, family = 'binomial')
summary(log.auto)

probs.2f <- predict(log.auto, te.auto, type = "response")
preds.2f <- rep(0, length(probs.2f))
preds.2f[probs.2f >= 0.5] <- 1
table(te.auto$mpg01, preds.2f)

te.error.2f <- mean(preds.2f != te.auto$mpg01)
cat("The test error of this Logistic model is:", percent(te.error.2f, accuracy = 0.01), "\n")
```

(g) Perform naive Bayes on the training data in order to predict `mpg01` using the variables that seemed most associated with `mpg01` in (b). What is the test error of the model obtained?

``` {r Prob2g}
nb.auto <- naiveBayes(mpg01 ~ cylinders + horsepower + weight + year, data = tr.auto)

preds.2g <- predict(nb.auto, te.auto, type = 'class')
table(te.auto$mpg01, preds.2g)

te.error.2g <- mean(preds.2g != te.auto$mpg01)
cat("The test error of this Naive Bayes model is:", percent(te.error.2g, accuracy = 0.01), "\n")
```

(h) Perform KNN on the training data, with several values of $K$, in order to predict `mpg01`. Use only the variables that seemed most associated with `mpg01` in (b). What test errors do you obtain? Which value of $K$ seems to perform the best on this data set?

``` {r Prob2h}
x1.tr <- tr.auto[,c(3,5,6,8)]
y1.tr <- tr.auto[,1]
x1.te <- te.auto[,c(3,5,6,8)]
y1.te <- te.auto[,1]

set.seed(481)
## KNN - K = 1 ##
knn.auto <- knn(x1.tr, x1.te, y1.tr, k = 1)
table(knn.auto, y1.te)

te.error.2h <- mean(knn.auto != y1.te)
cat("The test error of this KNN model with k = 1 is:", percent(te.error.2h, accuracy = 0.01), "\n")

## KNN - K = 10 ##
knn.auto <- knn(x1.tr, x1.te, y1.tr, k = 10)
table(knn.auto, y1.te)

te.error.2h <- mean(knn.auto != y1.te)
cat("The test error of this KNN model with k = 10 is:", percent(te.error.2h, accuracy = 0.01), "\n")

## KNN - K = 15 ##
knn.auto <- knn(x1.tr, x1.te, y1.tr, k = 15)
table(knn.auto, y1.te)

te.error.2h <- mean(knn.auto != y1.te)
cat("The test error of this KNN model with k = 15 is:", percent(te.error.2h, accuracy = 0.01), "\n")
```
Based on this data set, $k=10$ seems to perform the best.

**3. (Exercise 4.8.16) Using the Boston data set, fit classification models in order to predict whether a given census tract has a crime rate above or below the median. Explore logistic regression, LDA, naive Bayes, and KNN models using various subsets of the predictors. Describe your findings.**

*Hint: You will have to create the response variable yourself, using the variables that are contained in the `Boston` data set.*

```{r Prob3, message=FALSE}
data(Boston)
library(dplyr)
library(scales)
library(olsrr)

#get median crime rate and create response vector
crim.med <- median(Boston$crim)
response <- rep(0, length(Boston$crim))
response[Boston$crim >= crim.med] <- 1

#attach response vector to data frame
Boston$response <- response

#Split the data into test and train
set.seed(4816)
index.boston <- createDataPartition(Boston$response, p = 0.8, list = FALSE)
tr.bos <- Boston[index.boston,]
te.bos <- Boston[-index.boston,]
```

``` {r Prob3Log, warning = FALSE}
## LOGISTIC - FULL ##
log.bos <- glm(response ~ ., data = tr.bos, family = 'binomial')
summary(log.bos)

probs.log <- predict(log.bos, te.bos, type = "response")
preds.log <- rep(0, length(probs.log))
preds.log[probs.log >= 0.5] <- 1
table(te.bos$response, preds.log)

acc.log <- mean(preds.log == te.bos$response)
cat("The prediction accuracy of the full logistic model is:", percent(acc.log, accuracy = 0.01), "\n")

## LOGISTIC - REDUCED ##
#model selection
ols_step_both_p(log.bos, p_enter = 0.05, p_remove = 0.1)
log.red <- glm(response ~ nox + rad + medv + age, data = tr.bos, family = 'binomial')
summary(log.red)

probs.log.red <- predict(log.red, te.bos, type = "response")
preds.log.red <- rep(0, length(probs.log.red))
preds.log.red[probs.log.red >= 0.5] <- 1
table(te.bos$response, preds.log.red)

acc.log.red <- mean(preds.log.red == te.bos$response)
cat("The prediction accuracy of the reduced logistic model is:", percent(acc.log.red, accuracy = 0.01), "\n")
```
The full logistic regression model outperformed the reduced model, selected using stepwise regression. Not only did the full model have higher prediction accuracy, but the AIC was also much lower than that of the reduced model (28 vs. 209.26), demonstrating that the former has a much better balance between model fit and complexity.

``` {r Prob3LDA}
## LDA - FULL ##
lda.bos <- lda(response ~ ., data = tr.bos, family = 'binomial')

preds.lda <- predict(lda.bos, te.bos, type = 'response')
table(te.bos$response, preds.lda$class)

acc.lda <- mean(preds.lda$class == te.bos$response)
cat("The prediction accuracy of the full LDA model is:", percent(acc.lda, accuracy = 0.01), "\n")

## LDA - REDUCED ##
lda.red <- lda(response ~ nox + age + rad + medv, data = tr.bos, family = 'binomial')

preds.lda.red <- predict(lda.red, te.bos, type = 'response')
table(te.bos$response, preds.lda.red$class)

acc.lda.red <- mean(preds.lda.red$class == te.bos$response)
cat("The prediction accuracy of the reduced LDA model is:", percent(acc.lda.red, accuracy = 0.01), "\n")
```
The prediction accuracy improved by one percentage point from the full LDA (83.00\%) model to the reduced LDA model (84.00\%). Neither LDA model outperformed the full Logistic model.

``` {r Prob3NB}
## NAIVE BAYES - FULL ##
nb.bos <- naiveBayes(response ~ ., data = tr.bos)

preds.nb <- predict(nb.bos, te.bos, type = 'class')
table(te.bos$response, preds.nb)

acc.nb <- mean(preds.nb == te.bos$response)
cat("The prediction accuracy of the full Naive Bayes model is:", percent(acc.nb, accuracy = 0.01), "\n")

## NAIVE BAYES - REDUCED ##
nb.bos.red <- naiveBayes(response ~ zn + nox + age + rad + ptratio + medv, data = tr.bos)

preds.nb.red <- predict(nb.bos.red, te.bos, type = 'class')
table(te.bos$response, preds.nb.red)

acc.nb.red <- mean(preds.nb.red == te.bos$response)
cat("The prediction accuracy of the reduced Naive Bayes model is:", percent(acc.nb.red, accuracy = 0.01), "\n")
```
The full Naive Bayes model outperforms the reduced model, but is still slightly lower in accuracy than the full Logistic.

``` {r Prob3KNN}
## KNN - K = 1 ##
x.tr <- tr.bos[,1:13]
y.tr <- tr.bos[,14]
x.te <- te.bos[,1:13]
y.te <- te.bos[, 14]

knn.bos <- knn(x.tr, x.te, y.tr, k = 1)
table(knn.bos, y.te)

acc.knn <- mean(knn.bos == y.te)
cat("The prediction accuracy of the KNN model with k = 1 is:", percent(acc.knn, accuracy = 0.01), "\n")
```
The full Logisitc regression model remains the highest performing of those tested here.




