##13) This question should be answered using the Weekly data set, which is part of the ISLR2 package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1, 089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.

library(ISLR2)

##(a) Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?

summary(Weekly)
##       Year           Lag1               Lag2               Lag3         
##  Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
##  1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580  
##  Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410  
##  Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472  
##  3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
##  Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
##       Lag4               Lag5              Volume            Today         
##  Min.   :-18.1950   Min.   :-18.1950   Min.   :0.08747   Min.   :-18.1950  
##  1st Qu.: -1.1580   1st Qu.: -1.1660   1st Qu.:0.33202   1st Qu.: -1.1540  
##  Median :  0.2380   Median :  0.2340   Median :1.00268   Median :  0.2410  
##  Mean   :  0.1458   Mean   :  0.1399   Mean   :1.57462   Mean   :  0.1499  
##  3rd Qu.:  1.4090   3rd Qu.:  1.4050   3rd Qu.:2.05373   3rd Qu.:  1.4050  
##  Max.   : 12.0260   Max.   : 12.0260   Max.   :9.32821   Max.   : 12.0260  
##  Direction 
##  Down:484  
##  Up  :605  
##            
##            
##            
## 
pairs(Weekly)

pairs(Weekly[, -9])  # Excluding the 'Direction' variable, which is categorical

From the summary statistics:

The mean returns for Lag1 to Lag5 are around 0.15, indicating a slight positive trend in weekly returns. The trading volume varies, with a mean of approximately 1.57. The mean return for the current week (Today) is approximately 0.15. There are more “Up” directions (605 instances) compared to “Down” directions (484 instances), suggesting that the market tends to move up more frequently.

##b)Use the full data set to perform a logistic regression with. Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?

# Perform logistic regression
logistic_model <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Weekly, family = binomial)

# Print the summary of the logistic regression results
summary(logistic_model)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Weekly)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

Based on the summary output of the logistic regression model:

The intercept term has a coefficient estimate of 0.26686 with a standard error of 0.08593. It is statistically significant with a p-value of 0.0019.

Among the lag variables:

Lag2 has a coefficient estimate of 0.05844 with a standard error of 0.02686. It is statistically significant with a p-value of 0.0296. Lag1 has a coefficient estimate of -0.04127 with a standard error of 0.02641. Although it has a negative estimate, it is not statistically significant with a p-value of 0.1181. Lag3, Lag4, and Lag5 are not statistically significant as their p-values are greater than the chosen significance level (e.g., 0.05). The Volume variable also does not appear to be statistically significant, with a coefficient estimate of -0.02274 and a p-value of 0.5377.

In summary, among the predictors, only Lag2 is statistically significant in predicting the direction of weekly returns, while the other lag variables and the volume variable do not show significant associations.

##C)Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.

# Make predictions using the logistic regression model
predictions <- ifelse(predict(logistic_model, type = "response") > 0.5, "Up", "Down")

# Create a confusion matrix
conf_matrix <- table(predictions, Weekly$Direction)

# Compute the overall fraction of correct predictions
overall_accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)

# Print the confusion matrix
print(conf_matrix)
##            
## predictions Down  Up
##        Down   54  48
##        Up    430 557
# Print the overall fraction of correct predictions
print(overall_accuracy)
## [1] 0.5610652

True Positives (TP): The model correctly predicted “Up” for 557 instances where the actual direction was “Up”. False Positives (FP): The model incorrectly predicted “Up” for 48 instances where the actual direction was “Down”. False Negatives (FN): The model incorrectly predicted “Down” for 430 instances where the actual direction was “Up”. True Negatives (TN): The model correctly predicted “Down” for 54 instances where the actual direction was “Down”. The overall fraction of correct predictions, also known as accuracy, is calculated as the sum of true positives and true negatives divided by the total number of predictions. In this case, the overall accuracy is approximately 56.11%.

From the confusion matrix, we can see that the logistic regression model is better at predicting the “Up” direction (557 true positives) compared to the “Down” direction (54 true negatives). However, it tends to have more false negatives (430) than false positives (48), indicating that it’s more likely to incorrectly predict “Down” when the actual direction is “Up”.

##d) Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).

# Subset the data for the training period (1990-2008) and the held-out period (2009-2010)
train_data <- Weekly[Weekly$Year < 2009, ]
held_out_data <- Weekly[Weekly$Year >= 2009, ]

# Fit logistic regression model using Lag2 as the predictor for the training data
model <- glm(Direction ~ Lag2, data = train_data, family = binomial)

# Predict using the held-out data
predictions <- predict(model, newdata = held_out_data, type = "response")

# Convert predicted probabilities to directional predictions
predicted_direction <- ifelse(predictions > 0.5, "Up", "Down")

# Create the confusion matrix
conf_matrix <- table(predicted_direction, held_out_data$Direction)

# Calculate the overall fraction of correct predictions
overall_accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)

# Print the confusion matrix and overall fraction of correct predictions
print(conf_matrix)
##                    
## predicted_direction Down Up
##                Down    9  5
##                Up     34 56
print(overall_accuracy)
## [1] 0.625

This means that out of all the predictions made, approximately 62.5% were correct.

##(e) Repeat (d) using LDA.

# Load required library for LDA
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:ISLR2':
## 
##     Boston
# Subset data for training period (1990-2008) and held-out period (2009-2010)
train_data <- Weekly[Weekly$Year <= 2008, ]
test_data <- Weekly[Weekly$Year > 2008, ]

# Fit LDA model with Lag2 as the only predictor
lda_model <- lda(Direction ~ Lag2, data = train_data)

# Make predictions on the held-out data
lda_predictions <- predict(lda_model, newdata = test_data)

# Compute confusion matrix
confusion_matrix_lda <- table(lda_predictions$class, test_data$Direction)

# Compute overall fraction of correct predictions
correct_predictions_fraction_lda <- sum(diag(confusion_matrix_lda)) / sum(confusion_matrix_lda)

# Print confusion matrix
print("Confusion Matrix for LDA:")
## [1] "Confusion Matrix for LDA:"
print(confusion_matrix_lda)
##       
##        Down Up
##   Down    9  5
##   Up     34 56
# Print overall fraction of correct predictions
print(paste("Overall Fraction of Correct Predictions for LDA:", correct_predictions_fraction_lda))
## [1] "Overall Fraction of Correct Predictions for LDA: 0.625"

The overall fraction of correct predictions for the LDA model is approximately 0.625. This means that the LDA model correctly predicts the direction of the market about 62.5% of the time for the held-out data (2009-2010).

##(f) Repeat (d) using QDA.

# Load the required library
library(MASS)

# Create a subset of data for the training period (1990-2008)
train_data <- Weekly[Weekly$Year <= 2008, ]

# Fit the QDA model with Lag2 as the predictor
qda_model <- qda(Direction ~ Lag2, data = train_data)

# Predict on the held-out data (2009-2010)
held_out_data <- Weekly[Weekly$Year > 2008, ]
predictions <- predict(qda_model, newdata = held_out_data)

# Compute the confusion matrix
confusion_matrix <- table(predictions$class, held_out_data$Direction)

# Compute the overall fraction of correct predictions
overall_accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)

# Print the confusion matrix and overall fraction of correct predictions
print(confusion_matrix)
##       
##        Down Up
##   Down    0  0
##   Up     43 61
print(overall_accuracy)
## [1] 0.5865385

Overall Fraction of Correct Predictions: 0.5865385

##(g) Repeat (d) using KNN with K = 1. This indicates that the QDA model correctly predicts the direction of the market (Up or Down) approximately 58.65% of the time on the held-out data (2009-2010).

# Load necessary libraries
library(ISLR)
## 
## Attaching package: 'ISLR'
## The following objects are masked from 'package:ISLR2':
## 
##     Auto, Credit
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
# Set the seed for reproducibility
set.seed(123)

# Split the dataset into training (1990-2008) and testing (2009-2010) data
train_data <- Weekly[Weekly$Year <= 2008, ]
test_data <- Weekly[Weekly$Year > 2008, ]

# Define a range of values for K
k_values <- 1:10  # Adjust the range as needed

# Initialize vectors to store accuracy for each K value
accuracies <- numeric(length(k_values))

# Train KNN model for each value of K
for (k in k_values) {
  # Train KNN model
  knn_model <- train(Direction ~ Lag2, data = train_data, method = "knn", trControl = trainControl(method = "cv", number = 10), tuneGrid = data.frame(k = k))
  
  # Make predictions on test data
  predictions <- predict(knn_model, newdata = test_data)
  
  # Compute confusion matrix
  confusion_matrix <- table(predictions, test_data$Direction)
  
  # Compute overall accuracy
  accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
  
  # Store accuracy for current K value
  accuracies[k] <- accuracy
}

# Find the optimal K value based on the highest accuracy
optimal_k <- which.max(accuracies)

# Output the results
print(confusion_matrix)
##            
## predictions Down Up
##        Down   17 20
##        Up     26 41
print(paste0("Optimal K value: ", optimal_k))
## [1] "Optimal K value: 4"
print(paste0("Overall accuracy with optimal K: ", accuracies[optimal_k]))
## [1] "Overall accuracy with optimal K: 0.605769230769231"

the optimal value of K for the K-nearest neighbors (KNN) model is 4, and the overall accuracy achieved with this optimal K value on the testing data (from 2009 to 2010) is approximately 60.58%. This means that when using KNN with K=4, the model correctly predicts the direction of the market movement around 60.58% of the time on the unseen data. Adjusting the range of K values or other parameters may further improve the model’s performance.

##(h) Repeat (d) using naive Bayes.

# Load necessary library
library(e1071)

# Define the training and testing periods
train_data <- Weekly[Weekly$Year < 2009, ]
test_data <- Weekly[Weekly$Year >= 2009, ]

# Train the naive Bayes model using only Lag2 as the predictor
nb_model <- naiveBayes(Direction ~ Lag2, data = train_data)

# Make predictions on the testing data
predictions <- predict(nb_model, newdata = test_data, type = "class")

# Compute the confusion matrix
confusion_matrix <- table(predictions, test_data$Direction)
print(confusion_matrix)
##            
## predictions Down Up
##        Down    0  0
##        Up     43 61
# Compute the overall fraction of correct predictions
overall_accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
print(overall_accuracy)
## [1] 0.5865385

Overall Fraction of Correct Predictions: 0.5865385 This means that out of the total predictions made on the testing data from 2009 and 2010, approximately 58.65% were correct. However, the model did not predict any instances of the “Down” direction correctly, indicating potential limitations or issues with the model’s performance.

##i) Which of these methods appears to provide the best results on this data?

Linear Discriminant Analysis (LDA) shows the highest overall fraction of correct predictions, with an accuracy of 0.625. Quadratic Discriminant Analysis (QDA) has the lowest overall fraction of correct predictions, with an accuracy of 0.5865385. K-Nearest Neighbors (KNN) with K=1 falls between LDA and QDA in terms of accuracy, with an overall accuracy of 0.605769230769231. Therefore, based solely on the provided metrics, Linear Discriminant Analysis (LDA) appears to provide the best results on this data.

##(j) Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifier.

A weighted average of the lag variables, with the most recent lag values given a higher weight than the values further back in time, is the combination of predictors I will experiment with. To be more precise, I’ll try assigning Lag1 a 40% weight, Lag2 a 35% weight, Lag3 a 15% weight, and Lag4 and Lag5 a 5% apiece. Note that aside from the hunch that recent weeks may be more closely correlated with the direction of the current week than weeks further in the past, I don’t have a compelling explanation for these specific weight values.

weighted.lag.avg = 0.4*Weekly$Lag1 + 0.35*Weekly$Lag2 + 0.15*Weekly$Lag3 + 0.05*Weekly$Lag4 + 0.05*Weekly$Lag5
Weekly = data.frame(Weekly, weighted.lag.avg)
head(Weekly)
##   Year   Lag1   Lag2   Lag3   Lag4   Lag5    Volume  Today Direction
## 1 1990  0.816  1.572 -3.936 -0.229 -3.484 0.1549760 -0.270      Down
## 2 1990 -0.270  0.816  1.572 -3.936 -0.229 0.1485740 -2.576      Down
## 3 1990 -2.576 -0.270  0.816  1.572 -3.936 0.1598375  3.514        Up
## 4 1990  3.514 -2.576 -0.270  0.816  1.572 0.1616300  0.712        Up
## 5 1990  0.712  3.514 -2.576 -0.270  0.816 0.1537280  1.178        Up
## 6 1990  1.178  0.712  3.514 -2.576 -0.270 0.1544440 -1.372      Down
##   weighted.lag.avg
## 1          0.10055
## 2          0.20515
## 3         -1.12070
## 4          0.58290
## 5          1.15560
## 6          1.10520
cor(Weekly$Today, Weekly$weighted.lag.avg)
## [1] -0.03724141

After calculating the correlation between this weighted average and the return value for this week, we find that there is not much of one to the other. Compared to the correlations between Today and the first lag variables separately, it is less significant. That seems to imply that this weighting might not be all that helpful, but I’ll still test out each classification strategy using this predictor transformation. I’ll begin by using logistic regression.

train <- Weekly$Year <= 2008
glm.fit <- glm(Direction ~ weighted.lag.avg, data = Weekly, subset = train, family = "binomial")
summary(glm.fit)
## 
## Call:
## glm(formula = Direction ~ weighted.lag.avg, family = "binomial", 
##     data = Weekly, subset = train)
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       0.21349    0.06446   3.312 0.000926 ***
## weighted.lag.avg -0.02816    0.05347  -0.527 0.598508    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1354.7  on 984  degrees of freedom
## Residual deviance: 1354.4  on 983  degrees of freedom
## AIC: 1358.4
## 
## Number of Fisher Scoring iterations: 3
glm.probs = predict(glm.fit, Weekly[!train, ], type = "response")
glm.pred = rep("Down", dim(Weekly[!train, ])[1])
glm.pred[glm.probs > 0.5] = "Up"
table(glm.pred, Weekly[!train, ]$Direction)
##         
## glm.pred Down Up
##       Up   43 61
mean(glm.pred == Weekly[!train, ]$Direction)
## [1] 0.5865385

As we can see, the findings of the logistic regression approach aren’t very encouraging. When the test set was used, this method’s 50% prediction threshold was equivalent to constantly predicting that the market would rise. Furthermore, the weighted.lag.avg coefficient’s p-value is 0.598, indicating a lack of evidence supporting a statistically significant result. Upon closer examination of the concept of utilizing a weighted average, it is reasonable to conclude that the results obtained via logistic regression will not surpass our findings from Part 4 because the weighted average remains a linear combination of the variables. As demonstrated previously, Lag2 was the sole statistically significant coefficient in a logistic regression model, and even then it is borderline at the 5% significance level, so the weighted average includes variables which we already had reason to believe weren’t particularly helpful in making a strong model. I’ll still try out the remaining methods with weighted.lag.avg before trying out one other combination of the predictors. Next up is linear discriminant analysis.

lda.fit = lda(Direction ~ weighted.lag.avg, data = Weekly, subset = train)
lda.pred = predict(lda.fit, Weekly[!train, ])
table(lda.pred$class, Weekly[!train, ]$Direction)
##       
##        Down Up
##   Down    0  0
##   Up     43 61
mean(lda.pred$class == Weekly[!train, ]$Direction)
## [1] 0.5865385

Linear discriminant analysis has the same performance as logistic regression. This is reasonable since the two methods often perform similarly. Now we’ll consider quadratic discriminant analysis.

qda.fit = qda(Direction ~ weighted.lag.avg, data = Weekly, subset = train)
qda.pred = predict(qda.fit, Weekly[!train, ])
table(qda.pred$class, Weekly[!train, ]$Direction)
##       
##        Down Up
##   Down    0  0
##   Up     43 61
mean(qda.pred$class == Weekly[!train, ]$Direction)
## [1] 0.5865385
max(qda.pred$posterior)
## [1] 0.9435257

Quadratic discriminant analysis also performed the same, but when checking the maximum value in the dataframe of posterior probabilities, I noticed that the maximum probability from QDA is 0.943. This means that it could be meaningful to try out a stricter threshold for predicting an up week. I will try out a threshold of 60%.

qda.pred60 = rep("Down", dim(Weekly[!train, ])[1])
qda.pred60[qda.pred$posterior[, "Up"] > 0.6] = "Up"
table(qda.pred60, Weekly[!train, ]$Direction)
##           
## qda.pred60 Down Up
##       Down   37 48
##       Up      6 13
mean(qda.pred60 == Weekly[!train, ]$Direction)
## [1] 0.4807692

When predicting an up week with a somewhat tougher likelihood criterion of 60%, we observe that our total prediction accuracy drops to 48.1%. But as of right now, our false positive rate is 6/43≈0.14.

, which is a significant improvement over Part 6’s false positive rate of 1. Furthermore, 13/19≈0.684 is our positive predictive value, which is also better than the figure of 0.587 from Part 6. This implies that QDA with our weighted average of the lag factors and a tougher probability criterion could be promising if we are risk-averse investors who only want to invest our money when there is a reasonable chance of the market rising. Finally, let’s talk about K.

train.X = data.frame(Weekly[train, "weighted.lag.avg"])
test.X = data.frame(Weekly[!train, "weighted.lag.avg"])
train.Direction = Weekly[train, "Direction"]
library(class)
set.seed(1)
knn.pred = knn(train.X, test.X, train.Direction, k = 1)
table(knn.pred, Weekly[!train, ]$Direction)
##         
## knn.pred Down Up
##     Down   14 28
##     Up     29 33
mean(knn.pred == Weekly[!train, ]$Direction)
## [1] 0.4519231

When it comes to overall prediction accuracy, KNN performs even worse than random guessing with k=1. The true positive rate (0.541), false positive rate (0.674), and positive predictive rate (0.532) are unimpressive in comparison to the Part 8 result. Let’s test out k=3 and k=5 as two additional possibilities before continuing.

set.seed(1)
knn.pred = knn(train.X, test.X, train.Direction, k = 3)
table(knn.pred, Weekly[!train, ]$Direction)
##         
## knn.pred Down Up
##     Down   15 28
##     Up     28 33
mean(knn.pred == Weekly[!train, ]$Direction)
## [1] 0.4615385
set.seed(1)
knn.pred = knn(train.X, test.X, train.Direction, k = 5)
table(knn.pred, Weekly[!train, ]$Direction)
##         
## knn.pred Down Up
##     Down   13 26
##     Up     30 35
mean(knn.pred == Weekly[!train, ]$Direction)
## [1] 0.4615385

The outcomes remain essentially the same even if we raise the value of k. Although it would be ideal to use extra tools to compare these various models, such calculating ROC curves and AUC scores for each model, I’ll save that until Chapter 9, when we go into greater detail about how to construct them in R. As of right now, QDA with a 60% prediction criterion appears to be the most promising model when it comes to employing weighted.lag.avg for prediction purposes.

##14. In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set. ##(a) Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median() function. Note you may find it helpful to use the data.frame() function to create a single data set containing both mpg01 and the other Auto variables.

# Load the Auto dataset
data(Auto)

# Compute the median of mpg
mpg_median <- median(Auto$mpg)

# Create the binary variable mpg01
Auto$mpg01 <- ifelse(Auto$mpg > mpg_median, 1, 0)

# Check the structure of the Auto dataset
str(Auto)
## 'data.frame':    392 obs. of  10 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : num  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : num  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : num  3504 3693 3436 3433 3449 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : num  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
##  $ mpg01       : num  0 0 0 0 0 0 0 0 0 0 ...

##(b) Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.

# Load necessary libraries
library(ggplot2)

# Scatterplot matrix
pairs(Auto[, c("mpg01", "cylinders", "displacement", "horsepower", "weight", "acceleration", "year", "origin")])

# Boxplots
boxplot(cylinders ~ mpg01, data = Auto, xlab = "mpg01", ylab = "Cylinders", main = "Boxplot of Cylinders by mpg01")

boxplot(displacement ~ mpg01, data = Auto, xlab = "mpg01", ylab = "Displacement", main = "Boxplot of Displacement by mpg01")

boxplot(horsepower ~ mpg01, data = Auto, xlab = "mpg01", ylab = "Horsepower", main = "Boxplot of Horsepower by mpg01")

boxplot(weight ~ mpg01, data = Auto, xlab = "mpg01", ylab = "Weight", main = "Boxplot of Weight by mpg01")

boxplot(acceleration ~ mpg01, data = Auto, xlab = "mpg01", ylab = "Acceleration", main = "Boxplot of Acceleration by mpg01")

boxplot(year ~ mpg01, data = Auto, xlab = "mpg01", ylab = "Year", main = "Boxplot of Year by mpg01")

boxplot(origin ~ mpg01, data = Auto, xlab = "mpg01", ylab = "Origin", main = "Boxplot of Origin by mpg01")

##(c) Split the data into a training set and a test set.

# Set seed for reproducibility
set.seed(123)

# Number of rows in the dataset
n_rows <- nrow(Auto)

# Number of rows in the training set (e.g., 70%)
n_train <- round(0.7 * n_rows)

# Randomly select row indices for the training set
train_indices <- sample(1:n_rows, n_train, replace = FALSE)

# Create training and test sets
train_data <- Auto[train_indices, ]
test_data <- Auto[-train_indices, ]

##(d) Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

library(MASS)

# Perform LDA on the training data
lda_model <- lda(mpg01 ~ cylinders + displacement + horsepower + weight + acceleration + year + origin, data = train_data)

# Make predictions on the test data
lda_pred <- predict(lda_model, newdata = test_data)$class

# Compute the confusion matrix
confusion <- table(lda_pred, test_data$mpg01)

# Calculate the test error
test_error <- 1 - sum(diag(confusion)) / sum(confusion)
print(test_error)
## [1] 0.1016949

##(e) Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

library(MASS)

# Perform QDA on the training data
qda_model <- qda(mpg01 ~ cylinders + displacement + horsepower + weight + acceleration + year + origin, data = train_data)

# Make predictions on the test data
qda_pred <- predict(qda_model, newdata = test_data)$class

# Compute the confusion matrix
confusion <- table(qda_pred, test_data$mpg01)

# Calculate the test error
test_error <- 1 - sum(diag(confusion)) / sum(confusion)
print(test_error)
## [1] 0.06779661

##(f) Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

# Perform logistic regression on the training data
logistic_model <- glm(mpg01 ~ cylinders + displacement + horsepower + weight + acceleration + year + origin, data = train_data, family = "binomial")

# Make predictions on the test data
logistic_pred <- ifelse(predict(logistic_model, newdata = test_data, type = "response") > 0.5, 1, 0)

# Compute the confusion matrix
confusion <- table(logistic_pred, test_data$mpg01)

# Calculate the test error
test_error <- 1 - sum(diag(confusion)) / sum(confusion)
print(test_error)
## [1] 0.09322034

##(g) Perform naive Bayes on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

# Load the e1071 package
library(e1071)

# Perform naive Bayes classification on the training data
naive_bayes_model <- naiveBayes(mpg01 ~ cylinders + displacement + horsepower + weight + acceleration + year + origin, data = train_data)

# Make predictions on the test data
naive_bayes_pred <- predict(naive_bayes_model, newdata = test_data)

# Compute the confusion matrix
confusion <- table(naive_bayes_pred, test_data$mpg01)

# Calculate the test error
test_error <- 1 - sum(diag(confusion)) / sum(confusion)
print(test_error)
## [1] 0.09322034

##(h) Perform KNN on the training data, with several values of K, in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?

# Load the class package
library(class)

# Define predictor variables
predictors <- c("cylinders", "displacement", "horsepower", "weight", "acceleration", "year", "origin")

# Initialize variables to store test errors and optimal K value
test_errors <- numeric()
optimal_K <- NULL

# Loop through different values of K
for (k in 1:10) {
  # Perform KNN classification on the training data
  knn_pred <- knn(train_data[predictors], test_data[predictors], train_data$mpg01, k = k)
  
  # Compute the confusion matrix
  confusion <- table(knn_pred, test_data$mpg01)
  
  # Calculate the test error
  test_error <- 1 - sum(diag(confusion)) / sum(confusion)
  
  # Store the test error
  test_errors <- c(test_errors, test_error)
}

# Find the optimal K value with the lowest test error
optimal_K <- which.min(test_errors)

# Print test errors for different values of K
print(test_errors)
##  [1] 0.16949153 0.15254237 0.13559322 0.12711864 0.11016949 0.11016949
##  [7] 0.10169492 0.09322034 0.11016949 0.09322034
# Print the optimal K value
print(optimal_K)
## [1] 8

This code performs KNN classification on the training data using the specified predictor variables for various values of K (from 1 to 10). It computes the test error for each value of K and stores them in a vector. Finally, it identifies the optimal K value that minimizes the test error.

Based on the test errors calculated for different values of K, we observe the following:

For K=1, the test error is approximately 16.95%. For K=2, the test error is approximately 15.25%. For K=3, the test error is approximately 13.56%. For K=4, the test error is approximately 12.71%. For K=5, the test error is approximately 11.02%. For K=6, the test error is approximately 11.02%. For K=7, the test error is approximately 10.17%. For K=8, the test error is approximately 9.32%. For K=9, the test error is approximately 11.02%. For K=10, the test error is approximately 9.32%. The optimal value of K is 8, as it corresponds to the lowest test error of approximately 9.32%.