Assignment #3

Rudy Martinez

6/30/2021

Libraries

library(MASS)
library(ISLR)
library(tidyverse)
library(ggthemes)
library(class)
library(gridExtra)
library(caret)

Exercises

Exercise 10

This question should be answered using the `Weekly` data set, which is part of the `ISLR` package. This data is similar in nature to the `Smarket` data from this chapter’s lab, except that it contains 1,089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.

glimpse(Weekly)

## Rows: 1,089
## Columns: 9
## $ Year      <dbl> 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990,…
## $ Lag1      <dbl> 0.816, -0.270, -2.576, 3.514, 0.712, 1.178, -1.372, 0.807, …
## $ Lag2      <dbl> 1.572, 0.816, -0.270, -2.576, 3.514, 0.712, 1.178, -1.372, …
## $ Lag3      <dbl> -3.936, 1.572, 0.816, -0.270, -2.576, 3.514, 0.712, 1.178, …
## $ Lag4      <dbl> -0.229, -3.936, 1.572, 0.816, -0.270, -2.576, 3.514, 0.712,…
## $ Lag5      <dbl> -3.484, -0.229, -3.936, 1.572, 0.816, -0.270, -2.576, 3.514…
## $ Volume    <dbl> 0.1549760, 0.1485740, 0.1598375, 0.1616300, 0.1537280, 0.15…
## $ Today     <dbl> -0.270, -2.576, 3.514, 0.712, 1.178, -1.372, 0.807, 0.041, …
## $ Direction <fct> Down, Down, Up, Up, Up, Down, Up, Up, Up, Down, Down, Up, U…

Weekly is a data frame with A data frame with 1089 observations on the following 9 variables.
- Year: The year that the observation was recorded
- Lag1: Percentage return for previous week
- Lag2: Percentage return for 2 weeks previous
- Lag3: Percentage return for 3 weeks previous
- Lag4: Percentage return for 4 weeks previous
- Lag5: Percentage return for 5 weeks previous
- Volume: Volume of shares traded (average number of daily shares traded in billions)
- Today: Percentage return for this week
- Direction: A factor with levels Down and Up indicating whether the market had a positive or negative return on a given week

(a) Produce some numerical and graphical summaries of the `Weekly` data.

#Exercise 10-a
pairs(Weekly)

cor(Weekly[ , -9])

##               Year         Lag1        Lag2        Lag3         Lag4
## Year    1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1   -0.03228927  1.000000000 -0.07485305  0.05863568 -0.071273876
## Lag2   -0.03339001 -0.074853051  1.00000000 -0.07572091  0.058381535
## Lag3   -0.03000649  0.058635682 -0.07572091  1.00000000 -0.075395865
## Lag4   -0.03112792 -0.071273876  0.05838153 -0.07539587  1.000000000
## Lag5   -0.03051910 -0.008183096 -0.07249948  0.06065717 -0.075675027
## Volume  0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today  -0.03245989 -0.075031842  0.05916672 -0.07124364 -0.007825873
##                Lag5      Volume        Today
## Year   -0.030519101  0.84194162 -0.032459894
## Lag1   -0.008183096 -0.06495131 -0.075031842
## Lag2   -0.072499482 -0.08551314  0.059166717
## Lag3    0.060657175 -0.06928771 -0.071243639
## Lag4   -0.075675027 -0.06107462 -0.007825873
## Lag5    1.000000000 -0.05851741  0.011012698
## Volume -0.058517414  1.00000000 -0.033077783
## Today   0.011012698 -0.03307778  1.000000000

Do there appear to be any patterns?
- Based on the numerical and graphical summaries of the Weekly data, it appears that the only 2 variables that are correlated are Year and Volume. This is represented in the below chart where Volume increases as Year increases (positive correlation).

#Exercise 10-a Year & Volume
Weekly$Week = 1:nrow(Weekly)

year_breaks = Weekly %>%
  group_by(Year) %>%
  summarize(Week = min(Week))

year_volume_plot = ggplot(data = Weekly, aes(x = Week, y = Volume)) +
  geom_line() +
  geom_smooth() + 
  scale_x_continuous(breaks = year_breaks$Week, minor_breaks = NULL, labels = year_breaks$Year) +
  theme_economist() + 
  labs(title = "Average Number of Daily Shares Traded Over Time", 
       x = "Year")

year_volume_plot

(b) Use the full data set to perform a logistic regression with `Direction` as the response and the five lag variables plus `Volume` as predictors.

#Exercise 10-b
logreg_1 = glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume , data = Weekly , family = "binomial")

summary(logreg_1)

## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = "binomial", data = Weekly)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6949  -1.2565   0.9913   1.0849   1.4579  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?
- It appears that Lag2 is the only statistically significant predictor as it has a p-value below the 0.05 significance level.

(c) Compute the confusion matrix and overall fraction of correct predictions.

#Exercise 10-c
prob_1 = predict(logreg_1, Weekly, type ="response")
preds_1 = rep("Down", 1089) #1089 rows in the data
preds_1[prob_1 > 0.5] = "Up"
table(preds_1, Weekly$Direction)

##        
## preds_1 Down  Up
##    Down   54  48
##    Up    430 557

Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.
- The percentage of correct predictions is (54+557)/1089 which is equal to 56.11%. In other words 43.89% is the test error rate. We could also say that for weeks when the market goes up, the model is right 92.10% of the time (557/(48+557)). For weeks when the market goes down, the model is right only 11.12% of the time (54/(54+430)).

(d) Now fit the logistic regression model using a training data period from 1990 to 2008, with `Lag2` as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).

#Exercise 10-d
training_data = (Weekly$Year <= 2008)

Weekly.20092010 = Weekly[!training_data, ]
Direction.20092010 = Weekly$Direction[!training_data]

logreg_2 = glm(Direction ~ Lag2, 
               data = Weekly,
               family = "binomial",
               subset = training_data)

summary(logreg_2)

## 
## Call:
## glm(formula = Direction ~ Lag2, family = "binomial", data = Weekly, 
##     subset = training_data)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.536  -1.264   1.021   1.091   1.368  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.20326    0.06428   3.162  0.00157 **
## Lag2         0.05810    0.02870   2.024  0.04298 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1354.7  on 984  degrees of freedom
## Residual deviance: 1350.5  on 983  degrees of freedom
## AIC: 1354.5
## 
## Number of Fisher Scoring iterations: 4

#Exercise 10-d Matrix
probs2 = predict(logreg_2, Weekly.20092010, type = "response")
preds2 = rep("Down", length(probs2))
preds2[probs2 > 0.5] = "Up"
table(preds2, Direction.20092010)

##       Direction.20092010
## preds2 Down Up
##   Down    9  5
##   Up     34 56

The percentage of correct predictions on the test data is (9+56)/104 which is equal to 62.5%. In other words 37.5% is the test error rate. We could also say that for weeks when the market goes up, the model is right 91.80% of the time (56/(56+5)). For weeks when the market goes down, the model is right only 20.93% of the time (9/(9+34)).

(e) Repeat (d) using LDA

#Exercise 10-e
lda.fit = lda(Direction ~ Lag2, data = Weekly, subset = training_data)

preds3 = predict(lda.fit, Weekly.20092010)
table(preds3$class, Direction.20092010)

##       Direction.20092010
##        Down Up
##   Down    9  5
##   Up     34 56

The percentage of correct predictions on the test data is (9+56)/104 which is equal to 62.5%. In other words 37.5% is the test error rate. We could also say that for weeks when the market goes up, the model is right 91.80% of the time (56/(56+5)). For weeks when the market goes down, the model is right only 20.93% of the time (9/(9+34)).These results are very close to those obtained with the logistic regression model which is not surprising.

(f) Repeat (d) using QDA

#Exercise 10-f
qda.fit = qda(Direction ~ Lag2, data = Weekly, subset = training_data)

preds4 = predict(qda.fit, Weekly.20092010)
table(preds4$class, Direction.20092010)

##       Direction.20092010
##        Down Up
##   Down    0  0
##   Up     43 61

The percentage of correct predictions on the test data is (0+61)/104 which is equal to58.65%. In other words 41.35% is the test error rate. We could also say that for weeks when the market goes up, the model is right 100% of the time (61/(61+0)). For weeks when the market goes down, the model is right only 0% of the time (0/(0+43).

(g) Repeat (d) using KNN with K = 1

#Exercise 10-g
train.X = as.matrix(Weekly$Lag2[training_data])
test.X = as.matrix(Weekly$Lag2[!training_data])
train.Direction = Weekly$Direction[training_data]

set.seed(1)
pred.knn = knn(train.X, test.X, train.Direction, k = 1)
table(pred.knn, Direction.20092010)

##         Direction.20092010
## pred.knn Down Up
##     Down   21 30
##     Up     22 31

The percentage of correct predictions on the test data is (21+31)/104 which is equal to 50%. In other words 50% is the test error rate. We could also say that for weeks when the market goes up, the model is right 50.82% of the time (31/(31+30)). For weeks when the market goes down, the model is right only 48.84% of the time (21/(21+22)).

(h) Which of these methods appears to provide the best results on this data?

If we compare the test error rates, we see the following test error rate results. Therefore, the logistic regression and LDA methods yield the best results.
- Logistic regression: 37.5%
- LDA: 37.5%
- QDA: 41.35%
- KNN: 50.00%

(i) Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifier.

#Exercise 10-i Establish Models

#Logistic Regression Model
logreg_3 = glm(Direction ~ Lag1:Lag2, 
               data = Weekly,
               family = "binomial",
               subset = training_data)

#LDA Model
lda.fit_2 = lda(Direction ~ Lag1:Lag2, data = Weekly, subset = training_data)

#QDA Model
qda.fit_2 = qda(Direction ~ Lag1:Lag2, data = Weekly, subset = training_data)

#KNN Model
set.seed(1)
pred.knn_2 = knn(train.X, test.X, train.Direction, k = 12)

# 10-i Logistic Regression Model Results (with interaction Lag1:Lag2)
probs2.1 = predict(logreg_3, Weekly.20092010, type = "response")
preds2.1 = rep("Down", length(probs2.1))
preds2.1[probs2.1 > 0.5] = "Up"
table(preds2.1, Direction.20092010)

##         Direction.20092010
## preds2.1 Down Up
##     Down    1  1
##     Up     42 60

Test Error Rate: 41.35% or 1 - (1+60)/104

# 10-i LDA Model Results (with interaction Lag1:Lag2)
preds3.1 = predict(lda.fit_2, Weekly.20092010)
table(preds3.1$class, Direction.20092010)

##       Direction.20092010
##        Down Up
##   Down    0  1
##   Up     43 60

Test Error Rate: 42.31% or 1 - (0+60)/104

# 10-i QDA Model Results (with interaction Lag1:Lag2)
preds4.1 = predict(qda.fit_2, Weekly.20092010)
table(preds4.1$class, Direction.20092010)

##       Direction.20092010
##        Down Up
##   Down   16 32
##   Up     27 29

Test Error Rate: 56.73% or 1 - (16+29)/104

# 10-i KNN Model Results (with K=12)
table(pred.knn_2, Direction.20092010)

##           Direction.20092010
## pred.knn_2 Down Up
##       Down   18 21
##       Up     25 40

Test Error Rate: 44.23% or 1 - (18+40)/104

(i) Findings: After running the above combinations, it appears that the below variables, method, and associated confusion matrix provide the best results on the data:

Variables: Lag2
Method: Logistic regression and LDA
Confusion Matrix: 37.5% test error rate for each method

Exercise 11

(a) In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the `Auto` data set. Create a binary variable, `mpg01`, that contains a 1 if `mpg` contains a value above its median, and a 0 if `mpg` contains a value below its median. You can compute the median using the `median()` function. Note you may find it helpful to use the `data.frame()` function to create a single data set containing both `mpg01` and the other `Auto` variables.

#Exercise 11-a
mpg01 = rep(0, length(Auto$mpg))
mpg01[Auto$mpg > median(Auto$mpg)] = 1
Auto = data.frame(Auto, mpg01)
table(Auto$mpg01)

## 
##   0   1 
## 196 196

(b) Explore the data graphically in order to investigate the association between `mpg01` and the other features. Which of the other features seem most likely to be useful in predicting `mpg01`? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.

#Exercise 11-b
Auto$name = NULL
Auto$mpg = NULL
glimpse(Auto)

## Rows: 392
## Columns: 8
## $ cylinders    <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 6, 6, 6, 4,…
## $ displacement <dbl> 307, 350, 318, 304, 302, 429, 454, 440, 455, 390, 383, 3…
## $ horsepower   <dbl> 130, 165, 150, 150, 140, 198, 220, 215, 225, 190, 170, 1…
## $ weight       <dbl> 3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4425, 38…
## $ acceleration <dbl> 12.0, 11.5, 11.0, 12.0, 10.5, 10.0, 9.0, 8.5, 10.0, 8.5,…
## $ year         <dbl> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, …
## $ origin       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 3,…
## $ mpg01        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,…

#Exercise 11-b Correlations
cor(Auto)

##               cylinders displacement horsepower     weight acceleration
## cylinders     1.0000000    0.9508233  0.8429834  0.8975273   -0.5046834
## displacement  0.9508233    1.0000000  0.8972570  0.9329944   -0.5438005
## horsepower    0.8429834    0.8972570  1.0000000  0.8645377   -0.6891955
## weight        0.8975273    0.9329944  0.8645377  1.0000000   -0.4168392
## acceleration -0.5046834   -0.5438005 -0.6891955 -0.4168392    1.0000000
## year         -0.3456474   -0.3698552 -0.4163615 -0.3091199    0.2903161
## origin       -0.5689316   -0.6145351 -0.4551715 -0.5850054    0.2127458
## mpg01        -0.7591939   -0.7534766 -0.6670526 -0.7577566    0.3468215
##                    year     origin      mpg01
## cylinders    -0.3456474 -0.5689316 -0.7591939
## displacement -0.3698552 -0.6145351 -0.7534766
## horsepower   -0.4163615 -0.4551715 -0.6670526
## weight       -0.3091199 -0.5850054 -0.7577566
## acceleration  0.2903161  0.2127458  0.3468215
## year          1.0000000  0.1815277  0.4299042
## origin        0.1815277  1.0000000  0.5136984
## mpg01         0.4299042  0.5136984  1.0000000

pairs(Auto)

#Exercise 11-b Charts
par(mfrow=c(2,2))

boxplot(cylinders ~ mpg01, data = Auto, main = "Cylinders vs mpg01")
boxplot(displacement ~ mpg01, data = Auto, main = "Displacement vs mpg01")
boxplot(horsepower ~ mpg01, data = Auto, main = "Horsepower vs mpg01")
boxplot(weight ~ mpg01, data = Auto, main = "Weight vs mpg01")

The following features seem most likely to be useful in predicting mpg01 based on the above analysis:
- Negative Correlations
  - cylinders -0.7591939
  - displacement -0.7534766
  - horsepower -0.6670526
  - weight -0.7577566

(c) Split the data into a training set and a test set.

#Exercise 11-c
train = (Auto$year %% 2 == 0)
Auto.train = Auto[train, ]
Auto.test = Auto[!train, ]
mpg01.test = Auto$mpg01[!train]

(d) Perform LDA on the training data in order to predict `mpg01` using the variables that seemed most associated with `mpg01` in (b). What is the test error of the model obtained?

#Exercise 11-d

#LDA Model
Auto.lda = lda(mpg01 ~ cylinders + displacement + horsepower + weight, data = Auto, subset = train)

pred.lda = predict(Auto.lda, Auto.test)
table(pred.lda$class, mpg01.test)

##    mpg01.test
##      0  1
##   0 86  9
##   1 14 73

Test Error = 1 - (86+73) / 182 OR 12.64%

(e) Perform QDA on the training data in order to predict `mpg01` using the variables that seemed most associated with `mpg01` in (b). What is the test error of the model obtained?

#Exercise 11-e

#QDA Model
Auto.qda = qda(mpg01 ~ cylinders + displacement + horsepower + weight, data = Auto, subset = train)

pred.qda = predict(Auto.qda, Auto.test)
table(pred.qda$class, mpg01.test)

##    mpg01.test
##      0  1
##   0 89 13
##   1 11 69

Test Error = 1 - (89+69) / 182 OR 13.19%

(f) Perform logistic regression on the training data in order to predict `mpg01` using the variables that seemed most associated with `mpg01` in (b). What is the test error of the model obtained?

#Exercise 11-f

#Logistic Regression Model
Auto.log = glm(mpg01 ~ cylinders + displacement + horsepower + weight, 
               data = Auto,
               family = "binomial",
               subset = train)

Auto.probs = predict(Auto.log, Auto.test, type = "response")
pred.auto = rep(0, length(Auto.probs))
pred.auto[Auto.probs > 0.5] <- 1
table(pred.auto, mpg01.test)

##          mpg01.test
## pred.auto  0  1
##         0 89 11
##         1 11 71

Test Error = 1 - (89+71) / 182 OR 12.09%

(g) Perform KNN on the training data, with several values of K, in order to predict `mpg01`. Use only the variables that seemed most associated with `mpg01` in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?

#Exercise 11-g
attach(Auto)

#KNN Model
Autotrain.X = cbind(cylinders, weight, displacement, horsepower)[train, ]
Autotest.X = cbind(cylinders, weight, displacement, horsepower)[!train, ]
train.mpg01 = mpg01[train]

set.seed(1)
Autopred.knn_1 = knn(Autotrain.X, Autotest.X, train.mpg01, k = 1)
table(Autopred.knn_1, mpg01.test)

##               mpg01.test
## Autopred.knn_1  0  1
##              0 83 11
##              1 17 71

Autopred.knn_50 = knn(Autotrain.X, Autotest.X, train.mpg01, k = 50)
table(Autopred.knn_50, mpg01.test)

##                mpg01.test
## Autopred.knn_50  0  1
##               0 80  7
##               1 20 75

Autopred.knn_100 = knn(Autotrain.X, Autotest.X, train.mpg01, k = 100)
table(Autopred.knn_100, mpg01.test)

##                 mpg01.test
## Autopred.knn_100  0  1
##                0 81  7
##                1 19 75

Autopred.knn_150 = knn(Autotrain.X, Autotest.X, train.mpg01, k = 150)
table(Autopred.knn_150, mpg01.test)

##                 mpg01.test
## Autopred.knn_150  0  1
##                0 72  6
##                1 28 76

Test Error where K=1 = 1 - (83+71) / 182 OR 15.38%
Test Error where K=50 = 1 - (80+75) / 182 OR 14.84%
Test Error where K=100 = 1 - (81+75) / 182 OR 14.29% PERFORMS THE BEST
Test Error where K=150 = 1 - (72+76) / 182 OR 18.68%

Exercise 13

Using the `Boston` data set, fit classification models in order to predict whether a given suburb has a crime rate above or below the median. Explore logistic regression, LDA, and KNN models using various subsets of the predictors. Describe your findings.

#Exercise 13
attach(Boston)
crim01 = rep(0, length(crim))
crim01[crim > median(crim)] = 1
Boston = data.frame(Boston, crim01)
table(crim01)

## crim01
##   0   1 
## 253 253

#Exercise 13 Split
train = 1:(length(crim) / 2)
test = (length(crim) / 2 + 1):length(crim)

Boston.train = Boston[train, ]
Boston.test = Boston[test, ]
crim01.test = crim01[test]

#Exercise 13 Logistic Regression Model

Boston.glm_1 = glm(crim01 ~ . - crim01 - crim, 
              data = Boston, 
              family = "binomial", 
              subset = train)

Boston.glm_2 = glm(crim01 ~ . - crim01 - crim - chas - nox, 
              data = Boston, 
              family = 'binomial', 
              subset = train)

Boston.probs_1 = predict(Boston.glm_1, Boston.test, type = "response")
Boston.pred_1 = rep(0, length(Boston.probs_1))
Boston.pred_1[Boston.probs_1 > 0.5] = 1
table(Boston.pred_1, crim01.test)

##              crim01.test
## Boston.pred_1   0   1
##             0  68  24
##             1  22 139

Boston.probs_2 = predict(Boston.glm_2, Boston.test, type = "response")
Boston.pred_2 = rep(0, length(Boston.probs_2))
Boston.pred_2[Boston.probs_2 > 0.5] = 1
table(Boston.pred_2, crim01.test)

##              crim01.test
## Boston.pred_2   0   1
##             0  78  28
##             1  12 135

#Exercise 13 LDA Model
Boston.lda_1 = lda(crim01 ~ . - crim01 - crim, data = Boston, subset = train)
pred.lda_1 = predict(Boston.lda_1, Boston.test)
table(pred.lda_1$class, crim01.test)

##    crim01.test
##       0   1
##   0  80  24
##   1  10 139

Boston.lda_2 = lda(crim01 ~ . - crim01 - crim - chas - nox - tax, data = Boston, subset = train)
pred.lda_2 = predict(Boston.lda_2, Boston.test)
table(pred.lda_2$class, crim01.test)

##    crim01.test
##       0   1
##   0  83  28
##   1   7 135

#Exercise 13 KNN Model
Boston.train.X = cbind(zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, black, lstat, medv)[train, ]
Boston.test.X = cbind(zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, black, lstat, medv)[test, ]
train.crim01 = crim01[train]

set.seed(1)
Boston.pred.knn_1 = knn(Boston.train.X, Boston.test.X, train.crim01, k = 1)
table(Boston.pred.knn_1, crim01.test)

##                  crim01.test
## Boston.pred.knn_1   0   1
##                 0  85 111
##                 1   5  52

Boston.pred.knn_50 = knn(Boston.train.X, Boston.test.X, train.crim01, k = 50)
table(Boston.pred.knn_50, crim01.test)

##                   crim01.test
## Boston.pred.knn_50   0   1
##                  0  80  23
##                  1  10 140

Boston.pred.knn_150 = knn(Boston.train.X, Boston.test.X, train.crim01, k = 150)
table(Boston.pred.knn_150, crim01.test)

##                    crim01.test
## Boston.pred.knn_150   0   1
##                   0  84  31
##                   1   6 132

RESULTS:
- Boston.glm_1 Test Error = 1 - (68+139) / 253 OR 18.18%
- Boston.glm_2 Test Error = 1 - (78+135) / 253 OR 15.81%
- Boston.lda_1 Test Error = 1 - (80+139) / 253 OR 13.44%
- Boston.lda_2 Test Error = 1 - (83+135) / 253 OR 13.83%
- Boston.pred.knn_1 Test Error = 1 - (85+52) / 253 OR 45.85%
- Boston.pred.knn_50 Test Error = 1 - (80+140) / 253 OR 13.04% PERFORMS THE BEST
- Boston.pred.knn_150 Test Error = 1 - (84+132) / 253 OR 14.62%

Libraries

Exercises

Exercise 10

This question should be answered using the Weekly data set, which is part of the ISLR package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1,089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.

(a) Produce some numerical and graphical summaries of the Weekly data.

(b) Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors.

(c) Compute the confusion matrix and overall fraction of correct predictions.

(d) Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).

(e) Repeat (d) using LDA

(f) Repeat (d) using QDA

(g) Repeat (d) using KNN with K = 1

(h) Which of these methods appears to provide the best results on this data?

(i) Findings: After running the above combinations, it appears that the below variables, method, and associated confusion matrix provide the best results on the data:

Exercise 11

(b) Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.

(c) Split the data into a training set and a test set.

(d) Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

(e) Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

(f) Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

(g) Perform KNN on the training data, with several values of K, in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?

Exercise 13

Using the Boston data set, fit classification models in order to predict whether a given suburb has a crime rate above or below the median. Explore logistic regression, LDA, and KNN models using various subsets of the predictors. Describe your findings.

This question should be answered using the `Weekly` data set, which is part of the `ISLR` package. This data is similar in nature to the `Smarket` data from this chapter’s lab, except that it contains 1,089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.

(a) Produce some numerical and graphical summaries of the `Weekly` data.

(b) Use the full data set to perform a logistic regression with `Direction` as the response and the five lag variables plus `Volume` as predictors.

(d) Now fit the logistic regression model using a training data period from 1990 to 2008, with `Lag2` as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).

(b) Explore the data graphically in order to investigate the association between `mpg01` and the other features. Which of the other features seem most likely to be useful in predicting `mpg01`? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.

(d) Perform LDA on the training data in order to predict `mpg01` using the variables that seemed most associated with `mpg01` in (b). What is the test error of the model obtained?

(e) Perform QDA on the training data in order to predict `mpg01` using the variables that seemed most associated with `mpg01` in (b). What is the test error of the model obtained?

(f) Perform logistic regression on the training data in order to predict `mpg01` using the variables that seemed most associated with `mpg01` in (b). What is the test error of the model obtained?

(g) Perform KNN on the training data, with several values of K, in order to predict `mpg01`. Use only the variables that seemed most associated with `mpg01` in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?

Using the `Boston` data set, fit classification models in order to predict whether a given suburb has a crime rate above or below the median. Explore logistic regression, LDA, and KNN models using various subsets of the predictors. Describe your findings.