library(tidyverse)
library(openintro)
library(ISLR2)
library(stats)
library(MASS)
library(e1071)
library(class)
library(dplyr)

Problem 13

This question should be answered using the Weekly data set, which is part of the ISLR2 package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1,089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.

#load the data
data(Weekly)

13(a)

Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?

summary(Weekly)
##       Year           Lag1               Lag2               Lag3         
##  Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
##  1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580  
##  Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410  
##  Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472  
##  3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
##  Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
##       Lag4               Lag5              Volume            Today         
##  Min.   :-18.1950   Min.   :-18.1950   Min.   :0.08747   Min.   :-18.1950  
##  1st Qu.: -1.1580   1st Qu.: -1.1660   1st Qu.:0.33202   1st Qu.: -1.1540  
##  Median :  0.2380   Median :  0.2340   Median :1.00268   Median :  0.2410  
##  Mean   :  0.1458   Mean   :  0.1399   Mean   :1.57462   Mean   :  0.1499  
##  3rd Qu.:  1.4090   3rd Qu.:  1.4050   3rd Qu.:2.05373   3rd Qu.:  1.4050  
##  Max.   : 12.0260   Max.   : 12.0260   Max.   :9.32821   Max.   : 12.0260  
##  Direction 
##  Down:484  
##  Up  :605  
##            
##            
##            
## 
plot(Weekly)

cor(Weekly[,-9])
##               Year         Lag1        Lag2        Lag3         Lag4
## Year    1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1   -0.03228927  1.000000000 -0.07485305  0.05863568 -0.071273876
## Lag2   -0.03339001 -0.074853051  1.00000000 -0.07572091  0.058381535
## Lag3   -0.03000649  0.058635682 -0.07572091  1.00000000 -0.075395865
## Lag4   -0.03112792 -0.071273876  0.05838153 -0.07539587  1.000000000
## Lag5   -0.03051910 -0.008183096 -0.07249948  0.06065717 -0.075675027
## Volume  0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today  -0.03245989 -0.075031842  0.05916672 -0.07124364 -0.007825873
##                Lag5      Volume        Today
## Year   -0.030519101  0.84194162 -0.032459894
## Lag1   -0.008183096 -0.06495131 -0.075031842
## Lag2   -0.072499482 -0.08551314  0.059166717
## Lag3    0.060657175 -0.06928771 -0.071243639
## Lag4   -0.075675027 -0.06107462 -0.007825873
## Lag5    1.000000000 -0.05851741  0.011012698
## Volume -0.058517414  1.00000000 -0.033077783
## Today   0.011012698 -0.03307778  1.000000000
plot(Weekly$Volume)

From the summary statistics, Lag1 through Lag5 appear to have nearly identical data. From the scatterplot matrix, there definitely seems to be some kind of relationship between Year and Volume, which is confirmed by the correlation matrix. There do not appear to be any other strong relationships in the correlation matrix. Finally, we can see that Volume increases with increasing (chronological) index; that is, over time.

13(b)

Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?

#logistic regression model
glm.fits <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
                data = Weekly, family = binomial)

summary(glm.fits)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Weekly)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6949  -1.2565   0.9913   1.0849   1.4579  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

Lag2 is the only predictor that appears to be statistically significant, p < 0.5, according to this model.

13(c)

Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.

#prepare confusion matrix
glm.probs = predict(glm.fits, type = 'response')
glm.pred = rep("Down", 1089)
glm.pred[glm.probs > 0.5] = "Up"

writeLines("Confusion Matrix and Overall Fraction of Correct Predictions:")
## Confusion Matrix and Overall Fraction of Correct Predictions:
table(glm.pred, Weekly$Direction)
##         
## glm.pred Down  Up
##     Down   54  48
##     Up    430 557
mean(glm.pred == Weekly$Direction)
## [1] 0.5610652

Correct predictions: 0.5610652 or ~56%

This model predicted “Up” correctly (557/(48+557))*100 = 92.06612 about 92% of the time. However, it predicted “Down” correctly (54/(54+430))*100 = 11.15702 only about 11% of the time.

13(d)

Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).

#define training and testing data sets
train = (Weekly$Year < 2009)
Weekly.train = Weekly[train,]
Weekly.test = Weekly[!train,]

#logistic regression with `Lag2` as only predictor
glm.fitsTrained <- glm(Direction ~ Lag2,
                       data = Weekly.train, family = binomial)

#prepare confusion matrix
glm.probs = predict(glm.fitsTrained, Weekly.test, type = 'response')
glm.pred = rep("Down", 104)
glm.pred[glm.probs > 0.5] = "Up"

writeLines("Confusion Matrix and Overall Fraction of Correct Predictions:")
## Confusion Matrix and Overall Fraction of Correct Predictions:
table(glm.pred, Weekly.test$Direction)
##         
## glm.pred Down Up
##     Down    9  5
##     Up     34 56
mean(glm.pred == Weekly.test$Direction)
## [1] 0.625

13(e)

Repeat (d) using LDA.

#linear discriminant analysis model
lda.fit <- lda(Direction ~ Lag2, data = Weekly.train)
lda.fit
## Call:
## lda(Direction ~ Lag2, data = Weekly.train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##             Lag2
## Down -0.03568254
## Up    0.26036581
## 
## Coefficients of linear discriminants:
##            LD1
## Lag2 0.4414162
#prepare confusion matrix
lda.pred <- predict(lda.fit, Weekly.test)
names(lda.pred)
## [1] "class"     "posterior" "x"
writeLines("\n\nConfusion Matrix and Overall Fraction of Correct Predictions:")
## 
## 
## Confusion Matrix and Overall Fraction of Correct Predictions:
table(lda.pred$class, Weekly.test$Direction)
##       
##        Down Up
##   Down    9  5
##   Up     34 56
mean(lda.pred$class == Weekly.test$Direction)
## [1] 0.625

13(f)

Repeat (d) using QDA.

#quadratic discriminant analysis model
qda.fit <- qda(Direction ~ Lag2, data = Weekly.train)
qda.fit
## Call:
## qda(Direction ~ Lag2, data = Weekly.train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##             Lag2
## Down -0.03568254
## Up    0.26036581
#prepare confusion matrix
qda.pred <- predict(qda.fit, Weekly.test)
names(lda.pred)
## [1] "class"     "posterior" "x"
writeLines("\n\nConfusion Matrix and Overall Fraction of Correct Predictions:")
## 
## 
## Confusion Matrix and Overall Fraction of Correct Predictions:
table(qda.pred$class, Weekly.test$Direction)
##       
##        Down Up
##   Down    0  0
##   Up     43 61
mean(qda.pred$class == Weekly.test$Direction)
## [1] 0.5865385

13(g)

Repeat (d) using KNN with K = 1.

#KNN, K = 1
train.x = cbind(Weekly[train,]$Lag2)
test.x = cbind(Weekly[!train,]$Lag2)
train.Direction = Weekly[train,]$Direction
set.seed(291)
knn.pred <- knn(train.x, test.x, train.Direction, k = 1)

writeLines("\n\nConfusion Matrix and Overall Fraction of Correct Predictions:")
## 
## 
## Confusion Matrix and Overall Fraction of Correct Predictions:
table(knn.pred, Weekly.test$Direction)
##         
## knn.pred Down Up
##     Down   21 30
##     Up     22 31
mean(knn.pred == Weekly.test$Direction)
## [1] 0.5

13(h)

Repeat (d) using naive Bayes.

#naive Bayes model
nb.fit <- naiveBayes(Direction ~ Lag2, data = Weekly.train)
nb.fit
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Conditional probabilities:
##       Lag2
## Y             [,1]     [,2]
##   Down -0.03568254 2.199504
##   Up    0.26036581 2.317485
writeLines("\n\nConfusion Matrix and Overall Fraction of Correct Predictions:")
## 
## 
## Confusion Matrix and Overall Fraction of Correct Predictions:
nb.class <- predict(nb.fit, Weekly.test)
table(nb.class, Weekly.test$Direction)
##         
## nb.class Down Up
##     Down    0  0
##     Up     43 61
mean(nb.class == Weekly.test$Direction)
## [1] 0.5865385

13(i)

Which of these methods appears to provide the best results on this data?

Logistic regression and linear discriminant analysis have had the highest prediction success rate.

13(j)

Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifier.

Problem 14

In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.

data(Auto)

14(a)

Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median() function. Note you may find it helpful to use the data.frame() function to create a single data set containing both mpg01 and the other Auto variables.

mpg01 <- c(Auto$mpg > median(Auto$mpg))
Auto1 <- data.frame(Auto, mpg01)

14(b)

Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.

plot(Auto1)

boxplot(Auto1$mpg01 ~ Auto1$cylinders)

boxplot(Auto1$mpg01 ~ Auto1$displacement)

boxplot(Auto1$mpg01 ~ Auto1$horsepower)

boxplot(Auto1$mpg01 ~ Auto1$weight)

boxplot(Auto1$mpg01 ~ Auto1$acceleration)

boxplot(Auto1$mpg01 ~ Auto1$year)

boxplot(Auto1$mpg01 ~ Auto1$origin)

From the scatterplot matrix, it looks like displacement, horsepower, weight, and acceleration might be useful predictors for mpg01, because the TRUE AND FALSE values for mpg01 are bunched together in distinct regions of the plot.

14(c)

Split the data into a training set and a test set.

#define training and testing data sets by randomly sampling 80% of Auto1
set.seed(191)
Atrain = sample(nrow(Auto1), 314)

Autotrain <- Auto1[Atrain,]
Autotest <- Auto1[!(as.numeric(rownames(Auto1)) %in% Atrain),]

14(d)

Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

#linear discriminant analysis model
lda.Auto <- lda(mpg01 ~ displacement + horsepower + weight + acceleration, data = Autotrain)
lda.Auto
## Call:
## lda(mpg01 ~ displacement + horsepower + weight + acceleration, 
##     data = Autotrain)
## 
## Prior probabilities of groups:
##     FALSE      TRUE 
## 0.5191083 0.4808917 
## 
## Group means:
##       displacement horsepower   weight acceleration
## FALSE     276.0307   132.0798 3637.454     14.53988
## TRUE      115.7185    78.5894 2328.781     16.49073
## 
## Coefficients of linear discriminants:
##                        LD1
## displacement -0.0078245278
## horsepower    0.0036968317
## weight       -0.0009948677
## acceleration -0.0094873543
#prepare confusion matrix
lda.pred <- predict(lda.Auto, Autotest)
names(lda.pred)
## [1] "class"     "posterior" "x"
writeLines("\n\nConfusion Matrix and Overall Fraction of Incorrect Predictions:")
## 
## 
## Confusion Matrix and Overall Fraction of Incorrect Predictions:
table(lda.pred$class, Autotest$mpg01)
##        
##         FALSE TRUE
##   FALSE    29    2
##   TRUE      4   48
1 - mean(lda.pred$class == Autotest$mpg01)
## [1] 0.07228916

The test error rate is about 7%.

14(e)

Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

#quadratic discriminant analysis model
qda.Auto <- qda(mpg01 ~ displacement + horsepower + weight + acceleration, data = Autotrain)
qda.Auto
## Call:
## qda(mpg01 ~ displacement + horsepower + weight + acceleration, 
##     data = Autotrain)
## 
## Prior probabilities of groups:
##     FALSE      TRUE 
## 0.5191083 0.4808917 
## 
## Group means:
##       displacement horsepower   weight acceleration
## FALSE     276.0307   132.0798 3637.454     14.53988
## TRUE      115.7185    78.5894 2328.781     16.49073
#prepare confusion matrix
qda.pred <- predict(qda.Auto, Autotest)
names(qda.pred)
## [1] "class"     "posterior"
writeLines("\n\nConfusion Matrix and Overall Fraction of Incorrect Predictions:")
## 
## 
## Confusion Matrix and Overall Fraction of Incorrect Predictions:
table(qda.pred$class, Autotest$mpg01)
##        
##         FALSE TRUE
##   FALSE    32    2
##   TRUE      1   48
1 - mean(qda.pred$class == Autotest$mpg01)
## [1] 0.03614458

The test error rate of this model is about 3.6%.

14(f)

Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

#logistic regression model
glm.Auto <- glm(mpg01 ~ displacement + horsepower + weight + acceleration, 
                data = Autotrain, family = binomial)
glm.Auto
## 
## Call:  glm(formula = mpg01 ~ displacement + horsepower + weight + acceleration, 
##     family = binomial, data = Autotrain)
## 
## Coefficients:
##  (Intercept)  displacement    horsepower        weight  acceleration  
##     12.96911      -0.01260      -0.05740      -0.00129      -0.11074  
## 
## Degrees of Freedom: 313 Total (i.e. Null);  309 Residual
## Null Deviance:       434.8 
## Residual Deviance: 172.2     AIC: 182.2
#prepare confusion matrix
glm.Aprobs = predict(glm.Auto, Autotest, type = 'response')
glm.Apred = rep("High", 83)
glm.Apred[glm.Aprobs > 0.5] = "Low"

#writeLines("Confusion Matrix and Overall Fraction of Correct Predictions:")
table(glm.Apred, Autotest$mpg01)
##          
## glm.Apred FALSE TRUE
##      High    32    5
##      Low      1   45
mean(glm.Apred == Autotest$mpg01)
## [1] 0
(5/83)*100
## [1] 6.024096

The error rate is about 6%.

14(g)

Perform naive Bayes on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

#naive Bayes model
nb.Auto <- naiveBayes(mpg01 ~ displacement + horsepower + weight + acceleration,
                      data = Autotrain)
nb.Auto
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##     FALSE      TRUE 
## 0.5191083 0.4808917 
## 
## Conditional probabilities:
##        displacement
## Y           [,1]     [,2]
##   FALSE 276.0307 92.00204
##   TRUE  115.7185 41.02814
## 
##        horsepower
## Y           [,1]     [,2]
##   FALSE 132.0798 39.10574
##   TRUE   78.5894 15.60482
## 
##        weight
## Y           [,1]     [,2]
##   FALSE 3637.454 699.0066
##   TRUE  2328.781 410.3934
## 
##        acceleration
## Y           [,1]     [,2]
##   FALSE 14.53988 2.809981
##   TRUE  16.49073 2.524001
writeLines("\n\nConfusion Matrix and Overall Fraction of Inorrect Predictions:")
## 
## 
## Confusion Matrix and Overall Fraction of Inorrect Predictions:
nb.Aclass <- predict(nb.Auto, Autotest)
table(nb.Aclass, Autotest$mpg01)
##          
## nb.Aclass FALSE TRUE
##     FALSE    30    2
##     TRUE      3   48
1 - mean(nb.Aclass == Autotest$mpg01)
## [1] 0.06024096

The error rate is about 6%.

14(h)

Perform KNN on the training data, with several values of K, in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?

Problem 16

Using the Boston data set, fit classification models in order to predict whether a given census tract has a crime rate above or below the median. Explore logistic regression, LDA, naive Bayes, and KNN models using various subsets of the predictors. Describe your findings.

Hint: You will have to create the response variable yourself, using the variables that are contained in the Boston data set.

#prepare the data, create new variable
data(Boston)
high_crime <- c(Boston$crim > median(Boston$crim))
Boston$high_crime = high_crime 
#preliminary logistic regression
glm.Boston <- glm(high_crime ~ zn + indus + chas + nox + rm + age + dis + 
                  rad + tax + ptratio + black + lstat + medv,
                  data = Boston, family = binomial)
summary(glm.Boston)
## 
## Call:
## glm(formula = high_crime ~ zn + indus + chas + nox + rm + age + 
##     dis + rad + tax + ptratio + black + lstat + medv, family = binomial, 
##     data = Boston)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3946  -0.1585  -0.0004   0.0023   3.4239  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -34.103704   6.530014  -5.223 1.76e-07 ***
## zn           -0.079918   0.033731  -2.369  0.01782 *  
## indus        -0.059389   0.043722  -1.358  0.17436    
## chas          0.785327   0.728930   1.077  0.28132    
## nox          48.523782   7.396497   6.560 5.37e-11 ***
## rm           -0.425596   0.701104  -0.607  0.54383    
## age           0.022172   0.012221   1.814  0.06963 .  
## dis           0.691400   0.218308   3.167  0.00154 ** 
## rad           0.656465   0.152452   4.306 1.66e-05 ***
## tax          -0.006412   0.002689  -2.385  0.01709 *  
## ptratio       0.368716   0.122136   3.019  0.00254 ** 
## black        -0.013524   0.006536  -2.069  0.03853 *  
## lstat         0.043862   0.048981   0.895  0.37052    
## medv          0.167130   0.066940   2.497  0.01254 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 701.46  on 505  degrees of freedom
## Residual deviance: 211.93  on 492  degrees of freedom
## AIC: 239.93
## 
## Number of Fisher Scoring iterations: 9
#logistic regression with high p-value predictors removed
glm.Boston <- glm(high_crime ~ zn + nox + age + dis + rad + tax + ptratio + black + medv,
                  data = Boston, family = binomial)
summary(glm.Boston)
## 
## Call:
## glm(formula = high_crime ~ zn + nox + age + dis + rad + tax + 
##     ptratio + black + medv, family = binomial, data = Boston)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4197  -0.1840  -0.0004   0.0022   3.4087  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -31.441272   6.048989  -5.198 2.02e-07 ***
## zn           -0.082567   0.031424  -2.628  0.00860 ** 
## nox          43.195824   6.452812   6.694 2.17e-11 ***
## age           0.022851   0.009894   2.310  0.02091 *  
## dis           0.634380   0.207634   3.055  0.00225 ** 
## rad           0.718773   0.142066   5.059 4.21e-07 ***
## tax          -0.007676   0.002503  -3.066  0.00217 ** 
## ptratio       0.303502   0.109255   2.778  0.00547 ** 
## black        -0.012866   0.006334  -2.031  0.04224 *  
## medv          0.112882   0.034362   3.285  0.00102 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 701.46  on 505  degrees of freedom
## Residual deviance: 216.22  on 496  degrees of freedom
## AIC: 236.22
## 
## Number of Fisher Scoring iterations: 9
#linear discriminant analysis
lda.Boston <- lda(high_crime ~ zn + nox + age + dis + rad + tax + ptratio + black + medv,
                  data = Boston)
#naive Bayes model
nb.Boston <- naiveBayes(high_crime ~ zn + nox + age + dis + rad + tax + ptratio + black + medv,
                        data = Boston)
nb.Boston
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
## FALSE  TRUE 
##   0.5   0.5 
## 
## Conditional probabilities:
##        zn
## Y            [,1]      [,2]
##   FALSE 21.525692 29.319808
##   TRUE   1.201581  4.798611
## 
##        nox
## Y            [,1]       [,2]
##   FALSE 0.4709711 0.05559789
##   TRUE  0.6384190 0.09870365
## 
##        age
## Y           [,1]     [,2]
##   FALSE 51.31028 25.88190
##   TRUE  85.83953 17.87423
## 
##        dis
## Y           [,1]     [,2]
##   FALSE 5.091596 2.081304
##   TRUE  2.498489 1.085521
## 
##        rad
## Y            [,1]     [,2]
##   FALSE  4.158103 1.659121
##   TRUE  14.940711 9.529843
## 
##        tax
## Y           [,1]     [,2]
##   FALSE 305.7431  87.4837
##   TRUE  510.7312 167.8553
## 
##        ptratio
## Y           [,1]     [,2]
##   FALSE 17.90711 1.811216
##   TRUE  19.00395 2.346947
## 
##        black
## Y           [,1]      [,2]
##   FALSE 388.7061  22.83774
##   TRUE  324.6420 118.83084
## 
##        medv
## Y           [,1]      [,2]
##   FALSE 24.94941  7.232047
##   TRUE  20.11621 10.270362
---
title: "STA 4143 Assignment #3"
author: "Jordan Miller"
date: "`r Sys.Date()`"
output: openintro::lab_report
---

```{r load-packages, message=FALSE}
library(tidyverse)
library(openintro)
library(ISLR2)
library(stats)
library(MASS)
library(e1071)
library(class)
library(dplyr)
```

# Problem 13

**This question should be answered using the `Weekly` data set, which is part of the `ISLR2` package. This data is similar in nature to the `Smarket` data from this chapter’s lab, except that it contains 1,089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.**

```{r load-data}
#load the data
data(Weekly)
```

## 13(a)

**Produce some numerical and graphical summaries of the `Weekly` data. Do there appear to be any patterns?**

```{r 13a}
summary(Weekly)
plot(Weekly)
cor(Weekly[,-9])
plot(Weekly$Volume)
```

From the summary statistics, `Lag1` through `Lag5` appear to have nearly identical data. From the scatterplot matrix, there definitely seems to be some kind of relationship between `Year` and `Volume`, which is confirmed by the correlation matrix. There do not appear to be any other strong relationships in the correlation matrix. Finally, we can see that `Volume` increases with increasing (chronological) index; that is, over time.


## 13(b) 

**Use the full data set to perform a logistic regression with `Direction` as the response and the five lag variables plus `Volume` as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?**

```{r 13b}
#logistic regression model
glm.fits <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
                data = Weekly, family = binomial)

summary(glm.fits)
```
`Lag2` is the only predictor that appears to be statistically significant, p < 0.5, according to this model.


## 13(c)

**Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.**

```{r 13c}
#prepare confusion matrix
glm.probs = predict(glm.fits, type = 'response')
glm.pred = rep("Down", 1089)
glm.pred[glm.probs > 0.5] = "Up"

writeLines("Confusion Matrix and Overall Fraction of Correct Predictions:")
table(glm.pred, Weekly$Direction)
mean(glm.pred == Weekly$Direction)
```
Correct predictions: 0.5610652 or ~56%

This model predicted "Up" correctly `(557/(48+557))*100 = 92.06612` about 92% of the time. However, it predicted "Down" correctly `(54/(54+430))*100 = 11.15702` only about 11% of the time.


## 13(d)

**Now fit the logistic regression model using a training data period from 1990 to 2008, with `Lag2` as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).**

```{r 13d}
#define training and testing data sets
train = (Weekly$Year < 2009)
Weekly.train = Weekly[train,]
Weekly.test = Weekly[!train,]

#logistic regression with `Lag2` as only predictor
glm.fitsTrained <- glm(Direction ~ Lag2,
                       data = Weekly.train, family = binomial)

#prepare confusion matrix
glm.probs = predict(glm.fitsTrained, Weekly.test, type = 'response')
glm.pred = rep("Down", 104)
glm.pred[glm.probs > 0.5] = "Up"

writeLines("Confusion Matrix and Overall Fraction of Correct Predictions:")
table(glm.pred, Weekly.test$Direction)
mean(glm.pred == Weekly.test$Direction)
```


## 13(e)

**Repeat (d) using LDA.**

```{r 13e}
#linear discriminant analysis model
lda.fit <- lda(Direction ~ Lag2, data = Weekly.train)
lda.fit

#prepare confusion matrix
lda.pred <- predict(lda.fit, Weekly.test)
names(lda.pred)

writeLines("\n\nConfusion Matrix and Overall Fraction of Correct Predictions:")
table(lda.pred$class, Weekly.test$Direction)
mean(lda.pred$class == Weekly.test$Direction)
```


## 13(f)

**Repeat (d) using QDA.**

```{r 13f}
#quadratic discriminant analysis model
qda.fit <- qda(Direction ~ Lag2, data = Weekly.train)
qda.fit

#prepare confusion matrix
qda.pred <- predict(qda.fit, Weekly.test)
names(lda.pred)

writeLines("\n\nConfusion Matrix and Overall Fraction of Correct Predictions:")
table(qda.pred$class, Weekly.test$Direction)
mean(qda.pred$class == Weekly.test$Direction)
```


## 13(g)

**Repeat (d) using KNN with _K_ = 1.**

```{r 13g}
#KNN, K = 1
train.x = cbind(Weekly[train,]$Lag2)
test.x = cbind(Weekly[!train,]$Lag2)
train.Direction = Weekly[train,]$Direction
set.seed(291)
knn.pred <- knn(train.x, test.x, train.Direction, k = 1)

writeLines("\n\nConfusion Matrix and Overall Fraction of Correct Predictions:")
table(knn.pred, Weekly.test$Direction)
mean(knn.pred == Weekly.test$Direction)
```



## 13(h)

**Repeat (d) using naive Bayes.**

```{r 13h}
#naive Bayes model
nb.fit <- naiveBayes(Direction ~ Lag2, data = Weekly.train)
nb.fit

writeLines("\n\nConfusion Matrix and Overall Fraction of Correct Predictions:")
nb.class <- predict(nb.fit, Weekly.test)
table(nb.class, Weekly.test$Direction)
mean(nb.class == Weekly.test$Direction)
```


## 13(i)

**Which of these methods appears to provide the best results on this data?**

Logistic regression and linear discriminant analysis have had the highest prediction success rate.


## 13(j)

**Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for _K_ in the KNN classifier.**

```{r 13j}

```



# Problem 14

**In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the `Auto` data set.**

```{r 14}
data(Auto)
```


## 14(a)

**Create a binary variable, `mpg01`, that contains a 1 if `mpg` contains a value above its median, and a 0 if `mpg` contains a value below its median. You can compute the median using the `median()` function. Note you may find it helpful to use the `data.frame()` function to create a single data set containing both `mpg01` and the other `Auto` variables.**

```{r 14a}
mpg01 <- c(Auto$mpg > median(Auto$mpg))
Auto1 <- data.frame(Auto, mpg01)
```


## 14(b)

**Explore the data graphically in order to investigate the association between `mpg01` and the other features. Which of the other features seem most likely to be useful in predicting `mpg01`? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.**

```{r 14b}
plot(Auto1)
boxplot(Auto1$mpg01 ~ Auto1$cylinders)
boxplot(Auto1$mpg01 ~ Auto1$displacement)
boxplot(Auto1$mpg01 ~ Auto1$horsepower)
boxplot(Auto1$mpg01 ~ Auto1$weight)
boxplot(Auto1$mpg01 ~ Auto1$acceleration)
boxplot(Auto1$mpg01 ~ Auto1$year)
boxplot(Auto1$mpg01 ~ Auto1$origin)
```
From the scatterplot matrix, it looks like `displacement`, `horsepower`, `weight`, and `acceleration` might be useful predictors for `mpg01`, because the `TRUE` AND `FALSE` values for `mpg01` are bunched together in distinct regions of the plot.


## 14(c)

**Split the data into a training set and a test set.**

```{r 14c}
#define training and testing data sets by randomly sampling 80% of Auto1
set.seed(191)
Atrain = sample(nrow(Auto1), 314)

Autotrain <- Auto1[Atrain,]
Autotest <- Auto1[!(as.numeric(rownames(Auto1)) %in% Atrain),]
```


## 14(d)

**Perform LDA on the training data in order to predict `mpg01` using the variables that seemed most associated with `mpg01` in (b). What is the test error of the model obtained?**

```{r 14d}
#linear discriminant analysis model
lda.Auto <- lda(mpg01 ~ displacement + horsepower + weight + acceleration, data = Autotrain)
lda.Auto

#prepare confusion matrix
lda.pred <- predict(lda.Auto, Autotest)
names(lda.pred)

writeLines("\n\nConfusion Matrix and Overall Fraction of Incorrect Predictions:")
table(lda.pred$class, Autotest$mpg01)
1 - mean(lda.pred$class == Autotest$mpg01)
```
The test error rate is about 7%.


## 14(e)

**Perform QDA on the training data in order to predict `mpg01` using the variables that seemed most associated with `mpg01` in (b). What is the test error of the model obtained?**

```{r 14e}
#quadratic discriminant analysis model
qda.Auto <- qda(mpg01 ~ displacement + horsepower + weight + acceleration, data = Autotrain)
qda.Auto

#prepare confusion matrix
qda.pred <- predict(qda.Auto, Autotest)
names(qda.pred)

writeLines("\n\nConfusion Matrix and Overall Fraction of Incorrect Predictions:")
table(qda.pred$class, Autotest$mpg01)
1 - mean(qda.pred$class == Autotest$mpg01)
```
The test error rate of this model is about 3.6%.


## 14(f)

**Perform logistic regression on the training data in order to predict `mpg01` using the variables that seemed most associated with `mpg01` in (b). What is the test error of the model obtained?**

```{r 14f}
#logistic regression model
glm.Auto <- glm(mpg01 ~ displacement + horsepower + weight + acceleration, 
                data = Autotrain, family = binomial)
glm.Auto

#prepare confusion matrix
glm.Aprobs = predict(glm.Auto, Autotest, type = 'response')
glm.Apred = rep("High", 83)
glm.Apred[glm.Aprobs > 0.5] = "Low"

#writeLines("Confusion Matrix and Overall Fraction of Correct Predictions:")
table(glm.Apred, Autotest$mpg01)
mean(glm.Apred == Autotest$mpg01)
(5/83)*100
```
The error rate is about 6%.


## 14(g)

**Perform naive Bayes on the training data in order to predict `mpg01` using the variables that seemed most associated with `mpg01` in (b). What is the test error of the model obtained?**

```{r 14g}
#naive Bayes model
nb.Auto <- naiveBayes(mpg01 ~ displacement + horsepower + weight + acceleration,
                      data = Autotrain)
nb.Auto

writeLines("\n\nConfusion Matrix and Overall Fraction of Inorrect Predictions:")
nb.Aclass <- predict(nb.Auto, Autotest)
table(nb.Aclass, Autotest$mpg01)
1 - mean(nb.Aclass == Autotest$mpg01)
```
The error rate is about 6%.


## 14(h)

**Perform KNN on the training data, with several values of _K_, in order to predict `mpg01`. Use only the variables that seemed most associated with `mpg01` in (b). What test errors do you obtain? Which value of _K_ seems to perform the best on this data set?**

```{r 14h}

```



# Problem 16

**Using the `Boston` data set, fit classification models in order to predict whether a given census tract has a crime rate above or below the median. Explore logistic regression, LDA, naive Bayes, and KNN models using various subsets of the predictors. Describe your findings.**

_Hint: You will have to create the response variable yourself, using the variables that are contained in the `Boston` data set._

```{r data-prep}
#prepare the data, create new variable
data(Boston)
high_crime <- c(Boston$crim > median(Boston$crim))
Boston$high_crime = high_crime 
```

```{r 16-LGR}
#preliminary logistic regression
glm.Boston <- glm(high_crime ~ zn + indus + chas + nox + rm + age + dis + 
                  rad + tax + ptratio + black + lstat + medv,
                  data = Boston, family = binomial)
summary(glm.Boston)
```
```{r LGR-signif}
#logistic regression with high p-value predictors removed
glm.Boston <- glm(high_crime ~ zn + nox + age + dis + rad + tax + ptratio + black + medv,
                  data = Boston, family = binomial)
summary(glm.Boston)
```


```{r 16-LDA}
#linear discriminant analysis
lda.Boston <- lda(high_crime ~ zn + nox + age + dis + rad + tax + ptratio + black + medv,
                  data = Boston)
```

```{r 16-NB}
#naive Bayes model
nb.Boston <- naiveBayes(high_crime ~ zn + nox + age + dis + rad + tax + ptratio + black + medv,
                        data = Boston)
nb.Boston
```

```{r 16-KNN}

```

