Assignment_3.knit

title: “Assignment_3” author: “Monique Villarreal” date: “2022-09-24” output: html_document =========================================================================

Question 13. Using the Weekly data set.

a) Produce some numerical and graphical summaries of the Weekly data. Are there any patterns?

In reviewing the output, the majority of the correlations between the lag variables and today’s returns are close to zero (i.e. little correlation between today’s return and the previous day’s). There is a larger correlation between year and volume. This relationship can be seen in the scatter plot matrix.

library(ISLR2)
dim(Weekly)

## [1] 1089    9

summary(Weekly)

##       Year           Lag1               Lag2               Lag3         
##  Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
##  1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580  
##  Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410  
##  Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472  
##  3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
##  Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
##       Lag4               Lag5              Volume            Today         
##  Min.   :-18.1950   Min.   :-18.1950   Min.   :0.08747   Min.   :-18.1950  
##  1st Qu.: -1.1580   1st Qu.: -1.1660   1st Qu.:0.33202   1st Qu.: -1.1540  
##  Median :  0.2380   Median :  0.2340   Median :1.00268   Median :  0.2410  
##  Mean   :  0.1458   Mean   :  0.1399   Mean   :1.57462   Mean   :  0.1499  
##  3rd Qu.:  1.4090   3rd Qu.:  1.4050   3rd Qu.:2.05373   3rd Qu.:  1.4050  
##  Max.   : 12.0260   Max.   : 12.0260   Max.   :9.32821   Max.   : 12.0260  
##  Direction 
##  Down:484  
##  Up  :605  
##            
##            
##            
##

pairs(Weekly)
cor(Weekly[,-9])

##               Year         Lag1        Lag2        Lag3         Lag4
## Year    1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1   -0.03228927  1.000000000 -0.07485305  0.05863568 -0.071273876
## Lag2   -0.03339001 -0.074853051  1.00000000 -0.07572091  0.058381535
## Lag3   -0.03000649  0.058635682 -0.07572091  1.00000000 -0.075395865
## Lag4   -0.03112792 -0.071273876  0.05838153 -0.07539587  1.000000000
## Lag5   -0.03051910 -0.008183096 -0.07249948  0.06065717 -0.075675027
## Volume  0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today  -0.03245989 -0.075031842  0.05916672 -0.07124364 -0.007825873
##                Lag5      Volume        Today
## Year   -0.030519101  0.84194162 -0.032459894
## Lag1   -0.008183096 -0.06495131 -0.075031842
## Lag2   -0.072499482 -0.08551314  0.059166717
## Lag3    0.060657175 -0.06928771 -0.071243639
## Lag4   -0.075675027 -0.06107462 -0.007825873
## Lag5    1.000000000 -0.05851741  0.011012698
## Volume -0.058517414  1.00000000 -0.033077783
## Today   0.011012698 -0.03307778  1.000000000

attach(Weekly)
plot(Weekly)

b) Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant?

In reviewing the output, Lag 2 appears to be the only predictor with a significant p-value.

glm.wfit=glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume, data=Weekly, family=binomial)
summary(glm.wfit)

## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Weekly)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6949  -1.2565   0.9913   1.0849   1.4579  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

summary(glm.wfit)$coef[,4]

## (Intercept)        Lag1        Lag2        Lag3        Lag4        Lag5 
## 0.001898848 0.118144368 0.029601361 0.546923890 0.293653342 0.583348244 
##      Volume 
## 0.537674762

c) Compute the confusion matrix and overall fraction of correct prediction. Explain what the confusion matrix is telling you about the types of mistakes made by logic regression.

The model will have correctly predicted movement in the market 56% of the time.

##         Direction
## glm.pred Down  Up
##     Down   54  48
##     Up    430 557

mean(glm.pred==Direction)

## [1] 0.5610652

d) Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct prediction for the held out data (data from 2009 - 2010).

Has 62.5% accuracy rate

train=(Year<2009)
Weekly2.0.2009=Weekly[!train, ]
Direction2.0.2009=Direction[!train]
Weekly.fit=glm(Direction~Lag2,data=Weekly,family=binomial,subset=train)
Weekly.prob=predict(Weekly.fit,Weekly2.0.2009,type="response")
Weekly.pred=rep("Down",length(Weekly.prob))
Weekly.pred[Weekly.prob>0.5]="Up"
table(Weekly.pred,Direction2.0.2009)

##            Direction2.0.2009
## Weekly.pred Down Up
##        Down    9  5
##        Up     34 56

mean(Weekly.pred==Direction2.0.2009)

## [1] 0.625

e) repeat using LDA

Has 62.5% accuracy rate

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:ISLR2':
## 
##     Boston

lda.fit = lda(Direction~Lag2, data = Weekly, subset = train)
lda.fit

## Call:
## lda(Direction ~ Lag2, data = Weekly, subset = train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##             Lag2
## Down -0.03568254
## Up    0.26036581
## 
## Coefficients of linear discriminants:
##            LD1
## Lag2 0.4414162

lda.pred = predict(lda.fit, Weekly2.0.2009)
names(lda.pred)

## [1] "class"     "posterior" "x"

lda.class = lda.pred$class
table(lda.class, Direction2.0.2009)

##          Direction2.0.2009
## lda.class Down Up
##      Down    9  5
##      Up     34 56

mean(lda.class==Direction2.0.2009)

## [1] 0.625

f) repeat using QDA

Has 58.7% accuracy rate

qda.fit = qda(Direction~Lag2, data = Weekly, subset = train)
qda.fit

## Call:
## qda(Direction ~ Lag2, data = Weekly, subset = train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##             Lag2
## Down -0.03568254
## Up    0.26036581

qda.class = predict(qda.fit, Weekly2.0.2009)$class
table(qda.class, Direction2.0.2009)

##          Direction2.0.2009
## qda.class Down Up
##      Down    0  0
##      Up     43 61

mean(qda.class==Direction2.0.2009)

## [1] 0.5865385

g) repeat using KNN with K=1

Has a 50% accuracy rate

library(class)
train.X = as.matrix(Lag2[train])
test.X = as.matrix(Lag2[!train])
train.direction = Direction[train]
set.seed(1)
KNN.pred = knn(train.X, test.X, train.direction, k = 1)
table(KNN.pred, Direction2.0.2009)

##         Direction2.0.2009
## KNN.pred Down Up
##     Down   21 30
##     Up     22 31

(21+31)/104

## [1] 0.5

h) repeat using naive Bayes.

Has a 58.6% accuracy rate

library(e1071)
NB.fit = naiveBayes(Direction~Lag2, data = Weekly, subset = train)
NB.fit

## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Conditional probabilities:
##       Lag2
## Y             [,1]     [,2]
##   Down -0.03568254 2.199504
##   Up    0.26036581 2.317485

NB.class = predict(NB.fit, Weekly2.0.2009)
table(NB.class, Direction2.0.2009)

##         Direction2.0.2009
## NB.class Down Up
##     Down    0  0
##     Up     43 61

mean(NB.class==Direction2.0.2009)

## [1] 0.5865385

i) Which provides the best results on the data?

The LDA and logistic regression both have higher accuracy rates at 62.5%.

j) Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to have the best results on the held out data. Note that you should also experiment with the values for K in the KNN classifier.

The LDA and logistic regression continued to have the same accuracy rate when introducing an interaction term. Increasing K improved the accuracy rate for KNN from 50% to about 54%.

Weekly.fit = glm(Direction~Lag2:Lag3 + Lag2, family = binomial, data = Weekly, subset = train)
Weekly.prob = predict(Weekly.fit, Weekly2.0.2009, type = "response")
Weekly.pred = rep("Down", length(Weekly.prob))
Weekly.pred[Weekly.prob > 0.5] = "Up"
Direction2.0.2009 = Direction[!train]
table(Weekly.pred, Direction2.0.2009)

##            Direction2.0.2009
## Weekly.pred Down Up
##        Down    9  5
##        Up     34 56

mean(Weekly.pred==Direction2.0.2009)

## [1] 0.625

lda.fit = lda(Direction~Lag2:Lag3 + Lag2, data = Weekly, subset = train)
lda.pred = predict(lda.fit, Weekly2.0.2009)
lda.class = lda.pred$class
table(lda.class, Direction2.0.2009)

##          Direction2.0.2009
## lda.class Down Up
##      Down    9  5
##      Up     34 56

mean(lda.class == Direction2.0.2009)

## [1] 0.625

train.X = as.matrix(Lag2[train])
test.X = as.matrix(Lag2[!train])
train.direction = Direction[train]
set.seed(1)
KNN.pred = knn(train.X, test.X, train.direction, k = 7)
table(KNN.pred, Direction2.0.2009)

##         Direction2.0.2009
## KNN.pred Down Up
##     Down   15 20
##     Up     28 41

mean(KNN.pred == Direction2.0.2009)

## [1] 0.5384615

Question 14. In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.

library(ISLR2)
attach(Auto)

a) Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median.

mpg01 = rep(0, length(mpg))
mpg01[mpg>median(mpg)] = 1
Auto = data.frame(Auto, mpg01)

b) Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatter plots and box plots may be useful.

An association can be seen with displacement, horsepower, weight, and cylinders.

plot(Auto[,-9])

c) Split the data into a training set and a test set

Auto = data.frame(mpg01, apply(cbind(cylinders, weight, displacement, horsepower, acceleration), 2, scale), year)
train.A = (year %% 2 ==0)
test.A = !train.A
Auto.train = Auto[train.A,]
Auto.test = Auto[test.A,]
mpg01.test = mpg01[test.A]

d) Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in part (b). What is the test error of the model obtained?

Test error is 12.6%

lda.Afit = lda(mpg01~cylinders + displacement + weight + horsepower, data = Auto, subset = train.A )
autolda.pred = predict(lda.Afit, Auto.test)
table(autolda.pred$class, mpg01.test)

##    mpg01.test
##      0  1
##   0 86  9
##   1 14 73

mean(autolda.pred$class != mpg01.test)

## [1] 0.1263736

e) Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in part (b). What is the test error of the model obtained?

Test error is 13%

qda.Afit = qda(mpg01~cylinders + displacement + weight + horsepower, data = Auto, subset = train.A)
qda.Aclass = predict(qda.Afit, Auto.test)$class
table(qda.Aclass, mpg01.test)

##           mpg01.test
## qda.Aclass  0  1
##          0 89 13
##          1 11 69

mean(qda.Aclass!=mpg01.test)

## [1] 0.1318681

f) Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

Test error is 12.6%

glm.Afit = glm(mpg01~cylinders + displacement + weight + horsepower, data = Auto, subset = train.A)
glm.autoprob = predict(glm.Afit, Auto.test, type = "response")
glm.autopred = rep(0, length(glm.autoprob))
glm.autopred[glm.autoprob>0.5] = 1
table(glm.autopred, mpg01.test)

##             mpg01.test
## glm.autopred  0  1
##            0 86  9
##            1 14 73

mean(glm.autopred!=mpg01.test)

## [1] 0.1263736

## g) Perform naive Bayes on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

```r
nb.Afit = naiveBayes(mpg01~cylinders + displacement + weight + horsepower, data = Auto, subset = train)
nb.Aclass = predict(nb.Afit, Auto.test)
mean(nb.Aclass!= mpg01.test)

## [1] 0.1263736

h) Perform KNN on the training data, with several values of K, in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?

train.auto = cbind(cylinders, displacement, weight, horsepower)[train.A,]
test.auto = cbind(cylinders, displacement, weight, horsepower)[!train.A,]
train.mpg01 = mpg01[train.A]
set.seed(1)
knn.autopred = knn(train.auto, test.auto, train.mpg01, k = 1)
mean(knn.autopred != mpg01.test)

## [1] 0.1538462

knn.autopred = knn(train.auto, test.auto, train.mpg01, k = 25)
mean(knn.autopred != mpg01.test)

## [1] 0.1428571

knn.autopred = knn(train.auto, test.auto, train.mpg01, k = 50)
mean(knn.autopred != mpg01.test)

## [1] 0.1428571

knn.autopred = knn(train.auto, test.auto, train.mpg01, k = 100)
mean(knn.autopred != mpg01.test)

## [1] 0.1428571

Question 16. Using Boston data set, fit classification models in order to predict whether a given census tract has a crime rate above or below the median. Explore logistic regression, LDA, naive Bayes, and KNN models using various subsets of the predictors. Describe your findings.

attach(Boston)

crime1 = rep(0, length(crim))
crime1[crim>median(crim)] = 1
Boston = data.frame(Boston, crime1)

train = 1:(dim(Boston)[1]/2)
test = (dim(Boston)[1]/2+1):dim(Boston)[1]
Boston.train = Boston[train,]
Boston.test = Boston[test,]
crime1.test = crime1[test]

plot(Boston)

### Can see a correlation between nox, tax, dis, medv, and lstat.

set.seed(1)
Boston.fit = glm(crime1~nox + tax + dis + medv + lstat, data = Boston, family = binomial)
Boston.probs = predict(Boston.fit, Boston.test, type ="response")
Boston.pred = rep(0, length(Boston.probs))
Boston.pred[Boston.probs > 0.5] = 1
table(Boston.pred, crime1.test)

##            crime1.test
## Boston.pred   0   1
##           0  75  15
##           1  15 148

mean(Boston.pred == crime1.test)

## [1] 0.8814229

Logistic regression shows an 88% accuracy rate based on these associated predictors.

ldaBoston.fit = lda(crime1~nox+tax+dis+medv+lstat, data =  Boston.train)
ldaBoston.pred = predict(ldaBoston.fit, Boston.test)
table(ldaBoston.pred$class, crime1.test)

##    crime1.test
##       0   1
##   0  80  16
##   1  10 147

mean(ldaBoston.pred$class == crime1.test)

## [1] 0.8972332

The LDA shows an 89.7% accuracy rate.

nbBoston.fit = naiveBayes(crime1~nox+tax+dis+lstat, data = Boston, subset = train)
nbBoston.class = predict(nbBoston.fit, Boston.test)
table(nbBoston.class, crime1.test)

##               crime1.test
## nbBoston.class   0   1
##              0  75  18
##              1  15 145

mean(nbBoston.class == crime1.test)

## [1] 0.8695652

The naive Bayes results in an 87% accuracy rate.

train.B = cbind(nox,tax,dis,lstat)[train,]
test.B = cbind(nox,tax,dis,lstat)[test,]
train.crime = crime1.test
set.seed(1)
Bostonknn.pred = knn(train.B, test.B, train.crime, k=1)
table(Bostonknn.pred, crime1.test)

##               crime1.test
## Bostonknn.pred   0   1
##              0  33 143
##              1  57  20

mean(Bostonknn.pred == crime1.test)

## [1] 0.2094862

Bostonknn.pred = knn(train.B, test.B, train.crime, k=10)
table(Bostonknn.pred, crime1.test)

##               crime1.test
## Bostonknn.pred   0   1
##              0  43  20
##              1  47 143

mean(Bostonknn.pred==crime1.test)

## [1] 0.7351779

Bostonknn.pred = knn(train.B, test.B, train.crime, k=25)
table(Bostonknn.pred, crime1.test)

##               crime1.test
## Bostonknn.pred   0   1
##              0  38  14
##              1  52 149

mean(Bostonknn.pred==crime1.test)

## [1] 0.7391304

In increasing the K from 1 to 25 there is a substanital change in the accuracy rate. Increasing the K value further does not increase the accuracy rate more.