HW3

Q10

This question should be answered using the Weekly data set, which is part of the ISLR package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1,089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.

week = Weekly

summary(week)

##       Year           Lag1               Lag2               Lag3         
##  Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
##  1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580  
##  Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410  
##  Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472  
##  3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
##  Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
##       Lag4               Lag5              Volume            Today         
##  Min.   :-18.1950   Min.   :-18.1950   Min.   :0.08747   Min.   :-18.1950  
##  1st Qu.: -1.1580   1st Qu.: -1.1660   1st Qu.:0.33202   1st Qu.: -1.1540  
##  Median :  0.2380   Median :  0.2340   Median :1.00268   Median :  0.2410  
##  Mean   :  0.1458   Mean   :  0.1399   Mean   :1.57462   Mean   :  0.1499  
##  3rd Qu.:  1.4090   3rd Qu.:  1.4050   3rd Qu.:2.05373   3rd Qu.:  1.4050  
##  Max.   : 12.0260   Max.   : 12.0260   Max.   :9.32821   Max.   : 12.0260  
##  Direction 
##  Down:484  
##  Up  :605  
##            
##            
##            
##

The variables are:
Year: The year that the observation was recorded
Lag1: Percentage return for previous week
Lag2: Percentage return for 2 weeks previous
Lag3: Percentage return for 3 weeks previous
Lag4: Percentage return for 4 weeks previous
Lag5: Percentage return for 5 weeks previous
Volume: Volume of shares traded (average number of daily shares traded in billions)
Today: Percentage return for this week
Direction: A factor with levels Down and Up indicating whether the market had a positive or negative return on a given week

(a)

Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?

plot(week)

corrplot(cor(week[,-9]), method = "number")

I want to plot a more granular breakdown of Volume over time so I will need to create a Week variable within this dataset

week$Week = 1:nrow(week)

yrs = week %>% 
  group_by(Year) %>% 
  summarize(Week = min(Week)) %>% 
  ungroup()

## `summarise()` ungrouping output (override with `.groups` argument)

ggplot(week, aes(Week, Volume)) + 
  geom_line() + 
  geom_smooth(se = FALSE) + 
  scale_x_continuous(breaks = yrs$Week ,minor_breaks = NULL ,labels = yrs$Year) + 
  labs(title = "Average Weekly Shares Traded vs Time",
       x = "Year") + 
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90))

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

There is a clear relationship between Year and Volume. The volume of stocks traded appears to be increasing throughout the 90s and begins to decrease around 2010.

Let’s see a breakdown of the variable Direction overtime.

ggplot(week, aes(Year, fill = Direction)) + 
  geom_bar(position =  "fill") + 
  geom_hline(yintercept = 0.5, col = "white") + 
  scale_x_continuous(breaks = seq(1990, 2010)) +
  scale_y_continuous(labels = scales::percent_format()) + 
  labs(title = "Percentage of Direction vs Time",
       x = "Year") + 
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90))

The plot shows only 4 years where less than 50% of the weeks didn’t see a positive return. As shown in the table below, we could get a classifier that does better than random chance simply by predicting the return will be positive each week. Let’s investigate the volatility of the market overtime by looking at its percentage change.

prop.table(table(week$Direction))

## 
##      Down        Up 
## 0.4444444 0.5555556

ggplot(week, aes(Week, Today/100)) + 
  geom_line() +
  scale_x_continuous(breaks = yrs$Week, minor_breaks = NULL, labels = yrs$Year) + 
  scale_y_continuous(labels = scales::percent_format(), breaks = seq(-0.2, 0.2, 0.05)) +
  geom_hline(yintercept = 0, col = "blue") + 
  theme_light() +
  labs(title = "Weekly Percetage Return vs Time",
       x = "Year", 
       y = "Percentage Return") +
  theme(axis.text.x = element_text(angle = 90))

(b)

Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?

glm.fit1 = glm(Direction~.-Today, week, family = binomial)

summary(glm.fit1)

## 
## Call:
## glm(formula = Direction ~ . - Today, family = binomial, data = week)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7631  -1.2574   0.9678   1.0826   1.4948  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept) 708.898033 425.223143   1.667   0.0955 .
## Year         -0.356148   0.213721  -1.666   0.0956 .
## Lag1         -0.041577   0.026488  -1.570   0.1165  
## Lag2          0.059148   0.026986   2.192   0.0284 *
## Lag3         -0.015509   0.026774  -0.579   0.5624  
## Lag4         -0.027782   0.026543  -1.047   0.2952  
## Lag5         -0.014453   0.026472  -0.546   0.5851  
## Volume       -0.001679   0.068948  -0.024   0.9806  
## Week          0.006698   0.004101   1.633   0.1024  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1483.5  on 1080  degrees of freedom
## AIC: 1501.5
## 
## Number of Fisher Scoring iterations: 4

After running a logistic regression on the full data set, it appears that having a 2 day lag has a significant effect on the Direction of the weekly percentage returns. The variable Lag2 has the smallest p-value at 0.02 and the coefficient suggests this has a positive impact on the response variable, Direction.

(c)

Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.

glm.probs = predict(glm.fit1, type = "response")
glm.preds = ifelse(glm.probs>=0.5, "Up", "Down")

caret::confusionMatrix(as.factor(glm.preds), week$Direction)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down  Up
##       Down   73  68
##       Up    411 537
##                                           
##                Accuracy : 0.5601          
##                  95% CI : (0.5301, 0.5899)
##     No Information Rate : 0.5556          
##     P-Value [Acc > NIR] : 0.3923          
##                                           
##                   Kappa : 0.0414          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.15083         
##             Specificity : 0.88760         
##          Pos Pred Value : 0.51773         
##          Neg Pred Value : 0.56646         
##              Prevalence : 0.44444         
##          Detection Rate : 0.06703         
##    Detection Prevalence : 0.12948         
##       Balanced Accuracy : 0.51921         
##                                           
##        'Positive' Class : Down            
##

mean(glm.preds==week$Direction) # train error

## [1] 0.5601469

mean(glm.preds!=week$Direction) # test error

## [1] 0.4398531

We achieve an Accuracy of 56.01%.

Our model is predicting just above the baseline accuracy we calculated earlier (55.56%). The confusion matrix is finding that our model does not have a strong true negative rate, meaning it does not predict negative predictions well as shown in the table below.

prop.table(table(glm.preds))

## glm.preds
##      Down        Up 
## 0.1294766 0.8705234

(d)

Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).

train = week$Year<=2008
x_test = week[!train,]
x_train = week[train,]
y_test = week$Direction[!train]

glm.fit2=glm(Direction~Lag2, x_train, family = binomial)

glm.probs=predict(glm.fit2, x_test, type="response")
glm.preds = ifelse(glm.probs >= 0.5, "Up", "Down")

caret::confusionMatrix(as.factor(glm.preds), y_test)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down    9  5
##       Up     34 56
##                                          
##                Accuracy : 0.625          
##                  95% CI : (0.5247, 0.718)
##     No Information Rate : 0.5865         
##     P-Value [Acc > NIR] : 0.2439         
##                                          
##                   Kappa : 0.1414         
##                                          
##  Mcnemar's Test P-Value : 7.34e-06       
##                                          
##             Sensitivity : 0.20930        
##             Specificity : 0.91803        
##          Pos Pred Value : 0.64286        
##          Neg Pred Value : 0.62222        
##              Prevalence : 0.41346        
##          Detection Rate : 0.08654        
##    Detection Prevalence : 0.13462        
##       Balanced Accuracy : 0.56367        
##                                          
##        'Positive' Class : Down           
##

mean(glm.preds==y_test) # train error rate

## [1] 0.625

mean(glm.preds!=y_test) # test error rate (most important how it acts on new data)

## [1] 0.375

We achieve an Accuracy of 62.50%.

With the data now split based on a given year range and using only 1 predictor variable, the model performs better on new data with a test error rate of .375 or 37.5%. The model correctly predicted the held out data 62.5% percent of the time.

(e)

Repeat (d) using LDA

lda.fit = lda(Direction~Lag2, x_train)

lda.fit

## Call:
## lda(Direction ~ Lag2, data = x_train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##             Lag2
## Down -0.03568254
## Up    0.26036581
## 
## Coefficients of linear discriminants:
##            LD1
## Lag2 0.4414162

lda.class = predict(lda.fit, x_test)$class

caret::confusionMatrix(lda.class, y_test)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down    9  5
##       Up     34 56
##                                          
##                Accuracy : 0.625          
##                  95% CI : (0.5247, 0.718)
##     No Information Rate : 0.5865         
##     P-Value [Acc > NIR] : 0.2439         
##                                          
##                   Kappa : 0.1414         
##                                          
##  Mcnemar's Test P-Value : 7.34e-06       
##                                          
##             Sensitivity : 0.20930        
##             Specificity : 0.91803        
##          Pos Pred Value : 0.64286        
##          Neg Pred Value : 0.62222        
##              Prevalence : 0.41346        
##          Detection Rate : 0.08654        
##    Detection Prevalence : 0.13462        
##       Balanced Accuracy : 0.56367        
##                                          
##        'Positive' Class : Down           
##

mean(lda.class==y_test) # train error rate

## [1] 0.625

mean(lda.class!=y_test) # test error rate (most important how it acts on new data)

## [1] 0.375

We achieve an Accuracy of 62.50%.

In this LDA model, 44.7% of the training data corresponds to days the market went down and 55.2% of the observations correspond to days the market went up. The group means suggest there is a tendency for returns to be negative on days with a 2 day lag when the market went down and positive on days the market went up. The confusion matrix depicts the same results seen in (d) with a 62.5% test error rate.

(f)

Repeat (d) using QDA

qda.fit = qda(Direction~Lag2, x_train)

qda.fit

## Call:
## qda(Direction ~ Lag2, data = x_train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##             Lag2
## Down -0.03568254
## Up    0.26036581

qda.class = predict(qda.fit, x_test)$class

caret::confusionMatrix(qda.class, y_test)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down    0  0
##       Up     43 61
##                                           
##                Accuracy : 0.5865          
##                  95% CI : (0.4858, 0.6823)
##     No Information Rate : 0.5865          
##     P-Value [Acc > NIR] : 0.5419          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 1.504e-10       
##                                           
##             Sensitivity : 0.0000          
##             Specificity : 1.0000          
##          Pos Pred Value :    NaN          
##          Neg Pred Value : 0.5865          
##              Prevalence : 0.4135          
##          Detection Rate : 0.0000          
##    Detection Prevalence : 0.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : Down            
##

mean(qda.class==y_test) # train error rate

## [1] 0.5865385

mean(qda.class!=y_test) # test error rate (most important how it acts on new data)

## [1] 0.4134615

We achieve an Accuracy of 58.65%.

Fitting a QDA model, the prior probabilities and group means are identical to those since in the LDA model. This model only predicts Up similar to our baseline model that did slightly better than random chance. This quadratic model achieves a little better accuracy than that.

(g)

Repeat (d) using KNN with K = 1

set.seed(123)
x.train = as.matrix(week$Lag2[train])
x.test = as.matrix(week$Lag2[!train])
y.train = week$Direction[train]

knn.preds = knn(x.train, x.test, y.train, k=1)
caret::confusionMatrix(knn.preds, y_test)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down   21 29
##       Up     22 32
##                                          
##                Accuracy : 0.5096         
##                  95% CI : (0.4097, 0.609)
##     No Information Rate : 0.5865         
##     P-Value [Acc > NIR] : 0.9540         
##                                          
##                   Kappa : 0.0127         
##                                          
##  Mcnemar's Test P-Value : 0.4008         
##                                          
##             Sensitivity : 0.4884         
##             Specificity : 0.5246         
##          Pos Pred Value : 0.4200         
##          Neg Pred Value : 0.5926         
##              Prevalence : 0.4135         
##          Detection Rate : 0.2019         
##    Detection Prevalence : 0.4808         
##       Balanced Accuracy : 0.5065         
##                                          
##        'Positive' Class : Down           
##

mean(knn.preds == y_test)

## [1] 0.5096154

mean(knn.preds != y_test)

## [1] 0.4903846

We achieve an Accuracy of 50.96%.

Using the KNN model, our accuracy has worsened and test error rate have worsened. We are barely beating random chance, not evening meeting our baseline with this model. This model is doing a worse job of predicting new data.

(h)

Which of these methods appears to provide the best results on this data? The LDA model and the Logistic Regression with data split into train and test datasets did the best compared to other models. They correctly predicted the highest amount of new data with a test error rate of 62.5%

(i)

Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifier.

hist(week$Volume)

Due to the right skewed data seen in the Volume variable it may be good to log transform and see the how this effects the model.

glm.fit3=glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+log(Volume), x_train, family = binomial)

glm.probs=predict(glm.fit3, x_test, type="response")

glm.preds = ifelse(glm.probs>=0.5, "Up", "Down")

caret::confusionMatrix(as.factor(glm.preds), y_test)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down   23 31
##       Up     20 30
##                                          
##                Accuracy : 0.5096         
##                  95% CI : (0.4097, 0.609)
##     No Information Rate : 0.5865         
##     P-Value [Acc > NIR] : 0.9540         
##                                          
##                   Kappa : 0.0257         
##                                          
##  Mcnemar's Test P-Value : 0.1614         
##                                          
##             Sensitivity : 0.5349         
##             Specificity : 0.4918         
##          Pos Pred Value : 0.4259         
##          Neg Pred Value : 0.6000         
##              Prevalence : 0.4135         
##          Detection Rate : 0.2212         
##    Detection Prevalence : 0.5192         
##       Balanced Accuracy : 0.5133         
##                                          
##        'Positive' Class : Down           
##

mean(glm.preds==y_test) # train error rate

## [1] 0.5096154

mean(glm.preds!=y_test) # test error rate (most important how it acts on new data)

## [1] 0.4903846

Though there is no apparent improvement in the diagnostic plot. Fitting this model with all predictors, including a log transformed Volume variable worsened the test error rate. Now, the model only predicts new data correctly 50% of the time compared to the 62.5% seen in (d)

lda.fit2 = lda(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+log(Volume)+Lag1:Lag2, x_train)

lda.fit2

## Call:
## lda(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + log(Volume) + 
##     Lag1:Lag2, data = x_train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##              Lag1        Lag2       Lag3       Lag4       Lag5 log(Volume)
## Down  0.289444444 -0.03568254 0.17080045 0.15925624 0.21409297  -0.2386605
## Up   -0.009213235  0.26036581 0.08404044 0.09220956 0.04548897  -0.3281402
##       Lag1:Lag2
## Down -0.8014495
## Up   -0.1393632
## 
## Coefficients of linear discriminants:
##                     LD1
## Lag1        -0.24453155
## Lag2         0.20964391
## Lag3        -0.05951762
## Lag4        -0.13167014
## Lag5        -0.15170339
## log(Volume) -0.45849491
## Lag1:Lag2    0.01148372

lda.class = predict(lda.fit2, x_test)$class

caret::confusionMatrix(lda.class, y_test)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down   22 30
##       Up     21 31
##                                          
##                Accuracy : 0.5096         
##                  95% CI : (0.4097, 0.609)
##     No Information Rate : 0.5865         
##     P-Value [Acc > NIR] : 0.9540         
##                                          
##                   Kappa : 0.0192         
##                                          
##  Mcnemar's Test P-Value : 0.2626         
##                                          
##             Sensitivity : 0.5116         
##             Specificity : 0.5082         
##          Pos Pred Value : 0.4231         
##          Neg Pred Value : 0.5962         
##              Prevalence : 0.4135         
##          Detection Rate : 0.2115         
##    Detection Prevalence : 0.5000         
##       Balanced Accuracy : 0.5099         
##                                          
##        'Positive' Class : Down           
##

mean(lda.class==y_test) # train error rate

## [1] 0.5096154

mean(lda.class!=y_test) # test error rate (most important how it acts on new data)

## [1] 0.4903846

The LDA model sees similar results as the above logistic regression even after adding an interaction effect, correctly predicting 50% of the test data.

qda.fit = qda(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+log(Volume), x_train)

qda.fit

## Call:
## qda(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + log(Volume), 
##     data = x_train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##              Lag1        Lag2       Lag3       Lag4       Lag5 log(Volume)
## Down  0.289444444 -0.03568254 0.17080045 0.15925624 0.21409297  -0.2386605
## Up   -0.009213235  0.26036581 0.08404044 0.09220956 0.04548897  -0.3281402

qda.class = predict(qda.fit, x_test)$class

caret::confusionMatrix(qda.class, y_test)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down   18 25
##       Up     25 36
##                                           
##                Accuracy : 0.5192          
##                  95% CI : (0.4191, 0.6183)
##     No Information Rate : 0.5865          
##     P-Value [Acc > NIR] : 0.9316          
##                                           
##                   Kappa : 0.0088          
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.4186          
##             Specificity : 0.5902          
##          Pos Pred Value : 0.4186          
##          Neg Pred Value : 0.5902          
##              Prevalence : 0.4135          
##          Detection Rate : 0.1731          
##    Detection Prevalence : 0.4135          
##       Balanced Accuracy : 0.5044          
##                                           
##        'Positive' Class : Down            
##

mean(qda.class==y_test) # train error rate

## [1] 0.5192308

mean(qda.class!=y_test) # test error rate (most important how it acts on new data)

## [1] 0.4807692

qda.fit = qda(Direction~Lag2+log(Volume), x_train)

qda.fit

## Call:
## qda(Direction ~ Lag2 + log(Volume), data = x_train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##             Lag2 log(Volume)
## Down -0.03568254  -0.2386605
## Up    0.26036581  -0.3281402

qda.class = predict(qda.fit, x_test)$class

caret::confusionMatrix(qda.class, y_test)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down   21 24
##       Up     22 37
##                                         
##                Accuracy : 0.5577        
##                  95% CI : (0.457, 0.655)
##     No Information Rate : 0.5865        
##     P-Value [Acc > NIR] : 0.7579        
##                                         
##                   Kappa : 0.0943        
##                                         
##  Mcnemar's Test P-Value : 0.8828        
##                                         
##             Sensitivity : 0.4884        
##             Specificity : 0.6066        
##          Pos Pred Value : 0.4667        
##          Neg Pred Value : 0.6271        
##              Prevalence : 0.4135        
##          Detection Rate : 0.2019        
##    Detection Prevalence : 0.4327        
##       Balanced Accuracy : 0.5475        
##                                         
##        'Positive' Class : Down          
##

mean(qda.class==y_test) # train error rate

## [1] 0.5576923

mean(qda.class!=y_test) # test error rate (most important how it acts on new data)

## [1] 0.4423077

qda.fit = qda(Direction~I(Lag2**3)+log(Volume), x_train)

qda.fit

## Call:
## qda(Direction ~ I(Lag2^3) + log(Volume), data = x_train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##      I(Lag2^3) log(Volume)
## Down -13.71386  -0.2386605
## Up     1.09383  -0.3281402

qda.class = predict(qda.fit, x_test)$class

caret::confusionMatrix(qda.class, y_test)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down    3  5
##       Up     40 56
##                                           
##                Accuracy : 0.5673          
##                  95% CI : (0.4665, 0.6641)
##     No Information Rate : 0.5865          
##     P-Value [Acc > NIR] : 0.6921          
##                                           
##                   Kappa : -0.0139         
##                                           
##  Mcnemar's Test P-Value : 4.011e-07       
##                                           
##             Sensitivity : 0.06977         
##             Specificity : 0.91803         
##          Pos Pred Value : 0.37500         
##          Neg Pred Value : 0.58333         
##              Prevalence : 0.41346         
##          Detection Rate : 0.02885         
##    Detection Prevalence : 0.07692         
##       Balanced Accuracy : 0.49390         
##                                           
##        'Positive' Class : Down            
##

mean(qda.class==y_test) # train error rate

## [1] 0.5673077

mean(qda.class!=y_test) # test error rate (most important how it acts on new data)

## [1] 0.4326923

qda.fit = qda(Direction~Lag2, x_train)

qda.fit

## Call:
## qda(Direction ~ Lag2, data = x_train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##             Lag2
## Down -0.03568254
## Up    0.26036581

qda.class = predict(qda.fit, x_test)$class

caret::confusionMatrix(qda.class, y_test)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down    0  0
##       Up     43 61
##                                           
##                Accuracy : 0.5865          
##                  95% CI : (0.4858, 0.6823)
##     No Information Rate : 0.5865          
##     P-Value [Acc > NIR] : 0.5419          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 1.504e-10       
##                                           
##             Sensitivity : 0.0000          
##             Specificity : 1.0000          
##          Pos Pred Value :    NaN          
##          Neg Pred Value : 0.5865          
##              Prevalence : 0.4135          
##          Detection Rate : 0.0000          
##    Detection Prevalence : 0.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : Down            
##

mean(qda.class==y_test) # train error rate

## [1] 0.5865385

mean(qda.class!=y_test) # test error rate (most important how it acts on new data)

## [1] 0.4134615

There seems to be a better test error rate when including less predictors in the model, log transforming the right skewed Volume variable, and cubing the predictor that is included in the model. Each transformation brought more and more improved performance on new data. The final model correctly predicted new data 58% of the time. This is still not better than our original model in (d)

Adding to the value of k seems to worsen the model’s test error rate taking it to only 45%.

set.seed(123)
combo.train = cbind(week$Lag1, week$Lag2)[train,]
combo.test = cbind(week$Lag1, week$Lag2)[!train,]

x.train = as.matrix(combo.train)
x.test = as.matrix(combo.test)
y.train = week$Direction[train]

knn.preds = knn(x.train, x.test, y.train, k=90)

caret::confusionMatrix(knn.preds, y_test)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down   14 18
##       Up     29 43
##                                           
##                Accuracy : 0.5481          
##                  95% CI : (0.4474, 0.6459)
##     No Information Rate : 0.5865          
##     P-Value [Acc > NIR] : 0.8152          
##                                           
##                   Kappa : 0.0317          
##                                           
##  Mcnemar's Test P-Value : 0.1447          
##                                           
##             Sensitivity : 0.3256          
##             Specificity : 0.7049          
##          Pos Pred Value : 0.4375          
##          Neg Pred Value : 0.5972          
##              Prevalence : 0.4135          
##          Detection Rate : 0.1346          
##    Detection Prevalence : 0.3077          
##       Balanced Accuracy : 0.5152          
##                                           
##        'Positive' Class : Down            
##

mean(knn.preds == y_test)

## [1] 0.5480769

mean(knn.preds != y_test)

## [1] 0.4519231

Including 2 predictors and increasing the value of k to 90 increased the accuracy to 55% but decrease the test error rate to 44%. It is performing worse on new data and not doing better than our model in (d)

set.seed(123)
combo.train = cbind(week$Lag1, week$Lag2, week$Lag3, week$Lag4, week$Lag5)[train,]
combo.test = cbind(week$Lag1, week$Lag2, week$Lag3, week$Lag4, week$Lag5)[!train,]

x.train = as.matrix(combo.train)
x.test = as.matrix(combo.test)
y.train = week$Direction[train]

knn.preds = knn(x.train, x.test, y.train, k=3)
caret::confusionMatrix(knn.preds, y_test)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down   21 24
##       Up     22 37
##                                         
##                Accuracy : 0.5577        
##                  95% CI : (0.457, 0.655)
##     No Information Rate : 0.5865        
##     P-Value [Acc > NIR] : 0.7579        
##                                         
##                   Kappa : 0.0943        
##                                         
##  Mcnemar's Test P-Value : 0.8828        
##                                         
##             Sensitivity : 0.4884        
##             Specificity : 0.6066        
##          Pos Pred Value : 0.4667        
##          Neg Pred Value : 0.6271        
##              Prevalence : 0.4135        
##          Detection Rate : 0.2019        
##    Detection Prevalence : 0.4327        
##       Balanced Accuracy : 0.5475        
##                                         
##        'Positive' Class : Down          
##

mean(knn.preds == y_test)

## [1] 0.5576923

mean(knn.preds != y_test)

## [1] 0.4423077

Q11

In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.

cars = Auto

any(is.na(cars))

## [1] FALSE

(a)

Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median() function. Note you may find it helpful to use the data.frame() function to create a single data set containing both mpg01 and the other Auto variables.

cars$mpg01 = ifelse(cars$mpg > median(cars$mpg), 1, 0)


median(cars$mpg)

## [1] 22.75

(b)

Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.

corrplot(cor(cars[,-9]), method = "number")

The variables that most positively correlate with our binary variable mpg01 are:year , mpg, and origin

The variables that most negatively correlate with mpg01 are: cylinders , displacement and weight have the strongest correlation

(c)

Split the data into a training set and a test set.

set.seed(123)

split = sample.split(cars, SplitRatio = 0.50)

x_train = subset(cars, split == TRUE)

x_test = subset(cars, split == FALSE)

y_train = subset(cars$mpg01, split == TRUE)

y_test = subset(cars$mpg01, split == FALSE)

cars$mpg01 = as.factor(cars$mpg01)

(d)

Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

lda.fit = lda(mpg01~cylinders+displacement+weight, x_train)

lda.fit

## Call:
## lda(mpg01 ~ cylinders + displacement + weight, data = x_train)
## 
## Prior probabilities of groups:
##         0         1 
## 0.5102041 0.4897959 
## 
## Group means:
##   cylinders displacement   weight
## 0  6.670000     267.4300 3556.360
## 1  4.229167     117.7812 2347.333
## 
## Coefficients of linear discriminants:
##                        LD1
## cylinders    -0.4496559276
## displacement -0.0002958369
## weight       -0.0008855457

lda.class = predict(lda.fit,x_test)$class

caret::confusionMatrix(lda.class, as.factor(y_test))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 84  5
##          1 12 95
##                                           
##                Accuracy : 0.9133          
##                  95% CI : (0.8648, 0.9487)
##     No Information Rate : 0.5102          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8262          
##                                           
##  Mcnemar's Test P-Value : 0.1456          
##                                           
##             Sensitivity : 0.8750          
##             Specificity : 0.9500          
##          Pos Pred Value : 0.9438          
##          Neg Pred Value : 0.8879          
##              Prevalence : 0.4898          
##          Detection Rate : 0.4286          
##    Detection Prevalence : 0.4541          
##       Balanced Accuracy : 0.9125          
##                                           
##        'Positive' Class : 0               
##

mean(lda.class!=y_test)

## [1] 0.08673469

1-mean(lda.class!=y_test)

## [1] 0.9132653

This LDA model achieves a test error rate of .086 so, the model correctly predicted about 91.3% of the new data

(e)

Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

qda.fit = qda(mpg01~cylinders+displacement+weight, x_train)

qda.fit

## Call:
## qda(mpg01 ~ cylinders + displacement + weight, data = x_train)
## 
## Prior probabilities of groups:
##         0         1 
## 0.5102041 0.4897959 
## 
## Group means:
##   cylinders displacement   weight
## 0  6.670000     267.4300 3556.360
## 1  4.229167     117.7812 2347.333

qda.class = predict(qda.fit, x_test)$class

caret::confusionMatrix(qda.class, as.factor(y_test))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 87  7
##          1  9 93
##                                           
##                Accuracy : 0.9184          
##                  95% CI : (0.8708, 0.9526)
##     No Information Rate : 0.5102          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8366          
##                                           
##  Mcnemar's Test P-Value : 0.8026          
##                                           
##             Sensitivity : 0.9062          
##             Specificity : 0.9300          
##          Pos Pred Value : 0.9255          
##          Neg Pred Value : 0.9118          
##              Prevalence : 0.4898          
##          Detection Rate : 0.4439          
##    Detection Prevalence : 0.4796          
##       Balanced Accuracy : 0.9181          
##                                           
##        'Positive' Class : 0               
##

mean(qda.class!=y_test)

## [1] 0.08163265

1-mean(qda.class!=y_test)

## [1] 0.9183673

This QDA models achieves a slightly higher test error rate of .081 so, the model correctly predicted 91.8% of the new data

(f)

Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

glm.fit = glm(mpg01~cylinders+displacement+weight, x_train, family = binomial)

glm.probs=predict(glm.fit, x_test, type="response")


glm.preds = ifelse(glm.probs>=0.5, 1,0)

caret::confusionMatrix(as.factor(glm.preds), as.factor(y_test))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 86  7
##          1 10 93
##                                           
##                Accuracy : 0.9133          
##                  95% CI : (0.8648, 0.9487)
##     No Information Rate : 0.5102          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8263          
##                                           
##  Mcnemar's Test P-Value : 0.6276          
##                                           
##             Sensitivity : 0.8958          
##             Specificity : 0.9300          
##          Pos Pred Value : 0.9247          
##          Neg Pred Value : 0.9029          
##              Prevalence : 0.4898          
##          Detection Rate : 0.4388          
##    Detection Prevalence : 0.4745          
##       Balanced Accuracy : 0.9129          
##                                           
##        'Positive' Class : 0               
##

mean(glm.preds!=y_test)

## [1] 0.08673469

1-mean(glm.preds!=y_test)

## [1] 0.9132653

The Logistic Regression model performs slightly worse, with a test error rate similar to the LDA model of .086. The model correctly predicted 91.3% of the new data

(g)

Perform KNN on the training data, with several values of K, in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?

combo.train = cbind(cars$cylinders, cars$displacement, cars$weight)[split,]
combo.test = cbind(cars$cylinders, cars$displacement, cars$weight)[!split,]

x.train = as.matrix(combo.train)
x.test = as.matrix(combo.test)
y.train = cars$mpg01[split]

knn.preds = knn(x.train, x.test, y.train, k=1)
caret::confusionMatrix(knn.preds, as.factor(y_test))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 82 17
##          1 14 83
##                                           
##                Accuracy : 0.8418          
##                  95% CI : (0.7831, 0.8899)
##     No Information Rate : 0.5102          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.6837          
##                                           
##  Mcnemar's Test P-Value : 0.7194          
##                                           
##             Sensitivity : 0.8542          
##             Specificity : 0.8300          
##          Pos Pred Value : 0.8283          
##          Neg Pred Value : 0.8557          
##              Prevalence : 0.4898          
##          Detection Rate : 0.4184          
##    Detection Prevalence : 0.5051          
##       Balanced Accuracy : 0.8421          
##                                           
##        'Positive' Class : 0               
##

mean(knn.preds != y_test)

## [1] 0.1581633

1-mean(knn.preds != y_test)

## [1] 0.8418367

combo.train = cbind(cars$cylinders, cars$displacement, cars$weight)[split,]
combo.test = cbind(cars$cylinders, cars$displacement, cars$weight)[!split,]

x.train = as.matrix(combo.train)
x.test = as.matrix(combo.test)
y.train = cars$mpg01[split]

knn.preds = knn(x.train, x.test, y.train, k=3)
caret::confusionMatrix(knn.preds, as.factor(y_test))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 85 15
##          1 11 85
##                                           
##                Accuracy : 0.8673          
##                  95% CI : (0.8117, 0.9115)
##     No Information Rate : 0.5102          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7348          
##                                           
##  Mcnemar's Test P-Value : 0.5563          
##                                           
##             Sensitivity : 0.8854          
##             Specificity : 0.8500          
##          Pos Pred Value : 0.8500          
##          Neg Pred Value : 0.8854          
##              Prevalence : 0.4898          
##          Detection Rate : 0.4337          
##    Detection Prevalence : 0.5102          
##       Balanced Accuracy : 0.8677          
##                                           
##        'Positive' Class : 0               
##

mean(knn.preds != y_test)

## [1] 0.1326531

1-mean(knn.preds != y_test)

## [1] 0.8673469

combo.train = cbind(cars$cylinders, cars$displacement, cars$weight)[split,]
combo.test = cbind(cars$cylinders, cars$displacement, cars$weight)[!split,]

x.train = as.matrix(combo.train)
x.test = as.matrix(combo.test)
y.train = cars$mpg01[split]

knn.preds = knn(x.train, x.test, y.train, k=50)
caret::confusionMatrix(knn.preds, as.factor(y_test))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 86  7
##          1 10 93
##                                           
##                Accuracy : 0.9133          
##                  95% CI : (0.8648, 0.9487)
##     No Information Rate : 0.5102          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8263          
##                                           
##  Mcnemar's Test P-Value : 0.6276          
##                                           
##             Sensitivity : 0.8958          
##             Specificity : 0.9300          
##          Pos Pred Value : 0.9247          
##          Neg Pred Value : 0.9029          
##              Prevalence : 0.4898          
##          Detection Rate : 0.4388          
##    Detection Prevalence : 0.4745          
##       Balanced Accuracy : 0.9129          
##                                           
##        'Positive' Class : 0               
##

mean(knn.preds != y_test)

## [1] 0.08673469

1-mean(knn.preds != y_test)

## [1] 0.9132653

combo.train = cbind(cars$cylinders, cars$displacement, cars$weight)[split,]
combo.test = cbind(cars$cylinders, cars$displacement, cars$weight)[!split,]

x.train = as.matrix(combo.train)
x.test = as.matrix(combo.test)
y.train = cars$mpg01[split]

knn.preds = knn(x.train, x.test, y.train, k=149)
caret::confusionMatrix(knn.preds, as.factor(y_test))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 80  3
##          1 16 97
##                                           
##                Accuracy : 0.9031          
##                  95% CI : (0.8528, 0.9406)
##     No Information Rate : 0.5102          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8055          
##                                           
##  Mcnemar's Test P-Value : 0.005905        
##                                           
##             Sensitivity : 0.8333          
##             Specificity : 0.9700          
##          Pos Pred Value : 0.9639          
##          Neg Pred Value : 0.8584          
##              Prevalence : 0.4898          
##          Detection Rate : 0.4082          
##    Detection Prevalence : 0.4235          
##       Balanced Accuracy : 0.9017          
##                                           
##        'Positive' Class : 0               
##

mean(knn.preds != y_test)

## [1] 0.09693878

1-mean(knn.preds != y_test)

## [1] 0.9030612

The KNN model achieves a test error rate of .086. This model correctly predicted the new data 91.3% of the time similiar to the LDA and Logit models seen earlier. To achieve this result K was set to 50. Others values gave various results ranging from better to worse but, K=50 was the best argument in each of the KNN models.

Q13

Using the Boston data set, fit classification models in order to predict whether a given suburb has a crime rate above or below the median. Explore logistic regression, LDA, and KNN models using various subets of the predictors. Describe your findings.

(a) Creating binary variable

bost = Boston

any(is.na(bost))

## [1] FALSE

# create binary variable
bost$crim.pred = ifelse(bost$crim > median(bost$crim), 1, 0)

median(bost$crim)

## [1] 0.25651

(b) Correlation plots

corrplot(cor(bost), method = "number", type = "upper")

Variables that appear to be highly correlated with our binary response variable are: nox, age, dis, rad, indus, and tax

(c) Models

Now we will explore a few models to see which one best predicts new data points

# splitting the data
set.seed(1)

split = sample.split(bost, SplitRatio = 0.50)

x_train = subset(bost, split == TRUE)

x_test = subset(bost, split == FALSE)

y_train = subset(bost$crim.pred, split == TRUE)

y_test = subset(bost$crim.pred, split == FALSE)

bost$crim.pred = as.factor(bost$crim.pred)

Logistic Regression:

glm.fit = glm(crim.pred~nox+age+dis+rad+indus+tax, x_train, family = binomial)

glm.probs=predict(glm.fit, x_test, type="response")

glm.preds = ifelse(glm.probs>=0.5, 1, 0)
caret::confusionMatrix(as.factor(glm.preds), as.factor(y_test))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 115  18
##          1  23 114
##                                           
##                Accuracy : 0.8481          
##                  95% CI : (0.7997, 0.8888)
##     No Information Rate : 0.5111          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.6964          
##                                           
##  Mcnemar's Test P-Value : 0.5322          
##                                           
##             Sensitivity : 0.8333          
##             Specificity : 0.8636          
##          Pos Pred Value : 0.8647          
##          Neg Pred Value : 0.8321          
##              Prevalence : 0.5111          
##          Detection Rate : 0.4259          
##    Detection Prevalence : 0.4926          
##       Balanced Accuracy : 0.8485          
##                                           
##        'Positive' Class : 0               
##

mean(glm.preds!=y_test)

## [1] 0.1518519

1-mean(glm.preds!=y_test)

## [1] 0.8481481

Test-Accuracy: .151
The model correctly predicted 84.8% of the new data

Linear Discriminant Analysis:

lda.fit = lda(crim.pred~nox+age+dis+rad+indus+tax, x_train)

lda.fit

## Call:
## lda(crim.pred ~ nox + age + dis + rad + indus + tax, data = x_train)
## 
## Prior probabilities of groups:
##         0         1 
## 0.4872881 0.5127119 
## 
## Group means:
##         nox      age      dis       rad     indus      tax
## 0 0.4702070 50.53739 5.160057  4.069565  7.049478 310.6696
## 1 0.6380331 84.82231 2.548820 14.595041 15.066529 504.6860
## 
## Coefficients of linear discriminants:
##                LD1
## nox    7.272485559
## age    0.007580730
## dis   -0.081812204
## rad    0.101940253
## indus  0.010295835
## tax   -0.002892852

lda.class = predict(lda.fit,x_test)$class

caret::confusionMatrix(lda.class, as.factor(y_test))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 129  29
##          1   9 103
##                                          
##                Accuracy : 0.8593         
##                  95% CI : (0.812, 0.8984)
##     No Information Rate : 0.5111         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.7174         
##                                          
##  Mcnemar's Test P-Value : 0.002055       
##                                          
##             Sensitivity : 0.9348         
##             Specificity : 0.7803         
##          Pos Pred Value : 0.8165         
##          Neg Pred Value : 0.9196         
##              Prevalence : 0.5111         
##          Detection Rate : 0.4778         
##    Detection Prevalence : 0.5852         
##       Balanced Accuracy : 0.8575         
##                                          
##        'Positive' Class : 0              
##

mean(lda.class!=y_test)

## [1] 0.1407407

1-mean(lda.class!=y_test)

## [1] 0.8592593

Test-Accuracy: .140
The model correctly predicted 85.9% of the new data

Quadratic Discriminant Analysis:

qda.fit = qda(crim.pred~nox+age+dis+rad+indus+tax, x_train)

qda.fit

## Call:
## qda(crim.pred ~ nox + age + dis + rad + indus + tax, data = x_train)
## 
## Prior probabilities of groups:
##         0         1 
## 0.4872881 0.5127119 
## 
## Group means:
##         nox      age      dis       rad     indus      tax
## 0 0.4702070 50.53739 5.160057  4.069565  7.049478 310.6696
## 1 0.6380331 84.82231 2.548820 14.595041 15.066529 504.6860

qda.class = predict(qda.fit, x_test)$class

caret::confusionMatrix(qda.class, as.factor(y_test))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 133  19
##          1   5 113
##                                           
##                Accuracy : 0.9111          
##                  95% CI : (0.8706, 0.9422)
##     No Information Rate : 0.5111          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8217          
##                                           
##  Mcnemar's Test P-Value : 0.007963        
##                                           
##             Sensitivity : 0.9638          
##             Specificity : 0.8561          
##          Pos Pred Value : 0.8750          
##          Neg Pred Value : 0.9576          
##              Prevalence : 0.5111          
##          Detection Rate : 0.4926          
##    Detection Prevalence : 0.5630          
##       Balanced Accuracy : 0.9099          
##                                           
##        'Positive' Class : 0               
##

mean(qda.class!=y_test)

## [1] 0.08888889

1-mean(qda.class!=y_test)

## [1] 0.9111111

Test-Accuracy: .088
The model correctly predicted 91.1% of the new data

K-Nearest Neighbors:

combo.train = cbind(bost$nox, bost$age, bost$dis, bost$rad, bost$indus, bost$tax)[split,]
combo.test = cbind(bost$nox, bost$age, bost$dis, bost$rad, bost$indus, bost$tax)[!split,]

x.train = as.matrix(combo.train)
x.test = as.matrix(combo.test)
y.train = bost$crim.pred[split]

knn.preds = knn(x.train, x.test, y.train, k=1)
caret::confusionMatrix(knn.preds, as.factor(y_test))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 124   7
##          1  14 125
##                                           
##                Accuracy : 0.9222          
##                  95% CI : (0.8836, 0.9512)
##     No Information Rate : 0.5111          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8445          
##                                           
##  Mcnemar's Test P-Value : 0.1904          
##                                           
##             Sensitivity : 0.8986          
##             Specificity : 0.9470          
##          Pos Pred Value : 0.9466          
##          Neg Pred Value : 0.8993          
##              Prevalence : 0.5111          
##          Detection Rate : 0.4593          
##    Detection Prevalence : 0.4852          
##       Balanced Accuracy : 0.9228          
##                                           
##        'Positive' Class : 0               
##

mean(knn.preds != y_test)

## [1] 0.07777778

1-mean(knn.preds != y_test)

## [1] 0.9222222

knn.preds = knn(x.train, x.test, y.train, k=3)
caret::confusionMatrix(knn.preds, as.factor(y_test))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 126   9
##          1  12 123
##                                           
##                Accuracy : 0.9222          
##                  95% CI : (0.8836, 0.9512)
##     No Information Rate : 0.5111          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8444          
##                                           
##  Mcnemar's Test P-Value : 0.6625          
##                                           
##             Sensitivity : 0.9130          
##             Specificity : 0.9318          
##          Pos Pred Value : 0.9333          
##          Neg Pred Value : 0.9111          
##              Prevalence : 0.5111          
##          Detection Rate : 0.4667          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.9224          
##                                           
##        'Positive' Class : 0               
##

mean(knn.preds != y_test)

## [1] 0.07777778

1-mean(knn.preds != y_test)

## [1] 0.9222222

Test-Accuracy: .077
The model correctly predicted 92.2% of the new data

The KNN model performed the best on this data set and correctly predicted the highest amount of new data at 92%. This was achieved when K was set equal to 1 or 3.

HW3

Lauren Wilson

2/13/2021

Q10

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

Q11

(a)

(b)

(c)

(d)

(e)

(f)

(g)

Q13

(a) Creating binary variable

(b) Correlation plots

(c) Models