Assignment 3

10

a

Looking at our plot of weekly we don’t see any significant linear relationships. We’re given a bunch of quantitative variables with only one factor variable being Direction. Looking specfically at the direction variable that over the time of this data we have more instances of up, rather than down, but they figures are still fairly close. Next I wanted to look at the correlaion of the variables, and we really see that most have a negative relationship except for a few of the lags. The only real take away from this is that we see a high correlation between volume and year. Because of this correlation I wanted to get a quick graph of the two variables.Looking at the graph we see a steady increase in volume until about year 2000 where the increase becomes very large, and only grows from there.

plot(Weekly)

#View(Weekly)
summary(Weekly)

##       Year           Lag1               Lag2               Lag3         
##  Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
##  1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580  
##  Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410  
##  Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472  
##  3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
##  Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
##       Lag4               Lag5              Volume            Today         
##  Min.   :-18.1950   Min.   :-18.1950   Min.   :0.08747   Min.   :-18.1950  
##  1st Qu.: -1.1580   1st Qu.: -1.1660   1st Qu.:0.33202   1st Qu.: -1.1540  
##  Median :  0.2380   Median :  0.2340   Median :1.00268   Median :  0.2410  
##  Mean   :  0.1458   Mean   :  0.1399   Mean   :1.57462   Mean   :  0.1499  
##  3rd Qu.:  1.4090   3rd Qu.:  1.4050   3rd Qu.:2.05373   3rd Qu.:  1.4050  
##  Max.   : 12.0260   Max.   : 12.0260   Max.   :9.32821   Max.   : 12.0260  
##  Direction 
##  Down:484  
##  Up  :605  
##            
##            
##            
##

str(Weekly)

## 'data.frame':    1089 obs. of  9 variables:
##  $ Year     : num  1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
##  $ Lag1     : num  0.816 -0.27 -2.576 3.514 0.712 ...
##  $ Lag2     : num  1.572 0.816 -0.27 -2.576 3.514 ...
##  $ Lag3     : num  -3.936 1.572 0.816 -0.27 -2.576 ...
##  $ Lag4     : num  -0.229 -3.936 1.572 0.816 -0.27 ...
##  $ Lag5     : num  -3.484 -0.229 -3.936 1.572 0.816 ...
##  $ Volume   : num  0.155 0.149 0.16 0.162 0.154 ...
##  $ Today    : num  -0.27 -2.576 3.514 0.712 1.178 ...
##  $ Direction: Factor w/ 2 levels "Down","Up": 1 1 2 2 2 1 2 2 2 1 ...

head(Weekly)

##   Year   Lag1   Lag2   Lag3   Lag4   Lag5    Volume  Today Direction
## 1 1990  0.816  1.572 -3.936 -0.229 -3.484 0.1549760 -0.270      Down
## 2 1990 -0.270  0.816  1.572 -3.936 -0.229 0.1485740 -2.576      Down
## 3 1990 -2.576 -0.270  0.816  1.572 -3.936 0.1598375  3.514        Up
## 4 1990  3.514 -2.576 -0.270  0.816  1.572 0.1616300  0.712        Up
## 5 1990  0.712  3.514 -2.576 -0.270  0.816 0.1537280  1.178        Up
## 6 1990  1.178  0.712  3.514 -2.576 -0.270 0.1544440 -1.372      Down

table(Weekly$Direction)/sum(table(Weekly$Direction))

## 
##      Down        Up 
## 0.4444444 0.5555556

plot(Weekly$Direction)

cor(Weekly[,-9])

##               Year         Lag1        Lag2        Lag3         Lag4
## Year    1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1   -0.03228927  1.000000000 -0.07485305  0.05863568 -0.071273876
## Lag2   -0.03339001 -0.074853051  1.00000000 -0.07572091  0.058381535
## Lag3   -0.03000649  0.058635682 -0.07572091  1.00000000 -0.075395865
## Lag4   -0.03112792 -0.071273876  0.05838153 -0.07539587  1.000000000
## Lag5   -0.03051910 -0.008183096 -0.07249948  0.06065717 -0.075675027
## Volume  0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today  -0.03245989 -0.075031842  0.05916672 -0.07124364 -0.007825873
##                Lag5      Volume        Today
## Year   -0.030519101  0.84194162 -0.032459894
## Lag1   -0.008183096 -0.06495131 -0.075031842
## Lag2   -0.072499482 -0.08551314  0.059166717
## Lag3    0.060657175 -0.06928771 -0.071243639
## Lag4   -0.075675027 -0.06107462 -0.007825873
## Lag5    1.000000000 -0.05851741  0.011012698
## Volume -0.058517414  1.00000000 -0.033077783
## Today   0.011012698 -0.03307778  1.000000000

ggplot(Weekly, aes(x = Year, y = Volume))+
  geom_line(size = 1)+
  theme_gray()

b

After running the logestic regression model we see that the only significant variable is Lag2, assuming a level significance of 5%.

week.glm = glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Weekly, family = 'binomial')
summary(week.glm)

## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = "binomial", data = Weekly)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6949  -1.2565   0.9913   1.0849   1.4579  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

c

Looking at the confusion matrix we see that our accurracy for prediction is pretty low, 56%. Looking closer We see that we have a large number of false positives.

pred = predict.glm(week.glm, type = 'response')

pred_direction = ifelse(pred >= 0.5, "Up", "Down")

caret::confusionMatrix(as.factor(Weekly$Direction), as.factor(pred_direction))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down  Up
##       Down   54 430
##       Up     48 557
##                                          
##                Accuracy : 0.5611         
##                  95% CI : (0.531, 0.5908)
##     No Information Rate : 0.9063         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.035          
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.52941        
##             Specificity : 0.56434        
##          Pos Pred Value : 0.11157        
##          Neg Pred Value : 0.92066        
##              Prevalence : 0.09366        
##          Detection Rate : 0.04959        
##    Detection Prevalence : 0.44444        
##       Balanced Accuracy : 0.54687        
##                                          
##        'Positive' Class : Down           
##

d

Fitting it with the data set we see that we now get a prediction rate of 62.5%. We still have a higher false positive rate, but its much lower than before.

week_train = Weekly[Weekly$Year <= 2008, ]
week_test = Weekly[Weekly$Year > 2008, ]

week.glm.train = glm(Direction ~ Lag2, data = week_train, family = 'binomial')
#summary(week.glm.train)

pred2 = predict.glm(week.glm.train, newdata = week_test, type = 'response')

pred_direction2 = ifelse(pred2 >= 0.5, "Up", "Down")

caret::confusionMatrix(as.factor(week_test$Direction), as.factor(pred_direction2))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down    9 34
##       Up      5 56
##                                          
##                Accuracy : 0.625          
##                  95% CI : (0.5247, 0.718)
##     No Information Rate : 0.8654         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.1414         
##                                          
##  Mcnemar's Test P-Value : 7.34e-06       
##                                          
##             Sensitivity : 0.64286        
##             Specificity : 0.62222        
##          Pos Pred Value : 0.20930        
##          Neg Pred Value : 0.91803        
##              Prevalence : 0.13462        
##          Detection Rate : 0.08654        
##    Detection Prevalence : 0.41346        
##       Balanced Accuracy : 0.63254        
##                                          
##        'Positive' Class : Down           
##

e

We get an accuracy of 62.5% from lda.

week.lda = lda(Direction ~ Lag2, data = week_train)
week.lda

## Call:
## lda(Direction ~ Lag2, data = week_train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##             Lag2
## Down -0.03568254
## Up    0.26036581
## 
## Coefficients of linear discriminants:
##            LD1
## Lag2 0.4414162

pred.lda = predict(week.lda, week_test)

confusionMatrix(data = pred.lda$class, reference = week_test$Direction)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down    9  5
##       Up     34 56
##                                          
##                Accuracy : 0.625          
##                  95% CI : (0.5247, 0.718)
##     No Information Rate : 0.5865         
##     P-Value [Acc > NIR] : 0.2439         
##                                          
##                   Kappa : 0.1414         
##                                          
##  Mcnemar's Test P-Value : 7.34e-06       
##                                          
##             Sensitivity : 0.20930        
##             Specificity : 0.91803        
##          Pos Pred Value : 0.64286        
##          Neg Pred Value : 0.62222        
##              Prevalence : 0.41346        
##          Detection Rate : 0.08654        
##    Detection Prevalence : 0.13462        
##       Balanced Accuracy : 0.56367        
##                                          
##        'Positive' Class : Down           
##

f

Accuracy for QDA is 58%, but it should be noted that it predicts Up every time

week.qda = qda(Direction ~ Lag2, data = week_train)
week.qda

## Call:
## qda(Direction ~ Lag2, data = week_train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##             Lag2
## Down -0.03568254
## Up    0.26036581

pred.qda = predict(week.qda, week_test)

confusionMatrix(data = pred.qda$class, reference = week_test$Direction)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down    0  0
##       Up     43 61
##                                           
##                Accuracy : 0.5865          
##                  95% CI : (0.4858, 0.6823)
##     No Information Rate : 0.5865          
##     P-Value [Acc > NIR] : 0.5419          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 1.504e-10       
##                                           
##             Sensitivity : 0.0000          
##             Specificity : 1.0000          
##          Pos Pred Value :    NaN          
##          Neg Pred Value : 0.5865          
##              Prevalence : 0.4135          
##          Detection Rate : 0.0000          
##    Detection Prevalence : 0.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : Down            
##

g

Using KNN we get an accuracy score of 50%

set.seed(1)

pred.knn = knn(data.frame(week_train$Lag2),
               data.frame(week_test$Lag2),
               cl = week_train$Direction,
               k = 1,
               prob = T)

confusionMatrix(pred.knn, week_test$Direction)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down   21 30
##       Up     22 31
##                                           
##                Accuracy : 0.5             
##                  95% CI : (0.4003, 0.5997)
##     No Information Rate : 0.5865          
##     P-Value [Acc > NIR] : 0.9700          
##                                           
##                   Kappa : -0.0033         
##                                           
##  Mcnemar's Test P-Value : 0.3317          
##                                           
##             Sensitivity : 0.4884          
##             Specificity : 0.5082          
##          Pos Pred Value : 0.4118          
##          Neg Pred Value : 0.5849          
##              Prevalence : 0.4135          
##          Detection Rate : 0.2019          
##    Detection Prevalence : 0.4904          
##       Balanced Accuracy : 0.4983          
##                                           
##        'Positive' Class : Down            
##

h

Looking at the accuracies we see that LDA and logistic regression both are the highest at 62.5%

i

First i’ll just the logistic regression using the interaction between Lag2 and Lag1 becuase they were one of the few lags that were correlated. After running the confusion matrix we return an accuracy rating of only 55.6%

week.glm2 = glm(Direction ~ Lag2:Lag1, data = Weekly, family = 'binomial')
summary(week.glm2)

## 
## Call:
## glm(formula = Direction ~ Lag2:Lag1, family = "binomial", data = Weekly)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.362  -1.274   1.073   1.084   1.302  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 0.225716   0.061080   3.695  0.00022 ***
## Lag2:Lag1   0.005982   0.006285   0.952  0.34113    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1495.3  on 1087  degrees of freedom
## AIC: 1499.3
## 
## Number of Fisher Scoring iterations: 4

pred2 = predict.glm(week.glm2, type = 'response')

pred_direction2 = ifelse(pred2 >= 0.5, "Up", "Down")

caret::confusionMatrix(as.factor(Weekly$Direction), as.factor(pred_direction2))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down  Up
##       Down    4 480
##       Up      3 602
##                                           
##                Accuracy : 0.5565          
##                  95% CI : (0.5264, 0.5863)
##     No Information Rate : 0.9936          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0037          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.571429        
##             Specificity : 0.556377        
##          Pos Pred Value : 0.008264        
##          Neg Pred Value : 0.995041        
##              Prevalence : 0.006428        
##          Detection Rate : 0.003673        
##    Detection Prevalence : 0.444444        
##       Balanced Accuracy : 0.563903        
##                                           
##        'Positive' Class : Down            
##

Continuing with the same interaction terms we compute the confusion matrix for lda and return a score of 57.69%.

week.lda2 = lda(Direction ~ Lag2:Lag1, data = week_train)
week.lda2

## Call:
## lda(Direction ~ Lag2:Lag1, data = week_train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##       Lag2:Lag1
## Down -0.8014495
## Up   -0.1393632
## 
## Coefficients of linear discriminants:
##                 LD1
## Lag2:Lag1 0.1013404

pred.lda2 = predict(week.lda2, week_test)

confusionMatrix(data = pred.lda2$class, reference = week_test$Direction)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down    0  1
##       Up     43 60
##                                           
##                Accuracy : 0.5769          
##                  95% CI : (0.4761, 0.6732)
##     No Information Rate : 0.5865          
##     P-Value [Acc > NIR] : 0.6193          
##                                           
##                   Kappa : -0.0192         
##                                           
##  Mcnemar's Test P-Value : 6.37e-10        
##                                           
##             Sensitivity : 0.000000        
##             Specificity : 0.983607        
##          Pos Pred Value : 0.000000        
##          Neg Pred Value : 0.582524        
##              Prevalence : 0.413462        
##          Detection Rate : 0.000000        
##    Detection Prevalence : 0.009615        
##       Balanced Accuracy : 0.491803        
##                                           
##        'Positive' Class : Down            
##

For QDA I decided to experiment with a different interaction, that being between Lag2 and Lag3, which after running the confusion matrix gave us an score of 56.7%. We primarily made this change as the interaction between Lag2 and Lag1 gave us a very low score with QDA.

week.qda2 = qda(Direction ~ Lag2:Lag3, data = week_train)
week.qda2

## Call:
## qda(Direction ~ Lag2:Lag3, data = week_train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##       Lag2:Lag3
## Down -0.1937158
## Up   -0.6405132

pred.qda2 = predict(week.qda2, week_test)

confusionMatrix(data = pred.qda2$class, reference = week_test$Direction)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down    6  8
##       Up     37 53
##                                           
##                Accuracy : 0.5673          
##                  95% CI : (0.4665, 0.6641)
##     No Information Rate : 0.5865          
##     P-Value [Acc > NIR] : 0.6921          
##                                           
##                   Kappa : 0.0093          
##                                           
##  Mcnemar's Test P-Value : 2.993e-05       
##                                           
##             Sensitivity : 0.13953         
##             Specificity : 0.86885         
##          Pos Pred Value : 0.42857         
##          Neg Pred Value : 0.58889         
##              Prevalence : 0.41346         
##          Detection Rate : 0.05769         
##    Detection Prevalence : 0.13462         
##       Balanced Accuracy : 0.50419         
##                                           
##        'Positive' Class : Down            
##

For KNN we ended setting k = 10 as it was able to slightly improve our score over 5. Still though, its accuracy was only raised a few points to 54.81%

set.seed(1)

pred.knn5 = knn(data.frame(week_train$Lag2),
               data.frame(week_test$Lag2),
               cl = week_train$Direction,
               k = 10,
               prob = T)

confusionMatrix(pred.knn5, week_test$Direction)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down   17 21
##       Up     26 40
##                                           
##                Accuracy : 0.5481          
##                  95% CI : (0.4474, 0.6459)
##     No Information Rate : 0.5865          
##     P-Value [Acc > NIR] : 0.8152          
##                                           
##                   Kappa : 0.052           
##                                           
##  Mcnemar's Test P-Value : 0.5596          
##                                           
##             Sensitivity : 0.3953          
##             Specificity : 0.6557          
##          Pos Pred Value : 0.4474          
##          Neg Pred Value : 0.6061          
##              Prevalence : 0.4135          
##          Detection Rate : 0.1635          
##    Detection Prevalence : 0.3654          
##       Balanced Accuracy : 0.5255          
##                                           
##        'Positive' Class : Down            
##

Overall, even with some experimentation we weren’t able to get better than our initial 62.5% accuracy score produced from the GLM and LDA.

11

a

median(Auto$mpg)

## [1] 22.75

mpg01 = as.factor(ifelse(Auto$mpg >= 22.75, 1, 0))
str(mpg01)

##  Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...

table(mpg01)

## mpg01
##   0   1 
## 196 196

Auto$origin = factor(Auto$origin, labels = c("American", "European", "Japanese"))

auto = data.frame(Auto, mpg01)
head(auto)

##   mpg cylinders displacement horsepower weight acceleration year   origin
## 1  18         8          307        130   3504         12.0   70 American
## 2  15         8          350        165   3693         11.5   70 American
## 3  18         8          318        150   3436         11.0   70 American
## 4  16         8          304        150   3433         12.0   70 American
## 5  17         8          302        140   3449         10.5   70 American
## 6  15         8          429        198   4341         10.0   70 American
##                        name mpg01
## 1 chevrolet chevelle malibu     0
## 2         buick skylark 320     0
## 3        plymouth satellite     0
## 4             amc rebel sst     0
## 5               ford torino     0
## 6          ford galaxie 500     0

b

plot(auto)

boxplot(auto$horsepower ~ auto$mpg01, auto)

boxplot(auto$cylinders ~ auto$mpg01, auto)

boxplot(auto$weight ~ auto$mpg01, auto)

boxplot(auto$year ~ auto$mpg01, auto)

c

set.seed(1)

sub = sample(nrow(auto), nrow(auto) * .6)
auto.train = auto[sub, ]
auto.test = auto[-sub, ]

d

After running LDA on the variables most associated with mpg01 we returned a test error rate of 13%

auto.train.lda = lda(mpg01 ~ cylinders + displacement + horsepower + weight, data = auto.train)

auto.test.pred = predict(auto.train.lda, auto.test)

auto.test.lda.class = auto.test.pred$class

#table(auto.test$mpg01, auto.test.lda.class)
mean(auto.test.lda.class!=auto.test$mpg01)

## [1] 0.133758

e

Running QDA we return a test error rate of 12%

auto.train.qda = qda(mpg01 ~ cylinders + displacement + horsepower + weight, data = auto.train)

auto.test.pred.qda = predict(auto.train.qda, auto.test)

auto.test.qda.class = auto.test.pred.qda$class

mean(auto.test.qda.class!=auto.test$mpg01)

## [1] 0.1210191

f

We get the same error rate as we did with QDA for logistic regression, 12%.

auto.train.glm = glm(mpg01 ~ cylinders + displacement + horsepower + weight, 
                     data = auto.train, family = binomial)

auto.test.pred.glm = predict(auto.train.glm, auto.test, type = 'response')

auto.test.glm.class = ifelse(auto.test.pred.glm > .5, 1, 0)

mean(auto.test.glm.class!=auto.test$mpg01)

## [1] 0.1210191

g

Running KNN we can look at k = 1:20 where we see that k = 9 preforms the best. We do end up computing a very high error rate of 50%.

set.seed(1)

knn.mpg = train(mpg01 ~ cylinders + displacement + weight + horsepower, data = auto.train,
                method = 'knn', tuneGrid = expand.grid(k=seq(1,20,1)))

knn.mpg

## k-Nearest Neighbors 
## 
## 235 samples
##   4 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 235, 235, 235, 235, 235, 235, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    1  0.8587632  0.7167391
##    2  0.8611897  0.7221149
##    3  0.8627032  0.7252875
##    4  0.8727974  0.7449759
##    5  0.8718777  0.7429520
##    6  0.8727477  0.7446311
##    7  0.8794954  0.7582691
##    8  0.8784786  0.7562528
##    9  0.8807195  0.7606923
##   10  0.8800241  0.7593469
##   11  0.8772008  0.7535948
##   12  0.8740404  0.7475105
##   13  0.8753979  0.7501298
##   14  0.8720116  0.7434766
##   15  0.8726215  0.7447116
##   16  0.8767539  0.7527069
##   17  0.8733906  0.7459947
##   18  0.8733731  0.7460016
##   19  0.8716154  0.7424155
##   20  0.8715201  0.7422480
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.

pred.knn = predict(knn.mpg, data = auto.test, type = 'raw')

mean(pred.knn != auto.test$mpg01)

## Warning in `!=.default`(pred.knn, auto.test$mpg01): longer object length is not
## a multiple of shorter object length

## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length

## [1] 0.5021277

13

median(Boston$crim)

## [1] 0.25651

Boston$crim = factor(ifelse(Boston$crim > .25651, 1, 0))

Running glm we get an accuracy of 91.5%

boston.glm = glm(crim ~ ., Boston, family = binomial)

summary(boston.glm)

## 
## Call:
## glm(formula = crim ~ ., family = binomial, data = Boston)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3946  -0.1585  -0.0004   0.0023   3.4239  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -34.103704   6.530014  -5.223 1.76e-07 ***
## zn           -0.079918   0.033731  -2.369  0.01782 *  
## indus        -0.059389   0.043722  -1.358  0.17436    
## chas          0.785327   0.728930   1.077  0.28132    
## nox          48.523782   7.396497   6.560 5.37e-11 ***
## rm           -0.425596   0.701104  -0.607  0.54383    
## age           0.022172   0.012221   1.814  0.06963 .  
## dis           0.691400   0.218308   3.167  0.00154 ** 
## rad           0.656465   0.152452   4.306 1.66e-05 ***
## tax          -0.006412   0.002689  -2.385  0.01709 *  
## ptratio       0.368716   0.122136   3.019  0.00254 ** 
## black        -0.013524   0.006536  -2.069  0.03853 *  
## lstat         0.043862   0.048981   0.895  0.37052    
## medv          0.167130   0.066940   2.497  0.01254 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 701.46  on 505  degrees of freedom
## Residual deviance: 211.93  on 492  degrees of freedom
## AIC: 239.93
## 
## Number of Fisher Scoring iterations: 9

pred.glm.bos = predict.glm(boston.glm, type = 'response')

pred.bos = ifelse(pred.glm.bos >= 0.5, 1, 0)

caret::confusionMatrix(as.factor(Boston$crim), as.factor(pred.bos))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 234  19
##          1  24 229
##                                           
##                Accuracy : 0.915           
##                  95% CI : (0.8872, 0.9378)
##     No Information Rate : 0.5099          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.83            
##                                           
##  Mcnemar's Test P-Value : 0.5419          
##                                           
##             Sensitivity : 0.9070          
##             Specificity : 0.9234          
##          Pos Pred Value : 0.9249          
##          Neg Pred Value : 0.9051          
##              Prevalence : 0.5099          
##          Detection Rate : 0.4625          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.9152          
##                                           
##        'Positive' Class : 0               
##

For LDA we return a accuracy score of 85.5%

boston.lda = lda(crim ~ ., Boston)

pred.lda.boston = predict(boston.lda, Boston)

confusionMatrix(data = pred.lda.boston$class, reference = Boston$crim)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 240  60
##          1  13 193
##                                           
##                Accuracy : 0.8557          
##                  95% CI : (0.8221, 0.8852)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7115          
##                                           
##  Mcnemar's Test P-Value : 7.289e-08       
##                                           
##             Sensitivity : 0.9486          
##             Specificity : 0.7628          
##          Pos Pred Value : 0.8000          
##          Neg Pred Value : 0.9369          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4743          
##    Detection Prevalence : 0.5929          
##       Balanced Accuracy : 0.8557          
##                                           
##        'Positive' Class : 0               
##

Running the model with QDA we return an accuracy of 90.5%

boston.qda = qda(crim ~ ., Boston)

pred.qda.boston = predict(boston.qda, Boston)

confusionMatrix(data = pred.qda.boston$class, reference = Boston$crim)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 252  47
##          1   1 206
##                                           
##                Accuracy : 0.9051          
##                  95% CI : (0.8762, 0.9292)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8103          
##                                           
##  Mcnemar's Test P-Value : 8.293e-11       
##                                           
##             Sensitivity : 0.9960          
##             Specificity : 0.8142          
##          Pos Pred Value : 0.8428          
##          Neg Pred Value : 0.9952          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4980          
##    Detection Prevalence : 0.5909          
##       Balanced Accuracy : 0.9051          
##                                           
##        'Positive' Class : 0               
##

Using KNN we return an accuracy score of 91.6% when k= 1.

set.seed(1)

boston.knn = train(crim ~ ., Boston, method = 'knn', tuneGrid = expand.grid(k=seq(1,20,1)))

boston.knn

## k-Nearest Neighbors 
## 
## 506 samples
##  13 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 506, 506, 506, 506, 506, 506, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    1  0.9167006  0.8331366
##    2  0.9091468  0.8179958
##    3  0.9020065  0.8036794
##    4  0.9008807  0.8014285
##    5  0.9056188  0.8108326
##    6  0.9042593  0.8081688
##    7  0.9016401  0.8028562
##    8  0.8981840  0.7960143
##    9  0.8934816  0.7865572
##   10  0.8909119  0.7814292
##   11  0.8883550  0.7762695
##   12  0.8864353  0.7724074
##   13  0.8829908  0.7655497
##   14  0.8821341  0.7638773
##   15  0.8806973  0.7610447
##   16  0.8767248  0.7531522
##   17  0.8714638  0.7426165
##   18  0.8713321  0.7422955
##   19  0.8682418  0.7361468
##   20  0.8655964  0.7308633
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 1.

Overall it became a close decision between GLM and KNN, but KNN slightly wins out with the better accuracy score.

Assignment 3

Travis Compton

2/24/2021

10

a

b

c

d

e

f

g

h

i

11

a

b

c

d

e

f

g

13