Homework3

Question 10A.

week = Weekly
summary(week)

##       Year           Lag1               Lag2               Lag3         
##  Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
##  1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580  
##  Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410  
##  Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472  
##  3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
##  Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
##       Lag4               Lag5              Volume            Today         
##  Min.   :-18.1950   Min.   :-18.1950   Min.   :0.08747   Min.   :-18.1950  
##  1st Qu.: -1.1580   1st Qu.: -1.1660   1st Qu.:0.33202   1st Qu.: -1.1540  
##  Median :  0.2380   Median :  0.2340   Median :1.00268   Median :  0.2410  
##  Mean   :  0.1458   Mean   :  0.1399   Mean   :1.57462   Mean   :  0.1499  
##  3rd Qu.:  1.4090   3rd Qu.:  1.4050   3rd Qu.:2.05373   3rd Qu.:  1.4050  
##  Max.   : 12.0260   Max.   : 12.0260   Max.   :9.32821   Max.   : 12.0260  
##  Direction 
##  Down:484  
##  Up  :605  
##            
##            
##            
##

hist(week$Lag1)

hist(week$Lag2)

hist(week$Lag3)

hist(week$Lag4)

hist(week$Lag5)

hist(week$Volume)

hist(week$Today)

pairs(week[1:8])

We can see in pairs functions that there is pattern for Year and Volume, other variables show no pattern

10B.

week.glm = glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = week, family = binomial)
summary(week.glm)

## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = week)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6949  -1.2565   0.9913   1.0849   1.4579  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

We can see from the results that Lag2 is the only significant variable.

10C.

#Method A
week.probs = predict(week.glm, type = "response")
contrasts(week$Direction)

##      Up
## Down  0
## Up    1

week.preds = rep("Down", 1089)
week.preds[week.probs>.5] = "Up"
table(week.preds, week$Direction)

##           
## week.preds Down  Up
##       Down   54  48
##       Up    430 557

(557+54)/1089

## [1] 0.5610652

#Method B
week$PredProb = predict.glm(week.glm, newdata = week, type = "response")
week$PredSur = ifelse(week$PredProb >= .5,"Up","Down")
caret::confusionMatrix(as.factor(week$Direction), as.factor(week$PredSur), positive = "Up")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down  Up
##       Down   54 430
##       Up     48 557
##                                          
##                Accuracy : 0.5611         
##                  95% CI : (0.531, 0.5908)
##     No Information Rate : 0.9063         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.035          
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.5643         
##             Specificity : 0.5294         
##          Pos Pred Value : 0.9207         
##          Neg Pred Value : 0.1116         
##              Prevalence : 0.9063         
##          Detection Rate : 0.5115         
##    Detection Prevalence : 0.5556         
##       Balanced Accuracy : 0.5469         
##                                          
##        'Positive' Class : Up             
##

Accuracy of the model is 56.11%. Which means it can predict correctly almost half of the time. Also it is better in predicting true positive compared to true negative.

10D.

set.seed(1)
train =(week$Year<2009)
week.test = week[!train,1:8]
week.train = week[train,]
direction.test = week$Direction[!train]

week.glm2 = glm(Direction ~ Lag2, data = week, family = binomial, subset = train)

week.test$PredProb = predict.glm(week.glm2, newdata = week.test, type = "response")
week.test$PredSur = ifelse(week.test$PredProb >= .5,"Up","Down")
caret::confusionMatrix(as.factor(direction.test), as.factor(week.test$PredSur), positive = "Up")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down    9 34
##       Up      5 56
##                                          
##                Accuracy : 0.625          
##                  95% CI : (0.5247, 0.718)
##     No Information Rate : 0.8654         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.1414         
##                                          
##  Mcnemar's Test P-Value : 7.34e-06       
##                                          
##             Sensitivity : 0.6222         
##             Specificity : 0.6429         
##          Pos Pred Value : 0.9180         
##          Neg Pred Value : 0.2093         
##              Prevalence : 0.8654         
##          Detection Rate : 0.5385         
##    Detection Prevalence : 0.5865         
##       Balanced Accuracy : 0.6325         
##                                          
##        'Positive' Class : Up             
##

10E.

library(MASS)
week.lda = lda(Direction ~ Lag2, data = week, subset = train)
week.lda

## Call:
## lda(Direction ~ Lag2, data = week, subset = train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##             Lag2
## Down -0.03568254
## Up    0.26036581
## 
## Coefficients of linear discriminants:
##            LD1
## Lag2 0.4414162

lda.pred = predict(week.lda, week.test)
lda.class= lda.pred$class
caret::confusionMatrix(as.factor(direction.test), as.factor(lda.class), positive = "Up")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down    9 34
##       Up      5 56
##                                          
##                Accuracy : 0.625          
##                  95% CI : (0.5247, 0.718)
##     No Information Rate : 0.8654         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.1414         
##                                          
##  Mcnemar's Test P-Value : 7.34e-06       
##                                          
##             Sensitivity : 0.6222         
##             Specificity : 0.6429         
##          Pos Pred Value : 0.9180         
##          Neg Pred Value : 0.2093         
##              Prevalence : 0.8654         
##          Detection Rate : 0.5385         
##    Detection Prevalence : 0.5865         
##       Balanced Accuracy : 0.6325         
##                                          
##        'Positive' Class : Up             
##

10F.

week.qda = qda(Direction ~ Lag2, data = week, subset = train)
week.qda

## Call:
## qda(Direction ~ Lag2, data = week, subset = train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##             Lag2
## Down -0.03568254
## Up    0.26036581

qda.pred = predict(week.qda, week.test)
qda.class = qda.pred$class
caret::confusionMatrix(as.factor(direction.test), as.factor(qda.class), positive = "Up")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down    0 43
##       Up      0 61
##                                           
##                Accuracy : 0.5865          
##                  95% CI : (0.4858, 0.6823)
##     No Information Rate : 1               
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 1.504e-10       
##                                           
##             Sensitivity : 0.5865          
##             Specificity :     NA          
##          Pos Pred Value :     NA          
##          Neg Pred Value :     NA          
##              Prevalence : 1.0000          
##          Detection Rate : 0.5865          
##    Detection Prevalence : 0.5865          
##       Balanced Accuracy :     NA          
##                                           
##        'Positive' Class : Up              
##

10G.

library(class)
train.x = data.frame(week.train$Lag2)
test.x = data.frame(week.test$Lag2)
train.direction = week.train$Direction
train.direction = as.character(train.direction)

set.seed(1)
knn.pred = knn(train.x, test.x, train.direction, k=1)
caret::confusionMatrix(as.factor(direction.test), as.factor(knn.pred), positive = "Up")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down   21 22
##       Up     30 31
##                                           
##                Accuracy : 0.5             
##                  95% CI : (0.4003, 0.5997)
##     No Information Rate : 0.5096          
##     P-Value [Acc > NIR] : 0.6158          
##                                           
##                   Kappa : -0.0033         
##                                           
##  Mcnemar's Test P-Value : 0.3317          
##                                           
##             Sensitivity : 0.5849          
##             Specificity : 0.4118          
##          Pos Pred Value : 0.5082          
##          Neg Pred Value : 0.4884          
##              Prevalence : 0.5096          
##          Detection Rate : 0.2981          
##    Detection Prevalence : 0.5865          
##       Balanced Accuracy : 0.4983          
##                                           
##        'Positive' Class : Up              
##

We can observe that logistic regression model and LDA model are best ones with accuracy of 62.5%

10I.

Logistic regression with interaction terms

week.glm3 = glm(Direction ~ Lag1*Lag2 + Lag1*Lag3 + Lag1*Lag4 + Lag1*Lag5 + Lag1*Volume + Lag1*Today + Lag2*Lag3 + Lag2*Lag4 + Lag2*Lag5 + Lag2*Volume + Lag2*Today + Lag3*Lag4 + Lag3*Lag5 + Lag3*Volume + Lag3*Today + Lag4*Lag5 + Lag4*Volume + Lag4*Today + Lag5*Volume + Lag5*Today + Volume*Today, data = week, family = binomial, subset = train)

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

week.test$PredProb2 = predict.glm(week.glm3, newdata = week.test, type = "response")
week.test$PredSur2 = ifelse(week.test$PredProb2 >= .5,"Up","Down")
caret::confusionMatrix(as.factor(direction.test), as.factor(week.test$PredSur2), positive = "Up")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down   41  2
##       Up      2 59
##                                           
##                Accuracy : 0.9615          
##                  95% CI : (0.9044, 0.9894)
##     No Information Rate : 0.5865          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9207          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9672          
##             Specificity : 0.9535          
##          Pos Pred Value : 0.9672          
##          Neg Pred Value : 0.9535          
##              Prevalence : 0.5865          
##          Detection Rate : 0.5673          
##    Detection Prevalence : 0.5865          
##       Balanced Accuracy : 0.9604          
##                                           
##        'Positive' Class : Up              
##

KNN model with K values of 5, 10, 15, 20

knn.pred5 = knn(train.x, test.x, train.direction, k=5)
caret::confusionMatrix(as.factor(direction.test), as.factor(knn.pred5), positive = "Up")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down   15 28
##       Up     20 41
##                                          
##                Accuracy : 0.5385         
##                  95% CI : (0.438, 0.6367)
##     No Information Rate : 0.6635         
##     P-Value [Acc > NIR] : 0.9970         
##                                          
##                   Kappa : 0.0216         
##                                          
##  Mcnemar's Test P-Value : 0.3123         
##                                          
##             Sensitivity : 0.5942         
##             Specificity : 0.4286         
##          Pos Pred Value : 0.6721         
##          Neg Pred Value : 0.3488         
##              Prevalence : 0.6635         
##          Detection Rate : 0.3942         
##    Detection Prevalence : 0.5865         
##       Balanced Accuracy : 0.5114         
##                                          
##        'Positive' Class : Up             
##

knn.pred10 = knn(train.x, test.x, train.direction, k=10)
caret::confusionMatrix(as.factor(direction.test), as.factor(knn.pred10), positive = "Up")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down   17 26
##       Up     19 42
##                                           
##                Accuracy : 0.5673          
##                  95% CI : (0.4665, 0.6641)
##     No Information Rate : 0.6538          
##     P-Value [Acc > NIR] : 0.9734          
##                                           
##                   Kappa : 0.0859          
##                                           
##  Mcnemar's Test P-Value : 0.3711          
##                                           
##             Sensitivity : 0.6176          
##             Specificity : 0.4722          
##          Pos Pred Value : 0.6885          
##          Neg Pred Value : 0.3953          
##              Prevalence : 0.6538          
##          Detection Rate : 0.4038          
##    Detection Prevalence : 0.5865          
##       Balanced Accuracy : 0.5449          
##                                           
##        'Positive' Class : Up              
##

knn.pred15 = knn(train.x, test.x, train.direction, k=15)
caret::confusionMatrix(as.factor(direction.test), as.factor(knn.pred15), positive = "Up")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down   20 23
##       Up     20 41
##                                           
##                Accuracy : 0.5865          
##                  95% CI : (0.4858, 0.6823)
##     No Information Rate : 0.6154          
##     P-Value [Acc > NIR] : 0.7609          
##                                           
##                   Kappa : 0.1387          
##                                           
##  Mcnemar's Test P-Value : 0.7604          
##                                           
##             Sensitivity : 0.6406          
##             Specificity : 0.5000          
##          Pos Pred Value : 0.6721          
##          Neg Pred Value : 0.4651          
##              Prevalence : 0.6154          
##          Detection Rate : 0.3942          
##    Detection Prevalence : 0.5865          
##       Balanced Accuracy : 0.5703          
##                                           
##        'Positive' Class : Up              
##

knn.pred20 = knn(train.x, test.x, train.direction, k=20)
caret::confusionMatrix(as.factor(direction.test), as.factor(knn.pred20), positive = "Up")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down   21 22
##       Up     20 41
##                                           
##                Accuracy : 0.5962          
##                  95% CI : (0.4954, 0.6913)
##     No Information Rate : 0.6058          
##     P-Value [Acc > NIR] : 0.6207          
##                                           
##                   Kappa : 0.1616          
##                                           
##  Mcnemar's Test P-Value : 0.8774          
##                                           
##             Sensitivity : 0.6508          
##             Specificity : 0.5122          
##          Pos Pred Value : 0.6721          
##          Neg Pred Value : 0.4884          
##              Prevalence : 0.6058          
##          Detection Rate : 0.3942          
##    Detection Prevalence : 0.5865          
##       Balanced Accuracy : 0.5815          
##                                           
##        'Positive' Class : Up              
##

We can see that best model is logistic regression including all interaction terms, and accuracy comes to 96.15%, as well as sensitivity and specificity values are also better from previous models. We also applied KNN model with different K values and observed that when K value increased accuracy also increased and with K=20 the accuracy is 59.62%

Question 11A.

setwd("/Users/shamstabrez/Documents/Algo/Homework1/")
auto = read.csv("Auto.csv", na.strings = "?")
auto = na.omit(auto)

mpg.median = median(auto$mpg)
auto$mpg01 = auto$mpg
auto$mpg01[auto$mpg >= mpg.median] = 1
auto$mpg01[auto$mpg < mpg.median] = 0

11B.

str(auto)

## 'data.frame':    392 obs. of  10 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : int  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : chr  "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...
##  $ mpg01       : num  0 0 0 0 0 0 0 0 0 0 ...

pairs(auto[c(1:8, 10)])

par(mfrow=c(1,2))
boxplot(auto$cylinders ~ auto$mpg01)
boxplot(auto$displacement ~ auto$mpg01)

boxplot(auto$horsepower ~ auto$mpg01)

boxplot(auto$weight ~ auto$mpg01)

boxplot(auto$acceleration ~ auto$mpg01)

boxplot(auto$year ~ auto$mpg01)

We observe from pair chart that there is no correlation between mpg01 and other variables. In boxplot we find that the more cylinders, a higher displacement, horsepower and weight seems to increase the likelihood of mpg01 being 0. And also lower acceleration and year tends to increase the likelihood of mpg01 being 0.

11C.

set.seed(1)
train=sample(392,310)
auto.train = auto[train,]
auto.test.x = auto[-train,1:9]
auto.test.y = auto[-train, 10]

train and test data has been created

11D.

auto.lm = lm(mpg01 ~ cylinders + displacement + horsepower + weight + acceleration + year + origin, data = auto)
summary(auto.lm)

## 
## Call:
## lm(formula = mpg01 ~ cylinders + displacement + horsepower + 
##     weight + acceleration + year + origin, data = auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.93858 -0.15035  0.06735  0.19175  0.90105 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -6.366e-01  4.177e-01  -1.524   0.1284    
## cylinders    -1.183e-01  2.908e-02  -4.067 5.78e-05 ***
## displacement  3.395e-04  6.760e-04   0.502   0.6158    
## horsepower    2.130e-03  1.240e-03   1.718   0.0867 .  
## weight       -2.873e-04  5.865e-05  -4.899 1.43e-06 ***
## acceleration  2.305e-03  8.891e-03   0.259   0.7956    
## year          2.949e-02  4.585e-03   6.433 3.73e-10 ***
## origin        4.683e-02  2.502e-02   1.872   0.0620 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2993 on 384 degrees of freedom
## Multiple R-squared:  0.649,  Adjusted R-squared:  0.6426 
## F-statistic: 101.4 on 7 and 384 DF,  p-value: < 2.2e-16

auto.lda = lda(mpg01 ~ cylinders + weight + year, data = auto.train)
auto.lda

## Call:
## lda(mpg01 ~ cylinders + weight + year, data = auto.train)
## 
## Prior probabilities of groups:
##         0         1 
## 0.4935484 0.5064516 
## 
## Group means:
##   cylinders   weight     year
## 0  6.771242 3604.667 74.54248
## 1  4.203822 2346.051 77.59873
## 
## Coefficients of linear discriminants:
##                     LD1
## cylinders -0.4212284939
## weight    -0.0009673581
## year       0.1022662664

auto.lda.pred = predict(auto.lda, auto.test.x)
auto.lda.class= auto.lda.pred$class
caret::confusionMatrix(as.factor(auto.test.y), as.factor(auto.lda.class), positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 36  7
##          1  0 39
##                                         
##                Accuracy : 0.9146        
##                  95% CI : (0.832, 0.965)
##     No Information Rate : 0.561         
##     P-Value [Acc > NIR] : 2.002e-12     
##                                         
##                   Kappa : 0.8303        
##                                         
##  Mcnemar's Test P-Value : 0.02334       
##                                         
##             Sensitivity : 0.8478        
##             Specificity : 1.0000        
##          Pos Pred Value : 1.0000        
##          Neg Pred Value : 0.8372        
##              Prevalence : 0.5610        
##          Detection Rate : 0.4756        
##    Detection Prevalence : 0.4756        
##       Balanced Accuracy : 0.9239        
##                                         
##        'Positive' Class : 1             
##

Accuracy of this LDA model is 91.46% hence is is good one.

auto.qda = qda(mpg01 ~ cylinders + weight + year, data = auto.train)
auto.qda

## Call:
## qda(mpg01 ~ cylinders + weight + year, data = auto.train)
## 
## Prior probabilities of groups:
##         0         1 
## 0.4935484 0.5064516 
## 
## Group means:
##   cylinders   weight     year
## 0  6.771242 3604.667 74.54248
## 1  4.203822 2346.051 77.59873

auto.qda.pred = predict(auto.qda, auto.test.x)
auto.qda.class = auto.qda.pred$class
caret::confusionMatrix(as.factor(auto.test.y), as.factor(auto.qda.class), positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 38  5
##          1  0 39
##                                           
##                Accuracy : 0.939           
##                  95% CI : (0.8634, 0.9799)
##     No Information Rate : 0.5366          
##     P-Value [Acc > NIR] : 9.57e-16        
##                                           
##                   Kappa : 0.8785          
##                                           
##  Mcnemar's Test P-Value : 0.07364         
##                                           
##             Sensitivity : 0.8864          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.8837          
##              Prevalence : 0.5366          
##          Detection Rate : 0.4756          
##    Detection Prevalence : 0.4756          
##       Balanced Accuracy : 0.9432          
##                                           
##        'Positive' Class : 1               
##

QDA model is little better that LDA model tried earlier and we observe accuracy is 93.9% compared to LDA model accuracy of 91.46%

11F.

auto.log = glm(mpg01 ~ cylinders + weight + year, data = auto.train, family = binomial)

auto.PredProb = predict.glm(auto.log, newdata = auto.test.x, type = "response")
auto.PredSur = ifelse(auto.PredProb >= .5,1,0)
caret::confusionMatrix(as.factor(auto.test.y), as.factor(auto.PredSur), positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 39  4
##          1  3 36
##                                         
##                Accuracy : 0.9146        
##                  95% CI : (0.832, 0.965)
##     No Information Rate : 0.5122        
##     P-Value [Acc > NIR] : 4.455e-15     
##                                         
##                   Kappa : 0.8291        
##                                         
##  Mcnemar's Test P-Value : 1             
##                                         
##             Sensitivity : 0.9000        
##             Specificity : 0.9286        
##          Pos Pred Value : 0.9231        
##          Neg Pred Value : 0.9070        
##              Prevalence : 0.4878        
##          Detection Rate : 0.4390        
##    Detection Prevalence : 0.4756        
##       Balanced Accuracy : 0.9143        
##                                         
##        'Positive' Class : 1             
##

The accuracy of this logistic regression model is same as LDA model accuracy of 91.46%

11G.

auto.knn.train.x = auto.train[c(2,5,7)]
auto.knn.train.y = auto.train[,10]
auto.knn.test.x = auto.test.x[c(2,5,7)]

set.seed(1)
auto.knn.pred = knn(auto.knn.train.x, auto.knn.test.x, auto.knn.train.y, k =1)
caret::confusionMatrix(as.factor(auto.test.y), as.factor(auto.knn.pred), positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 38  5
##          1  5 34
##                                           
##                Accuracy : 0.878           
##                  95% CI : (0.7871, 0.9399)
##     No Information Rate : 0.5244          
##     P-Value [Acc > NIR] : 9.717e-12       
##                                           
##                   Kappa : 0.7555          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8718          
##             Specificity : 0.8837          
##          Pos Pred Value : 0.8718          
##          Neg Pred Value : 0.8837          
##              Prevalence : 0.4756          
##          Detection Rate : 0.4146          
##    Detection Prevalence : 0.4756          
##       Balanced Accuracy : 0.8778          
##                                           
##        'Positive' Class : 1               
##

auto.knn.pred5 = knn(auto.knn.train.x, auto.knn.test.x, auto.knn.train.y, k =5)
caret::confusionMatrix(as.factor(auto.test.y), as.factor(auto.knn.pred5), positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 38  5
##          1  5 34
##                                           
##                Accuracy : 0.878           
##                  95% CI : (0.7871, 0.9399)
##     No Information Rate : 0.5244          
##     P-Value [Acc > NIR] : 9.717e-12       
##                                           
##                   Kappa : 0.7555          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8718          
##             Specificity : 0.8837          
##          Pos Pred Value : 0.8718          
##          Neg Pred Value : 0.8837          
##              Prevalence : 0.4756          
##          Detection Rate : 0.4146          
##    Detection Prevalence : 0.4756          
##       Balanced Accuracy : 0.8778          
##                                           
##        'Positive' Class : 1               
##

auto.knn.pred10 = knn(auto.knn.train.x, auto.knn.test.x, auto.knn.train.y, k =10)
caret::confusionMatrix(as.factor(auto.test.y), as.factor(auto.knn.pred10), positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 38  5
##          1  5 34
##                                           
##                Accuracy : 0.878           
##                  95% CI : (0.7871, 0.9399)
##     No Information Rate : 0.5244          
##     P-Value [Acc > NIR] : 9.717e-12       
##                                           
##                   Kappa : 0.7555          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8718          
##             Specificity : 0.8837          
##          Pos Pred Value : 0.8718          
##          Neg Pred Value : 0.8837          
##              Prevalence : 0.4756          
##          Detection Rate : 0.4146          
##    Detection Prevalence : 0.4756          
##       Balanced Accuracy : 0.8778          
##                                           
##        'Positive' Class : 1               
##

auto.knn.pred15 = knn(auto.knn.train.x, auto.knn.test.x, auto.knn.train.y, k =15)
caret::confusionMatrix(as.factor(auto.test.y), as.factor(auto.knn.pred15), positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 37  6
##          1  5 34
##                                           
##                Accuracy : 0.8659          
##                  95% CI : (0.7726, 0.9311)
##     No Information Rate : 0.5122          
##     P-Value [Acc > NIR] : 1.449e-11       
##                                           
##                   Kappa : 0.7314          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8500          
##             Specificity : 0.8810          
##          Pos Pred Value : 0.8718          
##          Neg Pred Value : 0.8605          
##              Prevalence : 0.4878          
##          Detection Rate : 0.4146          
##    Detection Prevalence : 0.4756          
##       Balanced Accuracy : 0.8655          
##                                           
##        'Positive' Class : 1               
##

auto.knn.pred20 = knn(auto.knn.train.x, auto.knn.test.x, auto.knn.train.y, k =20)
caret::confusionMatrix(as.factor(auto.test.y), as.factor(auto.knn.pred20), positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 36  7
##          1  5 34
##                                          
##                Accuracy : 0.8537         
##                  95% CI : (0.7583, 0.922)
##     No Information Rate : 0.5            
##     P-Value [Acc > NIR] : 2.054e-11      
##                                          
##                   Kappa : 0.7073         
##                                          
##  Mcnemar's Test P-Value : 0.7728         
##                                          
##             Sensitivity : 0.8293         
##             Specificity : 0.8780         
##          Pos Pred Value : 0.8718         
##          Neg Pred Value : 0.8372         
##              Prevalence : 0.5000         
##          Detection Rate : 0.4146         
##    Detection Prevalence : 0.4756         
##       Balanced Accuracy : 0.8537         
##                                          
##        'Positive' Class : 1              
##

auto.knn.pred25 = knn(auto.knn.train.x, auto.knn.test.x, auto.knn.train.y, k =25)
caret::confusionMatrix(as.factor(auto.test.y), as.factor(auto.knn.pred25), positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 37  6
##          1  5 34
##                                           
##                Accuracy : 0.8659          
##                  95% CI : (0.7726, 0.9311)
##     No Information Rate : 0.5122          
##     P-Value [Acc > NIR] : 1.449e-11       
##                                           
##                   Kappa : 0.7314          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8500          
##             Specificity : 0.8810          
##          Pos Pred Value : 0.8718          
##          Neg Pred Value : 0.8605          
##              Prevalence : 0.4878          
##          Detection Rate : 0.4146          
##    Detection Prevalence : 0.4756          
##       Balanced Accuracy : 0.8655          
##                                           
##        'Positive' Class : 1               
##

For KNN best model we observe accuracy of 87.8% has k value = 1,5,10

Higher k value level tend to lower the accuracy.

13.

boston = Boston

crim.median = median(boston$crim)
boston$crim.above.med = boston$crim

boston$crim.above.med[boston$crim >= crim.median] = 1
boston$crim.above.med[boston$crim < crim.median] = 0

set.seed(1)
train = sample(506, 400)
boston.train = boston[train,]
boston.test.x = boston[-train,1:14]
boston.test.y = boston[-train, 15]

Logistic regression model

boston.log = glm(crim.above.med ~ zn + indus + chas + nox + rm + age + dis + rad + tax + ptratio + black + lstat + medv, data = boston.train, family = binomial)

summary(boston.log)

## 
## Call:
## glm(formula = crim.above.med ~ zn + indus + chas + nox + rm + 
##     age + dis + rad + tax + ptratio + black + lstat + medv, family = binomial, 
##     data = boston.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0405  -0.1412  -0.0002   0.0017   3.6053  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -42.413075   7.663745  -5.534 3.13e-08 ***
## zn           -0.105480   0.042747  -2.468 0.013605 *  
## indus        -0.064406   0.051401  -1.253 0.210203    
## chas          0.471499   0.781593   0.603 0.546340    
## nox          56.009198   9.235915   6.064 1.33e-09 ***
## rm           -0.151684   0.850862  -0.178 0.858511    
## age           0.021783   0.013687   1.591 0.111499    
## dis           1.062174   0.277926   3.822 0.000132 ***
## rad           0.681962   0.175689   3.882 0.000104 ***
## tax          -0.007108   0.003163  -2.248 0.024607 *  
## ptratio       0.361211   0.147963   2.441 0.014637 *  
## black        -0.010394   0.005750  -1.808 0.070671 .  
## lstat         0.080476   0.056675   1.420 0.155626    
## medv          0.169017   0.080670   2.095 0.036155 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 554.52  on 399  degrees of freedom
## Residual deviance: 159.23  on 386  degrees of freedom
## AIC: 187.23
## 
## Number of Fisher Scoring iterations: 9

boston.log.PredProb = predict.glm(boston.log, newdata = boston.test.x, type = "response")
boston.log.PredSur = ifelse(boston.log.PredProb >= .5,1,0)
caret::confusionMatrix(as.factor(boston.test.y), as.factor(boston.log.PredSur), positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 45  8
##          1  5 48
##                                           
##                Accuracy : 0.8774          
##                  95% CI : (0.7994, 0.9331)
##     No Information Rate : 0.5283          
##     P-Value [Acc > NIR] : 1.817e-14       
##                                           
##                   Kappa : 0.7547          
##                                           
##  Mcnemar's Test P-Value : 0.5791          
##                                           
##             Sensitivity : 0.8571          
##             Specificity : 0.9000          
##          Pos Pred Value : 0.9057          
##          Neg Pred Value : 0.8491          
##              Prevalence : 0.5283          
##          Detection Rate : 0.4528          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.8786          
##                                           
##        'Positive' Class : 1               
##

boston.log2 = glm(crim.above.med ~ zn + nox + age + dis + rad + tax + ptratio + black + medv, data = boston.train, family = binomial)
summary(boston.log2)

## 
## Call:
## glm(formula = crim.above.med ~ zn + nox + age + dis + rad + tax + 
##     ptratio + black + medv, family = binomial, data = boston.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9541  -0.1624  -0.0002   0.0015   3.6164  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -39.093913   7.230444  -5.407 6.41e-08 ***
## zn           -0.104871   0.039008  -2.688  0.00718 ** 
## nox          50.727147   8.131533   6.238 4.42e-10 ***
## age           0.026352   0.011576   2.276  0.02282 *  
## dis           0.999550   0.269695   3.706  0.00021 ***
## rad           0.759663   0.163470   4.647 3.37e-06 ***
## tax          -0.008527   0.002964  -2.877  0.00402 ** 
## ptratio       0.350180   0.134863   2.597  0.00942 ** 
## black        -0.010352   0.005822  -1.778  0.07538 .  
## medv          0.128555   0.040044   3.210  0.00133 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 554.52  on 399  degrees of freedom
## Residual deviance: 163.07  on 390  degrees of freedom
## AIC: 183.07
## 
## Number of Fisher Scoring iterations: 9

boston.log.PredProb2 = predict.glm(boston.log2, newdata = boston.test.x, type = "response")
boston.log.PredSur2 = ifelse(boston.log.PredProb2 >= .5,1,0)
caret::confusionMatrix(as.factor(boston.test.y), as.factor(boston.log.PredSur2), positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 47  6
##          1  6 47
##                                           
##                Accuracy : 0.8868          
##                  95% CI : (0.8106, 0.9401)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7736          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8868          
##             Specificity : 0.8868          
##          Pos Pred Value : 0.8868          
##          Neg Pred Value : 0.8868          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4434          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.8868          
##                                           
##        'Positive' Class : 1               
##

LDA Model

boston.lda = lda(crim.above.med ~ zn + indus + chas + nox + rm + age + dis + rad + tax + ptratio + black + lstat + medv, data = boston.train)

boston.lda.pred = predict(boston.lda, boston.test.x)
boston.lda.class= boston.lda.pred$class
caret::confusionMatrix(as.factor(boston.test.y), as.factor(boston.lda.class), positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 52  1
##          1 11 42
##                                           
##                Accuracy : 0.8868          
##                  95% CI : (0.8106, 0.9401)
##     No Information Rate : 0.5943          
##     P-Value [Acc > NIR] : 3.067e-11       
##                                           
##                   Kappa : 0.7736          
##                                           
##  Mcnemar's Test P-Value : 0.009375        
##                                           
##             Sensitivity : 0.9767          
##             Specificity : 0.8254          
##          Pos Pred Value : 0.7925          
##          Neg Pred Value : 0.9811          
##              Prevalence : 0.4057          
##          Detection Rate : 0.3962          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.9011          
##                                           
##        'Positive' Class : 1               
##

boston.lda2 = lda(crim.above.med ~ zn + nox + age + dis + rad + tax + ptratio + black + medv, data = boston.train)

boston.lda.pred2 = predict(boston.lda2, boston.test.x)
boston.lda.class2= boston.lda.pred2$class
caret::confusionMatrix(as.factor(boston.test.y), as.factor(boston.lda.class2), positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 52  1
##          1 13 40
##                                           
##                Accuracy : 0.8679          
##                  95% CI : (0.7883, 0.9259)
##     No Information Rate : 0.6132          
##     P-Value [Acc > NIR] : 6.666e-09       
##                                           
##                   Kappa : 0.7358          
##                                           
##  Mcnemar's Test P-Value : 0.003283        
##                                           
##             Sensitivity : 0.9756          
##             Specificity : 0.8000          
##          Pos Pred Value : 0.7547          
##          Neg Pred Value : 0.9811          
##              Prevalence : 0.3868          
##          Detection Rate : 0.3774          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.8878          
##                                           
##        'Positive' Class : 1               
##

QDA Model

boston.qda = qda(crim.above.med ~ zn + indus + chas + nox + rm + age + dis + rad + tax + ptratio + black + lstat + medv, data = boston.train)

boston.qda.pred = predict(boston.qda, boston.test.x)
boston.qda.class = boston.qda.pred$class
caret::confusionMatrix(as.factor(boston.test.y), as.factor(boston.qda.class), positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 52  1
##          1  9 44
##                                           
##                Accuracy : 0.9057          
##                  95% CI : (0.8333, 0.9538)
##     No Information Rate : 0.5755          
##     P-Value [Acc > NIR] : 6.437e-14       
##                                           
##                   Kappa : 0.8113          
##                                           
##  Mcnemar's Test P-Value : 0.02686         
##                                           
##             Sensitivity : 0.9778          
##             Specificity : 0.8525          
##          Pos Pred Value : 0.8302          
##          Neg Pred Value : 0.9811          
##              Prevalence : 0.4245          
##          Detection Rate : 0.4151          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.9151          
##                                           
##        'Positive' Class : 1               
##

boston.qda2 = qda(crim.above.med ~ zn + nox + age + dis + rad + tax + ptratio + black + medv, data = boston.train)

boston.qda.pred2 = predict(boston.qda2, boston.test.x)
boston.qda.class2 = boston.qda.pred2$class
caret::confusionMatrix(as.factor(boston.test.y), as.factor(boston.qda.class2), positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 52  1
##          1 11 42
##                                           
##                Accuracy : 0.8868          
##                  95% CI : (0.8106, 0.9401)
##     No Information Rate : 0.5943          
##     P-Value [Acc > NIR] : 3.067e-11       
##                                           
##                   Kappa : 0.7736          
##                                           
##  Mcnemar's Test P-Value : 0.009375        
##                                           
##             Sensitivity : 0.9767          
##             Specificity : 0.8254          
##          Pos Pred Value : 0.7925          
##          Neg Pred Value : 0.9811          
##              Prevalence : 0.4057          
##          Detection Rate : 0.3962          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.9011          
##                                           
##        'Positive' Class : 1               
##

KNN Model

boston.knn.train.x = boston.train[c(2:13)]
boston.knn.train.y = boston.train[,15]
boston.knn.test.x = boston.test.x[,2:13]

set.seed(1)
boston.knn.pred = knn(boston.knn.train.x, boston.knn.test.x, boston.knn.train.y, k =1)
caret::confusionMatrix(as.factor(boston.test.y), as.factor(boston.knn.pred), positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 50  3
##          1  4 49
##                                          
##                Accuracy : 0.934          
##                  95% CI : (0.8687, 0.973)
##     No Information Rate : 0.5094         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.8679         
##                                          
##  Mcnemar's Test P-Value : 1              
##                                          
##             Sensitivity : 0.9423         
##             Specificity : 0.9259         
##          Pos Pred Value : 0.9245         
##          Neg Pred Value : 0.9434         
##              Prevalence : 0.4906         
##          Detection Rate : 0.4623         
##    Detection Prevalence : 0.5000         
##       Balanced Accuracy : 0.9341         
##                                          
##        'Positive' Class : 1              
##

boston.knn.pred = knn(boston.knn.train.x, boston.knn.test.x, boston.knn.train.y, k =5)
caret::confusionMatrix(as.factor(boston.test.y), as.factor(boston.knn.pred), positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 45  8
##          1  4 49
##                                           
##                Accuracy : 0.8868          
##                  95% CI : (0.8106, 0.9401)
##     No Information Rate : 0.5377          
##     P-Value [Acc > NIR] : 1.155e-14       
##                                           
##                   Kappa : 0.7736          
##                                           
##  Mcnemar's Test P-Value : 0.3865          
##                                           
##             Sensitivity : 0.8596          
##             Specificity : 0.9184          
##          Pos Pred Value : 0.9245          
##          Neg Pred Value : 0.8491          
##              Prevalence : 0.5377          
##          Detection Rate : 0.4623          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.8890          
##                                           
##        'Positive' Class : 1               
##

boston.knn.pred = knn(boston.knn.train.x, boston.knn.test.x, boston.knn.train.y, k =10)
caret::confusionMatrix(as.factor(boston.test.y), as.factor(boston.knn.pred), positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 44  9
##          1  4 49
##                                           
##                Accuracy : 0.8774          
##                  95% CI : (0.7994, 0.9331)
##     No Information Rate : 0.5472          
##     P-Value [Acc > NIR] : 2.833e-13       
##                                           
##                   Kappa : 0.7547          
##                                           
##  Mcnemar's Test P-Value : 0.2673          
##                                           
##             Sensitivity : 0.8448          
##             Specificity : 0.9167          
##          Pos Pred Value : 0.9245          
##          Neg Pred Value : 0.8302          
##              Prevalence : 0.5472          
##          Detection Rate : 0.4623          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.8807          
##                                           
##        'Positive' Class : 1               
##

We run all the models with all variables. Model 2 we ran above is Logistic regression model using backwards selection method which gives accuracy of 88.68% better than simple logistic regression model. Also LDA model has same accuracy of 88.68%. QDA Model gives accuracy of 90.57%. Finally, we can observe that KNN model is the best model with accuracy of 93.4%.

Homework3

Shams Tabrez

3/10/2021

Homework 3

Question 10A.

We can see in pairs functions that there is pattern for Year and Volume, other variables show no pattern

10B.

We can see from the results that Lag2 is the only significant variable.

10C.

Accuracy of the model is 56.11%. Which means it can predict correctly almost half of the time. Also it is better in predicting true positive compared to true negative.

10D.

10E.

10F.

10G.

We can observe that logistic regression model and LDA model are best ones with accuracy of 62.5%

10I.

Logistic regression with interaction terms

KNN model with K values of 5, 10, 15, 20

Question 11A.

11B.

11C.

train and test data has been created

11D.

Accuracy of this LDA model is 91.46% hence is is good one.

QDA model is little better that LDA model tried earlier and we observe accuracy is 93.9% compared to LDA model accuracy of 91.46%

11F.

The accuracy of this logistic regression model is same as LDA model accuracy of 91.46%

11G.

For KNN best model we observe accuracy of 87.8% has k value = 1,5,10

Higher k value level tend to lower the accuracy.

13.

Logistic regression model

LDA Model

QDA Model

KNN Model