10. This question should be answered using the Weekly data set, which is part of the ISLR package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1,089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.

Load the ISLR and MASS libraries

library(ISLR)
library(MASS)
## Warning: package 'MASS' was built under R version 3.6.2
library(class)
## Warning: package 'class' was built under R version 3.6.2

Get structure of weekly data set

str(Weekly)
## 'data.frame':    1089 obs. of  9 variables:
##  $ Year     : num  1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
##  $ Lag1     : num  0.816 -0.27 -2.576 3.514 0.712 ...
##  $ Lag2     : num  1.572 0.816 -0.27 -2.576 3.514 ...
##  $ Lag3     : num  -3.936 1.572 0.816 -0.27 -2.576 ...
##  $ Lag4     : num  -0.229 -3.936 1.572 0.816 -0.27 ...
##  $ Lag5     : num  -3.484 -0.229 -3.936 1.572 0.816 ...
##  $ Volume   : num  0.155 0.149 0.16 0.162 0.154 ...
##  $ Today    : num  -0.27 -2.576 3.514 0.712 1.178 ...
##  $ Direction: Factor w/ 2 levels "Down","Up": 1 1 2 2 2 1 2 2 2 1 ...

Lets do the summary of the Weekly data set

summary(Weekly)
##       Year           Lag1               Lag2               Lag3         
##  Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
##  1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580  
##  Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410  
##  Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472  
##  3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
##  Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
##       Lag4               Lag5              Volume            Today         
##  Min.   :-18.1950   Min.   :-18.1950   Min.   :0.08747   Min.   :-18.1950  
##  1st Qu.: -1.1580   1st Qu.: -1.1660   1st Qu.:0.33202   1st Qu.: -1.1540  
##  Median :  0.2380   Median :  0.2340   Median :1.00268   Median :  0.2410  
##  Mean   :  0.1458   Mean   :  0.1399   Mean   :1.57462   Mean   :  0.1499  
##  3rd Qu.:  1.4090   3rd Qu.:  1.4050   3rd Qu.:2.05373   3rd Qu.:  1.4050  
##  Max.   : 12.0260   Max.   : 12.0260   Max.   :9.32821   Max.   : 12.0260  
##  Direction 
##  Down:484  
##  Up  :605  
##            
##            
##            
## 

Quick read:
Observations(n): 1089
Columns: 9
Response Variable: Direction

10(a) Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?

Lets do scatter plot matrix

pairs(Weekly)

On first look, the relationship is weak between lag variables and Volume/Today/Year.
Year and volume seems to have strong relationship.
Lets print correlation values to observe this. Remove the non-numeric column “Direction” when we do this.

cor(Weekly[,-9])
##               Year         Lag1        Lag2        Lag3         Lag4
## Year    1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1   -0.03228927  1.000000000 -0.07485305  0.05863568 -0.071273876
## Lag2   -0.03339001 -0.074853051  1.00000000 -0.07572091  0.058381535
## Lag3   -0.03000649  0.058635682 -0.07572091  1.00000000 -0.075395865
## Lag4   -0.03112792 -0.071273876  0.05838153 -0.07539587  1.000000000
## Lag5   -0.03051910 -0.008183096 -0.07249948  0.06065717 -0.075675027
## Volume  0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today  -0.03245989 -0.075031842  0.05916672 -0.07124364 -0.007825873
##                Lag5      Volume        Today
## Year   -0.030519101  0.84194162 -0.032459894
## Lag1   -0.008183096 -0.06495131 -0.075031842
## Lag2   -0.072499482 -0.08551314  0.059166717
## Lag3    0.060657175 -0.06928771 -0.071243639
## Lag4   -0.075675027 -0.06107462 -0.007825873
## Lag5    1.000000000 -0.05851741  0.011012698
## Volume -0.058517414  1.00000000 -0.033077783
## Today   0.011012698 -0.03307778  1.000000000

The coorelation numbers confirm our initial analysis using scatter plot matrix. In other words, there appears to be little correlation between this week’s returns and previous weeks’ returns.
Year and volume seems to have relationship. Lets build a plot to see this.

boxplot(Volume~Year,data=Weekly,main="Average number of daily shares traded across Years")

By plotting the data we see that Volume is increasing over time and dipped a bit in 2010. In other words, the average number of shares traded increased from 1990 to 2009 and reduced in 2010.

From the summary, we noticed that the dataset contains more observations with direction as Up Lets look at the ratios.

table(Weekly$Direction)/sum(table(Weekly$Direction))
## 
##      Down        Up 
## 0.4444444 0.5555556

This indicates that we have 44% of down and 56% of Up. Lets compare it with Today to understand the relationship between Today and Direction.

head(Weekly[,c(8,9)])
##    Today Direction
## 1 -0.270      Down
## 2 -2.576      Down
## 3  3.514        Up
## 4  0.712        Up
## 5  1.178        Up
## 6 -1.372      Down

Based on the above list, we can say that direction is marked as Up or Down based on percentage return for this week.

10(b) Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?
glm.fit <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Weekly, family = binomial)
summary(glm.fit)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Weekly)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6949  -1.2565   0.9913   1.0849   1.4579  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

Based on the above model, with p-value of 0.0296, it seems that Lag2 is the only significant predictor. All otherpredictors seem to be insignificant.

10(c) Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.

Lets build Conusion Matrix:
Horizontal -> Prediction
Vertical -> training data

glm.probs <- predict(glm.fit,type = "response")
glm.pred <- rep("Down",nrow(Weekly))
glm.pred[glm.probs>0.5] = "Up"

table(glm.pred,Weekly$Direction)
##         
## glm.pred Down  Up
##     Down   54  48
##     Up    430 557

Lets identify the Accuracy/correct prediction rate:

mean(glm.pred == Weekly$Direction)
## [1] 0.5610652

Calculating the accuracy rate from confusion matrix:

(54+557)/(54+557+48+430)
## [1] 0.5610652

Model prediction rate when the market is Up:

557/(48+557)
## [1] 0.9206612

Model prediction rate when the market is Down:

54/(54+430)
## [1] 0.1115702

We may conclude that the percentage of correct predictions on the training data is (54+557)/1089 wich is equal to 56.1% In other words 43.9% is the training error rate, which is overly optimistic. We could also say that for weeks when the market goes up, the model is right 92.0661157% of the time (557/(48+557)) i.e. very few False Positives.

However, when the market is actually down, the model is right only 11.1570248% of the time (54/(54+430)) and this is a concern with this model that there are heavy False Negatives.

10(d) Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).

Fit the model for the training data set i.e. from 1990 to 2008.

train <- (Weekly$Year < 2009)
Weekly.20092010 <- Weekly[!train, ]
Direction.20092010 <- Weekly$Direction[!train]
glm.fit.subset <- glm(Direction ~ Lag2, data = Weekly, family = binomial, subset = train)
summary(glm.fit.subset)
## 
## Call:
## glm(formula = Direction ~ Lag2, family = binomial, data = Weekly, 
##     subset = train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.536  -1.264   1.021   1.091   1.368  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.20326    0.06428   3.162  0.00157 **
## Lag2         0.05810    0.02870   2.024  0.04298 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1354.7  on 984  degrees of freedom
## Residual deviance: 1350.5  on 983  degrees of freedom
## AIC: 1354.5
## 
## Number of Fisher Scoring iterations: 4

Build the confusion matrix for the held out data i.e. test data set.

glm.probs <- predict(glm.fit.subset,Weekly.20092010,type = "response")

glm.pred <- rep("Down",nrow(Weekly.20092010))
glm.pred[glm.probs>0.5] = "Up"

table(glm.pred,Direction.20092010)
##         Direction.20092010
## glm.pred Down Up
##     Down    9  5
##     Up     34 56

Check the fraction of correct predictions for the held out data:

mean(glm.pred == Direction.20092010)
## [1] 0.625
mean(glm.pred != Direction.20092010)
## [1] 0.375
56/(56+5)
## [1] 0.9180328
9/(9+34)
## [1] 0.2093023

In this case, we may conclude that the percentage of correct predictions on the test data is (9+56)/104 which is equal to 62.5%. In other words 37.5% is the test error rate. We could also say that for weeks when the market goes up, the model is right 91.8032787% of the time (56/(56+5)). For weeks when the market goes down, the model is right only 20.9302326% of the time (9/(9+34)). Though the overall accuracy rate improved, the False Negatives are still high.

10(e) Repeat (d) using LDA.

Fit the model with LDA.

lda.fit <- lda(Direction ~ Lag2, data = Weekly, subset = train)
lda.fit
## Call:
## lda(Direction ~ Lag2, data = Weekly, subset = train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##             Lag2
## Down -0.03568254
## Up    0.26036581
## 
## Coefficients of linear discriminants:
##            LD1
## Lag2 0.4414162

Build the confusion matrix.

lda.pred <- predict(lda.fit, Weekly.20092010)
table(lda.pred$class, Direction.20092010)
##       Direction.20092010
##        Down Up
##   Down    9  5
##   Up     34 56

Check the fraction of correct predictions for the held out data:

mean(lda.pred$class == Direction.20092010)
## [1] 0.625

The LDA and logistic regression predictions are identical.

10(f) Repeat (d) using QDA.

Lets fit the model.

qda.fit <- qda(Direction ~ Lag2, data = Weekly, subset = train)
qda.fit
## Call:
## qda(Direction ~ Lag2, data = Weekly, subset = train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##             Lag2
## Down -0.03568254
## Up    0.26036581

Build the confusion matrix.

qda.pred <- predict(qda.fit, Weekly.20092010)
table(qda.pred$class, Direction.20092010)
##       Direction.20092010
##        Down Up
##   Down    0  0
##   Up     43 61

Check the fraction of correct predictions for the held out data:

mean(qda.pred$class == Direction.20092010)
## [1] 0.5865385

We can conclude that the percentage of correct predictions on the test data is 58.6538462%. In other words 41.3461538% is the test error rate. As there are no False Positives, for weeks when the market goes up, the model is right 100% of the time. For weeks when the market goes down, the model is right only 0% of the time. We may note, that QDA achieves a correctness of 58.6538462% even though the model chooses “Up” the whole time.

10(g) Repeat (d) using KNN with K = 1.
train.X <- cbind(Weekly$Lag2)[train,]
test.X <- cbind(Weekly$Lag2)[!train,]
train.Direction <- Weekly$Direction[train]
set.seed(1)
knn.pred <- knn(data.frame(train.X), data.frame(test.X), train.Direction, k = 1)
table(knn.pred, Direction.20092010)
##         Direction.20092010
## knn.pred Down Up
##     Down   21 30
##     Up     22 31
mean(knn.pred == Direction.20092010)
## [1] 0.5
31/(30+31)
## [1] 0.5081967
21/(21+22)
## [1] 0.4883721

With KNN, we may conclude that the percentage of correct predictions on the test data is 50%. In other words 50% is the test error rate. We could also say that for weeks when the market goes up, the model is right 50.8196721% of the time. For weeks when the market goes down, the model is right only 48.8372093% of the time. The model has considerable False positives and False negatives.

10(h) Which of these methods appears to provide the best results on this data?

The Logisic Regression and LDA seem to be better with accuracy rate of 62.5%.
Though the False negatives are more, it is ok that you lose out the opportunity. But the false positives are less which means you are right more than 90% of the time and not end up with loss by investing.

10(h)(i) Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifier.

Trying KNN = 3

# KNN with K=3
knn.pred <- knn(data.frame(train.X), data.frame(test.X), train.Direction, k = 3)
table(knn.pred, Direction.20092010)
##         Direction.20092010
## knn.pred Down Up
##     Down   16 19
##     Up     27 42
mean(knn.pred == Direction.20092010)
## [1] 0.5576923

Trying KNN = 5

# KNN with K=5
knn.pred <- knn(data.frame(train.X), data.frame(test.X), train.Direction, k = 5)
mean(knn.pred == Direction.20092010)
## [1] 0.5384615

As we increase k value to 3, KNN gave a better accuracy rate. However, increasing k further, there is no improvement compared to the original LDA/Logistic Regression model.

Lets try Logistic regression with adding one more predictor Lag2:Lag1 as Lag1’s p-value is 0.1181 and lowest after lag2.

# Logistic regression with Lag2:Lag1
glm.fit.L2L1 <- glm(Direction ~ Lag2:Lag1, data = Weekly, family = binomial, subset = train)
glm.probs.L2L1 <- predict(glm.fit.L2L1, Weekly.20092010, type = "response")
glm.pred.L2L1 <- rep("Down", length(glm.probs.L2L1))
glm.pred.L2L1[glm.probs.L2L1 > 0.5] = "Up"
table(glm.pred.L2L1, Direction.20092010)
##              Direction.20092010
## glm.pred.L2L1 Down Up
##          Down    1  1
##          Up     42 60
mean(glm.pred.L2L1 == Direction.20092010)
## [1] 0.5865385

The prediction accuracy rate is 58.65% which is not good when compared to the test model with just Lag2 as predictor.

Lets fit the LDA model with same interaction term.

# LDA with Lag2:Lag1 interaction term
lda.fit.L2L1 <- lda(Direction ~ Lag2:Lag1, data = Weekly, subset = train)
lda.pred.L2L1 <- predict(lda.fit.L2L1, Weekly.20092010)
mean(lda.pred.L2L1$class == Direction.20092010)
## [1] 0.5769231

The prediction accuracy rate is 57.69% which is not good when compared to the test LDA model with just Lag2 as predictor.

# QDA with Sqrt function
qda.fit.sqrt <- qda(Direction ~ Lag2 + sqrt(abs(Lag2)), data = Weekly, subset = train)
qda.fit.sqrt
## Call:
## qda(Direction ~ Lag2 + sqrt(abs(Lag2)), data = Weekly, subset = train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##             Lag2 sqrt(abs(Lag2))
## Down -0.03568254        1.140078
## Up    0.26036581        1.169635
qda.pred.sqrt <- predict(qda.fit.sqrt, Weekly.20092010)
table(qda.pred.sqrt$class, Direction.20092010)
##       Direction.20092010
##        Down Up
##   Down   12 13
##   Up     31 48
mean(qda.pred.sqrt$class == Direction.20092010)
## [1] 0.5769231

The prediction accuracy rate is 57.69% which is not good when compared to the test LDA model with just Lag2 as predictor.

Out of all the new combinations tried, the original logistic regression and LDA have the best performance (62.5%).

11. In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.

str(Auto)
## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : num  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : num  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : num  3504 3693 3436 3433 3449 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : num  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...

11(a) Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median() function. Note you may find it helpful to use the data.frame() function to create a single data set containing both mpg01 and the other Auto variables.

# Lets create a new data set Auto1 with mpg01 added along with all columns of Auto.  
mpg01 <- rep(0, nrow(Auto))
mpg01[Auto[,'mpg']>median(Auto[,'mpg'])] <- 1

Auto1 <- data.frame(Auto, mpg01)
str(Auto1)
## 'data.frame':    392 obs. of  10 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : num  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : num  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : num  3504 3693 3436 3433 3449 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : num  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
##  $ mpg01       : num  0 0 0 0 0 0 0 0 0 0 ...
summary(Auto1)
##       mpg          cylinders      displacement     horsepower        weight    
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
##  Median :22.75   Median :4.000   Median :151.0   Median : 93.5   Median :2804  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5   Mean   :2978  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140  
##                                                                                
##   acceleration        year           origin                      name    
##  Min.   : 8.00   Min.   :70.00   Min.   :1.000   amc matador       :  5  
##  1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000   ford pinto        :  5  
##  Median :15.50   Median :76.00   Median :1.000   toyota corolla    :  5  
##  Mean   :15.54   Mean   :75.98   Mean   :1.577   amc gremlin       :  4  
##  3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000   amc hornet        :  4  
##  Max.   :24.80   Max.   :82.00   Max.   :3.000   chevrolet chevette:  4  
##                                                  (Other)           :365  
##      mpg01    
##  Min.   :0.0  
##  1st Qu.:0.0  
##  Median :0.5  
##  Mean   :0.5  
##  3rd Qu.:1.0  
##  Max.   :1.0  
## 
table(Auto1$mpg01)
## 
##   0   1 
## 196 196

11(b) Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.

pairs(Auto1[,-9])

cor(Auto1[,-9])
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
## mpg01         0.8369392 -0.7591939   -0.7534766 -0.6670526 -0.7577566
##              acceleration       year     origin      mpg01
## mpg             0.4233285  0.5805410  0.5652088  0.8369392
## cylinders      -0.5046834 -0.3456474 -0.5689316 -0.7591939
## displacement   -0.5438005 -0.3698552 -0.6145351 -0.7534766
## horsepower     -0.6891955 -0.4163615 -0.4551715 -0.6670526
## weight         -0.4168392 -0.3091199 -0.5850054 -0.7577566
## acceleration    1.0000000  0.2903161  0.2127458  0.3468215
## year            0.2903161  1.0000000  0.1815277  0.4299042
## origin          0.2127458  0.1815277  1.0000000  0.5136984
## mpg01           0.3468215  0.4299042  0.5136984  1.0000000

Correlation matrix shows that mpg01 has strong relationship with cylinders, displacement, horsepower and weight.

Year and origin also seem to have some

Lets do box plots to see the same.

par(mfrow = c(2,3))
boxplot(weight~mpg01,Auto1,main="Weight~mpg01")
boxplot(horsepower~mpg01,Auto1,main="Horsepower~mpg01")
boxplot(displacement~mpg01,Auto1,main="Displacement~mpg01")
boxplot(acceleration~mpg01,Auto1,main="Acceleration~mpg01")
boxplot(origin~mpg01,Auto1,main="Origin~mpg01")
boxplot(cylinders~mpg01,Auto1,main="Cylinders~mpg01")

boxplot(year~mpg01,Auto1,main="Year~mpg01")

Based on the correlation matrix and box plots, there is clear evidence that there is some relationship between horsepower, displacement, cylinders and weight with mpg01.

11(c) Split the data into a training set and a test set.

The data is equally split with 196 observations with mpg01 as 0 and the rest of 196 with mpg01 as 1. So, just going with random sampling based on model year.

train <- (Auto1$year %% 2 == 0)
Auto1.train <- Auto1[train, ]
Auto1.test <- Auto1[!train, ]
mpg01.test <- mpg01[!train]
# training data set observations count
nrow(Auto1.train)
## [1] 210
# test data set observations count
nrow(Auto1.test)
## [1] 182

11(d) Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

fit.lda <- lda(mpg01 ~ cylinders + weight + displacement + horsepower, data = Auto1, subset = train)
fit.lda
## Call:
## lda(mpg01 ~ cylinders + weight + displacement + horsepower, data = Auto1, 
##     subset = train)
## 
## Prior probabilities of groups:
##         0         1 
## 0.4571429 0.5428571 
## 
## Group means:
##   cylinders   weight displacement horsepower
## 0  6.812500 3604.823     271.7396  133.14583
## 1  4.070175 2314.763     111.6623   77.92105
## 
## Coefficients of linear discriminants:
##                        LD1
## cylinders    -0.6741402638
## weight       -0.0011465750
## displacement  0.0004481325
## horsepower    0.0059035377
pred.lda <- predict(fit.lda, Auto1.test)
table(pred.lda$class, mpg01.test)
##    mpg01.test
##      0  1
##   0 86  9
##   1 14 73
#LDA test error rate
mean(pred.lda$class != mpg01.test)
## [1] 0.1263736

The LDA test error rate is 12.63736%.

11(e) Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

#perform QDA
fit.qda <- qda(mpg01 ~ cylinders + weight + displacement + horsepower, data = Auto1, subset = train)
fit.qda
## Call:
## qda(mpg01 ~ cylinders + weight + displacement + horsepower, data = Auto1, 
##     subset = train)
## 
## Prior probabilities of groups:
##         0         1 
## 0.4571429 0.5428571 
## 
## Group means:
##   cylinders   weight displacement horsepower
## 0  6.812500 3604.823     271.7396  133.14583
## 1  4.070175 2314.763     111.6623   77.92105
pred.qda <- predict(fit.qda, Auto1.test)
table(pred.qda$class, mpg01.test)
##    mpg01.test
##      0  1
##   0 89 13
##   1 11 69
#QDA est error rate
mean(pred.qda$class != mpg01.test)
## [1] 0.1318681

The LDA test error rate is 13.18681%.

11(f) Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

fit.glm <- glm(mpg01 ~ cylinders + weight + displacement + horsepower, data = Auto1, family = binomial, subset = train)
summary(fit.glm)
## 
## Call:
## glm(formula = mpg01 ~ cylinders + weight + displacement + horsepower, 
##     family = binomial, data = Auto1, subset = train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.48027  -0.03413   0.10583   0.29634   2.57584  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  17.658730   3.409012   5.180 2.22e-07 ***
## cylinders    -1.028032   0.653607  -1.573   0.1158    
## weight       -0.002922   0.001137  -2.569   0.0102 *  
## displacement  0.002462   0.015030   0.164   0.8699    
## horsepower   -0.050611   0.025209  -2.008   0.0447 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 289.58  on 209  degrees of freedom
## Residual deviance:  83.24  on 205  degrees of freedom
## AIC: 93.24
## 
## Number of Fisher Scoring iterations: 7
probs.glm <- predict(fit.glm, Auto1.test, type = "response")
pred.glm <- rep(0, length(probs.glm))
pred.glm[probs.glm > 0.5] <- 1
table(pred.glm, mpg01.test)
##         mpg01.test
## pred.glm  0  1
##        0 89 11
##        1 11 71
#Logistic Regression test error rate
mean(pred.glm != mpg01.test)
## [1] 0.1208791

The Logistic Regression test error rate is 12.08791%.

11(g) Perform KNN on the training data, with several values of K, in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?

# Perform KNN test with k=1
train.X <- cbind(Auto1$cylinders, Auto1$weight, Auto1$displacement, Auto1$horsepower)[train, ]
test.X <- cbind(Auto1$cylinders, Auto1$weight, Auto1$displacement, Auto1$horsepower)[!train, ]
train.mpg01 <- Auto1$mpg01[train]

set.seed(1)
pred.knn <- knn(data.frame(train.X), data.frame(test.X), train.mpg01, k = 1)
table(pred.knn, mpg01.test)
##         mpg01.test
## pred.knn  0  1
##        0 83 11
##        1 17 71
# KNN with k=1, error rate
mean(pred.knn != mpg01.test)
## [1] 0.1538462

The KNN test error rate is 15.38462% with K=1.

# KNN with k= 3 - Test error rate
pred.knn <- knn(data.frame(train.X), data.frame(test.X), train.mpg01, k = 3)
mean(pred.knn != mpg01.test)
## [1] 0.1373626

The KNN test error rate is 13.73626% with K=3.

# KNN with k= 5 - test error rate
pred.knn <- knn(data.frame(train.X), data.frame(test.X), train.mpg01, k = 5)
mean(pred.knn != mpg01.test)
## [1] 0.1483516

The KNN test error rate is 14.83516% with K=5.

# KNN with k= 10 - test error rate
pred.knn <- knn(data.frame(train.X), data.frame(test.X), train.mpg01, k = 10)
mean(pred.knn != mpg01.test)
## [1] 0.1538462

The KNN test error rate is 15.38462% with K=10.

# KNN with k= 100 - test error rate
pred.knn <- knn(data.frame(train.X), data.frame(test.X), train.mpg01, k = 100)
mean(pred.knn != mpg01.test)
## [1] 0.1428571

The KNN test error rate is 14.28571% with K=100.

KNN prediction error rate seems to vary quite a lot with various k values. Out of trials i did, K=3 seems to be better, even though K value is small and could potentially cause noise. Square root (training data set observations count) i.e. sqrt(210) = 14. So for the sake of easiness and as the sample size is too small, i would like to pick k=3.

13. Using the Boston data set, fit classification models in order to predict whether a given suburb has a crime rate above or below the median. Explore logistic regression, LDA, and KNN models using various subsets of the predictors. Describe your findings.

# Boston data set
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
cor(Boston)
##                crim          zn       indus         chas         nox
## crim     1.00000000 -0.20046922  0.40658341 -0.055891582  0.42097171
## zn      -0.20046922  1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus    0.40658341 -0.53382819  1.00000000  0.062938027  0.76365145
## chas    -0.05589158 -0.04269672  0.06293803  1.000000000  0.09120281
## nox      0.42097171 -0.51660371  0.76365145  0.091202807  1.00000000
## rm      -0.21924670  0.31199059 -0.39167585  0.091251225 -0.30218819
## age      0.35273425 -0.56953734  0.64477851  0.086517774  0.73147010
## dis     -0.37967009  0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad      0.62550515 -0.31194783  0.59512927 -0.007368241  0.61144056
## tax      0.58276431 -0.31456332  0.72076018 -0.035586518  0.66802320
## ptratio  0.28994558 -0.39167855  0.38324756 -0.121515174  0.18893268
## black   -0.38506394  0.17552032 -0.35697654  0.048788485 -0.38005064
## lstat    0.45562148 -0.41299457  0.60379972 -0.053929298  0.59087892
## medv    -0.38830461  0.36044534 -0.48372516  0.175260177 -0.42732077
##                  rm         age         dis          rad         tax    ptratio
## crim    -0.21924670  0.35273425 -0.37967009  0.625505145  0.58276431  0.2899456
## zn       0.31199059 -0.56953734  0.66440822 -0.311947826 -0.31456332 -0.3916785
## indus   -0.39167585  0.64477851 -0.70802699  0.595129275  0.72076018  0.3832476
## chas     0.09125123  0.08651777 -0.09917578 -0.007368241 -0.03558652 -0.1215152
## nox     -0.30218819  0.73147010 -0.76923011  0.611440563  0.66802320  0.1889327
## rm       1.00000000 -0.24026493  0.20524621 -0.209846668 -0.29204783 -0.3555015
## age     -0.24026493  1.00000000 -0.74788054  0.456022452  0.50645559  0.2615150
## dis      0.20524621 -0.74788054  1.00000000 -0.494587930 -0.53443158 -0.2324705
## rad     -0.20984667  0.45602245 -0.49458793  1.000000000  0.91022819  0.4647412
## tax     -0.29204783  0.50645559 -0.53443158  0.910228189  1.00000000  0.4608530
## ptratio -0.35550149  0.26151501 -0.23247054  0.464741179  0.46085304  1.0000000
## black    0.12806864 -0.27353398  0.29151167 -0.444412816 -0.44180801 -0.1773833
## lstat   -0.61380827  0.60233853 -0.49699583  0.488676335  0.54399341  0.3740443
## medv     0.69535995 -0.37695457  0.24992873 -0.381626231 -0.46853593 -0.5077867
##               black      lstat       medv
## crim    -0.38506394  0.4556215 -0.3883046
## zn       0.17552032 -0.4129946  0.3604453
## indus   -0.35697654  0.6037997 -0.4837252
## chas     0.04878848 -0.0539293  0.1752602
## nox     -0.38005064  0.5908789 -0.4273208
## rm       0.12806864 -0.6138083  0.6953599
## age     -0.27353398  0.6023385 -0.3769546
## dis      0.29151167 -0.4969958  0.2499287
## rad     -0.44441282  0.4886763 -0.3816262
## tax     -0.44180801  0.5439934 -0.4685359
## ptratio -0.17738330  0.3740443 -0.5077867
## black    1.00000000 -0.3660869  0.3334608
## lstat   -0.36608690  1.0000000 -0.7376627
## medv     0.33346082 -0.7376627  1.0000000

Lets create a binary response variable for crimerate and populate as 1 if crim value is greater than its median.

# create a binary crime rate variable similar to mpg01 in Q11 - crim01
crim01 <- rep(0, nrow(Boston))
crim01[Boston$crim > median(Boston$crim)] <- 1
Boston1 <- data.frame(Boston, crim01)
str(Boston1)
## 'data.frame':    506 obs. of  15 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
##  $ crim01 : num  0 0 0 0 0 0 0 0 0 0 ...
table(crim01)
## crim01
##   0   1 
## 253 253

Lets create a training and test data samples

#training and test dataset creation. 

train <- 1:(nrow(Boston1) / 2)
test <- (nrow(Boston1) / 2 + 1):nrow(Boston1)
Boston1.train <- Boston1[train, ]
Boston1.test <- Boston1[test, ]
crim01.test <- crim01[test]
nrow(Boston1.test)
## [1] 253

Lets fit Logistic regression first with all predictors.

# Fit logistic regression 
fit.glm <- glm(crim01 ~ . - crim01 - crim, data = Boston1, family = binomial, subset = train)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(fit.glm)
## 
## Call:
## glm(formula = crim01 ~ . - crim01 - crim, family = binomial, 
##     data = Boston1, subset = train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.83229  -0.06593   0.00000   0.06181   2.61513  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -91.319906  19.490273  -4.685 2.79e-06 ***
## zn           -0.815573   0.193373  -4.218 2.47e-05 ***
## indus         0.354172   0.173862   2.037  0.04164 *  
## chas          0.167396   0.991922   0.169  0.86599    
## nox          93.706326  21.202008   4.420 9.88e-06 ***
## rm           -4.719108   1.788765  -2.638  0.00833 ** 
## age           0.048634   0.024199   2.010  0.04446 *  
## dis           4.301493   0.979996   4.389 1.14e-05 ***
## rad           3.039983   0.719592   4.225 2.39e-05 ***
## tax          -0.006546   0.007855  -0.833  0.40461    
## ptratio       1.430877   0.359572   3.979 6.91e-05 ***
## black        -0.017552   0.006734  -2.606  0.00915 ** 
## lstat         0.190439   0.086722   2.196  0.02809 *  
## medv          0.598533   0.185514   3.226  0.00125 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 329.367  on 252  degrees of freedom
## Residual deviance:  69.568  on 239  degrees of freedom
## AIC: 97.568
## 
## Number of Fisher Scoring iterations: 10

Several of predictors seem to be significant. Lets get the confusion matrix and test error rate.

probs.glm <- predict(fit.glm, Boston1.test, type = "response")
pred.glm <- rep(0, length(probs.glm))
pred.glm[probs.glm > 0.5] <- 1
table(pred.glm, crim01.test)
##         crim01.test
## pred.glm   0   1
##        0  68  24
##        1  22 139
mean(pred.glm != crim01.test)
## [1] 0.1818182

The Logistic regression model above has 18.18% error rate.

Now, lets fit th logistic regression model with only few of the predictors.

#Logistic regression with few predictors. Removing insignificant predictirs chas and tax 
fit2.glm <- glm(crim01 ~ . - crim01 - crim - chas - tax, data = Boston1, family = binomial, subset = train)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
probs2.glm <- predict(fit2.glm, Boston1.test, type = "response")
pred2.glm <- rep(0, length(probs2.glm))
pred2.glm[probs2.glm > 0.5] <- 1
table(pred2.glm, crim01.test)
##          crim01.test
## pred2.glm   0   1
##         0  67  24
##         1  23 139
mean(pred2.glm != crim01.test)
## [1] 0.1857708

Removing the insignifcant predictors chas and tax didn’t help in improving the error rate.

Lets try removing a different predictor. Lets check crime rate and nox association.

boxplot(Boston1$crim~Boston1$nox)

The association is not clearly evident and even the correlation matrix confirms the same even though logistic regression shows significance.Lets try to fit a logistic regression model without nox.

#Logistic regression with few predictors. Removing insignificant predictor chas and nox - which has high coefficient 
fit3.glm <- glm(crim01 ~ . - crim01 - crim - chas - nox, data = Boston1, family = binomial, subset = train)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
probs3.glm <- predict(fit3.glm, Boston1.test, type = "response")
pred3.glm <- rep(0, length(probs3.glm))
pred3.glm[probs3.glm > 0.5] <- 1
table(pred3.glm, crim01.test)
##          crim01.test
## pred3.glm   0   1
##         0  78  28
##         1  12 135
mean(pred3.glm != crim01.test)
## [1] 0.1581028

This new model’s error rate (15.8%) improved than the previous ones (~18.2 and 18.6%). we can conclude that the logistic regression without the chas and nox predictors performs well than the logistic regression with all predictors.

Lets fit LDA with all predictors.

# LDA with all predictors
fit.lda <- lda(crim01 ~ . - crim01 - crim, data = Boston1, subset = train)
pred.lda <- predict(fit.lda, Boston1.test)
table(pred.lda$class, crim01.test)
##    crim01.test
##       0   1
##   0  80  24
##   1  10 139
mean(pred.lda$class != crim01.test)
## [1] 0.1343874

LDA with all predictors has error rate of 13.44%

Lets fit LDA again without the chas and nox predictors.

# LDA without chas and tax 
fit2.lda <- lda(crim01 ~ . - crim01 - crim -chas -tax, data = Boston1, subset = train)
pred2.lda <- predict(fit2.lda, Boston1.test)
mean(pred2.lda$class != crim01.test)
## [1] 0.1225296
# LDA without chas and nox 
fit3.lda <- lda(crim01 ~ . - crim01 - crim -chas -nox, data = Boston1, subset = train)
pred3.lda <- predict(fit3.lda, Boston1.test)
mean(pred3.lda$class != crim01.test)
## [1] 0.1501976

LDA without chas and tax (insignificant predictors from Logistic regression) has better error rate of 12.25%.

Lets build KNN.

#KNN with k=1
train.X <- cbind(Boston1$zn, Boston1$indus, Boston1$chas, Boston1$nox, Boston1$rm, Boston1$age, Boston1$dis, Boston1$rad, Boston1$tax, Boston1$ptratio, Boston1$black, Boston1$lstat, Boston1$medv)[train, ]
test.X <- cbind(Boston1$zn, Boston1$indus, Boston1$chas, Boston1$nox, Boston1$rm, Boston1$age, Boston1$dis, Boston1$rad, Boston1$tax, Boston1$ptratio, Boston1$black, Boston1$lstat, Boston1$medv)[test, ]
train.crim01 <- Boston1$crim01[train]
set.seed(1)
pred.knn <- knn(data.frame(train.X), data.frame(test.X), train.crim01, k = 1)
table(pred.knn, crim01.test)
##         crim01.test
## pred.knn   0   1
##        0  85 111
##        1   5  52
mean(pred.knn != crim01.test)
## [1] 0.458498

for k=1, test error rate is 45.85%

#KNN with k=9
pred.knn <- knn(data.frame(train.X), data.frame(test.X), train.crim01, k = 9)
mean(pred.knn != crim01.test)
## [1] 0.1106719

for k=9, test error rate is 11.07%.

#KNN with k=10 which is approximately sqrt(training set length-253)
pred.knn <- knn(data.frame(train.X), data.frame(test.X), train.crim01, k = 10)
mean(pred.knn != crim01.test)
## [1] 0.1185771

for k=10, test error rate is 11.86%.

#KNN with k=15 which is approximately sqrt(training set length-253)
pred.knn <- knn(data.frame(train.X), data.frame(test.X), train.crim01, k = 15)
mean(pred.knn != crim01.test)
## [1] 0.1185771

for k=15, test error rate is 11.86%.

K=9 has better performance compared to other k values.

Overall KNN performed well when compared to Logistic regression or LDA.