Exercise 10

This question should be answered using the Weekly data set, which is part of the ISLR package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1, 089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.

library(ISLR)
## Warning: package 'ISLR' was built under R version 4.1.2
library(MASS)
attach(Weekly)
str(Weekly)
## 'data.frame':    1089 obs. of  9 variables:
##  $ Year     : num  1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
##  $ Lag1     : num  0.816 -0.27 -2.576 3.514 0.712 ...
##  $ Lag2     : num  1.572 0.816 -0.27 -2.576 3.514 ...
##  $ Lag3     : num  -3.936 1.572 0.816 -0.27 -2.576 ...
##  $ Lag4     : num  -0.229 -3.936 1.572 0.816 -0.27 ...
##  $ Lag5     : num  -3.484 -0.229 -3.936 1.572 0.816 ...
##  $ Volume   : num  0.155 0.149 0.16 0.162 0.154 ...
##  $ Today    : num  -0.27 -2.576 3.514 0.712 1.178 ...
##  $ Direction: Factor w/ 2 levels "Down","Up": 1 1 2 2 2 1 2 2 2 1 ...

(a) Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?

pairs(Weekly)

summary(Weekly)
##       Year           Lag1               Lag2               Lag3         
##  Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
##  1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580  
##  Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410  
##  Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472  
##  3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
##  Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
##       Lag4               Lag5              Volume            Today         
##  Min.   :-18.1950   Min.   :-18.1950   Min.   :0.08747   Min.   :-18.1950  
##  1st Qu.: -1.1580   1st Qu.: -1.1660   1st Qu.:0.33202   1st Qu.: -1.1540  
##  Median :  0.2380   Median :  0.2340   Median :1.00268   Median :  0.2410  
##  Mean   :  0.1458   Mean   :  0.1399   Mean   :1.57462   Mean   :  0.1499  
##  3rd Qu.:  1.4090   3rd Qu.:  1.4050   3rd Qu.:2.05373   3rd Qu.:  1.4050  
##  Max.   : 12.0260   Max.   : 12.0260   Max.   :9.32821   Max.   : 12.0260  
##  Direction 
##  Down:484  
##  Up  :605  
##            
##            
##            
## 

Comments: The summary function brings a very good picture of how the lags behave, they for sure have the same Min and Max, since it corresponds to how the stock market day starts and ends. The main variations will be in the mean and median, but still we can see that they behave within a range, same, because we are talking about a stock market.



cor(Weekly [,-9])
##               Year         Lag1        Lag2        Lag3         Lag4
## Year    1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1   -0.03228927  1.000000000 -0.07485305  0.05863568 -0.071273876
## Lag2   -0.03339001 -0.074853051  1.00000000 -0.07572091  0.058381535
## Lag3   -0.03000649  0.058635682 -0.07572091  1.00000000 -0.075395865
## Lag4   -0.03112792 -0.071273876  0.05838153 -0.07539587  1.000000000
## Lag5   -0.03051910 -0.008183096 -0.07249948  0.06065717 -0.075675027
## Volume  0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today  -0.03245989 -0.075031842  0.05916672 -0.07124364 -0.007825873
##                Lag5      Volume        Today
## Year   -0.030519101  0.84194162 -0.032459894
## Lag1   -0.008183096 -0.06495131 -0.075031842
## Lag2   -0.072499482 -0.08551314  0.059166717
## Lag3    0.060657175 -0.06928771 -0.071243639
## Lag4   -0.075675027 -0.06107462 -0.007825873
## Lag5    1.000000000 -0.05851741  0.011012698
## Volume -0.058517414  1.00000000 -0.033077783
## Today   0.011012698 -0.03307778  1.000000000

Comments: From the correlation and the pairs function we can see that there is some strong correlation between Year and Volume, which makes a lot of sense, because Stock Markets have a long-term trend of increasing, we can see this also by plotting it.



plot(Volume)



(b) Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?

Model_Log = glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume, data=Weekly,family=binomial)
summary(Model_Log)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Weekly)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6949  -1.2565   0.9913   1.0849   1.4579  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

Comments: The only variable which is statistically significant is Lag2, with a p-value of 0.0296 which is > 0.05 (significance level). The coefficient for Lag2 is positive and would suggest that if the market goes up (positive) compared to the day before, it might be high the next day.



(c) Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.

Probs=predict(Model_Log,type="response")
Probs [1:10]
##         1         2         3         4         5         6         7         8 
## 0.6086249 0.6010314 0.5875699 0.4816416 0.6169013 0.5684190 0.5786097 0.5151972 
##         9        10 
## 0.5715200 0.5554287
contrasts(Direction)
##      Up
## Down  0
## Up    1
Preds=ifelse(Probs > 0.5,"Up","Down")
Preds [1:10]
##      1      2      3      4      5      6      7      8      9     10 
##   "Up"   "Up"   "Up" "Down"   "Up"   "Up"   "Up"   "Up"   "Up"   "Up"
contrasts(Direction)
##      Up
## Down  0
## Up    1
table(Preds,Direction)
##       Direction
## Preds  Down  Up
##   Down   54  48
##   Up    430 557
(54+557)/(54+48+430+557)
## [1] 0.5610652

Comments: The accuracy will be calculated from the following formula: (TP + TN) / Total Obs = 0.56106. The total observations are all the TP, TN, FP, FN, in this case= (54+48+430+557). This might look like a good prediction, however since we tested our model on our training data, that really doesn’t really give us an actual picture of the performance of the model. Usually the training predictions (accuracy, sensitivity, specificity, etc) tend to be better than the ones of the test data set. For that it would be suggested to do a subset for Training and the remaining for Testing.



(d) Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).

train=(Year>=1990 & Year<=2008)
Weekly_test= Weekly[!train ,]
Weekly_train= Weekly[train ,]


Direction_test= Direction[!train]
Direction_train= Direction[train]

dim(Weekly_test)
## [1] 104   9
dim(Weekly_train)
## [1] 985   9

Here we have the data set split into Train and Test, we will use the Train to run the Logistic Regression Model, and then the Test for our predictions.



Model_Log2=glm(Direction~Lag2, data=Weekly_train,family=binomial)
summary(Model_Log2)
## 
## Call:
## glm(formula = Direction ~ Lag2, family = binomial, data = Weekly_train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.536  -1.264   1.021   1.091   1.368  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.20326    0.06428   3.162  0.00157 **
## Lag2         0.05810    0.02870   2.024  0.04298 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1354.7  on 984  degrees of freedom
## Residual deviance: 1350.5  on 983  degrees of freedom
## AIC: 1354.5
## 
## Number of Fisher Scoring iterations: 4



Probs_test=predict(Model_Log2, Weekly_test, type="response")
Preds_test=ifelse(Probs_test > 0.5,"Up","Down")
Preds_test [1:10]
##    986    987    988    989    990    991    992    993    994    995 
##   "Up"   "Up" "Down" "Down"   "Up"   "Up"   "Up" "Down" "Down" "Down"
Probs_train=predict(Model_Log2, Weekly_train, type="response")
Preds_train=ifelse(Probs_train > 0.5,"Up","Down")
Preds_train [1:10]
##    1    2    3    4    5    6    7    8    9   10 
## "Up" "Up" "Up" "Up" "Up" "Up" "Up" "Up" "Up" "Up"



Confussion Matrix
table(Preds_test,Direction_test)
##           Direction_test
## Preds_test Down Up
##       Down    9  5
##       Up     34 56
table(Preds_train,Direction_train)
##            Direction_train
## Preds_train Down  Up
##        Down   23  20
##        Up    418 524
mean(Preds_test==Direction_test)
## [1] 0.625
mean(Preds_train==Direction_train)
## [1] 0.5553299

Comments: The “mean” function simplifies the formula of accuracy and brings the same result. In this case, our model predicted correctly 62.5% of the weekly trends. I also brought the result that our model would bring only using the training data set, and it is lower, therefore our model did a good job training and then applying that on the test set.



(e) LDA Model

Model_LDA = lda(Direction ~ Lag2, data = Weekly_train)
Model_LDA
## Call:
## lda(Direction ~ Lag2, data = Weekly_train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##             Lag2
## Down -0.03568254
## Up    0.26036581
## 
## Coefficients of linear discriminants:
##            LD1
## Lag2 0.4414162
Preds_LDA=predict(Model_LDA, Weekly_test) 
LDA_class=Preds_LDA$class 
table(LDA_class,Direction_test)
##          Direction_test
## LDA_class Down Up
##      Down    9  5
##      Up     34 56
mean(LDA_class==Direction_test)
## [1] 0.625

Comments: We obtained the same result as we did in the Logistic Regression 62.5% correctly predicted.



(f) QDA model

Model_QDA = qda(Direction~Lag2, data=Weekly_train)
Model_QDA
## Call:
## qda(Direction ~ Lag2, data = Weekly_train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##             Lag2
## Down -0.03568254
## Up    0.26036581
QDA_class=predict(Model_QDA,Weekly_test)$class
table(QDA_class,Direction_test)
##          Direction_test
## QDA_class Down Up
##      Down    0  0
##      Up     43 61
mean(QDA_class==Direction_test)
## [1] 0.5865385

Comments: The accuracy of the QDA model is 58.65%, lower than the one we got for LDA and the Logistic Regression Models.



(g) KNN using K = 1

Merging the train and test columns to the test and training data sets

library(class)
train_X = cbind(Weekly[train,3])
test_X = cbind(Weekly[!train,3])
train_Direction = Weekly[train,c(9)]
test_Direction = Weekly[!train,c(9)]
knn_pred=knn(train_X,test_X,train_Direction,k=1)
table(knn_pred,test_Direction)
##         test_Direction
## knn_pred Down Up
##     Down   21 29
##     Up     22 32
mean(knn_pred==test_Direction)
## [1] 0.5096154

Comments: The KNN (K-nearest neighbor) had a 50.96% accuracy rate which is not really good.



(h) Which of these methods appears to provide the best results on this data?

Both LDA and Logistic Regression gave the best results when it comes to accuracy, both achieving a 62.5%, then QDA with 58.65% and finally KNN with 50.96% (which is almost like saying “by random chance”).



(i) Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifier.

Since the models where the accuracy was higher were LDA and Logistic Regression, I will perform the modifications on it first, then the KNN with a different K.

Logistic Regression with interactions

Model_Log_Interac = glm(Direction ~ Lag1+Lag2+Lag1*Lag2, data = Weekly_train, family = "binomial")
summary(Model_Log_Interac)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag1 * Lag2, family = "binomial", 
##     data = Weekly_train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.573  -1.259   1.003   1.086   1.596  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.211419   0.064589   3.273  0.00106 **
## Lag1        -0.051505   0.030727  -1.676  0.09370 . 
## Lag2         0.053471   0.029193   1.832  0.06700 . 
## Lag1:Lag2    0.001921   0.007460   0.257  0.79680   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1354.7  on 984  degrees of freedom
## Residual deviance: 1346.9  on 981  degrees of freedom
## AIC: 1354.9
## 
## Number of Fisher Scoring iterations: 4
Probs_Interac = predict(Model_Log_Interac, Weekly_test, type = "response")
Preds_Interac = rep("Down", nrow(Weekly_test))
Preds_Interac[Probs_Interac > 0.5] = "Up"

table(Preds_Interac, Direction_test)
##              Direction_test
## Preds_Interac Down Up
##          Down    7  8
##          Up     36 53
mean(Preds_Interac==Direction_test)
## [1] 0.5769231

Comments: The interaction brings down the prediction power of the Logistic Model.



Logistic Regression with Log transformation of predictors

Model_Log_Transf = glm(Direction ~ log10(abs(Lag2)), data = Weekly_train, family = "binomial")
summary(Model_Log_Transf)
## 
## Call:
## glm(formula = Direction ~ log10(abs(Lag2)), family = "binomial", 
##     data = Weekly_train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.305  -1.269   1.074   1.088   1.167  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)   
## (Intercept)       0.20916    0.06410   3.263   0.0011 **
## log10(abs(Lag2))  0.06823    0.13149   0.519   0.6038   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1354.7  on 984  degrees of freedom
## Residual deviance: 1354.4  on 983  degrees of freedom
## AIC: 1358.4
## 
## Number of Fisher Scoring iterations: 3
Probs_log_trans = predict(Model_Log_Transf, Weekly_test, type = "response")
Preds_log_trans = rep("Down", nrow(Weekly_test))
Preds_log_trans[Probs_log_trans > 0.5] = "Up"

table(Preds_log_trans, Direction_test)
##                Direction_test
## Preds_log_trans Down Up
##              Up   43 61
mean(Preds_log_trans==Direction_test)
## [1] 0.5865385



KNN with K=2

knn_pred_2=knn(train_X,test_X,train_Direction,k=2)
table(knn_pred_2,test_Direction)
##           test_Direction
## knn_pred_2 Down Up
##       Down   21 20
##       Up     22 41
mean(knn_pred_2==test_Direction)
## [1] 0.5961538



KNN with K=3

knn_pred_3=knn(train_X,test_X,train_Direction,k=3)
table(knn_pred_3,test_Direction)
##           test_Direction
## knn_pred_3 Down Up
##       Down   15 19
##       Up     28 42
mean(knn_pred_3==test_Direction)
## [1] 0.5480769

Comments: We can see that by increasing the K number, the accuracy is slowly going up, but still Logistic Regression and LDA models with Lag2 are the best models.



Exercise 11

In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.

detach(Weekly)
attach(Auto)
summary(Auto)
##       mpg          cylinders      displacement     horsepower        weight    
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
##  Median :22.75   Median :4.000   Median :151.0   Median : 93.5   Median :2804  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5   Mean   :2978  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140  
##                                                                                
##   acceleration        year           origin                      name    
##  Min.   : 8.00   Min.   :70.00   Min.   :1.000   amc matador       :  5  
##  1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000   ford pinto        :  5  
##  Median :15.50   Median :76.00   Median :1.000   toyota corolla    :  5  
##  Mean   :15.54   Mean   :75.98   Mean   :1.577   amc gremlin       :  4  
##  3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000   amc hornet        :  4  
##  Max.   :24.80   Max.   :82.00   Max.   :3.000   chevrolet chevette:  4  
##                                                  (Other)           :365



(a) Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median() function. Note you may find it helpful to use the data.frame() function to create a single data set containing both mpg01 and the other Auto variables.
Auto$mpg01 = ifelse(mpg >= median(mpg), 1, 0)



(b) Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.
pairs(Auto[,-9])

cor(Auto[,-9])
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
## mpg01         0.8369392 -0.7591939   -0.7534766 -0.6670526 -0.7577566
##              acceleration       year     origin      mpg01
## mpg             0.4233285  0.5805410  0.5652088  0.8369392
## cylinders      -0.5046834 -0.3456474 -0.5689316 -0.7591939
## displacement   -0.5438005 -0.3698552 -0.6145351 -0.7534766
## horsepower     -0.6891955 -0.4163615 -0.4551715 -0.6670526
## weight         -0.4168392 -0.3091199 -0.5850054 -0.7577566
## acceleration    1.0000000  0.2903161  0.2127458  0.3468215
## year            0.2903161  1.0000000  0.1815277  0.4299042
## origin          0.2127458  0.1815277  1.0000000  0.5136984
## mpg01           0.3468215  0.4299042  0.5136984  1.0000000

Comments: We can see that there are couple of variables with strong correlations, such as Cylinders, displacement, horsepower and weight. Acceleration, year and Origin are around 0.5, so we can assume they are not correlated.



(c) Split the data into a training set and a test set.
set.seed(110)
index = sample(nrow(Auto), 0.8*nrow(Auto), replace = F)
train_Auto = Auto[index,]
test_Auto = Auto[-index,]



(d) Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
Model.lda=lda(mpg01~cylinders+weight+displacement+horsepower,data=train_Auto)
Model.lda
## Call:
## lda(mpg01 ~ cylinders + weight + displacement + horsepower, data = train_Auto)
## 
## Prior probabilities of groups:
##         0         1 
## 0.4888179 0.5111821 
## 
## Group means:
##   cylinders   weight displacement horsepower
## 0  6.836601 3666.431     278.0261  130.94118
## 1  4.212500 2329.219     116.2219   78.40625
## 
## Coefficients of linear discriminants:
##                       LD1
## cylinders    -0.343935265
## weight       -0.000943983
## displacement -0.004455063
## horsepower    0.006439206
Preds.LDA=predict(Model.lda, test_Auto)
LDA.class=Preds.LDA$class
table(LDA.class,test_Auto$mpg01) 
##          
## LDA.class  0  1
##         0 31  1
##         1 12 35
mean(LDA.class!=test_Auto$mpg01)
## [1] 0.164557

Comments: The test error is 16.5% (that means in only predicts correctly 16.5% of the data)



(e) Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
QDA.model=qda(mpg01~cylinders+weight+displacement+horsepower,data=train_Auto)
QDA.model
## Call:
## qda(mpg01 ~ cylinders + weight + displacement + horsepower, data = train_Auto)
## 
## Prior probabilities of groups:
##         0         1 
## 0.4888179 0.5111821 
## 
## Group means:
##   cylinders   weight displacement horsepower
## 0  6.836601 3666.431     278.0261  130.94118
## 1  4.212500 2329.219     116.2219   78.40625
Preds.QDA=predict(QDA.model, test_Auto)
QDA.class=Preds.QDA$class
table(QDA.class,test_Auto$mpg01)
##          
## QDA.class  0  1
##         0 36  1
##         1  7 35
mean(QDA.class!=test_Auto$mpg01)
## [1] 0.1012658

Comments: The test error is 10.12% (that means in only predicts correctly 10.12% of the data)



(f) Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
Model_Log_Auto=glm(mpg01~cylinders+weight+displacement+horsepower,data=train_Auto, family = "binomial")
summary(Model_Log_Auto)
## 
## Call:
## glm(formula = mpg01 ~ cylinders + weight + displacement + horsepower, 
##     family = "binomial", data = train_Auto)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2669  -0.1556   0.1217   0.3497   3.2780  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   9.7943532  1.7704628   5.532 3.16e-08 ***
## cylinders     0.5001436  0.4040011   1.238  0.21572    
## weight       -0.0014658  0.0007962  -1.841  0.06561 .  
## displacement -0.0234400  0.0095747  -2.448  0.01436 *  
## horsepower   -0.0419434  0.0151859  -2.762  0.00574 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 433.75  on 312  degrees of freedom
## Residual deviance: 163.38  on 308  degrees of freedom
## AIC: 173.38
## 
## Number of Fisher Scoring iterations: 7
Probs_Log_Auto=predict(Model_Log_Auto, test_Auto, type = "response")
Preds_Log_Auto = rep(0, nrow(test_Auto))
Preds_Log_Auto[Probs_Log_Auto > 0.5] = 1
table(Preds_Log_Auto,test_Auto$mpg01) 
##               
## Preds_Log_Auto  0  1
##              0 31  0
##              1 12 36
mean(Preds_Log_Auto!=test_Auto$mpg01)
## [1] 0.1518987

Comments: The test error is 15.19% (that means in only predicts correctly 15.19% of the data)



(g) Perform KNN on the training data, with several values of K, in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?

K=1

train_knn = cbind(cylinders,weight,displacement,horsepower)[index,]
test_knn = cbind(cylinders,weight,displacement,horsepower)[-index,]

train_mpg01 = Auto[index,c(10)]
test_mpg01 = Auto[-index,c(10)]

knn_preds=knn(train_knn,test_knn,train_mpg01,k=1)
table(knn_preds,test_mpg01)
##          test_mpg01
## knn_preds  0  1
##         0 33  4
##         1 10 32
mean(knn_preds!=test_mpg01)
## [1] 0.1772152

K=5

train_knn = cbind(cylinders,weight,displacement,horsepower)[index,]
test_knn = cbind(cylinders,weight,displacement,horsepower)[-index,]

train_mpg01 = Auto[index,c(10)]
test_mpg01 = Auto[-index,c(10)]

knn_preds=knn(train_knn,test_knn,train_mpg01,k=5)
table(knn_preds,test_mpg01)
##          test_mpg01
## knn_preds  0  1
##         0 31  1
##         1 12 35
mean(knn_preds!=test_mpg01)
## [1] 0.164557

K=10

train_knn = cbind(cylinders,weight,displacement,horsepower)[index,]
test_knn = cbind(cylinders,weight,displacement,horsepower)[-index,]

train_mpg01 = Auto[index,c(10)]
test_mpg01 = Auto[-index,c(10)]

knn_preds=knn(train_knn,test_knn,train_mpg01,k=10)
table(knn_preds,test_mpg01)
##          test_mpg01
## knn_preds  0  1
##         0 32  2
##         1 11 34
mean(knn_preds!=test_mpg01)
## [1] 0.164557

Comments: We can see that the more we increase the number of K, the accuracy of the model decreases.



Exercise 13

Using the Boston data set, fit classification models in order to predict whether a given suburb has a crime rate above or below the median. Explore logistic regression, LDA, and KNN models using various subsets of the predictors. Describe your findings.

detach(Auto)
attach(Boston)
Boston$crim01 = ifelse(crim >= median(crim), 1, 0)

set.seed(100)
index = sample(nrow(Boston), 0.8*nrow(Boston), replace = F)
train = Boston[index,]
test = Boston[-index,]

Logistic Regression Model

glm.model = glm(crim01 ~ . -crim, data = train, family = "binomial")
summary(glm.model)
## 
## Call:
## glm(formula = crim01 ~ . - crim, family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1490  -0.1839  -0.0020   0.0036   3.3823  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -35.563369   7.234946  -4.915 8.86e-07 ***
## zn           -0.061499   0.035264  -1.744  0.08116 .  
## indus        -0.031617   0.049685  -0.636  0.52455    
## chas          0.440435   0.806740   0.546  0.58510    
## nox          45.505591   8.154368   5.581 2.40e-08 ***
## rm           -0.614829   0.769615  -0.799  0.42436    
## age           0.034393   0.014270   2.410  0.01595 *  
## dis           0.780795   0.246066   3.173  0.00151 ** 
## rad           0.635762   0.175862   3.615  0.00030 ***
## tax          -0.006423   0.002977  -2.158  0.03096 *  
## ptratio       0.450300   0.138823   3.244  0.00118 ** 
## black        -0.011476   0.006177  -1.858  0.06320 .  
## lstat         0.010570   0.059725   0.177  0.85953    
## medv          0.206090   0.077958   2.644  0.00820 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 560.05  on 403  degrees of freedom
## Residual deviance: 173.37  on 390  degrees of freedom
## AIC: 201.37
## 
## Number of Fisher Scoring iterations: 9

Comments: The following predictors show to be statistically significant for the model: nox, age, dis, rad, tax, ptratio, medv.

glm.prob=predict(glm.model, test, type = "response")
glm.pred = rep(0, nrow(test))
glm.pred[glm.prob > 0.5] = 1
table(glm.pred,test$crim01)
##         
## glm.pred  0  1
##        0 48  4
##        1  2 48
mean(glm.pred==test$crim01)
## [1] 0.9411765

Comments: The logistic regression model shows a 94% accuracy, which is very good.



LDA Model

lda.model = lda(crim01 ~ . -crim, data = train)
lda.model
## Call:
## lda(crim01 ~ . - crim, data = train)
## 
## Prior probabilities of groups:
##         0         1 
## 0.5024752 0.4975248 
## 
## Group means:
##          zn     indus       chas       nox       rm      age      dis       rad
## 0 21.342365  7.171084 0.03940887 0.4737714 6.375404 51.20690 5.034042  4.152709
## 1  1.114428 15.483333 0.07960199 0.6410199 6.161751 85.28756 2.516774 15.303483
##        tax  ptratio    black     lstat     medv
## 0 311.5320 17.89458 388.7609  9.370887 24.62906
## 1 518.4279 19.14428 325.2534 16.075672 19.67313
## 
## Coefficients of linear discriminants:
##                   LD1
## zn      -0.0054037426
## indus    0.0167809902
## chas    -0.0815189869
## nox      8.4834502301
## rm       0.1381303165
## age      0.0110929665
## dis      0.0814731543
## rad      0.0825844056
## tax     -0.0015452330
## ptratio  0.0871344821
## black   -0.0008842299
## lstat    0.0198033538
## medv     0.0428799230
lda.pred=predict(lda.model, test)
lda.class=lda.pred$class

table(lda.class,test$crim01) 
##          
## lda.class  0  1
##         0 47 14
##         1  3 38
mean(lda.class==test$crim01)
## [1] 0.8333333

Comments: The LDA model also has a very good accuracy, of 83.33%, however the logistic regression model still outperforms it.



KNN Classification, K=1

train.X = cbind(zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, black, lstat, medv)[index,]
test.X = cbind(zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, black, lstat, medv)[-index,]

train.crim01 = Boston[index,c(15)]
test.crim01 = Boston[-index,c(15)]

knn.pred=knn(train.X,test.X,train.crim01,k=1)
table(knn.pred,test.crim01)
##         test.crim01
## knn.pred  0  1
##        0 48  3
##        1  2 49
mean(knn.pred==test.crim01)
## [1] 0.9509804

Comments: From all the models, so far, this is the best accuracy rate with 95.1% correct prediction, which is extremely good.



KNN Classification, K=2

train.X = cbind(zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, black, lstat, medv)[index,]
test.X = cbind(zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, black, lstat, medv)[-index,]

train.crim01 = Boston[index,c(15)]
test.crim01 = Boston[-index,c(15)]

knn.pred=knn(train.X,test.X,train.crim01,k=2)
table(knn.pred,test.crim01)
##         test.crim01
## knn.pred  0  1
##        0 46  4
##        1  4 48
mean(knn.pred==test.crim01)
## [1] 0.9215686

Comments: The prediction power went down, from 95.1% to 92.16%, most likely, the more we increase K, the more prediction power the model will lose.



KNN Classification, K=10

train.X = cbind(zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, black, lstat, medv)[index,]
test.X = cbind(zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, black, lstat, medv)[-index,]

train.crim01 = Boston[index,c(15)]
test.crim01 = Boston[-index,c(15)]

knn.pred=knn(train.X,test.X,train.crim01,k=10)
table(knn.pred,test.crim01)
##         test.crim01
## knn.pred  0  1
##        0 48  3
##        1  2 49
mean(knn.pred==test.crim01)
## [1] 0.9509804

Comments: Contrary of what we thought, the prediction power went up again, it would be interesting to create a graph trying different K values, to see what would be the limit or the best K for the model.