library(ISLR)
## Warning: package 'ISLR' was built under R version 4.1.2
library(MASS)
attach(Weekly)
str(Weekly)
## 'data.frame': 1089 obs. of 9 variables:
## $ Year : num 1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
## $ Lag1 : num 0.816 -0.27 -2.576 3.514 0.712 ...
## $ Lag2 : num 1.572 0.816 -0.27 -2.576 3.514 ...
## $ Lag3 : num -3.936 1.572 0.816 -0.27 -2.576 ...
## $ Lag4 : num -0.229 -3.936 1.572 0.816 -0.27 ...
## $ Lag5 : num -3.484 -0.229 -3.936 1.572 0.816 ...
## $ Volume : num 0.155 0.149 0.16 0.162 0.154 ...
## $ Today : num -0.27 -2.576 3.514 0.712 1.178 ...
## $ Direction: Factor w/ 2 levels "Down","Up": 1 1 2 2 2 1 2 2 2 1 ...
pairs(Weekly)
summary(Weekly)
## Year Lag1 Lag2 Lag3
## Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
## Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
## Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
## 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag4 Lag5 Volume Today
## Min. :-18.1950 Min. :-18.1950 Min. :0.08747 Min. :-18.1950
## 1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540
## Median : 0.2380 Median : 0.2340 Median :1.00268 Median : 0.2410
## Mean : 0.1458 Mean : 0.1399 Mean :1.57462 Mean : 0.1499
## 3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050
## Max. : 12.0260 Max. : 12.0260 Max. :9.32821 Max. : 12.0260
## Direction
## Down:484
## Up :605
##
##
##
##
Comments: The summary function brings a very good picture of how the lags behave, they for sure have the same Min and Max, since it corresponds to how the stock market day starts and ends. The main variations will be in the mean and median, but still we can see that they behave within a range, same, because we are talking about a stock market.
cor(Weekly [,-9])
## Year Lag1 Lag2 Lag3 Lag4
## Year 1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1 -0.03228927 1.000000000 -0.07485305 0.05863568 -0.071273876
## Lag2 -0.03339001 -0.074853051 1.00000000 -0.07572091 0.058381535
## Lag3 -0.03000649 0.058635682 -0.07572091 1.00000000 -0.075395865
## Lag4 -0.03112792 -0.071273876 0.05838153 -0.07539587 1.000000000
## Lag5 -0.03051910 -0.008183096 -0.07249948 0.06065717 -0.075675027
## Volume 0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today -0.03245989 -0.075031842 0.05916672 -0.07124364 -0.007825873
## Lag5 Volume Today
## Year -0.030519101 0.84194162 -0.032459894
## Lag1 -0.008183096 -0.06495131 -0.075031842
## Lag2 -0.072499482 -0.08551314 0.059166717
## Lag3 0.060657175 -0.06928771 -0.071243639
## Lag4 -0.075675027 -0.06107462 -0.007825873
## Lag5 1.000000000 -0.05851741 0.011012698
## Volume -0.058517414 1.00000000 -0.033077783
## Today 0.011012698 -0.03307778 1.000000000
Comments: From the correlation and the pairs function we can see that there is some strong correlation between Year and Volume, which makes a lot of sense, because Stock Markets have a long-term trend of increasing, we can see this also by plotting it.
plot(Volume)
Model_Log = glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume, data=Weekly,family=binomial)
summary(Model_Log)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Weekly)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6949 -1.2565 0.9913 1.0849 1.4579
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
Comments: The only variable which is statistically significant is Lag2, with a p-value of 0.0296 which is > 0.05 (significance level). The coefficient for Lag2 is positive and would suggest that if the market goes up (positive) compared to the day before, it might be high the next day.
Probs=predict(Model_Log,type="response")
Probs [1:10]
## 1 2 3 4 5 6 7 8
## 0.6086249 0.6010314 0.5875699 0.4816416 0.6169013 0.5684190 0.5786097 0.5151972
## 9 10
## 0.5715200 0.5554287
contrasts(Direction)
## Up
## Down 0
## Up 1
Preds=ifelse(Probs > 0.5,"Up","Down")
Preds [1:10]
## 1 2 3 4 5 6 7 8 9 10
## "Up" "Up" "Up" "Down" "Up" "Up" "Up" "Up" "Up" "Up"
contrasts(Direction)
## Up
## Down 0
## Up 1
table(Preds,Direction)
## Direction
## Preds Down Up
## Down 54 48
## Up 430 557
(54+557)/(54+48+430+557)
## [1] 0.5610652
Comments: The accuracy will be calculated from the following formula: (TP + TN) / Total Obs = 0.56106. The total observations are all the TP, TN, FP, FN, in this case= (54+48+430+557). This might look like a good prediction, however since we tested our model on our training data, that really doesn’t really give us an actual picture of the performance of the model. Usually the training predictions (accuracy, sensitivity, specificity, etc) tend to be better than the ones of the test data set. For that it would be suggested to do a subset for Training and the remaining for Testing.
train=(Year>=1990 & Year<=2008)
Weekly_test= Weekly[!train ,]
Weekly_train= Weekly[train ,]
Direction_test= Direction[!train]
Direction_train= Direction[train]
dim(Weekly_test)
## [1] 104 9
dim(Weekly_train)
## [1] 985 9
Here we have the data set split into Train and Test, we will use the Train to run the Logistic Regression Model, and then the Test for our predictions.
Model_Log2=glm(Direction~Lag2, data=Weekly_train,family=binomial)
summary(Model_Log2)
##
## Call:
## glm(formula = Direction ~ Lag2, family = binomial, data = Weekly_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.536 -1.264 1.021 1.091 1.368
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.20326 0.06428 3.162 0.00157 **
## Lag2 0.05810 0.02870 2.024 0.04298 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1354.7 on 984 degrees of freedom
## Residual deviance: 1350.5 on 983 degrees of freedom
## AIC: 1354.5
##
## Number of Fisher Scoring iterations: 4
Probs_test=predict(Model_Log2, Weekly_test, type="response")
Preds_test=ifelse(Probs_test > 0.5,"Up","Down")
Preds_test [1:10]
## 986 987 988 989 990 991 992 993 994 995
## "Up" "Up" "Down" "Down" "Up" "Up" "Up" "Down" "Down" "Down"
Probs_train=predict(Model_Log2, Weekly_train, type="response")
Preds_train=ifelse(Probs_train > 0.5,"Up","Down")
Preds_train [1:10]
## 1 2 3 4 5 6 7 8 9 10
## "Up" "Up" "Up" "Up" "Up" "Up" "Up" "Up" "Up" "Up"
table(Preds_test,Direction_test)
## Direction_test
## Preds_test Down Up
## Down 9 5
## Up 34 56
table(Preds_train,Direction_train)
## Direction_train
## Preds_train Down Up
## Down 23 20
## Up 418 524
mean(Preds_test==Direction_test)
## [1] 0.625
mean(Preds_train==Direction_train)
## [1] 0.5553299
Comments: The “mean” function simplifies the formula of accuracy and brings the same result. In this case, our model predicted correctly 62.5% of the weekly trends. I also brought the result that our model would bring only using the training data set, and it is lower, therefore our model did a good job training and then applying that on the test set.
Model_LDA = lda(Direction ~ Lag2, data = Weekly_train)
Model_LDA
## Call:
## lda(Direction ~ Lag2, data = Weekly_train)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag2
## Down -0.03568254
## Up 0.26036581
##
## Coefficients of linear discriminants:
## LD1
## Lag2 0.4414162
Preds_LDA=predict(Model_LDA, Weekly_test)
LDA_class=Preds_LDA$class
table(LDA_class,Direction_test)
## Direction_test
## LDA_class Down Up
## Down 9 5
## Up 34 56
mean(LDA_class==Direction_test)
## [1] 0.625
Comments: We obtained the same result as we did in the Logistic Regression 62.5% correctly predicted.
Model_QDA = qda(Direction~Lag2, data=Weekly_train)
Model_QDA
## Call:
## qda(Direction ~ Lag2, data = Weekly_train)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag2
## Down -0.03568254
## Up 0.26036581
QDA_class=predict(Model_QDA,Weekly_test)$class
table(QDA_class,Direction_test)
## Direction_test
## QDA_class Down Up
## Down 0 0
## Up 43 61
mean(QDA_class==Direction_test)
## [1] 0.5865385
Comments: The accuracy of the QDA model is 58.65%, lower than the one we got for LDA and the Logistic Regression Models.
Merging the train and test columns to the test and training data sets
library(class)
train_X = cbind(Weekly[train,3])
test_X = cbind(Weekly[!train,3])
train_Direction = Weekly[train,c(9)]
test_Direction = Weekly[!train,c(9)]
knn_pred=knn(train_X,test_X,train_Direction,k=1)
table(knn_pred,test_Direction)
## test_Direction
## knn_pred Down Up
## Down 21 29
## Up 22 32
mean(knn_pred==test_Direction)
## [1] 0.5096154
Comments: The KNN (K-nearest neighbor) had a 50.96% accuracy rate which is not really good.
Both LDA and Logistic Regression gave the best results when it comes to accuracy, both achieving a 62.5%, then QDA with 58.65% and finally KNN with 50.96% (which is almost like saying “by random chance”).
Since the models where the accuracy was higher were LDA and Logistic Regression, I will perform the modifications on it first, then the KNN with a different K.
Logistic Regression with interactions
Model_Log_Interac = glm(Direction ~ Lag1+Lag2+Lag1*Lag2, data = Weekly_train, family = "binomial")
summary(Model_Log_Interac)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag1 * Lag2, family = "binomial",
## data = Weekly_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.573 -1.259 1.003 1.086 1.596
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.211419 0.064589 3.273 0.00106 **
## Lag1 -0.051505 0.030727 -1.676 0.09370 .
## Lag2 0.053471 0.029193 1.832 0.06700 .
## Lag1:Lag2 0.001921 0.007460 0.257 0.79680
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1354.7 on 984 degrees of freedom
## Residual deviance: 1346.9 on 981 degrees of freedom
## AIC: 1354.9
##
## Number of Fisher Scoring iterations: 4
Probs_Interac = predict(Model_Log_Interac, Weekly_test, type = "response")
Preds_Interac = rep("Down", nrow(Weekly_test))
Preds_Interac[Probs_Interac > 0.5] = "Up"
table(Preds_Interac, Direction_test)
## Direction_test
## Preds_Interac Down Up
## Down 7 8
## Up 36 53
mean(Preds_Interac==Direction_test)
## [1] 0.5769231
Comments: The interaction brings down the prediction power of the Logistic Model.
Logistic Regression with Log transformation of predictors
Model_Log_Transf = glm(Direction ~ log10(abs(Lag2)), data = Weekly_train, family = "binomial")
summary(Model_Log_Transf)
##
## Call:
## glm(formula = Direction ~ log10(abs(Lag2)), family = "binomial",
## data = Weekly_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.305 -1.269 1.074 1.088 1.167
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.20916 0.06410 3.263 0.0011 **
## log10(abs(Lag2)) 0.06823 0.13149 0.519 0.6038
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1354.7 on 984 degrees of freedom
## Residual deviance: 1354.4 on 983 degrees of freedom
## AIC: 1358.4
##
## Number of Fisher Scoring iterations: 3
Probs_log_trans = predict(Model_Log_Transf, Weekly_test, type = "response")
Preds_log_trans = rep("Down", nrow(Weekly_test))
Preds_log_trans[Probs_log_trans > 0.5] = "Up"
table(Preds_log_trans, Direction_test)
## Direction_test
## Preds_log_trans Down Up
## Up 43 61
mean(Preds_log_trans==Direction_test)
## [1] 0.5865385
KNN with K=2
knn_pred_2=knn(train_X,test_X,train_Direction,k=2)
table(knn_pred_2,test_Direction)
## test_Direction
## knn_pred_2 Down Up
## Down 21 20
## Up 22 41
mean(knn_pred_2==test_Direction)
## [1] 0.5961538
KNN with K=3
knn_pred_3=knn(train_X,test_X,train_Direction,k=3)
table(knn_pred_3,test_Direction)
## test_Direction
## knn_pred_3 Down Up
## Down 15 19
## Up 28 42
mean(knn_pred_3==test_Direction)
## [1] 0.5480769
Comments: We can see that by increasing the K number, the accuracy is slowly going up, but still Logistic Regression and LDA models with Lag2 are the best models.
detach(Weekly)
attach(Auto)
summary(Auto)
## mpg cylinders displacement horsepower weight
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225
## Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140
##
## acceleration year origin name
## Min. : 8.00 Min. :70.00 Min. :1.000 amc matador : 5
## 1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000 ford pinto : 5
## Median :15.50 Median :76.00 Median :1.000 toyota corolla : 5
## Mean :15.54 Mean :75.98 Mean :1.577 amc gremlin : 4
## 3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000 amc hornet : 4
## Max. :24.80 Max. :82.00 Max. :3.000 chevrolet chevette: 4
## (Other) :365
Auto$mpg01 = ifelse(mpg >= median(mpg), 1, 0)
pairs(Auto[,-9])
cor(Auto[,-9])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## mpg01 0.8369392 -0.7591939 -0.7534766 -0.6670526 -0.7577566
## acceleration year origin mpg01
## mpg 0.4233285 0.5805410 0.5652088 0.8369392
## cylinders -0.5046834 -0.3456474 -0.5689316 -0.7591939
## displacement -0.5438005 -0.3698552 -0.6145351 -0.7534766
## horsepower -0.6891955 -0.4163615 -0.4551715 -0.6670526
## weight -0.4168392 -0.3091199 -0.5850054 -0.7577566
## acceleration 1.0000000 0.2903161 0.2127458 0.3468215
## year 0.2903161 1.0000000 0.1815277 0.4299042
## origin 0.2127458 0.1815277 1.0000000 0.5136984
## mpg01 0.3468215 0.4299042 0.5136984 1.0000000
Comments: We can see that there are couple of variables with strong correlations, such as Cylinders, displacement, horsepower and weight. Acceleration, year and Origin are around 0.5, so we can assume they are not correlated.
set.seed(110)
index = sample(nrow(Auto), 0.8*nrow(Auto), replace = F)
train_Auto = Auto[index,]
test_Auto = Auto[-index,]
Model.lda=lda(mpg01~cylinders+weight+displacement+horsepower,data=train_Auto)
Model.lda
## Call:
## lda(mpg01 ~ cylinders + weight + displacement + horsepower, data = train_Auto)
##
## Prior probabilities of groups:
## 0 1
## 0.4888179 0.5111821
##
## Group means:
## cylinders weight displacement horsepower
## 0 6.836601 3666.431 278.0261 130.94118
## 1 4.212500 2329.219 116.2219 78.40625
##
## Coefficients of linear discriminants:
## LD1
## cylinders -0.343935265
## weight -0.000943983
## displacement -0.004455063
## horsepower 0.006439206
Preds.LDA=predict(Model.lda, test_Auto)
LDA.class=Preds.LDA$class
table(LDA.class,test_Auto$mpg01)
##
## LDA.class 0 1
## 0 31 1
## 1 12 35
mean(LDA.class!=test_Auto$mpg01)
## [1] 0.164557
Comments: The test error is 16.5% (that means in only predicts correctly 16.5% of the data)
QDA.model=qda(mpg01~cylinders+weight+displacement+horsepower,data=train_Auto)
QDA.model
## Call:
## qda(mpg01 ~ cylinders + weight + displacement + horsepower, data = train_Auto)
##
## Prior probabilities of groups:
## 0 1
## 0.4888179 0.5111821
##
## Group means:
## cylinders weight displacement horsepower
## 0 6.836601 3666.431 278.0261 130.94118
## 1 4.212500 2329.219 116.2219 78.40625
Preds.QDA=predict(QDA.model, test_Auto)
QDA.class=Preds.QDA$class
table(QDA.class,test_Auto$mpg01)
##
## QDA.class 0 1
## 0 36 1
## 1 7 35
mean(QDA.class!=test_Auto$mpg01)
## [1] 0.1012658
Comments: The test error is 10.12% (that means in only predicts correctly 10.12% of the data)
Model_Log_Auto=glm(mpg01~cylinders+weight+displacement+horsepower,data=train_Auto, family = "binomial")
summary(Model_Log_Auto)
##
## Call:
## glm(formula = mpg01 ~ cylinders + weight + displacement + horsepower,
## family = "binomial", data = train_Auto)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2669 -0.1556 0.1217 0.3497 3.2780
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 9.7943532 1.7704628 5.532 3.16e-08 ***
## cylinders 0.5001436 0.4040011 1.238 0.21572
## weight -0.0014658 0.0007962 -1.841 0.06561 .
## displacement -0.0234400 0.0095747 -2.448 0.01436 *
## horsepower -0.0419434 0.0151859 -2.762 0.00574 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 433.75 on 312 degrees of freedom
## Residual deviance: 163.38 on 308 degrees of freedom
## AIC: 173.38
##
## Number of Fisher Scoring iterations: 7
Probs_Log_Auto=predict(Model_Log_Auto, test_Auto, type = "response")
Preds_Log_Auto = rep(0, nrow(test_Auto))
Preds_Log_Auto[Probs_Log_Auto > 0.5] = 1
table(Preds_Log_Auto,test_Auto$mpg01)
##
## Preds_Log_Auto 0 1
## 0 31 0
## 1 12 36
mean(Preds_Log_Auto!=test_Auto$mpg01)
## [1] 0.1518987
Comments: The test error is 15.19% (that means in only predicts correctly 15.19% of the data)
K=1
train_knn = cbind(cylinders,weight,displacement,horsepower)[index,]
test_knn = cbind(cylinders,weight,displacement,horsepower)[-index,]
train_mpg01 = Auto[index,c(10)]
test_mpg01 = Auto[-index,c(10)]
knn_preds=knn(train_knn,test_knn,train_mpg01,k=1)
table(knn_preds,test_mpg01)
## test_mpg01
## knn_preds 0 1
## 0 33 4
## 1 10 32
mean(knn_preds!=test_mpg01)
## [1] 0.1772152
K=5
train_knn = cbind(cylinders,weight,displacement,horsepower)[index,]
test_knn = cbind(cylinders,weight,displacement,horsepower)[-index,]
train_mpg01 = Auto[index,c(10)]
test_mpg01 = Auto[-index,c(10)]
knn_preds=knn(train_knn,test_knn,train_mpg01,k=5)
table(knn_preds,test_mpg01)
## test_mpg01
## knn_preds 0 1
## 0 31 1
## 1 12 35
mean(knn_preds!=test_mpg01)
## [1] 0.164557
K=10
train_knn = cbind(cylinders,weight,displacement,horsepower)[index,]
test_knn = cbind(cylinders,weight,displacement,horsepower)[-index,]
train_mpg01 = Auto[index,c(10)]
test_mpg01 = Auto[-index,c(10)]
knn_preds=knn(train_knn,test_knn,train_mpg01,k=10)
table(knn_preds,test_mpg01)
## test_mpg01
## knn_preds 0 1
## 0 32 2
## 1 11 34
mean(knn_preds!=test_mpg01)
## [1] 0.164557
Comments: We can see that the more we increase the number of K, the accuracy of the model decreases.
detach(Auto)
attach(Boston)
Boston$crim01 = ifelse(crim >= median(crim), 1, 0)
set.seed(100)
index = sample(nrow(Boston), 0.8*nrow(Boston), replace = F)
train = Boston[index,]
test = Boston[-index,]
Logistic Regression Model
glm.model = glm(crim01 ~ . -crim, data = train, family = "binomial")
summary(glm.model)
##
## Call:
## glm(formula = crim01 ~ . - crim, family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1490 -0.1839 -0.0020 0.0036 3.3823
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -35.563369 7.234946 -4.915 8.86e-07 ***
## zn -0.061499 0.035264 -1.744 0.08116 .
## indus -0.031617 0.049685 -0.636 0.52455
## chas 0.440435 0.806740 0.546 0.58510
## nox 45.505591 8.154368 5.581 2.40e-08 ***
## rm -0.614829 0.769615 -0.799 0.42436
## age 0.034393 0.014270 2.410 0.01595 *
## dis 0.780795 0.246066 3.173 0.00151 **
## rad 0.635762 0.175862 3.615 0.00030 ***
## tax -0.006423 0.002977 -2.158 0.03096 *
## ptratio 0.450300 0.138823 3.244 0.00118 **
## black -0.011476 0.006177 -1.858 0.06320 .
## lstat 0.010570 0.059725 0.177 0.85953
## medv 0.206090 0.077958 2.644 0.00820 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 560.05 on 403 degrees of freedom
## Residual deviance: 173.37 on 390 degrees of freedom
## AIC: 201.37
##
## Number of Fisher Scoring iterations: 9
Comments: The following predictors show to be statistically significant for the model: nox, age, dis, rad, tax, ptratio, medv.
glm.prob=predict(glm.model, test, type = "response")
glm.pred = rep(0, nrow(test))
glm.pred[glm.prob > 0.5] = 1
table(glm.pred,test$crim01)
##
## glm.pred 0 1
## 0 48 4
## 1 2 48
mean(glm.pred==test$crim01)
## [1] 0.9411765
Comments: The logistic regression model shows a 94% accuracy, which is very good.
LDA Model
lda.model = lda(crim01 ~ . -crim, data = train)
lda.model
## Call:
## lda(crim01 ~ . - crim, data = train)
##
## Prior probabilities of groups:
## 0 1
## 0.5024752 0.4975248
##
## Group means:
## zn indus chas nox rm age dis rad
## 0 21.342365 7.171084 0.03940887 0.4737714 6.375404 51.20690 5.034042 4.152709
## 1 1.114428 15.483333 0.07960199 0.6410199 6.161751 85.28756 2.516774 15.303483
## tax ptratio black lstat medv
## 0 311.5320 17.89458 388.7609 9.370887 24.62906
## 1 518.4279 19.14428 325.2534 16.075672 19.67313
##
## Coefficients of linear discriminants:
## LD1
## zn -0.0054037426
## indus 0.0167809902
## chas -0.0815189869
## nox 8.4834502301
## rm 0.1381303165
## age 0.0110929665
## dis 0.0814731543
## rad 0.0825844056
## tax -0.0015452330
## ptratio 0.0871344821
## black -0.0008842299
## lstat 0.0198033538
## medv 0.0428799230
lda.pred=predict(lda.model, test)
lda.class=lda.pred$class
table(lda.class,test$crim01)
##
## lda.class 0 1
## 0 47 14
## 1 3 38
mean(lda.class==test$crim01)
## [1] 0.8333333
Comments: The LDA model also has a very good accuracy, of 83.33%, however the logistic regression model still outperforms it.
KNN Classification, K=1
train.X = cbind(zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, black, lstat, medv)[index,]
test.X = cbind(zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, black, lstat, medv)[-index,]
train.crim01 = Boston[index,c(15)]
test.crim01 = Boston[-index,c(15)]
knn.pred=knn(train.X,test.X,train.crim01,k=1)
table(knn.pred,test.crim01)
## test.crim01
## knn.pred 0 1
## 0 48 3
## 1 2 49
mean(knn.pred==test.crim01)
## [1] 0.9509804
Comments: From all the models, so far, this is the best accuracy rate with 95.1% correct prediction, which is extremely good.
KNN Classification, K=2
train.X = cbind(zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, black, lstat, medv)[index,]
test.X = cbind(zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, black, lstat, medv)[-index,]
train.crim01 = Boston[index,c(15)]
test.crim01 = Boston[-index,c(15)]
knn.pred=knn(train.X,test.X,train.crim01,k=2)
table(knn.pred,test.crim01)
## test.crim01
## knn.pred 0 1
## 0 46 4
## 1 4 48
mean(knn.pred==test.crim01)
## [1] 0.9215686
Comments: The prediction power went down, from 95.1% to 92.16%, most likely, the more we increase K, the more prediction power the model will lose.
KNN Classification, K=10
train.X = cbind(zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, black, lstat, medv)[index,]
test.X = cbind(zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, black, lstat, medv)[-index,]
train.crim01 = Boston[index,c(15)]
test.crim01 = Boston[-index,c(15)]
knn.pred=knn(train.X,test.X,train.crim01,k=10)
table(knn.pred,test.crim01)
## test.crim01
## knn.pred 0 1
## 0 48 3
## 1 2 49
mean(knn.pred==test.crim01)
## [1] 0.9509804
Comments: Contrary of what we thought, the prediction power went up again, it would be interesting to create a graph trying different K values, to see what would be the limit or the best K for the model.