Load the ISLR and MASS libraries
library(ISLR)
library(MASS)
## Warning: package 'MASS' was built under R version 3.6.2
library(class)
## Warning: package 'class' was built under R version 3.6.2
Get structure of weekly data set
str(Weekly)
## 'data.frame': 1089 obs. of 9 variables:
## $ Year : num 1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
## $ Lag1 : num 0.816 -0.27 -2.576 3.514 0.712 ...
## $ Lag2 : num 1.572 0.816 -0.27 -2.576 3.514 ...
## $ Lag3 : num -3.936 1.572 0.816 -0.27 -2.576 ...
## $ Lag4 : num -0.229 -3.936 1.572 0.816 -0.27 ...
## $ Lag5 : num -3.484 -0.229 -3.936 1.572 0.816 ...
## $ Volume : num 0.155 0.149 0.16 0.162 0.154 ...
## $ Today : num -0.27 -2.576 3.514 0.712 1.178 ...
## $ Direction: Factor w/ 2 levels "Down","Up": 1 1 2 2 2 1 2 2 2 1 ...
Lets do the summary of the Weekly data set
summary(Weekly)
## Year Lag1 Lag2 Lag3
## Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
## Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
## Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
## 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag4 Lag5 Volume Today
## Min. :-18.1950 Min. :-18.1950 Min. :0.08747 Min. :-18.1950
## 1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540
## Median : 0.2380 Median : 0.2340 Median :1.00268 Median : 0.2410
## Mean : 0.1458 Mean : 0.1399 Mean :1.57462 Mean : 0.1499
## 3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050
## Max. : 12.0260 Max. : 12.0260 Max. :9.32821 Max. : 12.0260
## Direction
## Down:484
## Up :605
##
##
##
##
Quick read:
Observations(n): 1089
Columns: 9
Response Variable: Direction
Lets do scatter plot matrix
pairs(Weekly)
On first look, the relationship is weak between lag variables and Volume/Today/Year.
Year and volume seems to have strong relationship.
Lets print correlation values to observe this. Remove the non-numeric column “Direction” when we do this.
cor(Weekly[,-9])
## Year Lag1 Lag2 Lag3 Lag4
## Year 1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1 -0.03228927 1.000000000 -0.07485305 0.05863568 -0.071273876
## Lag2 -0.03339001 -0.074853051 1.00000000 -0.07572091 0.058381535
## Lag3 -0.03000649 0.058635682 -0.07572091 1.00000000 -0.075395865
## Lag4 -0.03112792 -0.071273876 0.05838153 -0.07539587 1.000000000
## Lag5 -0.03051910 -0.008183096 -0.07249948 0.06065717 -0.075675027
## Volume 0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today -0.03245989 -0.075031842 0.05916672 -0.07124364 -0.007825873
## Lag5 Volume Today
## Year -0.030519101 0.84194162 -0.032459894
## Lag1 -0.008183096 -0.06495131 -0.075031842
## Lag2 -0.072499482 -0.08551314 0.059166717
## Lag3 0.060657175 -0.06928771 -0.071243639
## Lag4 -0.075675027 -0.06107462 -0.007825873
## Lag5 1.000000000 -0.05851741 0.011012698
## Volume -0.058517414 1.00000000 -0.033077783
## Today 0.011012698 -0.03307778 1.000000000
The coorelation numbers confirm our initial analysis using scatter plot matrix. In other words, there appears to be little correlation between this week’s returns and previous weeks’ returns.
Year and volume seems to have relationship. Lets build a plot to see this.
boxplot(Volume~Year,data=Weekly,main="Average number of daily shares traded across Years")
By plotting the data we see that Volume is increasing over time and dipped a bit in 2010. In other words, the average number of shares traded increased from 1990 to 2009 and reduced in 2010.
From the summary, we noticed that the dataset contains more observations with direction as Up Lets look at the ratios.
table(Weekly$Direction)/sum(table(Weekly$Direction))
##
## Down Up
## 0.4444444 0.5555556
This indicates that we have 44% of down and 56% of Up. Lets compare it with Today to understand the relationship between Today and Direction.
head(Weekly[,c(8,9)])
## Today Direction
## 1 -0.270 Down
## 2 -2.576 Down
## 3 3.514 Up
## 4 0.712 Up
## 5 1.178 Up
## 6 -1.372 Down
Based on the above list, we can say that direction is marked as Up or Down based on percentage return for this week.
glm.fit <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Weekly, family = binomial)
summary(glm.fit)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Weekly)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6949 -1.2565 0.9913 1.0849 1.4579
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
Based on the above model, with p-value of 0.0296, it seems that Lag2 is the only significant predictor. All otherpredictors seem to be insignificant.
Lets build Conusion Matrix:
Horizontal -> Prediction
Vertical -> training data
glm.probs <- predict(glm.fit,type = "response")
glm.pred <- rep("Down",nrow(Weekly))
glm.pred[glm.probs>0.5] = "Up"
table(glm.pred,Weekly$Direction)
##
## glm.pred Down Up
## Down 54 48
## Up 430 557
Lets identify the Accuracy/correct prediction rate:
mean(glm.pred == Weekly$Direction)
## [1] 0.5610652
Calculating the accuracy rate from confusion matrix:
(54+557)/(54+557+48+430)
## [1] 0.5610652
Model prediction rate when the market is Up:
557/(48+557)
## [1] 0.9206612
Model prediction rate when the market is Down:
54/(54+430)
## [1] 0.1115702
We may conclude that the percentage of correct predictions on the training data is (54+557)/1089 wich is equal to 56.1% In other words 43.9% is the training error rate, which is overly optimistic. We could also say that for weeks when the market goes up, the model is right 92.0661157% of the time (557/(48+557)) i.e. very few False Positives.
However, when the market is actually down, the model is right only 11.1570248% of the time (54/(54+430)) and this is a concern with this model that there are heavy False Negatives.
Fit the model for the training data set i.e. from 1990 to 2008.
train <- (Weekly$Year < 2009)
Weekly.20092010 <- Weekly[!train, ]
Direction.20092010 <- Weekly$Direction[!train]
glm.fit.subset <- glm(Direction ~ Lag2, data = Weekly, family = binomial, subset = train)
summary(glm.fit.subset)
##
## Call:
## glm(formula = Direction ~ Lag2, family = binomial, data = Weekly,
## subset = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.536 -1.264 1.021 1.091 1.368
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.20326 0.06428 3.162 0.00157 **
## Lag2 0.05810 0.02870 2.024 0.04298 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1354.7 on 984 degrees of freedom
## Residual deviance: 1350.5 on 983 degrees of freedom
## AIC: 1354.5
##
## Number of Fisher Scoring iterations: 4
Build the confusion matrix for the held out data i.e. test data set.
glm.probs <- predict(glm.fit.subset,Weekly.20092010,type = "response")
glm.pred <- rep("Down",nrow(Weekly.20092010))
glm.pred[glm.probs>0.5] = "Up"
table(glm.pred,Direction.20092010)
## Direction.20092010
## glm.pred Down Up
## Down 9 5
## Up 34 56
Check the fraction of correct predictions for the held out data:
mean(glm.pred == Direction.20092010)
## [1] 0.625
mean(glm.pred != Direction.20092010)
## [1] 0.375
56/(56+5)
## [1] 0.9180328
9/(9+34)
## [1] 0.2093023
In this case, we may conclude that the percentage of correct predictions on the test data is (9+56)/104 which is equal to 62.5%. In other words 37.5% is the test error rate. We could also say that for weeks when the market goes up, the model is right 91.8032787% of the time (56/(56+5)). For weeks when the market goes down, the model is right only 20.9302326% of the time (9/(9+34)). Though the overall accuracy rate improved, the False Negatives are still high.
Fit the model with LDA.
lda.fit <- lda(Direction ~ Lag2, data = Weekly, subset = train)
lda.fit
## Call:
## lda(Direction ~ Lag2, data = Weekly, subset = train)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag2
## Down -0.03568254
## Up 0.26036581
##
## Coefficients of linear discriminants:
## LD1
## Lag2 0.4414162
Build the confusion matrix.
lda.pred <- predict(lda.fit, Weekly.20092010)
table(lda.pred$class, Direction.20092010)
## Direction.20092010
## Down Up
## Down 9 5
## Up 34 56
Check the fraction of correct predictions for the held out data:
mean(lda.pred$class == Direction.20092010)
## [1] 0.625
The LDA and logistic regression predictions are identical.
Lets fit the model.
qda.fit <- qda(Direction ~ Lag2, data = Weekly, subset = train)
qda.fit
## Call:
## qda(Direction ~ Lag2, data = Weekly, subset = train)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag2
## Down -0.03568254
## Up 0.26036581
Build the confusion matrix.
qda.pred <- predict(qda.fit, Weekly.20092010)
table(qda.pred$class, Direction.20092010)
## Direction.20092010
## Down Up
## Down 0 0
## Up 43 61
Check the fraction of correct predictions for the held out data:
mean(qda.pred$class == Direction.20092010)
## [1] 0.5865385
We can conclude that the percentage of correct predictions on the test data is 58.6538462%. In other words 41.3461538% is the test error rate. As there are no False Positives, for weeks when the market goes up, the model is right 100% of the time. For weeks when the market goes down, the model is right only 0% of the time. We may note, that QDA achieves a correctness of 58.6538462% even though the model chooses “Up” the whole time.
train.X <- cbind(Weekly$Lag2)[train,]
test.X <- cbind(Weekly$Lag2)[!train,]
train.Direction <- Weekly$Direction[train]
set.seed(1)
knn.pred <- knn(data.frame(train.X), data.frame(test.X), train.Direction, k = 1)
table(knn.pred, Direction.20092010)
## Direction.20092010
## knn.pred Down Up
## Down 21 30
## Up 22 31
mean(knn.pred == Direction.20092010)
## [1] 0.5
31/(30+31)
## [1] 0.5081967
21/(21+22)
## [1] 0.4883721
With KNN, we may conclude that the percentage of correct predictions on the test data is 50%. In other words 50% is the test error rate. We could also say that for weeks when the market goes up, the model is right 50.8196721% of the time. For weeks when the market goes down, the model is right only 48.8372093% of the time. The model has considerable False positives and False negatives.
The Logisic Regression and LDA seem to be better with accuracy rate of 62.5%.
Though the False negatives are more, it is ok that you lose out the opportunity. But the false positives are less which means you are right more than 90% of the time and not end up with loss by investing.
Trying KNN = 3
# KNN with K=3
knn.pred <- knn(data.frame(train.X), data.frame(test.X), train.Direction, k = 3)
table(knn.pred, Direction.20092010)
## Direction.20092010
## knn.pred Down Up
## Down 16 19
## Up 27 42
mean(knn.pred == Direction.20092010)
## [1] 0.5576923
Trying KNN = 5
# KNN with K=5
knn.pred <- knn(data.frame(train.X), data.frame(test.X), train.Direction, k = 5)
mean(knn.pred == Direction.20092010)
## [1] 0.5384615
As we increase k value to 3, KNN gave a better accuracy rate. However, increasing k further, there is no improvement compared to the original LDA/Logistic Regression model.
Lets try Logistic regression with adding one more predictor Lag2:Lag1 as Lag1’s p-value is 0.1181 and lowest after lag2.
# Logistic regression with Lag2:Lag1
glm.fit.L2L1 <- glm(Direction ~ Lag2:Lag1, data = Weekly, family = binomial, subset = train)
glm.probs.L2L1 <- predict(glm.fit.L2L1, Weekly.20092010, type = "response")
glm.pred.L2L1 <- rep("Down", length(glm.probs.L2L1))
glm.pred.L2L1[glm.probs.L2L1 > 0.5] = "Up"
table(glm.pred.L2L1, Direction.20092010)
## Direction.20092010
## glm.pred.L2L1 Down Up
## Down 1 1
## Up 42 60
mean(glm.pred.L2L1 == Direction.20092010)
## [1] 0.5865385
The prediction accuracy rate is 58.65% which is not good when compared to the test model with just Lag2 as predictor.
Lets fit the LDA model with same interaction term.
# LDA with Lag2:Lag1 interaction term
lda.fit.L2L1 <- lda(Direction ~ Lag2:Lag1, data = Weekly, subset = train)
lda.pred.L2L1 <- predict(lda.fit.L2L1, Weekly.20092010)
mean(lda.pred.L2L1$class == Direction.20092010)
## [1] 0.5769231
The prediction accuracy rate is 57.69% which is not good when compared to the test LDA model with just Lag2 as predictor.
# QDA with Sqrt function
qda.fit.sqrt <- qda(Direction ~ Lag2 + sqrt(abs(Lag2)), data = Weekly, subset = train)
qda.fit.sqrt
## Call:
## qda(Direction ~ Lag2 + sqrt(abs(Lag2)), data = Weekly, subset = train)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag2 sqrt(abs(Lag2))
## Down -0.03568254 1.140078
## Up 0.26036581 1.169635
qda.pred.sqrt <- predict(qda.fit.sqrt, Weekly.20092010)
table(qda.pred.sqrt$class, Direction.20092010)
## Direction.20092010
## Down Up
## Down 12 13
## Up 31 48
mean(qda.pred.sqrt$class == Direction.20092010)
## [1] 0.5769231
The prediction accuracy rate is 57.69% which is not good when compared to the test LDA model with just Lag2 as predictor.
Out of all the new combinations tried, the original logistic regression and LDA have the best performance (62.5%).
str(Auto)
## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : num 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : num 3504 3693 3436 3433 3449 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : num 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : num 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
# Lets create a new data set Auto1 with mpg01 added along with all columns of Auto.
mpg01 <- rep(0, nrow(Auto))
mpg01[Auto[,'mpg']>median(Auto[,'mpg'])] <- 1
Auto1 <- data.frame(Auto, mpg01)
str(Auto1)
## 'data.frame': 392 obs. of 10 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : num 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : num 3504 3693 3436 3433 3449 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : num 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : num 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
## $ mpg01 : num 0 0 0 0 0 0 0 0 0 0 ...
summary(Auto1)
## mpg cylinders displacement horsepower weight
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225
## Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140
##
## acceleration year origin name
## Min. : 8.00 Min. :70.00 Min. :1.000 amc matador : 5
## 1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000 ford pinto : 5
## Median :15.50 Median :76.00 Median :1.000 toyota corolla : 5
## Mean :15.54 Mean :75.98 Mean :1.577 amc gremlin : 4
## 3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000 amc hornet : 4
## Max. :24.80 Max. :82.00 Max. :3.000 chevrolet chevette: 4
## (Other) :365
## mpg01
## Min. :0.0
## 1st Qu.:0.0
## Median :0.5
## Mean :0.5
## 3rd Qu.:1.0
## Max. :1.0
##
table(Auto1$mpg01)
##
## 0 1
## 196 196
pairs(Auto1[,-9])
cor(Auto1[,-9])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## mpg01 0.8369392 -0.7591939 -0.7534766 -0.6670526 -0.7577566
## acceleration year origin mpg01
## mpg 0.4233285 0.5805410 0.5652088 0.8369392
## cylinders -0.5046834 -0.3456474 -0.5689316 -0.7591939
## displacement -0.5438005 -0.3698552 -0.6145351 -0.7534766
## horsepower -0.6891955 -0.4163615 -0.4551715 -0.6670526
## weight -0.4168392 -0.3091199 -0.5850054 -0.7577566
## acceleration 1.0000000 0.2903161 0.2127458 0.3468215
## year 0.2903161 1.0000000 0.1815277 0.4299042
## origin 0.2127458 0.1815277 1.0000000 0.5136984
## mpg01 0.3468215 0.4299042 0.5136984 1.0000000
Correlation matrix shows that mpg01 has strong relationship with cylinders, displacement, horsepower and weight.
Year and origin also seem to have some
Lets do box plots to see the same.
par(mfrow = c(2,3))
boxplot(weight~mpg01,Auto1,main="Weight~mpg01")
boxplot(horsepower~mpg01,Auto1,main="Horsepower~mpg01")
boxplot(displacement~mpg01,Auto1,main="Displacement~mpg01")
boxplot(acceleration~mpg01,Auto1,main="Acceleration~mpg01")
boxplot(origin~mpg01,Auto1,main="Origin~mpg01")
boxplot(cylinders~mpg01,Auto1,main="Cylinders~mpg01")
boxplot(year~mpg01,Auto1,main="Year~mpg01")
Based on the correlation matrix and box plots, there is clear evidence that there is some relationship between horsepower, displacement, cylinders and weight with mpg01.
The data is equally split with 196 observations with mpg01 as 0 and the rest of 196 with mpg01 as 1. So, just going with random sampling based on model year.
train <- (Auto1$year %% 2 == 0)
Auto1.train <- Auto1[train, ]
Auto1.test <- Auto1[!train, ]
mpg01.test <- mpg01[!train]
# training data set observations count
nrow(Auto1.train)
## [1] 210
# test data set observations count
nrow(Auto1.test)
## [1] 182
fit.lda <- lda(mpg01 ~ cylinders + weight + displacement + horsepower, data = Auto1, subset = train)
fit.lda
## Call:
## lda(mpg01 ~ cylinders + weight + displacement + horsepower, data = Auto1,
## subset = train)
##
## Prior probabilities of groups:
## 0 1
## 0.4571429 0.5428571
##
## Group means:
## cylinders weight displacement horsepower
## 0 6.812500 3604.823 271.7396 133.14583
## 1 4.070175 2314.763 111.6623 77.92105
##
## Coefficients of linear discriminants:
## LD1
## cylinders -0.6741402638
## weight -0.0011465750
## displacement 0.0004481325
## horsepower 0.0059035377
pred.lda <- predict(fit.lda, Auto1.test)
table(pred.lda$class, mpg01.test)
## mpg01.test
## 0 1
## 0 86 9
## 1 14 73
#LDA test error rate
mean(pred.lda$class != mpg01.test)
## [1] 0.1263736
The LDA test error rate is 12.63736%.
#perform QDA
fit.qda <- qda(mpg01 ~ cylinders + weight + displacement + horsepower, data = Auto1, subset = train)
fit.qda
## Call:
## qda(mpg01 ~ cylinders + weight + displacement + horsepower, data = Auto1,
## subset = train)
##
## Prior probabilities of groups:
## 0 1
## 0.4571429 0.5428571
##
## Group means:
## cylinders weight displacement horsepower
## 0 6.812500 3604.823 271.7396 133.14583
## 1 4.070175 2314.763 111.6623 77.92105
pred.qda <- predict(fit.qda, Auto1.test)
table(pred.qda$class, mpg01.test)
## mpg01.test
## 0 1
## 0 89 13
## 1 11 69
#QDA est error rate
mean(pred.qda$class != mpg01.test)
## [1] 0.1318681
The LDA test error rate is 13.18681%.
fit.glm <- glm(mpg01 ~ cylinders + weight + displacement + horsepower, data = Auto1, family = binomial, subset = train)
summary(fit.glm)
##
## Call:
## glm(formula = mpg01 ~ cylinders + weight + displacement + horsepower,
## family = binomial, data = Auto1, subset = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.48027 -0.03413 0.10583 0.29634 2.57584
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 17.658730 3.409012 5.180 2.22e-07 ***
## cylinders -1.028032 0.653607 -1.573 0.1158
## weight -0.002922 0.001137 -2.569 0.0102 *
## displacement 0.002462 0.015030 0.164 0.8699
## horsepower -0.050611 0.025209 -2.008 0.0447 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 289.58 on 209 degrees of freedom
## Residual deviance: 83.24 on 205 degrees of freedom
## AIC: 93.24
##
## Number of Fisher Scoring iterations: 7
probs.glm <- predict(fit.glm, Auto1.test, type = "response")
pred.glm <- rep(0, length(probs.glm))
pred.glm[probs.glm > 0.5] <- 1
table(pred.glm, mpg01.test)
## mpg01.test
## pred.glm 0 1
## 0 89 11
## 1 11 71
#Logistic Regression test error rate
mean(pred.glm != mpg01.test)
## [1] 0.1208791
The Logistic Regression test error rate is 12.08791%.
# Perform KNN test with k=1
train.X <- cbind(Auto1$cylinders, Auto1$weight, Auto1$displacement, Auto1$horsepower)[train, ]
test.X <- cbind(Auto1$cylinders, Auto1$weight, Auto1$displacement, Auto1$horsepower)[!train, ]
train.mpg01 <- Auto1$mpg01[train]
set.seed(1)
pred.knn <- knn(data.frame(train.X), data.frame(test.X), train.mpg01, k = 1)
table(pred.knn, mpg01.test)
## mpg01.test
## pred.knn 0 1
## 0 83 11
## 1 17 71
# KNN with k=1, error rate
mean(pred.knn != mpg01.test)
## [1] 0.1538462
The KNN test error rate is 15.38462% with K=1.
# KNN with k= 3 - Test error rate
pred.knn <- knn(data.frame(train.X), data.frame(test.X), train.mpg01, k = 3)
mean(pred.knn != mpg01.test)
## [1] 0.1373626
The KNN test error rate is 13.73626% with K=3.
# KNN with k= 5 - test error rate
pred.knn <- knn(data.frame(train.X), data.frame(test.X), train.mpg01, k = 5)
mean(pred.knn != mpg01.test)
## [1] 0.1483516
The KNN test error rate is 14.83516% with K=5.
# KNN with k= 10 - test error rate
pred.knn <- knn(data.frame(train.X), data.frame(test.X), train.mpg01, k = 10)
mean(pred.knn != mpg01.test)
## [1] 0.1538462
The KNN test error rate is 15.38462% with K=10.
# KNN with k= 100 - test error rate
pred.knn <- knn(data.frame(train.X), data.frame(test.X), train.mpg01, k = 100)
mean(pred.knn != mpg01.test)
## [1] 0.1428571
The KNN test error rate is 14.28571% with K=100.
KNN prediction error rate seems to vary quite a lot with various k values. Out of trials i did, K=3 seems to be better, even though K value is small and could potentially cause noise. Square root (training data set observations count) i.e. sqrt(210) = 14. So for the sake of easiness and as the sample size is too small, i would like to pick k=3.
# Boston data set
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
cor(Boston)
## crim zn indus chas nox
## crim 1.00000000 -0.20046922 0.40658341 -0.055891582 0.42097171
## zn -0.20046922 1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus 0.40658341 -0.53382819 1.00000000 0.062938027 0.76365145
## chas -0.05589158 -0.04269672 0.06293803 1.000000000 0.09120281
## nox 0.42097171 -0.51660371 0.76365145 0.091202807 1.00000000
## rm -0.21924670 0.31199059 -0.39167585 0.091251225 -0.30218819
## age 0.35273425 -0.56953734 0.64477851 0.086517774 0.73147010
## dis -0.37967009 0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad 0.62550515 -0.31194783 0.59512927 -0.007368241 0.61144056
## tax 0.58276431 -0.31456332 0.72076018 -0.035586518 0.66802320
## ptratio 0.28994558 -0.39167855 0.38324756 -0.121515174 0.18893268
## black -0.38506394 0.17552032 -0.35697654 0.048788485 -0.38005064
## lstat 0.45562148 -0.41299457 0.60379972 -0.053929298 0.59087892
## medv -0.38830461 0.36044534 -0.48372516 0.175260177 -0.42732077
## rm age dis rad tax ptratio
## crim -0.21924670 0.35273425 -0.37967009 0.625505145 0.58276431 0.2899456
## zn 0.31199059 -0.56953734 0.66440822 -0.311947826 -0.31456332 -0.3916785
## indus -0.39167585 0.64477851 -0.70802699 0.595129275 0.72076018 0.3832476
## chas 0.09125123 0.08651777 -0.09917578 -0.007368241 -0.03558652 -0.1215152
## nox -0.30218819 0.73147010 -0.76923011 0.611440563 0.66802320 0.1889327
## rm 1.00000000 -0.24026493 0.20524621 -0.209846668 -0.29204783 -0.3555015
## age -0.24026493 1.00000000 -0.74788054 0.456022452 0.50645559 0.2615150
## dis 0.20524621 -0.74788054 1.00000000 -0.494587930 -0.53443158 -0.2324705
## rad -0.20984667 0.45602245 -0.49458793 1.000000000 0.91022819 0.4647412
## tax -0.29204783 0.50645559 -0.53443158 0.910228189 1.00000000 0.4608530
## ptratio -0.35550149 0.26151501 -0.23247054 0.464741179 0.46085304 1.0000000
## black 0.12806864 -0.27353398 0.29151167 -0.444412816 -0.44180801 -0.1773833
## lstat -0.61380827 0.60233853 -0.49699583 0.488676335 0.54399341 0.3740443
## medv 0.69535995 -0.37695457 0.24992873 -0.381626231 -0.46853593 -0.5077867
## black lstat medv
## crim -0.38506394 0.4556215 -0.3883046
## zn 0.17552032 -0.4129946 0.3604453
## indus -0.35697654 0.6037997 -0.4837252
## chas 0.04878848 -0.0539293 0.1752602
## nox -0.38005064 0.5908789 -0.4273208
## rm 0.12806864 -0.6138083 0.6953599
## age -0.27353398 0.6023385 -0.3769546
## dis 0.29151167 -0.4969958 0.2499287
## rad -0.44441282 0.4886763 -0.3816262
## tax -0.44180801 0.5439934 -0.4685359
## ptratio -0.17738330 0.3740443 -0.5077867
## black 1.00000000 -0.3660869 0.3334608
## lstat -0.36608690 1.0000000 -0.7376627
## medv 0.33346082 -0.7376627 1.0000000
Lets create a binary response variable for crimerate and populate as 1 if crim value is greater than its median.
# create a binary crime rate variable similar to mpg01 in Q11 - crim01
crim01 <- rep(0, nrow(Boston))
crim01[Boston$crim > median(Boston$crim)] <- 1
Boston1 <- data.frame(Boston, crim01)
str(Boston1)
## 'data.frame': 506 obs. of 15 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
## $ crim01 : num 0 0 0 0 0 0 0 0 0 0 ...
table(crim01)
## crim01
## 0 1
## 253 253
Lets create a training and test data samples
#training and test dataset creation.
train <- 1:(nrow(Boston1) / 2)
test <- (nrow(Boston1) / 2 + 1):nrow(Boston1)
Boston1.train <- Boston1[train, ]
Boston1.test <- Boston1[test, ]
crim01.test <- crim01[test]
nrow(Boston1.test)
## [1] 253
Lets fit Logistic regression first with all predictors.
# Fit logistic regression
fit.glm <- glm(crim01 ~ . - crim01 - crim, data = Boston1, family = binomial, subset = train)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(fit.glm)
##
## Call:
## glm(formula = crim01 ~ . - crim01 - crim, family = binomial,
## data = Boston1, subset = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.83229 -0.06593 0.00000 0.06181 2.61513
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -91.319906 19.490273 -4.685 2.79e-06 ***
## zn -0.815573 0.193373 -4.218 2.47e-05 ***
## indus 0.354172 0.173862 2.037 0.04164 *
## chas 0.167396 0.991922 0.169 0.86599
## nox 93.706326 21.202008 4.420 9.88e-06 ***
## rm -4.719108 1.788765 -2.638 0.00833 **
## age 0.048634 0.024199 2.010 0.04446 *
## dis 4.301493 0.979996 4.389 1.14e-05 ***
## rad 3.039983 0.719592 4.225 2.39e-05 ***
## tax -0.006546 0.007855 -0.833 0.40461
## ptratio 1.430877 0.359572 3.979 6.91e-05 ***
## black -0.017552 0.006734 -2.606 0.00915 **
## lstat 0.190439 0.086722 2.196 0.02809 *
## medv 0.598533 0.185514 3.226 0.00125 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 329.367 on 252 degrees of freedom
## Residual deviance: 69.568 on 239 degrees of freedom
## AIC: 97.568
##
## Number of Fisher Scoring iterations: 10
Several of predictors seem to be significant. Lets get the confusion matrix and test error rate.
probs.glm <- predict(fit.glm, Boston1.test, type = "response")
pred.glm <- rep(0, length(probs.glm))
pred.glm[probs.glm > 0.5] <- 1
table(pred.glm, crim01.test)
## crim01.test
## pred.glm 0 1
## 0 68 24
## 1 22 139
mean(pred.glm != crim01.test)
## [1] 0.1818182
The Logistic regression model above has 18.18% error rate.
Now, lets fit th logistic regression model with only few of the predictors.
#Logistic regression with few predictors. Removing insignificant predictirs chas and tax
fit2.glm <- glm(crim01 ~ . - crim01 - crim - chas - tax, data = Boston1, family = binomial, subset = train)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
probs2.glm <- predict(fit2.glm, Boston1.test, type = "response")
pred2.glm <- rep(0, length(probs2.glm))
pred2.glm[probs2.glm > 0.5] <- 1
table(pred2.glm, crim01.test)
## crim01.test
## pred2.glm 0 1
## 0 67 24
## 1 23 139
mean(pred2.glm != crim01.test)
## [1] 0.1857708
Removing the insignifcant predictors chas and tax didn’t help in improving the error rate.
Lets try removing a different predictor. Lets check crime rate and nox association.
boxplot(Boston1$crim~Boston1$nox)
The association is not clearly evident and even the correlation matrix confirms the same even though logistic regression shows significance.Lets try to fit a logistic regression model without nox.
#Logistic regression with few predictors. Removing insignificant predictor chas and nox - which has high coefficient
fit3.glm <- glm(crim01 ~ . - crim01 - crim - chas - nox, data = Boston1, family = binomial, subset = train)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
probs3.glm <- predict(fit3.glm, Boston1.test, type = "response")
pred3.glm <- rep(0, length(probs3.glm))
pred3.glm[probs3.glm > 0.5] <- 1
table(pred3.glm, crim01.test)
## crim01.test
## pred3.glm 0 1
## 0 78 28
## 1 12 135
mean(pred3.glm != crim01.test)
## [1] 0.1581028
This new model’s error rate (15.8%) improved than the previous ones (~18.2 and 18.6%). we can conclude that the logistic regression without the chas and nox predictors performs well than the logistic regression with all predictors.
Lets fit LDA with all predictors.
# LDA with all predictors
fit.lda <- lda(crim01 ~ . - crim01 - crim, data = Boston1, subset = train)
pred.lda <- predict(fit.lda, Boston1.test)
table(pred.lda$class, crim01.test)
## crim01.test
## 0 1
## 0 80 24
## 1 10 139
mean(pred.lda$class != crim01.test)
## [1] 0.1343874
LDA with all predictors has error rate of 13.44%
Lets fit LDA again without the chas and nox predictors.
# LDA without chas and tax
fit2.lda <- lda(crim01 ~ . - crim01 - crim -chas -tax, data = Boston1, subset = train)
pred2.lda <- predict(fit2.lda, Boston1.test)
mean(pred2.lda$class != crim01.test)
## [1] 0.1225296
# LDA without chas and nox
fit3.lda <- lda(crim01 ~ . - crim01 - crim -chas -nox, data = Boston1, subset = train)
pred3.lda <- predict(fit3.lda, Boston1.test)
mean(pred3.lda$class != crim01.test)
## [1] 0.1501976
LDA without chas and tax (insignificant predictors from Logistic regression) has better error rate of 12.25%.
Lets build KNN.
#KNN with k=1
train.X <- cbind(Boston1$zn, Boston1$indus, Boston1$chas, Boston1$nox, Boston1$rm, Boston1$age, Boston1$dis, Boston1$rad, Boston1$tax, Boston1$ptratio, Boston1$black, Boston1$lstat, Boston1$medv)[train, ]
test.X <- cbind(Boston1$zn, Boston1$indus, Boston1$chas, Boston1$nox, Boston1$rm, Boston1$age, Boston1$dis, Boston1$rad, Boston1$tax, Boston1$ptratio, Boston1$black, Boston1$lstat, Boston1$medv)[test, ]
train.crim01 <- Boston1$crim01[train]
set.seed(1)
pred.knn <- knn(data.frame(train.X), data.frame(test.X), train.crim01, k = 1)
table(pred.knn, crim01.test)
## crim01.test
## pred.knn 0 1
## 0 85 111
## 1 5 52
mean(pred.knn != crim01.test)
## [1] 0.458498
for k=1, test error rate is 45.85%
#KNN with k=9
pred.knn <- knn(data.frame(train.X), data.frame(test.X), train.crim01, k = 9)
mean(pred.knn != crim01.test)
## [1] 0.1106719
for k=9, test error rate is 11.07%.
#KNN with k=10 which is approximately sqrt(training set length-253)
pred.knn <- knn(data.frame(train.X), data.frame(test.X), train.crim01, k = 10)
mean(pred.knn != crim01.test)
## [1] 0.1185771
for k=10, test error rate is 11.86%.
#KNN with k=15 which is approximately sqrt(training set length-253)
pred.knn <- knn(data.frame(train.X), data.frame(test.X), train.crim01, k = 15)
mean(pred.knn != crim01.test)
## [1] 0.1185771
for k=15, test error rate is 11.86%.
K=9 has better performance compared to other k values.
Overall KNN performed well when compared to Logistic regression or LDA.