Suppose that we wish to predict whether a given stock will issue a dividend this year (“Yes” or “No”) based on X, last year’s percent profit. We examine a large number of companies and discover that the mean value of X for companies that issued a dividend was Xbar = 10, while the mean for those that didn’t was Xbar = 0. In addition, the variance of X for these two sets of companies was sigmaˆ2 = 36. Finally, 80 % of companies issued dividends. Assuming that X follows a normal distribution, predict the probability that a company will issue a dividend this year given that its percentage profit was X = 4 last year.
In order to solve this equation, we will use Bayes Theorem. We will use \(D_y\) signifying dividends and \(D_n\) signifying no dividends. We find the probabilities of the dividends to be \(P(D_y) = \pi_y = 0.8\) and \(P(D_n) = \pi_n = 0.2\), and we can calculate the probability of a given \(X\) given a decision to issue dividends using the density function for a normal random variable, which is found below:
\[ f(x)=\frac{1}{\sqrt{2\pi\sigma^2}}e^{\frac{-(x-\mu)^2}{2\sigma^2}} \]
Using this, we can calculate \(P(X=4|D_y) = f_y(X=4) \approx 0.04\) and \(P(X=4|D_n) = f_n(X=4) \approx 0.05\). Now that we have all of the probabilities, we can calculate the probability that a company will issue a divided given that it has a percentage profit of \(X=4\) in the last year:
\[ P(D_y|X=4)=\frac{P(X=4|D_y)P(D_y)}{P(X=4|D_y)P(D_y)+P(X=4|D_n)P(D_n)} \newline \]
\[ P(D_y|X=4)=\frac{0.04\times0.8}{0.04\times0.8 + 0.05\times0.2} \newline \]
\[ P(D_y|X=4)\approx0.76 \]
Thus, we can say that the probability is approximately 76%.
This question should be answered using the Weekly data set, which is part of the ISLR2 package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1, 089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.
# there are 21 weeks, with 1990 having 47 instead of 52 weeks and 1996 having 53 weeks
weekly <- Weekly
# head(Weekly)
summary(Weekly[-1]) # gets rid of year
## Lag1 Lag2 Lag3 Lag4
## Min. :-18.1950 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580 1st Qu.: -1.1580
## Median : 0.2410 Median : 0.2410 Median : 0.2410 Median : 0.2380
## Mean : 0.1506 Mean : 0.1511 Mean : 0.1472 Mean : 0.1458
## 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. : 12.0260 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag5 Volume Today Direction
## Min. :-18.1950 Min. :0.08747 Min. :-18.1950 Down:484
## 1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540 Up :605
## Median : 0.2340 Median :1.00268 Median : 0.2410
## Mean : 0.1399 Mean :1.57462 Mean : 0.1499
## 3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050
## Max. : 12.0260 Max. :9.32821 Max. : 12.0260
pairs(Weekly, col=Weekly$Direction)
# corrplot(cor(Weekly[-c(1,9)]))
Numerically, the variables Lag1 through Lag5 are very similar with very close, if not identical, values for the summary statistics. We see there are more weeks with an upwards direction, which corresponds with the positive mean and median of today.
Looking at the pairwise plots, the clearest relationship we can see is the very clear distinction between Direction and Today; values for Today that are low have Down as the direction and values for Today that are high have Up as the direction. This is to be expected as Direction is derived from Today. Another clear relationship is that as the year increases, volume tends to increase as well. Other than this, there are no clear relationships.
weekly.predictors <- Weekly[-c(1,8)]
# up is 1, down is 0
weekly.predictors$Direction <- (ifelse(weekly.predictors$Direction == 'Up', 1, 0))
# weekly.predictors['Direction'] <- lapply(weekly.predictors['Direction'] , factor)
logistic.2 <- glm(Direction ~ ., data=weekly.predictors, family = 'binomial')
summary(logistic.2)
##
## Call:
## glm(formula = Direction ~ ., family = "binomial", data = weekly.predictors)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
After performing logistic regression, the only variable that appears to be statistically significant is Lag2.
# confusion matrix
predicted.2c <- predict(logistic.2, weekly, type='response')
predicted.2c <- ifelse(predicted.2c >= 0.5, 1, 0)
confusionMatrix(as.factor(predicted.2c), as.factor(weekly.predictors$Direction), mode='everything',positive='1')
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 54 48
## 1 430 557
##
## Accuracy : 0.5611
## 95% CI : (0.531, 0.5908)
## No Information Rate : 0.5556
## P-Value [Acc > NIR] : 0.369
##
## Kappa : 0.035
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.9207
## Specificity : 0.1116
## Pos Pred Value : 0.5643
## Neg Pred Value : 0.5294
## Precision : 0.5643
## Recall : 0.9207
## F1 : 0.6997
## Prevalence : 0.5556
## Detection Rate : 0.5115
## Detection Prevalence : 0.9063
## Balanced Accuracy : 0.5161
##
## 'Positive' Class : 1
##
There is a sensitivity of 92% and a specificity of 11%. This means it detects true positives (a direction of up) 92% of the time and true negatives (a direction of down) 11% of the time. The confusion matrix shows a positive predictive value of 52% and a negative predictive value of 56%. There is a positive predictive value of 11% and negative predictive value of 92%. The F1 score is 0.70, and the overall accuracy is 0.56.
From this we can tell that our model is much better at predicting when the market is up rather than when it is down. However, this is slightly deceptive; looking at the PPV we only see a meager 56% meaning only 56% of its predictions of an upward market are correct. Going back to the confusion matrix we can see this is because around 90% of its predictions are that the market will be up. Looking at the data, only 55% of the time the market is up. We can see that the model mistakenly predicts the market will be up a majority of the time, giving it a high sensitivity but mediocre scores on other metrics.
weekly.before.2008 = weekly[weekly$Year <= 2008,]
weekly.after.2008 = weekly[weekly$Year > 2008,]
logistic.2d <- glm(Direction ~ Lag2, data=weekly.before.2008, family = 'binomial')
predicted.2d <- predict(logistic.2d, newdata=weekly.after.2008,type='response')
predicted.2d <- ifelse(predicted.2d >= 0.5, 'Up', 'Down')
# compute confusion matrix
cm.lr <- confusionMatrix(as.factor(predicted.2d), as.factor(weekly.after.2008$Direction), mode='everything',positive='Up')
print(cm.lr)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 9 5
## Up 34 56
##
## Accuracy : 0.625
## 95% CI : (0.5247, 0.718)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.2439
##
## Kappa : 0.1414
##
## Mcnemar's Test P-Value : 7.34e-06
##
## Sensitivity : 0.9180
## Specificity : 0.2093
## Pos Pred Value : 0.6222
## Neg Pred Value : 0.6429
## Precision : 0.6222
## Recall : 0.9180
## F1 : 0.7417
## Prevalence : 0.5865
## Detection Rate : 0.5385
## Detection Prevalence : 0.8654
## Balanced Accuracy : 0.5637
##
## 'Positive' Class : Up
##
After training on 2008 and before then testing on 2009 an 2010, we calculate an accuracy (the fraction of correct predictions) of 62.5%, an improvement over what we found in part C.
lda.fit <- lda(Direction ~ Lag2, data=weekly.before.2008)
predicted.lda <- predict(lda.fit, newdata=weekly.after.2008)
# predicted.lda <- ifelse(predicted.lda >= 0.5, 'Up', 'Down') # not necessary for LDA: already in 0 and 1
# compute confusion matrix
cm.lda <- confusionMatrix(as.factor(predicted.lda$class), as.factor(weekly.after.2008$Direction), mode='everything',positive='Up')
print(cm.lda)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 9 5
## Up 34 56
##
## Accuracy : 0.625
## 95% CI : (0.5247, 0.718)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.2439
##
## Kappa : 0.1414
##
## Mcnemar's Test P-Value : 7.34e-06
##
## Sensitivity : 0.9180
## Specificity : 0.2093
## Pos Pred Value : 0.6222
## Neg Pred Value : 0.6429
## Precision : 0.6222
## Recall : 0.9180
## F1 : 0.7417
## Prevalence : 0.5865
## Detection Rate : 0.5385
## Detection Prevalence : 0.8654
## Balanced Accuracy : 0.5637
##
## 'Positive' Class : Up
##
Using LDA to predict for 2009 and 2010 gives us an accuracy of 62.5%. This performance is identical to logistic regression using Lag2.
qda.fit <- qda(Direction ~ Lag2, data=weekly.before.2008)
predicted.qda <- predict(qda.fit, newdata=weekly.after.2008)
# predicted.qda <- ifelse(predicted.qda >= 0.5, 'Up', 'Down') # also not needed
# compute confusion matrix
cm.qda <- confusionMatrix(as.factor(predicted.qda$class), as.factor(weekly.after.2008$Direction), mode='everything',positive='Up')
print(cm.qda)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 0 0
## Up 43 61
##
## Accuracy : 0.5865
## 95% CI : (0.4858, 0.6823)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.5419
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 1.504e-10
##
## Sensitivity : 1.0000
## Specificity : 0.0000
## Pos Pred Value : 0.5865
## Neg Pred Value : NaN
## Precision : 0.5865
## Recall : 1.0000
## F1 : 0.7394
## Prevalence : 0.5865
## Detection Rate : 0.5865
## Detection Prevalence : 1.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : Up
##
Using QDA we get an accuracy of 58.65%. It is important to note that the sensitivity of 100% is somewhat misleading as to the performance of the model, as it is predicting ‘Up’ for every possible prediction. This inflates the sensitivity as it will always correctly predict a positive value if it always predicts a positive value.
knn.predicted <- knn(train=weekly.before.2008[3], test=weekly.after.2008[3], cl = weekly.before.2008[,9], k=1)
# index for lag2 is 3, index for direction is 9
# compute confusion matrix
cm.knn <- confusionMatrix(as.factor(knn.predicted), as.factor(weekly.after.2008[,9]), mode='everything',positive='Up')
print(cm.knn)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 21 29
## Up 22 32
##
## Accuracy : 0.5096
## 95% CI : (0.4097, 0.609)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.9540
##
## Kappa : 0.0127
##
## Mcnemar's Test P-Value : 0.4008
##
## Sensitivity : 0.5246
## Specificity : 0.4884
## Pos Pred Value : 0.5926
## Neg Pred Value : 0.4200
## Precision : 0.5926
## Recall : 0.5246
## F1 : 0.5565
## Prevalence : 0.5865
## Detection Rate : 0.3077
## Detection Prevalence : 0.5192
## Balanced Accuracy : 0.5065
##
## 'Positive' Class : Up
##
Using KNN with k=1 we find an accuracy of 50%, which so far is the worst performing model.
nb.fit <- naiveBayes(Direction ~ Lag2, data=weekly.before.2008)
pred.nb <- predict(nb.fit, newdata = weekly.after.2008, type = "raw")
pred.nb <- ifelse(pred.nb[,2] >= 0.5, 'Up', 'Down') # column 2 represents likelihood of 'Up'
cm.nb <- confusionMatrix(as.factor(pred.nb), as.factor(weekly.after.2008$Direction),mode='everything',positive='Up')
print(cm.nb)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 0 0
## Up 43 61
##
## Accuracy : 0.5865
## 95% CI : (0.4858, 0.6823)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.5419
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 1.504e-10
##
## Sensitivity : 1.0000
## Specificity : 0.0000
## Pos Pred Value : 0.5865
## Neg Pred Value : NaN
## Precision : 0.5865
## Recall : 1.0000
## F1 : 0.7394
## Prevalence : 0.5865
## Detection Rate : 0.5865
## Detection Prevalence : 1.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : Up
##
We see an accuracy of 58.65% using naive bayes, which is the same as QDA. We also see the same issue with QDA, in which the model has a perfect sensitivity of 100%.
The best performing models using Lag2 were LDA and logistic regression, both of which provided the exact same results with an overall accuracy of 62.5%.
We will experiment with different subsets for each model; these subsets include all lag variables, all lag and today, and all lag with today and volume. However, we will not conduct further investigation with logistic regression as Lag2 was the only term shown to be significant.
# LDA
# lag 1,2,3
lda.fit <- lda(Direction ~ Lag1 + Lag2 + Lag3, data=weekly.before.2008)
predicted.lda <- predict(lda.fit, newdata = weekly.after.2008, type = "raw")
cm <- confusionMatrix(as.factor(predicted.lda$class), as.factor(weekly.after.2008$Direction), mode='everything', positive='Up')
print('LDA with lag 1, 2, and 3:', quote=F)
## [1] LDA with lag 1, 2, and 3:
print(cm$table)
## Reference
## Prediction Down Up
## Down 8 9
## Up 35 52
print(cm$overall[1])
## Accuracy
## 0.5769231
print(cm$byClass)
## Sensitivity Specificity Pos Pred Value
## 0.8524590 0.1860465 0.5977011
## Neg Pred Value Precision Recall
## 0.4705882 0.5977011 0.8524590
## F1 Prevalence Detection Rate
## 0.7027027 0.5865385 0.5000000
## Detection Prevalence Balanced Accuracy
## 0.8365385 0.5192528
# all lag
lda.fit <- lda(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5, data=weekly.before.2008)
predicted.lda <- predict(lda.fit, newdata = weekly.after.2008, type = "raw")
cm <- confusionMatrix(as.factor(predicted.lda$class), as.factor(weekly.after.2008$Direction), mode='everything', positive='Up')
print('LDA with all lag:', quote=F)
## [1] LDA with all lag:
print(cm$table)
## Reference
## Prediction Down Up
## Down 9 13
## Up 34 48
print(cm$overall[1])
## Accuracy
## 0.5480769
print(cm$byClass)
## Sensitivity Specificity Pos Pred Value
## 0.7868852 0.2093023 0.5853659
## Neg Pred Value Precision Recall
## 0.4090909 0.5853659 0.7868852
## F1 Prevalence Detection Rate
## 0.6713287 0.5865385 0.4615385
## Detection Prevalence Balanced Accuracy
## 0.7884615 0.4980938
# all lag, volume
lda.fit <- lda(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data=weekly.before.2008)
predicted.lda <- predict(lda.fit, newdata = weekly.after.2008, type = "raw")
cm <- confusionMatrix(as.factor(predicted.lda$class), as.factor(weekly.after.2008$Direction), mode='everything', positive='Up')
print('LDA with all lag and volume:', quote=F)
## [1] LDA with all lag and volume:
print(cm$table)
## Reference
## Prediction Down Up
## Down 31 44
## Up 12 17
print(cm$overall[1])
## Accuracy
## 0.4615385
print(cm$byClass)
## Sensitivity Specificity Pos Pred Value
## 0.2786885 0.7209302 0.5862069
## Neg Pred Value Precision Recall
## 0.4133333 0.5862069 0.2786885
## F1 Prevalence Detection Rate
## 0.3777778 0.5865385 0.1634615
## Detection Prevalence Balanced Accuracy
## 0.2788462 0.4998094
We see that the best performance for LDA is using only Lag2, which has an accuracy of 62.5% which is higher than all other LDA models shown. This appears to get worse as more variables are added; the one exception is specificity, which increases to 72% with all variables compared to 21% with only lag2.
# QDA
# lag 1,2,3
qda.fit <- qda(Direction ~ Lag1 + Lag2 + Lag3, data=weekly.before.2008)
predicted.qda <- predict(qda.fit, newdata = weekly.after.2008, type = "raw")
cm <- confusionMatrix(as.factor(predicted.qda$class), as.factor(weekly.after.2008$Direction), mode='everything', positive='Up')
print('QDA with lag 1, 2, and 3:', quote=F)
## [1] QDA with lag 1, 2, and 3:
print(cm$table)
## Reference
## Prediction Down Up
## Down 6 10
## Up 37 51
print(cm$overall[1])
## Accuracy
## 0.5480769
print(cm$byClass)
## Sensitivity Specificity Pos Pred Value
## 0.8360656 0.1395349 0.5795455
## Neg Pred Value Precision Recall
## 0.3750000 0.5795455 0.8360656
## F1 Prevalence Detection Rate
## 0.6845638 0.5865385 0.4903846
## Detection Prevalence Balanced Accuracy
## 0.8461538 0.4878002
# all lag
qda.fit <- qda(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5, data=weekly.before.2008)
predicted.qda <- predict(qda.fit, newdata = weekly.after.2008, type = "raw")
cm <- confusionMatrix(as.factor(predicted.qda$class), as.factor(weekly.after.2008$Direction), mode='everything', positive='Up')
print('QDA with all lag:', quote=F)
## [1] QDA with all lag:
print(cm$table)
## Reference
## Prediction Down Up
## Down 10 23
## Up 33 38
print(cm$overall[1])
## Accuracy
## 0.4615385
print(cm$byClass)
## Sensitivity Specificity Pos Pred Value
## 0.6229508 0.2325581 0.5352113
## Neg Pred Value Precision Recall
## 0.3030303 0.5352113 0.6229508
## F1 Prevalence Detection Rate
## 0.5757576 0.5865385 0.3653846
## Detection Prevalence Balanced Accuracy
## 0.6826923 0.4277545
# all lag and volume
qda.fit <- qda(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data=weekly.before.2008)
predicted.qda <- predict(qda.fit, newdata = weekly.after.2008, type = "raw")
cm <- confusionMatrix(as.factor(predicted.qda$class), as.factor(weekly.after.2008$Direction), mode='everything', positive='Up')
print('QDA with all lag and volume:', quote=F)
## [1] QDA with all lag and volume:
print(cm$table)
## Reference
## Prediction Down Up
## Down 33 49
## Up 10 12
print(cm$overall[1])
## Accuracy
## 0.4326923
print(cm$byClass)
## Sensitivity Specificity Pos Pred Value
## 0.1967213 0.7674419 0.5454545
## Neg Pred Value Precision Recall
## 0.4024390 0.5454545 0.1967213
## F1 Prevalence Detection Rate
## 0.2891566 0.5865385 0.1153846
## Detection Prevalence Balanced Accuracy
## 0.2115385 0.4820816
Similarly, we see the overall accuracy decreases with the addition of terms for QDA, as it starts from 59% and decreases to a low of 43%. The sensitivity of 100% also decreases to a measly 20%. Interestingly, the specificity also increases from 0% to 76%.
Next we will perform KNN for the various subsets for k values of 1, 3, 5, 10, 15, 20, 25, and 30. To make things easier to read we will only compare the overall accuracy of each.
k.vals <- c(1,3,5,10,15,20,25,30)
# lag 2
print('KNN with lag2:', quote=F)
## [1] KNN with lag2:
index <- 3
for (k in k.vals){
print(paste('For k=',k, sep=''),quote=F)
knn.fit <- knn(train=weekly.before.2008[index], test=weekly.after.2008[index], cl=weekly.before.2008[,9], k=k)
cm <- confusionMatrix(as.factor(knn.fit), as.factor(weekly.after.2008[,9]), mode='everything',positive='Up')
# print(cm$table)
print(cm$overall[1])
# print(cm$byClass[1])
# print(cm$byClass[2])
}
## [1] For k=1
## Accuracy
## 0.5
## [1] For k=3
## Accuracy
## 0.5384615
## [1] For k=5
## Accuracy
## 0.5384615
## [1] For k=10
## Accuracy
## 0.5865385
## [1] For k=15
## Accuracy
## 0.5865385
## [1] For k=20
## Accuracy
## 0.5576923
## [1] For k=25
## Accuracy
## 0.5480769
## [1] For k=30
## Accuracy
## 0.5288462
# lag1,2,3
print('KNN with lag 1,2,3:', quote=F)
## [1] KNN with lag 1,2,3:
index <- c(2,3,4)
for (k in k.vals){
print(paste('For k=',k, sep=''),quote=F)
knn.fit <- knn(train=weekly.before.2008[index], test=weekly.after.2008[index], cl=weekly.before.2008[,9], k=k)
cm <- confusionMatrix(as.factor(knn.fit), as.factor(weekly.after.2008[,9]), mode='everything',positive='Up')
# print(cm$table)
print(cm$overall[1])
# print(cm$byClass[1])
# print(cm$byClass[2])
}
## [1] For k=1
## Accuracy
## 0.4903846
## [1] For k=3
## Accuracy
## 0.5096154
## [1] For k=5
## Accuracy
## 0.5288462
## [1] For k=10
## Accuracy
## 0.5769231
## [1] For k=15
## Accuracy
## 0.5865385
## [1] For k=20
## Accuracy
## 0.6153846
## [1] For k=25
## Accuracy
## 0.6153846
## [1] For k=30
## Accuracy
## 0.6153846
# all lag
print('KNN with all lag:', quote=F)
## [1] KNN with all lag:
index <- c(2,3,4,5,6)
for (k in k.vals){
print(paste('For k=',k, sep=''),quote=F)
knn.fit <- knn(train=weekly.before.2008[index], test=weekly.after.2008[index], cl=weekly.before.2008[,9], k=k)
cm <- confusionMatrix(as.factor(knn.fit), as.factor(weekly.after.2008[,9]), mode='everything',positive='Up')
# print(cm$table)
print(cm$overall[1])
# print(cm$byClass[1])
# print(cm$byClass[2])
}
## [1] For k=1
## Accuracy
## 0.5192308
## [1] For k=3
## Accuracy
## 0.5576923
## [1] For k=5
## Accuracy
## 0.5480769
## [1] For k=10
## Accuracy
## 0.6153846
## [1] For k=15
## Accuracy
## 0.5384615
## [1] For k=20
## Accuracy
## 0.5673077
## [1] For k=25
## Accuracy
## 0.5
## [1] For k=30
## Accuracy
## 0.5096154
# all lag, volume
print('KNN with all lag, volume:', quote=F)
## [1] KNN with all lag, volume:
index <- c(2,3,4,5,6,7)
for (k in k.vals){
print(paste('For k=',k, sep=''),quote=F)
knn.fit <- knn(train=weekly.before.2008[index], test=weekly.after.2008[index], cl=weekly.before.2008[,9], k=k)
cm <- confusionMatrix(as.factor(knn.fit), as.factor(weekly.after.2008[,9]), mode='everything',positive='Up')
# print(cm$table)
print(cm$overall[1])
# print(cm$byClass[1])
# print(cm$byClass[2])
}
## [1] For k=1
## Accuracy
## 0.4807692
## [1] For k=3
## Accuracy
## 0.5096154
## [1] For k=5
## Accuracy
## 0.5
## [1] For k=10
## Accuracy
## 0.5192308
## [1] For k=15
## Accuracy
## 0.5576923
## [1] For k=20
## Accuracy
## 0.5192308
## [1] For k=25
## Accuracy
## 0.5480769
## [1] For k=30
## Accuracy
## 0.5480769
print('KNN with lag 1, 2, and 3, and k=25:', quote=F)
## [1] KNN with lag 1, 2, and 3, and k=25:
index=c(2,3,4)
knn.fit <- knn(train=weekly.before.2008[index], test=weekly.after.2008[index], cl=weekly.before.2008[,9], k=25)
cm <- confusionMatrix(as.factor(knn.fit), as.factor(weekly.after.2008[,9]), mode='everything',positive='Up')
print(cm)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 22 19
## Up 21 42
##
## Accuracy : 0.6154
## 95% CI : (0.5149, 0.7091)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.3110
##
## Kappa : 0.2015
##
## Mcnemar's Test P-Value : 0.8744
##
## Sensitivity : 0.6885
## Specificity : 0.5116
## Pos Pred Value : 0.6667
## Neg Pred Value : 0.5366
## Precision : 0.6667
## Recall : 0.6885
## F1 : 0.6774
## Prevalence : 0.5865
## Detection Rate : 0.4038
## Detection Prevalence : 0.6058
## Balanced Accuracy : 0.6001
##
## 'Positive' Class : Up
##
A k=25 seems to have the best performance overall, seeming to reach a peak accuracy of 61% when using all lag variables as predictors. Overall, it appears that adding predictors helps until the point where you have all lag variables, where the performance starts to decline. The worst performance is with all lag and volume. KNN are the only models which appear to improve with the addition of variables, but all models including KNN still suffer decreases in performances when all predictors are used.
# NB
# lag 1,2,3
nb.fit <- naiveBayes(Direction ~ Lag1 + Lag2 + Lag3, data=weekly.before.2008)
pred.nb <- predict(nb.fit, newdata = weekly.after.2008, type = "raw")
pred.nb <- ifelse(pred.nb[,2] >= 0.5, 'Up', 'Down') # column 2 represents likelihood of 'Up'
cm <- confusionMatrix(as.factor(pred.nb), as.factor(weekly.after.2008$Direction),mode='everything',positive='Up')
print('NB with lag 1,2,3:', quote=F)
## [1] NB with lag 1,2,3:
print(cm$table)
## Reference
## Prediction Down Up
## Down 5 10
## Up 38 51
print(cm$overall[1])
## Accuracy
## 0.5384615
print(cm$byClass)
## Sensitivity Specificity Pos Pred Value
## 0.8360656 0.1162791 0.5730337
## Neg Pred Value Precision Recall
## 0.3333333 0.5730337 0.8360656
## F1 Prevalence Detection Rate
## 0.6800000 0.5865385 0.4903846
## Detection Prevalence Balanced Accuracy
## 0.8557692 0.4761723
# all lag
nb.fit <- naiveBayes(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5, data=weekly.before.2008)
pred.nb <- predict(nb.fit, newdata = weekly.after.2008, type = "raw")
pred.nb <- ifelse(pred.nb[,2] >= 0.5, 'Up', 'Down') # column 2 represents likelihood of 'Up'
cm <- confusionMatrix(as.factor(pred.nb), as.factor(weekly.after.2008$Direction),mode='everything',positive='Up')
print('NB with all lag:', quote=F)
## [1] NB with all lag:
print(cm$table)
## Reference
## Prediction Down Up
## Down 10 21
## Up 33 40
print(cm$overall[1])
## Accuracy
## 0.4807692
print(cm$byClass)
## Sensitivity Specificity Pos Pred Value
## 0.6557377 0.2325581 0.5479452
## Neg Pred Value Precision Recall
## 0.3225806 0.5479452 0.6557377
## F1 Prevalence Detection Rate
## 0.5970149 0.5865385 0.3846154
## Detection Prevalence Balanced Accuracy
## 0.7019231 0.4441479
# all lag, volume
nb.fit <- naiveBayes(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data=weekly.before.2008)
pred.nb <- predict(nb.fit, newdata = weekly.after.2008, type = "raw")
pred.nb <- ifelse(pred.nb[,2] >= 0.5, 'Up', 'Down') # column 2 represents likelihood of 'Up'
cm <- confusionMatrix(as.factor(pred.nb), as.factor(weekly.after.2008$Direction),mode='everything',positive='Up')
print('NB with all lag, volume:', quote=F)
## [1] NB with all lag, volume:
print(cm$table)
## Reference
## Prediction Down Up
## Down 42 56
## Up 1 5
print(cm$overall[1])
## Accuracy
## 0.4519231
print(cm$byClass)
## Sensitivity Specificity Pos Pred Value
## 0.08196721 0.97674419 0.83333333
## Neg Pred Value Precision Recall
## 0.42857143 0.83333333 0.08196721
## F1 Prevalence Detection Rate
## 0.14925373 0.58653846 0.04807692
## Detection Prevalence Balanced Accuracy
## 0.05769231 0.52935570
We see a similar pattern to what occured in LDA and QDA for NB. The accuracy steadily decreased from 59% at just lag2 down to 45% to all lag and volume. We see a similar phenomenon as we saw in QDA where the initial sensitivity of 100% decreased to 8% as the moidel stopped predicting purely positive responses, and a corresponding increase of specificity from 0% to 98%.
print('Logistic regression:', quote=F)
## [1] Logistic regression:
print(cm.lr)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 9 5
## Up 34 56
##
## Accuracy : 0.625
## 95% CI : (0.5247, 0.718)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.2439
##
## Kappa : 0.1414
##
## Mcnemar's Test P-Value : 7.34e-06
##
## Sensitivity : 0.9180
## Specificity : 0.2093
## Pos Pred Value : 0.6222
## Neg Pred Value : 0.6429
## Precision : 0.6222
## Recall : 0.9180
## F1 : 0.7417
## Prevalence : 0.5865
## Detection Rate : 0.5385
## Detection Prevalence : 0.8654
## Balanced Accuracy : 0.5637
##
## 'Positive' Class : Up
##
print('LDA:', quote=F)
## [1] LDA:
print(cm.lda)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 9 5
## Up 34 56
##
## Accuracy : 0.625
## 95% CI : (0.5247, 0.718)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.2439
##
## Kappa : 0.1414
##
## Mcnemar's Test P-Value : 7.34e-06
##
## Sensitivity : 0.9180
## Specificity : 0.2093
## Pos Pred Value : 0.6222
## Neg Pred Value : 0.6429
## Precision : 0.6222
## Recall : 0.9180
## F1 : 0.7417
## Prevalence : 0.5865
## Detection Rate : 0.5385
## Detection Prevalence : 0.8654
## Balanced Accuracy : 0.5637
##
## 'Positive' Class : Up
##
After conducting a further analysis, we see that the best performance was using logistic regression and LDA for only Lag2 as a predictor, which achieved an accuracy of 62.5%. Adding any extra predictors only served to decrease the performance of models.
KNN with k=25 and all lag predictors was the closest as it did see some improvement when adding extra variables, but was only able to get an accuracy of 61%.
Using the Boston data set, fit classification models in order to predict whether a given census tract has a crime rate above or below the median. Explore logistic regression, LDA, naive Bayes, and KNN models using various subsets of the predictors. Describe your findings.
Hint: You will have to create the response variable yourself, using the variables that are contained in the Boston data set.
# creating the response variable
med.crim <- median(Boston$crim)
# 0 for lower crim than median, 1 for higher
is.high.crime <- apply(Boston, 1, function(x) {
ifelse(x['crim'] > med.crim, 1,0) # median will be considered low crim, if exists in data
})
boston.crime <- Boston[-1] # remove crim from dataset
boston.crime$high.crime <- is.high.crime
After creating the response variable, we next decide on what subsets to use. The first model we will use StepAIC with stepwise regression to find the optimal logistic regression model.
# logistic regression
full.model <- glm(high.crime~. , data=boston.crime, family='binomial')
null.model <- glm(high.crime~1, data=boston.crime, family='binomial')
model.selection <- stepAIC(null.model, direction = 'both', scope=list(lower=null.model, upper=full.model), trace = F)
summary(model.selection)
##
## Call:
## glm(formula = high.crime ~ nox + rad + tax + zn + black + dis +
## ptratio + medv + age, family = "binomial", data = boston.crime)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -31.441272 6.048989 -5.198 2.02e-07 ***
## nox 43.195824 6.452812 6.694 2.17e-11 ***
## rad 0.718773 0.142066 5.059 4.21e-07 ***
## tax -0.007676 0.002503 -3.066 0.00217 **
## zn -0.082567 0.031424 -2.628 0.00860 **
## black -0.012866 0.006334 -2.031 0.04224 *
## dis 0.634380 0.207634 3.055 0.00225 **
## ptratio 0.303502 0.109255 2.778 0.00547 **
## medv 0.112882 0.034362 3.285 0.00102 **
## age 0.022851 0.009894 2.310 0.02091 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 701.46 on 505 degrees of freedom
## Residual deviance: 216.22 on 496 degrees of freedom
## AIC: 236.22
##
## Number of Fisher Scoring iterations: 9
As we see, the optimal LR model is using the predictors as follows: nox + rad + tax + zn + black + dis + ptratio + medv + age
For the different subsets used to predict whether a census tract has a higher than the median crime level, we will use these predictors in that order. For example, we will first use LR with nox, then nox + rad, then nox + rad + tax, and so on. We will do this for LR, LDA, NB, and KNN (with k values of 1, 3, 5, 10, 15, 20, 25, and 30). The metrics we will use to compare will be accuracy, sensitivity, specificity, and F1 score. We will then compare these metrics visually at the end.
# creating test and training dataset
tr.size = floor(0.7*nrow(boston.crime))
tr.ind = sample(seq_len(nrow(boston.crime)),tr.size)
boston.tr = boston.crime[tr.ind,]
# table(boston.tr$high.crime)
boston.te = boston.crime[!(rownames(boston.crime) %in% tr.ind),]
# table(boston.te$high.crime)
Now that we have the desired subsets and have split the data into the testing and training sets, we can proceed with fitting the models.
Note: For the sake of readability the code for the model fit and graph generation is hidden. The code can be found in the .rmd file submitted alongside the report. Additionally, the graphs for all KNN models will be found separately to avoid crowding; the model for k=1 will be kept with the other models as a comparison.
In terms of sensitivity, we see the KNN models performed significantly better than the other models; the closest performer was logistic regression which appeared to have performed worse than most KNN models. Of the KNN models, the best performer was k=1 for 6 and 7 predictors, which had a sensitivity of around 93%.
For specificity we see logistic regression has the best overall performance, particularly at 5 predictors with an estimated specificity of 98%. The best KNN model is again k=1 which appears to perform only slightly worse than LR at 4 predictors.
For accuracy, we again see that KNN models perform better with the best model of k=1 achieving an accuracy of 96% at 4 predictors. LR also performs well and approaches k=1 at 9 predictors with an accuracy just below 90%, but is ultimately outperformed.
We see a very similar picture with the F1 score as we saw in the accuracy. KNN with k=1 performs the best at 4 predictors with an F1 of around 97%, but LR gets close at 9 predictors.
Overall, the best performing model was KNN with k=1 and 4 predictors (nox + rad + tax + zn), as it outperformed every model in all metrics except for specificity, which LR performed better in. There is concern for overfitting however, as there was no split between the training and the validation test set.
While LR was outperformed by KNN at k=1, the optimal model of 9 predictors still performed relatively well at all metrics, around 5 to 7 percentage points behind the optimal k=1 model. Given the likely situation in which the KNN model overfit, LR would be a solid contender for the most optimal model.
This churn dataset comes from IBM Sample Data Sets. Customer churn occurs when customers stop doing business with a company, also known as customer attrition. The data set contains 5000rows (customers) and 20columns (features). The “Churn” column is our target which indicate whether customer churned (left the company) or not.
## 'data.frame': 5000 obs. of 20 variables:
## $ state : Factor w/ 51 levels "AK","AL","AR",..: 17 36 32 36 37 2 20 25 19 50 ...
## $ area.code : Factor w/ 3 levels "area_code_408",..: 2 2 2 1 2 3 3 2 1 2 ...
## $ account.length: int 128 107 137 84 75 118 121 147 117 141 ...
## $ voice.plan : Factor w/ 2 levels "yes","no": 1 1 2 2 2 2 1 2 2 1 ...
## $ voice.messages: int 25 26 0 0 0 0 24 0 0 37 ...
## $ intl.plan : Factor w/ 2 levels "yes","no": 2 2 2 1 1 1 2 1 2 1 ...
## $ intl.mins : num 10 13.7 12.2 6.6 10.1 6.3 7.5 7.1 8.7 11.2 ...
## $ intl.calls : int 3 3 5 7 3 6 7 6 4 5 ...
## $ intl.charge : num 2.7 3.7 3.29 1.78 2.73 1.7 2.03 1.92 2.35 3.02 ...
## $ day.mins : num 265 162 243 299 167 ...
## $ day.calls : int 110 123 114 71 113 98 88 79 97 84 ...
## $ day.charge : num 45.1 27.5 41.4 50.9 28.3 ...
## $ eve.mins : num 197.4 195.5 121.2 61.9 148.3 ...
## $ eve.calls : int 99 103 110 88 122 101 108 94 80 111 ...
## $ eve.charge : num 16.78 16.62 10.3 5.26 12.61 ...
## $ night.mins : num 245 254 163 197 187 ...
## $ night.calls : int 91 103 104 89 121 118 118 96 90 97 ...
## $ night.charge : num 11.01 11.45 7.32 8.86 8.41 ...
## $ customer.calls: int 1 1 0 2 3 0 3 0 1 0 ...
## $ churn : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...
We can see the imbalance if we separate the customers by whether they have churned or not:
##
## yes no
## 707 4293
As we can see, the data is significantly imbalanced.Only 14% of the customers have churned.
This presents a problem for resampling methods. For example, in a K fold cross validation where k=10, there may be very different amounts of cases where customers churned which can have a significant impact on the model fit. While this risk is always present in K fold CV, in cases where there are significantly fewer values of one class than another this can have a much more noticeable impact.
In order to compensate we can use stratified K fold cross validation, which will preserve the distribution of the classes in the dataset in each fold. This will help to lessen the overall impact on the model fit.
We will analyze the sensitivity, specificity, accuracy, F1 score, and precision, but we will focus on optimizing the precision. This is because it is the positive predictive value that describes how reliable a positive prediction is. In the case of a business analyzing data for churned customers, reliably predicting who will churn is much more important than predicting who will not churn. They will then be able to focus attention on the customers they are at risk of losing in order to keep as many customers as possible.
We will also use createDataPartition when constructing the folds for the K fold cross validation. This will be done to perform stratified K fold cross validation, which will be important in keeping the distribution of positive values consistent throughout each fold, which will allow for a much better comparison of each fold.
Before doing anything we must preprocess the data. We fortunately do not have to worry about missing data as there is none in the dataset.
# sum of total missing values in each column:
for (col in colnames(churn)) {
print(col, quote=F)
print(sum(is.na(churn[,col])), quote=F)
}
## [1] state
## [1] 0
## [1] area.code
## [1] 0
## [1] account.length
## [1] 0
## [1] voice.plan
## [1] 0
## [1] voice.messages
## [1] 0
## [1] intl.plan
## [1] 0
## [1] intl.mins
## [1] 0
## [1] intl.calls
## [1] 0
## [1] intl.charge
## [1] 0
## [1] day.mins
## [1] 0
## [1] day.calls
## [1] 0
## [1] day.charge
## [1] 0
## [1] eve.mins
## [1] 0
## [1] eve.calls
## [1] 0
## [1] eve.charge
## [1] 0
## [1] night.mins
## [1] 0
## [1] night.calls
## [1] 0
## [1] night.charge
## [1] 0
## [1] customer.calls
## [1] 0
## [1] churn
## [1] 0
Our next step is to look for degenerate distributions with near zero variance. We see voice.messages fits this criteria, and thus remove it from the dataset.
# degenerate distributions
degen_ind <- nearZeroVar(churn) # we see voice.messages is degenerate
for (col in colnames(churn[degen_ind])) {
tab1(churn[col], main=col)
}
# remove zero var
seg.churn <- churn[, -degen_ind]
Next we can look at correlations between the variables, and remove any with a correlation higher than 0.75.
# correlation
numeric.churn <- seg.churn %>% select_if(is.numeric) # select(where(is.numeric))
corr.churn <- cor(numeric.churn)
corrplot(corr.churn) # some perfect correlation here. charges and minutes
# finding the variables to remove
highCorr <- findCorrelation(corr.churn, 0.75)
# removing highly correlated predictors
cols.to.remove <- colnames(numeric.churn)[highCorr]
nocorr.churn <- seg.churn[, !colnames(seg.churn) %in% cols.to.remove]
# much less correlated
corrplot(cor(numeric.churn[-highCorr]))
We saw a perfect correlation between different variable pairs of mins and charge, which is to be expected based on the real world.
Next is to look at multicolinearity between the remaining predictors, which we see is no longer an issue:
multicolinearity <- vif(glm(churn~ ., data=nocorr.churn, family='binomial'))
print(multicolinearity)
## GVIF Df GVIF^(1/(2*Df))
## state 1.290222 50 1.002551
## area.code 1.040925 2 1.010078
## account.length 1.018134 1 1.009026
## voice.plan 1.034209 1 1.016961
## intl.plan 1.127323 1 1.061755
## intl.mins 1.026628 1 1.013226
## intl.calls 1.016393 1 1.008163
## day.mins 1.077634 1 1.038091
## day.calls 1.016450 1 1.008192
## eve.mins 1.041854 1 1.020712
## eve.calls 1.021654 1 1.010769
## night.calls 1.021810 1 1.010846
## night.charge 1.033708 1 1.016714
## customer.calls 1.110768 1 1.053930
nomulticolinearity.churn <- nocorr.churn
The last step in preprocessing is to check for skewness.
# Function to check skewness of all numeric columns in a dataframe
check_skewness <- function(data) {
# Filter numeric columns
numeric_columns <- data[, sapply(data, is.numeric)]
# Calculate skewness for each numeric column
skewness_values <- sapply(numeric_columns, skewness, na.rm = TRUE)
# Display the skewness values
skewness_values
}
# Example usage
skewness_results <- check_skewness(nomulticolinearity.churn)
# print(skewness_results)
print(skewness_results[abs(skewness_results)>0.75]) # will remove skewness above 0.75
## intl.calls customer.calls
## 1.359876 1.041837
par(mfrow = c(1,2))
hist(churn$intl.calls)
hist(churn$customer.calls)
We see two predictors have significant skew, intl.calls and customer.calls. It is important to note that because these are not strictly positive (there are several zero values as can be seen above) we are not able to do the Box Cox transformation. Instead we will use the Yeo Johnson transformation, which can be considered to be an extension of the Box Cox transformation that is capable of being used for zero values and negative values.
# Yeo Johnson transformation:
intl.calls.yj <- yeojohnson(nomulticolinearity.churn$intl.calls)
nomulticolinearity.churn$intl.calls <- intl.calls.yj$x.t # Transformed intl.calls
customer.calls.yj <- yeojohnson(nomulticolinearity.churn$customer.calls)
nomulticolinearity.churn$customer.calls <- customer.calls.yj$x.t # Transformed intl.calls
par(mfrow = c(1,2))
hist(nomulticolinearity.churn$intl.calls)
hist(nomulticolinearity.churn$customer.calls)
churn.processed <- nomulticolinearity.churn
As we can see, the data is significantly less skewed.
We can now proceed towards the cross validation. We will use 10 fold cross validation in this problem.
# set seed for reproducibility
set.seed(155)
# 10 fold cross validation
k <- 10
fold.indices <- createDataPartition(churn.processed$churn,p = 1/k, times=k, list=T)
folds.list <- list()
par(mfrow = c(2,5)) # 4,5 is too large
for(i in 1:k) {
# Get indices for the i-th fold
test.indices <- fold.indices[[i]]
# Split data into training and testing based on indices
test.data <- churn.processed[test.indices, ] # use i-th fold as test data
train.data <- churn.processed[-test.indices, ] # use rest as training data
# Store the train and test data in the folds list
folds.list[[i]] <- list("train.data" = train.data, "test.data" = test.data)
# Plot distribution of churn in training and testing sets
barplot(table(train.data$churn), main = paste('Training Fold', i))
barplot(table(test.data$churn), main = paste('Testing Fold', i))
}
We can see from the graphs that the distribution of positive and negative churn values has been preserved. Now that we have prepared our folds, we can fit the model for each fold and get a confusion matrix to measure the performance.
For our models we will use LR with stepwise regression in each fold using StepAIC to find the best model, LDA with all predictors, NB with all predictors, and KNN with all predictors for the following k values: 1, 3, 5, 10, 15, 20, 25, 30
# will return a cm for each fold for each model. will make comparisons based off that
model.performances <- lapply(folds.list, function(fold){
# print(str(fold$train.data))
train <- fold$train.data
test <- fold$test.data
# LR
full.model <- glm(churn~. , data=train, family='binomial')
null.model <- glm(churn~1, data=train, family='binomial')
best.lr <- stepAIC(null.model, direction = 'both', scope=list(lower=null.model, upper=full.model), trace = F)
# summary(model.selection)
# print(model.selection$call)
lr.pred <- predict(best.lr, newdata=test,type='response')
lr.pred <- ifelse(lr.pred >= 0.5, 'yes', 'no')
cm.lr <- confusionMatrix(as.factor(lr.pred), as.factor(test$churn), mode='everything',positive='yes')
# LDA
lda <- lda(churn~., data=train)
# lda <- lda(train[,c(-4,-15)], grouping = train[,15])
lda.pred <- predict(lda, newdata=test,type='response')
# lda.pred <- predict(lda, newdata=test[,c(-4,-15)],type='response')
cm.lda <- confusionMatrix(as.factor(lda.pred$class), as.factor(test$churn), mode='everything',positive='yes')
# NB
nb <- naiveBayes(churn~., data=train, family='binomial')
nb.pred <- predict(nb, newdata=test,type='raw')
nb.pred <- ifelse(nb.pred[,2] >= 0.5, 'yes','no')
cm.nb <- confusionMatrix(as.factor(nb.pred), as.factor(test$churn), mode='everything',positive='yes')
# KNN
k.vals <- c(1,3,5,10,15,20,25,30)
get.knn.cms <- function(k){
# given a value of k, get the cm
# Only use numeric columns
train_knn <- train[, sapply(train, is.numeric)]
test_knn <- test[, sapply(test, is.numeric)]
knn.fit <- knn(train = train_knn, test = test_knn, cl = train$churn, k = k)
cm <- confusionMatrix(as.factor(knn.fit), as.factor(test$churn), mode = 'everything', positive = 'yes')
return(list(k_val = as.character(k), cm.knn = cm))
}
cm.knns <- lapply(k.vals, get.knn.cms)
return(list(cm.lr=cm.lr, cm.lda=cm.lda, cm.nb=cm.nb, cm.knns=cm.knns))
})
Now that we have the models, we can see how their performance compares with each other. We will collect the sensitivity, specificity, accuracy, and f1 score of each model in each fold, and then compare the mean values of each.
model_names <- c("lr", "lda", "nb")
k.vals <- c(1,3,5,10,15,20,25,30)
# Initialize models list including KNN
models <- list(
lr = vector("list", 5),
lda = vector("list", 5),
nb = vector("list", 5),
knn = vector("list", length(k.vals))
)
# Name each sublist for the metrics
metric_names <- c("Sensitivity", "Specificity", "Accuracy", "F1_Score", "Precision")
names(models$lr) <- metric_names
names(models$nb) <- metric_names
names(models$lda) <- metric_names
# Initialize KNN metrics for each k value
for (k in seq_along(k.vals)) {
models$knn[[k]] <- vector("list", 5)
names(models$knn[[k]]) <- metric_names
}
# look at data, print sens, spec, acc, f1
for (i in seq_along(folds.list)){
fold.data <- model.performances[[i]]
# non KNN data
for (j in 1:3){
cm <- fold.data[[j]]
model_name <- model_names[j]
# enter each metric into the list
models[[model_name]]$Sensitivity[[i]] <- cm$byClass[1]
models[[model_name]]$Specificity[[i]] <- cm$byClass[2]
models[[model_name]]$Accuracy[[i]] <- cm$overall[1]
models[[model_name]]$F1_Score[[i]] <- cm$byClass[7]
models[[model_name]]$Precision[[i]] <- cm$byClass[5]
}
# KNN data
knn.data <- fold.data[[4]]
for (x in seq_along(k.vals)) {
k <- k.vals[x]
cm <- knn.data[[x]][[2]]
models$knn[[x]]$Sensitivity[[i]] <- cm$byClass[1]
models$knn[[x]]$Specificity[[i]] <- cm$byClass[2]
models$knn[[x]]$Accuracy[[i]] <- cm$overall[1]
models$knn[[x]]$F1_Score[[i]] <- cm$byClass[7]
models$knn[[x]]$Precision[[i]] <- cm$byClass[5]
# print(k)
# print(cm$table)
}
}
# non-KNN models
for (model_name in model_names) {
cat("Model:", model_name, "\n")
# get means for each metric
for (metric in metric_names) {
metric_values <- unlist(models[[model_name]][[metric]])
mean_value <- mean(metric_values)
cat(paste(metric, "mean:", mean_value), "\n")
}
cat("\n")
}
## Model: lr
## Sensitivity mean: 0.801408450704225
## Specificity mean: 0.0213953488372093
## Accuracy mean: 0.131936127744511
## F1_Score mean: 0.207345601042792
## Precision mean: 0.119078932526484
##
## Model: lda
## Sensitivity mean: 0.23943661971831
## Specificity mean: 0.961627906976744
## Accuracy mean: 0.859281437125748
## F1_Score mean: 0.323734799082412
## Precision mean: 0.512067218516668
##
## Model: nb
## Sensitivity mean: 0.795774647887324
## Specificity mean: 0.0116279069767442
## Accuracy mean: 0.122754491017964
## F1_Score mean: 0.204451163726151
## Precision mean: 0.117294063214245
# KNN models
cat("KNN Results:\n")
## KNN Results:
for (x in seq_along(k.vals)) {
k <- k.vals[x]
cat("K value:", k, "\n")
# get means for each metric
for (metric in metric_names) {
metric_values <- unlist(models$knn[[x]][[metric]])
mean_value <- mean(metric_values)
cat(paste(metric, "mean:", mean_value), "\n")
}
cat("\n")
}
## K value: 1
## Sensitivity mean: 0.311267605633803
## Specificity mean: 0.877906976744186
## Accuracy mean: 0.797604790419162
## F1_Score mean: 0.302567858649983
## Precision mean: 0.295867851407769
##
## K value: 3
## Sensitivity mean: 0.257746478873239
## Specificity mean: 0.950232558139535
## Accuracy mean: 0.852095808383234
## F1_Score mean: 0.329752101901635
## Precision mean: 0.460627173963411
##
## K value: 5
## Sensitivity mean: 0.242253521126761
## Specificity mean: 0.974651162790698
## Accuracy mean: 0.870858283433134
## F1_Score mean: 0.346218838705662
## Precision mean: 0.613222822263248
##
## K value: 10
## Sensitivity mean: 0.22112676056338
## Specificity mean: 0.983023255813953
## Accuracy mean: 0.875049900199601
## F1_Score mean: 0.332709803522071
## Precision mean: 0.680609580435667
##
## K value: 15
## Sensitivity mean: 0.216901408450704
## Specificity mean: 0.983720930232558
## Accuracy mean: 0.875049900199601
## F1_Score mean: 0.327811982985929
## Precision mean: 0.689641432910035
##
## K value: 20
## Sensitivity mean: 0.208450704225352
## Specificity mean: 0.985116279069767
## Accuracy mean: 0.875049900199601
## F1_Score mean: 0.318939860246865
## Precision mean: 0.692614955731963
##
## K value: 25
## Sensitivity mean: 0.194366197183099
## Specificity mean: 0.987209302325581
## Accuracy mean: 0.874850299401198
## F1_Score mean: 0.303679665495442
## Precision mean: 0.716749338179208
##
## K value: 30
## Sensitivity mean: 0.198591549295775
## Specificity mean: 0.987906976744186
## Accuracy mean: 0.876047904191617
## F1_Score mean: 0.310814801850237
## Precision mean: 0.733378544400216
Comparing the performance of each mode, we see the best precision on KNN with a k=30, which has a precision of 73% and a relatively high accuracy of 88%. As we chose to optimize precision in part b), this is our optimal model.
The other KNN models have a variable performance, but the precision increases as k increases, going from 30% to the highest of 73%. Accuracy also increases with precision, but begins to plateu at k=5 at 87%.
For non KNN models, LDA has the best precision at 51%, while LR and NB perform similarly with a precision of 11% and accuracy of 12%.
While we did not choose to optimize it in part b), it is important to take note of the sensitivities of the models; a company may choose to be particularly risk averse and prioritize detecting all customers at risk of churning while sacrificing some precision in the model for this. If this were the case, the highest sensitivity models are LR and NB which both have a sensitivity of 80%. This sensitivity must not be taken at face value however, as we saw in problem 2 that sensitivity and specificity can be exaggerated if a model chooses a particular option much more than the actual rate it appears. We can examine the confusion matrix of the models:
cms <- model.performances[[1]]
lr.cm <- cms$cm.lr
nb.cm <- cms$cm.nb
print('LR:', quote=F)
## [1] LR:
print(lr.cm$table)
## Reference
## Prediction yes no
## yes 58 422
## no 13 8
print('NB:',quote=F)
## [1] NB:
print(nb.cm$table)
## Reference
## Prediction yes no
## yes 50 423
## no 21 7
We can see that the sensitivity is an unreliable metric in this case, as it over predicts positive values by a significant margin. This is what causes a high sensitivity but low precision and accuracy. In the case of imbalanced data where positive values are significantly less common than negative values, this decreases the performance of the model which we can see by the low accuracy and precision for both logistic regression and naive Bayes.
Overall, if we consider the best model to be the one that optimizes our choice of classification statistic in part b), we will consider KNN with k=30 to be the best model. It is able to provide a high precision at 73%, which will be important to companies who are looking for who are at risk of churning.