Problem 1: Exercise 4.8.7

Suppose that we wish to predict whether a given stock will issue a dividend this year (“Yes” or “No”) based on X, last year’s percent profit. We examine a large number of companies and discover that the mean value of X for companies that issued a dividend was Xbar = 10, while the mean for those that didn’t was Xbar = 0. In addition, the variance of X for these two sets of companies was sigmaˆ2 = 36. Finally, 80 % of companies issued dividends. Assuming that X follows a normal distribution, predict the probability that a company will issue a dividend this year given that its percentage profit was X = 4 last year.

In order to solve this equation, we will use Bayes Theorem. We will use \(D_y\) signifying dividends and \(D_n\) signifying no dividends. We find the probabilities of the dividends to be \(P(D_y) = \pi_y = 0.8\) and \(P(D_n) = \pi_n = 0.2\), and we can calculate the probability of a given \(X\) given a decision to issue dividends using the density function for a normal random variable, which is found below:

\[ f(x)=\frac{1}{\sqrt{2\pi\sigma^2}}e^{\frac{-(x-\mu)^2}{2\sigma^2}} \]

Using this, we can calculate \(P(X=4|D_y) = f_y(X=4) \approx 0.04\) and \(P(X=4|D_n) = f_n(X=4) \approx 0.05\). Now that we have all of the probabilities, we can calculate the probability that a company will issue a divided given that it has a percentage profit of \(X=4\) in the last year:

\[ P(D_y|X=4)=\frac{P(X=4|D_y)P(D_y)}{P(X=4|D_y)P(D_y)+P(X=4|D_n)P(D_n)} \newline \]

\[ P(D_y|X=4)=\frac{0.04\times0.8}{0.04\times0.8 + 0.05\times0.2} \newline \]

\[ P(D_y|X=4)\approx0.76 \]

Thus, we can say that the probability is approximately 76%.

Problem 2: Exercise 4.8.13

This question should be answered using the Weekly data set, which is part of the ISLR2 package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1, 089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.

  1. Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?
# there are 21 weeks, with 1990 having 47 instead of 52 weeks and 1996 having 53 weeks
weekly <- Weekly
# head(Weekly)
summary(Weekly[-1]) # gets rid of year
##       Lag1               Lag2               Lag3               Lag4         
##  Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
##  1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580   1st Qu.: -1.1580  
##  Median :  0.2410   Median :  0.2410   Median :  0.2410   Median :  0.2380  
##  Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472   Mean   :  0.1458  
##  3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
##  Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
##       Lag5              Volume            Today          Direction 
##  Min.   :-18.1950   Min.   :0.08747   Min.   :-18.1950   Down:484  
##  1st Qu.: -1.1660   1st Qu.:0.33202   1st Qu.: -1.1540   Up  :605  
##  Median :  0.2340   Median :1.00268   Median :  0.2410             
##  Mean   :  0.1399   Mean   :1.57462   Mean   :  0.1499             
##  3rd Qu.:  1.4050   3rd Qu.:2.05373   3rd Qu.:  1.4050             
##  Max.   : 12.0260   Max.   :9.32821   Max.   : 12.0260
pairs(Weekly, col=Weekly$Direction)

# corrplot(cor(Weekly[-c(1,9)])) 

Numerically, the variables Lag1 through Lag5 are very similar with very close, if not identical, values for the summary statistics. We see there are more weeks with an upwards direction, which corresponds with the positive mean and median of today.

Looking at the pairwise plots, the clearest relationship we can see is the very clear distinction between Direction and Today; values for Today that are low have Down as the direction and values for Today that are high have Up as the direction. This is to be expected as Direction is derived from Today. Another clear relationship is that as the year increases, volume tends to increase as well. Other than this, there are no clear relationships.

  1. Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?
weekly.predictors <- Weekly[-c(1,8)]

# up is 1, down is 0
weekly.predictors$Direction <- (ifelse(weekly.predictors$Direction == 'Up', 1, 0))

# weekly.predictors['Direction'] <- lapply(weekly.predictors['Direction'] , factor)

logistic.2 <- glm(Direction ~ ., data=weekly.predictors, family = 'binomial')
summary(logistic.2)
## 
## Call:
## glm(formula = Direction ~ ., family = "binomial", data = weekly.predictors)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

After performing logistic regression, the only variable that appears to be statistically significant is Lag2.

  1. Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.
# confusion matrix
predicted.2c <- predict(logistic.2, weekly, type='response')
predicted.2c <- ifelse(predicted.2c >= 0.5, 1, 0)

confusionMatrix(as.factor(predicted.2c), as.factor(weekly.predictors$Direction), mode='everything',positive='1')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  54  48
##          1 430 557
##                                          
##                Accuracy : 0.5611         
##                  95% CI : (0.531, 0.5908)
##     No Information Rate : 0.5556         
##     P-Value [Acc > NIR] : 0.369          
##                                          
##                   Kappa : 0.035          
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.9207         
##             Specificity : 0.1116         
##          Pos Pred Value : 0.5643         
##          Neg Pred Value : 0.5294         
##               Precision : 0.5643         
##                  Recall : 0.9207         
##                      F1 : 0.6997         
##              Prevalence : 0.5556         
##          Detection Rate : 0.5115         
##    Detection Prevalence : 0.9063         
##       Balanced Accuracy : 0.5161         
##                                          
##        'Positive' Class : 1              
## 

There is a sensitivity of 92% and a specificity of 11%. This means it detects true positives (a direction of up) 92% of the time and true negatives (a direction of down) 11% of the time. The confusion matrix shows a positive predictive value of 52% and a negative predictive value of 56%. There is a positive predictive value of 11% and negative predictive value of 92%. The F1 score is 0.70, and the overall accuracy is 0.56.

From this we can tell that our model is much better at predicting when the market is up rather than when it is down. However, this is slightly deceptive; looking at the PPV we only see a meager 56% meaning only 56% of its predictions of an upward market are correct. Going back to the confusion matrix we can see this is because around 90% of its predictions are that the market will be up. Looking at the data, only 55% of the time the market is up. We can see that the model mistakenly predicts the market will be up a majority of the time, giving it a high sensitivity but mediocre scores on other metrics.

  1. Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).
weekly.before.2008 = weekly[weekly$Year <= 2008,]
weekly.after.2008 = weekly[weekly$Year > 2008,]

logistic.2d <- glm(Direction ~ Lag2, data=weekly.before.2008, family = 'binomial')

predicted.2d <- predict(logistic.2d, newdata=weekly.after.2008,type='response')
predicted.2d <- ifelse(predicted.2d >= 0.5, 'Up', 'Down')

# compute confusion matrix
cm.lr <- confusionMatrix(as.factor(predicted.2d), as.factor(weekly.after.2008$Direction), mode='everything',positive='Up')

print(cm.lr)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down    9  5
##       Up     34 56
##                                          
##                Accuracy : 0.625          
##                  95% CI : (0.5247, 0.718)
##     No Information Rate : 0.5865         
##     P-Value [Acc > NIR] : 0.2439         
##                                          
##                   Kappa : 0.1414         
##                                          
##  Mcnemar's Test P-Value : 7.34e-06       
##                                          
##             Sensitivity : 0.9180         
##             Specificity : 0.2093         
##          Pos Pred Value : 0.6222         
##          Neg Pred Value : 0.6429         
##               Precision : 0.6222         
##                  Recall : 0.9180         
##                      F1 : 0.7417         
##              Prevalence : 0.5865         
##          Detection Rate : 0.5385         
##    Detection Prevalence : 0.8654         
##       Balanced Accuracy : 0.5637         
##                                          
##        'Positive' Class : Up             
## 

After training on 2008 and before then testing on 2009 an 2010, we calculate an accuracy (the fraction of correct predictions) of 62.5%, an improvement over what we found in part C.

  1. Repeat (d) using LDA.
lda.fit <- lda(Direction ~ Lag2, data=weekly.before.2008)


predicted.lda <- predict(lda.fit, newdata=weekly.after.2008)
# predicted.lda <- ifelse(predicted.lda >= 0.5, 'Up', 'Down') # not necessary for LDA: already in 0 and 1

# compute confusion matrix
cm.lda <- confusionMatrix(as.factor(predicted.lda$class), as.factor(weekly.after.2008$Direction), mode='everything',positive='Up')

print(cm.lda)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down    9  5
##       Up     34 56
##                                          
##                Accuracy : 0.625          
##                  95% CI : (0.5247, 0.718)
##     No Information Rate : 0.5865         
##     P-Value [Acc > NIR] : 0.2439         
##                                          
##                   Kappa : 0.1414         
##                                          
##  Mcnemar's Test P-Value : 7.34e-06       
##                                          
##             Sensitivity : 0.9180         
##             Specificity : 0.2093         
##          Pos Pred Value : 0.6222         
##          Neg Pred Value : 0.6429         
##               Precision : 0.6222         
##                  Recall : 0.9180         
##                      F1 : 0.7417         
##              Prevalence : 0.5865         
##          Detection Rate : 0.5385         
##    Detection Prevalence : 0.8654         
##       Balanced Accuracy : 0.5637         
##                                          
##        'Positive' Class : Up             
## 

Using LDA to predict for 2009 and 2010 gives us an accuracy of 62.5%. This performance is identical to logistic regression using Lag2.

  1. Repeat (d) using QDA.
qda.fit <- qda(Direction ~ Lag2, data=weekly.before.2008)

predicted.qda <- predict(qda.fit, newdata=weekly.after.2008)
# predicted.qda <- ifelse(predicted.qda >= 0.5, 'Up', 'Down') # also not needed

# compute confusion matrix
cm.qda <- confusionMatrix(as.factor(predicted.qda$class), as.factor(weekly.after.2008$Direction), mode='everything',positive='Up')

print(cm.qda)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down    0  0
##       Up     43 61
##                                           
##                Accuracy : 0.5865          
##                  95% CI : (0.4858, 0.6823)
##     No Information Rate : 0.5865          
##     P-Value [Acc > NIR] : 0.5419          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 1.504e-10       
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.5865          
##          Neg Pred Value :    NaN          
##               Precision : 0.5865          
##                  Recall : 1.0000          
##                      F1 : 0.7394          
##              Prevalence : 0.5865          
##          Detection Rate : 0.5865          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : Up              
## 

Using QDA we get an accuracy of 58.65%. It is important to note that the sensitivity of 100% is somewhat misleading as to the performance of the model, as it is predicting ‘Up’ for every possible prediction. This inflates the sensitivity as it will always correctly predict a positive value if it always predicts a positive value.

  1. Repeat (d) using KNN with K = 1.
knn.predicted <- knn(train=weekly.before.2008[3], test=weekly.after.2008[3], cl = weekly.before.2008[,9], k=1)

# index for lag2 is 3, index for direction is 9


# compute confusion matrix
cm.knn <- confusionMatrix(as.factor(knn.predicted), as.factor(weekly.after.2008[,9]), mode='everything',positive='Up')
print(cm.knn)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down   21 29
##       Up     22 32
##                                          
##                Accuracy : 0.5096         
##                  95% CI : (0.4097, 0.609)
##     No Information Rate : 0.5865         
##     P-Value [Acc > NIR] : 0.9540         
##                                          
##                   Kappa : 0.0127         
##                                          
##  Mcnemar's Test P-Value : 0.4008         
##                                          
##             Sensitivity : 0.5246         
##             Specificity : 0.4884         
##          Pos Pred Value : 0.5926         
##          Neg Pred Value : 0.4200         
##               Precision : 0.5926         
##                  Recall : 0.5246         
##                      F1 : 0.5565         
##              Prevalence : 0.5865         
##          Detection Rate : 0.3077         
##    Detection Prevalence : 0.5192         
##       Balanced Accuracy : 0.5065         
##                                          
##        'Positive' Class : Up             
## 

Using KNN with k=1 we find an accuracy of 50%, which so far is the worst performing model.

  1. Repeat (d) using naive Bayes.
nb.fit <- naiveBayes(Direction ~ Lag2, data=weekly.before.2008)

pred.nb <- predict(nb.fit, newdata = weekly.after.2008, type = "raw")
pred.nb <- ifelse(pred.nb[,2] >= 0.5, 'Up', 'Down') # column 2 represents likelihood of 'Up'

cm.nb <- confusionMatrix(as.factor(pred.nb), as.factor(weekly.after.2008$Direction),mode='everything',positive='Up')
print(cm.nb)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down    0  0
##       Up     43 61
##                                           
##                Accuracy : 0.5865          
##                  95% CI : (0.4858, 0.6823)
##     No Information Rate : 0.5865          
##     P-Value [Acc > NIR] : 0.5419          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 1.504e-10       
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.5865          
##          Neg Pred Value :    NaN          
##               Precision : 0.5865          
##                  Recall : 1.0000          
##                      F1 : 0.7394          
##              Prevalence : 0.5865          
##          Detection Rate : 0.5865          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : Up              
## 

We see an accuracy of 58.65% using naive bayes, which is the same as QDA. We also see the same issue with QDA, in which the model has a perfect sensitivity of 100%.

  1. Which of these methods appears to provide the best results on this data?

The best performing models using Lag2 were LDA and logistic regression, both of which provided the exact same results with an overall accuracy of 62.5%.

  1. Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifier.

We will experiment with different subsets for each model; these subsets include all lag variables, all lag and today, and all lag with today and volume. However, we will not conduct further investigation with logistic regression as Lag2 was the only term shown to be significant.

# LDA
# lag 1,2,3
lda.fit <- lda(Direction ~ Lag1 + Lag2 + Lag3, data=weekly.before.2008)

predicted.lda <- predict(lda.fit, newdata = weekly.after.2008, type = "raw")

cm <- confusionMatrix(as.factor(predicted.lda$class), as.factor(weekly.after.2008$Direction), mode='everything', positive='Up')

print('LDA with lag 1, 2, and 3:', quote=F)
## [1] LDA with lag 1, 2, and 3:
print(cm$table)
##           Reference
## Prediction Down Up
##       Down    8  9
##       Up     35 52
print(cm$overall[1])
##  Accuracy 
## 0.5769231
print(cm$byClass)
##          Sensitivity          Specificity       Pos Pred Value 
##            0.8524590            0.1860465            0.5977011 
##       Neg Pred Value            Precision               Recall 
##            0.4705882            0.5977011            0.8524590 
##                   F1           Prevalence       Detection Rate 
##            0.7027027            0.5865385            0.5000000 
## Detection Prevalence    Balanced Accuracy 
##            0.8365385            0.5192528
# all lag
lda.fit <- lda(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5, data=weekly.before.2008)

predicted.lda <- predict(lda.fit, newdata = weekly.after.2008, type = "raw")

cm <- confusionMatrix(as.factor(predicted.lda$class), as.factor(weekly.after.2008$Direction), mode='everything', positive='Up')

print('LDA with all lag:', quote=F)
## [1] LDA with all lag:
print(cm$table)
##           Reference
## Prediction Down Up
##       Down    9 13
##       Up     34 48
print(cm$overall[1])
##  Accuracy 
## 0.5480769
print(cm$byClass)
##          Sensitivity          Specificity       Pos Pred Value 
##            0.7868852            0.2093023            0.5853659 
##       Neg Pred Value            Precision               Recall 
##            0.4090909            0.5853659            0.7868852 
##                   F1           Prevalence       Detection Rate 
##            0.6713287            0.5865385            0.4615385 
## Detection Prevalence    Balanced Accuracy 
##            0.7884615            0.4980938
# all lag, volume
lda.fit <- lda(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data=weekly.before.2008)

predicted.lda <- predict(lda.fit, newdata = weekly.after.2008, type = "raw")

cm <- confusionMatrix(as.factor(predicted.lda$class), as.factor(weekly.after.2008$Direction), mode='everything', positive='Up')

print('LDA with all lag and volume:', quote=F)
## [1] LDA with all lag and volume:
print(cm$table)
##           Reference
## Prediction Down Up
##       Down   31 44
##       Up     12 17
print(cm$overall[1])
##  Accuracy 
## 0.4615385
print(cm$byClass)
##          Sensitivity          Specificity       Pos Pred Value 
##            0.2786885            0.7209302            0.5862069 
##       Neg Pred Value            Precision               Recall 
##            0.4133333            0.5862069            0.2786885 
##                   F1           Prevalence       Detection Rate 
##            0.3777778            0.5865385            0.1634615 
## Detection Prevalence    Balanced Accuracy 
##            0.2788462            0.4998094

We see that the best performance for LDA is using only Lag2, which has an accuracy of 62.5% which is higher than all other LDA models shown. This appears to get worse as more variables are added; the one exception is specificity, which increases to 72% with all variables compared to 21% with only lag2.

# QDA
# lag 1,2,3
qda.fit <- qda(Direction ~ Lag1 + Lag2 + Lag3, data=weekly.before.2008)

predicted.qda <- predict(qda.fit, newdata = weekly.after.2008, type = "raw")

cm <- confusionMatrix(as.factor(predicted.qda$class), as.factor(weekly.after.2008$Direction), mode='everything', positive='Up')

print('QDA with lag 1, 2, and 3:', quote=F)
## [1] QDA with lag 1, 2, and 3:
print(cm$table)
##           Reference
## Prediction Down Up
##       Down    6 10
##       Up     37 51
print(cm$overall[1])
##  Accuracy 
## 0.5480769
print(cm$byClass)
##          Sensitivity          Specificity       Pos Pred Value 
##            0.8360656            0.1395349            0.5795455 
##       Neg Pred Value            Precision               Recall 
##            0.3750000            0.5795455            0.8360656 
##                   F1           Prevalence       Detection Rate 
##            0.6845638            0.5865385            0.4903846 
## Detection Prevalence    Balanced Accuracy 
##            0.8461538            0.4878002
# all lag
qda.fit <- qda(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5, data=weekly.before.2008)

predicted.qda <- predict(qda.fit, newdata = weekly.after.2008, type = "raw")

cm <- confusionMatrix(as.factor(predicted.qda$class), as.factor(weekly.after.2008$Direction), mode='everything', positive='Up')

print('QDA with all lag:', quote=F)
## [1] QDA with all lag:
print(cm$table)
##           Reference
## Prediction Down Up
##       Down   10 23
##       Up     33 38
print(cm$overall[1])
##  Accuracy 
## 0.4615385
print(cm$byClass)
##          Sensitivity          Specificity       Pos Pred Value 
##            0.6229508            0.2325581            0.5352113 
##       Neg Pred Value            Precision               Recall 
##            0.3030303            0.5352113            0.6229508 
##                   F1           Prevalence       Detection Rate 
##            0.5757576            0.5865385            0.3653846 
## Detection Prevalence    Balanced Accuracy 
##            0.6826923            0.4277545
# all lag and volume
qda.fit <- qda(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data=weekly.before.2008)

predicted.qda <- predict(qda.fit, newdata = weekly.after.2008, type = "raw")

cm <- confusionMatrix(as.factor(predicted.qda$class), as.factor(weekly.after.2008$Direction), mode='everything', positive='Up')

print('QDA with all lag and volume:', quote=F)
## [1] QDA with all lag and volume:
print(cm$table)
##           Reference
## Prediction Down Up
##       Down   33 49
##       Up     10 12
print(cm$overall[1])
##  Accuracy 
## 0.4326923
print(cm$byClass)
##          Sensitivity          Specificity       Pos Pred Value 
##            0.1967213            0.7674419            0.5454545 
##       Neg Pred Value            Precision               Recall 
##            0.4024390            0.5454545            0.1967213 
##                   F1           Prevalence       Detection Rate 
##            0.2891566            0.5865385            0.1153846 
## Detection Prevalence    Balanced Accuracy 
##            0.2115385            0.4820816

Similarly, we see the overall accuracy decreases with the addition of terms for QDA, as it starts from 59% and decreases to a low of 43%. The sensitivity of 100% also decreases to a measly 20%. Interestingly, the specificity also increases from 0% to 76%.

Next we will perform KNN for the various subsets for k values of 1, 3, 5, 10, 15, 20, 25, and 30. To make things easier to read we will only compare the overall accuracy of each.

k.vals <- c(1,3,5,10,15,20,25,30)

# lag 2
print('KNN with lag2:', quote=F)
## [1] KNN with lag2:
index <- 3
for (k in k.vals){
  print(paste('For k=',k, sep=''),quote=F)
  knn.fit <- knn(train=weekly.before.2008[index], test=weekly.after.2008[index], cl=weekly.before.2008[,9], k=k)
  cm <- confusionMatrix(as.factor(knn.fit), as.factor(weekly.after.2008[,9]), mode='everything',positive='Up')
  # print(cm$table)
  print(cm$overall[1])
  # print(cm$byClass[1])
  # print(cm$byClass[2])
}
## [1] For k=1
## Accuracy 
##      0.5 
## [1] For k=3
##  Accuracy 
## 0.5384615 
## [1] For k=5
##  Accuracy 
## 0.5384615 
## [1] For k=10
##  Accuracy 
## 0.5865385 
## [1] For k=15
##  Accuracy 
## 0.5865385 
## [1] For k=20
##  Accuracy 
## 0.5576923 
## [1] For k=25
##  Accuracy 
## 0.5480769 
## [1] For k=30
##  Accuracy 
## 0.5288462
# lag1,2,3
print('KNN with lag 1,2,3:', quote=F)
## [1] KNN with lag 1,2,3:
index <- c(2,3,4)
for (k in k.vals){
  print(paste('For k=',k, sep=''),quote=F)
  knn.fit <- knn(train=weekly.before.2008[index], test=weekly.after.2008[index], cl=weekly.before.2008[,9], k=k)
  cm <- confusionMatrix(as.factor(knn.fit), as.factor(weekly.after.2008[,9]), mode='everything',positive='Up')
  # print(cm$table)
  print(cm$overall[1])
  # print(cm$byClass[1])
  # print(cm$byClass[2])
}
## [1] For k=1
##  Accuracy 
## 0.4903846 
## [1] For k=3
##  Accuracy 
## 0.5096154 
## [1] For k=5
##  Accuracy 
## 0.5288462 
## [1] For k=10
##  Accuracy 
## 0.5769231 
## [1] For k=15
##  Accuracy 
## 0.5865385 
## [1] For k=20
##  Accuracy 
## 0.6153846 
## [1] For k=25
##  Accuracy 
## 0.6153846 
## [1] For k=30
##  Accuracy 
## 0.6153846
# all lag
print('KNN with all lag:', quote=F)
## [1] KNN with all lag:
index <- c(2,3,4,5,6)
for (k in k.vals){
  print(paste('For k=',k, sep=''),quote=F)
  knn.fit <- knn(train=weekly.before.2008[index], test=weekly.after.2008[index], cl=weekly.before.2008[,9], k=k)
  cm <- confusionMatrix(as.factor(knn.fit), as.factor(weekly.after.2008[,9]), mode='everything',positive='Up')
  # print(cm$table)
  print(cm$overall[1])
  # print(cm$byClass[1])
  # print(cm$byClass[2])
}
## [1] For k=1
##  Accuracy 
## 0.5192308 
## [1] For k=3
##  Accuracy 
## 0.5576923 
## [1] For k=5
##  Accuracy 
## 0.5480769 
## [1] For k=10
##  Accuracy 
## 0.6153846 
## [1] For k=15
##  Accuracy 
## 0.5384615 
## [1] For k=20
##  Accuracy 
## 0.5673077 
## [1] For k=25
## Accuracy 
##      0.5 
## [1] For k=30
##  Accuracy 
## 0.5096154
# all lag, volume
print('KNN with all lag, volume:', quote=F)
## [1] KNN with all lag, volume:
index <- c(2,3,4,5,6,7)
for (k in k.vals){
  print(paste('For k=',k, sep=''),quote=F)
  knn.fit <- knn(train=weekly.before.2008[index], test=weekly.after.2008[index], cl=weekly.before.2008[,9], k=k)
  cm <- confusionMatrix(as.factor(knn.fit), as.factor(weekly.after.2008[,9]), mode='everything',positive='Up')
  # print(cm$table)
  print(cm$overall[1])
  # print(cm$byClass[1])
  # print(cm$byClass[2])
}
## [1] For k=1
##  Accuracy 
## 0.4807692 
## [1] For k=3
##  Accuracy 
## 0.5096154 
## [1] For k=5
## Accuracy 
##      0.5 
## [1] For k=10
##  Accuracy 
## 0.5192308 
## [1] For k=15
##  Accuracy 
## 0.5576923 
## [1] For k=20
##  Accuracy 
## 0.5192308 
## [1] For k=25
##  Accuracy 
## 0.5480769 
## [1] For k=30
##  Accuracy 
## 0.5480769
print('KNN with lag 1, 2, and 3, and k=25:', quote=F)
## [1] KNN with lag 1, 2, and 3, and k=25:
index=c(2,3,4)
knn.fit <- knn(train=weekly.before.2008[index], test=weekly.after.2008[index], cl=weekly.before.2008[,9], k=25)
  cm <- confusionMatrix(as.factor(knn.fit), as.factor(weekly.after.2008[,9]), mode='everything',positive='Up')
  print(cm)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down   22 19
##       Up     21 42
##                                           
##                Accuracy : 0.6154          
##                  95% CI : (0.5149, 0.7091)
##     No Information Rate : 0.5865          
##     P-Value [Acc > NIR] : 0.3110          
##                                           
##                   Kappa : 0.2015          
##                                           
##  Mcnemar's Test P-Value : 0.8744          
##                                           
##             Sensitivity : 0.6885          
##             Specificity : 0.5116          
##          Pos Pred Value : 0.6667          
##          Neg Pred Value : 0.5366          
##               Precision : 0.6667          
##                  Recall : 0.6885          
##                      F1 : 0.6774          
##              Prevalence : 0.5865          
##          Detection Rate : 0.4038          
##    Detection Prevalence : 0.6058          
##       Balanced Accuracy : 0.6001          
##                                           
##        'Positive' Class : Up              
## 

A k=25 seems to have the best performance overall, seeming to reach a peak accuracy of 61% when using all lag variables as predictors. Overall, it appears that adding predictors helps until the point where you have all lag variables, where the performance starts to decline. The worst performance is with all lag and volume. KNN are the only models which appear to improve with the addition of variables, but all models including KNN still suffer decreases in performances when all predictors are used.

# NB
# lag 1,2,3
nb.fit <- naiveBayes(Direction ~ Lag1 + Lag2 + Lag3, data=weekly.before.2008)

pred.nb <- predict(nb.fit, newdata = weekly.after.2008, type = "raw")
pred.nb <- ifelse(pred.nb[,2] >= 0.5, 'Up', 'Down') # column 2 represents likelihood of 'Up'

cm <- confusionMatrix(as.factor(pred.nb), as.factor(weekly.after.2008$Direction),mode='everything',positive='Up')

print('NB with lag 1,2,3:', quote=F)
## [1] NB with lag 1,2,3:
print(cm$table)
##           Reference
## Prediction Down Up
##       Down    5 10
##       Up     38 51
print(cm$overall[1])
##  Accuracy 
## 0.5384615
print(cm$byClass)
##          Sensitivity          Specificity       Pos Pred Value 
##            0.8360656            0.1162791            0.5730337 
##       Neg Pred Value            Precision               Recall 
##            0.3333333            0.5730337            0.8360656 
##                   F1           Prevalence       Detection Rate 
##            0.6800000            0.5865385            0.4903846 
## Detection Prevalence    Balanced Accuracy 
##            0.8557692            0.4761723
# all lag
nb.fit <- naiveBayes(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5, data=weekly.before.2008)

pred.nb <- predict(nb.fit, newdata = weekly.after.2008, type = "raw")
pred.nb <- ifelse(pred.nb[,2] >= 0.5, 'Up', 'Down') # column 2 represents likelihood of 'Up'

cm <- confusionMatrix(as.factor(pred.nb), as.factor(weekly.after.2008$Direction),mode='everything',positive='Up')

print('NB with all lag:', quote=F)
## [1] NB with all lag:
print(cm$table)
##           Reference
## Prediction Down Up
##       Down   10 21
##       Up     33 40
print(cm$overall[1])
##  Accuracy 
## 0.4807692
print(cm$byClass)
##          Sensitivity          Specificity       Pos Pred Value 
##            0.6557377            0.2325581            0.5479452 
##       Neg Pred Value            Precision               Recall 
##            0.3225806            0.5479452            0.6557377 
##                   F1           Prevalence       Detection Rate 
##            0.5970149            0.5865385            0.3846154 
## Detection Prevalence    Balanced Accuracy 
##            0.7019231            0.4441479
# all lag, volume
nb.fit <- naiveBayes(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data=weekly.before.2008)

pred.nb <- predict(nb.fit, newdata = weekly.after.2008, type = "raw")
pred.nb <- ifelse(pred.nb[,2] >= 0.5, 'Up', 'Down') # column 2 represents likelihood of 'Up'

cm <- confusionMatrix(as.factor(pred.nb), as.factor(weekly.after.2008$Direction),mode='everything',positive='Up')

print('NB with all lag, volume:', quote=F)
## [1] NB with all lag, volume:
print(cm$table)
##           Reference
## Prediction Down Up
##       Down   42 56
##       Up      1  5
print(cm$overall[1])
##  Accuracy 
## 0.4519231
print(cm$byClass)
##          Sensitivity          Specificity       Pos Pred Value 
##           0.08196721           0.97674419           0.83333333 
##       Neg Pred Value            Precision               Recall 
##           0.42857143           0.83333333           0.08196721 
##                   F1           Prevalence       Detection Rate 
##           0.14925373           0.58653846           0.04807692 
## Detection Prevalence    Balanced Accuracy 
##           0.05769231           0.52935570

We see a similar pattern to what occured in LDA and QDA for NB. The accuracy steadily decreased from 59% at just lag2 down to 45% to all lag and volume. We see a similar phenomenon as we saw in QDA where the initial sensitivity of 100% decreased to 8% as the moidel stopped predicting purely positive responses, and a corresponding increase of specificity from 0% to 98%.

print('Logistic regression:', quote=F)
## [1] Logistic regression:
print(cm.lr)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down    9  5
##       Up     34 56
##                                          
##                Accuracy : 0.625          
##                  95% CI : (0.5247, 0.718)
##     No Information Rate : 0.5865         
##     P-Value [Acc > NIR] : 0.2439         
##                                          
##                   Kappa : 0.1414         
##                                          
##  Mcnemar's Test P-Value : 7.34e-06       
##                                          
##             Sensitivity : 0.9180         
##             Specificity : 0.2093         
##          Pos Pred Value : 0.6222         
##          Neg Pred Value : 0.6429         
##               Precision : 0.6222         
##                  Recall : 0.9180         
##                      F1 : 0.7417         
##              Prevalence : 0.5865         
##          Detection Rate : 0.5385         
##    Detection Prevalence : 0.8654         
##       Balanced Accuracy : 0.5637         
##                                          
##        'Positive' Class : Up             
## 
print('LDA:', quote=F)
## [1] LDA:
print(cm.lda)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down    9  5
##       Up     34 56
##                                          
##                Accuracy : 0.625          
##                  95% CI : (0.5247, 0.718)
##     No Information Rate : 0.5865         
##     P-Value [Acc > NIR] : 0.2439         
##                                          
##                   Kappa : 0.1414         
##                                          
##  Mcnemar's Test P-Value : 7.34e-06       
##                                          
##             Sensitivity : 0.9180         
##             Specificity : 0.2093         
##          Pos Pred Value : 0.6222         
##          Neg Pred Value : 0.6429         
##               Precision : 0.6222         
##                  Recall : 0.9180         
##                      F1 : 0.7417         
##              Prevalence : 0.5865         
##          Detection Rate : 0.5385         
##    Detection Prevalence : 0.8654         
##       Balanced Accuracy : 0.5637         
##                                          
##        'Positive' Class : Up             
## 

After conducting a further analysis, we see that the best performance was using logistic regression and LDA for only Lag2 as a predictor, which achieved an accuracy of 62.5%. Adding any extra predictors only served to decrease the performance of models.

KNN with k=25 and all lag predictors was the closest as it did see some improvement when adding extra variables, but was only able to get an accuracy of 61%.

Problem 3: Exercise 4.8.16

Using the Boston data set, fit classification models in order to predict whether a given census tract has a crime rate above or below the median. Explore logistic regression, LDA, naive Bayes, and KNN models using various subsets of the predictors. Describe your findings.

Hint: You will have to create the response variable yourself, using the variables that are contained in the Boston data set.

# creating the response variable
med.crim <- median(Boston$crim)

# 0 for lower crim than median, 1 for higher
is.high.crime <- apply(Boston, 1, function(x) {
  ifelse(x['crim'] > med.crim, 1,0) # median will be considered low crim, if exists in data
})

boston.crime <- Boston[-1] # remove crim from dataset
boston.crime$high.crime <- is.high.crime

After creating the response variable, we next decide on what subsets to use. The first model we will use StepAIC with stepwise regression to find the optimal logistic regression model.

# logistic regression
full.model <- glm(high.crime~. , data=boston.crime, family='binomial')
null.model <- glm(high.crime~1, data=boston.crime, family='binomial')

model.selection <- stepAIC(null.model, direction = 'both', scope=list(lower=null.model, upper=full.model), trace = F)
summary(model.selection)
## 
## Call:
## glm(formula = high.crime ~ nox + rad + tax + zn + black + dis + 
##     ptratio + medv + age, family = "binomial", data = boston.crime)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -31.441272   6.048989  -5.198 2.02e-07 ***
## nox          43.195824   6.452812   6.694 2.17e-11 ***
## rad           0.718773   0.142066   5.059 4.21e-07 ***
## tax          -0.007676   0.002503  -3.066  0.00217 ** 
## zn           -0.082567   0.031424  -2.628  0.00860 ** 
## black        -0.012866   0.006334  -2.031  0.04224 *  
## dis           0.634380   0.207634   3.055  0.00225 ** 
## ptratio       0.303502   0.109255   2.778  0.00547 ** 
## medv          0.112882   0.034362   3.285  0.00102 ** 
## age           0.022851   0.009894   2.310  0.02091 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 701.46  on 505  degrees of freedom
## Residual deviance: 216.22  on 496  degrees of freedom
## AIC: 236.22
## 
## Number of Fisher Scoring iterations: 9

As we see, the optimal LR model is using the predictors as follows: nox + rad + tax + zn + black + dis + ptratio + medv + age

For the different subsets used to predict whether a census tract has a higher than the median crime level, we will use these predictors in that order. For example, we will first use LR with nox, then nox + rad, then nox + rad + tax, and so on. We will do this for LR, LDA, NB, and KNN (with k values of 1, 3, 5, 10, 15, 20, 25, and 30). The metrics we will use to compare will be accuracy, sensitivity, specificity, and F1 score. We will then compare these metrics visually at the end.

# creating test and training dataset
tr.size = floor(0.7*nrow(boston.crime))
tr.ind = sample(seq_len(nrow(boston.crime)),tr.size)
boston.tr = boston.crime[tr.ind,]
# table(boston.tr$high.crime)
boston.te = boston.crime[!(rownames(boston.crime) %in% tr.ind),]
# table(boston.te$high.crime)

Now that we have the desired subsets and have split the data into the testing and training sets, we can proceed with fitting the models.

Note: For the sake of readability the code for the model fit and graph generation is hidden. The code can be found in the .rmd file submitted alongside the report. Additionally, the graphs for all KNN models will be found separately to avoid crowding; the model for k=1 will be kept with the other models as a comparison.

In terms of sensitivity, we see the KNN models performed significantly better than the other models; the closest performer was logistic regression which appeared to have performed worse than most KNN models. Of the KNN models, the best performer was k=1 for 6 and 7 predictors, which had a sensitivity of around 93%.

For specificity we see logistic regression has the best overall performance, particularly at 5 predictors with an estimated specificity of 98%. The best KNN model is again k=1 which appears to perform only slightly worse than LR at 4 predictors.

For accuracy, we again see that KNN models perform better with the best model of k=1 achieving an accuracy of 96% at 4 predictors. LR also performs well and approaches k=1 at 9 predictors with an accuracy just below 90%, but is ultimately outperformed.

We see a very similar picture with the F1 score as we saw in the accuracy. KNN with k=1 performs the best at 4 predictors with an F1 of around 97%, but LR gets close at 9 predictors.

Overall, the best performing model was KNN with k=1 and 4 predictors (nox + rad + tax + zn), as it outperformed every model in all metrics except for specificity, which LR performed better in. There is concern for overfitting however, as there was no split between the training and the validation test set.

While LR was outperformed by KNN at k=1, the optimal model of 9 predictors still performed relatively well at all metrics, around 5 to 7 percentage points behind the optimal k=1 model. Given the likely situation in which the KNN model overfit, LR would be a solid contender for the most optimal model.

Problem 4

This churn dataset comes from IBM Sample Data Sets. Customer churn occurs when customers stop doing business with a company, also known as customer attrition. The data set contains 5000rows (customers) and 20columns (features). The “Churn” column is our target which indicate whether customer churned (left the company) or not.

## 'data.frame':    5000 obs. of  20 variables:
##  $ state         : Factor w/ 51 levels "AK","AL","AR",..: 17 36 32 36 37 2 20 25 19 50 ...
##  $ area.code     : Factor w/ 3 levels "area_code_408",..: 2 2 2 1 2 3 3 2 1 2 ...
##  $ account.length: int  128 107 137 84 75 118 121 147 117 141 ...
##  $ voice.plan    : Factor w/ 2 levels "yes","no": 1 1 2 2 2 2 1 2 2 1 ...
##  $ voice.messages: int  25 26 0 0 0 0 24 0 0 37 ...
##  $ intl.plan     : Factor w/ 2 levels "yes","no": 2 2 2 1 1 1 2 1 2 1 ...
##  $ intl.mins     : num  10 13.7 12.2 6.6 10.1 6.3 7.5 7.1 8.7 11.2 ...
##  $ intl.calls    : int  3 3 5 7 3 6 7 6 4 5 ...
##  $ intl.charge   : num  2.7 3.7 3.29 1.78 2.73 1.7 2.03 1.92 2.35 3.02 ...
##  $ day.mins      : num  265 162 243 299 167 ...
##  $ day.calls     : int  110 123 114 71 113 98 88 79 97 84 ...
##  $ day.charge    : num  45.1 27.5 41.4 50.9 28.3 ...
##  $ eve.mins      : num  197.4 195.5 121.2 61.9 148.3 ...
##  $ eve.calls     : int  99 103 110 88 122 101 108 94 80 111 ...
##  $ eve.charge    : num  16.78 16.62 10.3 5.26 12.61 ...
##  $ night.mins    : num  245 254 163 197 187 ...
##  $ night.calls   : int  91 103 104 89 121 118 118 96 90 97 ...
##  $ night.charge  : num  11.01 11.45 7.32 8.86 8.41 ...
##  $ customer.calls: int  1 1 0 2 3 0 3 0 1 0 ...
##  $ churn         : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...
  1. Given the classification imbalance in churn status, describe how you would create a training and testing set.

We can see the imbalance if we separate the customers by whether they have churned or not:

## 
##  yes   no 
##  707 4293

As we can see, the data is significantly imbalanced.Only 14% of the customers have churned.

This presents a problem for resampling methods. For example, in a K fold cross validation where k=10, there may be very different amounts of cases where customers churned which can have a significant impact on the model fit. While this risk is always present in K fold CV, in cases where there are significantly fewer values of one class than another this can have a much more noticeable impact.

In order to compensate we can use stratified K fold cross validation, which will preserve the distribution of the classes in the dataset in each fold. This will help to lessen the overall impact on the model fit.

  1. Which classification statistic would you choose to optimize for this problem (i.e., simple random or create DataPartition) and why?

We will analyze the sensitivity, specificity, accuracy, F1 score, and precision, but we will focus on optimizing the precision. This is because it is the positive predictive value that describes how reliable a positive prediction is. In the case of a business analyzing data for churned customers, reliably predicting who will churn is much more important than predicting who will not churn. They will then be able to focus attention on the customers they are at risk of losing in order to keep as many customers as possible.

We will also use createDataPartition when constructing the folds for the K fold cross validation. This will be done to perform stratified K fold cross validation, which will be important in keeping the distribution of positive values consistent throughout each fold, which will allow for a much better comparison of each fold.

  1. Split the data into a training and a testing set, pre-process the data, and build classification models described in this chapter for the classification of churn. Of the models that you considered, which performs best on these data? and what is the optimal performance?

Before doing anything we must preprocess the data. We fortunately do not have to worry about missing data as there is none in the dataset.

# sum of total missing values in each column:
for (col in colnames(churn)) {
  print(col, quote=F)
  print(sum(is.na(churn[,col])), quote=F)
}
## [1] state
## [1] 0
## [1] area.code
## [1] 0
## [1] account.length
## [1] 0
## [1] voice.plan
## [1] 0
## [1] voice.messages
## [1] 0
## [1] intl.plan
## [1] 0
## [1] intl.mins
## [1] 0
## [1] intl.calls
## [1] 0
## [1] intl.charge
## [1] 0
## [1] day.mins
## [1] 0
## [1] day.calls
## [1] 0
## [1] day.charge
## [1] 0
## [1] eve.mins
## [1] 0
## [1] eve.calls
## [1] 0
## [1] eve.charge
## [1] 0
## [1] night.mins
## [1] 0
## [1] night.calls
## [1] 0
## [1] night.charge
## [1] 0
## [1] customer.calls
## [1] 0
## [1] churn
## [1] 0

Our next step is to look for degenerate distributions with near zero variance. We see voice.messages fits this criteria, and thus remove it from the dataset.

# degenerate distributions
degen_ind <- nearZeroVar(churn) # we see voice.messages is degenerate

for (col in colnames(churn[degen_ind])) {
  tab1(churn[col], main=col)
}

# remove zero var
seg.churn <- churn[, -degen_ind]

Next we can look at correlations between the variables, and remove any with a correlation higher than 0.75.

# correlation

numeric.churn <- seg.churn %>% select_if(is.numeric)  # select(where(is.numeric))
corr.churn <- cor(numeric.churn)
corrplot(corr.churn) # some perfect correlation here. charges and minutes

# finding the variables to remove
highCorr <- findCorrelation(corr.churn, 0.75)

# removing highly correlated predictors
cols.to.remove <- colnames(numeric.churn)[highCorr]
nocorr.churn <- seg.churn[, !colnames(seg.churn) %in% cols.to.remove]

# much less correlated
corrplot(cor(numeric.churn[-highCorr]))

We saw a perfect correlation between different variable pairs of mins and charge, which is to be expected based on the real world.

Next is to look at multicolinearity between the remaining predictors, which we see is no longer an issue:

multicolinearity <- vif(glm(churn~ ., data=nocorr.churn, family='binomial')) 

print(multicolinearity)
##                    GVIF Df GVIF^(1/(2*Df))
## state          1.290222 50        1.002551
## area.code      1.040925  2        1.010078
## account.length 1.018134  1        1.009026
## voice.plan     1.034209  1        1.016961
## intl.plan      1.127323  1        1.061755
## intl.mins      1.026628  1        1.013226
## intl.calls     1.016393  1        1.008163
## day.mins       1.077634  1        1.038091
## day.calls      1.016450  1        1.008192
## eve.mins       1.041854  1        1.020712
## eve.calls      1.021654  1        1.010769
## night.calls    1.021810  1        1.010846
## night.charge   1.033708  1        1.016714
## customer.calls 1.110768  1        1.053930
nomulticolinearity.churn <- nocorr.churn

The last step in preprocessing is to check for skewness.

# Function to check skewness of all numeric columns in a dataframe
check_skewness <- function(data) {
  # Filter numeric columns
  numeric_columns <- data[, sapply(data, is.numeric)]
  
  # Calculate skewness for each numeric column
  skewness_values <- sapply(numeric_columns, skewness, na.rm = TRUE)
  
  # Display the skewness values
  skewness_values
}

# Example usage
skewness_results <- check_skewness(nomulticolinearity.churn)
# print(skewness_results)
print(skewness_results[abs(skewness_results)>0.75]) # will remove skewness above 0.75
##     intl.calls customer.calls 
##       1.359876       1.041837
par(mfrow = c(1,2))
hist(churn$intl.calls)
hist(churn$customer.calls)

We see two predictors have significant skew, intl.calls and customer.calls. It is important to note that because these are not strictly positive (there are several zero values as can be seen above) we are not able to do the Box Cox transformation. Instead we will use the Yeo Johnson transformation, which can be considered to be an extension of the Box Cox transformation that is capable of being used for zero values and negative values.

# Yeo Johnson transformation:
intl.calls.yj <- yeojohnson(nomulticolinearity.churn$intl.calls)
nomulticolinearity.churn$intl.calls <- intl.calls.yj$x.t  # Transformed intl.calls

customer.calls.yj <- yeojohnson(nomulticolinearity.churn$customer.calls)
nomulticolinearity.churn$customer.calls <- customer.calls.yj$x.t  # Transformed intl.calls

par(mfrow = c(1,2))
hist(nomulticolinearity.churn$intl.calls)
hist(nomulticolinearity.churn$customer.calls)

churn.processed <- nomulticolinearity.churn

As we can see, the data is significantly less skewed.

We can now proceed towards the cross validation. We will use 10 fold cross validation in this problem.

# set seed for reproducibility
set.seed(155)


# 10 fold cross validation
k <- 10

fold.indices <- createDataPartition(churn.processed$churn,p = 1/k, times=k, list=T)

folds.list <- list()
par(mfrow = c(2,5)) # 4,5 is too large

for(i in 1:k) {
  # Get indices for the i-th fold
  test.indices <- fold.indices[[i]]
  
  # Split data into training and testing based on indices
  test.data <- churn.processed[test.indices, ]  # use i-th fold as test data
  train.data <- churn.processed[-test.indices, ]  # use rest as training data
  
  # Store the train and test data in the folds list
  folds.list[[i]] <- list("train.data" = train.data, "test.data" = test.data)
  
  # Plot distribution of churn in training and testing sets
  barplot(table(train.data$churn), main = paste('Training Fold', i))
  barplot(table(test.data$churn), main = paste('Testing Fold', i))
}

We can see from the graphs that the distribution of positive and negative churn values has been preserved. Now that we have prepared our folds, we can fit the model for each fold and get a confusion matrix to measure the performance.

For our models we will use LR with stepwise regression in each fold using StepAIC to find the best model, LDA with all predictors, NB with all predictors, and KNN with all predictors for the following k values: 1, 3, 5, 10, 15, 20, 25, 30

# will return a cm for each fold for each model. will make comparisons based off that

model.performances <- lapply(folds.list, function(fold){
  # print(str(fold$train.data))
  train <- fold$train.data
  test <- fold$test.data
  
  # LR
  full.model <- glm(churn~. , data=train, family='binomial')
  null.model <- glm(churn~1, data=train, family='binomial')

  best.lr <- stepAIC(null.model, direction = 'both', scope=list(lower=null.model, upper=full.model), trace = F)
  # summary(model.selection)
  # print(model.selection$call)
  
  lr.pred <- predict(best.lr, newdata=test,type='response')
  lr.pred <- ifelse(lr.pred >= 0.5, 'yes', 'no')
  
  cm.lr <- confusionMatrix(as.factor(lr.pred), as.factor(test$churn), mode='everything',positive='yes')
  
  # LDA

  lda <- lda(churn~., data=train)
  # lda <- lda(train[,c(-4,-15)], grouping = train[,15])
  lda.pred <- predict(lda, newdata=test,type='response')
  # lda.pred <- predict(lda, newdata=test[,c(-4,-15)],type='response')
  cm.lda <- confusionMatrix(as.factor(lda.pred$class), as.factor(test$churn), mode='everything',positive='yes')

  # NB

  nb <- naiveBayes(churn~., data=train, family='binomial')
  nb.pred <- predict(nb, newdata=test,type='raw')
  nb.pred <- ifelse(nb.pred[,2] >= 0.5, 'yes','no')
  cm.nb <- confusionMatrix(as.factor(nb.pred), as.factor(test$churn), mode='everything',positive='yes')
  

  # KNN
  k.vals <- c(1,3,5,10,15,20,25,30)

  get.knn.cms <- function(k){
    # given a value of k, get the cm
    
    # Only use numeric columns
    train_knn <- train[, sapply(train, is.numeric)]
    test_knn <- test[, sapply(test, is.numeric)]
  
    knn.fit <- knn(train = train_knn, test = test_knn, cl = train$churn, k = k)
    cm <- confusionMatrix(as.factor(knn.fit), as.factor(test$churn), mode = 'everything', positive = 'yes')
    return(list(k_val = as.character(k), cm.knn = cm))
  }
  cm.knns <- lapply(k.vals, get.knn.cms)
  
  
  return(list(cm.lr=cm.lr, cm.lda=cm.lda, cm.nb=cm.nb, cm.knns=cm.knns))
})

Now that we have the models, we can see how their performance compares with each other. We will collect the sensitivity, specificity, accuracy, and f1 score of each model in each fold, and then compare the mean values of each.

model_names <- c("lr", "lda", "nb")  
k.vals <- c(1,3,5,10,15,20,25,30)

# Initialize models list including KNN
models <- list(
  lr = vector("list", 5),
  lda = vector("list", 5),
  nb = vector("list", 5),
  knn = vector("list", length(k.vals))
)

# Name each sublist for the metrics
metric_names <- c("Sensitivity", "Specificity", "Accuracy", "F1_Score", "Precision")  
names(models$lr) <- metric_names
names(models$nb) <- metric_names
names(models$lda) <- metric_names

# Initialize KNN metrics for each k value
for (k in seq_along(k.vals)) {
  models$knn[[k]] <- vector("list", 5)  
  names(models$knn[[k]]) <- metric_names  
}


# look at data, print sens, spec, acc, f1
for (i in seq_along(folds.list)){
  fold.data <- model.performances[[i]]

  # non KNN data
  for (j in 1:3){
    cm <- fold.data[[j]]
    model_name <- model_names[j]
    # enter each metric into the list
    models[[model_name]]$Sensitivity[[i]] <- cm$byClass[1]
    models[[model_name]]$Specificity[[i]] <- cm$byClass[2]
    models[[model_name]]$Accuracy[[i]] <- cm$overall[1]
    models[[model_name]]$F1_Score[[i]] <- cm$byClass[7]
    models[[model_name]]$Precision[[i]] <- cm$byClass[5]
  }
  # KNN data
  knn.data <- fold.data[[4]]
  for (x in seq_along(k.vals)) {
    k <- k.vals[x]
    cm <- knn.data[[x]][[2]]
    models$knn[[x]]$Sensitivity[[i]] <- cm$byClass[1]
    models$knn[[x]]$Specificity[[i]] <- cm$byClass[2]
    models$knn[[x]]$Accuracy[[i]] <- cm$overall[1]
    models$knn[[x]]$F1_Score[[i]] <- cm$byClass[7]
    models$knn[[x]]$Precision[[i]] <- cm$byClass[5] 
    # print(k)
    # print(cm$table)
  }
}
# non-KNN models
for (model_name in model_names) {
  cat("Model:", model_name, "\n")
  
  # get means for each metric
  for (metric in metric_names) {
    metric_values <- unlist(models[[model_name]][[metric]])
    mean_value <- mean(metric_values)
    cat(paste(metric, "mean:", mean_value), "\n")
  }
  cat("\n") 
}
## Model: lr 
## Sensitivity mean: 0.801408450704225 
## Specificity mean: 0.0213953488372093 
## Accuracy mean: 0.131936127744511 
## F1_Score mean: 0.207345601042792 
## Precision mean: 0.119078932526484 
## 
## Model: lda 
## Sensitivity mean: 0.23943661971831 
## Specificity mean: 0.961627906976744 
## Accuracy mean: 0.859281437125748 
## F1_Score mean: 0.323734799082412 
## Precision mean: 0.512067218516668 
## 
## Model: nb 
## Sensitivity mean: 0.795774647887324 
## Specificity mean: 0.0116279069767442 
## Accuracy mean: 0.122754491017964 
## F1_Score mean: 0.204451163726151 
## Precision mean: 0.117294063214245
# KNN models
cat("KNN Results:\n")
## KNN Results:
for (x in seq_along(k.vals)) {
  k <- k.vals[x]
  cat("K value:", k, "\n")
  
  # get means for each metric
  for (metric in metric_names) {
    metric_values <- unlist(models$knn[[x]][[metric]])
    mean_value <- mean(metric_values)
    cat(paste(metric, "mean:", mean_value), "\n")
  }
  cat("\n")  
}
## K value: 1 
## Sensitivity mean: 0.311267605633803 
## Specificity mean: 0.877906976744186 
## Accuracy mean: 0.797604790419162 
## F1_Score mean: 0.302567858649983 
## Precision mean: 0.295867851407769 
## 
## K value: 3 
## Sensitivity mean: 0.257746478873239 
## Specificity mean: 0.950232558139535 
## Accuracy mean: 0.852095808383234 
## F1_Score mean: 0.329752101901635 
## Precision mean: 0.460627173963411 
## 
## K value: 5 
## Sensitivity mean: 0.242253521126761 
## Specificity mean: 0.974651162790698 
## Accuracy mean: 0.870858283433134 
## F1_Score mean: 0.346218838705662 
## Precision mean: 0.613222822263248 
## 
## K value: 10 
## Sensitivity mean: 0.22112676056338 
## Specificity mean: 0.983023255813953 
## Accuracy mean: 0.875049900199601 
## F1_Score mean: 0.332709803522071 
## Precision mean: 0.680609580435667 
## 
## K value: 15 
## Sensitivity mean: 0.216901408450704 
## Specificity mean: 0.983720930232558 
## Accuracy mean: 0.875049900199601 
## F1_Score mean: 0.327811982985929 
## Precision mean: 0.689641432910035 
## 
## K value: 20 
## Sensitivity mean: 0.208450704225352 
## Specificity mean: 0.985116279069767 
## Accuracy mean: 0.875049900199601 
## F1_Score mean: 0.318939860246865 
## Precision mean: 0.692614955731963 
## 
## K value: 25 
## Sensitivity mean: 0.194366197183099 
## Specificity mean: 0.987209302325581 
## Accuracy mean: 0.874850299401198 
## F1_Score mean: 0.303679665495442 
## Precision mean: 0.716749338179208 
## 
## K value: 30 
## Sensitivity mean: 0.198591549295775 
## Specificity mean: 0.987906976744186 
## Accuracy mean: 0.876047904191617 
## F1_Score mean: 0.310814801850237 
## Precision mean: 0.733378544400216

Comparing the performance of each mode, we see the best precision on KNN with a k=30, which has a precision of 73% and a relatively high accuracy of 88%. As we chose to optimize precision in part b), this is our optimal model.

The other KNN models have a variable performance, but the precision increases as k increases, going from 30% to the highest of 73%. Accuracy also increases with precision, but begins to plateu at k=5 at 87%.

For non KNN models, LDA has the best precision at 51%, while LR and NB perform similarly with a precision of 11% and accuracy of 12%.

While we did not choose to optimize it in part b), it is important to take note of the sensitivities of the models; a company may choose to be particularly risk averse and prioritize detecting all customers at risk of churning while sacrificing some precision in the model for this. If this were the case, the highest sensitivity models are LR and NB which both have a sensitivity of 80%. This sensitivity must not be taken at face value however, as we saw in problem 2 that sensitivity and specificity can be exaggerated if a model chooses a particular option much more than the actual rate it appears. We can examine the confusion matrix of the models:

cms <- model.performances[[1]]
lr.cm <- cms$cm.lr
nb.cm <- cms$cm.nb
print('LR:', quote=F)
## [1] LR:
print(lr.cm$table)
##           Reference
## Prediction yes  no
##        yes  58 422
##        no   13   8
print('NB:',quote=F)
## [1] NB:
print(nb.cm$table)
##           Reference
## Prediction yes  no
##        yes  50 423
##        no   21   7

We can see that the sensitivity is an unreliable metric in this case, as it over predicts positive values by a significant margin. This is what causes a high sensitivity but low precision and accuracy. In the case of imbalanced data where positive values are significantly less common than negative values, this decreases the performance of the model which we can see by the low accuracy and precision for both logistic regression and naive Bayes.

Overall, if we consider the best model to be the one that optimizes our choice of classification statistic in part b), we will consider KNN with k=30 to be the best model. It is able to provide a high precision at 73%, which will be important to companies who are looking for who are at risk of churning.