STA 6923 Fall 2024 Homework 3

HW 3

1. Exercise 4.8.7, Suppose that we wish to predict whether a given stock will issue a dividend this year (“Yes” or “No”) based on $X$, last year’s percent profit. We examine a large number of companies and discover that the mean value of $X$ for companies that issued a dividend was $\bar{X} = 10$, while the mean for those that didn’t was $\bar{X} = 0$. In addition, the variance of $X$ for these two sets of companies was $\hat{\sigma}^2 = 36$. Finally, 80% of companies issued dividends. Assuming that $X$ follows a normal distribution, predict the probability that a company will issue a dividend this year given that its percentage profit was $X = 4$ last year.

Hint: Recall the density function for a normal random variable: $f(x) =\frac{1}{\sqrt{2\pi\sigma^2}}e^{-(x-\mu)^2/2\sigma^2}$ You will need to use Bayes’ theorem.

Let $Y$ represent whether a given stock will issue a dividend this year (“Yes” or “No”) and $X$ represent last year’s percent profit.

$X_{Yes} \sim \mathcal{N}(10, 36)$ and $X_{No} \sim \mathcal{N}(0, 36)$

$\pi_k$ represents the prior probability of class $k$.

$f_k$ represents the probability density function for $X$ in class $k$.

$p_k$ represents the posterior probability of class $k$.

\[\begin{align} p_k(Y=k|X=x) &= \frac{P(X=x|Y=k)P(Y=k)}{P(X=x)} &\text{Bayes' Theorem} \\ &= \frac{P(X=x|Y=k)P(Y=k)}{\sum_{i=1}^{k} P(X=x|Y=l)P(Y=l)} \\ &= \frac{\pi_k f_k(x)}{\sum_{l=1}^{k} \pi_l f_l(x)} \\ &= \frac{\pi_k f_k(x)}{\pi_{Yes} f_{Yes}(x) + \pi_{No} f_{No}(x)} \\ &= \frac{\pi_k \frac{1}{\sqrt{2\pi\sigma_k^2}} e^{-(x-\mu_k)^2/2\sigma_k^2}} {\pi_{Yes}\frac{1}{\sqrt{2\pi\sigma_{Yes}^2}} e^{-(x-\mu_{Yes})^2/2\sigma_{Yes}^2}+ \pi_{No}\frac{1}{\sqrt{2\pi\sigma_{No}^2}} e^{-(x-\mu_{No})^2/2\sigma_{No}^2}} &\text{Substitute the equation for $f_k(x)$} \\ &= \frac{\pi_k e^{-(x-\mu_k)^2/2\sigma^2}}{\pi_{Yes}e^{-(x-\mu_{Yes})^2/2\sigma^2}+\pi_{No}e^{-(x-\mu_{No})^2/2\sigma^2}} &\text{Cross out terms given $\sigma_{Yes}=\sigma_{No}$} \\ \\ p_{Yes}(Y=Yes|X=4) &= \frac{0.8e^{-(4-10)^2/(2\times 6^2)}}{0.8e^{-(4-10)^2/(2\times 6^2)} + 0.2e^{-(4-0)^2/(2\times 6^2)}} \\ &= \frac{0.8e^{-1/2}}{0.8e^{-1/2}+0.2e^{-2/9}} \\ &= 0.752 \end{align}\]

2. Exercise 4.8.13, This question should be answered using the Weekly data set, which is part of the ISLR2 package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1,089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.

a) Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?

library(ISLR2)
library(corrplot)

## corrplot 0.94 loaded

data(Weekly)
summary(Weekly)

##       Year           Lag1               Lag2               Lag3         
##  Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
##  1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580  
##  Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410  
##  Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472  
##  3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
##  Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
##       Lag4               Lag5              Volume            Today         
##  Min.   :-18.1950   Min.   :-18.1950   Min.   :0.08747   Min.   :-18.1950  
##  1st Qu.: -1.1580   1st Qu.: -1.1660   1st Qu.:0.33202   1st Qu.: -1.1540  
##  Median :  0.2380   Median :  0.2340   Median :1.00268   Median :  0.2410  
##  Mean   :  0.1458   Mean   :  0.1399   Mean   :1.57462   Mean   :  0.1499  
##  3rd Qu.:  1.4090   3rd Qu.:  1.4050   3rd Qu.:2.05373   3rd Qu.:  1.4050  
##  Max.   : 12.0260   Max.   : 12.0260   Max.   :9.32821   Max.   : 12.0260  
##  Direction 
##  Down:484  
##  Up  :605  
##            
##            
##            
##

cor(Weekly[,-9])

##               Year         Lag1        Lag2        Lag3         Lag4
## Year    1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1   -0.03228927  1.000000000 -0.07485305  0.05863568 -0.071273876
## Lag2   -0.03339001 -0.074853051  1.00000000 -0.07572091  0.058381535
## Lag3   -0.03000649  0.058635682 -0.07572091  1.00000000 -0.075395865
## Lag4   -0.03112792 -0.071273876  0.05838153 -0.07539587  1.000000000
## Lag5   -0.03051910 -0.008183096 -0.07249948  0.06065717 -0.075675027
## Volume  0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today  -0.03245989 -0.075031842  0.05916672 -0.07124364 -0.007825873
##                Lag5      Volume        Today
## Year   -0.030519101  0.84194162 -0.032459894
## Lag1   -0.008183096 -0.06495131 -0.075031842
## Lag2   -0.072499482 -0.08551314  0.059166717
## Lag3    0.060657175 -0.06928771 -0.071243639
## Lag4   -0.075675027 -0.06107462 -0.007825873
## Lag5    1.000000000 -0.05851741  0.011012698
## Volume -0.058517414  1.00000000 -0.033077783
## Today   0.011012698 -0.03307778  1.000000000

pairs(Weekly)

corrplot(cor(Weekly[, -9]), method = "color")

plot(Weekly$Year, Weekly$Volume, xlab = "Year", ylab = "Volume")

The Weekly data set from ISLR2 shows the weekly percentage returns for the S&P 500 stock index between 1990 and 2010. Year refers to the year that the observation was recorded. Lag1-5 refers to the percentage return for the 1-5 weeks prior. Volume refers to the average number of daily shares traded in billions. Today refers to the percentage return for this week. Direction is a factor with levels Down and Up, indicating whether the market had a positive or negative return on a given week. There appears to be a strong positive correlation between Year and Volume.

b) Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?

glm.fit <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Weekly, family = binomial)
summary(glm.fit)

## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Weekly)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

Lag2 appears statistically significant assuming significant level of 0.05, with p value of 0.0296<0.05. The other predictors all have p values >0.05, but Lag1 is the second best.

c) Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.

glm.probs <- predict(glm.fit, type = "response")  # Get predicted probabilities
glm.pred <- ifelse(glm.probs > 0.5, "Up", "Down")  # Classify as "Up" or "Down"
conf_matrix <- table(glm.pred, Weekly$Direction)  # Compare predictions to actual values

# Extract values from the confusion matrix
TN <- conf_matrix[1, 1]  # True Negatives
FP <- conf_matrix[2, 1]  # False Positives
FN <- conf_matrix[1, 2]  # False Negatives
TP <- conf_matrix[2, 2]  # True Positives

# Calculate metrics
accuracy <- mean(glm.pred == Weekly$Direction)  # Calculate accuracy
sensitivity <- TP / (TP + FN)  # Sensitivity (True Positive Rate)
specificity <- TN / (TN + FP)  # Specificity (True Negative Rate)
type_I_error <- FP / (FP + TN)  # Type I Error (False Positive Rate)
type_II_error <- FN / (FN + TP)  # Type II Error (False Negative Rate)

cat("Accuracy:", accuracy, "\n")

## Accuracy: 0.5610652

cat("Sensitivity:", sensitivity, "\n")

## Sensitivity: 0.9206612

cat("Specificity:", specificity, "\n")

## Specificity: 0.1115702

cat("Type I Error:", type_I_error, "\n")

## Type I Error: 0.8884298

cat("Type II Error:", type_II_error, "\n")

## Type II Error: 0.07933884

The accuracy is only 56.1% with a high sensitivity of 0.92 and a low specificity of 0.11. The type 1 error is high at 0.888 and type 2 error is low at 0.079. The model overwhelmingly predicts weekly direction of “Up” even when the market actually went “Down”.

d) Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).

train <- Weekly[Weekly$Year < 2009, ]  # Data from 1990 to 2008
test <- Weekly[Weekly$Year >= 2009, ]  # Data from 2009 and 2010
logit.fit <- glm(Direction ~ Lag2, data = train, family = binomial)
logit.probs <- predict(logit.fit, newdata = test, type = "response")
logit.pred <- ifelse(logit.probs > 0.5, "Up", "Down")
conf_matrix <- table(logit.pred, test$Direction)
print(conf_matrix)

##           
## logit.pred Down Up
##       Down    9  5
##       Up     34 56

accuracy <- mean(logit.pred == test$Direction)
print(accuracy)

## [1] 0.625

e) Repeat d) using LDA

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:ISLR2':
## 
##     Boston

lda.fit <- lda(Direction ~ Lag2, data = train)
lda.pred <- predict(lda.fit, newdata = test)
conf_matrix <- table(lda.pred$class, test$Direction)  
print(conf_matrix)

##       
##        Down Up
##   Down    9  5
##   Up     34 56

accuracy <- mean(lda.pred$class == test$Direction)  
print(accuracy)

## [1] 0.625

f) QDA

qda.fit <- qda(Direction ~ Lag2, data = train)
qda.pred <- predict(qda.fit, newdata = test)
conf_matrix <- table(qda.pred$class, test$Direction)  
print(conf_matrix)

##       
##        Down Up
##   Down    0  0
##   Up     43 61

accuracy <- mean(qda.pred$class == test$Direction)  
print(accuracy)

## [1] 0.5865385

g) KNN with K=1

library(class)
train_X <- as.matrix(train$Lag2) # predictor for training data
test_X <- as.matrix(test$Lag2)   # predictor for test data
train_Y <- train$Direction      # response for training data
test_Y <- test$Direction        # response for test data
knn.pred <- knn(train_X, test_X, train_Y, k = 1)
conf_matrix <- table(knn.pred, test_Y)
print(conf_matrix)

##         test_Y
## knn.pred Down Up
##     Down   21 30
##     Up     22 31

accuracy <- mean(knn.pred == test_Y)
print(accuracy)

## [1] 0.5

h) Naive Bayes

library(e1071)
nb.fit <- naiveBayes(Direction ~ Lag2, data = train)
nb.pred <- predict(nb.fit, newdata = test)
conf_matrix <- table(nb.pred, test$Direction)
print(conf_matrix)

##        
## nb.pred Down Up
##    Down    0  0
##    Up     43 61

accuracy <- mean(nb.pred == test$Direction)
print(accuracy)

## [1] 0.5865385

i) Which of these methods appears to provide the best results on this data?

Logistic regression and LDA had accuracy of 62.5%, QDA and naive Bayes 58.65% (only predicting market “up”), and KNN with K=1 50%. Both logistic regression and LDA appear to provide the best results on this data, assuming the criteria is accuracy. They even have the same confusion matrix since they both involve linear decision boundaries.

j) Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifier.

library(leaps)
mod.sel <- regsubsets(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Weekly)
reg.summary <- summary(mod.sel)
names(reg.summary)

## [1] "which"  "rsq"    "rss"    "adjr2"  "cp"     "bic"    "outmat" "obj"

reg.summary$outmat

##          Lag1 Lag2 Lag3 Lag4 Lag5 Volume
## 1  ( 1 ) " "  "*"  " "  " "  " "  " "   
## 2  ( 1 ) "*"  "*"  " "  " "  " "  " "   
## 3  ( 1 ) "*"  "*"  " "  "*"  " "  " "   
## 4  ( 1 ) "*"  "*"  "*"  "*"  " "  " "   
## 5  ( 1 ) "*"  "*"  "*"  "*"  " "  "*"   
## 6  ( 1 ) "*"  "*"  "*"  "*"  "*"  "*"

cat(which.min(reg.summary$bic), reg.summary$bic[which.min(reg.summary$bic)]) # 1

## 1 8.215667

cat(which.min(reg.summary$cp), reg.summary$cp[which.min(reg.summary$cp)]) # 2

## 2 0.8082199

cat(which.max(reg.summary$adjr2), reg.summary$adjr2[which.max(reg.summary$adjr2)]) # 2

## 2 0.005453532

logit.fit <- glm(Direction ~ Lag1 + Lag2, data = train, family = binomial)
logit.probs <- predict(logit.fit, newdata = test, type = "response")
logit.pred <- ifelse(logit.probs > 0.5, "Up", "Down")
conf_matrix <- table(logit.pred, test$Direction)
print(conf_matrix)

##           
## logit.pred Down Up
##       Down    7  8
##       Up     36 53

accuracy <- mean(logit.pred == test$Direction)
print(accuracy)  # 0.577

## [1] 0.5769231

logit.fit <- glm(Direction ~ Lag1 + Lag2 + Lag1:Lag2, data = train, family = binomial)
logit.probs <- predict(logit.fit, newdata = test, type = "response")
logit.pred <- ifelse(logit.probs > 0.5, "Up", "Down")
conf_matrix <- table(logit.pred, test$Direction)
print(conf_matrix)

##           
## logit.pred Down Up
##       Down    7  8
##       Up     36 53

accuracy <- mean(logit.pred == test$Direction)
print(accuracy)  # 0.577

## [1] 0.5769231

lda.fit <- lda(Direction ~ Lag1 + Lag2, data = train)
lda.pred <- predict(lda.fit, newdata = test)
conf_matrix <- table(lda.pred$class, test$Direction)  
print(conf_matrix)

##       
##        Down Up
##   Down    7  8
##   Up     36 53

accuracy <- mean(lda.pred$class == test$Direction)  
print(accuracy) # 0.577

## [1] 0.5769231

lda.fit <- lda(Direction ~ Lag1 + Lag2 + Lag1:Lag2, data = train)
lda.pred <- predict(lda.fit, newdata = test)
conf_matrix <- table(lda.pred$class, test$Direction)  
print(conf_matrix)

##       
##        Down Up
##   Down    7  8
##   Up     36 53

accuracy <- mean(lda.pred$class == test$Direction)  
print(accuracy) # 0.577

## [1] 0.5769231

qda.fit <- qda(Direction ~ Lag1 + Lag2, data = train)
qda.pred <- predict(qda.fit, newdata = test)
conf_matrix <- table(qda.pred$class, test$Direction)  
print(conf_matrix)

##       
##        Down Up
##   Down    7 10
##   Up     36 51

accuracy <- mean(qda.pred$class == test$Direction)  
print(accuracy) # 0.558

## [1] 0.5576923

qda.fit <- qda(Direction ~ Lag1 + Lag2 + Lag1:Lag2, data = train)
qda.pred <- predict(qda.fit, newdata = test)
conf_matrix <- table(qda.pred$class, test$Direction)  
print(conf_matrix)

##       
##        Down Up
##   Down   23 36
##   Up     20 25

accuracy <- mean(qda.pred$class == test$Direction)  
print(accuracy) # 0.462

## [1] 0.4615385

nb.fit <- naiveBayes(Direction ~ Lag1 + Lag2, data = train)
nb.pred <- predict(nb.fit, newdata = test)
conf_matrix <- table(nb.pred, test$Direction)
print(conf_matrix)

##        
## nb.pred Down Up
##    Down    3  8
##    Up     40 53

accuracy <- mean(nb.pred == test$Direction)
print(accuracy) # 0.538

## [1] 0.5384615

for (k in 2:10) {
  knn.pred <- knn(train_X, test_X, train_Y, k = k)
  conf_matrix <- table(knn.pred, test_Y)
  print(paste("Confusion Matrix for k =", k))
  print(conf_matrix)
  accuracy <- mean(knn.pred == test_Y)
  print(paste("Accuracy for k =", k, ":", accuracy))
}

## [1] "Confusion Matrix for k = 2"
##         test_Y
## knn.pred Down Up
##     Down   26 23
##     Up     17 38
## [1] "Accuracy for k = 2 : 0.615384615384615"
## [1] "Confusion Matrix for k = 3"
##         test_Y
## knn.pred Down Up
##     Down   16 20
##     Up     27 41
## [1] "Accuracy for k = 3 : 0.548076923076923"
## [1] "Confusion Matrix for k = 4"
##         test_Y
## knn.pred Down Up
##     Down   22 21
##     Up     21 40
## [1] "Accuracy for k = 4 : 0.596153846153846"
## [1] "Confusion Matrix for k = 5"
##         test_Y
## knn.pred Down Up
##     Down   16 21
##     Up     27 40
## [1] "Accuracy for k = 5 : 0.538461538461538"
## [1] "Confusion Matrix for k = 6"
##         test_Y
## knn.pred Down Up
##     Down   18 21
##     Up     25 40
## [1] "Accuracy for k = 6 : 0.557692307692308"
## [1] "Confusion Matrix for k = 7"
##         test_Y
## knn.pred Down Up
##     Down   15 19
##     Up     28 42
## [1] "Accuracy for k = 7 : 0.548076923076923"
## [1] "Confusion Matrix for k = 8"
##         test_Y
## knn.pred Down Up
##     Down   15 19
##     Up     28 42
## [1] "Accuracy for k = 8 : 0.548076923076923"
## [1] "Confusion Matrix for k = 9"
##         test_Y
## knn.pred Down Up
##     Down   17 21
##     Up     26 40
## [1] "Accuracy for k = 9 : 0.548076923076923"
## [1] "Confusion Matrix for k = 10"
##         test_Y
## knn.pred Down Up
##     Down   17 22
##     Up     26 39
## [1] "Accuracy for k = 10 : 0.538461538461538"

Logistic regression and LDA had accuracy of 62.5%, QDA and naive Bayes 58.65% (only predicting market “up”), and KNN with K=1 50%. With the same confusion matrix, both logistic regression and LDA appear to provide the best results on this data.

There are 5 methods to experiment with: logistic regression, LDA, QDA, naive Bayes, and KNN. There are no predictors with degenerate distributions. We do not include year, so there are no highly correlated predictors. Moreover, there is no multicollinearity. Based on BIC, just Lag2 alone is ideal. However, Cp and AdjR2 criterion include Lag1, which makes sense considering it has the second lowest p-value. Hence, we tried these combination of predictors: Lag1 + Lag2 and Lag1 + Lag2 + Lag1:Lag2. Moreover, we assessed KNN with k=2 to 10. Based on the results above, adding predictors only worsened the accuracy. However, increasing k to >1 led to improved testing accuracy, although not much significant improvements past k=2.

3. Exercise 4.8.16, Using the Boston data set, fit classification models in order to predict whether a given census tract has a crime rate above or below the median. Explore logistic regression, LDA, naive Bayes, and KNN models using various subsets of the predictors. Describe your findings.

Hint: You will have to create the response variable yourself, using the variables that are contained in the Boston data set.

library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

data(Boston)
summary(Boston)

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

# Create a binary response variable indicating whether the crime rate is above or below the median
Boston$high_crim <- ifelse(Boston$crim > median(Boston$crim), 1, 0)
glm.fit <- glm(high_crim ~ . - crim, data = Boston, family = binomial)
summary(glm.fit)

## 
## Call:
## glm(formula = high_crim ~ . - crim, family = binomial, data = Boston)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -34.103704   6.530014  -5.223 1.76e-07 ***
## zn           -0.079918   0.033731  -2.369  0.01782 *  
## indus        -0.059389   0.043722  -1.358  0.17436    
## chas          0.785327   0.728930   1.077  0.28132    
## nox          48.523782   7.396497   6.560 5.37e-11 ***
## rm           -0.425596   0.701104  -0.607  0.54383    
## age           0.022172   0.012221   1.814  0.06963 .  
## dis           0.691400   0.218308   3.167  0.00154 ** 
## rad           0.656465   0.152452   4.306 1.66e-05 ***
## tax          -0.006412   0.002689  -2.385  0.01709 *  
## ptratio       0.368716   0.122136   3.019  0.00254 ** 
## black        -0.013524   0.006536  -2.069  0.03853 *  
## lstat         0.043862   0.048981   0.895  0.37052    
## medv          0.167130   0.066940   2.497  0.01254 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 701.46  on 505  degrees of freedom
## Residual deviance: 211.93  on 492  degrees of freedom
## AIC: 239.93
## 
## Number of Fisher Scoring iterations: 9

glm.probs <- predict(glm.fit, type = "response") 
glm.pred <- ifelse(glm.probs > 0.5, 1, 0)  # Classify crime rate as "High" or "Low"
conf_matrix <- table(glm.pred, Boston$high_crim)
print(conf_matrix)

##         
## glm.pred   0   1
##        0 234  24
##        1  19 229

accuracy <- mean(glm.pred == Boston$high_crim)
print(accuracy) #0.915

## [1] 0.9150198

# Train test split
set.seed(1234567)
train_size <- floor(0.70 * nrow(Boston))
train_index <- sample(seq_len(nrow(Boston)), size = train_size)
train <- Boston[train_index, ]
test <- Boston[-train_index, ]

logit.fit <- glm(high_crim ~ . - crim, data = train, family = binomial)
logit.probs <- predict(logit.fit, newdata = test, type = "response")
logit.pred <- ifelse(logit.probs > 0.5, 1, 0)
conf_matrix <- table(logit.pred, test$high_crim)
print(conf_matrix)

##           
## logit.pred  0  1
##          0 66  8
##          1  3 75

accuracy <- mean(logit.pred == test$high_crim)
print(accuracy)  # 0.928

## [1] 0.9276316

lda.fit <- lda(high_crim ~ . - crim, data = train)
lda.pred <- predict(lda.fit, newdata = test)
conf_matrix <- table(lda.pred$class, test$high_crim)  
print(conf_matrix)

##    
##      0  1
##   0 68 21
##   1  1 62

accuracy <- mean(lda.pred$class == test$high_crim)  
print(accuracy) # 0.855

## [1] 0.8552632

nb.fit <- naiveBayes(high_crim ~ . - crim, data = train)
nb.pred <- predict(nb.fit, newdata = test)
conf_matrix <- table(nb.pred, test$high_crim)
print(conf_matrix)

##        
## nb.pred  0  1
##       0 62 20
##       1  7 63

accuracy <- mean(nb.pred == test$high_crim)
print(accuracy) # 0.822

## [1] 0.8223684

train_X <- as.matrix(train[,2:15]) # excludes crim
test_X <- as.matrix(test[,2:15])  
train_Y <- train$high_crim     
test_Y <- test$high_crim       
knn.pred <- knn(train_X, test_X, train_Y, k = 1)
conf_matrix <- table(knn.pred, test_Y)
print(conf_matrix)

##         test_Y
## knn.pred  0  1
##        0 65  4
##        1  4 79

accuracy <- mean(knn.pred == test_Y)
print(accuracy) # 0.947

## [1] 0.9473684

nearZeroVar(Boston) # none

## integer(0)

library(car)

## Loading required package: carData

vif(glm.fit) # none >10

##       zn    indus     chas      nox       rm      age      dis      rad 
## 1.872010 2.640170 1.233983 4.111013 5.734988 2.455666 3.876479 1.983825 
##      tax  ptratio    black    lstat     medv 
## 1.932622 2.191848 1.107171 2.670299 8.088729

mod.sel <- regsubsets(high_crim ~ . -crim, data = Boston)
reg.summary <- summary(mod.sel)
reg.summary$outmat

##          zn  indus chas nox rm  age dis rad tax ptratio black lstat medv
## 1  ( 1 ) " " " "   " "  "*" " " " " " " " " " " " "     " "   " "   " " 
## 2  ( 1 ) " " " "   " "  "*" " " " " " " "*" " " " "     " "   " "   " " 
## 3  ( 1 ) " " " "   " "  "*" " " "*" " " "*" " " " "     " "   " "   " " 
## 4  ( 1 ) " " " "   " "  "*" " " "*" " " "*" " " " "     " "   " "   "*" 
## 5  ( 1 ) "*" " "   " "  "*" " " "*" " " "*" " " " "     " "   " "   "*" 
## 6  ( 1 ) "*" " "   " "  "*" " " "*" " " "*" " " " "     "*"   " "   "*" 
## 7  ( 1 ) "*" " "   " "  "*" " " "*" " " "*" " " "*"     "*"   " "   "*" 
## 8  ( 1 ) "*" " "   " "  "*" " " "*" " " "*" " " "*"     "*"   "*"   "*"

cat(which.min(reg.summary$bic), reg.summary$bic[which.min(reg.summary$bic)]) # 4

## 4 -430.8029

cat(which.min(reg.summary$cp), reg.summary$cp[which.min(reg.summary$cp)]) # 6

## 6 4.123717

cat(which.max(reg.summary$adjr2), reg.summary$adjr2[which.max(reg.summary$adjr2)]) # 7

## 7 0.598924

cor(Boston)

##                  crim          zn       indus         chas         nox
## crim       1.00000000 -0.20046922  0.40658341 -0.055891582  0.42097171
## zn        -0.20046922  1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus      0.40658341 -0.53382819  1.00000000  0.062938027  0.76365145
## chas      -0.05589158 -0.04269672  0.06293803  1.000000000  0.09120281
## nox        0.42097171 -0.51660371  0.76365145  0.091202807  1.00000000
## rm        -0.21924670  0.31199059 -0.39167585  0.091251225 -0.30218819
## age        0.35273425 -0.56953734  0.64477851  0.086517774  0.73147010
## dis       -0.37967009  0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad        0.62550515 -0.31194783  0.59512927 -0.007368241  0.61144056
## tax        0.58276431 -0.31456332  0.72076018 -0.035586518  0.66802320
## ptratio    0.28994558 -0.39167855  0.38324756 -0.121515174  0.18893268
## black     -0.38506394  0.17552032 -0.35697654  0.048788485 -0.38005064
## lstat      0.45562148 -0.41299457  0.60379972 -0.053929298  0.59087892
## medv      -0.38830461  0.36044534 -0.48372516  0.175260177 -0.42732077
## high_crim  0.40939545 -0.43615103  0.60326017  0.070096774  0.72323480
##                    rm         age         dis          rad         tax
## crim      -0.21924670  0.35273425 -0.37967009  0.625505145  0.58276431
## zn         0.31199059 -0.56953734  0.66440822 -0.311947826 -0.31456332
## indus     -0.39167585  0.64477851 -0.70802699  0.595129275  0.72076018
## chas       0.09125123  0.08651777 -0.09917578 -0.007368241 -0.03558652
## nox       -0.30218819  0.73147010 -0.76923011  0.611440563  0.66802320
## rm         1.00000000 -0.24026493  0.20524621 -0.209846668 -0.29204783
## age       -0.24026493  1.00000000 -0.74788054  0.456022452  0.50645559
## dis        0.20524621 -0.74788054  1.00000000 -0.494587930 -0.53443158
## rad       -0.20984667  0.45602245 -0.49458793  1.000000000  0.91022819
## tax       -0.29204783  0.50645559 -0.53443158  0.910228189  1.00000000
## ptratio   -0.35550149  0.26151501 -0.23247054  0.464741179  0.46085304
## black      0.12806864 -0.27353398  0.29151167 -0.444412816 -0.44180801
## lstat     -0.61380827  0.60233853 -0.49699583  0.488676335  0.54399341
## medv       0.69535995 -0.37695457  0.24992873 -0.381626231 -0.46853593
## high_crim -0.15637178  0.61393992 -0.61634164  0.619786249  0.60874128
##              ptratio       black      lstat       medv   high_crim
## crim       0.2899456 -0.38506394  0.4556215 -0.3883046  0.40939545
## zn        -0.3916785  0.17552032 -0.4129946  0.3604453 -0.43615103
## indus      0.3832476 -0.35697654  0.6037997 -0.4837252  0.60326017
## chas      -0.1215152  0.04878848 -0.0539293  0.1752602  0.07009677
## nox        0.1889327 -0.38005064  0.5908789 -0.4273208  0.72323480
## rm        -0.3555015  0.12806864 -0.6138083  0.6953599 -0.15637178
## age        0.2615150 -0.27353398  0.6023385 -0.3769546  0.61393992
## dis       -0.2324705  0.29151167 -0.4969958  0.2499287 -0.61634164
## rad        0.4647412 -0.44441282  0.4886763 -0.3816262  0.61978625
## tax        0.4608530 -0.44180801  0.5439934 -0.4685359  0.60874128
## ptratio    1.0000000 -0.17738330  0.3740443 -0.5077867  0.25356836
## black     -0.1773833  1.00000000 -0.3660869  0.3334608 -0.35121093
## lstat      0.3740443 -0.36608690  1.0000000 -0.7376627  0.45326273
## medv      -0.5077867  0.33346082 -0.7376627  1.0000000 -0.26301673
## high_crim  0.2535684 -0.35121093  0.4532627 -0.2630167  1.00000000

pairs(Boston)

corrplot(cor(Boston), method = "color")

highCorr <- findCorrelation(cor(Boston), cutoff = .75)
length(highCorr)

## [1] 3

colnames(Boston)[highCorr] # indus, tax, nox

## [1] "indus" "nox"   "tax"

logit.fit <- glm(high_crim ~ zn + nox + dis + rad + tax + ptratio + black + medv, data = train, family = binomial)
logit.probs <- predict(logit.fit, newdata = test, type = "response")
logit.pred <- ifelse(logit.probs > 0.5, 1, 0)
conf_matrix <- table(logit.pred, test$high_crim)
print(conf_matrix)

##           
## logit.pred  0  1
##          0 65 12
##          1  4 71

accuracy <- mean(logit.pred == test$high_crim)
print(accuracy)  # 0.895

## [1] 0.8947368

logit.fit <- glm(high_crim ~ nox + rad + age + medv, data = train, family = binomial)
logit.probs <- predict(logit.fit, newdata = test, type = "response")
logit.pred <- ifelse(logit.probs > 0.5, 1, 0)
conf_matrix <- table(logit.pred, test$high_crim)
print(conf_matrix)

##           
## logit.pred  0  1
##          0 63 18
##          1  6 65

accuracy <- mean(logit.pred == test$high_crim)
print(accuracy)  # 0.842

## [1] 0.8421053

lda.fit <- lda(high_crim ~ zn + nox + dis + rad + tax + ptratio + black + medv, data = train)
lda.pred <- predict(lda.fit, newdata = test)
conf_matrix <- table(lda.pred$class, test$high_crim)  
print(conf_matrix)

##    
##      0  1
##   0 68 21
##   1  1 62

accuracy <- mean(lda.pred$class == test$high_crim)  
print(accuracy) # 0.855

## [1] 0.8552632

lda.fit <- lda(high_crim ~ nox + rad + age + medv, data = train)
lda.pred <- predict(lda.fit, newdata = test)
conf_matrix <- table(lda.pred$class, test$high_crim)  
print(conf_matrix)

##    
##      0  1
##   0 68 21
##   1  1 62

accuracy <- mean(lda.pred$class == test$high_crim)  
print(accuracy) # 0.855

## [1] 0.8552632

nb.fit <- naiveBayes(high_crim ~ zn + nox + dis + rad + tax + ptratio + black + medv, data = train)
nb.pred <- predict(nb.fit, newdata = test)
conf_matrix <- table(nb.pred, test$high_crim)
print(conf_matrix)

##        
## nb.pred  0  1
##       0 65 20
##       1  4 63

accuracy <- mean(nb.pred == test$high_crim)
print(accuracy) # 0.842

## [1] 0.8421053

nb.fit <- naiveBayes(high_crim ~ nox + rad + age + medv, data = train)
nb.pred <- predict(nb.fit, newdata = test)
conf_matrix <- table(nb.pred, test$high_crim)
print(conf_matrix)

##        
## nb.pred  0  1
##       0 65 21
##       1  4 62

accuracy <- mean(nb.pred == test$high_crim)
print(accuracy) # 0.836

## [1] 0.8355263

for (k in 2:10) {
  knn.pred <- knn(train_X, test_X, train_Y, k = k)
  conf_matrix <- table(knn.pred, test_Y)
  print(paste("Confusion Matrix for k =", k))
  print(conf_matrix)
  accuracy <- mean(knn.pred == test_Y)
  print(paste("Accuracy for k =", k, ":", accuracy))
}

## [1] "Confusion Matrix for k = 2"
##         test_Y
## knn.pred  0  1
##        0 61  5
##        1  8 78
## [1] "Accuracy for k = 2 : 0.914473684210526"
## [1] "Confusion Matrix for k = 3"
##         test_Y
## knn.pred  0  1
##        0 64  4
##        1  5 79
## [1] "Accuracy for k = 3 : 0.940789473684211"
## [1] "Confusion Matrix for k = 4"
##         test_Y
## knn.pred  0  1
##        0 64  6
##        1  5 77
## [1] "Accuracy for k = 4 : 0.927631578947368"
## [1] "Confusion Matrix for k = 5"
##         test_Y
## knn.pred  0  1
##        0 64  6
##        1  5 77
## [1] "Accuracy for k = 5 : 0.927631578947368"
## [1] "Confusion Matrix for k = 6"
##         test_Y
## knn.pred  0  1
##        0 64  6
##        1  5 77
## [1] "Accuracy for k = 6 : 0.927631578947368"
## [1] "Confusion Matrix for k = 7"
##         test_Y
## knn.pred  0  1
##        0 62  6
##        1  7 77
## [1] "Accuracy for k = 7 : 0.914473684210526"
## [1] "Confusion Matrix for k = 8"
##         test_Y
## knn.pred  0  1
##        0 62  7
##        1  7 76
## [1] "Accuracy for k = 8 : 0.907894736842105"
## [1] "Confusion Matrix for k = 9"
##         test_Y
## knn.pred  0  1
##        0 61  7
##        1  8 76
## [1] "Accuracy for k = 9 : 0.901315789473684"
## [1] "Confusion Matrix for k = 10"
##         test_Y
## knn.pred  0  1
##        0 60  7
##        1  9 76
## [1] "Accuracy for k = 10 : 0.894736842105263"

train_X <- as.matrix(train[, c("zn", "nox", "dis", "rad", "tax", "ptratio", "black", "medv")])
test_X <- as.matrix(test[, c("zn", "nox", "dis", "rad", "tax", "ptratio", "black", "medv")])
for (k in 1:10) {
  knn.pred <- knn(train_X, test_X, train_Y, k = k)
  conf_matrix <- table(knn.pred, test_Y)
  print(paste("Confusion Matrix for k =", k))
  print(conf_matrix)
  accuracy <- mean(knn.pred == test_Y)
  print(paste("Accuracy for k =", k, ":", accuracy))
}

## [1] "Confusion Matrix for k = 1"
##         test_Y
## knn.pred  0  1
##        0 65  5
##        1  4 78
## [1] "Accuracy for k = 1 : 0.940789473684211"
## [1] "Confusion Matrix for k = 2"
##         test_Y
## knn.pred  0  1
##        0 65  4
##        1  4 79
## [1] "Accuracy for k = 2 : 0.947368421052632"
## [1] "Confusion Matrix for k = 3"
##         test_Y
## knn.pred  0  1
##        0 65  4
##        1  4 79
## [1] "Accuracy for k = 3 : 0.947368421052632"
## [1] "Confusion Matrix for k = 4"
##         test_Y
## knn.pred  0  1
##        0 66  5
##        1  3 78
## [1] "Accuracy for k = 4 : 0.947368421052632"
## [1] "Confusion Matrix for k = 5"
##         test_Y
## knn.pred  0  1
##        0 65  5
##        1  4 78
## [1] "Accuracy for k = 5 : 0.940789473684211"
## [1] "Confusion Matrix for k = 6"
##         test_Y
## knn.pred  0  1
##        0 63  6
##        1  6 77
## [1] "Accuracy for k = 6 : 0.921052631578947"
## [1] "Confusion Matrix for k = 7"
##         test_Y
## knn.pred  0  1
##        0 62  4
##        1  7 79
## [1] "Accuracy for k = 7 : 0.927631578947368"
## [1] "Confusion Matrix for k = 8"
##         test_Y
## knn.pred  0  1
##        0 60  4
##        1  9 79
## [1] "Accuracy for k = 8 : 0.914473684210526"
## [1] "Confusion Matrix for k = 9"
##         test_Y
## knn.pred  0  1
##        0 60  4
##        1  9 79
## [1] "Accuracy for k = 9 : 0.914473684210526"
## [1] "Confusion Matrix for k = 10"
##         test_Y
## knn.pred  0  1
##        0 60  7
##        1  9 76
## [1] "Accuracy for k = 10 : 0.894736842105263"

train_X <- as.matrix(train[, c("nox", "rad", "age", "medv")])
test_X <- as.matrix(test[, c("nox", "rad", "age", "medv")])
for (k in 1:10) {
  knn.pred <- knn(train_X, test_X, train_Y, k = k)
  conf_matrix <- table(knn.pred, test_Y)
  print(paste("Confusion Matrix for k =", k))
  print(conf_matrix)
  accuracy <- mean(knn.pred == test_Y)
  print(paste("Accuracy for k =", k, ":", accuracy))
}

## [1] "Confusion Matrix for k = 1"
##         test_Y
## knn.pred  0  1
##        0 56 18
##        1 13 65
## [1] "Accuracy for k = 1 : 0.796052631578947"
## [1] "Confusion Matrix for k = 2"
##         test_Y
## knn.pred  0  1
##        0 60 16
##        1  9 67
## [1] "Accuracy for k = 2 : 0.835526315789474"
## [1] "Confusion Matrix for k = 3"
##         test_Y
## knn.pred  0  1
##        0 57 17
##        1 12 66
## [1] "Accuracy for k = 3 : 0.809210526315789"
## [1] "Confusion Matrix for k = 4"
##         test_Y
## knn.pred  0  1
##        0 58 16
##        1 11 67
## [1] "Accuracy for k = 4 : 0.822368421052632"
## [1] "Confusion Matrix for k = 5"
##         test_Y
## knn.pred  0  1
##        0 59 15
##        1 10 68
## [1] "Accuracy for k = 5 : 0.835526315789474"
## [1] "Confusion Matrix for k = 6"
##         test_Y
## knn.pred  0  1
##        0 60 17
##        1  9 66
## [1] "Accuracy for k = 6 : 0.828947368421053"
## [1] "Confusion Matrix for k = 7"
##         test_Y
## knn.pred  0  1
##        0 61 15
##        1  8 68
## [1] "Accuracy for k = 7 : 0.848684210526316"
## [1] "Confusion Matrix for k = 8"
##         test_Y
## knn.pred  0  1
##        0 61 17
##        1  8 66
## [1] "Accuracy for k = 8 : 0.835526315789474"
## [1] "Confusion Matrix for k = 9"
##         test_Y
## knn.pred  0  1
##        0 61 17
##        1  8 66
## [1] "Accuracy for k = 9 : 0.835526315789474"
## [1] "Confusion Matrix for k = 10"
##         test_Y
## knn.pred  0  1
##        0 61 15
##        1  8 68
## [1] "Accuracy for k = 10 : 0.848684210526316"

There are 4 methods to experiment with: logistic regression, LDA, naive Bayes, and KNN. There are no predictors with degenerate distributions and no multicollinearity. Based on the logistic regression summary, the significant (p<0.05) predictors are zn, nox, dis, rad, tax, ptratio, black, and medv. 3 variables are highly correlated: indus, tax, and nox. Since nox is significant, we can exclude indus and tax with confidence. Based on BIC criterion, I will select nox, rad, age, and medv as a set of predictors. These are the final combinations: Set 1: zn + nox + dis + rad + tax + ptratio + black + medv Set 2: nox + rad + age + medv We will also assess KNN with k=2 to 10.

Based on the results above, using Set 1 predictors improved accuracy for naive Bayes and KNN, but Set 2 worsened accuracy for all models. Increasing k worsened results when using all predictors, improved up to a certain point for Set 1, and improved for Set 2.

4. This churn dataset comes from IBM Sample Data Sets. Customer churn occurs when customers stop doing business with a company, also known as customer attrition. The data set contains 5000 rows (customers) and 20 columns (features). The “Churn” column is our target which indicate whether customer churned (left the company) or not.

library(liver)

## 
## Attaching package: 'liver'

## The following object is masked from 'package:e1071':
## 
##     skewness

## The following object is masked from 'package:base':
## 
##     transform

data(churn)
str(churn)

## 'data.frame':    5000 obs. of  20 variables:
##  $ state         : Factor w/ 51 levels "AK","AL","AR",..: 17 36 32 36 37 2 20 25 19 50 ...
##  $ area.code     : Factor w/ 3 levels "area_code_408",..: 2 2 2 1 2 3 3 2 1 2 ...
##  $ account.length: int  128 107 137 84 75 118 121 147 117 141 ...
##  $ voice.plan    : Factor w/ 2 levels "yes","no": 1 1 2 2 2 2 1 2 2 1 ...
##  $ voice.messages: int  25 26 0 0 0 0 24 0 0 37 ...
##  $ intl.plan     : Factor w/ 2 levels "yes","no": 2 2 2 1 1 1 2 1 2 1 ...
##  $ intl.mins     : num  10 13.7 12.2 6.6 10.1 6.3 7.5 7.1 8.7 11.2 ...
##  $ intl.calls    : int  3 3 5 7 3 6 7 6 4 5 ...
##  $ intl.charge   : num  2.7 3.7 3.29 1.78 2.73 1.7 2.03 1.92 2.35 3.02 ...
##  $ day.mins      : num  265 162 243 299 167 ...
##  $ day.calls     : int  110 123 114 71 113 98 88 79 97 84 ...
##  $ day.charge    : num  45.1 27.5 41.4 50.9 28.3 ...
##  $ eve.mins      : num  197.4 195.5 121.2 61.9 148.3 ...
##  $ eve.calls     : int  99 103 110 88 122 101 108 94 80 111 ...
##  $ eve.charge    : num  16.78 16.62 10.3 5.26 12.61 ...
##  $ night.mins    : num  245 254 163 197 187 ...
##  $ night.calls   : int  91 103 104 89 121 118 118 96 90 97 ...
##  $ night.charge  : num  11.01 11.45 7.32 8.86 8.41 ...
##  $ customer.calls: int  1 1 0 2 3 0 3 0 1 0 ...
##  $ churn         : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...

summary(churn)

##      state              area.code    account.length  voice.plan
##  WV     : 158   area_code_408:1259   Min.   :  1.0   yes:1323  
##  MN     : 125   area_code_415:2495   1st Qu.: 73.0   no :3677  
##  AL     : 124   area_code_510:1246   Median :100.0             
##  ID     : 119                        Mean   :100.3             
##  VA     : 118                        3rd Qu.:127.0             
##  OH     : 116                        Max.   :243.0             
##  (Other):4240                                                  
##  voice.messages   intl.plan    intl.mins       intl.calls      intl.charge   
##  Min.   : 0.000   yes: 473   Min.   : 0.00   Min.   : 0.000   Min.   :0.000  
##  1st Qu.: 0.000   no :4527   1st Qu.: 8.50   1st Qu.: 3.000   1st Qu.:2.300  
##  Median : 0.000              Median :10.30   Median : 4.000   Median :2.780  
##  Mean   : 7.755              Mean   :10.26   Mean   : 4.435   Mean   :2.771  
##  3rd Qu.:17.000              3rd Qu.:12.00   3rd Qu.: 6.000   3rd Qu.:3.240  
##  Max.   :52.000              Max.   :20.00   Max.   :20.000   Max.   :5.400  
##                                                                              
##     day.mins       day.calls     day.charge       eve.mins       eve.calls    
##  Min.   :  0.0   Min.   :  0   Min.   : 0.00   Min.   :  0.0   Min.   :  0.0  
##  1st Qu.:143.7   1st Qu.: 87   1st Qu.:24.43   1st Qu.:166.4   1st Qu.: 87.0  
##  Median :180.1   Median :100   Median :30.62   Median :201.0   Median :100.0  
##  Mean   :180.3   Mean   :100   Mean   :30.65   Mean   :200.6   Mean   :100.2  
##  3rd Qu.:216.2   3rd Qu.:113   3rd Qu.:36.75   3rd Qu.:234.1   3rd Qu.:114.0  
##  Max.   :351.5   Max.   :165   Max.   :59.76   Max.   :363.7   Max.   :170.0  
##                                                                               
##    eve.charge      night.mins     night.calls      night.charge   
##  Min.   : 0.00   Min.   :  0.0   Min.   :  0.00   Min.   : 0.000  
##  1st Qu.:14.14   1st Qu.:166.9   1st Qu.: 87.00   1st Qu.: 7.510  
##  Median :17.09   Median :200.4   Median :100.00   Median : 9.020  
##  Mean   :17.05   Mean   :200.4   Mean   : 99.92   Mean   : 9.018  
##  3rd Qu.:19.90   3rd Qu.:234.7   3rd Qu.:113.00   3rd Qu.:10.560  
##  Max.   :30.91   Max.   :395.0   Max.   :175.00   Max.   :17.770  
##                                                                   
##  customer.calls churn     
##  Min.   :0.00   yes: 707  
##  1st Qu.:1.00   no :4293  
##  Median :1.00             
##  Mean   :1.57             
##  3rd Qu.:2.00             
##  Max.   :9.00             
##

a) Given the classification imbalance in churn status, describe how you would create a training and testing set.

table(churn$churn) # 707 yes, 4293 no

## 
##  yes   no 
##  707 4293

plot(churn$churn, main = "Churn Distribution", ylab = "Frequency")

# Separate the 'yes' and 'no' classes
churn_yes <- churn[churn$churn == "yes", ]
churn_no <- churn[churn$churn == "no", ]

# Undersample the 'no' class to match the size of the 'yes' class
set.seed(123456)
num_yes <- nrow(churn_yes)
undersampled_indices <- sample(nrow(churn_no), size = num_yes)
churn_no_undersampled <- churn_no[undersampled_indices, ]

# Combine the undersampled 'no' class with the 'yes' class
churn_balanced <- rbind(churn_yes, churn_no_undersampled)

# Shuffle the balanced dataset
shuffled_indices <- sample(nrow(churn_balanced))
churn_balanced <- churn_balanced[shuffled_indices, ]

plot(churn_balanced$churn, main = "Churn Distribution", ylab = "Frequency")

In this case, I would undersample the “no” class to match the size of the “yes” class since the imbalance is too severe. Moreover, the data is large enough to not worry too much about data loss. After the undersampling, you can see both classes of churn have the same frequency.

b) Which classification statistic would you choose to optimize for this problem (i.e., simple random or createDataPartition) and why?

In general, I would use createDataPartition to ensure that the proportion of classes in the training and testing sets is similar to the original dataset and reduce bias towards the majority “no” class. However, in my code above, I already undersampled such that I can use simple random.

c) Split the data into a training and a testing set, pre-process the data, and build classification models described in this chapter for the classification of churn. Of the models that you considered, which performs best on these data? and what is the optimal performance?

summary(churn_balanced)

##      state              area.code   account.length  voice.plan voice.messages  
##  NJ     :  44   area_code_408:346   Min.   :  1.0   yes: 310   Min.   : 0.000  
##  WV     :  40   area_code_415:698   1st Qu.: 74.0   no :1104   1st Qu.: 0.000  
##  MN     :  39   area_code_510:370   Median :101.0              Median : 0.000  
##  MD     :  38                       Mean   :101.3              Mean   : 6.562  
##  ME     :  38                       3rd Qu.:127.0              3rd Qu.: 0.000  
##  MI     :  38                       Max.   :225.0              Max.   :50.000  
##  (Other):1177                                                                  
##  intl.plan    intl.mins        intl.calls      intl.charge       day.mins    
##  yes: 244   Min.   : 0.000   Min.   : 0.000   Min.   :0.000   Min.   :  0.0  
##  no :1170   1st Qu.: 8.625   1st Qu.: 3.000   1st Qu.:2.328   1st Qu.:144.3  
##             Median :10.500   Median : 4.000   Median :2.840   Median :189.3  
##             Mean   :10.482   Mean   : 4.318   Mean   :2.830   Mean   :191.0  
##             3rd Qu.:12.300   3rd Qu.: 6.000   3rd Qu.:3.320   3rd Qu.:236.7  
##             Max.   :20.000   Max.   :20.000   Max.   :5.400   Max.   :351.5  
##                                                                              
##    day.calls       day.charge       eve.mins       eve.calls     eve.charge   
##  Min.   :  0.0   Min.   : 0.00   Min.   : 31.2   Min.   : 12   Min.   : 2.65  
##  1st Qu.: 87.0   1st Qu.:24.54   1st Qu.:169.1   1st Qu.: 87   1st Qu.:14.37  
##  Median :101.0   Median :32.19   Median :205.7   Median :100   Median :17.48  
##  Mean   :100.4   Mean   :32.48   Mean   :205.2   Mean   :100   Mean   :17.44  
##  3rd Qu.:114.0   3rd Qu.:40.23   3rd Qu.:241.6   3rd Qu.:114   3rd Qu.:20.54  
##  Max.   :165.0   Max.   :59.76   Max.   :363.7   Max.   :170   Max.   :30.91  
##                                                                               
##    night.mins     night.calls     night.charge   customer.calls  churn    
##  Min.   : 46.7   Min.   : 42.0   Min.   : 2.10   Min.   :0.000   yes:707  
##  1st Qu.:168.6   1st Qu.: 86.0   1st Qu.: 7.59   1st Qu.:1.000   no :707  
##  Median :203.3   Median :100.0   Median : 9.15   Median :1.000            
##  Mean   :202.4   Mean   : 99.6   Mean   : 9.11   Mean   :1.852            
##  3rd Qu.:236.3   3rd Qu.:114.0   3rd Qu.:10.63   3rd Qu.:3.000            
##  Max.   :395.0   Max.   :158.0   Max.   :17.77   Max.   :9.000            
##

nearZeroVar(churn_balanced) # Column 5 voice.messages is near zero variance

## [1] 5

names(churn_balanced[nearZeroVar(churn_balanced)])

## [1] "voice.messages"

table(churn_balanced$voice.messages) # Mainly 0 voice messages

## 
##    0    9   12   14   15   16   17   18   19   20   21   22   23   24   25   26 
## 1104    1    2    1    6    2    5    5    5    7    8   13   14   11    8   18 
##   27   28   29   30   31   32   33   34   35   36   37   38   39   40   41   42 
##   15   14   19   10   18   13   16   12   14   12    6    8    8    6    7   10 
##   43   44   45   46   47   48   49   50 
##    5    2    2    2    1    2    1    1

glm.fit <- glm(churn ~ . - voice.messages, data = churn_balanced, family = binomial)
summary(glm.fit)

## 
## Call:
## glm(formula = churn ~ . - voice.messages, family = binomial, 
##     data = churn_balanced)
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             5.106e+00  1.089e+00   4.687 2.78e-06 ***
## stateAL                 3.228e-01  8.315e-01   0.388   0.6978    
## stateAR                 3.501e-02  8.486e-01   0.041   0.9671    
## stateAZ                -9.691e-01  9.231e-01  -1.050   0.2938    
## stateCA                -1.321e+00  8.914e-01  -1.482   0.1383    
## stateCO                 1.576e-01  8.737e-01   0.180   0.8569    
## stateCT                -1.026e+00  8.665e-01  -1.184   0.2363    
## stateDC                -6.236e-02  8.938e-01  -0.070   0.9444    
## stateDE                -2.496e-01  8.345e-01  -0.299   0.7649    
## stateFL                -6.766e-01  8.445e-01  -0.801   0.4230    
## stateGA                 5.318e-02  8.932e-01   0.060   0.9525    
## stateHI                 7.064e-01  9.743e-01   0.725   0.4684    
## stateIA                -3.151e-01  9.111e-01  -0.346   0.7295    
## stateID                -3.717e-01  8.462e-01  -0.439   0.6605    
## stateIL                 6.453e-01  9.187e-01   0.702   0.4824    
## stateIN                -2.118e-01  8.478e-01  -0.250   0.8027    
## stateKS                -4.036e-01  8.292e-01  -0.487   0.6264    
## stateKY                -8.291e-01  8.451e-01  -0.981   0.3266    
## stateLA                -2.876e-02  8.705e-01  -0.033   0.9736    
## stateMA                -5.716e-01  8.431e-01  -0.678   0.4978    
## stateMD                -2.117e-01  8.152e-01  -0.260   0.7951    
## stateME                -2.887e-01  8.070e-01  -0.358   0.7206    
## stateMI                -3.663e-01  8.172e-01  -0.448   0.6540    
## stateMN                -5.303e-01  8.096e-01  -0.655   0.5125    
## stateMO                 8.919e-02  8.547e-01   0.104   0.9169    
## stateMS                -4.522e-01  8.253e-01  -0.548   0.5837    
## stateMT                -1.550e+00  8.233e-01  -1.882   0.0598 .  
## stateNC                 6.821e-03  8.786e-01   0.008   0.9938    
## stateND                -8.695e-01  9.918e-01  -0.877   0.3806    
## stateNE                 2.954e-01  9.041e-01   0.327   0.7438    
## stateNH                -7.718e-02  8.688e-01  -0.089   0.9292    
## stateNJ                -7.022e-01  7.986e-01  -0.879   0.3792    
## stateNM                -4.304e-01  8.828e-01  -0.487   0.6259    
## stateNV                -4.383e-01  8.428e-01  -0.520   0.6031    
## stateNY                -6.847e-01  8.310e-01  -0.824   0.4100    
## stateOH                -2.341e-01  8.299e-01  -0.282   0.7779    
## stateOK                -6.087e-01  8.399e-01  -0.725   0.4686    
## stateOR                -6.172e-01  8.163e-01  -0.756   0.4496    
## statePA                 7.735e-02  9.040e-01   0.086   0.9318    
## stateRI                 5.280e-01  9.432e-01   0.560   0.5756    
## stateSC                -1.459e+00  8.957e-01  -1.629   0.1033    
## stateSD                -8.426e-01  8.801e-01  -0.957   0.3383    
## stateTN                -6.378e-01  8.662e-01  -0.736   0.4615    
## stateTX                -1.416e+00  8.397e-01  -1.686   0.0918 .  
## stateUT                -1.102e+00  8.460e-01  -1.303   0.1927    
## stateVA                 1.161e+00  9.083e-01   1.279   0.2010    
## stateVT                 2.139e-01  8.881e-01   0.241   0.8097    
## stateWA                -8.977e-01  8.253e-01  -1.088   0.2767    
## stateWI                -6.768e-01  9.020e-01  -0.750   0.4530    
## stateWV                -9.425e-01  8.041e-01  -1.172   0.2412    
## stateWY                 5.969e-01  8.537e-01   0.699   0.4844    
## area.codearea_code_415  2.544e-01  1.689e-01   1.506   0.1320    
## area.codearea_code_510  2.122e-01  1.923e-01   1.104   0.2698    
## account.length         -2.066e-03  1.716e-03  -1.204   0.2287    
## voice.planno           -1.029e+00  1.735e-01  -5.930 3.02e-09 ***
## intl.planno             2.608e+00  2.209e-01  11.808  < 2e-16 ***
## intl.mins               4.943e+00  6.362e+00   0.777   0.4371    
## intl.calls              5.172e-02  2.765e-02   1.870   0.0614 .  
## intl.charge            -1.850e+01  2.356e+01  -0.785   0.4323    
## day.mins               -6.352e+00  4.042e+00  -1.571   0.1161    
## day.calls              -1.327e-03  3.360e-03  -0.395   0.6929    
## day.charge              3.729e+01  2.378e+01   1.568   0.1168    
## eve.mins                8.947e-01  2.033e+00   0.440   0.6599    
## eve.calls               7.300e-04  3.375e-03   0.216   0.8287    
## eve.charge             -1.061e+01  2.392e+01  -0.444   0.6573    
## night.mins              8.127e-01  1.046e+00   0.777   0.4371    
## night.calls             9.492e-04  3.506e-03   0.271   0.7866    
## night.charge           -1.813e+01  2.324e+01  -0.780   0.4352    
## customer.calls         -6.023e-01  4.935e-02 -12.204  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1960.2  on 1413  degrees of freedom
## Residual deviance: 1386.0  on 1345  degrees of freedom
## AIC: 1524
## 
## Number of Fisher Scoring iterations: 5

glm.probs <- predict(glm.fit, type = "response")  
glm.pred <- ifelse(glm.probs > 0.5, "yes", "no")
conf_matrix <- table(glm.pred, churn_balanced$churn)
print(conf_matrix)

##         
## glm.pred yes  no
##      no  553 157
##      yes 154 550

accuracy <- mean(glm.pred == churn_balanced$churn)
print(accuracy) # 0.220

## [1] 0.2199434

glm.fit <- glm(churn ~ voice.plan + intl.plan + customer.calls, data = churn_balanced, family = binomial)
summary(glm.fit)

## 
## Call:
## glm(formula = churn ~ voice.plan + intl.plan + customer.calls, 
##     family = binomial, data = churn_balanced)
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -0.10656    0.20700  -0.515    0.607    
## voice.planno   -1.13659    0.15295  -7.431 1.07e-13 ***
## intl.planno     2.09259    0.18863  11.093  < 2e-16 ***
## customer.calls -0.42161    0.04089 -10.311  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1960.2  on 1413  degrees of freedom
## Residual deviance: 1657.4  on 1410  degrees of freedom
## AIC: 1665.4
## 
## Number of Fisher Scoring iterations: 4

glm.probs <- predict(glm.fit, type = "response")  
glm.pred <- ifelse(glm.probs > 0.5, "yes", "no")
conf_matrix <- table(glm.pred, churn_balanced$churn)
print(conf_matrix)

##         
## glm.pred yes  no
##      no  401 145
##      yes 306 562

accuracy <- mean(glm.pred == churn_balanced$churn)
print(accuracy) # 0.319

## [1] 0.3189533

mod.sel <- regsubsets(churn ~ . -voice.messages, data = churn_balanced, really.big=60)
reg.summary <- summary(mod.sel)
reg.summary$outmat

##          stateAL stateAR stateAZ stateCA stateCO stateCT stateDC stateDE
## 1  ( 1 ) " "     " "     " "     " "     " "     " "     " "     " "    
## 2  ( 1 ) " "     " "     " "     " "     " "     " "     " "     " "    
## 3  ( 1 ) " "     " "     " "     " "     " "     " "     " "     " "    
## 4  ( 1 ) " "     " "     " "     " "     " "     " "     " "     " "    
## 5  ( 1 ) " "     " "     " "     " "     " "     " "     " "     " "    
## 6  ( 1 ) " "     " "     " "     " "     " "     " "     " "     " "    
## 7  ( 1 ) " "     " "     " "     " "     " "     " "     " "     " "    
## 8  ( 1 ) " "     " "     " "     " "     " "     " "     " "     " "    
##          stateFL stateGA stateHI stateIA stateID stateIL stateIN stateKS
## 1  ( 1 ) " "     " "     " "     " "     " "     " "     " "     " "    
## 2  ( 1 ) " "     " "     " "     " "     " "     " "     " "     " "    
## 3  ( 1 ) " "     " "     " "     " "     " "     " "     " "     " "    
## 4  ( 1 ) " "     " "     " "     " "     " "     " "     " "     " "    
## 5  ( 1 ) " "     " "     " "     " "     " "     " "     " "     " "    
## 6  ( 1 ) " "     " "     " "     " "     " "     " "     " "     " "    
## 7  ( 1 ) " "     " "     " "     " "     " "     " "     " "     " "    
## 8  ( 1 ) " "     " "     " "     " "     " "     " "     " "     " "    
##          stateKY stateLA stateMA stateMD stateME stateMI stateMN stateMO
## 1  ( 1 ) " "     " "     " "     " "     " "     " "     " "     " "    
## 2  ( 1 ) " "     " "     " "     " "     " "     " "     " "     " "    
## 3  ( 1 ) " "     " "     " "     " "     " "     " "     " "     " "    
## 4  ( 1 ) " "     " "     " "     " "     " "     " "     " "     " "    
## 5  ( 1 ) " "     " "     " "     " "     " "     " "     " "     " "    
## 6  ( 1 ) " "     " "     " "     " "     " "     " "     " "     " "    
## 7  ( 1 ) " "     " "     " "     " "     " "     " "     " "     " "    
## 8  ( 1 ) " "     " "     " "     " "     " "     " "     " "     " "    
##          stateMS stateMT stateNC stateND stateNE stateNH stateNJ stateNM
## 1  ( 1 ) " "     " "     " "     " "     " "     " "     " "     " "    
## 2  ( 1 ) " "     " "     " "     " "     " "     " "     " "     " "    
## 3  ( 1 ) " "     " "     " "     " "     " "     " "     " "     " "    
## 4  ( 1 ) " "     " "     " "     " "     " "     " "     " "     " "    
## 5  ( 1 ) " "     " "     " "     " "     " "     " "     " "     " "    
## 6  ( 1 ) " "     " "     " "     " "     " "     " "     " "     " "    
## 7  ( 1 ) " "     "*"     " "     " "     " "     " "     " "     " "    
## 8  ( 1 ) " "     "*"     " "     " "     " "     " "     " "     " "    
##          stateNV stateNY stateOH stateOK stateOR statePA stateRI stateSC
## 1  ( 1 ) " "     " "     " "     " "     " "     " "     " "     " "    
## 2  ( 1 ) " "     " "     " "     " "     " "     " "     " "     " "    
## 3  ( 1 ) " "     " "     " "     " "     " "     " "     " "     " "    
## 4  ( 1 ) " "     " "     " "     " "     " "     " "     " "     " "    
## 5  ( 1 ) " "     " "     " "     " "     " "     " "     " "     " "    
## 6  ( 1 ) " "     " "     " "     " "     " "     " "     " "     " "    
## 7  ( 1 ) " "     " "     " "     " "     " "     " "     " "     " "    
## 8  ( 1 ) " "     " "     " "     " "     " "     " "     " "     " "    
##          stateSD stateTN stateTX stateUT stateVA stateVT stateWA stateWI
## 1  ( 1 ) " "     " "     " "     " "     " "     " "     " "     " "    
## 2  ( 1 ) " "     " "     " "     " "     " "     " "     " "     " "    
## 3  ( 1 ) " "     " "     " "     " "     " "     " "     " "     " "    
## 4  ( 1 ) " "     " "     " "     " "     " "     " "     " "     " "    
## 5  ( 1 ) " "     " "     " "     " "     " "     " "     " "     " "    
## 6  ( 1 ) " "     " "     " "     " "     "*"     " "     " "     " "    
## 7  ( 1 ) " "     " "     " "     " "     "*"     " "     " "     " "    
## 8  ( 1 ) " "     " "     " "     " "     "*"     " "     " "     " "    
##          stateWV stateWY area.codearea_code_415 area.codearea_code_510
## 1  ( 1 ) " "     " "     " "                    " "                   
## 2  ( 1 ) " "     " "     " "                    " "                   
## 3  ( 1 ) " "     " "     " "                    " "                   
## 4  ( 1 ) " "     " "     " "                    " "                   
## 5  ( 1 ) " "     " "     " "                    " "                   
## 6  ( 1 ) " "     " "     " "                    " "                   
## 7  ( 1 ) " "     " "     " "                    " "                   
## 8  ( 1 ) " "     " "     " "                    " "                   
##          account.length voice.planno intl.planno intl.mins intl.calls
## 1  ( 1 ) " "            " "          "*"         " "       " "       
## 2  ( 1 ) " "            " "          " "         " "       " "       
## 3  ( 1 ) " "            " "          "*"         " "       " "       
## 4  ( 1 ) " "            "*"          "*"         " "       " "       
## 5  ( 1 ) " "            "*"          "*"         " "       " "       
## 6  ( 1 ) " "            "*"          "*"         " "       " "       
## 7  ( 1 ) " "            "*"          "*"         " "       " "       
## 8  ( 1 ) " "            "*"          "*"         " "       " "       
##          intl.charge day.mins day.calls day.charge eve.mins eve.calls
## 1  ( 1 ) " "         " "      " "       " "        " "      " "      
## 2  ( 1 ) " "         "*"      " "       " "        " "      " "      
## 3  ( 1 ) " "         "*"      " "       " "        " "      " "      
## 4  ( 1 ) " "         "*"      " "       " "        " "      " "      
## 5  ( 1 ) " "         "*"      " "       " "        " "      " "      
## 6  ( 1 ) " "         "*"      " "       " "        " "      " "      
## 7  ( 1 ) " "         "*"      " "       " "        " "      " "      
## 8  ( 1 ) " "         "*"      " "       " "        " "      " "      
##          eve.charge night.mins night.calls night.charge customer.calls
## 1  ( 1 ) " "        " "        " "         " "          " "           
## 2  ( 1 ) " "        " "        " "         " "          "*"           
## 3  ( 1 ) " "        " "        " "         " "          "*"           
## 4  ( 1 ) " "        " "        " "         " "          "*"           
## 5  ( 1 ) "*"        " "        " "         " "          "*"           
## 6  ( 1 ) "*"        " "        " "         " "          "*"           
## 7  ( 1 ) "*"        " "        " "         " "          "*"           
## 8  ( 1 ) "*"        " "        " "         "*"          "*"

cat(which.min(reg.summary$bic), reg.summary$bic[which.min(reg.summary$bic)]) # 8

## 8 -433.6823

cat(which.min(reg.summary$cp), reg.summary$cp[which.min(reg.summary$cp)]) # 8

## 8 18.0739

cat(which.max(reg.summary$adjr2), reg.summary$adjr2[which.max(reg.summary$adjr2)]) # 8

## 8 0.2933364

# voice.plan + intl.plan + day.mins + eve.charge + night.charge + customer.calls

glm.fit <- glm(churn ~ voice.plan + intl.plan + day.mins + eve.charge + night.charge + customer.calls, data = churn_balanced, family = binomial)
summary(glm.fit)

## 
## Call:
## glm(formula = churn ~ voice.plan + intl.plan + day.mins + eve.charge + 
##     night.charge + customer.calls, family = binomial, data = churn_balanced)
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     4.207976   0.478986   8.785  < 2e-16 ***
## voice.planno   -0.970971   0.164085  -5.917 3.27e-09 ***
## intl.planno     2.419194   0.203904  11.864  < 2e-16 ***
## day.mins       -0.012459   0.001142 -10.912  < 2e-16 ***
## eve.charge     -0.080446   0.014826  -5.426 5.76e-08 ***
## night.charge   -0.066164   0.028932  -2.287   0.0222 *  
## customer.calls -0.590062   0.046846 -12.596  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1960.2  on 1413  degrees of freedom
## Residual deviance: 1471.3  on 1407  degrees of freedom
## AIC: 1485.3
## 
## Number of Fisher Scoring iterations: 4

glm.probs <- predict(glm.fit, type = "response")  
glm.pred <- ifelse(glm.probs > 0.5, "yes", "no")
conf_matrix <- table(glm.pred, churn_balanced$churn)
print(conf_matrix)

##         
## glm.pred yes  no
##      no  548 169
##      yes 159 538

accuracy <- mean(glm.pred == churn_balanced$churn)
print(accuracy) # 0.23 without state and 0.22 with state

## [1] 0.2319661

train_size <- floor(0.70 * nrow(churn_balanced))
train_index <- sample(seq_len(nrow(churn_balanced)), size = train_size)
train <- churn_balanced[train_index, ]
test <- churn_balanced[-train_index, ]
table(train$churn) # 497 yes, 492 no --> balanced

## 
## yes  no 
## 497 492

table(test$churn) # 210 yes, 215 no --> balanced

## 
## yes  no 
## 210 215

plot(train$churn, main = "Train Churn Distribution", ylab = "Frequency")

plot(test$churn, main = "Test Churn Distribution", ylab = "Frequency")

logit.fit <- glm(churn ~ voice.plan + intl.plan + customer.calls, data = train, family = binomial)
logit.probs <- predict(logit.fit, newdata = test, type = "response")
logit.pred <- ifelse(logit.probs > 0.5, "yes", "no")
conf_matrix <- table(logit.pred, test$churn)
print(conf_matrix)

##           
## logit.pred yes  no
##        no  111  37
##        yes  99 178

accuracy <- mean(logit.pred == test$churn)
print(accuracy)  # 0.32

## [1] 0.32

lda.fit <- lda(churn ~ voice.plan + intl.plan + customer.calls, data = train)
lda.pred <- predict(lda.fit, newdata = test)
conf_matrix <- table(lda.pred$class, test$churn)  
print(conf_matrix)

##      
##       yes  no
##   yes 111  37
##   no   99 178

accuracy <- mean(lda.pred$class == test$churn)  
print(accuracy) # 0.68

## [1] 0.68

nb.fit <- naiveBayes(churn ~ voice.plan + intl.plan + customer.calls, data = train)
nb.pred <- predict(nb.fit, newdata = test)
conf_matrix <- table(nb.pred, test$churn)
print(conf_matrix)

##        
## nb.pred yes  no
##     yes 113  38
##     no   97 177

accuracy <- mean(nb.pred == test$churn)
print(accuracy) # 0.682

## [1] 0.6823529

train$voice.plan <- as.numeric(as.factor(train$voice.plan))
train$intl.plan <- as.numeric(as.factor(train$intl.plan))
test$voice.plan <- as.numeric(as.factor(test$voice.plan))
test$intl.plan <- as.numeric(as.factor(test$intl.plan))
train_X <- as.matrix(train[, c("voice.plan", "intl.plan", "customer.calls")])
test_X <- as.matrix(test[, c("voice.plan", "intl.plan", "customer.calls")])
train_Y <- train$churn
test_Y <- test$churn
for (k in 1:10) {
  knn.pred <- knn(train_X, test_X, train_Y, k = k)
  conf_matrix <- table(knn.pred, test_Y)
  print(paste("Confusion Matrix for k =", k))
  print(conf_matrix)
  accuracy <- mean(knn.pred == test_Y)
  print(paste("Accuracy for k =", k, ":", accuracy)) # 0.708
}

## [1] "Confusion Matrix for k = 1"
##         test_Y
## knn.pred yes  no
##      yes 104  18
##      no  106 197
## [1] "Accuracy for k = 1 : 0.708235294117647"
## [1] "Confusion Matrix for k = 2"
##         test_Y
## knn.pred yes  no
##      yes 104  18
##      no  106 197
## [1] "Accuracy for k = 2 : 0.708235294117647"
## [1] "Confusion Matrix for k = 3"
##         test_Y
## knn.pred yes  no
##      yes 104  18
##      no  106 197
## [1] "Accuracy for k = 3 : 0.708235294117647"
## [1] "Confusion Matrix for k = 4"
##         test_Y
## knn.pred yes  no
##      yes 104  18
##      no  106 197
## [1] "Accuracy for k = 4 : 0.708235294117647"
## [1] "Confusion Matrix for k = 5"
##         test_Y
## knn.pred yes  no
##      yes 104  18
##      no  106 197
## [1] "Accuracy for k = 5 : 0.708235294117647"
## [1] "Confusion Matrix for k = 6"
##         test_Y
## knn.pred yes  no
##      yes 104  18
##      no  106 197
## [1] "Accuracy for k = 6 : 0.708235294117647"
## [1] "Confusion Matrix for k = 7"
##         test_Y
## knn.pred yes  no
##      yes 104  18
##      no  106 197
## [1] "Accuracy for k = 7 : 0.708235294117647"
## [1] "Confusion Matrix for k = 8"
##         test_Y
## knn.pred yes  no
##      yes 104  18
##      no  106 197
## [1] "Accuracy for k = 8 : 0.708235294117647"
## [1] "Confusion Matrix for k = 9"
##         test_Y
## knn.pred yes  no
##      yes 104  18
##      no  106 197
## [1] "Accuracy for k = 9 : 0.708235294117647"
## [1] "Confusion Matrix for k = 10"
##         test_Y
## knn.pred yes  no
##      yes 104  18
##      no  106 197
## [1] "Accuracy for k = 10 : 0.708235294117647"

I preprocessed the data undersampling the “no” class in churn. After removing a near zero variance predictors, I used predictor selection techniques to land on using these three predictors: voice.plan, intl.plan, and customer.calls. My train/test split was a simple random 70/30 split due to the undersampling procedure. As shown by the plots, each class is equally represented in the train and test splits. I assessed logistic regression, LDA, naive Bayes’, and KNN. For all k’s, KNN performed the best with accuracy of 0.708.