Let \(Y\) represent whether a given stock will issue a dividend this year (“Yes” or “No”) and \(X\) represent last year’s percent profit.
\(X_{Yes} \sim \mathcal{N}(10, 36)\) and \(X_{No} \sim \mathcal{N}(0, 36)\)
\(\pi_k\) represents the prior probability of class \(k\).
\(f_k\) represents the probability density function for \(X\) in class \(k\).
\(p_k\) represents the posterior probability of class \(k\).
\[\begin{align} p_k(Y=k|X=x) &= \frac{P(X=x|Y=k)P(Y=k)}{P(X=x)} &\text{Bayes' Theorem} \\ &= \frac{P(X=x|Y=k)P(Y=k)}{\sum_{i=1}^{k} P(X=x|Y=l)P(Y=l)} \\ &= \frac{\pi_k f_k(x)}{\sum_{l=1}^{k} \pi_l f_l(x)} \\ &= \frac{\pi_k f_k(x)}{\pi_{Yes} f_{Yes}(x) + \pi_{No} f_{No}(x)} \\ &= \frac{\pi_k \frac{1}{\sqrt{2\pi\sigma_k^2}} e^{-(x-\mu_k)^2/2\sigma_k^2}} {\pi_{Yes}\frac{1}{\sqrt{2\pi\sigma_{Yes}^2}} e^{-(x-\mu_{Yes})^2/2\sigma_{Yes}^2}+ \pi_{No}\frac{1}{\sqrt{2\pi\sigma_{No}^2}} e^{-(x-\mu_{No})^2/2\sigma_{No}^2}} &\text{Substitute the equation for $f_k(x)$} \\ &= \frac{\pi_k e^{-(x-\mu_k)^2/2\sigma^2}}{\pi_{Yes}e^{-(x-\mu_{Yes})^2/2\sigma^2}+\pi_{No}e^{-(x-\mu_{No})^2/2\sigma^2}} &\text{Cross out terms given $\sigma_{Yes}=\sigma_{No}$} \\ \\ p_{Yes}(Y=Yes|X=4) &= \frac{0.8e^{-(4-10)^2/(2\times 6^2)}}{0.8e^{-(4-10)^2/(2\times 6^2)} + 0.2e^{-(4-0)^2/(2\times 6^2)}} \\ &= \frac{0.8e^{-1/2}}{0.8e^{-1/2}+0.2e^{-2/9}} \\ &= 0.752 \end{align}\]
library(ISLR2)
library(corrplot)
## corrplot 0.94 loaded
data(Weekly)
summary(Weekly)
## Year Lag1 Lag2 Lag3
## Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
## Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
## Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
## 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag4 Lag5 Volume Today
## Min. :-18.1950 Min. :-18.1950 Min. :0.08747 Min. :-18.1950
## 1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540
## Median : 0.2380 Median : 0.2340 Median :1.00268 Median : 0.2410
## Mean : 0.1458 Mean : 0.1399 Mean :1.57462 Mean : 0.1499
## 3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050
## Max. : 12.0260 Max. : 12.0260 Max. :9.32821 Max. : 12.0260
## Direction
## Down:484
## Up :605
##
##
##
##
cor(Weekly[,-9])
## Year Lag1 Lag2 Lag3 Lag4
## Year 1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1 -0.03228927 1.000000000 -0.07485305 0.05863568 -0.071273876
## Lag2 -0.03339001 -0.074853051 1.00000000 -0.07572091 0.058381535
## Lag3 -0.03000649 0.058635682 -0.07572091 1.00000000 -0.075395865
## Lag4 -0.03112792 -0.071273876 0.05838153 -0.07539587 1.000000000
## Lag5 -0.03051910 -0.008183096 -0.07249948 0.06065717 -0.075675027
## Volume 0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today -0.03245989 -0.075031842 0.05916672 -0.07124364 -0.007825873
## Lag5 Volume Today
## Year -0.030519101 0.84194162 -0.032459894
## Lag1 -0.008183096 -0.06495131 -0.075031842
## Lag2 -0.072499482 -0.08551314 0.059166717
## Lag3 0.060657175 -0.06928771 -0.071243639
## Lag4 -0.075675027 -0.06107462 -0.007825873
## Lag5 1.000000000 -0.05851741 0.011012698
## Volume -0.058517414 1.00000000 -0.033077783
## Today 0.011012698 -0.03307778 1.000000000
pairs(Weekly)
corrplot(cor(Weekly[, -9]), method = "color")
plot(Weekly$Year, Weekly$Volume, xlab = "Year", ylab = "Volume")
The Weekly data set from ISLR2 shows the weekly percentage returns for the S&P 500 stock index between 1990 and 2010. Year refers to the year that the observation was recorded. Lag1-5 refers to the percentage return for the 1-5 weeks prior. Volume refers to the average number of daily shares traded in billions. Today refers to the percentage return for this week. Direction is a factor with levels Down and Up, indicating whether the market had a positive or negative return on a given week. There appears to be a strong positive correlation between Year and Volume.
glm.fit <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Weekly, family = binomial)
summary(glm.fit)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Weekly)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
Lag2 appears statistically significant assuming significant level of 0.05, with p value of 0.0296<0.05. The other predictors all have p values >0.05, but Lag1 is the second best.
glm.probs <- predict(glm.fit, type = "response") # Get predicted probabilities
glm.pred <- ifelse(glm.probs > 0.5, "Up", "Down") # Classify as "Up" or "Down"
conf_matrix <- table(glm.pred, Weekly$Direction) # Compare predictions to actual values
# Extract values from the confusion matrix
TN <- conf_matrix[1, 1] # True Negatives
FP <- conf_matrix[2, 1] # False Positives
FN <- conf_matrix[1, 2] # False Negatives
TP <- conf_matrix[2, 2] # True Positives
# Calculate metrics
accuracy <- mean(glm.pred == Weekly$Direction) # Calculate accuracy
sensitivity <- TP / (TP + FN) # Sensitivity (True Positive Rate)
specificity <- TN / (TN + FP) # Specificity (True Negative Rate)
type_I_error <- FP / (FP + TN) # Type I Error (False Positive Rate)
type_II_error <- FN / (FN + TP) # Type II Error (False Negative Rate)
cat("Accuracy:", accuracy, "\n")
## Accuracy: 0.5610652
cat("Sensitivity:", sensitivity, "\n")
## Sensitivity: 0.9206612
cat("Specificity:", specificity, "\n")
## Specificity: 0.1115702
cat("Type I Error:", type_I_error, "\n")
## Type I Error: 0.8884298
cat("Type II Error:", type_II_error, "\n")
## Type II Error: 0.07933884
The accuracy is only 56.1% with a high sensitivity of 0.92 and a low specificity of 0.11. The type 1 error is high at 0.888 and type 2 error is low at 0.079. The model overwhelmingly predicts weekly direction of “Up” even when the market actually went “Down”.
train <- Weekly[Weekly$Year < 2009, ] # Data from 1990 to 2008
test <- Weekly[Weekly$Year >= 2009, ] # Data from 2009 and 2010
logit.fit <- glm(Direction ~ Lag2, data = train, family = binomial)
logit.probs <- predict(logit.fit, newdata = test, type = "response")
logit.pred <- ifelse(logit.probs > 0.5, "Up", "Down")
conf_matrix <- table(logit.pred, test$Direction)
print(conf_matrix)
##
## logit.pred Down Up
## Down 9 5
## Up 34 56
accuracy <- mean(logit.pred == test$Direction)
print(accuracy)
## [1] 0.625
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:ISLR2':
##
## Boston
lda.fit <- lda(Direction ~ Lag2, data = train)
lda.pred <- predict(lda.fit, newdata = test)
conf_matrix <- table(lda.pred$class, test$Direction)
print(conf_matrix)
##
## Down Up
## Down 9 5
## Up 34 56
accuracy <- mean(lda.pred$class == test$Direction)
print(accuracy)
## [1] 0.625
qda.fit <- qda(Direction ~ Lag2, data = train)
qda.pred <- predict(qda.fit, newdata = test)
conf_matrix <- table(qda.pred$class, test$Direction)
print(conf_matrix)
##
## Down Up
## Down 0 0
## Up 43 61
accuracy <- mean(qda.pred$class == test$Direction)
print(accuracy)
## [1] 0.5865385
library(class)
train_X <- as.matrix(train$Lag2) # predictor for training data
test_X <- as.matrix(test$Lag2) # predictor for test data
train_Y <- train$Direction # response for training data
test_Y <- test$Direction # response for test data
knn.pred <- knn(train_X, test_X, train_Y, k = 1)
conf_matrix <- table(knn.pred, test_Y)
print(conf_matrix)
## test_Y
## knn.pred Down Up
## Down 21 30
## Up 22 31
accuracy <- mean(knn.pred == test_Y)
print(accuracy)
## [1] 0.5
library(e1071)
nb.fit <- naiveBayes(Direction ~ Lag2, data = train)
nb.pred <- predict(nb.fit, newdata = test)
conf_matrix <- table(nb.pred, test$Direction)
print(conf_matrix)
##
## nb.pred Down Up
## Down 0 0
## Up 43 61
accuracy <- mean(nb.pred == test$Direction)
print(accuracy)
## [1] 0.5865385
Logistic regression and LDA had accuracy of 62.5%, QDA and naive Bayes 58.65% (only predicting market “up”), and KNN with K=1 50%. Both logistic regression and LDA appear to provide the best results on this data, assuming the criteria is accuracy. They even have the same confusion matrix since they both involve linear decision boundaries.
library(leaps)
mod.sel <- regsubsets(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Weekly)
reg.summary <- summary(mod.sel)
names(reg.summary)
## [1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj"
reg.summary$outmat
## Lag1 Lag2 Lag3 Lag4 Lag5 Volume
## 1 ( 1 ) " " "*" " " " " " " " "
## 2 ( 1 ) "*" "*" " " " " " " " "
## 3 ( 1 ) "*" "*" " " "*" " " " "
## 4 ( 1 ) "*" "*" "*" "*" " " " "
## 5 ( 1 ) "*" "*" "*" "*" " " "*"
## 6 ( 1 ) "*" "*" "*" "*" "*" "*"
cat(which.min(reg.summary$bic), reg.summary$bic[which.min(reg.summary$bic)]) # 1
## 1 8.215667
cat(which.min(reg.summary$cp), reg.summary$cp[which.min(reg.summary$cp)]) # 2
## 2 0.8082199
cat(which.max(reg.summary$adjr2), reg.summary$adjr2[which.max(reg.summary$adjr2)]) # 2
## 2 0.005453532
logit.fit <- glm(Direction ~ Lag1 + Lag2, data = train, family = binomial)
logit.probs <- predict(logit.fit, newdata = test, type = "response")
logit.pred <- ifelse(logit.probs > 0.5, "Up", "Down")
conf_matrix <- table(logit.pred, test$Direction)
print(conf_matrix)
##
## logit.pred Down Up
## Down 7 8
## Up 36 53
accuracy <- mean(logit.pred == test$Direction)
print(accuracy) # 0.577
## [1] 0.5769231
logit.fit <- glm(Direction ~ Lag1 + Lag2 + Lag1:Lag2, data = train, family = binomial)
logit.probs <- predict(logit.fit, newdata = test, type = "response")
logit.pred <- ifelse(logit.probs > 0.5, "Up", "Down")
conf_matrix <- table(logit.pred, test$Direction)
print(conf_matrix)
##
## logit.pred Down Up
## Down 7 8
## Up 36 53
accuracy <- mean(logit.pred == test$Direction)
print(accuracy) # 0.577
## [1] 0.5769231
lda.fit <- lda(Direction ~ Lag1 + Lag2, data = train)
lda.pred <- predict(lda.fit, newdata = test)
conf_matrix <- table(lda.pred$class, test$Direction)
print(conf_matrix)
##
## Down Up
## Down 7 8
## Up 36 53
accuracy <- mean(lda.pred$class == test$Direction)
print(accuracy) # 0.577
## [1] 0.5769231
lda.fit <- lda(Direction ~ Lag1 + Lag2 + Lag1:Lag2, data = train)
lda.pred <- predict(lda.fit, newdata = test)
conf_matrix <- table(lda.pred$class, test$Direction)
print(conf_matrix)
##
## Down Up
## Down 7 8
## Up 36 53
accuracy <- mean(lda.pred$class == test$Direction)
print(accuracy) # 0.577
## [1] 0.5769231
qda.fit <- qda(Direction ~ Lag1 + Lag2, data = train)
qda.pred <- predict(qda.fit, newdata = test)
conf_matrix <- table(qda.pred$class, test$Direction)
print(conf_matrix)
##
## Down Up
## Down 7 10
## Up 36 51
accuracy <- mean(qda.pred$class == test$Direction)
print(accuracy) # 0.558
## [1] 0.5576923
qda.fit <- qda(Direction ~ Lag1 + Lag2 + Lag1:Lag2, data = train)
qda.pred <- predict(qda.fit, newdata = test)
conf_matrix <- table(qda.pred$class, test$Direction)
print(conf_matrix)
##
## Down Up
## Down 23 36
## Up 20 25
accuracy <- mean(qda.pred$class == test$Direction)
print(accuracy) # 0.462
## [1] 0.4615385
nb.fit <- naiveBayes(Direction ~ Lag1 + Lag2, data = train)
nb.pred <- predict(nb.fit, newdata = test)
conf_matrix <- table(nb.pred, test$Direction)
print(conf_matrix)
##
## nb.pred Down Up
## Down 3 8
## Up 40 53
accuracy <- mean(nb.pred == test$Direction)
print(accuracy) # 0.538
## [1] 0.5384615
for (k in 2:10) {
knn.pred <- knn(train_X, test_X, train_Y, k = k)
conf_matrix <- table(knn.pred, test_Y)
print(paste("Confusion Matrix for k =", k))
print(conf_matrix)
accuracy <- mean(knn.pred == test_Y)
print(paste("Accuracy for k =", k, ":", accuracy))
}
## [1] "Confusion Matrix for k = 2"
## test_Y
## knn.pred Down Up
## Down 26 23
## Up 17 38
## [1] "Accuracy for k = 2 : 0.615384615384615"
## [1] "Confusion Matrix for k = 3"
## test_Y
## knn.pred Down Up
## Down 16 20
## Up 27 41
## [1] "Accuracy for k = 3 : 0.548076923076923"
## [1] "Confusion Matrix for k = 4"
## test_Y
## knn.pred Down Up
## Down 22 21
## Up 21 40
## [1] "Accuracy for k = 4 : 0.596153846153846"
## [1] "Confusion Matrix for k = 5"
## test_Y
## knn.pred Down Up
## Down 16 21
## Up 27 40
## [1] "Accuracy for k = 5 : 0.538461538461538"
## [1] "Confusion Matrix for k = 6"
## test_Y
## knn.pred Down Up
## Down 18 21
## Up 25 40
## [1] "Accuracy for k = 6 : 0.557692307692308"
## [1] "Confusion Matrix for k = 7"
## test_Y
## knn.pred Down Up
## Down 15 19
## Up 28 42
## [1] "Accuracy for k = 7 : 0.548076923076923"
## [1] "Confusion Matrix for k = 8"
## test_Y
## knn.pred Down Up
## Down 15 19
## Up 28 42
## [1] "Accuracy for k = 8 : 0.548076923076923"
## [1] "Confusion Matrix for k = 9"
## test_Y
## knn.pred Down Up
## Down 17 21
## Up 26 40
## [1] "Accuracy for k = 9 : 0.548076923076923"
## [1] "Confusion Matrix for k = 10"
## test_Y
## knn.pred Down Up
## Down 17 22
## Up 26 39
## [1] "Accuracy for k = 10 : 0.538461538461538"
Logistic regression and LDA had accuracy of 62.5%, QDA and naive Bayes 58.65% (only predicting market “up”), and KNN with K=1 50%. With the same confusion matrix, both logistic regression and LDA appear to provide the best results on this data.
There are 5 methods to experiment with: logistic regression, LDA, QDA, naive Bayes, and KNN. There are no predictors with degenerate distributions. We do not include year, so there are no highly correlated predictors. Moreover, there is no multicollinearity. Based on BIC, just Lag2 alone is ideal. However, Cp and AdjR2 criterion include Lag1, which makes sense considering it has the second lowest p-value. Hence, we tried these combination of predictors: Lag1 + Lag2 and Lag1 + Lag2 + Lag1:Lag2. Moreover, we assessed KNN with k=2 to 10. Based on the results above, adding predictors only worsened the accuracy. However, increasing k to >1 led to improved testing accuracy, although not much significant improvements past k=2.
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
data(Boston)
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
# Create a binary response variable indicating whether the crime rate is above or below the median
Boston$high_crim <- ifelse(Boston$crim > median(Boston$crim), 1, 0)
glm.fit <- glm(high_crim ~ . - crim, data = Boston, family = binomial)
summary(glm.fit)
##
## Call:
## glm(formula = high_crim ~ . - crim, family = binomial, data = Boston)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -34.103704 6.530014 -5.223 1.76e-07 ***
## zn -0.079918 0.033731 -2.369 0.01782 *
## indus -0.059389 0.043722 -1.358 0.17436
## chas 0.785327 0.728930 1.077 0.28132
## nox 48.523782 7.396497 6.560 5.37e-11 ***
## rm -0.425596 0.701104 -0.607 0.54383
## age 0.022172 0.012221 1.814 0.06963 .
## dis 0.691400 0.218308 3.167 0.00154 **
## rad 0.656465 0.152452 4.306 1.66e-05 ***
## tax -0.006412 0.002689 -2.385 0.01709 *
## ptratio 0.368716 0.122136 3.019 0.00254 **
## black -0.013524 0.006536 -2.069 0.03853 *
## lstat 0.043862 0.048981 0.895 0.37052
## medv 0.167130 0.066940 2.497 0.01254 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 701.46 on 505 degrees of freedom
## Residual deviance: 211.93 on 492 degrees of freedom
## AIC: 239.93
##
## Number of Fisher Scoring iterations: 9
glm.probs <- predict(glm.fit, type = "response")
glm.pred <- ifelse(glm.probs > 0.5, 1, 0) # Classify crime rate as "High" or "Low"
conf_matrix <- table(glm.pred, Boston$high_crim)
print(conf_matrix)
##
## glm.pred 0 1
## 0 234 24
## 1 19 229
accuracy <- mean(glm.pred == Boston$high_crim)
print(accuracy) #0.915
## [1] 0.9150198
# Train test split
set.seed(1234567)
train_size <- floor(0.70 * nrow(Boston))
train_index <- sample(seq_len(nrow(Boston)), size = train_size)
train <- Boston[train_index, ]
test <- Boston[-train_index, ]
logit.fit <- glm(high_crim ~ . - crim, data = train, family = binomial)
logit.probs <- predict(logit.fit, newdata = test, type = "response")
logit.pred <- ifelse(logit.probs > 0.5, 1, 0)
conf_matrix <- table(logit.pred, test$high_crim)
print(conf_matrix)
##
## logit.pred 0 1
## 0 66 8
## 1 3 75
accuracy <- mean(logit.pred == test$high_crim)
print(accuracy) # 0.928
## [1] 0.9276316
lda.fit <- lda(high_crim ~ . - crim, data = train)
lda.pred <- predict(lda.fit, newdata = test)
conf_matrix <- table(lda.pred$class, test$high_crim)
print(conf_matrix)
##
## 0 1
## 0 68 21
## 1 1 62
accuracy <- mean(lda.pred$class == test$high_crim)
print(accuracy) # 0.855
## [1] 0.8552632
nb.fit <- naiveBayes(high_crim ~ . - crim, data = train)
nb.pred <- predict(nb.fit, newdata = test)
conf_matrix <- table(nb.pred, test$high_crim)
print(conf_matrix)
##
## nb.pred 0 1
## 0 62 20
## 1 7 63
accuracy <- mean(nb.pred == test$high_crim)
print(accuracy) # 0.822
## [1] 0.8223684
train_X <- as.matrix(train[,2:15]) # excludes crim
test_X <- as.matrix(test[,2:15])
train_Y <- train$high_crim
test_Y <- test$high_crim
knn.pred <- knn(train_X, test_X, train_Y, k = 1)
conf_matrix <- table(knn.pred, test_Y)
print(conf_matrix)
## test_Y
## knn.pred 0 1
## 0 65 4
## 1 4 79
accuracy <- mean(knn.pred == test_Y)
print(accuracy) # 0.947
## [1] 0.9473684
nearZeroVar(Boston) # none
## integer(0)
library(car)
## Loading required package: carData
vif(glm.fit) # none >10
## zn indus chas nox rm age dis rad
## 1.872010 2.640170 1.233983 4.111013 5.734988 2.455666 3.876479 1.983825
## tax ptratio black lstat medv
## 1.932622 2.191848 1.107171 2.670299 8.088729
mod.sel <- regsubsets(high_crim ~ . -crim, data = Boston)
reg.summary <- summary(mod.sel)
reg.summary$outmat
## zn indus chas nox rm age dis rad tax ptratio black lstat medv
## 1 ( 1 ) " " " " " " "*" " " " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " "*" " " " " " " "*" " " " " " " " " " "
## 3 ( 1 ) " " " " " " "*" " " "*" " " "*" " " " " " " " " " "
## 4 ( 1 ) " " " " " " "*" " " "*" " " "*" " " " " " " " " "*"
## 5 ( 1 ) "*" " " " " "*" " " "*" " " "*" " " " " " " " " "*"
## 6 ( 1 ) "*" " " " " "*" " " "*" " " "*" " " " " "*" " " "*"
## 7 ( 1 ) "*" " " " " "*" " " "*" " " "*" " " "*" "*" " " "*"
## 8 ( 1 ) "*" " " " " "*" " " "*" " " "*" " " "*" "*" "*" "*"
cat(which.min(reg.summary$bic), reg.summary$bic[which.min(reg.summary$bic)]) # 4
## 4 -430.8029
cat(which.min(reg.summary$cp), reg.summary$cp[which.min(reg.summary$cp)]) # 6
## 6 4.123717
cat(which.max(reg.summary$adjr2), reg.summary$adjr2[which.max(reg.summary$adjr2)]) # 7
## 7 0.598924
cor(Boston)
## crim zn indus chas nox
## crim 1.00000000 -0.20046922 0.40658341 -0.055891582 0.42097171
## zn -0.20046922 1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus 0.40658341 -0.53382819 1.00000000 0.062938027 0.76365145
## chas -0.05589158 -0.04269672 0.06293803 1.000000000 0.09120281
## nox 0.42097171 -0.51660371 0.76365145 0.091202807 1.00000000
## rm -0.21924670 0.31199059 -0.39167585 0.091251225 -0.30218819
## age 0.35273425 -0.56953734 0.64477851 0.086517774 0.73147010
## dis -0.37967009 0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad 0.62550515 -0.31194783 0.59512927 -0.007368241 0.61144056
## tax 0.58276431 -0.31456332 0.72076018 -0.035586518 0.66802320
## ptratio 0.28994558 -0.39167855 0.38324756 -0.121515174 0.18893268
## black -0.38506394 0.17552032 -0.35697654 0.048788485 -0.38005064
## lstat 0.45562148 -0.41299457 0.60379972 -0.053929298 0.59087892
## medv -0.38830461 0.36044534 -0.48372516 0.175260177 -0.42732077
## high_crim 0.40939545 -0.43615103 0.60326017 0.070096774 0.72323480
## rm age dis rad tax
## crim -0.21924670 0.35273425 -0.37967009 0.625505145 0.58276431
## zn 0.31199059 -0.56953734 0.66440822 -0.311947826 -0.31456332
## indus -0.39167585 0.64477851 -0.70802699 0.595129275 0.72076018
## chas 0.09125123 0.08651777 -0.09917578 -0.007368241 -0.03558652
## nox -0.30218819 0.73147010 -0.76923011 0.611440563 0.66802320
## rm 1.00000000 -0.24026493 0.20524621 -0.209846668 -0.29204783
## age -0.24026493 1.00000000 -0.74788054 0.456022452 0.50645559
## dis 0.20524621 -0.74788054 1.00000000 -0.494587930 -0.53443158
## rad -0.20984667 0.45602245 -0.49458793 1.000000000 0.91022819
## tax -0.29204783 0.50645559 -0.53443158 0.910228189 1.00000000
## ptratio -0.35550149 0.26151501 -0.23247054 0.464741179 0.46085304
## black 0.12806864 -0.27353398 0.29151167 -0.444412816 -0.44180801
## lstat -0.61380827 0.60233853 -0.49699583 0.488676335 0.54399341
## medv 0.69535995 -0.37695457 0.24992873 -0.381626231 -0.46853593
## high_crim -0.15637178 0.61393992 -0.61634164 0.619786249 0.60874128
## ptratio black lstat medv high_crim
## crim 0.2899456 -0.38506394 0.4556215 -0.3883046 0.40939545
## zn -0.3916785 0.17552032 -0.4129946 0.3604453 -0.43615103
## indus 0.3832476 -0.35697654 0.6037997 -0.4837252 0.60326017
## chas -0.1215152 0.04878848 -0.0539293 0.1752602 0.07009677
## nox 0.1889327 -0.38005064 0.5908789 -0.4273208 0.72323480
## rm -0.3555015 0.12806864 -0.6138083 0.6953599 -0.15637178
## age 0.2615150 -0.27353398 0.6023385 -0.3769546 0.61393992
## dis -0.2324705 0.29151167 -0.4969958 0.2499287 -0.61634164
## rad 0.4647412 -0.44441282 0.4886763 -0.3816262 0.61978625
## tax 0.4608530 -0.44180801 0.5439934 -0.4685359 0.60874128
## ptratio 1.0000000 -0.17738330 0.3740443 -0.5077867 0.25356836
## black -0.1773833 1.00000000 -0.3660869 0.3334608 -0.35121093
## lstat 0.3740443 -0.36608690 1.0000000 -0.7376627 0.45326273
## medv -0.5077867 0.33346082 -0.7376627 1.0000000 -0.26301673
## high_crim 0.2535684 -0.35121093 0.4532627 -0.2630167 1.00000000
pairs(Boston)
corrplot(cor(Boston), method = "color")
highCorr <- findCorrelation(cor(Boston), cutoff = .75)
length(highCorr)
## [1] 3
colnames(Boston)[highCorr] # indus, tax, nox
## [1] "indus" "nox" "tax"
logit.fit <- glm(high_crim ~ zn + nox + dis + rad + tax + ptratio + black + medv, data = train, family = binomial)
logit.probs <- predict(logit.fit, newdata = test, type = "response")
logit.pred <- ifelse(logit.probs > 0.5, 1, 0)
conf_matrix <- table(logit.pred, test$high_crim)
print(conf_matrix)
##
## logit.pred 0 1
## 0 65 12
## 1 4 71
accuracy <- mean(logit.pred == test$high_crim)
print(accuracy) # 0.895
## [1] 0.8947368
logit.fit <- glm(high_crim ~ nox + rad + age + medv, data = train, family = binomial)
logit.probs <- predict(logit.fit, newdata = test, type = "response")
logit.pred <- ifelse(logit.probs > 0.5, 1, 0)
conf_matrix <- table(logit.pred, test$high_crim)
print(conf_matrix)
##
## logit.pred 0 1
## 0 63 18
## 1 6 65
accuracy <- mean(logit.pred == test$high_crim)
print(accuracy) # 0.842
## [1] 0.8421053
lda.fit <- lda(high_crim ~ zn + nox + dis + rad + tax + ptratio + black + medv, data = train)
lda.pred <- predict(lda.fit, newdata = test)
conf_matrix <- table(lda.pred$class, test$high_crim)
print(conf_matrix)
##
## 0 1
## 0 68 21
## 1 1 62
accuracy <- mean(lda.pred$class == test$high_crim)
print(accuracy) # 0.855
## [1] 0.8552632
lda.fit <- lda(high_crim ~ nox + rad + age + medv, data = train)
lda.pred <- predict(lda.fit, newdata = test)
conf_matrix <- table(lda.pred$class, test$high_crim)
print(conf_matrix)
##
## 0 1
## 0 68 21
## 1 1 62
accuracy <- mean(lda.pred$class == test$high_crim)
print(accuracy) # 0.855
## [1] 0.8552632
nb.fit <- naiveBayes(high_crim ~ zn + nox + dis + rad + tax + ptratio + black + medv, data = train)
nb.pred <- predict(nb.fit, newdata = test)
conf_matrix <- table(nb.pred, test$high_crim)
print(conf_matrix)
##
## nb.pred 0 1
## 0 65 20
## 1 4 63
accuracy <- mean(nb.pred == test$high_crim)
print(accuracy) # 0.842
## [1] 0.8421053
nb.fit <- naiveBayes(high_crim ~ nox + rad + age + medv, data = train)
nb.pred <- predict(nb.fit, newdata = test)
conf_matrix <- table(nb.pred, test$high_crim)
print(conf_matrix)
##
## nb.pred 0 1
## 0 65 21
## 1 4 62
accuracy <- mean(nb.pred == test$high_crim)
print(accuracy) # 0.836
## [1] 0.8355263
for (k in 2:10) {
knn.pred <- knn(train_X, test_X, train_Y, k = k)
conf_matrix <- table(knn.pred, test_Y)
print(paste("Confusion Matrix for k =", k))
print(conf_matrix)
accuracy <- mean(knn.pred == test_Y)
print(paste("Accuracy for k =", k, ":", accuracy))
}
## [1] "Confusion Matrix for k = 2"
## test_Y
## knn.pred 0 1
## 0 61 5
## 1 8 78
## [1] "Accuracy for k = 2 : 0.914473684210526"
## [1] "Confusion Matrix for k = 3"
## test_Y
## knn.pred 0 1
## 0 64 4
## 1 5 79
## [1] "Accuracy for k = 3 : 0.940789473684211"
## [1] "Confusion Matrix for k = 4"
## test_Y
## knn.pred 0 1
## 0 64 6
## 1 5 77
## [1] "Accuracy for k = 4 : 0.927631578947368"
## [1] "Confusion Matrix for k = 5"
## test_Y
## knn.pred 0 1
## 0 64 6
## 1 5 77
## [1] "Accuracy for k = 5 : 0.927631578947368"
## [1] "Confusion Matrix for k = 6"
## test_Y
## knn.pred 0 1
## 0 64 6
## 1 5 77
## [1] "Accuracy for k = 6 : 0.927631578947368"
## [1] "Confusion Matrix for k = 7"
## test_Y
## knn.pred 0 1
## 0 62 6
## 1 7 77
## [1] "Accuracy for k = 7 : 0.914473684210526"
## [1] "Confusion Matrix for k = 8"
## test_Y
## knn.pred 0 1
## 0 62 7
## 1 7 76
## [1] "Accuracy for k = 8 : 0.907894736842105"
## [1] "Confusion Matrix for k = 9"
## test_Y
## knn.pred 0 1
## 0 61 7
## 1 8 76
## [1] "Accuracy for k = 9 : 0.901315789473684"
## [1] "Confusion Matrix for k = 10"
## test_Y
## knn.pred 0 1
## 0 60 7
## 1 9 76
## [1] "Accuracy for k = 10 : 0.894736842105263"
train_X <- as.matrix(train[, c("zn", "nox", "dis", "rad", "tax", "ptratio", "black", "medv")])
test_X <- as.matrix(test[, c("zn", "nox", "dis", "rad", "tax", "ptratio", "black", "medv")])
for (k in 1:10) {
knn.pred <- knn(train_X, test_X, train_Y, k = k)
conf_matrix <- table(knn.pred, test_Y)
print(paste("Confusion Matrix for k =", k))
print(conf_matrix)
accuracy <- mean(knn.pred == test_Y)
print(paste("Accuracy for k =", k, ":", accuracy))
}
## [1] "Confusion Matrix for k = 1"
## test_Y
## knn.pred 0 1
## 0 65 5
## 1 4 78
## [1] "Accuracy for k = 1 : 0.940789473684211"
## [1] "Confusion Matrix for k = 2"
## test_Y
## knn.pred 0 1
## 0 65 4
## 1 4 79
## [1] "Accuracy for k = 2 : 0.947368421052632"
## [1] "Confusion Matrix for k = 3"
## test_Y
## knn.pred 0 1
## 0 65 4
## 1 4 79
## [1] "Accuracy for k = 3 : 0.947368421052632"
## [1] "Confusion Matrix for k = 4"
## test_Y
## knn.pred 0 1
## 0 66 5
## 1 3 78
## [1] "Accuracy for k = 4 : 0.947368421052632"
## [1] "Confusion Matrix for k = 5"
## test_Y
## knn.pred 0 1
## 0 65 5
## 1 4 78
## [1] "Accuracy for k = 5 : 0.940789473684211"
## [1] "Confusion Matrix for k = 6"
## test_Y
## knn.pred 0 1
## 0 63 6
## 1 6 77
## [1] "Accuracy for k = 6 : 0.921052631578947"
## [1] "Confusion Matrix for k = 7"
## test_Y
## knn.pred 0 1
## 0 62 4
## 1 7 79
## [1] "Accuracy for k = 7 : 0.927631578947368"
## [1] "Confusion Matrix for k = 8"
## test_Y
## knn.pred 0 1
## 0 60 4
## 1 9 79
## [1] "Accuracy for k = 8 : 0.914473684210526"
## [1] "Confusion Matrix for k = 9"
## test_Y
## knn.pred 0 1
## 0 60 4
## 1 9 79
## [1] "Accuracy for k = 9 : 0.914473684210526"
## [1] "Confusion Matrix for k = 10"
## test_Y
## knn.pred 0 1
## 0 60 7
## 1 9 76
## [1] "Accuracy for k = 10 : 0.894736842105263"
train_X <- as.matrix(train[, c("nox", "rad", "age", "medv")])
test_X <- as.matrix(test[, c("nox", "rad", "age", "medv")])
for (k in 1:10) {
knn.pred <- knn(train_X, test_X, train_Y, k = k)
conf_matrix <- table(knn.pred, test_Y)
print(paste("Confusion Matrix for k =", k))
print(conf_matrix)
accuracy <- mean(knn.pred == test_Y)
print(paste("Accuracy for k =", k, ":", accuracy))
}
## [1] "Confusion Matrix for k = 1"
## test_Y
## knn.pred 0 1
## 0 56 18
## 1 13 65
## [1] "Accuracy for k = 1 : 0.796052631578947"
## [1] "Confusion Matrix for k = 2"
## test_Y
## knn.pred 0 1
## 0 60 16
## 1 9 67
## [1] "Accuracy for k = 2 : 0.835526315789474"
## [1] "Confusion Matrix for k = 3"
## test_Y
## knn.pred 0 1
## 0 57 17
## 1 12 66
## [1] "Accuracy for k = 3 : 0.809210526315789"
## [1] "Confusion Matrix for k = 4"
## test_Y
## knn.pred 0 1
## 0 58 16
## 1 11 67
## [1] "Accuracy for k = 4 : 0.822368421052632"
## [1] "Confusion Matrix for k = 5"
## test_Y
## knn.pred 0 1
## 0 59 15
## 1 10 68
## [1] "Accuracy for k = 5 : 0.835526315789474"
## [1] "Confusion Matrix for k = 6"
## test_Y
## knn.pred 0 1
## 0 60 17
## 1 9 66
## [1] "Accuracy for k = 6 : 0.828947368421053"
## [1] "Confusion Matrix for k = 7"
## test_Y
## knn.pred 0 1
## 0 61 15
## 1 8 68
## [1] "Accuracy for k = 7 : 0.848684210526316"
## [1] "Confusion Matrix for k = 8"
## test_Y
## knn.pred 0 1
## 0 61 17
## 1 8 66
## [1] "Accuracy for k = 8 : 0.835526315789474"
## [1] "Confusion Matrix for k = 9"
## test_Y
## knn.pred 0 1
## 0 61 17
## 1 8 66
## [1] "Accuracy for k = 9 : 0.835526315789474"
## [1] "Confusion Matrix for k = 10"
## test_Y
## knn.pred 0 1
## 0 61 15
## 1 8 68
## [1] "Accuracy for k = 10 : 0.848684210526316"
There are 4 methods to experiment with: logistic regression, LDA, naive Bayes, and KNN. There are no predictors with degenerate distributions and no multicollinearity. Based on the logistic regression summary, the significant (p<0.05) predictors are zn, nox, dis, rad, tax, ptratio, black, and medv. 3 variables are highly correlated: indus, tax, and nox. Since nox is significant, we can exclude indus and tax with confidence. Based on BIC criterion, I will select nox, rad, age, and medv as a set of predictors. These are the final combinations: Set 1: zn + nox + dis + rad + tax + ptratio + black + medv Set 2: nox + rad + age + medv We will also assess KNN with k=2 to 10.
Based on the results above, using Set 1 predictors improved accuracy for naive Bayes and KNN, but Set 2 worsened accuracy for all models. Increasing k worsened results when using all predictors, improved up to a certain point for Set 1, and improved for Set 2.
library(liver)
##
## Attaching package: 'liver'
## The following object is masked from 'package:e1071':
##
## skewness
## The following object is masked from 'package:base':
##
## transform
data(churn)
str(churn)
## 'data.frame': 5000 obs. of 20 variables:
## $ state : Factor w/ 51 levels "AK","AL","AR",..: 17 36 32 36 37 2 20 25 19 50 ...
## $ area.code : Factor w/ 3 levels "area_code_408",..: 2 2 2 1 2 3 3 2 1 2 ...
## $ account.length: int 128 107 137 84 75 118 121 147 117 141 ...
## $ voice.plan : Factor w/ 2 levels "yes","no": 1 1 2 2 2 2 1 2 2 1 ...
## $ voice.messages: int 25 26 0 0 0 0 24 0 0 37 ...
## $ intl.plan : Factor w/ 2 levels "yes","no": 2 2 2 1 1 1 2 1 2 1 ...
## $ intl.mins : num 10 13.7 12.2 6.6 10.1 6.3 7.5 7.1 8.7 11.2 ...
## $ intl.calls : int 3 3 5 7 3 6 7 6 4 5 ...
## $ intl.charge : num 2.7 3.7 3.29 1.78 2.73 1.7 2.03 1.92 2.35 3.02 ...
## $ day.mins : num 265 162 243 299 167 ...
## $ day.calls : int 110 123 114 71 113 98 88 79 97 84 ...
## $ day.charge : num 45.1 27.5 41.4 50.9 28.3 ...
## $ eve.mins : num 197.4 195.5 121.2 61.9 148.3 ...
## $ eve.calls : int 99 103 110 88 122 101 108 94 80 111 ...
## $ eve.charge : num 16.78 16.62 10.3 5.26 12.61 ...
## $ night.mins : num 245 254 163 197 187 ...
## $ night.calls : int 91 103 104 89 121 118 118 96 90 97 ...
## $ night.charge : num 11.01 11.45 7.32 8.86 8.41 ...
## $ customer.calls: int 1 1 0 2 3 0 3 0 1 0 ...
## $ churn : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...
summary(churn)
## state area.code account.length voice.plan
## WV : 158 area_code_408:1259 Min. : 1.0 yes:1323
## MN : 125 area_code_415:2495 1st Qu.: 73.0 no :3677
## AL : 124 area_code_510:1246 Median :100.0
## ID : 119 Mean :100.3
## VA : 118 3rd Qu.:127.0
## OH : 116 Max. :243.0
## (Other):4240
## voice.messages intl.plan intl.mins intl.calls intl.charge
## Min. : 0.000 yes: 473 Min. : 0.00 Min. : 0.000 Min. :0.000
## 1st Qu.: 0.000 no :4527 1st Qu.: 8.50 1st Qu.: 3.000 1st Qu.:2.300
## Median : 0.000 Median :10.30 Median : 4.000 Median :2.780
## Mean : 7.755 Mean :10.26 Mean : 4.435 Mean :2.771
## 3rd Qu.:17.000 3rd Qu.:12.00 3rd Qu.: 6.000 3rd Qu.:3.240
## Max. :52.000 Max. :20.00 Max. :20.000 Max. :5.400
##
## day.mins day.calls day.charge eve.mins eve.calls
## Min. : 0.0 Min. : 0 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.:143.7 1st Qu.: 87 1st Qu.:24.43 1st Qu.:166.4 1st Qu.: 87.0
## Median :180.1 Median :100 Median :30.62 Median :201.0 Median :100.0
## Mean :180.3 Mean :100 Mean :30.65 Mean :200.6 Mean :100.2
## 3rd Qu.:216.2 3rd Qu.:113 3rd Qu.:36.75 3rd Qu.:234.1 3rd Qu.:114.0
## Max. :351.5 Max. :165 Max. :59.76 Max. :363.7 Max. :170.0
##
## eve.charge night.mins night.calls night.charge
## Min. : 0.00 Min. : 0.0 Min. : 0.00 Min. : 0.000
## 1st Qu.:14.14 1st Qu.:166.9 1st Qu.: 87.00 1st Qu.: 7.510
## Median :17.09 Median :200.4 Median :100.00 Median : 9.020
## Mean :17.05 Mean :200.4 Mean : 99.92 Mean : 9.018
## 3rd Qu.:19.90 3rd Qu.:234.7 3rd Qu.:113.00 3rd Qu.:10.560
## Max. :30.91 Max. :395.0 Max. :175.00 Max. :17.770
##
## customer.calls churn
## Min. :0.00 yes: 707
## 1st Qu.:1.00 no :4293
## Median :1.00
## Mean :1.57
## 3rd Qu.:2.00
## Max. :9.00
##
table(churn$churn) # 707 yes, 4293 no
##
## yes no
## 707 4293
plot(churn$churn, main = "Churn Distribution", ylab = "Frequency")
# Separate the 'yes' and 'no' classes
churn_yes <- churn[churn$churn == "yes", ]
churn_no <- churn[churn$churn == "no", ]
# Undersample the 'no' class to match the size of the 'yes' class
set.seed(123456)
num_yes <- nrow(churn_yes)
undersampled_indices <- sample(nrow(churn_no), size = num_yes)
churn_no_undersampled <- churn_no[undersampled_indices, ]
# Combine the undersampled 'no' class with the 'yes' class
churn_balanced <- rbind(churn_yes, churn_no_undersampled)
# Shuffle the balanced dataset
shuffled_indices <- sample(nrow(churn_balanced))
churn_balanced <- churn_balanced[shuffled_indices, ]
plot(churn_balanced$churn, main = "Churn Distribution", ylab = "Frequency")
In this case, I would undersample the “no” class to match the size of the “yes” class since the imbalance is too severe. Moreover, the data is large enough to not worry too much about data loss. After the undersampling, you can see both classes of churn have the same frequency.
In general, I would use createDataPartition to ensure that the proportion of classes in the training and testing sets is similar to the original dataset and reduce bias towards the majority “no” class. However, in my code above, I already undersampled such that I can use simple random.
summary(churn_balanced)
## state area.code account.length voice.plan voice.messages
## NJ : 44 area_code_408:346 Min. : 1.0 yes: 310 Min. : 0.000
## WV : 40 area_code_415:698 1st Qu.: 74.0 no :1104 1st Qu.: 0.000
## MN : 39 area_code_510:370 Median :101.0 Median : 0.000
## MD : 38 Mean :101.3 Mean : 6.562
## ME : 38 3rd Qu.:127.0 3rd Qu.: 0.000
## MI : 38 Max. :225.0 Max. :50.000
## (Other):1177
## intl.plan intl.mins intl.calls intl.charge day.mins
## yes: 244 Min. : 0.000 Min. : 0.000 Min. :0.000 Min. : 0.0
## no :1170 1st Qu.: 8.625 1st Qu.: 3.000 1st Qu.:2.328 1st Qu.:144.3
## Median :10.500 Median : 4.000 Median :2.840 Median :189.3
## Mean :10.482 Mean : 4.318 Mean :2.830 Mean :191.0
## 3rd Qu.:12.300 3rd Qu.: 6.000 3rd Qu.:3.320 3rd Qu.:236.7
## Max. :20.000 Max. :20.000 Max. :5.400 Max. :351.5
##
## day.calls day.charge eve.mins eve.calls eve.charge
## Min. : 0.0 Min. : 0.00 Min. : 31.2 Min. : 12 Min. : 2.65
## 1st Qu.: 87.0 1st Qu.:24.54 1st Qu.:169.1 1st Qu.: 87 1st Qu.:14.37
## Median :101.0 Median :32.19 Median :205.7 Median :100 Median :17.48
## Mean :100.4 Mean :32.48 Mean :205.2 Mean :100 Mean :17.44
## 3rd Qu.:114.0 3rd Qu.:40.23 3rd Qu.:241.6 3rd Qu.:114 3rd Qu.:20.54
## Max. :165.0 Max. :59.76 Max. :363.7 Max. :170 Max. :30.91
##
## night.mins night.calls night.charge customer.calls churn
## Min. : 46.7 Min. : 42.0 Min. : 2.10 Min. :0.000 yes:707
## 1st Qu.:168.6 1st Qu.: 86.0 1st Qu.: 7.59 1st Qu.:1.000 no :707
## Median :203.3 Median :100.0 Median : 9.15 Median :1.000
## Mean :202.4 Mean : 99.6 Mean : 9.11 Mean :1.852
## 3rd Qu.:236.3 3rd Qu.:114.0 3rd Qu.:10.63 3rd Qu.:3.000
## Max. :395.0 Max. :158.0 Max. :17.77 Max. :9.000
##
nearZeroVar(churn_balanced) # Column 5 voice.messages is near zero variance
## [1] 5
names(churn_balanced[nearZeroVar(churn_balanced)])
## [1] "voice.messages"
table(churn_balanced$voice.messages) # Mainly 0 voice messages
##
## 0 9 12 14 15 16 17 18 19 20 21 22 23 24 25 26
## 1104 1 2 1 6 2 5 5 5 7 8 13 14 11 8 18
## 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
## 15 14 19 10 18 13 16 12 14 12 6 8 8 6 7 10
## 43 44 45 46 47 48 49 50
## 5 2 2 2 1 2 1 1
glm.fit <- glm(churn ~ . - voice.messages, data = churn_balanced, family = binomial)
summary(glm.fit)
##
## Call:
## glm(formula = churn ~ . - voice.messages, family = binomial,
## data = churn_balanced)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.106e+00 1.089e+00 4.687 2.78e-06 ***
## stateAL 3.228e-01 8.315e-01 0.388 0.6978
## stateAR 3.501e-02 8.486e-01 0.041 0.9671
## stateAZ -9.691e-01 9.231e-01 -1.050 0.2938
## stateCA -1.321e+00 8.914e-01 -1.482 0.1383
## stateCO 1.576e-01 8.737e-01 0.180 0.8569
## stateCT -1.026e+00 8.665e-01 -1.184 0.2363
## stateDC -6.236e-02 8.938e-01 -0.070 0.9444
## stateDE -2.496e-01 8.345e-01 -0.299 0.7649
## stateFL -6.766e-01 8.445e-01 -0.801 0.4230
## stateGA 5.318e-02 8.932e-01 0.060 0.9525
## stateHI 7.064e-01 9.743e-01 0.725 0.4684
## stateIA -3.151e-01 9.111e-01 -0.346 0.7295
## stateID -3.717e-01 8.462e-01 -0.439 0.6605
## stateIL 6.453e-01 9.187e-01 0.702 0.4824
## stateIN -2.118e-01 8.478e-01 -0.250 0.8027
## stateKS -4.036e-01 8.292e-01 -0.487 0.6264
## stateKY -8.291e-01 8.451e-01 -0.981 0.3266
## stateLA -2.876e-02 8.705e-01 -0.033 0.9736
## stateMA -5.716e-01 8.431e-01 -0.678 0.4978
## stateMD -2.117e-01 8.152e-01 -0.260 0.7951
## stateME -2.887e-01 8.070e-01 -0.358 0.7206
## stateMI -3.663e-01 8.172e-01 -0.448 0.6540
## stateMN -5.303e-01 8.096e-01 -0.655 0.5125
## stateMO 8.919e-02 8.547e-01 0.104 0.9169
## stateMS -4.522e-01 8.253e-01 -0.548 0.5837
## stateMT -1.550e+00 8.233e-01 -1.882 0.0598 .
## stateNC 6.821e-03 8.786e-01 0.008 0.9938
## stateND -8.695e-01 9.918e-01 -0.877 0.3806
## stateNE 2.954e-01 9.041e-01 0.327 0.7438
## stateNH -7.718e-02 8.688e-01 -0.089 0.9292
## stateNJ -7.022e-01 7.986e-01 -0.879 0.3792
## stateNM -4.304e-01 8.828e-01 -0.487 0.6259
## stateNV -4.383e-01 8.428e-01 -0.520 0.6031
## stateNY -6.847e-01 8.310e-01 -0.824 0.4100
## stateOH -2.341e-01 8.299e-01 -0.282 0.7779
## stateOK -6.087e-01 8.399e-01 -0.725 0.4686
## stateOR -6.172e-01 8.163e-01 -0.756 0.4496
## statePA 7.735e-02 9.040e-01 0.086 0.9318
## stateRI 5.280e-01 9.432e-01 0.560 0.5756
## stateSC -1.459e+00 8.957e-01 -1.629 0.1033
## stateSD -8.426e-01 8.801e-01 -0.957 0.3383
## stateTN -6.378e-01 8.662e-01 -0.736 0.4615
## stateTX -1.416e+00 8.397e-01 -1.686 0.0918 .
## stateUT -1.102e+00 8.460e-01 -1.303 0.1927
## stateVA 1.161e+00 9.083e-01 1.279 0.2010
## stateVT 2.139e-01 8.881e-01 0.241 0.8097
## stateWA -8.977e-01 8.253e-01 -1.088 0.2767
## stateWI -6.768e-01 9.020e-01 -0.750 0.4530
## stateWV -9.425e-01 8.041e-01 -1.172 0.2412
## stateWY 5.969e-01 8.537e-01 0.699 0.4844
## area.codearea_code_415 2.544e-01 1.689e-01 1.506 0.1320
## area.codearea_code_510 2.122e-01 1.923e-01 1.104 0.2698
## account.length -2.066e-03 1.716e-03 -1.204 0.2287
## voice.planno -1.029e+00 1.735e-01 -5.930 3.02e-09 ***
## intl.planno 2.608e+00 2.209e-01 11.808 < 2e-16 ***
## intl.mins 4.943e+00 6.362e+00 0.777 0.4371
## intl.calls 5.172e-02 2.765e-02 1.870 0.0614 .
## intl.charge -1.850e+01 2.356e+01 -0.785 0.4323
## day.mins -6.352e+00 4.042e+00 -1.571 0.1161
## day.calls -1.327e-03 3.360e-03 -0.395 0.6929
## day.charge 3.729e+01 2.378e+01 1.568 0.1168
## eve.mins 8.947e-01 2.033e+00 0.440 0.6599
## eve.calls 7.300e-04 3.375e-03 0.216 0.8287
## eve.charge -1.061e+01 2.392e+01 -0.444 0.6573
## night.mins 8.127e-01 1.046e+00 0.777 0.4371
## night.calls 9.492e-04 3.506e-03 0.271 0.7866
## night.charge -1.813e+01 2.324e+01 -0.780 0.4352
## customer.calls -6.023e-01 4.935e-02 -12.204 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1960.2 on 1413 degrees of freedom
## Residual deviance: 1386.0 on 1345 degrees of freedom
## AIC: 1524
##
## Number of Fisher Scoring iterations: 5
glm.probs <- predict(glm.fit, type = "response")
glm.pred <- ifelse(glm.probs > 0.5, "yes", "no")
conf_matrix <- table(glm.pred, churn_balanced$churn)
print(conf_matrix)
##
## glm.pred yes no
## no 553 157
## yes 154 550
accuracy <- mean(glm.pred == churn_balanced$churn)
print(accuracy) # 0.220
## [1] 0.2199434
glm.fit <- glm(churn ~ voice.plan + intl.plan + customer.calls, data = churn_balanced, family = binomial)
summary(glm.fit)
##
## Call:
## glm(formula = churn ~ voice.plan + intl.plan + customer.calls,
## family = binomial, data = churn_balanced)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.10656 0.20700 -0.515 0.607
## voice.planno -1.13659 0.15295 -7.431 1.07e-13 ***
## intl.planno 2.09259 0.18863 11.093 < 2e-16 ***
## customer.calls -0.42161 0.04089 -10.311 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1960.2 on 1413 degrees of freedom
## Residual deviance: 1657.4 on 1410 degrees of freedom
## AIC: 1665.4
##
## Number of Fisher Scoring iterations: 4
glm.probs <- predict(glm.fit, type = "response")
glm.pred <- ifelse(glm.probs > 0.5, "yes", "no")
conf_matrix <- table(glm.pred, churn_balanced$churn)
print(conf_matrix)
##
## glm.pred yes no
## no 401 145
## yes 306 562
accuracy <- mean(glm.pred == churn_balanced$churn)
print(accuracy) # 0.319
## [1] 0.3189533
mod.sel <- regsubsets(churn ~ . -voice.messages, data = churn_balanced, really.big=60)
reg.summary <- summary(mod.sel)
reg.summary$outmat
## stateAL stateAR stateAZ stateCA stateCO stateCT stateDC stateDE
## 1 ( 1 ) " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " " " " " " "
## 4 ( 1 ) " " " " " " " " " " " " " " " "
## 5 ( 1 ) " " " " " " " " " " " " " " " "
## 6 ( 1 ) " " " " " " " " " " " " " " " "
## 7 ( 1 ) " " " " " " " " " " " " " " " "
## 8 ( 1 ) " " " " " " " " " " " " " " " "
## stateFL stateGA stateHI stateIA stateID stateIL stateIN stateKS
## 1 ( 1 ) " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " " " " " " "
## 4 ( 1 ) " " " " " " " " " " " " " " " "
## 5 ( 1 ) " " " " " " " " " " " " " " " "
## 6 ( 1 ) " " " " " " " " " " " " " " " "
## 7 ( 1 ) " " " " " " " " " " " " " " " "
## 8 ( 1 ) " " " " " " " " " " " " " " " "
## stateKY stateLA stateMA stateMD stateME stateMI stateMN stateMO
## 1 ( 1 ) " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " " " " " " "
## 4 ( 1 ) " " " " " " " " " " " " " " " "
## 5 ( 1 ) " " " " " " " " " " " " " " " "
## 6 ( 1 ) " " " " " " " " " " " " " " " "
## 7 ( 1 ) " " " " " " " " " " " " " " " "
## 8 ( 1 ) " " " " " " " " " " " " " " " "
## stateMS stateMT stateNC stateND stateNE stateNH stateNJ stateNM
## 1 ( 1 ) " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " " " " " " "
## 4 ( 1 ) " " " " " " " " " " " " " " " "
## 5 ( 1 ) " " " " " " " " " " " " " " " "
## 6 ( 1 ) " " " " " " " " " " " " " " " "
## 7 ( 1 ) " " "*" " " " " " " " " " " " "
## 8 ( 1 ) " " "*" " " " " " " " " " " " "
## stateNV stateNY stateOH stateOK stateOR statePA stateRI stateSC
## 1 ( 1 ) " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " " " " " " "
## 4 ( 1 ) " " " " " " " " " " " " " " " "
## 5 ( 1 ) " " " " " " " " " " " " " " " "
## 6 ( 1 ) " " " " " " " " " " " " " " " "
## 7 ( 1 ) " " " " " " " " " " " " " " " "
## 8 ( 1 ) " " " " " " " " " " " " " " " "
## stateSD stateTN stateTX stateUT stateVA stateVT stateWA stateWI
## 1 ( 1 ) " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " " " " " " "
## 4 ( 1 ) " " " " " " " " " " " " " " " "
## 5 ( 1 ) " " " " " " " " " " " " " " " "
## 6 ( 1 ) " " " " " " " " "*" " " " " " "
## 7 ( 1 ) " " " " " " " " "*" " " " " " "
## 8 ( 1 ) " " " " " " " " "*" " " " " " "
## stateWV stateWY area.codearea_code_415 area.codearea_code_510
## 1 ( 1 ) " " " " " " " "
## 2 ( 1 ) " " " " " " " "
## 3 ( 1 ) " " " " " " " "
## 4 ( 1 ) " " " " " " " "
## 5 ( 1 ) " " " " " " " "
## 6 ( 1 ) " " " " " " " "
## 7 ( 1 ) " " " " " " " "
## 8 ( 1 ) " " " " " " " "
## account.length voice.planno intl.planno intl.mins intl.calls
## 1 ( 1 ) " " " " "*" " " " "
## 2 ( 1 ) " " " " " " " " " "
## 3 ( 1 ) " " " " "*" " " " "
## 4 ( 1 ) " " "*" "*" " " " "
## 5 ( 1 ) " " "*" "*" " " " "
## 6 ( 1 ) " " "*" "*" " " " "
## 7 ( 1 ) " " "*" "*" " " " "
## 8 ( 1 ) " " "*" "*" " " " "
## intl.charge day.mins day.calls day.charge eve.mins eve.calls
## 1 ( 1 ) " " " " " " " " " " " "
## 2 ( 1 ) " " "*" " " " " " " " "
## 3 ( 1 ) " " "*" " " " " " " " "
## 4 ( 1 ) " " "*" " " " " " " " "
## 5 ( 1 ) " " "*" " " " " " " " "
## 6 ( 1 ) " " "*" " " " " " " " "
## 7 ( 1 ) " " "*" " " " " " " " "
## 8 ( 1 ) " " "*" " " " " " " " "
## eve.charge night.mins night.calls night.charge customer.calls
## 1 ( 1 ) " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " "*"
## 3 ( 1 ) " " " " " " " " "*"
## 4 ( 1 ) " " " " " " " " "*"
## 5 ( 1 ) "*" " " " " " " "*"
## 6 ( 1 ) "*" " " " " " " "*"
## 7 ( 1 ) "*" " " " " " " "*"
## 8 ( 1 ) "*" " " " " "*" "*"
cat(which.min(reg.summary$bic), reg.summary$bic[which.min(reg.summary$bic)]) # 8
## 8 -433.6823
cat(which.min(reg.summary$cp), reg.summary$cp[which.min(reg.summary$cp)]) # 8
## 8 18.0739
cat(which.max(reg.summary$adjr2), reg.summary$adjr2[which.max(reg.summary$adjr2)]) # 8
## 8 0.2933364
# voice.plan + intl.plan + day.mins + eve.charge + night.charge + customer.calls
glm.fit <- glm(churn ~ voice.plan + intl.plan + day.mins + eve.charge + night.charge + customer.calls, data = churn_balanced, family = binomial)
summary(glm.fit)
##
## Call:
## glm(formula = churn ~ voice.plan + intl.plan + day.mins + eve.charge +
## night.charge + customer.calls, family = binomial, data = churn_balanced)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.207976 0.478986 8.785 < 2e-16 ***
## voice.planno -0.970971 0.164085 -5.917 3.27e-09 ***
## intl.planno 2.419194 0.203904 11.864 < 2e-16 ***
## day.mins -0.012459 0.001142 -10.912 < 2e-16 ***
## eve.charge -0.080446 0.014826 -5.426 5.76e-08 ***
## night.charge -0.066164 0.028932 -2.287 0.0222 *
## customer.calls -0.590062 0.046846 -12.596 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1960.2 on 1413 degrees of freedom
## Residual deviance: 1471.3 on 1407 degrees of freedom
## AIC: 1485.3
##
## Number of Fisher Scoring iterations: 4
glm.probs <- predict(glm.fit, type = "response")
glm.pred <- ifelse(glm.probs > 0.5, "yes", "no")
conf_matrix <- table(glm.pred, churn_balanced$churn)
print(conf_matrix)
##
## glm.pred yes no
## no 548 169
## yes 159 538
accuracy <- mean(glm.pred == churn_balanced$churn)
print(accuracy) # 0.23 without state and 0.22 with state
## [1] 0.2319661
train_size <- floor(0.70 * nrow(churn_balanced))
train_index <- sample(seq_len(nrow(churn_balanced)), size = train_size)
train <- churn_balanced[train_index, ]
test <- churn_balanced[-train_index, ]
table(train$churn) # 497 yes, 492 no --> balanced
##
## yes no
## 497 492
table(test$churn) # 210 yes, 215 no --> balanced
##
## yes no
## 210 215
plot(train$churn, main = "Train Churn Distribution", ylab = "Frequency")
plot(test$churn, main = "Test Churn Distribution", ylab = "Frequency")
logit.fit <- glm(churn ~ voice.plan + intl.plan + customer.calls, data = train, family = binomial)
logit.probs <- predict(logit.fit, newdata = test, type = "response")
logit.pred <- ifelse(logit.probs > 0.5, "yes", "no")
conf_matrix <- table(logit.pred, test$churn)
print(conf_matrix)
##
## logit.pred yes no
## no 111 37
## yes 99 178
accuracy <- mean(logit.pred == test$churn)
print(accuracy) # 0.32
## [1] 0.32
lda.fit <- lda(churn ~ voice.plan + intl.plan + customer.calls, data = train)
lda.pred <- predict(lda.fit, newdata = test)
conf_matrix <- table(lda.pred$class, test$churn)
print(conf_matrix)
##
## yes no
## yes 111 37
## no 99 178
accuracy <- mean(lda.pred$class == test$churn)
print(accuracy) # 0.68
## [1] 0.68
nb.fit <- naiveBayes(churn ~ voice.plan + intl.plan + customer.calls, data = train)
nb.pred <- predict(nb.fit, newdata = test)
conf_matrix <- table(nb.pred, test$churn)
print(conf_matrix)
##
## nb.pred yes no
## yes 113 38
## no 97 177
accuracy <- mean(nb.pred == test$churn)
print(accuracy) # 0.682
## [1] 0.6823529
train$voice.plan <- as.numeric(as.factor(train$voice.plan))
train$intl.plan <- as.numeric(as.factor(train$intl.plan))
test$voice.plan <- as.numeric(as.factor(test$voice.plan))
test$intl.plan <- as.numeric(as.factor(test$intl.plan))
train_X <- as.matrix(train[, c("voice.plan", "intl.plan", "customer.calls")])
test_X <- as.matrix(test[, c("voice.plan", "intl.plan", "customer.calls")])
train_Y <- train$churn
test_Y <- test$churn
for (k in 1:10) {
knn.pred <- knn(train_X, test_X, train_Y, k = k)
conf_matrix <- table(knn.pred, test_Y)
print(paste("Confusion Matrix for k =", k))
print(conf_matrix)
accuracy <- mean(knn.pred == test_Y)
print(paste("Accuracy for k =", k, ":", accuracy)) # 0.708
}
## [1] "Confusion Matrix for k = 1"
## test_Y
## knn.pred yes no
## yes 104 18
## no 106 197
## [1] "Accuracy for k = 1 : 0.708235294117647"
## [1] "Confusion Matrix for k = 2"
## test_Y
## knn.pred yes no
## yes 104 18
## no 106 197
## [1] "Accuracy for k = 2 : 0.708235294117647"
## [1] "Confusion Matrix for k = 3"
## test_Y
## knn.pred yes no
## yes 104 18
## no 106 197
## [1] "Accuracy for k = 3 : 0.708235294117647"
## [1] "Confusion Matrix for k = 4"
## test_Y
## knn.pred yes no
## yes 104 18
## no 106 197
## [1] "Accuracy for k = 4 : 0.708235294117647"
## [1] "Confusion Matrix for k = 5"
## test_Y
## knn.pred yes no
## yes 104 18
## no 106 197
## [1] "Accuracy for k = 5 : 0.708235294117647"
## [1] "Confusion Matrix for k = 6"
## test_Y
## knn.pred yes no
## yes 104 18
## no 106 197
## [1] "Accuracy for k = 6 : 0.708235294117647"
## [1] "Confusion Matrix for k = 7"
## test_Y
## knn.pred yes no
## yes 104 18
## no 106 197
## [1] "Accuracy for k = 7 : 0.708235294117647"
## [1] "Confusion Matrix for k = 8"
## test_Y
## knn.pred yes no
## yes 104 18
## no 106 197
## [1] "Accuracy for k = 8 : 0.708235294117647"
## [1] "Confusion Matrix for k = 9"
## test_Y
## knn.pred yes no
## yes 104 18
## no 106 197
## [1] "Accuracy for k = 9 : 0.708235294117647"
## [1] "Confusion Matrix for k = 10"
## test_Y
## knn.pred yes no
## yes 104 18
## no 106 197
## [1] "Accuracy for k = 10 : 0.708235294117647"
I preprocessed the data undersampling the “no” class in churn. After removing a near zero variance predictors, I used predictor selection techniques to land on using these three predictors: voice.plan, intl.plan, and customer.calls. My train/test split was a simple random 70/30 split due to the undersampling procedure. As shown by the plots, each class is equally represented in the train and test splits. I assessed logistic regression, LDA, naive Bayes’, and KNN. For all k’s, KNN performed the best with accuracy of 0.708.