Data

Consider the data set OJ, which is included in R library ISLR. The following code chunk will load the library and necessary packages for this assignment. If you haven’t already install these packages please use install.packages() command. You can use head(OJ) or ?OJ to get more information about the data set. We would like to predict the variable Purchase which is equal to CH or MM for Citrus Hill or Minute Maid Orange Juice.

library(ISLR)
## Warning: package 'ISLR' was built under R version 4.4.2
library(MASS)
## Warning: package 'MASS' was built under R version 4.4.2
library(class)
## Warning: package 'class' was built under R version 4.4.2
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.4.2
data(OJ)

head(OJ)
##   Purchase WeekofPurchase StoreID PriceCH PriceMM DiscCH DiscMM SpecialCH
## 1       CH            237       1    1.75    1.99   0.00    0.0         0
## 2       CH            239       1    1.75    1.99   0.00    0.3         0
## 3       CH            245       1    1.86    2.09   0.17    0.0         0
## 4       MM            227       1    1.69    1.69   0.00    0.0         0
## 5       CH            228       7    1.69    1.69   0.00    0.0         0
## 6       CH            230       7    1.69    1.99   0.00    0.0         0
##   SpecialMM  LoyalCH SalePriceMM SalePriceCH PriceDiff Store7 PctDiscMM
## 1         0 0.500000        1.99        1.75      0.24     No  0.000000
## 2         1 0.600000        1.69        1.75     -0.06     No  0.150754
## 3         0 0.680000        2.09        1.69      0.40     No  0.000000
## 4         0 0.400000        1.69        1.69      0.00     No  0.000000
## 5         0 0.956535        1.69        1.69      0.00    Yes  0.000000
## 6         1 0.965228        1.99        1.69      0.30    Yes  0.000000
##   PctDiscCH ListPriceDiff STORE
## 1  0.000000          0.24     1
## 2  0.000000          0.24     1
## 3  0.091398          0.23     1
## 4  0.000000          0.00     1
## 5  0.000000          0.00     0
## 6  0.000000          0.30     0

Questions

  1. (5 points) Separate 20% of your data into testing and remainder as training.
#Set Seed Reproducibility
set.seed(1)

#Calculate the Number of Rows for Testing (20%)
n2 <- round(length(OJ[,"Purchase"])*.2)

#Store Total Numb3r of Rows in n1 and n:
n1 <- length(OJ[,"Purchase"])
n <- length(OJ["Purchase"])

#Randomly select n2 indices (number of rows for OJ for testing, 20%) from 1 to n1, n1 being total number of rows in the Purchase column/vector extracted from OJ data.frame:
testing <- sample (n1, n2)

#Split OJ dataset in Training_OJ dataset into Training and Testing
train_OJ <- OJ[-testing,]
test_OJ <- OJ[testing,]

#Note train_OJ and test_OJ  rows sum to equate to total OJ rows implying that the dataframe OJ was successfully split into training and testing.
nrow(OJ)
## [1] 1070
nrow(train_OJ)
## [1] 856
nrow(test_OJ)
## [1] 214
head(OJ)
##   Purchase WeekofPurchase StoreID PriceCH PriceMM DiscCH DiscMM SpecialCH
## 1       CH            237       1    1.75    1.99   0.00    0.0         0
## 2       CH            239       1    1.75    1.99   0.00    0.3         0
## 3       CH            245       1    1.86    2.09   0.17    0.0         0
## 4       MM            227       1    1.69    1.69   0.00    0.0         0
## 5       CH            228       7    1.69    1.69   0.00    0.0         0
## 6       CH            230       7    1.69    1.99   0.00    0.0         0
##   SpecialMM  LoyalCH SalePriceMM SalePriceCH PriceDiff Store7 PctDiscMM
## 1         0 0.500000        1.99        1.75      0.24     No  0.000000
## 2         1 0.600000        1.69        1.75     -0.06     No  0.150754
## 3         0 0.680000        2.09        1.69      0.40     No  0.000000
## 4         0 0.400000        1.69        1.69      0.00     No  0.000000
## 5         0 0.956535        1.69        1.69      0.00    Yes  0.000000
## 6         1 0.965228        1.99        1.69      0.30    Yes  0.000000
##   PctDiscCH ListPriceDiff STORE
## 1  0.000000          0.24     1
## 2  0.000000          0.24     1
## 3  0.091398          0.23     1
## 4  0.000000          0.00     1
## 5  0.000000          0.00     0
## 6  0.000000          0.30     0
head(train_OJ)
##   Purchase WeekofPurchase StoreID PriceCH PriceMM DiscCH DiscMM SpecialCH
## 2       CH            239       1    1.75    1.99   0.00    0.3         0
## 3       CH            245       1    1.86    2.09   0.17    0.0         0
## 4       MM            227       1    1.69    1.69   0.00    0.0         0
## 5       CH            228       7    1.69    1.69   0.00    0.0         0
## 6       CH            230       7    1.69    1.99   0.00    0.0         0
## 7       CH            232       7    1.69    1.99   0.00    0.4         1
##   SpecialMM  LoyalCH SalePriceMM SalePriceCH PriceDiff Store7 PctDiscMM
## 2         1 0.600000        1.69        1.75     -0.06     No  0.150754
## 3         0 0.680000        2.09        1.69      0.40     No  0.000000
## 4         0 0.400000        1.69        1.69      0.00     No  0.000000
## 5         0 0.956535        1.69        1.69      0.00    Yes  0.000000
## 6         1 0.965228        1.99        1.69      0.30    Yes  0.000000
## 7         1 0.972182        1.59        1.69     -0.10    Yes  0.201005
##   PctDiscCH ListPriceDiff STORE
## 2  0.000000          0.24     1
## 3  0.091398          0.23     1
## 4  0.000000          0.00     1
## 5  0.000000          0.00     0
## 6  0.000000          0.30     0
## 7  0.000000          0.30     0
head(test_OJ)
##      Purchase WeekofPurchase StoreID PriceCH PriceMM DiscCH DiscMM SpecialCH
## 1017       CH            236       7    1.75    1.99   0.00   0.40         0
## 679        CH            277       1    1.99    2.13   0.24   0.00         0
## 129        CH            267       2    1.86    2.18   0.00   0.40         0
## 930        MM            236       3    1.79    2.09   0.00   0.00         0
## 471        CH            274       7    1.86    2.13   0.47   0.54         1
## 299        MM            231       2    1.69    1.69   0.30   0.00         1
##      SpecialMM  LoyalCH SalePriceMM SalePriceCH PriceDiff Store7 PctDiscMM
## 1017         0 0.908732        1.59        1.75     -0.16    Yes  0.201005
## 679          0 0.585435        2.13        1.75      0.38     No  0.000000
## 129          1 0.890772        1.78        1.86     -0.08     No  0.183486
## 930          0 0.163840        2.09        1.79      0.30     No  0.000000
## 471          0 0.500000        1.59        1.39      0.20    Yes  0.253521
## 299          0 0.467200        1.69        1.39      0.30     No  0.000000
##      PctDiscCH ListPriceDiff STORE
## 1017  0.000000          0.24     0
## 679   0.120603          0.14     1
## 129   0.000000          0.32     2
## 930   0.000000          0.30     3
## 471   0.252688          0.27     0
## 299   0.177515          0.00     2
  1. (15 points) Develop a classification model where the variable Purchase is the dependent variable using the Logistic Regression method. (Independent variables are PriceCH, PriceMM, STORE, and PriceDiff)
#Check structure of Purchase column dependent variable
str(train_OJ$Purchase)
##  Factor w/ 2 levels "CH","MM": 1 1 2 1 1 1 1 1 1 1 ...
#Check structure of 4 independent variables
str(train_OJ$PriceCH)
##  num [1:856] 1.75 1.86 1.69 1.69 1.69 1.69 1.75 1.75 1.75 1.86 ...
str(train_OJ$PriceMM)
##  num [1:856] 1.99 2.09 1.69 1.69 1.99 1.99 1.99 1.99 1.99 2.09 ...
str(train_OJ$STORE)
##  num [1:856] 1 1 1 0 0 0 0 0 0 0 ...
str(train_OJ$PriceDiff)
##  num [1:856] -0.06 0.4 0 0 0.3 -0.1 -0.16 -0.16 -0.16 0.23 ...
# Check current levels and reassign correctly to ensure 1 corresponds to CH and 2 corresponds to MM in Purchase column
levels(train_OJ$Purchase) <- c("CH", "MM")

head(train_OJ)
##   Purchase WeekofPurchase StoreID PriceCH PriceMM DiscCH DiscMM SpecialCH
## 2       CH            239       1    1.75    1.99   0.00    0.3         0
## 3       CH            245       1    1.86    2.09   0.17    0.0         0
## 4       MM            227       1    1.69    1.69   0.00    0.0         0
## 5       CH            228       7    1.69    1.69   0.00    0.0         0
## 6       CH            230       7    1.69    1.99   0.00    0.0         0
## 7       CH            232       7    1.69    1.99   0.00    0.4         1
##   SpecialMM  LoyalCH SalePriceMM SalePriceCH PriceDiff Store7 PctDiscMM
## 2         1 0.600000        1.69        1.75     -0.06     No  0.150754
## 3         0 0.680000        2.09        1.69      0.40     No  0.000000
## 4         0 0.400000        1.69        1.69      0.00     No  0.000000
## 5         0 0.956535        1.69        1.69      0.00    Yes  0.000000
## 6         1 0.965228        1.99        1.69      0.30    Yes  0.000000
## 7         1 0.972182        1.59        1.69     -0.10    Yes  0.201005
##   PctDiscCH ListPriceDiff STORE
## 2  0.000000          0.24     1
## 3  0.091398          0.23     1
## 4  0.000000          0.00     1
## 5  0.000000          0.00     0
## 6  0.000000          0.30     0
## 7  0.000000          0.30     0
#Check structure of Purchase column dependent variable
str(test_OJ$Purchase)
##  Factor w/ 2 levels "CH","MM": 1 1 1 2 1 2 2 1 1 1 ...
#Check structure of 4 independent variables
str(test_OJ$PriceCH)
##  num [1:214] 1.75 1.99 1.86 1.79 1.86 1.69 1.86 1.86 1.86 1.76 ...
str(test_OJ$PriceMM)
##  num [1:214] 1.99 2.13 2.18 2.09 2.13 1.69 2.18 2.18 2.13 2.18 ...
str(test_OJ$STORE)
##  num [1:214] 0 1 2 3 0 2 2 0 0 1 ...
str(test_OJ$PriceDiff)
##  num [1:214] -0.16 0.38 -0.08 0.3 0.2 0.3 0.32 0.32 0.27 0.42 ...
#Fit and view regression model
logfit_OJ <- glm(Purchase ~ PriceCH + PriceMM + STORE + PriceDiff, family = binomial, data=train_OJ)

summary(logfit_OJ)
## 
## Call:
## glm(formula = Purchase ~ PriceCH + PriceMM + STORE + PriceDiff, 
##     family = binomial, data = train_OJ)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  5.25729    1.41835   3.707  0.00021 ***
## PriceCH     -2.40462    1.08912  -2.208  0.02725 *  
## PriceMM     -0.60896    0.77973  -0.781  0.43481    
## STORE        0.17646    0.05894   2.994  0.00275 ** 
## PriceDiff   -2.04732    0.31564  -6.486  8.8e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1145.9  on 855  degrees of freedom
## Residual deviance: 1065.8  on 851  degrees of freedom
## AIC: 1075.8
## 
## Number of Fisher Scoring iterations: 4
logprob <- predict (logfit_OJ, newdata=test_OJ, type="response")

logpred = rep ("0", n2)
logpred[logprob[1:n2]>.5]="1"

head(logprob)
##      1017       679       129       930       471       299 
## 0.5411449 0.1936440 0.4934687 0.4002267 0.2845822 0.4757856
#If probability >0.5, then 1 (observation is a CH Purchase)
head(logpred)
## [1] "1" "0" "0" "0" "0" "0"
  1. (10 points) Obtain the confusion matrix and compute the training error rate based on the logistic regression classification. (Make sure to check what the high probabilities corressponds to: MM or CH, hint: contrasts(train$Purchase))
#create confusion matrix (CH==1, MM==0)
table (logpred, test_OJ[,"Purchase"])
##        
## logpred  CH  MM
##       0 113  61
##       1  19  21
# Convert logpred from 0/1 to "MM"/"CH" to match test data and better calculate error rate
logpred <- factor(logpred, levels = c("0", "1"), labels = c("MM", "CH"))
#Testing error rate
error_rate <- 1 - mean(logpred==test_OJ[,"Purchase"])
error_rate
## [1] 0.6261682
  1. (15 points) Develop a classification model where the variable Purchase is the dependent variable using the LDA method. (Independent variables are PriceCH, PriceMM, STORE, and PriceDiff)
#Set train and test data sets for LDA method 
set.seed(123)

sample <- sample(c(TRUE, FALSE), nrow(OJ), replace = T, prob = c(0.8,0.2))

train_LDA <- OJ[sample,]
test_LDA <- OJ[!sample, ]

nrow(OJ)
## [1] 1070
nrow(train_LDA)
## [1] 863
nrow(test_LDA)
## [1] 207
lda_OJ <- lda(Purchase ~ PriceCH + PriceMM + STORE + PriceDiff, data = train_LDA)
lda_OJ
## Call:
## lda(Purchase ~ PriceCH + PriceMM + STORE + PriceDiff, data = train_LDA)
## 
## Prior probabilities of groups:
##        CH        MM 
## 0.6071842 0.3928158 
## 
## Group means:
##     PriceCH  PriceMM    STORE  PriceDiff
## CH 1.867481 2.099046 1.511450 0.20215649
## MM 1.869676 2.067640 1.843658 0.05132743
## 
## Coefficients of linear discriminants:
##                  LD1
## PriceCH   -1.2822666
## PriceMM   -0.9924425
## STORE      0.2745856
## PriceDiff -3.2852832
plot(lda_OJ)

  1. (10 points) Obtain the confusion matrix and compute the training error rate based on the LDA classification.
#Confirm which class lda.pred$posteerior[,2] corresponds to (the response variable, Purchase, is a factor with 2 levels - CH and MM):
levels(train_OJ$Purchase)
## [1] "CH" "MM"
lda.pred <- predict(lda_OJ, newdata=test_OJ)
# number of CH Purchases measured by summing the posterior probability of the first level, CH [,1]
sum(lda.pred$posterior[, 1] > .5)
## [1] 169
#confusion matrix
table(test_OJ$Purchase, lda.pred$class)
##     
##       CH  MM
##   CH 114  18
##   MM  55  27
#error rate
1 - mean(test_OJ$Purchase==lda.pred$class)
## [1] 0.3411215
  1. (15 points) Develop a classification model for the variable Purchase using the KNN method when k=3. (Use only PriceCH, PriceMM, STORE, and PriceDiff, hint: cbind())
#Remember to create testing and training sets without the variable `Purchase`. You do not need to create a new testing and training set as in question 1. You could just remove one variable from your already created data set
set.seed(1234)

ind <- sample(2, nrow(OJ), replace=TRUE, prob=c(0.80, 0.20)) 
OJ.training <- OJ[ind==1, 2:4] 
OJ.test <- OJ[ind==2, 2:4]
OJ.trainLabels <- OJ[ind==1, 1] 
OJ.testLabels <- OJ[ind==2, 1]
str(OJ.training)
## 'data.frame':    847 obs. of  3 variables:
##  $ WeekofPurchase: num  237 239 245 227 230 232 234 235 238 240 ...
##  $ StoreID       : num  1 1 1 1 7 7 7 7 7 7 ...
##  $ PriceCH       : num  1.75 1.75 1.86 1.69 1.69 1.69 1.75 1.75 1.75 1.86 ...
str(OJ.test)
## 'data.frame':    223 obs. of  3 variables:
##  $ WeekofPurchase: num  228 268 278 269 272 274 271 227 259 260 ...
##  $ StoreID       : num  7 7 7 7 7 7 4 4 4 4 ...
##  $ PriceCH       : num  1.69 1.86 2.06 1.86 1.86 1.86 1.99 1.79 1.99 1.99 ...
#KNN model with k=3
OJ_pred <- knn(train = OJ.training, test = OJ.test, cl = OJ.trainLabels, k=3)
OJ_pred
##   [1] CH CH CH CH CH CH CH MM CH MM CH CH CH CH CH CH MM CH MM MM MM CH CH MM CH
##  [26] CH MM MM MM CH CH CH MM CH CH CH CH CH CH CH CH CH CH MM CH MM CH MM MM MM
##  [51] CH CH CH MM MM CH CH MM CH CH CH CH CH CH CH CH CH MM CH CH MM CH MM CH MM
##  [76] MM MM MM MM CH MM MM CH CH CH CH MM CH CH CH CH CH MM MM MM MM CH MM MM CH
## [101] MM CH CH CH CH MM CH CH CH CH CH MM MM MM CH CH MM MM CH CH CH CH CH CH CH
## [126] CH MM CH MM MM MM MM MM CH MM CH CH CH CH CH CH CH MM CH CH CH MM CH MM CH
## [151] CH CH MM MM CH CH MM MM CH CH MM MM MM CH CH MM MM MM MM MM CH CH MM MM MM
## [176] MM MM MM CH CH CH MM CH MM MM MM CH CH CH CH CH MM MM CH CH CH CH MM CH MM
## [201] MM CH MM CH CH CH CH CH MM CH CH CH CH MM CH CH CH CH MM MM MM CH CH
## Levels: CH MM
  1. (10 points) Obtain the confusion matrix and compute the training error rate based on the KNN classification.
table(OJ_pred, OJ.testLabels)
##        OJ.testLabels
## OJ_pred  CH  MM
##      CH 105  32
##      MM  39  47
1-mean(OJ_pred==OJ.testLabels)
## [1] 0.3183857
  1. (10 points) Based on the 3 classification models for Purchase, which model you would prefer and why? Are you satisfied with the predictions? (Hint: testing error rate)

Based on the 3 classification models, I would prefer the KNN model because it has the lowest error rate (0.318) and the highest precision rate (0.729), the model is least likely to classify a purchase as CH when in fact the purchase was MM (false positive). The next best model is the LDA with an error rate of 0.341 and a precision rate of 0.675. The worst performing model, from an accuracy standpoint, in the logistic regression model with a high error rate of 0.626 and low precision rate of 0.475. While, better than LDA and logistic regression, the KNN classification model still misclassifies 0.318 (almost 1/3 of observations) as the wrong type of purchase (CH when MM, or MM when CH). The KNN model can be improved by finding the optimal k value for the data set. Perhaps more k, independent, variables are needed to reduce potential overfitting? More analysis will be needed. Furthermore, there are 653 CH purchases and only 417 MM purchases (see table directly below) suggesting that the the underlying OJ data set is slightly imbalanced. Resampling techniques, such as SMOTE (Synthetic Minority Over-sampling Technique) or a weighted KNN model may be needed. SMOTE refers practice of artificially creating new minority class data points given the attributes of the initial, original minority class observations found in the data set to ensure that predicted values have a greater chance of being classified according to their true values given the underlying attributes.

#Count numbert of CH and MM purchases
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.2
## Warning: package 'readr' was built under R version 4.4.2
## Warning: package 'dplyr' was built under R version 4.4.2
## Warning: package 'forcats' was built under R version 4.4.2
## Warning: package 'lubridate' was built under R version 4.4.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()     masks stats::filter()
## ✖ dplyr::group_rows() masks kableExtra::group_rows()
## ✖ dplyr::lag()        masks stats::lag()
## ✖ dplyr::select()     masks MASS::select()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Purchase_Counts_OJ <- OJ %>%
  group_by(Purchase) %>%
  summarize(count = n())

Purchase_Counts_OJ
## # A tibble: 2 × 2
##   Purchase count
##   <fct>    <int>
## 1 CH         653
## 2 MM         417
Models = c("Logistic Reg", "LDA", "KNN k=3")
Error.Rate = c("0.6261682","0.3411215","0.3183857")
Precision = c(19/(19+21), 114/(114+55), 105/(105+39)) #(Correctly pred + CH outcomes) / (total pred + CH outcomes) 
Precision = round(Precision, 3) 
text_tbl = data.frame(Models, Error.Rate, Precision)

kable(text_tbl) %>%
  kable_styling(bootstrap_options = "striped", full_width = F) 
Models Error.Rate Precision
Logistic Reg 0.6261682 0.475
LDA 0.3411215 0.675
KNN k=3 0.3183857 0.729
  1. (10 points) Please summarize your learning outcome for this assignment both in terms of R and classification methods.

After completing this assignment, I have a better understanding of classification methods to predict the value of a categorical response variable, such as whether CH or MM was purchased given a variety of independent variables contained in the underlying OJ dataset (including PriceCH and PriceMM). Upon further research, I learned how the logistic regression model equation was created/derived to assign the probability (0 to 1, with >0.5 suggesting XYZ category) of an observation to conform to category XYZ (CH purchase) or category ABC (MM purchase), as well as how odd-logs convert the logistic model to a linear equivalent. With this, I would like to find more resources that explain in greater detail how the odd-log and logistic regression equations are derived. This would enable me to better understand the statistical background behind the classification models explored in modlue 3.

Using R code, we learned how to do basic exploratory analysis to first explore the independent variables, in visual chart and graph format, prior to running any classification model. THis allowed for us to understand what to look for (normal distribution, low variance, no outliers, etc) in the independent variables when constructing a classification models. This is an important step that cannot be overlooked. A statistical model created in a software program is only as good as the underlying data. To this end, we also learned how to not just interpret the results of the logistic reg, LDA, and KNN models but also how to check the accuracy of each model. Accuracy can be imporved through SMOTE, weighted neighbors, and other techniques.