Consider the data set OJ, which is included in R library
ISLR. The following code chunk will load the library and
necessary packages for this assignment. If you haven’t already install
these packages please use install.packages() command. You
can use head(OJ) or ?OJ to get more
information about the data set. We would like to predict the variable
Purchase which is equal to CH or
MM for Citrus Hill or Minute Maid Orange Juice.
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.4.2
library(MASS)
## Warning: package 'MASS' was built under R version 4.4.2
library(class)
## Warning: package 'class' was built under R version 4.4.2
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.4.2
data(OJ)
head(OJ)
## Purchase WeekofPurchase StoreID PriceCH PriceMM DiscCH DiscMM SpecialCH
## 1 CH 237 1 1.75 1.99 0.00 0.0 0
## 2 CH 239 1 1.75 1.99 0.00 0.3 0
## 3 CH 245 1 1.86 2.09 0.17 0.0 0
## 4 MM 227 1 1.69 1.69 0.00 0.0 0
## 5 CH 228 7 1.69 1.69 0.00 0.0 0
## 6 CH 230 7 1.69 1.99 0.00 0.0 0
## SpecialMM LoyalCH SalePriceMM SalePriceCH PriceDiff Store7 PctDiscMM
## 1 0 0.500000 1.99 1.75 0.24 No 0.000000
## 2 1 0.600000 1.69 1.75 -0.06 No 0.150754
## 3 0 0.680000 2.09 1.69 0.40 No 0.000000
## 4 0 0.400000 1.69 1.69 0.00 No 0.000000
## 5 0 0.956535 1.69 1.69 0.00 Yes 0.000000
## 6 1 0.965228 1.99 1.69 0.30 Yes 0.000000
## PctDiscCH ListPriceDiff STORE
## 1 0.000000 0.24 1
## 2 0.000000 0.24 1
## 3 0.091398 0.23 1
## 4 0.000000 0.00 1
## 5 0.000000 0.00 0
## 6 0.000000 0.30 0
#Set Seed Reproducibility
set.seed(1)
#Calculate the Number of Rows for Testing (20%)
n2 <- round(length(OJ[,"Purchase"])*.2)
#Store Total Numb3r of Rows in n1 and n:
n1 <- length(OJ[,"Purchase"])
n <- length(OJ["Purchase"])
#Randomly select n2 indices (number of rows for OJ for testing, 20%) from 1 to n1, n1 being total number of rows in the Purchase column/vector extracted from OJ data.frame:
testing <- sample (n1, n2)
#Split OJ dataset in Training_OJ dataset into Training and Testing
train_OJ <- OJ[-testing,]
test_OJ <- OJ[testing,]
#Note train_OJ and test_OJ rows sum to equate to total OJ rows implying that the dataframe OJ was successfully split into training and testing.
nrow(OJ)
## [1] 1070
nrow(train_OJ)
## [1] 856
nrow(test_OJ)
## [1] 214
head(OJ)
## Purchase WeekofPurchase StoreID PriceCH PriceMM DiscCH DiscMM SpecialCH
## 1 CH 237 1 1.75 1.99 0.00 0.0 0
## 2 CH 239 1 1.75 1.99 0.00 0.3 0
## 3 CH 245 1 1.86 2.09 0.17 0.0 0
## 4 MM 227 1 1.69 1.69 0.00 0.0 0
## 5 CH 228 7 1.69 1.69 0.00 0.0 0
## 6 CH 230 7 1.69 1.99 0.00 0.0 0
## SpecialMM LoyalCH SalePriceMM SalePriceCH PriceDiff Store7 PctDiscMM
## 1 0 0.500000 1.99 1.75 0.24 No 0.000000
## 2 1 0.600000 1.69 1.75 -0.06 No 0.150754
## 3 0 0.680000 2.09 1.69 0.40 No 0.000000
## 4 0 0.400000 1.69 1.69 0.00 No 0.000000
## 5 0 0.956535 1.69 1.69 0.00 Yes 0.000000
## 6 1 0.965228 1.99 1.69 0.30 Yes 0.000000
## PctDiscCH ListPriceDiff STORE
## 1 0.000000 0.24 1
## 2 0.000000 0.24 1
## 3 0.091398 0.23 1
## 4 0.000000 0.00 1
## 5 0.000000 0.00 0
## 6 0.000000 0.30 0
head(train_OJ)
## Purchase WeekofPurchase StoreID PriceCH PriceMM DiscCH DiscMM SpecialCH
## 2 CH 239 1 1.75 1.99 0.00 0.3 0
## 3 CH 245 1 1.86 2.09 0.17 0.0 0
## 4 MM 227 1 1.69 1.69 0.00 0.0 0
## 5 CH 228 7 1.69 1.69 0.00 0.0 0
## 6 CH 230 7 1.69 1.99 0.00 0.0 0
## 7 CH 232 7 1.69 1.99 0.00 0.4 1
## SpecialMM LoyalCH SalePriceMM SalePriceCH PriceDiff Store7 PctDiscMM
## 2 1 0.600000 1.69 1.75 -0.06 No 0.150754
## 3 0 0.680000 2.09 1.69 0.40 No 0.000000
## 4 0 0.400000 1.69 1.69 0.00 No 0.000000
## 5 0 0.956535 1.69 1.69 0.00 Yes 0.000000
## 6 1 0.965228 1.99 1.69 0.30 Yes 0.000000
## 7 1 0.972182 1.59 1.69 -0.10 Yes 0.201005
## PctDiscCH ListPriceDiff STORE
## 2 0.000000 0.24 1
## 3 0.091398 0.23 1
## 4 0.000000 0.00 1
## 5 0.000000 0.00 0
## 6 0.000000 0.30 0
## 7 0.000000 0.30 0
head(test_OJ)
## Purchase WeekofPurchase StoreID PriceCH PriceMM DiscCH DiscMM SpecialCH
## 1017 CH 236 7 1.75 1.99 0.00 0.40 0
## 679 CH 277 1 1.99 2.13 0.24 0.00 0
## 129 CH 267 2 1.86 2.18 0.00 0.40 0
## 930 MM 236 3 1.79 2.09 0.00 0.00 0
## 471 CH 274 7 1.86 2.13 0.47 0.54 1
## 299 MM 231 2 1.69 1.69 0.30 0.00 1
## SpecialMM LoyalCH SalePriceMM SalePriceCH PriceDiff Store7 PctDiscMM
## 1017 0 0.908732 1.59 1.75 -0.16 Yes 0.201005
## 679 0 0.585435 2.13 1.75 0.38 No 0.000000
## 129 1 0.890772 1.78 1.86 -0.08 No 0.183486
## 930 0 0.163840 2.09 1.79 0.30 No 0.000000
## 471 0 0.500000 1.59 1.39 0.20 Yes 0.253521
## 299 0 0.467200 1.69 1.39 0.30 No 0.000000
## PctDiscCH ListPriceDiff STORE
## 1017 0.000000 0.24 0
## 679 0.120603 0.14 1
## 129 0.000000 0.32 2
## 930 0.000000 0.30 3
## 471 0.252688 0.27 0
## 299 0.177515 0.00 2
Purchase is the dependent variable using the Logistic
Regression method. (Independent variables are PriceCH,
PriceMM, STORE, and
PriceDiff)#Check structure of Purchase column dependent variable
str(train_OJ$Purchase)
## Factor w/ 2 levels "CH","MM": 1 1 2 1 1 1 1 1 1 1 ...
#Check structure of 4 independent variables
str(train_OJ$PriceCH)
## num [1:856] 1.75 1.86 1.69 1.69 1.69 1.69 1.75 1.75 1.75 1.86 ...
str(train_OJ$PriceMM)
## num [1:856] 1.99 2.09 1.69 1.69 1.99 1.99 1.99 1.99 1.99 2.09 ...
str(train_OJ$STORE)
## num [1:856] 1 1 1 0 0 0 0 0 0 0 ...
str(train_OJ$PriceDiff)
## num [1:856] -0.06 0.4 0 0 0.3 -0.1 -0.16 -0.16 -0.16 0.23 ...
# Check current levels and reassign correctly to ensure 1 corresponds to CH and 2 corresponds to MM in Purchase column
levels(train_OJ$Purchase) <- c("CH", "MM")
head(train_OJ)
## Purchase WeekofPurchase StoreID PriceCH PriceMM DiscCH DiscMM SpecialCH
## 2 CH 239 1 1.75 1.99 0.00 0.3 0
## 3 CH 245 1 1.86 2.09 0.17 0.0 0
## 4 MM 227 1 1.69 1.69 0.00 0.0 0
## 5 CH 228 7 1.69 1.69 0.00 0.0 0
## 6 CH 230 7 1.69 1.99 0.00 0.0 0
## 7 CH 232 7 1.69 1.99 0.00 0.4 1
## SpecialMM LoyalCH SalePriceMM SalePriceCH PriceDiff Store7 PctDiscMM
## 2 1 0.600000 1.69 1.75 -0.06 No 0.150754
## 3 0 0.680000 2.09 1.69 0.40 No 0.000000
## 4 0 0.400000 1.69 1.69 0.00 No 0.000000
## 5 0 0.956535 1.69 1.69 0.00 Yes 0.000000
## 6 1 0.965228 1.99 1.69 0.30 Yes 0.000000
## 7 1 0.972182 1.59 1.69 -0.10 Yes 0.201005
## PctDiscCH ListPriceDiff STORE
## 2 0.000000 0.24 1
## 3 0.091398 0.23 1
## 4 0.000000 0.00 1
## 5 0.000000 0.00 0
## 6 0.000000 0.30 0
## 7 0.000000 0.30 0
#Check structure of Purchase column dependent variable
str(test_OJ$Purchase)
## Factor w/ 2 levels "CH","MM": 1 1 1 2 1 2 2 1 1 1 ...
#Check structure of 4 independent variables
str(test_OJ$PriceCH)
## num [1:214] 1.75 1.99 1.86 1.79 1.86 1.69 1.86 1.86 1.86 1.76 ...
str(test_OJ$PriceMM)
## num [1:214] 1.99 2.13 2.18 2.09 2.13 1.69 2.18 2.18 2.13 2.18 ...
str(test_OJ$STORE)
## num [1:214] 0 1 2 3 0 2 2 0 0 1 ...
str(test_OJ$PriceDiff)
## num [1:214] -0.16 0.38 -0.08 0.3 0.2 0.3 0.32 0.32 0.27 0.42 ...
#Fit and view regression model
logfit_OJ <- glm(Purchase ~ PriceCH + PriceMM + STORE + PriceDiff, family = binomial, data=train_OJ)
summary(logfit_OJ)
##
## Call:
## glm(formula = Purchase ~ PriceCH + PriceMM + STORE + PriceDiff,
## family = binomial, data = train_OJ)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.25729 1.41835 3.707 0.00021 ***
## PriceCH -2.40462 1.08912 -2.208 0.02725 *
## PriceMM -0.60896 0.77973 -0.781 0.43481
## STORE 0.17646 0.05894 2.994 0.00275 **
## PriceDiff -2.04732 0.31564 -6.486 8.8e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1145.9 on 855 degrees of freedom
## Residual deviance: 1065.8 on 851 degrees of freedom
## AIC: 1075.8
##
## Number of Fisher Scoring iterations: 4
logprob <- predict (logfit_OJ, newdata=test_OJ, type="response")
logpred = rep ("0", n2)
logpred[logprob[1:n2]>.5]="1"
head(logprob)
## 1017 679 129 930 471 299
## 0.5411449 0.1936440 0.4934687 0.4002267 0.2845822 0.4757856
#If probability >0.5, then 1 (observation is a CH Purchase)
head(logpred)
## [1] "1" "0" "0" "0" "0" "0"
MM or
CH, hint: contrasts(train$Purchase))#create confusion matrix (CH==1, MM==0)
table (logpred, test_OJ[,"Purchase"])
##
## logpred CH MM
## 0 113 61
## 1 19 21
# Convert logpred from 0/1 to "MM"/"CH" to match test data and better calculate error rate
logpred <- factor(logpred, levels = c("0", "1"), labels = c("MM", "CH"))
#Testing error rate
error_rate <- 1 - mean(logpred==test_OJ[,"Purchase"])
error_rate
## [1] 0.6261682
Purchase is the dependent variable using the LDA method.
(Independent variables are PriceCH, PriceMM,
STORE, and PriceDiff)#Set train and test data sets for LDA method
set.seed(123)
sample <- sample(c(TRUE, FALSE), nrow(OJ), replace = T, prob = c(0.8,0.2))
train_LDA <- OJ[sample,]
test_LDA <- OJ[!sample, ]
nrow(OJ)
## [1] 1070
nrow(train_LDA)
## [1] 863
nrow(test_LDA)
## [1] 207
lda_OJ <- lda(Purchase ~ PriceCH + PriceMM + STORE + PriceDiff, data = train_LDA)
lda_OJ
## Call:
## lda(Purchase ~ PriceCH + PriceMM + STORE + PriceDiff, data = train_LDA)
##
## Prior probabilities of groups:
## CH MM
## 0.6071842 0.3928158
##
## Group means:
## PriceCH PriceMM STORE PriceDiff
## CH 1.867481 2.099046 1.511450 0.20215649
## MM 1.869676 2.067640 1.843658 0.05132743
##
## Coefficients of linear discriminants:
## LD1
## PriceCH -1.2822666
## PriceMM -0.9924425
## STORE 0.2745856
## PriceDiff -3.2852832
plot(lda_OJ)
#Confirm which class lda.pred$posteerior[,2] corresponds to (the response variable, Purchase, is a factor with 2 levels - CH and MM):
levels(train_OJ$Purchase)
## [1] "CH" "MM"
lda.pred <- predict(lda_OJ, newdata=test_OJ)
# number of CH Purchases measured by summing the posterior probability of the first level, CH [,1]
sum(lda.pred$posterior[, 1] > .5)
## [1] 169
#confusion matrix
table(test_OJ$Purchase, lda.pred$class)
##
## CH MM
## CH 114 18
## MM 55 27
#error rate
1 - mean(test_OJ$Purchase==lda.pred$class)
## [1] 0.3411215
Purchase using the KNN method when k=3. (Use only
PriceCH, PriceMM, STORE, and
PriceDiff, hint: cbind())#Remember to create testing and training sets without the variable `Purchase`. You do not need to create a new testing and training set as in question 1. You could just remove one variable from your already created data set
set.seed(1234)
ind <- sample(2, nrow(OJ), replace=TRUE, prob=c(0.80, 0.20))
OJ.training <- OJ[ind==1, 2:4]
OJ.test <- OJ[ind==2, 2:4]
OJ.trainLabels <- OJ[ind==1, 1]
OJ.testLabels <- OJ[ind==2, 1]
str(OJ.training)
## 'data.frame': 847 obs. of 3 variables:
## $ WeekofPurchase: num 237 239 245 227 230 232 234 235 238 240 ...
## $ StoreID : num 1 1 1 1 7 7 7 7 7 7 ...
## $ PriceCH : num 1.75 1.75 1.86 1.69 1.69 1.69 1.75 1.75 1.75 1.86 ...
str(OJ.test)
## 'data.frame': 223 obs. of 3 variables:
## $ WeekofPurchase: num 228 268 278 269 272 274 271 227 259 260 ...
## $ StoreID : num 7 7 7 7 7 7 4 4 4 4 ...
## $ PriceCH : num 1.69 1.86 2.06 1.86 1.86 1.86 1.99 1.79 1.99 1.99 ...
#KNN model with k=3
OJ_pred <- knn(train = OJ.training, test = OJ.test, cl = OJ.trainLabels, k=3)
OJ_pred
## [1] CH CH CH CH CH CH CH MM CH MM CH CH CH CH CH CH MM CH MM MM MM CH CH MM CH
## [26] CH MM MM MM CH CH CH MM CH CH CH CH CH CH CH CH CH CH MM CH MM CH MM MM MM
## [51] CH CH CH MM MM CH CH MM CH CH CH CH CH CH CH CH CH MM CH CH MM CH MM CH MM
## [76] MM MM MM MM CH MM MM CH CH CH CH MM CH CH CH CH CH MM MM MM MM CH MM MM CH
## [101] MM CH CH CH CH MM CH CH CH CH CH MM MM MM CH CH MM MM CH CH CH CH CH CH CH
## [126] CH MM CH MM MM MM MM MM CH MM CH CH CH CH CH CH CH MM CH CH CH MM CH MM CH
## [151] CH CH MM MM CH CH MM MM CH CH MM MM MM CH CH MM MM MM MM MM CH CH MM MM MM
## [176] MM MM MM CH CH CH MM CH MM MM MM CH CH CH CH CH MM MM CH CH CH CH MM CH MM
## [201] MM CH MM CH CH CH CH CH MM CH CH CH CH MM CH CH CH CH MM MM MM CH CH
## Levels: CH MM
table(OJ_pred, OJ.testLabels)
## OJ.testLabels
## OJ_pred CH MM
## CH 105 32
## MM 39 47
1-mean(OJ_pred==OJ.testLabels)
## [1] 0.3183857
Purchase, which model you would prefer and why? Are you
satisfied with the predictions? (Hint: testing error rate)Based on the 3 classification models, I would prefer the KNN model because it has the lowest error rate (0.318) and the highest precision rate (0.729), the model is least likely to classify a purchase as CH when in fact the purchase was MM (false positive). The next best model is the LDA with an error rate of 0.341 and a precision rate of 0.675. The worst performing model, from an accuracy standpoint, in the logistic regression model with a high error rate of 0.626 and low precision rate of 0.475. While, better than LDA and logistic regression, the KNN classification model still misclassifies 0.318 (almost 1/3 of observations) as the wrong type of purchase (CH when MM, or MM when CH). The KNN model can be improved by finding the optimal k value for the data set. Perhaps more k, independent, variables are needed to reduce potential overfitting? More analysis will be needed. Furthermore, there are 653 CH purchases and only 417 MM purchases (see table directly below) suggesting that the the underlying OJ data set is slightly imbalanced. Resampling techniques, such as SMOTE (Synthetic Minority Over-sampling Technique) or a weighted KNN model may be needed. SMOTE refers practice of artificially creating new minority class data points given the attributes of the initial, original minority class observations found in the data set to ensure that predicted values have a greater chance of being classified according to their true values given the underlying attributes.
#Count numbert of CH and MM purchases
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.2
## Warning: package 'readr' was built under R version 4.4.2
## Warning: package 'dplyr' was built under R version 4.4.2
## Warning: package 'forcats' was built under R version 4.4.2
## Warning: package 'lubridate' was built under R version 4.4.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::group_rows() masks kableExtra::group_rows()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::select() masks MASS::select()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Purchase_Counts_OJ <- OJ %>%
group_by(Purchase) %>%
summarize(count = n())
Purchase_Counts_OJ
## # A tibble: 2 × 2
## Purchase count
## <fct> <int>
## 1 CH 653
## 2 MM 417
Models = c("Logistic Reg", "LDA", "KNN k=3")
Error.Rate = c("0.6261682","0.3411215","0.3183857")
Precision = c(19/(19+21), 114/(114+55), 105/(105+39)) #(Correctly pred + CH outcomes) / (total pred + CH outcomes)
Precision = round(Precision, 3)
text_tbl = data.frame(Models, Error.Rate, Precision)
kable(text_tbl) %>%
kable_styling(bootstrap_options = "striped", full_width = F)
| Models | Error.Rate | Precision |
|---|---|---|
| Logistic Reg | 0.6261682 | 0.475 |
| LDA | 0.3411215 | 0.675 |
| KNN k=3 | 0.3183857 | 0.729 |
After completing this assignment, I have a better understanding of classification methods to predict the value of a categorical response variable, such as whether CH or MM was purchased given a variety of independent variables contained in the underlying OJ dataset (including PriceCH and PriceMM). Upon further research, I learned how the logistic regression model equation was created/derived to assign the probability (0 to 1, with >0.5 suggesting XYZ category) of an observation to conform to category XYZ (CH purchase) or category ABC (MM purchase), as well as how odd-logs convert the logistic model to a linear equivalent. With this, I would like to find more resources that explain in greater detail how the odd-log and logistic regression equations are derived. This would enable me to better understand the statistical background behind the classification models explored in modlue 3.
Using R code, we learned how to do basic exploratory analysis to first explore the independent variables, in visual chart and graph format, prior to running any classification model. THis allowed for us to understand what to look for (normal distribution, low variance, no outliers, etc) in the independent variables when constructing a classification models. This is an important step that cannot be overlooked. A statistical model created in a software program is only as good as the underlying data. To this end, we also learned how to not just interpret the results of the logistic reg, LDA, and KNN models but also how to check the accuracy of each model. Accuracy can be imporved through SMOTE, weighted neighbors, and other techniques.