Course: MBA563 Module: 06 Author: PT ************************************************************************
In this video, let’s get some hands-on experience using logistic regression. Logistic regression is a valuable classification-type machine learning algorithm. In this video, we will use the algorithm to predict whether a transaction is from a loyalty customer or not. We will use the TECA dataset.
TECA desires to increase their loyalty customers and the number of products a loyalty customer buys, because loyalty customers create more business and more profitable business for TECA.
As always, we encourage you to resist the urge to just watch these videos. Rather, please follow along, doing each of the steps we are doing, so that you can get the critical hands-on experience necessary to learn this valuable tool. First, let’s bring in the data and the needed packages.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
df1_even_kp <- read_rds('logistic1.rds')
Next, run the following lines of code. In these lines I have created a very basic function that will help us evaluate the quality of our logistic regression model, later on.
We will use this code a few times, so, it made sense to put it into a function that can be reused, without having to cut and paste all of this code every time we want to use it.
# Make reusable Confusion Matrix function
my_confusion_matrix <- function(cf_table) {
true_positive <- cf_table[4]
true_negative <- cf_table[1]
false_positive <- cf_table[2]
false_negative <- cf_table[3]
accuracy <- (true_positive + true_negative) / (true_positive + true_negative + false_positive + false_negative)
sensitivity_recall <- true_positive / (true_positive + false_negative)
specificity_selectivity <- true_negative / (true_negative + false_positive)
precision <- true_positive / (true_positive + false_positive)
neg_pred_value <- true_negative/(true_negative + false_negative)
print(cf_table)
my_list <- list(sprintf("%1.0f = True Positive (TP), Hit", true_positive),
sprintf("%1.0f = True Negative (TN), Rejection", true_negative),
sprintf("%1.0f = False Positive (FP), Type 1 Error", false_positive),
sprintf("%1.0f = False Negative (FN), Type 2 Error", false_negative),
sprintf("%1.4f = Accuracy (TP+TN/(TP+TN+FP+FN))", accuracy),
sprintf("%1.4f = Sensitivity, Recall, Hit Rate, True Positive Rate (How many positives did the model get right? TP/(TP+FN))", sensitivity_recall),
sprintf("%1.4f = Specificity, Selectivity, True Negative Rate (How many negatives did the model get right? TN/(TN+FP))", specificity_selectivity),
sprintf("%1.4f = Precision, Positive Predictive Value (How good are the model's positive predictions? TP/(TP+FP))", precision),
sprintf("%1.4f = Negative Predictive Value (How good are the model's negative predictions? TN/(TN+FN)", neg_pred_value)
)
return(my_list)
}
Now, let’s go through our six steps of for using a machine learning algorithm. Recall that these steps are the following: 1. Data scrubbing 2. Algorithm selection 3. Model training or fitting 4. Model evaluation 5. Model improvement 6. Deploy the model
We have done all of the data scrubbing, so number 1 is already complete. For number 2, we have selected logistic regression as our classification algorithm. So, let’s move to number 3.
We will break this step down into multiple parts. Our goal is to train a univariate logistic regression model using our data. That is, we will predict whether a particular transaction is from a loyalty customer or not.
First, we need to select which independent/input variable we are going to use to predict out dependent variable. Let’s look at the data to remind ourselves what it looks like.
slice_sample(df1_even_kp, n=10)
## loyalty category quarter state
## 9459 loyal Candy/Gum 2 Nebraska
## 3395481 not loyal Fuel 4 Alabama
## 1247649 not loyal Hot Dispensed Beverage 4 Oklahoma
## 1348527 not loyal Fuel 1 Nebraska
## 345779 loyal Fuel 4 Nebraska
## 346049 loyal Fuel 1 Oklahoma
## 619162 not loyal Lottery 3 Missouri
## 511317 loyal Cold Dispensed Beverage 2 Missouri
## 554068 not loyal Juice/tonics 4 Arkansas
## 12906 loyal Lottery 2 Missouri
loyalty is our dependent/output/target variable. We need
to pick one of the other three to be our independent/input variable.
Since TECA is concerned about predicting which transactions will be
loyalty transactions, it probably makes the most sense to use
category as our independent variable.
Before we train our model, we have a few more things to do to get ready.
First, let’s look at the baseline percentage of the occurrence of
loyalty. First, we will use the function
table(). This function uses the levels of categorical
variables to build a contingency table of the counts at each combination
of variable levels. We would like to build a table of how many loyal and
no loyal transactions we have and then calculate a percentage of loyal
transactions.
Let’s describe what is done here. First, the table()
function is used to create a table from the loyalty
variable. Second, we printed that table. Next, we printed the percentage
of loyal transactions divided by the total transactions. This is done by
indexing the table and using indexing to select individual parts of the
table with the square brackets []. Thus, for example,
loyal_table[2] = 519744.
loyaltyloyal_table <- table(df1_even_kp$loyalty)
print(loyal_table)
##
## not loyal loyal
## 520000 519744
print(loyal_table[2]/(loyal_table[1]+loyal_table[2]))
## loyal
## 0.4998769
Next, I want to double check that the levels of loyalty
are in the order I think they are. Logistic regression and my confusion
matrix function are going to use factors in the order they are in, so I
need to understand what that order is so that I can interpret the output
of the model correctly. The contrasts() function will check
that for me. I would like loyal to be coded as 1.
contrasts(df1_even_kp$loyalty)
## loyal
## not loyal 0
## loyal 1
This is correct, so we can move on to our next step. The next step in
training our data is to split it into training and testing proportions.
Recall, that we want to train our model on one set of data, but then
make sure it works on other data, so that we are not overfitting our
model. I will use the caret package to split the data,
though there are myriad ways of doing this. However, as you continue to
explore and work on classification models it is likely you will use the
caret package.
The second line sets a seed so we can get the same split again. The
next line uses the caret package to select 75% of the
entries in the column loyalty. We could have used a
different percentage–like 80% or 70%. This creates a matrix of numbers
that we can uses to select the rows from our dataset that will become
the training data. That is what we have done in the fourth line. Here,
we create a dataset called data_train by taking the
original dataset and telling it to only keep the rows we randomly
selected in partition. The square brackets here are again
using indexing to tell R which rows to take. We put nothing after the
comma, which tells R to take all of the columns. We could have instead
selected only some subset of the columns. Next, we select the opposite
columns, using the - symbol in front of
partition to make the testing dataset,
data_test. Finally, we print the percentage of rows that
are in the training data to make sure it is close to 75%.
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
set.seed(77)
partition <- caret::createDataPartition(y=df1_even_kp$loyalty, p=.75, list=FALSE)
data_train <- df1_even_kp[partition,]
data_test <- df1_even_kp[-partition,]
print(nrow(data_train)/(nrow(data_test)+nrow(data_train)))
## [1] 0.75
With our data split, we are ready to train our model. Recall that an
algorithm is a set of rules for how to make a model. A model takes our
data and options we select to create new data/information and additional
rules for how to use the model. Below, we create our model, called
model_train and then summarize it to view the output. We
use the glm function with the family=binomial
argument to select logistic regression. The syntax is really simple
here. Our dependent variable, loyalty goes first. Then we
use ~ to act like an equals sign followed by our
independent variables, category in this case. Finally, we
list the data set. We are, of course, using the training data, since we
are training the model.
There is a lot of output here, so let’s pick out only the important
stuff. Let’s focus on the Coefficients area. There we have coefficients
listed for out variable under the Estimate column. A ways over we have
the z value and the p value (Pr(>|z|)). What does all of this mean?
The coefficients tell us the effect of that variable on the dependent
variable, loyalty. The glm function is smart
enough to know that category is a categorical variable and
not a continuous variable. Thus, glm automatically did
something called “one-hot encoding” or, equivalently, “dummy coding”, to
our variable. That is, it took the variable and essentially created
individual “dummy variables” for each level of category.
Thus, instead of having on category variable, we now have
19 dummy variables for each level. Thus, categoryBeer is a
new variable that has a 1 if that row is the level Beer and a 0
otherwise. That is an example of a dummy variable. Likewise,
categoryPizza has a 1 if that level row has pizza in the
category column and 0 otherwise. Hopefully you notice an
issue, however. We have 20 levels of category but only 19
dummy variable coefficients. That is because glm put one of
our levels into the intercept term. Specifically, Bakery is in the
intercept. This is done to avoid over-specifying the model. It is
important to know which level is not in the model because it affects how
we interpret the results. Specifically, the effect of the other
coefficients is in reference to the variable in the intercept.
We can now interpret. The intercept lists the effect of bakery items on whether a transaction is from a loyalty customer or not. The coefficient is positive, which means that selling a bakery item increases the likelihood that a transaction is a loyalty transaction. More precisely, purchasing this item increases the log odds of a loyalty purchase happening by 0.30575. We won’t go into log odds, but it suffices to say that buying a bakery item increases that likelihood of the transaction being from a loyalty customer. Let’s look at some more categories. Fuel actually decreases the chance of the transaction being from a loyalty customer relative to a bakery item, as can be seen by its negative coefficient. More specifically, the coefficient is still negative even relative to the intercept, which we remember is Bakery items. That is, Fuel’s coefficient plus the intercept is still negative meaning Fuel decreases the chance that a loyalty customer makes a purchase. This makes sense since a bakery item is one you go into the store for, which is more likely to happen for a loyalty customer, whereas anyone can drive by and purchase gas. Next, fountain soda, Cold Dispensed Beverage, increases the chance the purchase was from a loyalty customer, even more than purchasing a bakery item. Thus, fountain soda seems to be a big draw for loyalty customers and could be a way TECA could bring increase loyalty purchases. Finally, what effect do Bread and Cakes have? This is a small negative coefficient. If we add it to the intercept, we see that the difference is positive. Thus, Bread and Cakes have a significant effect on the chance a purchase is a loyalty purchase. We have to compare to the intercept, though, since all of the coefficients are relative to Bakery.
Next, it is not enough to see what the sign of the coefficient is,
but we need to see whether the coefficient has a significant effect on
loyalty. We see this in at least two ways. First, we can
look at the p-level (Pr(>|z|)). This is the probability that the
coefficient is equal to 0. This is not the technical definition, but the
point to remember is that a low p-value means the coefficient is not 0
and is statistically significant. Additionally, ***
indicate significance, with more stars meaning it is less likely the
coefficient is 0. We have a lot of data here so even small coefficients
are likely to be significant. Sure enough, only energy drinks and Hot
Sandwiches & Chicken are not significant relative to bakery. So,
overall, what do we learn here? We know now that several categories,
such as fountain sodas, bakery items, and breakfast sandwiches help
increase the chance a customer will be a loyalty customer. TECA could
promote these products to try to draw in more loyalty customers.
model_train <- glm(loyalty ~ category, family=binomial, data=data_train)
summary(model_train)
##
## Call:
## glm(formula = loyalty ~ category, family = binomial, data = data_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.30575 0.01539 19.870 < 2e-16 ***
## categoryBeer -1.15590 0.02101 -55.009 < 2e-16 ***
## categoryBread & Cakes -0.11000 0.02173 -5.063 4.14e-07 ***
## categoryBreakfast Sandwiches 0.22179 0.02156 10.288 < 2e-16 ***
## categoryCandy/Gum -0.33491 0.01816 -18.442 < 2e-16 ***
## categoryChips -0.20758 0.02392 -8.677 < 2e-16 ***
## categoryCigarettes -0.48665 0.01769 -27.517 < 2e-16 ***
## categoryCigars -0.95134 0.02544 -37.395 < 2e-16 ***
## categoryCold Dispensed Beverage 0.46045 0.01706 26.991 < 2e-16 ***
## categoryEnergy 0.02773 0.01797 1.544 0.12269
## categoryFuel -0.63600 0.01622 -39.211 < 2e-16 ***
## categoryHot Dispensed Beverage -0.01130 0.01866 -0.605 0.54487
## categoryHot Sandwiches & Chicken -0.06145 0.02298 -2.674 0.00749 **
## categoryJuice/tonics -0.40630 0.01752 -23.196 < 2e-16 ***
## categoryLottery -0.91991 0.01865 -49.313 < 2e-16 ***
## categoryPizza 0.10643 0.02130 4.996 5.84e-07 ***
## categoryPop (587) -0.31478 0.01706 -18.448 < 2e-16 ***
## categoryRoller Grill -0.18406 0.02183 -8.430 < 2e-16 ***
## categorySalty Snacks -0.28467 0.01997 -14.252 < 2e-16 ***
## categorySmokeless (951) -0.67115 0.02315 -28.989 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1081043 on 779807 degrees of freedom
## Residual deviance: 1050668 on 779788 degrees of freedom
## AIC: 1050708
##
## Number of Fisher Scoring iterations: 4
Our next step is to evaluate the model we have created by predicting
the probabilities of each row and examining how accurate the model is at
predicting a loyalty purchase. Recall that logistic regression uses the
independent variables to create a probability for each row that
maximized that likelihood that that row is close to its true
category–loyal or not loyal. We should predict
those probabilities and examine their accuracy, since we have labeled
data and know the true category for each row.
First, let’s predict that probabilities. We really want to do this with the testing data, but we will do this on the training data first, and then compare to the real data, the testing data, next.
Let’s examine this code line by line. The first line is the
prediction of the probabilities. We are predicting using the model we
just created, model_train. Next, we are using the training
data, data_train. The argument type='response'
gives the predicted probabilities rather than the log odds, which are
more difficult to interpret. Next, we print a summary of the
probabilities, just so we can get a sense of what they look like. To
further evaluate the model, we take the probabilities,
predict_train, and put them into the training data as a new
column, data_train$prediction. Finally, we take a look at a
snapshot of the data. You can see that the probabilities here make
sense, generally. All of these first 10 rows are from loyalty customers,
and they are all relatively high, considering our highest probability is
only 68%.
predict_train <- predict(model_train, newdata=data_train, type='response')
print(summary(predict_train))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2994 0.4182 0.4927 0.4999 0.5731 0.6827
data_train$prediction <- predict_train
head(data_train, n=10)
## loyalty category quarter state prediction
## 1 loyal Bread & Cakes 3 Colorado 0.5487819
## 2 loyal Candy/Gum 1 Wyoming 0.4927105
## 3 loyal Pop (587) 3 Wyoming 0.4977430
## 4 loyal Pop (587) 4 Colorado 0.4977430
## 5 loyal Chips 1 Oklahoma 0.5245229
## 6 loyal Juice/tonics 2 Colorado 0.4748839
## 7 loyal Juice/tonics 3 Wyoming 0.4748839
## 9 loyal Beer 3 Alabama 0.2994030
## 10 loyal Chips 2 Oklahoma 0.5245229
## 11 loyal Cigarettes 1 Nebraska 0.4548983
Next, let’s evaluate our model in another way, by looking at how accurate the model is at making correct predictions. There are a lot of ways to think about accuracy of this model. You can think of how accurate the model was a detecting loyalty, but you can also think about how accurate the model was at detecting not loyalty. Both of these types of accuracy to us, since we want an overall accurate model. A typical way to visualize accuracy is by using a confusion or classification matrix. Again, we are using the training data just so we can compare to the testing data next.
Let’s again look at this code line by line. The first line creates a
table using the table function. This table creates the
probability predictions from above, predict_train, and
writes “FALSE” if the probability is less than 0.5 and “TRUE” if it is
more than 0.5. These are our predictions using the training data. You
can see the FALSE and TRUE categories on the left of the table. It then
compares these to the truth, which is the loyalty column
from the data_train column. The second line then runs the
functions we pasted above. This function just takes the four cells of
this table and labels them and manipulates them to create various types
of accuracy, displayed below.
Let’s explore all of this. The table shows the following output: * When a transaction is actually/truthfully a loyalty transaction, the model correctly classifies it as such-by saying “TRUE”, 185131 times. This is called a True Positive (TP), Hit.
When a transaction is truthfully a non-loyalty transaction, the model correctly classifies it as such–by saying “FALSE”, 265829 times. This is a True Negative (TN), Rejection.
On the other hand, the model makes two kinds of errors. When the model transaction is not a loyalty transaction but the model incorrectly says it is, this is called a False Positive (FP), Type 1 Error. It happens 124171 times.
Finally, when a transaction is a loyalty transaction and the model says it is not, which here happens 204677, it is called a False Negative (FN), Type 2 Error.
These numbers can then be manipulated to create different measures of accuracy, as follows: * Overall accuracy (TP+TN/(TP+TN+FP+FN)) is 0.578296.
Sensitivity, Recall, Hit Rate, True Positive Rate (How many positives did the model get right? TP/(TP+FN)), is 0.474929.
Specificity, Selectivity, True Negative Rate (How many negatives did the model get right? TN/(TN+FP)) is 0.681613.
Precision, Positive Predictive Value (How good are the model’s positive predictions? TP/(TP+FP)) is 0.598544.
Negative Predictive Value (How good are the model’s negative predictions? TN/(TN+FN) is 0.564985.
Thus, overall, our model predicts better than chance. Recall, that
loyalty transactions happen about 50% of the time. Thus, if we had no
model at all and just flipped a coin every time to guess if a
transaction was going to be a loyalty transaction, we would get it right
50% of the time. Our model helps us to get it right about 58% of the
time, which is not super high, but is better than chance. Where the
model really fails right now is in sensitivity. That is, the model is
pretty good at predicting not loyal, 68%, but bad at
predicting the loyal, getting only 47% correct: worse than flipping a
coin. So, our next step is to improve the model.
table1 <- table(predict_train>0.5, data_train$loyalty) #prediction on left and truth on top
my_confusion_matrix(table1)
##
## not loyal loyal
## FALSE 265829 204677
## TRUE 124171 185131
## [[1]]
## [1] "185131 = True Positive (TP), Hit"
##
## [[2]]
## [1] "265829 = True Negative (TN), Rejection"
##
## [[3]]
## [1] "124171 = False Positive (FP), Type 1 Error"
##
## [[4]]
## [1] "204677 = False Negative (FN), Type 2 Error"
##
## [[5]]
## [1] "0.5783 = Accuracy (TP+TN/(TP+TN+FP+FN))"
##
## [[6]]
## [1] "0.4749 = Sensitivity, Recall, Hit Rate, True Positive Rate (How many positives did the model get right? TP/(TP+FN))"
##
## [[7]]
## [1] "0.6816 = Specificity, Selectivity, True Negative Rate (How many negatives did the model get right? TN/(TN+FP))"
##
## [[8]]
## [1] "0.5985 = Precision, Positive Predictive Value (How good are the model's positive predictions? TP/(TP+FP))"
##
## [[9]]
## [1] "0.5650 = Negative Predictive Value (How good are the model's negative predictions? TN/(TN+FN)"
However, first, let’s take our trained model and predict/run it on the testing data. Recall, that we train on the training data and test on the testing data. We want to use new data to test on so that we aren’t just trying to memorize the training data, but rather are creating a model that can work on any data.
This, of course, looks very similar to the code above with the
key exception that the data being used is the test
data, data_test.
How does our model perform on the new data? It actually does a little bit better than the training data. While accuracy is very similar, our problem from before, the low sensitivity, is a tiny bit improved.
predict_test <- predict(model_train, newdata=data_test, type='response')
print(summary(predict_test))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2994 0.4182 0.4927 0.5003 0.5731 0.6827
data_test$prediction <- predict_test
head(data_test, n=10)
## loyalty category quarter state prediction
## 8 loyal Bread & Cakes 2 Alabama 0.5487819
## 13 loyal Pop (587) 4 Iowa 0.4977430
## 16 loyal Salty Snacks 3 Colorado 0.5052708
## 20 loyal Cold Dispensed Beverage 1 Missouri 0.6826999
## 21 loyal Fuel 2 Alabama 0.4181812
## 24 loyal Smokeless (951) 1 Wyoming 0.4096534
## 26 loyal Juice/tonics 4 Wyoming 0.4748839
## 34 loyal Candy/Gum 1 Oklahoma 0.4927105
## 36 loyal Juice/tonics 1 Alabama 0.4748839
## 37 loyal Energy 4 Oklahoma 0.5826080
table2 <- table(predict_test>.5, data_test$loyalty) #prediction on left and truth on top
my_confusion_matrix(table2)
##
## not loyal loyal
## FALSE 88248 67847
## TRUE 41752 62089
## [[1]]
## [1] "62089 = True Positive (TP), Hit"
##
## [[2]]
## [1] "88248 = True Negative (TN), Rejection"
##
## [[3]]
## [1] "41752 = False Positive (FP), Type 1 Error"
##
## [[4]]
## [1] "67847 = False Negative (FN), Type 2 Error"
##
## [[5]]
## [1] "0.5784 = Accuracy (TP+TN/(TP+TN+FP+FN))"
##
## [[6]]
## [1] "0.4778 = Sensitivity, Recall, Hit Rate, True Positive Rate (How many positives did the model get right? TP/(TP+FN))"
##
## [[7]]
## [1] "0.6788 = Specificity, Selectivity, True Negative Rate (How many negatives did the model get right? TN/(TN+FP))"
##
## [[8]]
## [1] "0.5979 = Precision, Positive Predictive Value (How good are the model's positive predictions? TP/(TP+FP))"
##
## [[9]]
## [1] "0.5653 = Negative Predictive Value (How good are the model's negative predictions? TN/(TN+FN)"
Last time, we predicted loyalty using a single variable,
category. This time we will use all of our variables. We
will see if this improves the prediction accuracy of our model. Before
we get to the model statement, the code will be identical to the prior
video when we used only one variable.
Of our six steps of for using a machine learning algorithm, we are now focused on step 5. 1. Data scrubbing 2. Algorithm selection 3. Model training or fitting 4. Model evaluation 5. Model improvement 6. Deploy the model
With our data split up above, we are ready to train our new model.
Recall that an algorithm is a set of rules for how to make a model. A
model takes our data and options we select to create new
data/information and additional rules for how to use the model. Below,
we create our model, called model_train and then summarize
it to view the output. We use the glm function with the
family=binomial argument to select logistic regression. Our
dependent variable will continue to be loyalty, but this
time we will include all of our variables in the model. We are, of
course, using the training data since we are training the model.
Last time we examined the impact of category on
loyalty. Let’s look at quarter. The quarter is
just the quarter of the year. I make it a factor so that R knows it is a
category. There is some debate about whether we could leave that as a
continuous variable, since the difference between one quarter of the
year and another does have meaning, but we will leave it as a factor.
The first quarter of the year is the reference level of these multiple
dummy variables. Thus, when we evaluate the effects of these quarters it
is in relationship to the effect of quarter 1. Looking at the signs of
the coefficients and the p-values, we see that each quarter
significantly increases the probability that a transaction will be a
loyalty transaction. For example, the third quarter of the year has a
positive coefficient of 0.216750 and a very low p-value. This means that
relative to quarter 1, quarter 3 has more transactions that are from
loyalty customers. With this knowledge, TECA could consider focusing on
the beginning of the year in an effort to increase the number of their
loyalty customers. Additionally, TECA could focus on summer and fall as
key times when loyalty customers are active to promote to them and
increases their purchases even more.
Next, let’s look at how each state affects the occurrence of loyalty transactions. The reference state is Alabama. Relative to Alabama, the states that are most likely to generate loyalty transactions are Colorado, Missouri, and Wyoming. Missouri, in particular has a strong effect, as can be seen by their relatively large coefficient. TECA should investigate key stores in this state to see what is going right.
model_train <- glm(loyalty ~ category + factor(quarter) + state, family=binomial, data=data_train)
summary(model_train)
##
## Call:
## glm(formula = loyalty ~ category + factor(quarter) + state, family = binomial,
## data = data_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.254291 0.016633 15.289 < 2e-16 ***
## categoryBeer -1.096064 0.021205 -51.689 < 2e-16 ***
## categoryBread & Cakes -0.114865 0.021934 -5.237 1.63e-07 ***
## categoryBreakfast Sandwiches 0.220034 0.021724 10.129 < 2e-16 ***
## categoryCandy/Gum -0.341969 0.018316 -18.670 < 2e-16 ***
## categoryChips -0.204701 0.024131 -8.483 < 2e-16 ***
## categoryCigarettes -0.511199 0.017849 -28.641 < 2e-16 ***
## categoryCigars -0.966302 0.025690 -37.614 < 2e-16 ***
## categoryCold Dispensed Beverage 0.447609 0.017227 25.982 < 2e-16 ***
## categoryEnergy 0.015697 0.018120 0.866 0.38633
## categoryFuel -0.638023 0.016359 -39.002 < 2e-16 ***
## categoryHot Dispensed Beverage -0.009300 0.018821 -0.494 0.62122
## categoryHot Sandwiches & Chicken -0.059721 0.023156 -2.579 0.00991 **
## categoryJuice/tonics -0.433122 0.017665 -24.519 < 2e-16 ***
## categoryLottery -0.991808 0.018851 -52.612 < 2e-16 ***
## categoryPizza 0.096907 0.021468 4.514 6.36e-06 ***
## categoryPop (587) -0.279395 0.017215 -16.229 < 2e-16 ***
## categoryRoller Grill -0.192345 0.022021 -8.735 < 2e-16 ***
## categorySalty Snacks -0.281628 0.020151 -13.976 < 2e-16 ***
## categorySmokeless (951) -0.691339 0.023350 -29.608 < 2e-16 ***
## factor(quarter)2 0.194020 0.006887 28.170 < 2e-16 ***
## factor(quarter)3 0.216750 0.006653 32.580 < 2e-16 ***
## factor(quarter)4 0.228426 0.006624 34.486 < 2e-16 ***
## stateArkansas -0.341681 0.008364 -40.851 < 2e-16 ***
## stateColorado 0.116604 0.007972 14.627 < 2e-16 ***
## stateIowa -0.285021 0.007312 -38.980 < 2e-16 ***
## stateMinnesota -0.285307 0.030645 -9.310 < 2e-16 ***
## stateMissouri 0.276047 0.008295 33.281 < 2e-16 ***
## stateNebraska -0.182782 0.010315 -17.720 < 2e-16 ***
## stateOklahoma -0.592161 0.009407 -62.947 < 2e-16 ***
## stateSouth Dakota -0.165378 0.021925 -7.543 4.60e-14 ***
## stateWyoming 0.085997 0.013440 6.399 1.57e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1081043 on 779807 degrees of freedom
## Residual deviance: 1037616 on 779776 degrees of freedom
## AIC: 1037680
##
## Number of Fisher Scoring iterations: 4
Finally, let’s examine our larger, more complicated model on our test
data. Recall that our single-variable model has a relatively modest
accuracy rate of about 57%. The more complex model has increased that
accuracy slightly to about 60%. Perhaps more importantly, the
sensitivity of the model is much improved. Recall that before, the
sensitivity, or the accuracy of the model at predicting
not loyal, was below 50%. This new model has increased that
to 57%.
Overall, this is not a great model. We could add additional variables to make it more complex. Additionally, recall that this data is just a subset of the available data. Adding data could improve the model as well.
predict_test <- predict(model_train, newdata=data_test, type='response')
summary(predict_test)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1925 0.4148 0.4903 0.5000 0.5788 0.7697
data_test$prediction <- predict_test
head(data_test, n=10)
## loyalty category quarter state prediction
## 8 loyal Bread & Cakes 2 Alabama 0.5825976
## 13 loyal Pop (587) 4 Iowa 0.4795867
## 16 loyal Salty Snacks 3 Colorado 0.5759126
## 20 loyal Cold Dispensed Beverage 1 Missouri 0.7267008
## 21 loyal Fuel 2 Alabama 0.4527136
## 24 loyal Smokeless (951) 1 Wyoming 0.4131277
## 26 loyal Juice/tonics 4 Wyoming 0.5338463
## 34 loyal Candy/Gum 1 Oklahoma 0.3362973
## 36 loyal Juice/tonics 1 Alabama 0.4554110
## 37 loyal Energy 4 Oklahoma 0.4765806
table2 <- table(predict_test>.5, data_test$loyalty)
my_confusion_matrix(table2)
##
## not loyal loyal
## FALSE 80501 55670
## TRUE 49499 74266
## [[1]]
## [1] "74266 = True Positive (TP), Hit"
##
## [[2]]
## [1] "80501 = True Negative (TN), Rejection"
##
## [[3]]
## [1] "49499 = False Positive (FP), Type 1 Error"
##
## [[4]]
## [1] "55670 = False Negative (FN), Type 2 Error"
##
## [[5]]
## [1] "0.5954 = Accuracy (TP+TN/(TP+TN+FP+FN))"
##
## [[6]]
## [1] "0.5716 = Sensitivity, Recall, Hit Rate, True Positive Rate (How many positives did the model get right? TP/(TP+FN))"
##
## [[7]]
## [1] "0.6192 = Specificity, Selectivity, True Negative Rate (How many negatives did the model get right? TN/(TN+FP))"
##
## [[8]]
## [1] "0.6001 = Precision, Positive Predictive Value (How good are the model's positive predictions? TP/(TP+FP))"
##
## [[9]]
## [1] "0.5912 = Negative Predictive Value (How good are the model's negative predictions? TN/(TN+FN)"