Linear regression assumes a linear relationship between the independent variables (features) and the dependent variable (target). This means that the change in the dependent variable is proportional to the change in the independent variables. Mathematically, it can be expressed as:
\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_n x_n + \epsilon \]
Where: - \(y\) is the dependent variable. - \(x_1, x_2, \dots, x_n\) are the independent variables. - \(\beta_0, \beta_1, \dots, \beta_n\) are the coefficients. - \(\epsilon\) is the error term.
Other key assumptions include: - Homoscedasticity: The variance of the errors is constant across all levels of the independent variables. - Independence: Observations are independent of each other. - Normality: The errors are normally distributed.
Binary Classification: Logistic regression predicts the probability of an instance belonging to one of two classes (e.g., 0 or 1). The output is a probability value, which is then thresholded (e.g., 0.5) to classify the instance.
Example: Predicting whether an email is spam (1) or not spam (0).
Multiclass Classification: Logistic regression can be extended to handle more than two classes using techniques like One-vs-Rest (OvR) or Multinomial Logistic Regression. In OvR, a separate binary classifier is trained for each class, while in multinomial logistic regression, a single model is trained to handle all classes.
Example: Classifying an image as a cat, dog, or bird.
These metrics are used to evaluate the performance of classification models:
Accuracy: The ratio of correctly predicted instances to the total instances. \[ \text{Accuracy} = \frac{\text{True Positives (TP)} + \text{True Negatives (TN)}}{\text{TP} + \text{TN} + \text{False Positives (FP)} + \text{False Negatives (FN)}} \]
Precision: The ratio of correctly predicted positive instances to the total predicted positive instances. \[ \text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} \]
Recall (Sensitivity): The ratio of correctly predicted positive instances to the total actual positive instances. \[ \text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} \]
F1-Score: The harmonic mean of precision and recall. \[ \text{F1-Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} \]
In the field of machine learning, regression and classification are two fundamental techniques used for prediction and decision-making. Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables, assuming a linear relationship. Logistic regression, on the other hand, is primarily used for classification tasks, where the goal is to predict the probability of a binary or multiclass outcome. Understanding the assumptions behind these models and how to evaluate their performance is crucial for building effective machine learning systems. In this discussion, we will explore the assumptions of linear regression, the differences between binary and multiclass classification in logistic regression, and the key metrics used to evaluate machine learning models, such as accuracy, precision, recall, and F1-score.
install the [tidyverse],collection of packages, and the [jtools] package. Functions from the jtools package depend on at least two other packages, ggstance, and huxtable, so you will also need to install those packages. You should also install the corrplot package
You only need to install these packages once on the machine that you’re using. If you have not already done so, then you can do so by uncommenting the code chunk below and running it. If you have already done so, then you should not run the next code chunk.
# install.packages('tidyverse')
# install.packages('jtools')
# install.packages('ggstance')
# install.packages('huxtable')
# install.packages('corrplot')
Load the tidyverse collection of packages.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(jtools) # For tabulating and visualizing results from multiple regression models
library(corrplot)
## corrplot 0.95 loaded
library(broom.mixed)
## Registered S3 methods overwritten by 'broom':
## method from
## tidy.glht jtools
## tidy.summary.glht jtools
Make sure that you have also downloaded the tecaRegressionData.rds file into the same folder in which this file is saved. Use the next code chunk to read in the data and load it as a dataframe object.
Creating a linear model using more than one explanatory variable is easy to do in R, but we also want to illustrate that it’s not simply a combination of coefficients from many simple regressions that include only one predictor variable. (By the way, a simple regression is a regression in which there is only one predictor variable.)
As a benchmark, let’s first calculate simple regression models of totalRevenue on Fuel_py1 and Juicetonics_py1 and store the coefficients and then aggregate R-squared values in dataframes for comparison purposes.
lm1 <- lm(totalRevenue ~ Fuel_py1, data = trd)
lm2 <- lm(totalRevenue ~ Juicetonics_py1, data = trd)
export_summs(lm1, lm2) # Create a nice looking table for comparing the regression results
| Model 1 | Model 2 | |
|---|---|---|
| (Intercept) | -11509.68 *** | 14295.79 *** |
| (1216.79) | (683.78) | |
| Fuel_py1 | 35096.84 *** | |
| (1815.14) | ||
| Juicetonics_py1 | -150275.47 *** | |
| (37960.10) | ||
| N | 564 | 564 |
| R2 | 0.40 | 0.03 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | ||
The table indicates the key takeaways from both linear models:
* Coefficient estimates
* For model 1, the coefficient estimate is -11,509.68 for the intercept
and is 35,096.84 for Fuel_py1.
* For model 2, the coefficient estimate is 14,295.79 for the intercept
and is -150,275.47 for Juicetonics_py1. * The standard errors are in
parentheses below these coefficient estimates.
* N stands for number of observations, and they are both based on 564
observations.
* R2 is the R-squared, which is much larger for model 1, 40%, than for
model 2, 3%. This means that model 1 explains more variation in
totalRevenue than model 2, and is better for making predictions.
Let’s use the jtools package to plot the coefficients and standard errors to help visualize the results.
plot_summs(lm1, lm2)
I think this visualization is excellent. It clearly shows that the coefficient on Fuel_py1 from model 1 is positive, and has a much smaller standard error relative to the negative coefficient on Juicetonics_py1 from model 2. As you may have guessed, the long orange whiskers and the short, barely visible blue whiskers represent the standard error, or the range in which the actual value could be.
Now, let’s run a multiple regression that contains both predictor variables in the same model, and evaluate the models. It’s called a multiple regression because it has multiple predictor variables.
lm3 <- lm(totalRevenue ~ Fuel_py1 + Juicetonics_py1, data = trd)
export_summs(lm1, lm2, lm3)
| Model 1 | Model 2 | Model 3 | |
|---|---|---|---|
| (Intercept) | -11509.68 *** | 14295.79 *** | -16855.87 *** |
| (1216.79) | (683.78) | (1680.95) | |
| Fuel_py1 | 35096.84 *** | 39333.35 *** | |
| (1815.14) | (2014.95) | ||
| Juicetonics_py1 | -150275.47 *** | 149875.80 *** | |
| (37960.10) | (33106.72) | ||
| N | 564 | 564 | 564 |
| R2 | 0.40 | 0.03 | 0.42 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | |||
First, notice how the R-squared for model 3 is .42, which is higher than the R-squared of .40 in model 1 and .03 in model 2. This means that we can have a little more confidence in the accuracy of these predictions than in the predictions from model 1, and a lot more confidence in the predictions compared to model 2.
Also, the R-squared is not simply a sum of the R-squared from models 1 and 2. This is because Fuel_py1 and Juicetonics_py1 are correlated to each other. Let’s review the corrleation matrix. It looks like they have a fairly strong correlation of -.46 with each other. This also has implications for helping us understand the relationships among these variables.
ctrd <- cor(trd %>% select(where(is.numeric)))
corrplot(ctrd
, method = 'color' # I also like pie and ellipse
, order = 'hclust' # Orders the variables so that ones that behave similarly are placed next to each other
, addCoef.col = 'black'
, number.cex = .6 # Lower values decrease the size of the numbers in the cells
)
Notice that the coefficient estimates in model 3 are different than in
the other two models, and are all statistically significant.
The coefficient estimates for the intercept and Fuel_py1 are similar, but more extreme than those in model 1. However, the coefficient estimate for Juicetonics_py1 is drastically different. It changed from about a value of -150,000 to a positive value of 150,000. Once again, this is because Fuel_py1 and Juicetonics_py1 are correlated with each other, and the coefficient estimates represent the unique effect of each predictor variable after considering the effect(s) of the other predictor variables.
This change is effectively communicated by visualizing the coefficients from all three models.
plot_summs(lm1, lm2, lm3)
Conceptually, this change in the Juicetonics_py1 coefficient makes sense, as well. Specifically, the negative correlation between Fuel_py1 and Juicetonics_py1 means that stores that have higher percentage of fuel sales have a smaller percentage of juice and tonic sales. People spend a lot more on fuel than on juice and tonics, so stores that depend more on sales of juice an tonics are likely to not generate as much total revenue.
So, the really great thing for explanatory purposes is that we can evaluate the effect of an increase of percentage sales from juice and tonics after controlling for the amount of sales from fuel. In other words, stores that have the same amount of fuel sales can expect to have higher total revenue if they increase their sales from juice and tonics.
In the absence of being able to run an experiment, this is the next best thing. In some ways it’s better because running an experiment would be really hard.
Making predictions using a multiple regression model is a fairly
straightforward extension of making predictions from a simple regression
model. In our case, if we want to estimate the total revenue for a store
in which during the previous year 70% of sales came from fuel and 3%
came from juice and tonics, then we would take the intercept of -16,856,
then add .7 times 39,333, and finally add .03 times 149,876 which would
give us \(15,173.86=-16,856 +
(.7*39,333)+(.03*149,876)\). Of course, we can simply use the
predict() function to calculate those predictions.
You may have noticed the adjusted R-squared in the output, and that it’s always lower than the R-squared. The purpose of the adjusted R-squared is to penalize regression models for adding in predictor variables that do not add much explanatory power, or that are insignificant.
Let’s create a larger multiple regression model by adding in ColdDispensedBeverage_py1, Lottery_py1, and quarterNoYear and look at the summary.
lm4 = lm(totalRevenue ~ Fuel_py1 + Juicetonics_py1 + ColdDispensedBeverage_py1 + Lottery_py1, data = trd)
summary(lm4)
##
## Call:
## lm(formula = totalRevenue ~ Fuel_py1 + Juicetonics_py1 + ColdDispensedBeverage_py1 +
## Lottery_py1, data = trd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11128.2 -2566.3 -312.4 1855.2 25791.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -14330 2434 -5.886 6.82e-09 ***
## Fuel_py1 36903 2610 14.140 < 2e-16 ***
## Juicetonics_py1 140170 37908 3.698 0.000239 ***
## ColdDispensedBeverage_py1 -60632 29829 -2.033 0.042558 *
## Lottery_py1 1582 5963 0.265 0.790877
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4261 on 559 degrees of freedom
## Multiple R-squared: 0.4251, Adjusted R-squared: 0.421
## F-statistic: 103.4 on 4 and 559 DF, p-value: < 2.2e-16
Notice that the Lottery_py1 variable in this model has a statistically insignificant coefficient, which means that we’re not very confident that this coefficient could be zero or negative. This is contributing to the adjusted R-squared value being less than the R-squared value.
We don’t want to make things more difficult than they need to be, so to keep things simple, the final model that is often used typically includes only the variables that are statistically significant, and the adjusted R-squared should remain about the same as when it was left in the model. Let’s try it out.
lm5 = lm(totalRevenue ~ Fuel_py1 + Juicetonics_py1 + ColdDispensedBeverage_py1, data = trd)
summary(lm5)
##
## Call:
## lm(formula = totalRevenue ~ Fuel_py1 + Juicetonics_py1 + ColdDispensedBeverage_py1,
## data = trd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11085.6 -2541.9 -333.1 1844.3 25796.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -14033 2159 -6.499 1.79e-10 ***
## Fuel_py1 36629 2395 15.293 < 2e-16 ***
## Juicetonics_py1 135591 33721 4.021 6.59e-05 ***
## ColdDispensedBeverage_py1 -57848 27898 -2.074 0.0386 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4257 on 560 degrees of freedom
## Multiple R-squared: 0.4251, Adjusted R-squared: 0.422
## F-statistic: 138 on 3 and 560 DF, p-value: < 2.2e-16
In this case, the adjusted R-squared even increased slightly.
plot_summs(lm1, lm2, lm3, lm4, lm5)
In conclusion, multiple regression is a very powerful tool for both predictive and explanatory purposes. It improves the predictive power by using the effect of multiple variables. It improves the explanatory power by reporting the unique effect of each predictor variable.
The adjusted R-squared should be the main version of R-squared that you focus on when using multiple regressions because it is a more conservative estimate of the expected explanatory power of a model.
let’s get some hands-on experience using logistic regression. Logistic regression is a valuable classification-type machine learning algorithm. we will use the algorithm to predict whether a transaction is from a loyalty customer or not. We will use the TECA dataset.
TECA desires to increase their loyalty customers and the number of products a loyalty customer buys, because loyalty customers create more business and more profitable business for TECA.
As always, we encourage you to resist the urge to just watch these videos. Rather, please follow along, doing each of the steps we are doing, so that you can get the critical hands-on experience necessary to learn this valuable tool. First, let’s bring in the data and the needed packages.
Next, run the following lines of code. In these lines I have created a very basic function that will help us evaluate the quality of our logistic regression model, later on.
Now, let’s go through our six steps of for using a machine learning algorithm. Recall that these steps are the following: 1. Data scrubbing 2. Algorithm selection 3. Model training or fitting 4. Model evaluation 5. Model improvement 6. Deploy the model
We have done all of the data scrubbing, so number 1 is already complete. For number 2, we have selected logistic regression as our classification algorithm. So, let’s move to number 3.
We will break this step down into multiple parts. Our goal is to train a univariate logistic regression model using our data. That is, we will predict whether a particular transaction is from a loyalty customer or not.
First, we need to select which independent/input variable we are going to use to predict out dependent variable. Let’s look at the data to remind ourselves what it looks like.
slice_sample(df1_even_kp, n=10)
## loyalty category quarter state
## 897644 not loyal Lottery 4 Colorado
## 191946 loyal Pizza 2 Missouri
## 381765 loyal Fuel 4 Missouri
## 2352691 not loyal Cigars 4 Iowa
## 260251 loyal Fuel 4 Colorado
## 4390681 not loyal Energy 3 Arkansas
## 888609 not loyal Salty Snacks 4 Alabama
## 402489 loyal Pop (587) 4 Iowa
## 4020821 not loyal Smokeless (951) 3 Nebraska
## 500533 loyal Cold Dispensed Beverage 1 Nebraska
loyalty is our dependent/output/target variable. We need
to pick one of the other three to be our independent/input variable.
Since TECA is concerned about predicting which transactions will be
loyalty transactions, it probably makes the most sense to use
category as our independent variable.
Before we train our model, we have a few more things to do to get ready.
First, let’s look at the baseline percentage of the occurrence of
loyalty. First, we will use the function
table(). This function uses the levels of categorical
variables to build a contingency table of the counts at each combination
of variable levels. We would like to build a table of how many loyal and
no loyal transactions we have and then calculate a percentage of loyal
transactions.
Let’s describe what is done here. First, the table()
function is used to create a table from the loyalty
variable. Second, we printed that table. Next, we printed the percentage
of loyal transactions divided by the total transactions. This is done by
indexing the table and using indexing to select individual parts of the
table with the square brackets []. Thus, for example,
loyal_table[2] = 519744.
loyaltyloyal_table <- table(df1_even_kp$loyalty)
print(loyal_table)
##
## not loyal loyal
## 520000 519744
print(loyal_table[2]/(loyal_table[1]+loyal_table[2]))
## loyal
## 0.4998769
Next, I want to double check that the levels of loyalty
are in the order I think they are. Logistic regression and my confusion
matrix function are going to use factors in the order they are in, so I
need to understand what that order is so that I can interpret the output
of the model correctly. The contrasts() function will check
that for me. I would like loyal to be coded as 1.
contrasts(df1_even_kp$loyalty)
## loyal
## not loyal 0
## loyal 1
This is correct, so we can move on to our next step. The next step in
training our data is to split it into training and testing proportions.
Recall, that we want to train our model on one set of data, but then
make sure it works on other data, so that we are not overfitting our
model. I will use the caret package to split the data,
though there are myriad ways of doing this. However, as you continue to
explore and work on classification models it is likely you will use the
caret package.
The second line sets a seed so we can get the same split again. The
next line uses the caret package to select 75% of the
entries in the column loyalty. We could have used a
different percentage–like 80% or 70%. This creates a matrix of numbers
that we can uses to select the rows from our dataset that will become
the training data. That is what we have done in the fourth line. Here,
we create a dataset called data_train by taking the
original dataset and telling it to only keep the rows we randomly
selected in partition. The square brackets here are again
using indexing to tell R which rows to take. We put nothing after the
comma, which tells R to take all of the columns. We could have instead
selected only some subset of the columns. Next, we select the opposite
columns, using the - symbol in front of
partition to make the testing dataset,
data_test. Finally, we print the percentage of rows that
are in the training data to make sure it is close to 75%.
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
set.seed(77)
partition <- caret::createDataPartition(y=df1_even_kp$loyalty, p=.75, list=FALSE)
data_train <- df1_even_kp[partition,]
data_test <- df1_even_kp[-partition,]
print(nrow(data_train)/(nrow(data_test)+nrow(data_train)))
## [1] 0.75
With our data split, we are ready to train our model. Recall that an
algorithm is a set of rules for how to make a model. A model takes our
data and options we select to create new data/information and additional
rules for how to use the model. Below, we create our model, called
model_train and then summarize it to view the output. We
use the glm function with the family=binomial
argument to select logistic regression. The syntax is really simple
here. Our dependent variable, loyalty goes first. Then we
use ~ to act like an equals sign followed by our
independent variables, category in this case. Finally, we
list the data set. We are, of course, using the training data, since we
are training the model.
There is a lot of output here, so let’s pick out only the important
stuff. Let’s focus on the Coefficients area. There we have coefficients
listed for out variable under the Estimate column. A ways over we have
the z value and the p value (Pr(>|z|)). What does all of this mean?
The coefficients tell us the effect of that variable on the dependent
variable, loyalty. The glm function is smart
enough to know that category is a categorical variable and
not a continuous variable. Thus, glm automatically did
something called “one-hot encoding” or, equivalently, “dummy coding”, to
our variable. That is, it took the variable and essentially created
individual “dummy variables” for each level of category.
Thus, instead of having on category variable, we now have
19 dummy variables for each level. Thus, categoryBeer is a
new variable that has a 1 if that row is the level Beer and a 0
otherwise. That is an example of a dummy variable. Likewise,
categoryPizza has a 1 if that level row has pizza in the
category column and 0 otherwise. Hopefully you notice an
issue, however. We have 20 levels of category but only 19
dummy variable coefficients. That is because glm put one of
our levels into the intercept term. Specifically, Bakery is in the
intercept. This is done to avoid over-specifying the model. It is
important to know which level is not in the model because it affects how
we interpret the results. Specifically, the effect of the other
coefficients is in reference to the variable in the intercept.
We can now interpret. The intercept lists the effect of bakery items on whether a transaction is from a loyalty customer or not. The coefficient is positive, which means that selling a bakery item increases the likelihood that a transaction is a loyalty transaction. More precisely, purchasing this item increases the log odds of a loyalty purchase happening by 0.30575. We won’t go into log odds, but it suffices to say that buying a bakery item increases that likelihood of the transaction being from a loyalty customer. Let’s look at some more categories. Fuel actually decreases the chance of the transaction being from a loyalty customer relative to a bakery item, as can be seen by its negative coefficient. More specifically, the coefficient is still negative even relative to the intercept, which we remember is Bakery items. That is, Fuel’s coefficient plus the intercept is still negative meaning Fuel decreases the chance that a loyalty customer makes a purchase. This makes sense since a bakery item is one you go into the store for, which is more likely to happen for a loyalty customer, whereas anyone can drive by and purchase gas. Next, fountain soda, Cold Dispensed Beverage, increases the chance the purchase was from a loyalty customer, even more than purchasing a bakery item. Thus, fountain soda seems to be a big draw for loyalty customers and could be a way TECA could bring increase loyalty purchases. Finally, what effect do Bread and Cakes have? This is a small negative coefficient. If we add it to the intercept, we see that the difference is positive. Thus, Bread and Cakes have a significant effect on the chance a purchase is a loyalty purchase. We have to compare to the intercept, though, since all of the coefficients are relative to Bakery.
Next, it is not enough to see what the sign of the coefficient is,
but we need to see whether the coefficient has a significant effect on
loyalty. We see this in at least two ways. First, we can
look at the p-level (Pr(>|z|)). This is the probability that the
coefficient is equal to 0. This is not the technical definition, but the
point to remember is that a low p-value means the coefficient is not 0
and is statistically significant. Additionally, ***
indicate significance, with more stars meaning it is less likely the
coefficient is 0. We have a lot of data here so even small coefficients
are likely to be significant. Sure enough, only energy drinks and Hot
Sandwiches & Chicken are not significant relative to bakery. So,
overall, what do we learn here? We know now that several categories,
such as fountain sodas, bakery items, and breakfast sandwiches help
increase the chance a customer will be a loyalty customer. TECA could
promote these products to try to draw in more loyalty customers.
model_train <- glm(loyalty ~ category, family=binomial, data=data_train)
summary(model_train)
##
## Call:
## glm(formula = loyalty ~ category, family = binomial, data = data_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5152 -1.1350 -0.8436 1.1812 1.5530
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.30575 0.01539 19.870 < 2e-16 ***
## categoryBeer -1.15590 0.02101 -55.009 < 2e-16 ***
## categoryBread & Cakes -0.11000 0.02173 -5.063 4.14e-07 ***
## categoryBreakfast Sandwiches 0.22179 0.02156 10.288 < 2e-16 ***
## categoryCandy/Gum -0.33491 0.01816 -18.442 < 2e-16 ***
## categoryChips -0.20758 0.02392 -8.677 < 2e-16 ***
## categoryCigarettes -0.48665 0.01769 -27.517 < 2e-16 ***
## categoryCigars -0.95134 0.02544 -37.395 < 2e-16 ***
## categoryCold Dispensed Beverage 0.46045 0.01706 26.991 < 2e-16 ***
## categoryEnergy 0.02773 0.01797 1.544 0.12269
## categoryFuel -0.63600 0.01622 -39.211 < 2e-16 ***
## categoryHot Dispensed Beverage -0.01130 0.01866 -0.605 0.54487
## categoryHot Sandwiches & Chicken -0.06145 0.02298 -2.674 0.00749 **
## categoryJuice/tonics -0.40630 0.01752 -23.196 < 2e-16 ***
## categoryLottery -0.91991 0.01865 -49.313 < 2e-16 ***
## categoryPizza 0.10643 0.02130 4.996 5.84e-07 ***
## categoryPop (587) -0.31478 0.01706 -18.448 < 2e-16 ***
## categoryRoller Grill -0.18406 0.02183 -8.430 < 2e-16 ***
## categorySalty Snacks -0.28467 0.01997 -14.252 < 2e-16 ***
## categorySmokeless (951) -0.67115 0.02315 -28.989 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1081043 on 779807 degrees of freedom
## Residual deviance: 1050668 on 779788 degrees of freedom
## AIC: 1050708
##
## Number of Fisher Scoring iterations: 4
Our next step is to evaluate the model we have created by predicting
the probabilities of each row and examining how accurate the model is at
predicting a loyalty purchase. Recall that logistic regression uses the
independent variables to create a probability for each row that
maximized that likelihood that that row is close to its true
category–loyal or not loyal. We should predict
those probabilities and examine their accuracy, since we have labeled
data and know the true category for each row.
First, let’s predict that probabilities. We really want to do this with the testing data, but we will do this on the training data first, and then compare to the real data, the testing data, next.
Let’s examine this code line by line. The first line is the
prediction of the probabilities. We are predicting using the model we
just created, model_train. Next, we are using the training
data, data_train. The argument type='response'
gives the predicted probabilities rather than the log odds, which are
more difficult to interpret. Next, we print a summary of the
probabilities, just so we can get a sense of what they look like. To
further evaluate the model, we take the probabilities,
predict_train, and put them into the training data as a new
column, data_train$prediction. Finally, we take a look at a
snapshot of the data. You can see that the probabilities here make
sense, generally. All of these first 10 rows are from loyalty customers,
and they are all relatively high, considering our highest probability is
only 68%.
predict_train <- predict(model_train, newdata=data_train, type='response')
print(summary(predict_train))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2994 0.4182 0.4927 0.4999 0.5731 0.6827
data_train$prediction <- predict_train
head(data_train, n=10)
## loyalty category quarter state prediction
## 1 loyal Bread & Cakes 3 Colorado 0.5487819
## 2 loyal Candy/Gum 1 Wyoming 0.4927105
## 3 loyal Pop (587) 3 Wyoming 0.4977430
## 4 loyal Pop (587) 4 Colorado 0.4977430
## 5 loyal Chips 1 Oklahoma 0.5245229
## 6 loyal Juice/tonics 2 Colorado 0.4748839
## 7 loyal Juice/tonics 3 Wyoming 0.4748839
## 9 loyal Beer 3 Alabama 0.2994030
## 10 loyal Chips 2 Oklahoma 0.5245229
## 11 loyal Cigarettes 1 Nebraska 0.4548983
Next, let’s evaluate our model in another way, by looking at how accurate the model is at making correct predictions. There are a lot of ways to think about accuracy of this model. You can think of how accurate the model was a detecting loyalty, but you can also think about how accurate the model was at detecting not loyalty. Both of these types of accuracy to us, since we want an overall accurate model. A typical way to visualize accuracy is by using a confusion or classification matrix. Again, we are using the training data just so we can compare to the testing data next.
Let’s again look at this code line by line. The first line creates a
table using the table function. This table creates the
probability predictions from above, predict_train, and
writes “FALSE” if the probability is less than 0.5 and “TRUE” if it is
more than 0.5. These are our predictions using the training data. You
can see the FALSE and TRUE categories on the left of the table. It then
compares these to the truth, which is the loyalty column
from the data_train column. The second line then runs the
functions we pasted above. This function just takes the four cells of
this table and labels them and manipulates them to create various types
of accuracy, displayed below.
Let’s explore all of this. The table shows the following output: * When a transaction is actually/truthfully a loyalty transaction, the model correctly classifies it as such-by saying “TRUE”, 185131 times. This is called a True Positive (TP), Hit.
When a transaction is truthfully a non-loyalty transaction, the model correctly classifies it as such–by saying “FALSE”, 265829 times. This is a True Negative (TN), Rejection.
On the other hand, the model makes two kinds of errors. When the model transaction is not a loyalty transaction but the model incorrectly says it is, this is called a False Positive (FP), Type 1 Error. It happens 124171 times.
Finally, when a transaction is a loyalty transaction and the model says it is not, which here happens 204677, it is called a False Negative (FN), Type 2 Error.
These numbers can then be manipulated to create different measures of accuracy, as follows: * Overall accuracy (TP+TN/(TP+TN+FP+FN)) is 0.578296.
Sensitivity, Recall, Hit Rate, True Positive Rate (How many positives did the model get right? TP/(TP+FN)), is 0.474929.
Specificity, Selectivity, True Negative Rate (How many negatives did the model get right? TN/(TN+FP)) is 0.681613.
Precision, Positive Predictive Value (How good are the model’s positive predictions? TP/(TP+FP)) is 0.598544.
Negative Predictive Value (How good are the model’s negative predictions? TN/(TN+FN) is 0.564985.
Thus, overall, our model predicts better than chance. Recall, that
loyalty transactions happen about 50% of the time. Thus, if we had no
model at all and just flipped a coin every time to guess if a
transaction was going to be a loyalty transaction, we would get it right
50% of the time. Our model helps us to get it right about 58% of the
time, which is not super high, but is better than chance. Where the
model really fails right now is in sensitivity. That is, the model is
pretty good at predicting not loyal, 68%, but bad at
predicting the loyal, getting only 47% correct: worse than flipping a
coin. So, our next step is to improve the model.
table1 <- table(predict_train>0.5, data_train$loyalty) #prediction on left and truth on top
my_confusion_matrix(table1)
##
## not loyal loyal
## FALSE 265829 204677
## TRUE 124171 185131
## [[1]]
## [1] "185131 = True Positive (TP), Hit"
##
## [[2]]
## [1] "265829 = True Negative (TN), Rejection"
##
## [[3]]
## [1] "124171 = False Positive (FP), Type 1 Error"
##
## [[4]]
## [1] "204677 = False Negative (FN), Type 2 Error"
##
## [[5]]
## [1] "0.5783 = Accuracy (TP+TN/(TP+TN+FP+FN))"
##
## [[6]]
## [1] "0.4749 = Sensitivity, Recall, Hit Rate, True Positive Rate (How many positives did the model get right? TP/(TP+FN))"
##
## [[7]]
## [1] "0.6816 = Specificity, Selectivity, True Negative Rate (How many negatives did the model get right? TN/(TN+FP))"
##
## [[8]]
## [1] "0.5985 = Precision, Positive Predictive Value (How good are the model's positive predictions? TP/(TP+FP))"
##
## [[9]]
## [1] "0.5650 = Negative Predictive Value (How good are the model's negative predictions? TN/(TN+FN)"
However, first, let’s take our trained model and predict/run it on the testing data. Recall, that we train on the training data and test on the testing data. We want to use new data to test on so that we aren’t just trying to memorize the training data, but rather are creating a model that can work on any data.
This, of course, looks very similar to the code above with the
key exception that the data being used is the test
data, data_test.
How does our model perform on the new data? It actually does a little bit better than the training data. While accuracy is very similar, our problem from before, the low sensitivity, is a tiny bit improved.
predict_test <- predict(model_train, newdata=data_test, type='response')
print(summary(predict_test))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2994 0.4182 0.4927 0.5003 0.5731 0.6827
data_test$prediction <- predict_test
head(data_test, n=10)
## loyalty category quarter state prediction
## 8 loyal Bread & Cakes 2 Alabama 0.5487819
## 13 loyal Pop (587) 4 Iowa 0.4977430
## 16 loyal Salty Snacks 3 Colorado 0.5052708
## 20 loyal Cold Dispensed Beverage 1 Missouri 0.6826999
## 21 loyal Fuel 2 Alabama 0.4181812
## 24 loyal Smokeless (951) 1 Wyoming 0.4096534
## 26 loyal Juice/tonics 4 Wyoming 0.4748839
## 34 loyal Candy/Gum 1 Oklahoma 0.4927105
## 36 loyal Juice/tonics 1 Alabama 0.4748839
## 37 loyal Energy 4 Oklahoma 0.5826080
table2 <- table(predict_test>.5, data_test$loyalty) #prediction on left and truth on top
my_confusion_matrix(table2)
##
## not loyal loyal
## FALSE 88248 67847
## TRUE 41752 62089
## [[1]]
## [1] "62089 = True Positive (TP), Hit"
##
## [[2]]
## [1] "88248 = True Negative (TN), Rejection"
##
## [[3]]
## [1] "41752 = False Positive (FP), Type 1 Error"
##
## [[4]]
## [1] "67847 = False Negative (FN), Type 2 Error"
##
## [[5]]
## [1] "0.5784 = Accuracy (TP+TN/(TP+TN+FP+FN))"
##
## [[6]]
## [1] "0.4778 = Sensitivity, Recall, Hit Rate, True Positive Rate (How many positives did the model get right? TP/(TP+FN))"
##
## [[7]]
## [1] "0.6788 = Specificity, Selectivity, True Negative Rate (How many negatives did the model get right? TN/(TN+FP))"
##
## [[8]]
## [1] "0.5979 = Precision, Positive Predictive Value (How good are the model's positive predictions? TP/(TP+FP))"
##
## [[9]]
## [1] "0.5653 = Negative Predictive Value (How good are the model's negative predictions? TN/(TN+FN)"
Last time, we predicted loyalty using a single variable,
category. This time we will use all of our variables. We
will see if this improves the prediction accuracy of our model. Before
we get to the model statement, the code will be identical to the prior
video when we used only one variable.
Of our six steps of for using a machine learning algorithm, we are now focused on step 5. 1. Data scrubbing 2. Algorithm selection 3. Model training or fitting 4. Model evaluation 5. Model improvement 6. Deploy the model
With our data split up above, we are ready to train our new model.
Recall that an algorithm is a set of rules for how to make a model. A
model takes our data and options we select to create new
data/information and additional rules for how to use the model. Below,
we create our model, called model_train and then summarize
it to view the output. We use the glm function with the
family=binomial argument to select logistic regression. Our
dependent variable will continue to be loyalty, but this
time we will include all of our variables in the model. We are, of
course, using the training data since we are training the model.
Last time we examined the impact of category on
loyalty. Let’s look at quarter. The quarter is
just the quarter of the year. I make it a factor so that R knows it is a
category. There is some debate about whether we could leave that as a
continuous variable, since the difference between one quarter of the
year and another does have meaning, but we will leave it as a factor.
The first quarter of the year is the reference level of these multiple
dummy variables. Thus, when we evaluate the effects of these quarters it
is in relationship to the effect of quarter 1. Looking at the signs of
the coefficients and the p-values, we see that each quarter
significantly increases the probability that a transaction will be a
loyalty transaction. For example, the third quarter of the year has a
positive coefficient of 0.216750 and a very low p-value. This means that
relative to quarter 1, quarter 3 has more transactions that are from
loyalty customers. With this knowledge, TECA could consider focusing on
the beginning of the year in an effort to increase the number of their
loyalty customers. Additionally, TECA could focus on summer and fall as
key times when loyalty customers are active to promote to them and
increases their purchases even more.
Next, let’s look at how each state affects the occurrence of loyalty transactions. The reference state is Alabama. Relative to Alabama, the states that are most likely to generate loyalty transactions are Colorado, Missouri, and Wyoming. Missouri, in particular has a strong effect, as can be seen by their relatively large coefficient. TECA should investigate key stores in this state to see what is going right.
model_train <- glm(loyalty ~ category + factor(quarter) + state, family=binomial, data=data_train)
summary(model_train)
##
## Call:
## glm(formula = loyalty ~ category + factor(quarter) + state, family = binomial,
## data = data_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7136 -1.1285 -0.6539 1.1315 1.8153
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.254291 0.016633 15.289 < 2e-16 ***
## categoryBeer -1.096064 0.021205 -51.689 < 2e-16 ***
## categoryBread & Cakes -0.114865 0.021934 -5.237 1.63e-07 ***
## categoryBreakfast Sandwiches 0.220034 0.021724 10.129 < 2e-16 ***
## categoryCandy/Gum -0.341969 0.018316 -18.670 < 2e-16 ***
## categoryChips -0.204701 0.024131 -8.483 < 2e-16 ***
## categoryCigarettes -0.511199 0.017849 -28.641 < 2e-16 ***
## categoryCigars -0.966302 0.025690 -37.614 < 2e-16 ***
## categoryCold Dispensed Beverage 0.447609 0.017227 25.982 < 2e-16 ***
## categoryEnergy 0.015697 0.018120 0.866 0.38633
## categoryFuel -0.638023 0.016359 -39.002 < 2e-16 ***
## categoryHot Dispensed Beverage -0.009300 0.018821 -0.494 0.62122
## categoryHot Sandwiches & Chicken -0.059721 0.023156 -2.579 0.00991 **
## categoryJuice/tonics -0.433122 0.017665 -24.519 < 2e-16 ***
## categoryLottery -0.991808 0.018851 -52.612 < 2e-16 ***
## categoryPizza 0.096907 0.021468 4.514 6.36e-06 ***
## categoryPop (587) -0.279395 0.017215 -16.229 < 2e-16 ***
## categoryRoller Grill -0.192345 0.022021 -8.735 < 2e-16 ***
## categorySalty Snacks -0.281628 0.020151 -13.976 < 2e-16 ***
## categorySmokeless (951) -0.691339 0.023350 -29.608 < 2e-16 ***
## factor(quarter)2 0.194020 0.006887 28.170 < 2e-16 ***
## factor(quarter)3 0.216750 0.006653 32.580 < 2e-16 ***
## factor(quarter)4 0.228426 0.006624 34.486 < 2e-16 ***
## stateArkansas -0.341681 0.008364 -40.851 < 2e-16 ***
## stateColorado 0.116604 0.007972 14.627 < 2e-16 ***
## stateIowa -0.285021 0.007312 -38.980 < 2e-16 ***
## stateMinnesota -0.285307 0.030645 -9.310 < 2e-16 ***
## stateMissouri 0.276047 0.008295 33.281 < 2e-16 ***
## stateNebraska -0.182782 0.010315 -17.720 < 2e-16 ***
## stateOklahoma -0.592161 0.009407 -62.947 < 2e-16 ***
## stateSouth Dakota -0.165378 0.021925 -7.543 4.60e-14 ***
## stateWyoming 0.085997 0.013440 6.399 1.57e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1081043 on 779807 degrees of freedom
## Residual deviance: 1037616 on 779776 degrees of freedom
## AIC: 1037680
##
## Number of Fisher Scoring iterations: 4
Finally, let’s examine our larger, more complicated model on our test
data. Recall that our single-variable model has a relatively modest
accuracy rate of about 57%. The more complex model has increased that
accuracy slightly to about 60%. Perhaps more importantly, the
sensitivity of the model is much improved. Recall that before, the
sensitivity, or the accuracy of the model at predicting
not loyal, was below 50%. This new model has increased that
to 57%.
Overall, this is not a great model. We could add additional variables to make it more complex. Additionally, recall that this data is just a subset of the available data. Adding data could improve the model as well.
predict_test <- predict(model_train, newdata=data_test, type='response')
summary(predict_test)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1925 0.4148 0.4903 0.5000 0.5788 0.7697
data_test$prediction <- predict_test
head(data_test, n=10)
## loyalty category quarter state prediction
## 8 loyal Bread & Cakes 2 Alabama 0.5825976
## 13 loyal Pop (587) 4 Iowa 0.4795867
## 16 loyal Salty Snacks 3 Colorado 0.5759126
## 20 loyal Cold Dispensed Beverage 1 Missouri 0.7267008
## 21 loyal Fuel 2 Alabama 0.4527136
## 24 loyal Smokeless (951) 1 Wyoming 0.4131277
## 26 loyal Juice/tonics 4 Wyoming 0.5338463
## 34 loyal Candy/Gum 1 Oklahoma 0.3362973
## 36 loyal Juice/tonics 1 Alabama 0.4554110
## 37 loyal Energy 4 Oklahoma 0.4765806
table2 <- table(predict_test>.5, data_test$loyalty)
my_confusion_matrix(table2)
##
## not loyal loyal
## FALSE 80501 55670
## TRUE 49499 74266
## [[1]]
## [1] "74266 = True Positive (TP), Hit"
##
## [[2]]
## [1] "80501 = True Negative (TN), Rejection"
##
## [[3]]
## [1] "49499 = False Positive (FP), Type 1 Error"
##
## [[4]]
## [1] "55670 = False Negative (FN), Type 2 Error"
##
## [[5]]
## [1] "0.5954 = Accuracy (TP+TN/(TP+TN+FP+FN))"
##
## [[6]]
## [1] "0.5716 = Sensitivity, Recall, Hit Rate, True Positive Rate (How many positives did the model get right? TP/(TP+FN))"
##
## [[7]]
## [1] "0.6192 = Specificity, Selectivity, True Negative Rate (How many negatives did the model get right? TN/(TN+FP))"
##
## [[8]]
## [1] "0.6001 = Precision, Positive Predictive Value (How good are the model's positive predictions? TP/(TP+FP))"
##
## [[9]]
## [1] "0.5912 = Negative Predictive Value (How good are the model's negative predictions? TN/(TN+FN)"
In this discussion, we explored the foundational concepts of linear and logistic regression, their assumptions, and their applications in prediction and classification tasks. Linear regression assumes a linear relationship between the independent and dependent variables, along with other key assumptions such as homoscedasticity, independence, and normality of errors. These assumptions ensure that the model provides reliable and interpretable results for predictive purposes.
Logistic regression, on the other hand, is a powerful tool for classification tasks. We distinguished between binary classification, where the outcome has two possible classes (e.g., spam or not spam), and multiclass classification, where the outcome has three or more classes (e.g., classifying fruits as apples, oranges, or bananas). Techniques like One-vs-Rest (OvR) and Multinomial Logistic Regression extend logistic regression to handle multiclass problems effectively.
To evaluate the performance of classification models, we used metrics such as accuracy, precision, recall, and F1-score. These metrics provide a comprehensive understanding of a model’s performance, balancing the trade-offs between correctly identifying positive cases (precision and recall) and overall correctness (accuracy). The F1-score, in particular, is useful when dealing with imbalanced datasets, as it harmonizes precision and recall into a single metric.
Through hands-on examples, we demonstrated how to implement and evaluate regression and classification models using real-world data. We highlighted the importance of model evaluation using training and testing datasets to avoid overfitting and ensure generalizability. Additionally, we discussed the significance of adjusted R-squared in multiple regression models, which penalizes the inclusion of irrelevant variables and provides a more conservative estimate of model performance.
Finally, we explored how logistic regression can be used to predict customer loyalty, emphasizing the importance of feature selection and model improvement to enhance predictive accuracy. By incorporating additional variables and refining the model, we achieved modest improvements in accuracy and sensitivity, underscoring the iterative nature of machine learning.
In conclusion, linear and logistic regression are versatile and powerful tools for prediction and classification. Understanding their assumptions, applications, and evaluation metrics is essential for building effective machine learning models. As we continue to refine these models and explore more advanced techniques, we can unlock even greater insights and predictive capabilities in the field of data science.