Multiple Logistic Regression
Logistic regression also supports multiple explanatory variables. In this section, we will look at the case of two numeric explanatory variables, and for visualization, we will use color to denote the response.
For logistic regression, there are only two possible values of response (zero and one), and the predicted responses’ values should all lie between zero and one.
From the modelling results and the visualization of the plot, the most important thing to determine from the plot is whether the predictions are close to zero or close to one. Color gradient will be useful in this scenario.
Load Packages and Dataset
Packages includes fst (for reading fst document), dplyr (data manipulation), ggplot2, broom. r dataset is the one on Bank Churn dataset.
Visualizing Many Numeric Variables in Logistic Linear Model
In total there are four steps taken to draw below plot:
Using the
churndataset, plot the recency of purchase,time_since_last_purchase, versus the length of customer relationship,time_since_first_purchase, colored by whether or not the customer churned,has_churned.Add a point layer, with transparency set to
0.5.Use a 2-color gradient, with midpoint
0.5(It is useful because responses above 0.5 are one color and responses below 0.5 are another color)Use the black and white theme.
Gradient Visualization
As we can see in below, the points are marked in with gradient colors. And two-color gradient is excellent for distinguishing the two cases of a positive and negative response.
ggplot(churn,aes(time_since_first_purchase, time_since_last_purchase, color = has_churned)) +
geom_point(alpha = 0.5) +
scale_color_gradient2(midpoint = 0.5) +
theme_bw()Logistic Regression with Two Explanatory Variables
One important note here (key difference to linear regression model) is run a generalized linear model with a binomial error family.
Fit a logistics regression of churn status vs. length of relationship, recency and an interaction. Model is named mdl_churn_vs_both_inter.
mdl_churn_vs_both_inter <- glm( has_churned ~ time_since_first_purchase * time_since_last_purchase, data = churn, family = binomial)Make a grid of explanatory data
explanatory_data <- expand_grid( time_since_first_purchase = seq(-2, 4, 0.1), time_since_last_purchase = seq(-1, 6, 0.1) )Add column of predictions using the model mdl_churn_vs_both_inter and explanatory_data with type response. response prediction gives the probability result.
prediction_data <- explanatory_data %>% mutate(has_churned = predict(mdl_churn_vs_both_inter, explanatory_data, type = "response"))Time to visualize - actually to update the plot seen earlier with prediction dataset
ggplot( churn, aes(time_since_first_purchase, time_since_last_purchase, color = has_churned) ) + geom_point(alpha = 0.5) + scale_color_gradient2(midpoint = 0.5) + theme_bw() + geom_point(data = prediction_data, size = 3, shape = 15)
Presenting and Interpreting The Results
| actual false | actual true | |
|---|---|---|
| predicted false | correct | false negative |
| predicted true | false positive | correct |
Making a confusion Matrix
Get the actual response from the dataset churn.
actual_response <- churn$has_churnedGet the predicted value from the model (set up earlier). If calling “fitted” on the logistic linear model, we will get the fitted churn probability in decimal numbers and we round it to three decimal places.
predicted_response <- round(fitted(mdl_churn_vs_both_inter))Get a table of values from #1 and #2, and call it outcomes
outcomes <- table(predicted_response, actual_response)Check out the table. It corresponds to the four outcomes table we presented earlier. And tells the results - how may cases where the customer churned did the model correctly predict and how many cases where the customer didn’t churn did the model predict correctly.
outcomes## actual_response ## predicted_response 0 1 ## 0 102 53 ## 1 98 147The results are converted into a confusion matrix so that summary of results can be drawn and
Note that it uses the package called yardstick
confusion <- conf_mat(outcomes)See the result
confusion## actual_response ## predicted_response 0 1 ## 0 102 53 ## 1 98 147Autoplot (automatically plot) the confusion matrix
autoplot(confusion)Get the summary matrix from the confusion matrix. We want to get the statistics on the “churn” event which is in the second row and second column in our confusion matrix.
summary(confusion, event_level = "second")## # A tibble: 13 × 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 accuracy binary 0.622 ## 2 kap binary 0.245 ## 3 sens binary 0.735 ## 4 spec binary 0.51 ## 5 ppv binary 0.6 ## 6 npv binary 0.658 ## 7 mcc binary 0.251 ## 8 j_index binary 0.245 ## 9 bal_accuracy binary 0.622 ## 10 detection_prevalence binary 0.612 ## 11 precision binary 0.6 ## 12 recall binary 0.735 ## 13 f_meas binary 0.661
Conclusion
Generating a confusion matrix and calculating metrics like accuracy, sensitivity, and specificity is the standard way to measure how well a logistic model fits.
The Logistic Distribution
Distribution function names - a general rule
| curve | prefix | normal | lo gistic | nmemonic |
|---|---|---|---|---|
| d | dnorm() | dl ogis() | “d” for differentiate - we differentiate the CDF to get the PDF | |
| CDF | p | pnorm() | pl ogis() | “p” is backwards “q” so it is the inverse of the inverse CDF |
| Inv. CDF | q | qnorm() | ql ogis() | “q” is for quantile |
logistic distribution CDF is also called the logistic function
- \(cdf(x) = 1/(1+exp(-x)\)
Invert CDF is also called logit function
- \(inverse cdf(p) = log(1/(1-p))\)
Cumulative Distribution Function
Understanding the logistic distribution is key to understanding logistic regression. It is a probability distribution of a single continuous variable.
xvalues as a sequence from minus ten to ten in steps of0.1x = seq(-10, 10, 0.1)logistic_xmade fromxtransformed with the logistic distribution CDF.logistic_x <- plogis(x)logistic_x_manmade fromxtransformed with a logistic function calculated from the equation cdf(x)=1(1+exp(−x)).logistic_x_man <- 1/(1+exp(-x))Check that both logistic transformations (
logistic_xandlogistic_x_man) have the same values withall.equal().all.equal(logistic_x, logistic_x_man)## [1] TRUE
Visualize the cumulative distribution function (CDF) for the logistic distribution. The plot has an S-shape, known as sigmoid curve. It takes an input that can be any number from minus infinity to infinity, and returns a value between zero and one.
We made a tibble from previous outputs first and then plot the CDF below.
logistic_distn_cdf <- tibble(x, logistic_x, logistic_x_man)
ggplot(logistic_distn_cdf, aes(x, logistic_x)) +
geom_line()Inverse cumulative distribution function
Same process to before but now we are dealing with inversed logistic function.
logistic_distn_inv_cdf <- tibble(
# Make a seq from 0.001 to 0.999 in steps of 0.001
p = seq(0.001, 0.999, 0.001),
# Transform with built-in logistic inverse CDF
logit_p = qlogis(p),
# Transform with manual logit
logit_p_man = log(p/(1-p))
)
# Check that each logistic function gives the same results
all.equal(logistic_distn_inv_cdf$logit_p, logistic_distn_inv_cdf$logit_p_man
)## [1] TRUE
And plot as before.
ggplot(logistic_distn_inv_cdf, aes(p, logit_p)) +
# Make it a line plot
geom_line()Note that:
The logistic function (logistic distribution CDF) has another important property: each x input value is transformed to a unique value. That means that the transformation can be reversed.
The logit function is the name for the inverse logistic function, which is also the logistic distribution inverse cumulative distribution function. (All three terms mean exactly the same thing)
The logit function takes values between zero and one, and returns values between minus infinity and infinity.
The inverse CDF is the “opposite” transformation to the CDF.
Binomial family argument
difference in running a linear regression with lm() and runing a logistic regression with glm() is that we have to set glm()’s family argument to binomial.
binomial() is a function that returns a list of other functions that tell glm() how to perform calculations in the regression.
the two most important functions are linkinv and linkfun - used for transforming variables from the whole number line
(\(-\infty\), \(+\infty\)) to probabilities (zero to one) and back again.
Check on structure of Binomial Distribution. And reassigned the p values for calculation purpose.
str(binomial())## List of 12
## $ family : chr "binomial"
## $ link : chr "logit"
## $ linkfun :function (mu)
## $ linkinv :function (eta)
## $ variance :function (mu)
## $ dev.resids:function (y, mu, wt)
## $ aic :function (y, n, mu, wt, dev)
## $ mu.eta :function (eta)
## $ initialize: language { if (NCOL(y) == 1) { ...
## $ validmu :function (mu)
## $ valideta :function (eta)
## $ simulate :function (object, nsim)
## - attr(*, "class")= chr "family"
p <- seq(0.01, 0.99, 0.01)Learn the functions
- linkinv and linkfun. They are used from binomial()\(linkinv, binomial()\)linkfun
# Call the link inverse on x
linkinv_x <- binomial()$linkinv(x)
# Check linkinv_x and plogis() of x give same results
all.equal(linkinv_x, plogis(x))## [1] TRUE
# Call the link fun on p
linkfun_p <- binomial()$linkfun(p)
# Check linkfun_p and qlogis() of p give same results
all.equal(linkfun_p, qlogis(p))## [1] TRUE
Because binomial() family object contains linkinv and linkfun functions that correspond to the logistic distribution CDF and inverse CDF respectively. They are used to translate (or we say “convert”) between numbers and probabilities
the logistic distribution consists of a whole family of curves specified by the location and scale parameters - allows logistic model prediction curves to have different positions or steepnesses.
The Logistic Regression
different to linear regression, In the case of logistic regression, the actual response is always either zero or one, and the predicted response is between these two values. It turns out that the sum of squares metric optimizes poorly under these restrictions, and that there is a better metric.
better metric here is likelihood. y_pred * y_actual
higher likelihood score when the predicted response is close to the actual response
more efficient to compute the log-likelihood - gives the same coefficients in optimizing process
-sum(log_likelihood)
x_actual <- churn$time_since_last_purchase
y_actual <- churn$has_churnedLogistic regression algorithm
# Set the intercept to 1
intercept <- 1
# Set the slope to 0.5
slope <- 0.5
# Calculate the predicted y values
y_pred <- plogis(intercept + slope * x_actual)
# Calculate the log-likelihood for each term
log_likelihoods <- log(y_pred) * y_actual + log(1-y_pred) * (1-y_actual)
# Calculate minus the sum of the log-likelihoods for each term
-sum(log_likelihoods)## [1] 326.2599
Create the Function
calc_neg_log_likelihood <- function(coeffs) {
# Get the intercept coeff
intercept <- coeffs[1]
# Get the slope coeff
slope <- coeffs[2]
# Calculate the predicted y values
y_pred <- plogis(intercept + slope * x_actual)
# Calculate the log-likelihood for each term
log_likelihoods <- log(y_pred) * y_actual + log(1-y_pred) * (1-y_actual)
# Calculate minus the sum of the log-likelihoods for each term
-sum(log_likelihoods)
}Optimize the sum of squares metric
# Optimize the metric
optim(
# Initially guess 0 intercept and 1 slope
par = c(intercept = 0, slope = 1),
# Use calc_neg_log_likelihood as the optimization fn
fn = calc_neg_log_likelihood
)## $par
## intercept slope
## -0.03478255 0.26890041
##
## $value
## [1] 273.2002
##
## $counts
## function gradient
## 51 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
# Compare the coefficients to those calculated by glm()
glm(has_churned ~ time_since_last_purchase, data = churn, family = binomial)##
## Call: glm(formula = has_churned ~ time_since_last_purchase, family = binomial,
## data = churn)
##
## Coefficients:
## (Intercept) time_since_last_purchase
## -0.03502 0.26921
##
## Degrees of Freedom: 399 Total (i.e. Null); 398 Residual
## Null Deviance: 554.5
## Residual Deviance: 546.4 AIC: 550.4