Multiple Logistic Regression

Logistic regression also supports multiple explanatory variables. In this section, we will look at the case of two numeric explanatory variables, and for visualization, we will use color to denote the response.

For logistic regression, there are only two possible values of response (zero and one), and the predicted responses’ values should all lie between zero and one.

From the modelling results and the visualization of the plot, the most important thing to determine from the plot is whether the predictions are close to zero or close to one. Color gradient will be useful in this scenario.

Load Packages and Dataset

Packages includes fst (for reading fst document), dplyr (data manipulation), ggplot2, broom. r dataset is the one on Bank Churn dataset.

Visualizing Many Numeric Variables in Logistic Linear Model

In total there are four steps taken to draw below plot:

Using the churn dataset, plot the recency of purchase, time_since_last_purchase, versus the length of customer relationship, time_since_first_purchase, colored by whether or not the customer churned, has_churned.
Add a point layer, with transparency set to 0.5.
Use a 2-color gradient, with midpoint 0.5 (It is useful because responses above 0.5 are one color and responses below 0.5 are another color)
Use the black and white theme.

Gradient Visualization

As we can see in below, the points are marked in with gradient colors. And two-color gradient is excellent for distinguishing the two cases of a positive and negative response.

ggplot(churn,aes(time_since_first_purchase, time_since_last_purchase, color = has_churned)) +
  geom_point(alpha = 0.5) +
  scale_color_gradient2(midpoint = 0.5) +
  theme_bw()

Logistic Regression with Two Explanatory Variables

One important note here (key difference to linear regression model) is run a generalized linear model with a binomial error family.

Fit a logistics regression of churn status vs. length of relationship, recency and an interaction. Model is named mdl_churn_vs_both_inter.

mdl_churn_vs_both_inter <- glm(
  has_churned ~ time_since_first_purchase * time_since_last_purchase, data = churn, family = binomial)

Make a grid of explanatory data

explanatory_data <- expand_grid(
  time_since_first_purchase = seq(-2, 4, 0.1), 
  time_since_last_purchase = seq(-1, 6, 0.1)
)

Add column of predictions using the model mdl_churn_vs_both_inter and explanatory_data with type response. response prediction gives the probability result.
```
prediction_data <- explanatory_data %>% 
  mutate(has_churned = predict(mdl_churn_vs_both_inter, explanatory_data, type = "response"))
```

Time to visualize - actually to update the plot seen earlier with prediction dataset

ggplot(
  churn, 
  aes(time_since_first_purchase, time_since_last_purchase, color = has_churned)
) +
  geom_point(alpha = 0.5) +
  scale_color_gradient2(midpoint = 0.5) +
  theme_bw() +
  geom_point(data = prediction_data, size = 3, shape = 15)

Presenting and Interpreting The Results

The Four Outcomes
	actual false	actual true
predicted false	correct	false negative
predicted true	false positive	correct

Making a confusion Matrix

Get the actual response from the dataset churn.
```
actual_response <- churn$has_churned
```
Get the predicted value from the model (set up earlier). If calling “fitted” on the logistic linear model, we will get the fitted churn probability in decimal numbers and we round it to three decimal places.
```
predicted_response <- round(fitted(mdl_churn_vs_both_inter))
```
Get a table of values from #1 and #2, and call it outcomes
```
outcomes <- table(predicted_response, actual_response)
```
Check out the table. It corresponds to the four outcomes table we presented earlier. And tells the results - how may cases where the customer churned did the model correctly predict and how many cases where the customer didn’t churn did the model predict correctly.
```
outcomes
```
```
##                   actual_response
## predicted_response   0   1
##                  0 102  53
##                  1  98 147
```
The results are converted into a confusion matrix so that summary of results can be drawn and

Note that it uses the package called yardstick
```
confusion <- conf_mat(outcomes)
```

See the result

confusion

##                   actual_response
## predicted_response   0   1
##                  0 102  53
##                  1  98 147

Autoplot (automatically plot) the confusion matrix
```
autoplot(confusion)
```

Get the summary matrix from the confusion matrix. We want to get the statistics on the “churn” event which is in the second row and second column in our confusion matrix.

summary(confusion, event_level = "second")

## # A tibble: 13 × 3
##    .metric              .estimator .estimate
##    <chr>                <chr>          <dbl>
##  1 accuracy             binary         0.622
##  2 kap                  binary         0.245
##  3 sens                 binary         0.735
##  4 spec                 binary         0.51 
##  5 ppv                  binary         0.6  
##  6 npv                  binary         0.658
##  7 mcc                  binary         0.251
##  8 j_index              binary         0.245
##  9 bal_accuracy         binary         0.622
## 10 detection_prevalence binary         0.612
## 11 precision            binary         0.6  
## 12 recall               binary         0.735
## 13 f_meas               binary         0.661

Conclusion

Generating a confusion matrix and calculating metrics like accuracy, sensitivity, and specificity is the standard way to measure how well a logistic model fits.

The Logistic Distribution

Distribution function names - a general rule

curve	prefix	normal	lo gistic	nmemonic
PDF	d	dnorm()	dl ogis()	“d” for differentiate - we differentiate the CDF to get the PDF
CDF	p	pnorm()	pl ogis()	“p” is backwards “q” so it is the inverse of the inverse CDF
Inv. CDF	q	qnorm()	ql ogis()	“q” is for quantile

logistic distribution CDF is also called the logistic function
- $cdf(x) = 1/(1+exp(-x)$
Invert CDF is also called logit function
- $inverse cdf(p) = log(1/(1-p))$

Cumulative Distribution Function

Understanding the logistic distribution is key to understanding logistic regression. It is a probability distribution of a single continuous variable.

x values as a sequence from minus ten to ten in steps of 0.1
```
x = seq(-10, 10, 0.1)
```
logistic_x made from x transformed with the logistic distribution CDF.
```
logistic_x <- plogis(x)
```
logistic_x_man made from x transformed with a logistic function calculated from the equation cdf(x)=1(1+exp(−x)).
```
logistic_x_man <- 1/(1+exp(-x))
```
Check that both logistic transformations (logistic_x and logistic_x_man) have the same values with all.equal().
```
all.equal(logistic_x, logistic_x_man)
```
```
## [1] TRUE
```

Visualize the cumulative distribution function (CDF) for the logistic distribution. The plot has an S-shape, known as sigmoid curve. It takes an input that can be any number from minus infinity to infinity, and returns a value between zero and one.

We made a tibble from previous outputs first and then plot the CDF below.

logistic_distn_cdf <- tibble(x, logistic_x, logistic_x_man)

ggplot(logistic_distn_cdf, aes(x, logistic_x)) +
  geom_line()

Inverse cumulative distribution function

Same process to before but now we are dealing with inversed logistic function.

logistic_distn_inv_cdf <- tibble(
  # Make a seq from 0.001 to 0.999 in steps of 0.001
  p = seq(0.001, 0.999, 0.001),
  # Transform with built-in logistic inverse CDF
  logit_p = qlogis(p),
  # Transform with manual logit
  logit_p_man = log(p/(1-p))
) 

# Check that each logistic function gives the same results
all.equal(logistic_distn_inv_cdf$logit_p, logistic_distn_inv_cdf$logit_p_man
)

## [1] TRUE

And plot as before.

ggplot(logistic_distn_inv_cdf, aes(p, logit_p)) +
  # Make it a line plot
  geom_line()

Note that:

The logistic function (logistic distribution CDF) has another important property: each x input value is transformed to a unique value. That means that the transformation can be reversed.
The logit function is the name for the inverse logistic function, which is also the logistic distribution inverse cumulative distribution function. (All three terms mean exactly the same thing)
The logit function takes values between zero and one, and returns values between minus infinity and infinity.
The inverse CDF is the “opposite” transformation to the CDF.

Binomial family argument

difference in running a linear regression with lm() and runing a logistic regression with glm() is that we have to set glm()’s family argument to binomial.
binomial() is a function that returns a list of other functions that tell glm() how to perform calculations in the regression.
the two most important functions are linkinv and linkfun - used for transforming variables from the whole number line

($-\infty$, $+\infty$) to probabilities (zero to one) and back again.

Check on structure of Binomial Distribution. And reassigned the p values for calculation purpose.

str(binomial())

## List of 12
##  $ family    : chr "binomial"
##  $ link      : chr "logit"
##  $ linkfun   :function (mu)  
##  $ linkinv   :function (eta)  
##  $ variance  :function (mu)  
##  $ dev.resids:function (y, mu, wt)  
##  $ aic       :function (y, n, mu, wt, dev)  
##  $ mu.eta    :function (eta)  
##  $ initialize: language {     if (NCOL(y) == 1) { ...
##  $ validmu   :function (mu)  
##  $ valideta  :function (eta)  
##  $ simulate  :function (object, nsim)  
##  - attr(*, "class")= chr "family"

p <- seq(0.01, 0.99, 0.01)

Learn the functions

linkinv and linkfun. They are used from binomial()$linkinv, binomial()$linkfun

# Call the link inverse on x
linkinv_x <- binomial()$linkinv(x)

# Check linkinv_x and plogis() of x give same results 
all.equal(linkinv_x, plogis(x))

## [1] TRUE

# Call the link fun on p
linkfun_p <- binomial()$linkfun(p)

# Check linkfun_p and qlogis() of p give same results  
all.equal(linkfun_p, qlogis(p))

## [1] TRUE

Because binomial() family object contains linkinv and linkfun functions that correspond to the logistic distribution CDF and inverse CDF respectively. They are used to translate (or we say “convert”) between numbers and probabilities
the logistic distribution consists of a whole family of curves specified by the location and scale parameters - allows logistic model prediction curves to have different positions or steepnesses.

The Logistic Regression

different to linear regression, In the case of logistic regression, the actual response is always either zero or one, and the predicted response is between these two values. It turns out that the sum of squares metric optimizes poorly under these restrictions, and that there is a better metric.
better metric here is likelihood. y_pred * y_actual
higher likelihood score when the predicted response is close to the actual response
more efficient to compute the log-likelihood - gives the same coefficients in optimizing process
-sum(log_likelihood)

x_actual <- churn$time_since_last_purchase
y_actual <- churn$has_churned

Logistic regression algorithm

# Set the intercept to 1
intercept <- 1

# Set the slope to 0.5
slope <- 0.5

# Calculate the predicted y values
y_pred <- plogis(intercept + slope * x_actual)

# Calculate the log-likelihood for each term
log_likelihoods <- log(y_pred) * y_actual + log(1-y_pred) * (1-y_actual)

# Calculate minus the sum of the log-likelihoods for each term
-sum(log_likelihoods)

## [1] 326.2599

Create the Function

calc_neg_log_likelihood <- function(coeffs) {
  # Get the intercept coeff
  intercept <- coeffs[1]

  # Get the slope coeff
  slope <- coeffs[2]

  # Calculate the predicted y values
  y_pred <- plogis(intercept + slope * x_actual)

  # Calculate the log-likelihood for each term
  log_likelihoods <- log(y_pred) * y_actual + log(1-y_pred) * (1-y_actual)

  # Calculate minus the sum of the log-likelihoods for each term
  -sum(log_likelihoods)
}

Optimize the sum of squares metric

# Optimize the metric
optim(
  # Initially guess 0 intercept and 1 slope
  par = c(intercept = 0, slope = 1),
  # Use calc_neg_log_likelihood as the optimization fn 
  fn = calc_neg_log_likelihood
)

## $par
##   intercept       slope 
## -0.03478255  0.26890041 
## 
## $value
## [1] 273.2002
## 
## $counts
## function gradient 
##       51       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

# Compare the coefficients to those calculated by glm()
glm(has_churned ~ time_since_last_purchase, data = churn, family = binomial)

## 
## Call:  glm(formula = has_churned ~ time_since_last_purchase, family = binomial, 
##     data = churn)
## 
## Coefficients:
##              (Intercept)  time_since_last_purchase  
##                 -0.03502                   0.26921  
## 
## Degrees of Freedom: 399 Total (i.e. Null);  398 Residual
## Null Deviance:       554.5 
## Residual Deviance: 546.4     AIC: 550.4

Multiple Logistic Regression

Kate C

2021-12-31