Logistic Regression

Introduction

This document was composed from Dr. Snopkowski’s ANTH 504 Week 13 lecture and University of Cincinnati UC Business Analytics R Programming Guide at .

Aims

§ When and why do we use logistic regression?

§ Binary (categorical) output

§ Theory behind logistic regression

§ Assessing the model

§ Assessing predictors

§ Things that can go wrong

§ Interpreting logistic regression

Introduction to Logistic Regression

§ To predict an outcome variable that is categorical from one or more categorical or continuous predictor variables.

§ Logistic regression is also referred to as: logit regression or logit model.

§ We will be estimating the probability of a categorical response based on one or more predictor variables (X).

We have some independent variable X and some outcome variable Y. Y takes on 2 values (0 or 1). In a graph, we have data points that span across the X plane in a continuous or ordinal category data. But the Y plane has either 0 value or 1 value and nothing in between. The area between 0 and 1 are the probabilities of 0 or 1.

Why Logistic Regression?

§ Why not use linear regression?

§ Categorical outcome variable violates the assumption of linearity in normal regression.

§ We would like a model that ranges from 0 to 1 and doesn’t predict outcomes less than 0 or greater than 1.

With One Predictor

§ Outcome

§ We predict the probability of the outcome occurring

§ b₀ and b₁

§ Can be thought of in much the same way as multiple regression

§ Note the normal regression equation forms part of the logistic regression equation

With Several Predictors

§ Outcome

§ We still predict the probability of the outcome occurring

§ Differences

§ Note the multiple regression equation forms part of the logistic regression equation

§ This part of the equation expands to accommodate additional predictors

Assessing the Model

§ The model is assessed through a method called: Maximum Likelihood

§ We seek estimates of the betas such that the predicted probability corresponds as closely as possible to the individual’s observed value.

§ This is formalized as a mathematical equation called the likelihood function, which the statistical software tries to “maximize”:

An Example

§ We are going to use data from the Default dataset. This is simulated data with information on 10,000 customers such as whether the customer defaulted, is a student, the average balance carried by the customer, and the income of the customer.

§ We will test whether balance influences the likelihood of default. What is our independent and dependent variables?

#install.packages(c("ISLR", "pscl"))
library(pscl)
library(tidyverse)

default <- as_tibble(ISLR::Default)
head(default)

## # A tibble: 6 × 4
##   default student balance income
##   <fct>   <fct>     <dbl>  <dbl>
## 1 No      No         730. 44362.
## 2 No      Yes        817. 12106.
## 3 No      No        1074. 31767.
## 4 No      No         529. 35704.
## 5 No      No         786. 38463.
## 6 No      Yes        920.  7492.

Examining the data

§ We should always try to visualize our data before analysis. How might you try to visualize this data?

plot(default$balance, as.numeric(default$default))

default %>% group_by(default) %>% 
summarize(mean_balance = mean(balance))

## # A tibble: 2 × 2
##   default mean_balance
##   <fct>          <dbl>
## 1 No              804.
## 2 Yes            1748.

Notice that balance is continuous and therefore must be the X or independent variable, Default is binary and must be the dependent variable in a logistic regression.

Running logistic regression in R

To run logistic regression in R, we’ll use the glm() function (stands for generalized linear model) and we tell it which type of model to run, which will be family = "binomial". glm() works simulare as to the lm(). We are looking to see if the the balance size predicts default.

model <- glm(default ~ balance, family = "binomial", data = default) 
print(model)

## 
## Call:  glm(formula = default ~ balance, family = "binomial", data = default)
## 
## Coefficients:
## (Intercept)      balance  
##  -10.651331     0.005499  
## 
## Degrees of Freedom: 9999 Total (i.e. Null);  9998 Residual
## Null Deviance:       2921 
## Residual Deviance: 1596  AIC: 1600

summary(model)

## 
## Call:
## glm(formula = default ~ balance, family = "binomial", data = default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2697  -0.1465  -0.0589  -0.0221   3.7589  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.065e+01  3.612e-01  -29.49   <2e-16 ***
## balance      5.499e-03  2.204e-04   24.95   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1596.5  on 9998  degrees of freedom
## AIC: 1600.5
## 
## Number of Fisher Scoring iterations: 8

default is both the name of a column and the name of a dataframe. The first default refers to the column. The second default refers the dataframe.

default ~ balance has default column listed first because it is the dependent variable. This follows the dependent ~ independent format.

Interpreting the output

In this analysis, we used logistic regression to examine the relationship between a customer’s balance (independent variable) and the likelihood of defaulting (dependent variable).

Here’s an interpretation of the output:

Model coefficients: The logistic regression model produces two coefficients: one for the intercept (-10.651331) and one for the balance (0.005499). These coefficients indicate the estimated effect of the balance on the log-odds of defaulting. Specifically, for each one-unit increase in balance, the log-odds of defaulting increase by 0.005499. Since these values are in log-odds, it’s often more intuitive to convert them to odds ratios.

Odds Ratio: To obtain the odds ratio, we need to exponentiate the coefficient for the balance. In this case, exp(0.005499) = 1.00551. This means that, on average, for each one-unit increase in balance, the odds of defaulting are 1.00551 times higher, keeping all other factors constant.

§ Indicates the change in odds resulting from a unit change in the predictor.

§ OR > 1: Predictor increases, Probability of outcome occurring increases.

§ OR < 1: Predictor increases, Probability of outcome occurring decreases.

exp(coef(model))

##  (Intercept)      balance 
## 2.366933e-05 1.005514e+00

For every dollar increase in monthly balance carried, the odds of the customer defaulting increases by an odds of 1.0057.

Significance: The p-values for both the intercept and balance are less than 0.001 (indicated by ***), suggesting that the balance is a statistically significant predictor of defaulting.

Model fit: The null deviance (2920.6) represents the goodness-of-fit of the intercept-only model (i.e., without any predictor variables), while the residual deviance (1596.5) represents the goodness-of-fit of the current model. Comparing these values indicates that the current model with balance as a predictor variable provides a better fit to the data than the intercept-only model.

AIC: The Akaike Information Criterion (AIC) is a measure of the relative quality of a statistical model. Lower values indicate better models. In this case, the AIC is 1600.5. If we were to compare multiple models, we could use the AIC to help determine which model is the best fit for the data.

§ AIC score: smaller = better fit

In conclusion, the logistic regression analysis suggests that there is a significant relationship between the balance and the likelihood of defaulting. As the balance increases, the odds of defaulting also increase.

The coefficient on balance means that an increase in balance is associated with an increase in the probability of default. A one-unit increase in balance is associated with an increase in the log odds of default by 0.0055

Making predictions

§ While you can calculate model predictions by hand, it’s easier to do it using R.

§ For instance, let’s predict the default rate for a person with a balance of $1000.

Plug and chug method:

1/(1+exp(-(-10.651+(0.005499*1000))))

## [1] 0.005754511

predict(model, data.frame(balance=1000), type="response")

##           1 
## 0.005752145

Needs to be included or else it will create a prediction for each row of data

On the scale of the response variable.

Assessing the model: pseudo-R²

§ There is no R2 like there is in regression, but a variety of pseudo-R² values have been developed.

§ One of the most frequently used is McFadden’s R²

§ Its measure ranges from 0 to just under 1.

§ McFadden states that models with a McFadden’s pseudo R² of 0.4 represent a very good fit.

pR2(model)["McFadden"]

## fitting null model for pseudo-r2

##  McFadden 
## 0.4533916

This number is really go because this is fake data. In the real world, you propably wont get a value this high, especially in anthropology.

Is the log-likelihood statistic, which is an indicator of how much unexplained information there still exists in the model after its fit. Large values indicate poorly fitting models.

Methods of Regression

§Forced entry: all variables entered simultaneously.

§Hierarchical: variables entered in blocks.

§ Blocks should be based on past research, or theory being tested. Good method.

§ Stepwise: variables entered on the basis of statistical criteria (i.e. relative contribution to predicting outcome).

§ Should be used only for exploratory analysis.

Example of Hierarchical Logistic Regression

Let’s do another model where we add in “income” as another variable that may predict someone’s likelihood of defaulting.

model2 <- glm(default ~ balance + income, family = "binomial", data=default)
summary(model2)

## 
## Call:
## glm(formula = default ~ balance + income, family = "binomial", 
##     data = default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4725  -0.1444  -0.0574  -0.0211   3.7245  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

The logistic regression model produces three coefficients: one for the intercept (-11.54), one for the balance (0.005647), and one for the income (0.00002081). These coefficients indicate the estimated effect of the balance and income on the log-odds of defaulting. Specifically, for each one-unit increase in balance, the log-odds of defaulting increase by 0.005647, and for each one-unit increase in income, the log-odds of defaulting increase by 0.00002081.

To obtain the odds ratios, we exponentiate the coefficients for balance and income. In this case, exp(0.005647) = 1.00566 and exp(0.00002081) = 1.00002081. This means that, on average, for each one-unit increase in balance, the odds of defaulting are 1.00566 times higher, keeping all other factors constant. Similarly, for each one-unit increase in income, the odds of defaulting are 1.00002081 times higher, keeping all other factors constant.

The p-values for the intercept, balance, and income are all less than 0.001 (indicated by ***), suggesting that both balance and income are statistically significant predictors of defaulting.

The logistic regression analysis suggests that there is a significant relationship between both balance and income and the likelihood of defaulting. As the balance and income increase, the odds of defaulting also increase. However, it’s essential to remember that correlation does not imply causation, and there might be other factors that contribute to this relationship.

Then to test if this model is “significantly” better, we run:

anova(model, model2, test="Chisq")

## Analysis of Deviance Table
## 
## Model 1: default ~ balance
## Model 2: default ~ balance + income
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1      9998     1596.5                          
## 2      9997     1579.0  1   17.485 2.895e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

test="Chisq" tells R that that we are using a binary outcome. This is just another way to compare our models.

Here’s an interpretation of the output:

Resid. Df: The residual degrees of freedom for Model 1 and Model 2 are 9998 and 9997, respectively.
Resid. Dev: The residual deviance for Model 1 (1596.5) and Model 2 (1579.0) measures the goodness-of-fit of each model. Lower values indicate a better fit.
Df: The difference in degrees of freedom between the two models (9998 - 9997) is 1. This is because Model 2 has one additional predictor variable (income) compared to Model 1.
Deviance: The difference in residual deviance between the two models is 17.485. This value represents the reduction in residual deviance when moving from Model 1 to Model 2.
Pr(>Chi): The p-value for the test is 2.895e-05, which is very small (less than 0.001, indicated by ***). This suggests that there is a statistically significant difference between the two models, and Model 2 (with income as an additional predictor variable) provides a significantly better fit to the data than Model 1.

In conclusion, the Analysis of Deviance test indicates that adding income to the model significantly improves its ability to predict the likelihood of defaulting. This suggests that both balance and income are important factors to consider when predicting the probability of defaulting.

Thinks that Can Go Wrong

§ Assumptions from linear regression:

§ Independence of errors - if violated, this can lead to overdispersion

§ Multicollinearity

§ Unique problems

§ Incomplete information

§ Complete separation

§ Categorical predictors:

§ Predicting cancer from smoking and eating tomatoes.

§ We don’t know what happens when non-smokers eat tomatoes because we have no data in this cell of the design.

§ Continuous variables

§ Will your sample include an 80-year-old, highly anxious, Buddhist, left-handed lesbian?

Complete Seperation

§ When the outcome variable can be perfectly predicted.

§ E.g. predicting whether someone is a burglar, your teenage son or your cat based on weight.

§ Weight is a perfect predictor of cat/burglar unless you have a very fat cat indeed

If there is no overlaping data in the middle of the chart, R can’t determine what the lane should look like. This is usually not an issue. ## Assumptions to Test

Check Cook’s Distance – what is this testing for?

• Values should be less than 1

Check standardized residuals – are there any large outliers? Greater than 3 might be problematic
Multicollinearity – Why is this a problem? If you have two highly correlated predictors, the model struggles to determine which one is associated with the outcome because to two are so strongly associated. How do we test for it? vif is the function that we use to test for correlations.

Cook’s Distance

Check Cook’s Distance – influential points

• Values should be less than 1

plot(model2, which=4, id.n=5)

max(cooks.distance(model2))

## [1] 0.0167895

max gives the largest Cook’s distance. If it is greater then one, it suggests that at least one data point has influence on the actual perimeters.

Standardized Residuals

Standardized Residuals – are there any outliers?

standard_res <- rstandard(model2)
plot(standard_res)

You can take a look at these larger ones to try to determine if any are outside of what we would expect. We have a few that are greater than 3 but we have more than 10,000 data points. Having only a few out of 10,000 is perhaps 1% or less of the data points which is fine.
You can take a look at the ones with larger values to determine why they have such large values. Presumably these are people who had a really high balance who didn’t default or did default and had a really low balance.

Multicollinearity

Multicollinearity exists if predictors are highly correlated.We will use the variance inflation factors (VIF) to determine whether our predictors are too highly correlated. This is basically calculated by taking the predictors and making them the outcome variable, using the other predictors in the model as predictors. This indicates how ‘correlated’ the predictors are.

library(car)

vif(model2)

##  balance   income 
## 1.045605 1.045605

This shows the coefficients on our two variables. In this case they are vary low. Less than 2.5 is great! Real problem if you are over 10.

Summary

§ Check the Estimates to see the regression parameters for any predictors in the model.

§ Use the odds ratio, exp(coef(model)) for interpretation.

§ OR > 1, then as the predictor increases, the odds of the outcome occurring increase.

§ OR < 1, then as the predictor increases, the odds of the outcome occurring decrease.

§ Use anova() function to test hierarchical models

§ Check assumptions:

§ Cook’s distance

§ Standardized Residuals

§ Multicollinearity

Logistic Regression

Lorraine Gaudio

2023-04-13

Introduction

Aims

Introduction to Logistic Regression

Why Logistic Regression?

With One Predictor

With Several Predictors

Assessing the Model

An Example

Examining the data

Running logistic regression in R

Interpreting the output

Making predictions

Assessing the model: pseudo-R²

Methods of Regression

Example of Hierarchical Logistic Regression

Thinks that Can Go Wrong

Complete Seperation

Cook’s Distance

Standardized Residuals

Multicollinearity

Summary

Logistic Regression

Lorraine Gaudio

2023-04-13

Introduction

Aims

Introduction to Logistic Regression

Why Logistic Regression?

With One Predictor

With Several Predictors

Assessing the Model

An Example

Examining the data

Running logistic regression in R

Interpreting the output

Making predictions

Assessing the model: pseudo-R2

Methods of Regression

Example of Hierarchical Logistic Regression

Thinks that Can Go Wrong

Complete Seperation

Cook’s Distance

Standardized Residuals

Multicollinearity

Summary

Assessing the model: pseudo-R²