This document was composed from Dr. Snopkowski’s ANTH 504 Week 13 lecture and University of Cincinnati UC Business Analytics R Programming Guide at .
§ When and why do we use logistic regression?
§ Binary (categorical) output
§ Theory behind logistic regression
§ Assessing the model
§ Assessing predictors
§ Things that can go wrong
§ Interpreting logistic regression
§ To predict an outcome variable that is categorical from one or more categorical or continuous predictor variables.
§ Logistic regression is also referred to as: logit regression or logit model.
§ We will be estimating the probability of a categorical response based on one or more predictor variables (X).
§ Why not use linear regression?
§ Categorical outcome variable violates the assumption of linearity in normal regression.
§ We would like a model that ranges from 0 to 1 and doesn’t predict outcomes less than 0 or greater than 1.
§ Outcome
§ We predict the probability of the outcome occurring
§ b0 and b1
§ Can be thought of in much the same way as multiple regression
§ Note the normal regression equation forms part of the logistic regression equation
§ Outcome
§ We still predict the probability of the outcome occurring
§ Differences
§ Note the multiple regression equation forms part of the logistic regression equation
§ This part of the equation expands to accommodate additional predictors
§ The model is assessed through a method called: Maximum Likelihood
§ We seek estimates of the betas such that the predicted probability corresponds as closely as possible to the individual’s observed value.
§ This is formalized as a mathematical equation called the likelihood function, which the statistical software tries to “maximize”:
§ We are going to use data from the Default dataset. This is simulated data with information on 10,000 customers such as whether the customer defaulted, is a student, the average balance carried by the customer, and the income of the customer.
§ We will test whether balance influences the likelihood of default. What is our independent and dependent variables?
#install.packages(c("ISLR", "pscl"))
library(pscl)
library(tidyverse)
default <- as_tibble(ISLR::Default)
head(default)
## # A tibble: 6 × 4
## default student balance income
## <fct> <fct> <dbl> <dbl>
## 1 No No 730. 44362.
## 2 No Yes 817. 12106.
## 3 No No 1074. 31767.
## 4 No No 529. 35704.
## 5 No No 786. 38463.
## 6 No Yes 920. 7492.
§ We should always try to visualize our data before analysis. How might you try to visualize this data?
plot(default$balance, as.numeric(default$default))
default %>% group_by(default) %>%
summarize(mean_balance = mean(balance))
## # A tibble: 2 × 2
## default mean_balance
## <fct> <dbl>
## 1 No 804.
## 2 Yes 1748.
Notice that balance is continuous and therefore must be the X or independent variable, Default is binary and must be the dependent variable in a logistic regression.
To run logistic regression in R, we’ll use the glm()
function (stands for generalized linear model) and we tell it which type
of model to run, which will be family = "binomial".
glm() works simulare as to the lm().
We are looking to see if the the balance size predicts default.
model <- glm(default ~ balance, family = "binomial", data = default)
print(model)
##
## Call: glm(formula = default ~ balance, family = "binomial", data = default)
##
## Coefficients:
## (Intercept) balance
## -10.651331 0.005499
##
## Degrees of Freedom: 9999 Total (i.e. Null); 9998 Residual
## Null Deviance: 2921
## Residual Deviance: 1596 AIC: 1600
summary(model)
##
## Call:
## glm(formula = default ~ balance, family = "binomial", data = default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2697 -0.1465 -0.0589 -0.0221 3.7589
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.065e+01 3.612e-01 -29.49 <2e-16 ***
## balance 5.499e-03 2.204e-04 24.95 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1596.5 on 9998 degrees of freedom
## AIC: 1600.5
##
## Number of Fisher Scoring iterations: 8
default is both the name of a column and the name of a
dataframe. The first default refers to the column. The
second default refers the dataframe.
default ~ balance has default column listed
first because it is the dependent variable. This follows the
dependent ~ independent format.
In this analysis, we used logistic regression to examine the relationship between a customer’s balance (independent variable) and the likelihood of defaulting (dependent variable).
Here’s an interpretation of the output:
Model coefficients: The logistic regression model produces two coefficients: one for the intercept (-10.651331) and one for the balance (0.005499). These coefficients indicate the estimated effect of the balance on the log-odds of defaulting. Specifically, for each one-unit increase in balance, the log-odds of defaulting increase by 0.005499. Since these values are in log-odds, it’s often more intuitive to convert them to odds ratios.
Odds Ratio: To obtain the odds ratio, we need to
exponentiate the coefficient for the balance. In this case,
exp(0.005499) = 1.00551. This means that, on average, for
each one-unit increase in balance, the odds of defaulting are 1.00551
times higher, keeping all other factors constant.
§ Indicates the change in odds resulting from a unit change in the predictor.
§ OR > 1: Predictor increases, Probability of outcome occurring increases.
§ OR < 1: Predictor increases, Probability of outcome occurring decreases.
exp(coef(model))
## (Intercept) balance
## 2.366933e-05 1.005514e+00
For every dollar increase in monthly balance carried, the odds of the customer defaulting increases by an odds of 1.0057.
Significance: The p-values for both the intercept and balance are less than 0.001 (indicated by ***), suggesting that the balance is a statistically significant predictor of defaulting.
Model fit: The null deviance (2920.6) represents the goodness-of-fit of the intercept-only model (i.e., without any predictor variables), while the residual deviance (1596.5) represents the goodness-of-fit of the current model. Comparing these values indicates that the current model with balance as a predictor variable provides a better fit to the data than the intercept-only model.
AIC: The Akaike Information Criterion (AIC) is a measure of the relative quality of a statistical model. Lower values indicate better models. In this case, the AIC is 1600.5. If we were to compare multiple models, we could use the AIC to help determine which model is the best fit for the data.
§ AIC score: smaller = better fit
In conclusion, the logistic regression analysis suggests that there is a significant relationship between the balance and the likelihood of defaulting. As the balance increases, the odds of defaulting also increase.
The coefficient on balance means that an increase in balance is associated with an increase in the probability of default. A one-unit increase in balance is associated with an increase in the log odds of default by 0.0055
§ While you can calculate model predictions by hand, it’s easier to do it using R.
§ For instance, let’s predict the default rate for a person with a balance of $1000.
Plug and chug method:
1/(1+exp(-(-10.651+(0.005499*1000))))
## [1] 0.005754511
OR
predict(model, data.frame(balance=1000), type="response")
## 1
## 0.005752145
Needs to be included or else it will create a prediction for each row of data
On the scale of the response variable.
§ There is no R2 like there is in regression, but a variety of pseudo-R2 values have been developed.
§ One of the most frequently used is McFadden’s R2
§ Its measure ranges from 0 to just under 1.
§ McFadden states that models with a McFadden’s pseudo R2 of 0.4 represent a very good fit.
pR2(model)["McFadden"]
## fitting null model for pseudo-r2
## McFadden
## 0.4533916
This number is really go because this is fake data. In the real world, you propably wont get a value this high, especially in anthropology.
Is the log-likelihood statistic, which is an indicator of how much unexplained information there still exists in the model after its fit. Large values indicate poorly fitting models.
§Forced entry: all variables entered simultaneously.
§Hierarchical: variables entered in blocks.
§ Blocks should be based on past research, or theory being tested. Good method.
§ Stepwise: variables entered on the basis of statistical criteria (i.e. relative contribution to predicting outcome).
§ Should be used only for exploratory analysis.
Let’s do another model where we add in “income” as another variable that may predict someone’s likelihood of defaulting.
model2 <- glm(default ~ balance + income, family = "binomial", data=default)
summary(model2)
##
## Call:
## glm(formula = default ~ balance + income, family = "binomial",
## data = default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
The logistic regression model produces three coefficients: one for the intercept (-11.54), one for the balance (0.005647), and one for the income (0.00002081). These coefficients indicate the estimated effect of the balance and income on the log-odds of defaulting. Specifically, for each one-unit increase in balance, the log-odds of defaulting increase by 0.005647, and for each one-unit increase in income, the log-odds of defaulting increase by 0.00002081.
To obtain the odds ratios, we exponentiate the coefficients for
balance and income. In this case, exp(0.005647) = 1.00566
and exp(0.00002081) = 1.00002081. This means that, on
average, for each one-unit increase in balance, the odds of defaulting
are 1.00566 times higher, keeping all other factors constant. Similarly,
for each one-unit increase in income, the odds of defaulting are
1.00002081 times higher, keeping all other factors constant.
The p-values for the intercept, balance, and income are all less than 0.001 (indicated by ***), suggesting that both balance and income are statistically significant predictors of defaulting.
The logistic regression analysis suggests that there is a significant relationship between both balance and income and the likelihood of defaulting. As the balance and income increase, the odds of defaulting also increase. However, it’s essential to remember that correlation does not imply causation, and there might be other factors that contribute to this relationship.
Then to test if this model is “significantly” better, we run:
anova(model, model2, test="Chisq")
## Analysis of Deviance Table
##
## Model 1: default ~ balance
## Model 2: default ~ balance + income
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 9998 1596.5
## 2 9997 1579.0 1 17.485 2.895e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
test="Chisq" tells R that that we are using a binary
outcome. This is just another way to compare our models.
Here’s an interpretation of the output:
Resid. Df: The residual degrees of freedom for Model 1 and Model 2 are 9998 and 9997, respectively.
Resid. Dev: The residual deviance for Model 1 (1596.5) and Model 2 (1579.0) measures the goodness-of-fit of each model. Lower values indicate a better fit.
Df: The difference in degrees of freedom between the two models (9998 - 9997) is 1. This is because Model 2 has one additional predictor variable (income) compared to Model 1.
Deviance: The difference in residual deviance between the two models is 17.485. This value represents the reduction in residual deviance when moving from Model 1 to Model 2.
Pr(>Chi): The p-value for the test is 2.895e-05, which is very small (less than 0.001, indicated by ***). This suggests that there is a statistically significant difference between the two models, and Model 2 (with income as an additional predictor variable) provides a significantly better fit to the data than Model 1.
In conclusion, the Analysis of Deviance test indicates that adding income to the model significantly improves its ability to predict the likelihood of defaulting. This suggests that both balance and income are important factors to consider when predicting the probability of defaulting.
§ Assumptions from linear regression:
§ Independence of errors - if violated, this can lead to overdispersion
§ Multicollinearity
§ Unique problems
§ Incomplete information
§ Complete separation
§ Categorical predictors:
§ Predicting cancer from smoking and eating tomatoes.
§ We don’t know what happens when non-smokers eat tomatoes because we have no data in this cell of the design.
§ Continuous variables
§ Will your sample include an 80-year-old, highly anxious, Buddhist, left-handed lesbian?
§ When the outcome variable can be perfectly predicted.
§ E.g. predicting whether someone is a burglar, your teenage son or your cat based on weight.
§ Weight is a perfect predictor of cat/burglar unless you have a very fat cat indeed
If there is no overlaping data in the middle of the chart, R can’t determine what the lane should look like. This is usually not an issue. ## Assumptions to Test
• Values should be less than 1
Check standardized residuals – are there any large outliers? Greater than 3 might be problematic
Multicollinearity – Why is this a problem? If you have two highly
correlated predictors, the model struggles to determine which one is
associated with the outcome because to two are so strongly associated.
How do we test for it? vif is the function that we use to
test for correlations.
Check Cook’s Distance – influential points
• Values should be less than 1
plot(model2, which=4, id.n=5)
max(cooks.distance(model2))
## [1] 0.0167895
max gives the largest Cook’s distance. If it is greater
then one, it suggests that at least one data point has influence on the
actual perimeters.
Standardized Residuals – are there any outliers?
standard_res <- rstandard(model2)
plot(standard_res)
You can take a look at these larger ones to try to determine if any
are outside of what we would expect. We have a few that are greater than
3 but we have more than 10,000 data points. Having only a few out of
10,000 is perhaps 1% or less of the data points which is fine.
You can take a look at the ones with larger values to determine why they
have such large values. Presumably these are people who had a really
high balance who didn’t default or did default and had a really low
balance.
Multicollinearity exists if predictors are highly correlated.We will use the variance inflation factors (VIF) to determine whether our predictors are too highly correlated. This is basically calculated by taking the predictors and making them the outcome variable, using the other predictors in the model as predictors. This indicates how ‘correlated’ the predictors are.
library(car)
vif(model2)
## balance income
## 1.045605 1.045605
This shows the coefficients on our two variables. In this case they are vary low. Less than 2.5 is great! Real problem if you are over 10.
§ Check the Estimates to see the regression parameters for any predictors in the model.
§ Use the odds ratio, exp(coef(model)) for
interpretation.
§ OR > 1, then as the predictor increases, the odds of the outcome occurring increase.
§ OR < 1, then as the predictor increases, the odds of the outcome occurring decrease.
§ Use anova() function to test hierarchical models
§ Check assumptions:
§ Cook’s distance
§ Standardized Residuals
§ Multicollinearity