Logit Reression Assumptions

The logistic regression method assumes that:

The outcome is a binary or dichotomous variable like yes vs no, positive vs negative, 1 vs 0.
There is a linear relationship between the logit of the outcome and each predictor variables. Recall that the logit function is logit(p) = log(p/(1-p)), where p is the probabilities of the outcome (see Chapter @ref(logistic-regression)).
There is no influential values (extreme values or outliers) in the continuous predictors
There is no high intercorrelations (i.e. multicollinearity) among the predictors.
To improve the accuracy of your model, you should make sure that these assumptions hold true for your data. In the following sections, we’ll describe how to diagnostic potential problems in the data.

Loading required R Packages

tidyverse for easy data manipulation and visualization
broom creates a tidy data frame from statistical test results

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.0          v purrr   0.3.4.9000
## v tibble  3.0.0          v dplyr   0.8.5     
## v tidyr   1.0.2          v stringr 1.4.0     
## v readr   1.3.1          v forcats 0.5.0

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(broom)
theme_set(theme_classic())

Building a Logistic Regression model

We start by computing an example of logistic regression model using the PimaIndiansDiabetes2 [mlbench package], for predicting the probability of diabetes test positivity based on clinical variables.

# Load the data
data("PimaIndiansDiabetes2", package = "mlbench")
PimaIndiansDiabetes2 <- na.omit(PimaIndiansDiabetes2)
# Fit the logistic regression model
model <- glm(diabetes ~., data = PimaIndiansDiabetes2, 
               family = binomial)
# Predict the probability (p) of diabete positivity
probabilities <- predict(model, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
head(predicted.classes)

##     4     5     7     9    14    15 
## "neg" "pos" "neg" "pos" "pos" "pos"

Logistic Regression Diagnostics

Linearity Assumption

Here, we’ll check the linear relationship between continuous predictor variables and the logit of the outcome. This can be done by visually inspecting the scatter plot between each predictor and the logit values.

Remove qualitative variables from the original data frame and bind the logit values to the data:

# Select only numeric predictors
mydata <- PimaIndiansDiabetes2 %>%
  dplyr::select_if(is.numeric) 
predictors <- colnames(mydata)
# Bind the logit and tidying the data for plot
mydata <- mydata %>%
  mutate(logit = log(probabilities/(1-probabilities))) %>%
  gather(key = "predictors", value = "predictor.value", -logit)

Create the Scatter Plots:

ggplot(mydata, aes(logit, predictor.value))+
  geom_point(size = 0.5, alpha = 0.5) +
  geom_smooth(method = "loess") + 
  theme_bw() + 
  facet_wrap(~predictors, scales = "free_y")

## `geom_smooth()` using formula 'y ~ x'

The smoothed scatter plots show that variables glucose, mass, pregnant, pressure and triceps are all quite linearly associated with the diabetes outcome in logit scale.

The variable age and pedigree is not linear and might need some transformations. If the scatter plot shows non-linearity, you need other methods to build the model such as including 2 or 3-power terms, fractional polynomials and spline function.

Influential Values

Influential values are extreme individual data points that can alter the quality of the logistic regression model.

The most extreme values in the data can be examined by visualizing the Cook’s distance values. Here we label the top 3 largest values:

plot(model, which = 4, id.n = 3)

Note that, not all outliers are influential observations. To check whether the data contains potential influential observations, the standardized residual error can be inspected. Data points with an absolute standardized residuals above 3 represent possible outliers and may deserve closer attention.

The following R code computes the standardized residuals .std.resid and the Cook’s distance .cooksd using the R function augment() [broom package].

# Extract model results
model.data <- augment(model) %>% 
  mutate(index = 1:n())

The data for the top 3 largest values, according to the Cook’s distance, can be displayed as follow:

model.data %>% top_n(3, .cooksd)

## # A tibble: 3 x 18
##   .rownames diabetes pregnant glucose pressure triceps insulin  mass pedigree
##   <chr>     <fct>       <dbl>   <dbl>    <dbl>   <dbl>   <dbl> <dbl>    <dbl>
## 1 229       neg             4     197       70      39     744  36.7     2.33
## 2 460       neg             9     134       74      33      60  25.9     0.46
## 3 488       neg             0     173       78      32     265  46.5     1.16
## # ... with 9 more variables: age <dbl>, .fitted <dbl>, .se.fit <dbl>,
## #   .resid <dbl>, .hat <dbl>, .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl>,
## #   index <int>

Plot the Standardized Residuals:

ggplot(model.data, aes(index, .std.resid)) + 
  geom_point(aes(color = diabetes), alpha = .5) +
  theme_bw()

Filter potential influential data points with abs(.std.res) > 3:

model.data %>% 
  filter(abs(.std.resid) > 3)

## # A tibble: 0 x 18
## # ... with 18 variables: .rownames <chr>, diabetes <fct>, pregnant <dbl>,
## #   glucose <dbl>, pressure <dbl>, triceps <dbl>, insulin <dbl>, mass <dbl>,
## #   pedigree <dbl>, age <dbl>, .fitted <dbl>, .se.fit <dbl>, .resid <dbl>,
## #   .hat <dbl>, .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl>, index <int>

When you have outliers in a continuous predictor, potential solutions include:

Removing the concerned records
Transform the data into log scale
Use non parametric methods

Multicollinearity

Multicollinearity corresponds to a situation where the data contain highly correlated predictor variables.

Multicollinearity is an important issue in regression analysis and should be fixed by removing the concerned variables. It can be assessed using the R function vif() [car package], which computes the variance inflation factors:

car::vif(model)

## pregnant  glucose pressure  triceps  insulin     mass pedigree      age 
## 1.892387 1.378937 1.191287 1.638865 1.384038 1.832416 1.031715 1.974053

As a rule of thumb, a VIF value that exceeds 4 or 10 indicates a problematic amount of collinearity. In our example, there is no collinearity: all variables have a value of VIF well below 4.

Outcome

This study describes the main assumptions of logistic regression model and provides examples of R code to diagnostic potential problems in the data, including non linearity between the predictor variables and the logit of the outcome, the presence of influential observations in the data and multicollinearity among predictors.

Fixing these potential problems might improve considerably the goodness of the model.