The logistic regression method assumes that:
logit(p) = log(p/(1-p)), where p is the probabilities of the outcome (see Chapter @ref(logistic-regression)).tidyverse for easy data manipulation and visualizationbroom creates a tidy data frame from statistical test resultslibrary(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.0 v purrr 0.3.4.9000
## v tibble 3.0.0 v dplyr 0.8.5
## v tidyr 1.0.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(broom)
theme_set(theme_classic())
We start by computing an example of logistic regression model using the PimaIndiansDiabetes2 [mlbench package], for predicting the probability of diabetes test positivity based on clinical variables.
# Load the data
data("PimaIndiansDiabetes2", package = "mlbench")
PimaIndiansDiabetes2 <- na.omit(PimaIndiansDiabetes2)
# Fit the logistic regression model
model <- glm(diabetes ~., data = PimaIndiansDiabetes2,
family = binomial)
# Predict the probability (p) of diabete positivity
probabilities <- predict(model, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
head(predicted.classes)
## 4 5 7 9 14 15
## "neg" "pos" "neg" "pos" "pos" "pos"
Here, we’ll check the linear relationship between continuous predictor variables and the logit of the outcome. This can be done by visually inspecting the scatter plot between each predictor and the logit values.
# Select only numeric predictors
mydata <- PimaIndiansDiabetes2 %>%
dplyr::select_if(is.numeric)
predictors <- colnames(mydata)
# Bind the logit and tidying the data for plot
mydata <- mydata %>%
mutate(logit = log(probabilities/(1-probabilities))) %>%
gather(key = "predictors", value = "predictor.value", -logit)
ggplot(mydata, aes(logit, predictor.value))+
geom_point(size = 0.5, alpha = 0.5) +
geom_smooth(method = "loess") +
theme_bw() +
facet_wrap(~predictors, scales = "free_y")
## `geom_smooth()` using formula 'y ~ x'
The smoothed scatter plots show that variables glucose, mass, pregnant, pressure and triceps are all quite linearly associated with the diabetes outcome in logit scale.
The variable age and pedigree is not linear and might need some transformations. If the scatter plot shows non-linearity, you need other methods to build the model such as including 2 or 3-power terms, fractional polynomials and spline function.
Influential values are extreme individual data points that can alter the quality of the logistic regression model.
The most extreme values in the data can be examined by visualizing the Cook’s distance values. Here we label the top 3 largest values:
plot(model, which = 4, id.n = 3)
Note that, not all outliers are influential observations. To check whether the data contains potential influential observations, the standardized residual error can be inspected. Data points with an absolute standardized residuals above 3 represent possible outliers and may deserve closer attention.
The following R code computes the standardized residuals .std.resid and the Cook’s distance .cooksd using the R function augment() [broom package].
# Extract model results
model.data <- augment(model) %>%
mutate(index = 1:n())
The data for the top 3 largest values, according to the Cook’s distance, can be displayed as follow:
model.data %>% top_n(3, .cooksd)
## # A tibble: 3 x 18
## .rownames diabetes pregnant glucose pressure triceps insulin mass pedigree
## <chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 229 neg 4 197 70 39 744 36.7 2.33
## 2 460 neg 9 134 74 33 60 25.9 0.46
## 3 488 neg 0 173 78 32 265 46.5 1.16
## # ... with 9 more variables: age <dbl>, .fitted <dbl>, .se.fit <dbl>,
## # .resid <dbl>, .hat <dbl>, .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl>,
## # index <int>
Plot the Standardized Residuals:
ggplot(model.data, aes(index, .std.resid)) +
geom_point(aes(color = diabetes), alpha = .5) +
theme_bw()
Filter potential influential data points with abs(.std.res) > 3:
model.data %>%
filter(abs(.std.resid) > 3)
## # A tibble: 0 x 18
## # ... with 18 variables: .rownames <chr>, diabetes <fct>, pregnant <dbl>,
## # glucose <dbl>, pressure <dbl>, triceps <dbl>, insulin <dbl>, mass <dbl>,
## # pedigree <dbl>, age <dbl>, .fitted <dbl>, .se.fit <dbl>, .resid <dbl>,
## # .hat <dbl>, .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl>, index <int>
When you have outliers in a continuous predictor, potential solutions include:
Multicollinearity corresponds to a situation where the data contain highly correlated predictor variables.
Multicollinearity is an important issue in regression analysis and should be fixed by removing the concerned variables. It can be assessed using the R function vif() [car package], which computes the variance inflation factors:
car::vif(model)
## pregnant glucose pressure triceps insulin mass pedigree age
## 1.892387 1.378937 1.191287 1.638865 1.384038 1.832416 1.031715 1.974053
As a rule of thumb, a VIF value that exceeds 4 or 10 indicates a problematic amount of collinearity. In our example, there is no collinearity: all variables have a value of VIF well below 4.
This study describes the main assumptions of logistic regression model and provides examples of R code to diagnostic potential problems in the data, including non linearity between the predictor variables and the logit of the outcome, the presence of influential observations in the data and multicollinearity among predictors.
Fixing these potential problems might improve considerably the goodness of the model.