Visualizing multivariate logistic regression

Aim

I am going to show how to graphically visualize results from a multivariate logistic model in the case of two quantitative predictors.
I will consider three different scenarios:
1. the simpler case, where the log odds of the outcome linearly depends on the two predictors;
2. as 1. but with an interaction term between the predictors;
3. both the associations between the predictors and the log odds of the outcome are quadratic.
The analyses are based on simulated data.

Data simulation

I am going to consider two predictors, \(X\) and \(Z\), with the following distribution: \(X \sim \textrm{Unif}(-5, 5)\) and \(Z \sim \textrm{N}(0, 1.5)\).
I simulate \(n = 1000\) record.

library(dplyr)
n <- 1000
set.seed(1234)
dat <- data_frame(
  x = runif(n, -5, 5),
  z = rnorm(n, 0, 1.5)
)

1. Simple multivariate model

The real model is the defined as
\[log(odds(Y|X, Z)) = -2 + .5X + Z\]
or equivalently \[P(Y = 1|X, Z) = \frac{e^{-2 + .5X + Z}}{1 + e^{-2 + .5X + Z}}\]

invlogit <- function(x) exp(x)/(1 + exp(x))
dat <- dat %>%
  mutate(
    p = invlogit(-2 + .5*x + 1*z),
    y = rbinom(n, 1, p)
  )

2. Multivariate model with interaction

The real model is the defined as
\[log(odds(Y|X, Z)) = -2 + X + Z - 0.5XZ\]
or equivalently \[P(Y = 1|X, Z) = \frac{e^{-2 + X + Z - 0.5XZ}}{1 + e^{-2 + X + Z - 0.5XZ}}\]

dat <- dat %>%
  mutate(
    p_int = invlogit(-2 + 1*x + 1*z - 0.5*x*z),
    y_int = rbinom(n, 1, p_int)
  )

3. Non-linear relations

The real model is the defined as
\[log(odds(Y|X, Z)) = -1 + X - .1X^2 + 2Z - .2Z^2\]
or equivalently \[P(Y = 1|X, Z) = \frac{e^{-1 + X - .1X^2 + 2Z - .2Z^2}}{1 + e^{-1 + X - .1X^2 + 2Z - .2Z^2}}\]

dat <- dat %>%
  mutate(
    p_sq = invlogit(-1 + 1*x - .1*x^2 + 2*z - .2*z^2),
    y_sq = rbinom(n, 1, p_sq)
  )

Analyses

1. Multivariable model

Based on the simulated data, I estimate the following model
\[\textrm{logit}(P(Y|X, Z) = \beta_0 + \beta_1X + \beta_2Z\]

mod <- glm(y ~ x + z, data = dat, family = "binomial")
summary(mod)

## 
## Call:
## glm(formula = y ~ x + z, family = "binomial", data = dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2861  -0.5341  -0.2516  -0.0744   2.9356  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.15890    0.13731  -15.72   <2e-16 ***
## x            0.49660    0.04287   11.58   <2e-16 ***
## z            1.11399    0.09218   12.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1048.73  on 999  degrees of freedom
## Residual deviance:  672.11  on 997  degrees of freedom
## AIC: 678.11
## 
## Number of Fisher Scoring iterations: 6

I am going to graphically present both the predicted log odds and probability based on the fitted model
\[\widehat {\textrm{logit}}(Y|X, Z) = \hat \beta_0 + \hat \beta_1X + \hat \beta_2Z\] and \[\hat P(Y = 1|X, Z) = \frac{e^{\hat \beta_0 + \hat \beta_1X + \hat \beta_2Z}}{1 + e^{\hat \beta_0 + \hat \beta_1X + \hat \beta_2Z}}\]

The predictions will be based on data.fit, a fine grid of values for \(X\) and \(Z\)

xgrid <- zgrid <- seq(-5, 5, length.out = 50)
data.fit <- expand.grid(x = xgrid, z = zgrid)
logodds <- matrix(predict(mod, data.fit), ncol = 50, nrow = 50, byrow = T)
p <- matrix(predict(mod, data.fit, type = "response"), ncol = 50, nrow = 50, byrow = T)

To compare the predicted probability with the observed data, I am going to divide both \(X\) and \(Z\) in 10 categories each, and calculate the proportion of \(Y\) (mean of \(Y\) in each cell of a 2 by 2 table)

dat <- dat %>%
  mutate(
    xcat = cut(x, 10),
    zcat = cut(z, 10)
  ) %>%
  group_by(xcat, zcat) %>%
  mutate(
    risk = sum(y)/length(y)
  ) %>%
  ungroup(xcat, zcat)

Predicted Log odds

library(plotly)
plot_ly(x = xgrid, y = zgrid, z = logodds) %>% 
  add_surface() %>%
  layout(title = "Multivariable model",
         scene = list(
           xaxis = list(title = "x"), 
           yaxis = list(title = "z"), 
           zaxis = list(title = "log(odds)")))

Predicted and observed probabilities

plot_ly(dat, x = ~x, y = ~z, z = ~risk, type = "scatter3d", 
        mode = "markers", marker = list(size = 2)) %>%
  add_surface(x = xgrid, y = zgrid, z = p) %>%
  layout(title = "Multivariable model",
         scene = list(
           xaxis = list(title = "x"), 
           yaxis = list(title = "z"), 
           zaxis = list(title = "probability")))

Analysis 2. Interaction model

I am going to consider both the model with and without the interaction term, based on the data simulated in the second scenario.

The model with interaction is defined as \[\textrm{logit}(P(Y|X, Z) = \beta_0 + \beta_1X + \beta_2Z + \beta_2XZ\]

mod2 <- glm(y_int ~ x + z, data = dat, family = "binomial")
mod_int <- glm(y_int ~ x*z, data = dat, family = "binomial")
summary(mod2)

## 
## Call:
## glm(formula = y_int ~ x + z, family = "binomial", data = dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1673  -0.7395  -0.3850   0.8119   2.4617  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.83550    0.08524  -9.801  < 2e-16 ***
## x            0.49573    0.03323  14.916  < 2e-16 ***
## z            0.25176    0.05656   4.452 8.53e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1313.58  on 999  degrees of freedom
## Residual deviance:  978.27  on 997  degrees of freedom
## AIC: 984.27
## 
## Number of Fisher Scoring iterations: 4

summary(mod_int)

## 
## Call:
## glm(formula = y_int ~ x * z, family = "binomial", data = dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7234  -0.6031  -0.1024   0.5586   3.1721  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.63443    0.14297  -11.43   <2e-16 ***
## x            0.88582    0.06401   13.84   <2e-16 ***
## z            0.91369    0.09699    9.42   <2e-16 ***
## x:z         -0.47399    0.04167  -11.37   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1313.6  on 999  degrees of freedom
## Residual deviance:  743.2  on 996  degrees of freedom
## AIC: 751.2
## 
## Number of Fisher Scoring iterations: 6

The predictions for the log odds and probabilities, and the observed probabilities are calculated as before based on the new models and data

logodds2 <- matrix(predict(mod2, data.fit), ncol = 50, nrow = 50, byrow = T)
logodds_int <- matrix(predict(mod_int, data.fit), ncol = 50, nrow = 50, byrow = T)
p2 <- matrix(predict(mod2, data.fit, type = "response"), ncol = 50, nrow = 50, byrow = T)
p_int <- matrix(predict(mod_int, data.fit, type = "response"), ncol = 50, nrow = 50, byrow = T)
dat <- dat  %>%
  group_by(xcat, zcat) %>%
  mutate(
    risk_int = sum(y_int)/length(y_int)
  ) %>%
  ungroup(xcat, zcat)

Predicted Log odds

plot_ly(x = xgrid, y = zgrid, z = logodds2) %>% 
  add_surface() %>%
  layout(title = "Misspecified model (without interaction)",
         scene = list(
           xaxis = list(title = "x"), 
           yaxis = list(title = "z"), 
           zaxis = list(title = "log(odds)")))

plot_ly(x = xgrid, y = zgrid, z = logodds_int) %>% 
  add_surface() %>%
  layout(title = "Model with interaction)",
         scene = list(
           xaxis = list(title = "x"), 
           yaxis = list(title = "z"), 
           zaxis = list(title = "log(odds)")))

Predicted and observed probabilities

plot_ly(dat, x = ~x, y = ~z, z = ~risk_int, type = "scatter3d", 
        mode = "markers", marker = list(size = 2)) %>%
  add_surface(x = xgrid, y = zgrid, z = p2) %>%
  layout(title = "Misspecified model (without interaction)",
         scene = list(
           xaxis = list(title = "x"), 
           yaxis = list(title = "z"), 
           zaxis = list(title = "probability")))

plot_ly(dat, x = ~x, y = ~z, z = ~risk_int, type = "scatter3d", 
        mode = "markers", marker = list(size = 2)) %>%
  add_surface(x = xgrid, y = zgrid, z = p_int) %>%
  layout(title = "Model with interaction)",
         scene = list(
           xaxis = list(title = "x"), 
           yaxis = list(title = "z"), 
           zaxis = list(title = "probability")))

Analysis 3. Non-linear relationships

In the last setting I am going to fit both the models assuming a linear and a quadratic relationship between the two predictors and the odds of the outcome

The model assuming a (2) quadratic relationship(s) is defined as \[\textrm{logit}(P(Y|X, Z) = \beta_0 + \beta_1X + \beta_2X^2 + \beta_3Z + \beta_4Z^2\]

mod3 <- glm(y_int ~ x + z, data = dat, family = "binomial")
mod_sq <- glm(y_int ~ x + I(x^2) + z + I(z^2), data = dat, family = "binomial")
summary(mod3)

## 
## Call:
## glm(formula = y_int ~ x + z, family = "binomial", data = dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1673  -0.7395  -0.3850   0.8119   2.4617  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.83550    0.08524  -9.801  < 2e-16 ***
## x            0.49573    0.03323  14.916  < 2e-16 ***
## z            0.25176    0.05656   4.452 8.53e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1313.58  on 999  degrees of freedom
## Residual deviance:  978.27  on 997  degrees of freedom
## AIC: 984.27
## 
## Number of Fisher Scoring iterations: 4

summary(mod_sq)

## 
## Call:
## glm(formula = y_int ~ x + I(x^2) + z + I(z^2), family = "binomial", 
##     data = dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7923  -0.6299  -0.4367   0.6884   2.2197  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.46493    0.13209 -11.090  < 2e-16 ***
## x            0.46548    0.03232  14.402  < 2e-16 ***
## I(x^2)       0.05997    0.01179   5.087 3.63e-07 ***
## z            0.28398    0.05737   4.950 7.41e-07 ***
## I(z^2)       0.11609    0.02728   4.256 2.08e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1313.58  on 999  degrees of freedom
## Residual deviance:  932.94  on 995  degrees of freedom
## AIC: 942.94
## 
## Number of Fisher Scoring iterations: 4

The predictions for the log odds and probabilities, and the observed probabilities are calculated as before based on the new models and data

logodds3 <- matrix(predict(mod3, data.fit), ncol = 50, nrow = 50, byrow = T)
logodds_sq <- matrix(predict(mod_sq, data.fit), ncol = 50, nrow = 50, byrow = T)
p3 <- matrix(predict(mod3, data.fit, type = "response"), ncol = 50, nrow = 50, byrow = T)
p_sq <- matrix(predict(mod_sq, data.fit, type = "response"), ncol = 50, nrow = 50, byrow = T)
dat <- dat  %>%
  group_by(xcat, zcat) %>%
  mutate(
    risk_sq = sum(y_sq)/length(y_sq)
  ) %>%
  ungroup(xcat, zcat)

Predicted Log odds

plot_ly(x = xgrid, y = zgrid, z = logodds3) %>% 
  add_surface() %>%
  layout(title = "Linear model",
         scene = list(
           xaxis = list(title = "x"), 
           yaxis = list(title = "z"), 
           zaxis = list(title = "log(odds)")))

plot_ly(x = xgrid, y = zgrid, z = logodds_sq) %>% 
  add_surface() %>%
  layout(title = "Quadratic model)",
         scene = list(
           xaxis = list(title = "x"), 
           yaxis = list(title = "z"), 
           zaxis = list(title = "log(odds)")))

Predicted and observed probabilities

plot_ly(dat, x = ~x, y = ~z, z = ~risk_sq, type = "scatter3d", 
        mode = "markers", marker = list(size = 2)) %>%
  add_surface(x = xgrid, y = zgrid, z = p3) %>%
  layout(title = "Linear model",
         scene = list(
           xaxis = list(title = "x"), 
           yaxis = list(title = "z"), 
           zaxis = list(title = "probability")))

plot_ly(dat, x = ~x, y = ~z, z = ~risk_sq, type = "scatter3d", 
        mode = "markers", marker = list(size = 2)) %>%
  add_surface(x = xgrid, y = zgrid, z = p_sq) %>%
  layout(title = "Quadratic model",
         scene = list(
           xaxis = list(title = "x"), 
           yaxis = list(title = "z"), 
           zaxis = list(title = "probability")))

Visualizing multivariate logistic regression

Crippa A.

January 20, 2017

Aim

Data simulation

1. Simple multivariate model

2. Multivariate model with interaction

3. Non-linear relations

Analyses

1. Multivariable model

Analysis 2. Interaction model

Analysis 3. Non-linear relationships