I am going to show how to graphically visualize results from a multivariate logistic model in the case of two quantitative predictors.
I will consider three different scenarios:
1. the simpler case, where the log odds of the outcome linearly depends on the two predictors;
2. as 1. but with an interaction term between the predictors;
3. both the associations between the predictors and the log odds of the outcome are quadratic.
The analyses are based on simulated data.
I am going to consider two predictors, \(X\) and \(Z\), with the following distribution: \(X \sim \textrm{Unif}(-5, 5)\) and \(Z \sim \textrm{N}(0, 1.5)\).
I simulate \(n = 1000\) record.
library(dplyr)
n <- 1000
set.seed(1234)
dat <- data_frame(
x = runif(n, -5, 5),
z = rnorm(n, 0, 1.5)
)
Â
The real model is the defined as
\[log(odds(Y|X, Z)) = -2 + .5X + Z\]
or equivalently \[P(Y = 1|X, Z) = \frac{e^{-2 + .5X + Z}}{1 + e^{-2 + .5X + Z}}\]
invlogit <- function(x) exp(x)/(1 + exp(x))
dat <- dat %>%
mutate(
p = invlogit(-2 + .5*x + 1*z),
y = rbinom(n, 1, p)
)
Â
The real model is the defined as
\[log(odds(Y|X, Z)) = -2 + X + Z - 0.5XZ\]
or equivalently \[P(Y = 1|X, Z) = \frac{e^{-2 + X + Z - 0.5XZ}}{1 + e^{-2 + X + Z - 0.5XZ}}\]
dat <- dat %>%
mutate(
p_int = invlogit(-2 + 1*x + 1*z - 0.5*x*z),
y_int = rbinom(n, 1, p_int)
)
Â
The real model is the defined as
\[log(odds(Y|X, Z)) = -1 + X - .1X^2 + 2Z - .2Z^2\]
or equivalently \[P(Y = 1|X, Z) = \frac{e^{-1 + X - .1X^2 + 2Z - .2Z^2}}{1 + e^{-1 + X - .1X^2 + 2Z - .2Z^2}}\]
dat <- dat %>%
mutate(
p_sq = invlogit(-1 + 1*x - .1*x^2 + 2*z - .2*z^2),
y_sq = rbinom(n, 1, p_sq)
)
Â
Based on the simulated data, I estimate the following model
\[\textrm{logit}(P(Y|X, Z) = \beta_0 + \beta_1X + \beta_2Z\]
mod <- glm(y ~ x + z, data = dat, family = "binomial")
summary(mod)
##
## Call:
## glm(formula = y ~ x + z, family = "binomial", data = dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2861 -0.5341 -0.2516 -0.0744 2.9356
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.15890 0.13731 -15.72 <2e-16 ***
## x 0.49660 0.04287 11.58 <2e-16 ***
## z 1.11399 0.09218 12.09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1048.73 on 999 degrees of freedom
## Residual deviance: 672.11 on 997 degrees of freedom
## AIC: 678.11
##
## Number of Fisher Scoring iterations: 6
I am going to graphically present both the predicted log odds and probability based on the fitted model
\[\widehat {\textrm{logit}}(Y|X, Z) = \hat \beta_0 + \hat \beta_1X + \hat \beta_2Z\] and \[\hat P(Y = 1|X, Z) = \frac{e^{\hat \beta_0 + \hat \beta_1X + \hat \beta_2Z}}{1 + e^{\hat \beta_0 + \hat \beta_1X + \hat \beta_2Z}}\]
The predictions will be based on data.fit, a fine grid of values for \(X\) and \(Z\)
xgrid <- zgrid <- seq(-5, 5, length.out = 50)
data.fit <- expand.grid(x = xgrid, z = zgrid)
logodds <- matrix(predict(mod, data.fit), ncol = 50, nrow = 50, byrow = T)
p <- matrix(predict(mod, data.fit, type = "response"), ncol = 50, nrow = 50, byrow = T)
To compare the predicted probability with the observed data, I am going to divide both \(X\) and \(Z\) in 10 categories each, and calculate the proportion of \(Y\) (mean of \(Y\) in each cell of a 2 by 2 table)
dat <- dat %>%
mutate(
xcat = cut(x, 10),
zcat = cut(z, 10)
) %>%
group_by(xcat, zcat) %>%
mutate(
risk = sum(y)/length(y)
) %>%
ungroup(xcat, zcat)
Predicted Log odds
library(plotly)
plot_ly(x = xgrid, y = zgrid, z = logodds) %>%
add_surface() %>%
layout(title = "Multivariable model",
scene = list(
xaxis = list(title = "x"),
yaxis = list(title = "z"),
zaxis = list(title = "log(odds)")))
Predicted and observed probabilities
plot_ly(dat, x = ~x, y = ~z, z = ~risk, type = "scatter3d",
mode = "markers", marker = list(size = 2)) %>%
add_surface(x = xgrid, y = zgrid, z = p) %>%
layout(title = "Multivariable model",
scene = list(
xaxis = list(title = "x"),
yaxis = list(title = "z"),
zaxis = list(title = "probability")))
Â
I am going to consider both the model with and without the interaction term, based on the data simulated in the second scenario.
The model with interaction is defined as \[\textrm{logit}(P(Y|X, Z) = \beta_0 + \beta_1X + \beta_2Z + \beta_2XZ\]
mod2 <- glm(y_int ~ x + z, data = dat, family = "binomial")
mod_int <- glm(y_int ~ x*z, data = dat, family = "binomial")
summary(mod2)
##
## Call:
## glm(formula = y_int ~ x + z, family = "binomial", data = dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1673 -0.7395 -0.3850 0.8119 2.4617
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.83550 0.08524 -9.801 < 2e-16 ***
## x 0.49573 0.03323 14.916 < 2e-16 ***
## z 0.25176 0.05656 4.452 8.53e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1313.58 on 999 degrees of freedom
## Residual deviance: 978.27 on 997 degrees of freedom
## AIC: 984.27
##
## Number of Fisher Scoring iterations: 4
summary(mod_int)
##
## Call:
## glm(formula = y_int ~ x * z, family = "binomial", data = dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7234 -0.6031 -0.1024 0.5586 3.1721
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.63443 0.14297 -11.43 <2e-16 ***
## x 0.88582 0.06401 13.84 <2e-16 ***
## z 0.91369 0.09699 9.42 <2e-16 ***
## x:z -0.47399 0.04167 -11.37 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1313.6 on 999 degrees of freedom
## Residual deviance: 743.2 on 996 degrees of freedom
## AIC: 751.2
##
## Number of Fisher Scoring iterations: 6
The predictions for the log odds and probabilities, and the observed probabilities are calculated as before based on the new models and data
logodds2 <- matrix(predict(mod2, data.fit), ncol = 50, nrow = 50, byrow = T)
logodds_int <- matrix(predict(mod_int, data.fit), ncol = 50, nrow = 50, byrow = T)
p2 <- matrix(predict(mod2, data.fit, type = "response"), ncol = 50, nrow = 50, byrow = T)
p_int <- matrix(predict(mod_int, data.fit, type = "response"), ncol = 50, nrow = 50, byrow = T)
dat <- dat %>%
group_by(xcat, zcat) %>%
mutate(
risk_int = sum(y_int)/length(y_int)
) %>%
ungroup(xcat, zcat)
Predicted Log odds
plot_ly(x = xgrid, y = zgrid, z = logodds2) %>%
add_surface() %>%
layout(title = "Misspecified model (without interaction)",
scene = list(
xaxis = list(title = "x"),
yaxis = list(title = "z"),
zaxis = list(title = "log(odds)")))
plot_ly(x = xgrid, y = zgrid, z = logodds_int) %>%
add_surface() %>%
layout(title = "Model with interaction)",
scene = list(
xaxis = list(title = "x"),
yaxis = list(title = "z"),
zaxis = list(title = "log(odds)")))
Predicted and observed probabilities
plot_ly(dat, x = ~x, y = ~z, z = ~risk_int, type = "scatter3d",
mode = "markers", marker = list(size = 2)) %>%
add_surface(x = xgrid, y = zgrid, z = p2) %>%
layout(title = "Misspecified model (without interaction)",
scene = list(
xaxis = list(title = "x"),
yaxis = list(title = "z"),
zaxis = list(title = "probability")))
plot_ly(dat, x = ~x, y = ~z, z = ~risk_int, type = "scatter3d",
mode = "markers", marker = list(size = 2)) %>%
add_surface(x = xgrid, y = zgrid, z = p_int) %>%
layout(title = "Model with interaction)",
scene = list(
xaxis = list(title = "x"),
yaxis = list(title = "z"),
zaxis = list(title = "probability")))
Â
In the last setting I am going to fit both the models assuming a linear and a quadratic relationship between the two predictors and the odds of the outcome
The model assuming a (2) quadratic relationship(s) is defined as \[\textrm{logit}(P(Y|X, Z) = \beta_0 + \beta_1X + \beta_2X^2 + \beta_3Z + \beta_4Z^2\]
mod3 <- glm(y_int ~ x + z, data = dat, family = "binomial")
mod_sq <- glm(y_int ~ x + I(x^2) + z + I(z^2), data = dat, family = "binomial")
summary(mod3)
##
## Call:
## glm(formula = y_int ~ x + z, family = "binomial", data = dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1673 -0.7395 -0.3850 0.8119 2.4617
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.83550 0.08524 -9.801 < 2e-16 ***
## x 0.49573 0.03323 14.916 < 2e-16 ***
## z 0.25176 0.05656 4.452 8.53e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1313.58 on 999 degrees of freedom
## Residual deviance: 978.27 on 997 degrees of freedom
## AIC: 984.27
##
## Number of Fisher Scoring iterations: 4
summary(mod_sq)
##
## Call:
## glm(formula = y_int ~ x + I(x^2) + z + I(z^2), family = "binomial",
## data = dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7923 -0.6299 -0.4367 0.6884 2.2197
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.46493 0.13209 -11.090 < 2e-16 ***
## x 0.46548 0.03232 14.402 < 2e-16 ***
## I(x^2) 0.05997 0.01179 5.087 3.63e-07 ***
## z 0.28398 0.05737 4.950 7.41e-07 ***
## I(z^2) 0.11609 0.02728 4.256 2.08e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1313.58 on 999 degrees of freedom
## Residual deviance: 932.94 on 995 degrees of freedom
## AIC: 942.94
##
## Number of Fisher Scoring iterations: 4
The predictions for the log odds and probabilities, and the observed probabilities are calculated as before based on the new models and data
logodds3 <- matrix(predict(mod3, data.fit), ncol = 50, nrow = 50, byrow = T)
logodds_sq <- matrix(predict(mod_sq, data.fit), ncol = 50, nrow = 50, byrow = T)
p3 <- matrix(predict(mod3, data.fit, type = "response"), ncol = 50, nrow = 50, byrow = T)
p_sq <- matrix(predict(mod_sq, data.fit, type = "response"), ncol = 50, nrow = 50, byrow = T)
dat <- dat %>%
group_by(xcat, zcat) %>%
mutate(
risk_sq = sum(y_sq)/length(y_sq)
) %>%
ungroup(xcat, zcat)
Predicted Log odds
plot_ly(x = xgrid, y = zgrid, z = logodds3) %>%
add_surface() %>%
layout(title = "Linear model",
scene = list(
xaxis = list(title = "x"),
yaxis = list(title = "z"),
zaxis = list(title = "log(odds)")))
plot_ly(x = xgrid, y = zgrid, z = logodds_sq) %>%
add_surface() %>%
layout(title = "Quadratic model)",
scene = list(
xaxis = list(title = "x"),
yaxis = list(title = "z"),
zaxis = list(title = "log(odds)")))
Predicted and observed probabilities
plot_ly(dat, x = ~x, y = ~z, z = ~risk_sq, type = "scatter3d",
mode = "markers", marker = list(size = 2)) %>%
add_surface(x = xgrid, y = zgrid, z = p3) %>%
layout(title = "Linear model",
scene = list(
xaxis = list(title = "x"),
yaxis = list(title = "z"),
zaxis = list(title = "probability")))
plot_ly(dat, x = ~x, y = ~z, z = ~risk_sq, type = "scatter3d",
mode = "markers", marker = list(size = 2)) %>%
add_surface(x = xgrid, y = zgrid, z = p_sq) %>%
layout(title = "Quadratic model",
scene = list(
xaxis = list(title = "x"),
yaxis = list(title = "z"),
zaxis = list(title = "probability")))