When you have a binary outcome (Yes/No), you can use a chi square test to compare the differences in proportions across \(n\) number of groups. For instance, if you had two groups (exposed and unexposed) and a binary outcome (event and no event), you can create a 2 x 2 contingency table and use a chi square test to test if there is a difference in the frequency or proportion in the outcome across the two groups. However, this will not get you the magnitude of the differences, the direction of the difference, nor the uncertainty with the differences.
Figure caption: 2 x 2 contingency table
Alternative measures of association include risk ratio and odds ratio. I wrote a tutorial on how to do this in R, which you can find at this link. However, in this tutorial, we’ll go over how we can leverage the odds ratio particularly when it comes to constructing a logistic regression model to make predictions.
The logistic regression model has several assumptions; however, they do not necessarily follow those of the linear regression models:
The structural form of the logistic regression model:
Typical notation of the linear regression include:
The logistic regression model is a predictive model for binary data. It is also known as a classification model. Hence, the logistic regression model can generate probabilities that a sample will have the discrete outcome given an input variable(s). The logistic regression model uses maximum likelihood estimation (MLE) which is a conditional probability that classifies the outcome if a certain threshold is met (e.g,. > 0.50). Hence, the probability range of a logistic regression model is between 0 and 1. The figure below provides an example of a logistic function. It uses a logit function to model a binary outcome. The logit function is the natural log of the odds. For instance, if the probability is > 0.5, the logit function is positive, if the probability is < 0.5, the logit function is negative.
Unlike the 2 x 2 contingency table setup, the logistic regression allows for continuous and categorical variables as the independent or predictor variables. This allows for easy interpretation particularly with continuous data. Additionally, the logistic regression can include multiple predictors which can be controlled or adjusted in a multivariable logistic regression model.
library(LaplacesDemon)
x <- -10:10
prob <- invlogit(x)
plot(x, prob, type = "l", main = "Logistic regression function plot", ylab = "Probability", xlab = "Values of X")
We’ll use the mtcar
data to build our logistic regression model.
#### Load the libraries
library("ggplot2")
library("gmodels")
library("epitools")
library("tidyverse")
### Create factors
data1 <- within(mtcars, {
vs <- factor(vs, labels = c("V", "S"))
am <- factor(am, labels = c("automatic", "manual"))
})
head(data1)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 V manual 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 V manual 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 S manual 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 S automatic 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 V automatic 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 S automatic 3 1
Let’s look at the 2x2 contingency table between transmission types am
(0 = automatic and 1 = manual) with the engine types vs
(0 = V-shaped and 1 = straight).
Among vehicles with a V-shaped engine, 33% (N = 6) had a manual transmission and 67% (N = 12) had an automatic transmission. According to the Pearson’s chi square test (p = 0.34), this difference was not statistically significant.
CrossTable(data1$vs, data1$am, chisq = TRUE, missing.include = TRUE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 32
##
##
## | data1$am
## data1$vs | automatic | manual | Row Total |
## -------------|-----------|-----------|-----------|
## V | 12 | 6 | 18 |
## | 0.161 | 0.236 | |
## | 0.667 | 0.333 | 0.562 |
## | 0.632 | 0.462 | |
## | 0.375 | 0.188 | |
## -------------|-----------|-----------|-----------|
## S | 7 | 7 | 14 |
## | 0.207 | 0.303 | |
## | 0.500 | 0.500 | 0.438 |
## | 0.368 | 0.538 | |
## | 0.219 | 0.219 | |
## -------------|-----------|-----------|-----------|
## Column Total | 19 | 13 | 32 |
## | 0.594 | 0.406 | |
## -------------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 0.9068826 d.f. = 1 p = 0.3409429
##
## Pearson's Chi-squared test with Yates' continuity correction
## ------------------------------------------------------------
## Chi^2 = 0.3475355 d.f. = 1 p = 0.5555115
##
##
We can calculate the crude odds ratio. Using the odds ratio, vehicles with a “V” engine had a 2 times higher odds of having an automatic transmission (95% CI: 0.47, 8.40) compared to vehicles with a straight engine.
oddsratio(data1$vs, data1$am, conf.level = 0.95, method = "wald")
## $data
## Outcome
## Predictor automatic manual Total
## V 12 6 18
## S 7 7 14
## Total 19 13 32
##
## $measure
## odds ratio with 95% C.I.
## Predictor estimate lower upper
## V 1 NA NA
## S 2 0.4764466 8.395484
##
## $p.value
## two-sided
## Predictor midp.exact fisher.exact chi.square
## V NA NA NA
## S 0.3718521 0.4726974 0.3409429
##
## $correction
## [1] FALSE
##
## attr(,"method")
## [1] "Unconditional MLE & normal approximation (Wald) CI"
We can create a crude logistic regression model to estimate the odds ratio. We set the transmission type am
as the dependent variable and the engine type vs
as the independent variable. The glm()
command generates coefficients that are interpreted as the log odds of the event occuring. We need to exponentiate this to get the odds ratio using the exp()
command.
According to the logistic regression model, vehicles with an “V” engine had a 2.0 times higher odds of having an automatic transmission (95% CI: 0.48, 8.76) compared to vehicles with a straight engine; this is not statistically significant since the odds ratio crosses the null or OR = 1.
logit1 <- glm(formula = am ~ vs, data = data1, family = "binomial"(link = "logit"))
summary(logit1)
##
## Call:
## glm(formula = am ~ vs, family = binomial(link = "logit"), data = data1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.1774 -0.9005 -0.9005 1.1774 1.4823
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.6931 0.5000 -1.386 0.166
## vsS 0.6931 0.7319 0.947 0.344
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 43.230 on 31 degrees of freedom
## Residual deviance: 42.323 on 30 degrees of freedom
## AIC: 46.323
##
## Number of Fisher Scoring iterations: 4
### Generate the 95% CI
confint(logit1)
## 2.5 % 97.5 %
## (Intercept) -1.7483158 0.2526876
## vsS -0.7334126 2.1696774
### Exponentiate the coefficients
exp(coef(logit1)) ### Odds ratio
## (Intercept) vsS
## 0.5 2.0
exp(confint(logit1)) ### 95% CI (odds ratio)
## 2.5 % 97.5 %
## (Intercept) 0.1740669 1.287481
## vsS 0.4802672 8.755459
The structural form of the multivariable logistic regression model (this example uses two X
variables):
Since the logistic regression model can include both continuous and categorical predictors, we can add the engine type vs
(V versus straight engine) and vehicle weight wt
.
where vs
is the engine type and wt
is the vehicle weight.
There is a total of 32 observations; N = 19 vehicles have automatic transmission and N =13 have manual transmission. There are 18 (56.2%) vehicles with V-type engines and 14 (43.8%) vehicles with straight type engines. The average weight of a vehicles with automatic transmission is 3.77 units (SD, 0.78) and for vehicles with manual transmission, the average weight is 2.41 units (SD, 0.62); this difference was statistically significant (p<0.001).
library("dplyr")
### Get the n for transmission and engine types
CrossTable(data1$am)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 32
##
##
## | automatic | manual |
## |-----------|-----------|
## | 19 | 13 |
## | 0.594 | 0.406 |
## |-----------|-----------|
##
##
##
##
CrossTable(data1$vs)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 32
##
##
## | V | S |
## |-----------|-----------|
## | 18 | 14 |
## | 0.562 | 0.438 |
## |-----------|-----------|
##
##
##
##
### Get the n, mean, and sd of weight by transmission type.
group_by(data1, am) %>%
summarise(
count = n(),
mean = mean(wt, na.rm = TRUE),
sd = sd(wt, na.rm = TRUE)
)
## # A tibble: 2 x 4
## am count mean sd
## <fct> <int> <dbl> <dbl>
## 1 automatic 19 3.77 0.777
## 2 manual 13 2.41 0.617
### Compare the average vehicle weight by transmission type.
t.test(data1$wt ~ data1$am, alternative = "two.sided", paired = FALSE, var.equal = FALSE, conf.level = 0.95)
##
## Welch Two Sample t-test
##
## data: data1$wt by data1$am
## t = 5.4939, df = 29.234, p-value = 6.272e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.8525632 1.8632262
## sample estimates:
## mean in group automatic mean in group manual
## 3.768895 2.411000
We add the vs
and wt
variables as predictors in the logistic regression model. The output will be in log odds, but we can exponentiate this using the exp()
command.
We get very different results from the first logistic regression model. In this multivariable logistic regression model, the association between engine type vs
and transmission type am
is much lower (OR = 0.01; 95% CI: 0.000005, 0.048) controlling for vehicle weight wt
Controlling for the vehicle’s weight reduced odds of the association between engine type vs
and transmission type am
. In the previous logistic regression model, vehicles with a V-type engine has a 2.0 times higher odds of having an automatic transmission. However, when adjusting for the vehicles weight, this association is no longer significant.
logit2 <- glm(formula = am ~ vs + wt, data = data1, family = "binomial"(link = "logit"))
summary(logit2)
##
## Call:
## glm(formula = am ~ vs + wt, family = binomial(link = "logit"),
## data = data1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.72025 -0.25387 -0.04841 0.13220 1.90889
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 22.143 9.134 2.424 0.0153 *
## vsS -4.496 2.641 -1.703 0.0887 .
## wt -6.664 2.640 -2.524 0.0116 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 43.230 on 31 degrees of freedom
## Residual deviance: 13.095 on 29 degrees of freedom
## AIC: 19.095
##
## Number of Fisher Scoring iterations: 7
### Generate the 95% CI
confint(logit2)
## 2.5 % 97.5 %
## (Intercept) 9.516615 48.7186931
## vsS -12.115746 -0.7135884
## wt -14.321591 -2.9934338
### Exponentiate the coefficients
exp(coef(logit2)) ### Odds ratio
## (Intercept) vsS wt
## 4.136804e+09 1.114894e-02 1.276657e-03
exp(confint(logit2)) ### 95% CI (odds ratio)
## 2.5 % 97.5 %
## (Intercept) 1.358356e+04 1.439659e+21
## vsS 5.472656e-06 4.898832e-01
## wt 6.028539e-07 5.011505e-02
We compare the odds ratio between the crude and adjusted logistic regression models.
Figure caption: Comparison between crude and adjustment logistic regression models
Let’s use the diabetes.csv
data set to construct a logistic regression model. You can download the data here.
There are 768 observations in the diabetes.data
data frame.
library("psych")
## Import file
diabetes.data <- read.csv("diabetes.csv", header = TRUE)
head(diabetes.data)
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 6 148 72 35 0 33.6
## 2 1 85 66 29 0 26.6
## 3 8 183 64 0 0 23.3
## 4 1 89 66 23 94 28.1
## 5 0 137 40 35 168 43.1
## 6 5 116 74 0 0 25.6
## DiabetesPedigreeFunction Age Outcome
## 1 0.627 50 1
## 2 0.351 31 0
## 3 0.672 32 1
## 4 0.167 21 0
## 5 2.288 33 1
## 6 0.201 30 0
### Descriptive of diabetes.data
describeBy(diabetes.data)
## vars n mean sd median trimmed mad min
## Pregnancies 1 768 3.85 3.37 3.00 3.46 2.97 0.00
## Glucose 2 768 120.89 31.97 117.00 119.38 29.65 0.00
## BloodPressure 3 768 69.11 19.36 72.00 71.36 11.86 0.00
## SkinThickness 4 768 20.54 15.95 23.00 19.94 17.79 0.00
## Insulin 5 768 79.80 115.24 30.50 56.75 45.22 0.00
## BMI 6 768 31.99 7.88 32.00 31.96 6.82 0.00
## DiabetesPedigreeFunction 7 768 0.47 0.33 0.37 0.42 0.25 0.08
## Age 8 768 33.24 11.76 29.00 31.54 10.38 21.00
## Outcome 9 768 0.35 0.48 0.00 0.31 0.00 0.00
## max range skew kurtosis se
## Pregnancies 17.00 17.00 0.90 0.14 0.12
## Glucose 199.00 199.00 0.17 0.62 1.15
## BloodPressure 122.00 122.00 -1.84 5.12 0.70
## SkinThickness 99.00 99.00 0.11 -0.53 0.58
## Insulin 846.00 846.00 2.26 7.13 4.16
## BMI 67.10 67.10 -0.43 3.24 0.28
## DiabetesPedigreeFunction 2.42 2.34 1.91 5.53 0.01
## Age 81.00 60.00 1.13 0.62 0.42
## Outcome 1.00 1.00 0.63 -1.60 0.02
Currently, the data frame has a variable called Pregnancies
that contains the number of pregnancies each subject has. We are only interested in whether or not they have had a pregnancy, so we’ll create a binary variable called group
. We added labels to the 0s an 1s and call this new data frame data2
. There was a total of 657 (85.5%) subjects with a history of pregnancies and 111 (14.5%) subjects with no history of pregnancy.
#### Generate groups based on pregnancies (Group 1 = 0, Group 2 = 1-5, Group 3 = >5 pregnancies)
diabetes.data$group[diabetes.data$Pregnancies == 0] = 0
diabetes.data$group[diabetes.data$Pregnancies > 0] = 1
CrossTable(diabetes.data$group)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 768
##
##
## | 0 | 1 |
## |-----------|-----------|
## | 111 | 657 |
## | 0.145 | 0.855 |
## |-----------|-----------|
##
##
##
##
### Create factors
data2 <- within(diabetes.data, {
group <- factor(group, labels = c("No history of pregnancies", "History of pregnancies"))
})
CrossTable(data2$group)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 768
##
##
## | No history of pregnancies | History of pregnancies |
## |---------------------------|---------------------------|
## | 111 | 657 |
## | 0.145 | 0.855 |
## |---------------------------|---------------------------|
##
##
##
##
We create a crude logistic regression model to evaluate the association of the subject’s age Age
on history of pregnancy group
. Baesd on the crude logistic regression model, a 1-unit increase in age was associated with a 7% increase in the odds of having a history of pregnancy (95% CI: 1.05, 1.10), which is statistically significant.
logit3 <- glm(group ~ Age, data = data2, family = "binomial"(link = "logit"))
summary(logit3)
##
## Call:
## glm(formula = group ~ Age, family = binomial(link = "logit"),
## data = data2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9478 0.2902 0.4966 0.6630 0.7500
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.33970 0.39173 -0.867 0.386
## Age 0.06972 0.01343 5.193 2.06e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 634.53 on 767 degrees of freedom
## Residual deviance: 597.14 on 766 degrees of freedom
## AIC: 601.14
##
## Number of Fisher Scoring iterations: 5
confint(logit3)
## 2.5 % 97.5 %
## (Intercept) -1.13328100 0.40600173
## Age 0.04484305 0.09761406
### Exponentiate the coefficients
exp(coef(logit3))
## (Intercept) Age
## 0.711984 1.072211
exp(confint(logit3))
## 2.5 % 97.5 %
## (Intercept) 0.3219751 1.500805
## Age 1.0458637 1.102537
We can add potential confounders such as BMI, glucose, and skin thickness to the logistic regression model. This will generate model estimates that will be adjusting for these other predictors.
logit3 <- glm(group ~ Age + BMI + Glucose + SkinThickness, data = data2, family = "binomial"(link = "logit"))
summary(logit3)
##
## Call:
## glm(formula = group ~ Age + BMI + Glucose + SkinThickness, family = binomial(link = "logit"),
## data = data2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.3549 0.2632 0.4737 0.6239 1.2282
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.516255 0.626257 2.421 0.015472 *
## Age 0.083301 0.014808 5.625 1.85e-08 ***
## BMI -0.053471 0.015487 -3.453 0.000555 ***
## Glucose -0.005492 0.003615 -1.519 0.128700
## SkinThickness 0.007768 0.007413 1.048 0.294694
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 634.53 on 767 degrees of freedom
## Residual deviance: 578.30 on 763 degrees of freedom
## AIC: 588.3
##
## Number of Fisher Scoring iterations: 6
confint(logit3)
## 2.5 % 97.5 %
## (Intercept) 0.296789712 2.755963522
## Age 0.055821246 0.114006339
## BMI -0.084232809 -0.023425961
## Glucose -0.012607513 0.001577382
## SkinThickness -0.006861665 0.022262463
### Exponentiate the coefficients
exp(coef(logit3))
## (Intercept) Age BMI Glucose SkinThickness
## 4.5551360 1.0868687 0.9479330 0.9945226 1.0077981
exp(confint(logit3))
## 2.5 % 97.5 %
## (Intercept) 1.3455323 15.7361958
## Age 1.0574087 1.1207592
## BMI 0.9192172 0.9768463
## Glucose 0.9874716 1.0015786
## SkinThickness 0.9931618 1.0225121
The odds ratio describing the association between Age and History of pregnancy did not change much between the crude and adjusted logistic regression model.
Figure caption: Comparison between crude and adjustment logistic regression models
Logistic regression models allow us to estimate the association between a binary variable with a predictor variables that can be continuous or categorical. Additionally, the logistic regression model allows us to include more than one predictor variable thereby controlling for their effects. This is useful when potential confounders are present and the researcher wants to adjust for them.
I found the LaplacesDemon package useful for plotting the logistic regression function. I also found the website of Alboukadel Kassambara very helpful in learning more about logistic regressions. He has a lot of helpful resources to perform other types of models using R. His website can be found here. Additionally, I found the site by Selva Prabhakaran to be helpful in understanding the equations used for logistic regression models; here is the link.
These tutorials are a work in progress, so they may contain errors. I highly recommend you seek out statistical consult if you plan on running any analysis meant for publication. Any comments and suggestions can be sent to: internal.validity.blog@gmail.com