Introduction

When you have a binary outcome (Yes/No), you can use a chi square test to compare the differences in proportions across \(n\) number of groups. For instance, if you had two groups (exposed and unexposed) and a binary outcome (event and no event), you can create a 2 x 2 contingency table and use a chi square test to test if there is a difference in the frequency or proportion in the outcome across the two groups. However, this will not get you the magnitude of the differences, the direction of the difference, nor the uncertainty with the differences.

Figure caption: 2 x 2 contingency table

Figure caption: 2 x 2 contingency table

Alternative measures of association include risk ratio and odds ratio. I wrote a tutorial on how to do this in R, which you can find at this link. However, in this tutorial, we’ll go over how we can leverage the odds ratio particularly when it comes to constructing a logistic regression model to make predictions.

Logistic regression model

The logistic regression model has several assumptions; however, they do not necessarily follow those of the linear regression models:

The structural form of the logistic regression model:

\(logit( E[Y_i | X_i]) = logit(p_i) = ln(\frac{p_i}{1 - p_i}) = \beta_0 + \beta_1 X_{1i} + \epsilon\)


Typical notation of the linear regression include:

The logistic regression model is a predictive model for binary data. It is also known as a classification model. Hence, the logistic regression model can generate probabilities that a sample will have the discrete outcome given an input variable(s). The logistic regression model uses maximum likelihood estimation (MLE) which is a conditional probability that classifies the outcome if a certain threshold is met (e.g,. > 0.50). Hence, the probability range of a logistic regression model is between 0 and 1. The figure below provides an example of a logistic function. It uses a logit function to model a binary outcome. The logit function is the natural log of the odds. For instance, if the probability is > 0.5, the logit function is positive, if the probability is < 0.5, the logit function is negative.

Unlike the 2 x 2 contingency table setup, the logistic regression allows for continuous and categorical variables as the independent or predictor variables. This allows for easy interpretation particularly with continuous data. Additionally, the logistic regression can include multiple predictors which can be controlled or adjusted in a multivariable logistic regression model.

library(LaplacesDemon)
x <- -10:10
prob <- invlogit(x)
plot(x, prob, type = "l", main = "Logistic regression function plot", ylab = "Probability", xlab = "Values of X")

Motivating Example - Logistic regression

Data and libraries

We’ll use the mtcar data to build our logistic regression model.

#### Load the libraries
library("ggplot2")
library("gmodels")
library("epitools")
library("tidyverse")

### Create factors
data1 <- within(mtcars, {
           vs <- factor(vs, labels = c("V", "S"))
           am <- factor(am, labels = c("automatic", "manual"))
})
head(data1)
##                    mpg cyl disp  hp drat    wt  qsec vs        am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  V    manual    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  V    manual    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  S    manual    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  S automatic    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  V automatic    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  S automatic    3    1

2x2 contingency table

Let’s look at the 2x2 contingency table between transmission types am (0 = automatic and 1 = manual) with the engine types vs (0 = V-shaped and 1 = straight).

Among vehicles with a V-shaped engine, 33% (N = 6) had a manual transmission and 67% (N = 12) had an automatic transmission. According to the Pearson’s chi square test (p = 0.34), this difference was not statistically significant.

CrossTable(data1$vs, data1$am, chisq = TRUE, missing.include = TRUE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  32 
## 
##  
##              | data1$am 
##     data1$vs | automatic |    manual | Row Total | 
## -------------|-----------|-----------|-----------|
##            V |        12 |         6 |        18 | 
##              |     0.161 |     0.236 |           | 
##              |     0.667 |     0.333 |     0.562 | 
##              |     0.632 |     0.462 |           | 
##              |     0.375 |     0.188 |           | 
## -------------|-----------|-----------|-----------|
##            S |         7 |         7 |        14 | 
##              |     0.207 |     0.303 |           | 
##              |     0.500 |     0.500 |     0.438 | 
##              |     0.368 |     0.538 |           | 
##              |     0.219 |     0.219 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |        19 |        13 |        32 | 
##              |     0.594 |     0.406 |           | 
## -------------|-----------|-----------|-----------|
## 
##  
## Statistics for All Table Factors
## 
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 =  0.9068826     d.f. =  1     p =  0.3409429 
## 
## Pearson's Chi-squared test with Yates' continuity correction 
## ------------------------------------------------------------
## Chi^2 =  0.3475355     d.f. =  1     p =  0.5555115 
## 
## 

Odds ratio calculation

We can calculate the crude odds ratio. Using the odds ratio, vehicles with a “V” engine had a 2 times higher odds of having an automatic transmission (95% CI: 0.47, 8.40) compared to vehicles with a straight engine.

oddsratio(data1$vs, data1$am, conf.level = 0.95, method = "wald")
## $data
##          Outcome
## Predictor automatic manual Total
##     V            12      6    18
##     S             7      7    14
##     Total        19     13    32
## 
## $measure
##          odds ratio with 95% C.I.
## Predictor estimate     lower    upper
##         V        1        NA       NA
##         S        2 0.4764466 8.395484
## 
## $p.value
##          two-sided
## Predictor midp.exact fisher.exact chi.square
##         V         NA           NA         NA
##         S  0.3718521    0.4726974  0.3409429
## 
## $correction
## [1] FALSE
## 
## attr(,"method")
## [1] "Unconditional MLE & normal approximation (Wald) CI"

Logistic regression in R

We can create a crude logistic regression model to estimate the odds ratio. We set the transmission type am as the dependent variable and the engine type vs as the independent variable. The glm() command generates coefficients that are interpreted as the log odds of the event occuring. We need to exponentiate this to get the odds ratio using the exp() command.

According to the logistic regression model, vehicles with an “V” engine had a 2.0 times higher odds of having an automatic transmission (95% CI: 0.48, 8.76) compared to vehicles with a straight engine; this is not statistically significant since the odds ratio crosses the null or OR = 1.

logit1 <- glm(formula = am ~ vs, data = data1, family = "binomial"(link = "logit"))
summary(logit1)
## 
## Call:
## glm(formula = am ~ vs, family = binomial(link = "logit"), data = data1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1774  -0.9005  -0.9005   1.1774   1.4823  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -0.6931     0.5000  -1.386    0.166
## vsS           0.6931     0.7319   0.947    0.344
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 43.230  on 31  degrees of freedom
## Residual deviance: 42.323  on 30  degrees of freedom
## AIC: 46.323
## 
## Number of Fisher Scoring iterations: 4
### Generate the 95% CI
confint(logit1)
##                  2.5 %    97.5 %
## (Intercept) -1.7483158 0.2526876
## vsS         -0.7334126 2.1696774
### Exponentiate the coefficients
exp(coef(logit1))     ### Odds ratio
## (Intercept)         vsS 
##         0.5         2.0
exp(confint(logit1))  ### 95% CI (odds ratio)
##                 2.5 %   97.5 %
## (Intercept) 0.1740669 1.287481
## vsS         0.4802672 8.755459

Multivariable logistic regression model

The structural form of the multivariable logistic regression model (this example uses two X variables):

\(logit( E[Y_i | X_i]) = logit(p_i) = ln(\frac{p_i}{1 - p_i}) = \beta_0 + \beta_1 X_{1i} + + \beta_2 X_{2i} + \epsilon\)


Since the logistic regression model can include both continuous and categorical predictors, we can add the engine type vs (V versus straight engine) and vehicle weight wt.

\(logit( E[Y_i | X_i]) = logit(p_i) = ln(\frac{p_i}{1 - p_i}) = \beta_0 + \beta_1 (vs)_{i} + \beta_2 (wt)_{i} + \epsilon\)


where vs is the engine type and wt is the vehicle weight.

Descriptives

There is a total of 32 observations; N = 19 vehicles have automatic transmission and N =13 have manual transmission. There are 18 (56.2%) vehicles with V-type engines and 14 (43.8%) vehicles with straight type engines. The average weight of a vehicles with automatic transmission is 3.77 units (SD, 0.78) and for vehicles with manual transmission, the average weight is 2.41 units (SD, 0.62); this difference was statistically significant (p<0.001).

library("dplyr")

### Get the n for transmission and engine types
CrossTable(data1$am)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  32 
## 
##  
##           | automatic |    manual | 
##           |-----------|-----------|
##           |        19 |        13 | 
##           |     0.594 |     0.406 | 
##           |-----------|-----------|
## 
## 
## 
## 
CrossTable(data1$vs)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  32 
## 
##  
##           |         V |         S | 
##           |-----------|-----------|
##           |        18 |        14 | 
##           |     0.562 |     0.438 | 
##           |-----------|-----------|
## 
## 
## 
## 
### Get the n, mean, and sd of weight by transmission type.
group_by(data1, am) %>%
  summarise(
    count = n(),
    mean = mean(wt, na.rm = TRUE),
    sd = sd(wt, na.rm = TRUE)
  )
## # A tibble: 2 x 4
##   am        count  mean    sd
##   <fct>     <int> <dbl> <dbl>
## 1 automatic    19  3.77 0.777
## 2 manual       13  2.41 0.617
### Compare the average vehicle weight by transmission type. 
t.test(data1$wt ~ data1$am, alternative = "two.sided", paired = FALSE, var.equal = FALSE, conf.level = 0.95)
## 
##  Welch Two Sample t-test
## 
## data:  data1$wt by data1$am
## t = 5.4939, df = 29.234, p-value = 6.272e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.8525632 1.8632262
## sample estimates:
## mean in group automatic    mean in group manual 
##                3.768895                2.411000

Multivariable logistic regression model

We add the vs and wt variables as predictors in the logistic regression model. The output will be in log odds, but we can exponentiate this using the exp() command.

We get very different results from the first logistic regression model. In this multivariable logistic regression model, the association between engine type vs and transmission type am is much lower (OR = 0.01; 95% CI: 0.000005, 0.048) controlling for vehicle weight wt Controlling for the vehicle’s weight reduced odds of the association between engine type vs and transmission type am. In the previous logistic regression model, vehicles with a V-type engine has a 2.0 times higher odds of having an automatic transmission. However, when adjusting for the vehicles weight, this association is no longer significant.

logit2 <- glm(formula = am ~ vs + wt, data = data1, family = "binomial"(link = "logit"))
summary(logit2)
## 
## Call:
## glm(formula = am ~ vs + wt, family = binomial(link = "logit"), 
##     data = data1)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.72025  -0.25387  -0.04841   0.13220   1.90889  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)   22.143      9.134   2.424   0.0153 *
## vsS           -4.496      2.641  -1.703   0.0887 .
## wt            -6.664      2.640  -2.524   0.0116 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 43.230  on 31  degrees of freedom
## Residual deviance: 13.095  on 29  degrees of freedom
## AIC: 19.095
## 
## Number of Fisher Scoring iterations: 7
### Generate the 95% CI
confint(logit2)
##                  2.5 %     97.5 %
## (Intercept)   9.516615 48.7186931
## vsS         -12.115746 -0.7135884
## wt          -14.321591 -2.9934338
### Exponentiate the coefficients
exp(coef(logit2))     ### Odds ratio
##  (Intercept)          vsS           wt 
## 4.136804e+09 1.114894e-02 1.276657e-03
exp(confint(logit2))  ### 95% CI (odds ratio)
##                    2.5 %       97.5 %
## (Intercept) 1.358356e+04 1.439659e+21
## vsS         5.472656e-06 4.898832e-01
## wt          6.028539e-07 5.011505e-02

Comparison between models

We compare the odds ratio between the crude and adjusted logistic regression models.

Figure caption: Comparison between crude and adjustment logistic regression models

Figure caption: Comparison between crude and adjustment logistic regression models

Motivating example – diabetes dataset

Let’s use the diabetes.csv data set to construct a logistic regression model. You can download the data here.

Descriptives

There are 768 observations in the diabetes.data data frame.

library("psych")

## Import file
diabetes.data <- read.csv("diabetes.csv", header = TRUE)
head(diabetes.data)
##   Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
## 1           6     148            72            35       0 33.6
## 2           1      85            66            29       0 26.6
## 3           8     183            64             0       0 23.3
## 4           1      89            66            23      94 28.1
## 5           0     137            40            35     168 43.1
## 6           5     116            74             0       0 25.6
##   DiabetesPedigreeFunction Age Outcome
## 1                    0.627  50       1
## 2                    0.351  31       0
## 3                    0.672  32       1
## 4                    0.167  21       0
## 5                    2.288  33       1
## 6                    0.201  30       0
### Descriptive of diabetes.data
describeBy(diabetes.data)
##                          vars   n   mean     sd median trimmed   mad   min
## Pregnancies                 1 768   3.85   3.37   3.00    3.46  2.97  0.00
## Glucose                     2 768 120.89  31.97 117.00  119.38 29.65  0.00
## BloodPressure               3 768  69.11  19.36  72.00   71.36 11.86  0.00
## SkinThickness               4 768  20.54  15.95  23.00   19.94 17.79  0.00
## Insulin                     5 768  79.80 115.24  30.50   56.75 45.22  0.00
## BMI                         6 768  31.99   7.88  32.00   31.96  6.82  0.00
## DiabetesPedigreeFunction    7 768   0.47   0.33   0.37    0.42  0.25  0.08
## Age                         8 768  33.24  11.76  29.00   31.54 10.38 21.00
## Outcome                     9 768   0.35   0.48   0.00    0.31  0.00  0.00
##                             max  range  skew kurtosis   se
## Pregnancies               17.00  17.00  0.90     0.14 0.12
## Glucose                  199.00 199.00  0.17     0.62 1.15
## BloodPressure            122.00 122.00 -1.84     5.12 0.70
## SkinThickness             99.00  99.00  0.11    -0.53 0.58
## Insulin                  846.00 846.00  2.26     7.13 4.16
## BMI                       67.10  67.10 -0.43     3.24 0.28
## DiabetesPedigreeFunction   2.42   2.34  1.91     5.53 0.01
## Age                       81.00  60.00  1.13     0.62 0.42
## Outcome                    1.00   1.00  0.63    -1.60 0.02

Create categorical variables

Currently, the data frame has a variable called Pregnancies that contains the number of pregnancies each subject has. We are only interested in whether or not they have had a pregnancy, so we’ll create a binary variable called group. We added labels to the 0s an 1s and call this new data frame data2. There was a total of 657 (85.5%) subjects with a history of pregnancies and 111 (14.5%) subjects with no history of pregnancy.

#### Generate groups based on pregnancies (Group 1 = 0, Group 2 = 1-5, Group 3 = >5 pregnancies)
diabetes.data$group[diabetes.data$Pregnancies == 0] = 0 
diabetes.data$group[diabetes.data$Pregnancies > 0] = 1 
CrossTable(diabetes.data$group)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  768 
## 
##  
##           |         0 |         1 | 
##           |-----------|-----------|
##           |       111 |       657 | 
##           |     0.145 |     0.855 | 
##           |-----------|-----------|
## 
## 
## 
## 
### Create factors
data2 <- within(diabetes.data, {
           group <- factor(group, labels = c("No history of pregnancies", "History of pregnancies"))
})

CrossTable(data2$group)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  768 
## 
##  
##                           | No history of pregnancies |    History of pregnancies | 
##                           |---------------------------|---------------------------|
##                           |                       111 |                       657 | 
##                           |                     0.145 |                     0.855 | 
##                           |---------------------------|---------------------------|
## 
## 
## 
## 

Crude Logistic Regression Model

We create a crude logistic regression model to evaluate the association of the subject’s age Age on history of pregnancy group. Baesd on the crude logistic regression model, a 1-unit increase in age was associated with a 7% increase in the odds of having a history of pregnancy (95% CI: 1.05, 1.10), which is statistically significant.

logit3 <- glm(group ~ Age, data = data2, family = "binomial"(link = "logit"))
summary(logit3)
## 
## Call:
## glm(formula = group ~ Age, family = binomial(link = "logit"), 
##     data = data2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9478   0.2902   0.4966   0.6630   0.7500  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.33970    0.39173  -0.867    0.386    
## Age          0.06972    0.01343   5.193 2.06e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 634.53  on 767  degrees of freedom
## Residual deviance: 597.14  on 766  degrees of freedom
## AIC: 601.14
## 
## Number of Fisher Scoring iterations: 5
confint(logit3)
##                   2.5 %     97.5 %
## (Intercept) -1.13328100 0.40600173
## Age          0.04484305 0.09761406
### Exponentiate the coefficients
exp(coef(logit3))
## (Intercept)         Age 
##    0.711984    1.072211
exp(confint(logit3))
##                 2.5 %   97.5 %
## (Intercept) 0.3219751 1.500805
## Age         1.0458637 1.102537

Multivariable logistic regression model

We can add potential confounders such as BMI, glucose, and skin thickness to the logistic regression model. This will generate model estimates that will be adjusting for these other predictors.

logit3 <- glm(group ~ Age + BMI + Glucose + SkinThickness, data = data2, family = "binomial"(link = "logit"))
summary(logit3)
## 
## Call:
## glm(formula = group ~ Age + BMI + Glucose + SkinThickness, family = binomial(link = "logit"), 
##     data = data2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.3549   0.2632   0.4737   0.6239   1.2282  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    1.516255   0.626257   2.421 0.015472 *  
## Age            0.083301   0.014808   5.625 1.85e-08 ***
## BMI           -0.053471   0.015487  -3.453 0.000555 ***
## Glucose       -0.005492   0.003615  -1.519 0.128700    
## SkinThickness  0.007768   0.007413   1.048 0.294694    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 634.53  on 767  degrees of freedom
## Residual deviance: 578.30  on 763  degrees of freedom
## AIC: 588.3
## 
## Number of Fisher Scoring iterations: 6
confint(logit3)
##                      2.5 %       97.5 %
## (Intercept)    0.296789712  2.755963522
## Age            0.055821246  0.114006339
## BMI           -0.084232809 -0.023425961
## Glucose       -0.012607513  0.001577382
## SkinThickness -0.006861665  0.022262463
### Exponentiate the coefficients
exp(coef(logit3))
##   (Intercept)           Age           BMI       Glucose SkinThickness 
##     4.5551360     1.0868687     0.9479330     0.9945226     1.0077981
exp(confint(logit3))
##                   2.5 %     97.5 %
## (Intercept)   1.3455323 15.7361958
## Age           1.0574087  1.1207592
## BMI           0.9192172  0.9768463
## Glucose       0.9874716  1.0015786
## SkinThickness 0.9931618  1.0225121

Models comparisons

The odds ratio describing the association between Age and History of pregnancy did not change much between the crude and adjusted logistic regression model.

Figure caption: Comparison between crude and adjustment logistic regression models

Figure caption: Comparison between crude and adjustment logistic regression models

Conclusions

Logistic regression models allow us to estimate the association between a binary variable with a predictor variables that can be continuous or categorical. Additionally, the logistic regression model allows us to include more than one predictor variable thereby controlling for their effects. This is useful when potential confounders are present and the researcher wants to adjust for them.

Acknowledgements

I found the LaplacesDemon package useful for plotting the logistic regression function. I also found the website of Alboukadel Kassambara very helpful in learning more about logistic regressions. He has a lot of helpful resources to perform other types of models using R. His website can be found here. Additionally, I found the site by Selva Prabhakaran to be helpful in understanding the equations used for logistic regression models; here is the link.

Contact

These tutorials are a work in progress, so they may contain errors. I highly recommend you seek out statistical consult if you plan on running any analysis meant for publication. Any comments and suggestions can be sent to: