library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(openxlsx)
setwd("E:/Biostat and Study Design/204/Lectures/Data")
PIMA_df <- openxlsx::read.xlsx('PIMA.xlsx')
The odds are defined as the probability that the event will occur divided by the probability that the event will not occur. To better under the concept of odds, consider the following example. If you toss a fair coin, the probability of getting heads is 50% (Pr). The probability of getting tails is 1-Pr. What are the odds of getting heads? To answer this question, we start by defining odds, which is defined as the ratio of the number of ways the event can occur to the number of ways the event cannot occur.
\[Odds\:of\:heads=\frac{probability\:of\:heads}{probability\:of\:tails}=\frac{Pr}{1-Pr}=\frac{50\%}{50\%}=1\] Keep in mind the distinction between probability and odds. In our example, the probability of getting heads when tossing a fair coin is 50% and the odds of getting heads when tossing a fair coin is 1.
Probability value ranges between 0 to 1, odds value ranges between 0 and \(\infty\), and log-odds ranges between from \(− \infty\) to \(+ \infty\). Below is a table demonstrating the relationship between probability, odds, and log-odds.
| Probability = Odds/(1+Odds) | Odds = Pr/(1-Pr) | Log-odds = ln(Odds) |
|---|---|---|
| 0.01 | 0.01 | -4.60 |
| 0.1 | 0.11 | -2.21 |
| 0.15 | 0.18 | -1.71 |
| 0.2 | 0.25 | -1.39 |
| 0.3 | 0.43 | -0.84 |
| 0.5 | 1 | 0 |
| 0.6 | 1.5 | 0.41 |
| 0.7 | 2.3 | 0.83 |
| 0.8 | 4 | 1.39 |
| 0.9 | 9 | 2.19 |
| 0.95 | 19 | 2.94 |
Linear regression model is based on an assumption that the outcome Y is continuous, with errors that are normally distributed. If the outcome variable is binary this assumption is clearly violated. Additionally, there is no assurance that the fitted values fall between 0 and 1.
Motivating example: Diabetes mellitus type 2 (T2DM) is a prevalent disease with significant morbidity and mortality. Because multiple risk factors may be involved, several risk factors must be controlled when analyzing variables associated with T2DM.
A model of the following form might be considered.
\[Pr = \alpha + \beta_1 x_1 +...+ \beta_k x_k\]
where Pr = probability of disease. However, because the right-hand side of the equation could be less than 0 or greater than 1 for certain values of \(x_1, . . . , x_k\), predicted probabilities that are either less than 0 or greater than 1 could be obtained, which is impossible. Instead, the logit (logistic) transformation of Pr is often used as the dependent variable.
\[ln(\frac{Pr}{1-Pr})=logit(Pr) \] The logit transformation can take on any value from \(− \infty\) to \(+ \infty\).
\[ln(\frac{Pr}{1-Pr})=logit(Pr)= \alpha + \beta_1 x_1 +...+ \beta_k x_k\]
If we solve for Pr, then the model can be expressed in the form.
\[Pr(Y_i=1|X) = {\frac{1}{1 + e^{-(\alpha + \beta_1 x_1 +...+ \beta_k x_k)}}} \]
The following assumptions must be met for the binary logistic regression model to return valid results:
Example: A researcher is interested in investigating the relationship between diabetes mellitus type 2 (T2DM) and the number of pregnancies. Using the PIMA dataset, construct a binary logistic regression model with diabetes as the dependent variable and the number of pregnancies, blood pressure (confounder), and glucose level (confounder), age > 50 (confounder) as the independent variables.
PIMA_df %>% head()
## Diabetes Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 0 1 0 20 0 24.7
## 2 0 1 0 74 20 23 27.7
## 3 0 1 0 68 35 0 32.0
## 4 1 5 0 80 32 0 41.0
## 5 1 6 0 68 41 0 39.0
## 6 0 5 44 62 0 0 25.0
## DiabetesPedigreeFunction Age Age_50
## 1 0.140 22 No
## 2 0.299 21 No
## 3 0.389 22 No
## 4 0.346 37 No
## 5 0.727 41 No
## 6 0.587 36 No
PIMA_df_lg <- PIMA_df %>% select(Diabetes,Pregnancies,BloodPressure,Glucose,Age_50) #select variables of interest
Since ‘Diabetes’ is a categorical variable, we need to make sure it is has a class of factor.
class(PIMA_df_lg$Diabetes) #evaluate class of diabetes variable
## [1] "numeric"
class(PIMA_df_lg$Age_50) #evaluate class of age variable
## [1] "character"
class(PIMA_df_lg$BloodPressure) #evaluate class of blood pressure variable
## [1] "character"
class(PIMA_df_lg$Glucose) ##evaluate class of glucose variable
## [1] "numeric"
PIMA_df_lg$Diabetes <- as.factor(PIMA_df_lg$Diabetes) #convert diabetes to factor
PIMA_df_lg$Age_50 <- as.factor(PIMA_df_lg$Age_50) #convert age_50 to factor
PIMA_df_lg$BloodPressure <- as.numeric(PIMA_df_lg$BloodPressure) #convert blood pressure to numeric
class(PIMA_df_lg$Diabetes)
## [1] "factor"
levels(PIMA_df_lg$Diabetes)
## [1] "0" "1"
class(PIMA_df_lg$Age_50)
## [1] "factor"
levels(PIMA_df_lg$Age_50)
## [1] "No" "Yes"
class(PIMA_df_lg$BloodPressure)
## [1] "numeric"
Let’s investigate the data and check for outliers.
summary(PIMA_df_lg)
## Diabetes Pregnancies BloodPressure Glucose Age_50
## 0:500 Min. : 0.000 Min. : 0.00 Min. : 0.0 No :687
## 1:268 1st Qu.: 1.000 1st Qu.: 63.00 1st Qu.: 99.0 Yes: 81
## Median : 3.000 Median : 72.00 Median :117.0
## Mean : 3.845 Mean : 69.13 Mean :120.9
## 3rd Qu.: 6.000 3rd Qu.: 80.00 3rd Qu.:140.2
## Max. :17.000 Max. :122.00 Max. :199.0
## NA's :1
Since blood pressure and glucose values of zero are not physiologically possible, we will remove these observations.
PIMA_df_lg <- PIMA_df_lg %>% filter(BloodPressure!=0 & Glucose!=0) #remove observations if glucose is 0 or blood pressure is 0
summary(PIMA_df_lg)
## Diabetes Pregnancies BloodPressure Glucose Age_50
## 0:478 Min. : 0.000 Min. : 24.00 Min. : 44.0 No :648
## 1:250 1st Qu.: 1.000 1st Qu.: 64.00 1st Qu.:100.0 Yes: 80
## Median : 3.000 Median : 72.00 Median :117.0
## Mean : 3.863 Mean : 72.44 Mean :121.9
## 3rd Qu.: 6.000 3rd Qu.: 80.00 3rd Qu.:141.2
## Max. :17.000 Max. :122.00 Max. :199.0
Next, let’s construct the logistic regression model.
mylogit <- glm(Diabetes ~ Pregnancies+BloodPressure+Glucose+Age_50, data = PIMA_df_lg, family = "binomial") #create logistic regression model
summary(mylogit) # summary of model
##
## Call:
## glm(formula = Diabetes ~ Pregnancies + BloodPressure + Glucose +
## Age_50, family = "binomial", data = PIMA_df_lg)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.677295 0.687792 -9.708 < 2e-16 ***
## Pregnancies 0.128703 0.027804 4.629 3.68e-06 ***
## BloodPressure 0.009384 0.007874 1.192 0.233
## Glucose 0.038701 0.003513 11.016 < 2e-16 ***
## Age_50Yes -0.393626 0.290992 -1.353 0.176
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 936.60 on 727 degrees of freedom
## Residual deviance: 726.39 on 723 degrees of freedom
## AIC: 736.39
##
## Number of Fisher Scoring iterations: 4
The number of pregnancies and glucose levels are significant variables in our model. The results can be interpreted as such:
We can use the confint() function to obtain confidence intervals for the coefficient estimates using the log-likelihood function
confint(mylogit) #generate confidence intervals
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -8.063655689 -5.36442272
## Pregnancies 0.074623908 0.18377076
## BloodPressure -0.006036974 0.02487003
## Glucose 0.032009132 0.04580068
## Age_50Yes -0.971165484 0.17209178
You can also exponentiate the coefficients and interpret them as odds-ratios.
round(exp(coef(mylogit)),3) #odds ratios only
## (Intercept) Pregnancies BloodPressure Glucose Age_50Yes
## 0.001 1.137 1.009 1.039 0.675
round(exp(cbind(OR = coef(mylogit), confint(mylogit))),3) #odds ratios and 95% CI
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## (Intercept) 0.001 0.000 0.005
## Pregnancies 1.137 1.077 1.202
## BloodPressure 1.009 0.994 1.025
## Glucose 1.039 1.033 1.047
## Age_50Yes 0.675 0.379 1.188
The results can be interpreted as such: