RPubs

https://rpubs.com/zaidyousif/862660

Sources

Learning Objectives

Load Libraries

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(openxlsx)

Load data

setwd("E:/Biostat and Study Design/204/Lectures/Data")
PIMA_df <- openxlsx::read.xlsx('PIMA.xlsx')

Review

The odds are defined as the probability that the event will occur divided by the probability that the event will not occur. To better under the concept of odds, consider the following example. If you toss a fair coin, the probability of getting heads is 50% (Pr). The probability of getting tails is 1-Pr. What are the odds of getting heads? To answer this question, we start by defining odds, which is defined as the ratio of the number of ways the event can occur to the number of ways the event cannot occur.

\[Odds\:of\:heads=\frac{probability\:of\:heads}{probability\:of\:tails}=\frac{Pr}{1-Pr}=\frac{50\%}{50\%}=1\] Keep in mind the distinction between probability and odds. In our example, the probability of getting heads when tossing a fair coin is 50% and the odds of getting heads when tossing a fair coin is 1.

Probability value ranges between 0 to 1, odds value ranges between 0 and \(\infty\), and log-odds ranges between from \(− \infty\) to \(+ \infty\). Below is a table demonstrating the relationship between probability, odds, and log-odds.

Probability = Odds/(1+Odds) Odds = Pr/(1-Pr) Log-odds = ln(Odds)
0.01 0.01 -4.60
0.1 0.11 -2.21
0.15 0.18 -1.71
0.2 0.25 -1.39
0.3 0.43 -0.84
0.5 1 0
0.6 1.5 0.41
0.7 2.3 0.83
0.8 4 1.39
0.9 9 2.19
0.95 19 2.94

Logistic Regression

Linear regression model is based on an assumption that the outcome Y is continuous, with errors that are normally distributed. If the outcome variable is binary this assumption is clearly violated. Additionally, there is no assurance that the fitted values fall between 0 and 1.

Motivating example: Diabetes mellitus type 2 (T2DM) is a prevalent disease with significant morbidity and mortality. Because multiple risk factors may be involved, several risk factors must be controlled when analyzing variables associated with T2DM.

A model of the following form might be considered.

\[Pr = \alpha + \beta_1 x_1 +...+ \beta_k x_k\]

where Pr = probability of disease. However, because the right-hand side of the equation could be less than 0 or greater than 1 for certain values of \(x_1, . . . , x_k\), predicted probabilities that are either less than 0 or greater than 1 could be obtained, which is impossible. Instead, the logit (logistic) transformation of Pr is often used as the dependent variable.

\[ln(\frac{Pr}{1-Pr})=logit(Pr) \] The logit transformation can take on any value from \(− \infty\) to \(+ \infty\).

\[ln(\frac{Pr}{1-Pr})=logit(Pr)= \alpha + \beta_1 x_1 +...+ \beta_k x_k\]

If we solve for Pr, then the model can be expressed in the form.

\[Pr(Y_i=1|X) = {\frac{1}{1 + e^{-(\alpha + \beta_1 x_1 +...+ \beta_k x_k)}}} \]

Alt
Alt

The following assumptions must be met for the binary logistic regression model to return valid results:

Example: A researcher is interested in investigating the relationship between diabetes mellitus type 2 (T2DM) and the number of pregnancies. Using the PIMA dataset, construct a binary logistic regression model with diabetes as the dependent variable and the number of pregnancies, blood pressure (confounder), and glucose level (confounder), age > 50 (confounder) as the independent variables.

PIMA_df %>% head()
##   Diabetes Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
## 1        0           1       0                          20       0 24.7
## 2        0           1       0            74            20      23 27.7
## 3        0           1       0            68            35       0 32.0
## 4        1           5       0            80            32       0 41.0
## 5        1           6       0            68            41       0 39.0
## 6        0           5      44            62             0       0 25.0
##   DiabetesPedigreeFunction Age Age_50
## 1                    0.140  22     No
## 2                    0.299  21     No
## 3                    0.389  22     No
## 4                    0.346  37     No
## 5                    0.727  41     No
## 6                    0.587  36     No
PIMA_df_lg <- PIMA_df %>% select(Diabetes,Pregnancies,BloodPressure,Glucose,Age_50) #select variables of interest

Since ‘Diabetes’ is a categorical variable, we need to make sure it is has a class of factor.

class(PIMA_df_lg$Diabetes) #evaluate class of diabetes variable
## [1] "numeric"
class(PIMA_df_lg$Age_50) #evaluate class of age variable
## [1] "character"
class(PIMA_df_lg$BloodPressure) #evaluate class of blood pressure variable
## [1] "character"
class(PIMA_df_lg$Glucose) ##evaluate class of glucose variable
## [1] "numeric"
PIMA_df_lg$Diabetes <- as.factor(PIMA_df_lg$Diabetes) #convert diabetes to factor
PIMA_df_lg$Age_50 <- as.factor(PIMA_df_lg$Age_50) #convert age_50 to factor
PIMA_df_lg$BloodPressure <- as.numeric(PIMA_df_lg$BloodPressure) #convert blood pressure to numeric

class(PIMA_df_lg$Diabetes)
## [1] "factor"
levels(PIMA_df_lg$Diabetes)
## [1] "0" "1"
class(PIMA_df_lg$Age_50)
## [1] "factor"
levels(PIMA_df_lg$Age_50)
## [1] "No"  "Yes"
class(PIMA_df_lg$BloodPressure)
## [1] "numeric"

Let’s investigate the data and check for outliers.

summary(PIMA_df_lg)
##  Diabetes  Pregnancies     BloodPressure       Glucose      Age_50   
##  0:500    Min.   : 0.000   Min.   :  0.00   Min.   :  0.0   No :687  
##  1:268    1st Qu.: 1.000   1st Qu.: 63.00   1st Qu.: 99.0   Yes: 81  
##           Median : 3.000   Median : 72.00   Median :117.0            
##           Mean   : 3.845   Mean   : 69.13   Mean   :120.9            
##           3rd Qu.: 6.000   3rd Qu.: 80.00   3rd Qu.:140.2            
##           Max.   :17.000   Max.   :122.00   Max.   :199.0            
##                            NA's   :1

Since blood pressure and glucose values of zero are not physiologically possible, we will remove these observations.

PIMA_df_lg <- PIMA_df_lg %>% filter(BloodPressure!=0 & Glucose!=0) #remove observations if glucose is 0 or blood pressure is 0
summary(PIMA_df_lg)
##  Diabetes  Pregnancies     BloodPressure       Glucose      Age_50   
##  0:478    Min.   : 0.000   Min.   : 24.00   Min.   : 44.0   No :648  
##  1:250    1st Qu.: 1.000   1st Qu.: 64.00   1st Qu.:100.0   Yes: 80  
##           Median : 3.000   Median : 72.00   Median :117.0            
##           Mean   : 3.863   Mean   : 72.44   Mean   :121.9            
##           3rd Qu.: 6.000   3rd Qu.: 80.00   3rd Qu.:141.2            
##           Max.   :17.000   Max.   :122.00   Max.   :199.0

Next, let’s construct the logistic regression model.

mylogit <- glm(Diabetes ~ Pregnancies+BloodPressure+Glucose+Age_50, data = PIMA_df_lg, family = "binomial") #create logistic regression model
summary(mylogit) # summary of model
## 
## Call:
## glm(formula = Diabetes ~ Pregnancies + BloodPressure + Glucose + 
##     Age_50, family = "binomial", data = PIMA_df_lg)
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -6.677295   0.687792  -9.708  < 2e-16 ***
## Pregnancies    0.128703   0.027804   4.629 3.68e-06 ***
## BloodPressure  0.009384   0.007874   1.192    0.233    
## Glucose        0.038701   0.003513  11.016  < 2e-16 ***
## Age_50Yes     -0.393626   0.290992  -1.353    0.176    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 936.60  on 727  degrees of freedom
## Residual deviance: 726.39  on 723  degrees of freedom
## AIC: 736.39
## 
## Number of Fisher Scoring iterations: 4

The number of pregnancies and glucose levels are significant variables in our model. The results can be interpreted as such:

We can use the confint() function to obtain confidence intervals for the coefficient estimates using the log-likelihood function

confint(mylogit) #generate confidence intervals
## Waiting for profiling to be done...
##                      2.5 %      97.5 %
## (Intercept)   -8.063655689 -5.36442272
## Pregnancies    0.074623908  0.18377076
## BloodPressure -0.006036974  0.02487003
## Glucose        0.032009132  0.04580068
## Age_50Yes     -0.971165484  0.17209178

You can also exponentiate the coefficients and interpret them as odds-ratios.

round(exp(coef(mylogit)),3) #odds ratios only
##   (Intercept)   Pregnancies BloodPressure       Glucose     Age_50Yes 
##         0.001         1.137         1.009         1.039         0.675
round(exp(cbind(OR = coef(mylogit), confint(mylogit))),3) #odds ratios and 95% CI
## Waiting for profiling to be done...
##                  OR 2.5 % 97.5 %
## (Intercept)   0.001 0.000  0.005
## Pregnancies   1.137 1.077  1.202
## BloodPressure 1.009 0.994  1.025
## Glucose       1.039 1.033  1.047
## Age_50Yes     0.675 0.379  1.188

The results can be interpreted as such: