Overview of method

Today, we will be focusing on logistic regression. This type of regression is used to to assess the effects of independent variables on a binary outcome. The independent variables have no restrictions in terms of the type -they can be continuous or categorical- and the regression has fewer assumptions compared to linear regression.

Objectives

Examine data
Interpret the results of a logistic regression
Be able to identify confounding in a multivariable logistic regression
Be able to identify effect modification in a multivariable logistic regression

#Don't forget to install the packages if it is the first time you use them with the command install.packages("name_of_package")
library(RCurl)
library(readr) #To call data
library(dplyr) #To manipulate data
library(stats) #For logistic regression
library(gtsummary) #For nicer outputs
library(ggplot2)
library(tidymv)
library(sjPlot)

0. Data examination

The data for today’s practical is a nationally representative survey about disability of Chilean adult population (II Encuesta Nacional de Discapacidad, in spanish). We will be assessing the effect of some demographic and socioeconomic determinants on disability in this population.

Link for the database (optional):

https://www.senadis.gob.cl/pag/356/1625/base_de_datos

Links for RStudio download and tutorial (optional):

https://www.rstudio.com/products/rstudio/download/

https://www.youtube.com/watch?v=FIrsOBy5k58&ab_channel=HRanalytics101.com

0.1 Data preparation

#Loading the data. Can be called directly because I uploaded to Github, an online platform for working with data
data <- read_csv("https://raw.githubusercontent.com/eloluna/to_uch_dic21/main/endisc_lite.csv")

#The data has too many variables. I will only select the variables of interest. The RAM of my computer will thank me
data <- data %>% 
  select(sexo, edad, ytot, esc, disc_grado_adulto)

#The names are in spanish, so I will change them
data <- data %>% 
  rename(sex = sexo) %>% 
  rename(age = edad) %>% 
  rename(income = ytot) %>% 
  rename(education = esc) %>% 
  rename(cat_disability = disc_grado_adulto)

0.1 Data transformation

#Inspect the data
head(data)

#Lots of missing. This is because interviewees gave information for the whole household, yet the disability variable was only measure for the interviewee

#For that and to avoid missing data, I will only work with complete cases
data <- subset(data, complete.cases(data))

#Again, I examine the data to see the changes
head(data)

What type of variable is cat_disablity? Is it appropriate for logistic regression? Is it suitable for logistic regression?

table(data$cat_disability)

## 
## Discapacidad Leve a Moderada          Discapacidad Severa 
##                         1527                         1089 
##             Sin Discapacidad 
##                         9633

#I will create a binary variable of disability
data$bin_disability <- NA

data$bin_disability[data$cat_disability == "Discapacidad Leve a Moderada"] <- 1 #Small to moderate disability coded as 1

data$bin_disability[data$cat_disability == "Discapacidad Severa"] <- 1 #Severe disability coded as 1

data$bin_disability[data$cat_disability == "Sin Discapacidad"] <- 0 #No disability coded as 0

data$bin_disability <- factor(data$bin_disability, levels = c(0:1),
                              labels = c("Without disability",
                                        "With disability"))

typeof(data$sex) #sex is actually a character variable. I need to change this

## [1] "character"

data$sex <- as.factor(data$sex) #Now is categorical

#However, the labels of the levels are in spanish. I'll change them
levels(data$sex) <- c("Male", "Female")

#R can be used as calculator
1089 + 1527

## [1] 2616

table(data$bin_disability)

## 
## Without disability    With disability 
##               9633               2616

0.3 Data examination

How many people are in the sample?

nrow(data)

## [1] 12249

How is disability distributed in the sample?

table(data$bin_disability)

## 
## Without disability    With disability 
##               9633               2616

round(prop.table(table(data$bin_disability))*100,2)

## 
## Without disability    With disability 
##              78.64              21.36

How is age distributed?

#Age
summary(data$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   33.00   48.00   48.36   62.00  107.00

hist(data$age)

How is sex distributed?

table(data$sex)

## 
##   Male Female 
##   5303   6946

round(prop.table(table(data$sex))*100,2)

## 
##   Male Female 
##  43.29  56.71

How is the monthly income distributed?

hist(data$income)

summary(data$income) #Maximum data is an outlier. This is in Chilean pesos. 1 US Dollar = 925 Chilean pesos. Which is the same as 1 Chilean peso = 0.0011 US dollars. Let's transform it!

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##        0    80000   200000   317923   384000 26500000

data$income <- round(data$income*0.0011, 2)
hist(data$income) #same distribution but different scale

#To avoid outliers messing with my data. I will put a maximum of income of US5,000
data$income[data$income > 5000] <- 5000
summary(data$income)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    88.0   220.0   344.5   422.4  5000.0

hist(data$income) #In general, income does not have a normal distribution

How is education distributed?

hist(data$education)

round(prop.table(table(as.numeric(data$education)))*100,2)

## 
##     0     1     2     3     4     5     6     7     8     9    10    11    12 
##  2.75  0.86  1.52  3.03  3.00  2.40  6.35  2.42  9.19  3.47  5.14  3.13 27.68 
##    13    14    15    16    17    18    19    20    21    22    23    24 
##  4.04  4.46  6.52  4.56  6.43  1.35  1.05  0.33  0.08  0.19  0.04  0.02

1. Interpret a logistic model

1.1 Univariate regression with age

Interpret the results of the logistic regression table of the association between age and disability.

#On the right, my outcome, on the left explanatory variables
m1 <- glm(bin_disability ~ age, data = data, family = binomial(link = "logit"))

summary(m1)

## 
## Call:
## glm(formula = bin_disability ~ age, family = binomial(link = "logit"), 
##     data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4402  -0.7155  -0.5059  -0.3519   2.4070  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.644002   0.079133  -46.05   <2e-16 ***
## age          0.044668   0.001351   33.06   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 12706  on 12248  degrees of freedom
## Residual deviance: 11456  on 12247  degrees of freedom
## AIC: 11460
## 
## Number of Fisher Scoring iterations: 4

#The results are showed in logodds. I need to exponentiate them
exp(0.044668)

## [1] 1.045681

#I can calculate the 95% CI
confint(m1)

##                   2.5 %      97.5 %
## (Intercept) -3.80035367 -3.49011266
## age          0.04203306  0.04733069

paste0("95% CI ",round(exp(0.04203306), 2), "-", round(exp(0.04733069), 2))

## [1] "95% CI 1.04-1.05"

tbl_regression(m1, exponentiate = T, include = everything()) %>% 
  add_n()

Characteristic	N	OR¹	95% CI¹	p-value
age	12249	1.05	1.04, 1.05	<0.001
¹ OR = Odds Ratio, CI = Confidence Interval

1.2 Univariate regression with sex

Interpret the logistic regression output of the association between sex and disability.

m2 <- glm(bin_disability ~ sex, data = data, family = binomial(link = "logit"))

m2

## 
## Call:  glm(formula = bin_disability ~ sex, family = binomial(link = "logit"), 
##     data = data)
## 
## Coefficients:
## (Intercept)    sexFemale  
##     -1.7235       0.6828  
## 
## Degrees of Freedom: 12248 Total (i.e. Null);  12247 Residual
## Null Deviance:       12710 
## Residual Deviance: 12480     AIC: 12490

tbl_regression(m2, exp = T)

Characteristic	OR¹	95% CI¹	p-value
sex
Male	—	—
Female	1.98	1.81, 2.17	<0.001
¹ OR = Odds Ratio, CI = Confidence Interval

1.3 Univariate regression with income

Interpret the logistic regression output of the association between income and disability.

m3 <- glm(bin_disability ~ income, data = data, family = binomial(link = "logit"))
m3

## 
## Call:  glm(formula = bin_disability ~ income, family = binomial(link = "logit"), 
##     data = data)
## 
## Coefficients:
## (Intercept)       income  
##   -0.982081    -0.001136  
## 
## Degrees of Freedom: 12248 Total (i.e. Null);  12247 Residual
## Null Deviance:       12710 
## Residual Deviance: 12430     AIC: 12430

summary(m3)

## 
## Call:
## glm(formula = bin_disability ~ income, family = binomial(link = "logit"), 
##     data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.7976  -0.7539  -0.6763  -0.4081   3.6503  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -9.821e-01  2.976e-02  -33.00   <2e-16 ***
## income      -1.136e-03  8.215e-05  -13.83   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 12706  on 12248  degrees of freedom
## Residual deviance: 12430  on 12247  degrees of freedom
## AIC: 12434
## 
## Number of Fisher Scoring iterations: 5

tbl_regression(m3, exp = T)

Characteristic	OR¹	95% CI¹	p-value
income	1.00	1.00, 1.00	<0.001
¹ OR = Odds Ratio, CI = Confidence Interval

#To check the effect per every 100-dollar increase
exp(-0.001136)^100

## [1] 0.8926149

confint(m3)

##                   2.5 %        97.5 %
## (Intercept) -1.04045711 -0.9238103501
## income      -0.00129967 -0.0009777109

#95% CI per 100-dollar increase
paste0("95% CI ",round(exp(-0.00129967)^100, 3), "-", round(exp(-0.0009777109)^100, 3))

## [1] "95% CI 0.878-0.907"

data <- within(data, {   
  cat_income <- NA # need to initialize variable
  cat_income[income < 300] <- "Lower than US$300"
  cat_income[income >= 300 & income < 600] <- "US$300-599"
  cat_income[income >= 600 & income < 900 ] <- "US$600-899"
  cat_income[income >= 900] <- "US$900+"
   } )

table(data$cat_income)

## 
## Lower than US$300        US$300-599        US$600-899           US$900+ 
##              7644              2863               859               883

round(prop.table(table(data$cat_income))*100,2)

## 
## Lower than US$300        US$300-599        US$600-899           US$900+ 
##             62.41             23.37              7.01              7.21

m3_ <- glm(bin_disability ~ cat_income, data = data, family = binomial(link = "logit"))
m3_

## 
## Call:  glm(formula = bin_disability ~ cat_income, family = binomial(link = "logit"), 
##     data = data)
## 
## Coefficients:
##          (Intercept)  cat_incomeUS$300-599  cat_incomeUS$600-899  
##               -1.016                -0.731                -1.093  
##    cat_incomeUS$900+  
##               -1.500  
## 
## Degrees of Freedom: 12248 Total (i.e. Null);  12245 Residual
## Null Deviance:       12710 
## Residual Deviance: 12320     AIC: 12320

summary(m3_)

## 
## Call:
## glm(formula = bin_disability ~ cat_income, family = binomial(link = "logit"), 
##     data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.7862  -0.7862  -0.5669  -0.3942   2.2776  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -1.01589    0.02589 -39.238   <2e-16 ***
## cat_incomeUS$300-599 -0.73096    0.05860 -12.475   <2e-16 ***
## cat_incomeUS$600-899 -1.09270    0.11282  -9.685   <2e-16 ***
## cat_incomeUS$900+    -1.50010    0.13056 -11.490   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 12706  on 12248  degrees of freedom
## Residual deviance: 12316  on 12245  degrees of freedom
## AIC: 12324
## 
## Number of Fisher Scoring iterations: 5

tbl_regression(m3_, exp = T)

Characteristic	OR¹	95% CI¹	p-value
cat_income
Lower than US$300	—	—
US$300-599	0.48	0.43, 0.54	<0.001
US$600-899	0.34	0.27, 0.42	<0.001
US$900+	0.22	0.17, 0.29	<0.001
¹ OR = Odds Ratio, CI = Confidence Interval

1.3 Univariate regression with education

Interpret the logistic regression output of the association between education and disability.

m4 <- glm(bin_disability ~ education, data = data, family = binomial(link = "logit"))

tbl_regression(m4, exp = T)

Characteristic	OR¹	95% CI¹	p-value
education	0.86	0.85, 0.87	<0.001
¹ OR = Odds Ratio, CI = Confidence Interval

2. Multivariable logistic regression

Interpret the output of the mutually adjusted association between age, sex, categorical income and education on disability. Think about the differences between the previous estimates and the ones mutually adjusted

m5 <- glm(bin_disability ~ age + sex + cat_income + education, data = data, family = binomial(link = "logit"))

tbl_regression(m5, exp = T)

Characteristic	OR¹	95% CI¹	p-value
age	1.04	1.03, 1.04	<0.001
sex
Male	—	—
Female	1.69	1.53, 1.87	<0.001
cat_income
Lower than US$300	—	—
US$300-599	0.70	0.61, 0.79	<0.001
US$600-899	0.55	0.44, 0.70	<0.001
US$900+	0.46	0.35, 0.60	<0.001
education	0.94	0.93, 0.95	<0.001
¹ OR = Odds Ratio, CI = Confidence Interval

3. Testing interaction

Interpret the interaction between sex and income in the logistic regression model. Focus on the variables of the interaction and not on age and education

3.1 Effect of income

3.2 Effect of sex

m6 <- glm(bin_disability ~ age + education + sex*cat_income, data = data, family = binomial(link = "logit"))

tbl_regression(m6, exp = T)

Characteristic	OR¹	95% CI¹	p-value
age	1.04	1.03, 1.04	<0.001
education	0.94	0.93, 0.95	<0.001
sex
Male	—	—
Female	1.54	1.37, 1.74	<0.001
cat_income
Lower than US$300	—	—
US$300-599	0.58	0.48, 0.70	<0.001
US$600-899	0.44	0.31, 0.61	<0.001
US$900+	0.45	0.31, 0.63	<0.001
sex * cat_income
Female * US$300-599	1.39	1.08, 1.78	0.010
Female * US$600-899	1.57	0.99, 2.50	0.054
Female * US$900+	1.02	0.59, 1.72	>0.9
¹ OR = Odds Ratio, CI = Confidence Interval

Practical 4: Logistic Regression