Overview of method

Today, we will be focusing on logistic regression. This type of regression is used to to assess the effects of independent variables on a binary outcome. The independent variables have no restrictions in terms of the type -they can be continuous or categorical- and the regression has fewer assumptions compared to linear regression.

Objectives

Examine data
Interpret the results of a logistic regression
Be able to identify confounding a in multivariable logistic regression.
Be able to identify effect modification in multivariable logistic regression.

#Don't forget to install the packages if it is the first time you use them with the command install.packages("name_of_package")
library(RCurl)
library(readr) #To call data
library(dplyr) #To manipulate data
library(stats) #For logistic regression
library(gtsummary) #For nicer outputs
library(ggplot2)
library(tidymv)
library(sjPlot)

0. Data examination

The data for today’s practical is a nationally representative survey about disability of Chilean adult population (II Encuesta Nacional de Discapacidad, in spanish). We will be assessing the effect of some demographic and socioeconomic determinants on disability in this population.

Link for the database (optional):

https://www.senadis.gob.cl/pag/356/1625/base_de_datos

Links for RStudio download and tutorial (optional):

https://www.rstudio.com/products/rstudio/download/

https://www.youtube.com/watch?v=FIrsOBy5k58&ab_channel=HRanalytics101.com

0.1 Data preparation

#Loading the data. Can be called directly because I uploaded to Github, an online platform for working with data
data <- read_csv("https://raw.githubusercontent.com/eloluna/to_uch_dic21/main/endisc_lite.csv")

#The data has too many variables. I will only select the variables of interest. The RAM of my computer will thank me
data <- data %>% 
  select(sexo, edad, ytot, esc, disc_grado_adulto)

#The names are in spanish, so I will change them
data <- data %>% 
  rename(sex = sexo) %>% 
  rename(age = edad) %>% 
  rename(income = ytot) %>% 
  rename(education = esc) %>% 
  rename(cat_disability = disc_grado_adulto)

0.1 Data transformation

#Inspect the data
head(data)

#Lots of missing. This is because interviewees gave information for the whole household, yet the disability variable was only measure for the interviewee

#For that and to avoid missing data, I will only work with complete cases
data <- subset(data, complete.cases(data))

#Again, I examine the data to see the changes
head(data)

What type of variable is cat_disablity? Is it appropriate for logistic regression? Is it suitable for logistic regression?

table(data$cat_disability)

## 
## Discapacidad Leve a Moderada          Discapacidad Severa 
##                         1527                         1089 
##             Sin Discapacidad 
##                         9633

#I will create a binary variable of disability
data$bin_disability <- NA

data$bin_disability[data$cat_disability == "Discapacidad Leve a Moderada"] <- 1 #Small to moderate disability coded as 1

data$bin_disability[data$cat_disability == "Discapacidad Severa"] <- 1 #Severe disability coded as 1

data$bin_disability[data$cat_disability == "Sin Discapacidad"] <- 0 #No disability coded as 0

data$bin_disability <- factor(data$bin_disability, levels = c(0:1),
                              labels = c("Without disability",
                                        "With disability"))

typeof(data$sex) #sex is actually a character variable. I need to change this

## [1] "character"

data$sex <- as.factor(data$sex) #Now is categorical

#However, the labels of the levels are in spanish. I'll change them
levels(data$sex) <- c("Male", "Female")

#R can be used as calculator
1089 + 1527

## [1] 2616

table(data$bin_disability)

## 
## Without disability    With disability 
##               9633               2616

0.3 Data examination

How many people are in the sample?

nrow(data)

## [1] 12249

A: There are 12,249 people in this sample. There are no missing data so all those people have information on our outcome -disability- and our explanatory variables.

How is disability distributed in the sample?

table(data$bin_disability)

## 
## Without disability    With disability 
##               9633               2616

round(prop.table(table(data$bin_disability))*100,2)

## 
## Without disability    With disability 
##              78.64              21.36

A: There are 2,616 people with disability in this sample. This represents a 21.36% of the sample who have some degree of disability.

How is age distributed?

#Age
summary(data$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   33.00   48.00   48.36   62.00  107.00

hist(data$age)

A: Sex seems to be normally distributed in this sample. The left tail of the histogram does not give that impression, however, the sample is exclusively of adult population, which explains why we don’t see a tail that abruptly ends. When we focused on the summary measures, we can see that the median and the mean are very similar. A good indicator of normal distribution.

How is sex distributed?

table(data$sex)

## 
##   Male Female 
##   5303   6946

round(prop.table(table(data$sex))*100,2)

## 
##   Male Female 
##  43.29  56.71

A: There are 5,303 and 6,946 males and females in this sample, respectively. The proportion of females in the sample is 56.7%.

How is the monthly income distributed?

hist(data$income)

summary(data$income) #Maximum data is an outlier. This is in Chilean pesos. 1 US Dollar = 925 Chilean pesos. Which is the same as 1 Chilean peso = 0.0011 US dollars. Let's transform it!

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##        0    80000   200000   317923   384000 26500000

data$income <- round(data$income*0.0011, 2)
hist(data$income) #same distribution but different scale

#To avoid outliers messing with my data. I will put a maximum of income of US5,000
data$income[data$income > 5000] <- 5000
summary(data$income)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    88.0   220.0   344.5   422.4  5000.0

hist(data$income) #In general, income does not have a normal distribution

A: Income is not normally distributed in this sample. Most people earn a small amount of monthly income and a small proportion of participants have a high monthly income. If we look at the summary measures, we can see that the median is very different from the mean, even after introducing an income cap in our data. This is -statistically- normal because this variable tends to have this type of distribution.

How is education distributed?

hist(data$education)

round(prop.table(table(as.numeric(data$education)))*100,2)

## 
##     0     1     2     3     4     5     6     7     8     9    10    11    12 
##  2.75  0.86  1.52  3.03  3.00  2.40  6.35  2.42  9.19  3.47  5.14  3.13 27.68 
##    13    14    15    16    17    18    19    20    21    22    23    24 
##  4.04  4.46  6.52  4.56  6.43  1.35  1.05  0.33  0.08  0.19  0.04  0.02

A: Most people in this sample have 12 years of education. This makes sense as this is the number of mandatory years of education in the country. People who have more are mostly people who went to university, and people who have less are people who haven’t finished their mandatory education or dropped out.

1. Interpret a logistic model

1.1 Univariate regression with age

Interpret the results of the logistic regression table of the association between age and disability.

#On the right, my outcome, on the left explanatory variables
m1 <- glm(bin_disability ~ age, data = data, family = binomial(link = "logit"))

summary(m1)

## 
## Call:
## glm(formula = bin_disability ~ age, family = binomial(link = "logit"), 
##     data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4402  -0.7155  -0.5059  -0.3519   2.4070  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.644002   0.079133  -46.05   <2e-16 ***
## age          0.044668   0.001351   33.06   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 12706  on 12248  degrees of freedom
## Residual deviance: 11456  on 12247  degrees of freedom
## AIC: 11460
## 
## Number of Fisher Scoring iterations: 4

#The results are showed in logodds. I need to exponentiate them
exp(0.044668)

## [1] 1.045681

#I can calculate the 95% CI
confint(m1)

##                   2.5 %      97.5 %
## (Intercept) -3.80035367 -3.49011266
## age          0.04203306  0.04733069

paste0("95% CI ",round(exp(0.04203306), 2), "-", round(exp(0.04733069), 2))

## [1] "95% CI 1.04-1.05"

tbl_regression(m1, exponentiate = T, include = everything()) %>% 
  add_n()

Characteristic	N	OR¹	95% CI¹	p-value
age	12249	1.05	1.04, 1.05	<0.001
¹ OR = Odds Ratio, CI = Confidence Interval

A: 1-year increase in age is associated with a increase of the odds of disability by a factor of 1.05 (95% CI 1.04-1.05). The CI and the p-value suggest strong evidence to say that this is a relevant explanatory variable.

1.2 Univariate regression with sex

Interpret the logistic regression output of the association between sex and disability.

m2 <- glm(bin_disability ~ sex, data = data, family = binomial(link = "logit"))

m2

## 
## Call:  glm(formula = bin_disability ~ sex, family = binomial(link = "logit"), 
##     data = data)
## 
## Coefficients:
## (Intercept)    sexFemale  
##     -1.7235       0.6828  
## 
## Degrees of Freedom: 12248 Total (i.e. Null);  12247 Residual
## Null Deviance:       12710 
## Residual Deviance: 12480     AIC: 12490

tbl_regression(m2, exp = T)

Characteristic	OR¹	95% CI¹	p-value
sex
Male	—	—
Female	1.98	1.81, 2.17	<0.001
¹ OR = Odds Ratio, CI = Confidence Interval

A: Compared to males, females have 1.98 times the odds (95% CI 1.81-2.17) of having a disability. Again, the CI and p-value suggests that sex is a strong predictor of disability in this POPULATION.

1.3 Univariate regression with income

Interpret the logistic regression output of the association between income and disability.

m3 <- glm(bin_disability ~ income, data = data, family = binomial(link = "logit"))
m3

## 
## Call:  glm(formula = bin_disability ~ income, family = binomial(link = "logit"), 
##     data = data)
## 
## Coefficients:
## (Intercept)       income  
##   -0.982081    -0.001136  
## 
## Degrees of Freedom: 12248 Total (i.e. Null);  12247 Residual
## Null Deviance:       12710 
## Residual Deviance: 12430     AIC: 12430

summary(m3)

## 
## Call:
## glm(formula = bin_disability ~ income, family = binomial(link = "logit"), 
##     data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.7976  -0.7539  -0.6763  -0.4081   3.6503  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -9.821e-01  2.976e-02  -33.00   <2e-16 ***
## income      -1.136e-03  8.215e-05  -13.83   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 12706  on 12248  degrees of freedom
## Residual deviance: 12430  on 12247  degrees of freedom
## AIC: 12434
## 
## Number of Fisher Scoring iterations: 5

tbl_regression(m3, exp = T)

Characteristic	OR¹	95% CI¹	p-value
income	1.00	1.00, 1.00	<0.001
¹ OR = Odds Ratio, CI = Confidence Interval

#To check the effect per every 100-dollar increase
exp(-0.001136)^100

## [1] 0.8926149

confint(m3)

##                   2.5 %        97.5 %
## (Intercept) -1.04045711 -0.9238103501
## income      -0.00129967 -0.0009777109

#95% CI per 100-dollar increase
paste0("95% CI ",round(exp(-0.00129967)^100, 3), "-", round(exp(-0.0009777109)^100, 3))

## [1] "95% CI 0.878-0.907"

In this output I don’t see an effect in the estimate, even if I have a very low p-value. Why is this? Because the results show the effect per 1-dollar increase in income and this is rounded to 2 decimals. The effect is there but is very small so we can’t see it. There are two ways to check the actual effect, one is to to obtain the effect per 10-dollar increase or 100-dollar increase, which will show us a larger effect. Another option is to create a categorical variable of income.

data <- within(data, {   
  cat_income <- NA # need to initialize variable
  cat_income[income < 300] <- "Lower than US$300"
  cat_income[income >= 300 & income < 600] <- "US$300-599"
  cat_income[income >= 600 & income < 900 ] <- "US$600-899"
  cat_income[income >= 900] <- "US$900+"
   } )

table(data$cat_income)

## 
## Lower than US$300        US$300-599        US$600-899           US$900+ 
##              7644              2863               859               883

round(prop.table(table(data$cat_income))*100,2)

## 
## Lower than US$300        US$300-599        US$600-899           US$900+ 
##             62.41             23.37              7.01              7.21

m3_ <- glm(bin_disability ~ cat_income, data = data, family = binomial(link = "logit"))
m3_

## 
## Call:  glm(formula = bin_disability ~ cat_income, family = binomial(link = "logit"), 
##     data = data)
## 
## Coefficients:
##          (Intercept)  cat_incomeUS$300-599  cat_incomeUS$600-899  
##               -1.016                -0.731                -1.093  
##    cat_incomeUS$900+  
##               -1.500  
## 
## Degrees of Freedom: 12248 Total (i.e. Null);  12245 Residual
## Null Deviance:       12710 
## Residual Deviance: 12320     AIC: 12320

summary(m3_)

## 
## Call:
## glm(formula = bin_disability ~ cat_income, family = binomial(link = "logit"), 
##     data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.7862  -0.7862  -0.5669  -0.3942   2.2776  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -1.01589    0.02589 -39.238   <2e-16 ***
## cat_incomeUS$300-599 -0.73096    0.05860 -12.475   <2e-16 ***
## cat_incomeUS$600-899 -1.09270    0.11282  -9.685   <2e-16 ***
## cat_incomeUS$900+    -1.50010    0.13056 -11.490   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 12706  on 12248  degrees of freedom
## Residual deviance: 12316  on 12245  degrees of freedom
## AIC: 12324
## 
## Number of Fisher Scoring iterations: 5

tbl_regression(m3_, exp = T)

Characteristic	OR¹	95% CI¹	p-value
cat_income
Lower than US$300	—	—
US$300-599	0.48	0.43, 0.54	<0.001
US$600-899	0.34	0.27, 0.42	<0.001
US$900+	0.22	0.17, 0.29	<0.001
¹ OR = Odds Ratio, CI = Confidence Interval

A: The categorical variable of income allows us to see larger effects. Compared to the reference group -those earning less than US$300 monthly-, all higher categories of monthly income show lower odds of having a disability.

1.3 Univariate regression with education

Interpret the logistic regression output of the association between education and disability.

m4 <- glm(bin_disability ~ education, data = data, family = binomial(link = "logit"))

tbl_regression(m4, exp = T)

Characteristic	OR¹	95% CI¹	p-value
education	0.86	0.85, 0.87	<0.001
¹ OR = Odds Ratio, CI = Confidence Interval

A: 1-year increase in education is associated with lower odds of disability by a factor of 0.86 (95% CI 0.85-0.87). Both the CI and p-value provides strong evidence to say that education is a relevant explanatory variable of that decreases the odds of disability in the chilean adult population.

2. Multivariable logistic regression

Interpret the output of the mutually adjusted association between age, sex, categorical income and education on disability. Think about the differences between the previous estimates and the ones mutually adjusted

A: Adjusting for sex, income and education, 1-year increase in age is associated with 1.04 times the odds (95% CI 1.03-1.04) of having a disability.

As for sex, holding all variables constant, females have 1.69 times the odds (95% CI 1.53-1.87) of having disability compared to males.

In the case of income, those earning between US$300-599, US$600-899 and more than US$900 have 0.70 (95% CI 0.61-0.79), 0.55 (95% CI 0.44, 0.70) and 0.46 (95% CI 0.35, 0.60) times the odds, respectively, to have a disability compared to the reference category, those who earn less than US$300, adjusted by sex, age, and education. We can see a gradient between the categories of income and disability, a dose-response, the higher the income, the lower the odds of having a disability.

Lastly, 1-year increase in education is associated with lower odds of having a disability by a factor of 0.94 (95% CI 0.93, 0.95), holding all other variables constant.

We can see that the estimated ORs in this model were closer to one -a smaller effect- than the previous models. This is because we adjusted for confounding that explained some of the effect of the different variables on disability.

m5 <- glm(bin_disability ~ age + sex + cat_income + education, data = data, family = binomial(link = "logit"))

tbl_regression(m5, exp = T)

Characteristic	OR¹	95% CI¹	p-value
age	1.04	1.03, 1.04	<0.001
sex
Male	—	—
Female	1.69	1.53, 1.87	<0.001
cat_income
Lower than US$300	—	—
US$300-599	0.70	0.61, 0.79	<0.001
US$600-899	0.55	0.44, 0.70	<0.001
US$900+	0.46	0.35, 0.60	<0.001
education	0.94	0.93, 0.95	<0.001
¹ OR = Odds Ratio, CI = Confidence Interval

3.Testing interaction

Interpret the interaction between sex and income in the logistic regression model. Focus on the variables of the interaction and not on age and education

3.1 Effect of income

Among males -the baseline of sex- the OR of the effect of income on disability is 0.58, 0.44 and 0.45 for those earning US$300-599, US$600-899 and more than US$900, respectively, compared to the reference category of income, adjusting for age and education.

Among females, on the other hand, the effect of income on disability is different because of the interaction. the OR of the effect of income on disability is 0.81 (0.58x1.39), 0.69 (0.44x1.57) and 0.46 (0.45x1.02) (although this interaction was not relevant) for those earning US$300-599, US$600-899 and more than US$900, respectively, compared to the reference category of income, adjusting for age and education.

3.2 Effect of sex

Among those in the baseline category of income -people who earn less than US$300-, being female is associated with 1.54 times the odds (95% CI 1.37-1.74) of having a disability compared to males, adjusting for age and education.

Among those in the second category of income -those earning US$300-599-, being female is associated with 2.14 (1.54*1.39)times the odds of having a disability compared to males, adjusting for age and education.

Among those in the third category of income -those earning US$600-899-, being female is associated with 2.41 (1.54*1.57) times the odds of having a disability compared to males, adjusting for age and education.

Among those in the highest category of income -those earning more than US$900-, being female is associated with 1.57 (1.54*1.02) times the odds of having disability compared to males, adjusting for age and education.

m6 <- glm(bin_disability ~ age + education + sex*cat_income, data = data, family = binomial(link = "logit"))

tbl_regression(m6, exp = T)

Characteristic	OR¹	95% CI¹	p-value
age	1.04	1.03, 1.04	<0.001
education	0.94	0.93, 0.95	<0.001
sex
Male	—	—
Female	1.54	1.37, 1.74	<0.001
cat_income
Lower than US$300	—	—
US$300-599	0.58	0.48, 0.70	<0.001
US$600-899	0.44	0.31, 0.61	<0.001
US$900+	0.45	0.31, 0.63	<0.001
sex * cat_income
Female * US$300-599	1.39	1.08, 1.78	0.010
Female * US$600-899	1.57	0.99, 2.50	0.054
Female * US$900+	1.02	0.59, 1.72	>0.9
¹ OR = Odds Ratio, CI = Confidence Interval

Practical 4: Logistic Regression with answers