Today, we will be focusing on logistic regression. This type of regression is used to to assess the effects of independent variables on a binary outcome. The independent variables have no restrictions in terms of the type -they can be continuous or categorical- and the regression has fewer assumptions compared to linear regression.
#Don't forget to install the packages if it is the first time you use them with the command install.packages("name_of_package")
library(RCurl)
library(readr) #To call data
library(dplyr) #To manipulate data
library(stats) #For logistic regression
library(gtsummary) #For nicer outputs
library(ggplot2)
library(tidymv)
library(sjPlot)
The data for today’s practical is a nationally representative survey about disability of Chilean adult population (II Encuesta Nacional de Discapacidad, in spanish). We will be assessing the effect of some demographic and socioeconomic determinants on disability in this population.
Link for the database (optional):
https://www.senadis.gob.cl/pag/356/1625/base_de_datos
Links for RStudio download and tutorial (optional):
https://www.rstudio.com/products/rstudio/download/
https://www.youtube.com/watch?v=FIrsOBy5k58&ab_channel=HRanalytics101.com
#Loading the data. Can be called directly because I uploaded to Github, an online platform for working with data
data <- read_csv("https://raw.githubusercontent.com/eloluna/to_uch_dic21/main/endisc_lite.csv")
#The data has too many variables. I will only select the variables of interest. The RAM of my computer will thank me
data <- data %>%
select(sexo, edad, ytot, esc, disc_grado_adulto)
#The names are in spanish, so I will change them
data <- data %>%
rename(sex = sexo) %>%
rename(age = edad) %>%
rename(income = ytot) %>%
rename(education = esc) %>%
rename(cat_disability = disc_grado_adulto)
#Inspect the data
head(data)
#Lots of missing. This is because interviewees gave information for the whole household, yet the disability variable was only measure for the interviewee
#For that and to avoid missing data, I will only work with complete cases
data <- subset(data, complete.cases(data))
#Again, I examine the data to see the changes
head(data)
What type of variable is cat_disablity? Is it appropriate for logistic regression? Is it suitable for logistic regression?
table(data$cat_disability)
##
## Discapacidad Leve a Moderada Discapacidad Severa
## 1527 1089
## Sin Discapacidad
## 9633
#I will create a binary variable of disability
data$bin_disability <- NA
data$bin_disability[data$cat_disability == "Discapacidad Leve a Moderada"] <- 1 #Small to moderate disability coded as 1
data$bin_disability[data$cat_disability == "Discapacidad Severa"] <- 1 #Severe disability coded as 1
data$bin_disability[data$cat_disability == "Sin Discapacidad"] <- 0 #No disability coded as 0
data$bin_disability <- factor(data$bin_disability, levels = c(0:1),
labels = c("Without disability",
"With disability"))
typeof(data$sex) #sex is actually a character variable. I need to change this
## [1] "character"
data$sex <- as.factor(data$sex) #Now is categorical
#However, the labels of the levels are in spanish. I'll change them
levels(data$sex) <- c("Male", "Female")
#R can be used as calculator
1089 + 1527
## [1] 2616
table(data$bin_disability)
##
## Without disability With disability
## 9633 2616
How many people are in the sample?
nrow(data)
## [1] 12249
How is disability distributed in the sample?
table(data$bin_disability)
##
## Without disability With disability
## 9633 2616
round(prop.table(table(data$bin_disability))*100,2)
##
## Without disability With disability
## 78.64 21.36
How is age distributed?
#Age
summary(data$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 33.00 48.00 48.36 62.00 107.00
hist(data$age)
How is sex distributed?
table(data$sex)
##
## Male Female
## 5303 6946
round(prop.table(table(data$sex))*100,2)
##
## Male Female
## 43.29 56.71
How is the monthly income distributed?
hist(data$income)
summary(data$income) #Maximum data is an outlier. This is in Chilean pesos. 1 US Dollar = 925 Chilean pesos. Which is the same as 1 Chilean peso = 0.0011 US dollars. Let's transform it!
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 80000 200000 317923 384000 26500000
data$income <- round(data$income*0.0011, 2)
hist(data$income) #same distribution but different scale
#To avoid outliers messing with my data. I will put a maximum of income of US5,000
data$income[data$income > 5000] <- 5000
summary(data$income)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 88.0 220.0 344.5 422.4 5000.0
hist(data$income) #In general, income does not have a normal distribution
How is education distributed?
hist(data$education)
round(prop.table(table(as.numeric(data$education)))*100,2)
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12
## 2.75 0.86 1.52 3.03 3.00 2.40 6.35 2.42 9.19 3.47 5.14 3.13 27.68
## 13 14 15 16 17 18 19 20 21 22 23 24
## 4.04 4.46 6.52 4.56 6.43 1.35 1.05 0.33 0.08 0.19 0.04 0.02
Interpret the results of the logistic regression table of the association between age and disability.
#On the right, my outcome, on the left explanatory variables
m1 <- glm(bin_disability ~ age, data = data, family = binomial(link = "logit"))
summary(m1)
##
## Call:
## glm(formula = bin_disability ~ age, family = binomial(link = "logit"),
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4402 -0.7155 -0.5059 -0.3519 2.4070
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.644002 0.079133 -46.05 <2e-16 ***
## age 0.044668 0.001351 33.06 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 12706 on 12248 degrees of freedom
## Residual deviance: 11456 on 12247 degrees of freedom
## AIC: 11460
##
## Number of Fisher Scoring iterations: 4
#The results are showed in logodds. I need to exponentiate them
exp(0.044668)
## [1] 1.045681
#I can calculate the 95% CI
confint(m1)
## 2.5 % 97.5 %
## (Intercept) -3.80035367 -3.49011266
## age 0.04203306 0.04733069
paste0("95% CI ",round(exp(0.04203306), 2), "-", round(exp(0.04733069), 2))
## [1] "95% CI 1.04-1.05"
tbl_regression(m1, exponentiate = T, include = everything()) %>%
add_n()
| Characteristic | N | OR1 | 95% CI1 | p-value |
|---|---|---|---|---|
| age | 12249 | 1.05 | 1.04, 1.05 | <0.001 |
|
1
OR = Odds Ratio, CI = Confidence Interval
|
||||
Interpret the logistic regression output of the association between sex and disability.
m2 <- glm(bin_disability ~ sex, data = data, family = binomial(link = "logit"))
m2
##
## Call: glm(formula = bin_disability ~ sex, family = binomial(link = "logit"),
## data = data)
##
## Coefficients:
## (Intercept) sexFemale
## -1.7235 0.6828
##
## Degrees of Freedom: 12248 Total (i.e. Null); 12247 Residual
## Null Deviance: 12710
## Residual Deviance: 12480 AIC: 12490
tbl_regression(m2, exp = T)
| Characteristic | OR1 | 95% CI1 | p-value |
|---|---|---|---|
| sex | |||
| Male | — | — | |
| Female | 1.98 | 1.81, 2.17 | <0.001 |
|
1
OR = Odds Ratio, CI = Confidence Interval
|
|||
Interpret the logistic regression output of the association between income and disability.
m3 <- glm(bin_disability ~ income, data = data, family = binomial(link = "logit"))
m3
##
## Call: glm(formula = bin_disability ~ income, family = binomial(link = "logit"),
## data = data)
##
## Coefficients:
## (Intercept) income
## -0.982081 -0.001136
##
## Degrees of Freedom: 12248 Total (i.e. Null); 12247 Residual
## Null Deviance: 12710
## Residual Deviance: 12430 AIC: 12430
summary(m3)
##
## Call:
## glm(formula = bin_disability ~ income, family = binomial(link = "logit"),
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.7976 -0.7539 -0.6763 -0.4081 3.6503
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.821e-01 2.976e-02 -33.00 <2e-16 ***
## income -1.136e-03 8.215e-05 -13.83 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 12706 on 12248 degrees of freedom
## Residual deviance: 12430 on 12247 degrees of freedom
## AIC: 12434
##
## Number of Fisher Scoring iterations: 5
tbl_regression(m3, exp = T)
| Characteristic | OR1 | 95% CI1 | p-value |
|---|---|---|---|
| income | 1.00 | 1.00, 1.00 | <0.001 |
|
1
OR = Odds Ratio, CI = Confidence Interval
|
|||
#To check the effect per every 100-dollar increase
exp(-0.001136)^100
## [1] 0.8926149
confint(m3)
## 2.5 % 97.5 %
## (Intercept) -1.04045711 -0.9238103501
## income -0.00129967 -0.0009777109
#95% CI per 100-dollar increase
paste0("95% CI ",round(exp(-0.00129967)^100, 3), "-", round(exp(-0.0009777109)^100, 3))
## [1] "95% CI 0.878-0.907"
data <- within(data, {
cat_income <- NA # need to initialize variable
cat_income[income < 300] <- "Lower than US$300"
cat_income[income >= 300 & income < 600] <- "US$300-599"
cat_income[income >= 600 & income < 900 ] <- "US$600-899"
cat_income[income >= 900] <- "US$900+"
} )
table(data$cat_income)
##
## Lower than US$300 US$300-599 US$600-899 US$900+
## 7644 2863 859 883
round(prop.table(table(data$cat_income))*100,2)
##
## Lower than US$300 US$300-599 US$600-899 US$900+
## 62.41 23.37 7.01 7.21
m3_ <- glm(bin_disability ~ cat_income, data = data, family = binomial(link = "logit"))
m3_
##
## Call: glm(formula = bin_disability ~ cat_income, family = binomial(link = "logit"),
## data = data)
##
## Coefficients:
## (Intercept) cat_incomeUS$300-599 cat_incomeUS$600-899
## -1.016 -0.731 -1.093
## cat_incomeUS$900+
## -1.500
##
## Degrees of Freedom: 12248 Total (i.e. Null); 12245 Residual
## Null Deviance: 12710
## Residual Deviance: 12320 AIC: 12320
summary(m3_)
##
## Call:
## glm(formula = bin_disability ~ cat_income, family = binomial(link = "logit"),
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.7862 -0.7862 -0.5669 -0.3942 2.2776
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.01589 0.02589 -39.238 <2e-16 ***
## cat_incomeUS$300-599 -0.73096 0.05860 -12.475 <2e-16 ***
## cat_incomeUS$600-899 -1.09270 0.11282 -9.685 <2e-16 ***
## cat_incomeUS$900+ -1.50010 0.13056 -11.490 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 12706 on 12248 degrees of freedom
## Residual deviance: 12316 on 12245 degrees of freedom
## AIC: 12324
##
## Number of Fisher Scoring iterations: 5
tbl_regression(m3_, exp = T)
| Characteristic | OR1 | 95% CI1 | p-value |
|---|---|---|---|
| cat_income | |||
| Lower than US$300 | — | — | |
| US$300-599 | 0.48 | 0.43, 0.54 | <0.001 |
| US$600-899 | 0.34 | 0.27, 0.42 | <0.001 |
| US$900+ | 0.22 | 0.17, 0.29 | <0.001 |
|
1
OR = Odds Ratio, CI = Confidence Interval
|
|||
Interpret the logistic regression output of the association between education and disability.
m4 <- glm(bin_disability ~ education, data = data, family = binomial(link = "logit"))
tbl_regression(m4, exp = T)
| Characteristic | OR1 | 95% CI1 | p-value |
|---|---|---|---|
| education | 0.86 | 0.85, 0.87 | <0.001 |
|
1
OR = Odds Ratio, CI = Confidence Interval
|
|||
Interpret the output of the mutually adjusted association between age, sex, categorical income and education on disability. Think about the differences between the previous estimates and the ones mutually adjusted
m5 <- glm(bin_disability ~ age + sex + cat_income + education, data = data, family = binomial(link = "logit"))
tbl_regression(m5, exp = T)
| Characteristic | OR1 | 95% CI1 | p-value |
|---|---|---|---|
| age | 1.04 | 1.03, 1.04 | <0.001 |
| sex | |||
| Male | — | — | |
| Female | 1.69 | 1.53, 1.87 | <0.001 |
| cat_income | |||
| Lower than US$300 | — | — | |
| US$300-599 | 0.70 | 0.61, 0.79 | <0.001 |
| US$600-899 | 0.55 | 0.44, 0.70 | <0.001 |
| US$900+ | 0.46 | 0.35, 0.60 | <0.001 |
| education | 0.94 | 0.93, 0.95 | <0.001 |
|
1
OR = Odds Ratio, CI = Confidence Interval
|
|||
Interpret the interaction between sex and income in the logistic regression model. Focus on the variables of the interaction and not on age and education
m6 <- glm(bin_disability ~ age + education + sex*cat_income, data = data, family = binomial(link = "logit"))
tbl_regression(m6, exp = T)
| Characteristic | OR1 | 95% CI1 | p-value |
|---|---|---|---|
| age | 1.04 | 1.03, 1.04 | <0.001 |
| education | 0.94 | 0.93, 0.95 | <0.001 |
| sex | |||
| Male | — | — | |
| Female | 1.54 | 1.37, 1.74 | <0.001 |
| cat_income | |||
| Lower than US$300 | — | — | |
| US$300-599 | 0.58 | 0.48, 0.70 | <0.001 |
| US$600-899 | 0.44 | 0.31, 0.61 | <0.001 |
| US$900+ | 0.45 | 0.31, 0.63 | <0.001 |
| sex * cat_income | |||
| Female * US$300-599 | 1.39 | 1.08, 1.78 | 0.010 |
| Female * US$600-899 | 1.57 | 0.99, 2.50 | 0.054 |
| Female * US$900+ | 1.02 | 0.59, 1.72 | >0.9 |
|
1
OR = Odds Ratio, CI = Confidence Interval
|
|||