Logistic regression

Background

The data: NSDUH (2019)

The teaching dataset nsduh includes a random subset of 1000 observations of adults, and variables that have been renamed for clarity. Sampling was done with replacement using sampling weights in order to approximate a nationally representative distribution. This sampling method is solely for the purpose of creating a teaching dataset to illustrate regression methods. The National Survey on Drug Use and Health (NSDUH), a product of the Substance Abuse and Mental Health Services Administration (SAMHSA) under the U.S. Department of Health and Human Services, measures the use of illegal substances, the use and misuse of prescribed substances, substance use disorder and treatment, and mental health outcomes ( Substance Abuse and Mental Health Services Administration 2022). Downloadable data and documentation are freely available from SAMHSA (U.S. Department of Health and Human Services, Substance Abuse and Mental Health Services Administration, Center for Behavioral Health Statistics and Quality 2019) for research and statistical purposes. Documentation for the 2019 data can be found in the 2019 NSDUH Public Use File Codebook. See also Policies.

Exercise 1

Create a table1 and use GGally::ggpairs to explore your data.

# Write your code here
table1(~ demog_sex + alc_agefirst + demog_age_cat6 + demog_income | mj_lifetime, data=nsduh, overall=F, caption = "Demographics of marijuana use")

Demographics of marijuana use
	No (N=491)	Yes (N=509)
demog_sex
Female	285 (58.0%)	249 (48.9%)
Male	206 (42.0%)	260 (51.1%)
alc_agefirst
Mean (SD)	19.7 (5.09)	16.0 (3.08)
Median [Min, Max]	19.0 [3.00, 45.0]	16.0 [3.00, 29.0]
Missing	145 (29.5%)	12 (2.4%)
demog_age_cat6
18-25	53 (10.8%)	80 (15.7%)
26-34	52 (10.6%)	87 (17.1%)
35-49	129 (26.3%)	136 (26.7%)
50-64	109 (22.2%)	132 (25.9%)
65+	148 (30.1%)	74 (14.5%)
demog_income
Less than $20,000	76 (15.5%)	85 (16.7%)
$20,000 - $49,999	166 (33.8%)	127 (25.0%)
$50,000 - $74,999	64 (13.0%)	81 (15.9%)
$75,000 or more	185 (37.7%)	216 (42.4%)

# Write your code here
GGally::ggpairs(nsduh, columns = c(1,3:6), title = "Marijuana use", 
        upper = list(combo = "box_no_facet", discrete = "facetbar",  
                na = "na"),
        lower = list(combo = "dot_no_facet", discrete = "colbar",
                  na = "na"),
        mapping = ggplot2::aes(colour = mj_lifetime), 
        columnLabels = c("Marijuana use (lifetime)", "Age of first alcohol use", "Age", "Sex", "Income"),
        legend = 1,
        )

## Warning: Removed 157 rows containing non-finite values (`stat_boxplot()`).

## Warning: Removed 157 rows containing missing values (`geom_point()`).

## Warning: Removed 157 rows containing non-finite values (`stat_density()`).

## Warning: Removed 157 rows containing non-finite values (`stat_boxplot()`).
## Removed 157 rows containing non-finite values (`stat_boxplot()`).
## Removed 157 rows containing non-finite values (`stat_boxplot()`).

## Warning: Removed 157 rows containing missing values (`geom_point()`).
## Removed 157 rows containing missing values (`geom_point()`).
## Removed 157 rows containing missing values (`geom_point()`).

Exercise 2

Use the contingency table below to calculate (a) probability of lifetime marijuana among sample subjects (independent of sex), among sample females and among sample males, (b) Odds of lifetime marijuana use among sample subjects, among sample females and among sample males. (c) What is the risk difference, the risk ratio and the odds ratio when comparing lifetime marijuana use among males and females in our sample?

# Showing absolute numbers
nsduh %>% 
  tabyl(demog_sex, mj_lifetime) %>% 
  adorn_title()

# Showing percentages
nsduh %>% 
  tabyl(demog_sex, mj_lifetime) %>% 
  adorn_percentages() %>% 
  adorn_pct_formatting() %>% 
  adorn_title()

# Replace the blanks below with appropriate values
tribble(
  ~Prediction, ~General, ~Female, ~Male, 
  "Probability", 0.509, 0.466, 0.558, 
  "Odds",        1.037, 0.874, 1.262, 
  "Log Odds",    0.268, -0.135, 0.233
) %>% gt()

Prediction	General	Female	Male
Probability	0.509	0.466	0.558
Odds	1.037	0.874	1.262
Log Odds	0.268	-0.135	0.233

# Replace the blanks below with appropriate values
# Values for females are subtracted from values for males each time
tribble(
  ~Contrast, ~`Males vs. Females`, 
  "Risk difference", 0.093, 
  "Risk Ratio",      1.251,
  "Odds ratio",      1.446, 
  "Log odds ratio",  0.369
)  %>%  gt()

Contrast	Males vs. Females
Risk difference	0.093
Risk Ratio	1.251
Odds ratio	1.446
Log odds ratio	0.369

Exercise 3

You will now estimate the same measures from the previous question, but this time using two types of regressions: a logistic regression and a regular least squares linear regression. Show your code and present your results in a table adding the confidence intervals (see below). Explain your findings: what are the differences between the three approaches (using a contingency table as in the question above, using logistic regression and using linear regression)? What explains these differences (or lack thereof)? What are the advantages/drawbacks of using one approach over the other?

Note: Use the two null models to calculate probability and odds at the general population level. Use the regression models with a single predictor to calculate the probabilities and odds for males and females as well as risk difference, risk ratio and odds ratio. Create your tables using the following pattern.

#################
# The null model 
#################

# Using logistic regression
mdl.glm.null <- glm(
  mj_lifetime ~ 1, 
  data = nsduh, 
  family = binomial
  )

summary(mdl.glm.null)

## 
## Call:
## glm(formula = mj_lifetime ~ 1, family = binomial, data = nsduh)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  0.03600    0.06326   0.569    0.569
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1386  on 999  degrees of freedom
## Residual deviance: 1386  on 999  degrees of freedom
## AIC: 1388
## 
## Number of Fisher Scoring iterations: 3

# Log - odds for lifetime marijuana (everyone)
tidy(
  mdl.glm.null, 
    exponentiate = TRUE,
  conf.int = TRUE
  )

# Odds of lifetime marijuana (everyone)
tidy(
  mdl.glm.null, 
  conf.int = TRUE
  )

# Probability of lifetime marijuana (everyone)
tidy(
  mdl.glm.null, 
  conf.int = TRUE
  )  %>%  
  select(estimate, conf.low, conf.high) %>% 
  mutate_all(plogis)

#################
# Now let us use a linear regression
# Notice: we will be violating assumptions. Oh no! 😱
mdl.lm.null <- lm(mj_lifetime.numeric ~ 1, data = nsduh )

# Probability of lifetime marijuana (everyone) using linear regression
tidy(
  mdl.lm.null, 
  conf.int = TRUE
  )

The unadjusted model

# Now use the logistic and linear model, but this time 
# Using the formula `mj_lifetime.numeric ~ demog_sex` 

# The unadjusted model   | logistic 

mdl.glm <-  glm(
 mj_lifetime.numeric ~ demog_sex, 
  data = nsduh, 
  family = binomial
  )

summary(mdl.glm)

## 
## Call:
## glm(formula = mj_lifetime.numeric ~ demog_sex, family = binomial, 
##     data = nsduh)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)   
## (Intercept)   -0.13504    0.08675  -1.557  0.11954   
## demog_sexMale  0.36784    0.12738   2.888  0.00388 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1386.0  on 999  degrees of freedom
## Residual deviance: 1377.6  on 998  degrees of freedom
## AIC: 1381.6
## 
## Number of Fisher Scoring iterations: 3

# Log - odds for lifetime marijuana (everyone)
tidy(
  mdl.glm, 
  conf.int = TRUE
  )

# Odds of lifetime marijuana (everyone)
tidy(
  mdl.glm, 
  exponentiate = TRUE,
  conf.int = TRUE
  )

# Probability of lifetime marijuana (everyone)
tidy(
  mdl.glm, 
  conf.int = TRUE
  ) %>% 
  select(estimate, conf.low, conf.high)  %>%  
  mutate_all(plogis)

# The unadjusted model | linear regression 
mdl.lm  <-  lm(mj_lifetime.numeric ~ demog_sex, data = nsduh )

# Probability of lifetime marijuana (everyone) using linear regression
tidy(
  mdl.lm, 
  conf.int = TRUE
  )

The confidence intervals

models <-  c("mdl.glm.null", "mdl.lm.null", "mdl.glm", "mdl.lm")

CI_models <- function(CI_results) {
  lapply(CI_results,  function(x) {
    broom::tidy(x, conf.int = TRUE, exponentiate = TRUE) 
  })
}

(CI_models(list(mdl.glm.null, mdl.lm.null, mdl.glm, mdl.lm)))

## [[1]]
## # A tibble: 1 × 7
##   term        estimate std.error statistic p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
## 1 (Intercept)     1.04    0.0633     0.569   0.569    0.916      1.17
## 
## [[2]]
## # A tibble: 1 × 7
##   term        estimate std.error statistic   p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
## 1 (Intercept)     1.66    0.0158      32.2 1.75e-156     1.61      1.72
## 
## [[3]]
## # A tibble: 2 × 7
##   term          estimate std.error statistic p.value conf.low conf.high
##   <chr>            <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
## 1 (Intercept)      0.874    0.0867     -1.56 0.120      0.737      1.04
## 2 demog_sexMale    1.44     0.127       2.89 0.00388    1.13       1.86
## 
## [[4]]
## # A tibble: 2 × 7
##   term          estimate std.error statistic  p.value conf.low conf.high
##   <chr>            <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)       1.59    0.0216     21.6  2.40e-85     1.53      1.66
## 2 demog_sexMale     1.10    0.0316      2.90 3.80e- 3     1.03      1.17

#Can also add comparison = lnor to get the log odds ratios

# Replace the blanks below with appropriate values
tribble(
  ~Prediction, ~General, ~Female, ~Male, 
  "Probability", "0.509 CI [0.478, 0.540]", "0.466 CI [0.423, 0.509]", "0.558 CI [0.495, 0.621]", 
  "Odds",        "1.036 CI [0.916, 1.174]", "0.874 CI [0.735, 1.039]", "1.262 CI [0.978, 1.629]", 
  "Log Odds",    "0.036 CI [-0.087, 0.160]", "-0.135 CI [-0.309, 0.038]", "0.233 CI [-0.022, 0.488]"
) %>%  gt()

Prediction	General	Female	Male
Probability	0.509 CI [0.478, 0.540]	0.466 CI [0.423, 0.509]	0.558 CI [0.495, 0.621]
Odds	1.036 CI [0.916, 1.174]	0.874 CI [0.735, 1.039]	1.262 CI [0.978, 1.629]
Log Odds	0.036 CI [-0.087, 0.160]	-0.135 CI [-0.309, 0.038]	0.233 CI [-0.022, 0.488]

# Replace the blanks below with appropriate values; 
# values for females are subtracted from values for males each time
tribble(
  ~Contrast, ~`Males vs. Females`, 
  "Risk difference", "0.092 CI [0.030, 0.154]",
  "Risk Ratio",      "1.2 CI [1.05, 1.34]", 
  "Odds ratio",      "1.445 CI [1.126, 1.855]", 
  "Log odds ratio",  "0.367 CI [0.119, 0.618]"
) %>%  gt()

Contrast	Males vs. Females
Risk difference	0.092 CI [0.030, 0.154]
Risk Ratio	1.2 CI [1.05, 1.34]
Odds ratio	1.445 CI [1.126, 1.855]
Log odds ratio	0.367 CI [0.119, 0.618]

marginaleffects::avg_comparisons(
  model = mdl.glm , 
  variables = "demog_sex", # This is the exposure variable
  comparison = "lnor" )

marginaleffects::avg_comparisons(
  model = mdl.glm , 
  variables = "demog_sex", # This is the exposure variable
  comparison = "ratio" )

marginaleffects::avg_comparisons(
  model = mdl.glm , 
  variables = "demog_sex", # This is the exposure variable
  comparison = "difference")

Explanation: The logistic regression finds the (log) odds and odds ratio values that we also calculated from the contingency table. The plogis of the log odds and the linear model find the probablities we calculated from the contingency table.

Exercise 4

Write up your results using the template shown below.

Conclusion: Males have significantly higher odds of lifetime marijuana use than females (OR = 1.445 ; 95% CI = 1.126 , 1.855 p = .004). Males have 44.5% higher odds of lifetime marijuana use than females.

Exercise 5

Whereas in the previous questions we explored the relationship between lifetime marijuana use and sex, we will now explore the association between lifetime marijuana use (mj_lifetime) and age at first use of alcohol (alc_agefirst)? Fit a logistic regression and interpret the results, using the template shown below.

#Fitting the logistic regression
mdl.glm <- glm( mj_lifetime ~ alc_agefirst, 
  data = nsduh, 
  family = binomial)

#Summarizing results
summary(mdl.glm)

## 
## Call:
## glm(formula = mj_lifetime ~ alc_agefirst, family = binomial, 
##     data = nsduh)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    5.3407     0.4747   11.25   <2e-16 ***
## alc_agefirst  -0.2835     0.0267  -10.62   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1141.45  on 842  degrees of freedom
## Residual deviance:  968.44  on 841  degrees of freedom
##   (157 observations deleted due to missingness)
## AIC: 972.44
## 
## Number of Fisher Scoring iterations: 5

#Using marginaleffects for the CI
mdl.glm %>%  
  avg_comparisons(
    variables = "alc_agefirst", 
    type = "link", 
    transform = "exp")

Conclusion: Age at first alcohol use is significantly negatively associated with lifetime marijuana use (OR = 0.753 ; 95% CI = 0.715 , 0.794 ; p < .001). Individuals who first used alcohol at age 19 years have 24.7% lower odds of having ever used marijuana than those who first used alcohol at age 18 years. In contrast, the reduction in odds associated with starting drinking alcohol 3 years later is 57.3% lower odds.

Exercise 6

What is the association between lifetime marijuana use (mj_lifetime) and age at first use of alcohol (alc_agefirst), adjusted for age (demog_age_cat6), sex (demog_sex), and income (demog_income)? Fit a model with and without an interaction term (alc_agefirst:demog_sex) and compute the AOR, its 95% CI, and the p-value that tests the significance of age at first use of alcohol. Report also their AORs of the other predictors, their 95% CIs and p-values. Since there are some categorical predictors with more than two levels, use car::Anova(model, type = 3) function to compute p-values for multiple degrees of freedom. In your answer, fill in the template below.

# Unadjasted logistic regression model without interaction term
M0_unadj <- glm(mj_lifetime ~ alc_agefirst, data = nsduh, family = "binomial")
(summary(M0_unadj))

## 
## Call:
## glm(formula = mj_lifetime ~ alc_agefirst, family = "binomial", 
##     data = nsduh)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    5.3407     0.4747   11.25   <2e-16 ***
## alc_agefirst  -0.2835     0.0267  -10.62   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1141.45  on 842  degrees of freedom
## Residual deviance:  968.44  on 841  degrees of freedom
##   (157 observations deleted due to missingness)
## AIC: 972.44
## 
## Number of Fisher Scoring iterations: 5

# Fit logistic regression model without interaction term
M1 <- glm(mj_lifetime ~ alc_agefirst + demog_age_cat6 + demog_sex + demog_income,
                                 data = nsduh, family = "binomial")

(summary(M1))

## 
## Call:
## glm(formula = mj_lifetime ~ alc_agefirst + demog_age_cat6 + demog_sex + 
##     demog_income, family = "binomial", data = nsduh)
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                    6.25417    0.59136  10.576  < 2e-16 ***
## alc_agefirst                  -0.27541    0.02756  -9.992  < 2e-16 ***
## demog_age_cat626-34           -0.29619    0.32865  -0.901  0.36746    
## demog_age_cat635-49           -0.80427    0.29656  -2.712  0.00669 ** 
## demog_age_cat650-64           -0.68991    0.29854  -2.311  0.02084 *  
## demog_age_cat665+             -1.27475    0.30429  -4.189  2.8e-05 ***
## demog_sexMale                 -0.06089    0.16181  -0.376  0.70669    
## demog_income$20,000 - $49,999 -0.53088    0.26641  -1.993  0.04629 *  
## demog_income$50,000 - $74,999 -0.07931    0.30488  -0.260  0.79476    
## demog_income$75,000 or more   -0.36120    0.25323  -1.426  0.15376    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1141.45  on 842  degrees of freedom
## Residual deviance:  935.55  on 833  degrees of freedom
##   (157 observations deleted due to missingness)
## AIC: 955.55
## 
## Number of Fisher Scoring iterations: 5

# Fit logistic regression model with interaction term
M2_AgexSex <- glm(mj_lifetime ~ alc_agefirst * demog_sex + demog_age_cat6 + demog_income,
                              data = nsduh, family = "binomial")

(summary(M2_AgexSex))

## 
## Call:
## glm(formula = mj_lifetime ~ alc_agefirst * demog_sex + demog_age_cat6 + 
##     demog_income, family = "binomial", data = nsduh)
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                    7.37354    0.82921   8.892  < 2e-16 ***
## alc_agefirst                  -0.33782    0.04229  -7.988 1.38e-15 ***
## demog_sexMale                 -2.13813    0.98473  -2.171   0.0299 *  
## demog_age_cat626-34           -0.29506    0.33145  -0.890   0.3733    
## demog_age_cat635-49           -0.81603    0.29870  -2.732   0.0063 ** 
## demog_age_cat650-64           -0.68657    0.30071  -2.283   0.0224 *  
## demog_age_cat665+             -1.26004    0.30636  -4.113 3.91e-05 ***
## demog_income$20,000 - $49,999 -0.53035    0.26822  -1.977   0.0480 *  
## demog_income$50,000 - $74,999 -0.09306    0.30727  -0.303   0.7620    
## demog_income$75,000 or more   -0.36288    0.25466  -1.425   0.1542    
## alc_agefirst:demog_sexMale     0.11893    0.05552   2.142   0.0322 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1141.5  on 842  degrees of freedom
## Residual deviance:  930.9  on 832  degrees of freedom
##   (157 observations deleted due to missingness)
## AIC: 952.9
## 
## Number of Fisher Scoring iterations: 5

#Calculating AORs and percentages of use from the model with interaction
aor_alc_agefirst <- exp(-0.33782)
aor_alc_agefirst

## [1] 0.7133237

100-100*aor_alc_agefirst

## [1] 28.66763

aor_agecat_35_49 <- exp(-0.81603)
aor_agecat_35_49

## [1] 0.4421836

100-100*aor_agecat_35_49

## [1] 55.78164

#Calculating 95% CI for age categories from the model with interaction
M1 %>% 
  avg_comparisons(
    variables = "demog_age_cat6", 
    type = "link", 
    transform = "exp")

#Calculating 95% CI for age of first alcohol use from the model with interaction
M2_AgexSex %>%   avg_comparisons(
    variables = "alc_agefirst", 
    type = "link", 
    transform = "exp")

# Use Anova function from car package to compute p-values with type 3 test for multiple degrees of freedom
(M0_anova <- Anova(M0_unadj, type = "III"))

(M1_anova <- Anova(M1, type = "III"))

(M2_AgexSex_anova <- Anova(M2_AgexSex, type = "III"))

# Performing likelihood ratio test to compare the two models

lrtest(M0_unadj, M1,M2_AgexSex)

Interpreting the output: The AOR for our primary predictor alc_agefirst is 0.756. This represents the OR for lifetime marijuana use comparing those with a one-year difference in age at first use of alcohol, adjusted for age, sex, and income.

The remaining AORs compare levels of categorical predictors to their reference level, adjusted for the other predictors in the model. For example, comparing individuals with the same age of first alcohol use, sex, and income, 35-49 year-olds have 55.8 % lower odds of lifetime marijuana use than 18-25 year-olds (OR = 0.442 ; 95% CI = 0.246 , 0.794 ; p = .006). An overall 4 df p-value for age can be read from the Type III Test table. It is significant at p < .001 . An overall, 3 df p-value for income is not significant at p = 0.151.

Conclusion: After adjusting for age, sex, and income, age at first alcohol use is significantly negatively associated with lifetime marijuana use (AOR = 0.756 ; 95% CI = 0.715 , 0.798 ; p < .001). Individuals who first used alcohol at a given age have 28.7 % lower odds of having ever used marijuana than those who first used alcohol one year earlier. The association between age of first alcohol use and lifetime marijuana use differs significantly between males and females (p = .032). A likelihood ratio test shows that the adjusted model with the interaction is superior to the one without the interaction (p = .031)

Exercise 7

Replicate the model summary shown below, and discuss its main features. Use the likelihood ratio test to determine which model is best, and discuss the association between age of first alcohol and lifetime marijuana use, when comparing males to females.

#Replicate model summary
modelsummary(
  list(
    unadjusted = M0_unadj,
    adjusted = M1,
    interaction = M2_AgexSex
    ), 
  stars = TRUE
  )

	unadjusted	adjusted	interaction
(Intercept)	5.341***	6.254***	7.374***
	(0.475)	(0.591)	(0.829)
alc_agefirst	−0.284***	−0.275***	−0.338***
	(0.027)	(0.028)	(0.042)
demog_age_cat626-34		−0.296	−0.295
		(0.329)	(0.331)
demog_age_cat635-49		−0.804**	−0.816**
		(0.297)	(0.299)
demog_age_cat650-64		−0.690*	−0.687*
		(0.299)	(0.301)
demog_age_cat665+		−1.275***	−1.260***
		(0.304)	(0.306)
demog_sexMale		−0.061	−2.138*
		(0.162)	(0.985)
demog_income$20,000 - $49,999		−0.531*	−0.530*
		(0.266)	(0.268)
demog_income$50,000 - $74,999		−0.079	−0.093
		(0.305)	(0.307)
demog_income$75,000 or more		−0.361	−0.363
		(0.253)	(0.255)
alc_agefirst × demog_sexMale			0.119*
			(0.056)
Num.Obs.	843	843	843
AIC	972.4	955.6	952.9
BIC	981.9	1002.9	1005.0
Log.Lik.	−484.219	−467.776	−465.448
F	112.743	14.804	13.139
RMSE	0.44	0.43	0.43
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

#Performing likelihood ratio test to compare the two models. They are nested so this is allowed
lrtest(M0_unadj, M1,M2_AgexSex)

Conclusion: Compare with female, if the first age increased by one year, male will have the 53% lower risk than female.

Exercise 8

Replicate the forest plot shown below, using centralized variables scaled such that standard deviations are 0.5 (you can use the function datawizard::standardize(two_sd = TRUE). What is the advantage of this plot compared to the model summary tables shown above?

#Standardising the data
nsduh_ctr <- datawizard::standardize(nsduh, two_sd = TRUE)


#Interaction model on the standardised data

M2_AgexSex_ctr <- glm(mj_lifetime ~ alc_agefirst * demog_sex + demog_age_cat6 + demog_income,data = nsduh_ctr, family = "binomial")

(summary(M2_AgexSex_ctr))

## 
## Call:
## glm(formula = mj_lifetime ~ alc_agefirst * demog_sex + demog_age_cat6 + 
##     demog_income, family = "binomial", data = nsduh_ctr)
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                    1.46357    0.31860   4.594 4.35e-06 ***
## alc_agefirst                  -2.98057    0.37315  -7.988 1.38e-15 ***
## demog_sexMale                 -0.05751    0.16240  -0.354   0.7232    
## demog_age_cat626-34           -0.29506    0.33145  -0.890   0.3733    
## demog_age_cat635-49           -0.81603    0.29870  -2.732   0.0063 ** 
## demog_age_cat650-64           -0.68657    0.30071  -2.283   0.0224 *  
## demog_age_cat665+             -1.26004    0.30636  -4.113 3.91e-05 ***
## demog_income$20,000 - $49,999 -0.53035    0.26822  -1.977   0.0480 *  
## demog_income$50,000 - $74,999 -0.09306    0.30727  -0.303   0.7620    
## demog_income$75,000 or more   -0.36288    0.25466  -1.425   0.1542    
## alc_agefirst:demog_sexMale     1.04932    0.48987   2.142   0.0322 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1141.5  on 842  degrees of freedom
## Residual deviance:  930.9  on 832  degrees of freedom
##   (157 observations deleted due to missingness)
## AIC: 952.9
## 
## Number of Fisher Scoring iterations: 5

#sjPlot::plot_model(M2_AgexSex_ctr, transform = NULL)+ labs(title = "") 

M2_AgexSex_ctr %>%  
  model_parameters() %>% 
  plot() +
  labs( x= "Log-Odds (standarized)")

Answer: Both contain the table and the graphs contain the model information, but visually it is more important to visualize which parameters are different from 0. In addition, graphs help to visually illustrate trends, patterns and relationships between variables.

Exercise 9

We want to continue exploring the association between age of first alcohol use and lifetime marijuana use, and how this association varies between males and females. Try to replicate the following graphs, and interpret what you see. Note also that something interesting happens at the age of 18. What is special about that age, according to the two graphs? Is the difference between the predicted outcomes of males and females significant at any age bracket? If so, what is the age bracket?

mdl.int <- glm(mj_lifetime ~ alc_agefirst * demog_sex, family = "binomial", data = nsduh)


plot_predictions(
  mdl.int, 
  condition = c("alc_agefirst", "demog_sex"))+
  labs(title = "Predicted probability",
           x= "Alcohol (age first used)",
           y= "Probability of lifetime marijuana use")+ 
  theme(legend.position = "bottom")

# Probability difference
 plot_comparisons(
  mdl.int, 
  variables = list(
    demog_sex = c("Male", "Female")
    ), 
  comparison = "difference",
  condition = "alc_agefirst")+ geom_hline(yintercept = 0, linetype="longdash")+ labs(title = "Predicted probability",
           x= "Alcohol (age first used)",
           y= "Pr(Male) - Pr(Female)")+ 
  theme(legend.position = "bottom")

# Odds ratio


plot_comparisons(
  mdl.int, 
  variables = list(
    demog_sex = c("Male", "Female")
    ), 
  transform = exp,
  comparison = "lnor",
  condition = "alc_agefirst")+ geom_hline(yintercept = 1,linetype="longdash")+
  ylim(-1,15)+labs(title = "Odds ratio",
           x= "Alcohol (Age first used )",
           y= "Odds(Male)/Odds(Female)")+ 
  theme(legend.position = "bottom")

# Log odds ratio
plot_comparisons(
  mdl.int, 
  variables = list(
    demog_sex = c("Male", "Female")
    ), 
  comparison = "lnor",
  condition = "alc_agefirst")+
  labs(title = "Log odds ratio",
           x= "Risk ratio: alcoho (Age first used )",
           y= "Pr(Male) - Pr(Female)")+ 
  theme(legend.position = "bottom")

# Risk ratio
 plot_comparisons(
  mdl.int, 
  variables = list(
    demog_sex = c("Male", "Female")
    ), 
  comparison = "ratio",
  condition = "alc_agefirst")+ geom_hline(yintercept = 0, linetype="longdash")+ labs(title = "Risk ratio",
           x= "Risk ratio: alcoho (Age first used )",
           y= "Pr(Male) - Pr(Female)")+ 
  theme(legend.position = "bottom")

Conclusion: If males and females have their first alcohol use at age 18, there is a ~0.50 probability that they will use marijuana. Approximately 15 years of age, there is a significant risk that both males and females may initiate marijuana use.