Progress CheckIn 1

Section 1: Data Loading and Analytical Sample

# Load required packages
library(tinytex)
library(tidyverse)
library(kableExtra)
library(broom)
library(haven)
library(janitor)
library(knitr)
library(plotly)
library(ggeffects)
library(gtsummary)
library(GGally)
library(car)
library(lmtest)
library(corrplot)

# Import the dataset
brfss_full <- read_xpt("C:/Users/joshm/Documents/UAlbany/Spring 2026/EPI 553/Project/BRFSS.XPT") %>%
  clean_names()

# keep only key variables
brfss_sub <- brfss_full %>%
  select(seqno, menthlth, drnk3ge5, avedrnk4, maxdrnks, sex, ageg5yr, incomg1, marijan1)

#remove missing data from key variables
brfss_sub <- brfss_sub %>%
  filter(
    !is.na(seqno),
    !is.na(menthlth),
    !is.na(drnk3ge5),
    !is.na(avedrnk4),
    !is.na(maxdrnks),
    !is.na(sex),
    !is.na(ageg5yr),
    !is.na(incomg1),
    !is.na(marijan1)
    
  )
#Remove invalid data, recode key variables, add labels
brfss_sub2 <- brfss_sub %>%
  filter(
    ageg5yr != 14,
    incomg1 != 9,
    
    menthlth != 77, menthlth != 99,
    drnk3ge5 != 77, drnk3ge5 != 99,
    marijan1 != 77, marijan1 != 99,
    
    avedrnk4 != 77, avedrnk4 != 99, avedrnk4 != 88,   
    
    maxdrnks != 77, maxdrnks != 99, maxdrnks != 88, maxdrnks != 0)

Data from the 2024 Behavioral Risk Factor Surveillance System (BRFSS) was obtained from the Centers for Disease Control and Prevention (CDC) website (Centers for Disease Control and Prevention, 2024). Data was loaded into R using the read.xpt function from the Tidyverse library. The starting sample size was N = 457,670.

The first data cleaning step taken was the creation of a new dataset by selecting and keeping only the variables important for this analysis. Those variables were seqno, menthlth, drnk3ge5, avedrnk4, maxdrnks, sex, ageg5yr, incomg1, and marijan1. After selecting and keeping only the important variables, the next cleaning step taken was the removal of all missing data. In this initial removal, only the observations explicitly listed as NA were removed.

After this initial removal, another step was taken to remove other missing and invalid values that were not explicitly listed as NA. For the variable ageg5yr, a value of 14 was invalid (don’t know, refused, missing). For the variable incomg1, a value of 9 was invalid (don’t know, not sure, missing). For the variables menthlth, drnk3ge5, and marijan1, values of 77 (don’t know, not sure) and 99 (refused) were invalid. For the variable avedrnk, values of 77 (don’t know, not sure), 99 (refused), and 88 (none) were considered invalid. 88 was considered an invalid response because participants were only asked this question if they indicated on a previous question that they had at least one alcoholic beverage in the last 30 days. Thus, indicating that they had zero average drinks on the days they drank does not make sense. For the variable maxdrnks, values of 77 (don’t know, not sure), 88 (invalid response), 99 (refused), and 0 were considered invalid. 0 was considered an invalid response because participants were only asked this question if they indicated on a previous question that they had at least one alcoholic beverage in the last 30 days. Thus, indicating 0 for the maximum number of drinks they had in the past 30 days does not make sense.

Observations that contained an invalid value for one or more of the aforementioned variables were removed. After removal of other missing and invalid values, a total of 399,644 observations were excluded, and the final analytic sample size was N = 58,026. This analysis is also investigating the relationship between drinking and mental health within binge drinkers, so a separate “binge drinkers” dataset was created in addition to the main analytic dataset. The dataset selected all observations in the analytic dataset wherein drnk3ge5 > 0; that is, individuals who reported at least one instance of binge drinking in the past 30 days. Binge drinking was defined as 5 or more drinks for men or 4 or more drinks for women during a single occasion. Additionally, females who reported less than 4 for maxdrnks and males who reported less than 5 for maxdrnks were excluded. This is because it does not make sense to report at least one instance of binge drinking while also reporting the maximum drinks in one session to be less than the binge drinking threshold. A total of 44,009 observations were excluded from the analytic dataset. The binge drinkers dataset had a sample size of N = 14,017.

The Binge drinkers dataset was created after the recoding of variables so that recoding did not have to be done to two separate datasets.

Start: N = 457,670 -> exclude observations with a missing and/or invalid value for at least one of the key analytic variables: N = 58,026 [full analytic dataset] -> exclude individuals who did not report any past-month binge drinking: N = 14,017 [binge drinkers dataset]

It is possible that missing or invalid values might be non-random for the key variables in this analysis. In particular, the proportion of missing or invalid values for the variable incomg1 (income categories), may not be equally distributed amongst all the income categories. Those from lower income categories may be more likely than those from higher income categories to answer “don’t know”, “not sure”, or to leave the question blank. Individuals with higher income are likely to have salaried positions in which annual income is clear, but individuals from lower income categories may be more likely to have jobs in which the annual income is unknown or inconsistent (e.g. gig jobs). This might mean that lower income categories are underrepresented in this analysis.

Section 2: Variable Selection, Recoding, and Data Cleaning

# keep only key variables
brfss_sub2 <- brfss_sub2 %>%
  mutate(
      menthlth_new = case_when(menthlth == 88 ~ 0,
      menthlth >= 1 & menthlth <= 30 ~ as.numeric(menthlth),
      TRUE                        ~ NA_real_),
      
      drnk3ge5_new = case_when(drnk3ge5 == 88 ~ 0,
      drnk3ge5 >= 1 & drnk3ge5 <= 76 ~ as.numeric(drnk3ge5),
      TRUE                        ~ NA_real_),
      
      marijan1_new = case_when(marijan1 == 88 ~ 0,
      marijan1 >= 1 & marijan1 <= 30 ~ as.numeric(marijan1),
      TRUE                        ~ NA_real_),
 
      sex_new = factor(sex, levels = c(1, 2), labels = c("Male", "Female")),
      
      age_new = factor(ageg5yr, levels = 1:13,
      labels = c("18-24", "25-29", "30-34", "35-39", "40-44", "45-49", "50-54", "55-59", "60-64", "65-69", "70-74", "75-79", "80+") ),
      
      income_new = factor(incomg1, levels = 1:7,
      labels = c("<$15k", "$15-25k", "$25-35k", "$35-50k",
                 "$50-100k", "$100-200k", ">$200k"))
  
       ) 

# binge drinking dataset
brfss_binge <- brfss_sub2 %>%
  filter(drnk3ge5_new > 0, 
         sex_new == "Female" & maxdrnks>=4 | sex_new == "Male" & maxdrnks>=5)

The outcome variable in this analysis is menthlth, the number of days within the past 30 days in which individuals reported poor mental health. Poor mental health was defined as stress, depression, and problems with emotions. This variable was treated as a continuous variable in this analysis. Prior to analysis, menthtlh was recoded such that values of 88 were converted to 0, as 88 indicated an answer of “None” (i.e. no poor mental health days in the past 30 days). In the analytic dataset, the mean of menthlth was 4.1, with a standard deviation of 7.7 (Table 1a). In the binge drinkers dataset, the mean of menthlth was 5.4, with a standard deviation of 8.6 (Table 1b).

The exposure variables in this analysis are drnk3ge5, the frequency of binge drinking within the last 30 days; avedrnk4, the average number of drinks in one occasion; and maxdrnks, the maximum number of drinks in one occasion. All three of these variables were treated as continuous variables. The variable drnk3ge5 was recoded such that values of 88 were converted to 0, as 88 indicated an answer of “None” for this variable (i.e. no binge drinking within the past 30 days).

Covariates in this analysis are sex (sex), ageg5yr (age in 5 year age categories), incomg1 (income category), and marijan1 (number of days in which marijuana or cannabis was used in the past 30 days). The variable marijan1 was recoded so that a value of 88 was converted to 0, as a response of 88 indicated no marijuana use in the past 30 days. Sex is a binary variable, income and age are categorical variables, and marijuana use is a continuous variable.

Section 3: Descriptive Statistics (Table 1)

brfss_sub2 %>%
  select(menthlth_new, #outcome
         drnk3ge5_new, #exposure
         marijan1_new, sex_new, age_new, income_new #covariates
         ) %>% 
  tbl_summary(
    label = list(
      menthlth_new ~ "Mentally unhealthy days (past 30)",
      drnk3ge5_new ~ "Binge drinking frequency (past 30 days)",
      marijan1_new     ~ "Marijuana use days (past 30)",
      sex_new           ~ "Sex",
      age_new      ~ "Age category",
      income_new           ~ "Income category"
    ),
    statistic = list(
      all_continuous()  ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~ 1,
    missing = "no"
  ) %>%
  add_n() %>%
  bold_labels() %>%
  modify_caption("**Table 1a. Descriptive Statistics — BRFSS 2024 Analytic Sample (n = 58026)**")

**Table 1a. Descriptive Statistics — BRFSS 2024 Analytic Sample (n = 58026)**
Characteristic	N	N = 58,026¹
Mentally unhealthy days (past 30)	58,026	4.1 (7.7)
Binge drinking frequency (past 30 days)	58,026	1.2 (3.9)
Marijuana use days (past 30)	58,026	3.0 (8.0)
Sex	58,026
Male		30,449 (52%)
Female		27,577 (48%)
Age category	58,026
18-24		2,730 (4.7%)
25-29		3,200 (5.5%)
30-34		3,759 (6.5%)
35-39		4,348 (7.5%)
40-44		4,618 (8.0%)
45-49		4,112 (7.1%)
50-54		4,520 (7.8%)
55-59		4,987 (8.6%)
60-64		6,008 (10%)
65-69		6,219 (11%)
70-74		5,626 (9.7%)
75-79		4,259 (7.3%)
80+		3,640 (6.3%)
Income category	58,026
<$15k		1,529 (2.6%)
$15-25k		2,901 (5.0%)
$25-35k		4,463 (7.7%)
$35-50k		6,472 (11%)
$50-100k		18,863 (33%)
$100-200k		17,049 (29%)
>$200k		6,749 (12%)
¹ Mean (SD); n (%)

#binge drinkers subset

brfss_binge %>%
  select(menthlth_new, #outcome
         avedrnk4, #exposure
         maxdrnks,
         marijan1_new, sex_new, age_new, income_new #covariates
         ) %>% 
  tbl_summary(
    label = list(
      menthlth_new ~ "Mentally unhealthy days (past 30)",
      avedrnk4 ~ "Average number of drinks per session (past 30 days)",
      maxdrnks ~ "Maximum number of drinks in one session (past 30 days)",
      marijan1_new     ~ "Marijuana use days (past 30)",
      sex_new           ~ "Sex",
      age_new      ~ "Age category",
      income_new           ~ "Income category"
    ),
    statistic = list(
      all_continuous()  ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~ 1,
    missing = "no"
  ) %>%
  add_n() %>%
  bold_labels() %>%
  modify_caption("**Table 1b. Descriptive Statistics — BRFSS 2024 Binge Drinkers Subset (n = 14017)**")

**Table 1b. Descriptive Statistics — BRFSS 2024 Binge Drinkers Subset (n = 14017)**
Characteristic	N	N = 14,017¹
Mentally unhealthy days (past 30)	14,017	5.4 (8.6)
Average number of drinks per session (past 30 days)	14,017	4.0 (3.3)
Maximum number of drinks in one session (past 30 days)	14,017	7.3 (4.5)
Marijuana use days (past 30)	14,017	5.3 (10.0)
Sex	14,017
Male		8,644 (62%)
Female		5,373 (38%)
Age category	14,017
18-24		1,235 (8.8%)
25-29		1,308 (9.3%)
30-34		1,427 (10%)
35-39		1,448 (10%)
40-44		1,564 (11%)
45-49		1,264 (9.0%)
50-54		1,274 (9.1%)
55-59		1,262 (9.0%)
60-64		1,235 (8.8%)
65-69		967 (6.9%)
70-74		613 (4.4%)
75-79		287 (2.0%)
80+		133 (0.9%)
Income category	14,017
<$15k		488 (3.5%)
$15-25k		774 (5.5%)
$25-35k		1,111 (7.9%)
$35-50k		1,501 (11%)
$50-100k		4,147 (30%)
$100-200k		4,217 (30%)
>$200k		1,779 (13%)
¹ Mean (SD); n (%)

Section 4: Exploratory Data Analysis

#whole analytic sample
p_hist <- ggplot(brfss_sub2, aes(x = menthlth_new)) +
  geom_histogram(binwidth = 1, fill = "steelblue", color = "white", alpha = 0.85) +
  labs(
    title    = "Figure 1A. Mentally Unhealthy Days (Past 30)",
    subtitle = "BRFSS 2024 Analytic Sample (n = 58,137)",
    x        = "Number of Mentally Unhealthy Days",
    y        = "Count"
  ) +
  theme_minimal(base_size = 10)

ggplotly(p_hist)

#binge drinker subset
p_hist <- ggplot(brfss_binge, aes(x = menthlth_new)) +
  geom_histogram(binwidth = 1, fill = "red", color = "white", alpha = 0.85) +
  labs(
    title    = "Figure 1B. Mentally Unhealthy Days (Past 30)
    Binge Drinkers",
    subtitle = "BRFSS 2024 Analytic Sample (n = 58,137)",
    x        = "Number of Mentally Unhealthy Days",
    y        = "Count"
  ) +
  theme_minimal(base_size = 10)

ggplotly(p_hist)

#analytic sample

ggplot(brfss_sub2, aes(x = drnk3ge5_new, y = menthlth_new)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = FALSE, color = "steelblue") +
  labs(
    title   = "Figure 2A. Frequency of Binge Drinking and Mental Health",
    y       = "Mentally Unhealthy days (Past 30)",
    x       = "Binge drinking frequency (past 30 days)"
  ) +
  theme_minimal()

#binge drinker subset

ggplot(brfss_binge, aes(x = avedrnk4, y = menthlth_new)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(
    title   = "Figure 2B. Average Drinks per Session and Mental Health",
    y       = "Mentally Unhealthy days (Past 30)",
    x       = "Average number of drinks per session (past 30 days)"
  ) +
  theme_minimal()

ggplot(brfss_binge, aes(x = maxdrnks, y = menthlth_new)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(
    title   = "Figure 2C. Max Drinks per Session and Mental Health",
    y       = "Mentally Unhealthy days (Past 30)",
    x       = "Maximum number of drinks in one session (past 30 days)"
  ) +
  theme_minimal()

Figure 1 shows the distribution of the primary outcome variable menthlth in both the full analytic dataset (A) and the binge drinkers dataset (B). In both cases, the distribution is positively skewed, with most participants indicating they had zero poor mental health days in the past 30 days. This suggests that the assumption of normality required for multiple linear regression is unlikely to hold, but given the large sample size in both cases (n = 58,026 and n=14,017 for the full and binge drinkers dataset, respectively), the potential bias should be minimal. Some differences can be observed between the full and binge drinkers dataset. Just by looking at the figures, it can be seen that binge drinkers tend to have more mentally unhealthy days; the bars in the tail of the distribution are taller than those in the full analytic dataset.

Figure 2 shows preliminary associations between the outcome variable menthlth and the exposure variables drnk3ge5 (A), avedrnk4 (B), and maxdrnks (C). In this analysis, the relationship between menthlth and drnk3ge5 will be assessed in the full analytic dataset, while the relationships between menthlth and avedrnk4 and menthlth and maxdrnks will be assessed only in binge drinkers. Thus, figure A represents the full analytic dataset, while figures B and C represent the binge drinkers dataset. All three preliminary associations appear to go in the expected direction based on the fitted regression lines. In the full analytic sample, as binge drinking frequency increases, the number of poor mental health days that month also increases. In the binge drinkers dataset, as the average number of drinks per session and the maximum number of drinks in a single session increase, the number of poor mental health days that month also increases. These positive correlations were expected.

#analytic sample

brfss_sub2 %>%
  select(menthlth_new, drnk3ge5_new, marijan1_new) %>%

  rename(
    `Poor Mental Days`  = menthlth_new,
    `Freq Binge` = drnk3ge5_new,
    `Marijuana Days` = marijan1_new
  ) %>%
  
  
  ggpairs(
    lower  = list(continuous = wrap("points", alpha = 0.05, size = 0.5)),
    diag   = list(continuous = wrap("densityDiag", fill = "steelblue", alpha = 0.5)),
    upper  = list(continuous = wrap("cor", size = 4)),
    title  = "Figure 3A. Marijuana Use vs. Key Variables"
  ) +
  theme_minimal(base_size = 11)

#binge drinkers
brfss_binge %>%
  select(menthlth_new, avedrnk4, maxdrnks, marijan1_new) %>%

  rename(
    `Poor Mental Days`  = menthlth_new,
    `Ave Drinks` = avedrnk4,
    `Max Drinks`= maxdrnks,
    `Marijuana Days` = marijan1_new
  ) %>%
    
  ggpairs(
    lower  = list(continuous = wrap("points", alpha = 0.05, size = 0.5)),
    diag   = list(continuous = wrap("densityDiag", fill = "red", alpha = 0.5)),
    upper  = list(continuous = wrap("cor", size = 4)),
    title  = "Figure 3B. Marijuana Use vs. Key Variables - Binge Drinkers"
  ) +
  theme_minimal(base_size = 11)

Figure 3 is a scatterplot matrix that shows the associations between menthlth, drnk3ge5, and marijan1 in the full analytic dataset (A) as well as the associations between menthlth, avedrnk4, maxdrnks, and marijan1 in the binge drinkers dataset (B). This plot was chosen to assess any confounding on the part of the covariate marijan1; to see if it was significantly associated with both the exposure and outcome variables in this study. Figure 3a reveals that in our full dataset, the number of days in which marijuana was used in the past 30 days was significantly positively correlated with both the frequency of binge drinking and the number of poor mental health days. Figure 3b reveals that in our binge drinkers dataset, the number of days in which marijuana was used in the past 30 days was significantly positively correlated with the average number of drinks per drinking session, the maximum number of drinks in a single session, and the number of poor mental health days. Altogether, this confirms that marijuana use is a confounder as it was significantly correlated with the exposure and outcome variables, and all correlations were in the same direction. However, although correlations were statistically significant, the magnitude of the correlations were small (<0.3), indicating weak correlations between marijuana use and key variables.

References

Centers for Disease Control and Prevention. (2024). Behavioral Risk Factor Surveillance System. https://www.cdc.gov/brfss/annual_data/annual_2024.html

Progress CheckIn 1

EPI 553 — Principles of Statistical Inference II (Spring 2026)

Josh Macera

March 10, 2026

Section 1: Data Loading and Analytical Sample

Section 2: Variable Selection, Recoding, and Data Cleaning

Section 3: Descriptive Statistics (Table 1)

Section 4: Exploratory Data Analysis

References