PUBH 422 Final Project

Project Initiation & Data Acquisition

The Health Information National Trends Survey (HINTS) is a nationally representative cross-sectional survey conducted by the National Cancer Institute (NCI) to monitor health behaviors and trends among U.S. adults. The research question for this study was: Is there a significant association between household income and BMI among U.S. adults after controlling for age and sex? The null hypothesis (H0) stated that there is no significant difference in mean BMI between income levels, while the alternative hypothesis (HA) stated that there is a significant difference in mean BMI between income levels.

Data Cleaning & Management in R

# Set working directory to the folder containing the HINTS 2024 dataset
setwd("C:/Users/kevinroldan13/OneDrive - Cal State Fullerton/Desktop/PUBH 422/PUBH 422 Final Project")

# Loaded the HINTS 2024 R data file
load("hints7_public.rda")

# Viewed objects loaded into the R environment
ls()

## [1] "public"

# Renamed the dataset to an appropriate name
hints_2024 <- public

# Checked the size and structure of the original dataset
dim(hints_2024)

## [1] 7278  515

# Selected variables relevant to the research question and descriptive analysis. BMI is the outcome variable.HHInc is the main predictor variable.AgeGrpB and BirthSex are control variables. AvgDrinksPerWeek, WeeklyMinutesModerateExercise, and smokeStat are included to describe health behaviors in the sample.

hints_vars <- hints_2024[, c(
  "BMI",
  "AvgDrinksPerWeek",
  "BirthSex",
  "WeeklyMinutesModerateExercise",
  "HHInc",
  "smokeStat",
  "AgeGrpB"
)]

# Inspected variable types before cleaning
str(hints_vars)

## 'data.frame':    7278 obs. of  7 variables:
##  $ BMI                          : Factor w/ 404 levels "Missing data (Not Ascertained)",..: 125 112 102 214 152 135 169 165 123 128 ...
##  $ AvgDrinksPerWeek             : Factor w/ 90 levels "Missing data (Not Ascertained)",..: 43 49 31 13 42 7 9 19 5 1 ...
##  $ BirthSex                     : Factor w/ 6 levels "Missing data (Not Ascertained)",..: 5 4 4 5 4 5 4 5 4 6 ...
##  $ WeeklyMinutesModerateExercise: Factor w/ 166 levels "Missing data (Not Ascertained)",..: 72 64 73 51 5 57 44 45 92 5 ...
##  $ HHInc                        : Factor w/ 9 levels "Missing data (Not Ascertained)",..: 9 9 8 7 4 9 7 6 8 4 ...
##  $ smokeStat                    : Factor w/ 7 levels "Missing data (Not Ascertained)",..: 6 6 7 6 6 7 7 7 5 7 ...
##  $ AgeGrpB                      : Factor w/ 8 levels "Missing data (Not Ascertained)",..: 7 6 4 7 6 6 4 8 4 1 ...

# Reviewed factor levels using levels () for each variable (included in exploration/cleaning script) to identify missing, unreadable, or invalid response category names


# Removed observations with missing, unreadable, incomplete, or invalid responses across all the selected variables. This creates a clean dataset for analysis. Renamed to use clean dataset

hints_clean <- subset(
  hints_vars,
  
  AgeGrpB != "Missing data (Not Ascertained)" &
    AgeGrpB != "Missing data (Web partial - Question Never Seen)" &
    AgeGrpB != "Unreadable or Nonconforming Numeric Response"  &
    
    BMI != "Missing data (Not Ascertained)" &
      BMI != "Missing data (Web partial - Question Never Seen)" &
      BMI != "Unreadable or Nonconforming Numeric Response" &
    
    AvgDrinksPerWeek != "Missing data (Not Ascertained)"  &
      AvgDrinksPerWeek != "Missing data (Web partial - Question Never Seen)" &
      AvgDrinksPerWeek != "Multiple Responses Selected in Error" &
      AvgDrinksPerWeek != "Unreadable or Nonconforming Numeric Response" &
    
    BirthSex != "Missing data (Not Ascertained)"  &
      BirthSex != "Missing data (Web partial - Question Never Seen)" &
      BirthSex != "Multiple responses selected in error" &
      BirthSex != "Don't know" &
    
    WeeklyMinutesModerateExercise != "Missing data (Not Ascertained)"  &
      WeeklyMinutesModerateExercise != "Missing data (Web partial - Question Never Seen)" &
      WeeklyMinutesModerateExercise != "Multiple Responses Selected in Error" &
      WeeklyMinutesModerateExercise != "Unreadable or Nonconforming Numeric Response" &
    
    HHInc != "Missing data (Not Ascertained)" &
      HHInc != "Missing data (Web partial - Question Never Seen)" &
    
    smokeStat != "Missing data (Not Ascertained)" &
      smokeStat != "Missing data (Web partial - Question Never Seen)" &
      smokeStat != "Missing data (Filter Missing), coded -9 in Smoke100" &
      smokeStat != "Unreadable or Nonconforming Numeric Response"
)

# Checked the cleaned dataset to ensure no missing or invalid values
dim(hints_clean)

## [1] 5811    7

str(hints_clean)

## 'data.frame':    5811 obs. of  7 variables:
##  $ BMI                          : Factor w/ 404 levels "Missing data (Not Ascertained)",..: 125 112 102 214 152 135 169 165 123 92 ...
##  $ AvgDrinksPerWeek             : Factor w/ 90 levels "Missing data (Not Ascertained)",..: 43 49 31 13 42 7 9 19 5 5 ...
##  $ BirthSex                     : Factor w/ 6 levels "Missing data (Not Ascertained)",..: 5 4 4 5 4 5 4 5 4 4 ...
##  $ WeeklyMinutesModerateExercise: Factor w/ 166 levels "Missing data (Not Ascertained)",..: 72 64 73 51 5 57 44 45 92 57 ...
##  $ HHInc                        : Factor w/ 9 levels "Missing data (Not Ascertained)",..: 9 9 8 7 4 9 7 6 8 9 ...
##  $ smokeStat                    : Factor w/ 7 levels "Missing data (Not Ascertained)",..: 6 6 7 6 6 7 7 7 5 6 ...
##  $ AgeGrpB                      : Factor w/ 8 levels "Missing data (Not Ascertained)",..: 7 6 4 7 6 6 4 8 4 6 ...

summary(hints_clean)

##       BMI       AvgDrinksPerWeek
##  25.8   : 144   0      :2827    
##  26.6   : 117   0.5    : 287    
##  25.1   : 105   0.25   : 277    
##  28.3   :  97   1      : 255    
##  24.4   :  95   0.75   : 167    
##  27.4   :  86   2.5    : 165    
##  (Other):5167   (Other):1833    
##                                              BirthSex   
##  Missing data (Not Ascertained)                  :   0  
##  Missing data (Web partial - Question Never Seen):   0  
##  Multiple responses selected in error            :   0  
##  Female                                          :3447  
##  Male                                            :2364  
##  Don't know                                      :   0  
##                                                         
##  WeeklyMinutesModerateExercise                   HHInc     
##  0      :1331                  $100,000 or greater  :1792  
##  60     : 395                  $50,000 to < $75,000 : 965  
##  90     : 393                  Less than $20,000    : 904  
##  180    : 386                  $75,000 to < $100,000: 739  
##  120    : 315                  $35,000 to < $50,000 : 712  
##  30     : 254                  $20,000 to < $35,000 : 699  
##  (Other):2737                  (Other)              :   0  
##                                                smokeStat   
##  Missing data (Not Ascertained)                     :   0  
##  Missing data (Web partial - Question Never Seen)   :   0  
##  Missing data (Filter Missing), coded -9 in Smoke100:   0  
##  Unreadable or Nonconforming Numeric Response       :   0  
##  Current                                            : 581  
##  Former                                             :1491  
##  Never                                              :3739  
##                            AgeGrpB    
##  50-64                         :1523  
##  35-49                         :1254  
##  65-74                         :1236  
##  18-34                         : 999  
##  75+                           : 799  
##  Missing data (Not Ascertained):   0  
##  (Other)                       :   0

# Converted quantitative variables to numeric for proper descriptive statistics and visualization 
hints_clean$BMI <- as.numeric(as.character(hints_clean$BMI))

hints_clean$AvgDrinksPerWeek <- as.numeric(as.character(hints_clean$AvgDrinksPerWeek))

hints_clean$WeeklyMinutesModerateExercise <- as.numeric(as.character(hints_clean$WeeklyMinutesModerateExercise))


# Removed unused factor levels left over from removed missing/invalid categories
hints_clean <- droplevels(hints_clean)


# Verified the final cleaned dataset
str(hints_clean)

## 'data.frame':    5811 obs. of  7 variables:
##  $ BMI                          : num  26.3 25 24 35.2 29 27.3 30.7 30.3 26.1 23 ...
##  $ AvgDrinksPerWeek             : num  12.5 15 7.5 2 12 0.5 1 4 0 0 ...
##  $ BirthSex                     : Factor w/ 2 levels "Female","Male": 2 1 1 2 1 2 1 2 1 1 ...
##  $ WeeklyMinutesModerateExercise: num  225 180 240 120 0 150 80 90 360 150 ...
##  $ HHInc                        : Factor w/ 6 levels "Less than $20,000",..: 6 6 5 4 1 6 4 3 5 6 ...
##  $ smokeStat                    : Factor w/ 3 levels "Current","Former",..: 2 2 3 2 2 3 3 3 1 2 ...
##  $ AgeGrpB                      : Factor w/ 5 levels "18-34","35-49",..: 4 3 1 4 3 3 1 5 1 3 ...

summary(hints_clean)

##       BMI        AvgDrinksPerWeek   BirthSex    WeeklyMinutesModerateExercise
##  Min.   :10.20   Min.   : 0.000   Female:3447   Min.   :   0                 
##  1st Qu.:24.00   1st Qu.: 0.000   Male  :2364   1st Qu.:  15                 
##  Median :27.50   Median : 0.250                 Median :  90                 
##  Mean   :28.88   Mean   : 2.848                 Mean   : 183                 
##  3rd Qu.:32.40   3rd Qu.: 2.500                 3rd Qu.: 225                 
##  Max.   :66.60   Max.   :75.000                 Max.   :5040                 
##                    HHInc        smokeStat     AgeGrpB    
##  Less than $20,000    : 904   Current: 581   18-34: 999  
##  $20,000 to < $35,000 : 699   Former :1491   35-49:1254  
##  $35,000 to < $50,000 : 712   Never  :3739   50-64:1523  
##  $50,000 to < $75,000 : 965                  65-74:1236  
##  $75,000 to < $100,000: 739                  75+  : 799  
##  $100,000 or greater  :1792

dim(hints_clean)

## [1] 5811    7

The original HINTS 2024 dataset included 7,278 observations and 515 variables. After selecting variables relevant to the research question and removing missing, unreadable, incomplete, or invalid responses, the final cleaned dataset included 5,811 observations and 7 variables. BMI, average drinks per week, and weekly minutes of moderate exercise were converted from factor to numeric variables to allow for descriptive statistics, correlation analysis, visualization, and regression modeling. Unused factor levels were removed to ensure that only valid response categories appeared in the final analysis.

Descriptive & Frequency Analysis Descriptive Statistics

library(summarytools)

## Warning: package 'summarytools' was built under R version 4.5.3

# Created a subset containing only quantitative variables for descriptive statistics and correlation analysis
quant_var <- hints_clean[, c(
  "BMI",
  "AvgDrinksPerWeek",
  "WeeklyMinutesModerateExercise"
)]

# Generated descriptive statistics for quantitative variables Statistics include: n.valid = number of valid observations, mean = average value, med = median value, sd = standard deviation, min = minimum observed value, and max = maximum observed value
descr(
  quant_var,
  stats = c(
    "n.valid",
    "mean",
    "med",
    "sd",
    "min",
    "max"
  )
)

## Descriptive Statistics  
## quant_var  
## N: 5811  
## 
##                 AvgDrinksPerWeek       BMI   WeeklyMinutesModerateExercise
## ------------- ------------------ --------- -------------------------------
##       N.Valid            5811.00   5811.00                         5811.00
##          Mean               2.85     28.88                          182.96
##        Median               0.25     27.50                           90.00
##       Std.Dev               6.55      6.95                          316.89
##           Min               0.00     10.20                            0.00
##           Max              75.00     66.60                         5040.00

The descriptive statistics demonstrated variability across the quantitative variables included in the analysis. The mean BMI was 28.88 (SD = 6.88), indicating that the average respondent fell within the overweight BMI category. Average drinks consumed per week had a mean of 2.85, although the large range and standard deviation suggested substantial variability in alcohol consumption among respondents. Weekly minutes of moderate exercise averaged 182.96 minutes, exceeding the recommended 150 minutes of moderate physical activity per week; however, the wide range indicated that physical activity levels varied considerably across the sample.

Frequency Tables

# Generated frequency tables for categorical variables. valid.col = TRUE displays valid percentages excluding missing data.

#Age Group Distribution 
freq(hints_clean$AgeGrpB, valid.col = TRUE)

## Frequencies  
## hints_clean$AgeGrpB  
## Type: Factor  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##       18-34    999     17.19          17.19     17.19          17.19
##       35-49   1254     21.58          38.77     21.58          38.77
##       50-64   1523     26.21          64.98     26.21          64.98
##       65-74   1236     21.27          86.25     21.27          86.25
##         75+    799     13.75         100.00     13.75         100.00
##        <NA>      0                               0.00         100.00
##       Total   5811    100.00         100.00    100.00         100.00

#Birth Sex Distribution
freq(hints_clean$BirthSex, valid.col = TRUE)

## Frequencies  
## hints_clean$BirthSex  
## Type: Factor  
## 
##                Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ------------ ------ --------- -------------- --------- --------------
##       Female   3447     59.32          59.32     59.32          59.32
##         Male   2364     40.68         100.00     40.68         100.00
##         <NA>      0                               0.00         100.00
##        Total   5811    100.00         100.00    100.00         100.00

#Household Income Distribution
freq(hints_clean$HHInc, valid.col = TRUE)

## Frequencies  
## hints_clean$HHInc  
## Type: Factor  
## 
##                               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## --------------------------- ------ --------- -------------- --------- --------------
##           Less than $20,000    904     15.56          15.56     15.56          15.56
##        $20,000 to < $35,000    699     12.03          27.59     12.03          27.59
##        $35,000 to < $50,000    712     12.25          39.84     12.25          39.84
##        $50,000 to < $75,000    965     16.61          56.44     16.61          56.44
##       $75,000 to < $100,000    739     12.72          69.16     12.72          69.16
##         $100,000 or greater   1792     30.84         100.00     30.84         100.00
##                        <NA>      0                               0.00         100.00
##                       Total   5811    100.00         100.00    100.00         100.00

#Smoking Status Distribution
freq(hints_clean$smokeStat, valid.col = TRUE)

## Frequencies  
## hints_clean$smokeStat  
## Type: Factor  
## 
##                 Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ------------- ------ --------- -------------- --------- --------------
##       Current    581     10.00          10.00     10.00          10.00
##        Former   1491     25.66          35.66     25.66          35.66
##         Never   3739     64.34         100.00     64.34         100.00
##          <NA>      0                               0.00         100.00
##         Total   5811    100.00         100.00    100.00         100.00

Frequency table results showed that adults aged 50–64 years represented the largest age group in the sample (26.2%). The majority of respondents identified as female (59.3%), while the largest household income category was respondents earning $100,000 or greater annually (30.8%). Most respondents reported never smoking (64.3%), followed by former smokers (25.6%) and current smokers (10%). These distributions provide important demographic and behavioral context.

Univariate & Bivariate Visualization 1. Univariate Quantitative Plot: BMI Historgram

library(ggplot2)

# Created a histogram to visualize the distribution of BMI values binwidth = 2 groups BMI values into intervals of 2 units for clearer interpretation.

ggplot(hints_clean, aes(x = BMI)) +
  geom_histogram(binwidth = 2) +
  labs(
    title = "Distribution of BMI",
    x = "BMI",
    y = "Count"
  )

The BMI histogram shows a slightly right‑skewed distribution, with most respondents falling between BMI values of approximately 20 and 35. A small number of individuals showed very high BMI values above 50, indicating there are extreme cases. The overall spread of the distribution suggests variability in BMI among U.S. adults in the HINTS 2024 dataset.

Univariate Categorical Plot: Income Bar Chart

# Created a bar chart to visualize the distribution of household income categories. Rotated x-axis labels to improve readability.

ggplot(hints_clean, aes(x = HHInc)) +
  geom_bar() +
  labs(
    title = "Household Income Distribution",
    x = "Household Income",
    y = "Count"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The household income bar chart shows the distribution of household income is mostly concentrated in the highest income category ($100,000 or greater), with all lower‑income categories showing substantially smaller and similar counts. The distribution shows the sample included respondents from a wide range of socioeconomic backgrounds, this is important when examining associations between income and BMI.

Bivariate Plot: BMI by Income

# Created a boxplot to examine differences in BMI across household income groups. Boxplots display the median, interquartile range, and potential outliers within each income category.

ggplot(hints_clean, aes(x = HHInc, y = BMI)) +
  geom_boxplot() +
  labs(
    title = "BMI Across Household Income Levels",
    x = "Household Income",
    y = "BMI"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The boxplot shows that BMI distributions are highly similar across all household income categories. Median BMI values remain consistent, and the interquartile ranges show comparable variability across groups. Each income category also contains multiple high‑BMI outliers, indicating that elevated BMI values occur throughout the income levels. Overall, the visual patterns do not suggest a strong association between household income and BMI.

Bivariate plot: BMI by Weekly Moderate Exercise

# Created a scatterplot to examine the relationship between BMI and weekly minutes of moderate exercise. alpha = 0.5 improves visibility of overlapping points. x-axis is limited to 0–1000 minutes to reduce distortion from the extreme outliers.

ggplot(hints_clean, aes(x = WeeklyMinutesModerateExercise, y = BMI)) +
  geom_point(alpha = 0.5) +
  coord_cartesian(xlim = c(0, 1000)) +
  labs(
    title = "BMI and Weekly Minutes of Moderate Exercise",
    x = "Weekly Minutes of Moderate Exercise",
    y = "BMI"
  )

The scatterplot shows substantial variability in BMI across different levels of weekly moderate exercise. Although respondents with higher exercise levels occasionally demonstrated lower BMI values, the relationship appeared weak overall. Several clusters of respondents reported low to moderate exercise levels, while a smaller number of respondents reported very high levels of weekly exercise. The wide spread of BMI values suggests that other factors besides physical activity may contribute to BMI variation among U.S. adults.

Multivariate Visualization & Correlation 1. Multivariate Plot: BMI and Exercise by Birth Sex

# Created a multivariate scatterplot examining the relationship between BMI and weekly exercise while scontrolling by birth sex. Color is used to distinguish male and female respondents. alpha = 0.5 improves visibility of overlapping observations.
ggplot(hints_clean, aes(
  x = WeeklyMinutesModerateExercise,
  y = BMI,
  color = BirthSex
)) +
  geom_point(alpha = 0.5) +
  coord_cartesian(xlim = c(0, 1000)) +
  labs(
    title = "BMI and Weekly Moderate Exercise by Birth Sex",
    x = "Weekly Minutes of Moderate Exercise",
    y = "BMI",
    color = "Birth Sex"
  )

The multivariate scatterplot shows substantial overlap in BMI and weekly moderate exercise patterns between male and female respondents. Both groups showed considerable variability in BMI across exercise levels, and no strong linear relationship between exercise and BMI was apparent. However, the visualization suggests that BMI variation may differ slightly by birth sex, supporting the inclusion of birth sex as a control variable in the regression analysis.

Multivariate Plot: BMI by Household Income, Faceted by Age Group

# Created faceted boxplots to examine BMI differences across household income groups within each age category. facet_wrap() separates the visualization by age group to allow comparison of BMI patterns across both income and age simultaneously.

ggplot(hints_clean, aes(x = HHInc, y = BMI)) +
  geom_boxplot() +
  facet_wrap(~ AgeGrpB) +
  labs(
    title = "BMI Across Household Income Levels by Age Group",
    x = "Household Income",
    y = "BMI"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The faceted boxplots shows variation in BMI across household income groups within different age categories. Middle-aged adults, respondents aged 35–64 years, generally exhibited higher BMI distributions compared to their younger and older counterparts. Across several age groups, higher income categories appeared to demonstrate slightly lower median BMI values than lower-income categories. These findings suggest that both income and age may influence BMI patterns among U.S. adults.

Correlation Matrix

# Created subset of only quantitative variables for correlation analysis
corr_vars <- hints_clean[, c(
  "BMI",
  "AvgDrinksPerWeek",
  "WeeklyMinutesModerateExercise"
)]

# Generated a correlation matrix using complete observations only. Correlation coefficients range from -1 to 1 and indicate the strength and direction of relationships between variables.
cor_matrix <- cor(corr_vars, use = "complete.obs")

# Viewed correlation matrix
cor_matrix

##                                       BMI AvgDrinksPerWeek
## BMI                            1.00000000      -0.06796993
## AvgDrinksPerWeek              -0.06796993       1.00000000
## WeeklyMinutesModerateExercise -0.09687284       0.04725002
##                               WeeklyMinutesModerateExercise
## BMI                                             -0.09687284
## AvgDrinksPerWeek                                 0.04725002
## WeeklyMinutesModerateExercise                    1.00000000

The correlation matrix shows generally weak relationships among the quantitative variables included in the analysis. BMI showed only a weak correlation with both average drinks consumed per week and weekly minutes of moderate exercise, suggesting these variables alone explain limited variability in BMI. Similarly, the relationship between alcohol consumption and exercise levels appeared weak. These findings indicate that BMI is likely influenced by multiple additional behavioral, biological, and socioeconomic factors not included in the correlation analysis.

Corrplot

# Loaded corrplot package for visualizing correlations
library(corrplot)

## Warning: package 'corrplot' was built under R version 4.5.3

## corrplot 0.95 loaded

# Generated a visual correlation matrix. method = "color" displays correlation strength using color intensity. type = "upper" shows only the upper half of the matrix to reduce redundancy. addCoef.col adds correlation coefficient values to the plot.

corrplot(
  cor_matrix,
  method = "color",
  type = "upper",
  addCoef.col = "black",
  tl.col = "black",
  tl.srt = 45
)

The correlation plot visually confirmed that the relationships among BMI, average drinks consumed per week, and weekly minutes of moderate exercise were relatively weak. The color intensity and correlation coefficients suggested only small positive or negative associations between the variables. Overall, the visualization supported the interpretation that BMI is influenced by a complex combination of factors beyond exercise and alcohol consumption alone.

Inferential Statistics

# Multiple Linear Regression: Evaluated the association between household income and BMI while controlling for age group and birth sex. BMI is the dependent (outcome) variable. HHInc, AgeGrpB, and BirthSex are independent (predictor) variables.

model <- lm(
  BMI ~ HHInc + AgeGrpB + BirthSex,
  data = hints_clean
)

# Displayed regression coefficients, p-values, R-squared, and overall model fit statistics.
summary(model)

## 
## Call:
## lm(formula = BMI ~ HHInc + AgeGrpB + BirthSex, data = hints_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -19.144  -4.736  -1.103   3.437  35.910 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                29.00635    0.30196  96.059  < 2e-16 ***
## HHInc$20,000 to < $35,000   0.30578    0.34576   0.884  0.37654    
## HHInc$35,000 to < $50,000  -0.05394    0.34381  -0.157  0.87534    
## HHInc$50,000 to < $75,000  -0.03024    0.31795  -0.095  0.92423    
## HHInc$75,000 to < $100,000 -0.50009    0.34121  -1.466  0.14280    
## HHInc$100,000 or greater   -1.77685    0.28366  -6.264 4.02e-10 ***
## AgeGrpB35-49                1.30678    0.29284   4.462 8.26e-06 ***
## AgeGrpB50-64                1.73776    0.28003   6.206 5.82e-10 ***
## AgeGrpB65-74                0.16397    0.29275   0.560  0.57544    
## AgeGrpB75+                 -0.89755    0.32671  -2.747  0.00603 ** 
## BirthSexMale               -0.46726    0.18514  -2.524  0.01163 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.854 on 5800 degrees of freedom
## Multiple R-squared:  0.02843,    Adjusted R-squared:  0.02675 
## F-statistic: 16.97 on 10 and 5800 DF,  p-value: < 2.2e-16

# Generated ANOVA table for the regression model. This tests whether each predictor variable significantly contributes to explaining variation in BMI.
anova(model)

## Analysis of Variance Table
## 
## Response: BMI
##             Df Sum Sq Mean Sq F value    Pr(>F)    
## HHInc        5   2736  547.15 11.6456 3.217e-11 ***
## AgeGrpB      4   4939 1234.67 26.2789 < 2.2e-16 ***
## BirthSex     1    299  299.28  6.3698   0.01163 *  
## Residuals 5800 272504   46.98                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The multiple linear regression model was statistically significant, F(10, 5800) = 16.97, p < .001, indicating that household income, age group, and birth sex collectively explained a significant portion of BMI variability. Respondents with household incomes of $100,000 or greater demonstrated significantly lower BMI values compared to respondents earning less than $20,000 annually. Respondents aged 35–49 and 50–64 demonstrated significantly higher BMI values compared to respondents aged 18–34, while male respondents demonstrated slightly lower BMI values than female respondents. Although statistically significant relationships were identified, the model explained a relatively small proportion of BMI variability, suggesting additional factors not included in the analysis may influence BMI among U.S. adults.

PUBH 422 Final Project

Kevin Roldan

2026-05-09