An Analysis of the Relationship between physical activity, age, and strength training to predict BMI

Author

Javier Mantilla

https://www.cdc.gov/yrbs/files/2017/pdf/2017_YRBS_Data_Users_Guide.pdf

Introduction

This project looks at the data from the yrbss dataset from openintro.com, and sourced from the CDC. The yrbss dataset, or the Youth Risk Behavior Surveillance System dataset collected its data from 1991-2013, and it collects data on teen’s and preteen’s safety habits. The variables included in the dataset are : age, gender, hispanic, race, height, weight, helmet_12m, text_while_driving_30d, physically_active_7d, hours_tv_per_school_day, strength_training_7d, and school_night_hours_sleep. For my project, I will be focusing on age (one’s age), physically_active_7d (days of the week that the person is physically active), strength_training_7d (how many times the person worked out one week before the survey), and BMI. For the BMI variable, I mutated it by dividing height by the weight squared. BMI is used to calculate one’s body fat and their percentile. For my cleaning, I used colSums to look at my na values, and I used filter to remove NA values in the four of my variables that I am using. Lastly, I mutated the physically_active_7d variable, and I turned it into a factor (categorical), and I forced r to see the values as numbers 0-8 so the graph would show up properly. I chose this topic as I believe that using this dataset can give insights on how active many middle and high school students are, and this allows us to see the reasons behind the trend of obesity within our country. These kids develop these habits young, and they influence them in the future as they start their adult life.

Research Question

To what extent do physical activity, strength training, and age predict BMI middle and high school students?

Load the Libraries and Dataset

library(tidyverse)

Warning: package 'ggplot2' was built under R version 4.5.2

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   4.0.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(highcharter)

Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo

setwd("~/Documents/Data 110")
yrbss <- read_csv("yrbss.csv")

Rows: 13583 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): gender, grade, hispanic, race, helmet_12m, text_while_driving_30d, ...
dbl (5): age, height, weight, physically_active_7d, strength_training_7d

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Clean and Explore the Data

names(yrbss) <- tolower(names(yrbss))
names(yrbss) <- gsub(" ","_",names(yrbss))
head(yrbss)

# A tibble: 6 × 13
    age gender grade hispanic race                      height weight helmet_12m
  <dbl> <chr>  <chr> <chr>    <chr>                      <dbl>  <dbl> <chr>     
1    14 female 9     not      Black or African American  NA      NA   never     
2    14 female 9     not      Black or African American  NA      NA   never     
3    15 female 9     hispanic Native Hawaiian or Other…   1.73   84.4 never     
4    15 female 9     not      Black or African American   1.6    55.8 never     
5    15 female 9     not      Black or African American   1.5    46.7 did not r…
6    15 female 9     not      Black or African American   1.57   67.1 did not r…
# ℹ 5 more variables: text_while_driving_30d <chr>, physically_active_7d <dbl>,
#   hours_tv_per_school_day <chr>, strength_training_7d <dbl>,
#   school_night_hours_sleep <chr>

This chunk makes all of my column names lowercase, and it replaces spaces in the names with underscores.

colSums(is.na(yrbss))

                     age                   gender                    grade 
                      77                       12                       79 
                hispanic                     race                   height 
                     231                     2805                     1004 
                  weight               helmet_12m   text_while_driving_30d 
                    1004                      311                      918 
    physically_active_7d  hours_tv_per_school_day     strength_training_7d 
                     273                      338                     1176 
school_night_hours_sleep 
                    1248

This checks the NA counts for each of my variables, and we will need to filter those out in the following cleaning.

yrbss_clean <- yrbss |>
  mutate(BMI = weight / (height^2))|>  
  filter(!is.na(BMI),
         !is.na(physically_active_7d),
         !is.na(age),
         !is.na(strength_training_7d))

yrbss_clean <- yrbss_clean |>
  mutate(physically_active_7d = factor(physically_active_7d, levels = 0:7))

The first chunk creates a new variable called BMI, which calculates the students’ BMI by having their height divided by their weight squared. BMI is ones body mass index, and it is used to calculate body fat and percentile. Then I used filter to remove all of the NA values in the 4 variables I will be using, BMI, physically_active_7d, age, and strength_training_7d.

The second chunk mutates the physically_active_7d variable, changing it so the categorical variable is now a factor. Before this, r would put the numbers in alphabetical order, so the x-axis would not be in order and many of my visualization’s would be hard to follow along with. So, I specifically set the level of the factor so it would ensure that my x-axis is in order in all of my graphs.

Histogram of the Distribution of BMI

ggplot(yrbss_clean, aes(x = BMI,)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "black") +
  labs(title = "Distribution of BMI",
       x = "BMI",
       y = "Count") +
  theme_minimal()

This histogram shows the distribution of BMI and its counts. We see that most people have a BMI of 25-26, and it is very skewed right. This means that as the BMI goes higher, the count of the people with high BMI goes down significiantly.

ggplot(yrbss_clean, aes(x = physically_active_7d)) +
  geom_bar( fill = "limegreen", color = "black") +
  labs(
    title = "Distribution of Physical Activity (7 Days)",
    x = "Days Physically Active",
    y = "Count"
  ) +
  theme_minimal()

This barplot shows the counts of the days students are physically active. We see that most students exercise 7 days a week, with the lowest amount being students who exercise 6 days a week. Something that surprised me about this plot was the amount of students who exercise 0 days a week.

Backwards Elimination Multiple Linear Regression Model

full_model <- lm(BMI ~ age + strength_training_7d + physically_active_7d, data = yrbss_clean)
summary(full_model)


Call:
lm(formula = BMI ~ age + strength_training_7d + physically_active_7d, 
    data = yrbss_clean)

Residuals:
    Min      1Q  Median      3Q     Max 
-11.114  -3.333  -1.122   2.167  29.426 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)           15.58947    0.61984  25.151  < 2e-16 ***
age                    0.51503    0.03717  13.855  < 2e-16 ***
strength_training_7d  -0.02623    0.02305  -1.138 0.255211    
physically_active_7d1  0.02600    0.20917   0.124 0.901077    
physically_active_7d2  0.03968    0.19085   0.208 0.835324    
physically_active_7d3  0.23761    0.18378   1.293 0.196066    
physically_active_7d4 -0.04864    0.19377  -0.251 0.801821    
physically_active_7d5 -0.20116    0.18361  -1.096 0.273293    
physically_active_7d6 -0.37357    0.23064  -1.620 0.105317    
physically_active_7d7 -0.63897    0.17702  -3.610 0.000308 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.973 on 11512 degrees of freedom
Multiple R-squared:  0.02208,   Adjusted R-squared:  0.02131 
F-statistic: 28.88 on 9 and 11512 DF,  p-value: < 2.2e-16

Full Model Equation

BMI = 15.589 + 0.515(Age) − 0.026(Strength Training) + 0.026(Activity₁) + 0.040(Activity₂) + 0.238(Activity₃) − 0.049(Activity₄) − 0.201(Activity₅) − 0.374(Activity₆) − 0.639(Activity₇)

Full Model Explanation

The full model shows that age is a statistically significant predictor, with its p-value (2e-16) < 0.05, meaning that one’s BMI increases as they get older. For our strength training variable, our p-value is 0.255211, which is greater than our alpha level of 0.05. This means that strength training is not statistically significant in predicting one’s BMI. For physical activity, we see the levels from 0-6 days all have a p-value over 0.05, which means that if the middle and high school students only exercise 0-6 days out of the week, it doesn’t significantly affect BMI. However, if students exercise all 7 days of the week, we see that the p-value is 0.000308, which means that exercising all 7 days of the week will show a statistically significant relationship with BMI. Our adjusted R-squared is 0.02131, which means that the full model only explains 2.1% of the variation in BMI. Even though some of our predictors are statistically significant, the full model is very weak and there is definitely more factors that influence BMI.

backwards_model <- lm(BMI ~ age + physically_active_7d, data = yrbss_clean)
summary(backwards_model)


Call:
lm(formula = BMI ~ age + physically_active_7d, data = yrbss_clean)

Residuals:
    Min      1Q  Median      3Q     Max 
-10.983  -3.340  -1.117   2.166  29.445 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)           15.54945    0.61885  25.126  < 2e-16 ***
age                    0.51623    0.03716  13.893  < 2e-16 ***
physically_active_7d1  0.01605    0.20899   0.077   0.9388    
physically_active_7d2  0.01674    0.18979   0.088   0.9297    
physically_active_7d3  0.20129    0.18099   1.112   0.2661    
physically_active_7d4 -0.10023    0.18839  -0.532   0.5947    
physically_active_7d5 -0.27147    0.17290  -1.570   0.1164    
physically_active_7d6 -0.46007    0.21775  -2.113   0.0346 *  
physically_active_7d7 -0.75208    0.14648  -5.134 2.88e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.973 on 11513 degrees of freedom
Multiple R-squared:  0.02197,   Adjusted R-squared:  0.02129 
F-statistic: 32.32 on 8 and 11513 DF,  p-value: < 2.2e-16

Backwards Elimination Model Equation

BMI = 15.549 + 0.516(Age) + 0.016(Activity₁) + 0.017(Activity₂) + 0.201(Activity₃) − 0.100(Activity₄) − 0.271(Activity₅) − 0.460(Activity₆) − 0.752(Activity₇)

Backwards Elimination Model Explanation

The backwards elimination model shows that age is still a significant predictor, with it’s p-value being 2e-16, which is less than the alpha level of 0.05. We removed the strength_training_7d variable, as it had the highest p-value in the model. For our physically active variable, we see that exercising for 0-5 days in the week has no significant relationship with BMI, as all of the p-values from 0-5 are greater than 0.05. However, we now have exercising 6 days a week being a significant predictor of BMI. With its p-value of 0.03, and exercising 7 days a week having a p-value of 2.88e-07, we can determine that exercising 6 or more days a week will show a significantly lower BMI than students who exercise for 0-5 days a week. Our adjusted R-squared is 0.02129, which means that the backward elimination model only explains for 2.1 % of variation in BMI.

The adjusted R² values for the two models are very similar, with the full model (including strength training) having an adjusted R² of 0.02131 and the reduced model (excluding strength training) having an adjusted R² of 0.02129. This extremely small difference indicates that removing strength training does not meaningfully reduce the model’s explanatory power. In other words, strength training does not contribute additional useful information for predicting BMI in this context. Therefore, the simpler model is preferred because it achieves nearly the same level of explanation while being more parsimonious and easier to interpret.

Highcharter Boxplot

hcboxplot(x = yrbss_clean$BMI, var = yrbss_clean$physically_active_7d) |>
  hc_chart(type = "column") |>
  hc_title(text = " The Relationship Between BMI and Days out of the Week Exercised") |>
  hc_xAxis(title = list( text = "Days out of the Week Exercised")) |>
  hc_yAxis(title = list(text = "BMI")) |>
  hc_caption(title = list(text = "Source: CDC"))

Warning in hcboxplot(x = yrbss_clean$BMI, var = yrbss_clean$physically_active_7d): 'hcboxplot' is deprecated.
Use 'data_to_boxplot' instead.
See help("Deprecated")

Warning: `unite_()` was deprecated in tidyr 1.2.0.
ℹ Please use `unite()` instead.
ℹ The deprecated feature was likely used in the highcharter package.
  Please report the issue at <https://github.com/jbkunst/highcharter/issues>.

As the days of the week that people exercised go up, we see that the BMI goes down. For 0 days, it has a high variation, which means that the values are more spread out from 20-27. We see that the 0 BMI has the maximum BMI out of all of the days. For 1-3 days, it looks the same as 0 days, which shows that only exercising 1-3 days does not influence BMI significantly. However, as we go further in the week from 4-7 days, we see the maximum BMI drop significantly. The maximum BMI from exercising 0 days is 36, but exercising 7 days gives the maximum BMI of 32. Overall, this highcharter boxplot shows that as the days of exercise increase, the BMI decreases.

GGplot Boxplot (With Colors)

ggplot(yrbss_clean, aes(x = physically_active_7d, y = BMI, fill = physically_active_7d)) +
  geom_boxplot() +
  labs(
    title = "BMI Distribution by Days of Physical Activity",
    x = "Days Physically Active (Past 7 Days)",
    y = "BMI",
    caption = "Source : CDC",
    fill = "Days Active"
  ) +
  
  scale_fill_brewer(palette = "Set2") +
  
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold"),
    legend.position = "right"
  )

This ggplot is the same highcharter as above, just with outliers and colors. We see that 7 days physically active have a lot of outliers condensed around 32-42, which means that there is a high concentration around that BMI. We see that the maximum BMI collected was 55, which comes from the group that exercises 0 days a week. As the days of the week go on, the variation in the counts goes down as well, which means that there is a more condensed counts of people in those groups.

Conclusion

Findings Conclusion

In conclusion, this analysis examined how age, strength training, and physical activity relate to BMI among high school students using statistical inference. The results showed that age is a statistically significant predictor of BMI, with BMI increasing slightly as age increases. Physical activity had limited significance overall, although being active 6–7 days per week was associated with lower BMI. Strength training was not statistically significant and did not improve the model’s explanatory power, which was confirmed by the nearly identical adjusted R² values between the full and reduced models. Overall, the model explains only a small portion of the variation in BMI, suggesting that other factors not included in this analysis likely play a larger role in influencing BMI.

Future Implications

These findings suggest that while age and higher levels of physical activity (especially being active most days of the week) have some relationship with BMI, they explain only a small portion of the overall variation. This implies that focusing solely on activity levels may not be enough to fully understand or address differences in BMI among students. Other factors—such as diet, genetics, and lifestyle habits—are likely more influential and should be considered in future research or interventions.

Works Cited

Dataset : YRBSS Dataset

Image: YRBSS Image