This project looks at the data from the yrbss dataset from openintro.com, and sourced from the CDC. The yrbss dataset, or the Youth Risk Behavior Surveillance System dataset collected its data from 1991-2013, and it collects data on teen’s and preteen’s safety habits. The variables included in the dataset are : age, gender, hispanic, race, height, weight, helmet_12m, text_while_driving_30d, physically_active_7d, hours_tv_per_school_day, strength_training_7d, and school_night_hours_sleep. For my project, I will be focusing on age (one’s age), physically_active_7d (days of the week that the person is physically active), strength_training_7d (how many times the person worked out one week before the survey), and BMI. For the BMI variable, I mutated it by dividing height by the weight squared. BMI is used to calculate one’s body fat and their percentile. For my cleaning, I used colSums to look at my na values, and I used filter to remove NA values in the four of my variables that I am using. Lastly, I mutated the physically_active_7d variable, and I turned it into a factor (categorical), and I forced r to see the values as numbers 0-8 so the graph would show up properly. I chose this topic as I believe that using this dataset can give insights on how active many middle and high school students are, and this allows us to see the reasons behind the trend of obesity within our country. These kids develop these habits young, and they influence them in the future as they start their adult life.
Research Question
To what extent do physical activity, strength training, and age predict BMI middle and high school students?
Load the Libraries and Dataset
library(tidyverse)
Warning: package 'ggplot2' was built under R version 4.5.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 4.0.2 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(highcharter)
Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
Rows: 13583 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): gender, grade, hispanic, race, helmet_12m, text_while_driving_30d, ...
dbl (5): age, height, weight, physically_active_7d, strength_training_7d
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 6 × 13
age gender grade hispanic race height weight helmet_12m
<dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr>
1 14 female 9 not Black or African American NA NA never
2 14 female 9 not Black or African American NA NA never
3 15 female 9 hispanic Native Hawaiian or Other… 1.73 84.4 never
4 15 female 9 not Black or African American 1.6 55.8 never
5 15 female 9 not Black or African American 1.5 46.7 did not r…
6 15 female 9 not Black or African American 1.57 67.1 did not r…
# ℹ 5 more variables: text_while_driving_30d <chr>, physically_active_7d <dbl>,
# hours_tv_per_school_day <chr>, strength_training_7d <dbl>,
# school_night_hours_sleep <chr>
This chunk makes all of my column names lowercase, and it replaces spaces in the names with underscores.
The first chunk creates a new variable called BMI, which calculates the students’ BMI by having their height divided by their weight squared. BMI is ones body mass index, and it is used to calculate body fat and percentile. Then I used filter to remove all of the NA values in the 4 variables I will be using, BMI, physically_active_7d, age, and strength_training_7d.
The second chunk mutates the physically_active_7d variable, changing it so the categorical variable is now a factor. Before this, r would put the numbers in alphabetical order, so the x-axis would not be in order and many of my visualization’s would be hard to follow along with. So, I specifically set the level of the factor so it would ensure that my x-axis is in order in all of my graphs.
Histogram of the Distribution of BMI
ggplot(yrbss_clean, aes(x = BMI,)) +geom_histogram(bins =30, fill ="steelblue", color ="black") +labs(title ="Distribution of BMI",x ="BMI",y ="Count") +theme_minimal()
This histogram shows the distribution of BMI and its counts. We see that most people have a BMI of 25-26, and it is very skewed right. This means that as the BMI goes higher, the count of the people with high BMI goes down significiantly.
ggplot(yrbss_clean, aes(x = physically_active_7d)) +geom_bar( fill ="limegreen", color ="black") +labs(title ="Distribution of Physical Activity (7 Days)",x ="Days Physically Active",y ="Count" ) +theme_minimal()
This barplot shows the counts of the days students are physically active. We see that most students exercise 7 days a week, with the lowest amount being students who exercise 6 days a week. Something that surprised me about this plot was the amount of students who exercise 0 days a week.
Backwards Elimination Multiple Linear Regression Model
full_model <-lm(BMI ~ age + strength_training_7d + physically_active_7d, data = yrbss_clean)summary(full_model)
Call:
lm(formula = BMI ~ age + strength_training_7d + physically_active_7d,
data = yrbss_clean)
Residuals:
Min 1Q Median 3Q Max
-11.114 -3.333 -1.122 2.167 29.426
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15.58947 0.61984 25.151 < 2e-16 ***
age 0.51503 0.03717 13.855 < 2e-16 ***
strength_training_7d -0.02623 0.02305 -1.138 0.255211
physically_active_7d1 0.02600 0.20917 0.124 0.901077
physically_active_7d2 0.03968 0.19085 0.208 0.835324
physically_active_7d3 0.23761 0.18378 1.293 0.196066
physically_active_7d4 -0.04864 0.19377 -0.251 0.801821
physically_active_7d5 -0.20116 0.18361 -1.096 0.273293
physically_active_7d6 -0.37357 0.23064 -1.620 0.105317
physically_active_7d7 -0.63897 0.17702 -3.610 0.000308 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.973 on 11512 degrees of freedom
Multiple R-squared: 0.02208, Adjusted R-squared: 0.02131
F-statistic: 28.88 on 9 and 11512 DF, p-value: < 2.2e-16
The full model shows that age is a statistically significant predictor, with its p-value (2e-16) < 0.05, meaning that one’s BMI increases as they get older. For our strength training variable, our p-value is 0.255211, which is greater than our alpha level of 0.05. This means that strength training is not statistically significant in predicting one’s BMI. For physical activity, we see the levels from 0-6 days all have a p-value over 0.05, which means that if the middle and high school students only exercise 0-6 days out of the week, it doesn’t significantly affect BMI. However, if students exercise all 7 days of the week, we see that the p-value is 0.000308, which means that exercising all 7 days of the week will show a statistically significant relationship with BMI. Our adjusted R-squared is 0.02131, which means that the full model only explains 2.1% of the variation in BMI. Even though some of our predictors are statistically significant, the full model is very weak and there is definitely more factors that influence BMI.
backwards_model <-lm(BMI ~ age + physically_active_7d, data = yrbss_clean)summary(backwards_model)
Call:
lm(formula = BMI ~ age + physically_active_7d, data = yrbss_clean)
Residuals:
Min 1Q Median 3Q Max
-10.983 -3.340 -1.117 2.166 29.445
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15.54945 0.61885 25.126 < 2e-16 ***
age 0.51623 0.03716 13.893 < 2e-16 ***
physically_active_7d1 0.01605 0.20899 0.077 0.9388
physically_active_7d2 0.01674 0.18979 0.088 0.9297
physically_active_7d3 0.20129 0.18099 1.112 0.2661
physically_active_7d4 -0.10023 0.18839 -0.532 0.5947
physically_active_7d5 -0.27147 0.17290 -1.570 0.1164
physically_active_7d6 -0.46007 0.21775 -2.113 0.0346 *
physically_active_7d7 -0.75208 0.14648 -5.134 2.88e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.973 on 11513 degrees of freedom
Multiple R-squared: 0.02197, Adjusted R-squared: 0.02129
F-statistic: 32.32 on 8 and 11513 DF, p-value: < 2.2e-16
The backwards elimination model shows that age is still a significant predictor, with it’s p-value being 2e-16, which is less than the alpha level of 0.05. We removed the strength_training_7d variable, as it had the highest p-value in the model. For our physically active variable, we see that exercising for 0-5 days in the week has no significant relationship with BMI, as all of the p-values from 0-5 are greater than 0.05. However, we now have exercising 6 days a week being a significant predictor of BMI. With its p-value of 0.03, and exercising 7 days a week having a p-value of 2.88e-07, we can determine that exercising 6 or more days a week will show a significantly lower BMI than students who exercise for 0-5 days a week. Our adjusted R-squared is 0.02129, which means that the backward elimination model only explains for 2.1 % of variation in BMI.
The adjusted R² values for the two models are very similar, with the full model (including strength training) having an adjusted R² of 0.02131 and the reduced model (excluding strength training) having an adjusted R² of 0.02129. This extremely small difference indicates that removing strength training does not meaningfully reduce the model’s explanatory power. In other words, strength training does not contribute additional useful information for predicting BMI in this context. Therefore, the simpler model is preferred because it achieves nearly the same level of explanation while being more parsimonious and easier to interpret.
Highcharter Boxplot
hcboxplot(x = yrbss_clean$BMI, var = yrbss_clean$physically_active_7d) |>hc_chart(type ="column") |>hc_title(text =" The Relationship Between BMI and Days out of the Week Exercised") |>hc_xAxis(title =list( text ="Days out of the Week Exercised")) |>hc_yAxis(title =list(text ="BMI")) |>hc_caption(title =list(text ="Source: CDC"))
Warning in hcboxplot(x = yrbss_clean$BMI, var = yrbss_clean$physically_active_7d): 'hcboxplot' is deprecated.
Use 'data_to_boxplot' instead.
See help("Deprecated")
Warning: `unite_()` was deprecated in tidyr 1.2.0.
ℹ Please use `unite()` instead.
ℹ The deprecated feature was likely used in the highcharter package.
Please report the issue at <https://github.com/jbkunst/highcharter/issues>.
As the days of the week that people exercised go up, we see that the BMI goes down. For 0 days, it has a high variation, which means that the values are more spread out from 20-27. We see that the 0 BMI has the maximum BMI out of all of the days. For 1-3 days, it looks the same as 0 days, which shows that only exercising 1-3 days does not influence BMI significantly. However, as we go further in the week from 4-7 days, we see the maximum BMI drop significantly. The maximum BMI from exercising 0 days is 36, but exercising 7 days gives the maximum BMI of 32. Overall, this highcharter boxplot shows that as the days of exercise increase, the BMI decreases.
GGplot Boxplot (With Colors)
ggplot(yrbss_clean, aes(x = physically_active_7d, y = BMI, fill = physically_active_7d)) +geom_boxplot() +labs(title ="BMI Distribution by Days of Physical Activity",x ="Days Physically Active (Past 7 Days)",y ="BMI",caption ="Source : CDC",fill ="Days Active" ) +scale_fill_brewer(palette ="Set2") +theme_minimal() +theme(plot.title =element_text(size =16, face ="bold"),legend.position ="right" )
This ggplot is the same highcharter as above, just with outliers and colors. We see that 7 days physically active have a lot of outliers condensed around 32-42, which means that there is a high concentration around that BMI. We see that the maximum BMI collected was 55, which comes from the group that exercises 0 days a week. As the days of the week go on, the variation in the counts goes down as well, which means that there is a more condensed counts of people in those groups.
Conclusion
Findings Conclusion
In conclusion, this analysis examined how age, strength training, and physical activity relate to BMI among high school students using statistical inference. The results showed that age is a statistically significant predictor of BMI, with BMI increasing slightly as age increases. Physical activity had limited significance overall, although being active 6–7 days per week was associated with lower BMI. Strength training was not statistically significant and did not improve the model’s explanatory power, which was confirmed by the nearly identical adjusted R² values between the full and reduced models. Overall, the model explains only a small portion of the variation in BMI, suggesting that other factors not included in this analysis likely play a larger role in influencing BMI.
Future Implications
These findings suggest that while age and higher levels of physical activity (especially being active most days of the week) have some relationship with BMI, they explain only a small portion of the overall variation. This implies that focusing solely on activity levels may not be enough to fully understand or address differences in BMI among students. Other factors—such as diet, genetics, and lifestyle habits—are likely more influential and should be considered in future research or interventions.