Predicting the Weights of Students Based on Other Variables

Image Source: pngtree.com

I. Introduction

What variables are most significant at predicting the weight of a participant? The Youth Risk Behavior Surveillance System (YRBSS) contains surveys on students from 1991-2013 in the United States. This data is collected and provided by the Centers for Disease Control and Prevention (CDC) and is available to download on Openintro.

In this project, I will use a multiple linear regression model that inputs every variable in this dataset to answer my research question of what variables are most significant at predicting the weight of a participant. From this question stems many others, for example, how much does one’s weight change for every increase in the number of days they exercise? Or, is texting while driving significant at predicting weight? This data was collected through conducting surveys of students from 1991-2013 in the United States.

I chose to do this topic because it can be insightful if all assumptions of the model pass and the model explains the variability of weight well. As stated before, it can answer numerous questions regarding weight. I first thought of doing height, but height is usually genetic. So, I decided to predict weight because weight can change depending on a variety of factors. In this project, I will see which of those factors, in this dataset, are most important to weight–if any.

A. Load the Libraries and Dataset

library(tidyverse)
library(highcharter)
library(RColorBrewer) # for color palettes

setwd("C:/Users/sarah/Downloads") 
yrb <- read_csv("yrbss.csv")

## Rows: 13583 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): gender, grade, hispanic, race, helmet_12m, text_while_driving_30d, ...
## dbl (5): age, height, weight, physically_active_7d, strength_training_7d
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(yrb)

## # A tibble: 6 × 13
##     age gender grade hispanic race                      height weight helmet_12m
##   <dbl> <chr>  <chr> <chr>    <chr>                      <dbl>  <dbl> <chr>     
## 1    14 female 9     not      Black or African American  NA      NA   never     
## 2    14 female 9     not      Black or African American  NA      NA   never     
## 3    15 female 9     hispanic Native Hawaiian or Other…   1.73   84.4 never     
## 4    15 female 9     not      Black or African American   1.6    55.8 never     
## 5    15 female 9     not      Black or African American   1.5    46.7 did not r…
## 6    15 female 9     not      Black or African American   1.57   67.1 did not r…
## # ℹ 5 more variables: text_while_driving_30d <chr>, physically_active_7d <dbl>,
## #   hours_tv_per_school_day <chr>, strength_training_7d <dbl>,
## #   school_night_hours_sleep <chr>

II. Cleaning and EDA

In this section, I will clean the dataset, change variable names, remove N/A values for the weight variable, and create a new data frame for my plot. For my EDA, I will convert categorical variables and discrete variables to factors, check numerical and categorical variables, and check unique values for categorical variables.

A. Cleaning

Checking for N/A values.

colSums(is.na(yrb))

##                      age                   gender                    grade 
##                       77                       12                       79 
##                 hispanic                     race                   height 
##                      231                     2805                     1004 
##                   weight               helmet_12m   text_while_driving_30d 
##                     1004                      311                      918 
##     physically_active_7d  hours_tv_per_school_day     strength_training_7d 
##                      273                      338                     1176 
## school_night_hours_sleep 
##                     1248

A lot of N/A values. Since my main variable is weight, I will be removing N/A values for that. For the other variables, if N/A values are conflicting in a certain place, I will remove them in that place.

yrb_clean <- yrb |>
  filter(!is.na(weight)) # remove N/A's
head(yrb_clean)

## # A tibble: 6 × 13
##     age gender grade hispanic race                      height weight helmet_12m
##   <dbl> <chr>  <chr> <chr>    <chr>                      <dbl>  <dbl> <chr>     
## 1    15 female 9     hispanic Native Hawaiian or Other…   1.73   84.4 never     
## 2    15 female 9     not      Black or African American   1.6    55.8 never     
## 3    15 female 9     not      Black or African American   1.5    46.7 did not r…
## 4    15 female 9     not      Black or African American   1.57   67.1 did not r…
## 5    15 female 9     not      Black or African American   1.65  132.  did not r…
## 6    14 male   9     not      Black or African American   1.88   71.2 never     
## # ℹ 5 more variables: text_while_driving_30d <chr>, physically_active_7d <dbl>,
## #   hours_tv_per_school_day <chr>, strength_training_7d <dbl>,
## #   school_night_hours_sleep <chr>

Create the plot for a boxplot by selecting the two necessary variables and filtering out extra N/A values.

plot_one <- yrb_clean |>
  select(weight, race) |> # select the two variables I want/ 
  filter(!is.na(race)) |> # filter out n/a's
  arrange(desc(weight)) # arrange weight from highest to lowest
head(plot_one)

## # A tibble: 6 × 2
##   weight race                            
##    <dbl> <chr>                           
## 1   181. American Indian or Alaska Native
## 2   163. Black or African American       
## 3   163. Black or African American       
## 4   160. Black or African American       
## 5   159. White                           
## 6   159. White

B. EDA

Checking the class of the main variable.

class(yrb_clean$weight)

## [1] "numeric"

Checking the structure of the dataset

str(yrb_clean)

## spc_tbl_ [12,579 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ age                     : num [1:12579] 15 15 15 15 15 14 15 15 15 15 ...
##  $ gender                  : chr [1:12579] "female" "female" "female" "female" ...
##  $ grade                   : chr [1:12579] "9" "9" "9" "9" ...
##  $ hispanic                : chr [1:12579] "hispanic" "not" "not" "not" ...
##  $ race                    : chr [1:12579] "Native Hawaiian or Other Pacific Islander" "Black or African American" "Black or African American" "Black or African American" ...
##  $ height                  : num [1:12579] 1.73 1.6 1.5 1.57 1.65 1.88 1.75 1.37 1.68 1.65 ...
##  $ weight                  : num [1:12579] 84.4 55.8 46.7 67.1 131.5 ...
##  $ helmet_12m              : chr [1:12579] "never" "never" "did not ride" "did not ride" ...
##  $ text_while_driving_30d  : chr [1:12579] "30" "0" "did not drive" "did not drive" ...
##  $ physically_active_7d    : num [1:12579] 7 0 2 1 4 4 5 0 0 0 ...
##  $ hours_tv_per_school_day : chr [1:12579] "5+" "2" "3" "5+" ...
##  $ strength_training_7d    : num [1:12579] 0 0 1 0 2 0 3 0 3 0 ...
##  $ school_night_hours_sleep: chr [1:12579] "<5" "6" "9" "8" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   age = col_double(),
##   ..   gender = col_character(),
##   ..   grade = col_character(),
##   ..   hispanic = col_character(),
##   ..   race = col_character(),
##   ..   height = col_double(),
##   ..   weight = col_double(),
##   ..   helmet_12m = col_character(),
##   ..   text_while_driving_30d = col_character(),
##   ..   physically_active_7d = col_double(),
##   ..   hours_tv_per_school_day = col_character(),
##   ..   strength_training_7d = col_double(),
##   ..   school_night_hours_sleep = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

I can convert a lot of these variables to factors.

yrb_clean$helmet_12m <- as.factor(yrb_clean$helmet_12m)
yrb_clean$race <- as.factor(yrb_clean$race)
yrb_clean$text_while_driving_30d <- as.factor(yrb_clean$text_while_driving_30d)
yrb_clean$hispanic <- as.factor(yrb_clean$hispanic)
yrb_clean$hours_tv_per_school_day <- as.factor(yrb_clean$hours_tv_per_school_day)

Calculate summary statistics of the entire dataset

summary(yrb_clean)

##       age           gender             grade               hispanic   
##  Min.   :12.00   Length:12579       Length:12579       hispanic:3166  
##  1st Qu.:15.00   Class :character   Class :character   not     :9282  
##  Median :16.00   Mode  :character   Mode  :character   NA's    : 131  
##  Mean   :16.17                                                        
##  3rd Qu.:17.00                                                        
##  Max.   :18.00                                                        
##                                                                       
##                                         race          height     
##  American Indian or Alaska Native         : 295   Min.   :1.270  
##  Asian                                    : 499   1st Qu.:1.600  
##  Black or African American                :2904   Median :1.680  
##  Native Hawaiian or Other Pacific Islander: 242   Mean   :1.691  
##  White                                    :6112   3rd Qu.:1.780  
##  NA's                                     :2527   Max.   :2.110  
##                                                                  
##      weight              helmet_12m     text_while_driving_30d
##  Min.   : 29.94   always      : 359   0            :4462      
##  1st Qu.: 56.25   did not ride:4243   did not drive:4293      
##  Median : 64.41   most of time: 274   1-2          : 887      
##  Mean   : 67.91   never       :6555   30           : 756      
##  3rd Qu.: 76.20   rarely      : 660   3-5          : 457      
##  Max.   :180.99   sometimes   : 316   (Other)      : 928      
##                   NA's        : 172   NA's         : 796      
##  physically_active_7d hours_tv_per_school_day strength_training_7d
##  Min.   :0.000        2           :2548       Min.   :0.000       
##  1st Qu.:2.000        <1          :2021       1st Qu.:0.000       
##  Median :4.000        3           :1995       Median :3.000       
##  Mean   :3.924        do not watch:1671       Mean   :2.969       
##  3rd Qu.:7.000        1           :1667       3rd Qu.:5.000       
##  Max.   :7.000        (Other)     :2406       Max.   :7.000       
##  NA's   :215          NA's        : 271       NA's   :1044        
##  school_night_hours_sleep
##  Length:12579            
##  Class :character        
##  Mode  :character        
##                          
##                          
##                          
##

The mean weight is 67.91.

III. Multiple Linear Regression

In this section, I will input every variable into the model to predict weight. I will then use backwards elimination to figure out what variables are most significant at predicting weight.

model1 <- lm(weight ~ ., data = yrb_clean) # input every variable to predict weight

summary(model1)

## 
## Call:
## lm(formula = weight ~ ., data = yrb_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -42.381  -9.423  -2.944   6.076  88.205 
## 
## Coefficients:
##                                                Estimate Std. Error t value
## (Intercept)                                   -83.62740    5.45119 -15.341
## age                                             1.40507    0.26808   5.241
## gendermale                                      1.75048    0.42681   4.101
## grade11                                         0.33440    0.52668   0.635
## grade12                                        -0.09721    0.67564  -0.144
## grade9                                         -0.21348    0.52871  -0.404
## gradeother                                     -6.80660    4.33703  -1.569
## hispanicnot                                    -0.62765    0.50033  -1.254
## raceAsian                                      -5.22575    1.19801  -4.362
## raceBlack or African American                  -0.14622    1.02588  -0.143
## raceNative Hawaiian or Other Pacific Islander  -1.49259    1.44674  -1.032
## raceWhite                                      -1.08876    0.98878  -1.101
## height                                         77.02921    2.03855  37.786
## helmet_12mdid not ride                          2.31071    0.92008   2.511
## helmet_12mmost of time                          1.42472    1.31819   1.081
## helmet_12mnever                                 1.53801    0.90390   1.702
## helmet_12mrarely                                1.43069    1.11502   1.283
## helmet_12msometimes                             0.87814    1.27691   0.688
## text_while_driving_30d1-2                      -0.79007    0.61996  -1.274
## text_while_driving_30d10-19                    -1.08359    0.90642  -1.195
## text_while_driving_30d20-29                     0.23360    1.00806   0.232
## text_while_driving_30d3-5                       0.12379    0.81799   0.151
## text_while_driving_30d30                        1.03782    0.66647   1.557
## text_while_driving_30d6-9                       0.98795    0.98953   0.998
## text_while_driving_30ddid not drive             0.44461    0.38483   1.155
## physically_active_7d                           -0.29350    0.08036  -3.652
## hours_tv_per_school_day1                        0.08785    0.56731   0.155
## hours_tv_per_school_day2                        0.63437    0.51483   1.232
## hours_tv_per_school_day3                        1.64046    0.55461   2.958
## hours_tv_per_school_day4                        1.34191    0.69672   1.926
## hours_tv_per_school_day5+                       2.07755    0.62478   3.325
## hours_tv_per_school_daydo not watch            -0.94316    0.57523  -1.640
## strength_training_7d                           -0.07259    0.07879  -0.921
## school_night_hours_sleep10+                    -0.45980    1.23830  -0.371
## school_night_hours_sleep5                      -1.28867    0.73895  -1.744
## school_night_hours_sleep6                      -1.67981    0.67786  -2.478
## school_night_hours_sleep7                      -2.73447    0.66251  -4.127
## school_night_hours_sleep8                      -2.74442    0.68675  -3.996
## school_night_hours_sleep9                      -3.65133    0.87988  -4.150
##                                               Pr(>|t|)    
## (Intercept)                                    < 2e-16 ***
## age                                           1.63e-07 ***
## gendermale                                    4.15e-05 ***
## grade11                                       0.525504    
## grade12                                       0.885596    
## grade9                                        0.686382    
## gradeother                                    0.116589    
## hispanicnot                                   0.209703    
## raceAsian                                     1.30e-05 ***
## raceBlack or African American                 0.886665    
## raceNative Hawaiian or Other Pacific Islander 0.302247    
## raceWhite                                     0.270879    
## height                                         < 2e-16 ***
## helmet_12mdid not ride                        0.012043 *  
## helmet_12mmost of time                        0.279811    
## helmet_12mnever                               0.088881 .  
## helmet_12mrarely                              0.199491    
## helmet_12msometimes                           0.491659    
## text_while_driving_30d1-2                     0.202565    
## text_while_driving_30d10-19                   0.231942    
## text_while_driving_30d20-29                   0.816752    
## text_while_driving_30d3-5                     0.879716    
## text_while_driving_30d30                      0.119463    
## text_while_driving_30d6-9                     0.318116    
## text_while_driving_30ddid not drive           0.247987    
## physically_active_7d                          0.000261 ***
## hours_tv_per_school_day1                      0.876948    
## hours_tv_per_school_day2                      0.217908    
## hours_tv_per_school_day3                      0.003106 ** 
## hours_tv_per_school_day4                      0.054133 .  
## hours_tv_per_school_day5+                     0.000887 ***
## hours_tv_per_school_daydo not watch           0.101119    
## strength_training_7d                          0.356888    
## school_night_hours_sleep10+                   0.710415    
## school_night_hours_sleep5                     0.081210 .  
## school_night_hours_sleep6                     0.013228 *  
## school_night_hours_sleep7                     3.70e-05 ***
## school_night_hours_sleep8                     6.49e-05 ***
## school_night_hours_sleep9                     3.36e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.29 on 8312 degrees of freedom
##   (4228 observations deleted due to missingness)
## Multiple R-squared:  0.2917, Adjusted R-squared:  0.2885 
## F-statistic: 90.09 on 38 and 8312 DF,  p-value: < 2.2e-16

Since it’s hard to tell which variables are significant because some variables have multiple categories, I will use an anova test of the model to figure out what p-value is the highest for the backwards elimination.

Source: https://www.statalist.org/forums/forum/general-stata-discussion/general/956382-categorical-variables-in-backward-regression-model

Also, my Data 101 professor said I can use anova to figure out what p-value is the highest if there’s a variable with multiple categories.

anova(model1)

## Analysis of Variance Table
## 
## Response: weight
##                            Df  Sum Sq Mean Sq   F value    Pr(>F)    
## age                         1   80986   80986  396.3298 < 2.2e-16 ***
## gender                      1  275781  275781 1349.6261 < 2.2e-16 ***
## grade                       4    2343     586    2.8670  0.021840 *  
## hispanic                    1    1141    1141    5.5833  0.018155 *  
## race                        4   23764    5941   29.0745 < 2.2e-16 ***
## height                      1  289308  289308 1415.8245 < 2.2e-16 ***
## helmet_12m                  5    3475     695    3.4010  0.004517 ** 
## text_while_driving_30d      7    2501     357    1.7485  0.093116 .  
## physically_active_7d        1    7062    7062   34.5623 4.288e-09 ***
## hours_tv_per_school_day     6    6347    1058    5.1766 2.525e-05 ***
## strength_training_7d        1     198     198    0.9670  0.325452    
## school_night_hours_sleep    6    6638    1106    5.4144 1.349e-05 ***
## Residuals                8312 1698466     204                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Using the ANOVA test, we can see that the highest p-value is strength_training_7d, with a p-value of 0.325.

yrb_clean2 <- yrb_clean |>
  select(!strength_training_7d)

model2 <- lm(weight ~ ., data = yrb_clean2)

summary(model2)

## 
## Call:
## lm(formula = weight ~ ., data = yrb_clean2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -42.016  -9.440  -2.956   6.093  88.351 
## 
## Coefficients:
##                                                Estimate Std. Error t value
## (Intercept)                                   -83.55656    5.44236 -15.353
## age                                             1.39352    0.26786   5.202
## gendermale                                      1.69731    0.42489   3.995
## grade11                                         0.35777    0.52601   0.680
## grade12                                        -0.06388    0.67457  -0.095
## grade9                                         -0.25662    0.52792  -0.486
## gradeother                                     -6.81447    4.33653  -1.571
## hispanicnot                                    -0.61661    0.49976  -1.234
## raceAsian                                      -5.23699    1.19784  -4.372
## raceBlack or African American                  -0.15226    1.02559  -0.148
## raceNative Hawaiian or Other Pacific Islander  -1.49344    1.44658  -1.032
## raceWhite                                      -1.08648    0.98861  -1.099
## height                                         77.01842    2.03583  37.831
## helmet_12mdid not ride                          2.36751    0.91849   2.578
## helmet_12mmost of time                          1.46237    1.31703   1.110
## helmet_12mnever                                 1.57397    0.90232   1.744
## helmet_12mrarely                                1.47655    1.11377   1.326
## helmet_12msometimes                             0.93170    1.27575   0.730
## text_while_driving_30d1-2                      -0.77384    0.61982  -1.248
## text_while_driving_30d10-19                    -1.08240    0.90612  -1.195
## text_while_driving_30d20-29                     0.22794    1.00757   0.226
## text_while_driving_30d3-5                       0.13662    0.81664   0.167
## text_while_driving_30d30                        1.02773    0.66558   1.544
## text_while_driving_30d6-9                       0.99659    0.98933   1.007
## text_while_driving_30ddid not drive             0.47952    0.38435   1.248
## physically_active_7d                           -0.33308    0.06515  -5.112
## hours_tv_per_school_day1                        0.09317    0.56707   0.164
## hours_tv_per_school_day2                        0.64331    0.51458   1.250
## hours_tv_per_school_day3                        1.63507    0.55402   2.951
## hours_tv_per_school_day4                        1.35090    0.69589   1.941
## hours_tv_per_school_day5+                       2.08207    0.62418   3.336
## hours_tv_per_school_daydo not watch            -0.92711    0.57478  -1.613
## school_night_hours_sleep10+                    -0.43891    1.23788  -0.355
## school_night_hours_sleep5                      -1.26052    0.73849  -1.707
## school_night_hours_sleep6                      -1.66063    0.67730  -2.452
## school_night_hours_sleep7                      -2.69921    0.66192  -4.078
## school_night_hours_sleep8                      -2.73418    0.68614  -3.985
## school_night_hours_sleep9                      -3.63372    0.87871  -4.135
##                                               Pr(>|t|)    
## (Intercept)                                    < 2e-16 ***
## age                                           2.01e-07 ***
## gendermale                                    6.53e-05 ***
## grade11                                       0.496416    
## grade12                                       0.924554    
## grade9                                        0.626913    
## gradeother                                    0.116125    
## hispanicnot                                   0.217313    
## raceAsian                                     1.25e-05 ***
## raceBlack or African American                 0.881983    
## raceNative Hawaiian or Other Pacific Islander 0.301918    
## raceWhite                                     0.271805    
## height                                         < 2e-16 ***
## helmet_12mdid not ride                        0.009965 ** 
## helmet_12mmost of time                        0.266879    
## helmet_12mnever                               0.081135 .  
## helmet_12mrarely                              0.184966    
## helmet_12msometimes                           0.465219    
## text_while_driving_30d1-2                     0.211888    
## text_while_driving_30d10-19                   0.232298    
## text_while_driving_30d20-29                   0.821029    
## text_while_driving_30d3-5                     0.867142    
## text_while_driving_30d30                      0.122601    
## text_while_driving_30d6-9                     0.313806    
## text_while_driving_30ddid not drive           0.212204    
## physically_active_7d                          3.26e-07 ***
## hours_tv_per_school_day1                      0.869493    
## hours_tv_per_school_day2                      0.211272    
## hours_tv_per_school_day3                      0.003173 ** 
## hours_tv_per_school_day4                      0.052260 .  
## hours_tv_per_school_day5+                     0.000855 ***
## hours_tv_per_school_daydo not watch           0.106784    
## school_night_hours_sleep10+                   0.722924    
## school_night_hours_sleep5                     0.087879 .  
## school_night_hours_sleep6                     0.014233 *  
## school_night_hours_sleep7                     4.59e-05 ***
## school_night_hours_sleep8                     6.81e-05 ***
## school_night_hours_sleep9                     3.58e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.29 on 8321 degrees of freedom
##   (4220 observations deleted due to missingness)
## Multiple R-squared:  0.2916, Adjusted R-squared:  0.2884 
## F-statistic: 92.56 on 37 and 8321 DF,  p-value: < 2.2e-16

anova(model2)

## Analysis of Variance Table
## 
## Response: weight
##                            Df  Sum Sq Mean Sq   F value    Pr(>F)    
## age                         1   80892   80892  395.9522 < 2.2e-16 ***
## gender                      1  275402  275402 1348.0502 < 2.2e-16 ***
## grade                       4    2406     601    2.9437  0.019164 *  
## hispanic                    1    1133    1133    5.5473  0.018532 *  
## race                        4   23789    5947   29.1112 < 2.2e-16 ***
## height                      1  290136  290136 1420.1745 < 2.2e-16 ***
## helmet_12m                  5    3558     712    3.4827  0.003803 ** 
## text_while_driving_30d      7    2528     361    1.7674  0.089129 .  
## physically_active_7d        1    6931    6931   33.9250 5.941e-09 ***
## hours_tv_per_school_day     6    6267    1044    5.1126 2.987e-05 ***
## school_night_hours_sleep    6    6594    1099    5.3793 1.480e-05 ***
## Residuals                8321 1699950     204                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Highest p-value: text_while_driving_30d with a p-value of 0.089

yrb_clean3 <- yrb_clean2 |>
  select(!text_while_driving_30d)

model3 <- lm(weight ~ ., data = yrb_clean3)

summary(model3)

## 
## Call:
## lm(formula = weight ~ ., data = yrb_clean3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -42.303  -9.482  -3.038   6.190  88.970 
## 
## Coefficients:
##                                                Estimate Std. Error t value
## (Intercept)                                   -82.95478    5.20861 -15.926
## age                                             1.39749    0.25726   5.432
## gendermale                                      1.72007    0.41195   4.175
## grade11                                         0.04698    0.51068   0.092
## grade12                                        -0.29540    0.64864  -0.455
## grade9                                         -0.26288    0.51060  -0.515
## gradeother                                     -6.84569    4.35187  -1.573
## hispanicnot                                    -0.48381    0.48444  -0.999
## raceAsian                                      -5.71575    1.15430  -4.952
## raceBlack or African American                  -0.59287    0.98766  -0.600
## raceNative Hawaiian or Other Pacific Islander  -1.56347    1.38736  -1.127
## raceWhite                                      -1.64140    0.95265  -1.723
## height                                         77.07612    1.96867  39.151
## helmet_12mdid not ride                          2.50629    0.90412   2.772
## helmet_12mmost of time                          1.71639    1.30198   1.318
## helmet_12mnever                                 1.46975    0.88780   1.655
## helmet_12mrarely                                1.50894    1.09487   1.378
## helmet_12msometimes                             0.70919    1.25368   0.566
## physically_active_7d                           -0.33122    0.06315  -5.245
## hours_tv_per_school_day1                       -0.17305    0.55378  -0.312
## hours_tv_per_school_day2                        0.47960    0.50352   0.952
## hours_tv_per_school_day3                        1.47911    0.54021   2.738
## hours_tv_per_school_day4                        1.17861    0.67858   1.737
## hours_tv_per_school_day5+                       2.02263    0.60412   3.348
## hours_tv_per_school_daydo not watch            -1.01900    0.56150  -1.815
## school_night_hours_sleep10+                    -0.63820    1.18196  -0.540
## school_night_hours_sleep5                      -1.18917    0.71190  -1.670
## school_night_hours_sleep6                      -1.63159    0.65205  -2.502
## school_night_hours_sleep7                      -2.56890    0.63783  -4.028
## school_night_hours_sleep8                      -2.55080    0.66034  -3.863
## school_night_hours_sleep9                      -3.70786    0.84715  -4.377
##                                               Pr(>|t|)    
## (Intercept)                                    < 2e-16 ***
## age                                           5.71e-08 ***
## gendermale                                    3.00e-05 ***
## grade11                                       0.926700    
## grade12                                       0.648825    
## grade9                                        0.606667    
## gradeother                                    0.115744    
## hispanicnot                                   0.317962    
## raceAsian                                     7.49e-07 ***
## raceBlack or African American                 0.548337    
## raceNative Hawaiian or Other Pacific Islander 0.259798    
## raceWhite                                     0.084927 .  
## height                                         < 2e-16 ***
## helmet_12mdid not ride                        0.005582 ** 
## helmet_12mmost of time                        0.187440    
## helmet_12mnever                               0.097861 .  
## helmet_12mrarely                              0.168179    
## helmet_12msometimes                           0.571619    
## physically_active_7d                          1.60e-07 ***
## hours_tv_per_school_day1                      0.754675    
## hours_tv_per_school_day2                      0.340873    
## hours_tv_per_school_day3                      0.006193 ** 
## hours_tv_per_school_day4                      0.082441 .  
## hours_tv_per_school_day5+                     0.000817 ***
## hours_tv_per_school_daydo not watch           0.069589 .  
## school_night_hours_sleep10+                   0.589247    
## school_night_hours_sleep5                     0.094873 .  
## school_night_hours_sleep6                     0.012359 *  
## school_night_hours_sleep7                     5.68e-05 ***
## school_night_hours_sleep8                     0.000113 ***
## school_night_hours_sleep9                     1.22e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.35 on 8849 degrees of freedom
##   (3699 observations deleted due to missingness)
## Multiple R-squared:  0.2886, Adjusted R-squared:  0.2862 
## F-statistic: 119.7 on 30 and 8849 DF,  p-value: < 2.2e-16

anova(model3)

## Analysis of Variance Table
## 
## Response: weight
##                            Df  Sum Sq Mean Sq   F value    Pr(>F)    
## age                         1   83896   83896  407.1824 < 2.2e-16 ***
## gender                      1  289610  289610 1405.6016 < 2.2e-16 ***
## grade                       4    2386     597    2.8951 0.0208149 *  
## hispanic                    1    1699    1699    8.2483 0.0040888 ** 
## race                        4   25739    6435   31.2302 < 2.2e-16 ***
## height                      1  311465  311465 1511.6747 < 2.2e-16 ***
## helmet_12m                  5    4464     893    4.3335 0.0006117 ***
## physically_active_7d        1    7217    7217   35.0262 3.374e-09 ***
## hours_tv_per_school_day     6    6700    1117    5.4196 1.329e-05 ***
## school_night_hours_sleep    6    6583    1097    5.3249 1.706e-05 ***
## Residuals                8849 1823247     206                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We have reached all significant (p < 0.05) variables. However, we want the most significant variables. Therefore, I’ll be removing all variables that don’t have a *** significance code. The *** means the p-value is very very close to zero, for example, 1.706e-05 (0.00007).

So, I’ll be removing grade from the model.

yrb_clean4 <- yrb_clean3 |>
  select(!grade)

model4 <- lm(weight ~ ., data = yrb_clean4)

summary(model4)

## 
## Call:
## lm(formula = weight ~ ., data = yrb_clean4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -42.472  -9.476  -3.027   6.206  88.807 
## 
## Coefficients:
##                                                Estimate Std. Error t value
## (Intercept)                                   -83.21151    3.74973 -22.191
## age                                             1.39454    0.12626  11.045
## gendermale                                      1.69782    0.41035   4.138
## hispanicnot                                    -0.47751    0.48407  -0.986
## raceAsian                                      -5.76241    1.15025  -5.010
## raceBlack or African American                  -0.63309    0.98482  -0.643
## raceNative Hawaiian or Other Pacific Islander  -1.66870    1.38263  -1.207
## raceWhite                                      -1.69841    0.94965  -1.788
## height                                         77.20636    1.96392  39.312
## helmet_12mdid not ride                          2.50846    0.90405   2.775
## helmet_12mmost of time                          1.66705    1.30117   1.281
## helmet_12mnever                                 1.45871    0.88767   1.643
## helmet_12mrarely                                1.50745    1.09474   1.377
## helmet_12msometimes                             0.69173    1.25332   0.552
## physically_active_7d                           -0.33280    0.06308  -5.276
## hours_tv_per_school_day1                       -0.18322    0.55360  -0.331
## hours_tv_per_school_day2                        0.45109    0.50320   0.896
## hours_tv_per_school_day3                        1.45299    0.53997   2.691
## hours_tv_per_school_day4                        1.15976    0.67826   1.710
## hours_tv_per_school_day5+                       1.98826    0.60316   3.296
## hours_tv_per_school_daydo not watch            -1.01857    0.56108  -1.815
## school_night_hours_sleep10+                    -0.57431    1.17950  -0.487
## school_night_hours_sleep5                      -1.18684    0.71134  -1.668
## school_night_hours_sleep6                      -1.61006    0.65182  -2.470
## school_night_hours_sleep7                      -2.53823    0.63749  -3.982
## school_night_hours_sleep8                      -2.50243    0.65964  -3.794
## school_night_hours_sleep9                      -3.71038    0.84615  -4.385
##                                               Pr(>|t|)    
## (Intercept)                                    < 2e-16 ***
## age                                            < 2e-16 ***
## gendermale                                    3.54e-05 ***
## hispanicnot                                   0.323941    
## raceAsian                                     5.56e-07 ***
## raceBlack or African American                 0.520334    
## raceNative Hawaiian or Other Pacific Islander 0.227502    
## raceWhite                                     0.073735 .  
## height                                         < 2e-16 ***
## helmet_12mdid not ride                        0.005537 ** 
## helmet_12mmost of time                        0.200159    
## helmet_12mnever                               0.100355    
## helmet_12mrarely                              0.168550    
## helmet_12msometimes                           0.581020    
## physically_active_7d                          1.35e-07 ***
## hours_tv_per_school_day1                      0.740676    
## hours_tv_per_school_day2                      0.370043    
## hours_tv_per_school_day3                      0.007139 ** 
## hours_tv_per_school_day4                      0.087318 .  
## hours_tv_per_school_day5+                     0.000983 ***
## hours_tv_per_school_daydo not watch           0.069500 .  
## school_night_hours_sleep10+                   0.626334    
## school_night_hours_sleep5                     0.095259 .  
## school_night_hours_sleep6                     0.013526 *  
## school_night_hours_sleep7                     6.90e-05 ***
## school_night_hours_sleep8                     0.000149 ***
## school_night_hours_sleep9                     1.17e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.35 on 8860 degrees of freedom
##   (3692 observations deleted due to missingness)
## Multiple R-squared:  0.2882, Adjusted R-squared:  0.2861 
## F-statistic:   138 on 26 and 8860 DF,  p-value: < 2.2e-16

anova(model4)

## Analysis of Variance Table
## 
## Response: weight
##                            Df  Sum Sq Mean Sq   F value    Pr(>F)    
## age                         1   83825   83825  406.8560 < 2.2e-16 ***
## gender                      1  289453  289453 1404.9057 < 2.2e-16 ***
## hispanic                    1    1752    1752    8.5049 0.0035508 ** 
## race                        4   25319    6330   30.7230 < 2.2e-16 ***
## height                      1  313991  313991 1524.0044 < 2.2e-16 ***
## helmet_12m                  5    4520     904    4.3877 0.0005434 ***
## physically_active_7d        1    7250    7250   35.1912 3.101e-09 ***
## hours_tv_per_school_day     6    6560    1093    5.3066 1.790e-05 ***
## school_night_hours_sleep    6    6496    1083    5.2548 2.052e-05 ***
## Residuals                8860 1825426     206                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The hispanic variable has the highest p-value with 0.00355.

A. Final Model

yrb_clean5 <- yrb_clean4 |>
  select(!hispanic)

model5 <- lm(weight ~ ., data = yrb_clean5)

summary(model5)

## 
## Call:
## lm(formula = weight ~ ., data = yrb_clean5)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -42.434  -9.462  -3.017   6.201  88.797 
## 
## Coefficients:
##                                                Estimate Std. Error t value
## (Intercept)                                   -82.94431    3.72680 -22.256
## age                                             1.38848    0.12544  11.069
## gendermale                                      1.70063    0.40704   4.178
## raceAsian                                      -6.04473    1.11210  -5.435
## raceBlack or African American                  -0.98447    0.93629  -1.051
## raceNative Hawaiian or Other Pacific Islander  -1.74866    1.37340  -1.273
## raceWhite                                      -1.97130    0.91214  -2.161
## height                                         77.09613    1.94266  39.686
## helmet_12mdid not ride                          2.52714    0.89801   2.814
## helmet_12mmost of time                          1.67672    1.29643   1.293
## helmet_12mnever                                 1.46826    0.88160   1.665
## helmet_12mrarely                                1.43193    1.08590   1.319
## helmet_12msometimes                             0.70160    1.24709   0.563
## physically_active_7d                           -0.32707    0.06273  -5.214
## hours_tv_per_school_day1                       -0.20467    0.54990  -0.372
## hours_tv_per_school_day2                        0.49485    0.50014   0.989
## hours_tv_per_school_day3                        1.45271    0.53706   2.705
## hours_tv_per_school_day4                        1.29749    0.67234   1.930
## hours_tv_per_school_day5+                       2.03500    0.59954   3.394
## hours_tv_per_school_daydo not watch            -1.02773    0.55805  -1.842
## school_night_hours_sleep10+                    -0.57974    1.16463  -0.498
## school_night_hours_sleep5                      -1.39983    0.70659  -1.981
## school_night_hours_sleep6                      -1.82148    0.64772  -2.812
## school_night_hours_sleep7                      -2.75707    0.63311  -4.355
## school_night_hours_sleep8                      -2.69745    0.65529  -4.116
## school_night_hours_sleep9                      -3.91753    0.84083  -4.659
##                                               Pr(>|t|)    
## (Intercept)                                    < 2e-16 ***
## age                                            < 2e-16 ***
## gendermale                                    2.97e-05 ***
## raceAsian                                     5.61e-08 ***
## raceBlack or African American                 0.293076    
## raceNative Hawaiian or Other Pacific Islander 0.202965    
## raceWhite                                     0.030707 *  
## height                                         < 2e-16 ***
## helmet_12mdid not ride                        0.004901 ** 
## helmet_12mmost of time                        0.195927    
## helmet_12mnever                               0.095858 .  
## helmet_12mrarely                              0.187315    
## helmet_12msometimes                           0.573729    
## physically_active_7d                          1.89e-07 ***
## hours_tv_per_school_day1                      0.709754    
## hours_tv_per_school_day2                      0.322482    
## hours_tv_per_school_day3                      0.006845 ** 
## hours_tv_per_school_day4                      0.053664 .  
## hours_tv_per_school_day5+                     0.000691 ***
## hours_tv_per_school_daydo not watch           0.065562 .  
## school_night_hours_sleep10+                   0.618646    
## school_night_hours_sleep5                     0.047608 *  
## school_night_hours_sleep6                     0.004932 ** 
## school_night_hours_sleep7                     1.35e-05 ***
## school_night_hours_sleep8                     3.88e-05 ***
## school_night_hours_sleep9                     3.22e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.33 on 8939 degrees of freedom
##   (3614 observations deleted due to missingness)
## Multiple R-squared:  0.289,  Adjusted R-squared:  0.287 
## F-statistic: 145.4 on 25 and 8939 DF,  p-value: < 2.2e-16

anova(model5)

## Analysis of Variance Table
## 
## Response: weight
##                            Df  Sum Sq Mean Sq   F value    Pr(>F)    
## age                         1   83540   83540  406.5640 < 2.2e-16 ***
## gender                      1  291583  291583 1419.0553 < 2.2e-16 ***
## race                        4   26438    6610   32.1666 < 2.2e-16 ***
## height                      1  319188  319188 1553.3975 < 2.2e-16 ***
## helmet_12m                  5    4657     931    4.5327 0.0003957 ***
## physically_active_7d        1    7140    7140   34.7490 3.887e-09 ***
## hours_tv_per_school_day     6    6919    1153    5.6124 7.970e-06 ***
## school_night_hours_sleep    6    7229    1205    5.8634 4.087e-06 ***
## Residuals                8939 1836760     205                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Now, I have every variable that is significant or has a category that is significant to predicting weight. These variables are:

age
gender
race
height
helmet_12m
physically_active_7d
hours_tv_per_school_day
school_night_hours_sleep

Adjusted R-Squared: 0.287 or 28.7%. This means that this model explains 28.7% of the variability in weight. This isn’t the best.

Significant Variable P-Values:

Predictor	p-value
age	< 2.2e-16
gender	< 2.2e-16
race	< 2.2e-16
height	< 2.2e-16
helmet_12m	0.0003957
physically_active_7d	3.887e-09
hours_tv_per_school_day	7.970e-06
school_night_hours_sleep	4.087e-06

Equation: y = −82.94431 + 1.38848(age) + 1.70063(gendermale) − 6.04473(raceAsian) − 0.98447(raceBlack or African American) − 1.74866(raceNative Hawaiian or Other Pacific Islander)−1.97130(raceWhite)+77.09613(height)+2.52714(helmet_12mid did not ride)+1.67672(helmet_12most of time)+1.46826(helmet_12never)+1.43193(helmet_12rarely)+0.70160(helmet_12sometimes)−0.32707(physically_active_7d)−0.20467(hours_tv_per_school_day1)+0.49485(hours_tv_per_school_day2)+1.45271(hours_tv_per_school_day3)+1.29749(hours_tv_per_school_day4)+2.03000(hours_tv_per_school_day5+)−1.02773(hours_tv_per_school_daydo not watch)−0.57974(school_night_hours_sleep10+)−1.39983(school_night_hours_sleep5)−1.82148(school_night_hours_sleep6)−2.75707(school_night_hours_sleep7)−2.69745(school_night_hours_sleep8)−3.91753(school_night_hours_sleep9)

Even though I do know how to create an equation of a multiple linear model (y = coefficient(predictor) + intercept), I used AI to create the equation because there are so many predictors in the model.

Analysis of Coefficients: Since there are so many coefficients, I’ll explain what they mean in general terms that should cover every type of coefficient in a multiple linear model.

A positive coefficient of a numerical variable means that for every one increase in that variable, weight increases by that number. On the other hand, a negative coefficient of a numerical variable means that for every one increase in that variable, weight decreases by that number.

A positive coefficient of a categorical variable means that the category within that variable has that much greater weight than the baseline of the variable. A negative coefficient of a categorical variable means that the category within that variable has that much less weight than the baseline of the variable.

For example, the category of “did not ride” in the helmet_12m variable has a coefficient of 2.52714. The baseline of the variable is “always”, which means that those who “did not ride” have, on average, 2.52714 more weight than those who always had a helmet when riding–this is when holding every other variable constant.

P-value of Entire Model: p-value: < 2.2e-16. Very significant.

B. Assumptions

Diagnostic Plots:

par(mfrow=c(2,2)); plot(model5); par(mfrow=c(1,1)) # Code from Data 101

Residuals vs. Fitted: Most residuals are clustered in the middle. Most points are about the y=0 line, although residuals seem to be moving in a slightly downward pattern. Linearity and homoscedasticity are likely violated.

Q-Q Residuals: Points do not follow a straight, diagonal line. The left tail slightly deviates from the line, while the right tail heavily deviates, earlier than usual, from the line. This indicates a violation of normality.

Scale-Location: No equal spread and isn’t horizontal. The residuals seem to be increasing. This again indicates a violation of homoscedasticity.

Residuals vs Leverage: There seems to be one point all the way to the right which may be influential on the model, but it doesn’t seem to be much of a concern.

Check for Multicollinearity:

Here, I check for multicollinearity of the numerical variables by using the cor() command that I learned from Data 101.

cor(yrb_clean5[, c("age", "height", "physically_active_7d")], use = "complete.obs") # Code from Data 101

##                              age    height physically_active_7d
## age                   1.00000000 0.1331966          -0.06990648
## height                0.13319663 1.0000000           0.20427737
## physically_active_7d -0.06990648 0.2042774           1.00000000

Numerical variables don’t seem to be too correlated. This model likely has no multicollinearity. Only slightly concerning correlation is between height and physically active for 7 days. This correlation is 20%; still, this correlation is quite low for anything too concerning.

Check for Independence

plot(resid(model5), type="b",
     main="Residuals vs Order", ylab="Residuals"); abline(h=0, lty=2) # code from Data 101

The residuals in this diagnostic plot show points scattered randomly about the y = 0 line, although there seems to be more positive residuals shooting up more than negatives, which might be a concern. This suggests that we likely pass independence.

IV. Visualizations

A. Highcharters

cols <- brewer.pal(6, "Pastel1") # Set the colors

box_plot <- data_to_boxplot(data = plot_one, weight, race, group_var = race) # Create the mapping

highchart() |> # High charters
  hc_xAxis(type = "category") |> # Categorical Plot
  hc_add_series_list(box_plot) |>
  hc_title(text = "Box Plots of Race by Weight") |> # Title
  hc_xAxis(title = list(text = "Race")) |> # X-label
  hc_yAxis(title = list(text = "Weight")) |> # Y-label
  hc_colors(cols) |> # Add colors
  hc_caption(text= "Data provided by CDC") |> # Caption
  hc_add_theme(hc_theme_darkunica()) # Theme. Check references for theme

Interpretation: All of these races reside around the same median weights (50~70 kilograms). However, the highest median is the Black or African American race, with a median of 66.23 kilograms. The race with the lowest median are Asians, with a median of 58.97 kilograms.

B. Tableau

Link: https://public.tableau.com/views/YRBSSAnalysisonWeight/Dashboard1?:language=en-US&publish=yes&:sid=&:redirect=auth&:display_count=n&:origin=viz_share_link

Note: Use full screen on Tableau to view all elements.

Note for Professor: I’d like you to grade the highcharters plot above and the bar graph at the bottom right that shows sleep hours by average weight for my two graphs that you will grade.

Interpretation:

For this final project, I decided to go beyond the minimum requirement and create more than two graphs. This tableau dashboard includes 6 graphs on the YRBSS dataset. Each graph uses a significant variable from the multiple linear model as the x-axis and weight as the y-axis.

Graph	Interpretation
Box Plot of Gender by Weight	Males have a higher box plot than females, signifying a higher median, higher middle 50%, and higher spread. Males have a median of 79.4 kilograms, while females have a median of 69.9 kilograms. This box plot is colored by the ages, and shows that older people weigh more.
Scatterplot of Height by Weight	As height increases, weight also increases. Furthermore, the colored dots represent age. We see that older teens have a higher height and weight than younger teens.
The Three Bar Graphs, Bottom Left	These show helmet 12m, physically active for 7 days, and hours of TV watched on school days by average weight. They all relatively show no large differences.
Bar Graph of Hours of Sleep on School Night by Average Weight	This one is especially interesting; for hours 5-9, the average weight sits around 66~68 kilograms. However, for hours 10+ and <5, the weight is higher (possibly even significant; a Post HOC test would be used to figure out which ones are significantly different). For 10+ hours, it is 69 kilograms. For <5 hours, it is 70 kilograms. Research suggests that short sleep durations are associated with obesity (Beccuti & Pannain, 2011). This research finding is evident in the bar graph; those who sleep less than 5 hours have a higher weight than those who sleep around 5-9 hours, on average.

C. Is the Difference Significant?

Here, I do a Tukey Post HOC test to figure out if the differences between those who sleep less than 5 hours is actually significant than other sleep durations.

First, I create the ANOVA test.

anova_result <- aov(weight ~ school_night_hours_sleep, data = yrb_clean) # Code from Data 101


summary(anova_result)

##                             Df  Sum Sq Mean Sq F value  Pr(>F)    
## school_night_hours_sleep     6   11333    1889   6.581 5.9e-07 ***
## Residuals                11474 3293032     287                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1098 observations deleted due to missingness

P-Value < 0.05. This means that at least one of the group’s mean is significantly different from the rest.

Perform Tukey Test

TukeyHSD(anova_result) # Code from Data 101

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = weight ~ school_night_hours_sleep, data = yrb_clean)
## 
## $school_night_hours_sleep
##               diff       lwr         upr     p adj
## 10+-<5 -1.00448670 -4.567098  2.55812442 0.9817326
## 5-<5   -1.87893410 -4.050656  0.29278824 0.1414215
## 6-<5   -1.96381942 -3.939969  0.01233048 0.0527071
## 7-<5   -2.86243056 -4.776975 -0.94788582 0.0002120
## 8-<5   -2.83954341 -4.814784 -0.86430263 0.0004517
## 9-<5   -4.73801778 -7.276767 -2.19926899 0.0000008
## 5-10+  -0.87444739 -4.280026  2.53113106 0.9887944
## 6-10+  -0.95933272 -4.243652  2.32498687 0.9781071
## 7-10+  -1.85794386 -5.105569  1.38968117 0.6248265
## 8-10+  -1.83505671 -5.118829  1.44871595 0.6509751
## 9-10+  -3.73353108 -7.384124 -0.08293768 0.0411149
## 6-5    -0.08488533 -1.761470  1.59169902 0.9999991
## 7-5    -0.98349646 -2.587008  0.62001474 0.5421408
## 8-5    -0.96060932 -2.636122  0.71490336 0.6223861
## 9-5    -2.85908369 -5.172309 -0.54585822 0.0050099
## 7-6    -0.89861114 -2.225276  0.42805337 0.4162754
## 8-6    -0.87572399 -2.288570  0.53712182 0.5290992
## 9-6    -2.77419836 -4.904880 -0.64351711 0.0023770
## 8-7     0.02288715 -1.302423  1.34819708 1.0000000
## 9-7    -1.87558722 -3.949259  0.19808482 0.1067654
## 9-8    -1.89847437 -4.028312  0.23136372 0.1175002

Most significant differences: 7-<5, 8-<5, 9-<5, 9-10+, 9-5, 9-6

We see that the differences between those who get greater than 6 hours of sleep and those who get less than 5 hours of sleep is significant and the difference is negative. This statistically proves that those who sleep less than 5 hours do have a significantly higher weight than those sleep around 7-9 hours.

V. Conclusion

In conclusion, we answer our research question of what variables are most significant at predicting weight. The most significant variables are age, gender, race, height, helmet_12m, physically_active_7d, hours_tv_per_school_day, and school_night_hours_sleep. The multiple linear model explains about 29% of the variability in weight, and fails numerous assumptions of a linear model.

The visualizations prove worthy of valuable insight; the box plot of race by weight shows Black and African Americans having the highest median weight, whereas Asians have the lowest median weight. The bar graph–of average sleep hours on a school night by average weight–in Tableau shows the highest average weight belonging to those who sleep less than 5 hours and more than 10 hours, and the difference is significant. This finding is backed by research (Beccuti & Pannain, 2011).

For this project, I wish I could have seen the time series of this dataset. Since this dataset spans from 1991-2013, it would’ve been useful to see how weight has changed over the years. If the CDC does these surveys in the future (if they’re not already), they should include other variables such as smoking or not smoking, in a relationship or not, their GPA, etc. Such variables would allow for a more sophisticated and advanced analysis.

VI. References

Beccuti, G., & Pannain, S. (2011). Sleep and obesity. Current opinion in clinical nutrition and metabolic care, 14(4), 402–412. https://doi.org/10.1097/MCO.0b013e3283479109 Dataset Source: http://www.cdc.gov/healthyyouth/data/yrbs/index.htm.

Dataset repository: https://www.openintro.org/data/index.php?data=yrbss

YAML Code Customization: https://stackoverflow.com/questions/48261379/rmarkdown-floating-toc-and-toc-at-beginning

ANOVA-test to check significance in multi-linear model Source: https://www.statalist.org/forums/forum/general-stata-discussion/general/956382-categorical-variables-in-backward-regression-model

Project 3 - Data 110

Muhammad Ahtisham

2026-05-15