Image Source: pngtree.com
What variables are most significant at predicting the weight of a participant? The Youth Risk Behavior Surveillance System (YRBSS) contains surveys on students from 1991-2013 in the United States. This data is collected and provided by the Centers for Disease Control and Prevention (CDC) and is available to download on Openintro.
In this project, I will use a multiple linear regression model that inputs every variable in this dataset to answer my research question of what variables are most significant at predicting the weight of a participant. From this question stems many others, for example, how much does one’s weight change for every increase in the number of days they exercise? Or, is texting while driving significant at predicting weight? This data was collected through conducting surveys of students from 1991-2013 in the United States.
I chose to do this topic because it can be insightful if all assumptions of the model pass and the model explains the variability of weight well. As stated before, it can answer numerous questions regarding weight. I first thought of doing height, but height is usually genetic. So, I decided to predict weight because weight can change depending on a variety of factors. In this project, I will see which of those factors, in this dataset, are most important to weight–if any.
## Rows: 13583 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): gender, grade, hispanic, race, helmet_12m, text_while_driving_30d, ...
## dbl (5): age, height, weight, physically_active_7d, strength_training_7d
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 6 × 13
## age gender grade hispanic race height weight helmet_12m
## <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr>
## 1 14 female 9 not Black or African American NA NA never
## 2 14 female 9 not Black or African American NA NA never
## 3 15 female 9 hispanic Native Hawaiian or Other… 1.73 84.4 never
## 4 15 female 9 not Black or African American 1.6 55.8 never
## 5 15 female 9 not Black or African American 1.5 46.7 did not r…
## 6 15 female 9 not Black or African American 1.57 67.1 did not r…
## # ℹ 5 more variables: text_while_driving_30d <chr>, physically_active_7d <dbl>,
## # hours_tv_per_school_day <chr>, strength_training_7d <dbl>,
## # school_night_hours_sleep <chr>
In this section, I will clean the dataset, change variable names, remove N/A values for the weight variable, and create a new data frame for my plot. For my EDA, I will convert categorical variables and discrete variables to factors, check numerical and categorical variables, and check unique values for categorical variables.
Checking for N/A values.
## age gender grade
## 77 12 79
## hispanic race height
## 231 2805 1004
## weight helmet_12m text_while_driving_30d
## 1004 311 918
## physically_active_7d hours_tv_per_school_day strength_training_7d
## 273 338 1176
## school_night_hours_sleep
## 1248
A lot of N/A values. Since my main variable is weight, I will be removing N/A values for that. For the other variables, if N/A values are conflicting in a certain place, I will remove them in that place.
## # A tibble: 6 × 13
## age gender grade hispanic race height weight helmet_12m
## <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr>
## 1 15 female 9 hispanic Native Hawaiian or Other… 1.73 84.4 never
## 2 15 female 9 not Black or African American 1.6 55.8 never
## 3 15 female 9 not Black or African American 1.5 46.7 did not r…
## 4 15 female 9 not Black or African American 1.57 67.1 did not r…
## 5 15 female 9 not Black or African American 1.65 132. did not r…
## 6 14 male 9 not Black or African American 1.88 71.2 never
## # ℹ 5 more variables: text_while_driving_30d <chr>, physically_active_7d <dbl>,
## # hours_tv_per_school_day <chr>, strength_training_7d <dbl>,
## # school_night_hours_sleep <chr>
Create the plot for a boxplot by selecting the two necessary variables and filtering out extra N/A values.
plot_one <- yrb_clean |>
select(weight, race) |> # select the two variables I want/
filter(!is.na(race)) |> # filter out n/a's
arrange(desc(weight)) # arrange weight from highest to lowest
head(plot_one)## # A tibble: 6 × 2
## weight race
## <dbl> <chr>
## 1 181. American Indian or Alaska Native
## 2 163. Black or African American
## 3 163. Black or African American
## 4 160. Black or African American
## 5 159. White
## 6 159. White
Checking the class of the main variable.
## [1] "numeric"
Checking the structure of the dataset
## spc_tbl_ [12,579 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ age : num [1:12579] 15 15 15 15 15 14 15 15 15 15 ...
## $ gender : chr [1:12579] "female" "female" "female" "female" ...
## $ grade : chr [1:12579] "9" "9" "9" "9" ...
## $ hispanic : chr [1:12579] "hispanic" "not" "not" "not" ...
## $ race : chr [1:12579] "Native Hawaiian or Other Pacific Islander" "Black or African American" "Black or African American" "Black or African American" ...
## $ height : num [1:12579] 1.73 1.6 1.5 1.57 1.65 1.88 1.75 1.37 1.68 1.65 ...
## $ weight : num [1:12579] 84.4 55.8 46.7 67.1 131.5 ...
## $ helmet_12m : chr [1:12579] "never" "never" "did not ride" "did not ride" ...
## $ text_while_driving_30d : chr [1:12579] "30" "0" "did not drive" "did not drive" ...
## $ physically_active_7d : num [1:12579] 7 0 2 1 4 4 5 0 0 0 ...
## $ hours_tv_per_school_day : chr [1:12579] "5+" "2" "3" "5+" ...
## $ strength_training_7d : num [1:12579] 0 0 1 0 2 0 3 0 3 0 ...
## $ school_night_hours_sleep: chr [1:12579] "<5" "6" "9" "8" ...
## - attr(*, "spec")=
## .. cols(
## .. age = col_double(),
## .. gender = col_character(),
## .. grade = col_character(),
## .. hispanic = col_character(),
## .. race = col_character(),
## .. height = col_double(),
## .. weight = col_double(),
## .. helmet_12m = col_character(),
## .. text_while_driving_30d = col_character(),
## .. physically_active_7d = col_double(),
## .. hours_tv_per_school_day = col_character(),
## .. strength_training_7d = col_double(),
## .. school_night_hours_sleep = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
I can convert a lot of these variables to factors.
yrb_clean$helmet_12m <- as.factor(yrb_clean$helmet_12m)
yrb_clean$race <- as.factor(yrb_clean$race)
yrb_clean$text_while_driving_30d <- as.factor(yrb_clean$text_while_driving_30d)
yrb_clean$hispanic <- as.factor(yrb_clean$hispanic)
yrb_clean$hours_tv_per_school_day <- as.factor(yrb_clean$hours_tv_per_school_day)Calculate summary statistics of the entire dataset
## age gender grade hispanic
## Min. :12.00 Length:12579 Length:12579 hispanic:3166
## 1st Qu.:15.00 Class :character Class :character not :9282
## Median :16.00 Mode :character Mode :character NA's : 131
## Mean :16.17
## 3rd Qu.:17.00
## Max. :18.00
##
## race height
## American Indian or Alaska Native : 295 Min. :1.270
## Asian : 499 1st Qu.:1.600
## Black or African American :2904 Median :1.680
## Native Hawaiian or Other Pacific Islander: 242 Mean :1.691
## White :6112 3rd Qu.:1.780
## NA's :2527 Max. :2.110
##
## weight helmet_12m text_while_driving_30d
## Min. : 29.94 always : 359 0 :4462
## 1st Qu.: 56.25 did not ride:4243 did not drive:4293
## Median : 64.41 most of time: 274 1-2 : 887
## Mean : 67.91 never :6555 30 : 756
## 3rd Qu.: 76.20 rarely : 660 3-5 : 457
## Max. :180.99 sometimes : 316 (Other) : 928
## NA's : 172 NA's : 796
## physically_active_7d hours_tv_per_school_day strength_training_7d
## Min. :0.000 2 :2548 Min. :0.000
## 1st Qu.:2.000 <1 :2021 1st Qu.:0.000
## Median :4.000 3 :1995 Median :3.000
## Mean :3.924 do not watch:1671 Mean :2.969
## 3rd Qu.:7.000 1 :1667 3rd Qu.:5.000
## Max. :7.000 (Other) :2406 Max. :7.000
## NA's :215 NA's : 271 NA's :1044
## school_night_hours_sleep
## Length:12579
## Class :character
## Mode :character
##
##
##
##
The mean weight is 67.91.
In this section, I will input every variable into the model to predict weight. I will then use backwards elimination to figure out what variables are most significant at predicting weight.
##
## Call:
## lm(formula = weight ~ ., data = yrb_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.381 -9.423 -2.944 6.076 88.205
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -83.62740 5.45119 -15.341
## age 1.40507 0.26808 5.241
## gendermale 1.75048 0.42681 4.101
## grade11 0.33440 0.52668 0.635
## grade12 -0.09721 0.67564 -0.144
## grade9 -0.21348 0.52871 -0.404
## gradeother -6.80660 4.33703 -1.569
## hispanicnot -0.62765 0.50033 -1.254
## raceAsian -5.22575 1.19801 -4.362
## raceBlack or African American -0.14622 1.02588 -0.143
## raceNative Hawaiian or Other Pacific Islander -1.49259 1.44674 -1.032
## raceWhite -1.08876 0.98878 -1.101
## height 77.02921 2.03855 37.786
## helmet_12mdid not ride 2.31071 0.92008 2.511
## helmet_12mmost of time 1.42472 1.31819 1.081
## helmet_12mnever 1.53801 0.90390 1.702
## helmet_12mrarely 1.43069 1.11502 1.283
## helmet_12msometimes 0.87814 1.27691 0.688
## text_while_driving_30d1-2 -0.79007 0.61996 -1.274
## text_while_driving_30d10-19 -1.08359 0.90642 -1.195
## text_while_driving_30d20-29 0.23360 1.00806 0.232
## text_while_driving_30d3-5 0.12379 0.81799 0.151
## text_while_driving_30d30 1.03782 0.66647 1.557
## text_while_driving_30d6-9 0.98795 0.98953 0.998
## text_while_driving_30ddid not drive 0.44461 0.38483 1.155
## physically_active_7d -0.29350 0.08036 -3.652
## hours_tv_per_school_day1 0.08785 0.56731 0.155
## hours_tv_per_school_day2 0.63437 0.51483 1.232
## hours_tv_per_school_day3 1.64046 0.55461 2.958
## hours_tv_per_school_day4 1.34191 0.69672 1.926
## hours_tv_per_school_day5+ 2.07755 0.62478 3.325
## hours_tv_per_school_daydo not watch -0.94316 0.57523 -1.640
## strength_training_7d -0.07259 0.07879 -0.921
## school_night_hours_sleep10+ -0.45980 1.23830 -0.371
## school_night_hours_sleep5 -1.28867 0.73895 -1.744
## school_night_hours_sleep6 -1.67981 0.67786 -2.478
## school_night_hours_sleep7 -2.73447 0.66251 -4.127
## school_night_hours_sleep8 -2.74442 0.68675 -3.996
## school_night_hours_sleep9 -3.65133 0.87988 -4.150
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## age 1.63e-07 ***
## gendermale 4.15e-05 ***
## grade11 0.525504
## grade12 0.885596
## grade9 0.686382
## gradeother 0.116589
## hispanicnot 0.209703
## raceAsian 1.30e-05 ***
## raceBlack or African American 0.886665
## raceNative Hawaiian or Other Pacific Islander 0.302247
## raceWhite 0.270879
## height < 2e-16 ***
## helmet_12mdid not ride 0.012043 *
## helmet_12mmost of time 0.279811
## helmet_12mnever 0.088881 .
## helmet_12mrarely 0.199491
## helmet_12msometimes 0.491659
## text_while_driving_30d1-2 0.202565
## text_while_driving_30d10-19 0.231942
## text_while_driving_30d20-29 0.816752
## text_while_driving_30d3-5 0.879716
## text_while_driving_30d30 0.119463
## text_while_driving_30d6-9 0.318116
## text_while_driving_30ddid not drive 0.247987
## physically_active_7d 0.000261 ***
## hours_tv_per_school_day1 0.876948
## hours_tv_per_school_day2 0.217908
## hours_tv_per_school_day3 0.003106 **
## hours_tv_per_school_day4 0.054133 .
## hours_tv_per_school_day5+ 0.000887 ***
## hours_tv_per_school_daydo not watch 0.101119
## strength_training_7d 0.356888
## school_night_hours_sleep10+ 0.710415
## school_night_hours_sleep5 0.081210 .
## school_night_hours_sleep6 0.013228 *
## school_night_hours_sleep7 3.70e-05 ***
## school_night_hours_sleep8 6.49e-05 ***
## school_night_hours_sleep9 3.36e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.29 on 8312 degrees of freedom
## (4228 observations deleted due to missingness)
## Multiple R-squared: 0.2917, Adjusted R-squared: 0.2885
## F-statistic: 90.09 on 38 and 8312 DF, p-value: < 2.2e-16
Since it’s hard to tell which variables are significant because some variables have multiple categories, I will use an anova test of the model to figure out what p-value is the highest for the backwards elimination.
Also, my Data 101 professor said I can use anova to figure out what p-value is the highest if there’s a variable with multiple categories.
## Analysis of Variance Table
##
## Response: weight
## Df Sum Sq Mean Sq F value Pr(>F)
## age 1 80986 80986 396.3298 < 2.2e-16 ***
## gender 1 275781 275781 1349.6261 < 2.2e-16 ***
## grade 4 2343 586 2.8670 0.021840 *
## hispanic 1 1141 1141 5.5833 0.018155 *
## race 4 23764 5941 29.0745 < 2.2e-16 ***
## height 1 289308 289308 1415.8245 < 2.2e-16 ***
## helmet_12m 5 3475 695 3.4010 0.004517 **
## text_while_driving_30d 7 2501 357 1.7485 0.093116 .
## physically_active_7d 1 7062 7062 34.5623 4.288e-09 ***
## hours_tv_per_school_day 6 6347 1058 5.1766 2.525e-05 ***
## strength_training_7d 1 198 198 0.9670 0.325452
## school_night_hours_sleep 6 6638 1106 5.4144 1.349e-05 ***
## Residuals 8312 1698466 204
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Using the ANOVA test, we can see that the highest p-value is strength_training_7d, with a p-value of 0.325.
yrb_clean2 <- yrb_clean |>
select(!strength_training_7d)
model2 <- lm(weight ~ ., data = yrb_clean2)
summary(model2)##
## Call:
## lm(formula = weight ~ ., data = yrb_clean2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.016 -9.440 -2.956 6.093 88.351
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -83.55656 5.44236 -15.353
## age 1.39352 0.26786 5.202
## gendermale 1.69731 0.42489 3.995
## grade11 0.35777 0.52601 0.680
## grade12 -0.06388 0.67457 -0.095
## grade9 -0.25662 0.52792 -0.486
## gradeother -6.81447 4.33653 -1.571
## hispanicnot -0.61661 0.49976 -1.234
## raceAsian -5.23699 1.19784 -4.372
## raceBlack or African American -0.15226 1.02559 -0.148
## raceNative Hawaiian or Other Pacific Islander -1.49344 1.44658 -1.032
## raceWhite -1.08648 0.98861 -1.099
## height 77.01842 2.03583 37.831
## helmet_12mdid not ride 2.36751 0.91849 2.578
## helmet_12mmost of time 1.46237 1.31703 1.110
## helmet_12mnever 1.57397 0.90232 1.744
## helmet_12mrarely 1.47655 1.11377 1.326
## helmet_12msometimes 0.93170 1.27575 0.730
## text_while_driving_30d1-2 -0.77384 0.61982 -1.248
## text_while_driving_30d10-19 -1.08240 0.90612 -1.195
## text_while_driving_30d20-29 0.22794 1.00757 0.226
## text_while_driving_30d3-5 0.13662 0.81664 0.167
## text_while_driving_30d30 1.02773 0.66558 1.544
## text_while_driving_30d6-9 0.99659 0.98933 1.007
## text_while_driving_30ddid not drive 0.47952 0.38435 1.248
## physically_active_7d -0.33308 0.06515 -5.112
## hours_tv_per_school_day1 0.09317 0.56707 0.164
## hours_tv_per_school_day2 0.64331 0.51458 1.250
## hours_tv_per_school_day3 1.63507 0.55402 2.951
## hours_tv_per_school_day4 1.35090 0.69589 1.941
## hours_tv_per_school_day5+ 2.08207 0.62418 3.336
## hours_tv_per_school_daydo not watch -0.92711 0.57478 -1.613
## school_night_hours_sleep10+ -0.43891 1.23788 -0.355
## school_night_hours_sleep5 -1.26052 0.73849 -1.707
## school_night_hours_sleep6 -1.66063 0.67730 -2.452
## school_night_hours_sleep7 -2.69921 0.66192 -4.078
## school_night_hours_sleep8 -2.73418 0.68614 -3.985
## school_night_hours_sleep9 -3.63372 0.87871 -4.135
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## age 2.01e-07 ***
## gendermale 6.53e-05 ***
## grade11 0.496416
## grade12 0.924554
## grade9 0.626913
## gradeother 0.116125
## hispanicnot 0.217313
## raceAsian 1.25e-05 ***
## raceBlack or African American 0.881983
## raceNative Hawaiian or Other Pacific Islander 0.301918
## raceWhite 0.271805
## height < 2e-16 ***
## helmet_12mdid not ride 0.009965 **
## helmet_12mmost of time 0.266879
## helmet_12mnever 0.081135 .
## helmet_12mrarely 0.184966
## helmet_12msometimes 0.465219
## text_while_driving_30d1-2 0.211888
## text_while_driving_30d10-19 0.232298
## text_while_driving_30d20-29 0.821029
## text_while_driving_30d3-5 0.867142
## text_while_driving_30d30 0.122601
## text_while_driving_30d6-9 0.313806
## text_while_driving_30ddid not drive 0.212204
## physically_active_7d 3.26e-07 ***
## hours_tv_per_school_day1 0.869493
## hours_tv_per_school_day2 0.211272
## hours_tv_per_school_day3 0.003173 **
## hours_tv_per_school_day4 0.052260 .
## hours_tv_per_school_day5+ 0.000855 ***
## hours_tv_per_school_daydo not watch 0.106784
## school_night_hours_sleep10+ 0.722924
## school_night_hours_sleep5 0.087879 .
## school_night_hours_sleep6 0.014233 *
## school_night_hours_sleep7 4.59e-05 ***
## school_night_hours_sleep8 6.81e-05 ***
## school_night_hours_sleep9 3.58e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.29 on 8321 degrees of freedom
## (4220 observations deleted due to missingness)
## Multiple R-squared: 0.2916, Adjusted R-squared: 0.2884
## F-statistic: 92.56 on 37 and 8321 DF, p-value: < 2.2e-16
## Analysis of Variance Table
##
## Response: weight
## Df Sum Sq Mean Sq F value Pr(>F)
## age 1 80892 80892 395.9522 < 2.2e-16 ***
## gender 1 275402 275402 1348.0502 < 2.2e-16 ***
## grade 4 2406 601 2.9437 0.019164 *
## hispanic 1 1133 1133 5.5473 0.018532 *
## race 4 23789 5947 29.1112 < 2.2e-16 ***
## height 1 290136 290136 1420.1745 < 2.2e-16 ***
## helmet_12m 5 3558 712 3.4827 0.003803 **
## text_while_driving_30d 7 2528 361 1.7674 0.089129 .
## physically_active_7d 1 6931 6931 33.9250 5.941e-09 ***
## hours_tv_per_school_day 6 6267 1044 5.1126 2.987e-05 ***
## school_night_hours_sleep 6 6594 1099 5.3793 1.480e-05 ***
## Residuals 8321 1699950 204
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Highest p-value: text_while_driving_30d with a p-value of 0.089
yrb_clean3 <- yrb_clean2 |>
select(!text_while_driving_30d)
model3 <- lm(weight ~ ., data = yrb_clean3)
summary(model3)##
## Call:
## lm(formula = weight ~ ., data = yrb_clean3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.303 -9.482 -3.038 6.190 88.970
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -82.95478 5.20861 -15.926
## age 1.39749 0.25726 5.432
## gendermale 1.72007 0.41195 4.175
## grade11 0.04698 0.51068 0.092
## grade12 -0.29540 0.64864 -0.455
## grade9 -0.26288 0.51060 -0.515
## gradeother -6.84569 4.35187 -1.573
## hispanicnot -0.48381 0.48444 -0.999
## raceAsian -5.71575 1.15430 -4.952
## raceBlack or African American -0.59287 0.98766 -0.600
## raceNative Hawaiian or Other Pacific Islander -1.56347 1.38736 -1.127
## raceWhite -1.64140 0.95265 -1.723
## height 77.07612 1.96867 39.151
## helmet_12mdid not ride 2.50629 0.90412 2.772
## helmet_12mmost of time 1.71639 1.30198 1.318
## helmet_12mnever 1.46975 0.88780 1.655
## helmet_12mrarely 1.50894 1.09487 1.378
## helmet_12msometimes 0.70919 1.25368 0.566
## physically_active_7d -0.33122 0.06315 -5.245
## hours_tv_per_school_day1 -0.17305 0.55378 -0.312
## hours_tv_per_school_day2 0.47960 0.50352 0.952
## hours_tv_per_school_day3 1.47911 0.54021 2.738
## hours_tv_per_school_day4 1.17861 0.67858 1.737
## hours_tv_per_school_day5+ 2.02263 0.60412 3.348
## hours_tv_per_school_daydo not watch -1.01900 0.56150 -1.815
## school_night_hours_sleep10+ -0.63820 1.18196 -0.540
## school_night_hours_sleep5 -1.18917 0.71190 -1.670
## school_night_hours_sleep6 -1.63159 0.65205 -2.502
## school_night_hours_sleep7 -2.56890 0.63783 -4.028
## school_night_hours_sleep8 -2.55080 0.66034 -3.863
## school_night_hours_sleep9 -3.70786 0.84715 -4.377
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## age 5.71e-08 ***
## gendermale 3.00e-05 ***
## grade11 0.926700
## grade12 0.648825
## grade9 0.606667
## gradeother 0.115744
## hispanicnot 0.317962
## raceAsian 7.49e-07 ***
## raceBlack or African American 0.548337
## raceNative Hawaiian or Other Pacific Islander 0.259798
## raceWhite 0.084927 .
## height < 2e-16 ***
## helmet_12mdid not ride 0.005582 **
## helmet_12mmost of time 0.187440
## helmet_12mnever 0.097861 .
## helmet_12mrarely 0.168179
## helmet_12msometimes 0.571619
## physically_active_7d 1.60e-07 ***
## hours_tv_per_school_day1 0.754675
## hours_tv_per_school_day2 0.340873
## hours_tv_per_school_day3 0.006193 **
## hours_tv_per_school_day4 0.082441 .
## hours_tv_per_school_day5+ 0.000817 ***
## hours_tv_per_school_daydo not watch 0.069589 .
## school_night_hours_sleep10+ 0.589247
## school_night_hours_sleep5 0.094873 .
## school_night_hours_sleep6 0.012359 *
## school_night_hours_sleep7 5.68e-05 ***
## school_night_hours_sleep8 0.000113 ***
## school_night_hours_sleep9 1.22e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.35 on 8849 degrees of freedom
## (3699 observations deleted due to missingness)
## Multiple R-squared: 0.2886, Adjusted R-squared: 0.2862
## F-statistic: 119.7 on 30 and 8849 DF, p-value: < 2.2e-16
## Analysis of Variance Table
##
## Response: weight
## Df Sum Sq Mean Sq F value Pr(>F)
## age 1 83896 83896 407.1824 < 2.2e-16 ***
## gender 1 289610 289610 1405.6016 < 2.2e-16 ***
## grade 4 2386 597 2.8951 0.0208149 *
## hispanic 1 1699 1699 8.2483 0.0040888 **
## race 4 25739 6435 31.2302 < 2.2e-16 ***
## height 1 311465 311465 1511.6747 < 2.2e-16 ***
## helmet_12m 5 4464 893 4.3335 0.0006117 ***
## physically_active_7d 1 7217 7217 35.0262 3.374e-09 ***
## hours_tv_per_school_day 6 6700 1117 5.4196 1.329e-05 ***
## school_night_hours_sleep 6 6583 1097 5.3249 1.706e-05 ***
## Residuals 8849 1823247 206
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We have reached all significant (p < 0.05) variables. However, we want the most significant variables. Therefore, I’ll be removing all variables that don’t have a *** significance code. The *** means the p-value is very very close to zero, for example, 1.706e-05 (0.00007).
So, I’ll be removing grade from the model.
yrb_clean4 <- yrb_clean3 |>
select(!grade)
model4 <- lm(weight ~ ., data = yrb_clean4)
summary(model4)##
## Call:
## lm(formula = weight ~ ., data = yrb_clean4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.472 -9.476 -3.027 6.206 88.807
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -83.21151 3.74973 -22.191
## age 1.39454 0.12626 11.045
## gendermale 1.69782 0.41035 4.138
## hispanicnot -0.47751 0.48407 -0.986
## raceAsian -5.76241 1.15025 -5.010
## raceBlack or African American -0.63309 0.98482 -0.643
## raceNative Hawaiian or Other Pacific Islander -1.66870 1.38263 -1.207
## raceWhite -1.69841 0.94965 -1.788
## height 77.20636 1.96392 39.312
## helmet_12mdid not ride 2.50846 0.90405 2.775
## helmet_12mmost of time 1.66705 1.30117 1.281
## helmet_12mnever 1.45871 0.88767 1.643
## helmet_12mrarely 1.50745 1.09474 1.377
## helmet_12msometimes 0.69173 1.25332 0.552
## physically_active_7d -0.33280 0.06308 -5.276
## hours_tv_per_school_day1 -0.18322 0.55360 -0.331
## hours_tv_per_school_day2 0.45109 0.50320 0.896
## hours_tv_per_school_day3 1.45299 0.53997 2.691
## hours_tv_per_school_day4 1.15976 0.67826 1.710
## hours_tv_per_school_day5+ 1.98826 0.60316 3.296
## hours_tv_per_school_daydo not watch -1.01857 0.56108 -1.815
## school_night_hours_sleep10+ -0.57431 1.17950 -0.487
## school_night_hours_sleep5 -1.18684 0.71134 -1.668
## school_night_hours_sleep6 -1.61006 0.65182 -2.470
## school_night_hours_sleep7 -2.53823 0.63749 -3.982
## school_night_hours_sleep8 -2.50243 0.65964 -3.794
## school_night_hours_sleep9 -3.71038 0.84615 -4.385
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## age < 2e-16 ***
## gendermale 3.54e-05 ***
## hispanicnot 0.323941
## raceAsian 5.56e-07 ***
## raceBlack or African American 0.520334
## raceNative Hawaiian or Other Pacific Islander 0.227502
## raceWhite 0.073735 .
## height < 2e-16 ***
## helmet_12mdid not ride 0.005537 **
## helmet_12mmost of time 0.200159
## helmet_12mnever 0.100355
## helmet_12mrarely 0.168550
## helmet_12msometimes 0.581020
## physically_active_7d 1.35e-07 ***
## hours_tv_per_school_day1 0.740676
## hours_tv_per_school_day2 0.370043
## hours_tv_per_school_day3 0.007139 **
## hours_tv_per_school_day4 0.087318 .
## hours_tv_per_school_day5+ 0.000983 ***
## hours_tv_per_school_daydo not watch 0.069500 .
## school_night_hours_sleep10+ 0.626334
## school_night_hours_sleep5 0.095259 .
## school_night_hours_sleep6 0.013526 *
## school_night_hours_sleep7 6.90e-05 ***
## school_night_hours_sleep8 0.000149 ***
## school_night_hours_sleep9 1.17e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.35 on 8860 degrees of freedom
## (3692 observations deleted due to missingness)
## Multiple R-squared: 0.2882, Adjusted R-squared: 0.2861
## F-statistic: 138 on 26 and 8860 DF, p-value: < 2.2e-16
## Analysis of Variance Table
##
## Response: weight
## Df Sum Sq Mean Sq F value Pr(>F)
## age 1 83825 83825 406.8560 < 2.2e-16 ***
## gender 1 289453 289453 1404.9057 < 2.2e-16 ***
## hispanic 1 1752 1752 8.5049 0.0035508 **
## race 4 25319 6330 30.7230 < 2.2e-16 ***
## height 1 313991 313991 1524.0044 < 2.2e-16 ***
## helmet_12m 5 4520 904 4.3877 0.0005434 ***
## physically_active_7d 1 7250 7250 35.1912 3.101e-09 ***
## hours_tv_per_school_day 6 6560 1093 5.3066 1.790e-05 ***
## school_night_hours_sleep 6 6496 1083 5.2548 2.052e-05 ***
## Residuals 8860 1825426 206
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The hispanic variable has the highest p-value with 0.00355.
yrb_clean5 <- yrb_clean4 |>
select(!hispanic)
model5 <- lm(weight ~ ., data = yrb_clean5)
summary(model5)##
## Call:
## lm(formula = weight ~ ., data = yrb_clean5)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.434 -9.462 -3.017 6.201 88.797
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -82.94431 3.72680 -22.256
## age 1.38848 0.12544 11.069
## gendermale 1.70063 0.40704 4.178
## raceAsian -6.04473 1.11210 -5.435
## raceBlack or African American -0.98447 0.93629 -1.051
## raceNative Hawaiian or Other Pacific Islander -1.74866 1.37340 -1.273
## raceWhite -1.97130 0.91214 -2.161
## height 77.09613 1.94266 39.686
## helmet_12mdid not ride 2.52714 0.89801 2.814
## helmet_12mmost of time 1.67672 1.29643 1.293
## helmet_12mnever 1.46826 0.88160 1.665
## helmet_12mrarely 1.43193 1.08590 1.319
## helmet_12msometimes 0.70160 1.24709 0.563
## physically_active_7d -0.32707 0.06273 -5.214
## hours_tv_per_school_day1 -0.20467 0.54990 -0.372
## hours_tv_per_school_day2 0.49485 0.50014 0.989
## hours_tv_per_school_day3 1.45271 0.53706 2.705
## hours_tv_per_school_day4 1.29749 0.67234 1.930
## hours_tv_per_school_day5+ 2.03500 0.59954 3.394
## hours_tv_per_school_daydo not watch -1.02773 0.55805 -1.842
## school_night_hours_sleep10+ -0.57974 1.16463 -0.498
## school_night_hours_sleep5 -1.39983 0.70659 -1.981
## school_night_hours_sleep6 -1.82148 0.64772 -2.812
## school_night_hours_sleep7 -2.75707 0.63311 -4.355
## school_night_hours_sleep8 -2.69745 0.65529 -4.116
## school_night_hours_sleep9 -3.91753 0.84083 -4.659
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## age < 2e-16 ***
## gendermale 2.97e-05 ***
## raceAsian 5.61e-08 ***
## raceBlack or African American 0.293076
## raceNative Hawaiian or Other Pacific Islander 0.202965
## raceWhite 0.030707 *
## height < 2e-16 ***
## helmet_12mdid not ride 0.004901 **
## helmet_12mmost of time 0.195927
## helmet_12mnever 0.095858 .
## helmet_12mrarely 0.187315
## helmet_12msometimes 0.573729
## physically_active_7d 1.89e-07 ***
## hours_tv_per_school_day1 0.709754
## hours_tv_per_school_day2 0.322482
## hours_tv_per_school_day3 0.006845 **
## hours_tv_per_school_day4 0.053664 .
## hours_tv_per_school_day5+ 0.000691 ***
## hours_tv_per_school_daydo not watch 0.065562 .
## school_night_hours_sleep10+ 0.618646
## school_night_hours_sleep5 0.047608 *
## school_night_hours_sleep6 0.004932 **
## school_night_hours_sleep7 1.35e-05 ***
## school_night_hours_sleep8 3.88e-05 ***
## school_night_hours_sleep9 3.22e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.33 on 8939 degrees of freedom
## (3614 observations deleted due to missingness)
## Multiple R-squared: 0.289, Adjusted R-squared: 0.287
## F-statistic: 145.4 on 25 and 8939 DF, p-value: < 2.2e-16
## Analysis of Variance Table
##
## Response: weight
## Df Sum Sq Mean Sq F value Pr(>F)
## age 1 83540 83540 406.5640 < 2.2e-16 ***
## gender 1 291583 291583 1419.0553 < 2.2e-16 ***
## race 4 26438 6610 32.1666 < 2.2e-16 ***
## height 1 319188 319188 1553.3975 < 2.2e-16 ***
## helmet_12m 5 4657 931 4.5327 0.0003957 ***
## physically_active_7d 1 7140 7140 34.7490 3.887e-09 ***
## hours_tv_per_school_day 6 6919 1153 5.6124 7.970e-06 ***
## school_night_hours_sleep 6 7229 1205 5.8634 4.087e-06 ***
## Residuals 8939 1836760 205
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now, I have every variable that is significant or has a category that is significant to predicting weight. These variables are:
Adjusted R-Squared: 0.287 or 28.7%. This means that this model explains 28.7% of the variability in weight. This isn’t the best.
Significant Variable P-Values:
| Predictor | p-value |
|---|---|
| age | < 2.2e-16 |
| gender | < 2.2e-16 |
| race | < 2.2e-16 |
| height | < 2.2e-16 |
| helmet_12m | 0.0003957 |
| physically_active_7d | 3.887e-09 |
| hours_tv_per_school_day | 7.970e-06 |
| school_night_hours_sleep | 4.087e-06 |
Equation: y = −82.94431 + 1.38848(age) + 1.70063(gendermale) − 6.04473(raceAsian) − 0.98447(raceBlack or African American) − 1.74866(raceNative Hawaiian or Other Pacific Islander)−1.97130(raceWhite)+77.09613(height)+2.52714(helmet_12mid did not ride)+1.67672(helmet_12most of time)+1.46826(helmet_12never)+1.43193(helmet_12rarely)+0.70160(helmet_12sometimes)−0.32707(physically_active_7d)−0.20467(hours_tv_per_school_day1)+0.49485(hours_tv_per_school_day2)+1.45271(hours_tv_per_school_day3)+1.29749(hours_tv_per_school_day4)+2.03000(hours_tv_per_school_day5+)−1.02773(hours_tv_per_school_daydo not watch)−0.57974(school_night_hours_sleep10+)−1.39983(school_night_hours_sleep5)−1.82148(school_night_hours_sleep6)−2.75707(school_night_hours_sleep7)−2.69745(school_night_hours_sleep8)−3.91753(school_night_hours_sleep9)
Even though I do know how to create an equation of a multiple linear model (y = coefficient(predictor) + intercept), I used AI to create the equation because there are so many predictors in the model.
Analysis of Coefficients: Since there are so many coefficients, I’ll explain what they mean in general terms that should cover every type of coefficient in a multiple linear model.
A positive coefficient of a numerical variable means that for every one increase in that variable, weight increases by that number. On the other hand, a negative coefficient of a numerical variable means that for every one increase in that variable, weight decreases by that number.
A positive coefficient of a categorical variable means that the category within that variable has that much greater weight than the baseline of the variable. A negative coefficient of a categorical variable means that the category within that variable has that much less weight than the baseline of the variable.
For example, the category of “did not ride” in the helmet_12m variable has a coefficient of 2.52714. The baseline of the variable is “always”, which means that those who “did not ride” have, on average, 2.52714 more weight than those who always had a helmet when riding–this is when holding every other variable constant.
P-value of Entire Model: p-value: < 2.2e-16. Very significant.
Diagnostic Plots:
Residuals vs. Fitted: Most residuals are clustered in the middle. Most points are about the y=0 line, although residuals seem to be moving in a slightly downward pattern. Linearity and homoscedasticity are likely violated.
Q-Q Residuals: Points do not follow a straight, diagonal line. The left tail slightly deviates from the line, while the right tail heavily deviates, earlier than usual, from the line. This indicates a violation of normality.
Scale-Location: No equal spread and isn’t horizontal. The residuals seem to be increasing. This again indicates a violation of homoscedasticity.
Residuals vs Leverage: There seems to be one point all the way to the right which may be influential on the model, but it doesn’t seem to be much of a concern.
Check for Multicollinearity:
Here, I check for multicollinearity of the numerical variables by using the cor() command that I learned from Data 101.
cor(yrb_clean5[, c("age", "height", "physically_active_7d")], use = "complete.obs") # Code from Data 101## age height physically_active_7d
## age 1.00000000 0.1331966 -0.06990648
## height 0.13319663 1.0000000 0.20427737
## physically_active_7d -0.06990648 0.2042774 1.00000000
Numerical variables don’t seem to be too correlated. This model likely has no multicollinearity. Only slightly concerning correlation is between height and physically active for 7 days. This correlation is 20%; still, this correlation is quite low for anything too concerning.
Check for Independence
plot(resid(model5), type="b",
main="Residuals vs Order", ylab="Residuals"); abline(h=0, lty=2) # code from Data 101The residuals in this diagnostic plot show points scattered randomly about the y = 0 line, although there seems to be more positive residuals shooting up more than negatives, which might be a concern. This suggests that we likely pass independence.
cols <- brewer.pal(6, "Pastel1") # Set the colors
box_plot <- data_to_boxplot(data = plot_one, weight, race, group_var = race) # Create the mapping
highchart() |> # High charters
hc_xAxis(type = "category") |> # Categorical Plot
hc_add_series_list(box_plot) |>
hc_title(text = "Box Plots of Race by Weight") |> # Title
hc_xAxis(title = list(text = "Race")) |> # X-label
hc_yAxis(title = list(text = "Weight")) |> # Y-label
hc_colors(cols) |> # Add colors
hc_caption(text= "Data provided by CDC") |> # Caption
hc_add_theme(hc_theme_darkunica()) # Theme. Check references for themeInterpretation: All of these races reside around the same median weights (50~70 kilograms). However, the highest median is the Black or African American race, with a median of 66.23 kilograms. The race with the lowest median are Asians, with a median of 58.97 kilograms.
Note: Use full screen on Tableau to view all elements.
Note for Professor: I’d like you to grade the highcharters plot above and the bar graph at the bottom right that shows sleep hours by average weight for my two graphs that you will grade.
Interpretation:
For this final project, I decided to go beyond the minimum requirement and create more than two graphs. This tableau dashboard includes 6 graphs on the YRBSS dataset. Each graph uses a significant variable from the multiple linear model as the x-axis and weight as the y-axis.
| Graph | Interpretation |
|---|---|
| Box Plot of Gender by Weight | Males have a higher box plot than females, signifying a higher median, higher middle 50%, and higher spread. Males have a median of 79.4 kilograms, while females have a median of 69.9 kilograms. This box plot is colored by the ages, and shows that older people weigh more. |
| Scatterplot of Height by Weight | As height increases, weight also increases. Furthermore, the colored dots represent age. We see that older teens have a higher height and weight than younger teens. |
| The Three Bar Graphs, Bottom Left | These show helmet 12m, physically active for 7 days, and hours of TV watched on school days by average weight. They all relatively show no large differences. |
| Bar Graph of Hours of Sleep on School Night by Average Weight | This one is especially interesting; for hours 5-9, the average weight sits around 66~68 kilograms. However, for hours 10+ and <5, the weight is higher (possibly even significant; a Post HOC test would be used to figure out which ones are significantly different). For 10+ hours, it is 69 kilograms. For <5 hours, it is 70 kilograms. Research suggests that short sleep durations are associated with obesity (Beccuti & Pannain, 2011). This research finding is evident in the bar graph; those who sleep less than 5 hours have a higher weight than those who sleep around 5-9 hours, on average. |
Here, I do a Tukey Post HOC test to figure out if the differences between those who sleep less than 5 hours is actually significant than other sleep durations.
First, I create the ANOVA test.
anova_result <- aov(weight ~ school_night_hours_sleep, data = yrb_clean) # Code from Data 101
summary(anova_result)## Df Sum Sq Mean Sq F value Pr(>F)
## school_night_hours_sleep 6 11333 1889 6.581 5.9e-07 ***
## Residuals 11474 3293032 287
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1098 observations deleted due to missingness
P-Value < 0.05. This means that at least one of the group’s mean is significantly different from the rest.
Perform Tukey Test
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = weight ~ school_night_hours_sleep, data = yrb_clean)
##
## $school_night_hours_sleep
## diff lwr upr p adj
## 10+-<5 -1.00448670 -4.567098 2.55812442 0.9817326
## 5-<5 -1.87893410 -4.050656 0.29278824 0.1414215
## 6-<5 -1.96381942 -3.939969 0.01233048 0.0527071
## 7-<5 -2.86243056 -4.776975 -0.94788582 0.0002120
## 8-<5 -2.83954341 -4.814784 -0.86430263 0.0004517
## 9-<5 -4.73801778 -7.276767 -2.19926899 0.0000008
## 5-10+ -0.87444739 -4.280026 2.53113106 0.9887944
## 6-10+ -0.95933272 -4.243652 2.32498687 0.9781071
## 7-10+ -1.85794386 -5.105569 1.38968117 0.6248265
## 8-10+ -1.83505671 -5.118829 1.44871595 0.6509751
## 9-10+ -3.73353108 -7.384124 -0.08293768 0.0411149
## 6-5 -0.08488533 -1.761470 1.59169902 0.9999991
## 7-5 -0.98349646 -2.587008 0.62001474 0.5421408
## 8-5 -0.96060932 -2.636122 0.71490336 0.6223861
## 9-5 -2.85908369 -5.172309 -0.54585822 0.0050099
## 7-6 -0.89861114 -2.225276 0.42805337 0.4162754
## 8-6 -0.87572399 -2.288570 0.53712182 0.5290992
## 9-6 -2.77419836 -4.904880 -0.64351711 0.0023770
## 8-7 0.02288715 -1.302423 1.34819708 1.0000000
## 9-7 -1.87558722 -3.949259 0.19808482 0.1067654
## 9-8 -1.89847437 -4.028312 0.23136372 0.1175002
Most significant differences: 7-<5, 8-<5, 9-<5, 9-10+, 9-5, 9-6
We see that the differences between those who get greater than 6 hours of sleep and those who get less than 5 hours of sleep is significant and the difference is negative. This statistically proves that those who sleep less than 5 hours do have a significantly higher weight than those sleep around 7-9 hours.
In conclusion, we answer our research question of what variables are most significant at predicting weight. The most significant variables are age, gender, race, height, helmet_12m, physically_active_7d, hours_tv_per_school_day, and school_night_hours_sleep. The multiple linear model explains about 29% of the variability in weight, and fails numerous assumptions of a linear model.
The visualizations prove worthy of valuable insight; the box plot of race by weight shows Black and African Americans having the highest median weight, whereas Asians have the lowest median weight. The bar graph–of average sleep hours on a school night by average weight–in Tableau shows the highest average weight belonging to those who sleep less than 5 hours and more than 10 hours, and the difference is significant. This finding is backed by research (Beccuti & Pannain, 2011).
For this project, I wish I could have seen the time series of this dataset. Since this dataset spans from 1991-2013, it would’ve been useful to see how weight has changed over the years. If the CDC does these surveys in the future (if they’re not already), they should include other variables such as smoking or not smoking, in a relationship or not, their GPA, etc. Such variables would allow for a more sophisticated and advanced analysis.
Beccuti, G., & Pannain, S. (2011). Sleep and obesity. Current opinion in clinical nutrition and metabolic care, 14(4), 402–412. https://doi.org/10.1097/MCO.0b013e3283479109 Dataset Source: http://www.cdc.gov/healthyyouth/data/yrbs/index.htm.
Dataset repository: https://www.openintro.org/data/index.php?data=yrbss
YAML Code Customization: https://stackoverflow.com/questions/48261379/rmarkdown-floating-toc-and-toc-at-beginning
ANOVA-test to check significance in multi-linear model Source: https://www.statalist.org/forums/forum/general-stata-discussion/general/956382-categorical-variables-in-backward-regression-model