Project 1: Unhealthy Individuals BMI(2009-2010)

Author

J Gazmen

Introduction:

For this data visualization, I decided to use the data-set NHANES which stands for the “National Health and Nutrition Examination Study”. Which this I was very overwhelmed with what I could do with this data-set. Originally I wanted to analyze Individuals with High Cholesterol and connect it with the leading causes which include obesity(being overweight), active smokers, active alcohol drinkers, people with diabetes, and those who are physically inactive. With that I was not able to do, but is something I would chose to look more into. What I did end of doing was trying to connect peoples weight and height, and identify peoples BMI, whether it being unhealthy or healthy BMI ranges.

Packages installed:

#install.packages("NHANES")
#install.packages("scatterplot3d"
#install.packages("plotly")

Library’s used:

library(NHANES)
Warning: package 'NHANES' was built under R version 4.4.3
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
Warning: package 'ggthemes' was built under R version 4.4.3
library(ggrepel)
Warning: package 'ggrepel' was built under R version 4.4.3
library(ggplot2)
library(scatterplot3d)
library(plotly)
Warning: package 'plotly' was built under R version 4.4.3

Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout
head(NHANES)
# A tibble: 6 × 76
     ID SurveyYr Gender   Age AgeDecade AgeMonths Race1 Race3 Education   
  <int> <fct>    <fct>  <int> <fct>         <int> <fct> <fct> <fct>       
1 51624 2009_10  male      34 " 30-39"        409 White <NA>  High School 
2 51624 2009_10  male      34 " 30-39"        409 White <NA>  High School 
3 51624 2009_10  male      34 " 30-39"        409 White <NA>  High School 
4 51625 2009_10  male       4 " 0-9"           49 Other <NA>  <NA>        
5 51630 2009_10  female    49 " 40-49"        596 White <NA>  Some College
6 51638 2009_10  male       9 " 0-9"          115 White <NA>  <NA>        
# ℹ 67 more variables: MaritalStatus <fct>, HHIncome <fct>, HHIncomeMid <int>,
#   Poverty <dbl>, HomeRooms <int>, HomeOwn <fct>, Work <fct>, Weight <dbl>,
#   Length <dbl>, HeadCirc <dbl>, Height <dbl>, BMI <dbl>,
#   BMICatUnder20yrs <fct>, BMI_WHO <fct>, Pulse <int>, BPSysAve <int>,
#   BPDiaAve <int>, BPSys1 <int>, BPDia1 <int>, BPSys2 <int>, BPDia2 <int>,
#   BPSys3 <int>, BPDia3 <int>, Testosterone <dbl>, DirectChol <dbl>,
#   TotChol <dbl>, UrineVol1 <int>, UrineFlow1 <dbl>, UrineVol2 <int>, …
nhanes_df <- NHANES

Filtering/Mutating unhealthy individuals

In here you can see I mutated Race to be in a factored order based on “Race1” variable. This was needed because I found that when I was trying to add Race1 as color for my plots, the code was not reading it since it was not numerical/in a group. For example Hispanics should all be grouped into one group and counted per how many times there is Hispanic, White, Black, etc

unhealthy_Individuals <- nhanes_df |>
  filter(Diabetes %in% c("Yes")) |>
  filter(SurveyYr %in% c("2009_10")) |> 
  filter(PhysActive %in% c("No")) |>
  arrange(BMI) |>
  mutate( Race = recode_factor(Race1, 
                         `1` = "Hispanic", 
                         `2` = "Hispanic", 
                         `3` = "White", 
                         `4` = "Black", 
                         `5` = "Hispanic"))
  
unhealthy_Individuals  
# A tibble: 242 × 77
      ID SurveyYr Gender   Age AgeDecade AgeMonths Race1   Race3 Education   
   <int> <fct>    <fct>  <int> <fct>         <int> <fct>   <fct> <fct>       
 1 57980 2009_10  female    80  <NA>            NA Black   <NA>  High School 
 2 55909 2009_10  female    54 " 50-59"        653 Black   <NA>  Some College
 3 53444 2009_10  female    75 " 70+"          907 Mexican <NA>  8th Grade   
 4 58614 2009_10  male      56 " 50-59"        678 Black   <NA>  8th Grade   
 5 52414 2009_10  male      30 " 30-39"        366 White   <NA>  Some College
 6 52585 2009_10  female    36 " 30-39"        435 White   <NA>  High School 
 7 52585 2009_10  female    36 " 30-39"        435 White   <NA>  High School 
 8 52795 2009_10  male      61 " 60-69"        740 Other   <NA>  Some College
 9 59578 2009_10  female    80  <NA>            NA White   <NA>  Some College
10 54021 2009_10  female    46 " 40-49"        558 White   <NA>  8th Grade   
# ℹ 232 more rows
# ℹ 68 more variables: MaritalStatus <fct>, HHIncome <fct>, HHIncomeMid <int>,
#   Poverty <dbl>, HomeRooms <int>, HomeOwn <fct>, Work <fct>, Weight <dbl>,
#   Length <dbl>, HeadCirc <dbl>, Height <dbl>, BMI <dbl>,
#   BMICatUnder20yrs <fct>, BMI_WHO <fct>, Pulse <int>, BPSysAve <int>,
#   BPDiaAve <int>, BPSys1 <int>, BPDia1 <int>, BPSys2 <int>, BPDia2 <int>,
#   BPSys3 <int>, BPDia3 <int>, Testosterone <dbl>, DirectChol <dbl>, …

Filtering data for each Race for Linear Model

just_Hispanic <- unhealthy_Individuals |>
  filter(Race %in% c("Hispanic"))

just_White <- unhealthy_Individuals |>
  filter(Race %in% c("White"))

just_Black <- unhealthy_Individuals |>
  filter(Race %in% c("Black"))

just_Mexican <- unhealthy_Individuals |>
  filter(Race %in% c("Mexican"))

just_Other <- unhealthy_Individuals |>
  filter(Race %in% c("Other"))

Creating Linear Models

lm_model_Hispanic <- lm(Weight~ Height, data = just_Hispanic)
lm_model_White <- lm(Weight~ Height, data = just_White)
lm_model_Black <-lm(Weight~ Height, data = just_Black)
lm_model_Mexican <-lm(Weight~ Height, data = just_Mexican)
lm_model_Other <-lm(Weight~ Height, data = just_Other)

lm_model_Hispanic

Call:
lm(formula = Weight ~ Height, data = just_Hispanic)

Coefficients:
(Intercept)       Height  
   -109.201        1.267  
lm_model_White

Call:
lm(formula = Weight ~ Height, data = just_White)

Coefficients:
(Intercept)       Height  
    -98.743        1.141  
lm_model_Black

Call:
lm(formula = Weight ~ Height, data = just_Black)

Coefficients:
(Intercept)       Height  
   -154.418        1.545  
lm_model_Mexican

Call:
lm(formula = Weight ~ Height, data = just_Mexican)

Coefficients:
(Intercept)       Height  
   -101.711        1.144  
lm_model_Other

Call:
lm(formula = Weight ~ Height, data = just_Other)

Coefficients:
(Intercept)       Height  
   -71.3413       0.9437  

Statistical Analysis for Hispanics:

Based on the p-value = 0.05856 this data is statistically insignificant, as for the Adjusted R-Squared = 0.1777(17.77%) it can be concluded that 18% of the data on Hispanics can be explained by the Height.

formulas_Hispanic <- summary(lm_model_Hispanic)
formulas_Hispanic

Call:
lm(formula = Weight ~ Height, data = just_Hispanic)

Residuals:
    Min      1Q  Median      3Q     Max 
-25.106 -15.798  -3.984   2.456  87.967 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)  
(Intercept) -109.2007   101.3262  -1.078   0.2994  
Height         1.2666     0.6151   2.059   0.0586 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 27.24 on 14 degrees of freedom
Multiple R-squared:  0.2325,    Adjusted R-squared:  0.1777 
F-statistic: 4.241 on 1 and 14 DF,  p-value: 0.05856

Statistical Analysis for Whites:

Based on the p-value = 5.529e-08 this data is statistically insignificant, as for the Adjusted R-Squared = 0.2019(20.19%) it can be concluded that 20% of the data on Whites can be explained by the Height.

formulas_White <- summary(lm_model_White)
formulas_White

Call:
lm(formula = Weight ~ Height, data = just_White)

Residuals:
    Min      1Q  Median      3Q     Max 
-37.994 -14.942  -3.269  12.993  88.252 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -98.7431    33.1808  -2.976   0.0035 ** 
Height        1.1407     0.1974   5.777 5.53e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 22.33 on 127 degrees of freedom
  (4 observations deleted due to missingness)
Multiple R-squared:  0.2081,    Adjusted R-squared:  0.2019 
F-statistic: 33.38 on 1 and 127 DF,  p-value: 5.529e-08

Statistical Analysis for Blacks:

Based on the p-value = 0.008603 this data is statistically insignificant, as for the Adjusted R-Squared = 0.1196(11.96%) it can be concluded that 12% of the data on Blacks can be explained by the Height.

formulas_Black <- summary(lm_model_Black)
formulas_Black

Call:
lm(formula = Weight ~ Height, data = just_Black)

Residuals:
   Min     1Q Median     3Q    Max 
-57.18 -24.83 -13.68  17.51 124.77 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)   
(Intercept) -154.4184    94.4638  -1.635   0.1088   
Height         1.5451     0.5634   2.742   0.0086 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 42.12 on 47 degrees of freedom
Multiple R-squared:  0.1379,    Adjusted R-squared:  0.1196 
F-statistic: 7.521 on 1 and 47 DF,  p-value: 0.008603

Statistical Analysis for Mexicans:

Based on the p-value = 0.0002877 this data is statistically significant, as for the Adjusted R-Squared = 0.3577 (35.77%) it can be concluded that 35% of the data on Mexicans can be explained by the Height.

formulas_Mexican <- summary(lm_model_Mexican)
formulas_Mexican

Call:
lm(formula = Weight ~ Height, data = just_Mexican)

Residuals:
    Min      1Q  Median      3Q     Max 
-22.113 -10.311  -2.341   3.697  51.855 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -101.7111    45.5355  -2.234 0.033673 *  
Height         1.1445     0.2764   4.141 0.000288 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 17.97 on 28 degrees of freedom
Multiple R-squared:  0.3798,    Adjusted R-squared:  0.3577 
F-statistic: 17.15 on 1 and 28 DF,  p-value: 0.0002877

Statistical Analysis for Others:

Based on the p-value = 0.0582 this data is statistically significant, as for the Adjusted R-Squared = 0.2065(20.65%) it can be concluded that 21% of the data on Others can be explained by the Height.

formulas_Other <- summary(lm_model_Other)
formulas_Other

Call:
lm(formula = Weight ~ Height, data = just_Other)

Residuals:
    Min      1Q  Median      3Q     Max 
-24.218 -11.293   5.205  11.536  17.182 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept) -71.3413    72.6014  -0.983   0.3452  
Height        0.9437     0.4508   2.094   0.0582 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14.48 on 12 degrees of freedom
Multiple R-squared:  0.2675,    Adjusted R-squared:  0.2065 
F-statistic: 4.383 on 1 and 12 DF,  p-value: 0.0582

This first plot was to get a basis of the individuals Weight over Height…this is a facet scatter plot separated by race. After looking at the adjusted R-squared values, as well as the p-value I can conclude the graph and calulations seems accurate due to the data being self reported. I also noted that there was an uneven amount of data collected per race, which is why the p-values can be insignificant in the data report.

plot1 <-
  ggplot(unhealthy_Individuals, aes(x=Height, y = Weight))+
  geom_point(color = "blue")+
  geom_smooth() +
  facet_wrap(~ Race)
plot1
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Warning: Removed 4 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 4 rows containing missing values or values outside the scale range
(`geom_point()`).

This is a 3D Scatter plot made to connect the 242 unhealthy individuals weight, height, and BMI

#variable cols is a vector of hand-picked colors that will be used to indicated different races
cols <- c("lightgreen", "lightblue", "yellow","pink","purple")

#3D scatter plot aesthetics 
with(unhealthy_Individuals, 
     scatterplot3d(Height,
                   BMI, 
                   Weight, 
                   main="Unhealthy Indiviuals Body measurements(2009-2010)",
                   xlab = "Height (cm)",
                   ylab = "BMI (kg)",
                   zlab = "Weight (kg)",
                   pch = 16, color=cols[as.numeric(unhealthy_Individuals$Race)]))
                   mtext("Source: http://www.cdc.gov/nchs/nhanes.htm")
                   
#adding legend for the races (not pch stands for plot character size)
legend("topright", legend = levels(unhealthy_Individuals$Race),
      col =  c("lightgreen", "lightblue", "yellow","pink","purple"), pch = 16)

This is an extra interactive visualization using library plotly to re-demonstrate the body measurements of unhealthy individuals by race.

unhealthy_Individuals |>
  plot_ly(x = ~Height,y = ~BMI, z = ~Weight,color = ~Race, colors = cols)
No trace type specified:
  Based on info supplied, a 'scatter3d' trace seems appropriate.
  Read more about this trace type -> https://plotly.com/r/reference/#scatter3d
No scatter3d mode specifed:
  Setting the mode to markers
  Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
Warning: Ignoring 4 observations

Sources:

How I made a 3D scatter plot:

https://www.sthda.com/english/wiki/scatterplot3d-3d-graphics-r-software-and-data-visualization#google_vignette(this site taught me about scatter3DPlot)

How to change the colors of the legends:

https://r-charts.com/correlation/scatter-plot-group/ (this is where I found color=cols[as.numeric(unhealthy_Individuals$Race)]))…this code itself is how you change the Races by color

How to use use plotly as a scatter plot:

https://www.youtube.com/watch?v=z_Im3ME28k8

https://plotly.com/r/3d-scatter-plots/