#install.packages("NHANES")
#install.packages("scatterplot3d"
#install.packages("plotly")
Project 1: Unhealthy Individuals BMI(2009-2010)
Introduction:
For this data visualization, I decided to use the data-set NHANES which stands for the “National Health and Nutrition Examination Study”. Which this I was very overwhelmed with what I could do with this data-set. Originally I wanted to analyze Individuals with High Cholesterol and connect it with the leading causes which include obesity(being overweight), active smokers, active alcohol drinkers, people with diabetes, and those who are physically inactive. With that I was not able to do, but is something I would chose to look more into. What I did end of doing was trying to connect peoples weight and height, and identify peoples BMI, whether it being unhealthy or healthy BMI ranges.
Packages installed:
Library’s used:
library(NHANES)
Warning: package 'NHANES' was built under R version 4.4.3
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
Warning: package 'ggthemes' was built under R version 4.4.3
library(ggrepel)
Warning: package 'ggrepel' was built under R version 4.4.3
library(ggplot2)
library(scatterplot3d)
library(plotly)
Warning: package 'plotly' was built under R version 4.4.3
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
head(NHANES)
# A tibble: 6 × 76
ID SurveyYr Gender Age AgeDecade AgeMonths Race1 Race3 Education
<int> <fct> <fct> <int> <fct> <int> <fct> <fct> <fct>
1 51624 2009_10 male 34 " 30-39" 409 White <NA> High School
2 51624 2009_10 male 34 " 30-39" 409 White <NA> High School
3 51624 2009_10 male 34 " 30-39" 409 White <NA> High School
4 51625 2009_10 male 4 " 0-9" 49 Other <NA> <NA>
5 51630 2009_10 female 49 " 40-49" 596 White <NA> Some College
6 51638 2009_10 male 9 " 0-9" 115 White <NA> <NA>
# ℹ 67 more variables: MaritalStatus <fct>, HHIncome <fct>, HHIncomeMid <int>,
# Poverty <dbl>, HomeRooms <int>, HomeOwn <fct>, Work <fct>, Weight <dbl>,
# Length <dbl>, HeadCirc <dbl>, Height <dbl>, BMI <dbl>,
# BMICatUnder20yrs <fct>, BMI_WHO <fct>, Pulse <int>, BPSysAve <int>,
# BPDiaAve <int>, BPSys1 <int>, BPDia1 <int>, BPSys2 <int>, BPDia2 <int>,
# BPSys3 <int>, BPDia3 <int>, Testosterone <dbl>, DirectChol <dbl>,
# TotChol <dbl>, UrineVol1 <int>, UrineFlow1 <dbl>, UrineVol2 <int>, …
<- NHANES nhanes_df
Filtering/Mutating unhealthy individuals
In here you can see I mutated Race to be in a factored order based on “Race1” variable. This was needed because I found that when I was trying to add Race1 as color for my plots, the code was not reading it since it was not numerical/in a group. For example Hispanics should all be grouped into one group and counted per how many times there is Hispanic, White, Black, etc
<- nhanes_df |>
unhealthy_Individuals filter(Diabetes %in% c("Yes")) |>
filter(SurveyYr %in% c("2009_10")) |>
filter(PhysActive %in% c("No")) |>
arrange(BMI) |>
mutate( Race = recode_factor(Race1,
`1` = "Hispanic",
`2` = "Hispanic",
`3` = "White",
`4` = "Black",
`5` = "Hispanic"))
unhealthy_Individuals
# A tibble: 242 × 77
ID SurveyYr Gender Age AgeDecade AgeMonths Race1 Race3 Education
<int> <fct> <fct> <int> <fct> <int> <fct> <fct> <fct>
1 57980 2009_10 female 80 <NA> NA Black <NA> High School
2 55909 2009_10 female 54 " 50-59" 653 Black <NA> Some College
3 53444 2009_10 female 75 " 70+" 907 Mexican <NA> 8th Grade
4 58614 2009_10 male 56 " 50-59" 678 Black <NA> 8th Grade
5 52414 2009_10 male 30 " 30-39" 366 White <NA> Some College
6 52585 2009_10 female 36 " 30-39" 435 White <NA> High School
7 52585 2009_10 female 36 " 30-39" 435 White <NA> High School
8 52795 2009_10 male 61 " 60-69" 740 Other <NA> Some College
9 59578 2009_10 female 80 <NA> NA White <NA> Some College
10 54021 2009_10 female 46 " 40-49" 558 White <NA> 8th Grade
# ℹ 232 more rows
# ℹ 68 more variables: MaritalStatus <fct>, HHIncome <fct>, HHIncomeMid <int>,
# Poverty <dbl>, HomeRooms <int>, HomeOwn <fct>, Work <fct>, Weight <dbl>,
# Length <dbl>, HeadCirc <dbl>, Height <dbl>, BMI <dbl>,
# BMICatUnder20yrs <fct>, BMI_WHO <fct>, Pulse <int>, BPSysAve <int>,
# BPDiaAve <int>, BPSys1 <int>, BPDia1 <int>, BPSys2 <int>, BPDia2 <int>,
# BPSys3 <int>, BPDia3 <int>, Testosterone <dbl>, DirectChol <dbl>, …
Filtering data for each Race for Linear Model
<- unhealthy_Individuals |>
just_Hispanic filter(Race %in% c("Hispanic"))
<- unhealthy_Individuals |>
just_White filter(Race %in% c("White"))
<- unhealthy_Individuals |>
just_Black filter(Race %in% c("Black"))
<- unhealthy_Individuals |>
just_Mexican filter(Race %in% c("Mexican"))
<- unhealthy_Individuals |>
just_Other filter(Race %in% c("Other"))
Creating Linear Models
<- lm(Weight~ Height, data = just_Hispanic)
lm_model_Hispanic <- lm(Weight~ Height, data = just_White)
lm_model_White <-lm(Weight~ Height, data = just_Black)
lm_model_Black <-lm(Weight~ Height, data = just_Mexican)
lm_model_Mexican <-lm(Weight~ Height, data = just_Other)
lm_model_Other
lm_model_Hispanic
Call:
lm(formula = Weight ~ Height, data = just_Hispanic)
Coefficients:
(Intercept) Height
-109.201 1.267
lm_model_White
Call:
lm(formula = Weight ~ Height, data = just_White)
Coefficients:
(Intercept) Height
-98.743 1.141
lm_model_Black
Call:
lm(formula = Weight ~ Height, data = just_Black)
Coefficients:
(Intercept) Height
-154.418 1.545
lm_model_Mexican
Call:
lm(formula = Weight ~ Height, data = just_Mexican)
Coefficients:
(Intercept) Height
-101.711 1.144
lm_model_Other
Call:
lm(formula = Weight ~ Height, data = just_Other)
Coefficients:
(Intercept) Height
-71.3413 0.9437
Statistical Analysis for Hispanics:
Based on the p-value = 0.05856 this data is statistically insignificant, as for the Adjusted R-Squared = 0.1777(17.77%) it can be concluded that 18% of the data on Hispanics can be explained by the Height.
<- summary(lm_model_Hispanic)
formulas_Hispanic formulas_Hispanic
Call:
lm(formula = Weight ~ Height, data = just_Hispanic)
Residuals:
Min 1Q Median 3Q Max
-25.106 -15.798 -3.984 2.456 87.967
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -109.2007 101.3262 -1.078 0.2994
Height 1.2666 0.6151 2.059 0.0586 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 27.24 on 14 degrees of freedom
Multiple R-squared: 0.2325, Adjusted R-squared: 0.1777
F-statistic: 4.241 on 1 and 14 DF, p-value: 0.05856
Statistical Analysis for Whites:
Based on the p-value = 5.529e-08 this data is statistically insignificant, as for the Adjusted R-Squared = 0.2019(20.19%) it can be concluded that 20% of the data on Whites can be explained by the Height.
<- summary(lm_model_White)
formulas_White formulas_White
Call:
lm(formula = Weight ~ Height, data = just_White)
Residuals:
Min 1Q Median 3Q Max
-37.994 -14.942 -3.269 12.993 88.252
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -98.7431 33.1808 -2.976 0.0035 **
Height 1.1407 0.1974 5.777 5.53e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 22.33 on 127 degrees of freedom
(4 observations deleted due to missingness)
Multiple R-squared: 0.2081, Adjusted R-squared: 0.2019
F-statistic: 33.38 on 1 and 127 DF, p-value: 5.529e-08
Statistical Analysis for Blacks:
Based on the p-value = 0.008603 this data is statistically insignificant, as for the Adjusted R-Squared = 0.1196(11.96%) it can be concluded that 12% of the data on Blacks can be explained by the Height.
<- summary(lm_model_Black)
formulas_Black formulas_Black
Call:
lm(formula = Weight ~ Height, data = just_Black)
Residuals:
Min 1Q Median 3Q Max
-57.18 -24.83 -13.68 17.51 124.77
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -154.4184 94.4638 -1.635 0.1088
Height 1.5451 0.5634 2.742 0.0086 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 42.12 on 47 degrees of freedom
Multiple R-squared: 0.1379, Adjusted R-squared: 0.1196
F-statistic: 7.521 on 1 and 47 DF, p-value: 0.008603
Statistical Analysis for Mexicans:
Based on the p-value = 0.0002877 this data is statistically significant, as for the Adjusted R-Squared = 0.3577 (35.77%) it can be concluded that 35% of the data on Mexicans can be explained by the Height.
<- summary(lm_model_Mexican)
formulas_Mexican formulas_Mexican
Call:
lm(formula = Weight ~ Height, data = just_Mexican)
Residuals:
Min 1Q Median 3Q Max
-22.113 -10.311 -2.341 3.697 51.855
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -101.7111 45.5355 -2.234 0.033673 *
Height 1.1445 0.2764 4.141 0.000288 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 17.97 on 28 degrees of freedom
Multiple R-squared: 0.3798, Adjusted R-squared: 0.3577
F-statistic: 17.15 on 1 and 28 DF, p-value: 0.0002877
Statistical Analysis for Others:
Based on the p-value = 0.0582 this data is statistically significant, as for the Adjusted R-Squared = 0.2065(20.65%) it can be concluded that 21% of the data on Others can be explained by the Height.
<- summary(lm_model_Other)
formulas_Other formulas_Other
Call:
lm(formula = Weight ~ Height, data = just_Other)
Residuals:
Min 1Q Median 3Q Max
-24.218 -11.293 5.205 11.536 17.182
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -71.3413 72.6014 -0.983 0.3452
Height 0.9437 0.4508 2.094 0.0582 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 14.48 on 12 degrees of freedom
Multiple R-squared: 0.2675, Adjusted R-squared: 0.2065
F-statistic: 4.383 on 1 and 12 DF, p-value: 0.0582
This first plot was to get a basis of the individuals Weight over Height…this is a facet scatter plot separated by race. After looking at the adjusted R-squared values, as well as the p-value I can conclude the graph and calulations seems accurate due to the data being self reported. I also noted that there was an uneven amount of data collected per race, which is why the p-values can be insignificant in the data report.
<-
plot1 ggplot(unhealthy_Individuals, aes(x=Height, y = Weight))+
geom_point(color = "blue")+
geom_smooth() +
facet_wrap(~ Race)
plot1
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Warning: Removed 4 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 4 rows containing missing values or values outside the scale range
(`geom_point()`).
This is a 3D Scatter plot made to connect the 242 unhealthy individuals weight, height, and BMI
#variable cols is a vector of hand-picked colors that will be used to indicated different races
<- c("lightgreen", "lightblue", "yellow","pink","purple")
cols
#3D scatter plot aesthetics
with(unhealthy_Individuals,
scatterplot3d(Height,
BMI,
Weight, main="Unhealthy Indiviuals Body measurements(2009-2010)",
xlab = "Height (cm)",
ylab = "BMI (kg)",
zlab = "Weight (kg)",
pch = 16, color=cols[as.numeric(unhealthy_Individuals$Race)]))
mtext("Source: http://www.cdc.gov/nchs/nhanes.htm")
#adding legend for the races (not pch stands for plot character size)
legend("topright", legend = levels(unhealthy_Individuals$Race),
col = c("lightgreen", "lightblue", "yellow","pink","purple"), pch = 16)
This is an extra interactive visualization using library plotly to re-demonstrate the body measurements of unhealthy individuals by race.
|>
unhealthy_Individuals plot_ly(x = ~Height,y = ~BMI, z = ~Weight,color = ~Race, colors = cols)
No trace type specified:
Based on info supplied, a 'scatter3d' trace seems appropriate.
Read more about this trace type -> https://plotly.com/r/reference/#scatter3d
No scatter3d mode specifed:
Setting the mode to markers
Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
Warning: Ignoring 4 observations
Sources:
How I made a 3D scatter plot:
https://www.sthda.com/english/wiki/scatterplot3d-3d-graphics-r-software-and-data-visualization#google_vignette(this site taught me about scatter3DPlot)
How to change the colors of the legends:
https://r-charts.com/correlation/scatter-plot-group/ (this is where I found color=cols[as.numeric(unhealthy_Individuals$Race)]))…this code itself is how you change the Races by color
How to use use plotly as a scatter plot:
https://www.youtube.com/watch?v=z_Im3ME28k8
https://plotly.com/r/3d-scatter-plots/