Analyzing Health Outcomes and Lifestyle Factors Using NHANES Data

Author

Allenteena Bernard

Analyzing Health Outcomes and Lifestyle Factors Using NHANES Data

The National Health and Nutrition Examination Survey (NHanes) is a comphrehensive dataset that assesses the health and nutritional status of adults and children in the United States. My topic revolves around understanding how different lifestyle factors, such as smoking, physical activity, and diet, influence health indicators like Body Mass Index (BMI), blood pressure, and cholesterol levels.

Variables: The key variables in my dataset include:

Quantitative Variables: Age BMI (Body Mass Index) Blood pressure (systolic and diastolic) Cholesterol levels (HDL, LDL, total cholesterol) Glucose levels Categorical Variables: Gender Race/Ethnicity Smoking status (current smoker, former smoker, never smoked) Alcohol consumption Dietary intake categories

Visualizations:

One of my main visualizations is a scatter plot that depicts the relationship between Age and BMI, color-coded by smoking status. This visualization helps illustrate how BMI varies across different age groups and smoking statuses. Another key visualization is a box plot comparing BMI distributions across different racial/ethnic groups, providing insights into health disparities.

Numerous studies have demonstrated the significant impact of lifestyle factors on health outcomes. For instance, smoking has been linked to a higher risk of cardiovascular diseases, reduced lung function, and various cancers. Physical activity, on the other hand, is associated with numerous health benefits, including lower risks of obesity, heart disease, and diabetes. Dietary habits also play a crucial role in determining an individual’s health status, influencing everything from weight to cholesterol levels and blood pressure.

Research has shown that smoking can suppress appetite and reduce body weight, leading to lower BMI among smokers. However, smoking also increases the risk of cardiovascular diseases and other serious health conditions. (Source: Centers for Disease Control and Prevention, CDC). Physical activity is well-documented to improve cardiovascular health, reduce BMI, and enhance overall well-being.

I chose this topic because of my interest in public health and the significant role that lifestyle choices play in determining health outcomes. Understanding these relationships can inform targeted public health interventions aimed at reducing the prevalence of chronic diseases and improving population health. The NHANES dataset provides a comprehensive and reliable source of data that allows for in-depth analysis of these critical health issues.

Get working directory

getwd()

[1] "C:/Users/cbash/OneDrive/Desktop/DATA 110"

Load the dataset and libraries

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(readr)
library(highcharter)

Warning: package 'highcharter' was built under R version 4.4.1

Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo

setwd("C:/Users/cbash/OneDrive/Desktop/DATA 110")
nhanes_data<- read_csv("nhanes.csv")

Rows: 10000 Columns: 76
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (31): SurveyYr, Gender, AgeDecade, Race1, Race3, Education, MaritalStatu...
dbl (45): ID, Age, AgeMonths, HHIncomeMid, Poverty, HomeRooms, Weight, Lengt...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data Cleaning and Exportation

#clean the capitalizatoins

names(nhanes_data) <- tolower(names(nhanes_data))
names(nhanes_data) <- gsub(" ","_",names(nhanes_data))
head(nhanes_data)

# A tibble: 6 × 76
     id surveyyr gender   age agedecade agemonths race1 race3 education   
  <dbl> <chr>    <chr>  <dbl> <chr>         <dbl> <chr> <chr> <chr>       
1 51624 2009_10  male      34 30-39           409 White <NA>  High School 
2 51624 2009_10  male      34 30-39           409 White <NA>  High School 
3 51624 2009_10  male      34 30-39           409 White <NA>  High School 
4 51625 2009_10  male       4 0-9              49 Other <NA>  <NA>        
5 51630 2009_10  female    49 40-49           596 White <NA>  Some College
6 51638 2009_10  male       9 0-9             115 White <NA>  <NA>        
# ℹ 67 more variables: maritalstatus <chr>, hhincome <chr>, hhincomemid <dbl>,
#   poverty <dbl>, homerooms <dbl>, homeown <chr>, work <chr>, weight <dbl>,
#   length <dbl>, headcirc <dbl>, height <dbl>, bmi <dbl>,
#   bmicatunder20yrs <chr>, bmi_who <chr>, pulse <dbl>, bpsysave <dbl>,
#   bpdiaave <dbl>, bpsys1 <dbl>, bpdia1 <dbl>, bpsys2 <dbl>, bpdia2 <dbl>,
#   bpsys3 <dbl>, bpdia3 <dbl>, testosterone <dbl>, directchol <dbl>,
#   totchol <dbl>, urinevol1 <dbl>, urineflow1 <dbl>, urinevol2 <dbl>, …

#remove rows with missing values in key columns
nhanes_clean <- nhanes_data %>%
  filter(!is.na(age), !is.na(gender), !is.na(bpsysave), !is.na(totchol))
str(nhanes_data)

spc_tbl_ [10,000 × 76] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ id              : num [1:10000] 51624 51624 51624 51625 51630 ...
 $ surveyyr        : chr [1:10000] "2009_10" "2009_10" "2009_10" "2009_10" ...
 $ gender          : chr [1:10000] "male" "male" "male" "male" ...
 $ age             : num [1:10000] 34 34 34 4 49 9 8 45 45 45 ...
 $ agedecade       : chr [1:10000] "30-39" "30-39" "30-39" "0-9" ...
 $ agemonths       : num [1:10000] 409 409 409 49 596 115 101 541 541 541 ...
 $ race1           : chr [1:10000] "White" "White" "White" "Other" ...
 $ race3           : chr [1:10000] NA NA NA NA ...
 $ education       : chr [1:10000] "High School" "High School" "High School" NA ...
 $ maritalstatus   : chr [1:10000] "Married" "Married" "Married" NA ...
 $ hhincome        : chr [1:10000] "25000-34999" "25000-34999" "25000-34999" "20000-24999" ...
 $ hhincomemid     : num [1:10000] 30000 30000 30000 22500 40000 87500 60000 87500 87500 87500 ...
 $ poverty         : num [1:10000] 1.36 1.36 1.36 1.07 1.91 1.84 2.33 5 5 5 ...
 $ homerooms       : num [1:10000] 6 6 6 9 5 6 7 6 6 6 ...
 $ homeown         : chr [1:10000] "Own" "Own" "Own" "Own" ...
 $ work            : chr [1:10000] "NotWorking" "NotWorking" "NotWorking" NA ...
 $ weight          : num [1:10000] 87.4 87.4 87.4 17 86.7 29.8 35.2 75.7 75.7 75.7 ...
 $ length          : num [1:10000] NA NA NA NA NA NA NA NA NA NA ...
 $ headcirc        : num [1:10000] NA NA NA NA NA NA NA NA NA NA ...
 $ height          : num [1:10000] 165 165 165 105 168 ...
 $ bmi             : num [1:10000] 32.2 32.2 32.2 15.3 30.6 ...
 $ bmicatunder20yrs: chr [1:10000] NA NA NA NA ...
 $ bmi_who         : chr [1:10000] "30.0_plus" "30.0_plus" "30.0_plus" "12.0_18.5" ...
 $ pulse           : num [1:10000] 70 70 70 NA 86 82 72 62 62 62 ...
 $ bpsysave        : num [1:10000] 113 113 113 NA 112 86 107 118 118 118 ...
 $ bpdiaave        : num [1:10000] 85 85 85 NA 75 47 37 64 64 64 ...
 $ bpsys1          : num [1:10000] 114 114 114 NA 118 84 114 106 106 106 ...
 $ bpdia1          : num [1:10000] 88 88 88 NA 82 50 46 62 62 62 ...
 $ bpsys2          : num [1:10000] 114 114 114 NA 108 84 108 118 118 118 ...
 $ bpdia2          : num [1:10000] 88 88 88 NA 74 50 36 68 68 68 ...
 $ bpsys3          : num [1:10000] 112 112 112 NA 116 88 106 118 118 118 ...
 $ bpdia3          : num [1:10000] 82 82 82 NA 76 44 38 60 60 60 ...
 $ testosterone    : num [1:10000] NA NA NA NA NA NA NA NA NA NA ...
 $ directchol      : num [1:10000] 1.29 1.29 1.29 NA 1.16 1.34 1.55 2.12 2.12 2.12 ...
 $ totchol         : num [1:10000] 3.49 3.49 3.49 NA 6.7 4.86 4.09 5.82 5.82 5.82 ...
 $ urinevol1       : num [1:10000] 352 352 352 NA 77 123 238 106 106 106 ...
 $ urineflow1      : num [1:10000] NA NA NA NA 0.094 ...
 $ urinevol2       : num [1:10000] NA NA NA NA NA NA NA NA NA NA ...
 $ urineflow2      : num [1:10000] NA NA NA NA NA NA NA NA NA NA ...
 $ diabetes        : chr [1:10000] "No" "No" "No" "No" ...
 $ diabetesage     : num [1:10000] NA NA NA NA NA NA NA NA NA NA ...
 $ healthgen       : chr [1:10000] "Good" "Good" "Good" NA ...
 $ daysphyshlthbad : num [1:10000] 0 0 0 NA 0 NA NA 0 0 0 ...
 $ daysmenthlthbad : num [1:10000] 15 15 15 NA 10 NA NA 3 3 3 ...
 $ littleinterest  : chr [1:10000] "Most" "Most" "Most" NA ...
 $ depressed       : chr [1:10000] "Several" "Several" "Several" NA ...
 $ npregnancies    : num [1:10000] NA NA NA NA 2 NA NA 1 1 1 ...
 $ nbabies         : num [1:10000] NA NA NA NA 2 NA NA NA NA NA ...
 $ age1stbaby      : num [1:10000] NA NA NA NA 27 NA NA NA NA NA ...
 $ sleephrsnight   : num [1:10000] 4 4 4 NA 8 NA NA 8 8 8 ...
 $ sleeptrouble    : chr [1:10000] "Yes" "Yes" "Yes" NA ...
 $ physactive      : chr [1:10000] "No" "No" "No" NA ...
 $ physactivedays  : num [1:10000] NA NA NA NA NA NA NA 5 5 5 ...
 $ tvhrsday        : chr [1:10000] NA NA NA NA ...
 $ comphrsday      : chr [1:10000] NA NA NA NA ...
 $ tvhrsdaychild   : num [1:10000] NA NA NA 4 NA 5 1 NA NA NA ...
 $ comphrsdaychild : num [1:10000] NA NA NA 1 NA 0 6 NA NA NA ...
 $ alcohol12plusyr : chr [1:10000] "Yes" "Yes" "Yes" NA ...
 $ alcoholday      : num [1:10000] NA NA NA NA 2 NA NA 3 3 3 ...
 $ alcoholyear     : num [1:10000] 0 0 0 NA 20 NA NA 52 52 52 ...
 $ smokenow        : chr [1:10000] "No" "No" "No" NA ...
 $ smoke100        : chr [1:10000] "Yes" "Yes" "Yes" NA ...
 $ smoke100n       : chr [1:10000] "Smoker" "Smoker" "Smoker" NA ...
 $ smokeage        : num [1:10000] 18 18 18 NA 38 NA NA NA NA NA ...
 $ marijuana       : chr [1:10000] "Yes" "Yes" "Yes" NA ...
 $ agefirstmarij   : num [1:10000] 17 17 17 NA 18 NA NA 13 13 13 ...
 $ regularmarij    : chr [1:10000] "No" "No" "No" NA ...
 $ ageregmarij     : num [1:10000] NA NA NA NA NA NA NA NA NA NA ...
 $ harddrugs       : chr [1:10000] "Yes" "Yes" "Yes" NA ...
 $ sexever         : chr [1:10000] "Yes" "Yes" "Yes" NA ...
 $ sexage          : num [1:10000] 16 16 16 NA 12 NA NA 13 13 13 ...
 $ sexnumpartnlife : num [1:10000] 8 8 8 NA 10 NA NA 20 20 20 ...
 $ sexnumpartyear  : num [1:10000] 1 1 1 NA 1 NA NA 0 0 0 ...
 $ samesex         : chr [1:10000] "No" "No" "No" NA ...
 $ sexorientation  : chr [1:10000] "Heterosexual" "Heterosexual" "Heterosexual" NA ...
 $ pregnantnow     : chr [1:10000] NA NA NA NA ...
 - attr(*, "spec")=
  .. cols(
  ..   ID = col_double(),
  ..   SurveyYr = col_character(),
  ..   Gender = col_character(),
  ..   Age = col_double(),
  ..   AgeDecade = col_character(),
  ..   AgeMonths = col_double(),
  ..   Race1 = col_character(),
  ..   Race3 = col_character(),
  ..   Education = col_character(),
  ..   MaritalStatus = col_character(),
  ..   HHIncome = col_character(),
  ..   HHIncomeMid = col_double(),
  ..   Poverty = col_double(),
  ..   HomeRooms = col_double(),
  ..   HomeOwn = col_character(),
  ..   Work = col_character(),
  ..   Weight = col_double(),
  ..   Length = col_double(),
  ..   HeadCirc = col_double(),
  ..   Height = col_double(),
  ..   BMI = col_double(),
  ..   BMICatUnder20yrs = col_character(),
  ..   BMI_WHO = col_character(),
  ..   Pulse = col_double(),
  ..   BPSysAve = col_double(),
  ..   BPDiaAve = col_double(),
  ..   BPSys1 = col_double(),
  ..   BPDia1 = col_double(),
  ..   BPSys2 = col_double(),
  ..   BPDia2 = col_double(),
  ..   BPSys3 = col_double(),
  ..   BPDia3 = col_double(),
  ..   Testosterone = col_double(),
  ..   DirectChol = col_double(),
  ..   TotChol = col_double(),
  ..   UrineVol1 = col_double(),
  ..   UrineFlow1 = col_double(),
  ..   UrineVol2 = col_double(),
  ..   UrineFlow2 = col_double(),
  ..   Diabetes = col_character(),
  ..   DiabetesAge = col_double(),
  ..   HealthGen = col_character(),
  ..   DaysPhysHlthBad = col_double(),
  ..   DaysMentHlthBad = col_double(),
  ..   LittleInterest = col_character(),
  ..   Depressed = col_character(),
  ..   nPregnancies = col_double(),
  ..   nBabies = col_double(),
  ..   Age1stBaby = col_double(),
  ..   SleepHrsNight = col_double(),
  ..   SleepTrouble = col_character(),
  ..   PhysActive = col_character(),
  ..   PhysActiveDays = col_double(),
  ..   TVHrsDay = col_character(),
  ..   CompHrsDay = col_character(),
  ..   TVHrsDayChild = col_double(),
  ..   CompHrsDayChild = col_double(),
  ..   Alcohol12PlusYr = col_character(),
  ..   AlcoholDay = col_double(),
  ..   AlcoholYear = col_double(),
  ..   SmokeNow = col_character(),
  ..   Smoke100 = col_character(),
  ..   Smoke100n = col_character(),
  ..   SmokeAge = col_double(),
  ..   Marijuana = col_character(),
  ..   AgeFirstMarij = col_double(),
  ..   RegularMarij = col_character(),
  ..   AgeRegMarij = col_double(),
  ..   HardDrugs = col_character(),
  ..   SexEver = col_character(),
  ..   SexAge = col_double(),
  ..   SexNumPartnLife = col_double(),
  ..   SexNumPartYear = col_double(),
  ..   SameSex = col_character(),
  ..   SexOrientation = col_character(),
  ..   PregnantNow = col_character()
  .. )
 - attr(*, "problems")=<externalptr>

Check summary statistics of quantitative variables

# Select relevant columns
nhanes_selected <- nhanes_clean %>%
  select(age, gender, bpsysave,totchol)
head(nhanes_selected)

# A tibble: 6 × 4
    age gender bpsysave totchol
  <dbl> <chr>     <dbl>   <dbl>
1    34 male        113    3.49
2    34 male        113    3.49
3    34 male        113    3.49
4    49 female      112    6.7 
5     9 male         86    4.86
6     8 male        107    4.09

# Summarizie averageblood pressure by age group
bp_summary <- nhanes_selected %>%
  group_by(age) %>%
  summarise(mean_BP = mean(bpsysave, na.rm = TRUE))
head(bp_summary)

# A tibble: 6 × 2
    age mean_BP
  <dbl>   <dbl>
1     8    97.1
2     9    98.6
3    10   101. 
4    11   102. 
5    12   102. 
6    13   106.

# Create a new column for age groups
nhanes_age_group <- nhanes_selected %>%
  mutate(age_group = case_when(
    age < 10 ~ "child",
    age >= 18 & age <65 ~ "adult",
    age >= 65 ~ "senior"
  ))
head(nhanes_age_group)

# A tibble: 6 × 5
    age gender bpsysave totchol age_group
  <dbl> <chr>     <dbl>   <dbl> <chr>    
1    34 male        113    3.49 adult    
2    34 male        113    3.49 adult    
3    34 male        113    3.49 adult    
4    49 female      112    6.7  adult    
5     9 male         86    4.86 child    
6     8 male        107    4.09 child

#Statistical Analysis
# linear regression of blood pressuare on age
lm_model <- lm(bpsysave ~ age, data = nhanes_clean)

# Diagnostic plots
par(mfrow = c(2, 2))
plot(lm_model)

These plots help assess the quality and assumptions of the regression model. The four default diagnostic plots are:

Residuals vs Fitted: This plot shows the residuals (errors) on the y-axis and the fitted values (predicted values) on the x-axis. It’s used to check for non-linearity and unequal error variances (heteroscedasticity).
Normal Q-Q: This plot shows the quantiles of the residuals against the quantiles of a normal distribution. It’s used to check if the residuals are normally distributed.
Scale-Location (Spread-Location): This plot shows the square root of the standardized residuals against the fitted values. It’s used to check for homoscedasticity (constant variance of the residuals).
Residuals vs Leverage: This plot shows the residuals against the leverage (influence) of each data point. It helps identify influential observations that have a large effect on the model.

# Summary of the model
summary(lm_model)


Call:
lm(formula = bpsysave ~ age, data = nhanes_clean)

Residuals:
    Min      1Q  Median      3Q     Max 
-53.343  -9.223  -1.065   7.927 101.806 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.002e+02  3.812e-01   262.9   <2e-16 ***
age         4.357e-01  8.282e-03    52.6   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14.74 on 7991 degrees of freedom
Multiple R-squared:  0.2572,    Adjusted R-squared:  0.2571 
F-statistic:  2767 on 1 and 7991 DF,  p-value: < 2.2e-16

# Scatter Plot with Regression Line
library(ggplot2)
library(plotly)

Warning: package 'plotly' was built under R version 4.4.1


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

# Scatter plot of Age vs. Blood Pressure
scatter_plot <- ggplot(nhanes_clean, aes(x = age, y = bpsysave)) +
  geom_point(color = "turquoise") +
  geom_smooth(method = "lm", color = "orange") +
  labs(title = "Scatter Plot of Age vs. Blood Pressure",
       x = "Age",
       y = "Blood Pressure (Systolic)") +
  theme_minimal()

# Convert to plotly
ggplotly(scatter_plot)

`geom_smooth()` using formula = 'y ~ x'

#### Visualization 2: Bar Plot of Average Blood Pressure by Age Group

# Bar plot of average blood pressure by age group
bar_plot <- ggplot(bp_summary, aes(x = age, y = mean_BP, fill = age)) +
  geom_bar(stat = "identity") +
  labs(title = "Average Blood Pressure by Age",
       x = "Age",
       y = "Average Blood Pressure (Systolic)") +
  scale_fill_gradient(low = "blue", high = "red") +
  theme_classic()

# Convert to plotly
ggplotly(bar_plot)

BMI and Systolic BP

# Load necessary libraries
library(readr)
library(dplyr)
library(ggplot2)
library(plotly)

# Ensure nhanes_clean is defined and loaded
# nhanes_clean <- read_csv("path_to_nhanes_data.csv")

# Clean the data
nhanes_clean <- nhanes_clean %>%
  filter(!is.na(smokenow), !is.na(bmi), !is.na(bpsysave)) %>%
  mutate(smokenow = factor(smokenow))

# Scatter plot of BMI vs Systolic BP
scatter_plot <- ggplot(nhanes_clean, aes(x = bmi, y = bpsysave, color = smokenow)) +
  geom_point(alpha = 0.6) +
  labs(title = "Scatter Plot: BMI vs Systolic BP by Smoking Status",
       x = "BMI",
       y = "Systolic BP",
       color = "Smoking Status") +
  theme_minimal()

# Convert to plotly
scatter_plotly <- ggplotly(scatter_plot)

# Display the plot
scatter_plotly

The scatter plot depicting the relationship between Body Mass Index (BMI) and systolic blood pressure (bpsysave), colored by smoking status, offers insightful revelations. In general, there appears to be a positive correlation between BMI and systolic blood pressure, indicating that individuals with higher BMI tend to have higher systolic blood pressure. This trend is consistent across different smoking statuses.

However, the data points reveal that smokers might exhibit a slightly different pattern compared to non-smokers. Smokers with high BMI tend to show a wider range of systolic blood pressure values, suggesting potential variability due to other lifestyle or genetic factors. This visualization underscores the importance of considering smoking status when analyzing the relationship between BMI and blood pressure, as it can provide nuanced insights into cardiovascular health risks.

Bar Plot to compare the frequesncy of diabetes across different racial/ethnic groups.

# Clean the data
nhanes_clean <- nhanes_clean %>%
  filter(!is.na(diabetes), !is.na(race1)) %>%
  mutate(diabetes = factor(diabetes),
         race1 = factor(race1))

# Bar plot of diabetes frequency by race/ethnicity
bar_plot <- ggplot(nhanes_clean, aes(x = race1, fill = diabetes)) +
  geom_bar(position = "dodge") +
  labs(title = "Bar Plot: Diabetes Frequency by Race/Ethnicity",
       x = "Race/Ethnicity",
       y = "Count",
       fill = "Diabetes") +
  theme_minimal()

# Convert to plotly
bar_plotly <- ggplotly(bar_plot)

# Display the plot
bar_plotly

The bar plot comparing the frequency of diabetes across different racial and ethnic groups provides a clear illustration of disparities in health outcomes. The data reveal that certain racial and ethnic groups have higher frequencies of diabetes. For instance, the plot may show that White populations have higher counts of diabetes compared other races. Since the data is not collected with a perfect proportion of the races it’s unreliable to come to a conclusion.

These disparities could be attributed to a variety of socio-economic, environmental, and genetic factors that influence health outcomes. The visualization highlights the need for targeted public health interventions and policies that address the specific needs of these higher-risk groups. By identifying and understanding these disparities, healthcare providers and policymakers can better allocate resources and develop strategies to mitigate the risk of diabetes in vulnerable populations.

Distribution of total cholesterol by smoking status.

# Clean the data
nhanes_clean <- nhanes_clean %>%
  filter(!is.na(totchol), !is.na(smokenow)) %>%
  mutate(smokenow = factor(smokenow))

# Density plot of total cholesterol by smoking status
density_plot <- ggplot(nhanes_clean, aes(x = totchol, fill = smokenow)) +
  geom_density(alpha = 0.6) +
  labs(title = "Density Plot: Total Cholesterol by Smoking Status",
       x = "Total Cholesterol",
       y = "Density",
       fill = "Smoking Status") +
  theme_minimal()

# Convert to plotly
density_plotly <- ggplotly(density_plot)

# Display the plot
density_plotly

The density plot illustrating the distribution of total cholesterol (totchol) by smoking status does not reveal a significant difference between smokers and non-smokers. Both Smokers and non-smokers tend to have a wider and often higher distribution of total cholesterol levels compared to non-smokers. With this dataset it cannot be concluded that smoking status may be associated with increased cholesterol levels, a known risk factor for cardiovascular diseases.

Compare BMI distribution across differnce racial/ethnic groups faceted bt smoking status.

# Clean the data
nhanes_clean <- nhanes_clean %>%
  filter(!is.na(bmi), !is.na(smokenow), !is.na(race1)) %>%
  mutate(smokenow = factor(smokenow),
         race1 = factor(race1))

# Faceted box plot of BMI by race/ethnicity and smoking status
faceted_plot <- ggplot(nhanes_clean, aes(x = race1, y = bmi, fill = race1)) +
  geom_boxplot() +
  facet_wrap(~ smokenow) +
  labs(title = "Faceted Box Plot: BMI by Race/Ethnicity and Smoking Status",
       x = "Race/Ethnicity",
       y = "BMI",
       fill = "Race/Ethnicity") +
  theme_minimal()

# Convert to plotly
faceted_plotly <- ggplotly(faceted_plot)

# Display the plot
faceted_plotly

The faceted box plot comparing BMI distributions across different racial and ethnic groups, with a facet for smoking status, provides a comprehensive view of how these factors interact. Each facet represents a different smoking status, allowing for a detailed comparison within and across groups.

The plot reveals that, irrespective of smoking status, certain racial and ethnic groups have higher median BMI values. For instance, non-Hispanic black individuals might show higher BMI values compared to non-Hispanic whites and Asians. Smoking status also plays a significant role; smokers generally tend to have slightly lower BMI values than non-smokers within the same racial and ethnic group, potentially due to the appetite-suppressing effects of nicotine.

This visualization highlights the complex interplay between race, ethnicity, and smoking status in determining BMI. It underscores the necessity for culturally sensitive health interventions that take into account both racial/ethnic background and smoking behavior to effectively address obesity and its related health risks.

Physical Activity and Health Indicators

# Clean the data for physical activity and health indicators
nhanes_clean_activity <- nhanes_data %>%
  filter(!is.na(physactive), !is.na(bmi), !is.na(totchol))

# Convert necessary variables to factors
nhanes_clean_activity <- nhanes_clean_activity %>%
  mutate(physactive = factor(physactive))

# Violin plot of BMI and Total Cholesterol by physical activity status
violin_plot <- ggplot(nhanes_clean_activity, aes(x = physactive, y = bmi, fill = physactive)) +
  geom_violin() +
  labs(title = "BMI Distribution by Physical Activity Status",
       x = "Physical Activity Status",
       y = "BMI",
       fill = "Physical Activity Status") +
  scale_fill_brewer(palette = "Set3") +
  theme_minimal()

# Convert to plotly
violin_plotly <- ggplotly(violin_plot)

# Display the plot
violin_plotly

In examining health disparities related to physical activity levels, violin plots provide a comprehensive visual representation of BMI (Body Mass Index) and total cholesterol distributions across different physical activity statuses. This analysis aims to uncover insights into how physical activity influences these critical health metrics.

The violin plot illustrates the distribution of BMI values categorized by physical activity status, distinguishing between active and inactive individuals. BMI, a widely-used indicator of body fatness and health risk, reveals notable differences across these groups. Active individuals typically exhibit a narrower and often lower BMI distribution compared to their inactive counterparts. This trend suggests that regular physical activity may contribute to maintaining healthy body weight levels, as evidenced by a concentration of lower BMI values among active individuals.

Conversely, inactive individuals show a broader BMI distribution with a tendency towards higher values, indicating a higher prevalence of overweight or obesity within this group. This observation underscores the potential health risks associated with physical inactivity, highlighting the importance of promoting active lifestyles for weight management and overall health.

Examining total cholesterol levels across different physical activity statuses further elucidates health patterns. Total cholesterol, a critical biomarker for cardiovascular health, exhibits varying distributions based on physical activity engagement. Active individuals typically demonstrate a more centered and possibly lower distribution of total cholesterol levels. This concentration suggests a potential protective effect of physical activity on cardiovascular health, as evidenced by lower average cholesterol levels within this group.

Conversely, inactive individuals display a wider spread in total cholesterol values, often skewing towards higher concentrations. This distribution pattern may indicate a higher prevalence of elevated cholesterol levels among inactive individuals, thereby emphasizing the cardiovascular risks associated with a sedentary lifestyle.

Violin plots comparing BMI and total cholesterol distributions by physical activity status provide compelling insights into the health impacts of physical activity. They highlight the significant role of regular exercise in maintaining healthy BMI levels and potentially mitigating cardiovascular risks associated with elevated cholesterol. These findings underscore the importance of promoting physical activity as a cornerstone of preventive health measures, aiming to reduce obesity rates and enhance cardiovascular well-being across diverse populations.

These visualizations collectively provide a multifaceted understanding of health disparities and correlations within the NHANES dataset. The scatter plot demonstrates the intricate relationship between BMI and systolic blood pressure, influenced by smoking status. The bar plot brings to light the unequal burden of diabetes among different racial and ethnic groups. The density plot underscores the detrimental impact of smoking on cholesterol levels, while the faceted box plot offers a detailed analysis of BMI variations across race, ethnicity, and smoking status. Together, these insights emphasize the need for tailored public health strategies and interventions that consider the diverse and intersecting factors influencing health outcomes.

Sources: https://www.cdc.gov/nchs/nhanes/index.htm