The NHANES (National Health and Nutrition Examination Survey) is a program of studies designed to assess the health and nutritional status of people in the United States. The survey is unique in that it combines interviews and physical examinations. I chose this data set because my other data set was not working haha.

The variables included in this dataset are Sex, Age, Education, BMI, and Diabetes. Sex and Diabetes are categorical variables, while Age, Education, and BMI are quantitative variables.

Luckily I did not have to do much cleaning for this data set. The only thing I cleaned up was by renaming the “Gender” variable to “Sex” for readability purposes. I selected variables that were relevant to my project, which includes Sex, Age, Education, BMI, and Diabetes.

Load required libraries

library(dplyr)
## Warning: package 'dplyr' was built under R version 4.2.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.2.3
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(tidyverse)
## Warning: package 'tibble' was built under R version 4.2.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.1     ✔ tidyr     1.3.0
## ✔ readr     2.1.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ plotly::filter() masks dplyr::filter(), stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors

Load the data set

nhanes <- read_csv("C:/Users/maddi/Downloads/nhanes.csv")
## Rows: 10000 Columns: 76
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (31): SurveyYr, Gender, AgeDecade, Race1, Race3, Education, MaritalStatu...
## dbl (45): ID, Age, AgeMonths, HHIncomeMid, Poverty, HomeRooms, Weight, Lengt...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Clean and explore the data variables

Rename variables for readability

names(nhanes)[names(nhanes) == "Gender"] <- "Sex"

Selecting variables

nhanes_subset <- nhanes %>% 
  select(Sex, Age, Education, BMI, Diabetes)

Summarizing variables

nhanes_summary <- nhanes_subset %>% 
  group_by(Sex) %>% 
  summarize(
    mean_age = mean(Age),
    mean_bmi = mean(BMI),
    prop_diabetes = sum(Diabetes == "Yes")/n()
  )

Statistical analysis by correlation assessment between Age and BMI

cor(nhanes$Age, nhanes$BMI)
## [1] NA

Explore quantitative variables with histogram and boxplot

ggplot(nhanes_subset, aes(x = BMI, fill = Sex)) +
  geom_histogram(binwidth = 2) +
  labs(title = "BMI Distribution by Sex",
       x = "BMI",
       y = "Count") +
  scale_fill_manual(values = c("#E69F00", "#56B4E9")) +
  theme_minimal()
## Warning: Removed 366 rows containing non-finite values (`stat_bin()`).

ggplot(nhanes_subset, aes(x = Sex, y = BMI, fill = Diabetes)) +
  geom_boxplot() +
  labs(title = "BMI by Sex and Diabetes Status",
       x = "Sex",
       y = "BMI",
       fill = "Diabetes") +
  scale_fill_manual(values = c("#00BFC4", "#F8766D")) +
  theme_bw()
## Warning: Removed 366 rows containing non-finite values (`stat_boxplot()`).

Final visualization

nhanes_interactive <- nhanes_subset %>% 
  mutate(Diabetes = ifelse(Diabetes == "Yes", "Diabetes", "No Diabetes"))
plotly_histogram <- ggplot(nhanes_interactive, aes(x = BMI, fill = Diabetes)) +
  geom_histogram(binwidth = 2) +
  labs(title = "BMI Distribution by Diabetes Status",
       x = "BMI",
       y = "Count",
       fill = "Diabetes") +
  scale_fill_manual(values = c("#00BFC4", "#F8766D")) +
  theme_minimal()

ggplotly(plotly_histogram)
## Warning: Removed 366 rows containing non-finite values (`stat_bin()`).

According to the Centers for Disease Control and Prevention (CDC), more than 100 million adults in the United States are now living with diabetes or prediabetes. Diabetes is a chronic (long-lasting) health condition that affects how your body turns food into energy. Most people’s bodies naturally produce the hormone insulin, which helps convert sugars from the food we eat into energy that the body can use or store for later. When you have diabetes, your body either doesn’t make insulin or doesn’t use its insulin well, causing your blood sugar to rise. High blood sugar levels can cause serious health problems over time.

The final visualization I created is a histogram of BMI distribution by diabetes status, where diabetes is represented by different colors. The plot shows that individuals with diabetes have a higher mean BMI compared to those without diabetes. It also shows that the distribution of BMI is wider for individuals with diabetes, indicating that there is more variability in BMI among individuals with diabetes.

One interesting pattern that emerges from the plot is that there is a higher proportion of individuals with diabetes in the higher BMI range, which shows the relationship between BMI and diabetes. Additionally, the plot highlights the need for public health interventions aimed at preventing and managing diabetes, especially among those with high BMIs.

One limitation of the analysis is that it does not take into account other risk factors for diabetes, such as physical activity, family history, and diet. Future analysis could use these variables to gain a more understanding of the relationship between BMI and diabetes.