I have chosen data from the National Health and Nutrition Examination Survey (NHANES), a nationally representative survey conducted by the Centers for Disease Control and Prevention (CDC) through the National Center for Health Statistics (NCHS). NHANES is an ongoing program designed to assess the health and nutritional status of adults and children in the United States. The data is publicly available and can be accessed through the official NHANES website.
NHANES data is collected through in-home interviews, standardized physical examinations, and laboratory tests conducted in Mobile Examination Centers (MECs) rather than traditional healthcare facilities. These mobile units are specially equipped to ensure standardized data collection. Additionally, NHANES integrates supplementary health data from:
-> Hospitals, clinics, and laboratories that contribute to follow-up studies and validation efforts. -> Federal agencies, such as the National Institutes of Health (NIH) and the Food and Drug Administration (FDA), which collaborate on health-related research. -> State and local health departments, which provide data for specific public health studies.
The survey follows a complex, multi-stage probability sampling design, ensuring that the findings represent the U.S. population. Each year, about 5,000 individuals are surveyed from 15 different locations across the country. My research question is: How have racial, gender, and educational disparities in diabetes rates changed over time, and how does education influence these differences? To explore this question, I focus on the following key variables: Dependent Variable: Diabetes Status (Binary: Present = 1, Absent = 0) Independent Variables: Race/Ethnicity (Non-Hispanic White, Non-Hispanic Black, Hispanic, Other) Education Level (Less than High School, High School Graduate, Some College, College Graduate, Advanced Degree) Gender (Male = 1, Female = 2) Age (Control variable to account for diabetes risk differences across age groups) For analysis, I use survey-weighted logistic regression, which accounts for the complex NHANES sampling design. This model allows me to: 1. Assess how race, education, and gender impact diabetes prevalence while controlling for age. 2. Identify trends in disparities by comparing odds ratios across NHANES survey cycles from 2003 to 2023. 3. Examine whether education level affects racial differences in diabetes risk.
# Clear environment
rm(list = ls())
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 540029 28.9 1199710 64.1 686460 36.7
## Vcells 985292 7.6 8388608 64.0 1875941 14.4
# Load libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(readr)
library(haven)
library(ggplot2)
# Read datasets
diabetes <- read_xpt("C:/Users/Shamp/Downloads/DIQ_L (1).xpt")
demographics <- read_xpt("C:/Users/Shamp/Downloads/DEMO_L (1).xpt")
# Checking dataset structure
head(diabetes)
## # A tibble: 6 × 9
## SEQN DIQ010 DID040 DIQ160 DIQ180 DIQ050 DID060 DIQ060U DIQ070
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 130378 2 NA 2 2 NA NA NA NA
## 2 130379 2 NA 2 1 NA NA NA NA
## 3 130380 1 35 NA NA 2 NA NA 1
## 4 130381 2 NA NA NA NA NA NA NA
## 5 130382 2 NA NA NA NA NA NA NA
## 6 130383 2 NA NA NA NA NA NA NA
str(diabetes)
## tibble [11,744 × 9] (S3: tbl_df/tbl/data.frame)
## $ SEQN : num [1:11744] 130378 130379 130380 130381 130382 ...
## ..- attr(*, "label")= chr "Respondent sequence number"
## $ DIQ010 : num [1:11744] 2 2 1 2 2 2 2 2 2 2 ...
## ..- attr(*, "label")= chr "Doctor told you have diabetes"
## $ DID040 : num [1:11744] NA NA 35 NA NA NA NA NA NA NA ...
## ..- attr(*, "label")= chr "Age when first told you had diabetes"
## $ DIQ160 : num [1:11744] 2 2 NA NA NA NA 2 1 2 2 ...
## ..- attr(*, "label")= chr "Ever told you have prediabetes"
## $ DIQ180 : num [1:11744] 2 1 NA NA NA NA 2 1 2 2 ...
## ..- attr(*, "label")= chr "Had blood tested past three years"
## $ DIQ050 : num [1:11744] NA NA 2 NA NA NA NA NA NA NA ...
## ..- attr(*, "label")= chr "Taking insulin now"
## $ DID060 : num [1:11744] NA NA NA NA NA NA NA NA NA NA ...
## ..- attr(*, "label")= chr "How long taking insulin"
## $ DIQ060U: num [1:11744] NA NA NA NA NA NA NA NA NA NA ...
## ..- attr(*, "label")= chr "Unit of measure (month/year)"
## $ DIQ070 : num [1:11744] NA NA 1 NA NA NA NA 2 NA NA ...
## ..- attr(*, "label")= chr "Take diabetic pills to lower blood sugar"
## - attr(*, "label")= chr "Diabetes"
colnames(diabetes)
## [1] "SEQN" "DIQ010" "DID040" "DIQ160" "DIQ180" "DIQ050" "DID060"
## [8] "DIQ060U" "DIQ070"
# Remove leading/trailing spaces from column names
colnames(diabetes) <- trimws(colnames(diabetes))
# Check column names again
colnames(diabetes)
## [1] "SEQN" "DIQ010" "DID040" "DIQ160" "DIQ180" "DIQ050" "DID060"
## [8] "DIQ060U" "DIQ070"
# Inspect dataset structure
str(diabetes)
## tibble [11,744 × 9] (S3: tbl_df/tbl/data.frame)
## $ SEQN : num [1:11744] 130378 130379 130380 130381 130382 ...
## ..- attr(*, "label")= chr "Respondent sequence number"
## $ DIQ010 : num [1:11744] 2 2 1 2 2 2 2 2 2 2 ...
## ..- attr(*, "label")= chr "Doctor told you have diabetes"
## $ DID040 : num [1:11744] NA NA 35 NA NA NA NA NA NA NA ...
## ..- attr(*, "label")= chr "Age when first told you had diabetes"
## $ DIQ160 : num [1:11744] 2 2 NA NA NA NA 2 1 2 2 ...
## ..- attr(*, "label")= chr "Ever told you have prediabetes"
## $ DIQ180 : num [1:11744] 2 1 NA NA NA NA 2 1 2 2 ...
## ..- attr(*, "label")= chr "Had blood tested past three years"
## $ DIQ050 : num [1:11744] NA NA 2 NA NA NA NA NA NA NA ...
## ..- attr(*, "label")= chr "Taking insulin now"
## $ DID060 : num [1:11744] NA NA NA NA NA NA NA NA NA NA ...
## ..- attr(*, "label")= chr "How long taking insulin"
## $ DIQ060U: num [1:11744] NA NA NA NA NA NA NA NA NA NA ...
## ..- attr(*, "label")= chr "Unit of measure (month/year)"
## $ DIQ070 : num [1:11744] NA NA 1 NA NA NA NA 2 NA NA ...
## ..- attr(*, "label")= chr "Take diabetic pills to lower blood sugar"
## - attr(*, "label")= chr "Diabetes"
# Merge datasets by SEQN
merged_data <- merge(demographics, diabetes, by = "SEQN")
# Select relevant variables
diabetes <- merged_data %>%
select(SEQN, RIAGENDR, RIDRETH1, DMDEDUC2, RIDAGEYR, DIQ010)
# Rename columns (ensure correct names)
diabetes <- diabetes %>%
rename(
gender = RIAGENDR, # Gender
race = RIDRETH1, # Race/Ethnicity
education = DMDEDUC2, # Education level
age = RIDAGEYR, # Age
Diabetes = DIQ010 # Diabetes status
)
# Check renamed columns
head(diabetes)
## SEQN gender race education age Diabetes
## 1 130378 1 5 5 43 2
## 2 130379 1 3 5 66 2
## 3 130380 2 2 3 44 1
## 4 130381 2 5 NA 5 2
## 5 130382 1 3 NA 2 2
## 6 130383 2 2 NA 3 2
# Pivot longer to combine gender, race, and education into a single "variable" column
long_data <- diabetes %>%
pivot_longer(
cols = c(gender, race, education), # Columns to pivot
names_to = "variable", # New column for variable names
values_to = "value" # New column for variable values
)
# View the transformed data
head(long_data)
## # A tibble: 6 × 5
## SEQN age Diabetes variable value
## <dbl> <dbl> <dbl> <chr> <dbl>
## 1 130378 43 2 gender 1
## 2 130378 43 2 race 5
## 3 130378 43 2 education 5
## 4 130379 66 2 gender 1
## 5 130379 66 2 race 3
## 6 130379 66 2 education 5
# Calculate diabetes prevalence by race
diabetes_by_race <- diabetes %>%
group_by(race) %>%
summarize(
Diabetes_Prevalence = mean(Diabetes, na.rm = TRUE) # Calculate mean diabetes status
)
# Pivot wider to create separate columns for each race
wide_data <- diabetes_by_race %>%
pivot_wider(
names_from = race, # Spread race categories into columns
values_from = Diabetes_Prevalence # Fill with diabetes prevalence values
)
# View the transformed data
print(wide_data)
## # A tibble: 1 × 5
## `1` `2` `3` `4` `5`
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.94 1.94 1.94 1.90 1.95
# Recode the race variable with descriptive labels
diabetes$race <- factor(diabetes$race,
levels = c(1, 2, 3, 4, 5),
labels = c("Mexican American", "Other Hispanic",
"Non-Hispanic White", "Non-Hispanic Black",
"Other Race"))
# Verify the recoding
table(diabetes$race)
##
## Mexican American Other Hispanic Non-Hispanic White Non-Hispanic Black
## 1093 1345 6137 1577
## Other Race
## 1592
# Plot diabetes prevalence by race
ggplot(diabetes, aes(x = race, y = Diabetes, fill = race)) +
geom_bar(stat = "summary", fun = "mean") +
labs(
title = "Diabetes Prevalence by Race/Ethnicity",
x = "Race/Ethnicity",
y = "Mean Diabetes Prevalence"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_summary()`).
I started by cleaning up any existing data or variables in the R session using rm(list = ls()) and gc(). This ensures a fresh start by removing all objects from the environment and freeing up memory.
Next, I loaded two datasets, diabetes and demographics, from XPT files using read_xpt(). These datasets contain information related to diabetes status and demographic details of participants from the NHANES survey. To prepare for data manipulation and analysis, I loaded essential R libraries such as tidyverse, dplyr, readr, haven, and ggplot2. These libraries provide functions for data wrangling, reading files, and visualizing data efficiently.
After loading the datasets, I examined the structure of the diabetes dataset using head(diabetes), str(diabetes), and colnames(diabetes). These functions allow me to check the first few rows, data types of variables, and column names to understand the dataset better.
To ensure clean column names, I used trimws(colnames(diabetes)) to remove any leading or trailing spaces, making the column names more manageable. Next, I merged the demographics and diabetes datasets using the SEQN (sequence number) column, which uniquely identifies each participant. The merge() function helps combine these datasets into a single dataframe, allowing for a more comprehensive analysis.
After merging, I selected relevant variables such as gender, race, education, age, and diabetes status using the select() function. Then, I renamed these columns using rename() to make them more descriptive. To make the dataset more flexible for analysis, I transformed it into a long format using the pivot_longer() function. This combined the gender, race, and education columns into a single “variable” column, which makes it easier to analyze relationships between different demographic factors and diabetes prevalence.
For the analysis, I calculated the prevalence of diabetes by race using the group_by(race) %>% summarize(mean(Diabetes, na.rm = TRUE)) function. This computes the average diabetes status for each racial group, treating missing values appropriately.
Next, I transformed the diabetes prevalence data into a wide format using pivot_wider(), where each racial category became a separate column. This format makes it easier to compare diabetes prevalence across different racial groups.
To improve readability, I recoded the race variable with descriptive labels using the factor() function. Instead of numeric values such as 1,2,3, race categories are labeled as Mexican American, Other Hispanic, Non-Hispanic White, Non-Hispanic Black, and Other Race.
Finally, I visualized the diabetes prevalence by race using ggplot2. The geom_bar(stat = “summary”, fun = “mean”) function creates a bar plot, where the x-axis represents racial groups and the y-axis represents the mean diabetes prevalence. The bars are colored based on race, making the comparison visually clear. The theme_minimal() function enhances the appearance of the plot, and element_text(angle = 45, hjust = 1) ensures that race labels are readable.
In conclusion, the code follows a structured approach to clean, merge, transform, analyze, and visualize NHANES diabetes data. It prepares the data by renaming columns, reshaping it for analysis, and calculating diabetes prevalence across racial groups. The final visualization provides insights into how diabetes rates vary across different racial/ethnic populations.
# Clear environment
rm(list = ls())
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 1247181 66.7 2039401 109 2039401 109.0
## Vcells 2278691 17.4 8388608 64 8237476 62.9
# Load libraries
library(tidyverse)
library(dplyr)
library(readr)
library(haven)
# Read dataset
Weight_History <- read_xpt("C:/Users/Shamp/Downloads/WHQ_L.xpt")
# Checking dataset structure
head(Weight_History)
## # A tibble: 6 × 5
## SEQN WHD010 WHD020 WHD050 WHQ070
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 130378 71 190 200 1
## 2 130379 70 220 220 2
## 3 130380 60 150 165 1
## 4 130384 68 204 212 1
## 5 130385 70 240 240 2
## 6 130386 68 200 180 2
str(Weight_History)
## tibble [8,501 × 5] (S3: tbl_df/tbl/data.frame)
## $ SEQN : num [1:8501] 130378 130379 130380 130384 130385 ...
## ..- attr(*, "label")= chr "Respondent sequence number"
## $ WHD010: num [1:8501] 71 70 60 68 70 68 67 66 67 64 ...
## ..- attr(*, "label")= chr "Current self-reported height (inches)"
## $ WHD020: num [1:8501] 190 220 150 204 240 200 215 270 175 277 ...
## ..- attr(*, "label")= chr "Current self-reported weight (pounds)"
## $ WHD050: num [1:8501] 200 220 165 212 240 ...
## ..- attr(*, "label")= chr "Self-reported weight - 1 yr ago (pounds)"
## $ WHQ070: num [1:8501] 1 2 1 1 2 2 2 1 1 1 ...
## ..- attr(*, "label")= chr "Tried to lose weight in past year"
## - attr(*, "label")= chr "Weight History"
colnames(Weight_History)
## [1] "SEQN" "WHD010" "WHD020" "WHD050" "WHQ070"
# Summary statistics of the dataset
summary(Weight_History)
## SEQN WHD010 WHD020 WHD050
## Min. :130378 Min. : 40.0 Min. : 63.0 Min. : 67.0
## 1st Qu.:133378 1st Qu.: 63.0 1st Qu.: 145.0 1st Qu.: 145.0
## Median :136402 Median : 66.0 Median : 174.0 Median : 175.0
## Mean :136369 Mean : 150.1 Mean : 317.1 Mean : 420.5
## 3rd Qu.:139342 3rd Qu.: 70.0 3rd Qu.: 207.0 3rd Qu.: 214.0
## Max. :142310 Max. :9999.0 Max. :9999.0 Max. :9999.0
## NA's :15 NA's :15 NA's :15
## WHQ070
## Min. :1.000
## 1st Qu.:1.000
## Median :2.000
## Mean :1.549
## 3rd Qu.:2.000
## Max. :9.000
## NA's :15
# Ensuring numeric value and removing NA values
data <- na.omit(as.numeric(Weight_History$WHD020))
# Define the log-likelihood function for a normal distribution
log_likelihood <- function(params, data) {
mu <- params[1]
sigma <- abs(params[2]) # Ensure sigma is positive to prevent errors
# Calculate the log-likelihood for a normal distribution
ll <- sum(dnorm(data, mean = mu, sd = sigma, log = TRUE))
return(-ll) # Return negative log-likelihood for minimization
}
# Initial parameter estimates
initial_params <- c(mu = mean(data, na.rm = TRUE), sigma = sd(data, na.rm = TRUE))
# Ensure parameters are finite
if (any(!is.finite(initial_params))) {
stop("Initial parameters contain non-finite values. Check your data.")
}
# Optimize using the `optim` function
mle_results <- optim(
par = initial_params,
fn = function(params) log_likelihood(params, data),
method = "BFGS",
control = list(fnscale = 1, maxit = 1000, reltol = 1e-8)
)
# Display estimated parameters
estimated_params <- mle_results$par
names(estimated_params) <- c("Estimated_Mu", "Estimated_Sigma")
print(estimated_params)
## Estimated_Mu Estimated_Sigma
## 317.1426 1120.4757
# Display the final log-likelihood value
cat("Final Negative Log-Likelihood:", mle_results$value, "\n")
## Final Negative Log-Likelihood: 71625.14