Part I

I have chosen data from the National Health and Nutrition Examination Survey (NHANES), a nationally representative survey conducted by the Centers for Disease Control and Prevention (CDC) through the National Center for Health Statistics (NCHS). NHANES is an ongoing program designed to assess the health and nutritional status of adults and children in the United States. The data is publicly available and can be accessed through the official NHANES website.

NHANES data is collected through in-home interviews, standardized physical examinations, and laboratory tests conducted in Mobile Examination Centers (MECs) rather than traditional healthcare facilities. These mobile units are specially equipped to ensure standardized data collection. Additionally, NHANES integrates supplementary health data from:

-> Hospitals, clinics, and laboratories that contribute to follow-up studies and validation efforts. -> Federal agencies, such as the National Institutes of Health (NIH) and the Food and Drug Administration (FDA), which collaborate on health-related research. -> State and local health departments, which provide data for specific public health studies.

The survey follows a complex, multi-stage probability sampling design, ensuring that the findings represent the U.S. population. Each year, about 5,000 individuals are surveyed from 15 different locations across the country. My research question is: How have racial, gender, and educational disparities in diabetes rates changed over time, and how does education influence these differences? To explore this question, I focus on the following key variables: Dependent Variable: Diabetes Status (Binary: Present = 1, Absent = 0) Independent Variables: Race/Ethnicity (Non-Hispanic White, Non-Hispanic Black, Hispanic, Other) Education Level (Less than High School, High School Graduate, Some College, College Graduate, Advanced Degree) Gender (Male = 1, Female = 2) Age (Control variable to account for diabetes risk differences across age groups) For analysis, I use survey-weighted logistic regression, which accounts for the complex NHANES sampling design. This model allows me to: 1. Assess how race, education, and gender impact diabetes prevalence while controlling for age. 2. Identify trends in disparities by comparing odds ratios across NHANES survey cycles from 2003 to 2023. 3. Examine whether education level affects racial differences in diabetes risk.

Part II

# Clear environment 

rm(list = ls())
gc()
##          used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 540029 28.9    1199710 64.1   686460 36.7
## Vcells 985292  7.6    8388608 64.0  1875941 14.4
# Load libraries

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(readr)
library(haven)
library(ggplot2)

# Read datasets

diabetes <- read_xpt("C:/Users/Shamp/Downloads/DIQ_L (1).xpt")
demographics <- read_xpt("C:/Users/Shamp/Downloads/DEMO_L (1).xpt")
# Checking dataset structure

head(diabetes)
## # A tibble: 6 × 9
##     SEQN DIQ010 DID040 DIQ160 DIQ180 DIQ050 DID060 DIQ060U DIQ070
##    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>  <dbl>
## 1 130378      2     NA      2      2     NA     NA      NA     NA
## 2 130379      2     NA      2      1     NA     NA      NA     NA
## 3 130380      1     35     NA     NA      2     NA      NA      1
## 4 130381      2     NA     NA     NA     NA     NA      NA     NA
## 5 130382      2     NA     NA     NA     NA     NA      NA     NA
## 6 130383      2     NA     NA     NA     NA     NA      NA     NA
str(diabetes)
## tibble [11,744 × 9] (S3: tbl_df/tbl/data.frame)
##  $ SEQN   : num [1:11744] 130378 130379 130380 130381 130382 ...
##   ..- attr(*, "label")= chr "Respondent sequence number"
##  $ DIQ010 : num [1:11744] 2 2 1 2 2 2 2 2 2 2 ...
##   ..- attr(*, "label")= chr "Doctor told you have diabetes"
##  $ DID040 : num [1:11744] NA NA 35 NA NA NA NA NA NA NA ...
##   ..- attr(*, "label")= chr "Age when first told you had diabetes"
##  $ DIQ160 : num [1:11744] 2 2 NA NA NA NA 2 1 2 2 ...
##   ..- attr(*, "label")= chr "Ever told you have prediabetes"
##  $ DIQ180 : num [1:11744] 2 1 NA NA NA NA 2 1 2 2 ...
##   ..- attr(*, "label")= chr "Had blood tested past three years"
##  $ DIQ050 : num [1:11744] NA NA 2 NA NA NA NA NA NA NA ...
##   ..- attr(*, "label")= chr "Taking insulin now"
##  $ DID060 : num [1:11744] NA NA NA NA NA NA NA NA NA NA ...
##   ..- attr(*, "label")= chr "How long taking insulin"
##  $ DIQ060U: num [1:11744] NA NA NA NA NA NA NA NA NA NA ...
##   ..- attr(*, "label")= chr "Unit of measure (month/year)"
##  $ DIQ070 : num [1:11744] NA NA 1 NA NA NA NA 2 NA NA ...
##   ..- attr(*, "label")= chr "Take diabetic pills to lower blood sugar"
##  - attr(*, "label")= chr "Diabetes"
colnames(diabetes)
## [1] "SEQN"    "DIQ010"  "DID040"  "DIQ160"  "DIQ180"  "DIQ050"  "DID060" 
## [8] "DIQ060U" "DIQ070"
# Remove leading/trailing spaces from column names

colnames(diabetes) <- trimws(colnames(diabetes))

# Check column names again

colnames(diabetes)
## [1] "SEQN"    "DIQ010"  "DID040"  "DIQ160"  "DIQ180"  "DIQ050"  "DID060" 
## [8] "DIQ060U" "DIQ070"
# Inspect dataset structure
str(diabetes)
## tibble [11,744 × 9] (S3: tbl_df/tbl/data.frame)
##  $ SEQN   : num [1:11744] 130378 130379 130380 130381 130382 ...
##   ..- attr(*, "label")= chr "Respondent sequence number"
##  $ DIQ010 : num [1:11744] 2 2 1 2 2 2 2 2 2 2 ...
##   ..- attr(*, "label")= chr "Doctor told you have diabetes"
##  $ DID040 : num [1:11744] NA NA 35 NA NA NA NA NA NA NA ...
##   ..- attr(*, "label")= chr "Age when first told you had diabetes"
##  $ DIQ160 : num [1:11744] 2 2 NA NA NA NA 2 1 2 2 ...
##   ..- attr(*, "label")= chr "Ever told you have prediabetes"
##  $ DIQ180 : num [1:11744] 2 1 NA NA NA NA 2 1 2 2 ...
##   ..- attr(*, "label")= chr "Had blood tested past three years"
##  $ DIQ050 : num [1:11744] NA NA 2 NA NA NA NA NA NA NA ...
##   ..- attr(*, "label")= chr "Taking insulin now"
##  $ DID060 : num [1:11744] NA NA NA NA NA NA NA NA NA NA ...
##   ..- attr(*, "label")= chr "How long taking insulin"
##  $ DIQ060U: num [1:11744] NA NA NA NA NA NA NA NA NA NA ...
##   ..- attr(*, "label")= chr "Unit of measure (month/year)"
##  $ DIQ070 : num [1:11744] NA NA 1 NA NA NA NA 2 NA NA ...
##   ..- attr(*, "label")= chr "Take diabetic pills to lower blood sugar"
##  - attr(*, "label")= chr "Diabetes"
# Merge datasets by SEQN

merged_data <- merge(demographics, diabetes, by = "SEQN")

# Select relevant variables

diabetes <- merged_data %>%
  select(SEQN, RIAGENDR, RIDRETH1, DMDEDUC2, RIDAGEYR, DIQ010)

# Rename columns (ensure correct names)

diabetes <- diabetes %>%
  rename(
    gender = RIAGENDR,    # Gender
    race = RIDRETH1,      # Race/Ethnicity
    education = DMDEDUC2, # Education level
    age = RIDAGEYR,       # Age
    Diabetes = DIQ010     # Diabetes status
  )

# Check renamed columns
head(diabetes)
##     SEQN gender race education age Diabetes
## 1 130378      1    5         5  43        2
## 2 130379      1    3         5  66        2
## 3 130380      2    2         3  44        1
## 4 130381      2    5        NA   5        2
## 5 130382      1    3        NA   2        2
## 6 130383      2    2        NA   3        2
# Pivot longer to combine gender, race, and education into a single "variable" column
long_data <- diabetes %>%
  pivot_longer(
    cols = c(gender, race, education), # Columns to pivot
    names_to = "variable",             # New column for variable names
    values_to = "value"                # New column for variable values
  )

# View the transformed data
head(long_data)
## # A tibble: 6 × 5
##     SEQN   age Diabetes variable  value
##    <dbl> <dbl>    <dbl> <chr>     <dbl>
## 1 130378    43        2 gender        1
## 2 130378    43        2 race          5
## 3 130378    43        2 education     5
## 4 130379    66        2 gender        1
## 5 130379    66        2 race          3
## 6 130379    66        2 education     5
# Calculate diabetes prevalence by race
diabetes_by_race <- diabetes %>%
  group_by(race) %>%
  summarize(
    Diabetes_Prevalence = mean(Diabetes, na.rm = TRUE) # Calculate mean diabetes status
  )

# Pivot wider to create separate columns for each race
wide_data <- diabetes_by_race %>%
  pivot_wider(
    names_from = race,              # Spread race categories into columns
    values_from = Diabetes_Prevalence # Fill with diabetes prevalence values
  )

# View the transformed data
print(wide_data)
## # A tibble: 1 × 5
##     `1`   `2`   `3`   `4`   `5`
##   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  1.94  1.94  1.94  1.90  1.95
# Recode the race variable with descriptive labels
diabetes$race <- factor(diabetes$race, 
                        levels = c(1, 2, 3, 4, 5), 
                        labels = c("Mexican American", "Other Hispanic", 
                                   "Non-Hispanic White", "Non-Hispanic Black", 
                                   "Other Race"))

# Verify the recoding
table(diabetes$race)
## 
##   Mexican American     Other Hispanic Non-Hispanic White Non-Hispanic Black 
##               1093               1345               6137               1577 
##         Other Race 
##               1592
# Plot diabetes prevalence by race
ggplot(diabetes, aes(x = race, y = Diabetes, fill = race)) +
  geom_bar(stat = "summary", fun = "mean") +
  labs(
    title = "Diabetes Prevalence by Race/Ethnicity",
    x = "Race/Ethnicity",
    y = "Mean Diabetes Prevalence"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_summary()`).

I started by cleaning up any existing data or variables in the R session using rm(list = ls()) and gc(). This ensures a fresh start by removing all objects from the environment and freeing up memory.

Next, I loaded two datasets, diabetes and demographics, from XPT files using read_xpt(). These datasets contain information related to diabetes status and demographic details of participants from the NHANES survey. To prepare for data manipulation and analysis, I loaded essential R libraries such as tidyverse, dplyr, readr, haven, and ggplot2. These libraries provide functions for data wrangling, reading files, and visualizing data efficiently.

After loading the datasets, I examined the structure of the diabetes dataset using head(diabetes), str(diabetes), and colnames(diabetes). These functions allow me to check the first few rows, data types of variables, and column names to understand the dataset better.

To ensure clean column names, I used trimws(colnames(diabetes)) to remove any leading or trailing spaces, making the column names more manageable. Next, I merged the demographics and diabetes datasets using the SEQN (sequence number) column, which uniquely identifies each participant. The merge() function helps combine these datasets into a single dataframe, allowing for a more comprehensive analysis.

After merging, I selected relevant variables such as gender, race, education, age, and diabetes status using the select() function. Then, I renamed these columns using rename() to make them more descriptive. To make the dataset more flexible for analysis, I transformed it into a long format using the pivot_longer() function. This combined the gender, race, and education columns into a single “variable” column, which makes it easier to analyze relationships between different demographic factors and diabetes prevalence.

For the analysis, I calculated the prevalence of diabetes by race using the group_by(race) %>% summarize(mean(Diabetes, na.rm = TRUE)) function. This computes the average diabetes status for each racial group, treating missing values appropriately.

Next, I transformed the diabetes prevalence data into a wide format using pivot_wider(), where each racial category became a separate column. This format makes it easier to compare diabetes prevalence across different racial groups.

To improve readability, I recoded the race variable with descriptive labels using the factor() function. Instead of numeric values such as 1,2,3, race categories are labeled as Mexican American, Other Hispanic, Non-Hispanic White, Non-Hispanic Black, and Other Race.

Finally, I visualized the diabetes prevalence by race using ggplot2. The geom_bar(stat = “summary”, fun = “mean”) function creates a bar plot, where the x-axis represents racial groups and the y-axis represents the mean diabetes prevalence. The bars are colored based on race, making the comparison visually clear. The theme_minimal() function enhances the appearance of the plot, and element_text(angle = 45, hjust = 1) ensures that race labels are readable.

In conclusion, the code follows a structured approach to clean, merge, transform, analyze, and visualize NHANES diabetes data. It prepares the data by renaming columns, reshaping it for analysis, and calculating diabetes prevalence across racial groups. The final visualization provides insights into how diabetes rates vary across different racial/ethnic populations.

Part III

# Clear environment

rm(list = ls())
gc()
##           used (Mb) gc trigger (Mb) max used  (Mb)
## Ncells 1247181 66.7    2039401  109  2039401 109.0
## Vcells 2278691 17.4    8388608   64  8237476  62.9
# Load libraries

library(tidyverse)
library(dplyr)
library(readr)
library(haven)

# Read dataset

Weight_History <- read_xpt("C:/Users/Shamp/Downloads/WHQ_L.xpt")

# Checking dataset structure

head(Weight_History)
## # A tibble: 6 × 5
##     SEQN WHD010 WHD020 WHD050 WHQ070
##    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 130378     71    190    200      1
## 2 130379     70    220    220      2
## 3 130380     60    150    165      1
## 4 130384     68    204    212      1
## 5 130385     70    240    240      2
## 6 130386     68    200    180      2
str(Weight_History)
## tibble [8,501 × 5] (S3: tbl_df/tbl/data.frame)
##  $ SEQN  : num [1:8501] 130378 130379 130380 130384 130385 ...
##   ..- attr(*, "label")= chr "Respondent sequence number"
##  $ WHD010: num [1:8501] 71 70 60 68 70 68 67 66 67 64 ...
##   ..- attr(*, "label")= chr "Current self-reported height (inches)"
##  $ WHD020: num [1:8501] 190 220 150 204 240 200 215 270 175 277 ...
##   ..- attr(*, "label")= chr "Current self-reported weight (pounds)"
##  $ WHD050: num [1:8501] 200 220 165 212 240 ...
##   ..- attr(*, "label")= chr "Self-reported weight - 1 yr ago (pounds)"
##  $ WHQ070: num [1:8501] 1 2 1 1 2 2 2 1 1 1 ...
##   ..- attr(*, "label")= chr "Tried to lose weight in past year"
##  - attr(*, "label")= chr "Weight History"
colnames(Weight_History)
## [1] "SEQN"   "WHD010" "WHD020" "WHD050" "WHQ070"
# Summary statistics of the dataset

summary(Weight_History)
##       SEQN            WHD010           WHD020           WHD050      
##  Min.   :130378   Min.   :  40.0   Min.   :  63.0   Min.   :  67.0  
##  1st Qu.:133378   1st Qu.:  63.0   1st Qu.: 145.0   1st Qu.: 145.0  
##  Median :136402   Median :  66.0   Median : 174.0   Median : 175.0  
##  Mean   :136369   Mean   : 150.1   Mean   : 317.1   Mean   : 420.5  
##  3rd Qu.:139342   3rd Qu.:  70.0   3rd Qu.: 207.0   3rd Qu.: 214.0  
##  Max.   :142310   Max.   :9999.0   Max.   :9999.0   Max.   :9999.0  
##                   NA's   :15       NA's   :15       NA's   :15      
##      WHQ070     
##  Min.   :1.000  
##  1st Qu.:1.000  
##  Median :2.000  
##  Mean   :1.549  
##  3rd Qu.:2.000  
##  Max.   :9.000  
##  NA's   :15
# Ensuring numeric value and removing NA values

data <- na.omit(as.numeric(Weight_History$WHD020))  

# Define the log-likelihood function for a normal distribution

log_likelihood <- function(params, data) {
  mu <- params[1]
  sigma <- abs(params[2])  # Ensure sigma is positive to prevent errors
  
# Calculate the log-likelihood for a normal distribution
  
  ll <- sum(dnorm(data, mean = mu, sd = sigma, log = TRUE))
  
  return(-ll)  # Return negative log-likelihood for minimization
}

# Initial parameter estimates

initial_params <- c(mu = mean(data, na.rm = TRUE), sigma = sd(data, na.rm = TRUE))  

# Ensure parameters are finite

if (any(!is.finite(initial_params))) {
  stop("Initial parameters contain non-finite values. Check your data.")
}

# Optimize using the `optim` function

mle_results <- optim(
  par = initial_params,
  fn = function(params) log_likelihood(params, data),  
  method = "BFGS",
  control = list(fnscale = 1, maxit = 1000, reltol = 1e-8)
)

# Display estimated parameters

estimated_params <- mle_results$par
names(estimated_params) <- c("Estimated_Mu", "Estimated_Sigma")
print(estimated_params)
##    Estimated_Mu Estimated_Sigma 
##        317.1426       1120.4757
# Display the final log-likelihood value

cat("Final Negative Log-Likelihood:", mle_results$value, "\n")
## Final Negative Log-Likelihood: 71625.14