For this analysis, I am working with the NHANES dataset from the mice package in R. NHANES (National Health and Nutrition Examination Survey) is a large, ongoing project conducted by the Centers for Disease Control and Prevention (CDC) that collects extensive information on the health, nutrition, and demographics of the U.S. population through interviews and physical examinations.
The subset of NHANES used in this analysis is designed for practicing multiple imputation techniques and includes four variables: age (grouped into 20–39, 40–59, and 60+ years), bmi (Body Mass Index), hyp (hypertension status, coded 1 = no hypertension, 2 = hypertension), and chl (serum cholesterol level). Several of these variables contain missing values, which presents an opportunity to apply more principled methods for handling missing data.
Rather than relying on traditional approaches like listwise deletion, which can lead to biased results and loss of valuable information, multiple imputation allows missing values to be addressed in a way that preserves the overall structure and uncertainty of the dataset. This technique generates multiple complete datasets, incorporating natural variability and reducing the risk of underestimating uncertainty in subsequent analyses.
I chose to work with this dataset because earlier datasets I analyzed this semester had no missing values, offering limited opportunities to practice handling incomplete data. By working with a dataset that reflects the real-world challenges of missingness, I aim to strengthen my understanding of more robust and modern analytic techniques.
First, I loaded the necessary packages, including mice, ggplot2, and tidyr. I then used the data(nhanes) function from the mice package to load the NHANES dataset into the R environment.
To check for missing values, I created a summary table (nhanes_summary) that calculated the number of missing and non-missing entries for each variable using colSums() combined with is.na() and !is.na(). I then reshaped the summary table into a longer format using pivot_longer() from the tidyr package to prepare it for plotting.
Finally, I created a stacked bar chart with ggplot2, using geom_bar() with position = “stack” to visually compare the amount of missing and non-missing data for each variable. The resulting plot helps to quickly identify where missing values are present in the dataset.
library(mice)
library(ggplot2)
data(nhanes)
# Correct way: Summarize missing values per column
nhanes_summary <- data.frame(
Variable = colnames(nhanes),
Missing = colSums(is.na(nhanes)),
Not_Missing = colSums(!is.na(nhanes))
)
# Reshape for plotting
library(tidyr)
nhanes_long <- pivot_longer(nhanes_summary, cols = c(Missing, Not_Missing),
names_to = "Status", values_to = "Count")
# Plot
ggplot(nhanes_long, aes(x = Variable, y = Count, fill = Status)) +
geom_bar(stat = "identity", position = "stack") +
labs(title = "Missing vs Not Missing Data in NHANES Dataset",
x = "Variable",
y = "Count") +
theme_minimal()
The stacked bar chart shows the distribution of missing and non-missing values across each variable in the NHANES dataset. From the plot, it is clear that the age variable has no missing values, as the entire bar for age is composed of non-missing observations.
In contrast, the variables bmi, hyp, and chl all have missing values. Among these, chl (serum cholesterol level) appears to have the highest number of missing observations, followed by bmi (body mass index) and hyp (hypertension status). The missingness in these variables is significant enough to potentially bias any analysis if not addressed, which highlights the need for applying multiple imputation methods in the next steps of the project.
I conducted a logistic regression analysis to examine the relationship between hypertension status (hyp) and two predictor variables: Body Mass Index (bmi) and age group (age). In the original dataset, hypertension status was coded as 1 for no hypertension and 2 for the presence of hypertension. To properly fit the logistic regression model, I recoded the hyp variable so that 0 represents no hypertension and 1 represents the presence of hypertension. The age variable is grouped into three categories: 20–39 years old (coded as 1), 40–59 years old (coded as 2), and 60 years and older (coded as 3). The goal of this analysis is to assess how an individual’s BMI and age group are associated with their likelihood of having hypertension.
# Load necessary library
library(mice) # (only for loading the data)
# Load the NHANES dataset
data(nhanes)
# Recode hyp: 1 = no hypertension → 0
# 2 = hypertension → 1
nhanes$hyp_recode <- ifelse(nhanes$hyp == 2, 1,
ifelse(nhanes$hyp == 1, 0, NA))
# Now run the logistic regression using the recoded variable
model_pre <- glm(hyp_recode ~ bmi + factor(age), data = nhanes, family = binomial)
# View the model summary
summary(model_pre)
##
## Call:
## glm(formula = hyp_recode ~ bmi + factor(age), family = binomial,
## data = nhanes)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -58.4845 4743.2120 -0.012 0.990
## bmi 1.1463 0.8895 1.289 0.197
## factor(age)2 27.8760 4743.1149 0.006 0.995
## factor(age)3 29.7238 4743.1174 0.006 0.995
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 17.9947 on 15 degrees of freedom
## Residual deviance: 6.8269 on 12 degrees of freedom
## (9 observations deleted due to missingness)
## AIC: 14.827
##
## Number of Fisher Scoring iterations: 19
The logistic regression model examined the association between hypertension status (hyp_recode) and the predictors Body Mass Index (bmi) and age group (age). A total of 9 observations were omitted from the analysis due to missing values in either bmi or hyp, reducing the sample size from 25 to 16 observations. This highlights a key limitation of conducting regression analysis without addressing missing data, as it can reduce the sample size and potentially bias the results.
The regression output shows that none of the predictor variables were statistically significant at the conventional 0.05 level. Specifically, the p-value for bmi was 0.197, and the p-values for the age group indicators (factor(age)2 and factor(age)3) were both 0.995. The large standard errors, especially for the age variables, suggest instability in the model estimates, likely due to the small sample size after listwise deletion.
Additionally, a warning message (“fitted probabilities numerically 0 or 1 occurred”) indicates potential issues with separation or sparse data, further emphasizing that missing data and small sample size are affecting model performance. Overall, these results demonstrate the challenges of building reliable models when missing data are not properly addressed.
Given that 9 observations were omitted from the original logistic regression analysis due to missing data, I will next apply multiple imputation using the Amelia package. This approach will allow me to impute the missing values in the dataset and create several completed versions of the data. By doing so, I will be able to rerun the logistic regression model on the imputed datasets, preserving the full sample size and improving the stability and reliability of the model estimates. Multiple imputation will help address the limitations caused by listwise deletion and provide a more accurate understanding of the relationship between hypertension status, BMI, and age group.
# Load necessary packages
library(Amelia)
library(mice)
data(nhanes)
# Recode hyp so it's 0 (no hypertension) and 1 (has hypertension)
nhanes$hyp_recode <- ifelse(nhanes$hyp == 2, 1,
ifelse(nhanes$hyp == 1, 0, NA))
# Run Amelia imputations (20 datasets)
a.out <- amelia(x = nhanes, m = 20, idvars = "hyp_recode")
## -- Imputation 1 --
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## 41 42 43 44 45 46 47 48 49 50 51
##
## -- Imputation 2 --
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
##
##
## -- Imputation 3 --
##
## 1 2 3 4 5 6 7 8 9 10 11 12
##
## -- Imputation 4 --
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## 41 42
##
## -- Imputation 5 --
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 21 22
##
## -- Imputation 6 --
##
## 1 2 3 4 5 6 7 8 9 10 11
##
## -- Imputation 7 --
##
## 1 2 3 4 5 6 7 8 9 10 11 12
##
## -- Imputation 8 --
##
## 1 2 3 4 5 6 7 8
##
## -- Imputation 9 --
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 21
##
## -- Imputation 10 --
##
## 1 2 3 4 5 6 7 8 9
##
## -- Imputation 11 --
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## 41 42 43 44 45 46 47 48 49 50 51 52 53 54
##
## -- Imputation 12 --
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
##
## -- Imputation 13 --
##
## 1 2 3 4 5 6 7
##
## -- Imputation 14 --
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13
##
## -- Imputation 15 --
##
## 1 2 3 4 5 6 7 8 9 10 11 12
##
## -- Imputation 16 --
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 21 22 23 24
##
## -- Imputation 17 --
##
## 1 2 3 4 5 6 7 8 9 10 11 12
##
## -- Imputation 18 --
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 21 22 23 24 25 26 27 28 29 30 31
##
## -- Imputation 19 --
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
##
## -- Imputation 20 --
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 21 22 23 24 25 26 27 28
To address the missing values in the NHANES dataset, I performed multiple imputation using the Amelia package in R. I used the amelia() function to generate 20 imputed datasets, specifying hyp_recode as an ID variable to prevent it from being imputed.
I chose to run 20 imputations to increase the consistency and reliability of the imputation modeling process. Using a larger number of imputations reduces the uncertainty associated with missing data and leads to more stable pooled estimates when combining results across datasets. This approach helps strengthen the validity of the analyses conducted after imputation.
# Run logistic regression on each of the imputed datasets
models <- lapply(a.out$imputations, function(data) {
glm(hyp_recode ~ bmi + factor(age), family = binomial, data = data)
})
# Extract and pool the coefficients
coef_list <- lapply(models, coef)
coef_matrix <- do.call(rbind, coef_list)
pooled_estimates <- colMeans(coef_matrix)
# View the pooled (averaged) estimates
pooled_estimates
## (Intercept) bmi factor(age)2 factor(age)3
## -58.378110 1.146333 27.769673 29.617479
After applying multiple imputation using the Amelia package and pooling the results across 20 imputed datasets, the logistic regression analysis revealed positive associations between hypertension status and the predictor variables BMI and age group. Specifically, the coefficient for BMI was approximately 1.15, indicating that higher BMI values are associated with a greater likelihood of having hypertension. Similarly, the coefficients for age group 2 (ages 40–59) and age group 3 (ages 60 and older) were both positive, at approximately 27.77 and 29.62 respectively, suggesting that older individuals have a substantially higher likelihood of hypertension compared to the reference group (ages 20–39).
However, the magnitude of the coefficients for the age groups is very large, likely due to the small sample size and some overfitting, as indicated earlier by warnings during model estimation. Despite this, the multiple imputation process allowed the full dataset to be utilized, improving the stability and completeness of the regression compared to the pre-imputation model, where 9 observations were omitted due to missingness.
Comparing the results of the logistic regression models before and after multiple imputation illustrates the critical importance of properly addressing missing data. In the pre-imputation model, 9 observations were omitted through listwise deletion, substantially reducing the available sample size and weakening the reliability of the estimates. Although BMI and age group showed positive associations with hypertension, the loss of data likely contributed to inflated standard errors, instability in coefficient estimates, and warnings related to separation, all of which reflect the inefficiencies and potential biases that arise when incomplete data are discarded.
Following the application of multiple imputation using the Amelia package and pooling results across 20 imputed datasets, the logistic regression model utilized a complete dataset without the need to exclude any observations. This approach preserved the full informational content of the data while properly incorporating the uncertainty associated with the imputed values. As a result, the post-imputation model produced more stable and consistent coefficient estimates, with BMI and older age groups continuing to show positive relationships with hypertension. Although caution is still warranted due to the relatively small sample size and the large coefficients for some age groups, the multiple imputation process provided a stronger and more credible foundation for inference by reducing potential bias, improving efficiency, and more accurately representing the uncertainty inherent in the original dataset.