Missing Data/Multiple Imputation

2025-04-29

Data Overview

The data set I will be working with comes from the National Health and Nutrition Examination Survey (NHANES) and it provides annual data from 2002 to 2022 for the North American countries including Canada, Mexico, and the United States. It focuses on male health behaviors, including the percentage of males surviving to age of 65 (male_survival_rate), the percentage of tobacco use among adult males (male_tobacco_percentage), and the amount of Alcohol consumed among males measured in Liters(Alcohol_Consumption). The data offers a valuable basis for examining public health trends and potential links between lifestyle factors and male longevity.

Data Preparation

# Clear Environment
rm(list = ls()) 
# Import Data 
data <- read.csv("WDI_na.csv")
# Renaming Variables
library(dplyr) # Data manipulation
data <- data %>% rename("Life_Expectancy_65" = "SP.DYN.TO65.MA.ZS", "Smoking_Prevalence" = "SH.PRV.SMOK.MA", "Alcohol_Consumption" = "SH.ALC.PCAP.MA.LI")
# Removing unnecessary columns (country abbreviations)
data <- data %>% select(-c(iso2c, iso3c,))
# Checking to see how many observations and variables are in the data set
glimpse(data)
## Rows: 63
## Columns: 5
## $ country             <chr> "Canada", "Canada", "Canada", "Canada", "Canada", …
## $ year                <int> 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 20…
## $ Life_Expectancy_65  <dbl> 84.43242, 84.54817, 84.88524, 85.10808, 85.52173, …
## $ Smoking_Prevalence  <dbl> NA, NA, NA, 25.8, NA, 23.9, NA, NA, 21.6, NA, NA, …
## $ Alcohol_Consumption <dbl> 14.94, 15.01, 15.17, 15.40, 15.70, 15.92, 16.07, 1…
# Convert my DV (Life_Expectancy_65) from a percentage to a proportion
data <- data %>% mutate(Life_Expectancy_65 = Life_Expectancy_65/100)
# Data Table
library(DT) # Interactive Data Table
datatable(data)

Research Question/Hypothesis

Research Question 1 - Is there a relationship between the percentage of males who smoke and the proportion of males surviving to age 65 in North American countries between 2002 and 2022?

Hypotheis - Higher smoking prevalence among males is significantly associated with a lower proportion of males surviving to age 65.

Research Question 2 - Is there a relationship between alcohol consumption among males and the proportion of males surviving to age 65 in North American countries between 2002 and 2022?

Hypotheis - Higher alcohol consumption is significantly associated with a lower proportion of males surviving to age 65.

Statistical Analysis

To examine the relationship between male smoking prevalence, alcohol consumption, and the proportion of males surviving to age 65, I will perform a Beta Regression. A beta regression is appropriate to use because the dependent variable(Life_Expectancy_65) is a proportion bounded between 0 and 1.

library(betareg) # Beta Regression
library(modelsummary) # Table for model results
beta_model <- betareg(formula = Life_Expectancy_65 ~ Smoking_Prevalence + Alcohol_Consumption, data = data)
modelsummary(beta_model, output = "gt")
(1)
(Intercept) 0.415
(0.277)
Smoking_Prevalence -0.020
(0.008)
Alcohol_Consumption 0.111
(0.013)
Num.Obs. 15
R2 0.858
AIC -58.2
BIC -55.3
RMSE 0.03

Checking to see how many observations and variables are in the data set. There are a total of 63 observations in the data set and 5 variables.

dim(data)
## [1] 63  5

When you look at my beta regression model, you’ll notice that only 15 observations were used, however my data set contains 63 observations. This is because rows with missing values were excluded from the regression analysis. Missing observations can potentially reduce the statistical power of my regression and lead to bias results. There are several methods to handle missing data, including listwise deletion, pairwise deletion, mean substitution, simple imputation and multiple imputation. Every approach has its own advantages and disadvantages and is used depending on the context of the data. However for this analysis, I will be using multiple imputation. Multiple imputation is a statistical method used to manage missing data by generating multiple complete data sets. Beta regressions are then conducted on each dataset, and the results are combined together. Multiple imputaion is particularly useful to use when the missing data is missing at random (MAR) or when the data is completely missing at random (MCAR).Lastly multiple imputation can help reduce bias and increase the accuracy of estimates.


I will now interpret the Average Marginal Effect of my beta regression using using Clarify

# Average Marginal Effects(Smoking_Prevalence and Life_Expectancy_65)
library(clarify) # Average Marginal Effects
set.seed(123) #Repeatable Results
beta_sim <- sim(beta_model)
ame_beta <- sim_ame(beta_sim, var = "Smoking_Prevalence", contrast = "rd")
knitr::kable(summary(ame_beta)) # Create a table to display results
Estimate 2.5 % 97.5 %
E[dY/d(Smoking_Prevalence)] -0.0032491 -0.0060105 -0.0007991
# Average Marginal Effects(Alcohol_Consumption and Life_Expectancy_65)
set.seed(123) #Repeatable Results
beta_sim1 <- sim(beta_model)
ame_beta1 <- sim_ame(beta_sim1, var = "Alcohol_Consumption", contrast = "rd")
knitr::kable(summary(ame_beta1)) # Create a table to display results
Estimate 2.5 % 97.5 %
E[dY/d(Alcohol_Consumption)] 0.0182557 0.0140891 0.0224961

The results of my beta regression reveal a significant negative relationship between the percentage of men who smoke and the proportion of men surviving to age 65. As the percentage of men who smoke increases, the proportion of men surviving to age 65 decreases. In contrast, there is a significant positive relationship between alcohol consumption among men and the proportion of men surviving to age 65. As alcohol consumption increases, the proportion of men surviving to age 65 also increases. I will run the beta regression again but this time I will use the multiple imputation approach and I will compare my results to my previous beta regression and look for any differences or similarities.

Multiple Imputation

First, I will create multiple complete datasets, each containing different imputed values for the missing data. There are several packages available for multiple imputation, such as mice, norm, and pan, but for this analysis, I will be using the Amelia package. Typically, 10 to 20 datasets are used for multiple imputation, but for this analysis, I will use 10 datasets. It’s important to note that the imputed values aren’t just random guesses; they are based on the observed data and the relationships between the variables in the dataset.

library(Amelia) # Multiple Imputation
imputation <- amelia(x = data, 
                     m = 10, # Number of Imputed Data Sets
                     ts = "year", # Time Series Variable
                     cs = "country") # Cross Sectional Variable


Next, I will perform beta regressions for each of the 10 imputed datasets. To do this, I will use the with() function to apply the beta regression model to each dataset.

imputation1 <- with(imputation, betareg(formula = Life_Expectancy_65 ~ Smoking_Prevalence + Alcohol_Consumption))


Now I will show the results of the beta regression for each of the 10 imputed datasets by using the package modelsummary.

models <- list(
  "Imputation 1" = betareg(Life_Expectancy_65 ~ Smoking_Prevalence + Alcohol_Consumption,
                           data = imputation$imputations[[1]]),
  "Imputation 2" = betareg(Life_Expectancy_65 ~ Smoking_Prevalence + Alcohol_Consumption, 
                           data = imputation$imputations[[2]]),
  "Imputation 3" = betareg(Life_Expectancy_65 ~ Smoking_Prevalence + Alcohol_Consumption, 
                           data = imputation$imputations[[3]]),
  "Imputation 4" = betareg(Life_Expectancy_65 ~ Smoking_Prevalence + Alcohol_Consumption, 
                           data = imputation$imputations[[4]]),
  "Imputation 5" = betareg(Life_Expectancy_65 ~ Smoking_Prevalence + Alcohol_Consumption, 
                           data = imputation$imputations[[5]]),
  "Imputation 6" = betareg(Life_Expectancy_65 ~ Smoking_Prevalence + Alcohol_Consumption, 
                           data = imputation$imputations[[6]]),
  "Imputation 7" = betareg(Life_Expectancy_65 ~ Smoking_Prevalence + Alcohol_Consumption, 
                           data = imputation$imputations[[7]]),
  "Imputation 8" = betareg(Life_Expectancy_65 ~ Smoking_Prevalence + Alcohol_Consumption, 
                           data = imputation$imputations[[8]]),
  "Imputation 9" = betareg(Life_Expectancy_65 ~ Smoking_Prevalence + Alcohol_Consumption, 
                           data = imputation$imputations[[9]]),
  "Imputation 10" = betareg(Life_Expectancy_65 ~ Smoking_Prevalence + Alcohol_Consumption, 
                           data = imputation$imputations[[10]]))

modelsummary(models, output = "gt") # Display the results of all 10 beta regressions
Imputation 1 Imputation 2 Imputation 3 Imputation 4 Imputation 5 Imputation 6 Imputation 7 Imputation 8 Imputation 9 Imputation 10
(Intercept) 0.664 -0.309 0.335 0.302 0.526 0.212 0.580 0.359 0.286 0.356
(0.089) (0.073) (0.110) (0.136) (0.108) (0.104) (0.059) (0.127) (0.117) (0.089)
Smoking_Prevalence -0.017 -0.011 -0.015 -0.013 -0.021 -0.008 -0.023 -0.016 -0.012 -0.019
(0.002) (0.001) (0.003) (0.004) (0.004) (0.003) (0.002) (0.004) (0.003) (0.003)
Alcohol_Consumption 0.088 0.144 0.108 0.107 0.109 0.104 0.108 0.107 0.108 0.113
(0.004) (0.007) (0.006) (0.006) (0.005) (0.006) (0.003) (0.006) (0.006) (0.005)
Num.Obs. 63 63 63 63 63 63 63 63 63 63
R2 0.907 0.892 0.862 0.807 0.865 0.825 0.954 0.840 0.845 0.894
AIC -309.9 -299.5 -288.9 -269.5 -289.2 -274.8 -348.1 -279.0 -280.7 -304.2
BIC -301.3 -291.0 -280.3 -260.9 -280.7 -266.2 -339.5 -270.4 -272.1 -295.6
RMSE 0.02 0.02 0.02 0.03 0.02 0.03 0.02 0.03 0.03 0.02


Finally, I will combine the results from all 10 beta regressions into a single set of estimates and then I will interpret the results using Clarify.

results <- mi.combine(imputation1) # Combine the results from all 10 beta regressions
knitr::kable(results) # Create a table to display results
component term estimate std.error statistic p.value df r miss.info
mean (Intercept) 0.3311225 0.2982495 1.110220 0.2892882 11.65321 7.251989 0.8953572
mean Smoking_Prevalence -0.0154472 0.0057194 -2.700846 1.9854235 18.10011 2.391538 0.7330964
mean Alcohol_Consumption 0.1095393 0.0154267 7.100611 0.0000135 11.82890 6.828776 0.8894938
precision (phi) 351.3534892 181.7293798 1.933388 0.0767467 12.19091 6.103158 0.8777526

Now I will obtain the Average Marginal Effects of (Smoking_ Prevalence and Life_Expectancy_65) based on the single set of estimates shown above

set.seed(1234) #Repeatable Results
s <- misim(imputation1)
est <- sim_ame(s, var = "Smoking_Prevalence")
knitr::kable(summary(est)) # Create a table to display results
Estimate 2.5 % 97.5 %
E[dY/d(Smoking_Prevalence)] -0.0025118 -0.0040606 -0.0008858

Obtaining the Average Marginal Effects of (Alcohol_Consumption and Life_Expectancy_65)

set.seed(1234) #Repeatable Results
c <- misim(imputation1)
est1 <- sim_ame(c, var = "Alcohol_Consumption")
knitr::kable(summary(est1)) # Create a table to display results
Estimate 2.5 % 97.5 %
E[dY/d(Alcohol_Consumption)] 0.0178117 0.0138172 0.0241124

Comparing Results

AME (Smoking_Prevalence and Life_Expectancy_65) # Normal Linear Regression

Estimate 2.5 % 97.5 %
E[dY/d(Smoking_Prevalence)] -0.0032491 -0.0060105 -0.0007991

AME (Smoking_Prevalence and Life_Expectancy_65) # Multiple Imputation Linear Regression

Estimate 2.5 % 97.5 %
E[dY/d(Smoking_Prevalence)] -0.0025118 -0.0040606 -0.0008858

AME (Alcohol_Consumption and Life_Expectancy_65) # Normal Linear Regression

Estimate 2.5 % 97.5 %
E[dY/d(Alcohol_Consumption)] 0.0182557 0.0140891 0.0224961

AME (Alcohol_Consumption and Life_Expectancy_65) # Multiple Imputation Linear Regression

Estimate 2.5 % 97.5 %
E[dY/d(Alcohol_Consumption)] 0.0178117 0.0138172 0.0241124

The results show that the relationship between smoking prevalence and life expectancy at age 65 is negative in both the normal linear regression and the multiple imputation models, however the esitmates are smaller after imputation (−0.0032 vs. −0.0025). Similarly, alcohol consumption is positively associated with life expectancy at age 65 in both models, with very similar estimates (0.0183 vs. 0.0178). Overall, the use of multiple imputation slightly reduces the magnitude of the estimates but does not change the overall direction or significance of the relationships. Because multiple imputation accounts for the uncertainty of missing data by generating and combining several plausible datasets, it leads to more accurate and less biased results than simply ignoring or filling in missing values. The similarity of the results before and after imputation suggests that the findings are reliable and not heavily influenced by missing data.


References

Acock, A. C. (2005). Working with missing values. Journal of Marriage and Family, 67(4), 1012–1032.

Honaker, J., King, G., & Blackwell, M. (2011). Amelia II: A program for missing data. Journal of Statistical Software, 45(7), 1–47.

King, G., Honaker, J., Joseph, A., & Scheve, K. (2001). Analyzing incomplete political science data: An alternative algorithm for multiple imputation. American Political Science Review, 95(1), 49–69.

Honaker, J., & King, G. (2010). What to do about missing values in time‐series cross‐section data. American journal of political science, 54(2), 561-581.