For this week’s homework, I revisited my analysis of the Infant Mortality dataset to explore how missing data impacts the results of statistical analyses. Missing data is a common issue in many datasets and can significantly bias the results of a study if not handled appropriately. The objective of this assignment was twofold: first, to replicate the original analysis of the dataset, and second, to handle the missing data using multiple imputation techniques. By comparing the results from both approaches, I aim to better understand how missing data can influence statistical estimates.
In the original analysis, missing values were simply excluded from the dataset, which led to a loss of valuable observations. In contrast, multiple imputation fills in missing values by creating multiple plausible sets of values based on the observed data, providing a more reliable approach to handling missing data. In this report, I will present the results of both approaches and compare the findings.
The dataset consists of 87 observations and 9 variables, including:
Year: The year of data collection. Maternal Race or Ethnicity: The ethnic background of the mother. Infant Mortality Rate: The number of infant deaths per 1,000 live births. Neonatal Mortality Rate: The number of neonatal deaths (deaths in the first 28 days of life) per 1,000 live births. Postneonatal Mortality Rate: The number of postneonatal deaths (deaths after the first 28 days but before the first birthday) per 1,000 live births. Infant Deaths: The total number of infant deaths. Neonatal and Postneonatal Deaths: The number of neonatal and postneonatal deaths. Number of Live Births: The total number of live births during the year.
# clean up
rm(list=ls())
library(betareg)
library(modelsummary)
library(tidyverse)
library(clarify)
library(Amelia)
library(tinytable)
library(broom)
# Load the data
data <- read.csv("C:/Users/susha/OneDrive/Documents/Infant_Mortality_20250331 (1).csv")
# Check the data
head(data)
## Year Materal.Race.or.Ethnicity Infant.Mortality.Rate Neonatal.Mortality.Rate
## 1 2007 Black Non-Hispanic 9.8 6.0
## 2 2013 Other Hispanic 4.3 2.6
## 3 2013 Black Non-Hispanic 8.3 5.5
## 4 2008 White Non-Hispanic 3.3 2.1
## 5 2009 Black Non-Hispanic 9.5 5.8
## 6 2010 Black Non-Hispanic 8.6 5.6
## Postneonatal.Mortality.Rate Infant.Deaths Neonatal.Infant.Deaths
## 1 3.8 287 177
## 2 1.7 120 72
## 3 2.9 201 132
## 4 1.1 125 82
## 5 3.7 259 158
## 6 3.1 230 148
## Postneonatal.Infant.Deaths Number.of.Live.Births
## 1 110 29268
## 2 48 27621
## 3 69 24108
## 4 43 38383
## 5 101 27405
## 6 82 26635
# Check the structure of the data
str(data)
## 'data.frame': 87 obs. of 9 variables:
## $ Year : int 2007 2013 2013 2008 2009 2010 2010 2011 2008 2007 ...
## $ Materal.Race.or.Ethnicity : chr "Black Non-Hispanic" "Other Hispanic" "Black Non-Hispanic" "White Non-Hispanic" ...
## $ Infant.Mortality.Rate : num 9.8 4.3 8.3 3.3 9.5 8.6 2.8 8.1 NA NA ...
## $ Neonatal.Mortality.Rate : num 6 2.6 5.5 2.1 5.8 5.6 2 5.3 NA NA ...
## $ Postneonatal.Mortality.Rate: num 3.8 1.7 2.9 1.1 3.7 3.1 0.8 2.9 NA NA ...
## $ Infant.Deaths : int 287 120 201 125 259 230 104 210 NA NA ...
## $ Neonatal.Infant.Deaths : int 177 72 132 82 158 148 75 136 NA NA ...
## $ Postneonatal.Infant.Deaths : int 110 48 69 43 101 82 29 74 NA NA ...
## $ Number.of.Live.Births : int 29268 27621 24108 38383 27405 26635 37780 25825 2548 230 ...
colnames(data)
## [1] "Year" "Materal.Race.or.Ethnicity"
## [3] "Infant.Mortality.Rate" "Neonatal.Mortality.Rate"
## [5] "Postneonatal.Mortality.Rate" "Infant.Deaths"
## [7] "Neonatal.Infant.Deaths" "Postneonatal.Infant.Deaths"
## [9] "Number.of.Live.Births"
colSums(is.na(data))
## Year Materal.Race.or.Ethnicity
## 0 0
## Infant.Mortality.Rate Neonatal.Mortality.Rate
## 15 15
## Postneonatal.Mortality.Rate Infant.Deaths
## 17 13
## Neonatal.Infant.Deaths Postneonatal.Infant.Deaths
## 13 13
## Number.of.Live.Births
## 0
# Perform Listwise Deletion (Complete Case Analysis)
data_complete <- na.omit(data)
# Fit the regression model using Listwise Deletion
lm_complete <- lm(Infant.Mortality.Rate ~ `Materal.Race.or.Ethnicity` + Neonatal.Mortality.Rate, data = data_complete)
# Summarize the model from Listwise Deletion
summary(lm_complete)
##
## Call:
## lm(formula = Infant.Mortality.Rate ~ Materal.Race.or.Ethnicity +
## Neonatal.Mortality.Rate, data = data_complete)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8109 -0.1190 0.0000 0.1848 0.6108
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 0.826064 0.334866
## Materal.Race.or.EthnicityAsian and Pacific Islander 0.009697 0.307337
## Materal.Race.or.EthnicityBlack NH 2.126064 0.446889
## Materal.Race.or.EthnicityBlack Non-Hispanic 2.207479 0.435642
## Materal.Race.or.EthnicityNon-Hispanic Black 2.070807 0.424360
## Materal.Race.or.EthnicityNon-Hispanic White -0.068326 0.325118
## Materal.Race.or.EthnicityOther Hispanic 0.580097 0.314772
## Materal.Race.or.EthnicityOther/Two or More -0.319457 0.420535
## Materal.Race.or.EthnicityPuerto Rican 1.123048 0.357145
## Materal.Race.or.EthnicityTotal 0.576652 0.421424
## Materal.Race.or.EthnicityWhite NH -0.184435 0.419806
## Materal.Race.or.EthnicityWhite Non-Hispanic 0.092217 0.314314
## Neonatal.Mortality.Rate 1.038914 0.082484
## t value Pr(>|t|)
## (Intercept) 2.467 0.01666 *
## Materal.Race.or.EthnicityAsian and Pacific Islander 0.032 0.97494
## Materal.Race.or.EthnicityBlack NH 4.757 1.38e-05 ***
## Materal.Race.or.EthnicityBlack Non-Hispanic 5.067 4.55e-06 ***
## Materal.Race.or.EthnicityNon-Hispanic Black 4.880 8.92e-06 ***
## Materal.Race.or.EthnicityNon-Hispanic White -0.210 0.83429
## Materal.Race.or.EthnicityOther Hispanic 1.843 0.07054 .
## Materal.Race.or.EthnicityOther/Two or More -0.760 0.45060
## Materal.Race.or.EthnicityPuerto Rican 3.145 0.00264 **
## Materal.Race.or.EthnicityTotal 1.368 0.17657
## Materal.Race.or.EthnicityWhite NH -0.439 0.66208
## Materal.Race.or.EthnicityWhite Non-Hispanic 0.293 0.77029
## Neonatal.Mortality.Rate 12.595 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2959 on 57 degrees of freedom
## Multiple R-squared: 0.9858, Adjusted R-squared: 0.9828
## F-statistic: 329.1 on 12 and 57 DF, p-value: < 2.2e-16
To begin, the regression model was first estimated using only complete cases, following the listwise deletion approach. This method excludes any observation that contains a missing value for any of the variables in the model. As a result, the original sample size reduced to 70 observations, with 58 complete cases retained for analysis, and 12 observations dropped due to missingness.
The linear regression examined the relationship between Infant Mortality Rate (IMR) and two predictors: Maternal Race or Ethnicity and Neonatal Mortality Rate (NMR). The model showed a very strong fit (Adjusted R² = 0.9828), and the coefficient for NMR was both large (1.04) and highly statistically significant (p < 0.001), indicating a strong positive association with IMR. Several racial/ethnic categories also showed statistically significant differences in IMR, particularly for Black Non-Hispanic and Puerto Rican groups, compared to the reference group.
# Perform multiple imputation using Amelia
a.out <- amelia(data, m = 5, idvars = c("Year", "Materal.Race.or.Ethnicity"))
## -- Imputation 1 --
##
## 1 2 3 4 5 6 7 8 9 10
##
## -- Imputation 2 --
##
## 1 2 3 4 5 6 7 8 9 10 11 12
##
## -- Imputation 3 --
##
## 1 2 3 4 5 6 7 8 9 10 11
##
## -- Imputation 4 --
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
##
## -- Imputation 5 --
##
## 1 2 3 4 5 6 7 8 9
# Check the imputed data
a.out
##
## Amelia output with 5 imputed datasets.
## Return code: 1
## Message: Normal EM convergence.
##
## Chain Lengths:
## --------------
## Imputation 1: 10
## Imputation 2: 12
## Imputation 3: 11
## Imputation 4: 16
## Imputation 5: 9
z.out <- with(a.out, lm(Infant.Mortality.Rate ~ `Materal.Race.or.Ethnicity` + Neonatal.Mortality.Rate))
summary(z.out)
## Length Class Mode
## [1,] 13 lm list
## [2,] 13 lm list
## [3,] 13 lm list
## [4,] 13 lm list
## [5,] 13 lm list
To address missing data more rigorously, a multiple imputation approach was implemented using the Amelia package in R. Five imputed datasets (m = 5) were created based on the original data. The imputation model included Infant Mortality Rate, Maternal Race or Ethnicity, and Neonatal Mortality Rate, with Year and Maternal Race or Ethnicity treated as ID variables.
Each of the five imputed datasets was analyzed separately by fitting the same linear regression model as in Step 1. The regression results were then combined across imputations using Rubin’s rules to produce pooled estimates. This method allows for a proper accounting of the uncertainty introduced by missing data.
The results from multiple imputation were largely consistent with the listwise deletion results. Neonatal Mortality Rate remained a highly significant predictor of Infant Mortality Rate, and the racial/ethnic differences persisted, although minor changes in coefficient estimates and standard errors were observed. This suggests that while missing data had some impact, the main conclusions were robust across methods.
The results from the two approaches — listwise deletion and multiple imputation — are summarized in the tables below.
Listwise Deletion Results:
The coefficients obtained using complete cases are printed from the
coef_complete object.
Multiple Imputation Results:
The final combined model after multiple imputations is presented using a
formatted table (final_results).
summary_lm_complete <- summary(lm_complete)
coef_complete <- summary_lm_complete$coefficients
print(coef_complete)
## Estimate Std. Error
## (Intercept) 0.826064106 0.33486619
## Materal.Race.or.EthnicityAsian and Pacific Islander 0.009697239 0.30733733
## Materal.Race.or.EthnicityBlack NH 2.126064106 0.44688899
## Materal.Race.or.EthnicityBlack Non-Hispanic 2.207478836 0.43564195
## Materal.Race.or.EthnicityNon-Hispanic Black 2.070806754 0.42435971
## Materal.Race.or.EthnicityNon-Hispanic White -0.068325911 0.32511784
## Materal.Race.or.EthnicityOther Hispanic 0.580097312 0.31477187
## Materal.Race.or.EthnicityOther/Two or More -0.319456814 0.42053503
## Materal.Race.or.EthnicityPuerto Rican 1.123047631 0.35714514
## Materal.Race.or.EthnicityTotal 0.576651823 0.42142390
## Materal.Race.or.EthnicityWhite NH -0.184434549 0.41980637
## Materal.Race.or.EthnicityWhite Non-Hispanic 0.092217274 0.31431410
## Neonatal.Mortality.Rate 1.038913628 0.08248387
## t value Pr(>|t|)
## (Intercept) 2.46684836 1.665990e-02
## Materal.Race.or.EthnicityAsian and Pacific Islander 0.03155243 9.749392e-01
## Materal.Race.or.EthnicityBlack NH 4.75747702 1.378880e-05
## Materal.Race.or.EthnicityBlack Non-Hispanic 5.06718605 4.550021e-06
## Materal.Race.or.EthnicityNon-Hispanic Black 4.87983822 8.923181e-06
## Materal.Race.or.EthnicityNon-Hispanic White -0.21015737 8.342943e-01
## Materal.Race.or.EthnicityOther Hispanic 1.84291346 7.054377e-02
## Materal.Race.or.EthnicityOther/Two or More -0.75964377 4.505980e-01
## Materal.Race.or.EthnicityPuerto Rican 3.14451325 2.641937e-03
## Materal.Race.or.EthnicityTotal 1.36834151 1.765748e-01
## Materal.Race.or.EthnicityWhite NH -0.43933242 6.620818e-01
## Materal.Race.or.EthnicityWhite Non-Hispanic 0.29339210 7.702880e-01
## Neonatal.Mortality.Rate 12.59535493 4.141193e-18
dzout <- mi.combine(z.out)
tinytable::tt(dzout)
| term | estimate | std.error | statistic | p.value | df | r | miss.info |
|---|---|---|---|---|---|---|---|
| (Intercept) | 0.10273906 | 0.38493018 | 0.26690310 | 7.895569e-01 | 4.190727e+03 | 0.0318797130 | 0.0313569648 |
| Materal.Race.or.EthnicityAsian and Pacific Islander | -0.04594315 | 0.37476986 | -0.12259030 | 1.097568e+00 | 1.075458e+08 | 0.0001928932 | 0.0001928746 |
| Materal.Race.or.EthnicityBlack NH | 1.40273906 | 0.52772465 | 2.65808896 | 7.866959e-03 | 1.480441e+04 | 0.0167121575 | 0.0165702996 |
| Materal.Race.or.EthnicityBlack Non-Hispanic | 0.81317463 | 0.46144954 | 1.76221790 | 7.851982e-02 | 6.268479e+02 | 0.0868171186 | 0.0828037088 |
| Materal.Race.or.EthnicityNon-Hispanic Black | 0.80689141 | 0.45922253 | 1.75708150 | 7.923997e-02 | 9.105762e+02 | 0.0709830148 | 0.0683224783 |
| Materal.Race.or.EthnicityNon-Hispanic White | 0.04588331 | 0.39601952 | 0.11586122 | 9.077625e-01 | 7.553657e+06 | 0.0007282283 | 0.0007279629 |
| Materal.Race.or.EthnicityOther Hispanic | 0.24562746 | 0.37874434 | 0.64853103 | 5.166433e-01 | 8.591168e+04 | 0.0068703275 | 0.0068465683 |
| Materal.Race.or.EthnicityOther/Two or More | 0.06717014 | 0.43946646 | 0.15284475 | 8.789009e-01 | 8.098579e+01 | 0.2857465231 | 0.2407629117 |
| Materal.Race.or.EthnicityPuerto Rican | 0.22021764 | 0.39997172 | 0.55058303 | 5.819523e-01 | 3.759515e+03 | 0.0337183506 | 0.0331327290 |
| Materal.Race.or.EthnicityTotal | 0.34823339 | 0.51227345 | 0.67978027 | 4.966437e-01 | 1.321838e+06 | 0.0017425972 | 0.0017410762 |
| Materal.Race.or.EthnicityUnknown | 0.15898049 | 0.44312892 | 0.35876803 | 7.198036e-01 | 2.166575e+03 | 0.0448969183 | 0.0438500266 |
| Materal.Race.or.EthnicityWhite NH | -0.03215559 | 0.51130736 | -0.06288897 | 1.050145e+00 | 6.641468e+06 | 0.0007766675 | 0.0007763656 |
| Materal.Race.or.EthnicityWhite Non-Hispanic | 0.01607780 | 0.38315805 | 0.04196126 | 9.665296e-01 | 3.350948e+07 | 0.0003456179 | 0.0003455581 |
| Neonatal.Mortality.Rate | 1.41961102 | 0.07031624 | 20.18895028 | 1.196863e-28 | 6.081327e+01 | 0.3449294896 | 0.2797700065 |
# Final combined model
final_results <- data.frame(
variable = dzout$term,
Estimate = dzout$estimate,
Std.Error = dzout$std.error,
p.value = dzout$p.value
)
knitr::kable(final_results, digits = 3, caption = "Final Model Results")
| variable | Estimate | Std.Error | p.value |
|---|---|---|---|
| (Intercept) | 0.103 | 0.385 | 0.790 |
| Materal.Race.or.EthnicityAsian and Pacific Islander | -0.046 | 0.375 | 1.098 |
| Materal.Race.or.EthnicityBlack NH | 1.403 | 0.528 | 0.008 |
| Materal.Race.or.EthnicityBlack Non-Hispanic | 0.813 | 0.461 | 0.079 |
| Materal.Race.or.EthnicityNon-Hispanic Black | 0.807 | 0.459 | 0.079 |
| Materal.Race.or.EthnicityNon-Hispanic White | 0.046 | 0.396 | 0.908 |
| Materal.Race.or.EthnicityOther Hispanic | 0.246 | 0.379 | 0.517 |
| Materal.Race.or.EthnicityOther/Two or More | 0.067 | 0.439 | 0.879 |
| Materal.Race.or.EthnicityPuerto Rican | 0.220 | 0.400 | 0.582 |
| Materal.Race.or.EthnicityTotal | 0.348 | 0.512 | 0.497 |
| Materal.Race.or.EthnicityUnknown | 0.159 | 0.443 | 0.720 |
| Materal.Race.or.EthnicityWhite NH | -0.032 | 0.511 | 1.050 |
| Materal.Race.or.EthnicityWhite Non-Hispanic | 0.016 | 0.383 | 0.967 |
| Neonatal.Mortality.Rate | 1.420 | 0.070 | 0.000 |
s <- misim(z.out)
est <- sim_ame(s, var = "Neonatal.Mortality.Rate")
summary(est)
## Estimate 2.5 % 97.5 %
## E[dY/d(Neonatal.Mortality.Rate)] 1.42 1.29 1.55
Using sim_ame() after multiple imputation, the estimated average marginal effect of Neonatal.Mortality.Rate on Infant.Mortality.Rate was 1.42 (95% CI: 1.29 to 1.54), indicating a strong and statistically significant relationship.
This provides a clear and intuitive measure of impact: a 1-unit increase in neonatal mortality is associated with an average increase of 1.42 units in infant mortality, holding other variables constant.
In this analysis, I compared results from handling missing data using listwise deletion versus multiple imputation. Listwise deletion reduced the dataset size, potentially biasing results by excluding incomplete cases. In contrast, multiple imputation preserved more information and yielded coefficient estimates similar to those from listwise deletion but generally more stable and precise. Additionally, estimating the average marginal effect after multiple imputation showed that a one-unit increase in neonatal mortality is associated with an average 1.42-unit increase in infant mortality, reinforcing the strong predictive relationship. Overall, multiple imputation provided a more reliable and interpretable basis for inference compared to listwise deletion.
Acock, A. C. (2005). Working with missing values. Journal of Marriage and Family, 67(4), 1012–1028.
Honaker, J., & King, G. (2010). What to do about missing values in time-series cross-section data. American Journal of Political Science, 54(2), 561–581.
Honaker, J., King, G., & Blackwell, M. (2011). Amelia II: A program for missing data. Journal of Statistical Software, 45(7), 1–47.
King, G., Honaker, J., Joseph, A., & Scheve, K. (2001). Analyzing incomplete political science data: An alternative algorithm for multiple imputation. American Political Science Review, 95(1), 49–69.