Using bigLC for Edit and Imputation

Evaluating Various Imputation Methods for Household Data

Emanuel Ben-David
Joint work with Tom Mule and Joe Schafer

2026-03-23

The Nested Household Dataset

First 10 Rows of dataMissMAR.csv
Hhindex pernum sex race hisp age relate ownership headsex headrace headhisp headage householdsize
86 2 2 2 1 13 7 3 2 2 1 61 2
86 3 2 2 NA NA 9 3 2 2 1 61 2
354 2 1 3 1 30 2 NA NA NA 1 57 3
354 3 1 3 1 24 2 NA NA NA 1 57 3
354 4 2 NA 1 NA 10 NA NA NA 1 57 3
1191 2 1 1 1 42 1 1 2 1 1 41 3
1191 3 1 1 1 16 2 1 2 1 1 41 3
1191 4 1 1 1 10 2 1 2 1 1 41 3
1246 2 1 1 1 45 1 1 2 1 1 41 3
1246 3 1 1 1 11 2 1 2 1 1 41 3

15,361 individuals in 7627 households • 13 variables • 23,265 total missing values • NA = missing at random (MAR)

Missing Data Overview

Study Overview

Objective: Compare four imputation methods applied to nested household survey data

Data Source: 2012 American Community Survey (ACS) Public Use Microdata Sample

Evaluation Scope: 33 distinct patterns across 5 categories

Metrics:

Method Approach
Hot Deck Traditional donor-based imputation
By HHSIZE Latent class model — flattened household structure
Nested Latent class model — preserves household clustering
Two-Level LC Hierarchical latent class — household + individual levels

Imputation Methods

Edit Rules for Imputed Data

Imputed records must satisfy household-level structural constraints. Key edit rules include:

Household Head & Spouse

Parent–Child & Adoption

Siblings & Grandparents

Source: Ben-David, Mule & Schafer (2024), Part I & II.

Two-Level Latent Class Model

Two-Level LC: How It Works

RMSE Summary Statistics

Method Mean Median SD Min Max Rank
Nested 0.0369 0.0170 0.0490 0.0010 0.1660 1st
By HHSIZE 0.0373 0.0160 0.0486 0.0010 0.1650 2nd
Two-Level LC 0.0401 0.0180 0.0509 0.0010 0.1660 3rd
Hot Deck 0.0481 0.0180 0.0617 0.0010 0.2260 4th

RMSE computed across all 33 patterns. Lower RMSE indicates estimates closer to the population parameter.

RMSE Distribution

Mean RMSE Comparison

RMSE by Pattern Category

CI Coverage Summary

Method Patterns Covered (of 33) Coverage Rate Mean CI Width
Nested 9 27.3% 0.0158
Hot Deck 8 24.2% 0.0166
By HHSIZE 7 21.2% 0.0158
Two-Level LC 6 18.2% 0.0161

CI coverage = proportion of 33 patterns where the 95% CI contains the population parameter. A well-calibrated method should approach 95%.

CI Coverage by Method

CI Coverage Heatmap

Combined Performance Overview

Method Mean RMSE RMSE Rank CI Coverage CI Rank
Nested 0.0369 1st 27.3% (9/33) 1st
By HHSIZE 0.0373 2nd 21.2% (7/33) 3rd
Two-Level LC 0.0401 3rd 18.2% (6/33) 4th
Hot Deck 0.0481 4th 24.2% (8/33) 2nd

All methods show CI coverage rates substantially below the 95% nominal level.

Pattern-Level Best/Worst Count

Method Highest RMSE Lowest RMSE Net
By HHSIZE 4 10 6
Nested 6 6 0
Two-Level LC 4 3 -1
Hot Deck 19 14 -5

For each of the 33 patterns, the method with the lowest and highest RMSE is identified. Net = Lowest count − Highest count.

Summary of Results

Positive net score = method had the lowest RMSE more often than the highest. Negative = converse.

Appendix

Detailed Comparison Tables by Category

Appendix A: Household Racial Composition

Pattern Pop Hot Deck By HHSIZE Nested 2-Level LC
All same race HH size = 2 0.941 0.0510 0.0220 0.0180 0.0230
All same race HH size = 3 0.907 0.1580 0.0670 0.0690 0.0880
All same race HH size = 4 0.900 0.2260 0.0770 0.0890 0.1320
White couple 0.578 0.0240 0.0090 0.0080 0.0080
Same race couple 0.694 0.0760 0.0470 0.0470 0.0510
White-nonwhite couple 0.034 0.0490 0.0160 0.0190 0.0200
Non-White couple, homeowner 0.072 0.0120 0.0090 0.0070 0.0080

🟢 Green = Lowest RMSE in row    🔴 Red = Highest RMSE in row

Appendix B: Spouse / Partner Presence & Race

Pattern Pop Hot Deck By HHSIZE Nested 2-Level LC
Spouse present 0.694 0.0180 0.0140 0.0170 0.0180
Spouse present, HH is White 0.609 0.0100 0.0190 0.0190 0.0200
Spouse present, HH is Black 0.152 0.1340 0.1290 0.1250 0.1250
HH older than Spouse, White HH 0.327 0.0070 0.0060 0.0060 0.0070
couple with age difference less than five 0.486 0.0490 0.0220 0.0090 0.0330

🟢 Green = Lowest RMSE in row    🔴 Red = Highest RMSE in row

Appendix C: Children & Parental Structure

Pattern Pop Hot Deck By HHSIZE Nested 2-Level LC
At least one biological child present 0.438 0.1630 0.1650 0.1660 0.1660
Only one parent 0.171 0.0180 0.0180 0.0190 0.0190
Adult female w/ at least one child under 5 0.327 0.0630 0.0620 0.0550 0.0540
Adult Black female w/ at least one child under 18 0.149 0.1070 0.1110 0.1070 0.1080
Adult Hisp male w/ at least one child under 10 0.027 0.0130 0.0160 0.0150 0.0170
Hisp couple with at least one biological child 0.025 0.0060 0.0160 0.0120 0.0170
At least one stepchild 0.019 0.0070 0.0080 0.0080 0.0070
At least one adopted child, White couple 0.008 0.0040 0.0050 0.0040 0.0050
Black couple with at least two biological children 0.006 0.0010 0.0030 0.0030 0.0030

🟢 Green = Lowest RMSE in row    🔴 Red = Highest RMSE in row

Appendix D: Multigenerational Households

Pattern Pop Hot Deck By HHSIZE Nested 2-Level LC
At least two generations present, Hisp couple 0.026 0.0060 0.0160 0.0130 0.0170
Two generations present, Black HH 0.030 0.0190 0.0190 0.0200 0.0200
At least three generations present 0.183 0.1620 0.1600 0.1610 0.1610
Three generations present, White couple 0.005 0.0020 0.0020 0.0030 0.0030
One grandchild present 0.034 0.0160 0.0160 0.0170 0.0160

🟢 Green = Lowest RMSE in row    🔴 Red = Highest RMSE in row

Appendix E: Demographic & Tenure Characteristics

Pattern Pop Hot Deck By HHSIZE Nested 2-Level LC
Male HH, homeowner 0.299 0.0210 0.0210 0.0260 0.0180
HH over 35, no child present 0.402 0.1400 0.1380 0.1390 0.1380
White HH with Hisp origin 0.066 0.0150 0.0120 0.0110 0.0140
Black HH, home owner 0.035 0.0030 0.0020 0.0020 0.0020
Black HH under 40, home owner 0.006 0.0020 0.0010 0.0010 0.0020
Hisp HH over 50, home owner 0.017 0.0040 0.0030 0.0030 0.0020
White HH under 25, home owner 0.006 0.0010 0.0010 0.0010 0.0010

🟢 Green = Lowest RMSE in row    🔴 Red = Highest RMSE in row

References

Software

Key Publication

Data Source