Project History and Acknowledgments

This project began in 2013 as the computational backbone of my advanced doctoral research at Ohio University, rooted in psychometrics and educational measurement theory. What started as a structured academic endeavor gradually transformed into an ongoing journal of my learning journey—documenting not just solutions, but the iterative process of discovery, setbacks, and breakthroughs in statistical methodology.

The initial framework was built meticulously from the ground up, prioritizing foundational rigor over convenience. Early implementations balanced theoretical precision with practical execution, often revisiting core assumptions as my understanding deepened. Over time, the work evolved through multiple phases of refinement, each reflecting new insights from applied research and computational experimentation.

In recent years, the original R codebase and documentation have been systematically modernized using AI-assisted tools. These enhancements optimized clarity and computational efficiency while carefully preserving the analytical intent of the original work. The AI collaboration served not as a replacement for human reasoning but as an accelerator—helping bridge gaps between legacy implementations and contemporary best practices without obscuring the learning process that remains central to this project.

More than a technical archive, this project stands as a testament to the nonlinear nature of mastery in statistical computing. Annotations explicitly trace the evolution of my thinking, from early approximations to mature implementations, making it a resource for others navigating similar learning curves.

1 Introduction to Missing Data Methodology

Modern missing data analysis, rooted in seminal work from the 1970s, employs principled approaches such as maximum likelihood estimation (e.g., the EM algorithm) and Bayesian methods (e.g., MCMC) (Dempster, A. P. et al., 1977; Rubin, 1976). These approaches offer several advantages over traditional methods:

  • Statistical Superiority: Yield less biased estimates and greater statistical power under weaker assumptions about the missingness mechanism (Enders, 2010).
  • Generalizability: Provide unified solutions for a wide range of missing data patterns (Schafer, 1997).
  • Proper Uncertainty Accounting: Preserve valid standard errors and Type I error rates (Rubin, 1987).

Despite these benefits, practical challenges remain. These methods were initially limited by computational demands and a lack of accessible software until the late 1990s (Enders, 2010). Successful implementation also requires non-trivial statistical expertise (Bodner, 2006; Peugh, J. L. & Enders, C. K., 2004).


1.1 Limitations of Traditional Missing Data Methods

Conventional approaches suffer from methodological flaws that can compromise inference:

Method Preserved Characteristics Compromised Aspects
Mean Imputation Sample means Underestimates variance, distorts covariances and p-values
Regression Imputation Linear relationships Inflates correlations, reduces residual variance
Complete-Case Analysis Model assumptions Loss of power, biased estimates if data not MCAR

These methods treat imputed values as known, failing to account for uncertainty and often underestimating standard errors (Schafer, 1997). This limitation is particularly problematic in multivariate analysis, where preserving the joint distribution \(P(X_1, X_2, \ldots, X_n)\) is critical (Buuren, 2018, p. 42).

Even with optimal study design, missing data is common. Principled methods are needed for:

  • Survey Nonresponse — e.g., item non-compliance (Rubin, 1976)
  • Experimental Attrition — e.g., longitudinal dropout (Enders, 2010)
  • Planned Missingness — e.g., matrix sampling for cost efficiency (Graham, 2009)

1.2 Missing Data Mechanisms

(Rubin, 1976) established a foundational classification system for missing data mechanisms that has become the standard framework for analyzing incomplete data. These mechanisms describe the probabilistic relationship between observed variables and missingness patterns, without implying causal explanations for why data are missing (Enders, 2010, p. 3). For instance, survey nonresponse might systematically vary with income level without this relationship being explicitly modeled.

1.2.1 Missing Completely at Random (MCAR)

A variable \(X\) satisfies the MCAR condition when the probability of missingness is independent of both:

  • The observed values of other variables (\(X_{obs}\))
  • The missing values of \(X\) itself (\(X_{mis}\))

In simpler terms, the missingness in \(X\) is completely random and doesn’t depend on any other observed or unobserved data.

Formally, the missingness indicator \(R\) follows:

\[ P(R \mid X_{obs}, X_{mis}, \phi) = P(R \mid \phi) \]

where:

  • \(R\) is the missingness indicator matrix,

  • \(\phi\) represents parameters governing the missingness process

Key Implications:

  • No Systematic Pattern: The missingness is like flipping a fair coin for each data point—whether it’s missing or not is purely by chance.
  • Ignorable Missingness: Under MCAR, the complete cases (rows with no missing data) are essentially a random sample of the full dataset. This means analyses using only complete cases aren’t systematically biased (though you may lose power due to reduced sample size).
  • MCAR is the most restrictive and often unrealistic assumption in practice

1.2.2 Missing at Random (MAR)

A variable \(X\) satisfies the MAR condition when the probability of missingness depends only on observed data (either \(X_{obs}\) or other fully observed variables), but not on the missing values themselves (\(X_{mis}\)). For example, Income is more likely to be missing for younger people (Age is observed), but not based on the actual (missing) Income values.

Formally:

\[ P(R = 1 \mid X_{obs}, X_{mis}, \phi) = P(R = 1 \mid X_{obs}, \phi) \]

where:

  • \(R\) is the missingness indicator (1 = missing, 0 = observed),

  • \(X_{obs}\) represents observed values of \(X\),

  • \(X_{mis}\) represents missing values of \(X\),

  • \(\phi\) denotes parameters governing the missingness process

Key Properties:

  • The missingness mechanism is ignorable for likelihood-based inference
  • Valid inference can be obtained using:
    • Full information maximum likelihood (FIML)
    • Multiple imputation methods
    • Other modern missing data techniques

Practical Example: In a clinical study where missingness in depression scores (\(X\)) depends on observed age groups but not on unobserved depression levels, the data would be MAR. This allows unbiased estimation if age is included in the analysis model.

1.2.3 Missing Not at Random (MNAR)

A variable \(X\) is MNAR (also called “non-ignorable missingness”) when the probability of missingness depends on:

  • The unobserved values of \(X\) itself (\(X_{mis}\)), and/or
  • Unmeasured variables related to both \(X\) and the missingness process

In simpler terms, the missingness depends on the missing values themselves. For example, people with higher Income are less likely to report it, even after accounting for observed variables like Age.

The formal probability model is:

\[ P(R = 0 \mid X_{obs}, X_{mis}, \phi) = f(X_{mis}, \phi) \] where:

  • \(R = 0\) indicates missing values (consistent with Rubin’s notation),

  • \(X_{obs}\) = observed values of \(X\),

  • \(X_{mis}\) = missing values of \(X\),

  • \(\phi\) = parameters governing the missingness mechanism,

  • \(f(\cdot)\) represents a functional relationship (often logistic in practice)


Key Characteristics:

  • Non-ignorable: The missingness mechanism must be explicitly modeled
  • Identifiability Challenges: Requires untestable assumptions about \(X_{mis}\)
  • Analysis Approaches:
    • Selection models (e.g., Heckman’s model)
    • Pattern mixture models
    • Shared parameter models
    • Sensitivity analyses

Example Cases:

  • Income Reporting: Higher-income individuals may systematically refuse to report earnings
  • Clinical Trials: Patients with worse symptoms may drop out more frequently
  • Educational Testing: Students performing poorly may skip difficult questions

The joint distribution factors as:

\[ P(X, R \mid \theta, \phi) = P(X \mid \theta) \cdot P(R \mid X, \phi) \]

requiring simultaneous estimation of both the data model (\(\theta\)) and missingness model (\(\phi\)).



The following code simulates two datasets — one MAR and one MNAR — and creates side-by-side scatterplots showing how missingness in income is distributed depending on either observed age (MAR) or unobserved income (MNAR), with color-coded points.

# Load libraries
library(ggplot2)
library(dplyr)
library(patchwork)

set.seed(123)

# Simulate base data
n <- 20
age <- round(runif(n, 18, 65))
income <- round(rnorm(n, mean = 50000, sd = 15000))

# ---- MAR: Missingness depends on Age (observed) ----
mar_data <- data.frame(Age = age, Income = income) %>%
    mutate(Missing = ifelse(Age < 40, rbinom(n, 1, 0.9), rbinom(n, 1, 0.1)), Observed = ifelse(Missing ==
        1, NA, Income), Status = ifelse(Missing == 1, "Missing", "Observed"))

# ---- MNAR: Missingness depends on Income (unobserved) ----
mnar_data <- data.frame(Age = age, Income = income) %>%
    mutate(Missing = ifelse(Income > 45000, rbinom(n, 1, 0.9), rbinom(n, 1, 0.1)),
        Observed = ifelse(Missing == 1, NA, Income), Status = ifelse(Missing == 1,
            "Missing", "Observed"))


# ---- Plotting function with subtitle ----
plot_missing <- function(df, title, subtitle, fill_color, xlab) {
    ggplot(df, aes(x = Age, y = Income)) + geom_point(aes(fill = Status), color = "black",
        shape = 21, size = 5, stroke = 1.2) + scale_fill_manual(values = c(Observed = fill_color,
        Missing = "white")) + theme_minimal(base_size = 14) + labs(title = title,
        subtitle = subtitle, x = xlab, y = "Income") + theme(legend.position = "none",
        plot.title = element_text(size = 15, hjust = 0.5), plot.subtitle = element_text(size = 11,
            margin = margin(b = 10), hjust = 0.5))
}

# Create the two plots with unique subtitles
p1 <- plot_missing(mar_data, title = "MAR — Missing at Random", subtitle = "Missingness depends on an\nobserved variable: Age",
    fill_color = "#1f77b4", xlab = "Age")

p2 <- plot_missing(mnar_data, title = "MNAR — Missing Not at Random", subtitle = "Missingness depends on the\nunobserved value: Income",
    fill_color = "#d62728", xlab = "Age")

# Combine side-by-side with a spacer in between
p1 + plot_spacer() + p2 + plot_layout(widths = c(1, 0.2, 1))


Visual Intuition

MCAR looks like overlapping groups with only small differences—any variation is just noise, not systematic. MAR shows partial separation, where the missingness pattern aligns with observed variables. MNAR appears as clear divergence, with groups pulling apart in ways that can’t be explained by observed data alone, signaling strong evidence against randomness.


Mechanism Definition Example Impact
MCAR Missingness is unrelated to both observed and unobserved data Random data loss due to a technical glitch, unrelated to any participant characteristics Analyses like listwise deletion or simple imputation yield unbiased point estimates, but standard errors are underestimated and statistical power is reduced.
MAR Missingness depends only on observed data, not on the missing values themselves Younger participants are less likely to report income Requires model-based methods such as Multiple Imputation (MI) or Full Information Maximum Likelihood (FIML) to avoid bias.
MNAR Missingness depends on the unobserved value itself or on other unobserved variables High earners omit income because they find it too sensitive Requires specialized models (e.g., selection models, pattern-mixture models). Since MNAR is untestable from data alone, sensitivity analyses are essential to gauge robustness.

1.3 Critical Assumptions

Modern approaches like MI and FIML typically assume:

Violations of MAR (i.e., MNAR) require more advanced techniques (Enders, 2010). When missingness exceeds 5%, assumptions must be justified, and sensitivity analyses are recommended.

1.3.1 Missingness Thresholds

% Missing Bias Potential Recommended Approach
< 5% Negligible Direct maximum likelihood
5–15% Moderate Multiple imputation
> 15% Substantial Sensitivity analyses

Adapted from (Schafer, 1997) and (Graham, 2012)


2 Strategies for Handling Missing Data

2.1 1. By Proportion of Missingness

  • Less than 5%
    • Typically low risk of bias (Graham, 2009)
    • Acceptable approaches:
      • Listwise deletion (if sample size permits)
      • Simple imputation (e.g., regression imputation)
  • 5–15%
    • Requires more robust treatment
    • Preferred approaches:
      • Multiple Imputation (MI; Rubin, 1987)
      • Full Information Maximum Likelihood (FIML; Enders, 2010)
    • Checks:
      • Identify mechanism (MCAR / MAR / MNAR)
      • Assess estimate stability with sensitivity analyses (Enders, 2010)
  • More than 15%
    • High potential for bias and loss of power
    • Advanced approaches:
      • Multivariate MI (Rubin, 1987)
      • Pattern-mixture or selection models (Little, 1993)
    • If considering listwise deletion:
      • Must establish MCAR (Little, 1993)
      • Provide clear justification
      • Supplement with sensitivity analysis (Graham, 2009)

2.2 2. By Variable Type (mice Methods)

The mice package provides flexible imputation methods tailored to variable type (Buuren, 2018):

Variable Type Method Description
Continuous "norm" Linear regression imputation
Binary "logreg" Logistic regression imputation
Ordinal "polr" Proportional odds model
Nominal "polyreg" Multinomial logistic regression

3 The Data

This demonstration uses a dataset from Enders (2010) Applied Missing Data Analysis to illustrate techniques for handling missing values. The synthetic data (X, Y, Z) contains intentional missingness (coded as NA) across three variables, mimicking real-world incomplete data scenarios. Below, Table 1 displays the raw data, followed by analyses exploring patterns and solutions for missing data.

# ------------------------------------------
# Data Preparation (From Enders, 2010)
# Applied Missing Data Analysis (p. XYZ)
# ------------------------------------------

# Create the data matrix (X, Y, Z with intentional missing values)
A <- data.matrix(
  cbind(
    X = c(78, 84, 84, 85, 87, 91, 92, 94, 94, 96, 99, 105, 105, 106, 108, 112, 113, 115, 118, 134, NA),
    Y = c(13, 9, 10, 10, NA, 3, 12, 3, 13, NA, 6, 12, 14, 10, NA, 10, 14, 14, 12, 11, NA),
    Z = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 7, 10, 11, 15, 10, 10, 12, 14, 16, 12, NA)
  )
)

# Save for future use (optional)
save(A, file = "A.Rdata")

# Display as a clean table
knitr::kable(
  A,
  format = "html",
  align = "r",
  col.names = c("Variable X", "Variable Y", "Variable Z"),  # Rename columns
  caption = "**Table 1.** Data Matrix from Enders (2010) with Missing Values (NA)",
  table.attr = 'style="width: 50%; margin-left: auto; margin-right: auto;"'  # Center-align table
) %>%
  kableExtra::kable_styling(
    bootstrap_options = c("striped", "hover"),  # Styling
    full_width = FALSE
  )
Table 1. Data Matrix from Enders (2010) with Missing Values (NA)
Variable X Variable Y Variable Z
78 13 NA
84 9 NA
84 10 NA
85 10 NA
87 NA NA
91 3 NA
92 12 NA
94 3 NA
94 13 NA
96 NA NA
99 6 7
105 12 10
105 14 11
106 10 15
108 NA 10
112 10 10
113 14 12
115 14 14
118 12 16
134 11 12
NA NA NA

3.1 Data Cleaning Steps

3.1.1 Remove Trivial Cases (All Values Missing)

# Remove rows where all values are NA (non-informative cases)
A1 <- A[!apply(A, 1, function(x) all(is.na(x))), ]

# Display cleaned data with caption as footnote
knitr::kable(A1, format = "html", align = "r", table.attr = "style=\"width: 30%;\"",
    caption = "Trivial cases removed:")
Trivial cases removed:
X Y Z
78 13 NA
84 9 NA
84 10 NA
85 10 NA
87 NA NA
91 3 NA
92 12 NA
94 3 NA
94 13 NA
96 NA NA
99 6 7
105 12 10
105 14 11
106 10 15
108 NA 10
112 10 10
113 14 12
115 14 14
118 12 16
134 11 12

Note: Data after removing completely missing cases. These cases contribute nothing to the observed-data likelihood and would only slow EM convergence by increasing missing information fractions.


3.1.2 Listwise Deletion (Any Value Missing)

# Remove rows with any missing values (complete-case analysis)
A2 <- A[complete.cases(A), , drop = FALSE]

# Alternative implementation: A2 <- A[!apply(A, 1, function(x) any(is.na(x))),
# ]

# Display complete cases
knitr::kable(A2, format = "html", align = "r", table.attr = "style=\"width: 30%;\"",
    caption = "Data after listwise deletion.")
Data after listwise deletion.
X Y Z
99 6 7
105 12 10
105 14 11
106 10 15
112 10 10
113 14 12
115 14 14
118 12 16
134 11 12

Note: This traditional method excludes entire records if any value is missing, potentially reducing sample size and statistical power.


3.2 Missing Data Patterns

Missing data patterns refer to the structured ways in which observed and missing values occur in a dataset. Recognizing these patterns is important because they influence both the missingness mechanism and the choice of analytic strategy. The six core patterns are:

  • Univariate Pattern: Missing values occur only in a single variable.
    Example: Nonresponse on income in a demographic survey.

  • General Pattern: Missing values are scattered across variables, often with underlying relationships between observed data and missingness.
    Example: Sporadic missing lab values in a medical study.

  • Unit Nonresponse: Entire cases (e.g., survey respondents) are missing because of failed contact, refusal, or data loss.

  • Monotone Pattern: Common in longitudinal studies, where dropout results in participants missing all later measurements after a given point.

  • Planned Missingness: Deliberately introduced to reduce burden or assess reliability, as in split-questionnaire designs, instrument calibration studies, or adaptive testing.

  • Latent Variable Pattern: Specific to structural equation modeling, where latent constructs are unobserved by definition for all cases.

Historically, methods for handling missing data were tailored to each pattern. Today, modern approaches such as multiple imputation and full information maximum likelihood address all patterns in a unified framework without requiring pattern-specific solutions (Enders, 2010, p. 5).

# Calculate missing data statistics
missing_stats <- list(incomplete_cases = sum(!complete.cases(A)), total_missing = sum(is.na(A)),
    incomplete_rows = which(!complete.cases(A)))

# Display results Prepare summary as a named data frame
missing_summary <- data.frame(Metric = c("Number of incomplete cases:", "Total missing values:",
    "Rows with missing data:"), Value = c(missing_stats$incomplete_cases, missing_stats$total_missing,
    paste(missing_stats$incomplete_rows, collapse = ", ")))

# Display results in a clean HTML table
knitr::kable(missing_summary, format = "html", caption = "Missing Data Summary")
Missing Data Summary
Metric Value
Number of incomplete cases: 12
Total missing values: 16
Rows with missing data: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 21

Missingness Pattern

Examining missingness patterns is a critical first step in multiple imputation because it reveals where and how data are missing. Different patterns may suggest different missingness mechanisms (MCAR, MAR, MNAR), highlight structural missingness (e.g., skip patterns in surveys), or indicate whether monotone methods could be applied. By mapping these patterns, researchers can select suitable imputation models, ensure compatibility between variables, and reduce the risk of bias introduced by overlooking systematic gaps.

This function identifies the distinct patterns of missing data in a matrix or data frame. Within each row, variables are coded as 1 when observed and NA when missing, and the function extracts the unique combinations of these codes.

#' Identify Unique Missingness Patterns in a Dataset
#' 
#' This function analyzes a matrix or data frame to extract all unique patterns of missing data.
#' Each variable is coded as 1 (observed) or NA (missing), and the function returns only the
#' distinct missingness patterns present in the data.
#'
#' @param data A matrix or data frame containing the data to analyze
#' @return A matrix where each row represents a unique missingness pattern, 
#'         with 1 indicating observed values and NA indicating missing values
#' @examples
#' data <- data.frame(
#'   X = c(1, 2, NA, 4),
#'   Y = c(NA, 2, 3, 4),
#'   Z = c(1, NA, 3, 4)
#' )
#' get_missingness_pattern(data)
#' @export
get_missingness_pattern <- function(data) {
    # Input validation
    if (missing(data)) {
        stop("No input data provided")
    }
    if (!is.matrix(data) && !is.data.frame(data)) {
        stop("Input must be a matrix or data frame")
    }
    if (nrow(data) == 0) {
        stop("Data has no observations")
    }

    # Convert to binary matrix (1 = observed, NA = missing)
    pattern_matrix <- ifelse(!is.na(data), 1, NA)

    # Get unique patterns (preserving matrix structure)
    unique_patterns <- unique(pattern_matrix)

    # Add informative row names
    if (nrow(unique_patterns) > 0) {
        rownames(unique_patterns) <- paste0("Pattern_", seq_len(nrow(unique_patterns)))
    }

    return(unique_patterns)
}

# Save function to file
dump("get_missingness_pattern", file = "get_missingness_pattern.R")

Analyze missingness patterns in our dataset:

# =============================== Helpers ===============================
if (!exists("get_missingness_pattern")) {
    get_missingness_pattern <- function(A) {
        stopifnot(is.matrix(A) || is.data.frame(A))
        obs_mat <- !is.na(A)  # TRUE = observed, FALSE = missing
        # Map to 1 / NA for display (1 = observed, NA = missing)
        disp <- matrix(NA, nrow = nrow(A), ncol = ncol(A), dimnames = list(NULL,
            colnames(A)))
        disp[obs_mat] <- 1

        # Collapse per-row pattern using a compact code (e.g., '110NA1')
        pat_code <- apply(disp, 1, function(x) paste(ifelse(is.na(x), "NA", "1"),
            collapse = ""))
        uniq_codes <- unique(pat_code)
        # Build unique pattern matrix (rows = patterns)
        pat_df <- do.call(rbind, lapply(uniq_codes, function(code) {
            s <- strsplit(code, "")[[1]]
            # Convert back to 1 / NA
            out <- ifelse(s == "NA", NA, 1)
            names(out) <- colnames(A)
            out
        }))
        rownames(pat_df) <- paste0("Pattern_", seq_len(nrow(pat_df)))
        as.data.frame(pat_df, check.names = FALSE)
    }
}

is_monotone_missing <- function(A) {
    # Check if (after reordering columns) missingness is monotone: once a
    # variable is missing in a row, all subsequent variables are also missing
    M <- is.na(A) * 1L  # 1 = missing, 0 = observed
    # Order columns by increasing proportion missing
    miss_rate <- colMeans(is.na(A))
    M2 <- M[, order(miss_rate, decreasing = FALSE), drop = FALSE]
    # For each row, the vector of 0s then 1s must be non-decreasing
    all(apply(M2, 1, function(r) all(diff(r) >= 0)))
}

# =============================== Main Analysis ===============================
if (exists("A")) {
    A <- as.data.frame(A)
    n <- nrow(A)
    p <- ncol(A)

    # Patterns (1 = observed, NA = missing)
    missing_patterns <- get_missingness_pattern(A)

    # Pattern frequencies (compact, readable O/M string)
    pat_readable <- apply(!is.na(A), 1, function(x) paste(ifelse(x, "O", "M"), collapse = ""))
    pat_tab <- sort(table(pat_readable), decreasing = TRUE)
    freq_df <- data.frame(PatternCode = names(pat_tab), Cases = as.integer(pat_tab),
        Percent = round(as.integer(pat_tab)/n * 100, 1), row.names = NULL)
    freq_df$PatternID <- paste0("Pattern_", match(freq_df$PatternCode, unique(pat_readable)))
    freq_df <- freq_df[, c("PatternID", "PatternCode", "Cases", "Percent")]

    # Variable- and row-wise summaries
    var_missing <- data.frame(Variable = colnames(A), Missing = colSums(is.na(A)),
        Percent = round(colMeans(is.na(A)) * 100, 1), AllMissing = colSums(is.na(A)) ==
            n)
    row_missing <- data.frame(RowsWithAnyMissing = sum(rowSums(is.na(A)) > 0), CompleteCases = sum(rowSums(is.na(A)) ==
        0), RowsAllMissing = sum(rowSums(is.na(A)) == p))

    # Monotone check (useful for MI with monotone methods)
    has_monotone <- is_monotone_missing(A)

    # Simple diagnostics
    most_common <- freq_df$PatternID[1]
    complete_case_pat <- paste(rep("O", p), collapse = "")
    has_complete_pat <- any(freq_df$PatternCode == complete_case_pat)

    # =============================== Reporting ===============================
    cat("=== Missing Data Analysis ===\n\n")
    cat("Dataset Dimensions: ", n, " rows × ", p, " columns\n", sep = "")
    cat("Total Missing Values: ", sum(is.na(A)), " (", round(mean(is.na(A)) * 100,
        1), "% of all cells)\n\n", sep = "")

    cat("--- Unique Missingness Patterns (", nrow(missing_patterns), " found) ---\n",
        sep = "")
    print(missing_patterns)

    cat("\n--- Pattern Frequencies (sorted) ---\n")
    print(freq_df, row.names = FALSE)

    cat("\n--- Variable-wise Missingness ---\n")
    print(var_missing[order(-var_missing$Percent), ], row.names = FALSE)

    cat("\n--- Row-wise Summary ---\n")
    print(row_missing, row.names = FALSE)

    cat("\n--- Interpretation Aids ---\n")
    cat("* O = Observed, M = Missing (PatternCode columns correspond to: ", paste(colnames(A),
        collapse = ", "), ")\n", sep = "")
    cat("* Most common pattern: ", most_common, "\n", sep = "")
    cat("* Complete-case pattern present: ", ifelse(has_complete_pat, "Yes", "No"),
        "\n", sep = "")
    cat("* Monotone missingness (after column reordering): ", ifelse(has_monotone,
        "Yes (monotone) → monotone MI is feasible", "No (general) → use FIML or fully conditional MI"),
        "\n", sep = "")

    # Optional: quick cues for MI planning
    cat("\n--- MI Planning Cues ---\n")
    cat("- Variables with ≥ 30% missing may need predictive auxiliaries or sensitivity checks.\n")
    cat("- If a single variable dominates missingness → univariate pattern; consider simpler models.\n")
    cat("- Rows with all-missing suggest unit nonresponse; consider weighting/nonresponse adjustments.\n")

} else {
    warning("Dataset 'A' not found. Please load the data first.")
}
## === Missing Data Analysis ===
## 
## Dataset Dimensions: 21 rows × 3 columns
## Total Missing Values: 16 (25.4% of all cells)
## 
## --- Unique Missingness Patterns (5 found) ---
##            X  Y  Z
## Pattern_1  1  1 NA
## Pattern_2  1 NA NA
## Pattern_3  1  1  1
## Pattern_4  1 NA  1
## Pattern_5 NA NA NA
## 
## --- Pattern Frequencies (sorted) ---
##  PatternID PatternCode Cases Percent
##  Pattern_3         OOO     9    42.9
##  Pattern_1         OOM     8    38.1
##  Pattern_2         OMM     2     9.5
##  Pattern_5         MMM     1     4.8
##  Pattern_4         OMO     1     4.8
## 
## --- Variable-wise Missingness ---
##  Variable Missing Percent AllMissing
##         Z      11    52.4      FALSE
##         Y       4    19.0      FALSE
##         X       1     4.8      FALSE
## 
## --- Row-wise Summary ---
##  RowsWithAnyMissing CompleteCases RowsAllMissing
##                  12             9              1
## 
## --- Interpretation Aids ---
## * O = Observed, M = Missing (PatternCode columns correspond to: X, Y, Z)
## * Most common pattern: Pattern_3
## * Complete-case pattern present: Yes
## * Monotone missingness (after column reordering): No (general) → use FIML or fully conditional MI
## 
## --- MI Planning Cues ---
## - Variables with ≥ 30% missing may need predictive auxiliaries or sensitivity checks.
## - If a single variable dominates missingness → univariate pattern; consider simpler models.
## - Rows with all-missing suggest unit nonresponse; consider weighting/nonresponse adjustments.

Extracting Rows by Missingness Pattern

Grouping cases by missingness patterns is an important step because many imputation algorithms operate conditional on the observed data within a specific pattern. Extracting rows by pattern helps to:

  • Diagnose structure: Detect whether the data follow a monotone pattern or a general (non-monotone) structure.
  • Apply targeted models: Fit different imputation models depending on the pattern of missingness.
  • Inspect systematic gaps: Identify whether specific variables or subgroups drive missingness.

This process supports both diagnostics (understanding how data are missing) and model selection (choosing suitable imputat


This function selects all cases from a data matrix or data frame that match a given missingness pattern.
A pattern is defined as a logical vector, where:

  • TRUE = Missing value
  • FALSE = Observed value

Each row of the dataset is compared against the specified pattern, and all rows with the same arrangement of missing and observed values are returned.

#' Extract Rows Matching a Specific Missingness Pattern
#'
#' Filters a dataset to return rows that exactly match a specified pattern of missing values.
#' Pattern encodings:
#'   - Logical vector: TRUE = missing, FALSE = observed, NA = wildcard (ignore column)
#'   - Numeric/character vector: NA = missing, anything else = observed
#'   - Single character string of '0'/'1': '1' = missing, '0' = observed
#'
#' @param data A matrix or data frame to filter.
#' @param pattern Missingness pattern (see encodings above).
#' @param return_indices Logical; if TRUE, return matching row indices instead of data.
#'
#' @return Either:
#'   - A vector of row indices (if `return_indices = TRUE`), or
#'   - A data frame of matching rows with a leading `'index #'` column showing row numbers.
#'   Returns NULL (with warning) if no matches are found.
#' @export
extract_by_pattern <- function(data, pattern, return_indices = FALSE) {
    if (!is.matrix(data) && !is.data.frame(data)) {
        stop("`data` must be a matrix or data frame.")
    }
    data <- as.data.frame(data)
    p <- ncol(data)

    # Align by names if provided
    if (!is.null(names(pattern))) {
        missing_in_data <- setdiff(names(pattern), colnames(data))
        if (length(missing_in_data)) {
            stop("Pattern names not found in data: ", paste(missing_in_data, collapse = ", "))
        }
        pattern <- pattern[colnames(data)]
    }

    # Pattern coercion helper
    to_missing_logical <- function(pat) {
        if (is.character(pat) && length(pat) == 1L) {
            if (nchar(pat) != p || !grepl("^[01]+$", pat)) {
                stop("String pattern must be a single '0'/'1' string of length ncol(data).")
            }
            out <- as.logical(as.integer(strsplit(pat, "")[[1]]))
            attr(out, "wildcard") <- rep(FALSE, p)
            return(out)
        }
        if (is.logical(pat)) {
            if (length(pat) != p)
                stop("Logical pattern length must equal ncol(data).")
            wildcard <- is.na(pat)
            out <- pat
            out[wildcard] <- FALSE
            attr(out, "wildcard") <- wildcard
            return(out)
        }
        if (length(pat) == p) {
            out <- rep(FALSE, p)
            out[is.na(pat)] <- TRUE
            attr(out, "wildcard") <- rep(FALSE, p)
            return(out)
        }
        stop("Invalid `pattern`. See documentation for accepted encodings.")
    }

    miss_pat <- to_missing_logical(pattern)
    if (length(miss_pat) != p) {
        stop("Pattern length (", length(miss_pat), ") must match number of columns (",
            p, ").")
    }

    M <- is.na(data)  # missingness matrix
    wildcard <- attr(miss_pat, "wildcard")
    if (is.null(wildcard))
        wildcard <- rep(FALSE, p)

    if (all(wildcard)) {
        match_idx <- seq_len(nrow(data))
    } else {
        keep_cols <- which(!wildcard)
        diff_mat <- xor(M[, keep_cols, drop = FALSE], matrix(miss_pat[keep_cols],
            nrow = nrow(M), ncol = length(keep_cols), byrow = TRUE))
        match_idx <- which(rowSums(diff_mat) == 0L)
    }

    if (!length(match_idx)) {
        warning("No rows matched the specified pattern.")
        return(NULL)
    }

    if (return_indices) {
        match_idx
    } else {
        out <- data[match_idx, , drop = FALSE]
        out <- cbind(`index #` = match_idx, out)
        rownames(out) <- NULL  # <-- drop row names so kable won't show them
        out
    }

}

# Save function (optional)
dump("extract_by_pattern", file = "extract_by_pattern.R")

This code iterates through all unique missingness patterns in a dataset A, and for each pattern:

  • Prints the pattern itself (as a row of missing_patterns).
  • Extracts all rows from A that match that missingness pattern using our extract_by_pattern() function.
# Display data partitioned by missingness pattern

if (exists("missing_patterns") && exists("A")) {
    # helper to show pattern as O/M for readability
    show_code <- function(v) paste(ifelse(is.na(v), "M", "O"), collapse = "")

    for (i in seq_len(nrow(missing_patterns))) {
        cat("\n--- Pattern", i, "---\n")
        pattern_i <- missing_patterns[i, , drop = TRUE]
        cat("Columns:", paste(colnames(missing_patterns), collapse = ", "), "\n")
        cat("Code   :", show_code(pattern_i), " (O=observed, M=missing)\n")

        extracted_data <- extract_by_pattern(A, pattern_i)

        if (is.null(extracted_data) || nrow(extracted_data) == 0) {
            cat("No cases match this pattern\n\n")
        } else {
            print(knitr::kable(extracted_data, align = "c", caption = paste("Cases matching pattern",
                i), row.names = FALSE)  # <-- don't print row names
)

            cat("\n")
        }
    }
} else {
    warning("Required objects not available for pattern extraction")
}
## 
## --- Pattern 1 ---
## Columns: X, Y, Z 
## Code   : OOM  (O=observed, M=missing)
## 
## 
## Table: Cases matching pattern 1
## 
## | index # | X  | Y  | Z  |
## |:-------:|:--:|:--:|:--:|
## |    1    | 78 | 13 | NA |
## |    2    | 84 | 9  | NA |
## |    3    | 84 | 10 | NA |
## |    4    | 85 | 10 | NA |
## |    6    | 91 | 3  | NA |
## |    7    | 92 | 12 | NA |
## |    8    | 94 | 3  | NA |
## |    9    | 94 | 13 | NA |
## 
## 
## --- Pattern 2 ---
## Columns: X, Y, Z 
## Code   : OMM  (O=observed, M=missing)
## 
## 
## Table: Cases matching pattern 2
## 
## | index # | X  | Y  | Z  |
## |:-------:|:--:|:--:|:--:|
## |    5    | 87 | NA | NA |
## |   10    | 96 | NA | NA |
## 
## 
## --- Pattern 3 ---
## Columns: X, Y, Z 
## Code   : OOO  (O=observed, M=missing)
## 
## 
## Table: Cases matching pattern 3
## 
## | index # |  X  | Y  | Z  |
## |:-------:|:---:|:--:|:--:|
## |   11    | 99  | 6  | 7  |
## |   12    | 105 | 12 | 10 |
## |   13    | 105 | 14 | 11 |
## |   14    | 106 | 10 | 15 |
## |   16    | 112 | 10 | 10 |
## |   17    | 113 | 14 | 12 |
## |   18    | 115 | 14 | 14 |
## |   19    | 118 | 12 | 16 |
## |   20    | 134 | 11 | 12 |
## 
## 
## --- Pattern 4 ---
## Columns: X, Y, Z 
## Code   : OMO  (O=observed, M=missing)
## 
## 
## Table: Cases matching pattern 4
## 
## | index # |  X  | Y  | Z  |
## |:-------:|:---:|:--:|:--:|
## |   15    | 108 | NA | 10 |
## 
## 
## --- Pattern 5 ---
## Columns: X, Y, Z 
## Code   : MMM  (O=observed, M=missing)
## 
## 
## Table: Cases matching pattern 5
## 
## | index # | X  | Y  | Z  |
## |:-------:|:--:|:--:|:--:|
## |   21    | NA | NA | NA |

3.3 Little’s MCAR Test

Many statistical analyses require the missing completely at random (MCAR) assumption, where missingness is independent of both observed and unobserved data (Little, 1988). While maximum likelihood estimation (MLE) only requires the weaker missing at random (MAR) assumption under multivariate normality (Orchard, T. & Woodbury, M. A., 1972), verifying MCAR remains crucial for complete-case analyses and certain modeling approaches.

Test Statistic

Little’s test evaluates MCAR by comparing subgroup means across missing data patterns. The test statistic is:

\[ d^2 = \sum_{j=1}^J n_j \left( \hat{\mu}_j - \hat{\mu}_j^{(ML)} \right)^T \hat{\Sigma}_j^{-1} \left( \hat{\mu}_j - \hat{\mu}_j^{(ML)} \right) \] where:

  • \(J\) = number of distinct missing data patterns,

  • \(n_j\) = sample size for pattern \(j\),

  • \(\hat{\mu}_j\) = observed mean vector for pattern \(j\),

  • \(\hat{\mu}_j^{(ML)}\) = ML estimate of grand mean vector,

  • \(\hat{\Sigma}_j\) = ML estimate of covariance matrix for pattern \(j\)


Distributional Properties

Under the null hypothesis (MCAR holds):

  • \(d^2 \sim \chi^2(\text{df})\)
  • Degrees of freedom: \(\text{df} = \sum_{j=1}^J k_j - k\)
    • \(k_j\) = number of observed variables in pattern \(j\)
    • \(k\) = total number of variables

Interpretation

  • Significant result (\(p < \alpha\)): Evidence against MCAR
  • Non-significant result: Fails to reject MCAR (but doesn’t prove it)

Practical Considerations

  • Power: Requires adequate sample size in each missingness pattern
  • Normality: Assumes multivariate normal data
  • Limitations:
    • Cannot distinguish MAR from MNAR
    • May lack power with many variables or small samples

Example Application: When testing MCAR in a longitudinal study with dropout, significant results would suggest dropout relates to measured variables, violating MCAR but potentially satisfying MAR.

R Implementation of MCAR Test: This function implements Little’s MCAR (Missing Completely At Random) test on a numeric data matrix or data frame with missing values. Little’s MCAR Test checks whether missing data in your dataset can reasonably be assumed Missing Completely at Random (MCAR).

The test output includes both a numerical summary and an optional visual plot against the chi-square reference distribution. Computes the test statistic based on differences between observed pattern means and the overall means, adjusted by the covariance matrix.

# ======================================================================
# Little's MCAR Test (with EM ML, Robust MCD, and diagnostics)
# ----------------------------------------------------------------------
# Provides: - MCAR_Test(): main function - print_mcar_test(): pretty printer -
# .plot_mcar_test(): chi-square visual (with domain & zoom) - Internal helpers:
# input validation, pattern handling, safe inverse
# ======================================================================

#' Coerce Non-Numeric Columns with Warning
#'
#' Converts non-numeric columns to numeric via factor coding (levels → 1..K).
#' This preserves shape but changes meaning, so a clear warning is issued.
#'
#' @param df A data.frame or matrix with potential non-numeric columns.
#' @return A data.frame with all columns numeric.
#' @keywords internal
.coerce_numeric_warn <- function(df) {
    df <- as.data.frame(df)
    # Identify non-numeric columns using a fast vapply
    non_num <- !vapply(df, is.numeric, logical(1))
    if (any(non_num)) {
        bad <- names(df)[non_num]
        warning(sprintf("Coercing non-numeric columns to numeric via factor codes: %s",
            paste(bad, collapse = ", ")), call. = FALSE)
        # Factor → integer codes (1..K). This is *not* label encoding of any
        # ordinal meaning.
        df[bad] <- lapply(df[bad], function(col) as.numeric(as.factor(col)))
    }
    df
}

#' Missingness Pattern Indexer
#'
#' Constructs a compact index of row-wise missingness patterns:
#'  - a logical matrix of NAs
#'  - unique patterns and their row indices
#'  - frequency tables for diagnostics
#'
#' @param df A numeric data.frame or matrix containing missing values.
#' @param max_patterns Integer; number of most frequent patterns to keep for summary.
#'
#' @return A list with elements:
#' \describe{
#'   \item{na_mat}{Logical (n x p) matrix of `is.na(df)`.}
#'   \item{unique_patterns}{Logical (k x p) matrix of unique row patterns.}
#'   \item{row_index_by_pattern}{List of integer vectors: rows for each pattern.}
#'   \item{pattern_strings}{Character vector, one per row, listing NA positions (e.g., '1,3').}
#'   \item{pattern_counts}{Table of pattern string frequencies.}
#'   \item{top_patterns}{Head of `pattern_counts` (most frequent).}
#' }
#' @keywords internal
.missingness_index <- function(df, max_patterns = 10) {
    # Logical NA map (n x p)
    na_mat <- is.na(df)

    # String signature for each row's NA positions; '' = no missing
    patt_str <- apply(na_mat, 1, function(r) paste(which(r), collapse = ","))

    # Frequency of each unique row pattern (most common first)
    patt_counts <- sort(table(patt_str), decreasing = TRUE)
    top_patterns <- head(patt_counts, max_patterns)

    # Unique patterns as a logical matrix (k x p)
    unique_patterns <- unique(na_mat, MARGIN = 1)

    # For each unique pattern, which rows match it?
    row_index_by_pattern <- lapply(seq_len(nrow(unique_patterns)), function(i) {
        patt <- unique_patterns[i, ]
        # Match rows by exact logical equality across p variables
        which(apply(na_mat, 1, function(r) all(r == patt)))
    })

    list(na_mat = na_mat, unique_patterns = unique_patterns, row_index_by_pattern = row_index_by_pattern,
        pattern_strings = patt_str, pattern_counts = patt_counts, top_patterns = top_patterns)
}

#' Safe Matrix Inverse with Pseudoinverse Fallback
#'
#' Attempts a standard inverse via \code{solve()} and falls back to \code{MASS::ginv()}
#' if inversion fails (e.g., singular or near-singular covariance submatrices).
#'
#' @param S A square numeric matrix to invert.
#' @return The inverse (or pseudoinverse) of \code{S}.
#' @keywords internal
.safe_inverse <- function(S) {
    # Try a fast, standard inverse
    out <- tryCatch(solve(S), error = function(e) NULL)
    if (!is.null(out))
        return(out)

    # If that fails, require MASS and use Moore–Penrose generalized inverse
    if (!requireNamespace("MASS", quietly = TRUE)) {
        stop("Matrix not invertible. Install 'MASS' for ginv() fallback.", call. = FALSE)
    }
    MASS::ginv(S)
}

#' Grand Mean and Covariance under Different Estimation Schemes
#'
#' Computes the global (pooled) mean vector and covariance matrix using:
#' \itemize{
#'   \item \code{'em'}: ML estimates under MVN via \pkg{norm} (EM algorithm).
#'   \item \code{'pairwise'}: pairwise-complete covariance (pragmatic; not ML).
#'   \item \code{'robust'}: robust location/scatter via MCD from \pkg{robustbase}.
#' }
#'
#' The MCAR test uses the grand covariance submatrices for pattern-specific
#' quadratic forms (Little, 1988).
#'
#' @param df Numeric data.frame or matrix (may contain NAs).
#' @param method One of \code{c('em', 'pairwise', 'robust')}.
#'
#' @return A list with elements:
#' \describe{
#'   \item{mu}{Numeric vector of length p (grand means).}
#'   \item{Sigma}{(p x p) covariance matrix.}
#'   \item{label}{Character description of the method used.}
#' }
#' @references
#' Little, R. J. A. (1988). A test of missing completely at random for multivariate data with missing values. \emph{JASA}, 83(404), 1198–1202. \cr
#' Schafer, J. L. (1997). \emph{Analysis of Incomplete Multivariate Data}. Chapman & Hall. \cr
#' Rousseeuw, P. J., & Van Driessen, K. (1999). A fast algorithm for the minimum covariance determinant estimator. \emph{Technometrics}, 41(3), 212–223.
#' @keywords internal
.grand_params <- function(df, method = c("em", "pairwise", "robust")) {
    method <- match.arg(method)

    if (method == "em") {
        if (!requireNamespace("norm", quietly = TRUE)) {
            stop("EM requested but package 'norm' is not available.", call. = FALSE)
        }
        # EM under MVN using {norm}
        s <- norm::prelim.norm(as.matrix(df))  # set up sufficient statistics
        ml <- norm::em.norm(s, showits = FALSE)  # run EM to convergence
        par <- norm::getparam.norm(s, ml)  # extract ML parameters
        return(list(mu = as.numeric(par$mu), Sigma = par$sigma, label = "EM-based ML"))
    }

    if (method == "robust") {
        if (!requireNamespace("robustbase", quietly = TRUE)) {
            stop("Robust requested but package 'robustbase' is not available.", call. = FALSE)
        }
        # MCD needs complete input; we create a *rough* single-imputed copy
        # ONLY to estimate robust center/cov. This imputed data is not used
        # elsewhere.
        mu_imp <- colMeans(df, na.rm = TRUE)
        df_imp <- df
        for (j in seq_along(mu_imp)) {
            nas <- is.na(df_imp[, j])
            if (any(nas))
                df_imp[nas, j] <- mu_imp[j]
        }
        rc <- robustbase::covMcd(df_imp, alpha = 0.75)
        return(list(mu = as.numeric(rc$center), Sigma = rc$cov, label = "Robust (MCD)"))
    }

    # Pairwise-complete (not ML; can be indefinite in rare cases)
    mu <- colMeans(df, na.rm = TRUE)
    Sigma <- stats::cov(df, use = "pairwise.complete.obs")
    list(mu = as.numeric(mu), Sigma = Sigma, label = "Pairwise-complete (not ML)")
}

#' Pretty Printer for MCAR Test Results
#'
#' Formats the MCAR test output with diagnostics on missingness patterns
#' and a heuristic effect-size summary \eqn{\chi^2/\mathrm{df}}.
#'
#' @param x An object returned by [MCAR_Test()].
#' @param ... Unused.
#' @return Invisibly returns \code{x}; primarily used for its side effects.
#' @export
print_mcar_test <- function(x, ...) {
    effect_size <- ifelse(x$df > 0, x$statistic/x$df, NA_real_)
    base_decision <- ifelse(grepl("Reject", x$decision), "Reject MCAR", "Fail to reject MCAR")

    # Textual interpretation that couples significance with magnitude
    if (!is.na(effect_size)) {
        if (base_decision == "Reject MCAR" && effect_size >= 2) {
            decision_text <- "Strong evidence against MCAR (large effect)"
        } else if (base_decision == "Reject MCAR") {
            decision_text <- "Statistically significant deviation from MCAR"
        } else if (effect_size >= 2) {
            decision_text <- "Substantial effect despite non-significant p-value"
        } else if (effect_size >= 1.5) {
            decision_text <- "Moderate effect size - MCAR may not hold"
        } else {
            decision_text <- "No substantial evidence against MCAR"
        }
    } else {
        decision_text <- base_decision
    }

    cat("\nLittle's MCAR Test with Missing Pattern Analysis\n")
    cat(strrep("=", 69), "\n")
    cat(sprintf("%-25s %s\n", "Estimation method:", x$estimation_method))
    cat(sprintf("%-25s %s\n", "Small-sample correction:", x$correction))
    cat(strrep("-", 69), "\n")

    # Missingness diagnostics
    cat("Missing Data Diagnostics:\n")
    cat(sprintf(" - Total missing values: %d (%.1f%%)\n", x$diagnostics$total_missing,
        x$diagnostics$pct_missing))
    worst_var <- x$diagnostics$worst_variable
    if (is.na(worst_var)) {
        cat(" - Variable with most NAs: none (no missing values)\n")
    } else {
        cat(sprintf(" - Variable with most NAs: %s (%d missing)\n", worst_var, sum(is.na(x$data[,
            worst_var]))))
    }

    # Top patterns for quick human review
    cat("\nTop Missing Patterns:\n")
    tp <- x$diagnostics$top_patterns
    if (length(tp) == 0) {
        cat(" (no missingness patterns)\n")
    } else {
        for (i in seq_along(tp)) {
            patt <- names(tp)[i]
            vars <- if (patt == "")
                character(0) else colnames(x$data)[as.numeric(strsplit(patt, ",")[[1]])]
            patt_desc <- if (length(vars) > 0)
                paste(vars, collapse = ", ") else "Complete cases"
            cat(sprintf("%2d. %-40s (n = %d)\n", i, patt_desc, tp[i]))
        }
    }

    cat(strrep("-", 69), "\n")
    if (!is.na(effect_size)) {
        cat(sprintf("%-25s %.3f\n", "Effect size (χ²/df):", effect_size))
    } else {
        cat("- Effect size undefined (df = 0)\n")
    }
    cat(strrep("-", 69), "\n")
    cat(sprintf("%-25s %.3f (df = %d)\n", "Test statistic:", x$statistic, x$df))
    cat(sprintf("%-25s %.4f\n", "P-value:", x$p.value))
    cat(sprintf("%-25s %s\n", "Decision:", decision_text))
    cat(strrep("=", 69), "\n")

    invisible(x)
}

#' Chi-square Reference Plot for Little's MCAR Test
#'
#' Draws the chi-square density for the test's degrees of freedom,
#' shading accept/reject regions and marking the observed statistic and
#' critical value. Supports manual domain control and auto-centered zoom.
#'
#' @param obj An object returned by [MCAR_Test()].
#' @param xlim_min Numeric; lower x-axis bound. If > 0 (or \code{xlim_max} not \code{NULL}),
#'   the domain is treated as manual and strictly respected.
#' @param xlim_max Numeric or \code{NULL}; upper x-axis bound. When \code{NULL},
#'   the upper bound is auto-selected unless a zoom is requested.
#' @param zoom Logical; if \code{TRUE}, x-axis auto-centers on the observed \eqn{\chi^2}.
#' @param zoom_factor Positive numeric; half-width of zoom window in units of
#'   \eqn{\sqrt{2\,\mathrm{df}}} (the SD of \eqn{\chi^2(\mathrm{df})}).
#' @param force_show_refs Logical; if \code{TRUE} and no manual/zoom is set,
#'   expands the domain just enough to include the observed and critical values.
#'
#' @return Invisibly returns \code{NULL}; called for its plot side effect.
#' @keywords internal
#' @examples
#' \dontrun{
#'   res <- MCAR_Test(X, plot = FALSE)
#'   .plot_mcar_test(res, xlim_min = 0, xlim_max = 50)         # manual limits
#'   .plot_mcar_test(res, zoom = TRUE, zoom_factor = 1.5)      # zoomed view
#' }
.plot_mcar_test <- function(obj, xlim_min = 0, xlim_max = NULL, zoom = FALSE, zoom_factor = 2,
    force_show_refs = FALSE) {
    if (!is.list(obj) || is.null(obj$df) || obj$df <= 0 || !is.finite(obj$statistic))
        return(invisible(NULL))
    `%||%` <- function(x, y) if (is.null(x))
        y else x

    # Unpack core test quantities
    df <- obj$df
    alpha <- obj$alpha %||% 0.05
    chi2_obs <- as.numeric(obj$statistic)
    pval <- obj$p.value
    decision <- obj$decision %||% if (!is.na(pval) && pval < alpha)
        "Reject MCAR" else "Fail to reject MCAR"
    p_txt <- if (is.na(pval))
        "NA" else if (pval < 1e-04)
        "< 1e-4" else sprintf("= %.4f", pval)

    # N for subtitle; stored in res, otherwise compute from data
    N_val <- obj$N %||% obj$n %||% if (!is.null(obj$data))
        nrow(obj$data) else NA_real_
    critical <- qchisq(1 - alpha, df)

    # Reference full-range upper bound (used when no manual bounds/zoom)
    full_max <- max(critical, chi2_obs, qchisq(0.999, df)) * 1.05

    # --- Domain selection (NO implicit pull to 0) ---
    manual_bounds <- (!is.null(xlim_max)) || (xlim_min > 0)

    if (isTRUE(zoom)) {
        # Auto-centered window: half-width = zoom_factor * sd(chi-square)
        buf <- zoom_factor * sqrt(2 * df)
        # Clip lower bound at 0 to remain valid for chi-square support
        x_min <- max(0, chi2_obs - buf)
        x_max <- chi2_obs + buf
    } else if (manual_bounds) {
        # Strictly respect manual limits if provided
        x_min <- xlim_min
        x_max <- if (is.null(xlim_max))
            full_max else xlim_max
    } else {
        # Default full-range view
        x_min <- 0
        x_max <- full_max
    }

    # Optional expansion to ensure refs are visible (when not manual or zoom)
    if (isTRUE(force_show_refs) && !manual_bounds && !isTRUE(zoom)) {
        x_min <- min(x_min, chi2_obs, critical)
        x_max <- max(x_max, chi2_obs, critical)
    }

    # Protect against inverted / zero-width ranges
    if (x_max <= x_min)
        x_max <- x_min + 1

    # Grid for density curve
    x_vals <- seq(x_min, x_max, length.out = 800)
    y_vals <- dchisq(x_vals, df)

    # Colors
    col_accept <- adjustcolor("#4CAF50", 0.25)
    col_reject <- adjustcolor("#F44336", 0.25)
    col_density <- "#303F9F"
    col_stat <- "#1976D2"
    col_crit <- "#D32F2F"
    col_text <- "#333333"
    col_decision <- if (grepl("Reject", decision))
        "#8B0000" else "#006400"

    # Subtitle with current view and numeric bounds
    .fmt_num <- function(z, d = 3) formatC(z, digits = d, format = "f", big.mark = ",")
    subtitle <- sprintf("N = %s   |   α = %s   |   view: %s   |   x ∈ [%s, %s]",
        .fmt_num(N_val, 0), .fmt_num(alpha, 3), if (isTRUE(zoom))
            sprintf("zoomed (±%g·SD)", zoom_factor) else if (manual_bounds)
            "manual" else "full", .fmt_num(x_min, 0), .fmt_num(x_max, 0))

    # Plot canvas
    old_par <- par(no.readonly = TRUE)
    on.exit(par(old_par), add = TRUE)
    par(mar = c(5, 6, 6, 2.5))

    plot(NA, NA, xlim = c(x_min, x_max), ylim = c(0, max(y_vals) * 1.28), xlab = expression(chi^2),
        ylab = "Density", xaxt = "n", yaxt = "n", xaxs = "i", yaxs = "i", main = "Little's MCAR Test — Chi-square Reference")
    mtext(subtitle, side = 3, line = 1, cex = 0.95, col = col_stat)

    # Axes & grid (always use c(x_min, x_max), never c(0, x_max))
    x_at <- pretty(c(x_min, x_max), n = 6)
    axis(1, at = x_at, labels = .fmt_num(x_at, 0))
    y_at <- pretty(c(0, max(y_vals)), n = 5)
    axis(2, at = y_at, las = 1, labels = .fmt_num(y_at, 3))
    abline(v = x_at, col = adjustcolor("#000000", 0.06), lwd = 1)
    abline(h = y_at, col = adjustcolor("#000000", 0.06), lwd = 1)

    # Shade accept/reject regions (they may be empty if critical is off-canvas)
    xa <- x_vals[x_vals <= critical]
    if (length(xa))
        polygon(c(min(xa), xa, max(xa)), c(0, dchisq(xa, df), 0), col = col_accept,
            border = NA)
    xr <- x_vals[x_vals >= critical]
    if (length(xr))
        polygon(c(min(xr), xr, max(xr)), c(0, dchisq(xr, df), 0), col = col_reject,
            border = NA)

    # Draw density and reference lines (only if in current domain)
    lines(x_vals, y_vals, lwd = 2.2, col = col_density)
    if (critical >= x_min && critical <= x_max)
        abline(v = critical, col = col_crit, lty = 2, lwd = 2)
    if (chi2_obs >= x_min && chi2_obs <= x_max)
        abline(v = chi2_obs, col = col_stat, lwd = 2.4)

    # Annotate observed χ² with p and decision (only if visible)
    if (chi2_obs >= x_min && chi2_obs <= x_max) {
        mid_y <- max(y_vals) * 0.5
        x_off <- 0.02 * (x_max - x_min)
        text(chi2_obs - x_off, mid_y, labels = bquote(chi^2 * "(" * .(df) * ")" ==
            .(round(chi2_obs, 2)) ~ "," ~ p ~ .(p_txt)), col = col_decision, font = 2,
            cex = 1.2, srt = 90, adj = 0.5, xpd = NA)
        text(chi2_obs + x_off, mid_y, labels = paste("Decision:", decision), col = col_decision,
            font = 2, cex = 1.2, srt = 90, adj = 0.5, xpd = NA)
    }

    # Ticks at observed and critical (only if visible)
    col_tick <- if (grepl("Reject", decision))
        "#8B0000" else "#006400"
    if (chi2_obs >= x_min && chi2_obs <= x_max)
        axis(1, at = chi2_obs, labels = round(chi2_obs, 2), col.axis = col_tick,
            tck = -0.02, lwd = 1.2)
    if (critical >= x_min && critical <= x_max)
        axis(1, at = critical, labels = round(critical, 2), col.axis = col_crit,
            col.ticks = col_crit, tck = -0.02, lwd = 1.2)

    # Compact legend
    legend("topright", inset = c(0.01, 0.02), bty = "n", cex = 0.9, legend = c("Accept region",
        "Reject region", "Chi-square density", "Critical value", "Observed χ² (line)"),
        fill = c(col_accept, col_reject, NA, NA, NA), border = c(NA, NA, NA, NA,
            NA), lty = c(NA, NA, 1, 2, 1), col = c(NA, NA, col_density, col_crit,
            "#1976D2"), lwd = c(NA, NA, 2.2, 2, 2.4), pch = c(NA, NA, NA, NA, NA))

    invisible(NULL)
}

#' Little's MCAR Test with Pattern Diagnostics
#'
#' Implements Little's test of \emph{Missing Completely At Random} (MCAR)
#' for multivariate data with missing values. The statistic compares
#' pattern-specific means to the grand mean, weighted by the inverse of
#' the corresponding grand covariance submatrices.
#'
#' The function also reports diagnostics on the structure and frequency
#' of missingness patterns, and optionally draws a chi-square reference plot
#' with manual domain control or auto-centered zoom.
#'
#' @param x Numeric matrix or data.frame (may contain \code{NA}s).
#' @param alpha Significance level, default 0.05.
#' @param plot Logical; draw chi-square density with accept/reject shading and refs.
#' @param verbose Logical; print a formatted summary via [print_mcar_test()].
#' @param use_EM Logical; if \code{TRUE}, compute grand (mu, Sigma) via EM ML (requires \pkg{norm}).
#' @param robust Logical; if \code{TRUE}, compute grand (mu, Sigma) via robust MCD (requires \pkg{robustbase}).
#'   Overrides \code{use_EM} if both are \code{TRUE}.
#' @param bartlett Logical; apply a small-sample Bartlett-style correction to grand covariance
#'   submatrices when using the pairwise-complete estimator (not for EM/robust).
#' @param max_patterns Integer; number of most frequent missingness patterns to list in the summary.
#' @param xlim_min Lower x-axis bound for the plot (manual domain if > 0 or if \code{xlim_max} is set).
#' @param xlim_max Upper x-axis bound for the plot (manual domain if not \code{NULL}).
#' @param zoom Logical; auto-center x-axis around observed \eqn{\chi^2} using a buffer of \code{zoom_factor} SDs.
#' @param zoom_factor Positive numeric; half-window in units of \eqn{\sqrt{2\,\mathrm{df}}}.
#'
#' @return An object of class \code{'mcar_test'} with elements:
#' \describe{
#'   \item{statistic}{Observed \eqn{\chi^2} statistic.}
#'   \item{df}{Degrees of freedom.}
#'   \item{p.value}{Upper-tail p-value under \eqn{\chi^2(\mathrm{df})}.}
#'   \item{alpha}{Significance level used for the plot/reference line.}
#'   \item{decision}{Text decision string ('Reject MCAR' / 'Fail to reject MCAR').}
#'   \item{missing_patterns}{Number of unique row-wise missingness patterns.}
#'   \item{estimation_method}{Label describing how (mu, Sigma) were estimated.}
#'   \item{correction}{If Bartlett correction was applied ('Bartlett' or 'None').}
#'   \item{diagnostics}{List with missingness diagnostics (see source).}
#'   \item{data}{The processed numeric data used for testing.}
#'   \item{N}{Number of rows used in the test (after removing all-missing rows).}
#' }
#'
#' @details
#' The test statistic is
#' \deqn{d^2 = \sum_{j=1}^J n_j (\bar{x}_j - \hat{\mu})^\top \hat{\Sigma}_{j}^{-1} (\bar{x}_j - \hat{\mu}),}
#' where \eqn{\hat{\mu}} and \eqn{\hat{\Sigma}} are the grand mean and covariance
#' (estimated by EM, robust MCD, or pairwise), and \eqn{\hat{\Sigma}_j} denotes the submatrix
#' of \eqn{\hat{\Sigma}} corresponding to variables observed in pattern \eqn{j}.
#' Under MCAR, \eqn{d^2 \sim \chi^2(\mathrm{df})}, with \eqn{\mathrm{df} = \sum_j k_j - k}.
#'
#' @references
#' Little, R. J. A. (1988). A test of missing completely at random for multivariate data with missing values. \emph{JASA}, 83(404), 1198–1202. \cr
#' Enders, C. K. (2010). \emph{Applied Missing Data Analysis}. Guilford Press. \cr
#' Schafer, J. L., & Graham, J. W. (2002). Missing data: our view of the state of the art. \emph{Psychological Methods}, 7(2), 147–177.
#'
#' @examples
#' \dontrun{
#' set.seed(123)
#' N <- 300
#' Sigma <- matrix(c(1, .5, .3, .5, 1, .4, .3, .4, 1), 3, 3)
#' X <- MASS::mvrnorm(N, mu = c(0, 1, -0.5), Sigma = Sigma)
#' colnames(X) <- c('X1','X2','X3')
#' mask <- matrix(runif(length(X)) < 0.2, nrow(X), ncol(X)); X[mask] <- NA
#'
#' # Pairwise (pragmatic)
#' out_pw <- MCAR_Test(X, use_EM = FALSE, robust = FALSE)
#'
#' # EM ML (requires {norm})
#' # out_em <- MCAR_Test(X, use_EM = TRUE)
#'
#' # Robust MCD (requires {robustbase})
#' # out_rb <- MCAR_Test(X, robust = TRUE)
#'
#' # Manual domain and zoom:
#' # MCAR_Test(X, xlim_min = 0, xlim_max = 40)
#' # MCAR_Test(X, zoom = TRUE, zoom_factor = 1.5)
#' }
#'
#' @export
MCAR_Test <- function(x, alpha = 0.05, plot = TRUE, verbose = TRUE, use_EM = FALSE,
    robust = FALSE, bartlett = TRUE, max_patterns = 10, xlim_min = 0, xlim_max = NULL,
    zoom = FALSE, zoom_factor = 2) {

    # ---- Validate / preprocess ----
    if (!is.matrix(x) && !is.data.frame(x))
        stop("Input must be a matrix or data frame.")
    x_df <- .coerce_numeric_warn(x)  # ensure numeric (warn on coercion)
    # Drop rows that are entirely missing to avoid degenerate pattern handling
    x_df <- x_df[rowSums(!is.na(x_df)) > 0, , drop = FALSE]
    if (nrow(x_df) == 0)
        stop("No non-missing rows found.")

    p <- ncol(x_df)

    # ---- Missingness bookkeeping (once) ----
    miss <- .missingness_index(x_df, max_patterns = max_patterns)
    unique_patterns <- miss$unique_patterns
    row_index_by_pattern <- miss$row_index_by_pattern
    n_patterns <- nrow(unique_patterns)

    # ---- Choose estimation method for (mu, Sigma) ---- Priority: robust > EM
    # > pairwise (robust overrides EM if both TRUE)
    if (robust) {
        gp <- .grand_params(x_df, method = "robust")
    } else if (use_EM) {
        gp <- .grand_params(x_df, method = "em")
    } else {
        gp <- .grand_params(x_df, method = "pairwise")
    }
    grand_means <- gp$mu
    grand_cov <- gp$Sigma
    estimation_label <- gp$label

    # ---- Compute Little's d^2 statistic ---- Pattern loop: for each unique NA
    # pattern, compute quadratic form on the observed coordinates using the
    # corresponding grand covariance submatrix.
    d2_total <- 0
    sum_k <- 0

    for (i in seq_len(n_patterns)) {
        patt <- unique_patterns[i, ]  # logical row: TRUE where variable is NA
        idx <- row_index_by_pattern[[i]]  # indices of rows with this pattern
        obs_vars <- which(!patt)  # observed variables for this pattern
        n_j <- length(idx)

        # Skip degenerate patterns (no observed variables) or empty membership
        if (length(obs_vars) == 0 || n_j == 0)
            next

        # Pattern means using only observed variables
        patt_means <- colMeans(x_df[idx, obs_vars, drop = FALSE], na.rm = TRUE)

        # Grand covariance submatrix for observed variables
        Sigma_sub <- grand_cov[obs_vars, obs_vars, drop = FALSE]

        # Optional Bartlett-style tweak for pairwise covariances (not
        # EM/robust)
        if (bartlett && estimation_label == "Pairwise-complete (not ML)" && n_j >
            1) {
            # Scale by (n_j - 1) / n_j to reduce small-sample bias in sample
            # covariance
            Sigma_sub <- ((n_j - 1)/n_j) * Sigma_sub
        }

        # Invert (or pseudoinvert) submatrix safely
        inv_Sigma_sub <- .safe_inverse(Sigma_sub)

        # Quadratic form for this pattern
        mean_diff <- patt_means - grand_means[obs_vars]
        quad <- as.numeric(t(mean_diff) %*% inv_Sigma_sub %*% mean_diff)

        # Accumulate weighted by pattern size
        d2_total <- d2_total + n_j * quad

        # Track degrees-of-freedom components: total observed vars across
        # patterns
        sum_k <- sum_k + length(obs_vars)
    }

    # ---- Degrees of freedom & p-value ----
    df <- sum_k - p
    chi2 <- as.numeric(d2_total)
    p_value <- if (df > 0)
        pchisq(chi2, df, lower.tail = FALSE) else NA_real_
    decision <- if (!is.na(p_value) && p_value < alpha)
        "Reject MCAR" else "Fail to reject MCAR"

    # ---- Diagnostics ----
    na_counts <- colSums(is.na(x_df))
    worst_variable <- if (sum(na_counts) == 0)
        NA_character_ else names(which.max(na_counts))

    diagnostics <- list(total_missing = sum(is.na(x_df)), pct_missing = mean(is.na(x_df)) *
        100, worst_variable = worst_variable, top_patterns = miss$top_patterns, pattern_matrix = unique_patterns)

    # ---- Result object ----
    res <- structure(list(statistic = chi2, df = df, p.value = p_value, alpha = alpha,
        decision = decision, missing_patterns = n_patterns, estimation_method = estimation_label,
        correction = ifelse(bartlett && estimation_label == "Pairwise-complete (not ML)",
            "Bartlett", "None"), diagnostics = diagnostics, data = x_df, N = nrow(x_df)),
        class = "mcar_test")

    # Plot if requested (pass through domain/zoom parameters)
    if (plot && df > 0 && is.finite(chi2)) {
        .plot_mcar_test(res, xlim_min = xlim_min, xlim_max = xlim_max, zoom = zoom,
            zoom_factor = zoom_factor)
    }

    # Pretty print if requested
    if (verbose)
        print(res)
    invisible(res)
}

#' Dump All MCAR-Related Functions to a File
#'
#' Convenience helper to write the current function definitions to
#' \code{'MCAR_Test.R'} for reuse. Useful in exploratory work.
#'
#' @note This call is optional and typically not exported in a package build.
#' @keywords internal
dump(c("MCAR_Test", "print_mcar_test", ".plot_mcar_test", ".coerce_numeric_warn",
    ".missingness_index", ".safe_inverse", ".grand_params"), file = "MCAR_Test.R")

Little’s MCAR Test Interpretation Guide

  • Step 1. Statistical significance (p-value):

    • Hypotheses for Little’s MCAR Test
      • Null hypothesis \((H_0)\): The data are Missing Completely at Random (MCAR).
        • The probability of missingness does not depend on either observed or unobserved data.
      • Alternative hypothesis \((H_1)\): The data are not MCAR.
        • The probability of missingness does depend on the data.
          This means the mechanism could be MAR (Missing At Random) or MNAR (Missing Not At Random), but the test cannot distinguish between them.
    • Decision Rule
      • If \(p > \alpha\): Fail to reject \(H_0\) → evidence is consistent with MCAR.
      • If \(p \leq \alpha\): Reject \(H_0\) → data are not MCAR (at least MAR or MNAR).
  • Step 2. Effect size check (χ² / df):


Decision Guidelines

  • Fail to Reject MCAR
    (p ≥ α and χ² / df < 2)
    • Interpretation: The data are consistent with MCAR. Missingness can reasonably be treated as random.
    • Practical meaning: Missingness looks random enough; methods assuming MCAR (listwise deletion, simple imputation) may be defensible.
    • Caveat: Always check descriptives (are certain items missing more often?) before assuming MCAR is safe.
  • ⚠️ Non-significant but Substantial Effect
    (p ≥ α but χ² / df ≥ 2)
    • Interpretation: Test is not significant, but χ²/df suggests practical deviation.
    • Why: Could be a power issue (small N), or the missingness mechanism produces subtle but real bias.
    • Practical meaning: Even if “not significant,” treat MCAR with skepticism; do sensitivity analyses (pattern-mixture, imputation with missingness predictors).
    • Example: N = 80, p = .07, χ²/df = 2.5 → small sample hides a non-MCAR signal.
  • ⚠️ Statistically Significant but Modest Effect
    (p < α and χ² / df < 2)
    • Interpretation: MCAR formally rejected, but the deviation is weak in magnitude.
    • Why: With large N, the test detects trivial departures.
    • Practical meaning: Missingness may not be MCAR, but MAR methods (FIML, multiple imputation with auxiliary predictors) will likely be robust.
    • Example: N = 2000, p < .01, χ²/df = 1.1 → statistically “bad,” but in practice close to MCAR.
  • Strong Evidence Against MCAR
    (p < α and χ² / df ≥ 2)
    • Interpretation: Both statistically significant and large effect.
    • Practical meaning: Missingness is clearly not random — MCAR-based methods (listwise deletion, mean imputation) are invalid.
    • Remedy: Must assume MAR (use imputation, FIML, EM) or consider MNAR models.
    • Example: N = 500, p < .001, χ²/df = 3.4 → strong case against MCAR.

Interpreting Beyond MCAR

Rejecting MCAR does not imply that the data are necessarily Missing Not At Random (MNAR). Instead, the next most common and practical assumption is Missing At Random (MAR) — that is, missingness may depend on observed variables but not directly on the unobserved values themselves. Under MAR, modern methods such as full information maximum likelihood (FIML) or multiple imputation typically yield unbiased estimates.

By contrast, data are Missing Not At Random (MNAR) if the probability of missingness depends on the unobserved value itself (e.g., individuals with higher depression scores are less likely to report their score). MNAR is difficult to verify from data alone and requires strong modeling assumptions, such as selection models, pattern-mixture models, or shared-parameter models.

For this reason, applied analyses generally assume MAR while conducting sensitivity analyses to examine how robust results are if the missingness mechanism were closer to MNAR.


Why a Large Effect Size Counts Against MCAR

missing-data patterns show systematic differences in their distributions. If the data are truly MCAR, missingness is completely random, meaning that each group should resemble a random subsample and their means should differ only by chance.

When the χ² / df ratio (effect size) is large; however, the group means diverge more than random variation would predict, which constitutes strong evidence against MCAR.

Interpreting the χ²/df Ratio

The χ²/df ratio (sometimes called the normed chi-square in SEM and occasionally borrowed as an effect-size proxy for Little’s test) provides a heuristic gauge of how strongly the data deviate from MCAR:

χ²/df Ratio Interpretation
≈ 1.0 Data fit MCAR closely (differences no bigger than random sampling variation).
1.0 – 2.0 Small to moderate deviation; often considered acceptable.
2.0 – 3.0 Noticeable misfit; potential concern.
> 3.0 Strong evidence that group means differ beyond chance → violation of MCAR.

Note. These thresholds aren’t absolute rules; they come mainly from SEM practice (Carmines & McIver, 1981; Kline, 2016; Wheaton et al., 1977), and should be treated as heuristic cutoffs given that the \(\chi^2\) test is highly sensitive to sample size (Marsh & Hocevar, 1985).


Critical Lines Legend:

Line Style Meaning
Critical value (default α = 0.05 threshold)
Observed test statistic, χ²

Color-Coded Regions:

Region Interpretation
Fail-to-reject Data may be MCAR if observed test statistic () falls here
Rejection Data likely not MCAR if observed test statistic () falls here

# Example usage
MCAR_Test(A, alpha = 0.05, plot = TRUE)

## $statistic
## [1] 17.189
## 
## $df
## [1] 5
## 
## $p.value
## [1] 0.004155
## 
## $alpha
## [1] 0.05
## 
## $decision
## [1] "Reject MCAR"
## 
## $missing_patterns
## [1] 4
## 
## $estimation_method
## [1] "Pairwise-complete (not ML)"
## 
## $correction
## [1] "Bartlett"
## 
## $diagnostics
## $diagnostics$total_missing
## [1] 13
## 
## $diagnostics$pct_missing
## [1] 21.667
## 
## $diagnostics$worst_variable
## [1] "Z"
## 
## $diagnostics$top_patterns
## patt_str
##       3 2,3   2 
##   9   8   2   1 
## 
## $diagnostics$pattern_matrix
##        X     Y     Z
## 1  FALSE FALSE  TRUE
## 5  FALSE  TRUE  TRUE
## 11 FALSE FALSE FALSE
## 15 FALSE  TRUE FALSE
## 
## 
## $data
##      X  Y  Z
## 1   78 13 NA
## 2   84  9 NA
## 3   84 10 NA
## 4   85 10 NA
## 5   87 NA NA
## 6   91  3 NA
## 7   92 12 NA
## 8   94  3 NA
## 9   94 13 NA
## 10  96 NA NA
## 11  99  6  7
## 12 105 12 10
## 13 105 14 11
## 14 106 10 15
## 15 108 NA 10
## 16 112 10 10
## 17 113 14 12
## 18 115 14 14
## 19 118 12 16
## 20 134 11 12
## 
## $N
## [1] 20
## 
## attr(,"class")
## [1] "mcar_test"

The results of Little’s test for Missing Completely at Random (MCAR) revealed statistically significant evidence against the MCAR assumption (χ²(5) = 17.189, p = .0042). This significant finding, coupled with a substantial effect size (χ²/df = 3.44), strongly indicates that the missing data pattern deviates meaningfully from randomness. The nature of the test cannot distinguish whether this non-random missingness depends on observed variables (consistent with Missing at Random, MAR) or unobserved variables (consistent with Missing Not at Random, MNAR), but clearly demonstrates the data violate the MCAR assumption.

The dataset contains considerable missingness, with 21.7% of values missing overall (13 out of 60 potential values). This exceeds the 15% threshold where missing data typically becomes problematic for analysis. Examination of the missingness patterns reveals that variable Z is the primary source of missing data, accounting for 77% of all missing values (10 out of 13). The most common missingness pattern involves only Z being absent (occurring in 8 cases), followed by both Y and Z missing (2 cases). Notably, only 45% of observations (9 out of 20) represent complete cases with no missing values.

These findings collectively suggest that standard complete-case analysis would be inappropriate and potentially biased. The concentration of missingness in variable Z warrants particular investigation into potential measurement or data collection issues specific to this variable. Researchers should employ appropriate missing data techniques such as multiple imputation or full information maximum likelihood estimation that can properly account for this non-random missingness pattern. The substantial effect size (3.44) reinforces that this is not merely a statistically significant but trivial finding, but rather indicates meaningful deviations from MCAR that could substantively impact analysis results if not properly addressed.


Second Example: (Little, R. J. A. & Rubin, D. B., 2002). Statistical Analysis with Missing Data.

# Create data matrix X from Wood's (1966) cement hardening data Columns
# represent: X1 - % tricalcium aluminate X2 - % tricalcium silicate X3 - %
# tetracalcium alumino ferrite X4 - % dicalcium silicate Y - Heat evolved in
# calories/gram of cement

B1 <- matrix(c(7, 26, 6, 60, 78.5, 1, 29, 15, 52, 74.3, 11, 56, 8, 20, 104.3, 11,
    31, 8, 47, 87.6, 7, 52, 6, 33, 95.9, 11, 55, 9, 22, 109.2, 3, 71, 17, 6, 102.7,
    1, 31, 22, 44, 72.5, 2, 54, 18, 22, 93.1, 21, 47, 4, 26, 115.9, 1, 40, 23, 34,
    83.8, 11, 66, 9, 12, 113.3, 10, 68, 8, 12, 109.4), nrow = 13, ncol = 5, byrow = TRUE)

# Set descriptive row and column names
rownames(B1) <- paste0("obs.", 1:13)
colnames(B1) <- c("X1", "X2", "X3", "X4", "Y")

# Display with kable and proper headers
kable(B1, digits = 1, align = "c", caption = "Cement Hardening Data (Woods, 1966)") %>%
    kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
    add_header_above(c(` ` = 1, `Chemical Composition (%)` = 4, `Heat Evolved` = 1)) %>%
    footnote(general = "X1: Tricalcium Aluminate\nX2: Tricalcium Silicate\nX3: Tetracalcium Alumino Ferrite\nX4: Dicalcium Silicate\nY: calories/gram",
        general_title = "Variable Definitions:") %>%
    column_spec(1, bold = TRUE) %>%
    column_spec(5, background = "#f7f7f7")
Cement Hardening Data (Woods, 1966)
Chemical Composition (%)
Heat Evolved
X1 X2 X3 X4 Y
obs.1 7 26 6 60 78.5
obs.2 1 29 15 52 74.3
obs.3 11 56 8 20 104.3
obs.4 11 31 8 47 87.6
obs.5 7 52 6 33 95.9
obs.6 11 55 9 22 109.2
obs.7 3 71 17 6 102.7
obs.8 1 31 22 44 72.5
obs.9 2 54 18 22 93.1
obs.10 21 47 4 26 115.9
obs.11 1 40 23 34 83.8
obs.12 11 66 9 12 113.3
obs.13 10 68 8 12 109.4
Variable Definitions:
X1: Tricalcium Aluminate
X2: Tricalcium Silicate
X3: Tetracalcium Alumino Ferrite
X4: Dicalcium Silicate
Y: calories/gram
# Save data to RData file
save(B1, file = "B1.RData")

# Optional: Also save as CSV for interoperability
write.csv(B1, file = "B1.csv", row.names = TRUE)

Add a row of NA’s:

B2 <- matrix(c(7, 26, 6, 60, 78.5, 1, 29, 15, 52, 74.3, 11, 56, 8, 20, 104.3, 11,
    31, 8, 47, 87.6, 7, 52, 6, 33, 95.9, 11, 55, 9, 22, 109.2, 3, 71, 17, NA, 102.7,
    1, 31, 22, NA, 72.5, 2, 54, 18, NA, 93.1, NA, NA, 4, NA, 115.9, NA, NA, 23, NA,
    83.8, NA, NA, 9, NA, 113.3, NA, NA, 8, NA, 109.4, NA, NA, NA, NA, NA), 14, 5,
    byrow = TRUE)

B2
##       [,1] [,2] [,3] [,4]  [,5]
##  [1,]    7   26    6   60  78.5
##  [2,]    1   29   15   52  74.3
##  [3,]   11   56    8   20 104.3
##  [4,]   11   31    8   47  87.6
##  [5,]    7   52    6   33  95.9
##  [6,]   11   55    9   22 109.2
##  [7,]    3   71   17   NA 102.7
##  [8,]    1   31   22   NA  72.5
##  [9,]    2   54   18   NA  93.1
## [10,]   NA   NA    4   NA 115.9
## [11,]   NA   NA   23   NA  83.8
## [12,]   NA   NA    9   NA 113.3
## [13,]   NA   NA    8   NA 109.4
## [14,]   NA   NA   NA   NA    NA
# save data
save("B2", file = "B2.Rdata")

Remove trivial cases that have no observed values

The trivial patern with all variables missing ommited from consideration because it contributes nothing to the observed-data likelihood and would only slow the convergence of EM by increasing the fractions of missing information

B3 <- data.matrix(B2[apply(B2, 1, function(x) {
    !all(is.na(x))
}), ])
B3
##       [,1] [,2] [,3] [,4]  [,5]
##  [1,]    7   26    6   60  78.5
##  [2,]    1   29   15   52  74.3
##  [3,]   11   56    8   20 104.3
##  [4,]   11   31    8   47  87.6
##  [5,]    7   52    6   33  95.9
##  [6,]   11   55    9   22 109.2
##  [7,]    3   71   17   NA 102.7
##  [8,]    1   31   22   NA  72.5
##  [9,]    2   54   18   NA  93.1
## [10,]   NA   NA    4   NA 115.9
## [11,]   NA   NA   23   NA  83.8
## [12,]   NA   NA    9   NA 113.3
## [13,]   NA   NA    8   NA 109.4

This correctly removes the completely empty row 14.


Listwise deletion

Omit cases with any missing values

B4 <- B2[apply(B2, 1, function(x) !any(is.na(x))), , drop = FALSE]
B4
##      [,1] [,2] [,3] [,4]  [,5]
## [1,]    7   26    6   60  78.5
## [2,]    1   29   15   52  74.3
## [3,]   11   56    8   20 104.3
## [4,]   11   31    8   47  87.6
## [5,]    7   52    6   33  95.9
## [6,]   11   55    9   22 109.2

This creates a complete-case dataset (only rows with no missing values), leaving just 6 complete cases.

MCAR_Test(B2[, 1:4], plot = TRUE)

## $statistic
## [1] 8.6721
## 
## $df
## [1] 4
## 
## $p.value
## [1] 0.06984
## 
## $alpha
## [1] 0.05
## 
## $decision
## [1] "Fail to reject MCAR"
## 
## $missing_patterns
## [1] 3
## 
## $estimation_method
## [1] "Pairwise-complete (not ML)"
## 
## $correction
## [1] "Bartlett"
## 
## $diagnostics
## $diagnostics$total_missing
## [1] 15
## 
## $diagnostics$pct_missing
## [1] 28.846
## 
## $diagnostics$worst_variable
## [1] "V4"
## 
## $diagnostics$top_patterns
## patt_str
##       1,2,4     4 
##     6     4     3 
## 
## $diagnostics$pattern_matrix
##       V1    V2    V3    V4
## 1  FALSE FALSE FALSE FALSE
## 7  FALSE FALSE FALSE  TRUE
## 10  TRUE  TRUE FALSE  TRUE
## 
## 
## $data
##    V1 V2 V3 V4
## 1   7 26  6 60
## 2   1 29 15 52
## 3  11 56  8 20
## 4  11 31  8 47
## 5   7 52  6 33
## 6  11 55  9 22
## 7   3 71 17 NA
## 8   1 31 22 NA
## 9   2 54 18 NA
## 10 NA NA  4 NA
## 11 NA NA 23 NA
## 12 NA NA  9 NA
## 13 NA NA  8 NA
## 
## $N
## [1] 13
## 
## attr(,"class")
## [1] "mcar_test"

The results (χ²(4) = 8.672, p = 0.0698) show borderline non-significance at α = 0.05, technically failing to reject the MCAR hypothesis. However, the substantial effect size (χ²/df = 2.168) suggests meaningful deviation from completely random missingness patterns. This apparent contradiction likely reflects limited statistical power rather than true MCAR compliance, particularly given the considerable proportion of missing data (28.8% overall, with variable V4 accounting for nearly half of all missing values).

The missingness pattern analysis reveals potentially systematic behavior, with 40% of missing cases showing concurrent absence of V1, V2 and V4 values. While the p-value doesn’t reach conventional significance thresholds, the combination of substantial effect size and these observable patterns strongly suggests the missingness mechanism should be treated as Missing at Random (MAR) rather than MCAR for analytical purposes.

In practice, these results warrant using multiple imputation or other appropriate missing data methods rather than complete-case analysis. The concentration of missingness in V4 particularly merits investigation into possible data collection or measurement issues. Researchers should document this borderline MCAR test outcome and justify their chosen approach for handling missing data in light of both the statistical results and the observable patterns of missingness.

3.4 Single Imputation Techniques

Single imputation methods replace missing values (NA) with one estimated value per missing observation, creating a single complete dataset. While simple to implement, these techniques introduce specific biases that researchers must acknowledge.

3.4.1 Mean Imputation: Properties and Limitations

Mean imputation replaces missing values (NA) with the arithmetic mean of observed values for each variable.

\[ x_{i,\text{imputed}} = \bar{x}_{\text{observed}} = \frac{1}{n_{\text{obs}}} \sum_{j=1}^{n_{\text{obs}}} x_j \] where:

  • \(x_{i,\text{imputed}}\): The imputed value for missing case \(i\),

  • \(\bar{x}_{\text{observed}}\): Mean of observed values,

  • \(n_{\text{obs}}\): Number of observed cases,

  • \(\displaystyle{\sum_{j=1}^{n_{\text{obs}}} x_j}\): Summation over all observed values


Effects on Data Structure

Aspect Effects of Mean Imputation Mathematical Representation
Preserved Properties • Maintains sample mean
• Retains original sample size
\(\displaystyle E[X]_{\text{imp}} = E[X]_{\text{orig}}\)
Variance • Artificially reduces variance
• Creates spikes at mean values
\(\displaystyle \text{Var}(X)_{\text{imp}} = \left(\frac{n_{\text{obs}}}{n_{\text{total}}}\right)\text{Var}(X)_{\text{orig}}\)
Covariance • Underestimates covariances
• Biases correlations toward zero
\(\displaystyle \text{Cov}(X,Y)_{\text{imp}} \approx \left(\frac{n_{\text{obs}}}{n_{\text{total}}}\right)\text{Cov}(X,Y)_{\text{orig}}\)
Distribution Shape • Flattens distribution
• Creates artificial peaks
\(\displaystyle \kappa_{\text{imp}} < \kappa_{\text{orig}}\)

Mean imputation may be appropriate for preliminary exploratory analyses when examining only means (Allison, 2002, p. 4), particularly when less than 3% of values are missing completely at random (MCAR; (Enders, 2010, Chapter 3). However, extensive research demonstrates it artificially reduces variances and distorts covariances (Little & Rubin, 2019, Theorem 3.2), with Schafer and Graham (2002) noting it “systematically distorts correlations” (p. 154). For analyses examining relationships between variables, modern methods like multiple imputation (Buuren, 2018) or maximum likelihood estimation (Enders, 2010) are strongly preferred.


This R code defines and implements a mean imputation function that replaces missing values (NAs) in a dataset with the mean of each variable’s available values.

Limitations:

  • Statistical Impact: Reduces variance and biases correlations downward
  • Best For: Quick fixes with <5% MCAR missingness
  • Not Recommended: Final analyses or datasets with MNAR/MAR missingness
#' Mean Imputation for Missing Data
#' 
#' Replaces missing values with variable means
#' 
#' @param Y A matrix or data frame with missing values (NA)
#' @return A completed matrix/data frame with same dimensions as input
#' @examples
#' data <- data.frame(x = c(1, 2, NA), y = c(NA, 5, 6))
#' mean_imputation(data)
mean_imputation <- function(Y) {
    # Convert to data frame if not already
    Y <- as.data.frame(Y)

    # Perform mean imputation
    imputed <- Y
    for (col in colnames(Y)) {
        imputed[[col]][is.na(Y[[col]])] <- mean(Y[[col]], na.rm = TRUE)
    }

    # Return same type as input
    if (is.matrix(Y)) {
        return(as.matrix(imputed))
    }
    return(imputed)
}

# Save function to file for reuse
dump("mean_imputation", file = "mean_imputation.R")

This demonstrates how to apply the mean_imputation() function to a dataset B2 and format the results.

# Example usage:

# Apply mean imputation
B2_imputed <- mean_imputation(B2)
round(B2_imputed, 2)
##    V1 V2    V3 V4     V5
## 1   7 26  6.00 60  78.50
## 2   1 29 15.00 52  74.30
## 3  11 56  8.00 20 104.30
## 4  11 31  8.00 47  87.60
## 5   7 52  6.00 33  95.90
## 6  11 55  9.00 22 109.20
## 7   3 71 17.00 39 102.70
## 8   1 31 22.00 39  72.50
## 9   2 54 18.00 39  93.10
## 10  6 45  4.00 39 115.90
## 11  6 45 23.00 39  83.80
## 12  6 45  9.00 39 113.30
## 13  6 45  8.00 39 109.40
## 14  6 45 11.77 39  95.42

3.4.2 Random Hot-deck Imputation from the Sample of Respondents Using Bootstrapping

Hot-deck imputation addresses missing data by replacing each missing value with an observed response from a demographically or statistically similar unit in the dataset (Andridge, R. R. & Little, R. J. A., 2010). While this method sees widespread application across survey research, its theoretical foundations remain less developed compared to alternative imputation approaches.

3.4.2.1 Approach 1: Random sampling from respondents

  • Replaces NAs with random observed values from same column
  • Preserves original value distribution (non-parametric)
  • Best for categorical or non-normal continuous data
#' Hot-Deck Imputation Methods
#' 
#' Implementation of two hot-deck imputation approaches:
#' 1. Random sampling from observed values
#' 2. Sampling from estimated normal distribution
#' 
#' @references Andridge, R. R., & Little, R. J. (2010). A review of hot deck imputation for survey non-response.
#' International Statistical Review, 78(1), 40-64.

# Approach 1: Random sampling from respondents

#' Random Hot-deck Imputation
#' 
#' Replaces missing values with random draws from observed cases
#' 
#' @param x A matrix or data frame with missing values
#' @return A completed matrix of same dimensions
#' @examples 
#' data <- matrix(c(1,2,NA,4,NA,6), ncol=2)
#' hot_deck_random(data)
hot_deck_random <- function(x) {
    # Input validation
    if (!is.matrix(x) && !is.data.frame(x)) {
        stop("Input must be matrix or data frame")
    }

    imputed <- as.matrix(x)

    for (i in 1:ncol(x)) {
        missing <- is.na(x[, i])
        x_obs <- na.omit(x[, i])
        imputed[missing, i] <- sample(x_obs, sum(missing), replace = TRUE)
    }

    return(imputed)
}

# Save function to file for reuse
dump("hot_deck_random", file = "hot_deck_random.R")

3.4.2.2 Approach 2: Predictive distribution sampling

  • Draws imputations from estimated normal distribution
  • Better for continuous variables meeting normality assumptions
  • Includes n_samples parameter for distribution precision
# Approach 2: Predictive distribution sampling

#' Normal Distribution Hot-deck Imputation
#' 
#' Replaces missing values with draws from estimated normal distribution
#' 
#' @param x A matrix or data frame with missing values
#' @param n_samples Number of samples for distribution estimation (default=1000)
#' @return A completed matrix of same dimensions
#' @examples 
#' data <- matrix(c(1,2,NA,4,NA,6), ncol=2)
#' hot_deck_normal(data)
hot_deck_normal <- function(x, n_samples = 1000) {
    # Input validation
    if (!is.matrix(x) && !is.data.frame(x)) {
        stop("Input must be matrix or data frame")
    }

    imputed <- as.matrix(x)

    for (i in 1:ncol(x)) {
        missing <- is.na(x[, i])
        if (sum(missing) > 0) {
            x_obs <- na.omit(x[, i])
            dist <- rnorm(n_samples, mean = mean(x_obs), sd = sd(x_obs))
            imputed[missing, i] <- sample(dist, sum(missing), replace = TRUE)
        }
    }

    return(imputed)
}

# Save function to file for reuse
dump("hot_deck_normal", file = "hot_deck_normal.R")
# Example usage

# Create sample data with missing values
set.seed(123)

# Apply both methods
imputed_random <- hot_deck_random(B2)
imputed_normal <- hot_deck_normal(B2)

# Compare results
list(original = B2, random_imputation = imputed_random, normal_imputation = round(imputed_normal,
    1))
## $original
##       [,1] [,2] [,3] [,4]  [,5]
##  [1,]    7   26    6   60  78.5
##  [2,]    1   29   15   52  74.3
##  [3,]   11   56    8   20 104.3
##  [4,]   11   31    8   47  87.6
##  [5,]    7   52    6   33  95.9
##  [6,]   11   55    9   22 109.2
##  [7,]    3   71   17   NA 102.7
##  [8,]    1   31   22   NA  72.5
##  [9,]    2   54   18   NA  93.1
## [10,]   NA   NA    4   NA 115.9
## [11,]   NA   NA   23   NA  83.8
## [12,]   NA   NA    9   NA 113.3
## [13,]   NA   NA    8   NA 109.4
## [14,]   NA   NA   NA   NA    NA
## 
## $random_imputation
##       [,1] [,2] [,3] [,4]  [,5]
##  [1,]    7   26    6   60  78.5
##  [2,]    1   29   15   52  74.3
##  [3,]   11   56    8   20 104.3
##  [4,]   11   31    8   47  87.6
##  [5,]    7   52    6   33  95.9
##  [6,]   11   55    9   22 109.2
##  [7,]    3   71   17   60 102.7
##  [8,]    1   31   22   47  72.5
##  [9,]    2   54   18   60  93.1
## [10,]   11   31    4   60 115.9
## [11,]   11   55   23   33  83.8
## [12,]    1   54    9   20 113.3
## [13,]   11   52    8   52 109.4
## [14,]    7   56   23   52  93.1
## 
## $normal_imputation
##       [,1] [,2] [,3] [,4]  [,5]
##  [1,]  7.0 26.0  6.0 60.0  78.5
##  [2,]  1.0 29.0 15.0 52.0  74.3
##  [3,] 11.0 56.0  8.0 20.0 104.3
##  [4,] 11.0 31.0  8.0 47.0  87.6
##  [5,]  7.0 52.0  6.0 33.0  95.9
##  [6,] 11.0 55.0  9.0 22.0 109.2
##  [7,]  3.0 71.0 17.0 29.7 102.7
##  [8,]  1.0 31.0 22.0 46.7  72.5
##  [9,]  2.0 54.0 18.0 49.4  93.1
## [10,]  7.2 57.5  4.0 28.5 115.9
## [11,]  4.8 46.7 23.0 63.4  83.8
## [12,] -3.1 36.2  9.0 47.2 113.3
## [13,]  1.2 50.7  8.0 24.9 109.4
## [14,]  5.9 15.3 15.2 22.1  88.5

Statistical Comparison (compare_stats):

  • Calculates pre/post-imputation descriptive statistics
  • Tracks mean/SD differences to assess distortion
  • Reports missing count per variable
#' Compare Descriptive Statistics Before/After Imputation
#'
#' Provides comprehensive comparison of distributional characteristics
#' between original (with NAs) and imputed datasets.
#'
#' @param original Original data with NAs (matrix/data.frame)
#' @param imputed Imputed dataset (matrix/data.frame)
#' @param digits Rounding digits for output (default=3)
#' @return A data frame with comparison statistics
#' @examples
#' data <- data.frame(x = c(1, 2, NA, 4), y = c(NA, 2, 3, 4))
#' imp_data <- hot_deck_random(data)
#' compare_stats(data, imp_data)
compare_stats <- function(original, imputed, digits = 3) {
    # Input validation
    if (!identical(dim(original), dim(imputed))) {
        stop("Original and imputed datasets must have identical dimensions")
    }

    # Convert to matrix and handle column names
    original <- as.matrix(original)
    imputed <- as.matrix(imputed)
    colnames(original) <- colnames(original) %||% paste0("V", seq_len(ncol(original)))

    # Define self-contained statistics calculator
    calc_stats <- function(x) {
        x <- x[!is.na(x)]
        n <- length(x)
        if (n == 0)
            return(rep(NA, 7))

        # Basic stats
        m <- mean(x)
        s <- sd(x)
        med <- median(x)
        mad <- median(abs(x - med))

        # Skewness and kurtosis calculations
        if (n > 2) {
            z <- (x - m)/s
            skew <- mean(z^3)
            kurt <- mean(z^4) - 3
        } else {
            skew <- kurt <- NA
        }

        c(mean = m, sd = s, median = med, mad = mad, skew = skew, kurtosis = kurt,
            min = min(x), max = max(x))
    }

    # Calculate statistics
    orig_stats <- t(apply(original, 2, calc_stats))
    imp_stats <- t(apply(imputed, 2, calc_stats))

    # Create comparison dataframe
    stats <- data.frame(Variable = colnames(original), N_Missing = colSums(is.na(original)),
        orig_stats, imp_stats, row.names = NULL)

    # Name columns systematically
    colnames(stats)[3:10] <- paste0("Orig_", colnames(orig_stats))
    colnames(stats)[11:18] <- paste0("Imp_", colnames(imp_stats))

    # Calculate absolute differences
    diffs <- imp_stats - orig_stats
    colnames(diffs) <- paste0("Diff_", colnames(diffs))
    stats <- cbind(stats, diffs)

    # Add percentage change columns
    pct_change <- 100 * (imp_stats - orig_stats)/abs(orig_stats)
    pct_change[!is.finite(pct_change)] <- NA
    colnames(pct_change) <- paste0("PctChg_", colnames(orig_stats))
    stats <- cbind(stats, pct_change)

    # Round numeric columns
    num_cols <- sapply(stats, is.numeric)
    stats[num_cols] <- round(stats[num_cols], digits)

    # Add significance indicators
    stats$Sig_Mean <- ifelse(abs(stats$Diff_mean) > stats$Orig_sd/2, "*", "")
    stats$Sig_SD <- ifelse(abs(stats$Diff_sd) > stats$Orig_sd/3, "*", "")

    return(stats)
}

# Helper for NULL coalescing
`%||%` <- function(a, b) if (!is.null(a)) a else b

# Save function
dump(c("compare_stats", "%||%"), file = "compare_stats.R")

Visualization Code:

library(knitr)
library(ggplot2)
library(patchwork)  # For combining plots

# --- Apply Both Imputation Methods ---
set.seed(123)  # Ensure reproducibility
imputed_random <- hot_deck_random(B2)
imputed_normal <- hot_deck_normal(B2)

# --- Enhanced Comparison Statistics ---
compare_imputations <- function(original, imp1, imp2, method_names = c("Method 1", "Method 2")) {
  stats1 <- compare_stats(original, imp1)
  stats2 <- compare_stats(original, imp2)
  
  # Combine results with method identifiers
  stats1$Method <- method_names[1]
  stats2$Method <- method_names[2]
  combined <- rbind(stats1, stats2)
  
  # Calculate absolute differences
  combined$Abs_Mean_Diff <- abs(combined$Diff_mean)
  combined$Abs_SD_Diff <- abs(combined$Diff_sd)
  
  return(combined)
}

# Get comprehensive comparison
imputation_stats <- compare_imputations(
  B2, 
  imputed_random, 
  imputed_normal,
  c("Random Hot-Deck", "Normal Distribution")
)

# --- Improved Table Display ---
create_comparison_table <- function(stats_df) {
  stats_df %>%
    select(Variable, Method, 
           Orig_mean, Imp_mean, Diff_mean, PctChg_mean,
           Orig_sd, Imp_sd, Diff_sd, N_Missing) %>%
    group_by(Variable) %>%
    mutate(across(where(is.numeric), round, 3)) %>%
    kable(
      caption = "Imputation Method Comparison",
      digits = 3,
      col.names = c("Variable", "Method", 
                    "Original Mean", "Imputed Mean", "Difference", "% Change",
                    "Original SD", "Imputed SD", "SD Difference", "N Missing"),
      align = c("l", "l", rep("r", 8))
    ) %>%
    kable_styling(bootstrap_options = c("striped", "hover")) %>%
    add_header_above(c(" " = 2, "Mean Comparison" = 4, "SD Comparison" = 3, " " = 1))
}

# Display formatted table
create_comparison_table(imputation_stats)
Imputation Method Comparison
Mean Comparison
SD Comparison
Variable Method Original Mean Imputed Mean Difference % Change Original SD Imputed SD SD Difference N Missing
V1 Random Hot-Deck 6.000 6.786 0.786 13.095 4.359 4.336 -0.023 5
V2 Random Hot-Deck 45.000 46.643 1.643 3.651 15.953 13.992 -1.961 5
V3 Random Hot-Deck 11.769 12.571 0.802 6.816 6.405 6.847 0.442 1
V4 Random Hot-Deck 39.000 44.143 5.143 13.187 16.492 15.471 -1.021 8
V5 Random Hot-Deck 95.423 95.257 -0.166 -0.174 15.044 14.467 -0.577 1
V1 Normal Distribution 6.000 5.002 -0.998 -16.634 4.359 4.356 -0.003 5
V2 Normal Distribution 45.000 43.678 -1.322 -2.938 15.953 15.602 -0.351 5
V3 Normal Distribution 11.769 12.018 0.249 2.112 6.405 6.224 -0.181 1
V4 Normal Distribution 39.000 38.998 -0.002 -0.006 16.492 14.856 -1.636 8
V5 Normal Distribution 95.423 94.926 -0.497 -0.521 15.044 14.573 -0.471 1
# --- Enhanced Visualizations ---

# 1. Boxplot Comparison
plot_boxplots <- function(data, original, imp1, imp2, 
                         names = c("Original", "Random", "Normal")) {
  plots <- list()
  
  for (i in seq_len(ncol(original))) {
    # Create data frame for plotting
    df <- data.frame(
      value = c(original[, i], imp1[, i], imp2[, i]),
      group = rep(names, each = nrow(original))
    )
    
    # Create ggplot object
    plots[[i]] <- ggplot(df, aes(x = group, y = value, fill = group)) +
      geom_boxplot(alpha = 0.7) +
      scale_fill_manual(values = c("#E69F00", "#56B4E9", "#009E73")) +
      labs(title = colnames(original)[i],
           x = "", 
           y = "Value") +
      theme_minimal() +
      theme(legend.position = "none")
  }
  
  # Return combined plots
  wrap_plots(plots, ncol = 2)
}

# 2. Distribution Comparison
plot_densities <- function(original, imp1, imp2) {
  plots <- list()
  
  for (i in seq_len(ncol(original))) {
    # Create data frame with complete cases only
    df <- data.frame(
      Original = original[, i],
      Random = imp1[, i],
      Normal = imp2[, i]
    ) %>%
      tidyr::pivot_longer(
        everything(),
        names_to = "Method",
        values_to = "Value",
        values_drop_na = TRUE  # This removes NAs before plotting
      )
    
    # Only plot if we have at least 2 values per method
    if (nrow(df) >= 2) {
      plots[[i]] <- ggplot(df, aes(x = Value, fill = Method)) +
        geom_density(alpha = 0.5, na.rm = TRUE) +  # Added na.rm for extra safety
        scale_fill_manual(values = c("#E69F00", "#56B4E9", "#009E73")) +
        labs(title = colnames(original)[i]) +
        theme_minimal()
    } else {
      plots[[i]] <- ggplot() +
        annotate("text", x = 0.5, y = 0.5, label = "Insufficient Data") +
        labs(title = colnames(original)[i]) +
        theme_void()
    }
  }
  
  # Return combined plots
  patchwork::wrap_plots(plots, ncol = 2)
}

# Generate and display plots
plot_boxplots(B2, B2, imputed_random, imputed_normal)

plot_densities(B2, imputed_random, imputed_normal)


Boxplot Components to Examine

  • Central Tendency (Median Line)
    • The horizontal line in each box shows the median
    • Compare positions across Original, Random, and Normal methods
      • Similar median lines → Central tendency preserved
      • Raised/lowered medians → Potential bias introduced
  • Spread (Box & Whiskers)
    • Box shows IQR (25th-75th percentile)
      • Matching box heights → Variability maintained
      • Shrunk boxes → Variance underestimated
    • Whiskers show range (excluding outliers)
      • Equal whisker lengths → Tails unaffected by imputation
    • Dots indicate outliers
      • Missing outliers → Extreme values smoothed out
  • Distribution Shape
    • Box symmetry indicates skewness
    • Whisker length shows tail behavior

Density Plot Interpretation

  • Overlapping curves → Good distributional match
  • Shifts in peak location → Mean bias introduced
  • Changes in curve width → Variance differences
  • Asymmetry changes → Skewness alterations

This function compares the distribution of original data against two imputed datasets using Kolmogorov-Smirnov (KS) tests. It helps evaluate how well each imputation method preserves the original data distribution. This code is particularly useful for evaluating imputation quality when you want to ensure the imputed values maintain the original data’s statistical properties.

# --- Statistical Test Comparison ---
compare_distributions <- function(original, imp1, imp2) {
    # Initialize empty results dataframe with correct structure
    results <- data.frame(Variable = character(), KS_Random = numeric(), KS_Normal = numeric(),
        stringsAsFactors = FALSE)

    # Check if input data has columns
    if (ncol(original) == 0) {
        warning("Input data has no columns")
        return(results)
    }

    for (i in seq_len(ncol(original))) {
        # Skip if column name is missing
        if (is.null(colnames(original))) {
            current_var <- paste0("Variable", i)
        } else {
            current_var <- colnames(original)[i]
        }

        # Get complete cases
        orig_complete <- na.omit(original[, i])
        imp1_complete <- imp1[, i]  # Imputed data shouldn't have NAs
        imp2_complete <- imp2[, i]

        # Skip if insufficient data
        if (length(orig_complete) < 2)
            next

        # Calculate KS tests with error handling
        ks_random <- tryCatch({
            ks.test(orig_complete, imp1_complete)$p.value
        }, error = function(e) NA_real_)

        ks_normal <- tryCatch({
            ks.test(orig_complete, imp2_complete)$p.value
        }, error = function(e) NA_real_)

        # Only add if we got valid results
        if (!any(is.na(c(ks_random, ks_normal)))) {
            results <- rbind(results, data.frame(Variable = current_var, KS_Random = ks_random,
                KS_Normal = ks_normal, stringsAsFactors = FALSE))
        }
    }

    if (nrow(results) == 0) {
        warning("No valid comparisons could be made - check your input data")
    }

    return(results)
}

Kolmogorov-Smirnov (KS) tests comparing the distributions of the original data versus two imputation methods (Random Hot-Deck and Normal Distribution). The KS test compares the entire shape of distributions (not just means/variances).

dist_test_results <- compare_distributions(B2, imputed_random, imputed_normal)
kable(dist_test_results, digits = 4, caption = "Kolmogorov-Smirnov Test p-values (vs Original)")
Kolmogorov-Smirnov Test p-values (vs Original)
Variable KS_Random KS_Normal
Variable1 0.9692 0.9897
Variable2 1.0000 0.9977
Variable3 1.0000 1.0000
Variable4 0.8700 0.9833
Variable5 1.0000 1.0000

Interpretation Guide:

  • High p-values (all >0.87) suggest:
    • Both imputation methods produced distributions that are statistically indistinguishable from the original data
    • The null hypothesis (that distributions are equal) cannot be rejected -Perfect 1.000 values for Variables 2,3,5 indicate:
    • The imputed distributions match the original data perfectly
    • Common when variables have:
      • No missing values
      • Simple distributions (e.g., binary/categorical)
      • Small sample sizes
  • Variable 4 shows slight variation:
    • Random hot-deck (0.87) vs Normal (0.98)
    • Both methods worked well, but normal imputation matched slightly better

For analysis: These results suggest both imputation methods successfully preserved the original data structure.

For method selection:

  • Normal imputation performed marginally better where differences exist
  • Random hot-deck is also acceptable (all p-values >0.05)
p-value Interpretation Recommended Action
≥ 0.10 Excellent match Either method acceptable
0.05-0.10 Good match Prefer normal imputation
< 0.05 Significant difference Investigate further

Note: All tests conducted at α = 0.05 level. Higher p-values indicate better distributional matches between original and imputed data.


3.4.3 Regression Imputation

Regression imputation is a statistical method for replacing missing values in a dataset by predicting them using regression models. It leverages relationships between variables to estimate missing data more accurately than simple methods like mean imputation.

Missing values in a target variable are predicted using other predictor variables (columns) in the dataset and a regression model (e.g., linear regression, logistic regression). Only observed (non-missing) data is used to train the model.

For a target variable \(Y\) with missing values and predictors \(X_1, X_2, \dots, X_p\), the model is:

\[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p + \epsilon \]

  • The model is fitted on complete cases (rows without missing \(Y\)).
  • Missing \(Y\) values are then predicted using:

\[ \hat{Y} = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p \]

where:

  • \(Y\) : Target variable with missing values,

  • \(X_1...X_p\): Predictor variables,

  • \(\beta_0\): Intercept term ,

  • \(\beta_1...\beta_p\): Regression coefficients,

  • \(\epsilon\): Error term,

  • \(\hat{Y}\): Predicted value for missing \(Y\)


Types of Regression Imputation

Method Use Case Model Type
Linear Regression Continuous missing values lm()
Logistic Regression Binary missing values glm(family = binomial)
Multinomial Regression Categorical missing values nnet::multinom()
Robust Regression Outlier-resistant imputation MASS::rlm()

Advantages of Regression Imputation:

  • More accurate than mean/median imputation (preserves relationships between variables)
  • Uses available data efficiently (only complete cases are used for modeling)
  • Flexible (works for continuous, binary, and categorical data)
  • Can incorporate uncertainty (e.g., by adding random residuals)

Disadvantages & Challenges

  • Assumes linearity (may fail if relationships are nonlinear)
  • Requires sufficient complete cases (fails if predictors are missing)
  • May underestimate variance (treats imputed values as certain)
  • Risk of overfitting (if too many predictors are used)

When to Use Regression Imputation

  • Missing at Random (MAR) data (missingness depends on observed data)
  • Moderate missingness (e.g., <30% missing in a column)
  • When predictors are strongly correlated with missing values

This R code defines and demonstrates a regression-based imputation function for handling missing values. The function imputes missing values in a specified column of a dataset using regression (linear or generalized linear models). This function is particularly useful when missing values are assumed to be related to other observed variables in the dataset.

Advantages:

  • Context-Aware: Uses relationships between variables for accurate imputation.
  • Flexible: Works with different regression models.

Limitations:

  • Requires at least one complete predictor per missing value.
  • May produce biased estimates if the regression model is misspecified.
#' Regression Imputation for Missing Values in a Specified Column
#'
#' @param data A data frame or matrix with at least two columns
#' @param target The column name or index to impute (default is last column)
#' @param method Regression method ('lm' or 'glm')
#' @param family For GLM (e.g., 'binomial')
#' @param return_imputed_only Logical - return only imputed values or full dataset?
#' @param original_data (Optional) Original data matrix or data frame for computing RMSE
#' @return Either the full imputed data frame (or vector if return_imputed_only = TRUE), invisibly prints RMSE if original_data is supplied
#' @examples
#' imputed_data <- regression_imputation(B2, target = 4, original_data = B1)
regression_imputation <- function(data, target = NULL, method = "lm", family = NULL,
    return_imputed_only = FALSE, original_data = NULL) {
    if (is.matrix(data))
        data <- as.data.frame(data)
    if (ncol(data) < 2)
        stop("Data must contain at least two columns for regression.")
    if (is.null(colnames(data)))
        colnames(data) <- paste0("V", seq_len(ncol(data)))
    if (!is.null(original_data) && is.matrix(original_data))
        original_data <- as.data.frame(original_data)
    if (!is.null(original_data) && is.null(colnames(original_data)))
        colnames(original_data) <- paste0("V", seq_len(ncol(original_data)))

    # Identify target column
    if (is.null(target)) {
        target_col <- colnames(data)[ncol(data)]
    } else if (is.numeric(target)) {
        if (target < 1 || target > ncol(data))
            stop("Target column index out of bounds.")
        target_col <- colnames(data)[target]
    } else if (is.character(target)) {
        if (!target %in% colnames(data))
            stop("Target column not found in data.")
        target_col <- target
    } else {
        stop("Invalid 'target' value.")
    }

    if (!any(is.na(data[[target_col]]))) {
        warning("No missing values found in the target column.")
        return(if (return_imputed_only) numeric(0) else data)
    }

    imputed_values <- rep(NA, sum(is.na(data[[target_col]])))
    missing_rows <- which(is.na(data[[target_col]]))
    valid_rows <- integer(0)

    for (i in seq_along(missing_rows)) {
        row_idx <- missing_rows[i]
        row_data <- data[row_idx, , drop = FALSE]

        predictors <- setdiff(colnames(data), target_col)
        available_predictors <- predictors[!is.na(row_data[predictors])]

        if (length(available_predictors) == 0)
            next

        formula <- as.formula(paste(target_col, "~", paste(available_predictors,
            collapse = " + ")))
        subset_complete <- complete.cases(data[, c(available_predictors, target_col)])

        if (sum(subset_complete) < 2)
            next

        model_data <- data[subset_complete, ]
        model <- switch(method, lm = lm(formula, data = model_data), glm = {
            if (is.null(family)) stop("Specify 'family' for GLM.")
            glm(formula, data = model_data, family = family)
        }, stop("Unsupported method: use 'lm' or 'glm'"))

        prediction <- predict(model, newdata = row_data[, available_predictors, drop = FALSE])
        imputed_values[i] <- prediction
        data[row_idx, target_col] <- prediction
        valid_rows <- c(valid_rows, row_idx)
    }

    if (return_imputed_only) {
        return(imputed_values)
    } else {
        return(data)
    }
}

# Save function to file for reuse
dump("regression_imputation", file = "regression_imputation.R")
B1  # for comparison
##        X1 X2 X3 X4     Y
## obs.1   7 26  6 60  78.5
## obs.2   1 29 15 52  74.3
## obs.3  11 56  8 20 104.3
## obs.4  11 31  8 47  87.6
## obs.5   7 52  6 33  95.9
## obs.6  11 55  9 22 109.2
## obs.7   3 71 17  6 102.7
## obs.8   1 31 22 44  72.5
## obs.9   2 54 18 22  93.1
## obs.10 21 47  4 26 115.9
## obs.11  1 40 23 34  83.8
## obs.12 11 66  9 12 113.3
## obs.13 10 68  8 12 109.4
imputed <- round(regression_imputation(B2, target = 4), digits = 2)

3.5 Root Mean Square Error (RMSE)

RMSE measures the deviation between imputed values and true values, expressed in the original units of the target variable (e.g., dollars, test scores). A lower RMSE indicates better imputation accuracy.

  • RMSE quantifies imputation error in absolute terms.
  • % of Range shows how big the error is compared to the data spread.
  • % of SD shows whether the imputation error is within typical variability.

3.6 Root Mean Square Error (RMSE)

RMSE measures the deviation between imputed values and true values, expressed in the original units of the target variable (e.g., dollars, test scores). A lower RMSE indicates better imputation accuracy.


Key Interpretations of RMSE

  1. Absolute Error (RMSE)
    • Directly quantifies average imputation error magnitude
    • Example: An RMSE of 5 for test scores means imputed values are on average 5 points off from true values
  2. Relative to Data Range (% of Range)
    • Assesses error size compared to data spread: \[ % of Range = (RMSE / (Max - Min)) × 100% \]
    • Guidelines:
      • < 10% → Excellent
      • 10 - 20% → Acceptable
      • > 20% → Concerning
  3. Relative to Standard Deviation (% of SD)
    • Compares error to natural variability: \[ % of SD = (RMSE / SD) × 100% \]
    • Guidelines:
      • < 10% of SD → Highly accurate
      • 10 - 20% of SD → Moderate error
      • > 20% of SD → Poor imputation

This code evaluates the accuracy of imputed values by comparing them to known true values, calculating RMSE (Root Mean Square Error) and related metrics.

# Step 1: Identify rows with missing target values
missing_rows <- which(is.na(B2[, 4]))

# Step 2: Filter rows where imputation succeeded
valid_rows <- missing_rows[!is.na(imputed[missing_rows, 4])]

# Step 3: Extract true and imputed values for valid rows only
true_values <- B1[valid_rows, 4]
imputed_values <- imputed[valid_rows, 4]

# Step 4: Compute RMSE
rmse <- sqrt(mean((true_values - imputed_values)^2))

# Step 5: Calculate range of the true target values (column 4)
range_val <- max(B1[, 4], na.rm = TRUE) - min(B1[, 4], na.rm = TRUE)

# Step 6: RMSE Comparisons
percent_error_range <- (rmse/range_val) * 100
sd_true <- sd(B1[valid_rows, 4])
percent_error_sd <- (rmse/sd_true) * 100

# Output results
cat(sprintf("RMSE: %.4f\n", rmse))
## RMSE: 7.5686
cat(sprintf("RMSE as percentage of target range (%.2f): %.2f%%\n", range_val, percent_error_range))
## RMSE as percentage of target range (54.00): 14.02%
cat(sprintf("RMSE as percentage of SD (%.4f): %.2f%%\n", sd_true, percent_error_sd))
## RMSE as percentage of SD (13.5365): 55.91%
# Report skipped rows
skipped_rows <- missing_rows[is.na(imputed[missing_rows, 4])]
cat("Imputation skipped for rows:", if (length(skipped_rows) > 0) paste(skipped_rows,
    collapse = " ") else "None", "\n")
## Imputation skipped for rows: 14

Imputation Error Summary:

  • RMSE: 7.57
    The average difference between imputed and true values is about 7.57 units.

  • Target Range: 54.00
    The RMSE corresponds to approximately 14.0% of the total range of the target variable, indicating a moderate imputation error relative to the full spread of the data.

  • Standard Deviation of True Values: 13.54
    The RMSE is about 56% of the standard deviation, meaning the imputation error is a bit more than half the natural variability in the true data.

  • Imputation skipped for rows: 14
    Some rows could not be imputed due to insufficient predictor information.

Interpretation

The imputation performs reasonably well, capturing much of the data’s underlying structure, but there remains room for improvement. Consider adding more predictors or applying advanced imputation methods to reduce error further.


This code compares two imputation methods (mean imputation vs. regression imputation) for missing values in column V4 of dataset B2, visualizing how each method affects the data distribution compared to the original complete data (B1).

library(ggplot2)

B2 <- as.data.frame(B2)

# Step 1: Mean (for comparison) and regression imputations
B2_mean <- B2
B2_mean$V4[is.na(B2_mean$V4)] <- mean(B2_mean$V4, na.rm = TRUE)
B2_reg <- regression_imputation(B2, target = "V4", method = "lm")

# Step 2: Combine data into long format for ggplot
plot_df <- rbind(data.frame(V4 = B1[, 4], method = "Original (without NAs)"), data.frame(V4 = B2_mean$V4,
    method = "Mean Imputed"), data.frame(V4 = B2_reg$V4, method = "Regression Imputed"))

# Step 3: Plot density comparison
ggplot(plot_df, aes(x = V4, color = method)) + geom_density(na.rm = TRUE, size = 1.2) +
    labs(title = "Comparison of Imputation Methods for Column V4", x = "V4 Values",
        y = "Density", color = "Method") + theme_minimal() + theme(legend.position = "bottom")

The plot compares the distributions juxtaposed to see how imputation changes data shape. Good imputation should make the blue line (regression imputed) overlap as much as possible with the original (green) while the mean-imputed (red) will likely show visible distortions.


3.6.1 Estimating Parameters by Bootstrapping

Implements hot deck imputation to estimate unbiased parameters from incomplete data using stochastic resampling.

Key Features

Preserves:

  • Original distribution shape
  • Natural variance-covariance structure
  • Variable relationships

Statistical Estimation

Output Formula Interpretation
Mean \(\bar{x} = \frac{1}{m}\sum_{j=1}^m \bar{x}_j\) Average of imputed means
SD \(\sqrt{\frac{1}{m}\sum_{j=1}^m s_j^2}\) Pooled standard deviation

Where \(m\) = number of iterations (default=1000)

Missing Data Handling

Mechanism Support:

  • MCAR (Missing Completely at Random)
  • MAR (Missing at Random)
  • MNAR (Requires caution)

Technical Notes

  • Each iteration performs independent sampling
  • Final estimates are averages across all imputations -️ Progress bar shows real-time execution status

Core algorithm pseudocode

for each iteration:
   for each variable:
      1. Identify missing cases
      2. Sample from observed values (with replacement)
      3. Impute missing values
   Calculate statistics on completed dataset
Average results across all iterations

This code defines and implements a Hot Deck Imputation Parameter Estimation function in R, which estimates means and standard deviations of variables with missing data through multiple imputations. It performs multiple sampling iterations but aggregates results into a single set of estimates (averaged means/SDs). For each variable with missing values (NA), it randomly samples from observed values to fill in missing data. Preserves the original data distribution by using actual observed values. This is particularly useful for estimating summary statistics when dealing with incomplete datasets while maintaining the original data distribution.

#' Hot Deck Imputation Parameter Estimation
#'
#' Estimates means and standard deviations through multiple hot deck imputations. This resampling approach handles missing data by randomly sampling from observed values, preserving the original distributional properties of the data.
#'
#' @param x A numeric matrix or data frame containing missing values (NA)
#' @param iterations Number of imputation replicates (default: 1000)
#' @return A matrix with two rows (mean, sd) and columns matching input
#'    variables, showing the averaged parameters across all imputations
#'
#' @examples
#' # Create dataset with missing values
#' set.seed(123)
#' dat <- data.frame(
#'   age = c(25, 30, NA, 40, NA, 50),
#'   income = c(50, NA, 75, NA, 100, 120)
#' )
#' 
#' # Estimate parameters with 200 imputations
#' hot_deck_parameters(dat, iterations = 200)
#'
#' @export
#' @importFrom stats sd
hot_deck_parameters <- function(x, iterations = 1000) {

    # --- Input Validation --- Ensure input is either matrix or data frame
    if (!is.matrix(x) && !is.data.frame(x)) {
        stop("Input must be a matrix or data frame")
    }

    # Check for completely missing columns
    if (any(apply(x, 2, function(col) all(is.na(col))))) {
        warning("Some variables contain only missing values - results may be unreliable")
    }

    # Convert to matrix for consistent handling
    x <- as.matrix(x)
    n_vars <- ncol(x)

    # --- Initialize Storage --- Matrices to store results from each iteration
    standard_deviations <- matrix(NA, nrow = iterations, ncol = n_vars)
    means <- matrix(NA, nrow = iterations, ncol = n_vars)

    # --- Imputation Process ---

    for (j in 1:iterations) {
        imputed <- x  # Create fresh copy for each iteration

        # --- Variable-wise Imputation ---
        for (i in 1:n_vars) {
            missing <- is.na(x[, i])  # Identify missing cases

            if (any(missing)) {
                x_obs <- x[!missing, i]  # Extract observed values
                n_missing <- sum(missing)

                # Hot deck imputation by random sampling
                imputed[missing, i] <- sample(x_obs, size = n_missing, replace = TRUE  # Allows resampling of same value
)
            }
        }

        # --- Calculate Statistics ---
        means[j, ] <- colMeans(imputed)  # Column means
        standard_deviations[j, ] <- apply(imputed, 2, sd)  # Column SDs
    }

    # --- Aggregate Results --- Average across all iterations
    parameters <- rbind(mean = colMeans(means), sd = colMeans(standard_deviations))

    # Preserve original column names
    colnames(parameters) <- colnames(x)

    return(parameters)
}

# Save function to file for reuse
dump("hot_deck_parameters", file = "hot_deck_parameters.R")
hot_deck_parameters(B2, 1000)
##          V1     V2      V3     V4     V5
## mean 5.9902 45.094 11.7692 39.076 95.474
## sd   4.2190 15.348  6.3625 15.248 14.950

3.6.2 Maximum Likelihood Estimation

Maximum Likelihood Estimation (MLE) provides statistically efficient parameter estimation under Missing at Random (MAR) conditions, yielding unbiased estimates while maximizing information usage from incomplete datasets. The method demonstrates superior properties compared to traditional approaches: under MAR and Missing Completely at Random (MCAR) conditions, MLE produces unbiased estimates with minimum variance, as quantified by the Cramér-Rao lower bound, while under Missing Not at Random (MNAR) scenarios, the bias is typically constrained to parameters directly involved in the missingness mechanism. The theoretical foundation derives from the likelihood function

\[ L\left(θ|x\right) = \prod_{i=1}^n f\left(x_i|θ\right)^{1-o_i}\left[P(x_i \text{ missing})\right]^{o_i} \]

where \(o_i\) indicates missingness. For normal data \(X \sim N(μ,σ^2)\), the observed-data log-likelihood simplifies to

\[ \ell(μ,σ^2) = -\frac{n_o}{2}\log\left(2πσ^2\right) - \frac{1}{2σ^2}\sum_{i \in \text{obs}}\left(x_i-μ\right)^2 \] under MAR, with the EM algorithm providing iterative solutions through its E-step (computing \(\mathbb{E}[T(x)|x_{obs}]\)) and M-step (updating \(θ^{(t+1)} = \arg\max_θ Q(θ|θ^{(t)})\)).

3.6.2.1 Maximum Likelihood Estimation with Univariate Data

For a continuous random variable X following a normal distribution, the probability density function characterizes the likelihood of observing a particular value x given parameters \(\theta = (\mu,\: \sigma^2)\):

\[ f\left(x \mid \mu, \sigma^2 \right) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right) \]

When dealing with incomplete univariate data containing missing values, the observed-data likelihood function combines:

  • Observed Components: Product of density evaluations for observed cases
  • Missing Components: Integral over possible values for missing cases (under MAR)

The complete observed-data likelihood takes the form:

\[ L\left(\mu,\sigma^2 \mid X_{\text{obs}}\right) = \prod_{i=1}^{n_o} f\left(x_i \mid \mu,\sigma^2 \right) \times \prod_{j=1}^{n_m} \int_{-\infty}^{\infty} f\left(x_j \mid \mu,\sigma^2 \right) \, dx_j \] where:

  • \(x_i\) is a score value,

  • \(\mu\) is the population mean,

  • \(\sigma^2\) s the population variance, and

  • \(L_i\) is a likelihood value that describes the height of the normal curve at a particular score value.


The univariate_likelihood() function computes the probability density of observations under a Gaussian distribution, where the parameters μ (mean) and σ (standard deviation) are estimated directly from the input data. The function is essentially a reimplementation of R’s dnorm() with additional NA handling, useful when you need full control over the likelihood computation process.

#' Calculate Univariate Normal Likelihood
#'
#' Computes likelihood values for observations under a Gaussian distribution,
#' with automatic parameter estimation and robust missing data handling.
#'
#' @param x Numeric vector or matrix of observations (NAs allowed)
#' @return Matrix of likelihood values for non-missing cases
#' @details 
#' Implements the Gaussian probability density function:
#' \deqn{
#' p(x|\mu,\sigma) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)
#' }
#' where \eqn{\mu} and \eqn{\sigma} are the sample mean and standard deviation.
#' 
#' Key features:
#' \itemize{
#'   \item Automatic parameter estimation from available data
#'   \item Comprehensive NA handling (preserves partial cases)
#'   \item Vectorized for efficient computation
#'   \item Consistent output format (always returns a matrix)
#' }
#' @examples
#' # Basic usage with missing data
#' data <- c(1.2, 2.5, NA, 3.8, 4.1)
#' likelihood_values <- univariate_likelihood(data)
#' 
#' # Comparison with dnorm()
#' clean_data <- na.omit(data)
#' all.equal(
#'   univariate_likelihood(clean_data),
#'   matrix(dnorm(clean_data, mean(clean_data), sd(clean_data)), ncol = 1)
#' )
#' @export
univariate_likelihood <- function(x) {
    # Convert to matrix and remove completely NA rows
    x <- as.matrix(x)
    x <- x[apply(x, 1, function(row) !all(is.na(row))), , drop = FALSE]

    # Estimate distribution parameters
    mu <- mean(x, na.rm = TRUE)
    sigma <- sd(x, na.rm = TRUE)

    # Vectorized likelihood calculation using normal PDF
    L <- (1/sqrt(2 * pi * sigma^2)) * exp(-0.5 * (x[, 1] - mu)^2/sigma^2)

    return(matrix(L, ncol = 1))
}

# save to the working directory getwd(), ls()
dump("univariate_likelihood", file = "univariate_likelihood.R")

This code performs verification and comparison of the custom univariate_likelihood() function against R’s built-in dnorm() function.

# Verification example
if (FALSE) {
    test_data <- as.matrix(na.omit(A[, 1]))
    results <- data.frame(X = test_data, Custom_Likelihood = round(univariate_likelihood(test_data),
        5), DNorm = round(dnorm(test_data, mean(test_data), sd(test_data)), 5), Log_Likelihood = round(log(univariate_likelihood(test_data)),
        5))
    print(results)
}

# compare
X1 <- as.matrix(A[, 1][!is.na(A[, 1])])

data.frame(X = X1, L1 = round(univariate_likelihood(X1), 5), L2 = round(dnorm(X1,
    mean(X1), sd(X1)), 5), logL = round(log(univariate_likelihood(X1)), 5))
##      X      L1      L2    logL
## 1   78 0.00840 0.00840 -4.7796
## 2   84 0.01487 0.01487 -4.2084
## 3   84 0.01487 0.01487 -4.2084
## 4   85 0.01607 0.01607 -4.1307
## 5   87 0.01849 0.01849 -3.9904
## 6   91 0.02305 0.02305 -3.7700
## 7   92 0.02406 0.02406 -3.7274
## 8   94 0.02580 0.02580 -3.6572
## 9   94 0.02580 0.02580 -3.6572
## 10  96 0.02713 0.02713 -3.6071
## 11  99 0.02817 0.02817 -3.5696
## 12 105 0.02652 0.02652 -3.6297
## 13 105 0.02652 0.02652 -3.6297
## 14 106 0.02580 0.02580 -3.6572
## 15 108 0.02406 0.02406 -3.7274
## 16 112 0.01969 0.01969 -3.9278
## 17 113 0.01849 0.01849 -3.9904
## 18 115 0.01607 0.01607 -4.1307
## 19 118 0.01254 0.01254 -4.3788
## 20 134 0.00156 0.00156 -6.4631

3.6.2.2 Probability Density Function (pdf): Normal Distribution

This code defines a sophisticated R function likelihood_plot() that creates a professional visualization of the normal distribution’s likelihood function.

#' @title Plot Likelihood Function for Normal Distribution
#' @description Visualizes the likelihood function of a normal distribution with robust error handling.
#' @param y Numeric vector of observed data (NAs allowed)
#' @param var.name Optional character string for x-axis label (default: 'X')
#' @param FUN Optional custom likelihood function (default: normal density)
#' @param title Optional custom plot title (default: shows parameters)
#' @return Produces a plot (no return value)
#' @details
#' Creates a likelihood plot showing the normal density curve with key parameters.
#' Handles missing data automatically and includes proper error checking.
#' @examples
#' # Basic usage
#' likelihood_plot(rnorm(100))
#' 
#' # Customized plot
#' likelihood_plot(rnorm(100, mean=5, sd=2), 
#'                var.name='Test Scores',
#'                title='Exam Score Distribution')
#' @export
#' @importFrom graphics plot axis mtext segments title box par points
#' @importFrom stats dnorm var
likelihood_plot <- function(y, var.name = "X", FUN = NULL, title = NULL) {

    # Install shape package if not already installed
    if (!requireNamespace("shape", quietly = TRUE)) {
        install.packages("shape")
    }
    library(shape)

    # Error checking for input
    if (!is.numeric(y))
        stop("Input must be numeric")
    if (all(is.na(y)))
        stop("All values are NA")

    # Calculate distribution parameters (removing NAs)
    y_clean <- y[!is.na(y)]
    mu <- mean(y_clean)
    var <- var(y_clean)
    sd <- sqrt(var)

    # Set plot range (μ ± 3.5σ)
    x <- seq(from = mu - 3.5 * sd, to = mu + 3.5 * sd, length.out = 1000)

    # Default to normal likelihood if no function provided
    if (is.null(FUN)) {
        FUN <- function(x, mu, var) {
            # Fully vectorized calculation
            (1/sqrt(2 * pi * var)) * exp(-(x - mu)^2/(2 * var))
        }
    }

    # Set plot margins and orientation
    old_par <- par(no.readonly = TRUE)
    on.exit(par(old_par))
    par(mar = c(5, 4, 4, 2) + 0.1, mgp = c(3, 1, 0), las = 1)

    # Create base plot
    plot(x, FUN(x, mu, var), axes = FALSE, type = "l", lwd = 2, col = "darkblue",
        xlab = "", ylab = "", main = if (is.null(title)) {
            bquote(N(.(round(mu, 2)) ~ "," ~ .(round(sd, 2))))
        } else {
            title
        })

    box()

    # Calculate key points
    key_x <- c(mu, mu - 2 * sd, mu + 2 * sd)
    key_y <- FUN(key_x, mu, var)

    # Add axes
    axis(1, at = key_x, labels = format(key_x, digits = 3))
    mtext(var.name, side = 1, line = 2.5, cex = 1.2)

    axis(2, at = pretty(range(key_y)))
    mtext("Likelihood", side = 2, line = 3, las = 0, cex = 1.2)

    # Add reference lines
    abline(v = key_x, lty = 2, col = "gray50")
    segments(x0 = key_x, y0 = 0, y1 = key_y, lty = 3, col = "gray50")

    # Add points at key locations
    points(key_x, key_y, pch = 21, bg = "red", cex = 1.5)

    # Add parameter annotations
    text(x = key_x, y = key_y * 1.1, labels = c(expression(mu), expression(mu - 2 *
        sigma), expression(mu + 2 * sigma)), cex = 1.2)
}

# Save to file
dump("likelihood_plot", file = "likelihood_plot.R")

3.6.2.3 Likelihood of Standard Normal Distribution

# Standard normal
likelihood_plot(rnorm(1000))

# Custom distribution
likelihood_plot(rnorm(1000, 5, 2), var.name = "Values", title = "Custom Distribution")


3.6.2.4 The Sample Likelihood

The goal of maximum likelihood estimation (MLE) is to identify the population parameter values that have the highest probability of producing a particular sample of data. Identifying the most likely parameter values requires a summary fit measure for the entire sample, not just a single score.

In probability theory, the joint probability for a set of independent events is the product of \(N\) individual probabilities. The sample likelihood quantifies the joint probability of drawing this collection of \(N\) scores from a normal distribution with a mean \(\mu\) and a variance \(\sigma^2\).

The sample likelihood for a normal distribution is given by:

\[ L = \prod_{i=1}^{N} \left\{ \frac{1}{\sqrt{2\pi\sigma^2}} \times\exp \left( -\frac{(x_i - \mu)^2}{2\sigma^2} \right) \right\} \] where:

  • \(L\) is the sample likelihood,

  • \(x_i\) is the \(i\)-th observed value,

  • \(\mu\) is the population mean,

  • \(\sigma^2\) is the population variance,

  • \(N\) is the number of observations in the sample.

This likelihood function represents the probability of observing the entire sample, given specific values of \(\mu\) and \(\sigma^2\). Maximizing this likelihood function allows us to estimate the most likely values for these parameters, which is the core of MLE.

prod(univariate_likelihood(X1))
## [1] 7.785e-36

3.6.2.5 The Log Likelihood

Because the sample likelihood is such a small number, it is difficult to work with and is prone to rounding error. Computing the natural logarithm of the individual likelihood values solves this problem and converts the likelihood to a more tractable metric. The sample log-likelihood is the sum of the individual log-likelihood values.

The sample log-likelihood is given by:

\[ \log(L) = \sum_{i=1}^{N} \log \left\{ \frac{1}{\sqrt{2\pi\sigma^2}} \times \exp \left( -\frac{(x_i - \mu)^2}{2\sigma^2} \right) \right\} \] where:

  • \(\log(L)\) is the log-likelihood,

  • \(x_i\) is the \(i\)-th observed value,

  • \(\mu\) is the population mean,

  • \(\sigma^2\) is the population variance,

  • \(N\) is the number of observations in the sample.

Taking the logarithm transforms the product of likelihoods into a sum of log-likelihoods, which makes the computation easier and numerically stable.

# method 1:
sum(log(univariate_likelihood(X1)))
## [1] -80.841

Log Identity:

The logarithmic identity is:

\[ \log_b (xy) = \log_b (x) + \log_b (y) \]

This identity is derived from the property that:

\[ b^c \times b^d = b^{c+d} \] This means that the logarithm of a product is equal to the sum of the logarithms of the factors, which is a useful property when working with log-likelihoods.


# method 2:
log(prod(univariate_likelihood(X1)))
## [1] -80.841

3.6.2.6 Estimating Unknown Parameters

The estimation process is iterative, repeatedly calculating the log-likelihood with different values of the population parameters. This continues until it identifies the estimates that maximize the log-likelihood—those that best fit the data by minimizing the standardized distances between the observed scores and the mean (Enders, 2010).


This code defines a function logLiM() which creates a visualization of the log-likelihood function for estimating the mean parameter (μ) of a normal distribution. By computing and plotting the log-likelihood curve across a range of potential mean values, it provides an intuitive demonstration of maximum likelihood estimation (MLE) principles. The implementation assumes known variance (σ²), which is typically estimated from the sample data, offering both theoretical clarity and practical utility for data analysis. The invisible output contains all calculated μ values and their corresponding log-likelihoods for further analysis.

This approach effectively bridges the gap between statistical theory and applied work, making it particularly valuable for understanding estimation concepts while working with real-world datasets.

#' Plot Log-Likelihood Function for the Mean
#'
#' This function computes and plots the log-likelihood of the population mean 
#' for a univariate numeric vector under the assumption of a known variance 
#' (estimated from the data). It visually illustrates the maximum likelihood 
#' estimate (MLE) of the mean.
#'
#' @param x A numeric vector of observed data.
#' @param title Optional character string. Custom title for the plot. 
#'   Defaults to 'Log-Likelihood of the Mean'.
#'
#' @return A plot showing the log-likelihood values over a range of candidate means.
#'   The function invisibly returns a data frame with candidate means and corresponding log-likelihoods.
#'
#' @examples
#' x <- rnorm(100, mean = 5, sd = 2)
#' logLiM(x)
#'
#' @importFrom graphics plot segments points text
#' @importFrom shape Arrowhead
#' @export
logLiM <- function(x, title = NULL) {
    require(graphics, quietly = TRUE)
    require(shape, quietly = TRUE)

    # Estimate sample mean and variance
    mu_hat <- mean(x, na.rm = TRUE)
    var_hat <- as.numeric(var(x, na.rm = TRUE))  # Ensures scalar

    # Define a sequence of candidate means around MLE (±6 SD)
    mu_seq <- seq(mu_hat - 6 * sqrt(var_hat), mu_hat + 6 * sqrt(var_hat), by = 0.01)

    # Compute log-likelihoods for each candidate mean
    log_lik <- sapply(mu_seq, function(mu_i) {
        sum(dnorm(x, mean = mu_i, sd = sqrt(var_hat), log = TRUE))
    })

    # Plot the log-likelihood curve
    plot(mu_seq, log_lik, type = "l", lwd = 4, col = "darkgreen", xlab = "Mean",
        ylab = "Log-Likelihood", ylim = c(min(log_lik), max(log_lik) + 0.2 * diff(range(log_lik))),
        main = ifelse(is.null(title), "Log-Likelihood of the Mean", title))

    # Find the maximum log-likelihood and corresponding mean
    max_idx <- which.max(log_lik)
    max_mu <- mu_seq[max_idx]
    max_ll <- log_lik[max_idx]

    # Draw vertical line from x-axis to max point
    segments(x0 = max_mu, y0 = min(log_lik), x1 = max_mu, y1 = max_ll, col = "black",
        lty = 2)

    # Draw horizontal line from left to max point
    segments(x0 = min(mu_seq) * 0.6, y0 = max_ll, x1 = max_mu, y1 = max_ll, col = "black",
        lty = 2)

    # Arrow to highlight vertical MLE line
    Arrowhead(x0 = max_mu, y0 = min(log_lik), angle = 270, arr.lwd = 0.3, arr.length = 0.4,
        arr.col = "black", arr.type = "curved")

    # Highlight the MLE point
    points(max_mu, max_ll, pch = 21, bg = "darkgreen", col = "white", cex = 0.9)

    # Annotate with the estimated mean
    text(max_mu, max_ll, labels = round(max_mu, 4), pos = 3)

    # Return invisible data for further inspection or reuse
    invisible(data.frame(mu = mu_seq, logLik = log_lik))
}

# Save to file
dump("logLiM", file = "logLiM.R")

This code visualizes how the log-likelihood of a normal distribution’s mean parameter (μ) changes across different candidate values, using the observed data to estimate the most probable mean (MLE) and show estimation certainty through likelihood curvature

logLiM(X1)

Interpretation Guide for logLiM() Function

Green Likelihood Curve:

The parabolic curve shows how the log-likelihood changes as we test different μ values. The peak represents the maximum likelihood estimate (MLE).

Reference Lines:

  • Vertical dashed line: Marks the exact MLE position on the x-axis
  • Horizontal dashed line: Shows the maximum log-likelihood value

The x-position of the peak indicates the best estimate for the population mean.

Estimation Certainty:

  • A narrow, steep peak → High certainty about μ
  • A wide, shallow curve → More uncertainty

Curvature Interpretation:

The steepness of the parabola’s sides reflects how quickly likelihood decreases as we move away from the MLE, indicating estimation precision.


The sample mean is an unbiased estimator of the population mean.

The logLiV() function creates a visualization of how the log-likelihood varies for different values of the variance parameter (σ²) in a normal distribution, while holding the mean fixed at its maximum likelihood estimate. By computing the log-likelihood across a carefully chosen range of candidate variances centered around the sample variance, it produces an asymmetric curve that reveals important properties of variance estimation.

#' Plot Log-Likelihood Function for the Variance
#'
#' This function computes and plots the log-likelihood of the variance parameter 
#' for a univariate normal distribution, assuming a fixed mean (MLE from the data). 
#' It visually illustrates the maximum likelihood estimate (MLE) of the variance.
#'
#' @param x A numeric vector of observed data.
#' @param title Optional character string. Custom title for the plot.
#'
#' @return A plot showing log-likelihood values over a range of candidate variances.
#'   Invisibly returns a data frame of variance values and corresponding log-likelihoods.
#'
#' @examples
#' x <- rnorm(100, mean = 0, sd = 1)
#' logLiV(x)
#'
#' @importFrom graphics plot segments points text
#' @importFrom shape Arrowhead
#' @export
logLiV <- function(x, title = NULL) {
    require(graphics, quietly = TRUE)
    require(shape, quietly = TRUE)

    # Sample mean and variance
    mu_hat <- mean(x, na.rm = TRUE)
    var_hat <- as.numeric(var(x, na.rm = TRUE))  # Ensure scalar

    # Create a range of candidate variance values around estimated variance
    var_seq <- seq(var_hat * 0.5, var_hat * 1.5, by = 0.1)

    # Compute log-likelihood for each variance candidate
    log_lik <- sapply(var_seq, function(v) {
        sum(dnorm(x, mean = mu_hat, sd = sqrt(v), log = TRUE))
    })

    # Plot log-likelihood
    plot(var_seq, log_lik, type = "l", lwd = 5, col = "darkgreen", xlab = "Variance",
        ylab = "Log-Likelihood", ylim = c(min(log_lik), max(log_lik) + 0.2 * diff(range(log_lik))),
        main = ifelse(is.null(title), "Log-Likelihood of the Variance", title))

    # Locate maximum
    max_idx <- which.max(log_lik)
    max_var <- var_seq[max_idx]
    max_ll <- log_lik[max_idx]

    # Reference lines: Draws a vertical dashed line from the bottom of the plot
    # up to the peak likelihood point
    segments(x0 = max_var, y0 = min(log_lik), x1 = max_var, y1 = max_ll, col = "black",
        lty = 2)

    # Draws a horizontal dashed line from the left edge to the peak point
    segments(x0 = min(var_seq) * 0.6, y0 = max_ll, x1 = max_var, y1 = max_ll, col = "black",
        lty = 2)

    # Arrowhead to highlight vertical line
    Arrowhead(x0 = max_var, y0 = min(log_lik), angle = 270, arr.lwd = 0.3, arr.length = 0.4,
        arr.col = "black", arr.type = "curved")

    # Highlight MLE point
    points(max_var, max_ll, pch = 21, bg = "darkgreen", col = "white", cex = 0.9)
    text(max_var, max_ll, labels = round(max_var, 4), pos = 3)

    # Return invisible data for reproducibility or inspection
    invisible(data.frame(variance = var_seq, logLik = log_lik))
}

# Save to file
dump("logLiV", file = "logLiV.R")

This implementation includes clear visual markers of the MLE point and returns the complete set of calculated values for further analysis.

logLiV(X1, title = "Log-likelihood of the variance")


Interpretation Guide

Dark Green Curve: Shows how log-likelihood varies with different variance values

  • Higher values = more plausible variances
  • Peak = maximum likelihood estimate (MLE)

The peak of this curve identifies the maximum likelihood estimate of the variance, while the shape of the curve - with its characteristically steeper descent on the left (underestimation side) and more gradual decline on the right (overestimation side) - provides visual intuition about the precision and potential bias in variance estimation. The steep left side shows likelihood drops rapidly when underestimating variance where as the gradual right side reveals more tolerance when overestimating. This motivates the traditional n-1 correction in sample variance calculations and explains why sample variance (with n-1 denominator) naturally corrects for bias.

Reference Lines:

  • Vertical dashed line: Marks the exact MLE position on x-axis
  • Horizontal dashed line: Shows maximum log-likelihood value
  • Curved arrow: Directs attention to the MLE location

The x-position of peak indicates the best variance estimate


Likelihood Estimation of Means

# 2 x 2 pictures on one plot
op <- par(mfrow = c(2, 2), oma = c(0, 0, 2, 0))

# variable names / plot titles
names <- colnames(A)

# produce likelihood plots for each variable in the data frame
for (i in 1:ncol(A)) {
    logLiM(A[, i][!is.na(A[, i])], names[i])
    title(main = names[i], col.main = "black")
}

## At end of plotting, reset to previous settings:
par(op)

# check: The sample mean is an unbiased estimator of the population mean.
round(colMeans(A, na.rm = TRUE), 2)
##      X      Y      Z 
## 100.00  10.35  11.70

Likelihood Estimation of Variances

# 2 x 2 pictures on one plot
op <- par(mfrow = c(2, 2), oma = c(0, 0, 2, 0))

# variable names / plot titles
names <- colnames(A)

# produce likelihood plots for each variable in the data frame
for (i in 1:ncol(A)) {
    logLiV(A[, i][!is.na(A[, i])], names[i])
    title(main = names[i], col.main = "black")
}

## At end of plotting, reset to previous settings:
par(op)

# check: method 1 denominator (N-1) is used:
round(apply(A2, 2, FUN = function(x) {
    x <- na.omit(x)
    sum((x - mean(x))^2)/(length(x))
}), 2)
##     X     Y     Z 
## 92.54  6.02  6.99

3.6.2.7 Maximum Likelihood Estimation with Multivariate Data

The probability density function (PDF) of the multivariate normal distribution defines the shape of the likelihood surface for multivariate data. For an observation \(x_i\), the likelihood is given by:

\[ L_i = \left(2\pi \right)^{-\frac{k}{2}} \, \left|\Sigma \right|^{-\frac{1}{2}} \, \exp\left[-\frac{1}{2} \left(x_i - \mu \right)^{\top} \Sigma^{-1} \left(x_i - \mu \right)\right] \]

where:

  • \(k\) is the number of variables (i.e., the dimensionality of \(X\)); each individual has a vector of \(k\) scores,

  • \(\mu\) is the mean vector (length \(k\)),

  • \(\Sigma\) is the \(k \times k\) covariance matrix, and

  • \(|\Sigma|\) is the determinant of the covariance matrix \(\Sigma\).

This expression shows how the likelihood for a single multivariate observation depends on the distance between that observation and the mean, scaled by the covariance structure of the data.


If you want to compute the residual matrix (centered data matrix) by matrix operations, the typical interpretation is:

\[ \text{Residual matrix } R = A - \mathbf{1} \mu^\top \]

where:

  • \(A\) is your data matrix (size \(n \times p\)),

  • \(\mu\) is the vector of column means (length \(p\)),

  • \(\mathbf{1}\) is a column vector of ones (length \(n\)),

  • \(\mu^\top\) is the transpose of the row vector \(\mu\) (i.e., a \(1 \times p\) row vector)

What’s happening?

  • \(\mathbf{1} \times \mu^\top\) produces an \(n \times p\) matrix where each row is the mean vector \(\mu\).
  • Subtracting this from \(A\) subtracts the column means from each element in the respective columns, i.e., centers the data.

This code implements a multivariate normal log-likelihood calculator that evaluates how probable each observation is under a multivariate normal distribution. The core function computes sample statistics (mean vector and covariance matrix), centers the data, and then calculates per-observation likelihood scores using the Mahalanobis distance within the multivariate normal density formula. The companion code applies this to mean-imputed data, first replacing missing values with column means before computing likelihood scores. This implementation relies on custom matrix operation functions for pedagogical clarity rather than using R’s built-ins, providing full visibility into the underlying linear algebra. The resulting likelihood scores can identify unusual observations or assess distributional assumptions, though the mean imputation approach may distort results if missingness isn’t completely at random. The code outputs both imputed values and their corresponding log-likelihoods for diagnostic purposes.

#' Compute Log-Likelihood for Multivariate Normal Data
#'
#' This function computes the log-likelihood for each observation in a multivariate 
#' data matrix under the assumption of a multivariate normal distribution.
#'
#' It assumes that custom utility functions `MEANS`, `COV`, `INV`, and `DET` are 
#' sourced from local R scripts.
#'
#' @param A A numeric matrix or data frame with rows as cases and columns as variables.
#'
#' @return A numeric matrix of log-likelihood values (one per row of `A`).
#'
#' @examples
#' data <- matrix(rnorm(300), nrow = 100, ncol = 3)
#' multivariate_likelihood(data)
#'
#' @export
multivariate_likelihood <- function(A) {
    # Source custom matrix operations: see Computational Linear Algebra with
    # Applications in Statistics:
    # https://rpubs.com/castro/Computational_Linear_Algebra_for_Statistics

    source("MEANS.R")
    source("COV.R")
    source("INV.R")
    source("DET.R")
    source("CSSCP.R")

    A <- as.matrix(A)
    n <- nrow(A)  # number of observations
    k <- ncol(A)  # number of variables

    S <- COV(A)  # sample covariance matrix
    mu <- MEANS(A)  # mean vector

    # center each column of A by subtracting its mean Create an n x 1 vector of
    # ones
    ones <- matrix(1, nrow = n, ncol = 1)

    # Compute residual matrix
    R <- A - ones %*% t(mu)  # residual matrix (X - mu), vectorized 

    # Precompute constants
    S_inv <- INV(S)
    log_det_S <- log(DET(S))

    # Compute log-likelihood for each observation
    L <- numeric(n)
    for (i in 1:n) {
        # Mahalanobis distance term
        dist <- t(R[i, ]) %*% S_inv %*% R[i, ]

        # Full log-likelihood formula
        L[i] <- -0.5 * (k * log(2 * pi) + log_det_S + dist)
    }

    return(matrix(L, ncol = 1))
}

# Save function to a script file for later sourcing
dump("multivariate_likelihood", file = "multivariate_likelihood.R")

Compute a multivariate likelihood score for each observation from the mean-imputed data:

Q <- data.frame(apply(mean_imputation(A2), 2, function(x) {
    replace(x, is.na(x), mean(x, na.rm = TRUE))
}), round(multivariate_likelihood(mean_imputation(A2)), 6))

colnames(Q) <- c("X", "Y", "Z", "log.likelihood")
Q
##     X  Y  Z log.likelihood
## 1  99  6  7        -9.4182
## 2 105 12 10        -7.3676
## 3 105 14 11        -8.0015
## 4 106 10 15        -8.8112
## 5 112 10 10        -7.1365
## 6 113 14 12        -7.4360
## 7 115 14 14        -7.3452
## 8 118 12 16        -8.0760
## 9 134 11 12        -9.7446

The Missing Data Log-Likelihood

When data are missing, the log-likelihood for observation \(i\) is given by:

\[ \log L_i = - \frac{k_i}{2} \log \left(2\pi \right) - \frac{1}{2} \log \left| \Sigma_i \right| - \frac{1}{2} \left(Y_i - \mu_i \right)^\top \Sigma_i^{-1}(Y_i - \mu_i) \]

where:

  • \(k_i\) is the number of observed (non-missing) variables for case \(i\),
  • \(Y_i\) is the observed portion of the score vector for case \(i\),
  • \(\mu_i\) is the corresponding subset of the population mean vector, and
  • \(\Sigma_i\) is the submatrix of the population covariance matrix corresponding to the observed variables.

The final term, \(\left(Y_i - \mu_i \right)^\top \Sigma_i^{-1}(Y_i - \mu_i)\), in the expression is the Mahalanobis distance, which quantifies the standardized distance between \(Y_i\) and the center of the multivariate distribution. Smaller Mahalanobis distances correspond to higher log-likelihood values, while larger distances indicate a poorer fit and thus lower likelihood values (Enders, 2010).


Example 1:

\[ \small \begin{aligned} \log L_{1} &= -\frac{k_1}{2} \log(2 \pi) - \frac{1}{2} \log \begin{vmatrix} \hat{\sigma}^2_1 & \hat{\sigma}_{1,2} \\ \hat{\sigma}_{2,1} & \hat{\sigma}^2_2 \end{vmatrix} -\frac{1}{2} \left( \begin{bmatrix} y_{(1, 1)} \\ y_{(1, 2)} \end{bmatrix} - \begin{bmatrix} \hat{\mu}_1 \\ \hat{\mu}_2 \end{bmatrix} \right)^T \begin{bmatrix} \hat{\sigma}^2_1 & \hat{\sigma}_{1,2} \\ \hat{\sigma}_{2,1} & \hat{\sigma}^2_2 \end{bmatrix}^{-1} \left( \begin{bmatrix} y_{(1, 1)} \\ y_{(1, 2)} \end{bmatrix} - \begin{bmatrix} \hat{\mu}_1 \\ \hat{\mu}_2 \end{bmatrix}\right) \\ \\ &= -\frac{2}{2} \log(2 \pi) -\frac{1}{2} \log \begin{vmatrix} 189.60 & 11.69 \\ 11.69 & 9.59 \end{vmatrix} -\frac{1}{2} \left( \begin{bmatrix} 78 \\ 13 \end{bmatrix} - \begin{bmatrix} 100.00 \\ 10.35 \end{bmatrix} \right)^T \begin{bmatrix} 189.60 & 11.69 \\ 11.69 & 9.59 \end{bmatrix}^{-1} \left( \begin{bmatrix} 78 \\ 13 \end{bmatrix} - \begin{bmatrix} 100.00 \\ 10.35 \end{bmatrix}\right) \\ \\ &= -6.64 \end{aligned} \]


Example 2:

\[ \begin{aligned} \log(L_5) &= -\frac{k_5}{2} \log(2\pi) - \frac{1}{2} \log \begin{vmatrix}\hat{\sigma}^2_1\end{vmatrix} - \frac{1}{2} (y_{(5, 1)}-\hat{\mu}_1)^T (\hat{\sigma}^2_1)^{-1} (y_{(5, 1)}-\hat{\mu}_1) \\ \\ &= -\frac{1}{2} \log(2\pi) - \frac{1}{2} \log \begin{vmatrix}189.6\end{vmatrix} - \frac{1}{2} (87.0-100.0)^T (189.6)^{-1} (87.0-100) \\ \\ &= -3.99 \end{aligned} \]


This code implements a specialized log-likelihood calculator for multivariate normal data with missing values. The observed_likelihood() function evaluates how probable each observation is under a multivariate normal distribution while properly handling missing data through pairwise deletion. For each case, it computes the log-likelihood using only the observed variables by dynamically subsetting the data to match each observation’s available variable pattern. The implementation carefully handles different missingness patterns across cases while maintaining proper matrix operations for each unique combination of observed variables.

#' Calculate Log-Likelihood for Multivariate Normal Data with Missing Values
#'
#' This function computes the log-likelihood contribution of each observation 
#' under the assumption of a multivariate normal distribution, handling 
#' missing values by using available data per case. The Mahalanobis distance 
#' is used to quantify the deviation of each observed case from the mean.
#'
#' @param x A data matrix or data frame with numeric values. Missing values (NA) 
#'   are allowed and will be handled by pairwise deletion 
#'   for each observation.
#'
#' @return A column matrix (n x 1) of log-likelihood values,
#'  one per row of `x`.
#'
#' @details
#' For each case `i`, the log-likelihood is computed using
#'  only the observed values. The mean vector and
#'  covariance matrix are computed from available data on 
#'  the same set of variables across cases. The function 
#'  uses helper functions 
#' \code{MEANS()}, \code{SIGMA()}, \code{INV()}, and
#'  \code{DET()}, which must be available in the working 
#'  directory as separate R scripts.
#'
#' @references
#' Enders, C. K. (2010). *Applied Missing Data Analysis*. 
#' New York: Guilford Press.
#'
#' @examples
#' # Assume A is a data frame with missing values
#' log_liks <- observed.likelihood(A)
#' head(log_liks)
#'
#' @export
observed_likelihood <- function(x) {
    # Load required helper functions from external files
    source("MEANS.R")  # Calculates column means
    source("COV.R")  # Calculates covariance matrix
    source("INV.R")  # Calculates matrix inverse
    source("DET.R")  # Calculates matrix determinant

    # Remove rows that are entirely NA
    x <- as.matrix(x[apply(x, 1, function(x) {
        !all(is.na(x))
    }), ])
    n <- nrow(x)

    # Initialize log-likelihood vector
    L <- matrix(NA, n, 1)
    index <- 1

    for (i in 1:n) {
        # Extract non-missing values for the i-th case
        y <- as.matrix(x[i, ][!is.na(x[i, ])])  # observed vector
        z <- subset(x, select = !is.na(x[i, ]))  # cases with same observed variables
        k <- nrow(y)  # number of variables observed for this case
        p <- ncol(y)  # number of cases for those variables

        # Compute population parameters from available data
        M <- MEANS(z)  # mean vector
        S <- COV(z)  # covariance matrix
        D <- log(DET(S))  # log determinant
        L2P <- log(2 * pi)  # constant term
        I <- INV(S)  # inverse of covariance matrix

        for (j in 1:p) {
            R <- as.matrix(y[, j] - M)  # residual vector
            # Log-likelihood for this case and variable pattern
            L[index] <- -(k/2) * L2P - (1/2) * D - (1/2) * t(R) %*% I %*% R
            index <- index + 1
        }
    }

    return(L)
}

# save to the working directory getwd(), ls()
dump("observed_likelihood", file = "observed_likelihood.R")
W <- data.frame(A1, round(observed_likelihood(A), 5))
colnames(W) <- c("X", "Y", "Z", "log.likelihood")
W
##      X  Y  Z log.likelihood
## 1   78 13 NA        -7.7687
## 2   84  9 NA        -6.2575
## 3   84 10 NA        -6.2794
## 4   85 10 NA        -6.1915
## 5   87 NA NA        -3.9904
## 6   91  3 NA        -8.2431
## 7   92 12 NA        -5.9926
## 8   94  3 NA        -8.2686
## 9   94 13 NA        -6.1614
## 10  96 NA NA        -3.6071
## 11  99  6  7       -11.1184
## 12 105 12 10        -7.7029
## 13 105 14 11        -9.2564
## 14 106 10 15        -8.5748
## 15 108 NA 10        -5.7473
## 16 112 10 10        -7.6978
## 17 113 14 12        -8.5098
## 18 115 14 14        -8.1823
## 19 118 12 16        -9.6132
## 20 134 11 12       -12.3028

Smaller Mahalanobis distance values — that is, smaller squared z-scores — correspond to larger log-likelihood values. This is because the Mahalanobis distance quantifies how far an observed vector lies from the population mean, taking into account the variance-covariance structure of the data:

\[ D^2 = (x - \mu)^T \Sigma^{-1} (x - \mu) \]

As this distance increases, the exponent in the multivariate normal density function becomes more negative, thereby decreasing the likelihood:

\[ \log L = -\frac{k}{2} \log(2\pi) - \frac{1}{2} \log |\Sigma| - \frac{1}{2} D^2 \]

Thus, a high likelihood value implies that the observation lies close to the mean, given the shape of the distribution, and is considered a better fit under the assumed multivariate normal model.

Conversely, observations far from the mean (larger \(D^2\)) result in lower likelihoods, signaling potential outliers or poor fit.


3.7 The Expectation Maximization (EM) Algorithm

The EM algorithm is an iterative procedure for finding maximum likelihood estimates of parameters in statistical models that depend on unobserved latent variables. Each iteration consists of two main steps:

  • E-step (Expectation): Compute the expected value of the log-likelihood, with respect to the current estimate of the distribution of the latent variables.
  • M-step (Maximization): Maximize the expected log-likelihood found in the E-step to update parameter estimates.

The process begins with initial estimates for the sufficient statistics (mean vector and covariance matrix). These estimates are refined by regressing the incomplete variables on the observed variables, then iteratively updating the parameters.


3.7.1 Bivariate EM

In a bivariate case with missing values in variable \(Y\), the E-step fills in the missing components of:

  • \(\sum Y\)
  • \(\sum Y^2\)
  • \(\sum XY\)

The predicted values fill in \(\sum Y\) and \(\sum XY\), while missing parts of \(\sum Y^2\) are replaced using:

\[ \hat{Y}_i^2 + \hat{\sigma}^2_{Y|X} \]

This allows the M-step to update the estimates using:

\[ \hat{\mu}_Y = \frac{1}{N} \sum Y \]

\[ \hat{\sigma}^2_Y = \frac{1}{N} \left( \sum Y^2 - \frac{(\sum Y)^2}{N} \right) \]

\[ \hat{\sigma}_{XY} = \frac{1}{N} \left( \sum XY - \frac{\sum X \sum Y}{N} \right) \]


More precisely, the E-step computes each case’s expected contribution to the sufficient statistics (Dempster, A. P. et al., 1977). It uses the current estimates of the mean vector and covariance matrix to construct regression equations that predict the missing values.

For a bivariate dataset with \(Y\) missing, the regression equations are:

3.7.1.1 Regression Coefficients and Variance

\[ \hat{\beta}_1 = \frac{\hat{\sigma}_{X, Y}}{\hat{\sigma}^2_X} \]

\[ \hat{\beta}_0 = \hat{\mu}_Y - \hat{\beta}_1 \hat{\mu}_X \]

\[ \hat{\sigma}^2_{Y|X} = \hat{\sigma}^2_Y - \hat{\beta}_1^2 \hat{\sigma}^2_X \]

3.7.1.2 Predicted Value for Missing Y

\[ \hat{Y}_i = \hat{\beta}_0 + \hat{\beta}_1 X_i \]


This function implements a specialized EM algorithm for bivariate data where only the second variable (Y) contains missing values. It alternates between imputing missing Y values through regression on the complete X variable (E-step) and updating distribution parameters using both observed and imputed data (M-step), iterating until the covariance matrix stabilizes or reaching maximum iterations.

#' Bivariate Expectation-Maximization (EM) Algorithm for Missing Data Imputation
#'
#' Performs EM imputation for bivariate data where missing values occur only in the second variable (Y).
#' Uses regression-based imputation in the E-step and updates parameters (mean, variance, covariance) in the M-step.
#' 
#' @param x A two-column matrix or data frame. Missing values (`NA`) must occur only in the second column (Y).
#' @param max_iter Maximum number of iterations (default: 500).
#' @param sig_digits Number of significant digits for convergence check (default: 4).
#' @param tol Tolerance level for convergence (if `cov_diff < tol`, stop early).
#' 
#' @return A list containing:
#' \describe{
#'   \item{convergence}{Number of iterations and convergence status.}
#'   \item{imputations}{Imputed dataset (rounded to 3 decimal places).}
#'   \item{means}{Final mean estimates.}
#'   \item{covariance}{Final covariance matrix.}
#' }
#'
#' @details 
#' - **Initialization**: Mean imputation for missing Y values.
#' - **E-step**: Regression of Y on X to predict missing values.
#' - **M-step**: Recompute means, variances, and covariance.
#' - **Convergence**: Stops when the covariance matrix stabilizes (to `sig_digits`) or `max_iter` is reached.
#'
#' @references Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). 
#' *Maximum likelihood from incomplete data via the EM algorithm.*
#' 
#' @export
EM <- function(x, max_iter = 500, sig_digits = 4, tol = 1e-06) {
    # Input validation
    if (ncol(x) != 2)
        stop("x must have exactly 2 columns.")
    if (any(is.na(x[, 1])))
        stop("Missing values are only allowed in the second column (Y).")

    # Remove rows where both columns are NA
    x <- data.matrix(na.omit(x))

    # Initialize missing Y with column means
    data <- x
    na_mask <- is.na(data[, 2])
    data[na_mask, 2] <- mean(data[, 2], na.rm = TRUE)

    # Track convergence
    Sigma_prev <- matrix(0, 2, 2)
    Sigma_current <- round(cov(data), sig_digits)
    iter <- 0
    converged <- FALSE

    while (iter < max_iter && !converged) {
        iter <- iter + 1

        # Sufficient statistics
        sumX <- sum(data[, 1])
        sumY <- sum(data[, 2])
        sumX2 <- sum(data[, 1]^2)
        sumY2 <- sum(data[, 2]^2)
        sumXY <- sum(data[, 1] * data[, 2])
        n <- nrow(data)

        # M-step: Update parameters
        meanX <- sumX/n
        meanY <- sumY/n
        varX <- (sumX2 - sumX^2/n)/n
        varY <- (sumY2 - sumY^2/n)/n
        covXY <- (sumXY - sumX * sumY/n)/n

        # E-step: Impute missing Y
        B1 <- covXY/varX
        B0 <- meanY - B1 * meanX
        SigmaYX <- varY - B1^2 * varX

        # Update missing Y and Y²
        data[na_mask, 2] <- B0 + B1 * data[na_mask, 1]
        Y2_imputed <- data[na_mask, 2]^2 + SigmaYX

        # Check convergence
        Sigma_prev <- Sigma_current
        Sigma_current <- round(cov(data), sig_digits)
        cov_diff <- max(abs(Sigma_current - Sigma_prev))
        converged <- cov_diff < tol
    }

    # Prepare output
    list(convergence = paste("Converged in", iter, "iterations (tol =", tol, ")"),
        imputations = round(data, 3), means = round(c(meanX, meanY), 3), covariance = round(Sigma_current,
            3))
}

# save to the working directory getwd(), ls()
dump("EM", file = "EM.R")

This code demonstrates how to use the EM function for bivariate missing data imputation and presents the results in a clean, organized format. It runs the EM algorithm twice on different column pairs from dataset A1 (columns 1-2 and 1-3), then uses a custom print_EM_results function to display the outputs in a standardized way.

# Example usage
EM1 <- EM(A1[, 1:2])
EM2 <- EM(A1[, c(1, 3)])

# Nicely formatted output with better structure
print_EM_results <- function(EM_result, cols) {
    cat("\n", rep("=", 40), "\n", sep = "")
    cat("EM RESULTS FOR COLUMNS", paste(cols, collapse = " & "), "\n")
    cat(rep("=", 40), "\n\n", sep = "")

    cat("Convergence:\n")
    cat(EM_result$convergence, "\n\n")

    cat("Estimated Means:\n")
    print(matrix(EM_result$means, nrow = 1, dimnames = list("", paste0("X", cols))))

    cat("\nEstimated Covariance Matrix:\n")
    print(EM_result$covariance)
}

# Print results
print_EM_results(EM1, c(1, 2))
## 
## ========================================
## EM RESULTS FOR COLUMNS 1 & 2 
## ========================================
## 
## Convergence:
## Converged in 1 iterations (tol = 1e-06 ) 
## 
## Estimated Means:
##      X1     X2
##  100.53 10.353
## 
## Estimated Covariance Matrix:
##         X      Y
## X 221.140 14.614
## Y  14.614 11.993
print_EM_results(EM2, c(1, 3))
## 
## ========================================
## EM RESULTS FOR COLUMNS 1 & 3 
## ========================================
## 
## Convergence:
## Converged in 1 iterations (tol = 1e-06 ) 
## 
## Estimated Means:
##     X1   X3
##  111.5 11.7
## 
## Estimated Covariance Matrix:
##        X      Z
## X 94.056 11.611
## Z 11.611  7.344

3.7.2 Applying EM to Multivariate Data

Complete-data Sufficient Statistics

A sufficient statistic with respect to the parameter \(\Theta\) is a statistic

\[ T(X_1, X_2, X_3, \dots, X_n) \]

that contains all the information useful for estimating \(\Theta\). Formally, a statistic \(T\) is sufficient for \(\Theta\) if the conditional probability

\[ p(x_1, x_2, x_3, \dots, x_n \mid T(x_1, x_2, x_3, \dots, x_n)) \]

does not depend on \(\Theta\).


The complete-data sufficient statistic matrix can be written as:

\[ T = \begin{bmatrix} n & X^T \\ \\ X & X^T X \end{bmatrix} = \begin{bmatrix} n & \sum{x_{i1}} & \sum{x_{i2}} & \dots & \sum{x_{ip}} \\ \\ & \sum{x^{2}_{i1}} & \sum{x_{i1} x_{i2}} & \dots & \sum{x_{i1} x_{ip}} \\ \\ & & \sum{x^{2}_{i2}} & \dots & \sum{x_{i2} x_{ip}} \\ \\ & & & \ddots & \vdots \\ \\ & & & & \sum{x^{2}_{ip}} \end{bmatrix} \]


This code implements a function called sufficient_statistics_matrix() that calculates a key matrix for multivariate normal parameter estimation. The function computes what statisticians call the “sufficient statistics matrix” (denoted as T) - a compact representation containing all information needed to estimate parameters (means and covariances) for multivariate normal data. It handles missing values through either mean imputation or complete-case removal. The saved matrix enables computation of means, variances, and covariances through simple matrix operations, providing the foundation for many multivariate statistical procedures.

#' Compute Complete-Data Sufficient Statistics Matrix
#'
#' Calculates the sufficient statistics matrix T for a dataset, which contains all
#' information needed to estimate parameters in multivariate normal models.
#' Missing values are handled via mean imputation.
#'
#' @param A A numeric matrix or data frame (rows: observations, columns: variables)
#' @param na_action How to handle NAs ('mean' for mean imputation, 'omit' to remove rows)
#' @return A symmetric (p+1)×(p+1) matrix structured as:
#' \deqn{
#' T = \begin{bmatrix}
#' n & \sum X_i \\
#' \sum X_i & \sum X_i X_i^T
#' \end{bmatrix}
#' }
#' where the first row/column corresponds to the intercept term.
#'
#' @examples
#' # With missing data
#' dat <- matrix(c(1, 2, NA, 4, 5, 6), ncol=2)
#' sufficient_stats <- T(dat)
#' print(sufficient_stats)
#'
#'#' @references
#' For theoretical foundations:
#' \itemize{
#'   \item Little, R. J. A., & Rubin, D. B. (2019). \emph{Statistical Analysis with Missing Data} (3rd ed.). Wiley. (Chapter 7 covers sufficient statistics for missing data problems)
#'   
#'   \item Schafer, J. L. (1997). \emph{Analysis of Incomplete Multivariate Data}. Chapman & Hall. (Section 5.2 discusses sufficient statistics in EM algorithms)
#' }
#' 
#' For computational implementation:
#' \itemize{
#'   \item Van Buuren, S. (2018). \emph{Flexible Imputation of Missing Data} (2nd ed.). CRC Press. (Section 3.3 covers sufficient statistics approaches)
#'   
#'   \item Enders, C. K. (2022). \emph{Applied Missing Data Analysis} (2nd ed.). Guilford Press. (Chapter 4 discusses sufficient statistics for normal models)
#' }
#'
#' @export
sufficient_statistics_matrix <- function(A, na_action = c("mean", "omit")) {
    # Input validation
    if (!is.matrix(A) && !is.data.frame(A)) {
        stop("Input must be a matrix or data frame")
    }
    na_action <- match.arg(na_action)

    # Convert to matrix and handle NAs
    X <- data.matrix(A)

    if (na_action == "omit") {
        X <- X[complete.cases(X), , drop = FALSE]
        if (nrow(X) == 0)
            stop("No complete cases after NA removal")
    } else {
        # Mean imputation with warning if NAs exist
        na_count <- colSums(is.na(X))
        if (any(na_count > 0)) {
            warning(paste("Imputed NAs in columns:", paste(which(na_count > 0), collapse = ", ")))
            X <- apply(X, 2, function(x) {
                x[is.na(x)] <- mean(x, na.rm = TRUE)
                x
            })
        }
    }

    # Add intercept and compute cross-product
    ONE_X <- cbind(1, X)
    stats_matrix <- t(ONE_X) %*% ONE_X

    # Add informative dimnames
    varnames <- if (!is.null(colnames(A)))
        colnames(A) else paste0("X", 1:ncol(A))
    dimnames(stats_matrix) <- list(c("Intercept", varnames), c("Intercept", varnames))

    return(stats_matrix)
}

# Save function
dump("sufficient_statistics_matrix", file = "sufficient_statistics_matrix.R")
sufficient_statistics <- sufficient_statistics_matrix(B1)
print(sufficient_statistics)
##           Intercept    X1    X2    X3    X4        Y
## Intercept      13.0    97   626   153   390   1240.5
## X1             97.0  1139  4922   769  2620  10032.0
## X2            626.0  4922 33050  7201 15739  62027.8
## X3            153.0   769  7201  2293  4628  13981.5
## X4            390.0  2620 15739  4628 15062  34733.3
## Y            1240.5 10032 62028 13982 34733 121088.1

Components of Sufficient Statistics Matrix:

## Sufficient Statistics Matrix Components (HTML Knit Ready)

# 1. Sample size
cat("1. Sample Size (n): ", sufficient_statistics[1, 1], "\n\n")
## 1. Sample Size (n):  13
# 2. Variable sums
cat("2. Variable (column) Sums: ", colSums(B1), "\n\n")
## 2. Variable (column) Sums:  97 626 153 390 1240.5
# 3. Sums of Squares and Cross Products
cat("3. Sums of Squares and Cross Products (SSCP): \n")
## 3. Sums of Squares and Cross Products (SSCP):
print(crossprod(B1))
##       X1    X2    X3    X4      Y
## X1  1139  4922   769  2620  10032
## X2  4922 33050  7201 15739  62028
## X3   769  7201  2293  4628  13982
## X4  2620 15739  4628 15062  34733
## Y  10032 62028 13982 34733 121088
cat("\n\n")

3.7.2.1 The Expectation Step, \(\Theta\)

Calculate the expected value of the log-likelihood function with respect to the conditional distribution of the missing values given the observed values, under the current estimate of the parameters, \(\Theta\).

\[ \Theta = \begin{bmatrix} -1 & \mu^T \\ \\ \mu & \Sigma \end{bmatrix} = \begin{bmatrix} -1 & \mu_1^T & \mu_2^T & \dots & \mu_p^T \\ \\ \mu_1 & \Sigma_{11} & \Sigma_{12} & \dots & \Sigma_{1p} \\ \\ \mu_2 & \Sigma_{21} & \Sigma_{22} & \dots & \Sigma_{2p} \\ \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ \\ \mu_p & \Sigma_{p1} & \Sigma_{p2} & \dots & \Sigma_{pp} \\ \end{bmatrix} \]

The SWEEP Operator

The SWEEP operator is a matrix transformation used extensively in multivariate statistics and regression analysis. It provides an efficient way to update matrix inverses and related computations, particularly when fitting or adjusting linear models.

Given a symmetric matrix \(A\), the SWEEP operator transforms it relative to a pivot element (e.g., diagonal element \(A_{kk}\)) to produce a new matrix. This operation is especially useful in solving normal equations in regression without explicitly computing matrix inverses.


How It Works

Let \(A\) be a symmetric \(p \times p\) matrix. The SWEEP operator sweeps on index \(k\) as follows:

For all \(i, j = 1, \dots, p\):

  1. If \(i = k\) and \(j = k\): \[ A_{kk} \leftarrow -1 / A_{kk} \]
  2. If \(i = k\) and \(j \ne k\) (or vice versa): \[ A_{ik} \leftarrow A_{ik} / A_{kk} \]
  3. If \(i \ne k\) and \(j \ne k\): \[ A_{ij} \leftarrow A_{ij} - A_{ik} \cdot A_{kj} / A_{kk} \]

This operation is self-inverting: sweeping on the same index twice restores the original matrix.


We can form the augmented matrix:

\[ \begin{bmatrix} \mathbf{X}^\top \mathbf{X} & \mathbf{X}^\top \mathbf{Y} \\ \mathbf{Y}^\top \mathbf{X} & \mathbf{Y}^\top \mathbf{Y} \end{bmatrix} \]

By applying the SWEEP operator to the predictor indices, we can:

  • Compute the residual sum of squares
  • Evaluate model diagnostics

— all without directly inverting \(\mathbf{X}^\top \mathbf{X}\).


This code implements the SWEEP operator, a fundamental matrix transformation used in statistical computing. The SWEEP function performs an in-place matrix transformation that selectively inverts portions of a symmetric matrix by operating on specified diagonal elements (pivots) and updates all other elements through rank-1 adjustments. It avoids full matrix inversion while achieving equivalent results for key operations.

#' “Belsley–Kuh–Welsch–style” symmetric SWEEP operator
#'
#' Applies the SWEEP operator to one or more indices of a symmetric matrix.
#' Commonly used in regression, multivariate analysis, and matrix algebra
#' to avoid direct matrix inversion.
#'
#' @param M A square symmetric numeric matrix.
#' @param indices A vector of integer indices (1-based) indicating which
#'   diagonal elements (pivots) to sweep.
#'
#' @return A swept matrix of the same dimensions as \code{M}, with the
#'   specified pivot elements transformed.
#'
#' @details The SWEEP operator is a matrix transformation that inverts
#'   the pivot elements and updates the rest of the matrix accordingly.
#'   It is particularly useful for solving least squares problems using
#'   the augmented cross-product matrix.
#'
#'   If any pivot element is approximately zero (within machine precision),
#'   it will be skipped to avoid division by zero.
#'
#' @examples
#' # Simple regression example
#' X <- cbind(1, c(2, 3, 4))  # Intercept + one predictor
#' y <- c(1, 2, 3)
#' Z <- cbind(X, y)
#' A <- t(Z) %*% Z
#' A_swept <- SWEEP(A, 1:2)  # Sweep on predictors
#' coef <- A_swept[1:2, 3]   # Regression coefficients
#' print(coef)
#'
#' @export
SWEEP <- function(M, indices) {
    # Make a copy of the input matrix to avoid modifying in place
    B <- M

    # Iterate over each index to apply the sweep
    for (k in indices) {
        d <- B[k, k]  # Pivot element

        # Skip if pivot is (almost) zero to avoid division by zero
        if (abs(d) < .Machine$double.eps)
            next

        # Update the k-th row and column (excluding pivot)
        B[k, -k] <- B[k, -k]/d
        B[-k, k] <- B[-k, k]/d

        # Update the rest of the matrix with a rank-1 adjustment
        B[-k, -k] <- B[-k, -k] - outer(B[-k, k], B[k, -k]) * d

        # Replace the pivot with its negative reciprocal
        B[k, k] <- -1/d
    }

    return(B)
}


# Save function
dump("SWEEP", file = "SWEEP.R")

This function computes the augmented parameter matrix \(\theta\), which encapsulates both the mean vector and the variance-covariance matrix of a multivariate dataset in a unified form. This structure is especially useful in multivariate statistical modeling, such as the EM algorithm.

# Main function to build augmented matrix and apply SWEEP
augmented_parameter_matrix <- function(data, scaled = TRUE) {
    # Clean all-NA rows
    data <- as.matrix(data)
    data <- data[rowSums(!is.na(data)) > 0, ]

    # Mean imputation
    for (j in seq_len(ncol(data))) {
        data[is.na(data[, j]), j] <- mean(data[, j], na.rm = TRUE)
    }

    # Add intercept and compute SSCP
    X <- cbind(1, data)
    XtX <- t(X) %*% X

    # SWEEP on intercept to get parameter matrix (means and covariance)
    swept <- SWEEP(XtX, 1)

    # Optional scaling (i.e., divide SSCP part by n to get covariance)
    if (scaled) {
        n <- nrow(data)
        swept[-1, -1] <- swept[-1, -1]/n
        swept[1, 1] <- -1/n
    }

    # Label
    labels <- c("(Intercept)", colnames(data))
    dimnames(swept) <- list(labels, labels)

    return(swept)
}

# Save to file
dump("augmented_parameter_matrix", file = "augmented_parameter_matrix.R")
# Define B2 as before
B2 <- matrix(c(7, 26, 6, 60, 78.5, 1, 29, 15, 52, 74.3, 11, 56, 8, 20, 104.3, 11,
    31, 8, 47, 87.6, 7, 52, 6, 33, 95.9, 11, 55, 9, 22, 109.2, 3, 71, 17, NA, 102.7,
    1, 31, 22, NA, 72.5, 2, 54, 18, NA, 93.1, NA, NA, 4, NA, 115.9, NA, NA, 23, NA,
    83.8, NA, NA, 9, NA, 113.3, NA, NA, 8, NA, 109.4, NA, NA, NA, NA, NA), ncol = 5,
    byrow = TRUE)
colnames(B2) <- paste0("V", 1:5)

# Compute augmented matrix (scaled = TRUE for covariance)
aug_matrix <- augmented_parameter_matrix(B2, scaled = TRUE)

# Print result (rounded for readability)
print(round(aug_matrix, 2))
##             (Intercept)     V1     V2     V3     V4     V5
## (Intercept)       -0.08   6.00  45.00  11.77  39.00  95.42
## V1                 6.00  11.69   4.54 -13.15 -14.62  20.42
## V2                45.00   4.54 156.62   3.85 -87.69 115.15
## V3                11.77 -13.15   3.85  37.87   3.38 -47.56
## V4                39.00 -14.62 -87.69   3.38 104.62 -84.48
## V5                95.42  20.42 115.15 -47.56 -84.48 208.90

3.7.2.2 Maximization Step (M-step)

Update the parameter estimates Θ (means μ, covariance matrix Σ, etc.) to maximize the expected log-likelihood computed in the E-step. Expected sufficient statistics (filled in or “completed” data from the E-step) are used to compute new parameter estimates. In practice, this means calculating updated estimates such as:

The updated mean vector is given by: \[ \hat{\mu} = \frac{1}{n} \sum_{i=1}^{n} \hat{x}^{(i)} \]

where \(\hat{x}^{(i)}\) are the observed or imputed data points.


The updated covariance matrix is: \[ \hat{\Sigma} = \frac{1}{n} \sum_{i=1}^{n} \left( \hat{x}^{(i)} - \hat{\mu} \right) \left( \hat{x}^{(i)} - \hat{\mu} \right)^T \]

These updated parameters are then fed back into the next E-step to recompute the expected values based on improved parameter estimates.


EM algorithm cycles:

1. E-step: Calculate expected values of missing data and sufficient statistics using current parameter estimates: \[ \Theta = \begin{bmatrix} -1 & \hat{\mu}^T \\ \hat{\mu} & \hat{\Sigma} \end{bmatrix} \] 2. M-step: Maximize expected log-likelihood to update parameter estimates, the mean vector \(\hat{\mu}\) and the covariance matrix \(\hat{\Sigma}\). 3. Repeat until convergence.


This function implements the Expectation-Maximization algorithm to estimate mean and covariance for incomplete multivariate normal data with arbitrary missing patterns.

#' EM Algorithm for Multivariate Normal Data with Missing Values
#'
#' Estimates mean vector and covariance matrix using the 
#'     Expectation-Maximization (EM) algorithm.
#'     
#' Uses the user-defined `augmented_parameter_matrix()` function for parameter estimation in the M-step.
#'
#' @param data A numeric matrix or data frame with missing values (NA).
#' @param tol Convergence threshold for change in parameters (default: 1e-6).
#' @param max_iter Maximum number of EM iterations (default: 500).
#' @param verbose Whether to print iteration info and log-likelihood (default: TRUE).
#' @param init_method Initialization strategy: 'mean' (default) or 'random'.
#'
#' @return A list with class 'em_result' containing:
#' \describe{
#'   \item{mu}{Estimated mean vector}
#'   \item{sigma}{Estimated covariance matrix}
#'   \item{iterations}{Number of iterations performed}
#'   \item{converged}{Logical; whether the algorithm converged}
#'   \item{loglik}{Log-likelihood values across iterations}
#'   \item{imputed_data}{Final imputed dataset}
#' }
#' @export
EM_algorithm <- function(data, tol = 1e-06, max_iter = 500, verbose = TRUE, init_method = "mean") {

    source("augmented_parameter_matrix.R")

    # Ensure matrix format
    if (!is.matrix(data))
        data <- as.matrix(data)
    n <- nrow(data)
    p <- ncol(data)

    # Remove rows that are entirely missing
    all_na_rows <- apply(data, 1, function(x) all(is.na(x)))
    data <- data[!all_na_rows, , drop = FALSE]

    # ---- Initialization ----
    if (init_method == "mean") {
        # Impute each variable with its observed mean
        X_imp <- apply(data, 2, function(x) {
            x[is.na(x)] <- mean(x, na.rm = TRUE)
            x
        })

        # Use augmented_param_matrix to estimate mean and covariance
        aug <- augmented_parameter_matrix(X_imp, scaled = TRUE)
        mu <- aug[1, -1]  # First row (excluding intercept column) = mean vector
        sigma <- aug[-1, -1]  # Lower-right block = covariance matrix

    } else if (init_method == "random") {
        # Random initialization of parameters
        mu <- rnorm(p)
        sigma <- diag(p)

        # Random normal imputation for missing values
        X_imp <- apply(data, 2, function(x) {
            x[is.na(x)] <- rnorm(sum(is.na(x)), mean = 0, sd = 1)
            x
        })

    } else {
        stop("init_method must be either 'mean' or 'random'")
    }

    # Store log-likelihood values across iterations
    loglik_trace <- numeric(max_iter)
    converged <- FALSE

    # ---- EM Iterations ----
    for (iter in 1:max_iter) {
        # --- E-step: Impute missing values ---
        for (i in 1:nrow(data)) {
            missing <- is.na(data[i, ])
            if (any(missing)) {
                obs <- !missing
                # Extract submatrices for observed/missing partitions
                sigma_oo <- sigma[obs, obs, drop = FALSE]
                sigma_mo <- sigma[missing, obs, drop = FALSE]

                # Conditional expectation for missing values
                X_imp[i, missing] <- mu[missing] + sigma_mo %*% solve(sigma_oo) %*%
                  (X_imp[i, obs] - mu[obs])
            }
        }

        # --- M-step: Update parameters using imputed data ---
        aug <- augmented_parameter_matrix(X_imp, scaled = TRUE)
        mu_new <- aug[1, -1]  # Updated mean vector
        sigma_new <- aug[-1, -1]  # Updated covariance matrix

        # Compute maximum absolute parameter changes
        mu_diff <- max(abs(mu_new - mu))
        sigma_diff <- max(abs(sigma_new - sigma))

        # Warn if covariance matrix becomes ill-conditioned
        if (any(eigen(sigma_new, only.values = TRUE)$values <= 0)) {
            warning("Covariance matrix not positive-definite at iteration ", iter)
        }

        # Optional: Print progress and log-likelihood
        if (verbose) {
            loglik_trace[iter] <- sum(mvtnorm::dmvnorm(X_imp, mean = mu_new, sigma = sigma_new,
                log = TRUE))
            cat(sprintf("Iter %4d | μ_diff = %8.2e | Σ_diff = %8.2e | LogLik = %10.4f\n",
                iter, mu_diff, sigma_diff, loglik_trace[iter]))
        }

        # Update parameters for next iteration
        mu <- mu_new
        sigma <- sigma_new

        # Check convergence
        if (mu_diff < tol && sigma_diff < tol) {
            converged <- TRUE
            break
        }
    }

    # ---- Return results ----
    structure(list(mu = mu, sigma = sigma, iterations = iter, converged = converged,
        loglik = if (verbose) loglik_trace[1:iter] else NULL, imputed_data = round(X_imp,
            2)), class = "em_result")
}

# Save to file
dump("EM_algorithm", file = "EM_algorithm.R")
imputed_B2 <- EM_algorithm(B2, verbose = FALSE)
imputed_B2
## $mu
##      V1      V2      V3      V4      V5 
##  6.6552 49.9653 11.7692 27.0471 95.4231 
## 
## $sigma
##         V1       V2       V3        V4       V5
## V1  20.655   23.329 -24.9004  -12.5564   46.953
## V2  23.329  230.510 -15.8174 -246.7323  195.604
## V3 -24.900  -15.817  37.8698   -9.5992  -47.556
## V4 -12.556 -246.732  -9.5992  289.1063 -190.599
## V5  46.953  195.604 -47.5562 -190.5985  208.905
## 
## $iterations
## [1] 332
## 
## $converged
## [1] TRUE
## 
## $loglik
## NULL
## 
## $imputed_data
##          V1    V2 V3    V4    V5
##  [1,]  7.00 26.00  6 60.00  78.5
##  [2,]  1.00 29.00 15 52.00  74.3
##  [3,] 11.00 56.00  8 20.00 104.3
##  [4,] 11.00 31.00  8 47.00  87.6
##  [5,]  7.00 52.00  6 33.00  95.9
##  [6,] 11.00 55.00  9 22.00 109.2
##  [7,]  3.00 71.00 17  0.88 102.7
##  [8,]  1.00 31.00 22 37.91  72.5
##  [9,]  2.00 54.00 18 19.90  93.1
## [10,] 12.89 65.84  4 14.45 115.9
## [11,] -0.47 48.20 23 20.83  83.8
## [12,]  9.99 68.08  9  8.19 113.3
## [13,] 10.11 62.43  8 15.45 109.4
## 
## attr(,"class")
## [1] "em_result"

Second Example

Source: (Enders, 2022) Applied Missing Data Analysis (2nd ed.).

# DATA

# Source: Enders, C. K. (2022). Applied Missing Data Analysis (2nd ed.).

employee <- read.table("employee.dat", header = FALSE)
employeecomplete <- read.table("employeecomplete.dat", header = FALSE)

# Add meaningful column names V1 = Case ID V2 = Department (1-5) V3 = Gender
# (0=male, 1=female) V4 = Married (0=no, 1=yes) V5 = Age in years V6 = Job
# tenure in months V7 = Job level (1-7) V8 = Salary in thousands V9 = Job
# satisfaction (1-7 scale)

colnames(employee) <- c("ID", "Dept", "Gender", "Married", "Age", "Tenure", "JobLevel",
    "Salary", "Satisfaction")

# Add meaningful column names
colnames(employeecomplete) <- c("ID", "Dept", "Gender", "Married", "Age", "Tenure",
    "JobLevel", "Salary", "Satisfaction")
MCAR_Test(employee)

## $statistic
## [1] 340.49
## 
## $df
## [1] 106
## 
## $p.value
## [1] 2.1329e-26
## 
## $alpha
## [1] 0.05
## 
## $decision
## [1] "Reject MCAR"
## 
## $missing_patterns
## [1] 17
## 
## $estimation_method
## [1] "Pairwise-complete (not ML)"
## 
## $correction
## [1] "Bartlett"
## 
## $diagnostics
## $diagnostics$total_missing
## [1] 286
## 
## $diagnostics$pct_missing
## [1] 5.0441
## 
## $diagnostics$worst_variable
## [1] "Age"
## 
## $diagnostics$top_patterns
## patt_str
##               5       8       6   3,5,7       9     8,9     5,9     3,7 3,5,7,8 
##     432      66      39      22      19      17       9       6       5       4 
## 
## $diagnostics$pattern_matrix
##        ID  Dept Gender Married   Age Tenure JobLevel Salary Satisfaction
## 1   FALSE FALSE  FALSE   FALSE FALSE  FALSE    FALSE  FALSE        FALSE
## 2   FALSE FALSE  FALSE   FALSE  TRUE  FALSE    FALSE  FALSE        FALSE
## 6   FALSE FALSE   TRUE   FALSE  TRUE  FALSE     TRUE  FALSE        FALSE
## 9   FALSE FALSE  FALSE   FALSE FALSE   TRUE    FALSE  FALSE        FALSE
## 13  FALSE FALSE  FALSE   FALSE FALSE  FALSE    FALSE   TRUE        FALSE
## 18  FALSE FALSE  FALSE   FALSE FALSE   TRUE    FALSE   TRUE        FALSE
## 31  FALSE FALSE  FALSE   FALSE FALSE  FALSE    FALSE   TRUE         TRUE
## 33  FALSE FALSE  FALSE   FALSE  TRUE  FALSE    FALSE   TRUE         TRUE
## 35  FALSE FALSE   TRUE   FALSE  TRUE  FALSE     TRUE   TRUE         TRUE
## 36  FALSE FALSE  FALSE   FALSE FALSE   TRUE    FALSE   TRUE         TRUE
## 133 FALSE FALSE  FALSE   FALSE FALSE  FALSE    FALSE  FALSE         TRUE
## 135 FALSE FALSE   TRUE   FALSE  TRUE  FALSE     TRUE  FALSE         TRUE
## 138 FALSE FALSE  FALSE   FALSE  TRUE  FALSE    FALSE  FALSE         TRUE
## 145 FALSE FALSE   TRUE   FALSE  TRUE  FALSE     TRUE   TRUE        FALSE
## 172 FALSE FALSE   TRUE   FALSE FALSE  FALSE     TRUE  FALSE        FALSE
## 315 FALSE FALSE   TRUE   FALSE  TRUE  FALSE    FALSE  FALSE        FALSE
## 399 FALSE FALSE  FALSE   FALSE  TRUE  FALSE    FALSE   TRUE        FALSE
## 
## 
## $data
##      ID Dept Gender Married Age Tenure JobLevel Salary Satisfaction
## 1     1    1      0       1  32     11        3     18          3.5
## 2     2    1      1       1  NA     13        4     18          3.5
## 3     3    1      1       1  30      9        4     18          3.5
## 4     4    1      1       1  29      8        3     18          3.5
## 5     5    1      1       0  26      7        4     18          3.5
## 6     6    1     NA       0  NA     10       NA     18          3.5
## 7     7    2      0       0  NA     11        5     17          4.0
## 8     8    2      0       1  22      9        3     17          4.0
## 9     9    2      0       0  23     NA        2     17          4.0
## 10   10    2      0       0  32      9        4     17          4.0
## 11   11    2      0       1  NA     12        5     17          4.0
## 12   12    2      1       0  36     13        4     17          4.0
## 13   13    3      0       0  16      0        3     NA          6.0
## 14   14    3      1       0  25      8        3     NA          6.0
## 15   15    3      0       1  31     15        3     NA          6.0
## 16   16    3      0       1  27      9        6     NA          6.0
## 17   17    3      0       1  30      9        4     NA          6.0
## 18   18    3      0       1  23     NA        3     NA          6.0
## 19   19    4      0       0  NA     13        3     20          4.5
## 20   20    4      1       0  26     11        5     20          4.5
## 21   21    4      0       1  27      8        4     20          4.5
## 22   22    4      1       1  42     10        4     20          4.5
## 23   23    4      0       1  24      8        5     20          4.5
## 24   24    4      0       0  37     16        6     20          4.5
## 25   25    5      0       0  33     12        4     23          6.5
## 26   26    5      0       1  37     13        4     23          6.5
## 27   27    5      0       1  29     NA        4     23          6.5
## 28   28    5      0       1  32      9        5     23          6.5
## 29   29    5      0       0  24      6        4     23          6.5
## 30   30    5      1       0  31      6        3     23          6.5
## 31   31    6      1       0  21      2        3     NA           NA
## 32   32    6      1       0  29      8        3     NA           NA
## 33   33    6      0       1  NA     12        5     NA           NA
## 34   34    6      0       1  25      9        5     NA           NA
## 35   35    6     NA       0  NA     12       NA     NA           NA
## 36   36    6      1       1  36     NA        5     NA           NA
## 37   37    7      1       1  29      4        2     12          2.0
## 38   38    7      0       1  26     NA        4     12          2.0
## 39   39    7      1       0  26      6        2     12          2.0
## 40   40    7      0       1  26     10        4     12          2.0
## 41   41    7      1       1  31      9        4     12          2.0
## 42   42    7      0       0  33      8        4     12          2.0
## 43   43    8     NA       0  NA      8       NA     16          3.5
## 44   44    8      0       0  26     12        5     16          3.5
## 45   45    8      0       1  25      5        5     16          3.5
## 46   46    8     NA       1  NA      9       NA     16          3.5
## 47   47    8      0       1  29      5        3     16          3.5
## 48   48    8      0       0  27     NA        4     16          3.5
## 49   49    9      0       0  27      7        5     23          4.0
## 50   50    9      0       0  28      9        5     23          4.0
## 51   51    9      0       0  26      7        4     23          4.0
## 52   52    9      0       1  29     12        4     23          4.0
## 53   53    9      1       1  28     10        4     23          4.0
## 54   54    9      0       0  25     NA        5     23          4.0
## 55   55   10      1       1  33      8        4     24          6.5
## 56   56   10      1       0  27      9        3     24          6.5
## 57   57   10      0       1  29      6        4     24          6.5
## 58   58   10      0       1  36     14        4     24          6.5
## 59   59   10      0       0  24     NA        4     24          6.5
## 60   60   10      0       1  22      5        3     24          6.5
## 61   61   11      1       0  28      8        3     15          5.0
## 62   62   11      0       1  26      7        6     15          5.0
## 63   63   11      0       0  30     11        4     15          5.0
## 64   64   11      0       1  28      8        4     15          5.0
## 65   65   11      0       1  28      8        3     15          5.0
## 66   66   11      1       0  25      5        3     15          5.0
## 67   67   12      0       1  36     10        5     18          6.0
## 68   68   12      0       1  32     NA        3     18          6.0
## 69   69   12      0       0  30     11        4     18          6.0
## 70   70   12      0       0  30     12        7     18          6.0
## 71   71   12      1       0  34      9        3     18          6.0
## 72   72   12     NA       1  NA      7       NA     18          6.0
## 73   73   13      0       1  24      5        5     19          4.5
## 74   74   13      0       1  23      2        4     19          4.5
## 75   75   13      0       0  25     10        4     19          4.5
## 76   76   13      1       0  27     11        4     19          4.5
## 77   77   13      1       1  24      6        3     19          4.5
## 78   78   13     NA       1  NA      8       NA     19          4.5
## 79   79   14      1       1  24      3        2     20          5.0
## 80   80   14      0       1  30     NA        3     20          5.0
## 81   81   14      1       0  22      7        2     20          5.0
## 82   82   14      1       0  27      8        4     20          5.0
## 83   83   14      0       0  38     10        4     20          5.0
## 84   84   14      1       1  30      4        4     20          5.0
## 85   85   15      0       1  34      9        3     15          4.5
## 86   86   15      0       1  24     13        5     15          4.5
## 87   87   15      1       1  21      5        3     15          4.5
## 88   88   15      0       1  33     12        3     15          4.5
## 89   89   15      0       1  35     10        5     15          4.5
## 90   90   15      1       1  23     NA        1     15          4.5
## 91   91   16      0       1  28      9        5     24          4.5
## 92   92   16     NA       1  NA      9       NA     24          4.5
## 93   93   16      1       0  24      4        3     24          4.5
## 94   94   16      0       1  34     12        5     24          4.5
## 95   95   16      0       0  34      8        5     24          4.5
## 96   96   16      1       0  NA     12        6     24          4.5
## 97   97   17      1       0  29      4        3     15          5.5
## 98   98   17     NA       0  NA     11       NA     15          5.5
## 99   99   17      0       0  26     10        5     15          5.5
## 100 100   17      0       1  33     10        4     15          5.5
## 101 101   17      1       1  29     NA        4     15          5.5
## 102 102   17      0       0  NA     12        6     15          5.5
## 103 103   18      0       1  NA     10        1     16          4.0
## 104 104   18      1       0  22     10        4     16          4.0
## 105 105   18      0       1  28      9        3     16          4.0
## 106 106   18      1       1  32     13        4     16          4.0
## 107 107   18      0       0  27     10        2     16          4.0
## 108 108   18     NA       1  NA     17       NA     16          4.0
## 109 109   19      0       0  28      8        5     22          5.0
## 110 110   19      0       0  33     13        4     22          5.0
## 111 111   19      0       1  NA      9        4     22          5.0
## 112 112   19      0       0  26      4        7     22          5.0
## 113 113   19      0       1  33     NA        3     22          5.0
## 114 114   19      0       0  NA     11        4     22          5.0
## 115 115   20      0       0  23      9        3     14          3.0
## 116 116   20      1       0  22      6        4     14          3.0
## 117 117   20      0       0  29     12        4     14          3.0
## 118 118   20      0       0  26     15        4     14          3.0
## 119 119   20      1       0  29     10        3     14          3.0
## 120 120   20      0       0  24      9        4     14          3.0
## 121 121   21      0       1  25      7        3     19          5.0
## 122 122   21      1       0  NA     17        4     19          5.0
## 123 123   21      1       0  17      5        2     19          5.0
## 124 124   21      0       0  35     NA        5     19          5.0
## 125 125   21      0       0  26      9        5     19          5.0
## 126 126   21      0       1  37     10        3     19          5.0
## 127 127   22      0       1  39     10        5     NA          7.0
## 128 128   22      0       1  31      7        4     NA          7.0
## 129 129   22      0       1  32      6        3     NA          7.0
## 130 130   22      0       1  33      9        6     NA          7.0
## 131 131   22      0       0  32      5        3     NA          7.0
## 132 132   22      0       1  31     NA        3     NA          7.0
## 133 133   23      0       1  28     11        4     24           NA
## 134 134   23      0       0  22     10        4     24           NA
## 135 135   23     NA       0  NA     14       NA     24           NA
## 136 136   23      0       1  33     11        5     24           NA
## 137 137   23      0       1  30      9        4     24           NA
## 138 138   23      0       1  NA     13        6     24           NA
## 139 139   24      0       0  28     10        3     19          8.0
## 140 140   24      0       1  36     14        4     19          8.0
## 141 141   24      1       0  27     NA        3     19          8.0
## 142 142   24      0       0  26     11        2     19          8.0
## 143 143   24      0       0  24      9        1     19          8.0
## 144 144   24      0       1  NA     13        4     19          8.0
## 145 145   25     NA       1  NA     10       NA     NA          7.0
## 146 146   25      1       0  26      8        1     NA          7.0
## 147 147   25      0       0  34      6        4     NA          7.0
## 148 148   25      1       0  26      6        2     NA          7.0
## 149 149   25     NA       1  NA     10       NA     NA          7.0
## 150 150   25      0       0  28     NA        4     NA          7.0
## 151 151   26      0       1  29      9        4     22          5.0
## 152 152   26      0       0  NA     12        4     22          5.0
## 153 153   26      1       0  23     10        3     22          5.0
## 154 154   26      0       0  22      9        5     22          5.0
## 155 155   26      0       1  31      7        3     22          5.0
## 156 156   26      0       1  26     10        4     22          5.0
## 157 157   27      1       1  34     NA        3     21          4.5
## 158 158   27      0       1  32     12        4     21          4.5
## 159 159   27      0       1  NA     13        6     21          4.5
## 160 160   27      1       1  29      7        5     21          4.5
## 161 161   27      0       0  20      4        5     21          4.5
## 162 162   27      0       0  29     12        5     21          4.5
## 163 163   28      1       1  25      3        5     26          6.0
## 164 164   28      0       1  30     NA        5     26          6.0
## 165 165   28      0       1  29      8        7     26          6.0
## 166 166   28      0       0  30      8        5     26          6.0
## 167 167   28      1       0  26      9        3     26          6.0
## 168 168   28      0       1  NA     11        7     26          6.0
## 169 169   29      0       1  30     12        4     19          3.5
## 170 170   29      0       1  30     12        3     19          3.5
## 171 171   29      1       1  32     NA        3     19          3.5
## 172 172   29     NA       0  23      5       NA     19          3.5
## 173 173   29     NA       0  NA     14       NA     19          3.5
## 174 174   29      0       1  35     13        5     19          3.5
## 175 175   30      1       0  26      9        4     21          4.0
## 176 176   30      1       0  22     14        7     21          4.0
## 177 177   30      1       0  30     NA        2     21          4.0
## 178 178   30      0       0  30     14        6     21          4.0
## 179 179   30      0       1  29     14        5     21          4.0
## 180 180   30      1       1  32     NA        3     21          4.0
## 181 181   31      1       0  28     10        3     21           NA
## 182 182   31      0       0  NA     14        4     21           NA
## 183 183   31      1       1  22      8        5     21           NA
## 184 184   31      0       1  35     11        3     21           NA
## 185 185   31      0       0  NA     15        5     21           NA
## 186 186   31      0       1  28     12        7     21           NA
## 187 187   32      1       1  23      9        3     15          4.5
## 188 188   32      0       0  28      9        3     15          4.5
## 189 189   32      0       0  29     12        3     15          4.5
## 190 190   32      1       0  25      6        4     15          4.5
## 191 191   32      1       0  29     11        3     15          4.5
## 192 192   32      0       0  25     NA        4     15          4.5
## 193 193   33      1       1  32      8        3     19          5.0
## 194 194   33      0       1  31     12        6     19          5.0
## 195 195   33      0       0  NA     14        4     19          5.0
## 196 196   33      0       0  33     13        5     19          5.0
## 197 197   33      0       1  NA     10        3     19          5.0
## 198 198   33      0       1  32     11        3     19          5.0
## 199 199   34      1       1  27      9        4     24          5.5
## 200 200   34      0       1  30     11        4     24          5.5
## 201 201   34      1       0  NA     11        5     24          5.5
## 202 202   34      0       1  30     NA        6     24          5.5
## 203 203   34      0       1  31     12        4     24          5.5
## 204 204   34      0       1  NA     13        3     24          5.5
## 205 205   35      0       0  NA     15        7     33          3.0
## 206 206   35      0       1  38     12        6     33          3.0
## 207 207   35      0       1  32      8        5     33          3.0
## 208 208   35      0       1  30      7        3     33          3.0
## 209 209   35      0       1  NA     15        7     33          3.0
## 210 210   35      0       0  18      4        2     33          3.0
## 211 211   36      0       0  22      8        5     26          6.0
## 212 212   36      0       0  23      9        5     26          6.0
## 213 213   36      0       1  31     11        5     26          6.0
## 214 214   36      0       0  29     10        3     26          6.0
## 215 215   36      0       0  20     10        5     26          6.0
## 216 216   36      0       1  20      9        5     26          6.0
## 217 217   37      1       0  36      6        4     17          6.0
## 218 218   37      0       1  32      9        3     17          6.0
## 219 219   37      1       0  26      7        2     17          6.0
## 220 220   37      0       0  26     11        2     17          6.0
## 221 221   37      1       0  23     10        4     17          6.0
## 222 222   37      0       0  31     11        5     17          6.0
## 223 223   38      1       1  30      7        3     27         10.0
## 224 224   38      0       0  36     11        5     27         10.0
## 225 225   38      0       1  28     11        3     27         10.0
## 226 226   38      0       1  26     10        4     27         10.0
## 227 227   38      0       0  31      9        5     27         10.0
## 228 228   38      0       0  25      8        4     27         10.0
## 229 229   39      0       0  30     12        6     19          4.0
## 230 230   39      0       1  23      8        4     19          4.0
## 231 231   39      0       1  40     15        5     19          4.0
## 232 232   39      1       1  27     12        3     19          4.0
## 233 233   39      1       0  22      9        5     19          4.0
## 234 234   39      1       1  NA     15        4     19          4.0
## 235 235   40      0       1  30     10        4     15          3.5
## 236 236   40     NA       1  NA      8       NA     15          3.5
## 237 237   40      1       1  26     10        2     15          3.5
## 238 238   40      0       0  25     14        6     15          3.5
## 239 239   40      0       1  NA     15        5     15          3.5
## 240 240   40      0       0  NA     16        6     15          3.5
## 241 241   41      0       1  NA     13        5     23          5.5
## 242 242   41      0       0  30     10        4     23          5.5
## 243 243   41      0       1  NA     13        4     23          5.5
## 244 244   41      1       0  24      8        2     23          5.5
## 245 245   41      0       1  41     17        5     23          5.5
## 246 246   41      0       0  28      8        6     23          5.5
## 247 247   42      1       1  28     11        5     18          5.0
## 248 248   42      1       1  32      7        4     18          5.0
## 249 249   42      0       0  NA     12        4     18          5.0
## 250 250   42      1       0  NA     14        3     18          5.0
## 251 251   42      0       1  27      9        2     18          5.0
## 252 252   42      0       0  NA     13        5     18          5.0
## 253 253   43      0       0  NA     12        4     15          4.5
## 254 254   43      0       1  27      9        5     15          4.5
## 255 255   43      0       1  36     11        5     15          4.5
## 256 256   43      0       0  23     14        4     15          4.5
## 257 257   43      1       0  29     13        4     15          4.5
## 258 258   43      1       1  23     13        3     15          4.5
## 259 259   44      1       0  28     10        2     21          6.0
## 260 260   44      0       0  NA     15        6     21          6.0
## 261 261   44      0       1  26     10        4     21          6.0
## 262 262   44      0       1  31     10        4     21          6.0
## 263 263   44      1       1  NA     15        7     21          6.0
## 264 264   44      0       0  25      8        6     21          6.0
## 265 265   45      0       0  25     11        4     12          3.0
## 266 266   45      1       1  25      9        1     12          3.0
## 267 267   45      1       1  29      7        4     12          3.0
## 268 268   45      0       0  22      9        4     12          3.0
## 269 269   45      1       0  19      5        1     12          3.0
## 270 270   45      1       1  28     11        4     12          3.0
## 271 271   46      1       0  21      6        2     17          4.5
## 272 272   46      1       0  35     12        3     17          4.5
## 273 273   46      0       0  28      8        3     17          4.5
## 274 274   46      0       1  33     14        5     17          4.5
## 275 275   46     NA       0  NA      9       NA     17          4.5
## 276 276   46     NA       0  NA     10       NA     17          4.5
## 277 277   47     NA       0  NA      8       NA     NA          4.5
## 278 278   47      1       1  37      5        3     NA          4.5
## 279 279   47      1       0  32      7        2     NA          4.5
## 280 280   47      0       0  27     10        3     NA          4.5
## 281 281   47      0       1  33      9        2     NA          4.5
## 282 282   47      1       1  32      7        6     NA          4.5
## 283 283   48      0       0  24      5        4     24          3.5
## 284 284   48      0       1  28     11        6     24          3.5
## 285 285   48      0       0  23      7        2     24          3.5
## 286 286   48      0       0  30     11        5     24          3.5
## 287 287   48      0       0  22      5        6     24          3.5
## 288 288   48      0       0  27      8        3     24          3.5
## 289 289   49      0       0  31     10        6     21          4.5
## 290 290   49      1       0  24      9        4     21          4.5
## 291 291   49      1       0  36     14        4     21          4.5
## 292 292   49      0       1  33      9        5     21          4.5
## 293 293   49      0       1  31     15        7     21          4.5
## 294 294   49      0       1  27      6        5     21          4.5
## 295 295   50      1       0  27      9        5     22          3.0
## 296 296   50     NA       0  28      8       NA     22          3.0
## 297 297   50      0       0  NA     13        4     22          3.0
## 298 298   50      0       0  28     14        4     22          3.0
## 299 299   50      0       1  34     13        6     22          3.0
## 300 300   50      1       0  27      8        3     22          3.0
## 301 301   51      0       0  28     12        6     26           NA
## 302 302   51      0       1  NA     17        6     26           NA
## 303 303   51      0       0  35     13        6     26           NA
## 304 304   51      0       1  NA     17        6     26           NA
## 305 305   51      0       1  23      5        5     26           NA
## 306 306   51      0       1  NA     11        4     26           NA
## 307 307   52      0       1  32      8        5     19          5.5
## 308 308   52      0       0  NA     10        4     19          5.5
## 309 309   52      1       1  NA     13        6     19          5.5
## 310 310   52      0       1  28      8        4     19          5.5
## 311 311   52      0       0  31      9        4     19          5.5
## 312 312   52      1       1  37      9        4     19          5.5
## 313 313   53      0       1  27      7        3     21          3.5
## 314 314   53      0       0  NA     10        4     21          3.5
## 315 315   53     NA       0  NA      8        1     21          3.5
## 316 316   53      0       0  32      9        4     21          3.5
## 317 317   53      0       0  29     10        5     21          3.5
## 318 318   53      0       0  29      8        4     21          3.5
## 319 319   54      0       1  30     16        5     21          6.5
## 320 320   54      1       1  NA     13        3     21          6.5
## 321 321   54      0       1  24     10        5     21          6.5
## 322 322   54      0       1  33     12        5     21          6.5
## 323 323   54      0       1  NA     17        5     21          6.5
## 324 324   54      0       1  27     15        4     21          6.5
## 325 325   55      0       0  28      7        3     NA          6.0
## 326 326   55      0       0  26      6        2     NA          6.0
## 327 327   55      0       0  27      5        4     NA          6.0
## 328 328   55      0       0  28     10        4     NA          6.0
## 329 329   55      0       0  24      8        4     NA          6.0
## 330 330   55      0       0  27      6        4     NA          6.0
## 331 331   56      0       1  26     10        5     14          5.0
## 332 332   56      1       0  23      8        2     14          5.0
## 333 333   56      1       0  19      8        2     14          5.0
## 334 334   56     NA       1  NA     11        1     14          5.0
## 335 335   56      1       0  23      9        2     14          5.0
## 336 336   56     NA       0  NA     15       NA     14          5.0
## 337 337   57      0       0  29     12        5     12          3.5
## 338 338   57      0       0  26      9        5     12          3.5
## 339 339   57      1       1  25     16        3     12          3.5
## 340 340   57      0       0  25     11        3     12          3.5
## 341 341   57      1       1  27     12        3     12          3.5
## 342 342   57      0       0  31     10        2     12          3.5
## 343 343   58      0       0  20      8        3     18          5.0
## 344 344   58      0       1  29      8        5     18          5.0
## 345 345   58      0       0  33      9        5     18          5.0
## 346 346   58      0       1  32     13        5     18          5.0
## 347 347   58      1       0  28      7        2     18          5.0
## 348 348   58      0       0  29     10        4     18          5.0
## 349 349   59      0       1  30     11        5     13          3.5
## 350 350   59      0       0  28      7        4     13          3.5
## 351 351   59      0       0  29     13        2     13          3.5
## 352 352   59      0       0  21      9        3     13          3.5
## 353 353   59      1       0  26     10        4     13          3.5
## 354 354   59     NA       0  NA      6       NA     13          3.5
## 355 355   60      1       1  27      6        4     22          6.0
## 356 356   60      0       1  28     11        6     22          6.0
## 357 357   60      1       0  27      5        2     22          6.0
## 358 358   60      0       0  24      8        3     22          6.0
## 359 359   60      0       1  25      4        3     22          6.0
## 360 360   60      0       0  31      9        3     22          6.0
## 361 361   61      0       1  32     11        2     25          5.0
## 362 362   61      1       1  NA     13        3     25          5.0
## 363 363   61      1       0  33     10        4     25          5.0
## 364 364   61      0       0  23      7        3     25          5.0
## 365 365   61      0       1  30     11        4     25          5.0
## 366 366   61     NA       1  30      6       NA     25          5.0
## 367 367   62      0       0  22      4        2     15          2.0
## 368 368   62      0       0  29     11        6     15          2.0
## 369 369   62      1       1  28     12        4     15          2.0
## 370 370   62      1       1  29     12        4     15          2.0
## 371 371   62      1       0  22      6        4     15          2.0
## 372 372   62      0       0  30     14        5     15          2.0
## 373 373   63     NA       0  NA      6       NA     24          5.0
## 374 374   63      0       0  32      8        6     24          5.0
## 375 375   63      0       0  25      9        4     24          5.0
## 376 376   63      0       1  31      8        4     24          5.0
## 377 377   63      0       1  28     12        6     24          5.0
## 378 378   63      0       0  30      5        4     24          5.0
## 379 379   64      0       0  31     13        5     17          5.0
## 380 380   64      1       1  28     13        5     17          5.0
## 381 381   64      0       0  26      9        5     17          5.0
## 382 382   64      1       0  27      6        2     17          5.0
## 383 383   64      1       1  NA     14        4     17          5.0
## 384 384   64     NA       0  28      6       NA     17          5.0
## 385 385   65      0       0  26      8        4     19          4.5
## 386 386   65      1       0  27      8        4     19          4.5
## 387 387   65      0       0  22     10        4     19          4.5
## 388 388   65      0       1  NA     11        5     19          4.5
## 389 389   65      0       1  24      8        2     19          4.5
## 390 390   65      0       0  32     10        6     19          4.5
## 391 391   66      0       1  29      6        4     25          6.0
## 392 392   66      0       1  29      8        4     25          6.0
## 393 393   66      0       0  NA     12        6     25          6.0
## 394 394   66      0       0  37     10        5     25          6.0
## 395 395   66      0       0  38     13        5     25          6.0
## 396 396   66      0       1  29     11        4     25          6.0
## 397 397   67      0       0  24     11        2     NA          3.0
## 398 398   67      1       0  24      9        4     NA          3.0
## 399 399   67      1       0  NA      9        4     NA          3.0
## 400 400   67      1       1  24     10        6     NA          3.0
## 401 401   67      0       0  24      5        2     NA          3.0
## 402 402   67      0       0  28     11        4     NA          3.0
## 403 403   68      0       0  23      6        3     NA          7.0
## 404 404   68      0       1  34      7        3     NA          7.0
## 405 405   68      0       0  33      7        5     NA          7.0
## 406 406   68      0       0  NA     14        3     NA          7.0
## 407 407   68     NA       1  NA     10       NA     NA          7.0
## 408 408   68      1       1  19      6        3     NA          7.0
## 409 409   69      0       1  NA     13        5     24          7.0
## 410 410   69      0       0  NA     10        6     24          7.0
## 411 411   69      0       1  33     11        3     24          7.0
## 412 412   69      0       0  27      9        4     24          7.0
## 413 413   69      0       1  37      9        4     24          7.0
## 414 414   69      0       1  30     14        6     24          7.0
## 415 415   70      0       1  29      8        4     18          3.0
## 416 416   70      0       1  28      6        3     18          3.0
## 417 417   70      1       0  29     13        5     18          3.0
## 418 418   70      1       1  NA     13        4     18          3.0
## 419 419   70      1       1  35      7        3     18          3.0
## 420 420   70      1       1  28      3        2     18          3.0
## 421 421   71      0       1  35     11        4     26          7.0
## 422 422   71      0       1  31      9        7     26          7.0
## 423 423   71      0       1  29     10        5     26          7.0
## 424 424   71      0       1  32      9        5     26          7.0
## 425 425   71      0       0  30      4        1     26          7.0
## 426 426   71      1       0  28      7        4     26          7.0
## 427 427   72      0       0  25      9        4     22          4.0
## 428 428   72      0       1  27     12        4     22          4.0
## 429 429   72      1       0  NA     14        3     22          4.0
## 430 430   72      1       1  26      7        3     22          4.0
## 431 431   72      0       0  NA     11        5     22          4.0
## 432 432   72      1       0  28      9        3     22          4.0
## 433 433   73      0       1  33      9        5     23          6.0
## 434 434   73      0       0  25      6        5     23          6.0
## 435 435   73      0       0  27      8        3     23          6.0
## 436 436   73     NA       0  NA      9       NA     23          6.0
## 437 437   73      0       1  38      9        3     23          6.0
## 438 438   73      1       0  28      7        4     23          6.0
## 439 439   74     NA       0  NA     10       NA     21          3.5
## 440 440   74      1       0  23     11        4     21          3.5
## 441 441   74      1       1  30      9        4     21          3.5
## 442 442   74      0       1  31      9        3     21          3.5
## 443 443   74      0       1  31     12        3     21          3.5
## 444 444   74      1       0  26      5        3     21          3.5
## 445 445   75     NA       0  NA     10       NA     20          4.0
## 446 446   75      0       0  32      8        5     20          4.0
## 447 447   75      0       1  28     11        5     20          4.0
## 448 448   75      1       0  23      7        2     20          4.0
## 449 449   75      0       1  27      8        5     20          4.0
## 450 450   75      0       1  25      8        3     20          4.0
## 451 451   76      0       1  27      4        4     16          4.0
## 452 452   76      1       0  20      6        2     16          4.0
## 453 453   76      0       0  NA     13        4     16          4.0
## 454 454   76      0       1  NA     13        4     16          4.0
## 455 455   76      1       0  14      4        3     16          4.0
## 456 456   76      0       0  23     11        3     16          4.0
## 457 457   77      1       1  31      5        3     23          7.0
## 458 458   77      0       0  28     10        7     23          7.0
## 459 459   77      1       0  28      6        2     23          7.0
## 460 460   77      0       0  30     14        5     23          7.0
## 461 461   77      0       1  27      7        4     23          7.0
## 462 462   77      0       0  33     11        4     23          7.0
## 463 463   78      1       0  28     12        4     19          4.0
## 464 464   78      1       1  32     10        5     19          4.0
## 465 465   78      0       0  30     11        3     19          4.0
## 466 466   78      0       1  27     17        7     19          4.0
## 467 467   78      0       1  35     11        4     19          4.0
## 468 468   78      0       1  30     14        6     19          4.0
## 469 469   79      0       0  27      6        3     21          6.0
## 470 470   79      1       0  23      4        2     21          6.0
## 471 471   79      0       0  29     12        2     21          6.0
## 472 472   79      0       1  27     11        4     21          6.0
## 473 473   79      1       1  24      4        3     21          6.0
## 474 474   79      0       1  19      9        2     21          6.0
## 475 475   80      0       1  28      6        4     NA           NA
## 476 476   80      1       1  15      8        2     NA           NA
## 477 477   80      0       0  28     11        4     NA           NA
## 478 478   80      1       1  27     11        4     NA           NA
## 479 479   80      0       0  29      8        4     NA           NA
## 480 480   80      0       1  23      7        3     NA           NA
## 481 481   81      0       0  25     11        4     21          5.5
## 482 482   81      0       0  29     12        5     21          5.5
## 483 483   81      0       1  NA     15        7     21          5.5
## 484 484   81      0       0  36     13        6     21          5.5
## 485 485   81      0       1  32     10        5     21          5.5
## 486 486   81      0       0  24     12        6     21          5.5
## 487 487   82      1       1  31      9        4     18          3.0
## 488 488   82      0       0  31     12        4     18          3.0
## 489 489   82      0       0  27      8        6     18          3.0
## 490 490   82      0       0  NA     13        4     18          3.0
## 491 491   82      1       0  NA     16        6     18          3.0
## 492 492   82      0       0  27     13        4     18          3.0
## 493 493   83      1       0  28      7        3     17          5.0
## 494 494   83      0       1  31      9        5     17          5.0
## 495 495   83      0       1  28      8        4     17          5.0
## 496 496   83      1       1  29      8        3     17          5.0
## 497 497   83      0       0  33     11        5     17          5.0
## 498 498   83      1       0  26      6        4     17          5.0
## 499 499   84      1       1  31      6        3     19          3.0
## 500 500   84      0       1  28     10        4     19          3.0
## 501 501   84      1       1  35     14        4     19          3.0
## 502 502   84      0       1  26     11        4     19          3.0
## 503 503   84      0       0  28     10        3     19          3.0
## 504 504   84      1       0  28     15        4     19          3.0
## 505 505   85      0       1  28     10        6     25           NA
## 506 506   85      0       1  30     12        4     25           NA
## 507 507   85      0       0  29     12        3     25           NA
## 508 508   85      0       0  24      6        3     25           NA
## 509 509   85      0       1  34     14        7     25           NA
## 510 510   85      1       1  26      6        4     25           NA
## 511 511   86      0       0  32      9        4     21          5.0
## 512 512   86      0       0  NA     14        4     21          5.0
## 513 513   86      0       1  28      8        4     21          5.0
## 514 514   86      1       1  38      9        5     21          5.0
## 515 515   86      0       0  29      8        4     21          5.0
## 516 516   86      1       1  31     NA        5     21          5.0
## 517 517   87      1       0  24      8        3     24          3.0
## 518 518   87      0       1  40     13        3     24          3.0
## 519 519   87      0       0  26     NA        5     24          3.0
## 520 520   87      0       1  NA     12        5     24          3.0
## 521 521   87      0       1  28     10        6     24          3.0
## 522 522   87      1       1  22      5        2     24          3.0
## 523 523   88      0       0  29     12        3     27          5.5
## 524 524   88      0       0  NA     13        6     27          5.5
## 525 525   88      0       1  30      9        3     27          5.5
## 526 526   88      0       0  NA     13        5     27          5.5
## 527 527   88      0       1  38     14        5     27          5.5
## 528 528   88      0       1  31      8        3     27          5.5
## 529 529   89      1       1  28      9        3     15          4.0
## 530 530   89      0       0  26      4        2     15          4.0
## 531 531   89      1       0  30      6        4     15          4.0
## 532 532   89      1       1  25      5        3     15          4.0
## 533 533   89      1       1  26      6        3     15          4.0
## 534 534   89      0       0  28     10        3     15          4.0
## 535 535   90      0       1  30      7        5     14          5.0
## 536 536   90      1       0  19      6        2     14          5.0
## 537 537   90      1       1  34     14        4     14          5.0
## 538 538   90      1       0  25      8        3     14          5.0
## 539 539   90      1       0  18      6        2     14          5.0
## 540 540   90      0       0  30     11        4     14          5.0
## 541 541   91      0       1  33     14        5     24          5.0
## 542 542   91      0       1  25     10        5     24          5.0
## 543 543   91      0       1  38     13        4     24          5.0
## 544 544   91      0       1  30     11        4     24          5.0
## 545 545   91      0       1  32     11        4     24          5.0
## 546 546   91      0       1  36     12        6     24          5.0
## 547 547   92      1       0  23     10        4     19          5.0
## 548 548   92      0       0  23      9        4     19          5.0
## 549 549   92      0       0  28     13        4     19          5.0
## 550 550   92      1       0  29     11        5     19          5.0
## 551 551   92      0       1  22      2        3     19          5.0
## 552 552   92      0       0  24     13        3     19          5.0
## 553 553   93      0       1  28      9        4     23          5.5
## 554 554   93     NA       0  NA     11       NA     23          5.5
## 555 555   93      0       0  32     10        4     23          5.5
## 556 556   93      0       1  NA     11        5     23          5.5
## 557 557   93      1       1  34      9        4     23          5.5
## 558 558   93      0       1  34     13        7     23          5.5
## 559 559   94      1       1  NA     14        4     23          4.0
## 560 560   94      0       1  21     17        6     23          4.0
## 561 561   94      0       1  27     10        4     23          4.0
## 562 562   94      1       0  28     12        3     23          4.0
## 563 563   94      0       0  32     10        5     23          4.0
## 564 564   94      0       0  31     11        4     23          4.0
## 565 565   95      0       1  25      7        4     25          6.0
## 566 566   95      0       1  30     14        7     25          6.0
## 567 567   95      0       0  28     10        5     25          6.0
## 568 568   95      0       1  31      8        4     25          6.0
## 569 569   95      1       0  33      4        3     25          6.0
## 570 570   95      0       1  29      9        5     25          6.0
## 571 571   96      0       0  39     16        4     24          6.5
## 572 572   96      0       1  26      7        4     24          6.5
## 573 573   96      0       1  NA     17        6     24          6.5
## 574 574   96      0       1  30      8        3     24          6.5
## 575 575   96      0       1  NA     14        6     24          6.5
## 576 576   96      1       0  26     11        2     24          6.5
## 577 577   97      0       1  NA     11        6     21          7.0
## 578 578   97      0       0  25      7        6     21          7.0
## 579 579   97      0       1  NA     13        5     21          7.0
## 580 580   97      0       0  35     13        4     21          7.0
## 581 581   97      0       0  NA     15        5     21          7.0
## 582 582   97      1       0  31     15        6     21          7.0
## 583 583   98      1       1  31      8        3     16          5.0
## 584 584   98      1       0  24      7        1     16          5.0
## 585 585   98      1       0  NA     16        2     16          5.0
## 586 586   98      1       0  27      6        1     16          5.0
## 587 587   98      0       0  34      7        4     16          5.0
## 588 588   98      0       0  32      7        4     16          5.0
## 589 589   99      1       0  30      7        3     16          5.0
## 590 590   99      0       1  25      7        3     16          5.0
## 591 591   99      0       1  29      8        4     16          5.0
## 592 592   99      1       1  27      9        3     16          5.0
## 593 593   99      1       1  28     12        6     16          5.0
## 594 594   99      0       1  28      7        4     16          5.0
## 595 595  100      1       1  24      9        2     NA          3.0
## 596 596  100      0       0  23      9        3     NA          3.0
## 597 597  100      1       0  24      6        3     NA          3.0
## 598 598  100      1       0  21      5        2     NA          3.0
## 599 599  100      0       0  21      8        2     NA          3.0
## 600 600  100      1       0  27      9        3     NA          3.0
## 601 601  101      0       1  29      7        4     22          7.0
## 602 602  101      1       1  28      8        3     22          7.0
## 603 603  101      0       1  33     10        5     22          7.0
## 604 604  101      1       0  24      7        2     22          7.0
## 605 605  101      0       1  35      9        5     22          7.0
## 606 606  101      0       0  28      9        3     22          7.0
## 607 607  102     NA       1  25      1       NA     27          5.5
## 608 608  102      0       0  34     12        5     27          5.5
## 609 609  102      0       0  30      7        3     27          5.5
## 610 610  102      0       0  25      5        4     27          5.5
## 611 611  102      0       0  36      8        2     27          5.5
## 612 612  102      0       1  31      7        3     27          5.5
## 613 613  103      0       0  20      7        2     18          5.5
## 614 614  103      0       0  26      4        1     18          5.5
## 615 615  103      0       1  28     10        4     18          5.5
## 616 616  103      0       0  28     10        4     18          5.5
## 617 617  103      1       1  31      7        5     18          5.5
## 618 618  103      0       0  24      9        3     18          5.5
## 619 619  104      0       1  NA     13        5     17          4.5
## 620 620  104      1       0  28     10        3     17          4.5
## 621 621  104      0       1  28      9        3     17          4.5
## 622 622  104      0       1  26     12        3     17          4.5
## 623 623  104      1       0  23      5        3     17          4.5
## 624 624  104      1       1  27      9        4     17          4.5
## 625 625  105      1       0  NA     11        6     21          5.0
## 626 626  105      1       0  28      5        4     21          5.0
## 627 627  105      1       0  17      5        3     21          5.0
## 628 628  105      1       1  28     10        4     21          5.0
## 629 629  105      0       0  NA      9        4     21          5.0
## 630 630  105      1       1  32      5        3     21          5.0
## 
## $N
## [1] 630
## 
## attr(,"class")
## [1] "mcar_test"

To evaluate the nature of the missing data, Little’s MCAR test was performed using pairwise-complete estimation along with Bartlett’s small-sample correction. The test yielded a chi-square statistic of 340.49 with 106 degrees of freedom, resulting in a p-value less than 0.0001. This provides strong evidence against the assumption that the data are missing completely at random (MCAR). The effect size, calculated as the ratio of the test statistic to its degrees of freedom (χ²/df = 3.21), indicates a substantial deviation from MCAR. While large sample sizes can make the test sensitive to small deviations, the observed effect size suggests that the violation is meaningful rather than trivial.

The analysis also revealed 17 distinct patterns of missing data across the dataset, with a total of 286 missing values, corresponding to 5.0% of all data points. The variable with the highest number of missing values was Age, which had 102 missing entries. These findings suggest that the missingness is not random and should be addressed using methods that assume a less restrictive missing data mechanism, such as Missing at Random (MAR).

Given the strong evidence against MCAR and the presence of structured missingness, single imputation methods would be inappropriate, as they tend to underestimate variability and distort relationships among variables. Instead, multiple imputation (MI) is recommended. MI accounts for uncertainty in the imputed values by generating several plausible datasets and combining results across them, leading to more accurate estimates and valid statistical inference under the MAR assumption. This approach is particularly well-suited to the current dataset, given the moderate amount of missing data and the identifiable patterns of missingness.


imputed_employee <- EM_algorithm(employee, verbose = FALSE)


# Columns to force as integers 
integer_cols <- c("Gender", "Married",  "Age", "JobLevel", "Dept", "Salary")  

for (col in integer_cols) {
  if (col %in% colnames(imputed_employee$imputed_data)) {
    # For binary variables (e.g., Gender, Married), threshold at 0.5
    if (col %in% c("Gender", "Married")) {
      imputed_employee$imputed_data[, col] <- as.integer(imputed_employee$imputed_data[, col] > 0.5)
    } 
    # For ordinal/numeric integers (e.g., JobLevel, Dept), round and convert
    else {
      imputed_employee$imputed_data[, col] <- as.integer(round(imputed_employee$imputed_data[, col]))
    }
  }
}

This function computes RMSE, MAE, bias, normalized errors, and correlation for imputed missing values compared to true values, for each variable.

#' Compare Imputation Quality on Missing Data
#'
#' Computes RMSE, MAE, bias, normalized errors, and correlation for imputed missing values
#' compared to true values, for each variable.
#'
#' @param original Dataset with missing values (NA).
#' @param imputed Completed dataset after imputation.
#' @param true_data (Optional) True complete dataset with no missing values.
#'
#' @return A data frame summarizing imputation quality statistics for each variable.
#' @export
compare_stats_imputation <- function(original, imputed, true_data = NULL) {
    if (!identical(dim(original), dim(imputed)))
        stop("Datasets must have identical dimensions")

    original <- as.matrix(original)
    imputed <- as.matrix(imputed)
    if (!is.null(true_data)) {
        true_data <- as.matrix(true_data)
        if (!identical(dim(original), dim(true_data)))
            stop("true_data must match dimensions")
    }

    colnames(original) <- colnames(original) %||% paste0("V", seq_len(ncol(original)))
    colnames(imputed) <- colnames(imputed) %||% colnames(original)
    if (!is.null(true_data))
        colnames(true_data) <- colnames(true_data) %||% colnames(original)

    stats <- lapply(seq_len(ncol(original)), function(i) {
        orig <- original[, i]
        imp <- imputed[, i]
        missing_mask <- is.na(orig)
        n_missing <- sum(missing_mask)

        if (!is.null(true_data) && n_missing > 0) {
            true_vals <- true_data[missing_mask, i]
            imp_vals <- imp[missing_mask]
            sd_true <- sd(true_vals)

            rmse_miss <- sqrt(mean((imp_vals - true_vals)^2))
            mae_miss <- mean(abs(imp_vals - true_vals))
            bias_miss <- mean(imp_vals - true_vals)
            corr_miss <- if (length(true_vals) > 1)
                cor(imp_vals, true_vals) else NA
            rmse_miss_norm <- if (sd_true > 0)
                rmse_miss/sd_true else NA
            mae_miss_norm <- if (sd_true > 0)
                mae_miss/sd_true else NA
        } else {
            rmse_miss <- mae_miss <- bias_miss <- corr_miss <- rmse_miss_norm <- mae_miss_norm <- NA
        }

        data.frame(Variable = colnames(original)[i], N_Missing = n_missing, RMSE_Missing = rmse_miss,
            MAE_Missing = mae_miss, Bias_Missing = bias_miss, RMSE_Missing_Norm = rmse_miss_norm,
            MAE_Missing_Norm = mae_miss_norm, Correlation_Missing = corr_miss, stringsAsFactors = FALSE)
    })

    do.call(rbind, stats)
}

This function runs the EM algorithm on data with missing values, fills imputed values back, and computes imputation diagnostics.

#' Analyze EM Algorithm on Dataset
#'
#' Runs the EM algorithm on data with missing values, fills imputed values back,
#' and computes imputation diagnostics.
#'
#' @param data Dataset with missing values (NA).
#' @param true_data (Optional) True complete dataset (no missing values).
#' @param verbose Logical to show EM progress messages.
#'
#' @return A list with convergence info, estimated parameters, diagnostics, and completed data.
#' @export
analyze_em_results <- function(data, true_data = NULL, verbose = FALSE) {
    if (!is.matrix(data))
        data <- as.matrix(data)
    all_na_rows <- apply(data, 1, function(x) all(is.na(x)))
    data_clean <- data[!all_na_rows, , drop = FALSE]

    if (verbose) {
        cat("Removed", sum(all_na_rows), "completely empty rows\n")
        cat("Processing", nrow(data_clean), "rows with partial data\n")
    }

    result <- tryCatch({
        EM_algorithm(data_clean, verbose = verbose)
    }, error = function(e) {
        message("EM Algorithm failed: ", e$message)
        return(NULL)
    })

    if (is.null(result))
        return(NULL)

    full_imputed <- data
    missing_in_clean <- apply(data_clean, 1, function(row) any(is.na(row)))
    valid_rows <- which(!all_na_rows)
    rows_to_impute <- valid_rows[missing_in_clean]

    full_imputed[rows_to_impute, ] <- result$imputed_data[missing_in_clean, ]

    diagnostics <- compare_stats_imputation(data, full_imputed, true_data)

    list(convergence = list(converged = result$converged, iterations = result$iterations),
        parameters = list(means = result$mu, covariance = result$sigma), diagnostics = diagnostics,
        imputed_data = full_imputed)
}

This code provides a comprehensive visualization and reporting system for evaluating the results of a missing data imputation procedure (likely EM imputation).This system provides researchers with both high-level overviews and detailed metrics to thoroughly evaluate imputation performance across multiple dimensions.

It consists of four main components that work together to assess imputation quality:

  1. Text Summary Report (present_imputation_results)
library(knitr)
library(kableExtra)
library(dplyr)

present_imputation_results <- function(analysis) {

    # 1. Display basic convergence info
    cat("========== EM Algorithm Summary ==========\n\n")
    status <- if (isTRUE(analysis$convergence$converged))
        "Converged" else "Not Converged"
    cat(">> Convergence:\n")
    cat(sprintf("  Status:     %s\n", status))
    cat(sprintf("  Iterations: %d\n\n", analysis$convergence$iterations))

    # 2. Display parameter estimates
    if (!is.null(analysis$parameters$means)) {
        cat(">> Estimated Means:\n")
        print(round(analysis$parameters$means, 2))
        cat("\n")
    }

    if (!is.null(analysis$parameters$covariance)) {
        cat(">> Estimated Covariance Matrix:\n")
        print(round(analysis$parameters$covariance, 2))
        cat("\n")
    }

    # 3. Display diagnostics with color-coding

    # 4. Display first/last rows of imputed data
    if (!is.null(analysis$imputed_data)) {
        cat("\n>> Imputed Data Preview:\n")
        cat("First 6 rows:\n")
        print(head(analysis$imputed_data))
        cat("\nLast 6 rows:\n")
        print(tail(analysis$imputed_data))
    }
}


analysis <- analyze_em_results(employee[, -1], true_data = employeecomplete[, -1])
present_imputation_results(analysis)
## ========== EM Algorithm Summary ==========
## 
## >> Convergence:
##   Status:     Converged
##   Iterations: 16
## 
## >> Estimated Means:
##         Dept       Gender      Married          Age       Tenure     JobLevel 
##        53.00         0.32         0.48        28.60         9.61         3.98 
##       Salary Satisfaction 
##        20.15         4.92 
## 
## >> Estimated Covariance Matrix:
##                Dept Gender Married   Age Tenure JobLevel Salary Satisfaction
## Dept         918.67   0.40   -0.42 -2.45  -0.14    -1.14  11.89         1.43
## Gender         0.40   0.21   -0.01 -0.34  -0.28    -0.18  -0.46        -0.10
## Married       -0.42  -0.01    0.25  0.57   0.11     0.09   0.20         0.04
## Age           -2.45  -0.34    0.57 17.32   5.74     1.71   3.84         0.95
## Tenure        -0.14  -0.28    0.11  5.74   8.86     1.60   0.30        -0.10
## JobLevel      -1.14  -0.18    0.09  1.71   1.60     1.54   1.32         0.11
## Salary        11.89  -0.46    0.20  3.84   0.30     1.32  15.23         2.33
## Satisfaction   1.43  -0.10    0.04  0.95  -0.10     0.11   2.33         1.74
## 
## 
## >> Imputed Data Preview:
## First 6 rows:
##      Dept Gender Married   Age Tenure JobLevel Salary Satisfaction
## [1,]    1   0.00       1 32.00     11     3.00     18          3.5
## [2,]    1   1.00       1 30.86     13     4.00     18          3.5
## [3,]    1   1.00       1 30.00      9     4.00     18          3.5
## [4,]    1   1.00       1 29.00      8     3.00     18          3.5
## [5,]    1   1.00       0 26.00      7     4.00     18          3.5
## [6,]    1   0.36       0 27.39     10     3.95     18          3.5
## 
## Last 6 rows:
##        Dept Gender Married   Age Tenure JobLevel Salary Satisfaction
## [625,]  105      1       0 29.01     11        6     21            5
## [626,]  105      1       0 28.00      5        4     21            5
## [627,]  105      1       0 17.00      5        3     21            5
## [628,]  105      1       1 28.00     10        4     21            5
## [629,]  105      0       0 27.37      9        4     21            5
## [630,]  105      1       1 32.00      5        3     21            5

  1. Density Plot Comparison
# Prepare data - exclude first 4 columns
imputed_df <- as.data.frame(analysis$imputed_data)[, -(1:4)]
complete_df <- as.data.frame(employeecomplete)[, -(1:4)]
variables <- names(complete_df)

# Function to safely convert to numeric and check validity
get_plottable_vars <- function(df) {
  plottable <- character(0)
  for (var in names(df)) {
    vals <- tryCatch({
      x <- as.numeric(as.character(df[[var]]))
      if (sum(!is.na(x)) >= 2) var else NULL
    }, error = function(e) NULL)
    if (!is.null(vals)) plottable <- c(plottable, var)
  }
  plottable
}

# Identify plottable variables in both datasets
plottable_vars <- intersect(
  get_plottable_vars(complete_df),
  get_plottable_vars(imputed_df)
)

# Only proceed if we have plottable variables
if (length(plottable_vars) > 0) {
  # Set up grid layout
  n_cols <- 2
  n_rows <- ceiling(length(plottable_vars) / n_cols)
  
  # Set graphical parameters with larger fonts
  par(mfrow = c(n_rows, n_cols),
      mar = c(5, 6, 4, 2),  # Increased margins for larger text
      oma = c(4, 4, 5, 2),
      cex.main = 1.8,  # Larger variable names
      cex.lab = 1.5)   # Larger axis labels
  
  # Create density plots
  for (var in plottable_vars) {
    # Convert to numeric
    true_vals <- as.numeric(as.character(complete_df[[var]]))
    imp_vals <- as.numeric(as.character(imputed_df[[var]]))
    
    # Calculate densities
    d_true <- density(true_vals[!is.na(true_vals)], adjust = 1.5)
    d_imp <- density(imp_vals[!is.na(imp_vals)], adjust = 1.5)
    
    # Create plot with larger text
    plot(range(c(d_true$x, d_imp$x)), range(c(d_true$y, d_imp$y)),
         type = "n", 
         main = var, 
         xlab = "", 
         ylab = "",
         cex.axis = 1.3)  # Larger axis numbers
    
    # Add grid
    grid(col = "gray90", lty = "dotted")
    
    # Add densities with transparency
    polygon(d_true, col = adjustcolor("#1E88E5", alpha.f = 0.1), border = NA)
    polygon(d_imp, col = adjustcolor("#FF0D57", alpha.f = 0.1), border = NA)
    
    # Add thicker density lines
    lines(d_true, col = adjustcolor("#1E88E5", alpha.f = 0.5), lwd = 4)
    lines(d_imp, col = adjustcolor("#FF0D57", alpha.f = 0.5), lwd = 4)
  }
  
  # Add global titles with larger fonts
  mtext("Variable Value", side = 1, outer = TRUE, line = 2, cex = 1.8)
  mtext("Density", side = 2, outer = TRUE, line = 2, cex = 1.8)
  mtext("Comparison of True (Blue) vs Imputed (Red) Distributions", 
        side = 3, outer = TRUE, line = 1, font = 2, cex = 2.2)
  
  # Reset parameters
  par(mfrow = c(1, 1))
  
} else {
  message("No variables with sufficient numeric data for density plots.")
  message("Variables attempted: ", paste(variables, collapse = ", "))
}


  1. Boxplot Comparison
# Prepare data - exclude first 4 columns
true_data <- as.data.frame(employeecomplete)[, -(1:4)]
imputed_data <- as.data.frame(analysis$imputed_data)[, -(1:4)]
variables <- names(true_data)

# Identify numeric variables with sufficient data
numeric_vars <- variables[sapply(variables, function(var) {
  is.numeric(true_data[[var]]) && 
    sum(!is.na(true_data[[var]])) >= 2 &&
    sum(!is.na(imputed_data[[var]])) >= 2
})]

# Only proceed if we have numeric variables
if (length(numeric_vars) > 0) {
  # Calculate layout
  n_cols <- 2
  n_rows <- ceiling(length(numeric_vars)/n_cols)
  
  # Adjusted graphical parameters
  par(mfrow = c(n_rows, n_cols),
      mar = c(4, 4.5, 3, 2),    # Margins for individual plots
      oma = c(4, 0, 5, 0))      # Outer margins
  
  # Create plots with lighter colors
  for (var in numeric_vars) {
    # Prepare data
    true_vals <- true_data[[var]]
    imp_vals <- imputed_data[[var]]
    
    # Create boxplot with pastel colors
    boxplot(list(True = true_vals, Imputed = imp_vals),
            main = "",
            col = c("#A6CEE3", "#FDBF6F"),  # Light blue and light orange
            border = c("#1F78B4", "#FF7F00"),  # Slightly darker borders
            boxwex = 0.6,
            las = 1,
            ylab = "",
            cex.axis = 0.9,
            outline = FALSE)
    
    # Add variable title as y-axis label
    mtext(var, side = 2, line = 3, cex = 0.9, font = 2, col = "gray30")
    
    # Add light grid
    grid(nx = NA, ny = NULL, col = "gray95", lty = "solid")
    
    # Add mean points with soft green
    points(1:2, 
           c(mean(true_vals, na.rm = TRUE),
           mean(imp_vals, na.rm = TRUE)),
           pch = 18, col = "#B2DF8A", cex = 2, lwd = 2)  # Soft green
    
    # Add subtle border
    box("plot", col = "gray80", lwd = 0.5)
  }
  
  # Add global titles with softer colors
  mtext("Comparison of True vs. Imputed Values", 
        outer = TRUE, side = 3, line = 1.5, cex = 1.4, font = 2, col = "gray20")
  mtext("Boxes show median/IQR | Green diamonds show means",
        outer = TRUE, side = 3, line = -0.5, cex = 1, col = "gray40")
  mtext("Variable Values", outer = TRUE, side = 1, line = 2, cex = 1.1, col = "gray20")
  
} else {
  cat("<div class='alert alert-warning'>No numeric variables with sufficient data for comparison.</div>")
  cat(paste("<div>Variables attempted:", paste(variables, collapse = ", "), "</div>"))
}

# Reset graphical parameters
par(mfrow = c(1, 1))

  1. Interactive Diagnostics Table
library(formattable)
library(dplyr)

# Prepare the diagnostics data
diagnostics_table <- analysis$diagnostics %>%
  dplyr::mutate(
    Missing = paste0(N_Missing, " (", round(N_Missing / nrow(analysis$imputed_data) * 100, 1), "%)"),
    Bias = Bias_Missing,
    RMSE_Std = ifelse(is.na(RMSE_Missing), NA, RMSE_Missing / sd(analysis$imputed_data[, Variable], na.rm = TRUE)),
    MAE_Std = ifelse(is.na(MAE_Missing), NA, MAE_Missing / sd(analysis$imputed_data[, Variable], na.rm = TRUE)),
    Correlation = Correlation_Missing
  ) %>%
  dplyr::select(Variable, Missing, Bias, RMSE_Std, MAE_Std, Correlation)

# Color coding functions
missing_color <- function(x) {
  pct <- as.numeric(gsub(".*\\((.*)%\\)", "\\1", x))
  ifelse(is.na(pct), "grey",
         ifelse(pct < 5, "#4dac26",  # Green
                ifelse(pct <= 15, "#fdae61",  # Orange
                       "#d7191c")))  # Red
}

bias_color <- function(x) {
  ifelse(is.na(x), "grey",
         ifelse(abs(x) < 0.1, "#4dac26",  # Green (near zero)
                ifelse(abs(x) < 0.3, "#fdae61",  # Orange
                       "#d7191c")))  # Red (large bias)
}

error_color <- function(x) {
  ifelse(is.na(x), "grey",
         ifelse(x < 0.1, "#4dac26",  # Green
                ifelse(x <= 0.3, "#fdae61",  # Orange
                       "#d7191c")))  # Red
}

corr_color <- function(x) {
  ifelse(is.na(x), "grey",
         ifelse(x > 0.6, "#4dac26",  # Green (excellent)
                ifelse(x > 0.3, "#fdae61",  # Orange (good)
                       "#d7191c")))  # Red (poor)
}

# Create the formatted table
formattable(
  diagnostics_table,
  align = c("l", rep("c", ncol(diagnostics_table)-1)),
  list(
    Variable = formatter("span", style = ~ style(font.weight = "bold")),
    Missing = formatter("span", 
                      style = x ~ style(display = "block",
                                      padding = "0 4px",
                                      `border-radius` = "4px",
                                      `background-color` = missing_color(x))),
    Bias = formatter("span",
                   style = x ~ style(display = "block",
                                   padding = "0 4px",
                                   `border-radius` = "4px",
                                   `background-color` = bias_color(x)),
                   x ~ sprintf("%.3f", x)),
    RMSE_Std = formatter("span",
                        style = x ~ style(display = "block",
                                        padding = "0 4px",
                                        `border-radius` = "4px",
                                        `background-color` = error_color(x)),
                        x ~ sprintf("%.3f", x)),
    MAE_Std = formatter("span",
                       style = x ~ style(display = "block",
                                       padding = "0 4px",
                                       `border-radius` = "4px",
                                       `background-color` = error_color(x)),
                       x ~ sprintf("%.3f", x)),
    Correlation = formatter("span",
                          style = x ~ style(display = "block",
                                          padding = "0 4px",
                                          `border-radius` = "4px",
                                          `background-color` = corr_color(x)),
                          x ~ ifelse(is.na(x), "N/A", sprintf("%.3f", x)))
  )
) %>%
  as.htmlwidget() %>%
  htmltools::tagList() %>%
  htmltools::browsable()

Imputation Quality Metrics for Each Variable

  • N_Missing:
    The count and percentage of missing values in the variable before imputation.
    • Color coding:
      • Green: Missingness less than 5% — indicates very little missing data, so imputation is typically easier and more reliable.
      • Orange: Missingness between 5% and 15% — moderate missing data, imputation quality may vary and should be checked carefully.
      • Red: Missingness above 15% — high missingness; imputation may be less reliable and should be interpreted with caution.
  • Bias_Missing:
    The average difference between the imputed values and the true (observed) values, calculated only on the originally missing entries.
    • Computed as: imputed value − true value.
    • Interpretation:
      • Values close to 0 indicate little systematic error (no consistent over- or under-estimation).
      • Large positive or negative bias signals systematic over- or under-imputation, respectively, which can distort subsequent analyses.
  • RMSE (standardized) (Root Mean Squared Error, standardized):
    Measures the average magnitude of the imputation errors on missing values, accounting for variability in the variable by dividing RMSE by the standard deviation of the true values.
    • Color coding:
      • Green: RMSE < 0.1
      • Orange: 0.1 < RMSE < 0.3
      • Red: RMSE > 0.3
    • Why standardize? Different variables can have different scales and ranges; standardizing by the variable’s true standard deviation makes errors comparable across variables.
    • Lower values indicate more accurate imputation.
    • For example, a standardized RMSE of 0.2 means the average imputation error is about 20% of the natural variability of that variable.
  • MAE (standardized) (Mean Absolute Error, standardized):
    Similar to RMSE but uses the average absolute error instead of squared error, also standardized by the variable’s true standard deviation.
    • Less sensitive to large outliers compared to RMSE.
    • Like standardized RMSE, smaller values indicate better imputation accuracy.
    • Color coding:
      • Green: RMSE < 0.1
      • Orange: 0.1 < RMSE < 0.3
      • Red: RMSE > 0.3
  • Correlation_Missing:
    Pearson correlation between imputed and true values for the missing entries only.
    • Reflects how well the imputed values preserve the relative ordering and relationships in the original data.
    • Color coding:
      • Green: RMSE < 0.3 (Near zero or negative correlations indicate poor or even inverse relationships, which is a red flag for imputation quality.)
      • Orange: 0.3 < RMSE < 0.6 (indicates good recovery, but imputation quality may be moderate. )
      • Red: RMSE > 0.6 (typically considered excellent recovery.)

RMSE vs MAE

Case What it Means
RMSE ≈ MAE Errors are generally consistent and small. No large outliers.
RMSE > MAE (by a lot) Some large errors are inflating RMSE. Imputation might be good overall but bad for some cases.
Both RMSE and MAE are large Imputation is generally poor — revisit your model or assumptions.
Both are small (near 0) Imputation is high quality — close to true values on average.

How to Interpret Standardized RMSE / MAE

Value Interpretation
< 0.1 Excellent (errors are < 10% of the variable’s SD)
0.1 – 0.3 Acceptable
> 0.5 Poor — errors are large relative to the variable’s scale

These metrics allow cross-variable comparison, even if variables are on different scales.


To evaluate the quality of the multiple imputation results, several metrics were examined for each variable with missing values, including bias (difference between imputed and true values), standardized RMSE and MAE, and the correlation between imputed and true values (when available, typically in simulation settings).

For Gender, which had 5.1% missingness, the bias was modest at 0.139, with very small standardized errors (RMSE = 0.022, MAE = 0.020). However, the correlation between true and imputed values was relatively low (r = 0.159), suggesting the imputations captured general trends but not fine-grained individual accuracy.

Age had the highest proportion of missingness (16.2%). The bias was small (-0.064), and while the RMSE and MAE were higher than other variables (0.216 and 0.165, respectively), they remain reasonable given the level of missingness. The correlation of 0.434 indicates moderate accuracy of the imputed values and suggests that imputation preserved some, but not all, of the true variance in Age.

Tenure, with 4.1% missingness, showed higher bias (0.575) and a moderate RMSE of 0.138. The correlation (r = 0.232) was low, which may reflect instability in the imputation model for this variable or high individual-level variability not well captured by the predictors.

For Job Level, which had 4.8% missingness, the bias was relatively small (-0.213), and error metrics were low (RMSE = 0.043, MAE = 0.037). The correlation was fairly strong (r = 0.655), indicating that the imputation model performed well in preserving the true values’ structure.

Salary, missing 9.5% of values, showed the highest absolute bias (1.223) among all variables. While standardized errors (RMSE = 0.166, MAE = 0.129) were moderate, the correlation was decent (r = 0.561). This suggests that while overall patterns in Salary were reasonably well captured, there may have been some systematic overestimation or skew.

Satisfaction, with 5.7% missingness, had the most concerning result. The bias was substantial (-1.038), and though error metrics were low (RMSE = 0.076, MAE = 0.065), the negative correlation (r = –0.577) indicates that the imputed values may be systematically misaligned with the actual values, possibly reversing trends. This warrants closer examination of model assumptions, potential nonlinearity, or unaccounted-for predictors affecting Satisfaction.

For variables like Department and Marrital Status, which had no missing values, diagnostics were not applicable.

Summary

The multiple imputation performed reasonably well for most variables, with low-to-moderate bias and acceptable error metrics. However, the imputation quality varied across variables:

  • Strong performance: Job Level, Age (moderate)
  • Moderate concerns: Gender, Tenure, Salary
  • Red flag: Satisfaction, due to high bias and inverse correlation

These results suggest that while MI is appropriate, model refinement (e.g., adding interactions, transforming variables, or checking predictive relationships) may be necessary especially for variables like Satisfaction to improve accuracy and reduce bias.


3.8 Multiple Imputation via Stochastic EM Algorithm

Rubin (Rubin, 1987) developed a framework for combining results from \(m\) imputed datasets. For correlation analysis between variables \(X\) and \(Y\)

To combine results from \(m\) imputed datasets, we proceed through these steps:

Suppose we have performed multiple imputation generating \(m\) complete datasets. For each imputed dataset \(i = 1, \dots, m\), we estimate a parameter vector \(\hat{\theta}_i\) and its associated variance-covariance matrix \(U_i\).

3.8.1 Pooled Point Estimate

The combined estimate is the average of the \(m\) estimates:

\[ \bar{\theta} = \frac{1}{m} \sum_{i=1}^m \hat{\theta}_i \]

3.8.2 Within-Imputation Variance

This is the average of the estimated variances within each imputation:

\[ \bar{U} = \frac{1}{m} \sum_{i=1}^m U_i \]

3.8.3 Between-Imputation Variance

This captures the variability between the imputed estimates:

\[ B = \frac{1}{m-1} \sum_{i=1}^m \left(\hat{\theta}_i - \bar{\theta}\right) \left(\hat{\theta}_i - \bar{\theta}\right)^\top \]

3.8.4 Total Variance

Rubin’s total variance accounts for within- and between-imputation variance, plus an adjustment for the finite number of imputations:

\[ T = \bar{U} + \left(1 + \frac{1}{m}\right) B \]

3.8.5 Inference

Using \(\bar{\theta}\) and \(T\), inference (e.g., confidence intervals, hypothesis tests) proceeds as usual, often relying on a \(t\)-distribution with degrees of freedom estimated by Rubin’s formula.


miviaem() performs multiple imputation on datasets with missing values using a stochastic Expectation–Maximization (EM) algorithm. It generates several imputed datasets, analyzes each one, and pools the results across imputations using Rubin’s (1987) rules.

#' Multiple Imputation via Stochastic EM Algorithm with Rubin's Rules Pooling
#'
#' Performs multiple imputation on datasets with missing values using a 
#' stochastic Expectation–Maximization (EM) algorithm. The procedure generates 
#' several imputed datasets by adding stochastic variation during the 
#' Expectation step to reflect uncertainty in the missing values. Each imputed 
#' dataset is analyzed separately, and the results are pooled using Rubin's 
#' (1987) rules to produce overall estimates and standard errors.
#'
#' @details
#' The algorithm iterates between:
#' \itemize{
#'   \item \strong{E-step:} Calculates expected values of missing data and 
#'         sufficient statistics based on the current parameter estimates.
#'   \item \strong{M-step:} Updates the parameter estimates (mean vector and 
#'         covariance matrix) using the completed data.
#' }
#' Stochasticity is introduced in the E-step by sampling from the conditional
#' distributions of the missing values. This allows variability across 
#' imputations and ensures valid inference via Rubin's rules.
#'
#' @param data A data frame or matrix containing missing values (\code{NA}).
#' @param m Integer. Number of imputations to perform (default = 5).
#' @param max_iter Integer. Maximum number of EM iterations for each imputation (default = 100).
#' @param tol Numeric. Convergence tolerance for the EM iterations (default = 1e-6).
#' @param analysis_fun Function to apply to each imputed dataset for analysis. 
#'   Must return a list or object containing parameter estimates and their 
#'   variances.
#' @param seed Optional integer seed for reproducibility.
#'
#' @return A list containing:
#' \item{imputations}{List of \code{m} completed datasets.}
#' \item{analyses}{List of results from applying \code{analysis_fun} to each dataset.}
#' \item{pooled}{Pooled estimates and variances according to Rubin's rules.}
#'
#' @references
#' Rubin, D. B. (1987). \emph{Multiple Imputation for Nonresponse in Surveys}. Wiley.
#'
#' @examples
#' \dontrun{
#'   # Example usage
#'   result <- MIVIAEM(
#'     data = mydata,
#'     m = 5,
#'     analysis_fun = function(df) lm(y ~ x1 + x2, data = df),
#'     seed = 123
#'   )
#'   result$pooled
#' }
#'
#' @export

miviaem <- function(data, m = 5, formula = NULL, max_em_iter = 200, em_tol = 1e-06,
    verbose = FALSE) {

    # ----------------------------------------------------------- Basic package
    # checks -----------------------------------------------------------
    if (!requireNamespace("MASS", quietly = TRUE))
        stop("Package 'MASS' required. Install with install.packages('MASS').")
    if (!requireNamespace("mvtnorm", quietly = TRUE))
        stop("Package 'mvtnorm' required. Install with install.packages('mvtnorm').")

    # ----------------------------------------------------------- Internal
    # helper: SWEEP operator (classic Beaton (1964) / Goodnight (1979))
    # -----------------------------------------------------------
    SWEEP <- function(A, x) {
        a <- as.matrix(A)
        n <- nrow(a)

        for (k in x) {
            # step 1
            D <- a[k, k]

            # step 2
            a[k, ] <- a[k, ]/D

            # step 3
            for (i in 1:n) {
                if (i != k)
                  {
                    B <- a[i, k]
                    a[i, ] <- a[i, ] - B * a[k, ]
                    a[i, k] <- -B/D
                  }  # if
            }  # for i

            # step 4
            a[k, k] <- 1/D
        }
        return(a)
    }  # SWEEP

    # ----------------------------------------------------------- Internal
    # helper: augmented_parameter_matrix Returns list(mean, cov, augmented)
    # -----------------------------------------------------------
    augmented_parameter_matrix <- function(data_mat, scaled = TRUE) {
        Xdata <- as.matrix(data_mat)
        # remove rows that are entirely NA (defensive)
        Xdata <- Xdata[rowSums(!is.na(Xdata)) > 0, , drop = FALSE]
        # Add intercept column
        X <- cbind(1, Xdata)
        XtX <- t(X) %*% X
        swept <- SWEEP(XtX, 1)
        if (scaled) {
            n <- nrow(Xdata)
            swept[-1, -1] <- swept[-1, -1]/n
            swept[1, 1] <- -1/n
        }
        var_names <- colnames(Xdata)
        mean_vec <- as.numeric(swept[1, -1])
        cov_mat <- swept[-1, -1]
        names(mean_vec) <- var_names
        dimnames(cov_mat) <- list(var_names, var_names)
        list(mean = mean_vec, cov = cov_mat, augmented = swept)
    }

    # ----------------------------------------------------------- Internal
    # helper: EM for MVN with missing data - E-step: conditional expectations
    # fill missing entries - M-step: augmented_parameter_matrix on
    # expected-completed data Returns list(mu, sigma, iterations, loglik)
    # -----------------------------------------------------------
    em_mvnorm <- function(data_mat, max_iter = 200, tol = 1e-06, verbose = FALSE) {
        X <- as.matrix(data_mat)
        n <- nrow(X)
        p <- ncol(X)
        colnames(X) <- colnames(data_mat)
        # initialization: column means and pairwise cov
        mu <- colMeans(X, na.rm = TRUE)
        sigma <- stats::cov(X, use = "pairwise.complete.obs")
        # small fallback if sigma has NAs
        if (any(is.na(sigma)))
            sigma[is.na(sigma)] <- 0
        loglik_old <- -Inf
        for (iter in seq_len(max_iter)) {
            # E-step: build expected-completed data matrix X_exp
            X_exp <- X
            for (i in seq_len(n)) {
                miss <- is.na(X[i, ])
                if (any(miss)) {
                  obs <- !miss
                  # if all missing or all observed, skip conditionals
                  # appropriately
                  if (all(obs))
                    next
                  sigma_oo <- sigma[obs, obs, drop = FALSE]
                  sigma_mo <- sigma[miss, obs, drop = FALSE]
                  mu_m <- mu[miss]
                  mu_o <- mu[obs]
                  x_o <- X[i, obs]
                  # if sigma_oo is singular, regularize
                  if (ncol(sigma_oo) == 0)
                    next
                  # Solve robustly
                  sigma_oo_inv <- tryCatch(solve(sigma_oo), error = function(e) {
                    # add small ridge
                    solve(sigma_oo + diag(1e-08, nrow(sigma_oo)))
                  })
                  cond_mean <- as.numeric(mu_m + sigma_mo %*% sigma_oo_inv %*% (x_o -
                    mu_o))
                  X_exp[i, miss] <- cond_mean
                }
            }
            # M-step using SWEEP-based augmented matrix
            apm <- augmented_parameter_matrix(X_exp, scaled = TRUE)
            mu_new <- apm$mean
            sigma_new <- apm$cov
            # safeguard: symmetrize
            sigma_new <- (sigma_new + t(sigma_new))/2
            # approximate observed-data log-likelihood (sum over observed
            # parts)
            loglik_new <- 0
            for (i in seq_len(n)) {
                obs <- !is.na(X[i, ])
                if (all(!obs))
                  next
                xi_obs <- X[i, obs, drop = FALSE]
                mu_obs <- mu_new[obs]
                sigma_obs <- sigma_new[obs, obs, drop = FALSE]
                # numerical safety
                if (any(is.na(sigma_obs)) || det(sigma_obs) <= 0)
                  sigma_obs <- sigma_obs + diag(1e-08, nrow(sigma_obs))
                loglik_new <- loglik_new + mvtnorm::dmvnorm(as.numeric(xi_obs), mean = as.numeric(mu_obs),
                  sigma = sigma_obs, log = TRUE)
            }
            if (verbose)
                message(sprintf("EM iter=%d loglik=%.6f", iter, loglik_new))
            if (!is.finite(loglik_new))
                loglik_new <- -1e+10
            # check convergence
            if (abs(loglik_new - loglik_old) < tol) {
                return(list(mu = as.numeric(mu_new), sigma = as.matrix(sigma_new),
                  iterations = iter, loglik = loglik_new))
            }
            mu <- as.numeric(mu_new)
            sigma <- as.matrix(sigma_new)
            loglik_old <- loglik_new
        }
        warning("EM did not converge in max_iter")
        list(mu = as.numeric(mu), sigma = as.matrix(sigma), iterations = max_iter,
            loglik = loglik_old)
    }

    # ----------------------------------------------------------- Internal
    # helper: stochastic imputation for a single row given mu & sigma draws
    # missing block from conditional MVN
    # -----------------------------------------------------------
    stochastic_impute_row <- function(row, mu, sigma) {
        miss <- is.na(row)
        if (!any(miss))
            return(row)
        obs <- !miss
        sigma_oo <- sigma[obs, obs, drop = FALSE]
        sigma_mo <- sigma[miss, obs, drop = FALSE]
        mu_m <- mu[miss]
        mu_o <- mu[obs]
        x_o <- row[obs]
        sigma_oo_inv <- tryCatch(solve(sigma_oo), error = function(e) solve(sigma_oo +
            diag(1e-08, nrow(sigma_oo))))
        cond_mean <- as.numeric(mu_m + sigma_mo %*% sigma_oo_inv %*% (x_o - mu_o))
        cond_cov <- sigma[miss, miss, drop = FALSE] - sigma_mo %*% sigma_oo_inv %*%
            t(sigma_mo)
        cond_cov <- (cond_cov + t(cond_cov))/2
        # eigen safeguard
        eigs <- eigen(cond_cov, symmetric = TRUE, only.values = TRUE)$values
        if (any(eigs < 0))
            cond_cov <- cond_cov + diag(abs(min(eigs)) + 1e-08, nrow(cond_cov))
        draw <- as.numeric(MASS::mvrnorm(1, mu = cond_mean, Sigma = cond_cov))
        row[miss] <- draw
        row
    }

    # ----------------------------------------------------------- Prepare input
    # data -----------------------------------------------------------
    if (!is.data.frame(data))
        stop("`data` must be a data.frame.")
    df <- data
    # convert character to factor
    df[] <- lapply(df, function(col) if (is.character(col))
        as.factor(col) else col)
    # classify numeric vs factor
    is_num <- sapply(df, is.numeric)
    num_names <- names(df)[is_num]
    fac_names <- names(df)[!is_num]

    # Precompute factor sampling probabilities (marginal)
    fac_probs <- list()
    for (f in fac_names) {
        if (any(!is.na(df[[f]]))) {
            tab <- table(df[[f]])
            fac_probs[[f]] <- prop.table(tab)
        } else {
            fac_probs[[f]] <- NULL
        }
    }

    # If there are numeric columns, run EM to estimate mu & sigma
    mu_hat <- sigma_hat <- NULL
    em_info <- NULL
    if (length(num_names) > 0) {
        numeric_data <- as.matrix(df[, num_names, drop = FALSE])
        # initial quick fill for EM: column means used inside em_mvnorm
        for (j in seq_len(ncol(numeric_data))) {
            if (any(is.na(numeric_data[, j])))
                numeric_data[is.na(numeric_data[, j]), j] <- mean(numeric_data[,
                  j], na.rm = TRUE)
        }
        em_info <- em_mvnorm(numeric_data, max_iter = max_em_iter, tol = em_tol,
            verbose = verbose)
        mu_hat <- em_info$mu
        sigma_hat <- em_info$sigma
    } else {
        if (verbose)
            message("No numeric columns found — EM skipped.")
    }

    # ----------------------------------------------------------- Generate m
    # stochastic imputations
    # -----------------------------------------------------------
    imputations <- vector("list", m)
    for (imp in seq_len(m)) {
        working <- df
        # numeric imputations via conditional MVN draws
        if (length(num_names) > 0) {
            for (i in seq_len(nrow(working))) {
                row_vals <- as.numeric(working[i, num_names])
                if (any(is.na(row_vals))) {
                  row_filled <- stochastic_impute_row(row_vals, mu_hat, sigma_hat)
                  working[i, num_names] <- row_filled
                }
            }
        }
        # factor imputations: sample marginally from observed category
        # proportions
        if (length(fac_names) > 0) {
            for (f in fac_names) {
                miss_idx <- which(is.na(working[[f]]))
                if (length(miss_idx) > 0) {
                  probs <- fac_probs[[f]]
                  if (is.null(probs)) {
                    # all missing: cannot sample; leave NA (user should avoid)
                    next
                  }
                  cats <- names(probs)
                  draws <- sample(cats, length(miss_idx), replace = TRUE, prob = as.numeric(probs))
                  # maintain factor levels
                  if (is.factor(df[[f]])) {
                    working[miss_idx, f] <- factor(draws, levels = levels(df[[f]]))
                  } else {
                    working[miss_idx, f] <- draws
                  }
                }
            }
        }
        imputations[[imp]] <- working
    }

    # ----------------------------------------------------------- If formula
    # provided: fit on each imputed dataset and pool using Rubin's rules Manual
    # pooling: Q_bar, U_bar, B, T
    # -----------------------------------------------------------
    pooled_results <- NULL
    diagnostics <- list()
    if (!is.null(formula)) {
        models <- lapply(imputations, function(d) lm(formula, data = d))
        # Collect coefficient vectors (may vary in length if singular; we
        # assume same names)
        coef_list <- lapply(models, function(mo) {
            co <- tryCatch(coef(mo), error = function(e) rep(NA, length(coef(mo))))
            co
        })
        coef_names <- names(coef_list[[1]])
        Q <- sapply(coef_list, function(x) x[coef_names])
        # Collected within-imputation variances (diagonals)
        U <- sapply(models, function(mod) {
            vc <- tryCatch(vcov(mod), error = function(e) matrix(NA, nrow = length(coef(mod)),
                ncol = length(coef(mod))))
            diag(vc)[coef_names]
        })
        # Pool
        Q_bar <- rowMeans(Q, na.rm = TRUE)
        U_bar <- rowMeans(U, na.rm = TRUE)
        B <- apply(Q, 1, stats::var, na.rm = TRUE)
        # total variance
        T_var <- U_bar + (1 + 1/m) * B
        SE <- sqrt(T_var)
        pooled_results <- data.frame(Estimate = Q_bar, StdError = SE, t.value = Q_bar/SE,
            row.names = coef_names, stringsAsFactors = FALSE)

        # diagnostics: for variables with missing values, compute imputed
        # summaries across imputations
        diagnostics <- lapply(names(df), function(varname) {
            miss_idx <- which(is.na(data[[varname]]))
            if (length(miss_idx) == 0)
                return(NULL)
            values <- unlist(lapply(imputations, function(d) d[[varname]][miss_idx]))
            if (is.numeric(df[[varname]])) {
                list(n_missing = length(miss_idx), imputed_mean = mean(values, na.rm = TRUE),
                  imputed_sd = sd(values, na.rm = TRUE))
            } else {
                freqs <- table(values)
                list(n_missing = length(miss_idx), imputed_category_freqs = freqs)
            }
        })
        names(diagnostics) <- names(df)
    }

    # ----------------------------------------------------------- Print summary
    # tables (plain formatting)
    # -----------------------------------------------------------
    if (!is.null(pooled_results)) {
        cat("\nPooled Results (Rubin pooling):\n")
        print(round(pooled_results, 4))
    }
    # Numeric diagnostics table
    numeric_diag <- do.call(rbind, lapply(names(diagnostics), function(v) {
        diag <- diagnostics[[v]]
        if (is.null(diag))
            return(NULL)
        if (!is.null(diag$imputed_mean)) {
            data.frame(Variable = v, n_missing = diag$n_missing, Imputed_Mean = round(diag$imputed_mean,
                3), Imputed_SD = round(diag$imputed_sd, 4), stringsAsFactors = FALSE)
        }
    }))
    if (!is.null(numeric_diag) && nrow(numeric_diag) > 0) {
        cat("\nImputation Diagnostics (Numeric variables):\n")
        print(numeric_diag, row.names = FALSE)
    }
    # Factor diagnostics
    factor_diag <- do.call(rbind, lapply(names(diagnostics), function(v) {
        diag <- diagnostics[[v]]
        if (is.null(diag))
            return(NULL)
        if (!is.null(diag$imputed_category_freqs)) {
            dfreq <- as.data.frame(diag$imputed_category_freqs)
            colnames(dfreq) <- c("Category", "Frequency")
            dfreq$Variable <- v
            dfreq
        }
    }))
    if (!is.null(factor_diag) && nrow(factor_diag) > 0) {
        cat("\nImputation Diagnostics (Factor variables):\n")
        print(factor_diag, row.names = FALSE)
    }

    # ----------------------------------------------------------- Return list
    # invisibly -----------------------------------------------------------
    out <- list(imputations = imputations, pooled_results = pooled_results, missing_diagnostics = diagnostics,
        mu_hat = mu_hat, sigma_hat = sigma_hat, em_iterations = if (!is.null(em_info)) em_info$iterations else NULL,
        em_loglik = if (!is.null(em_info)) em_info$loglik else NULL)
    invisible(out)
}


# Save
dump("miviaem", file = "miviaem.R")
# Example data & emviami call
A1 <- data.frame(X = c(78, 84, 84, 85, 87, 91, 92, 94, 94, 96, 99, 105, 105, 106,
    108, 112, 113, 115, 118, 134), Y = c(13, 9, 10, 10, NA, 3, 12, 3, 13, NA, 6,
    12, 14, 10, NA, 10, 14, 14, 12, 11), Z = c(NA, NA, NA, NA, NA, NA, NA, NA, NA,
    NA, 7, 10, 11, 15, 10, 10, 12, 14, 16, 12))

# Print pooled results and diagnostics
set.seed(123)  # for reproducible results
results <- miviaem(A1, m = 5, formula = Z ~ X + Y)
## 
## Pooled Results (Rubin pooling):
##             Estimate StdError t.value
## (Intercept)   6.7292   4.4442  1.5142
## X             0.0242   0.0392  0.6159
## Y             0.2300   0.1568  1.4669
## 
## Imputation Diagnostics (Numeric variables):
##  Variable n_missing Imputed_Mean Imputed_SD
##         Y         3        9.493     3.2119
##         Z        10       11.287     1.5444
# one of the 5 imputed datasets
cat("\nOne of the 5 Imputed Datasets:\n")
## 
## One of the 5 Imputed Datasets:
print(round(results$imputations[[1]], 2))
##      X     Y     Z
## 1   78 13.00 10.69
## 2   84  9.00 10.81
## 3   84 10.00 14.04
## 4   85 10.00 11.49
## 5   87  9.76  8.40
## 6   91  3.00 11.30
## 7   92 12.00  9.59
## 8   94  3.00  9.37
## 9   94 13.00 11.19
## 10  96  6.61 10.24
## 11  99  6.00  7.00
## 12 105 12.00 10.00
## 13 105 14.00 11.00
## 14 106 10.00 15.00
## 15 108 11.25 10.00
## 16 112 10.00 10.00
## 17 113 14.00 12.00
## 18 115 14.00 14.00
## 19 118 12.00 16.00
## 20 134 11.00 12.00

3.8.6 Pooled Regression Output

The pooled_regression output from the EM_MI() function is derived using Rubin’s rules, applied to the results of fitting a linear model on each imputed dataset.

After performing multiple imputations (e.g., m = 5), we obtain m slightly different “complete” datasets. Instead of analyzing just one, we:

  • Fit the same regression model (e.g., X ~ Y + Z) on each imputed dataset.
  • Extract the estimated coefficients and standard errors.
  • Pool the estimates using Rubin’s rules, which combine:
    • Within-imputation variance – average uncertainty from each model.
    • Between-imputation variance – variability in estimates across datasets.

Why This Matters

Analyzing a single imputed dataset underestimates uncertainty because it ignores the variation introduced by missing data. The pooled_regression output corrects for this by:

  • Combining information across all imputations,
  • Reflecting both model uncertainty and missing data uncertainty,
  • Producing valid standard errors and confidence intervals.

This approach ensures more accurate and defensible statistical inference.


3.8.7 Assumptions of miviaem()

  1. Multivariate Normality (for numeric data)
    • The joint distribution of all numeric variables is assumed to be (approximately) multivariate normal.
    • This assumption is critical because the EM algorithm estimates a mean vector and covariance matrix under this model, and imputes missing data from conditional MVN distributions.
  2. Missing at Random (MAR)
    • The mechanism causing missingness depends only on observed data, not on unobserved (missing) values themselves.
    • MAR allows the EM algorithm to produce unbiased parameter estimates using only observed data.
    • Violations (e.g., Missing Not At Random - MNAR) require more complex models.
  3. Ignorable Missingness Mechanism
    • Missingness does not depend on the missing values after conditioning on observed data (related to MAR).
    • This allows ignoring the missingness mechanism in likelihood calculations.
  4. Sufficient Sample Size
    • The number of complete or partially observed cases should be large enough to estimate covariance structure reliably.
    • Small samples can yield unstable covariance estimates.
  5. Linearity & Homoscedasticity in Regression (if model used)
    • When a formula is provided for regression modeling, the usual assumptions of linear regression apply for valid inference:
      linearity, constant variance, independence, and normality of errors.
  6. Variables are Numeric or Coded Appropriately
    • The EM MVN approach directly applies to continuous numeric data.
    • Factor or categorical variables require special treatment or separate imputation methods.
  7. Convergence of EM Algorithm
    • The EM algorithm must converge to stable estimates within the maximum number of iterations.

Summary Table

Assumption Explanation
Multivariate Normality Data follow or approximate MVN distribution
Missing at Random (MAR) Missingness depends only on observed data
Ignorable Missingness Missingness mechanism can be ignored in model
Sufficient Sample Size Enough data to estimate covariance accurately
Regression assumptions (if used) Validity of linear model for pooled analysis
Data types Numeric data; categorical require separate methods
EM Convergence Algorithm reaches stable estimates

Note: Violations of these assumptions can lead to biased or invalid imputations and inference.


References

Allison, P. D. (2002). Missing data (1st ed.). Sage. https://doi.org/10.4135/9781412985079
Andridge, R. R., & Little, R. J. A. (2010). A review of hot deck imputation for survey non-response. International Statistical Review, 78(1), 40–64.
Bodner, T. E. (2006). Missing data: Prevalence and reporting practices. Psychological Reports, 99(3), 675–680.
Buuren, S. van. (2018). Flexible imputation of missing data (2nd ed.). Chapman & Hall/CRC. https://doi.org/10.1201/9780429492259
Carmines, E. G., & McIver, J. P. (1981). Analyzing models with unobserved variables. In G. W. Bohrnstedt & E. F. Borgatta (Eds.), Social measurement: Current issues (pp. 65–115). Sage.
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1), 1–38.
Enders, C. K. (2010). Applied missing data analysis. The Guilford Press.
Enders, C. K. (2022). Applied missing data analysis (2nd ed.). The Guilford Press.
Graham, J. W. (2009). Missing data analysis: Making it work in the real world. Annual Review of Psychology, 60, 549–576. https://doi.org/10.1146/annurev.psych.58.110405.085530
Graham, J. W. (2012). Missing data: Analysis and design. Annual Review of Psychology, 63, 539–560. https://doi.org/10.1146/annurev-psych-120710-100433
Kline, R. B. (2016). Principles and practice of structural equation modeling (4th ed.). Guilford Press.
Little, R. J. A. (1988). A test of missing completely at random for multivariate data with missing values. Journal of the American Statistical Association, 83(404), 1198–1202.
Little, R. J. A., & Rubin, D. B. (2002). Statistical analysis with missing data. Wiley.
Little, R. J. A., & Rubin, D. B. (2019). Statistical analysis with missing data (3rd ed.). Wiley.
Marsh, H. W., & Hocevar, D. (1985). Application of confirmatory factor analysis to the study of self-concept: First- and higher-order factor models and their invariance across groups. Psychological Bulletin, 97(3), 562–582. https://doi.org/10.1037/0033-2909.97.3.562
Orchard, T., & Woodbury, M. A. (1972). A missing information principle: Theory and applications. In Theory of Statistics: Vols. Volume 1 (pp. 697–715). University of California Press; Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics; Probability.
Peugh, J. L., & Enders, C. K. (2004). Missing data in educational research: A review of reporting practices and suggestions for improvement. Review of Educational Research, 74(4), 525–556.
Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581–592.
Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. Wiley.
Schafer, J. L. (1997). Analysis of incomplete multivariate data. Chapman; Hall/CRC.
Wheaton, B., Muthén, B., Alwin, D. F., & Summers, G. F. (1977). Assessing reliability and stability in panel models. Sociological Methodology, 8, 84–136. https://doi.org/10.2307/270754