Project History and Acknowledgments

This project is both a functional implementation of multivariate statistical methods (including regression and ANOVA) and a pedagogical resource. Developed as a learning journey, it begins with foundational concepts—such as basic column-wise computations—and systematically progresses to more advanced analyses. The accompanying code and commentary document not only the final solutions but also the iterative process of discovery, including challenges, revisions, and insights.

To emphasize algorithmic understanding over black-box solutions, the implementation deliberately minimizes reliance on built-in R packages, opting instead for transparent, step-by-step calculations.

The work originated in a graduate Multivariate Statistical Methods course at Ohio University in 2011, with initial implementations in SPSS and R. Over time, it evolved into a fully R-based project to leverage the language’s open-source flexibility and advanced statistical capabilities.

The original code, methodology, and documentation have since been refined—partially with AI-assisted tools—to enhance computational efficiency, analytical accuracy, and pedagogical clarity.

1 Computational Linear Algebra with Applications in Statistics

Linear algebra is a branch of mathematics that studies vectors, vector spaces, linear transformations, and matrix operations. It provides fundamental tools for statistical computing, where many problems can be formulated and solved using linear algebraic approaches. Key applications include:

  • Multivariate statistical analysis
  • Inference from linear models
  • Dimension reduction techniques
  • Optimization in statistical learning

Numerical linear algebra forms the computational backbone of modern statistical methods, enabling efficient solutions to large-scale problems in data analysis.


1.1 Important Matrices for Multivariate Analysis

1.1.1 The Data Matrix

At the core of multivariate analysis lies the data matrix, a structured representation of numerical data with:

  • Rows (n): Represent individual observations (cases, samples, or instances)
  • Columns (p): Contain measured variables (features, attributes, or dimensions)

This structure provides the framework for organizing, summarizing, and analyzing complex datasets involving multiple interrelated variables.

1.2 Key Requirements of the Data Matrix

  1. Statistical Independence of Observations
    For valid inference, the rows of the data matrix must represent statistically independent observations. Formally, this is expressed as

\[ \Pr(\text{row}_i \cap \text{row}_j) = \Pr(\text{row}_i)\,\Pr(\text{row}_j), \quad \forall i \neq j. \]

Independence is usually secured through experimental design (e.g., random assignment of treatments) or observational sampling methods such as simple random sampling, stratified sampling, or cluster sampling. Well-constructed designs and randomization procedures reduce unwanted dependencies among units.

  1. Structural Integrity
    Beyond independence, the data matrix must preserve its structural soundness to ensure meaningful statistical analysis:
  • Complete cases or appropriate missing-data handling (e.g., imputation, listwise/pairwise deletion).
  • Consistent units of measurement across all variables.
  • Well-defined variables with clear operational definitions and coding.

Violations of these conditions can distort results, invalidate statistical assumptions, and compromise matrix operations such as covariance computation or factorization.


1.2.1 Mathematical Representation

A data matrix with \(n\) observations and \(p\) variables is denoted

\[ X \in \mathbb{R}^{n \times p} = \begin{pmatrix} x_{11} & x_{12} & \cdots & x_{1p} \\[6px] x_{21} & x_{22} & \cdots & x_{2p} \\[6px] \vdots & \vdots & \ddots & \vdots \\[6px] x_{n1} & x_{n2} & \cdots & x_{np} \end{pmatrix}, \]

where \(x_{ij}\) is the value of the \(j\)-th variable for the \(i\)-th observation.

# Create data matrix X from Wood's (1966) cement hardening data
# Columns represent:
# X1 - % tricalcium aluminate
# X2 - % tricalcium silicate
# X3 - % tetracalcium alumino ferrite
# X4 - % dicalcium silicate
# Y - Heat evolved in calories/gram of cement

X <- matrix(c(7, 26, 6, 60, 78.5,
               1, 29, 15, 52, 74.3,
               11, 56, 8, 20, 104.3,
               11, 31, 8, 47, 87.6, 
               7, 52, 6, 33, 95.9,
               11, 55, 9, 22, 109.2,
               3, 71, 17, 6, 102.7,
               1, 31, 22, 44, 72.5,
               2, 54, 18, 22, 93.1,
               21, 47, 4, 26, 115.9,
               1, 40, 23, 34, 83.8,
               11, 66, 9, 12, 113.3,
               10, 68, 8, 12, 109.4), 
             nrow = 13, ncol = 5, byrow = TRUE)

# Set descriptive row and column names
rownames(X) <- paste0("obs.", 1:13)
colnames(X) <- c("X1", "X2", "X3", "X4", "Y")

# Display the matrix structure
str(X)
##  num [1:13, 1:5] 7 1 11 11 7 11 3 1 2 21 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:13] "obs.1" "obs.2" "obs.3" "obs.4" ...
##   ..$ : chr [1:5] "X1" "X2" "X3" "X4" ...
# Save data to RData file
save(X, file = "X.RData")

# Optional: Also save as CSV for interoperability
write.csv(X, file = "X.csv", row.names = TRUE)
# Display with kable and proper headers
kable(X, digits = 1, align = 'c', caption = "Cement Hardening Data (Woods, 1966)") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  add_header_above(c(" " = 1, 
                     "Chemical Composition (%)" = 4, 
                     "Heat Evolved" = 1)) %>%
  footnote(general = "X1: Tricalcium Aluminate\nX2: Tricalcium Silicate\nX3: Tetracalcium Alumino Ferrite\nX4: Dicalcium Silicate\nY: calories/gram",
           general_title = "Variable Definitions:") %>%
  column_spec(1, bold = TRUE) %>%
  column_spec(5, background = "#f7f7f7")
Cement Hardening Data (Woods, 1966)
Chemical Composition (%)
Heat Evolved
X1 X2 X3 X4 Y
obs.1 7 26 6 60 78.5
obs.2 1 29 15 52 74.3
obs.3 11 56 8 20 104.3
obs.4 11 31 8 47 87.6
obs.5 7 52 6 33 95.9
obs.6 11 55 9 22 109.2
obs.7 3 71 17 6 102.7
obs.8 1 31 22 44 72.5
obs.9 2 54 18 22 93.1
obs.10 21 47 4 26 115.9
obs.11 1 40 23 34 83.8
obs.12 11 66 9 12 113.3
obs.13 10 68 8 12 109.4
Variable Definitions:
X1: Tricalcium Aluminate
X2: Tricalcium Silicate
X3: Tetracalcium Alumino Ferrite
X4: Dicalcium Silicate
Y: calories/gram

1.2.2 Column Sums

Column sums \(\left(\sum_{i=1}^{n} x_{ij}\right)\) for a matrix \(X \in \mathbb{R}^{n \times p}\) serve as a fundamental operation in statistical computing with three primary applications:

  1. Data Aggregation
    For each column \(j = 1, \dots, p\), the column sum is defined as

    \[ S_j = \sum_{i=1}^{n} x_{ij}. \]

    These totals represent the overall magnitude of each variable, forming a cornerstone of descriptive statistics.

  2. Normalization and Scaling
    Column sums provide the basis for normalization procedures, where individual elements are scaled relative to their column sum:

    \[ \tilde{x}_{ij} = \frac{x_{ij}}{S_j}. \]

    This transformation enables proportional analysis and supports compositional data techniques.

  3. Algorithmic Foundations
    Column sums underpin key computational algorithms in multivariate analysis, such as:

    • Matrix–vector products:
      \[ X^\top y \]
    • Covariance matrix computation:
      \[ \Sigma = X^\top X \]

    These operations are central to regression, principal component analysis, and other statistical methods.


Column Sums via Matrix Multiplication

Let \(X \in \mathbb{R}^{n \times p}\) be a data matrix.
Define a row vector of ones \(\mathbf{1}_n^\top \in \mathbb{R}^{1 \times n}\):

\[ \mathbf{1}_n^\top = \begin{bmatrix} 1 & 1 & \cdots & 1 \end{bmatrix}. \]

The vector of column sums \(S \in \mathbb{R}^{p}\) is then obtained by a single matrix multiplication:

\[ S^\top \;=\; \mathbf{1}_n^\top X \quad \Longleftrightarrow \quad S_j \;=\; \sum_{i=1}^n x_{ij}, \; j=1,\dots,p. \]

#' Calculate Column Sums Identical to colSums()
#'
#' Computes column sums using matrix multiplication while perfectly replicating
#' base R's colSums() output format (named numeric vector with same spacing).
#'
#' @param X A matrix or data frame
#' @param na.rm Logical; should NA values be removed? (Default: TRUE)
#' @return A named numeric vector identical to colSums() output
#' @export
#' @examples
#' X <- matrix(c(97,626,153,390,1241), nrow=1,
#'             dimnames=list(NULL, c("X1","X2","X3","X4","Y")))
#' SUMS(X)    # Returns c(X1=97, X2=626, X3=153, X4=390, Y=1241)
#' colSums(X) # Identical output

SUMS <- function(X, na.rm = TRUE) {
  # Convert to matrix while preserving names
  original_names <- colnames(X)
  X <- data.matrix(X)
  
  # NA handling
  if (na.rm) {
    X[is.na(X)] <- 0
  } else if (anyNA(X)) {
    has_na <- colSums(is.na(X)) > 0
  }
  
  # Compute sums via matrix multiplication
  n <- nrow(X)
  if (n == 0) {
    sums <- rep(NA_real_, ncol(X))
  } else {
    J <- matrix(1, nrow = 1, ncol = n) # 1×n ones row vector
    sums <- as.vector(J %*% X)  # (1×n) %*% (n×p) -> (1×p), then drop to length-p vector
  }
  
  # Apply NA propagation if needed
  if (!na.rm && exists("has_na")) {
    sums[has_na] <- NA
  }
  
  # Format output to match colSums() exactly
  names(sums) <- original_names
  sums <- setNames(sums, original_names)  # Ensures identical name spacing
  
  return(sums)
}

# Save function to file for reuse
dump("SUMS", file = "SUMS.R")
# Compare custom SUMS() with base R colSums()
data.frame(
  Custom = SUMS(X),
  Base   = colSums(X)
)
##    Custom   Base
## X1   97.0   97.0
## X2  626.0  626.0
## X3  153.0  153.0
## X4  390.0  390.0
## Y  1240.5 1240.5

1.2.3 Column Means

In multivariate data, columns (variables) represent measured attributes or features. The column mean—the average value of each variable—is a fundamental statistic in both theory and practice.

For a data matrix \(X \in \mathbb{R}^{n \times p}\), the mean of column \(j\) is defined as

\[ \bar{x}_{\cdot j} = \frac{1}{n} \sum_{i=1}^n x_{ij}, \quad j = 1, \dots, p. \]


1.2.4 Importance of Column Means

  1. Data Reduction and Analysis
    Column means appear in core methods such as dimensionality reduction, covariance matrix computation, and principal component analysis (PCA).

  2. Data Preprocessing
    They are essential for centering (subtracting means to re-express data around zero) and normalization (scaling variables for comparability).

  3. Interpretation of Central Tendency
    Column means summarize the average behavior of each variable, serving as a baseline reference point for detecting deviations, outliers, or structural patterns.


1.2.5 Matrix Representation

Column means can be expressed compactly using matrix operations:

\[ \bar{x}^\top = \frac{1}{n} \mathbf{1}_n^\top X, \]

where \(\mathbf{1}_n\) is an \(n \times 1\) vector of ones.

This shows that column means are just a rescaled version of column sums:

\[ \bar{x}_j = \frac{S_j}{n}. \]

#' Compute Column Means Identical to colMeans()
#'
#' Calculates column means using matrix multiplication, mirroring base R's
#' colMeans() output, NA handling, and input expectations.
#'
#' @param X A numeric matrix or data frame (not a bare vector; matches colMeans())
#' @param na.rm Logical; remove NAs if TRUE. Default: TRUE
#' @return A named numeric vector matching colMeans() output
#' @export
#' @examples
#' X <- matrix(c(7.46, 48.2, 11.8, 30, 95.4), nrow = 1,
#'             dimnames = list(NULL, c("X1", "X2", "X3", "X4", "Y")))
#' MEANS(X)    # c(X1 = 7.46, X2 = 48.2, X3 = 11.8, X4 = 30, Y = 95.4)
#' colMeans(X) # identical
MEANS <- function(X, na.rm = TRUE) {
  # --- Input validation (match colMeans semantics) ---
  if (is.null(dim(X))) {
    stop("dim(X) must have a positive length")  # colMeans() error on bare vector
  }
  # Coerce data frames -> matrix, but keep names
  original_names <- colnames(X)
  X <- as.matrix(X)

  # Ensure numeric mode (colMeans errors if not)
  if (!is.numeric(X)) {
    stop("'x' must be numeric")  # matches colMeans() message
  }

  # --- NA handling plan ---
  n <- nrow(X)
  p <- ncol(X)

  if (na.rm) {
    # Count non-missing per column; replace NA with 0 for summation
    n_valid <- colSums(!is.na(X))
    X[is.na(X)] <- 0
  } else {
    # If any NA in a column, that column's mean should be NA
    has_na <- colSums(is.na(X)) > 0
    n_valid <- rep(n, p)  # all n (unused when na.rm=FALSE except for clarity)
  }

  # --- Core computation via matrix multiplication: 1_n^T X / n ---
  # sums = [1 ... 1] %*% X
  if (n == 0L) {
    # colMeans on a 0×p matrix returns NaN for each column
    means <- rep(NaN, p)
  } else {
    J <- matrix(1, nrow = 1, ncol = n)     # 1×n row of ones (1_n^T)
    means <- (J %*% X) / n                 # (1×n)(n×p) -> (1×p); divide by n
    means <- as.vector(means)              # drop to length-p vector
  }

  # --- Adjust for na.rm=TRUE (divide by non-missing counts) ---
  if (na.rm) {
    # means currently equals (sum over observed)/n; rescale by n/n_valid
    # (This yields sum / n_valid; columns with n_valid==0 become NaN)
    scale <- n / n_valid
    means <- means * scale
  } else if (exists("has_na") && any(has_na)) {
    means[has_na] <- NA_real_
  }

  # --- Restore names exactly like colMeans() ---
  names(means) <- original_names

  means
}

# Save function
dump("MEANS", file = "MEANS.R")
# Compare MEANS() with base colMeans()
data.frame(
  MEANS = MEANS(X),
  Base  = colMeans(X)
)
##        MEANS      Base
## X1  7.461538  7.461538
## X2 48.153846 48.153846
## X3 11.769231 11.769231
## X4 30.000000 30.000000
## Y  95.423077 95.423077

1.3 Deviation Scores Matrix

The deviation scores matrix \(\mathbf{D} \in \mathbb{R}^{n \times p}\) represents mean-centered data, defined by the transformation

\[ \mathbf{D} = \mathbf{X} - \mathbf{1}\,\bar{\mathbf{x}}^\top, \]

where:

  • \(\mathbf{X} \in \mathbb{R}^{n \times p}\): the original data matrix
  • \(\mathbf{1} \in \mathbb{R}^{n \times 1}\): a vector of ones
  • \(\bar{\mathbf{x}} = [\bar{x}_1, \ldots, \bar{x}_p]^\top\): the column means vector

Each element is computed as

\[ d_{ij} = x_{ij} - \bar{x}_j, \]

which measures the departure of observation \(i\) on variable \(j\) from its mean.


Key Properties

  1. Zero Column Sums
    By construction, each column of \(\mathbf{D}\) sums to zero:

    \[ \mathbf{1}^\top \mathbf{D} = \mathbf{0}^\top. \]

  2. Connection to SSCP Matrix
    The product \(\mathbf{D}^\top \mathbf{D}\) yields the sum-of-squares-and-cross-products (SSCP) matrix, which is the foundation for computing covariance and correlation matrices.

  3. Dimensionality Preservation
    Centering does not alter the rank of the data matrix. Instead, it re-centers the coordinate system around the multivariate mean, preserving intrinsic dimensionality.


1.3.1 Matrix Representation

\[ \mathbf{D} = \mathbf{X} - \mathbf{1}\boldsymbol{\mu}^\top = \begin{pmatrix} x_{11} - \mu_1 & x_{12} - \mu_2 & \cdots & x_{1p} - \mu_p \\[12px] x_{21} - \mu_1 & x_{22} - \mu_2 & \cdots & x_{2p} - \mu_p \\[12px] \vdots & \vdots & \ddots & \vdots \\[12px] x_{n1} - \mu_1 & x_{n2} - \mu_2 & \cdots & x_{np} - \mu_p \end{pmatrix} \]

where:

\[ \begin{aligned} \mathbf{X} &= (x_{ij}) \in \mathbb{R}^{n \times p} \quad &\text{(data matrix)} \\[12px] \boldsymbol{\mu} &= (\bar{x}_1, \bar{x}_2, \ldots, \bar{x}_p)^\top \quad &\text{(mean vector)} \\[12px] \mathbf{1} &= (1, 1, \ldots, 1)^\top \in \mathbb{R}^n \quad &\text{(ones vector)} \end{aligned} \]

#' Deviation Scores Matrix (short form)
#'
#' Mean-centers each column using your MEANS() function.
#' Uses R's safe recycling of a length-p vector across n rows.
#'
#' @param X A numeric matrix (n × p) or coercible object
#' @return A numeric matrix with the same dimnames as X, columns centered to ~0
#' @examples
#' X <- matrix(1:12, ncol = 3, dimnames = list(NULL, c("V1","V2","V3")))
#' D <- deviation_scores(X)
#' MEANS(D)                      # ~ 0, 0, 0
#' all.equal(D, X - MEANS(X))    # TRUE
#' @export
deviation_scores <- function(X) {
  
  source("MEANS.R")
  
  X <- as.matrix(X)
  D <- X - matrix(MEANS(X), nrow = nrow(X), ncol = ncol(X), byrow = TRUE)
  dimnames(D) <- dimnames(X)    # keep row/column names
  D
}

# Save function
dump("deviation_scores", file = "deviation_scores.R")

This code verifies that the deviation_scores() function is working correctly by checking two key mathematical properties of properly centered data:

  • Mean of deviation scores should be 0 for each column
  • Sum of deviation scores should be 0 for each column
cat("--- Deviation Score Checks ---\n")
## --- Deviation Score Checks ---
cat("Mean of deviation scores: ", MEANS(deviation_scores(X)), "\n")
## Mean of deviation scores:  -1.366428e-16 5.465713e-16 -8.19857e-16 0 -1.093143e-14
cat("Sum of deviation scores:  ",SUMS(deviation_scores(X)), "\n\n")
## Sum of deviation scores:   -1.776357e-15 7.105427e-15 -1.065814e-14 0 -1.421085e-13
cat("Note: Values should be very close to 0 (floating-point precision limits exact zeros)\n")
## Note: Values should be very close to 0 (floating-point precision limits exact zeros)
# Using the function
(D <- deviation_scores(X))
##                X1         X2        X3  X4           Y
## obs.1  -0.4615385 -22.153846 -5.769231  30 -16.9230769
## obs.2  -6.4615385 -19.153846  3.230769  22 -21.1230769
## obs.3   3.5384615   7.846154 -3.769231 -10   8.8769231
## obs.4   3.5384615 -17.153846 -3.769231  17  -7.8230769
## obs.5  -0.4615385   3.846154 -5.769231   3   0.4769231
## obs.6   3.5384615   6.846154 -2.769231  -8  13.7769231
## obs.7  -4.4615385  22.846154  5.230769 -24   7.2769231
## obs.8  -6.4615385 -17.153846 10.230769  14 -22.9230769
## obs.9  -5.4615385   5.846154  6.230769  -8  -2.3230769
## obs.10 13.5384615  -1.153846 -7.769231  -4  20.4769231
## obs.11 -6.4615385  -8.153846 11.230769   4 -11.6230769
## obs.12  3.5384615  17.846154 -2.769231 -18  17.8769231
## obs.13  2.5384615  19.846154 -3.769231 -18  13.9769231

1.4 Sums of Squares and Cross Products Matrix (SSCP)

The Sums of Squares and Cross Products (SSCP) matrix is a fundamental quadratic form in multivariate statistics that captures both the magnitude and interdependence of variables. For a data matrix \(\mathbf{X} \in \mathbb{R}^{n \times p}\) containing \(n\) observations of \(p\) variables, the SSCP matrix is given by:

\[ \mathbf{S} = \mathbf{X}^\top\mathbf{X} = \sum_{i=1}^n \mathbf{x}_i\mathbf{x}_i^\top \]

where \(\mathbf{x}_i^\top\) represents the \(i\)-th row of \(\mathbf{X}\) (a \(1 \times p\) vector).

1.4.1 Explicit Matrix Formulation

\[ \mathbf{X}^\top\mathbf{X} = \begin{pmatrix} \|\mathbf{x}_{(1)}\|^2 & \langle\mathbf{x}_{(1)}, \mathbf{x}_{(2)}\rangle & \cdots & \langle\mathbf{x}_{(1)}, \mathbf{x}_{(p)}\rangle \\[12px] \langle\mathbf{x}_{(2)}, \mathbf{x}_{(1)}\rangle & \|\mathbf{x}_{(2)}\|^2 & \cdots & \langle\mathbf{x}_{(2)}, \mathbf{x}_{(p)}\rangle \\[12px] \vdots & \vdots & \ddots & \vdots \\[12px] \langle\mathbf{x}_{(p)}, \mathbf{x}_{(1)}\rangle & \langle\mathbf{x}_{(p)}, \mathbf{x}_{(2)}\rangle & \cdots & \|\mathbf{x}_{(p)}\|^2 \end{pmatrix} \]

where:

  • \(\|\mathbf{x}_{(j)}\|^2 = \sum_{i=1}^n x_{ij}^2\) (squared Euclidean norm)
  • \(\langle\mathbf{x}_{(j)}, \mathbf{x}_{(k)}\rangle = \sum_{i=1}^n x_{ij}x_{ik}\) (inner product)

Key Properties

  • Symmetric Positive Semi-definite:
    • \(\mathbf{S}^\top = \mathbf{S}\)
    • \(\mathbf{v}^\top\mathbf{S}\mathbf{v} \geq 0\) for all \(\mathbf{v} \in \mathbb{R}^p\)
  • Rank Properties:
    • \(\text{rank}(\mathbf{S}) = \text{rank}(\mathbf{X}) \leq \min(n,p)\)
    • Full rank when columns are linearly independent
  • Relation to Sample Covariance:
    • For centered data covariance as a scaled CSSCP:

\[ \begin{align} \operatorname{Cov}(\mathbf{X}) &= \frac{1}{n-1}\,\mathbf{D}^\top \mathbf{D} \\[12px] &= \frac{1}{n-1}\,\mathbf{X}_c^\top \mathbf{X}_c \\[12px] &= \frac{1}{n-1}\,\text{SSCP}(\mathbf{X}_c) \end{align} \]

where:

  • \(\mathbf{X}_c\) is the column-centered data matrix (\(\mathbf{X}_c = \mathbf{X} - \mathbf{1}\boldsymbol{\mu}^\top\))
  • \(\mathbf{S}\) is the sample covariance matrix
  • \(n\) is the number of observations
#' Compute Sums of Squares and Cross Products (SSCP) Matrix
#'
#' Calculates the SSCP matrix for a numeric data matrix with optional centering.
#'
#' @param Y A numeric matrix or data frame (n × p).
#' @param center Logical; center columns before SSCP (default = FALSE).
#' @return A symmetric p × p SSCP matrix with dimnames preserved.
#' @export
SSCP <- function(X, center = FALSE) {

  source("MEANS.R")

  if (!is.matrix(X) && !is.data.frame(X)) stop("Input must be a matrix or data frame")
  Y <- as.matrix(X)
  if (!is.numeric(X)) stop("'X' must be numeric")

  # Listwise deletion (complete cases)
  if (anyNA(X)) {
    X <- X[complete.cases(X), , drop = FALSE]
  }

  # Optional centering
  if (center && nrow(X) > 0L) {
    X <- X - matrix(MEANS(X), nrow = nrow(X), ncol = ncol(X), byrow = TRUE)
  }

  # SSCP = X^T X
  sscp_matrix <- t(X) %*% X 

  colnames(sscp_matrix) <- rownames(sscp_matrix) <- colnames(X)
  sscp_matrix
}

# Save function with improved version
dump("SSCP", file = "SSCP.R")
SSCP(X)
##       X1      X2      X3      X4        Y
## X1  1139  4922.0   769.0  2620.0  10032.0
## X2  4922 33050.0  7201.0 15739.0  62027.8
## X3   769  7201.0  2293.0  4628.0  13981.5
## X4  2620 15739.0  4628.0 15062.0  34733.3
## Y  10032 62027.8 13981.5 34733.3 121088.1

1.5 Key Differences Between SSCP and Covariance Matrices

Property SSCP Matrix Covariance Matrix
Scale Sum-based (sums of squares/cross-products) Average-based (sums divided by a divisor)
Units Product of original units (e.g., \(u_j u_k\)); diagonal = \(u_j^2\) Same units as SSCP; just averaged (diagonals = \(u_j^2\))
Centering Can be raw (\(\mathbf{X}^\top\mathbf{X}\)) or centered (\(\mathbf{X}_c^\top\mathbf{X}_c\)) Always centered (defined from deviations)
Divisor None (sum) Typically \(n-1\) (sample) or \(n\) (MLE)
Matrix Form Raw: \(\mathbf{X}^\top\mathbf{X}\)
Centered: \(\mathbf{X}_c^\top\mathbf{X}_c\) Sample: \(\displaystyle \mathbf{S}=\frac{1}{n-1}\mathbf{X}_c^\top\mathbf{X}_c\)
MLE: \(\displaystyle \frac{1}{n}\mathbf{X}_c^\top\mathbf{X}_c\)

Note. “Standardized units” apply to the correlation matrix, not the covariance matrix. Covariance retains the original measurement units (products thereof); correlations are unitless.


Mathematical Relationships

Let \(\mathbf{X}_c\) be mean-centered data and \(\text{SSCP}_c = \mathbf{X}_c^\top\mathbf{X}_c\).

  1. Matrix relationship \[ \text{SSCP}_c = (n-1)\,\mathbf{S} \quad \text{(sample covariance)}, \qquad \text{SSCP}_c = n\,\mathbf{\Sigma}_{\text{MLE}} \quad \text{(MLE)}. \]

  2. Variance (diagonal entries) \[ \operatorname{Var}(X_j) = s_{jj} = \frac{(\text{SSCP}_c)_{jj}}{n-1}. \]

  3. Covariance (off-diagonal entries) \[ \operatorname{Cov}(X_j, X_k) = s_{jk} = \frac{(\text{SSCP}_c)_{jk}}{n-1}. \]

If you use the raw (uncentered) SSCP, \(\mathbf{X}^\top\mathbf{X}\), then \[ \mathbf{X}^\top\mathbf{X} = \mathbf{X}_c^\top\mathbf{X}_c \;+\; n\,\bar{\mathbf{x}}\,\bar{\mathbf{x}}^\top, \] so centering is required to recover covariance from SSCP.


1.6 The Corrected Sums of Squares and Cross Products Matrix (CSSCP)

The Corrected Sums of Squares and Cross Products (CSSCP) matrix—sometimes called the deviation scores sums of squares and cross products matrix —is constructed from mean-centered data, where each observation is expressed as a deviation from its column mean.

For a data matrix \(\mathbf{X} \in \mathbb{R}^{n \times m}\), let

\[ \mathbf{D} = \mathbf{X} - \mathbf{1}\bar{\mathbf{x}}^\top \]

be the deviation scores matrix, with \(\bar{\mathbf{x}}\) the vector of column means.

The CSSCP matrix is then

\[ \text{CSSCP}_{(m \times m)} = \mathbf{D}^\top \mathbf{D} = (X - \bar{X})^\top (X - \bar{X}). \]

In element form:

\[ \text{CSSCP}_{jk} = \sum_{i=1}^n (x_{ij} - \bar{x}_j)(x_{ik} - \bar{x}_k). \]


Matrix Layout

\[ \text{CSSCP} = \begin{pmatrix} \sum d_1^2 & \sum d_1 d_2 & \sum d_1 d_3 & \cdots & \sum d_1 d_m \\[12px] \sum d_2 d_1 & \sum d_2^2 & \sum d_2 d_3 & \cdots & \sum d_2 d_m \\[12px] \sum d_3 d_1 & \sum d_3 d_2 & \sum d_3^2 & \cdots & \sum d_3 d_m \\[12px] \vdots & \vdots & \vdots & \ddots & \vdots \\[12px] \sum d_m d_1 & \sum d_m d_2 & \sum d_m d_3 & \cdots & \sum d_m^2 \end{pmatrix} \]

where \(d_j = x_j - \bar{x}_j\) denotes the deviation scores for variable \(j\).


Interpretation

  • Diagonal entries (\(\sum d_j^2\)) represent the corrected sums of squares, i.e., the total variation of each variable around its mean.
  • Off-diagonal entries (\(\sum d_j d_k\)) represent the corrected sums of cross-products, i.e., how two variables covary after centering.

Relationship to Covariance

The sample covariance matrix \(\mathbf{S}\) is a rescaled version of the CSSCP:

\[ \mathbf{S} = \frac{1}{n-1} \, \text{CSSCP}. \]

Thus, CSSCP provides the raw sums of variation and co-variation, while covariance standardizes these by dividing by the sample degrees of freedom.

#' Compute Centered Sums of Squares and Cross Products (CSSCP) Matrix
#'
#' Calculates the CSSCP (centered SSCP) matrix with listwise deletion
#' and column centering via the custom MEANS() function.
#'
#' @param X A numeric matrix or data frame (n × p)
#' @return A symmetric p × p CSSCP matrix
#' @examples
#' # Basic usage
#' CSSCP(X)
#'
#' # Relationship to covariance
#' n_eff <- nrow(na.omit(as.data.frame(X)))  # effective sample size
#' CSSCP(X) / (n_eff - 1)
#' all.equal(CSSCP(X)/(n_eff - 1), cov(X, use = "complete.obs"))
#' @export
CSSCP <- function(X) {

  source("MEANS.R") # (imported at script/package level, not inside function) 

  if (!is.matrix(X) && !is.data.frame(X)) {
    stop("Input must be a matrix or data frame")
  }
  X <- as.matrix(X)
  if (!is.numeric(X)) stop("'X' must be numeric")

  # Listwise deletion
  if (anyNA(X)) {
    X <- X[complete.cases(X), , drop = FALSE]
  }
  if (nrow(X) == 0L) {
    return(matrix(NA_real_, ncol(X), ncol(X),
                  dimnames = list(colnames(X), colnames(X))))
  }

  # Center with custom MEANS()
  mu <- MEANS(X)
  Xc <- X - matrix(mu, nrow = nrow(X), ncol = ncol(X), byrow = TRUE)

  # CSSCP = D^T D (matrix multiplication form)
  csscp <- t(Xc) %*% Xc

  dimnames(csscp) <- list(colnames(X), colnames(X))
  csscp
}

# Save function
dump("CSSCP", file = "CSSCP.R")
# --- Basic usage: CSSCP matrix ---
cat("Raw sums of squares & cross-products: \n")
## Raw sums of squares & cross-products:
CSSCP(X)
##           X1         X2        X3      X4          Y
## X1  415.2308   251.0769 -372.6154  -290.0   775.9615
## X2  251.0769  2905.6923 -166.5385 -3041.0  2292.9538
## X3 -372.6154  -166.5385  492.3077    38.0  -618.2308
## X4 -290.0000 -3041.0000   38.0000  3362.0 -2481.7000
## Y   775.9615  2292.9538 -618.2308 -2481.7  2715.7631
# Output: A p × p symmetric matrix
# - Diagonal entries = corrected sums of squares (variation of each variable)
# - Off-diagonal entries = corrected sums of cross-products (co-variation)

# --- Relationship to covariance matrix ---
cat("\n\nSample covariance matrix: \n")
## 
## 
## Sample covariance matrix:
n_eff <- nrow(na.omit(as.data.frame(X)))   # effective sample size after listwise deletion
CSSCP(X) / (n_eff - 1)
##           X1         X2         X3          X4          Y
## X1  34.60256   20.92308 -31.051282  -24.166667   64.66346
## X2  20.92308  242.14103 -13.878205 -253.416667  191.07949
## X3 -31.05128  -13.87821  41.025641    3.166667  -51.51923
## X4 -24.16667 -253.41667   3.166667  280.166667 -206.80833
## Y   64.66346  191.07949 -51.519231 -206.808333  226.31359
# Output: The sample covariance matrix
# - Each diagonal element = variance of a variable
# - Each off-diagonal element = covariance between two variables

# --- Compare with R's built-in cov() ---
cat("\n\nR built-in covariance: \n")
## 
## 
## R built-in covariance:
cov(X, use = "complete.obs")
##           X1         X2         X3          X4          Y
## X1  34.60256   20.92308 -31.051282  -24.166667   64.66346
## X2  20.92308  242.14103 -13.878205 -253.416667  191.07949
## X3 -31.05128  -13.87821  41.025641    3.166667  -51.51923
## X4 -24.16667 -253.41667   3.166667  280.166667 -206.80833
## Y   64.66346  191.07949 -51.519231 -206.808333  226.31359
# Output: Matches CSSCP(X)/(n_eff - 1)
# Confirms that CSSCP provides the numerator of the covariance computation

1.7 Variance–Covariance Matrix

Variance measures the dispersion of a single variable; covariance measures how two variables co-vary —i.e., whether they tend to increase or decrease together.

In a variance–covariance matrix

\(\operatorname{Cov}(\mathbf{X})\in\mathbb{R}^{p\times p}\): - Diagonal entries \(s_{jj}=\operatorname{Var}(X_j)\) are variances. - Off–diagonal entries \(s_{jk}=\operatorname{Cov}(X_j,X_k)\) are covariances. - The matrix is symmetric and positive semi-definite.

Correlation vs. Independence

  • If two variables are independent, then their covariance is 0.
  • The converse need not hold: variables can be uncorrelated (zero covariance) yet still dependent (covariance only captures linear association).
  • Under a multivariate normal model, zero covariance implies independence.

Sample computation (from centered data)

Let \(\mathbf{X}_c\) be mean-centered data and \(\text{CSSCP}=\mathbf{X}_c^\top\mathbf{X}_c\). - Sample covariance: \(\displaystyle \mathbf{S}=\frac{1}{n-1}\text{CSSCP}\). - MLE covariance: \(\displaystyle \hat{\Sigma}_{\text{MLE}}=\frac{1}{n}\text{CSSCP}\).

Precision (Concentration) Matrix

The inverse covariance \(\mathbf{\Omega}=\Sigma^{-1}\) is the precision (or concentration) matrix. - In the multivariate normal case, \(\Omega_{jk}=0\) \(\iff\) variables \(X_j\) and \(X_k\) are conditionally independent given all others. - Partial variances: \(1/\Omega_{jj}\). - Partial correlations:
\[ \rho_{jk\cdot -jk} \;=\; -\,\frac{\Omega_{jk}}{\sqrt{\Omega_{jj}\,\Omega_{kk}}}. \]

These properties make the precision matrix central to graphical models, regularization (e.g., GLASSO), and algorithms such as EM, where conditional relationships among variables are exploited.

The covariance matrix is defined as:

\[ \begin{align} \operatorname{Cov}(\mathbf{X}) &= \mathbb{E}\!\left[(X - \mathbb{E}[X])(X - \mathbb{E}[X])^\top\right] \\[12px] &= \begin{pmatrix} \sigma^2(X_1) & \sigma(X_1, X_2) & \sigma(X_1, X_3) & \cdots & \sigma(X_1, X_p) \\[12px] \sigma(X_2, X_1) & \sigma^2(X_2) & \sigma(X_2, X_3) & \cdots & \sigma(X_2, X_p) \\[12px] \sigma(X_3, X_1) & \sigma(X_3, X_2) & \sigma^2(X_3) & \cdots & \sigma(X_3, X_p) \\[12px] \vdots & \vdots & \vdots & \ddots & \vdots \\[12px] \sigma(X_p, X_1) & \sigma(X_p, X_2) & \sigma(X_p, X_3) & \cdots & \sigma^2(X_p) \\ \end{pmatrix} \end{align} \]

where:

\(\qquad \sigma^2(X_i)\) is the variance of variable \(X_i\),

\(\qquad \sigma(X_i, X_j)\) is the covariance between \(X_i\), and \(X_j\),and

\(\qquad p\) is the number of variables (columns in the data matrix).

#' Compute Covariance Matrix Using CSSCP (Listwise, Matrix Ops)
#'
#' Computes the covariance matrix via the identity:
#'   cov(X) = CSSCP(X) / (n_eff - 1)     [unbiased]
#'   cov_ML = CSSCP(X) /  n_eff          [MLE]
#'
#' Assumes CSSCP() is available and implements:
#'   - listwise deletion (complete cases)
#'   - centering with MEANS()
#'   - CSSCP = t(D) %*% D  
#'
#' @param X A numeric matrix or data frame (n × p)
#' @param unbiased Logical; TRUE uses n_eff - 1 (sample), FALSE uses n_eff (MLE)
#' @return A symmetric p × p covariance matrix with dimnames preserved
#' @examples
#' # Standard sample covariance (matches cov(X, use = "complete.obs"))
#' C <- COV(X)
#' all.equal(C, cov(X, use = "complete.obs"))
#'
#' # MLE version (divide by n_eff)
#' C_ml <- COV(X, unbiased = FALSE)
#' @export
#' 
COV <- function(X, unbiased = TRUE) {
  
  source("CSSCP.R")
  
  # Preconditions
  if (!is.matrix(X) && !is.data.frame(X)) stop("Input must be a matrix or data frame")
  X <- as.matrix(X)
  if (!is.numeric(X)) stop("'X' must be numeric")

  # Effective n after listwise deletion (must match CSSCP's behavior)
  n_eff <- nrow(na.omit(as.data.frame(X)))
  if (n_eff < 2L) {
    warning("Insufficient complete observations (n_eff < 2)")
    return(matrix(NA_real_, ncol(X), ncol(X),
                  dimnames = list(colnames(X), colnames(X))))
  }

  # Numerator: centered sums of squares & cross-products (listwise)
  csscp <- CSSCP(X)  # expects listwise deletion + MEANS-based centering + t(D) %*% D

  # Denominator
  denom <- if (unbiased) n_eff - 1 else n_eff

  # Covariance
  cov_mat <- csscp / denom

  # Symmetrize defensively (handles tiny FP asymmetries)
  cov_mat <- (cov_mat + t(cov_mat)) / 2

  dimnames(cov_mat) <- list(colnames(X), colnames(X))
  cov_mat
}

# Save function
dump("COV", file = "COV.R")
cat("Custom covariance via CSSCP:\n")
## Custom covariance via CSSCP:
C_custom <- COV(X)
C_custom
##           X1         X2         X3          X4          Y
## X1  34.60256   20.92308 -31.051282  -24.166667   64.66346
## X2  20.92308  242.14103 -13.878205 -253.416667  191.07949
## X3 -31.05128  -13.87821  41.025641    3.166667  -51.51923
## X4 -24.16667 -253.41667   3.166667  280.166667 -206.80833
## Y   64.66346  191.07949 -51.519231 -206.808333  226.31359
# Output: Sample covariance matrix computed using your custom pipeline
# (listwise deletion + MEANS() + CSSCP + division by n-1)

cat("\n\nBuilt-in R covariance for comparison:\n")
## 
## 
## Built-in R covariance for comparison:
C_builtin <- cov(X, use = "complete.obs")
C_builtin
##           X1         X2         X3          X4          Y
## X1  34.60256   20.92308 -31.051282  -24.166667   64.66346
## X2  20.92308  242.14103 -13.878205 -253.416667  191.07949
## X3 -31.05128  -13.87821  41.025641    3.166667  -51.51923
## X4 -24.16667 -253.41667   3.166667  280.166667 -206.80833
## Y   64.66346  191.07949 -51.519231 -206.808333  226.31359
# Output: Sample covariance matrix from base R

cat("\n\nCheck equivalence:\n")
## 
## 
## Check equivalence:
all.equal(C_custom, C_builtin)
## [1] TRUE
# Output: TRUE (up to floating-point tolerance)

1.8 Basic Linear Algebra for Statistics

1.8.1 Trace of a Square Matrix

The trace of a square matrix is the sum of the elements on its main diagonal.
For a square matrix \(A \in \mathbb{R}^{n \times n}\):

\[ \operatorname{tr}(A) = \sum_{i=1}^{n} a_{ii}. \]


Key Properties

  • Eigenvalue connection:
    The trace equals the sum of the eigenvalues of \(A\) (counted with algebraic multiplicity), even if some eigenvalues are complex.

  • Invariance under similarity transformations:
    For any invertible matrix \(P\),
    \[ \operatorname{tr}(A) = \operatorname{tr}(P^{-1}AP). \]
    Thus, the trace is independent of the choice of basis.


Statistical Relevance

  • In multivariate statistics, the trace often appears as a scalar summary of variance–covariance matrices (e.g., total variance = trace of the covariance matrix).
  • In linear regression, it shows up in degrees of freedom (e.g., \(\operatorname{tr}(H)\), where \(H\) is the hat matrix).
  • More generally, it provides a compact way to summarize the “size” of a matrix without requiring its full spectrum.

#' TRACE: Matrix Trace via Custom Ops (SUMS/MEANS) and Matrix Algebra
#'
#' Computes the trace of a square numeric matrix (sum of diagonal elements),
#' with flexible NA handling and optional normalization. Designed to fit your
#' custom pipeline (SUMS/MEANS) and matrix-operation style.
#'
#' NA handling:
#' - "fail"   : error if any diagonal entry is NA
#' - "omit"   : drop NA diagonal entries from the sum
#' - "ignore" : propagate NA (i.e., return NA if any diagonal is NA)
#' - "mean"   : impute NA diagonal entries with their column means via MEANS()
#'
#' @param X A square numeric matrix or data frame
#' @param na_action One of c("omit","fail","ignore","mean"); default "omit"
#' @param normalized Logical; if TRUE, divide by the effective diagonal length
#'                   (n for fail/ignore/mean; number of non-NA diagonals for omit)
#' @return Numeric scalar: trace(X) (normalized if requested)
#' @examples
#' A <- matrix(c(4, -2,  4,  2,
#'               -2, 10, -2, -7,
#'                4, -2,  8,  4,
#'                2, -7,  4,  7), nrow = 4, byrow = TRUE)
#' TRACE(A)                 # 29
#' psych::tr(A)             # 29 (verification)
#'
#' B <- A; diag(B)[2] <- NA
#' TRACE(B, "omit")         # sum of non-NA diagonals
#' # if MEANS() is loaded:
#' # TRACE(B, "mean")       # imputes NA diag via column mean
#'
#' # Normalized (average diagonal):
#' TRACE(A, normalized = TRUE)  # 29/4 = 7.25
#' @export
# -----------------------------
# uses Hadamard mask + SUMS)
#   tr(X) = sum( X ⊙ I ) = 1^T (X ⊙ I) 1
# Requires SUMS(); keeps NA logic matching the main TRACE().
# -----------------------------
TRACE <- function(X, na_action = c("omit","fail","ignore","mean"),
                             normalized = FALSE) {
  
  source("SUMS.R")
  source("MEANS.R")
  
if (!is.matrix(X) && !is.data.frame(X)) stop("Input must be a matrix or data frame")
  X <- as.matrix(X)
  if (!is.numeric(X)) stop("'X' must be numeric")
  if (nrow(X) != ncol(X)) stop("Matrix must be square")
  if (!exists("SUMS")) stop("SUMS() not found")

  na_action <- match.arg(na_action)

  n <- nrow(X)
  I <- diag(1, n)
  D <- X * I                 # Hadamard product isolates the diagonal
  d <- diag(X)               # used only for NA logic below

  rm_flag <- TRUE
  if (anyNA(d)) {
    if (na_action == "fail") {
      stop("NA values found on the diagonal")
    } else if (na_action == "omit") {
      D[cbind(which(is.na(d)), which(is.na(d)))] <- 0
    } else if (na_action == "mean") {
      if (!exists("MEANS")) stop("MEANS() not found")
      mu <- MEANS(X)
      idx <- which(is.na(d))
      if (length(idx)) D[cbind(idx, idx)] <- mu[idx]
    } else if (na_action == "ignore") {
      rm_flag <- FALSE
    }
  }

  # Column sums then total sum -> scalar trace
  tr_val <- sum(SUMS(D, na.rm = rm_flag))

  if (normalized) {
    denom <- if (na_action == "omit") sum(!is.na(d)) else n
    if (!rm_flag && anyNA(d)) return(NA_real_)
    tr_val <- tr_val / denom
  }

  tr_val
}

# Save function
dump("TRACE", file = "TRACE.R")
# -----------------------------
# Quick tests with annotated display
# -----------------------------

# Base matrix
A <- matrix(c(4, -2,  4,  2,
              -2, 10, -2, -7,
               4, -2,  8,  4,
               2, -7,  4,  7), nrow = 4, byrow = TRUE)

cat("Trace of A (custom TRACE):\n")
## Trace of A (custom TRACE):
print(TRACE(A))    # Expected 29
## [1] 29
# Matches base psych::tr() implementation
cat("\nTrace of A (psych::tr):\n")
## 
## Trace of A (psych::tr):
print(psych::tr(A)) # Expected 29
## [1] 29
# With NA on the diagonal
B <- A; diag(B)[2] <- NA

cat("\nTrace of B with na_action='omit' (drop NA):\n")
## 
## Trace of B with na_action='omit' (drop NA):
print(TRACE(B, "omit"))  # Expected 19
## [1] 19
# Explanation: NA is dropped → 4 + 8 + 7 = 19

cat("\nTrace of B with na_action='ignore' (propagate NA):\n")
## 
## Trace of B with na_action='ignore' (propagate NA):
print(TRACE(B, "ignore")) # Expected NA
## [1] NA
# Explanation: NA propagates because diagonal contains NA

cat("\nTrace of B with na_action='mean' (impute with column mean):\n")
## 
## Trace of B with na_action='mean' (impute with column mean):
print(TRACE(B, "mean"))   # Expected 18.75
## [1] 15.33333
# Expected output: value with NA imputed by column mean
# Column 2 mean = ( -2 + 10 + -2 + -7 ) / 4 = -0.25
# Trace = 4 + (-0.25) + 8 + 7 = 18.75

# Normalized trace
cat("\nNormalized trace of A (average of diagonal):\n")
## 
## Normalized trace of A (average of diagonal):
print(TRACE(A, normalized = TRUE)) # Expected 7.25
## [1] 7.25
# Explanation: 29 / 4 = 7.25 (average of diagonal entries)

2 The SWEEP Operator

The SWEEP operator (Goodnight, 1979) is a fundamental matrix transformation closely related to:

  • Gauss–Jordan elimination
  • Doolittle’s forward procedure
  • Generalized inverse computation

It is widely used in statistical computing (e.g., regression, multivariate analysis) because it performs in-place updates of a symmetric matrix with minimal storage requirements.

For a symmetric matrix \(\mathbf{G}_{p \times p}\), sweeping on pivot \(k\) produces a new matrix \(\mathbf{H}\) with entries:

\[ \begin{aligned} h_{kk} &= -\frac{1}{g_{kk}}, \\[12pt] h_{jk} = h_{kj} &= \frac{g_{jk}}{g_{kk}}, \quad j \neq k, \\[12pt] h_{jl} &= g_{jl} - \frac{g_{jk}g_{kl}}{g_{kk}}, \quad j \neq k, \; l \neq k. \end{aligned} \]


Example: 3×3 Matrix

\[ \mathbf{G} = \begin{pmatrix} g_{11} & g_{12} & g_{13} \\[6px] g_{21} & g_{22} & g_{23} \\[6px] g_{31} & g_{32} & g_{33} \end{pmatrix} \]

After sweeping on pivot 1:

$$ 1 = \[\begin{pmatrix} -\tfrac{1}{g_{11}} & \tfrac{g_{12}}{g_{11}} & \tfrac{g_{13}}{g_{11}} \\[12pt] \tfrac{g_{21}}{g_{11}} & \tfrac{g_{22} - g_{12}^2}{g_{11}} & \tfrac{g_{23} - g_{12}g_{13}}{g_{11}} \\[12pt] \tfrac{g_{31}}{g_{11}} & \tfrac{g_{32} - g_{12}g_{31}}{g_{11}} & \tfrac{g_{33} - g_{13}^2}{g_{11}} \end{pmatrix}\]

$$


Statistical Applications

  • Regression analysis: Efficiently computes regression coefficients, residual sum of squares, and partial correlations without explicit matrix inversion.
  • Multivariate analysis: Forms the computational backbone of algorithms for MANOVA, discriminant analysis, and canonical correlations.
  • Generalized inverses: Provides a numerically stable route to Moore–Penrose inverses in certain contexts.

#' SWEEP Operator for Symmetric Matrices
#'
#' Performs SWEEP operations on selected pivots of a symmetric matrix.
#' Implements both forward sweep and reverse (unsweep) with numerical checks.
#'
#' @param A A symmetric numeric matrix (p × p).
#' @param k Integer vector of pivot indices. If NULL, sweep all: 1:p.
#' @param reverse Logical; if TRUE, performs reverse sweep (unsweep). Default FALSE.
#' @param tol Numeric tolerance for pivot magnitude. Default 1e-7.
#' @return A symmetric matrix of the same dimension with pivots swept/unswept.
#' @references Goodnight, J. H. (1979). A Tutorial on the SWEEP Operator.
#'             The American Statistician, 33(3), 149–158.
#' @examples
#' set.seed(1)
#' A <- crossprod(matrix(rnorm(16), 4, 4))   # SPD symmetric matrix
#' S  <- SWEEP(A, 1:2)                       # forward sweep on 1 and 2
#' A2 <- SWEEP(S, 2:1, reverse = TRUE)       # unsweep to recover A
#' all.equal(A, A2)
#' @export
SWEEP <- function(A, k = NULL, reverse = FALSE, tol = 1e-7) {
  a <- as.matrix(A)
  if (!is.numeric(a)) stop("'A' must be numeric")
  if (nrow(a) != ncol(a)) stop("'A' must be square")
  p <- nrow(a)
  
  # (Soft) symmetry check
  if (max(abs(a - t(a))) > sqrt(.Machine$double.eps)) {
    warning("Input matrix is not exactly symmetric; proceeding with symmetric updates.")
  }
  
  if (is.null(k)) k <- seq_len(p)
  cn <- colnames(a); rn <- rownames(a)
  
  for (kk in k) {
    if (kk < 1 || kk > p) stop("Pivot index out of range")
    pivot <- a[kk, kk]
    
    if (!reverse) {
      # ---------- Forward SWEEP on pivot kk ----------
      if (abs(pivot) < tol) stop(sprintf("Pivot a[%d,%d] too small (|%g| < %g)", kk, kk, pivot, tol))
      
      # Update off-pivot block: a_ij <- a_ij - a_ik * a_kj / pivot
      for (i in seq_len(p)) if (i != kk) {
        for (j in seq_len(p)) if (j != kk) {
          a[i, j] <- a[i, j] - a[i, kk] * a[kk, j] / pivot
        }
      }
      # Update pivot row/col (excluding pivot): a_ik <- a_ik / pivot
      for (i in seq_len(p)) if (i != kk) {
        a[i, kk] <- a[i, kk] / pivot
        a[kk, i] <- a[i, kk]   # keep symmetry
      }
      # Update pivot element
      a[kk, kk] <- -1 / pivot
      
    } else {
      # ---------- Reverse SWEEP (unsweep) on pivot kk ----------
      # Here a[kk,kk] should be -1/pivot_original; recover original pivot:
      if (abs(a[kk, kk]) < tol) stop(sprintf("Pivot element too small to reverse-sweep at [%d,%d]", kk, kk))
      pivot_orig <- -1 / a[kk, kk]
      
      # Rebuild off-pivot block: a_ij <- a_ij + a_ik * a_kj * pivot_orig
      for (i in seq_len(p)) if (i != kk) {
        for (j in seq_len(p)) if (j != kk) {
          a[i, j] <- a[i, j] + a[i, kk] * a[kk, j] * pivot_orig
        }
      }
      # Rebuild pivot row/col: a_ik <- a_ik * pivot_orig (then flip sign)
      for (i in seq_len(p)) if (i != kk) {
        val <- a[i, kk] * pivot_orig
        a[i, kk] <- val
        a[kk, i] <- val
      }
      # Restore pivot element
      a[kk, kk] <- pivot_orig
    }
  }
  
  colnames(a) <- cn; rownames(a) <- rn
  a
}

# save to the working directory getwd(), ls()
dump("SWEEP", file = "SWEEP.R")
### 1. Create Test Matrix ###
cat("Original symmetric 2×2 matrix:\n")
## Original symmetric 2×2 matrix:
(test_matrix <- matrix(c(4, 2,
                         2, 3),
                       nrow = 2, byrow = TRUE,
                       dimnames = list(c("row1","row2"),
                                       c("col1","col2"))))
##      col1 col2
## row1    4    2
## row2    2    3
### 2. Forward Sweep on Pivot 1 ###
cat("\nAfter sweeping pivot 1:\n")
## 
## After sweeping pivot 1:
(swept <- SWEEP(test_matrix, 1))
##       col1 col2
## row1 -0.25  0.5
## row2  0.50  2.0
### 3. Reverse Sweep on Pivot 1 (recover original) ###
cat("\nReverse sweep on pivot 1 (recover original):\n")
## 
## Reverse sweep on pivot 1 (recover original):
(recovered <- SWEEP(swept, 1, reverse = TRUE))
##      col1 col2
## row1    4    2
## row2    2    3
cat("\nall.equal(original, recovered):\n")
## 
## all.equal(original, recovered):
print(all.equal(test_matrix, recovered))
## [1] TRUE
# ----- Demo & annotated checks -----
A <- matrix(c(4, -2, 4, 2,
              -2, 10, -2, -7,
              4, -2, 8, 4,
              2, -7, 4, 7),
            nrow = 4, byrow = TRUE,
            dimnames = list(paste0("row",1:4), paste0("col",1:4)))

cat("Original matrix A:\n"); print(A)
## Original matrix A:
##      col1 col2 col3 col4
## row1    4   -2    4    2
## row2   -2   10   -2   -7
## row3    4   -2    8    4
## row4    2   -7    4    7
k <- 3
cat("\nAfter sweeping pivot", k, ":\n")
## 
## After sweeping pivot 3 :
A_swp3 <- SWEEP(A, k); print(round(A_swp3, 2))
##      col1  col2  col3 col4
## row1  2.0 -1.00  0.50  0.0
## row2 -1.0  9.50 -0.25 -6.0
## row3  0.5 -0.25 -0.12  0.5
## row4  0.0 -6.00  0.50  5.0
# Expected values under Goodnight (1979) for pivot = A[k,k]
pivot <- A[k,k]
exp_hkk  <- -1 / pivot
exp_hk1  <- A[k,1] / pivot
exp_h1k  <- A[1,k] / pivot  # should equal exp_hk1

cat("\nChecks under Goodnight convention:\n")
## 
## Checks under Goodnight convention:
cat(sprintf("  Expected h[%d,%d] = -1/g[%d,%d] = %.3f\n", k,k,k,k, exp_hkk))
##   Expected h[3,3] = -1/g[3,3] = -0.125
cat(sprintf("  Observed h[%d,%d] = %.3f\n", k,k, A_swp3[k,k]))
##   Observed h[3,3] = -0.125
cat(sprintf("  Expected h[%d,1] = g[%d,1]/g[%d,%d] = %.3f\n", k,k,k,k, exp_hk1))
##   Expected h[3,1] = g[3,1]/g[3,3] = 0.500
cat(sprintf("  Observed h[%d,1] = %.3f\n", k, A_swp3[k,1]))
##   Observed h[3,1] = 0.500
cat(sprintf("  Expected h[1,%d] = g[1,%d]/g[%d,%d] = %.3f\n", k,k,k,k, exp_h1k))
##   Expected h[1,3] = g[1,3]/g[3,3] = 0.500
cat(sprintf("  Observed h[1,%d] = %.3f\n", k, A_swp3[1,k]))
##   Observed h[1,3] = 0.500
cat("\nSymmetry check (should be ~0): max |H - H^T| = ",
    max(abs(A_swp3 - t(A_swp3))), "\n", sep = "")
## 
## Symmetry check (should be ~0): max |H - H^T| = 0

2.1 Reverse SWEEP Operator (RSW)

The reverse SWEEP undoes a prior SWEEP on a symmetric matrix. It is useful when you need to revert selected pivots (e.g., in EM algorithms or Gaussian graphical models), restoring the pre-sweep values while maintaining symmetry.

Setup and Notation

Let \(\mathbf{S}\) be the matrix after sweeping pivot \(k\) on an original symmetric matrix \(\mathbf{G}\). The reverse SWEEP at pivot \(k\) reconstructs \(\mathbf{G}\) from \(\mathbf{S}\).

Define the index set \(I = \{1,\dots,p\}\setminus\{k\}\). Then the reverse SWEEP is:

\[ \begin{align} g_{kk} &= -\frac{1}{\,s_{kk}\,}, \\[6pt] g_{ik} = g_{ki} &= s_{ik}\, g_{kk}, \quad i \in I, \\[6pt] g_{ij} &= s_{ij} + s_{ik}\,s_{kj}\, g_{kk}, \quad i,j \in I. \end{align} \]

These updates are the exact inverse of the forward SWEEP identities: \[ \begin{aligned} \text{(forward)}\quad & h_{kk} = -\frac{1}{g_{kk}}, \qquad h_{ik} = h_{ki} = \frac{g_{ik}}{g_{kk}}, \\ & h_{ij} = g_{ij} - \frac{g_{ik}g_{kj}}{g_{kk}} \quad (i,j \neq k). \end{aligned} \] Notably, the sign and operations flip: division \(\leftrightarrow\) multiplication on row/col, and subtraction \(\leftrightarrow\) addition in the Schur-complement update.

Properties

  • Symmetry preserved: Each step maintains \(g_{ij}=g_{ji}\).
  • Exact inverse locally: Applying reverse SWEEP at \(k\) after a forward SWEEP at \(k\) returns the original entries.
  • Numerical care: Require \(|s_{kk}|>0\) (or a tolerance) to avoid instability when inverting the pivot.

#' Reverse SWEEP Operator 
#'
#' Undoes prior SWEEP operations at the specified pivots on a symmetric matrix.
#' If multiple pivots were swept, reverse-sweeping should usually be done
#' in the reverse order of application (hence the default \code{rev(k)}).
#'
#' @param G A symmetric numeric matrix (p × p) that has been SWEEPed.
#' @param k Integer vector of pivot indices to reverse-sweep.
#' @param tol Numeric tolerance for pivot magnitude (default 1e-7).
#' @param reverse_order Logical; if TRUE (default), process \code{rev(k)}.
#' @return The matrix after reverse SWEEP at the specified pivots.
#' @references
#'   Little, R.J.A., & Rubin, D.B. (2020). Statistical Analysis with Missing Data (3rd ed.).
#' @examples
#' set.seed(1)
#' A <- crossprod(matrix(rnorm(16), 4, 4))  # SPD matrix
#' S <- SWEEP(A, k = 2)                     # forward sweep at pivot 2
#' A_recovered <- RSW(S, k = 2)             # reverse sweep at pivot 2
#' all.equal(A, A_recovered)
#' @export
# Reverse SWEEP (unsweep) for a single or multiple pivots (Goodnight/Little–Rubin convention)

# Reverse SWEEP (unsweep) — supports two sign conventions for swept row/col
#   sign_convention = "goodnight" : s_{ik} =  g_{ik}/g_{kk}
#   sign_convention = "negated"   : s_{ik} = -g_{ik}/g_{kk}  (your A_swept3 case)
RSW <- function(S, k, tol = 1e-7, reverse_order = TRUE,
                sign_convention = c("goodnight","negated")) {
  H <- as.matrix(S)
  if (!is.numeric(H)) stop("'S' must be numeric")
  if (nrow(H) != ncol(H)) stop("'S' must be square")
  p <- nrow(H)
  if (missing(k) || length(k) == 0L) stop("Provide pivot index/indices in 'k'")
  if (reverse_order) k <- rev(k)
  sign_convention <- match.arg(sign_convention)

  rn <- rownames(H); cn <- colnames(H)

  for (kk in k) {
    if (kk < 1 || kk > p) stop("Pivot index out of range: ", kk)
    s_kk <- H[kk, kk]
    if (abs(s_kk) < tol) stop(sprintf("Pivot too small at [%d,%d]: %g", kk, kk, s_kk))

    idx <- setdiff(seq_len(p), kk)

    # Recover original pivot
    g_kk <- -1 / s_kk

    # Save swept row/col BEFORE any scaling
    s_col <- H[idx, kk, drop = FALSE]   # (p-1)×1  = s_{ik}
    s_row <- H[kk, idx, drop = FALSE]   # 1×(p-1)  = s_{kj}

    # Off-pivot block: g_{ij} = s_{ij} + s_{ik} s_{kj} * g_kk
    H[idx, idx] <- H[idx, idx] + (s_col %*% s_row) * g_kk

    # Row/col recovery depends on convention
    sgn <- if (sign_convention == "goodnight") +1 else -1
    H[idx, kk] <- sgn * s_col * g_kk
    H[kk, idx] <- t(sgn * s_row * g_kk)

    # Pivot
    H[kk, kk] <- g_kk
  }

  rownames(H) <- rn; colnames(H) <- cn
  H
}

# Save to file
dump("RSW", file = "RSW.R")
# -----------------------------
# Reverse SWEEP demonstration
# -----------------------------

# Swept matrix (pivot 3 was swept; note s[3,3] = -0.125)
A_swept3 <- matrix(c(2, -1, -0.5, 0,
                     -1, 9.5, 0.25, -6,
                     -0.5, 0.25, -0.125, -0.5,
                      0, -6, -0.5, 5),
                   nrow = 4, byrow = TRUE,
                   dimnames = list(paste0("row",1:4), paste0("col",1:4)))

cat("Swept matrix (after forward SWEEP at pivot 3):\n")
## Swept matrix (after forward SWEEP at pivot 3):
print(A_swept3)
##      col1  col2   col3 col4
## row1  2.0 -1.00 -0.500  0.0
## row2 -1.0  9.50  0.250 -6.0
## row3 -0.5  0.25 -0.125 -0.5
## row4  0.0 -6.00 -0.500  5.0
# Reverse SWEEP at pivot 3 (negated convention)
unswept_neg <- RSW(A_swept3, 3, sign_convention = "negated")

cat("\nMatrix after reverse SWEEP at pivot 3:\n")
## 
## Matrix after reverse SWEEP at pivot 3:
print(unswept_neg)
##      col1 col2 col3 col4
## row1    4   -2    4    2
## row2   -2   10   -2   -7
## row3    4   -2    8    4
## row4    2   -7    4    7
cat("\nCheck: pivot [3,3] recovered as", unswept_neg[3,3],
    " (expected ~8)\n")
## 
## Check: pivot [3,3] recovered as 8  (expected ~8)
# Original pre-sweep matrix (for comparison)
cat("\nOriginal pre-sweep matrix:\n")
## 
## Original pre-sweep matrix:
print(A)
##      col1 col2 col3 col4
## row1    4   -2    4    2
## row2   -2   10   -2   -7
## row3    4   -2    8    4
## row4    2   -7    4    7
# Verification summary
cat("\nall.equal(original, unswept) →", all.equal(A, unswept_neg), "\n")
## 
## all.equal(original, unswept) → TRUE
cat("max |difference| =", max(abs(A - unswept_neg)), "\n")
## max |difference| = 0

cat("Original matrix A:\n"); print(A)
## Original matrix A:
##      col1 col2 col3 col4
## row1    4   -2    4    2
## row2   -2   10   -2   -7
## row3    4   -2    8    4
## row4    2   -7    4    7
cat("\nSweep pivot 3 once (forward SWEEP):\n")
## 
## Sweep pivot 3 once (forward SWEEP):
S1 <- SWEEP(A, 3); print(S1)
##      col1  col2   col3 col4
## row1  2.0 -1.00  0.500  0.0
## row2 -1.0  9.50 -0.250 -6.0
## row3  0.5 -0.25 -0.125  0.5
## row4  0.0 -6.00  0.500  5.0
cat("\nSweep pivot 3 again (forward SWEEP again):\n")
## 
## Sweep pivot 3 again (forward SWEEP again):
S2 <- SWEEP(S1, 3); print(S2)
##      col1 col2 col3 col4
## row1    4   -2   -4    2
## row2   -2   10    2   -7
## row3   -4    2    8   -4
## row4    2   -7   -4    7
cat("\nObservation: row/col 3 are sign-flipped vs A\n")
## 
## Observation: row/col 3 are sign-flipped vs A
cat("max |off-pivot block difference| =", 
    max(abs(S2[-3,-3] - A[-3,-3])), "\n")  # should be ~0
## max |off-pivot block difference| = 0
cat("Row 3 difference (S2 vs A):\n"); print(S2[3,] - A[3,])
## Row 3 difference (S2 vs A):
## col1 col2 col3 col4 
##   -8    4    0   -8
cat("Col 3 difference (S2 vs A):\n"); print(S2[,3] - A[,3])
## Col 3 difference (S2 vs A):
## row1 row2 row3 row4 
##   -8    4    0   -8
cat("\nNow reverse-sweep pivot 3 (unsweep):\n")
## 
## Now reverse-sweep pivot 3 (unsweep):
A_rec <- RSW(S1, 3)   # uses your RSW()
print(A_rec)
##      col1 col2 col3 col4
## row1    4   -2    4    2
## row2   -2   10   -2   -7
## row3    4   -2    8    4
## row4    2   -7    4    7
cat("\nall.equal(A, A_rec) -> ", all.equal(A, A_rec), "\n")
## 
## all.equal(A, A_rec) ->  TRUE

Sweeping is (Order-)Commutative for Distinct Pivots

Let \(G\in\mathbb{R}^{p\times p}\) be symmetric and let \(S\subset\{1,\dots,p\}\) be an index set with \(G_{SS}\) nonsingular. Then \[ \mathrm{SWEEP}_{[r,s]}(G)=\mathrm{SWEEP}_{[s,r]}(G)\quad\text{for } r\neq s, \] and, more generally, \(\mathrm{SWEEP}_{S}(G)\) is independent of the order of indices in \(S\).

A block identity (showing order-independence): if we partition \[ G=\begin{pmatrix}G_{SS} & G_{S\bar S}\\[6pt] G_{\bar S S} & G_{\bar S\bar S}\end{pmatrix}, \] then sweeping all of \(S\) yields \[ \mathrm{SWEEP}_{S}(G)= \begin{pmatrix} -\,G_{SS}^{-1} & \;\;G_{SS}^{-1}G_{S\bar S}\\[12pt] G_{\bar S S}G_{SS}^{-1} & \;G_{\bar S\bar S}-G_{\bar S S}G_{SS}^{-1}G_{S\bar S} \end{pmatrix}, \] which depends only on \(G_{SS}^{-1}\), not the order of sweeps.

# ------------------------------------------------------------------
# Order-independence (commutativity) demo with annotated display
# ------------------------------------------------------------------

cat("Sweeping on pivots 1→4 (ascending):\n")
## Sweeping on pivots 1→4 (ascending):
swp_asc <- SWEEP(A, 1:4)
print(round(swp_asc, 4))
##         col1    col2    col3    col4
## row1 -0.6389 -0.2778  0.4167 -0.3333
## row2 -0.2778 -0.5556  0.3333 -0.6667
## row3  0.4167  0.3333 -0.5000  0.5000
## row4 -0.3333 -0.6667  0.5000 -1.0000
cat("\nSweeping on pivots 4→1 (descending):\n")
## 
## Sweeping on pivots 4→1 (descending):
swp_desc <- SWEEP(A, 4:1)
print(round(swp_desc, 4))
##         col1    col2    col3    col4
## row1 -0.6389 -0.2778  0.4167 -0.3333
## row2 -0.2778 -0.5556  0.3333 -0.6667
## row3  0.4167  0.3333 -0.5000  0.5000
## row4 -0.3333 -0.6667  0.5000 -1.0000
cat("\nAre the two results equal (up to FP tolerance)?\n")
## 
## Are the two results equal (up to FP tolerance)?
ae <- all.equal(swp_asc, swp_desc)
print(ae)
## [1] TRUE
cat("max |difference| =", max(abs(swp_asc - swp_desc)), "\n")
## max |difference| = 1.110223e-16
# Extra sanity checks (useful for readers)
cat("\nSanity check: sweeping ALL pivots should yield -solve(A)\n")
## 
## Sanity check: sweeping ALL pivots should yield -solve(A)
neg_inv <- -solve(A)
cat("all.equal(SWEEP(A, 1:4), -solve(A)) → ",
    all.equal(swp_asc, neg_inv), "\n")
## all.equal(SWEEP(A, 1:4), -solve(A)) →  Attributes: < Component "dimnames": Component 1: 4 string mismatches > Attributes: < Component "dimnames": Component 2: 4 string mismatches >
cat("max |SWEEP(A,1:4) + solve(A)| = ",
    max(abs(swp_asc + solve(A))), "\n")
## max |SWEEP(A,1:4) + solve(A)| =  0
# Optional: show partial-order commutativity on a subset (e.g., {1,3})
cat("\nSubset commutativity example on pivots {1,3}:\n")
## 
## Subset commutativity example on pivots {1,3}:
H_13 <- SWEEP(A, c(1,3))
H_31 <- SWEEP(A, c(3,1))
cat("all.equal(SWEEP(A, c(1,3)), SWEEP(A, c(3,1))) → ",
    all.equal(H_13, H_31), "\n")
## all.equal(SWEEP(A, c(1,3)), SWEEP(A, c(3,1))) →  TRUE
# Numerical note for readers
cat("\nNote: If pivots are near singular, tiny differences may appear.\n",
    "A tolerance inside SWEEP (e.g., tol=1e-7) helps catch unstable pivots.\n", sep = "")
## 
## Note: If pivots are near singular, tiny differences may appear.
## A tolerance inside SWEEP (e.g., tol=1e-7) helps catch unstable pivots.

2.2 Matrix Inverse

For a square matrix \(A \in \mathbb{R}^{n\times n}\), an inverse \(A^{-1}\) satisfies \[ A\,A^{-1} = A^{-1}A = I_n. \] A matrix is invertible (nonsingular) iff \(\det(A)\neq 0\). If \(\det(A)=0\), \(A\) is singular (non-invertible). Non-square matrices do not have inverses (but see pseudoinverse below).

Core properties

  • Uniqueness: The inverse, if it exists, is unique.
  • \((A^{-1})^{-1}=A,\qquad (AB)^{-1}=B^{-1}A^{-1},\qquad (A^\top)^{-1}=(A^{-1})^\top.\)
  • \(\det(A^{-1}) = 1/\det(A)\).
  • If \(A\) is SPD (symmetric positive definite), then \(A^{-1}\) is SPD.

Links to regression

For linear regression \(y=X\beta+\varepsilon\) (full column rank \(X\)): \[ \hat\beta = (X^\top X)^{-1}X^\top y. \]

Relation to the SWEEP operator (symmetric \(A\))

For symmetric \(A\) with all pivots nonsingular, \[ \mathrm{SWEEP}(A,\;1{:}n)\;=\; -\,A^{-1}. \] This provides an efficient route to \(A^{-1}\) (up to a sign) using in-place updates.


2.2.1 R demo (annotated)

# Symmetric positive-definite test matrix
set.seed(1)
M <- crossprod(matrix(rnorm(16), 4, 4))

# Inverse via base R (factorization under the hood)
Ainv <- solve(M)

# Inverse via SWEEP: sweep all pivots -> -A^{-1}
S <- SWEEP(M, 1:4)
all.equal(S, -Ainv)         # should be TRUE (within FP tolerance)
## [1] TRUE
# Sanity checks
all.equal(M %*% Ainv, diag(4))  # ~ I
## [1] TRUE
max(abs(M %*% Ainv - diag(4)))  # tiny FP error
## [1] 2.164273e-15
# Regression tip (avoid explicit inverse):
# beta_hat <- solve(t(X) %*% X, t(X) %*% y)  # or better: qr.solve(X, y)

#' Matrix Inversion via SWEEP Operator
#'
#' Computes the inverse of a square matrix using sequential SWEEP operations.
#' Uses the Goodnight convention, for which SWEEP(A, 1:n) = -A^{-1}.
#'
#' @param A A square numeric matrix.
#' @param tol Tolerance for near-zero pivot detection (default = 1e-10).
#' @return The inverse matrix A^{-1}.
#' @details
#' - Assumes a SWEEP() implementation with pivot update h_kk = -1/g_kk.
#' - Works best for symmetric (SPD) matrices; a symmetry warning is issued otherwise.
#' - Avoids using det() for singularity tests; relies on pivot checks instead.
#' @examples
#' A <- matrix(c(4,-2,4,2, -2,10,-2,-7, 4,-2,8,4, 2,-7,4,7), nrow=4, byrow=TRUE)
#' INV(A)              # Matrix inverse via SWEEP
#' all.equal(INV(A), solve(A))  # should be TRUE (within FP tolerance)
#' @export
INV <- function(A, tol = 1e-10) {
  if (!is.matrix(A) || !is.numeric(A)) stop("Input must be a numeric matrix")
  if (nrow(A) != ncol(A)) stop("Matrix must be square")

  # Ensure SWEEP is available
  if (!exists("SWEEP")) {
    if (file.exists("SWEEP.R")) source("SWEEP.R") else stop("SWEEP() not found")
  }

  M <- as.matrix(A)
  n <- nrow(M)
  dn <- dimnames(M)

  # Soft symmetry check (helpful since SWEEP is usually defined for symmetric inputs)
  if (max(abs(M - t(M))) > sqrt(.Machine$double.eps)) {
    warning("Input is not exactly symmetric; proceeding, but ensure SWEEP convention matches.")
  }

  # Sequential sweeps with pivot checks
  for (k in seq_len(n)) {
    piv <- M[k, k]
    if (!is.finite(piv) || abs(piv) < tol) {
      stop(sprintf("Near-zero or invalid pivot (%.3e) at position %d", piv, k))
    }
    M <- SWEEP(M, k)  # Goodnight convention: sets M[k,k] <- -1/piv, etc.
  }

  # Under Goodnight convention: swept-all = -A^{-1}
  invA <- -M
  dimnames(invA) <- dn
  invA
}

# Save function
dump("INV", file = "INV.R")
options(digits = 2)

cat("Original matrix A:\n")
## Original matrix A:
print(A)
##      col1 col2 col3 col4
## row1    4   -2    4    2
## row2   -2   10   -2   -7
## row3    4   -2    8    4
## row4    2   -7    4    7
cat("\nInverse via SWEEP (INV function):\n")
## 
## Inverse via SWEEP (INV function):
(A_inv <- INV(A))
##       col1  col2  col3  col4
## row1  0.64  0.28 -0.42  0.33
## row2  0.28  0.56 -0.33  0.67
## row3 -0.42 -0.33  0.50 -0.50
## row4  0.33  0.67 -0.50  1.00
#       col1  col2  col3  col4
# row1  0.64  0.28 -0.42  0.33
# row2  0.28  0.56 -0.33  0.67
# row3 -0.42 -0.33  0.50 -0.50
# row4  0.33  0.67 -0.50  1.00

cat("\nVerification (A %*% A_inv should be identity):\n")
## 
## Verification (A %*% A_inv should be identity):
round(A %*% A_inv, 10)
##      col1 col2 col3 col4
## row1    1    0    0    0
## row2    0    1    0    0
## row3    0    0    1    0
## row4    0    0    0    1
#      col1 col2 col3 col4
# row1    1    0    0    0
# row2    0    1    0    0
# row3    0    0    1    0
# row4    0    0    0    1

2.3 Correlation Matrix

A correlation matrix is a covariance matrix calculated on variables that are standardized to have a mean of 0 and a standard deviation of 1.

The determinant of the correlation matrix is 1.0 only if all correlations are 0; otherwise, the determinant is less than 1.

The correlation between variables \(X_i\) and \(X_j\) is defined as:

\[ \begin{aligned} \rho_{ij} &= \begin{pmatrix} 1 & \dfrac{\sigma(X_1,X_2)}{\sigma_1\sigma_2} & \dfrac{\sigma(X_1,X_3)}{\sigma_1\sigma_3} & \cdots & \dfrac{\sigma(X_1,X_p)}{\sigma_1\sigma_p} \\[12pt] \dfrac{\sigma(X_2,X_1)}{\sigma_2\sigma_1} & 1 & \dfrac{\sigma(X_2,X_3)}{\sigma_2\sigma_3} & \cdots & \dfrac{\sigma(X_2,X_p)}{\sigma_2\sigma_p} \\[12pt] \dfrac{\sigma(X_3,X_1)}{\sigma_3\sigma_1} & \dfrac{\sigma(X_3,X_2)}{\sigma_3\sigma_2} & 1 & \cdots & \dfrac{\sigma(X_3,X_p)}{\sigma_3\sigma_p} \\[12pt] \vdots & \vdots & \vdots & \ddots & \vdots \\[12pt] \dfrac{\sigma(X_p,X_1)}{\sigma_p\sigma_1} & \dfrac{\sigma(X_p,X_2)}{\sigma_p\sigma_2} & \dfrac{\sigma(X_p,X_3)}{\sigma_p\sigma_3} & \cdots & 1 \end{pmatrix}, \\[12pt] &= \mathbf{D}^{-1/2}\,\boldsymbol{\Sigma}\,\mathbf{D}^{-1/2}, \end{aligned} \]

where:

\(\qquad \sigma(X_i, X_j)\) is the covariance between \(X_i\) and \(X_j\),

\(\qquad \sigma_{X_i}\) is the standard deviation of \(X_i\),

\(\qquad\)The diagonal elements are always 1 since they represent the correlation of each variable with itself.

#' Compute Correlation Matrix with Multiple NA Handling Options
#'
#' Calculates Pearson correlation matrix using matrix operations, with options for
#' different missing data handling methods.
#'
#' @param Y A numeric matrix or data frame (n × p)
#' @param na_action Method for handling NAs: 
#'        "listwise" (complete cases, default), 
#'        "pairwise" (available cases for each pair),
#'        "complete" (only complete cases, alias for listwise)
#' @return A symmetric p × p correlation matrix
#' @export
#' @examples
#' data <- matrix(rnorm(100), ncol=5)
#' data[sample(100, 10)] <- NA  # Add missing values
#' 
#' COR(data)  # Listwise deletion
#' COR(data, na_action="pairwise")  # Pairwise deletion
#' stats::cor(data, use="complete.obs")  # Base R equivalent
COR <- function(Y, na_action = c("listwise", "pairwise", "complete")) {
  # Load dependencies if needed
  required_funs <- c("CSSCP", "COV", "INV")
  for (fun in required_funs) {
    if (!exists(fun)) {
      if (file.exists(paste0(fun, ".R"))) {
        source(paste0(fun, ".R"))
      } else {
        stop(paste(fun, "function not found in working directory"))
      }
    }
  }
  
  # Input validation
  if (!is.matrix(Y) && !is.data.frame(Y)) {
    stop("Input must be a matrix or data frame")
  }
  na_action <- match.arg(na_action)
  if (na_action == "complete") na_action <- "listwise"
  
  # Handle missing data
  Y <- as.matrix(Y)
  if (na_action == "listwise") {
    Y <- Y[complete.cases(Y), , drop = FALSE]
    n <- nrow(Y)
    if (n < 2) {
      warning("Insufficient complete cases (n < 2)")
      return(matrix(NA, ncol(Y), ncol(Y)))
    }
    
    # Standardize using listwise means/SDs
    V <- diag(COV(Y))
    S <- diag(sqrt(V), ncol(Y), ncol(Y))
    C <- CSSCP(Y)/(n - 1)
    corr <- INV(S) %*% C %*% INV(S)
    
  } else { # pairwise
    # Pairwise covariance computation
    corr <- matrix(NA, ncol(Y), ncol(Y))
    for (i in 1:ncol(Y)) {
      for (j in 1:i) {
        complete_ij <- complete.cases(Y[, c(i,j)])
        if (sum(complete_ij) >= 2) {
          cov_ij <- cov(Y[complete_ij, i], Y[complete_ij, j])
          sd_i <- sd(Y[complete_ij, i])
          sd_j <- sd(Y[complete_ij, j])
          corr[i,j] <- corr[j,i] <- cov_ij/(sd_i * sd_j)
        }
      }
    }
    diag(corr) <- 1
  }
  
  # Preserve dimension names
  if (is.null(colnames(Y))) {
    dimnames(corr) <- list(paste0("X", 1:ncol(Y)), paste0("X", 1:ncol(Y)))
  } else {
    dimnames(corr) <- list(colnames(Y), colnames(Y))
  }
  
  return(corr)
}

# Save function
dump("COR", file = "COR.R")
cat("Custom COR (listwise):\n")
## Custom COR (listwise):
R_list <- COR(X, "listwise")
print(round(R_list, 4))
##       X1    X2    X3    X4     Y
## X1  1.00  0.23 -0.82 -0.25  0.73
## X2  0.23  1.00 -0.14 -0.97  0.82
## X3 -0.82 -0.14  1.00  0.03 -0.53
## X4 -0.25 -0.97  0.03  1.00 -0.82
## Y   0.73  0.82 -0.53 -0.82  1.00
cat("\nBase cor() with complete.obs:\n")
## 
## Base cor() with complete.obs:
R_list_base <- cor(X, use = "complete.obs")
print(round(R_list_base, 4))
##       X1    X2    X3    X4     Y
## X1  1.00  0.23 -0.82 -0.25  0.73
## X2  0.23  1.00 -0.14 -0.97  0.82
## X3 -0.82 -0.14  1.00  0.03 -0.53
## X4 -0.25 -0.97  0.03  1.00 -0.82
## Y   0.73  0.82 -0.53 -0.82  1.00
cat("\nall.equal(custom, base) → ", all.equal(R_list, R_list_base), "\n")
## 
## all.equal(custom, base) →  TRUE
cat("\nCustom COR (pairwise):\n")
## 
## Custom COR (pairwise):
R_pair <- COR(X, "pairwise")
print(round(R_pair, 4))
##       X1    X2    X3    X4     Y
## X1  1.00  0.23 -0.82 -0.25  0.73
## X2  0.23  1.00 -0.14 -0.97  0.82
## X3 -0.82 -0.14  1.00  0.03 -0.53
## X4 -0.25 -0.97  0.03  1.00 -0.82
## Y   0.73  0.82 -0.53 -0.82  1.00
cat("\nBase cor() with pairwise.complete.obs:\n")
## 
## Base cor() with pairwise.complete.obs:
R_pair_base <- cor(X, use = "pairwise.complete.obs")
print(round(R_pair_base, 4))
##       X1    X2    X3    X4     Y
## X1  1.00  0.23 -0.82 -0.25  0.73
## X2  0.23  1.00 -0.14 -0.97  0.82
## X3 -0.82 -0.14  1.00  0.03 -0.53
## X4 -0.25 -0.97  0.03  1.00 -0.82
## Y   0.73  0.82 -0.53 -0.82  1.00
cat("\nall.equal(custom, base) → ", all.equal(R_pair, R_pair_base), "\n")
## 
## all.equal(custom, base) →  TRUE

2.4 Determinant Calculation via SWEEP Operator

The determinant of a matrix \(A\) can be computed using the SWEEP operator through the following mathematical relationship:

\[ \det(A) = \prod_{k=1}^{n} D_{kk} \]

where:

\(\qquad A\) is an \(n \times n\) square matrix

\(\qquad D_{kk}\) are the diagonal elements after fully sweeping the matrix

\(\qquad\)The product is taken over all \(n\) diagonal elements

Key Properties:

  1. Triangular Matrices: \[ \text{If } A \text{ is triangular, } \det(A) = \prod_{i=1}^{n} a_{ii} \]

  2. Singularity Test: \[ \det(A) = 0 \iff A \text{ is singular} \]

  3. Multiplicative Property: \[ \det(AB) = \det(A)\det(B) \]

Computational Notes: The SWEEP-based calculation is mathematically equivalent to: \[ \det(A) = (-1)^s \prod_{k=1}^{n} p_k \] where:

\(\qquad p_k\) are the pivot elements, and

\(\qquad s\) is the number of row exchanges

#' Matrix Determinant via SWEEP Operator
#'
#' Computes the determinant of a square matrix using sequential SWEEP operations,
#' with improved numerical stability and error handling.
#'
#' @param S A square numeric matrix
#' @param tol Tolerance for singularity detection (default = 1e-10)
#' @return The determinant as a numeric scalar
#' @export
#' @examples
#' A <- matrix(c(4,-2,4,2, -2,10,-2,-7, 4,-2,8,4, 2,-7,4,7), nrow=4)
#' DET(A)  # Returns 144
#' det(A)  # Base R equivalent
DET <- function(S, tol = 1e-10) {
  # Input validation
  if (!is.matrix(S) || !is.numeric(S)) {
    stop("Input must be a numeric matrix")
  }
  if (nrow(S) != ncol(S)) {
    stop("Matrix must be square")
  }
  
  # Load SWEEP function if needed
  if (!exists("SWEEP")) {
    if (file.exists("SWEEP.R")) {
      source("SWEEP.R")
    } else {
      stop("SWEEP.R not found in working directory")
    }
  }
  
  A <- as.matrix(S)
  r <- ncol(A)
  
  # Handle 1×1 matrix case
  if (r == 1) {
    return(as.numeric(A[1,1]))
  }
  
  # Initialize determinant
  det_value <- A[1,1]
  if (abs(det_value) < tol) {
    warning("Initial pivot is near zero - determinant may be unstable")
  }
  
  # Perform sequential sweeps
  if (r > 1) {
    A_swept <- SWEEP(A, 1)
    for (k in 2:r) {
      pivot <- A_swept[k,k]
      if (abs(pivot) < tol) {
        warning(sprintf("Near-zero pivot (%.2e) at position %d", pivot, k))
      }
      det_value <- det_value * pivot
      A_swept <- SWEEP(A_swept, k)
    }
  }
  
  # Return as numeric (not matrix)
  return(as.numeric(det_value))
}

# Save function
dump("DET", file = "DET.R")
# Demonstration with matrix A
cat("Matrix A:\n")
## Matrix A:
print(A)
##      col1 col2 col3 col4
## row1    4   -2    4    2
## row2   -2   10   -2   -7
## row3    4   -2    8    4
## row4    2   -7    4    7
cat("\nDeterminant via SWEEP (DET function):\n")
## 
## Determinant via SWEEP (DET function):
(DET_result <- DET(A))
## [1] 144
# [1] 144

cat("\nBase R determinant (det function):\n")
## 
## Base R determinant (det function):
(det_result <- det(A))
## [1] 144
# [1] 144

cat("\nVerification (difference):", DET_result - det_result, "\n")
## 
## Verification (difference): 0
# Verification (difference): 0 

# Edge case testing
cat("\n1×1 matrix test:\n")
## 
## 1×1 matrix test:
(mat_1x1 <- matrix(189.6))
##      [,1]
## [1,]  190
DET(mat_1x1)  # Returns 189.6
## [1] 190

2.5 Cholesky Decomposition

Let \(A\) be a symmetric positive definite matrix. Then \(A\) can be uniquely decomposed into: $A = U^T U $

where:

\(\qquad U\) is an upper triangular matrix with positive diagonal entries

\(\qquad U^T\) is the conjugate transpose of \(U\) (simply the transpose for real matrices)

Matrix Formulation

For a 3×3 matrix \(A\):

\[ \begin{aligned} A &=\begin{pmatrix} A_{11} & A_{12} & A_{13}\\[6px] A_{21} & A_{22} & A_{23}\\[6px] A_{31} & A_{32} & A_{33} \end{pmatrix} \\[12pt] &= U^\top U \\[12pt] &=\begin{pmatrix} U_{11} & 0 & 0\\[6px] U_{12} & U_{22} & 0\\[6px] U_{13} & U_{23} & U_{33} \end{pmatrix} \begin{pmatrix} U_{11} & U_{12} & U_{13}\\[6px] 0 & U_{22} & U_{23}\\[6px] 0 & 0 & U_{33} \end{pmatrix}. \end{aligned} \]

#' Robust Cholesky Decomposition (upper triangular)
#'
#' Computes U such that A = t(U) %*% U for symmetric positive definite A.
#' Includes symmetry/finiteness checks, scale-aware pivot tolerance, and preserves dimnames.
#'
#' @param A Numeric square matrix (intended SPD).
#' @param tol Base tolerance for pivot positivity (relative; default 1e-12).
#' @param check_symmetry Logical; error if A not symmetric within tol. Default TRUE.
#' @return Upper-triangular matrix U with positive diagonal (same dimnames as A).
#' #' @references Watkins, D. S. (2010). *Fundamentals of Matrix Computations* 
#' @examples
#' A1 <- matrix(4)  # 1x1 positive
#' A3 <- matrix(c(4,-2,4,2, -2,10,-2,-7, 4,-2,8,4, 2,-7,4,7), nrow=4, byrow=TRUE)
#' U1 <- cholesky(A1); all.equal(t(U1) %*% U1, A1)
#' U3 <- cholesky(A3); all.equal(t(U3) %>% U3, A3)  # (typo fixed below)
cholesky <- function(A, tol = 1e-12, check_symmetry = TRUE) {
  # --- validation ---
  if (!is.matrix(A) || !is.numeric(A)) stop("A must be a numeric matrix.")
  n <- nrow(A)
  if (n != ncol(A)) stop("A must be square.")
  if (n == 0) {
    U0 <- matrix(numeric(0), 0, 0)
    dimnames(U0) <- dimnames(A)
    return(U0)
  }
  if (any(!is.finite(A))) stop("A contains non-finite values (NA/NaN/Inf).")

  # symmetry check (soft)
  if (check_symmetry) {
    if (max(abs(A - t(A))) > tol * max(1, max(abs(A)))) {
      stop("A is not symmetric within tolerance.")
    }
  }

  # --- factorization (upper-triangular) ---
  U <- matrix(0, n, n)
  dimnames(U) <- dimnames(A)

  # scale-aware pivot floor
  scale <- max(1, max(diag(A)), na.rm = TRUE)
  piv_tol <- tol * scale

  for (j in seq_len(n)) {
    # diagonal update
    if (j == 1) {
      s <- 0
    } else {
      s <- sum(U[1:(j-1), j]^2)
    }
    dj <- A[j, j] - s
    if (!is.finite(dj) || dj <= piv_tol) {
      stop(sprintf("Matrix is not SPD: non-positive pivot at j=%d (%.3e).", j, dj))
    }
    U[j, j] <- sqrt(dj)

    # off-diagonals in row j (fill upper triangle)
    if (j < n) {
      for (i in (j + 1):n) {
        if (j == 1) {
          s <- 0
        } else {
          s <- sum(U[1:(j-1), j] * U[1:(j-1), i])
        }
        U[j, i] <- (A[j, i] - s) / U[j, j]
      }
    }
  }
  U
}

# save
dump("cholesky", "cholesky.R")
# -----------------------------
# Test cases (annotated)
# -----------------------------
A1 <- matrix(4)  # 1×1 positive
A2 <- matrix(-1) # 1×1 negative
A3 <- matrix(c(4,-2,4,2,
               -2,10,-2,-7,
               4,-2,8,4,
               2,-7,4,7),
             nrow = 4, byrow = TRUE)
A4 <- matrix(0, 0, 0) # Empty matrix

cat("A1 (1x1, positive): U and reconstruction\n")
## A1 (1x1, positive): U and reconstruction
U1 <- cholesky(A1)
print(U1)
##      [,1]
## [1,]    2
cat("t(U1) %*% U1:\n"); print(t(U1) %*% U1)  # should equal A1
## t(U1) %*% U1:
##      [,1]
## [1,]    4
cat("U1 diagonal > 0? ", U1[1,1] > 0, "\n")
## U1 diagonal > 0?  TRUE
cat("Reconstruction equals A1? ",
    isTRUE(all.equal(t(U1) %*% U1, A1)), "\n")
## Reconstruction equals A1?  TRUE
cat("\nA3 (4x4, SPD): U and reconstruction check\n")
## 
## A3 (4x4, SPD): U and reconstruction check
cat("Matrix A3:\n"); print(A3)
## Matrix A3:
##      [,1] [,2] [,3] [,4]
## [1,]    4   -2    4    2
## [2,]   -2   10   -2   -7
## [3,]    4   -2    8    4
## [4,]    2   -7    4    7
U3 <- cholesky(A3)
cat("\nCholesky factor U (upper triangular):\n")
## 
## Cholesky factor U (upper triangular):
print(round(U3, 6))
##      [,1] [,2] [,3] [,4]
## [1,]    2   -1    2    1
## [2,]    0    3    0   -2
## [3,]    0    0    2    1
## [4,]    0    0    0    1
cat("Check t(U)%*%U == A3:\n")
## Check t(U)%*%U == A3:
recon_err <- max(abs(t(U3) %*% U3 - A3))
cat("max |t(U)%*%U - A3| =", recon_err, "\n")
## max |t(U)%*%U - A3| = 0
cat("\nA4 (0x0): returns empty matrix\n")
## 
## A4 (0x0): returns empty matrix
print(cholesky(A4))
## <0 x 0 matrix>
cat("\nA2 (1x1, negative): should error\n")
## 
## A2 (1x1, negative): should error
try(print(cholesky(A2)))
## Error in cholesky(A2) : 
##   Matrix is not SPD: non-positive pivot at j=1 (-1.000e+00).
# Compare to base chol()
U_base <- chol(A3)                   # base R is upper-triangular
cat("\nCompare custom U to base chol(A3):\n")
## 
## Compare custom U to base chol(A3):
cat("all.equal(U_custom, U_base):", all.equal(U3, U_base), "\n")
## all.equal(U_custom, U_base): TRUE
cat("max |U_custom - U_base| =", max(abs(U3 - U_base)), "\n")
## max |U_custom - U_base| = 0

2.6 Forward Substitution

Forward substitution is an iterative method used to solve matrix equations of the form \(Lx=b\), where LL is a lower triangular matrix. Similarly, back substitution is applied to solve equations involving upper triangular matrices, \(Ux=b\).

#' Forward Substitution for Lower Triangular Systems
#'
#' Solves the system Ly = b where L is lower triangular using forward substitution.
#'
#' @param L A lower triangular numeric matrix
#' @param b A numeric vector or column matrix
#' @return The solution vector y
#' @references Watkins, D.S. (2002). Fundamentals of Matrix Computations
#' @examples
#' L <- matrix(c(2,0,0,0, -1,3,0,0, 2,0,2,0, 1,-2,1,1), nrow=4)
#' b <- c(8,2,16,6)
#' forward_solve(L, b)
forward_solve <- function(L, b) {
  # Input validation
  if (!is.matrix(L)) stop("L must be a matrix")
  if (!is.numeric(b)) stop("b must be numeric")
  if (nrow(L) != ncol(L)) stop("L must be square")
  if (nrow(L) != length(b)) stop("Dimensions of L and b must match")
  
  n <- nrow(L)
  b <- as.matrix(b)
  
  for (i in 1:n) {
    # Check for zero pivot
    if (abs(L[i,i]) < .Machine$double.eps) {
      stop(sprintf("Zero pivot encountered at position %d", i))
    }
    
    # Forward substitution
    if (i > 1) {
      for (j in 1:(i-1)) {
        b[i] <- b[i] - L[i,j] * b[j]
      }
    }
    b[i] <- b[i] / L[i,i]
  }
  
  # Return as column vector with names if available
  if (!is.null(rownames(L))) {
    rownames(b) <- rownames(L)
  }
  return(b)
}

# Save function
dump("forward_solve", file = "forward_solve.R")
# --------------------------------------------------
# Demonstration with Cholesky Decomposition
# --------------------------------------------------

### 1. Input Setup ###
cat("Lower triangular matrix L:\n")
## Lower triangular matrix L:
(L <- matrix(c(2,0,0,0, -1,3,0,0, 2,0,2,0, 1,-2,1,1), 
             nrow=4, dimnames=list(paste0("Eq",1:4),NULL)))
##     [,1] [,2] [,3] [,4]
## Eq1    2   -1    2    1
## Eq2    0    3    0   -2
## Eq3    0    0    2    1
## Eq4    0    0    0    1
cat("\nRight-hand side vector b:\n")
## 
## Right-hand side vector b:
(b <- matrix(c(8,2,16,6), ncol=1, dimnames=list(paste0("Eq",1:4),NULL)))
##     [,1]
## Eq1    8
## Eq2    2
## Eq3   16
## Eq4    6
### 2. Forward Substitution ###
cat("\nSolving Ly = b via forward substitution:\n")
## 
## Solving Ly = b via forward substitution:
(y <- forward_solve(L, b))
##     [,1]
## Eq1 4.00
## Eq2 0.67
## Eq3 8.00
## Eq4 6.00
#       [,1]
# Eq1     4
# Eq2     2
# Eq3     4
# Eq4     2

### 3. Verification ###
cat("\nVerification (L %*% y should equal b):\n")
## 
## Verification (L %*% y should equal b):
(L %*% y)
##     [,1]
## Eq1   29
## Eq2  -10
## Eq3   22
## Eq4    6
#       [,1]
# Eq1     8
# Eq2     2
# Eq3    16
# Eq4     6

### 4. Comparison with Base R ###
cat("\nBase R forwardsolve() result:\n")
## 
## Base R forwardsolve() result:
(base_y <- forwardsolve(L, b))
##      [,1]
## [1,] 4.00
## [2,] 0.67
## [3,] 8.00
## [4,] 6.00
cat("Maximum difference:", max(abs(y - base_y)), "\n")
## Maximum difference: 0
# Maximum difference: 0
#' Backward Substitution for Upper Triangular Systems
#'
#' Solves the system Ux = y where U is upper triangular using backward substitution.
#'
#' @param U An upper triangular numeric matrix
#' @param y A numeric vector or column matrix
#' @return The solution vector x
#' @references Watkins, D.S. (2002). Fundamentals of Matrix Computations
#' @examples
#' U <- matrix(c(2,-1,2,1, 0,3,0,-2, 0,0,2,1, 0,0,0,1), nrow=4)
#' y <- c(4,2,4,2)
#' backward_solve(U, y)
backward_solve <- function(U, y) {
  # Input validation
  if (!is.matrix(U)) stop("U must be a matrix")
  if (!is.numeric(y)) stop("y must be numeric")
  if (nrow(U) != ncol(U)) stop("U must be square")
  if (nrow(U) != length(y)) stop("Dimensions of U and y must match")
  
  n <- nrow(U)
  y <- as.matrix(y)
  
  for (i in seq(n, 1, by = -1)) {  # Backward iteration
    # Check for zero pivot
    if (abs(U[i,i]) < .Machine$double.eps) {
      stop(sprintf("Zero pivot encountered at position %d", i))
    }
    
    # Backward substitution
    if (i < n) {
      for (j in (i+1):n) {
        y[i] <- y[i] - U[i,j] * y[j]
      }
    }
    y[i] <- y[i] / U[i,i]
  }
  
  # Return as named vector
  if (!is.null(rownames(U))) {
    names(y) <- rownames(U)
  }
  return(as.vector(y))
}

# Save function
dump("backward_solve", file = "backward_solve.R")
# --------------------------------------------------
# Complete Linear System Solver Demonstration 
# --------------------------------------------------

### Utility Functions ###
is_upper_triangular <- function(M) all(M[lower.tri(M)] == 0)
is_lower_triangular <- function(M) all(M[upper.tri(M)] == 0)

### 1. Matrix Setup ###
cat("Original matrix A:\n")
## Original matrix A:
(A <- matrix(c(
  4, -2,  4,  2,
  -2, 10, -2, -7,
  4, -2,  8,  4,
  2, -7,  4,  7
), nrow = 4, byrow = TRUE))
##      [,1] [,2] [,3] [,4]
## [1,]    4   -2    4    2
## [2,]   -2   10   -2   -7
## [3,]    4   -2    8    4
## [4,]    2   -7    4    7
cat("\nRight-hand side vector b:\n")
## 
## Right-hand side vector b:
(b <- matrix(c(8, 2, 16, 6), ncol = 1))
##      [,1]
## [1,]    8
## [2,]    2
## [3,]   16
## [4,]    6
### 2. Cholesky Decomposition ###
cat("\nCholesky decomposition: A = L %*% t(L)\n")
## 
## Cholesky decomposition: A = L %*% t(L)
L <- t(chol(A))  # Lower triangular
cat("Lower triangular matrix L:\n")
## Lower triangular matrix L:
print(L)
##      [,1] [,2] [,3] [,4]
## [1,]    2    0    0    0
## [2,]   -1    3    0    0
## [3,]    2    0    2    0
## [4,]    1   -2    1    1
U <- t(L)  # Upper triangular

cat("\nIs L lower triangular? ", is_lower_triangular(L), "\n")
## 
## Is L lower triangular?  TRUE
cat("Is U upper triangular? ", is_upper_triangular(U), "\n")
## Is U upper triangular?  TRUE
### 3. Forward and Backward Substitution ###
cat("\nStep 1: Solving Ly = b via forward substitution:\n")
## 
## Step 1: Solving Ly = b via forward substitution:
y <- forward_solve(L, b)
print(y)
##      [,1]
## [1,]    4
## [2,]    2
## [3,]    4
## [4,]    2
cat("\nStep 2: Solving Ux = y via backward substitution:\n")
## 
## Step 2: Solving Ux = y via backward substitution:
x <- backward_solve(U, y)
print(x)
## [1] 1 2 1 2
### 4. Verification ###
cat("\nVerification: Does A %*% x equal b?\n")
## 
## Verification: Does A %*% x equal b?
print(A %*% x)
##      [,1]
## [1,]    8
## [2,]    2
## [3,]   16
## [4,]    6
cat("Residual norm (||Ax - b||):", norm(A %*% x - b, "2"), "\n")
## Residual norm (||Ax - b||): 0
#' Solve Linear System Using Cholesky Decomposition
#'
#' Solves the system Ax = b for symmetric positive definite matrices using:
#' 1. Cholesky decomposition A = U'U
#' 2. Forward substitution to solve U'y = b
#' 3. Back substitution to solve Ux = y
#'
#' @param A A symmetric positive definite matrix
#' @param b Right-hand side vector or matrix
#' @return Solution vector x
#' @examples
#' A <- matrix(c(4,-2,4,2, -2,10,-2,-7, 4,-2,8,4, 2,-7,4,7), nrow=4)
#' b <- matrix(c(8,2,16,6), ncol=1)
#' solve_cholesky(A, b)
solve_cholesky <- function(A, b) {
  # Input validation
  if (!is.matrix(A) || !is.numeric(A)) stop("A must be numeric matrix")
  if (!is.matrix(b) && !is.vector(b)) stop("b must be vector or matrix")
  if (nrow(A) != ncol(A)) stop("A must be square")
  if (nrow(A) != length(b)) stop("A and b dimension mismatch")
  
  # Compute Cholesky decomposition
  U <- cholesky(A)
  
  # Forward substitution (solve U'y = b)
  y <- forwardsolve(t(U), b)
  
  # Back substitution (solve Ux = y)
  x <- backsolve(U, y)
  
  # Preserve names if available
  if (!is.null(rownames(A))) {
    names(x) <- rownames(A)
  }
  
  return(x)
}
# Demonstration with verification
A <- matrix(c(4,-2,4,2, -2,10,-2,-7, 4,-2,8,4, 2,-7,4,7), nrow=4)
b <- matrix(c(8,2,16,6), ncol=1)

cat("Solving Ax = b for:\n")
## Solving Ax = b for:
cat("A:\n"); print(A)
## A:
##      [,1] [,2] [,3] [,4]
## [1,]    4   -2    4    2
## [2,]   -2   10   -2   -7
## [3,]    4   -2    8    4
## [4,]    2   -7    4    7
cat("\nb:\n"); print(b)
## 
## b:
##      [,1]
## [1,]    8
## [2,]    2
## [3,]   16
## [4,]    6
x <- solve_cholesky(A, b)

cat("\nSolution x:\n")
## 
## Solution x:
print(x)
##      [,1]
## [1,]    1
## [2,]    2
## [3,]    1
## [4,]    2
cat("\nVerification (Ax should equal b):\n")
## 
## Verification (Ax should equal b):
cat("Computed Ax:\n"); computed_b <- A %*% x
## Computed Ax:
print(computed_b)
##      [,1]
## [1,]    8
## [2,]    2
## [3,]   16
## [4,]    6
cat("\nOriginal b:\n"); print(b)
## 
## Original b:
##      [,1]
## [1,]    8
## [2,]    2
## [3,]   16
## [4,]    6
cat("\nMaximum residual:", max(abs(computed_b - b)), "\n")
## 
## Maximum residual: 0
# Compare with base R
cat("\nBase R solve() solution:\n")
## 
## Base R solve() solution:
base_x <- solve(A, b)
print(base_x)
##      [,1]
## [1,]    1
## [2,]    2
## [3,]    1
## [4,]    2
cat("\nMaximum difference:", max(abs(x - base_x)), "\n")
## 
## Maximum difference: 0

2.6.1 The Inverse of a Matrix by Cholesky Decomposition

The Cholesky decomposition can also be used for matrix inversion. It is computationally more efficient to compute the inverse and the determinant of a triangular matrix.

Given the Cholesky decomposition of a positive definite matrix \(A\):

\[ A = U^T U \]

Then the inverse of \(A\) is:

\[ A^{-1} = (U^T U)^{-1} = U^{-1} (U^T)^{-1} = U^{-1} (U^{-1})^T \]

Similarly, if \(A = L L^T\), then:

\[ A^{-1} = (L L^T)^{-1} = (L^T)^{-1} L^{-1} = (L^{-1})^T L^{-1} \]


2.6.2 The Determinant of a Matrix by Cholesky Decomposition

Given a positive definite matrix \(A\) with Cholesky decomposition

\[ A = U^T U, \]

the determinant of \(A\) is

\[ |A| = |U^T U| = |U^T| \cdot |U| = \prod_{i=1}^n u_{ii} \times \prod_{i=1}^n u_{ii} = \left( \prod_{i=1}^n u_{ii} \right)^2, \]

where \(u_{ii}\) are the diagonal elements of the upper triangular matrix \(U\).

#' Determinant via Cholesky Decomposition
#'
#' Computes the determinant of a symmetric positive definite matrix using its
#' Cholesky decomposition: det(A) = det(U)^2 = (∏u_ii)^2
#' 
#' @param A A symmetric positive definite matrix
#' @return The determinant of A
#' @examples
#' A <- matrix(c(4,-2,4,2, -2,10,-2,-7, 4,-2,8,4, 2,-7,4,7), nrow=4)
#' DET2(A)  # 144
#' det(A)   # Base R equivalent
DET2 <- function(A) {
  # Input validation
  if (!is.matrix(A) || !is.numeric(A)) stop("Input must be numeric matrix")
  if (nrow(A) != ncol(A)) stop("Matrix must be square")
  
  # Compute Cholesky decomposition
  U <- cholesky(A)
  
  # Calculate determinant as product of squared diagonals
  det_value <- prod(diag(U))^2
  
  return(det_value)
}

# Save function
dump("DET2", file = "DET2.R")
# Example matrix
A <- matrix(c(
  4, -2,  4,  2,
  -2, 10, -2, -7,
  4, -2,  8,  4,
  2, -7,  4,  7
), nrow = 4, byrow = TRUE)

# Compute determinant using DET2
det_chol <- DET2(A)
print(det_chol)  # Expected: 144
## [1] 144
# Verify with base R's determinant function
det_base <- det(A)
print(det_base)  # Should also be 144
## [1] 144

Moreover, in the Monte Carlo method for simulating systems with multiple correlated variables, the Cholesky decomposition is used to factorize the correlation matrix into a lower-triangular matrix \(L\). This Cholesky factor is then applied to a vector of uncorrelated samples to generate a sample vector that preserves the covariance structure of the system being modeled.

# --------------------------------------------------
# Monte Carlo Simulation Example Using Cholesky
# --------------------------------------------------

#' Generate Correlated Normal Samples
#'
#' @param n Number of samples
#' @param mu Mean vector
#' @param Sigma Covariance matrix
#' @return Matrix of correlated samples
#' @examples
#' Sigma <- matrix(c(1,0.8,0.8,1), nrow=2)
#' correlated_samples(100, c(0,0), Sigma)
correlated_samples <- function(n, mu, Sigma) {
  p <- length(mu)
  
  # Cholesky decomposition
  L <- t(cholesky(Sigma))
  
  # Generate uncorrelated samples
  Z <- matrix(rnorm(n*p), nrow=p)
  
  # Transform to correlated samples
  X <- L %*% Z
  
  # Add means
  X <- sweep(X, 1, mu, "+")
  
  return(t(X))
}
### Demonstration ###

# 1. Define covariance structure
Sigma <- matrix(c(4, 3, 3, 9), nrow=2)

# 2. Generate correlated samples
set.seed(123)
samples <- correlated_samples(1000, c(1,2), Sigma)

# 3. Verify covariance structure
empirical_cov <- cov(samples)
cat("Target covariance:\n")
## Target covariance:
print(Sigma)
##      [,1] [,2]
## [1,]    4    3
## [2,]    3    9
cat("\nEmpirical covariance:\n")
## 
## Empirical covariance:
print(empirical_cov)
##      [,1] [,2]
## [1,]  3.8  2.5
## [2,]  2.5  8.7
cat("\nMaximum difference:", max(abs(Sigma - empirical_cov)), "\n")
## 
## Maximum difference: 0.5
# 4. Plot results
plot(samples, main="Correlated Normal Samples",
     xlab="Variable 1", ylab="Variable 2",
     col=adjustcolor("blue", alpha=0.5), pch=1)


2.6.3 Row Reduction (Gaussian Elimination) and Echelon Forms

Row reduction is an algorithm for solving systems of linear equations in which a sequence of elementary row operations are used to modify the matrix until the lower left-hand corner of the matrix is filled with zeros as much as possible.

If for every non-zero row, the leading coefficient is to the right of the leading coefficient of the row above, then the matrix is said to be in row echelon form.

A rectangular matrix is in echelon form (or row echelon form) if it satisfies the following three properties:

  1. All nonzero rows are above any rows of all zeros.
  2. Each leading entry of a row is in a column to the right of the leading entry of the row above it.
  3. All entries in a column below a leading entry are zeros.

If a matrix in echelon form also satisfies the following additional conditions, then it is in reduced echelon form (or reduced row echelon form):

  1. The leading entry in each nonzero row is 1.
  2. Each leading 1 is the only nonzero entry in its column.

Using row operations to convert a matrix into reduced row echelon form is called Gauss–Jordan elimination.

If the matrix \(D\) represents a system of linear equations, then the reduced row echelon form allows one to write the solutions to the system.

As described in (Kincaid, D. R. & Cheney, E. W., 2008, p. 267):

#' Reduced Row Echelon Form (RREF) of a Matrix
#'
#' Computes the reduced row echelon form of a numeric matrix using
#' elementary row operations (Gauss-Jordan elimination).
#'
#' @param A A numeric matrix or an object coercible to a matrix.
#'
#' @return A matrix in reduced row echelon form.
#'
#' @details
#' The function applies a sequence of elementary row operations to
#' transform the matrix into its reduced row echelon form (RREF).
#' This form is useful for solving linear systems and analyzing
#' linear independence.
#'
#' @examples
#' A <- matrix(c(1, 2, -1, -4,
#'               2, 3, -1, -11,
#'               -2, 0, -3, 22), nrow = 3, byrow = TRUE)
#' RREF(A)
#'
#' @export
RREF <- function(A) {
  a <- as.matrix(A)
  n <- nrow(a)
  k <- ncol(a)
  l <- 1  # pivot column index
  
  for (r in 1:n) {
    if (l > k) break
    
    i <- r
    while (a[i, l] == 0) {
      i <- i + 1
      if (i > n) {
        i <- r
        l <- l + 1
        if (l > k) break
      }
    }
    if (l > k) break
    
    if (i != r) {
      temp <- a[i, ]
      a[i, ] <- a[r, ]
      a[r, ] <- temp
    }
    
    if (a[r, l] != 0) {
      a[r, ] <- a[r, ] / a[r, l]
    }
    
    for (i in 1:n) {
      if (i != r) {
        a[i, ] <- a[i, ] - a[i, l] * a[r, ]
      }
    }
    
    l <- l + 1
  }
  
  return(a)
}

# save to the working directory getwd(), ls()
dump("RREF", file = "RREF.R")

Solving a System of Linear Equations

Given the system \(A\mathbf{x} = \mathbf{b}\), where:

\[ A = \begin{pmatrix} 4 & -2 & 4 & 2 \\[6px] -2 & 10 & -2 & -7 \\[6px] 4 & -2 & 8 & 4 \\[6px] 2 & -7 & 4 & 7 \end{pmatrix}, \quad \mathbf{b} = \begin{pmatrix} 8 \\[8px] 2 \\[8px] 16 \\[8px] 6 \end{pmatrix} \]

We construct the augmented matrix $D = [A ] $:

\[ D = \left[\begin{array}{cccc|c} 4 & -2 & 4 & 2 & 8 \\[6px] -2 & 10 & -2 & -7 & 2 \\[6px] 4 & -2 & 8 & 4 & 16 \\[6px] 2 & -7 & 4 & 7 & 6 \end{array}\right] \]

Key Features:

  • Matrix \(A\) is a \(4×4\) coefficient matrix
  • Vector \(\mathbf{b}\) is the right-hand side
  • Augmented matrix \(D\) combines both for row operations
  • Vertical bar separates coefficients from constants
  • Proper LaTeX formatting for matrices and vectors
# Define coefficient matrix A
A <- matrix(c(
  4, -2,  4,  2,
  -2, 10, -2, -7,
  4, -2,  8,  4,
  2, -7,  4,  7
), nrow = 4, byrow = TRUE)

# Define constant vector b
b <- c(8, 2, 16, 6)

# Form augmented matrix D = [A | b]
D <- cbind(A, b)
print(D)
##                   b
## [1,]  4 -2  4  2  8
## [2,] -2 10 -2 -7  2
## [3,]  4 -2  8  4 16
## [4,]  2 -7  4  7  6
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    4   -2    4    2    8
## [2,]   -2   10   -2   -7    2
## [3,]    4   -2    8    4   16
## [4,]    2   -7    4    7    6

# Compute Reduced Row Echelon Form
E <- RREF(D)
print(E)
##              b
## [1,] 1 0 0 0 1
## [2,] 0 1 0 0 2
## [3,] 0 0 1 0 1
## [4,] 0 0 0 1 2
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    0    0    0    1
## [2,]    0    1    0    0    2
## [3,]    0    0    1    0    1
## [4,]    0    0    0    1    2

# Interpretation:
# Solution to system: x1 = 1, x2 = 2, x3 = 1, x4 = 2

Interpretation

The RREF of \(D\) gives the solution:

\[ \mathbf{x} = \begin{pmatrix} x_1 \\[6px] x_2 \\[6px] x_3 \\[6px] x_4 \end{pmatrix} = \begin{pmatrix} 1 \\[6px] 2 \\[6px] 1 \\[6px] 2 \end{pmatrix} \]


Verification:

Substitute \(\mathbf{x}\) into \(A\mathbf{x}\):

\[ A\mathbf{x} = \begin{bmatrix} 4(1) - 2(2) + 4(1) + 2(2) \\[6px] -2(1) + 10(2) - 2(1) - 7(2) \\[6px] 4(1) - 2(2) + 8(1) + 4(2) \\[6px] 2(1) - 7(2) + 4(1) + 7(2) \end{bmatrix} = \begin{bmatrix} 8 \\[8px] 2 \\[8px] 16 \\[8px] 6 \end{bmatrix} = \mathbf{b} \]


2.6.4 Solution Verification

The proposed solution \(\mathbf{x} = \begin{bmatrix}1 & 2 & 1 & 2\end{bmatrix}^\top\) is correct.
Below is the step-by-step verification:

2.6.4.1 Matrix Multiplication: Compute \(A\mathbf{x}\)

Given:

\[ A = \begin{bmatrix} 4 & -2 & 4 & 2 \\[6px] -2 & 10 & -2 & -7 \\[6px] 4 & -2 & 8 & 4 \\[6px] 2 & -7 & 4 & 7 \end{bmatrix}, \quad \mathbf{x} = \begin{bmatrix}1 \\[8px] 2 \\[8px] 1 \\[8px] 2\end{bmatrix}, \quad \mathbf{b} = \begin{bmatrix}8 \\[8px] 2 \\[8px] 16 \\[8px] 6\end{bmatrix} \]

2.6.4.2 Row-wise Arithmetic Verification

  • Row 1:
    • \(4(1) - 2(2) + 4(1) + 2(2) = 4 - 4 + 4 + 4 = 8\)
    • \(\Rightarrow\) Matches \(b_1 = 8\)
  • Row 2:
    • \(-2(1) + 10(2) - 2(1) - 7(2) = -2 + 20 - 2 - 14 = 2\)
    • \(\Rightarrow\) Matches \(b_2 = 2\)
  • Row 3:
    • \(4(1) - 2(2) + 8(1) + 4(2) = 4 - 4 + 8 + 8 = 16\)
    • \(\Rightarrow\) Matches \(b_3 = 16\)
  • Row 4:
    • \(2(1) - 7(2) + 4(1) + 7(2) = 2 - 14 + 4 + 14 = 6\)
    • \(\Rightarrow\) Matches \(b_4 = 6\)

2.6.4.3 Conclusion

Since \(A\mathbf{x} = \mathbf{b}\) holds exactly, the solution is validated.


2.6.5 Rank of a Matrix

The rank of a matrix \(A\) is the dimension of the vector space spanned by its columns, also known as the column space of \(A\). Equivalently, it is the maximum number of linearly independent column vectors in the matrix.

A matrix is said to have full rank if its rank is equal to the maximum possible for its dimensions — that is, the lesser of the number of rows and columns:

\[ \text{rank}(A) = \min(m, n) \]

A matrix is rank deficient if its rank is less than this maximum.

A common method for determining the rank of a matrix is to reduce it to row echelon form (REF) using Gaussian elimination. In this form, the rank corresponds to the number of leading 1s or pivots, which is also the number of non-zero rows.

Note: The same number of pivots is obtained whether you reduce rows or columns, since row rank and column rank are always equal.

#' Rank of a Matrix
#'
#' Computes the rank of a matrix using its reduced row echelon form (RREF).
#' The rank is defined as the number of non-zero rows in RREF.
#'
#' @param D A numeric matrix.
#'
#' @return An integer representing the rank of the matrix.
#'
#' @details
#' A matrix is said to have full rank if its rank equals the minimum of the number
#' of rows and columns. It is rank deficient if its rank is less than this value.
#' This function uses the custom \code{RREF} function, so ensure that \code{RREF.R}
#' is in your working directory.
#'
#' @examples
#' A <- matrix(c(4, -2, 4, 2,
#'               -2, 10, -2, -7,
#'               4, -2, 8, 4,
#'               2, -7, 4, 7), nrow = 4, byrow = TRUE)
#' b <- c(8, 2, 16, 6)
#' D <- cbind(A, b)
#' RANK(D)  # Returns 4
#'
#' # Rank deficient example:
#' C <- matrix(c(1, 2, 3,
#'               3, 4, 6,
#'               4, 6, 9), nrow = 3, byrow = FALSE)
#' RANK(C)  # Returns 2
#'
#' @export
RANK <- function(D) {
  # Load RREF function from file
  source("RREF.R")
  
  # Compute Reduced Row Echelon Form
  E <- RREF(D)
  
  # Count number of non-zero rows
  rank <- nrow(E[apply(E, 1, function(x) !all(x == 0)), ])
  
  return(rank)
}

# Save function to file
dump("RANK", file = "RANK.R")
A <- matrix(c(4, -2, 4, 2,
              -2, 10, -2, -7,
              4, -2, 8, 4,
              2, -7, 4, 7), nrow = 4, byrow = TRUE)
b <- c(8, 2, 16, 6)
D <- cbind(A, b)

RANK(D)  # Output: 4 (full rank)
## [1] 4
C <- matrix(c(1, 2, 3,
              3, 4, 6,
              4, 6, 9), nrow = 3, byrow = FALSE)

C
##      [,1] [,2] [,3]
## [1,]    1    3    4
## [2,]    2    4    6
## [3,]    3    6    9
#      [,1] [,2] [,3]
# [1,]    1    3    4
# [2,]    2    4    6
# [3,]    3    6    9

RANK(C)  # Output: 2
## [1] 2
DET <- function(A) prod(diag(chol(A)))^2
# Note: Will throw an error if A is not positive definite

# This matrix is not full-rank; determinant should be zero (or fail via chol)
det(C)   # Base R
## [1] 3.3e-16
# [1] 0
#' Determinant of a Matrix with Fallback for Non-Positive-Definite Cases
#'
#' Computes the determinant of a square matrix. Attempts Cholesky decomposition
#' first (for speed), and falls back to LU decomposition if Cholesky fails.
#'
#' @param A A square numeric matrix.
#'
#' @return A numeric value representing the determinant of the matrix.
#'
#' @examples
#' A <- matrix(c(4, -2, 4, 2,
#'               -2, 10, -2, -7,
#'               4, -2, 8, 4,
#'               2, -7, 4, 7), nrow = 4, byrow = TRUE)
#' DET(A)  # 144
#'
#' C <- matrix(c(1, 2, 3,
#'               3, 4, 6,
#'               4, 6, 9), nrow = 3, byrow = FALSE)
#' DET(C)  # 0
#'
#' @export
DET <- function(A) {
  if (!is.matrix(A) || nrow(A) != ncol(A)) {
    stop("Input must be a square matrix.")
  }
  
  # Try Cholesky decomposition first (works only for symmetric positive-definite)
  det_val <- tryCatch({
    U <- chol(A)
    prod(diag(U))^2
  }, error = function(e) {
    # Fall back to LU decomposition (via base R's determinant function)
    det_base <- determinant(A, logarithm = FALSE)
    as.numeric(det_base$modulus) * ifelse(det_base$sign < 0, -1, 1)
  })
  
  return(det_val)
}
# Full-rank matrix
A <- matrix(c(4, -2, 4, 2,
              -2, 10, -2, -7,
              4, -2, 8, 4,
              2, -7, 4, 7), nrow = 4, byrow = TRUE)
DET(A)  # 144
## [1] 144
# Rank-deficient matrix
C <- matrix(c(1, 2, 3,
              3, 4, 6,
              4, 6, 9), nrow = 3, byrow = FALSE)
DET(C)  # 0
## [1] 3.3e-16

3 Multiple Linear Regression

In statistics, often multiple independent factors contribute to the dependent variable. Linear regression is a linear approach to modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression.

Multiple linear regression is the least squares estimator of a linear regression model that relates response variable, Y to a function of multiple explanatory variables X and regression coefficients β as follows:

\[ Y_i = \beta_0 + \beta_i X_{[i,j]} + \varepsilon_i, \quad \text{for } i = 1, 2, 3, ..., n \text{ and } j = 1, 2, 3, ..., (k-1) \] The least squares parameter estimates are obtained from a series of k normal equations:

\[ \begin{aligned} Y_1 &= \beta_0 + \beta_1 X_{[1,1]} + \beta_2 X_{[1,2]} + \cdots + \beta_{(k-1)} X_{[1,(k-1)]} + \varepsilon_1 \\[12pt] Y_2 &= \beta_0 + \beta_1 X_{[2,1]} + \beta_2 X_{[2,2]} + \cdots + \beta_{(k-1)} X_{[2,(k-1)]} + \varepsilon_2 \\[12pt] Y_3 &= \beta_0 + \beta_1 X_{[3,1]} + \beta_2 X_{[3,2]} + \cdots + \beta_{(k-1)} X_{[3,(k-1)]} + \varepsilon_3 \\[12pt] &\vdots \quad + \quad \vdots \quad + \quad \vdots \quad + \quad \ddots \quad + \quad \vdots \quad + \quad \vdots \\[12pt] Y_n &= \beta_0 + \beta_1 X_{[n,1]} + \beta_2 X_{[n,2]} + \cdots + \beta_{(k-1)} X_{[n,(k-1)]} + \varepsilon_n \end{aligned} \]

In matrix notation, the normal equations are written as:

\[ \begin{pmatrix} Y_1 \\[8pt] Y_2 \\[8pt] Y_3 \\[8pt] \vdots \\[8pt] Y_n \end{pmatrix} = \begin{pmatrix} \beta_0 & + & \beta_1 X_{[1,1]} & + & \beta_2 X_{[1,2]} & + & \ldots & + & \beta_{(k-1)} X_{[1,(k-1)]} \\[6pt] \beta_0 & + & \beta_1 X_{[2,1]} & + & \beta_2 X_{[2,2]} & + & \ldots & + & \beta_{(k-1)} X_{[2,(k-1)]} \\[6pt] \beta_0 & + & \beta_1 X_{[3,1]} & + & \beta_2 X_{[3,2]} & + & \ldots & + & \beta_{(k-1)} X_{[3,(k-1)]} \\[6pt] \vdots & + & \vdots & + & \vdots & + & \ddots & + & \vdots \\[6pt] \beta_0 & + & \beta_1 X_{[n,1]} & + & \beta_2 X_{[n,2]} & + & \ldots & + & \beta_{(k-1)} X_{[n,(k-1)]} \end{pmatrix} + \begin{pmatrix} \varepsilon_1 \\[8pt] \varepsilon_2 \\[8pt] \varepsilon_3 \\[8pt] \vdots \\[8pt] \varepsilon_n \end{pmatrix} \]

\[ = \begin{pmatrix} 1 & X_{[1,1]} & X_{[1,2]} & \ldots & X_{[1,(k-1)]} \\[6pt] 1 & X_{[2,1]} & X_{[2,2]} & \ldots & X_{[2,(k-1)]} \\[6pt] 1 & X_{[3,1]} & X_{[3,2]} & \ldots & X_{[3,(k-1)]} \\[6pt] \vdots & \vdots & \vdots & \ddots & \vdots \\[6pt] 1 & X_{[n,1]} & X_{[n,2]} & \ldots & X_{[n,(k-1)]} \end{pmatrix} \begin{pmatrix} \beta_0 \\[6pt] \beta_1 \\[6pt] \beta_2 \\[6pt] \vdots \\[6pt] \beta_{(k-1)} \end{pmatrix} + \begin{pmatrix} \varepsilon_1 \\[6pt] \varepsilon_2 \\[6pt] \varepsilon_3 \\[6pt] \vdots \\[6pt] \varepsilon_n \end{pmatrix} \]

where:

\(\qquad k\) is the number of explanatory variables

\(\qquad n\) is the number of observations

\(\qquad X_{n \times (k-1)}\) is the design matrix

\(\qquad \beta_0\) is the intercept

\(\qquad \beta_{(k-1) \times 1}\) is the vector of regression coefficients

\(\qquad \varepsilon_i \sim \text{iid } N(0, \sigma^2)\) is the vector of error terms

\(\qquad Y_{n \times 1}\) is the response vector

3.1 Regression Coefficients

The goal of ordinary least squares regression is to minimize the differences (residuals) between the observed responses and the responses predicted by the linear approximation (fitted values) of the data.

\[ \begin{aligned} Y &= X\beta + \varepsilon \\[12pt] \varepsilon &= Y - X\beta \end{aligned} \]

This method obtains parameter estimates that minimize the sum of squared residuals (also residual sum of squares or sum of squared errors of prediction).

\[ \begin{aligned} \varepsilon^T \varepsilon &= (Y - X\beta)^T (Y - X\beta) \\[12pt] &= [Y^T - (X\beta)^T](Y - X\beta) \\[12pt] &= (Y^T - X^T \beta^T)(Y - X\beta) \\[12pt] &= Y^T Y - Y^T X\beta - X^T \beta^T Y + X^T \beta^T X \beta \\[12pt] &= Y^T Y - [(X\beta)^T Y]^T - X^T \beta^T Y + X^T \beta^T X \beta \\[12pt] &= Y^T Y - (\beta^T X^T Y)^T - X^T \beta^T Y + X^T \beta^T X \beta \end{aligned} \]

To minimize the sum of squared residuals set its first order partial derivative with respect to \(\beta\) to zero:

\[ \begin{aligned} \frac{\partial}{\partial \beta} [Y^T Y - (\beta^T X^T Y)^T - X^T \beta^T Y + X^T \beta^T X \beta] &= 0 \\[12pt] \frac{\partial}{\partial \beta}(Y^T Y) - \frac{\partial}{\partial \beta}(\beta^T X^T Y)^T - \frac{\partial}{\partial \beta}(X^T \beta^T Y) + \frac{\partial}{\partial \beta}(X^T \beta^T X \beta) &= 0 \\[12pt] 0 - X^T Y - X^T Y + 2X^T X \beta &= 0 \\[12pt] -2X^T Y + 2X^T X \beta &= 0 \\[12pt] 2X^T X \beta &= 2X^T Y \\[12pt] X^T X \beta &= X^T Y \end{aligned} \]

To solve for \(\beta\) multiply both sides by the inverse of \(\beta\)’s coefficient:

\[ \begin{aligned} (X^T X)^{-1}(X^T X) \beta &= (X^T X)^{-1} X^T Y \\[6pt] I \beta &= (X^T X)^{-1} X^T Y \\[6pt] \beta &= (X^T X)^{-1} X^T Y \end{aligned} \]

############################################################
## Multiple Linear Regression via Normal Equations (X'X)  ##
## - Robust, self-contained, and annotated                ##
############################################################

## Helper: safe numeric coercion (kills factor-to-numeric bugs)
as_num <- function(x) {
  if (is.factor(x))     return(as.numeric(as.character(x)))
  if (is.character(x))  return(as.numeric(x))
  as.numeric(x)
}

## Helper: safe inverse
## - Tries user-defined INV2(), then base solve()
## - Gives informative error if XtX is singular
safe_inverse <- function(M) {
  # If a custom INV2() is defined in your environment, prefer it
  if (exists("INV", mode = "function", inherits = TRUE)) {
    return(INV(M))
  }
  # Fallback to base R
  solve(M)
}

## Main: regression coefficients via normal equations
##  beta_hat = (D' D)^(-1) D' Y, where D is the design matrix (w/wo intercept)
##
## Features:
## - If Y is NULL, uses LAST column of X as response (your workflow).
## - Drops rows with any NA across X or Y (listwise deletion).
## - Adds intercept unless intercept = FALSE.
## - Auto-fixes "(Intercept)" vs "Intercept" naming.
## - Guards against all-ones predictor (double-intercept).
regression_coefficients <- function(X, Y = NULL, intercept = TRUE) {
  # Ensure matrices (and numeric storage)
  X <- as.matrix(X)
  storage.mode(X) <- "double"

  # Split X and Y if Y is NULL (last column == response)
  if (is.null(Y)) {
    p <- ncol(X)
    if (is.null(p) || p < 1L) stop("X must have at least 1 column.")
    Y <- as.matrix(X[, p, drop = FALSE])
    X <- as.matrix(X[, 1L:(p - 1L), drop = FALSE])
  } else {
    Y <- as.matrix(Y)
    storage.mode(Y) <- "double"
  }

  # Ensure sane names for X and Y
  if (is.null(colnames(X)) || anyNA(colnames(X)) || any(colnames(X) == "")) {
    colnames(X) <- paste0("X", seq_len(ncol(X)))
  }
  if (is.null(colnames(Y))) {
    colnames(Y) <- if (ncol(Y) == 1L) "Y" else paste0("Y", seq_len(ncol(Y)))
  }

  # Listwise deletion (complete cases across X and Y)
  keep <- complete.cases(cbind(X, Y))
  X <- X[keep, , drop = FALSE]
  Y <- Y[keep, , drop = FALSE]
  if (nrow(X) == 0L) stop("No complete cases after listwise deletion.")

  # Add intercept if requested (avoid double-intercept from all-ones column)
  if (intercept) {
    ones <- apply(X, 2L, function(col) all(col == 1))
    if (any(ones)) X <- X[, !ones, drop = FALSE]
    D <- cbind(Intercept = 1, X)
  } else {
    D <- X
  }

  # Normal equations
  XtX <- t(D) %*% D
  XtY <- t(D) %*% Y

  # Solve for beta_hat with a sturdy inverse (or your INV2)
  beta_hat <- tryCatch(
    safe_inverse(XtX) %*% XtY,
    error = function(e) {
      # If available, try SWEEP on augmented system as a last resort
      if (exists("SWEEP", mode = "function", inherits = TRUE)) {
        Aug <- cbind(XtX, XtY)
        SW  <- SWEEP(Aug, k = ncol(XtX))
        return(SW[, (ncol(XtX) + 1L):ncol(SW), drop = FALSE])
      }
      stop("Failed to invert XtX (near-singular?). Original error: ", conditionMessage(e))
    }
  )

  # Name rows = terms; cols = response(s)
  rownames(beta_hat) <- colnames(D)
  colnames(beta_hat) <- colnames(Y)

  # For univariate Y, return a named vector; otherwise a matrix
  if (ncol(beta_hat) == 1L) {
    out <- as.numeric(beta_hat[, 1L])
    names(out) <- rownames(beta_hat)
    # Standardize intercept naming to match lm()/glm()
    names(out) <- sub("^Intercept$", "(Intercept)", names(out))
    return(out)
  } else {
    rownames(beta_hat) <- sub("^Intercept$", "(Intercept)", rownames(beta_hat))
    return(beta_hat)
  }
}
############################################################
## Verification: compare to lm() and glm(family=gaussian) ##
############################################################

## Example data workflow 1: You already have X with last column as Y
## (If your X lacks names, we create stable names so formulas behave.)
##
## Uncomment and adapt this to your object X, or run the mtcars demo below.
# if (is.null(colnames(X)) || anyNA(colnames(X)) || any(colnames(X) == "")) {
#   colnames(X) <- paste0("X", seq_len(ncol(X)))
# }
# p    <- ncol(X)
# df   <- as.data.frame(X, stringsAsFactors = FALSE)
# form <- as.formula(
#   paste(colnames(df)[p], "~", paste(colnames(df)[1:(p - 1)], collapse = " + "))
# )
# b_ours <- regression_coefficients(X)
# b_lm   <- coef(lm(form, data = df))
# b_glm  <- coef(glm(form, data = df, family = gaussian()))

## Example data workflow 2 (fully reproducible): mtcars demo
data(mtcars)
X_demo <- as.matrix(mtcars[, c("hp","wt","qsec","mpg")])  # last col is response
colnames(X_demo) <- c("hp","wt","qsec","mpg")             # ensure clean names

# Build matching formula for lm/glm
p    <- ncol(X_demo)
df   <- as.data.frame(X_demo, stringsAsFactors = FALSE)
form <- as.formula(
  paste(colnames(df)[p], "~", paste(colnames(df)[1:(p - 1)], collapse = " + "))
)

# Compute coefficients (our implementation vs lm/glm)
b_ours <- regression_coefficients(X_demo)                  # returns a named vector
b_lm   <- coef(lm(form, data = df))
b_glm  <- coef(glm(form, data = df, family = gaussian()))

# Align by canonical names from lm (handles intercept naming and order)
std_names <- names(b_lm)
b_ours_aligned <- b_ours[std_names]
b_glm_aligned  <- b_glm[std_names]

# Coerce EVERYTHING to plain numeric (avoids factor/character code leaks)
b_ours_num <- as_num(b_ours_aligned)
b_lm_num   <- as_num(b_lm)
b_glm_num  <- as_num(b_glm_aligned)

# Sanity checks
stopifnot(length(b_ours_num) == length(std_names),
          length(b_lm_num)   == length(std_names),
          length(b_glm_num)  == length(std_names),
          all(is.finite(b_ours_num)),
          all(is.finite(b_lm_num)),
          all(is.finite(b_glm_num)))

# Comparison table (no accidental rowname inference)
comp <- data.frame(
  Term = std_names,
  Beta = round(b_ours_num, 6),
  LM   = round(b_lm_num, 6),
  GLM  = round(b_glm_num, 6),
  check.names = FALSE,
  row.names = NULL,
  stringsAsFactors = FALSE
)
print(comp)
##          Term   Beta     LM    GLM
## 1 (Intercept) 27.611 27.611 27.611
## 2          hp -0.018 -0.018 -0.018
## 3          wt -4.359 -4.359 -4.359
## 4        qsec  0.511  0.511  0.511
# Numeric agreement diagnostics
cat(sprintf("\nMax |Beta - LM|  = %.3e\n", max(abs(b_ours_num - b_lm_num))))
## 
## Max |Beta - LM|  = 4.906e-12
cat(sprintf("Max |Beta - GLM| = %.3e\n", max(abs(b_ours_num - b_glm_num))))
## Max |Beta - GLM| = 4.906e-12
############################################################
## Notes & Troubleshooting                                ##
## - If your X includes an all-ones column and intercept=TRUE,
##   we drop the all-ones column to avoid a double-intercept.
## - If XtX is near-singular (multicollinearity), safe_inverse()
##   will error; consider dropping collinear columns or use QR:
##     coef(lm(Y ~ X))  # uses QR by default and is numerically stable
## - If you maintain custom linear algebra:
##   * Define INV2() (Cholesky-based inverse) and/or SWEEP() in your env.
##   * regression_coefficients() will automatically use them if found.
############################################################

3.1.1 The Covariance Matrix and the Regression Coefficients Using the Sweep Operator

Excerpt From: Roderick J. A. Little & Donald B. Rubin, Statistical Analysis with Missing Data (Little & Rubin, 2020)

The sweep operator is closely related to linear regression. For example, suppose that G is a \(2 \times 2\) covariance matrix of two variables \(Y_1\) and \(Y_2\), and let:

\[ H = \text{SWEEP}[1]G \]

Then:

\(\qquad h_{12}\) is the coefficient of \(Y_1\) from the regression of \(Y_2\) on \(Y_1\).

\(\qquad h_{22}\) is the residual variance of \(Y_2\) given \(Y_1\).

$$Furthermore, when G is a sample covariance matrix from \(n\) independent units, the estimated sampling variance of the regression coefficient \(h_{12}\) is:

\[ -\frac{h_{11}h_{22}}{n} \]


4 Regression via the SWEEP Operator

4.1 1. Augmented Data Matrix

For sample size n and K predictors:

\[ Z_i = \begin{bmatrix} 1 & y_{i1} & y_{i2} & \cdots & y_{iK} \end{bmatrix}, \quad i=1,\dots,n \]

Stack into matrix: \[ Z = \begin{pmatrix} 1 & y_{11} & \cdots & y_{1K} \\[6px] 1 & y_{21} & \cdots & y_{2K} \\[6px] \vdots & \vdots & \ddots & \vdots \\[6px] 1 & y_{n1} & \cdots & y_{nK} \end{pmatrix} \]

4.2 2. Moment (Gram) Matrix

$$ G = \[\begin{pmatrix} 1 & \bar{y}_1 & \bar{y}_2 & \cdots & \bar{y}_K \\[12px] \bar{y}_1 & \frac{1}{n}\sum y_1^2 & \frac{1}{n}\sum y_1 y_2 & \cdots & \frac{1}{n}\sum y_1 y_K \\[12px] \bar{y}_2 & \frac{1}{n}\sum y_2 y_1 & \frac{1}{n}\sum y_2^2 & \cdots & \frac{1}{n}\sum y_2 y_K \\[12px] \vdots & \vdots & \vdots & \ddots & \vdots \\[12px] \bar{y}_K & \frac{1}{n}\sum y_K y_1 & \frac{1}{n}\sum y_K y_2 & \cdots & \frac{1}{n}\sum y_K^2 \end{pmatrix}\]

$$

This matrix incorporates both the means and covariances of the variables \(y_1, \dots, y_K\), and is used to perform regression computations via the sweep operator.

This is the input to the SWEEP algorithm.

4.3 3. Regression Setup

Suppose last column is the response Y, and the first K columns are predictors X. Then the augmented Gram matrix is:

\[ G^* = \begin{pmatrix} G_{XX} & G_{XY} \\[6px] G_{YX} & G_{YY} \end{pmatrix} \]

  • \(G_{XX}\): predictor cross-products
  • \(G_{XY}\): predictor–response products
  • \(G_{YY}\): response variance term

4.4 4. Apply SWEEP

Perform SWEEP on indices corresponding to predictors (X).

Result: - The swept matrix contains regression coefficients in the XY block. - Specifically: \[ \hat\beta = G_{XX}^{-1} G_{XY} \]

  • Residual variance estimate comes from the swept YY element.

4.5 5. Connection to OLS

The SWEEP operator provides the same result as ordinary least squares:

\[ \hat\beta = (X^\top X)^{-1} X^\top Y \]

but computed directly through transformations of the moment matrix G, avoiding explicit matrix inversion.


Summary:

  1. Build G = (1/n) ZᵀZ.
  2. SWEEP on predictor indices.
  3. Extract regression coefficients and residual variance directly.
#' Augmented Parameter Matrix [@Little:Rubin:2002]
#'
#' Computes the (K+1) x (K+1) augmented parameter matrix G used in regression 
#' via the sweep operator. Missing values are imputed with column means.
#'
#' @param A A data frame or matrix of numeric variables (possibly with missing values).
#' @return A numeric matrix G = (t(D) %*% D)/n, where D is the design matrix with intercept.
#' @examples
#' G <- augmented_parameter_matrix(X)

augmented_parameter_matrix <- function(A){
  # Remove rows that are entirely NA
  X <- data.matrix(A[apply(A, 1, function(x) !all(is.na(x))), ])
  
  # Impute missing values with column means
  X <- apply(X, 2, function(x) replace(x, is.na(x), mean(x, na.rm = TRUE)))
  
  # Design matrix with intercept column
  D <- cbind(1, X)
  n <- nrow(D)
  
  # Return augmented parameter matrix
  return((t(D) %*% D) / n)
}

# Save to file in working directory
dump("augmented_parameter_matrix", file = "augmented_parameter_matrix.R")
# Example use
G <- augmented_parameter_matrix(X)
print(G)
##            X1   X2   X3   X4    Y
##     1.0   7.5   48   12   30   95
## X1  7.5  87.6  379   59  202  772
## X2 48.2 378.6 2542  554 1211 4771
## X3 11.8  59.2  554  176  356 1076
## X4 30.0 201.5 1211  356 1159 2672
## Y  95.4 771.7 4771 1076 2672 9314

Notes:

  • Header Components:
    • Top Row: Contains column means (\(\bar{Y}_j\) for each variable \(j\))
    • First Column: Contains intercept terms (for regression constant)
  • Core Matrix Components:
    • Scaled Cross-Products \(= \left(\frac{1}{n}\right) \sum_{i=1}^{n} (Y_i \times Y_j)\)

Where:

\(\qquad n =\) sample size

\(\qquad Y_i\), \(Y_j\) = paired observations

\(\qquad \sum\) runs across all observations

Purpose:

  • Designed specifically for efficient regression computation
  • Structure optimized for:
    • Sweep operator applications
    • Sequential partial regression analyses
    • Variance-covariance decomposition

Key Properties:

  • Symmetric positive-definite when no perfect multicollinearity exists
  • Diagonal elements represent \(\frac{1}{n}\sum Y_i^2\) (scaled sum of squares)
  • Off-diagonals represent \(\frac{1}{n}\sum Y_i Y_j\) (scaled cross-products)

4.5.1 The Variance-Covariance Matrix Using the Sweep Operator

Sweeping on row and column 1 of the augmented parameter matrix \(G\) yields the matrix:

\[ \text{SWEEP}[1]G = \Theta = \begin{pmatrix} -1 & \bar{y}_1 & \bar{y}_2 & \cdots & \bar{y}_K \\[6px] \bar{y}_1 & \hat{\sigma}_{11} & \hat{\sigma}_{21} & \cdots & \hat{\sigma}_{K1} \\[6px] \bar{y}_2 & \hat{\sigma}_{12} & \hat{\sigma}_{22} & \cdots & \hat{\sigma}_{K2} \\[6px] \vdots & \vdots & \vdots & \ddots & \vdots \\[6px] \bar{y}_K & \hat{\sigma}_{1K} & \hat{\sigma}_{2K} & \cdots & \hat{\sigma}_{KK} \end{pmatrix} \]

Where:

\(\qquad \bar{y}_j\) is the sample mean of variable \(Y_j\)

\(\qquad \hat{\sigma}_{jk}\) is the sample covariance between variables \(Y_j\) and \(Y_k\), computed with denominator \(n\) (not $ n-1$)


4.5.2 Interpretation

This sweep operation performs a correction for the means, transforming the scaled cross-products matrix \(G\) into the covariance matrix of variables $ Y_1, , Y_K $. This is why it is referred to as:

Sweeping on the constant term (since $ Y_1 $ in the augmented matrix)

In regression terms:

  • The first row and column of $ $ (excluding the \(-1\)) contain the regression coefficients of \(Y_2, \dots, Y_K\) on the constant term.
  • The remaining block (lower-right) of the matrix contains the residual covariance matrix of \(Y_2, \dots, Y_K\) after regressing out the intercept.

Summary

  • Before sweep: $G $ contains means and cross-products.
  • After sweep on 1: \(\Theta = \text{SWEEP}[1]G\) gives:
    • Sample means \((\bar{y}_j)\)
    • Covariances \((\hat{\sigma}_{jk})\) with divisor \(n\)
    • Regression coefficients on intercept in first row/column

This symmetric matrix \(\Theta\) is called the augmented covariance matrix.

(Theta <- SWEEP(G, 1))
##            X1   X2    X3     X4    Y
##    -1.0   7.5   48  11.8   30.0   95
## X1  7.5  31.9   19 -28.7  -22.3   60
## X2 48.2  19.3  224 -12.8 -233.9  176
## X3 11.8 -28.7  -13  37.9    2.9  -48
## X4 30.0 -22.3 -234   2.9  258.6 -191
## Y  95.4  59.7  176 -47.6 -190.9  209

Sweeping on row and column 1 gives Theta (i.e., SWEEP(G, 1)) — an updated matrix where:

  • The first row (and column) now contains regression coefficients of the mean-centered variables on the intercept.
  • The remaining submatrix is the adjusted covariance matrix.
# ------------------------------------------------------------
# Gram / moment matrix Θ (dimension K+1)
# ------------------------------------------------------------
# Θ = (1/n) ZᵀZ, where Z = [1, X]
#
# Partitioning Θ:
#   Θ[ 1,  1 ]  = 1                     (scalar)
#   Θ[-1,  1 ]  = column means of X     (K×1 vector)
#   Θ[ 1, -1 ]  = row means of X        (1×K vector)
#   Θ[-1, -1 ]  = raw second moments    (K×K matrix)
#
# Raw second moments are exactly the covariance matrix 
# scaled to denominator n instead of n-1.
# ------------------------------------------------------------

# Biased covariance matrix (divide by n)
# Biased covariance (denominator n):
cov_biased <- cov(X) * (nrow(X) - 1) / nrow(X)

cat("=== Biased Covariance (denominator n) ===\n")
## === Biased Covariance (denominator n) ===
print(round(cov_biased, 4))
##     X1   X2    X3     X4    Y
## X1  32   19 -28.7  -22.3   60
## X2  19  224 -12.8 -233.9  176
## X3 -29  -13  37.9    2.9  -48
## X4 -22 -234   2.9  258.6 -191
## Y   60  176 -47.6 -190.9  209
# Compare with Θ block
cat("\n=== Gram Matrix Block Θ[-1, -1] ===\n")
## 
## === Gram Matrix Block Θ[-1, -1] ===
print(round(Theta[-1, -1], 4))
##     X1   X2    X3     X4    Y
## X1  32   19 -28.7  -22.3   60
## X2  19  224 -12.8 -233.9  176
## X3 -29  -13  37.9    2.9  -48
## X4 -22 -234   2.9  258.6 -191
## Y   60  176 -47.6 -190.9  209
cat("-> Matches biased covariance above.\n")
## -> Matches biased covariance above.
# ------------------------------------------------------------
# Means appear in Θ[1, -1] and Θ[-1, 1]:
# ------------------------------------------------------------
cat("\n=== Sample Means ===\n")
## 
## === Sample Means ===
print(round(colMeans(X), 4))
##   X1   X2   X3   X4    Y 
##  7.5 48.2 11.8 30.0 95.4
cat("\n=== Gram Matrix Means Blocks ===\n")
## 
## === Gram Matrix Means Blocks ===
cat("Row vector:\n")
## Row vector:
print(round(Theta[1, -1], 4))   # row vector
##   X1   X2   X3   X4    Y 
##  7.5 48.2 11.8 30.0 95.4
cat("\nColumn vector:\n")
## 
## Column vector:
print(round(Theta[-1, 1], 4))   # column vector
##   X1   X2   X3   X4    Y 
##  7.5 48.2 11.8 30.0 95.4
cat("-> Both match the sample means above.\n")
## -> Both match the sample means above.
cat("\n=== Constant Term ===\n")
## 
## === Constant Term ===
print(Theta[1, 1])
## [1] -1
cat("-> After sweeping the intercept, this equals -1.\n")
## -> After sweeping the intercept, this equals -1.

The lower-right block of Theta is the variance-covariance matrix (unbiased estimator).

Summary of Matrix Structure

The Theta after the sweep is structured as:

Intercept Variables
Intercept \(-1.0\) variable means \((\mu)\)
Variables \(\mu^t\) covariance matrix \((\sum)\)

This structure is essential in multivariate regression or MANOVA, and the sweep operator makes inversion and updates efficient without recalculating from scratch.

4.6 The Regression Coefficients Using the Sweep

If matrix G is swept on rows and columns 1 and 2 (i.e., SWEEP[1,2]G), the result is a block matrix of the form:

$$ _{[1,2]}(G) = \[\begin{pmatrix} -\left(1 + \frac{\bar{y}_1^2}{\sigma_{11}}\right) & \frac{\bar{y}_2 - \bar{y}_1 \sigma_{12} / \sigma_{11}}{\sigma_{11}} & \cdots & \frac{\bar{y}_K - \bar{y}_1 \sigma_{1K} / \sigma_{11}}{\sigma_{11}} \\[12px] \frac{\bar{y}_2 - \bar{y}_1 \sigma_{12} / \sigma_{11}}{\sigma_{11}} & \frac{\sigma_{22} - \sigma_{12}^2 / \sigma_{11}}{\sigma_{11}} & \cdots & \frac{\sigma_{2K} - \sigma_{1K} \sigma_{12} / \sigma_{11}}{\sigma_{11}} \\[12px] \vdots & \vdots & \ddots & \vdots \\[12px] \frac{\bar{y}_K - \bar{y}_1 \sigma_{1K} / \sigma_{11}}{\sigma_{11}} & \frac{\sigma_{2K} - \sigma_{1K} \sigma_{12} / \sigma_{11}}{\sigma_{11}} & \cdots & \frac{\sigma_{KK} - \sigma_{1K}^2 / \sigma_{11}}{\sigma_{11}} \end{pmatrix}\] = \[\begin{pmatrix} -A & B^\top \\[6px] B & C \end{pmatrix}\]

$$

Matrix Interpretation:

  • A is a 2×2 matrix representing intercept–intercept and slope–slope structure.
  • B is a 2×(K−1) matrix:
  • Each column of B gives the intercept and slope for the regression of Y_{j+1} on Y₁, for j = 1, ..., K−1.
  • C is a (K−1)×(K−1) matrix:
  • Represents the residual covariance matrix of Y₂, ..., Y_K after regressing them on Y₁.

Notes:

  • The first row gives intercepts.
  • The second row gives slopes (regression coefficients).
  • The C block (lower-right) contains residual covariances.
  • The elements of A, when multiplied by the corresponding elements in C and divided by n, give asymptotic variances and covariances of the regression coefficients in B.
SWEEP(G, 1:2)
##              X1     X2    X3     X4      Y
##    -2.74  0.234   43.6  18.5   35.2   81.5
## X1  0.23 -0.031    0.6  -0.9   -0.7    1.9
## X2 43.64  0.605  211.8   4.5 -220.4  140.3
## X3 18.46 -0.897    4.5  12.1  -17.1    6.0
## X4 35.21 -0.698 -220.4 -17.1  243.0 -149.2
## Y  81.48  1.869  140.3   6.0 -149.2   97.4

4.7 Multivariate Regression Coefficients Using the Sweep

Sweeping the constant term and the first q elements yields results for the multivariate regression of \(Y_{q+1}, ..., Y_K` on `Y_1, ..., Y_q\).

After sweeping, the resulting matrix takes the form:

\[ \text{SWEEP}_{[1, \dots, q]}(G) = \begin{pmatrix} -D & E \\[6px] E^T & F \end{pmatrix} \]


Matrix Components

  • D: A q × q matrix containing information related to variances and covariances of the estimated regression coefficients.
  • E: A (K − q) × q matrix of regression coefficients, where:
  • Each row corresponds to a dependent variable (Y_{q+1}, ..., Y_K)
  • Each column corresponds to a predictor variable (Y_1, ..., Y_q)
  • F: A (K − q) × (K − q) matrix representing the residual covariance matrix of the dependent variables after regressing on the predictors.

Interpretation

  • The E matrix provides the multivariate regression coefficients.
  • The F matrix contains the residual covariances after accounting for the effect of the predictors.
  • The elements of D, combined with F, can be used to compute asymptotic standard errors and covariances of the estimates in E.
Phi <- SWEEP(G, 1:5)

This corresponds to computing the regression of \(X5\) on \(X1\) to \(X4\), where:

  • \(X5\) is the dependent variable.
  • \(X1\) to \(X4\) are the predictors.
  • The first row of \(\Phi\) gives the intercept and regression coefficients.
#' Calculate Regression Coefficients Using Sweep Operator
#'
#' Computes multiple linear regression coefficients using the sweep operator method.
#' More numerically stable than normal equations for certain problems.
#'
#' @param x A numeric matrix where the last column is the response variable
#'          and other columns are predictors.
#'
#' @return A matrix of regression coefficients with proper dimension names
#' @export
#' @examples
#' data(mtcars)
#' x <- cbind(mtcars$wt, mtcars$hp, mtcars$mpg)
#' beta(x)
beta <- function(x){
    
    # Load required functions 
    if (!exists("SWEEP") || !exists("augmented_parameter_matrix")) {
      source("SWEEP.R")
      source("augmented_parameter_matrix.R")
    }
    
    x <- as.matrix(x)
    k <- ncol(x)
    
    G <- augmented_parameter_matrix(x)
    P <- SWEEP(G, 1:k)
    # Extract and format coefficients
    B <- matrix(P[1:k, k+1], dimnames=list(paste("B", 0:(k-1), sep = ""), "coeffs"))
    
    # Validate sweep operation results
    if (is.null(P) || any(dim(P) != c(k+1, k+1))) {
      stop("Sweep operation failed - check for linear dependencies")
    }
    
    return(B)
  }

# save
dump("beta", file = "beta.R")
# Apply to dataset X (last column = response)
coeff <- beta(X)  # SWEEP-based coefficients (vector)

# Ensure X has tidy names for predictors + response
if (is.null(colnames(X))) {
  colnames(X) <- c(paste0("X", seq_len(ncol(X)-1)), "Y")
}

# If beta() returned an unnamed vector, name it now
if (is.null(names(coeff))) {
  names(coeff) <- c("Intercept", colnames(X)[1:(ncol(X)-1)])
}

# Reference model using lm()
df   <- as.data.frame(X)
form <- as.formula(
  paste(colnames(X)[ncol(X)], "~",
        paste(colnames(X)[1:(ncol(X)-1)], collapse = " + "))
)
b_lm <- coef(lm(form, data = df))

# Map our "Intercept" name to lm's "(Intercept)"
lm_index <- sub("^Intercept$", "(Intercept)", names(coeff))

# Compare side-by-side
comp <- data.frame(
  Term        = names(coeff),
  Beta_SWEEP  = round(as.numeric(coeff), 6),
  Beta_lm     = round(as.numeric(b_lm[lm_index]), 6),
  row.names = NULL
)

comp  # just print; don't 'return()' at top level
##        Term Beta_SWEEP Beta_lm
## 1 Intercept      62.41   62.41
## 2        X1       1.55    1.55
## 3        X2       0.51    0.51
## 4        X3       0.10    0.10
## 5        X4      -0.14   -0.14

In summary, maximum likelihood (ML) estimates for the multivariate linear regression of
Y_{q+1}, ..., Y_K on Y_1, ..., Y_q can be obtained by sweeping the rows and columns
corresponding to the constant term and the predictor variables (Y_1, ..., Y_q)
from the scaled cross-products matrix G.


Interpretation of the Sweep Operation

The operation of sweeping on a variable effectively transforms that variable from an outcome (dependent) variable into a predictor (independent) variable. This reparameterization allows the regression relationships and residual covariances to be computed directly from the augmented matrix without performing separate matrix inversions.

4.7.1 The Hat Matrix

The hat matrix (also known as the projection matrix or influence matrix) is a linear operator that maps the vector of observed response values (dependent variable) to the vector of fitted values (predicted responses). It effectively “puts a hat” on \(y\), producing \(\hat{y}\).

\[ \begin{align*} Y &= X\beta + \varepsilon \\[12px] \hat{Y} &= X\hat{\beta} \\[12px] \hat{\beta} &= (X^TX)^{-1}X^TY \\[12px] \hat{Y} &= X(X^TX)^{-1}X^TY \\[12px] &= HY \end{align*} \]

So the hat matrix is defined as:

\[ H = X(X^TX)^{-1}X^T \] where \(X\) is the design matrix of predictors.

Key properties: - \(H\) is symmetric: \(H^\top = H\). - \(H\) is idempotent: \(H^2 = H\). - The diagonal entries of \(H\) (\(h_{ii}\)) measure the leverage of observation \(i\).

The element in the ith row and jth column of H is equal to the covariance between the jth response value and the ith fitted value, divided by the variance of the former:

\[ h_{ij} = \frac{\text{cov}(\hat{Y}_i, Y_j)}{\text{var}(Y_j)} \]

Properties of H and (I-H) matrices:

  • Symmetric: \(H = H^T\) and \((I-H)^T = (I-H)\)
  • Idempotent: \(H^2 = H\) and \((I-H)(I-H) = (I-H)\)
  • X is invariant under H: \(HX = X\) hence \((I-H)X = 0\)
#' Compute the Hat Matrix
#' 
#' Calculates the hat matrix (projection matrix) for a given data matrix after 
#' handling missing values. The hat matrix maps observed response values to 
#' predicted values in linear regression.
#'
#' @param Y A numeric matrix or data frame containing the response variables. 
#'           Rows with any missing values will be removed.
#'
#' @return The hat matrix \eqn{H = X(X'X)^{-1}X'}, where X is the design matrix 
#'         (the input matrix Y with an added column of 1s for the intercept).
#'
#' @details
#' The function:
#' \enumerate{
#'   \item Performs list-wise deletion of missing data
#'   \item Creates a design matrix by adding an intercept column
#'   \item Computes the hat matrix using \eqn{H = D(D'D)^{-1}D'}
#' }
#' 
#' Requires the `INV()` function for matrix inversion (assumed to be in "INV.R").
#'
#' @examples
#' \dontrun{
#' data <- matrix(rnorm(100), ncol = 2)
#' H <- HAT(data)
#' }
#'
#' @seealso \code{\link{lm}} for linear model fitting which uses the hat matrix
#' @export
HAT <- function(Y) {
    # Load matrix inversion function
    source("INV.R")
    
    # List-wise missing data deletion
    X <- Y[apply(Y, 1, function(x) !any(is.na(x))), , drop = FALSE]
    
    # Create design matrix (add intercept)
    D <- cbind(1, X)  
    
    # Compute and return hat matrix
    D %*% INV(t(D) %*% D) %*% t(D)
}

# save to the working directory getwd(), ls()
dump("HAT", file = "HAT.R")
round(HAT(X), 1)
##        obs.1 obs.2 obs.3 obs.4 obs.5 obs.6 obs.7 obs.8 obs.9 obs.10 obs.11
## obs.1    0.6   0.2  -0.1   0.2   0.3   0.0  -0.1   0.1  -0.1    0.0    0.0
## obs.2    0.2   0.4   0.1   0.1   0.1   0.1  -0.1   0.1   0.2   -0.2    0.1
## obs.3   -0.1   0.1   0.6   0.3   0.0   0.0   0.1   0.0   0.2    0.1   -0.2
## obs.4    0.2   0.1   0.3   0.4   0.0   0.0  -0.1   0.2   0.0    0.2   -0.1
## obs.5    0.3   0.1   0.0   0.0   0.4   0.1   0.1  -0.1   0.0   -0.1   -0.1
## obs.6    0.0   0.1   0.0   0.0   0.1   0.4   0.0  -0.3   0.2    0.2    0.2
## obs.7   -0.1  -0.1   0.1  -0.1   0.1   0.0   0.4   0.1   0.2   -0.1    0.1
## obs.8    0.1   0.1   0.0   0.2  -0.1  -0.3   0.1   0.6   0.0    0.1    0.3
## obs.9   -0.1   0.2   0.2   0.0   0.0   0.2   0.2   0.0   0.3   -0.1    0.2
## obs.10   0.0  -0.2   0.1   0.2  -0.1   0.2  -0.1   0.1  -0.1    0.7    0.1
## obs.11   0.0   0.1  -0.2  -0.1  -0.1   0.2   0.1   0.3   0.2    0.1    0.5
## obs.12   0.0  -0.1  -0.1  -0.1   0.1   0.2   0.2  -0.1   0.0    0.2    0.1
## obs.13   0.1  -0.2   0.1   0.0   0.2  -0.1   0.3   0.1  -0.1    0.1   -0.1
##        obs.12 obs.13
## obs.1     0.0    0.1
## obs.2    -0.1   -0.2
## obs.3    -0.1    0.1
## obs.4    -0.1    0.0
## obs.5     0.1    0.2
## obs.6     0.2   -0.1
## obs.7     0.2    0.3
## obs.8    -0.1    0.1
## obs.9     0.0   -0.1
## obs.10    0.2    0.1
## obs.11    0.1   -0.1
## obs.12    0.3    0.2
## obs.13    0.2    0.4
# check
X1 <- cbind(X[,1:4])
Y <- as.matrix(X[,5])
(D <- cbind(1, X1)) # the design matrix
##          X1 X2 X3 X4
## obs.1  1  7 26  6 60
## obs.2  1  1 29 15 52
## obs.3  1 11 56  8 20
## obs.4  1 11 31  8 47
## obs.5  1  7 52  6 33
## obs.6  1 11 55  9 22
## obs.7  1  3 71 17  6
## obs.8  1  1 31 22 44
## obs.9  1  2 54 18 22
## obs.10 1 21 47  4 26
## obs.11 1  1 40 23 34
## obs.12 1 11 66  9 12
## obs.13 1 10 68  8 12
# Calculate predicted values and residuals
# First ensure all required objects exist
if (!exists("X") || !exists("Y")) stop("X or Y not found in environment")
if (!exists("beta")) stop("beta() function not defined")
if (!exists("HAT")) stop("HAT() function not defined")

# Method 1: Using design matrix and beta coefficients
Y_hat1 <- D %*% beta(X)

# Method 2: Using hat matrix
Y_hat2 <- HAT(X) %*% Y

# Verify dimensions match
if (!all(dim(Y_hat1) == dim(Y_hat2))) {
  warning("Prediction methods yield different dimensions")
}

# Calculate residuals
residuals <- Y - Y_hat1

# Create comparison table with proper formatting
results <- data.frame(
  Observed = Y,
  Predicted_Method1 = Y_hat1,
  Predicted_Method2 = Y_hat2,
  Residuals = residuals
)

# Format table output
knitr::kable(
  results,
  digits = 3,
  align = "r",
  caption = "Comparison of Prediction Methods",
  col.names = c("$Y$", "$\\hat{Y}_1$ (Dβ)", "$\\hat{Y}_2$ (HY)", "$\\varepsilon$"),
  format.args = list(scientific = FALSE)
) %>%
  kableExtra::kable_styling(
    bootstrap_options = c("striped", "hover"),
    full_width = FALSE
  )
Comparison of Prediction Methods
\(Y\) \(\hat{Y}_1\) (Dβ) \(\hat{Y}_2\) (HY) \(\varepsilon\)
obs.1 78 78 78 0.005
obs.2 74 73 74 1.511
obs.3 104 106 104 -1.671
obs.4 88 89 88 -1.727
obs.5 96 96 96 0.251
obs.6 109 105 109 3.925
obs.7 103 104 103 -1.449
obs.8 72 76 72 -3.175
obs.9 93 92 93 1.378
obs.10 116 116 116 0.282
obs.11 84 82 84 1.991
obs.12 113 112 113 0.973
obs.13 109 112 109 -2.294

4.8 Leverage Points Analysis

The diagonal elements of the hat matrix (\(h_{ii}\)) are called leverages, which indicate the influence of observations on their own predicted values.

Key Properties

  • Influence Measurement: Each element \(h_{ij}\) represents the influence of the \(j^{th}\) observation on the \(i^{th}\) predicted value \(\hat{Y}_i\)
  • Trace Property: The sum of the diagonal elements equals the number of parameters being estimated (\(p\)): \[ \sum_{i=1}^n h_{ii} = tr(H) = p \]
  • Average Leverage: The average leverage value is \(\frac{p}{n}\) (where \(n\) is the number of observations): \[ \bar{h} = \frac{p}{n} \]

Leverage Point Identification

  • Threshold Rule: An observation is considered influential if: \[ h_{ii} > 2 \times \bar{h} = \frac{2p}{n} \]
  • Example Detection:
    In this example:
    • Observation 10 shows high leverage (\(h_{10,10} = 0.7\))
    • If \(2 \times \bar{h} \approx 0.5\) (typical for many models), this would be a leverage point

Visualization Tip:
A leverage-residual plot (plotting \(h_{ii}\) against residuals) helps identify both high-leverage points and outliers.

# First fit your linear model (replace with your actual model)
# Example using mtcars dataset:
model <- lm(mpg ~ wt + hp + qsec, data = mtcars)

# Calculate diagnostic measures
h <- hatvalues(model)
res <- rstudent(model)  # Using studentized residuals
cook <- cooks.distance(model)
n <- length(res)
p <- length(coef(model))

# Create base plot
plot(h, res, 
     xlab = "Leverage (hᵢᵢ)", ylab = "Studentized Residuals",
     main = "Leverage-Residual Diagnostic Plot",
     pch = 21, bg = ifelse(cook > 4/n, "red", "gray80"),
     cex = 1.5*sqrt(cook/max(cook)))

# Add reference lines
abline(h = c(-2, 0, 2), v = c(1, 2, 3)*p/n, lty = c(2,1,2), col = "gray50")

# Highlight influential points
influential <- which(cook > 4/n)
if(length(influential) > 0) {
  text(h[influential], res[influential], 
       labels = influential, pos = 4, col = "red", cex = 0.8)
}

# Add legend
legend("topright", 
       legend = c(paste("Influential (Cook's D >", round(4/n, 3)), "Normal"),
       pch = 21, pt.bg = c("red", "gray80"), pt.cex = 1.2,
       title = "Observation Influence")
Leverage-Residual Diagnostic Plot

Leverage-Residual Diagnostic Plot

# diagonals
H <- HAT(X)
matrix(diag(H))
##       [,1]
##  [1,] 0.55
##  [2,] 0.38
##  [3,] 0.64
##  [4,] 0.36
##  [5,] 0.36
##  [6,] 0.45
##  [7,] 0.41
##  [8,] 0.62
##  [9,] 0.33
## [10,] 0.70
## [11,] 0.51
## [12,] 0.28
## [13,] 0.41
# There are no leverage points
any(diag(H) > 2*mean(diag(H))) # FALSE
## [1] FALSE

The number of parameters to be estimated: The trace (sum of diagonal elements) of the hat matrix equals the number of parameters (p) in the linear model:

# There are 6 parameters to be estimated
sum(diag(H))
## [1] 6

p = number of coefficients + 1 (for intercept)


4.9 Residuals

Residuals and errors are fundamental concepts in regression analysis that measure deviations between observed and predicted values:

Definitions

  • Statistical Errors (\(e_i\)): The difference between observed values and their true (but unobservable) expected values: \[ e_i = Y_i - \mu_Y \] where \(\mu_Y\) represents the true population mean.

  • Residuals (\(\varepsilon_i\) or \(r_i\)): The difference between observed values and their estimated values from the model: \[ \varepsilon_i = Y_i - \hat{Y}_i \] where \(\hat{Y} = HY\) (the hat matrix projection).

Matrix Formulation

The vector of residuals can be expressed as: \[ \varepsilon = Y - \hat{Y} = Y - HY = (I - H)Y \]


Key Properties

  1. Mean: \[ \sum \varepsilon_i = 0 \quad \text{(for models with intercept)} \]

  2. Variance: \[ \text{Var}(\varepsilon) = \sigma^2(I - H) \]

  3. Orthogonality: \[ X^T\varepsilon = 0 \quad \text{(residuals are orthogonal to predictors)} \]


Types of Residuals

Type Formula Use Case
Raw Residuals \(\varepsilon_i = Y_i - \hat{Y}_i\) Basic diagnostics
Standardized \(\frac{\varepsilon_i}{\sigma}\) Comparing across observations
Studentized \(\frac{\varepsilon_i}{\sigma\sqrt{1-h_{ii}}}\) Outlier detection

#' Calculate Regression Residuals
#' 
#' Computes the residuals from a linear regression model using matrix operations.
#' The residuals represent the difference between observed and predicted values.
#'
#' @param X A numeric matrix or data frame containing predictor variables.
#'          If Y is NULL, the last column is treated as the response variable.
#' @param Y Optional numeric vector/matrix of response values. If NULL, 
#'          the last column of X is used as the response.
#'
#' @return A numeric vector of residuals calculated as ε = (I - H)Y,
#'         where H is the hat matrix and I is the identity matrix.
#'
#' @details
#' The function:
#' \enumerate{
#'   \item Automatically handles input matrix conversion
#'   \item Separates predictors and response if Y is not provided
#'   \item Computes residuals using ε = Y - Ŷ = (I - H)Y
#' }
#' 
#' Requires the `HAT()` function to compute the hat matrix.
#'
#' @examples
#' \dontrun{
#' # Using separate X and Y
#' X <- matrix(rnorm(100), ncol = 4)
#' Y <- rnorm(25)
#' res <- epsilon(X, Y)
#' 
#' # Using combined matrix (last column as Y)
#' data <- cbind(X, Y)
#' res <- epsilon(data)
#' }
#'
#' @seealso \code{\link{residuals}} for the standard R residual function
#' @export
epsilon <- function(X, Y = NULL) {
    # Input validation and conversion
    X <- as.matrix(X)
    
    # Handle response variable
    if (is.null(Y)) {
        p <- ncol(X)
        Y <- as.matrix(X[, p])
        X <- as.matrix(X[, 1:(p - 1)])
    } else {
        Y <- as.matrix(Y)
    }
    
    # Check dimension compatibility
    if (nrow(X) != nrow(Y)) {
        stop("X and Y must have the same number of observations")
    }
    
    # Compute residuals
    n <- nrow(X)
    I <- diag(n) # identity matrix
    H <- HAT(X)  # hat matrix
    residuals <- (I - H) %*% Y
    
    return(as.vector(residuals))
}

# Save function to file
dump("epsilon", file = "epsilon.R")

Residual Comparison

Comparing residuals calculated by our epsilon() function versus R’s built-in residuals() function:

# Calculate residuals using our custom function
e1 <- round(epsilon(X), 4)

# Calculate residuals using glm()
k <- ncol(X)
e2 <- as.matrix(round(residuals(glm(X[, k] ~ X[, 1:(k - 1)])), 4))

# Create comparison dataframe
residual_comparison <- data.frame(
  epsilon = e1, 
  glm = e2
)

# Display results
knitr::kable(residual_comparison, 
             align = 'c',
             caption = "Comparison of Residual Calculation Methods")
Comparison of Residual Calculation Methods
epsilon glm
obs.1 0.00 0.00
obs.2 1.51 1.51
obs.3 -1.67 -1.67
obs.4 -1.73 -1.73
obs.5 0.25 0.25
obs.6 3.93 3.93
obs.7 -1.45 -1.45
obs.8 -3.17 -3.17
obs.9 1.38 1.38
obs.10 0.28 0.28
obs.11 1.99 1.99
obs.12 0.97 0.97
obs.13 -2.29 -2.29

4.10 The Covariance Matrix of the Residuals

The variance-covariance matrix of the residuals is derived as follows:

Derivation

\[ \begin{aligned} \text{cov}(\varepsilon) &= \text{cov}(Y - \hat{Y}) \\[12px] &= \text{cov}(Y - HY) \\[12px] &= \text{cov}[(I-H)Y] \\[12px] &= (I-H)\text{cov}(Y)(I-H)^T \quad \text{(since cov(AX) = A cov(X) A}^T\text{)} \\[12px] &= (I-H)\sigma^2_Y(I-H) \quad \text{(since } I-H \text{ is symmetric)} \\[12px] &= \sigma^2_Y(I-H)(I-H) \\[12px] &= \sigma^2_Y(I-H) \quad \text{(since } I-H \text{ is idempotent)} \end{aligned} \]


Key Properties

  • Relationship Between Y and ε Covariance: \[ \text{var}(Y_i) = \text{var}(X_i\beta + \varepsilon_i) = \text{var}(X_i\beta) + \text{var}(\varepsilon_i) + 2\text{cov}(X_i\beta, \varepsilon_i) \] where:

  • \(\text{var}(\beta) = 0\) (β is constant)

  • \(\text{cov}(X_i\beta, \varepsilon_i) = 0\) (orthogonality condition)

Therefore: \[ \text{var}(Y) = \text{var}(\varepsilon) \]

  • Diagonal Elements: The variance of individual residuals: \[ \text{var}(\varepsilon_i) = \sigma^2(1 - h_{ii}) \] where \(h_{ii}\) are the diagonal elements of the hat matrix.
#' Compute Residual Covariance Matrix
#' 
#' Calculates the variance-covariance matrix of residuals for a linear regression model,
#' following the theoretical form: Σ = σ²(I - H)
#'
#' @param data A matrix or data frame where the last column contains the response variable
#'            and preceding columns contain predictors
#'
#' @return A n×n matrix where n is the number of observations, with:
#'         - Diagonal elements: var(ε_i) = σ²(1 - h_ii)
#'         - Off-diagonal elements: cov(ε_i,ε_j) = -σ²h_ij
#'
#' @details
#' The implementation:
#' \enumerate{
#'   \item Separates predictors (X) from response (y)
#'   \item Computes hat matrix H = X(X'X)⁻¹X'
#'   \item Estimates residual variance σ²
#'   \item Returns σ²(I - H)
#' }
#'
#' @examples
#' \dontrun{
#' data <- cbind(matrix(rnorm(100), ncol=4), rnorm(25))
#' cov_mat <- cov.matrix(data)
#' round(cov_mat, 3)
#' }
#' 
#' @seealso \code{\link{vcov}} for the parameter covariance matrix
#' @export
cov.matrix <- function(data) {
    # Input validation
    if(!is.matrix(data) && !is.data.frame(data)) {
        stop("Input must be a matrix or data frame")
    }
    if(ncol(data) < 2) {
        stop("Data must contain at least one predictor and one response")
    }
    
    # Separate predictors and response
    X <- as.matrix(data[, -ncol(data), drop=FALSE])
    y <- as.matrix(data[, ncol(data), drop=FALSE])
    
    # Check for NA values
    if(any(is.na(cbind(X, y)))) {
        warning("Data contains NA values - results may be unreliable")
    }
    
    # Compute components
    n <- nrow(X)
    I <- diag(n)
    H <- HAT(X)
    residuals <- epsilon(X, y)
    sigma2 <- sum(residuals^2)/(n - ncol(X) - 1)  # MSE estimator
    
    # Return covariance matrix
    sigma2 * (I - H)
}
# Example usage with X dataset
if(exists("X")) {
    round(cov.matrix(X), 1)
}
##        obs.1 obs.2 obs.3 obs.4 obs.5 obs.6 obs.7 obs.8 obs.9 obs.10 obs.11
## obs.1    2.7  -1.4   0.9  -1.0  -1.9   0.0   0.8  -0.4   0.8    0.0    0.1
## obs.2   -1.4   4.0  -0.8  -1.1  -0.7  -0.1   0.1  -0.9  -1.0    1.0   -0.5
## obs.3    0.9  -0.8   2.5  -1.3   0.0  -1.1  -0.3   0.8  -1.6   -0.4    0.9
## obs.4   -1.0  -1.1  -1.3   4.2  -0.1  -0.6   0.9  -0.4  -0.2   -1.3    0.0
## obs.5   -1.9  -0.7   0.0  -0.1   3.8  -0.3  -0.6   0.6   0.0    0.7    0.7
## obs.6    0.0  -0.1  -1.1  -0.6  -0.3   5.2  -0.5   0.0  -0.5   -1.0    0.0
## obs.7    0.8   0.1  -0.3   0.9  -0.6  -0.5   3.8  -0.3  -1.3    0.8   -0.8
## obs.8   -0.4  -0.9   0.8  -0.4   0.6   0.0  -0.3   3.5  -0.6   -0.4   -2.4
## obs.9    0.8  -1.0  -1.6  -0.2   0.0  -0.5  -1.3  -0.6   4.2    0.8   -0.7
## obs.10   0.0   1.0  -0.4  -1.3   0.7  -1.0   0.8  -0.4   0.8    1.8   -0.4
## obs.11   0.1  -0.5   0.9   0.0   0.7   0.0  -0.8  -2.4  -0.7   -0.4    3.4
## obs.12  -0.1   0.7   0.2   0.4  -0.8  -0.7  -1.2   0.1   0.0   -1.0   -0.3
## obs.13  -0.4   0.5   0.2   0.5  -1.3  -0.6  -1.3   0.4   0.0   -0.5    0.0
##        obs.12 obs.13
## obs.1    -0.1   -0.4
## obs.2     0.7    0.5
## obs.3     0.2    0.2
## obs.4     0.4    0.5
## obs.5    -0.8   -1.3
## obs.6    -0.7   -0.6
## obs.7    -1.2   -1.3
## obs.8     0.1    0.4
## obs.9     0.0    0.0
## obs.10   -1.0   -0.5
## obs.11   -0.3    0.0
## obs.12    4.4   -1.6
## obs.13   -1.6    4.2

Properties of the Residual Covariance Matrix

  1. No Autocorrelation (Zero Off-Diagonal Elements)

For ideal OLS residuals, we assume: \[ \text{cov}(\varepsilon_i, \varepsilon_j) = 0 \quad \text{for all } i \neq j \]

This implies the covariance matrix is diagonal:

\[ \text{cov}(\varepsilon) = \sigma^2 \times \begin{pmatrix} 1-h_{11} & 0 & \cdots & 0 \\[6px] 0 & 1-h_{22} & \cdots & 0 \\[6px] \vdots & \vdots & \ddots & \vdots \\[6px] 0 & 0 & \cdots & 1-h_{nn} \end{pmatrix} \]

  1. Homoskedasticity (Constant Diagonal Elements)

Under homoskedasticity:

\[ \text{var}(\varepsilon_i) = \sigma^2(1-h_{ii}) \approx \sigma^2 \quad \text{(since } h_{ii} \text{ typically small)} \]

Resulting in approximately equal diagonal elements: \[ \text{cov}(\varepsilon) \approx \sigma^2I_n \]

4.10.1 Practical Implications

In practice, we often want to verify that the residual covariance matrix is close to diagonal (no autocorrelation) and that the diagonal entries are roughly constant (homoskedasticity).

# Compute residual covariance matrix (using the hat matrix or helper function)
cov_resid <- cov.matrix(X)

# 1. Check diagonality: off-diagonal elements should be near zero
off_diag <- cov_resid[lower.tri(cov_resid)] 
cat("Max off-diagonal:", max(abs(off_diag)), "\n")
## Max off-diagonal: 2.4
# 2. Check homoskedasticity: diagonal elements should be similar
diag_elements <- diag(cov_resid)
cat("Diagonal range:", range(diag_elements), "\n")
## Diagonal range: 1.8 5.2
cat("Ratio max/min:", max(diag_elements) / min(diag_elements), "\n")
## Ratio max/min: 2.9

Interpretation of Diagnostics

  1. Max off-diagonal: 2.4
    • Ideally, off-diagonal elements of the residual covariance matrix should be very close to zero.
    • A value of 2.4 is substantially different from zero, suggesting notable autocorrelation (dependence across residuals).
  2. Diagonal range: 1.8 to 5.2
    • Under homoskedasticity, diagonal elements should be approximately equal.
    • Here, the variance of residuals ranges from 1.8 to 5.2, showing unequal spread across observations.
    • This indicates heteroskedasticity, where some observations have much larger residual variance than others.
  3. Ratio max/min: 2.9
    • A ratio close to 1 would mean nearly equal residual variances.
    • A ratio of 2.9 means the largest variance is almost 3 times the smallest, reinforcing evidence of heteroskedasticity.

Conclusion:

The diagnostics indicate that the residual covariance matrix is not diagonal and exhibits heteroskedasticity. This violates the classical OLS assumptions, implying:

  • OLS standard errors may be biased.
  • Inference (t-tests, F-tests) may be invalid without correction.

5 Analysis of Variance (ANOVA)

ANOVA is a statistical method used to compare means across two or more groups. It is considered a special case of linear regression and is based on the law of total variance, which partitions the observed variance in a variable into components attributable to different sources of variation.

At its core, ANOVA tests the null hypothesis that all group means are equal. It generalizes the t-test to more than two groups and allows researchers to determine whether any observed differences between sample means are statistically significant.


5.1 Sources of Variation

ANOVA partitions total variance into two main components:

  • Between-group variation (variation due to the model)
  • Within-group variation (residual or error variation)

5.1.1 Between-Group Variation (Model Sum of Squares)

The between-group variation, also called the model sum of squares \(SS_{\text{Between}}\), reflects differences among the group means. It measures how much the group means deviate from the grand mean (the mean of all observations combined).

This variation is large when group means are far from the grand mean, and small when group means are similar.

\[ SS_{\text{Between}} = \sum_{k=1}^{g} n_k(\bar{x}_k - \bar{x}_{\text{grand}})^2 \]

where:

  • \(g\) is the number of groups
  • \(n_k\) is the number of observations in group \(k\)
  • \(\bar{x}_k\) is the mean of group \(k\)
  • \(\bar{x}_{\text{grand}}\) is the grand mean (overall mean across all groups)
    This component captures the variation explained by the model, i.e., how much variation is due to group membership.
# Sum of Squares Between Groups (SSB) Function Documentation
#' Calculate Between-Group Sum of Squares
#'
#' Computes the sum of squares between groups (SSB) for a linear model,
#' which measures variation between group means.
#'
#' @param X A numeric matrix or data frame containing predictor variables.
#'          If Y is NULL, the last column is treated as the response variable.
#' @param Y Optional response variable vector/matrix. If NULL, 
#'          the last column of X is used as the response.
#'
#' @return The between-group sum of squares (SSB) as a scalar value.
#'
#' @details
#' The calculation uses the hat matrix to project deviation scores:
#' \[ SSB = (HY)^T(HY) \]
#' where:
#' \itemize{
#'   \item \eqn{H} is the hat matrix \eqn{X(X^TX)^{-1}X^T}
#'   \item \eqn{Y} contains deviation scores (centered around the mean)
#' }
#'
#' @examples
#' \dontrun{
#' # Using built-in dataset
#' data(mtcars)
#' SSB(cbind(wt, hp), mtcars$mpg)  # Explicit X and Y
#' 
#' # Using matrix with response as last column
#' mat <- cbind(mtcars$wt, mtcars$hp, mtcars$mpg)
#' SSB(mat)  # Automatically detects last column as Y
#' }
#'
#' @seealso \code{\link{SST}} for total sum of squares, 
#'          \code{\link{SSW}} for within-group variation
#' @export
SSB <- function(X, Y = NULL){
    # Load required functions
    source("deviation_scores.R")  # For centering the response
    source("HAT.R")               # For hat matrix calculation
  
    # Input validation and processing
    X <- as.matrix(X)
    p <- ncol(X)
    
    if (is.null(Y)) {
        Y <- as.matrix(X[, p])
        X <- as.matrix(X[, 1:(p - 1)])
    } else {
        Y <- as.matrix(Y)
    }
    
    # Verify dimensions
    if (nrow(X) != nrow(Y)) {
        stop("X and Y must have the same number of observations")
    }
    
    # Calculate SSB
    Y_c <- deviation_scores(Y)  # Center Y around its mean
    Y_p <- HAT(X) %*% Y_c # projected Y
    t(Y_p) %*% Y_p    # Sum of squared projections
}

# Save function to file
dump("SSB", file = "SSB.R")
# Calculate SSB
(SSb <- SSB(X))
##      [,1]
## [1,] 2668

Interpretation:

SSB Value Interpretation Statistical Meaning
Large SSB Strong between-group differences High group separation quality
Small SSB Minimal between-group variability Potential weak treatment effects

Diagnostic Guidance:

  1. For Large SSB: - Indicates significant group differences - Treatments/groupings are effective

  2. For Small SSB: - Suggests possible issues:

    • Weak or non-existent treatment effects
    • Poor group categorization
    • Measurement errors in grouping variable

5.1.2 Within-Group Variation (SSW)

The within-group sum of squares (SSW), also known as the residual sum of squares or error sum of squares, quantifies the remaining variation in the data after accounting for the model’s explanatory variables.

Interpretation: - Measures unexplained variability in the model - Lower values indicate better model fit - Forms the denominator in ANOVA’s F-test

Mathematical Definition

\[ SSW = \sum_{k=1}^{K} \sum_{i=1}^{n_k} (x_{ik} - \bar{x}_k)^2 = \sum_{k=1}^{K} s_k^2(n_k - 1) \]

where:

  • \(K\) = number of groups
  • \(n_k\) = number of observations in group \(k\)
  • \(\bar{x}_k\) = mean of group \(k\)
  • \(s_k^2\) = variance of group \(k\)
#' Calculate Within-Group Sum of Squares (SSW)
#' 
#' Computes the residual sum of squares, representing variation within groups.
#' This is the unexplained variation after accounting for group differences.
#'
#' @param X A numeric matrix or data frame containing predictor variables.
#'          If Y is NULL, the last column is treated as the response variable.
#' @param Y Optional response variable vector/matrix. If NULL, 
#'          the last column of X is used as the response.
#'
#' @return The within-group sum of squares (SSW) as a scalar value.
#'
#' @details
#' The calculation uses residuals from the linear model:
#' \[ SSW = \varepsilon^T\varepsilon = \sum_{i=1}^n (y_i - \hat{y}_i)^2 \]
#' where \(\varepsilon\) are the residuals from the model \(Y = X\beta + \varepsilon\).
#'
#' @examples
#' \dontrun{
#' # Using built-in dataset
#' data(mtcars)
#' SSW(cbind(wt, hp), mtcars$mpg)
#' 
#' # Using matrix with response as last column
#' mat <- cbind(mtcars$wt, mtcars$hp, mtcars$mpg)
#' SSW(mat)
#' }
#'
#' @seealso \code{\link{SSB}} for between-group variation, 
#'          \code{\link{SST}} for total variation
#' @export
SSW <- function(X, Y = NULL){
    # Load required functions
    source("epsilon.R")  # For residual calculation
    
    # Input validation and processing
    X <- as.matrix(X)
    p <- ncol(X)
    
    if (is.null(Y)) {
        Y <- as.matrix(X[, p])
        X <- as.matrix(X[, 1:(p - 1)])
    } else {
        Y <- as.matrix(Y)
    }
    
    # Verify dimensions
    if (nrow(X) != nrow(Y)) {
        stop("X and Y must have the same number of observations")
    }
    
    # Calculate residuals and SSW
    e <- epsilon(X, Y)
    return(t(e) %*% e)
}

# Save function to file
dump("SSW", file = "SSW.R")
# Calculate SSW for X dataset
(SSw <- SSW(X))
##      [,1]
## [1,]   48

Interpretation:

SSW Value Interpretation Statistical Meaning
Small SSW Groups are internally homogeneous High model fit quality
Large SSW Significant within-group variability Potential model misspecification

Diagnostic Guidance:

  1. For Small SSW: - Indicates good model fit - Groups are well-separated

  2. For Large SSW: - Suggests possible issues:

    • Missing important predictors
    • Non-linear relationships
    • Heteroscedasticity
SSW/SST Ratio Magnitude Interpretation Recommended Actions
<20% Small SSW Excellent group separation Verify no overfitting
20-50% Moderate SSW Acceptable but could be improved Check for additional predictors
>50% Large SSW Poor group differentiation Re-examine model specification

5.1.3 Total Variation (SST)

The Total Sum of Squares (SST) measures the total variation in the data by summing the squared differences between each observation and the grand mean.

  • Composition: \[ SST = SSB + SSW \]
    • SSB: Variation between group means
    • SSW: Variation within groups
  • Interpretation:
    • Large SSB relative to SSW → Significant group differences
    • Small SSB relative to SSW → Minimal group differences

Mathematical Definition

\[ SST = \sum_{i=1}^n (x_i - \bar{x}_{grand})^2 \]

where:

  • \(n\) = total number of observations
  • \(x_i\) = individual observation
  • \(\bar{x}_{grand}\) = grand mean of all data
#' Calculate Total Sum of Squares (SST)
#' 
#' Computes the total variation in the data as the sum of squared deviations
#' from the grand mean.
#'
#' @param Y A numeric vector or matrix of response values
#' @return A scalar value representing the total sum of squares
#'
#' @details
#' The calculation uses matrix algebra:
#' \[ SST = Y^T(I - \frac{1}{n}J)Y \]
#' where:
#' \itemize{
#'   \item \( I \) is the identity matrix
#'   \item \( J \) is a matrix of ones
#'   \item \( n \) is the number of observations
#' }
#'
#' @examples
#' \dontrun{
#' # Using built-in dataset
#' SST(mtcars$mpg)
#' 
#' # Using matrix input
#' mat <- matrix(rnorm(100), ncol=4)
#' SST(mat)
#' }
#'
#' @seealso \code{\link{SSB}} for between-group variation, 
#'          \code{\link{SSW}} for within-group variation
#' @export
SST <- function(Y){
    # Input validation
    if(!is.numeric(Y)) stop("Input must be numeric")
    Y <- as.matrix(Y)
    n <- nrow(Y)
    
    # Create matrices
    J <- matrix(1, n, n)  # Matrix of ones
    I <- diag(n)          # Identity matrix
    H <- (1/n) * J        # Centering matrix
    
    # Calculate SST
    t(Y) %*% (I - H) %*% Y
}

# Save function to file
dump("SST", file = "SST.R")
(SSt <- SST(X[,5]))
##      [,1]
## [1,] 2716
# check
SSb + SSw
##      [,1]
## [1,] 2716

Interpretation of Results

SST Value Data Characteristics Typical Context
Large High overall variability in data - Diverse populations
- Wide measurement range
Small Low overall variability - Homogeneous populations
- Tight measurement clusters

5.2 Degrees of Freedom in ANOVA

Degrees of freedom (df) represent the number of independent pieces of information available for estimating population parameters. In ANOVA, they partition the total variation into between-group and within-group components.

\[ df_{Total} = df_{Between} + df_{Within} \]

p <- ncol(X)  # Number of variables  
n <- nrow(X)  # Number of cases  

dfB <- p - 1  # Between-group degrees of freedom  
dfB 
## [1] 4
dfW <- n - p  # Within-group degrees of freedom  
dfW  
## [1] 8
dfT <- n - 1  # Total degrees of freedom  
dfT 
## [1] 12
Component Formula Interpretation Example Value
Total \(n - 1\) Total variability estimation 12
Between \(p - 1\) Group mean differences estimation 4
Within \(n - p\) Error variance estimation 8

5.3 Mean Squares

Mean Square Between Groups \((MS_{Between})\)

This represents the variance attributed to the differences between the sample groups. It is calculated by dividing the between-group variation \((SS_{B})\) by its corresponding degrees of freedom.

MSB <- SSb/dfB
MSB # Mean square between
##      [,1]
## [1,]  667

Mean Square Within Groups \((MS_{Within})\)

This quantifies the variance due to differences within individual groups (i.e., random error). It is computed as the within-group variation \((SS_{W})\) divided by its degrees of freedom, effectively a weighted average of group variances.

MSW <- SSw / dfW  
MSW  # Mean square within 
##      [,1]
## [1,]    6

The Mean Squared Error (MSE) is equivalent to MSWithin, as both estimate the within-group variability. Additionally, the Root Mean Squared Error (RMSE) is derived as the square root of MSE:

MSE <- (t(e1) %*% e1) / dfW  # Mean squared error (same as MSW)  

RMSE <- sqrt(MSE)  

data.frame(
  Metric = c("Mean Squared Error (MSE)", "Root Mean Squared Error (RMSE)"),
  Value = c(MSE, RMSE))
##                           Metric Value
## 1       Mean Squared Error (MSE)   6.0
## 2 Root Mean Squared Error (RMSE)   2.4

5.4 The F-Statistic

The F-statistic is a ratio of two variances (mean squares). Variance measures data dispersion - how far points are scattered from the mean, with larger values indicating greater dispersion.

The F-statistic is calculated as:

\[ F = \frac{MS_{Between}}{MS_{Within}} \]


Hypothesis Test

  • Null Hypothesis: $H_0: _1 = 2 = = {p-1} = 0 $ (All regression coefficients are zero)
  • Alternative Hypothesis: \(H_a: \text{At least one } \beta_j \neq 0 \text{ for } j=1,\ldots,p-1\) (At least one coefficient is non-zero)

Decision Rule

Reject \(H_0\) if:

  • F > F-critical (the critical value from F-distribution), or
  • p-value \(< \alpha\) (typically \(\alpha = 0.05\))

The F-statistic compares between-group variance to within-group variance:

F_value <- MSB / MSW
F_value  
##      [,1]
## [1,]  111

The p-value is calculated from the F-distribution with (dfB, dfW) degrees of freedom:

p_value <- pf(F_value, df1 = dfB, df2 = dfW, lower.tail = FALSE)
p_value 
##         [,1]
## [1,] 4.8e-07

Interpretation:

  • \(F = 111\) indicates strong evidence against the null hypothesis
  • \(p = 4.8 × 10^{-7}\) (highly significant at \(\alpha = 0.05\))

5.4.1 F-Distribution

The F-distribution (also known as Snedecor’s F distribution or the Fisher-Snedecor distribution) is a continuous probability distribution that arises frequently as the null distribution of test statistics, particularly in analysis of variance (ANOVA). It is named after Ronald Fisher and George W. Snedecor.

Parameters - Degrees of freedom: df₁ > 0, df₂ > 0

The PDF of the F-distribution is given by:

\[ f(x; \nu_1, \nu_2) = \frac{ \sqrt{\displaystyle\frac{(\nu_1x)^{\nu_1}\nu_2^{\nu_2}}{(\nu_1x + \nu_2)^{\nu_1+\nu_2}}} }{ xB\left(\displaystyle\frac{\nu_1}{2}, \displaystyle\frac{\nu_2}{2}\right) } \]

where:

  • \(x > 0\)
  • \(B\) is the beta function

Key Properties:

  • Expected Value: \[ E(X) = \frac{df_2}{df_2 - 2} \quad \text{for } df_2 > 2 \]
  • Variance: \[ Var(X) = \frac{2df_2^2(df_1 + df_2 - 2)}{df_1(df_2 - 2)^2(df_2 - 4)} \quad \text{for } df_2 > 4 \]
  • Skewness: \[ \text{Skewness} = \frac{(2df_1 + df_2 - 2)\sqrt{8(df_2 - 4)}}{(df_2 - 6)\sqrt{df_1(df_1 + df_2 - 2)}} \quad \text{for } df_2 > 6 \]
  • Excess Kurtosis: (Computed as kurtosis minus 3)
#' F-Distribution Probability Density Function
#'
#' Calculates the probability density function (PDF) of the F-distribution
#' for given quantiles and degrees of freedom.
#'
#' @param x Numeric vector of quantiles. If NULL (default), generates a sequence 
#'           from 0 to 10.
#' @param v1 Numerator degrees of freedom (must be positive).
#' @param v2 Denominator degrees of freedom (must be positive).
#'
#' @return A numeric vector of PDF values if x is provided, or a matrix with 
#'         x values and corresponding PDF values if x is NULL.
#'
#' @examples
#' # Single value
#' f_pdf(2, v1 = 5, v2 = 2)
#' 
#' # Vector of values
#' f_pdf(c(1, 2, 3), v1 = 5, v2 = 2)
#' 
#' # Default sequence
#' plot(f_pdf(v1 = 5, v2 = 2), type = "l")
#' 
#' @export
f_pdf <-
  function(x=NULL, v1, v2){
    # probability density function
    f <- function(x, v1, v2){
      sqrt((v1*x)^v1*v2^v2/(v1*x + v2)^(v1+v2))/(x*base::beta(v1/2, v2/2))
    }
    
    if(is.null(x)){
      x <- seq(from=0.001, to=10.1, by=0.001)
      y <- cbind(x, mapply(f, x, v1, v2))
    } else {
      y <- mapply(f, x, v1, v2)
    }
    
    return(y)
  }

# Save function
dump("f_pdf", file = "f_pdf.R")
# --------------------------------------------
# F-Distribution PDF Function Test Results
# --------------------------------------------

# 1. Single Value Test
#    Calculate PDF at x=2 with df1=5, df2=2
cat("Single value test:\n")
## Single value test:
f_pdf(2, v1 = 5, v2 = 2)
## [1] 0.13
# Output: [1] 0.1304
# Interpretation: P(X=2) ≈ 0.1304

# 2. Vector Input Test
#    Calculate PDF for x=[1,2,3] with df1=5, df2=2
cat("\nVector input test:\n")
## 
## Vector input test:
f_pdf(c(1, 2, 3), v1 = 5, v2 = 2)
## [1] 0.308 0.132 0.072
# Output: [1] 0.308 0.130 0.072
# Interpretation: Probabilities for multiple points

# 3. Edge Case Tests
cat("\nEdge case tests:\n")
## 
## Edge case tests:
# a) Empty vector
empty_result <- f_pdf(numeric(0), v1 = 5, v2 = 2)
cat("Empty input returns:", length(empty_result), "elements\n")
## Empty input returns: 0 elements
# Output: Empty input returns: 0 elements
# Interpretation: Handles empty input gracefully

# b) Missing value
na_result <- f_pdf(NA, v1 = 5, v2 = 2)
cat("NA input returns:", na_result, "\n")
## NA input returns: NA
# Output: NA input returns: NA
# Interpretation: Proper NA propagation

# c) Mixed values (negative, zero, positive)
mixed_result <- f_pdf(-1:3, v1 = 5, v2 = 2)
cat("Mixed input (-1:3) returns:\n")
## Mixed input (-1:3) returns:
print(mixed_result)
## [1] -5.977    NaN  0.308  0.132  0.072
# Output: [1] 0.000 0.000 0.000 0.308 0.130
# Interpretation: Returns 0 for x ≤ 0, valid PDF for x > 0

# --------------------------------------------
# Note: All tests complete without errors
# --------------------------------------------
#' F-Distribution Visualization
#'
#' Creates a professional plot of the F-distribution with shaded critical regions
#' and test statistic annotation.
#'
#' @param fstat Calculated F-statistic value
#' @param df1 Numerator degrees of freedom
#' @param df2 Denominator degrees of freedom
#' @param p Calculated p-value
#' @param alpha Significance level (default = 0.05)
#' @param n Number of points for smooth curve (default = 1000)
#'
#' @return A ggplot object showing the F-distribution with critical regions
#' @export
#'
#' @examples
#' f_plot(fstat = 5, df1 = 10, df2 = 5, p = 0.05)
f_plot <- function(fstat, df1, df2, p, alpha = 0.05, n = 1000) {
  
  # Calculate critical value and prepare data
  critical <- qf(1 - alpha, df1, df2)
  x <- seq(from = 0, to = max(critical + 2, fstat * 1.2), length.out = n)
  y <- df(x, df1, df2)
  plot_data <- data.frame(x = x, y = y)
  
  # Modern color palette
  palette <- list(
    fill_accept = "#2CA58DFF",       # Vibrant teal
    fill_reject = "#F46197FF",       # Elegant pink
    curve_color = "#3A3A3AFF",       # Dark gray
    stat_line = "#FF7F0EFF",         # Orange
    crit_line = "#9467BDFF",         # Purple
    text_dark = "#1A1A1AFF",         # Near-black
    text_light = "#FFFFFFFF",        # White
    bg_color = "#F5F5F5FF"           # Light gray
  )
  
  # Create plot
  ggplot(plot_data, aes(x = x, y = y)) +
    # F-distribution curve
    geom_line(color = palette$curve_color, linewidth = 1.2) +
    
    # Critical regions with gradient fills
    geom_area(data = subset(plot_data, x <= critical),
              aes(fill = "Acceptance"), alpha = 0.7) +
    geom_area(data = subset(plot_data, x >= critical),
              aes(fill = "Rejection"), alpha = 0.7) +
    
    # Critical value line
    geom_vline(xintercept = critical, 
               color = palette$crit_line, 
               linewidth = 1, 
               linetype = "dashed") +
    
    # F-statistic line
    geom_vline(xintercept = fstat, 
               color = palette$stat_line, 
               linewidth = 1.5) +
    
    # Annotations
    annotate("label", x = fstat, y = max(y) * 0.9,
             label = sprintf("F = %.2f\np = %.3f", fstat, p),
             hjust = ifelse(fstat > critical, 1.1, -0.1),
             fill = palette$text_light,
             color = palette$text_dark,
             size = 4) +
    
    annotate("label", x = critical, y = max(y) * 0.7,
             label = sprintf("Critical = %.2f", critical),
             hjust = 1.1,
             fill = palette$text_light,
             color = palette$crit_line,
             size = 4) +
    
    # Color scaling
    scale_fill_manual(values = c(
      "Acceptance" = palette$fill_accept,
      "Rejection" = palette$fill_reject
    )) +
    
    # Theme and labels
    labs(title = bquote(bold(F[.(df1)*","*.(df2)]~Distribution)),
         x = "F Value",
         y = "Density",
         fill = "Region") +
    
    theme_minimal(base_size = 13) +
    theme(
      plot.title = element_text(size = 16, face = "bold", hjust = 0.5, 
                               margin = margin(b = 15)),
      panel.background = element_rect(fill = palette$bg_color, color = NA),
      plot.background = element_rect(fill = "white", color = NA),
      panel.grid.major = element_line(color = "white", linewidth = 0.3),
      panel.grid.minor = element_blank(),
      legend.position = "top",
      legend.justification = "right",
      axis.title = element_text(face = "bold"),
      axis.text = element_text(color = palette$text_dark)
    )
}

# Save function
dump("f_plot", file = "f_plot.R")
# Test the function
f_plot(fstat = 5, df1 = 10, df2 = 5, p = 0.05)

f_plot(fstat = 1.5, df1=60, df2=60, p=0.05)


5.5 F-Test

An F-test is a statistical test in which the test statistic follows an F-distribution under the null hypothesis. It is commonly used to compare statistical models fitted to the same dataset to determine which model best explains the variability in the population. Exact F-tests typically arise when models are fitted using least squares estimation.

#' Perform Analysis of Variance (ANOVA)
#'
#' Conducts ANOVA testing and returns results with optional visualization.
#'
#' @param X A matrix or data frame containing the predictor variables. If \code{Y} is NULL, the last column will be used as the response variable.
#' @param Y Optional response variable vector/matrix. If provided, \code{X} should contain only predictor variables.
#' @param plot Logical indicating whether to generate an F-distribution plot (default = TRUE).
#' @param alpha Significance level for critical value calculation (default = 0.05).
#'
#' @return Invisibly returns a list with:
#' \itemize{
#'   \item df_between: Degrees of freedom between groups
#'   \item df_within: Degrees of freedom within groups
#'   \item ss_between: Sum of squares between groups
#'   \item ss_within: Sum of squares within groups
#'   \item ms_between: Mean square between groups
#'   \item ms_within: Mean square within groups
#'   \item f_statistic: Calculated F-statistic
#'   \item p_value: Associated p-value
#' }
#'
#' @examples
#' \dontrun{
#' data(mtcars)
#' ANOVA(mtcars[, c("cyl", "mpg")])       # last col as response
#'
#' X <- matrix(rnorm(100), ncol = 2)
#' Y <- rnorm(50)
#' ANOVA(X, Y)    # separate X, Y
#' }
#' @export
#' 


# Pretty printer (returns a kable object)
.print_anova_table <- function(df1, ssb, msb, f_stat, p_value,
                               df2, ssw, msw,
                               show_total = TRUE, digits = 3) {
  stopifnot(requireNamespace("knitr", quietly = TRUE),
            requireNamespace("kableExtra", quietly = TRUE))
  
  fmt_fixed <- function(x, d = 3) ifelse(is.na(x), "", formatC(x, format = "f", digits = d))
  fmt_p <- function(p) if (is.na(p)) "" else if (p < 1e-3) "< .001" else formatC(p, format = "f", digits = d)
  
  tab <- data.frame(
    Source    = c("X", "Residuals"),
    Df        = c(df1, df2),
    `Sum Sq`  = c(ssb, ssw),
    `Mean Sq` = c(msb, msw),
    `F value` = c(f_stat, NA_real_),
    `Pr(>F)`  = c(p_value, NA_real_),
    check.names = FALSE
  )
  
  if (isTRUE(show_total)) {
    tab <- rbind(tab, data.frame(
      Source = "Total",
      Df = df1 + df2,
      `Sum Sq` = ssb + ssw,
      `Mean Sq` = NA_real_,
      `F value` = NA_real_,
      `Pr(>F)`  = NA_real_,
      check.names = FALSE
    ))
  }
  
  tab$`Sum Sq`  <- fmt_fixed(tab$`Sum Sq`, 0)
  tab$`Mean Sq` <- fmt_fixed(tab$`Mean Sq`, 0)
  tab$`F value` <- fmt_fixed(tab$`F value`, 2)
  tab$`Pr(>F)` <- ifelse(is.na(tab$`Pr(>F)`), "", fmt_p(p_value))
  
  knitr::kable(tab, align = "lrrrrr", booktabs = TRUE,
               caption = paste("Analysis of Variance Table")) |>
    kableExtra::kable_styling(full_width = FALSE, position = "center") |>
    kableExtra::column_spec(2:6, width = "2cm")
}

# ---------- Main ANOVA ----------
# Accepts:
#  - ANOVA(X_with_Y_last_col)
#  - ANOVA(X, Y)
# --------------------------------
ANOVA <- function(X, Y = NULL, plot = TRUE, alpha = 0.05) {
  if (!is.matrix(X) && !is.data.frame(X)) stop("X must be a matrix or data frame.")
  if (alpha <= 0 || alpha >= 1) stop("alpha must be in (0,1).")
  
  # Load helper functions needed by ANOVA
  
  source("HAT.R")              # Hat matrix: H = X (X'X)^{-1} X'
  source("deviation_scores.R") # Centers Y around its mean (deviation scores)
  source("SST.R")              # Total sum of squares: SST = Σ(y - ȳ)^2
  source("SSB.R")              # Between-group sum of squares (projection via H)
  source("SSW.R")              # Within-group sum of squares: SST - SSB
  source("f_plot.R")           # Optional F-distribution plotting (ggplot2)
  
  library(knitr)
  
  
  X <- as.matrix(X); if (!is.numeric(X)) stop("X must be numeric.")
  n <- nrow(X)
  
  if (is.null(Y)) {
    if (ncol(X) < 2) stop("When Y is NULL, X must have at least 2 columns (predictors + response).")
    Y <- as.matrix(X[, ncol(X)])
    X <- as.matrix(X[, 1:(ncol(X) - 1), drop = FALSE])
  } else {
    Y <- as.matrix(Y)
    if (nrow(Y) != n) stop("X and Y must have the same number of rows.")
  }
  
  # Complete cases
  cc <- stats::complete.cases(X, Y)
  if (any(!cc)) {
    warning(sum(!cc), " incomplete cases removed")
    X <- X[cc, , drop = FALSE]
    Y <- Y[cc, , drop = FALSE]
    n <- nrow(X)
  }
  
  # Degrees of freedom
  p <- ncol(X) + 1          # intercept + predictors
  df_between <- p - 1
  df_within  <- n - p
  if (df_within < 1) stop("Not enough residual df (need n > number of parameters).")
  
  # Sums of squares
  ss_between <- SSB(X, Y)
  ss_within  <- SSW(X, Y)
  
  # Mean squares
  ms_between <- ss_between / df_between
  ms_within  <- ss_within  / df_within
  
  # F and p
  f_stat <- ms_between / ms_within
  p_value <- stats::pf(f_stat, df_between, df_within, lower.tail = FALSE)
  
  # Print table (must print the kable object from inside a function)
  tbl <- .print_anova_table(df_between, 
                            ss_between, 
                            ms_between, 
                            f_stat, 
                            p_value, 
                            df_within,  
                            ss_within,  
                            ms_within)
  print(tbl)
  
  # Plot (optional)
  if (isTRUE(plot)) {
    p_obj <- f_plot(f_stat, df_between, df_within, p_value, alpha)
    if (!is.null(p_obj)) print(p_obj)
  }
  
  invisible(list(
    df_between = df_between,
    df_within  = df_within,
    ss_between = ss_between,
    ss_within  = ss_within,
    ms_between = ms_between,
    ms_within  = ms_within,
    f_statistic = f_stat,
    p_value     = p_value
  ))
}

# Save function to file
dump("ANOVA", file = "ANOVA.R")
ANOVA(X, plot = FALSE)
## <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;">
## <caption>Analysis of Variance Table</caption>
##  <thead>
##   <tr>
##    <th style="text-align:left;"> Source </th>
##    <th style="text-align:right;"> Df </th>
##    <th style="text-align:right;"> Sum Sq </th>
##    <th style="text-align:right;"> Mean Sq </th>
##    <th style="text-align:right;"> F value </th>
##    <th style="text-align:right;"> Pr(&gt;F) </th>
##   </tr>
##  </thead>
## <tbody>
##   <tr>
##    <td style="text-align:left;"> X </td>
##    <td style="text-align:right;width: 2cm; "> 4 </td>
##    <td style="text-align:right;width: 2cm; "> 2668 </td>
##    <td style="text-align:right;width: 2cm; "> 667 </td>
##    <td style="text-align:right;width: 2cm; "> 111.48 </td>
##    <td style="text-align:right;width: 2cm; "> &lt; .001 </td>
##   </tr>
##   <tr>
##    <td style="text-align:left;"> Residuals </td>
##    <td style="text-align:right;width: 2cm; "> 8 </td>
##    <td style="text-align:right;width: 2cm; "> 48 </td>
##    <td style="text-align:right;width: 2cm; "> 6 </td>
##    <td style="text-align:right;width: 2cm; ">  </td>
##    <td style="text-align:right;width: 2cm; ">  </td>
##   </tr>
##   <tr>
##    <td style="text-align:left;"> Total </td>
##    <td style="text-align:right;width: 2cm; "> 12 </td>
##    <td style="text-align:right;width: 2cm; "> 2716 </td>
##    <td style="text-align:right;width: 2cm; ">  </td>
##    <td style="text-align:right;width: 2cm; ">  </td>
##    <td style="text-align:right;width: 2cm; ">  </td>
##   </tr>
## </tbody>
## </table>

5.6 The Coefficient of Determination \((R^2)\)

The coefficient of multiple correlation, denoted as R, is defined as the Pearson correlation coefficient between the predicted and actual values of the dependent variable in a linear regression model that includes an intercept.

The coefficient of determination, \((R^2)\), represents the proportion of variance in the dependent variable that is explained by the independent variables. It measures how well the statistical model (whether a line, curve, or more complex function) fits the observed data. An \((R^2)\) value of 1 indicates a perfect fit, meaning the regression model explains all the variability in the dependent variable.

The unexplained variance, also known as the coefficient of alienation, is given by \((1-R^2)\). This quantity can be computed as the sum of squared residuals (the differences between observed and predicted values).

Y.hat <- HAT(X) %*% Y  # Predicted Y values using the hat matrix
R2 <- cor(Y, Y.hat)^2  # Coefficient of determination (R²)
R2
##      [,1]
## [1,]    1

\(R^2\) can also be computed using sums of squares:

SSb / SSt  # Explained variance / Total variance
##      [,1]
## [1,] 0.98

Or equivalently:

1 - (SSw / SSt)  # 1 - (Unexplained variance / Total variance)
##      [,1]
## [1,] 0.98

The squared multiple correlation of a variable with other variables in a matrix can be computed as:

1 - (1 / (diag(INV(COR(X)))))  # Manual SMC calculation
##   X1   X2   X3   X4    Y 
## 0.98 1.00 0.98 1.00 0.98

5.7 Adjusted \(R^2\)

The coefficient of determination (\(R^2\)) tends to be inflated when additional explanatory variables are added to a least squares regression model. This makes \(R^2\) unsuitable for comparing models with significantly different numbers of independent variables.

To address this, the adjusted \(R^2\) is calculated to penalize the inclusion of unnecessary predictors. It adjusts for the number of explanatory terms (p) relative to the sample size (n). Unlike \(R^2\), the adjusted \(R^2\) increases only if the new variable improves the model more than expected by random chance.

McNemar’s Formula for Adjusted \(R^2\)

The adjusted \(R^2\) is computed as:

\[ \bar{R}^2 = 1 - (1 - R^2) \frac{n - 1}{n - p - 1} \]

Where:

  • \(\bar{R}^2\) = adjusted \(R^2\)
  • \(R^2\) = original coefficient of determination
  • \(n\) = sample size
  • \(p\) = number of explanatory variables

Key Properties:

  • Penalizes Complexity: Reduces overfitting by accounting for model degrees of freedom.
  • Comparison: Enables meaningful comparison across models with different numbers of predictors.
  • Interpretation: Always ≤ \(R^2\); can be negative if the model fits worse than a horizontal line.
#' Calculate Adjusted R-squared Using Different Formulas
#'
#' Computes four common variants of adjusted R-squared with recommendations
#' for which formula to use based on sample size and number of predictors.
#'
#' @param R2 Original R-squared value (0 to 1)
#' @param n Sample size (positive integer > p+1)
#' @param p Number of predictors (positive integer)
#' @return A list containing:
#' \itemize{
#'   \item Original_R2 - Input R-squared value
#'   \item McNemar - Adjusted R-squared (McNemar's formula)
#'   \item Wherry - Adjusted R-squared (Wherry's formula)
#'   \item Lord - Adjusted R-squared (Lord's formula)
#'   \item Stein - Adjusted R-squared (Stein's formula)
#'   \item Recommendation - Suggested formula based on n/p ratio
#' }
#' @examples
#' # Typical usage
#' calculate_adjusted_R2(R2 = 0.75, n = 100, p = 5)
#'
#' # Edge case (small sample)
#' calculate_adjusted_R2(R2 = 0.6, n = 15, p = 3)
calculate_adjusted_R2 <- function(R2, n, p) {
  # Input validation
  if (R2 < 0 || R2 > 1) stop("R2 must be between 0 and 1")
  if (n <= p + 1) stop("Sample size n must be greater than p + 1")
  if (p < 1) stop("Number of predictors p must be at least 1")
  
  # McNemar's formula (most common)
  mcnemar <- 1 - (1 - R2) * (n - 1) / (n - p - 1)
  
  # Wherry's formula
  wherry <- 1 - (1 - R2) * (n - 1) / (n - p)
  
  # Lord's formula
  lord <- 1 - (1 - R2) * (n + p - 1) / (n - p - 1)
  
  # Stein's formula (most conservative)
  # Only calculate if sample size is sufficient
  stein <- if (n > p + 2) {
    1 - ((n - 1)/(n - p - 1)) * 
        ((n - 2)/(n - p - 2)) * 
        ((n + 1)/n) * 
        (1 - R2)
  } else {
    NA_real_
    warning("Stein's formula requires n > p + 2, returning NA")
  }
  
  # Determine recommendation
  recommendation <- if (n/p >= 40) {
    "All formulas should give similar results (large sample)"
  } else if (n/p >= 10) {
    "McNemar or Wherry recommended (moderate sample)"
  } else {
    "Stein recommended (small sample or many predictors)"
  }
  
  return(list(
    Original_R2 = R2,
    McNemar = mcnemar,
    Wherry = wherry,
    Lord = lord,
    Stein = stein,
    Recommendation = recommendation
  ))
}
calculate_adjusted_R2(R2, n, p)
## $Original_R2
##      [,1]
## [1,]    1
## 
## $McNemar
##      [,1]
## [1,]    1
## 
## $Wherry
##      [,1]
## [1,]    1
## 
## $Lord
##      [,1]
## [1,]    1
## 
## $Stein
##      [,1]
## [1,]    1
## 
## $Recommendation
## [1] "Stein recommended (small sample or many predictors)"

### The Relationship Between F and \(R^2\)

The F-statistic in ANOVA can be expressed in terms of sums of squares or the coefficient of determination \((R^2)\):

\[ F = \frac{SST - SSW}{SSW} \cdot \frac{n - p}{p - 1} = \frac{R^2}{p - 1} \cdot \frac{n - p}{1 - R^2} \]

Where:

  • \(F\) = F-statistic
  • \(SST\) = Total sum of squares
  • \(SSW\) = Within-group sum of squares (residual)
  • \(n\) = Total sample size
  • \(p\) = Number of predictor variables (including intercept)
  • \(R^2\) = Coefficient of determination

Key Observations:

  • The F-statistic increases with higher \(R^2\) values
  • For a given \(R^2\), F decreases as more predictors are added (higher \(p\))
  • The relationship shows how goodness-of-fit \((R^2)\) translates to significance testing (F-test)

Alternative Forms: The relationship can also be expressed as:

\[ F = \frac{MS_{Model}}{MS_{Error}} = \frac{R^2/(p - 1)}{(1 - R^2)/(n - p)} \]

Where \(MS\) represents mean squares.

F <- (R2/(p - 1))/((1 - R2)/(n - p)); F
##      [,1]
## [1,]  Inf
R2 <- 1 - (1/(1 + (F*(p - 1)/(n - p)))); R2
##      [,1]
## [1,]    1

6 Appendix A: Helper Functions

# First define all helper functions outside the main ANOVA function

#' Calculate Between-Group Sum of Squares
SSB <- function(data) {
  grand_mean <- mean(data[, ncol(data)])
  group_means <- apply(data[, -ncol(data), drop = FALSE], 2, mean)
  sum((group_means - grand_mean)^2) * nrow(data)
}

#' Calculate Within-Group Sum of Squares
SSW <- function(data) {
  group_means <- apply(data[, -ncol(data), drop = FALSE], 2, mean)
  sum((data[, ncol(data)] - group_means)^2)
}

#' Create F-distribution plot
f_plot <- function(f_stat, df1, df2, p_value, alpha) {
  if (!requireNamespace("ggplot2", quietly = TRUE)) {
    stop("ggplot2 package required for plotting")
  }
  
  x <- seq(0, max(5, f_stat * 1.5), length.out = 200)
  y <- df(x, df1, df2)
  crit_val <- qf(1 - alpha, df1, df2)
  
  ggplot2::ggplot(data.frame(x = x, y = y), ggplot2::aes(x, y)) +
    ggplot2::geom_line() +
    ggplot2::geom_vline(xintercept = f_stat, color = "red", linetype = "dashed") +
    ggplot2::geom_vline(xintercept = crit_val, color = "blue", linetype = "dotted") +
    ggplot2::annotate("text", x = f_stat, y = max(y)/2, 
                     label = paste0("F = ", round(f_stat, 2)), hjust = -0.2) +
    ggplot2::annotate("text", x = crit_val, y = max(y)/2, 
                     label = paste0("Critical value = ", round(crit_val, 2)), hjust = 1.2) +
    ggplot2::labs(title = "F-Distribution", 
                 subtitle = paste0("p-value = ", format.pval(p_value)),
                 x = "F-value", y = "Density") +
    ggplot2::theme_minimal()
}

#' Create ANOVA table
anova_table <- function(df1, ssb, msb, f_stat, p_value, df2, ssw, msw) {
  # Signif. codes:
  sig <- round(p_value, 3)
  if(sig == 0){symb <- "***"}
  else if (sig <= 0.001){symb <- "**"}
  else if (sig <= 0.01){symb <- "*"}
  else if (sig <= 0.1){symb <- "."}
  else if (sig <= 0.5){symb <- "  "}
  else{symb <- ""}
  
  cat("           Df   Sum Sq   Mean Sq   F value   Pr(>F)\n")
  cat("X         ", format(df1, width=2), " ", format(ssb, width=6, nsmall=1, justify='right'), 
      "   ", format(msb, width=6, nsmall=1, justify='right'), "   ", 
      format(f_stat, width=6, nsmall=1, justify='right'), "", 
      format(p_value, width=6, nsmall=4, justify='right'), symb, "\n")
  cat("Residuals ", format(df2, width=2), "   ", format(ssw, width=6, nsmall=1, justify='right'), 
      "    ", format(msw, width=6, nsmall=1, justify='right'), "\n") 
  cat("---\n")
  cat("Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n")
}


# Save all functions to file
dump(c("SSB", "SSW", "f_plot", "anova_table", "ANOVA"), file = "ANOVA_functions.R")

References

Goodnight, J. H. (1979). A tutorial on the SWEEP operator. The American Statistician, 33(3), 149–159.
Kincaid, D. R., & Cheney, E. W. (2008). Numerical mathematics and computing. Cengage Learning.
Little, R. J. A., & Rubin, D. B. (2020). Statistical analysis with missing data (3rd ed.). Wiley.