Project History and Acknowledgments

This project is both a functional implementation of multivariate statistical methods (including regression and ANOVA) and a pedagogical resource. Developed as a learning journey, it begins with foundational concepts—such as basic column-wise computations—and systematically progresses to more advanced analyses. The accompanying code and commentary document not only the final solutions but also the iterative process of discovery, including challenges, revisions, and insights.

To emphasize algorithmic understanding over black-box solutions, the implementation deliberately minimizes reliance on built-in R packages, opting instead for transparent, step-by-step calculations.

The work originated in a graduate Multivariate Statistical Methods course at Ohio University in 2011, with initial implementations in SPSS and R. Over time, it evolved into a fully R-based project to leverage the language’s open-source flexibility and advanced statistical capabilities.

The original code, methodology, and documentation have since been refined—partially with AI-assisted tools—to enhance computational efficiency, analytical accuracy, and pedagogical clarity.

1 Computational Linear Algebra with Applications in Statistics

Linear algebra is a branch of mathematics that studies vectors, vector spaces, linear transformations, and matrix operations. It provides fundamental tools for statistical computing, where many problems can be formulated and solved using linear algebraic approaches. Key applications include:

Multivariate statistical analysis
Inference from linear models
Dimension reduction techniques
Optimization in statistical learning

Numerical linear algebra forms the computational backbone of modern statistical methods, enabling efficient solutions to large-scale problems in data analysis.

1.1 Important Matrices for Multivariate Analysis

1.1.1 The Data Matrix

At the core of multivariate analysis lies the data matrix, a structured representation of numerical data with:

Rows (n): Represent individual observations (cases, samples, or instances)
Columns (p): Contain measured variables (features, attributes, or dimensions)

Key Requirements:

Statistical Independence of Observations

The data matrix requires rows to represent statistically independent observations, formally expressed as \[ \Pr(\text{row}_i \cap \text{row}_j) = \Pr(\text{row}_i) \times \Pr(\text{row}_j) \quad \forall i \neq j. \] This property is typically achieved through controlled experimental design, proper sampling techniques, and randomization procedures. In experimental settings, independence is enforced by random assignment of treatments, while observational studies rely on probability sampling methods (e.g., simple random sampling, stratified sampling) and careful study design to minimize dependencies between units.

Structural Integrity:The data matrix must maintain rigorous structural integrity through:

Complete cases (or proper missing data handling)
Consistent units of measurement
Clear variable definitions

Violations compromise matrix operations and statistical inference validity.

Mathematical Representation:

Let X ∈ ℝⁿˣᵖ denote a data matrix:

\[ X = \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1p} \\ x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \cdots & x_{np} \end{bmatrix} \]

# Load required packages
# List of required packages
required_packages <- c("tidyverse", "knitr", "kableExtra", "GGally", "broom", "car")

# Function to install and load packages safely
install_and_load <- function(pkg) {
  if (!requireNamespace(pkg, quietly = TRUE)) {
    install.packages(pkg, dependencies = TRUE)
  }
  library(pkg, character.only = TRUE)
}

# Install and load all packages with progress feedback
invisible(suppressMessages(sapply(required_packages, install_and_load)))

# Create data matrix X from Wood's (1966) cement hardening data
# Columns represent:
# X1 - % tricalcium aluminate
# X2 - % tricalcium silicate
# X3 - % tetracalcium alumino ferrite
# X4 - % dicalcium silicate
# Y - Heat evolved in calories/gram of cement

X <- matrix(c(7, 26, 6, 60, 78.5,
               1, 29, 15, 52, 74.3,
               11, 56, 8, 20, 104.3,
               11, 31, 8, 47, 87.6, 
               7, 52, 6, 33, 95.9,
               11, 55, 9, 22, 109.2,
               3, 71, 17, 6, 102.7,
               1, 31, 22, 44, 72.5,
               2, 54, 18, 22, 93.1,
               21, 47, 4, 26, 115.9,
               1, 40, 23, 34, 83.8,
               11, 66, 9, 12, 113.3,
               10, 68, 8, 12, 109.4), 
             nrow = 13, ncol = 5, byrow = TRUE)

# Set descriptive row and column names
rownames(X) <- paste0("obs.", 1:13)
colnames(X) <- c("X1", "X2", "X3", "X4", "Y")

# Display the matrix structure
str(X)

##  num [1:13, 1:5] 7 1 11 11 7 11 3 1 2 21 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:13] "obs.1" "obs.2" "obs.3" "obs.4" ...
##   ..$ : chr [1:5] "X1" "X2" "X3" "X4" ...

# Save data to RData file
save(X, file = "X.RData")

# Optional: Also save as CSV for interoperability
write.csv(X, file = "X.csv", row.names = TRUE)

# Display with kable and proper headers
kable(X, digits = 1, align = 'c', caption = "Cement Hardening Data (Woods, 1966)") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  add_header_above(c(" " = 1, 
                     "Chemical Composition (%)" = 4, 
                     "Heat Evolved" = 1)) %>%
  footnote(general = "X1: Tricalcium Aluminate\nX2: Tricalcium Silicate\nX3: Tetracalcium Alumino Ferrite\nX4: Dicalcium Silicate\nY: calories/gram",
           general_title = "Variable Definitions:") %>%
  column_spec(1, bold = TRUE) %>%
  column_spec(5, background = "#f7f7f7")

Cement Hardening Data (Woods, 1966)
	Chemical Composition (%)				Heat Evolved
	X1	X2	X3	X4	Y
obs.1	7	26	6	60	78.5
obs.2	1	29	15	52	74.3
obs.3	11	56	8	20	104.3
obs.4	11	31	8	47	87.6
obs.5	7	52	6	33	95.9
obs.6	11	55	9	22	109.2
obs.7	3	71	17	6	102.7
obs.8	1	31	22	44	72.5
obs.9	2	54	18	22	93.1
obs.10	21	47	4	26	115.9
obs.11	1	40	23	34	83.8
obs.12	11	66	9	12	113.3
obs.13	10	68	8	12	109.4
Variable Definitions:
X1: Tricalcium Aluminate X2: Tricalcium Silicate X3: Tetracalcium Alumino Ferrite X4: Dicalcium Silicate Y: calories/gram

1.1.2 Column Sums in Statistical Analysis

Column sums ($\sum_{i=1}^{n} x_{ij}$ for matrix $X \in \mathbb{R}^{n \times p}$) serve as a fundamental operation in statistical computing with three primary applications. First, they enable data aggregation through the computation $S_j = \sum_{i=1}^{n} x_{ij}$ (for each column $j = 1,...,p$), which provides the total magnitude of each variable - a cornerstone of descriptive statistics. Second, these sums form the basis for normalization procedures, where individual elements are scaled by their column sum via $\tilde{x}_{ij} = x_{ij}/S_j$, enabling proportional analysis and compositional data transformations. Finally, column sums function as critical building blocks for core algorithms, particularly in matrix-vector products ($X^\top y$), and covariance matrix computation ($\Sigma = X^\top X$).

#' Calculate Column Sums Identical to colSums()
#'
#' Computes column sums using matrix multiplication while perfectly replicating
#' base R's colSums() output format (named numeric vector with same spacing).
#'
#' @param X A matrix or data frame
#' @param na.rm Logical; should NA values be removed? (Default: TRUE)
#' @return A named numeric vector identical to colSums() output
#' @export
#' @examples
#' X <- matrix(c(97,626,153,390,1241), nrow=1,
#'             dimnames=list(NULL, c("X1","X2","X3","X4","Y")))
#' SUMS(X)    # Returns c(X1=97, X2=626, X3=153, X4=390, Y=1241)
#' colSums(X) # Identical output

SUMS <- function(X, na.rm = TRUE) {
  # Convert to matrix while preserving names
  original_names <- colnames(X)
  X <- data.matrix(X)
  
  # NA handling
  if (na.rm) {
    X[is.na(X)] <- 0
  } else if (anyNA(X)) {
    has_na <- colSums(is.na(X)) > 0
  }
  
  # Compute sums via matrix multiplication
  n <- nrow(X)
  if (n == 0) {
    sums <- rep(NA_real_, ncol(X))
  } else {
    J <- matrix(1, nrow = 1, ncol = n)
    sums <- as.vector(J %*% X)  # Convert to vector
  }
  
  # Apply NA propagation if needed
  if (!na.rm && exists("has_na")) {
    sums[has_na] <- NA
  }
  
  # Format output to match colSums() exactly
  names(sums) <- original_names
  sums <- setNames(sums, original_names)  # Ensures identical name spacing
  
  return(sums)
}

# Save function to file for reuse
dump("SUMS", file = "SUMS.R")

SUMS(X)

##     X1     X2     X3     X4      Y 
##   97.0  626.0  153.0  390.0 1240.5

# for comparison
colSums(X)

##     X1     X2     X3     X4      Y 
##   97.0  626.0  153.0  390.0 1240.5

1.1.3 Column (variable) Means

In data analysis, columns (also called variables) represent the measured attributes or features in a dataset. The column mean—the average value of each variable—serves as a fundamental building block in statistical linear algebra. These means play a pivotal role in both theoretical analysis (such as dimensionality reduction and covariance computation) and practical applications (like data centering and normalization). By summarizing the central tendency of each variable, column means provide a critical reference point for further statistical modeling.

#' Compute Column Means Identical to colMeans()
#'
#' Calculates column means using matrix multiplication while perfectly replicating
#' base R's colMeans() output format and NA handling behavior.
#'
#' @param X A matrix, data frame, or numeric vector
#' @param na.rm Logical; should NA values be removed? (Default: TRUE)
#' @return A named numeric vector matching colMeans() output
#' @export
#' @examples
#' X <- matrix(c(7.46, 48.2, 11.8, 30, 95.4), nrow = 1,
#'             dimnames = list(NULL, c("X1", "X2", "X3", "X4", "Y")))
#' MEANS(X)    # Returns c(X1 = 7.46, X2 = 48.2, X3 = 11.8, X4 = 30, Y = 95.4)
#' colMeans(X) # Identical output

MEANS <- function(X, na.rm = TRUE) {
  # [Input Conversion]
  # Store original column names before conversion
  original_names <- colnames(X)
  
  # Convert to matrix (preserves numeric values and handles data frames)
  # Note: as.matrix() maintains dimension names unlike data.matrix()
  X <- as.matrix(X)
  
  # [NA Handling Preparation]
  if (na.rm) {
    # Count valid (non-NA) observations per column
    n_valid <- colSums(!is.na(X))
    # Temporarily replace NAs with 0 for correct summation
    X[is.na(X)] <- 0
  } else {
    # Identify columns with any NAs
    has_na <- colSums(is.na(X)) > 0
    # Count valid obs, set to NA if any NAs exist when na.rm=FALSE
    n_valid <- nrow(X) - colSums(is.na(X))
    n_valid[has_na] <- NA
  }
  
  # [Core Computation]
  # Create summation vector (1×n matrix of 1s)
  n <- nrow(X)
  J <- matrix(1, nrow = 1, ncol = n)
  
  # Matrix multiplication equivalent to:
  # sum = [1,1,...,1] %*% X
  # mean = sum/n
  means <- (J %*% X) / n
  
  # [NA Adjustment]
  # When na.rm=TRUE, adjust means for columns with NAs:
  # mean = sum/(original n) * (original n)/n_valid
  if (na.rm) {
    # Create adjustment matrix (1×p)
    adj <- matrix(n/n_valid, nrow = 1)
    means <- means * adj
  }
  
  # [Output Formatting]
  # Convert to named numeric vector (identical to colMeans() output)
  result <- as.numeric(means)  # Drops matrix class
  names(result) <- original_names  # Restore column names
  
  return(result)
}

# Save function
dump("MEANS", file = "MEANS.R")

MEANS(X)

##        X1        X2        X3        X4         Y 
##  7.461538 48.153846 11.769231 30.000000 95.423077

# for comparison
colMeans(X)

##        X1        X2        X3        X4         Y 
##  7.461538 48.153846 11.769231 30.000000 95.423077

1.2 Deviation Scores Matrix

The deviation scores matrix $\mathbf{D} \in \mathbb{R}^{n \times p}$ represents mean-centered data through the transformation $\mathbf{D} = \mathbf{X} - \mathbf{1}\bar{\mathbf{x}}^\top$, where $\mathbf{X}$ is the original $n \times p$ data matrix, $\mathbf{1}$ denotes an $n \times 1$ vector of ones, and $\bar{\mathbf{x}} = [\bar{x}_1 \cdots \bar{x}_p]^\top$ contains the column means. Each element $d_{ij} = x_{ij} - \bar{x}_j$ quantifies the departure of observation $i$ from variable $j$’s mean value.

This centering operation generates several important mathematical consequences. First, the column sums of $\mathbf{D}$ become exactly zero ($\mathbf{1}^\top\mathbf{D} = \mathbf{0}^\top$), reflecting the elimination of mean differences. Second, the matrix product $\mathbf{D}^\top\mathbf{D}$ produces the sum-of-squares-and-cross-products (SSCP) matrix, which forms the basis for computing covariance and correlation matrices. Notably, this transformation preserves the intrinsic dimensionality of the data, maintaining the rank of the original matrix $\mathbf{X}$ while recentering the coordinate system around the multivariate mean.

\[ \begin{equation} \mathbf{D} = \mathbf{X} - \mathbf{1}\boldsymbol{\mu}^\top = \begin{pmatrix} x_{11} - \mu_1 & x_{12} - \mu_2 & \cdots & x_{1p} - \mu_p \\ x_{21} - \mu_1 & x_{22} - \mu_2 & \cdots & x_{2p} - \mu_p \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} - \mu_1 & x_{n2} - \mu_2 & \cdots & x_{np} - \mu_p \end{pmatrix} \end{equation} \] where: \[ \begin{align*} \mathbf{X} &= (x_{ij}) \in \mathbb{R}^{n \times p} \text{ is the data matrix} \\ \boldsymbol{\mu} &= (\bar{x}_1, \bar{x}_2, \ldots, \bar{x}_p)^\top \text{ is the mean vector} \\ \mathbf{1} &= (1, 1, \ldots, 1)^\top \in \mathbb{R}^n \text{ is the ones vector} \end{align*} \]

#' Compute Deviation Scores Using MEANS and SUMS
#'
#' Centers a matrix by subtracting column means using pre-defined functions,
#' demonstrating how statistical operations compose together.
#'
#' @param X A numeric matrix (n × p)
#' @return The centered matrix (n × p) where each column has mean 0
#' @export
#' @examples
#' X <- matrix(1:12, ncol=3)
#' deviation_scores_composed(X)

deviation_scores <- function(X) {
  # Convert to matrix if needed
  X <- as.matrix(X)
  n <- nrow(X)
  
  # [1] Get column means using MEANS()
  # (which itself uses matrix multiplication)
  μ <- MEANS(X)
  
  # [2] Create mean matrix by replicating μ
  M <- matrix(μ, nrow = n, ncol = ncol(X), byrow = TRUE)
  
  # [3] Compute deviations
  D <- X - M
  
  # [Alternative Approach Using SUMS]
  # μ_alt <- SUMS(X)/nrow(X)  # Demonstrates functional composition
  
  dimnames(D) <- dimnames(X)
  return(D)
}

# Save function
dump("deviation_scores", file = "deviation_scores.R")

# Using the function
deviation_scores(X)

##                X1         X2        X3  X4           Y
## obs.1  -0.4615385 -22.153846 -5.769231  30 -16.9230769
## obs.2  -6.4615385 -19.153846  3.230769  22 -21.1230769
## obs.3   3.5384615   7.846154 -3.769231 -10   8.8769231
## obs.4   3.5384615 -17.153846 -3.769231  17  -7.8230769
## obs.5  -0.4615385   3.846154 -5.769231   3   0.4769231
## obs.6   3.5384615   6.846154 -2.769231  -8  13.7769231
## obs.7  -4.4615385  22.846154  5.230769 -24   7.2769231
## obs.8  -6.4615385 -17.153846 10.230769  14 -22.9230769
## obs.9  -5.4615385   5.846154  6.230769  -8  -2.3230769
## obs.10 13.5384615  -1.153846 -7.769231  -4  20.4769231
## obs.11 -6.4615385  -8.153846 11.230769   4 -11.6230769
## obs.12  3.5384615  17.846154 -2.769231 -18  17.8769231
## obs.13  2.5384615  19.846154 -3.769231 -18  13.9769231

# Manual comparison
(residuals <- apply(na.omit(X), 2, function(x) x - mean(x)))

##                X1         X2        X3  X4           Y
## obs.1  -0.4615385 -22.153846 -5.769231  30 -16.9230769
## obs.2  -6.4615385 -19.153846  3.230769  22 -21.1230769
## obs.3   3.5384615   7.846154 -3.769231 -10   8.8769231
## obs.4   3.5384615 -17.153846 -3.769231  17  -7.8230769
## obs.5  -0.4615385   3.846154 -5.769231   3   0.4769231
## obs.6   3.5384615   6.846154 -2.769231  -8  13.7769231
## obs.7  -4.4615385  22.846154  5.230769 -24   7.2769231
## obs.8  -6.4615385 -17.153846 10.230769  14 -22.9230769
## obs.9  -5.4615385   5.846154  6.230769  -8  -2.3230769
## obs.10 13.5384615  -1.153846 -7.769231  -4  20.4769231
## obs.11 -6.4615385  -8.153846 11.230769   4 -11.6230769
## obs.12  3.5384615  17.846154 -2.769231 -18  17.8769231
## obs.13  2.5384615  19.846154 -3.769231 -18  13.9769231

This code verifies that the deviation_scores() function is working correctly by checking two key mathematical properties of properly centered data:

Mean of deviation scores should be 0 for each column
Sum of deviation scores should be 0 for each column

source("MEANS.R")
source("SUMS.R")

cat("--- Deviation Score Checks ---\n")

## --- Deviation Score Checks ---

cat("Mean of deviation scores: ", MEANS(deviation_scores(X)), "\n")

## Mean of deviation scores:  -1.366428e-16 5.465713e-16 -8.19857e-16 0 -1.093143e-14

cat("Sum of deviation scores:    ",SUMS(deviation_scores(X)), "\n\n")

## Sum of deviation scores:     -1.776357e-15 7.105427e-15 -1.065814e-14 0 -1.421085e-13

cat("Note: Values should be very close to 0 (floating-point precision limits exact zeros)\n")

## Note: Values should be very close to 0 (floating-point precision limits exact zeros)

1.3 Sums of Squares and Cross Products Matrix (SSCP)

The Sums of Squares and Cross Products (SSCP) matrix is a fundamental quadratic form in multivariate statistics that captures both the magnitude and interdependence of variables. For a data matrix $\mathbf{X} \in \mathbb{R}^{n \times p}$ containing $n$ observations of $p$ variables, the SSCP matrix is given by:

\[ \mathbf{S} = \mathbf{X}^\top\mathbf{X} = \sum_{i=1}^n \mathbf{x}_i\mathbf{x}_i^\top \]

where $\mathbf{x}_i^\top$ represents the $i$-th row of $\mathbf{X}$ (a $1 \times p$ vector).

1.3.1 Explicit Matrix Formulation

\[ \mathbf{X}^\top\mathbf{X} = \begin{pmatrix} \|\mathbf{x}_{(1)}\|^2 & \langle\mathbf{x}_{(1)}, \mathbf{x}_{(2)}\rangle & \cdots & \langle\mathbf{x}_{(1)}, \mathbf{x}_{(p)}\rangle \\ \langle\mathbf{x}_{(2)}, \mathbf{x}_{(1)}\rangle & \|\mathbf{x}_{(2)}\|^2 & \cdots & \langle\mathbf{x}_{(2)}, \mathbf{x}_{(p)}\rangle \\ \vdots & \vdots & \ddots & \vdots \\ \langle\mathbf{x}_{(p)}, \mathbf{x}_{(1)}\rangle & \langle\mathbf{x}_{(p)}, \mathbf{x}_{(2)}\rangle & \cdots & \|\mathbf{x}_{(p)}\|^2 \end{pmatrix} \]

where:

$\|\mathbf{x}_{(j)}\|^2 = \sum_{i=1}^n x_{ij}^2$ (squared Euclidean norm)
$\langle\mathbf{x}_{(j)}, \mathbf{x}_{(k)}\rangle = \sum_{i=1}^n x_{ij}x_{ik}$ (inner product)

1.3.2 Key Properties

Symmetric Positive Semi-definite:
- $\mathbf{S}^\top = \mathbf{S}$
- $\mathbf{v}^\top\mathbf{S}\mathbf{v} \geq 0$ for all $\mathbf{v} \in \mathbb{R}^p$
Rank Properties:
- $\text{rank}(\mathbf{S}) = \text{rank}(\mathbf{X}) \leq \min(n,p)$
- Full rank when columns are linearly independent
Relation to Sample Covariance:
- For centered data $\mathbf{D}$:

\[ \begin{equation} \text{Cov}(\mathbf{X}) = \frac{1}{n-1}\mathbf{D}^\top\mathbf{D} = \frac{1}{n-1}\mathbf{X}_c^\top\mathbf{X}_c = \frac{1}{n-1}\text{SSCP}(\mathbf{X}_c) \end{equation} \]

where:

$\mathbf{X}_c$ is the column-centered data matrix ($\mathbf{X}_c = \mathbf{X} - \mathbf{1}\boldsymbol{\mu}^\top$)
$\mathbf{S}$ is the sample covariance matrix
$n$ is the number of observations

#' Compute Sums of Squares and Cross Products (SSCP) Matrix
#'
#' Calculates the SSCP matrix for a numeric data matrix with options for NA handling.
#'
#' @param Y A numeric matrix or data frame (n × p)
#' @param na_action Method for handling NAs: 
#'        "listwise" (complete cases, default) or 
#'        "pairwise" (available cases for each product)
#' @param center Logical indicating whether to center columns first (default = FALSE)
#' @return A symmetric p × p SSCP matrix
#' @export

SSCP <- function(Y, na_action = c("listwise", "pairwise"), center = FALSE) {
  # Validate input
  if (!is.matrix(Y) && !is.data.frame(Y)) {
    stop("Input must be a matrix or data frame")
  }
  na_action <- match.arg(na_action)
  Y <- as.matrix(Y)
  
  # Handle missing data
  if (na_action == "listwise") {
    comp_cases <- complete.cases(Y)
    Y <- Y[comp_cases, , drop = FALSE]
  }
  
  # Center data if requested
  if (center) {
    Y <- scale(Y, center = TRUE, scale = FALSE)
  }
  
  # Compute SSCP matrix
  sscp_matrix <-  t(Y) %*% Y
  
  # Preserve column names
  colnames(sscp_matrix) <- rownames(sscp_matrix) <- colnames(Y)
  
  return(sscp_matrix)
}

# Save function with improved version
dump("SSCP", file = "SSCP.R")

SSCP(X)

##       X1      X2      X3      X4        Y
## X1  1139  4922.0   769.0  2620.0  10032.0
## X2  4922 33050.0  7201.0 15739.0  62027.8
## X3   769  7201.0  2293.0  4628.0  13981.5
## X4  2620 15739.0  4628.0 15062.0  34733.3
## Y  10032 62027.8 13981.5 34733.3 121088.1

Key Differences Between SSCP and Covariance Matrices

Property	SSCP Matrix	Covariance Matrix
Scale	Sum-based	Average-based
Units	Original units squared	Standardized units
Centering	Can be raw or centered	Always centered
Divisor	None	$n-1$ (sample)
Matrix Form	$\mathbf{X}^\top\mathbf{X}$	$\frac{1}{n-1}\mathbf{X}_c^\top\mathbf{X}_c$

Mathematical Relationships

Variance: \[ \text{Var}(X_j) = \frac{\text{SSCP}_{jj}}{n-1} = s_{jj} \]
Covariance: \[ \text{Cov}(X_j,X_k) = \frac{\text{SSCP}_{jk}}{n-1} = s_{jk} \]

1.4 The Corrected Sums of Squares and Cross Products Matrix (CSSCP)

The Corrected Sums of Squares and Cross Products (CSSCP) matrix, also known as the deviation scores sums of squares and cross products matrix, is derived from a data matrix in which each score is expressed as a deviation from its mean.

The CSSCP matrix is defined as:

\[ \text{CSSCP}_{(m \times m)} = (X - \bar{X})^\top (X - \bar{X}) \\ = \begin{bmatrix} \sum d_1^2 & \sum d_1 d_2 & \sum d_1 d_3 & \cdots & \sum d_1 d_m \\ \sum d_2 d_1 & \sum d_2^2 & \sum d_2 d_3 & \cdots & \sum d_2 d_m \\ \sum d_3 d_1 & \sum d_3 d_2 & \sum d_3^2 & \cdots & \sum d_3 d_m \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ \sum d_m d_1 & \sum d_m d_2 & \sum d_m d_3 & \cdots & \sum d_m^2 \\ \end{bmatrix} \]

where each row vector of deviations is defined as:

\[ d_i = x_i - \bar{X} \]

That is, each value in the matrix is a sum of the product of deviations from the mean, capturing both variance (on the diagonal) and covariance (off-diagonal).

#' Compute Centered Sums of Squares and Cross Products (CSSCP) Matrix
#'
#' Calculates the centered SSCP matrix, which is directly related to the covariance matrix.
#' This is the matrix operation behind cov() and var() functions.
#'
#' @param X A numeric matrix or data frame (n × p)
#' @param na_action Method for handling NAs: "listwise" (default) or "pairwise"
#' @return A symmetric p × p CSSCP matrix (numerator of covariance matrix)
#' @export
#' # Basic usage
#' CSSCP(X)
#' 
#' # Relationship to covariance
#' n <- nrow(X)
#' CSSCP(X)/(n-1)  # Exactly matches cov(X)
#' 
#' # For comparison
#' cov(X)

CSSCP <- function(X, na_action = c("listwise", "pairwise")) {
  # Input validation
  if (!is.numeric(X) && !is.data.frame(X)) {
    stop("Input must be numeric matrix or data frame")
  }
  na_action <- match.arg(na_action)
  
  # Convert to matrix and center
  X <- as.matrix(X)
  X_centered <- scale(X, center = TRUE, scale = FALSE)
  
  # Handle NAs
  if (na_action == "listwise") {
    comp_cases <- complete.cases(X_centered)
    X_centered <- X_centered[comp_cases, , drop = FALSE]
  }
  
  # Compute CSSCP matrix (crossprod is more efficient than t(X) %*% X)
  csscp <- crossprod(X_centered)
  
  # Preserve dimension names
  dimnames(csscp) <- list(colnames(X), colnames(X))
  
  return(csscp)
}

# Save function
dump("CSSCP", file = "CSSCP.R")

# Basic usage
CSSCP(X)

##           X1         X2        X3      X4          Y
## X1  415.2308   251.0769 -372.6154  -290.0   775.9615
## X2  251.0769  2905.6923 -166.5385 -3041.0  2292.9538
## X3 -372.6154  -166.5385  492.3077    38.0  -618.2308
## X4 -290.0000 -3041.0000   38.0000  3362.0 -2481.7000
## Y   775.9615  2292.9538 -618.2308 -2481.7  2715.7631

# Relationship to covariance
n <- nrow(X)
CSSCP(X)/(n-1)

##           X1         X2         X3          X4          Y
## X1  34.60256   20.92308 -31.051282  -24.166667   64.66346
## X2  20.92308  242.14103 -13.878205 -253.416667  191.07949
## X3 -31.05128  -13.87821  41.025641    3.166667  -51.51923
## X4 -24.16667 -253.41667   3.166667  280.166667 -206.80833
## Y   64.66346  191.07949 -51.519231 -206.808333  226.31359

# Compare with R's cov()
cov(X)

##           X1         X2         X3          X4          Y
## X1  34.60256   20.92308 -31.051282  -24.166667   64.66346
## X2  20.92308  242.14103 -13.878205 -253.416667  191.07949
## X3 -31.05128  -13.87821  41.025641    3.166667  -51.51923
## X4 -24.16667 -253.41667   3.166667  280.166667 -206.80833
## Y   64.66346  191.07949 -51.519231 -206.808333  226.31359

1.5 Variance-Covariance Matrix

Variance measures how far a set of numbers is spread out, while covariance quantifies the extent to which corresponding elements from two sets of ordered data move together — that is, whether they increase or decrease simultaneously.

In a variance-covariance matrix:

Variances appear along the diagonal.
Covariances appear in the off-diagonal elements.

If two variables are independent, their covariance is zero. However, the converse is not always true: variables can be uncorrelated (zero covariance) but still dependent, because correlation only captures linear relationships. Notably, if the variables are normally distributed, zero correlation does imply independence.

The inverse covariance matrix, also known as the concentration matrix or precision matrix, reflects partial correlations and partial variances — it shows the strength of association between two variables after removing the linear influence of other variables. Partial correlations are especially useful in statistical methods such as the Expectation-Maximization (EM) algorithm.

The covariance matrix is defined as:

\[ \text{Cov}(X_i, X_j) = \mathbb{E}[(X_i - \mathbb{E}[X_i])(X_j - \mathbb{E}[X_j])^T] \\ = \begin{bmatrix} \sigma^2(X_1) & \sigma(X_1, X_2) & \sigma(X_1, X_3) & \cdots & \sigma(X_1, X_p) \\ \sigma(X_2, X_1) & \sigma^2(X_2) & \sigma(X_2, X_3) & \cdots & \sigma(X_2, X_p) \\ \sigma(X_3, X_1) & \sigma(X_3, X_2) & \sigma^2(X_3) & \cdots & \sigma(X_3, X_p) \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ \sigma(X_p, X_1) & \sigma(X_p, X_2) & \sigma(X_p, X_3) & \cdots & \sigma^2(X_p) \\ \end{bmatrix} \]

where:

$\qquad \sigma^2(X_i)$ is the variance of variable $X_i$,

$\qquad \sigma(X_i, X_j)$ is the covariance between $X_i$, and $X_j$,and

$\qquad p$ is the number of variables (columns in the data matrix).

#' Compute Covariance Matrix Using CSSCP
#'
#' Efficiently calculates the covariance matrix using the relationship:
#' cov(X) = CSSCP(X)/(n-1)
#' 
#' @param X A numeric matrix or data frame (n × p)
#' @param na_action Method for NA handling: "listwise" (default) or "pairwise"
#' @param unbiased Use unbiased estimator (n-1 denominator)? Default TRUE
#' @return A symmetric p × p covariance matrix
#' @export
#' @examples
#' data <- matrix(rnorm(100), ncol=4)
#' COV(data)  # Standard sample covariance
COV <- function(X, na_action = c("listwise", "pairwise"), unbiased = TRUE) {
  # Input validation
  if (!is.numeric(X) && !is.data.frame(X)) {
    stop("Input must be numeric matrix or data frame")
  }
  na_action <- match.arg(na_action)
  
  # Ensure CSSCP is available
  if (!exists("CSSCP")) {
    if (file.exists("CSSCP.R")) source("CSSCP.R")
    else stop("CSSCP function not found")
  }
  
  X <- as.matrix(X)
  n <- nrow(X)
  
  # Handle insufficient data
  if (n < 2) {
    warning("Insufficient observations (n < 2)")
    return(matrix(NA, ncol(X), ncol(X), dimnames = list(colnames(X), colnames(X))))
  }
  
  # Compute denominator
  denom <- if (unbiased) {
    if (na_action == "pairwise") {
      # For pairwise, use matrix of (n_ij - 1)
      n_complete <- colSums(!is.na(X))
      outer(n_complete, n_complete, function(a,b) pmin(a,b)) - 1
    } else {
      n - 1  # Standard unbiased estimator
    }
  } else {
    if (na_action == "pairwise") {
      # For pairwise ML, use matrix of n_ij
      n_complete <- colSums(!is.na(X))
      outer(n_complete, n_complete, function(a,b) pmin(a,b))
    } else {
      n  # ML estimator
    }
  }
  
  # Compute covariance using CSSCP relationship
  cov_mat <- CSSCP(X, na_action) / denom
  
  # Ensure symmetric (handles floating point precision)
  cov_mat <- (cov_mat + t(cov_mat)) / 2
  
  return(cov_mat)
}

# Save function
dump("COV", file = "COV.R")

# Standard sample covariance
COV(X)

##           X1         X2         X3          X4          Y
## X1  34.60256   20.92308 -31.051282  -24.166667   64.66346
## X2  20.92308  242.14103 -13.878205 -253.416667  191.07949
## X3 -31.05128  -13.87821  41.025641    3.166667  -51.51923
## X4 -24.16667 -253.41667   3.166667  280.166667 -206.80833
## Y   64.66346  191.07949 -51.519231 -206.808333  226.31359

# Compare with R's cov()
cov(X)

##           X1         X2         X3          X4          Y
## X1  34.60256   20.92308 -31.051282  -24.166667   64.66346
## X2  20.92308  242.14103 -13.878205 -253.416667  191.07949
## X3 -31.05128  -13.87821  41.025641    3.166667  -51.51923
## X4 -24.16667 -253.41667   3.166667  280.166667 -206.80833
## Y   64.66346  191.07949 -51.519231 -206.808333  226.31359

1.6 3. Basic Linear Algebra for Statistics

1.6.1 3.1 Trace of a Square Matrix

The trace of a square matrix is defined as the sum of the elements on its main diagonal. Mathematically, for a square matrix $A \in \mathbb{R}^{n \times n}$, the trace is given by:

\[ \text{tr}(A) = \sum_{i=1}^{n} a_{ii} \]

The trace has several important properties:

It is equal to the sum of the eigenvalues (including complex ones) of the matrix.
It is invariant under a change of basis, meaning that for any invertible matrix $P$, $\text{tr}(A) = \text{tr}(P^{-1}AP)$

These properties make the trace a useful scalar summary of a matrix in many statistical and linear algebra applications.

#' Compute Matrix Trace with Comprehensive Options
#'
#' Calculates the trace of a square matrix (sum of diagonal elements) with NA handling
#' and validation. Can utilize MEANS() for normalized trace calculations.
#'
#' @param X A square numeric matrix or data frame
#' @param na_action Method for handling NAs: 
#'        "omit" (default: removes NA diagonals), 
#'        "fail" (throws error if NAs exist),
#'        "ignore" (includes NAs in sum),
#'        "mean" (replaces NAs with column means)
#' @param normalized Logical; divide by matrix dimension? (Default: FALSE)
#' @return The sum of diagonal elements (normalized if requested)
#' @export
#' @examples
#' A <- matrix(c(4,-2,4,2, -2,10,-2,-7, 4,-2,8,4, 2,-7,4,7), nrow=4)
#' TRACE(A)  # Returns 29
#' 
#' # With NA values:
#' B <- A; diag(B)[2] <- NA
#' TRACE(B, na_action = "mean")  # Uses MEANS() for imputation
#' 
#' # Normalized trace:
#' TRACE(A, normalized = TRUE)  # Returns 29/4 = 7.25

TRACE <- function(X, na_action = c("omit", "fail", "ignore", "mean"), 
                 normalized = FALSE) {
  # Input validation
  if (!is.matrix(X) && !is.data.frame(X)) {
    stop("Input must be a matrix or data frame")
  }
  
  na_action <- match.arg(na_action)
  X <- as.matrix(X)
  
  # Check if square matrix
  if (nrow(X) != ncol(X)) {
    stop("Matrix must be square to compute trace")
  }
  
  # Get diagonal elements
  diag_elems <- diag(X)
  n <- length(diag_elems)
  
  # Handle NA values
  if (anyNA(diag_elems)) {
    switch(na_action,
           "fail" = stop("Matrix contains NA diagonal elements"),
           "omit" = {
             diag_elems <- na.omit(diag_elems)
             n <- length(diag_elems)  # Update effective dimension
           },
           "mean" = {
             if (!exists("MEANS")) {
               if (file.exists("MEANS.R")) source("MEANS.R") 
               else stop("MEANS function required for na_action='mean'")
             }
             diag_elems[is.na(diag_elems)] <- MEANS(X)[is.na(diag(X))]
           },
           "ignore" = {}
    )
  }
  
  # Calculate trace
  trace <- sum(diag_elems)
  
  # Normalize if requested
  if (normalized) trace <- trace/n
  
  return(trace)
}

# Save function
dump("TRACE", file = "TRACE.R")

# Test cases
A <- matrix(c(4, -2, 4, 2,
              -2, 10, -2, -7,
              4, -2, 8, 4,
              2, -7, 4, 7), 
            nrow = 4, byrow = TRUE)

TRACE(A)  # Returns 29

## [1] 29

psych::tr(A)  # Verify with psych package

## [1] 29

2 The SWEEP Operator

The SWEEP operator (Goodnight, 1979) is a fundamental matrix operation related to:

Gauss-Jordan elimination
Forward Doolittle procedure
Generalized inverse computation

It performs in-place matrix transformations with minimal storage requirements. For a symmetric matrix $\mathbf{G}_{p \times p}$, sweeping on pivot $k$ produces matrix $\mathbf{H}$ with elements:

\[ \begin{aligned} h_{kk} &= -\frac{1}{g_{kk}} \\ h_{jk} = h_{kj} &= \frac{g_{jk}}{g_{kk}}, \quad j \neq k \\ h_{jl} &= g_{jl} - \frac{g_{jk}g_{kl}}{g_{kk}}, \quad j \neq k, \ l \neq k \end{aligned} \]

Example: 3×3 Matrix

\[ \mathbf{G} = \begin{bmatrix} g_{11} & g_{12} & g_{13} \\ g_{21} & g_{22} & g_{23} \\ g_{31} & g_{32} & g_{33} \end{bmatrix} \]

After sweeping on pivot 1:

\[ \text{SWEEP}[1]\mathbf{G} = \begin{bmatrix} \displaystyle{-\frac{1}{g_{11}}} & \displaystyle{\frac{g_{12}}{g_{11}}} & \displaystyle{\frac{g_{13}}{g_{11}}} \\ \\ \displaystyle{\frac{g_{21}}{g_{11}}} & \displaystyle{\frac{g_{22}-g_{12}^2}{g_{11}}} & \displaystyle{\frac{g_{23}-g_{12}g_{13}}{g_{11}}} \\ \\ \displaystyle{\frac{g_{31}}{g_{11}}} & \displaystyle{\frac{ g_{32}-g_{12}g_{31}}{g_{11}}} & \displaystyle{\frac{g_{33}-g_{13}^2}{g_{11}}} \end{bmatrix} \]

#' SWEEP Operator for Symmetric Matrices
#'
#' Performs SWEEP operations on specified pivots of a symmetric matrix, with options
#' for reverse sweeping and numerical stability checks.
#'
#' @param A A symmetric numeric matrix
#' @param k Vector of pivot indices to sweep (if NULL, all columns are swept)
#' @param reverse Logical indicating whether to perform reverse sweep (default = FALSE)
#' @param tol Tolerance for checking pivot magnitude (default = 1e-7)
#' @return The transformed matrix
#' @export
#' @references 
#' Goodnight, J.H. (1979). "A tutorial on the SWEEP operator". *The American Statistician*, 33(3), 149–158.
#'
#' @examples
#' A <- crossprod(matrix(rnorm(16), 4, 4))  # Positive definite matrix
#' SWEEP(A, 1:2)                            # Forward sweep
#' SWEEP(SWEEP(A, 1), 1, reverse = TRUE)    # Reverse sweep to restore original
SWEEP <- 
  function(A, x){
    a <- as.matrix(A)
    n <- nrow(a)
    
    for(k in x){
      # step 1
      D <- a[k,k]
      
      # step 2
      a[k,] <- a[k,]/D
      
      # step 3
      for (i in 1:n) {
        if (i != k) {
          B <- a[i, k]
          a[i,] <- a[i,] - B*a[k,]
          a[i,k] <- -B/D
        } # if
      } # for i
      
      # step 4
      a[k,k] <- 1/D
    }
    return(a)
  } # SWEEP

# save to the working directory getwd(), ls()
dump("SWEEP", file = "SWEEP.R")

### 1. Create Test Matrix ###
cat("Original symmetric 2×2 matrix:\n")

## Original symmetric 2×2 matrix:

(test_matrix <- matrix(c(4, 2, 
                         2, 3), 
                       nrow = 2, ncol = 2,
                       dimnames = list(c("row1", "row2"),
                                       c("col1", "col2"))))

##      col1 col2
## row1    4    2
## row2    2    3

### 2. Forward Sweep on Pivot 1 ###
cat("\nAfter sweeping pivot 1:\n")

## 
## After sweeping pivot 1:

(swept <- SWEEP(test_matrix, 1))

##       col1 col2
## row1  0.25  0.5
## row2 -0.50  2.0

# Demonstration with matrix A
A <- matrix(c(4, -2, 4, 2,
              -2, 10, -2, -7,
              4, -2, 8, 4,
              2, -7, 4, 7), 
            nrow = 4, ncol = 4, 
            byrow = TRUE,
            dimnames = list(paste0("row",1:4), paste0("col",1:4)))

cat("Original matrix A:\n")

## Original matrix A:

print(A)

##      col1 col2 col3 col4
## row1    4   -2    4    2
## row2   -2   10   -2   -7
## row3    4   -2    8    4
## row4    2   -7    4    7

cat("\nAfter sweeping pivot 3:\n")

## 
## After sweeping pivot 3:

(A_swp3 <- SWEEP(A, 3))

##      col1  col2   col3 col4
## row1  2.0 -1.00 -0.500  0.0
## row2 -1.0  9.50  0.250 -6.0
## row3  0.5 -0.25  0.125  0.5
## row4  0.0 -6.00 -0.500  5.0

#       col1   col2  col3 col4
# row1  2.0 -1.00 -0.50  0.0
# row2 -1.0  9.50  0.25 -6.0
# row3  0.5 -0.25  0.12  0.5
# row4  0.0 -6.00 -0.50  5.0

cat("\nAfter re-sweeping pivot 3 (should recover original):\n")

## 
## After re-sweeping pivot 3 (should recover original):

(A_swp3_inv <- SWEEP(A_swp3, 3))

##      col1 col2 col3 col4
## row1    4   -2    4    2
## row2   -2   10   -2   -7
## row3    4   -2    8    4
## row4    2   -7    4    7

#      col1 col2 col3 col4
# row1    4   -2    4    2
# row2   -2   10   -2   -7
# row3    4   -2    8    4
# row4    2   -7    4    7

cat("\nVerification (max difference):", 
    max(abs(A_swp3_inv - A)), "\n")

## 
## Verification (max difference): 0

# Verification (max difference): 1.776357e-15

2.1 Reverse SWEEP operator (RSW)

The Reverse SWEEP operator undoes the transformation applied by the SWEEP operator on a symmetric matrix. It is particularly useful in EM algorithms and Gaussian graphical models when selectively reverting pivot operations.

The reverse SWEEP is defined (following Little & Rubin, 2020) as follows:

Negate and invert the pivot element: \[ H_{kk} = -\frac{1}{G_{kk}} \]
Update the $k$-th row and column: \[ H_{kj} = H_{jk} = -\frac{G_{jk}}{G_{kk}} \quad \text{for } j \ne k \]
Update the rest of the matrix: \[ H_{jl} = G_{jl} - \frac{G_{jk} \cdot G_{kl}}{G_{kk}} \quad \text{for } j, l \ne k \]

Like the forward sweep, the reverse sweep maintains symmetry at each step.

#' Reverse SWEEP Operator (Little & Rubin Style)
#'
#' Performs a reverse SWEEP operation on a symmetric matrix,
#' undoing a previous sweep at the specified pivot index.
#'
#' @param G A symmetric numeric matrix.
#' @param i Indices of pivot elements to reverse sweep.
#' @return The matrix after applying reverse SWEEP to the specified pivots.
#' @export
#' @references 
#' Little, R.J.A., & Rubin, D.B. (2020). *Statistical Analysis with Missing Data* (3rd ed.).
#'
#' @examples
#' A <- matrix(c(2, -1, -0.5, 0,
#'              -1, 9.5, 0.25, -6,
#'              -0.5, 0.25, -0.125, -0.5,
#'               0, -6, -0.5, 5), nrow = 4, byrow = TRUE)
#' RSW(A, 3)
RSW <- function(G, i) {
  H <- as.matrix(G)
  n <- nrow(H)
  
  for (k in i) {
    # Step 1: Restore pivot
    H[k, k] <- -1 / G[k, k]
    
    # Step 2: Update row and column
    for (j in 1:n) {
      if (j != k) {
        H[k, j] <- H[j, k] <- -G[j, k] / G[k, k]
      }
    }
    
    # Step 3: Update remaining submatrix
    for (j in 1:n) {
      for (l in 1:n) {
        if (j != k && l != k) {
          H[j, l] <- G[j, l] - (G[j, k] * G[k, l]) / G[k, k]
        }
      }
    }
    
    # Prepare G for next pivot
    G <- H
  }
  
  return(H)
}

# Save to file
dump("RSW", file = "RSW.R")

RSW(A, 3)

##      col1  col2   col3 col4
## row1  2.0 -1.00 -0.500  0.0
## row2 -1.0  9.50  0.250 -6.0
## row3 -0.5  0.25 -0.125 -0.5
## row4  0.0 -6.00 -0.500  5.0

# Result:
##      [,1]  [,2]  [,3] [,4]
## [1,]  2.0 -1.00 -0.50  0.0
## [2,] -1.0  9.50  0.25 -6.0
## [3,] -0.5  0.25 -0.12 -0.5
## [4,]  0.0 -6.00 -0.50  5.0

2.1.1 Properties of the SWEEP Operator

Sweeping is Reversible

One of the key properties of the SWEEP operator is that sweeping twice on the same row (or pivot index) restores the original matrix. That is, the SWEEP operation is involutory with respect to a given pivot.

Formally:

\[ \text{SWEEP}(\text{SWEEP}(A, r), r) = A \]

This means that applying the SWEEP operator on row (and column) $r$ twice is equivalent to no sweep at all on that row. This property is crucial in iterative algorithms such as the EM algorithm, where reversible transformations are desirable for selective updates of matrix partitions.

Example:

If you start with a symmetric matrix $A$, then:

cat("Original Matrix A\n")

## Original Matrix A

##      col1 col2 col3 col4
## row1    4   -2    4    2
## row2   -2   10   -2   -7
## row3    4   -2    8    4
## row4    2   -7    4    7

cat("\nStep 1: Sweep pivot 3 once\n")

## 
## Step 1: Sweep pivot 3 once

(swept_once <- SWEEP(A, 3))

##      col1  col2   col3 col4
## row1  2.0 -1.00 -0.500  0.0
## row2 -1.0  9.50  0.250 -6.0
## row3  0.5 -0.25  0.125  0.5
## row4  0.0 -6.00 -0.500  5.0

cat("\nStep 2: Sweep pivot 3 again (reverse the sweep)\n")

## 
## Step 2: Sweep pivot 3 again (reverse the sweep)

(swept_twice <- SWEEP(swept_once, 3))

##      col1 col2 col3 col4
## row1    4   -2    4    2
## row2   -2   10   -2   -7
## row3    4   -2    8    4
## row4    2   -7    4    7

cat("\nVerification: Is the matrix restored?\n")

## 
## Verification: Is the matrix restored?

identical_check <- all.equal(A, swept_twice)
print(identical_check)  # Should return TRUE

## [1] TRUE

cat("Original Matrix A\n")

## Original Matrix A

##      col1 col2 col3 col4
## row1    4   -2    4    2
## row2   -2   10   -2   -7
## row3    4   -2    8    4
## row4    2   -7    4    7

cat("\nAfter sweeping twice\n")

## 
## After sweeping twice

SWEEP(SWEEP(A, 2), 2)

##      col1 col2 col3 col4
## row1    4   -2    4    2
## row2   -2   10   -2   -7
## row3    4   -2    8    4
## row4    2   -7    4    7

Little and Rubin version

cat("Original matrix A:\n")

## Original matrix A:

print(A)

##      col1 col2 col3 col4
## row1    4   -2    4    2
## row2   -2   10   -2   -7
## row3    4   -2    8    4
## row4    2   -7    4    7

cat("\nAfter applying SWEEP on pivot 3:\n")

## 
## After applying SWEEP on pivot 3:

swp_result <- SWEEP(A, 3)
print(swp_result)

##      col1  col2   col3 col4
## row1  2.0 -1.00 -0.500  0.0
## row2 -1.0  9.50  0.250 -6.0
## row3  0.5 -0.25  0.125  0.5
## row4  0.0 -6.00 -0.500  5.0

cat("\nAfter applying RSW on pivot 3 (reverse sweep):\n")

## 
## After applying RSW on pivot 3 (reverse sweep):

rsw_result <- RSW(swp_result, 3)
print(rsw_result)

##      col1 col2 col3 col4
## row1    4   -2    4    2
## row2   -2   10   -2   -7
## row3    4   -2   -8    4
## row4    2   -7    4    7

cat("\nVerification that RSW(SWEEP(A, 3), 3) returns the original matrix:\n")

## 
## Verification that RSW(SWEEP(A, 3), 3) returns the original matrix:

print(all.equal(A, rsw_result))

## [1] "Mean relative difference: 2"

Sweeping is Commutative

Sweeping on row rr and then on row ss is equivalent to sweeping on ss and then on rr. Formally,

\[ SWEEP_{[r,s]}(G)=SWEEP_{[s,r]}(G) \]

This property shows that the order of sweeping different pivots does not affect the final result.

cat("Sweeping on pivots 1 to 4 in ascending order:\n")

## Sweeping on pivots 1 to 4 in ascending order:

swp_asc <- SWEEP(A, 1:4)
print(round(swp_asc, 2))

##       col1  col2  col3  col4
## row1  0.64  0.28 -0.42  0.33
## row2  0.28  0.56 -0.33  0.67
## row3 -0.42 -0.33  0.50 -0.50
## row4  0.33  0.67 -0.50  1.00

cat("\nSweeping on pivots 4 to 1 in descending order:\n")

## 
## Sweeping on pivots 4 to 1 in descending order:

swp_desc <- SWEEP(A, 4:1)
print(round(swp_desc, 2))

##       col1  col2  col3  col4
## row1  0.64  0.28 -0.42  0.33
## row2  0.28  0.56 -0.33  0.67
## row3 -0.42 -0.33  0.50 -0.50
## row4  0.33  0.67 -0.50  1.00

cat("\nCheck if results are identical:\n")

## 
## Check if results are identical:

print(all.equal(swp_asc, swp_desc))

## [1] TRUE

2.2 3.4 Matrix Inverse

Definition:
Let $ A_{n n}$ be a square matrix. If there exists a matrix $A^{-1}$ such that

\[ A A^{-1} = A^{-1} A = I \]

where $I$ is the identity matrix, then $A^{-1}$ is called the inverse of $A $.

A square matrix $A$ has an inverse if and only if its determinant $|A| \neq 0$.
Such a matrix is called nonsingular or invertible.
If a matrix is not invertible, it is called singular or degenerate.
Randomly chosen square matrices from continuous distributions are almost always nonsingular.
Non-square matrices $( m \times n$ with $m \neq n )$ do not have inverses.
Matrix inversion can be computed using methods such as Gauss-Jordan elimination, Gaussian elimination, or LU decomposition.

Matrix inversion is critical in regression to compute the best fit line minimizing the sum of squared errors and estimating regression coefficients.

#' Matrix Inversion via SWEEP Operator
#'
#' Computes the inverse of a square matrix using sequential SWEEP operations.
#' Provides numerical stability checks and preserves matrix properties.
#'
#' @param A A square numeric matrix
#' @param tol Tolerance for singularity detection (default = 1e-10)
#' @return The inverse matrix
#' @export
#' @examples
#' A <- matrix(c(4,-2,4,2, -2,10,-2,-7, 4,-2,8,4, 2,-7,4,7), nrow=4)
#' INV(A)  # Matrix inverse
#' solve(A)  # Base R equivalent
INV <- function(A, tol = 1e-10) {
  # Input validation
  if (!is.matrix(A) || !is.numeric(A)) {
    stop("Input must be a numeric matrix")
  }
  if (nrow(A) != ncol(A)) {
    stop("Matrix must be square")
  }
  
  # Load SWEEP function if needed
  if (!exists("SWEEP")) {
    if (file.exists("SWEEP.R")) {
      source("SWEEP.R")
    } else {
      stop("SWEEP.R not found in working directory")
    }
  }
  
  A0 <- as.matrix(A)
  r <- ncol(A0)
  orig_names <- dimnames(A0)
  
  # Check for exact singularity
  if (det(A0) < tol) {
    warning("Matrix is nearly singular - results may be numerically unstable")
  }
  
  # Perform sequential sweeps
  for (k in 1:r) {
    pivot <- A0[k,k]
    if (abs(pivot) < tol) {
      stop(sprintf("Near-zero pivot (%.2e) at position %d", pivot, k))
    }
    A0 <- SWEEP(A0, k)
  }
  
  # Restore names and return
  dimnames(A0) <- orig_names
  return(A0)
}

# Save function
dump("INV", file = "INV.R")

options(digits = 2)

cat("Original matrix A:\n")

## Original matrix A:

print(A)

##      col1 col2 col3 col4
## row1    4   -2    4    2
## row2   -2   10   -2   -7
## row3    4   -2    8    4
## row4    2   -7    4    7

cat("\nInverse via SWEEP (INV function):\n")

## 
## Inverse via SWEEP (INV function):

(A_inv <- INV(A))

##       col1  col2  col3  col4
## row1  0.64  0.28 -0.42  0.33
## row2  0.28  0.56 -0.33  0.67
## row3 -0.42 -0.33  0.50 -0.50
## row4  0.33  0.67 -0.50  1.00

#       col1  col2  col3  col4
# row1  0.64  0.28 -0.42  0.33
# row2  0.28  0.56 -0.33  0.67
# row3 -0.42 -0.33  0.50 -0.50
# row4  0.33  0.67 -0.50  1.00

cat("\nVerification (A %*% A_inv should be identity):\n")

## 
## Verification (A %*% A_inv should be identity):

round(A %*% A_inv, 10)

##      col1 col2 col3 col4
## row1    1    0    0    0
## row2    0    1    0    0
## row3    0    0    1    0
## row4    0    0    0    1

#      col1 col2 col3 col4
# row1    1    0    0    0
# row2    0    1    0    0
# row3    0    0    1    0
# row4    0    0    0    1

2.3 Correlation Matrix

A correlation matrix is a covariance matrix calculated on variables that are standardized to have a mean of 0 and a standard deviation of 1.

The determinant of the correlation matrix is 1.0 only if all correlations are 0; otherwise, the determinant is less than 1.

The correlation between variables $X_i$ and $X_j$ is defined as:

\[ \operatorname{corr}(X_i, X_j) = \frac{\operatorname{cov}(X_i, X_j)}{\sigma_{X_i} \sigma_{X_j}} \\ \\ = \begin{bmatrix} 1 & \displaystyle{\frac{\sigma(X_1, X_2)}{\sigma_{X_1}\sigma_{X_2}}} & \displaystyle{\frac{\sigma(X_1, X_3)}{\sigma_{X_1}\sigma_{X_3}}} & \cdots \\ \\ \displaystyle{\frac{\sigma(X_2, X_1)}{\sigma_{X_2}\sigma_{X_1}}} & 1 & \displaystyle{\frac{\sigma(X_2, X_3)}{\sigma_{X_2}\sigma_{X_3}}} & \cdots \\ \\ \displaystyle{\frac{\sigma(X_3, X_1)}{\sigma_{X_3}\sigma_{X_1}}} & \displaystyle{\frac{\sigma(X_3, X_2)}{\sigma_{X_3}\sigma_{X_2}}} & 1 & \cdots \\ \\ \vdots & \vdots & \vdots & \ddots \end{bmatrix} \]

where:

$\qquad \sigma(X_i, X_j)$ is the covariance between $X_i$ and $X_j$,

$\qquad \sigma_{X_i}$ is the standard deviation of $X_i$,

$\qquad$The diagonal elements are always 1 since they represent the correlation of each variable with itself.

#' Compute Correlation Matrix with Multiple NA Handling Options
#'
#' Calculates Pearson correlation matrix using matrix operations, with options for
#' different missing data handling methods.
#'
#' @param Y A numeric matrix or data frame (n × p)
#' @param na_action Method for handling NAs: 
#'        "listwise" (complete cases, default), 
#'        "pairwise" (available cases for each pair),
#'        "complete" (only complete cases, alias for listwise)
#' @return A symmetric p × p correlation matrix
#' @export
#' @examples
#' data <- matrix(rnorm(100), ncol=5)
#' data[sample(100, 10)] <- NA  # Add missing values
#' 
#' COR(data)  # Listwise deletion
#' COR(data, na_action="pairwise")  # Pairwise deletion
#' stats::cor(data, use="complete.obs")  # Base R equivalent
COR <- function(Y, na_action = c("listwise", "pairwise", "complete")) {
  # Input validation
  if (!is.matrix(Y) && !is.data.frame(Y)) {
    stop("Input must be a matrix or data frame")
  }
  na_action <- match.arg(na_action)
  if (na_action == "complete") na_action <- "listwise"
  
  # Load dependencies if needed
  required_funs <- c("CSSCP", "COV", "INV")
  for (fun in required_funs) {
    if (!exists(fun)) {
      if (file.exists(paste0(fun, ".R"))) {
        source(paste0(fun, ".R"))
      } else {
        stop(paste(fun, "function not found in working directory"))
      }
    }
  }
  
  # Handle missing data
  Y <- as.matrix(Y)
  if (na_action == "listwise") {
    Y <- Y[complete.cases(Y), , drop = FALSE]
    n <- nrow(Y)
    if (n < 2) {
      warning("Insufficient complete cases (n < 2)")
      return(matrix(NA, ncol(Y), ncol(Y)))
    }
    
    # Standardize using listwise means/SDs
    V <- diag(COV(Y))
    S <- diag(sqrt(V), ncol(Y), ncol(Y))
    C <- CSSCP(Y)/(n - 1)
    corr <- INV(S) %*% C %*% INV(S)
    
  } else { # pairwise
    # Pairwise covariance computation
    corr <- matrix(NA, ncol(Y), ncol(Y))
    for (i in 1:ncol(Y)) {
      for (j in 1:i) {
        complete_ij <- complete.cases(Y[, c(i,j)])
        if (sum(complete_ij) >= 2) {
          cov_ij <- cov(Y[complete_ij, i], Y[complete_ij, j])
          sd_i <- sd(Y[complete_ij, i])
          sd_j <- sd(Y[complete_ij, j])
          corr[i,j] <- corr[j,i] <- cov_ij/(sd_i * sd_j)
        }
      }
    }
    diag(corr) <- 1
  }
  
  # Preserve dimension names
  if (is.null(colnames(Y))) {
    dimnames(corr) <- list(paste0("X", 1:ncol(Y)), paste0("X", 1:ncol(Y)))
  } else {
    dimnames(corr) <- list(colnames(Y), colnames(Y))
  }
  
  return(corr)
}

# Save function
dump("COR", file = "COR.R")

# Demonstration with X matrix
options(digits = 2)
cat("Listwise deletion:\n")

## Listwise deletion:

COR(X)

##       X1    X2    X3    X4     Y
## X1  1.00  0.23 -0.82 -0.25  0.73
## X2  0.23  1.00 -0.14 -0.97  0.82
## X3 -0.82 -0.14  1.00  0.03 -0.53
## X4 -0.25 -0.97  0.03  1.00 -0.82
## Y   0.73  0.82 -0.53 -0.82  1.00

cat("\nPairwise deletion:\n")

## 
## Pairwise deletion:

COR(X, na_action = "pairwise")

##       X1    X2    X3    X4     Y
## X1  1.00  0.23 -0.82 -0.25  0.73
## X2  0.23  1.00 -0.14 -0.97  0.82
## X3 -0.82 -0.14  1.00  0.03 -0.53
## X4 -0.25 -0.97  0.03  1.00 -0.82
## Y   0.73  0.82 -0.53 -0.82  1.00

cat("\nBase R comparison (complete cases):\n")

## 
## Base R comparison (complete cases):

stats::cor(X, use = "complete.obs")

##       X1    X2    X3    X4     Y
## X1  1.00  0.23 -0.82 -0.25  0.73
## X2  0.23  1.00 -0.14 -0.97  0.82
## X3 -0.82 -0.14  1.00  0.03 -0.53
## X4 -0.25 -0.97  0.03  1.00 -0.82
## Y   0.73  0.82 -0.53 -0.82  1.00

cat("\nBase R comparison (pairwise):\n")

## 
## Base R comparison (pairwise):

stats::cor(X, use = "pairwise.complete.obs")

##       X1    X2    X3    X4     Y
## X1  1.00  0.23 -0.82 -0.25  0.73
## X2  0.23  1.00 -0.14 -0.97  0.82
## X3 -0.82 -0.14  1.00  0.03 -0.53
## X4 -0.25 -0.97  0.03  1.00 -0.82
## Y   0.73  0.82 -0.53 -0.82  1.00

2.4 Determinant Calculation via SWEEP Operator

The determinant of a matrix $A$ can be computed using the SWEEP operator through the following mathematical relationship:

\[ \det(A) = \prod_{k=1}^{n} D_{kk} \]

where:

$\qquad A$ is an $n \times n$ square matrix

$\qquad D_{kk}$ are the diagonal elements after fully sweeping the matrix

$\qquad$The product is taken over all $n$ diagonal elements

Key Properties:

Triangular Matrices: \[ \text{If } A \text{ is triangular, } \det(A) = \prod_{i=1}^{n} a_{ii} \]
Singularity Test: \[ \det(A) = 0 \iff A \text{ is singular} \]
Multiplicative Property: \[ \det(AB) = \det(A)\det(B) \]

Computational Notes: The SWEEP-based calculation is mathematically equivalent to: \[ \det(A) = (-1)^s \prod_{k=1}^{n} p_k \] where:

$\qquad p_k$ are the pivot elements, and

$\qquad s$ is the number of row exchanges

#' Matrix Determinant via SWEEP Operator
#'
#' Computes the determinant of a square matrix using sequential SWEEP operations,
#' with improved numerical stability and error handling.
#'
#' @param S A square numeric matrix
#' @param tol Tolerance for singularity detection (default = 1e-10)
#' @return The determinant as a numeric scalar
#' @export
#' @examples
#' A <- matrix(c(4,-2,4,2, -2,10,-2,-7, 4,-2,8,4, 2,-7,4,7), nrow=4)
#' DET(A)  # Returns 144
#' det(A)  # Base R equivalent
DET <- function(S, tol = 1e-10) {
  # Input validation
  if (!is.matrix(S) || !is.numeric(S)) {
    stop("Input must be a numeric matrix")
  }
  if (nrow(S) != ncol(S)) {
    stop("Matrix must be square")
  }
  
  # Load SWEEP function if needed
  if (!exists("SWEEP")) {
    if (file.exists("SWEEP.R")) {
      source("SWEEP.R")
    } else {
      stop("SWEEP.R not found in working directory")
    }
  }
  
  A <- as.matrix(S)
  r <- ncol(A)
  
  # Handle 1×1 matrix case
  if (r == 1) {
    return(as.numeric(A[1,1]))
  }
  
  # Initialize determinant
  det_value <- A[1,1]
  if (abs(det_value) < tol) {
    warning("Initial pivot is near zero - determinant may be unstable")
  }
  
  # Perform sequential sweeps
  if (r > 1) {
    A_swept <- SWEEP(A, 1)
    for (k in 2:r) {
      pivot <- A_swept[k,k]
      if (abs(pivot) < tol) {
        warning(sprintf("Near-zero pivot (%.2e) at position %d", pivot, k))
      }
      det_value <- det_value * pivot
      A_swept <- SWEEP(A_swept, k)
    }
  }
  
  # Return as numeric (not matrix)
  return(as.numeric(det_value))
}

# Save function
dump("DET", file = "DET.R")

# Demonstration with matrix A
cat("Matrix A:\n")

## Matrix A:

print(A)

##      col1 col2 col3 col4
## row1    4   -2    4    2
## row2   -2   10   -2   -7
## row3    4   -2    8    4
## row4    2   -7    4    7

cat("\nDeterminant via SWEEP (DET function):\n")

## 
## Determinant via SWEEP (DET function):

(DET_result <- DET(A))

## [1] 144

# [1] 144

cat("\nBase R determinant (det function):\n")

## 
## Base R determinant (det function):

(det_result <- det(A))

## [1] 144

# [1] 144

cat("\nVerification (difference):", DET_result - det_result, "\n")

## 
## Verification (difference): 0

# Verification (difference): 0 

# Edge case testing
cat("\n1×1 matrix test:\n")

## 
## 1×1 matrix test:

(mat_1x1 <- matrix(189.6))

##      [,1]
## [1,]  190

DET(mat_1x1)  # Returns 189.6

## [1] 190

2.5 3.7 Cholesky Decomposition

2.5.1 Definition

Let $A$ be a symmetric positive definite matrix. Then $A$ can be uniquely decomposed into: $A = U^T U $

where:

$\qquad U$ is an upper triangular matrix with positive diagonal entries

$\qquad U^T$ is the conjugate transpose of $U$ (simply the transpose for real matrices)

Matrix Formulation

For a 3×3 matrix $A$:

\[ A = \begin{bmatrix} A_{11} & A_{12} & A_{13} \\ A_{21} & A_{22} & A_{23} \\ A_{31} & A_{32} & A_{33} \end{bmatrix} = U^T U = \begin{bmatrix} U_{11} & 0 & 0 \\ U_{12} & U_{22} & 0 \\ U_{13} & U_{23} & U_{33} \end{bmatrix} \begin{bmatrix} U_{11} & U_{12} & U_{13} \\ 0 & U_{22} & U_{23} \\ 0 & 0 & U_{33} \end{bmatrix} \]

#' Robust Cholesky Decomposition
#'
#' Computes the upper triangular Cholesky factor U such that A = U'U
#' with improved error handling for edge cases.
#'
#' @param C A symmetric positive definite square matrix
#' @return Upper triangular matrix U satisfying A = U'U
#' @references Watkins, D.S. (2004). Fundamentals of Matrix Computations
#' @examples
#' # Positive definite matrix
#' A <- matrix(c(4,-2,4,2, -2,10,-2,-7, 4,-2,8,4, 2,-7,4,7), nrow=4)
#' U <- cholesky(A)
cholesky <- function(C) {
  # Input validation
  if (!is.matrix(C)) stop("Input must be a matrix")
  n <- nrow(C)
  if (n != ncol(C)) stop("Matrix must be square")
  if (n == 0) stop("Empty matrix provided")
  
  c <- as.matrix(C)
  
  # Special case for 1×1 matrix
  if (n == 1) {
    if (c[1,1] <= 0) stop("1×1 matrix must have positive entry")
    return(matrix(sqrt(c[1,1])))
  }
  
  for (i in 1:n) {
    # Diagonal elements
    if (i > 1) {  # Only subtract if i > 1
      for (k in 1:(i-1)) {
        c[i,i] <- c[i,i] - c[k,i]^2
      }
    }
    
    # Check positive definiteness
    if (c[i,i] <= sqrt(.Machine$double.eps)) {
      stop(sprintf("Matrix is not positive definite at pivot %d", i))
    }
    
    c[i,i] <- sqrt(c[i,i])
    
    # Off-diagonal elements (only if not last row)
    if (i < n) {
      for (j in (i+1):n) {
        if (i > 1) {  # Only subtract if i > 1
          for (k in 1:(i-1)) {
            c[i,j] <- c[i,j] - c[k,i]*c[k,j]
          }
        }
        c[i,j] <- c[i,j]/c[i,i]
        c[j,i] <- 0  # Explicitly zero lower triangle
      }
    }
  }
  return(c)
}

# Test cases
A1 <- matrix(4)  # 1×1 positive
A2 <- matrix(-1) # 1×1 negative
A3 <- matrix(c(4,-2,4,2, -2,10,-2,-7, 4,-2,8,4, 2,-7,4,7), nrow=4) # 4×4
A4 <- matrix(0, 0, 0) # Empty matrix

cholesky(A1)  # Works

##      [,1]
## [1,]    2

cholesky(A3)  # Works

##      [,1] [,2] [,3] [,4]
## [1,]    2   -1    2    1
## [2,]    0    3    0   -2
## [3,]    0    0    2    1
## [4,]    0    0    0    1

# --------------------------------------------------
# Cholesky Decomposition Demonstration
# --------------------------------------------------

### 1. Input Matrix (Positive Definite) ###
cat("Original matrix A:\n")

## Original matrix A:

(A <- matrix(c(4, -2, 4, 2,
               -2, 10, -2, -7,
               4, -2, 8, 4,
               2, -7, 4, 7), 
             nrow = 4, byrow = TRUE,
             dimnames = list(paste0("Row",1:4), paste0("Col",1:4))))

##      Col1 Col2 Col3 Col4
## Row1    4   -2    4    2
## Row2   -2   10   -2   -7
## Row3    4   -2    8    4
## Row4    2   -7    4    7

### 2. Cholesky Decomposition ###
cat("\nComputing Cholesky decomposition...\n")

## 
## Computing Cholesky decomposition...

(U <- cholesky(A))

##      Col1 Col2 Col3 Col4
## Row1    2   -1    2    1
## Row2    0    3    0   -2
## Row3    0    0    2    1
## Row4    0    0    0    1

# Upper triangular factor U:
#       Col1  Col2 Col3 Col4
# Row1  2.00 -1.00 2.00 1.00
# Row2  0.00  3.00 0.00 -2.00
# Row3  0.00  0.00 2.00 1.00
# Row4  0.00  0.00 0.00 1.00

### 3. Lower Triangular Factor ###
cat("\nLower triangular factor L (t(U)):\n")

## 
## Lower triangular factor L (t(U)):

(L <- t(U))

##      Row1 Row2 Row3 Row4
## Col1    2    0    0    0
## Col2   -1    3    0    0
## Col3    2    0    2    0
## Col4    1   -2    1    1

#       Col1 Col2 Col3 Col4
# Row1  2.00 0.00 0.00 0.00
# Row2 -1.00 3.00 0.00 0.00
# Row3  2.00 0.00 2.00 0.00
# Row4  1.00 -2.00 1.00 1.00

### 4. Verification ###
cat("\nVerifying decomposition A = L %*% U:\n")

## 
## Verifying decomposition A = L %*% U:

(reconstructed_A <- L %*% U)

##      Col1 Col2 Col3 Col4
## Col1    4   -2    4    2
## Col2   -2   10   -2   -7
## Col3    4   -2    8    4
## Col4    2   -7    4    7

cat("Maximum reconstruction error:", 
    max(abs(A - reconstructed_A)), "\n")

## Maximum reconstruction error: 0

# [1] Maximum reconstruction error: 0

### 5. Comparison with Base R ###
cat("\nBase R chol() result:\n")

## 
## Base R chol() result:

(base_U <- chol(A))

##      Col1 Col2 Col3 Col4
## Row1    2   -1    2    1
## Row2    0    3    0   -2
## Row3    0    0    2    1
## Row4    0    0    0    1

cat("Maximum difference with our implementation:", 
    max(abs(U - base_U)), "\n")

## Maximum difference with our implementation: 0

# [1] Maximum difference with our implementation: 0

2.6 Forward Substitution

Forward substitution is an iterative method used to solve matrix equations of the form $Lx=b$, where LL is a lower triangular matrix. Similarly, back substitution is applied to solve equations involving upper triangular matrices, $Ux=b$.

#' Forward Substitution for Lower Triangular Systems
#'
#' Solves the system Ly = b where L is lower triangular using forward substitution.
#'
#' @param L A lower triangular numeric matrix
#' @param b A numeric vector or column matrix
#' @return The solution vector y
#' @references Watkins, D.S. (2002). Fundamentals of Matrix Computations
#' @examples
#' L <- matrix(c(2,0,0,0, -1,3,0,0, 2,0,2,0, 1,-2,1,1), nrow=4)
#' b <- c(8,2,16,6)
#' forward_solve(L, b)
forward_solve <- function(L, b) {
  # Input validation
  if (!is.matrix(L)) stop("L must be a matrix")
  if (!is.numeric(b)) stop("b must be numeric")
  if (nrow(L) != ncol(L)) stop("L must be square")
  if (nrow(L) != length(b)) stop("Dimensions of L and b must match")
  
  n <- nrow(L)
  b <- as.matrix(b)
  
  for (i in 1:n) {
    # Check for zero pivot
    if (abs(L[i,i]) < .Machine$double.eps) {
      stop(sprintf("Zero pivot encountered at position %d", i))
    }
    
    # Forward substitution
    if (i > 1) {
      for (j in 1:(i-1)) {
        b[i] <- b[i] - L[i,j] * b[j]
      }
    }
    b[i] <- b[i] / L[i,i]
  }
  
  # Return as column vector with names if available
  if (!is.null(rownames(L))) {
    rownames(b) <- rownames(L)
  }
  return(b)
}

# Save function
dump("forward_solve", file = "forward_solve.R")

# --------------------------------------------------
# Demonstration with Cholesky Decomposition
# --------------------------------------------------

### 1. Input Setup ###
cat("Lower triangular matrix L:\n")

## Lower triangular matrix L:

(L <- matrix(c(2,0,0,0, -1,3,0,0, 2,0,2,0, 1,-2,1,1), 
             nrow=4, dimnames=list(paste0("Eq",1:4),NULL)))

##     [,1] [,2] [,3] [,4]
## Eq1    2   -1    2    1
## Eq2    0    3    0   -2
## Eq3    0    0    2    1
## Eq4    0    0    0    1

cat("\nRight-hand side vector b:\n")

## 
## Right-hand side vector b:

(b <- matrix(c(8,2,16,6), ncol=1, dimnames=list(paste0("Eq",1:4),NULL)))

##     [,1]
## Eq1    8
## Eq2    2
## Eq3   16
## Eq4    6

### 2. Forward Substitution ###
cat("\nSolving Ly = b via forward substitution:\n")

## 
## Solving Ly = b via forward substitution:

(y <- forward_solve(L, b))

##     [,1]
## Eq1 4.00
## Eq2 0.67
## Eq3 8.00
## Eq4 6.00

#       [,1]
# Eq1     4
# Eq2     2
# Eq3     4
# Eq4     2

### 3. Verification ###
cat("\nVerification (L %*% y should equal b):\n")

## 
## Verification (L %*% y should equal b):

(L %*% y)

##     [,1]
## Eq1   29
## Eq2  -10
## Eq3   22
## Eq4    6

#       [,1]
# Eq1     8
# Eq2     2
# Eq3    16
# Eq4     6

### 4. Comparison with Base R ###
cat("\nBase R forwardsolve() result:\n")

## 
## Base R forwardsolve() result:

(base_y <- forwardsolve(L, b))

##      [,1]
## [1,] 4.00
## [2,] 0.67
## [3,] 8.00
## [4,] 6.00

cat("Maximum difference:", max(abs(y - base_y)), "\n")

## Maximum difference: 0

# Maximum difference: 0

#' Backward Substitution for Upper Triangular Systems
#'
#' Solves the system Ux = y where U is upper triangular using backward substitution.
#'
#' @param U An upper triangular numeric matrix
#' @param y A numeric vector or column matrix
#' @return The solution vector x
#' @references Watkins, D.S. (2002). Fundamentals of Matrix Computations
#' @examples
#' U <- matrix(c(2,-1,2,1, 0,3,0,-2, 0,0,2,1, 0,0,0,1), nrow=4)
#' y <- c(4,2,4,2)
#' backward_solve(U, y)
backward_solve <- function(U, y) {
  # Input validation
  if (!is.matrix(U)) stop("U must be a matrix")
  if (!is.numeric(y)) stop("y must be numeric")
  if (nrow(U) != ncol(U)) stop("U must be square")
  if (nrow(U) != length(y)) stop("Dimensions of U and y must match")
  
  n <- nrow(U)
  y <- as.matrix(y)
  
  for (i in seq(n, 1, by = -1)) {  # Backward iteration
    # Check for zero pivot
    if (abs(U[i,i]) < .Machine$double.eps) {
      stop(sprintf("Zero pivot encountered at position %d", i))
    }
    
    # Backward substitution
    if (i < n) {
      for (j in (i+1):n) {
        y[i] <- y[i] - U[i,j] * y[j]
      }
    }
    y[i] <- y[i] / U[i,i]
  }
  
  # Return as named vector
  if (!is.null(rownames(U))) {
    names(y) <- rownames(U)
  }
  return(as.vector(y))
}

# Save function
dump("backward_solve", file = "backward_solve.R")

# --------------------------------------------------
# Complete Linear System Solver Demonstration 
# --------------------------------------------------

### Utility Functions ###
is_upper_triangular <- function(M) all(M[lower.tri(M)] == 0)
is_lower_triangular <- function(M) all(M[upper.tri(M)] == 0)

### 1. Matrix Setup ###
cat("Original matrix A:\n")

## Original matrix A:

(A <- matrix(c(
  4, -2,  4,  2,
  -2, 10, -2, -7,
  4, -2,  8,  4,
  2, -7,  4,  7
), nrow = 4, byrow = TRUE))

##      [,1] [,2] [,3] [,4]
## [1,]    4   -2    4    2
## [2,]   -2   10   -2   -7
## [3,]    4   -2    8    4
## [4,]    2   -7    4    7

cat("\nRight-hand side vector b:\n")

## 
## Right-hand side vector b:

(b <- matrix(c(8, 2, 16, 6), ncol = 1))

##      [,1]
## [1,]    8
## [2,]    2
## [3,]   16
## [4,]    6

### 2. Cholesky Decomposition ###
cat("\nCholesky decomposition: A = L %*% t(L)\n")

## 
## Cholesky decomposition: A = L %*% t(L)

L <- t(chol(A))  # Lower triangular
cat("Lower triangular matrix L:\n")

## Lower triangular matrix L:

print(L)

##      [,1] [,2] [,3] [,4]
## [1,]    2    0    0    0
## [2,]   -1    3    0    0
## [3,]    2    0    2    0
## [4,]    1   -2    1    1

U <- t(L)  # Upper triangular

cat("\nIs L lower triangular? ", is_lower_triangular(L), "\n")

## 
## Is L lower triangular?  TRUE

cat("Is U upper triangular? ", is_upper_triangular(U), "\n")

## Is U upper triangular?  TRUE

### 3. Forward and Backward Substitution ###
cat("\nStep 1: Solving Ly = b via forward substitution:\n")

## 
## Step 1: Solving Ly = b via forward substitution:

y <- forward_solve(L, b)
print(y)

##      [,1]
## [1,]    4
## [2,]    2
## [3,]    4
## [4,]    2

cat("\nStep 2: Solving Ux = y via backward substitution:\n")

## 
## Step 2: Solving Ux = y via backward substitution:

x <- backward_solve(U, y)
print(x)

## [1] 1 2 1 2

### 4. Verification ###
cat("\nVerification: Does A %*% x equal b?\n")

## 
## Verification: Does A %*% x equal b?

print(A %*% x)

##      [,1]
## [1,]    8
## [2,]    2
## [3,]   16
## [4,]    6

cat("Residual norm (||Ax - b||):", norm(A %*% x - b, "2"), "\n")

## Residual norm (||Ax - b||): 0

#' Solve Linear System Using Cholesky Decomposition
#'
#' Solves the system Ax = b for symmetric positive definite matrices using:
#' 1. Cholesky decomposition A = U'U
#' 2. Forward substitution to solve U'y = b
#' 3. Back substitution to solve Ux = y
#'
#' @param A A symmetric positive definite matrix
#' @param b Right-hand side vector or matrix
#' @return Solution vector x
#' @examples
#' A <- matrix(c(4,-2,4,2, -2,10,-2,-7, 4,-2,8,4, 2,-7,4,7), nrow=4)
#' b <- matrix(c(8,2,16,6), ncol=1)
#' solve_cholesky(A, b)
solve_cholesky <- function(A, b) {
  # Input validation
  if (!is.matrix(A) || !is.numeric(A)) stop("A must be numeric matrix")
  if (!is.matrix(b) && !is.vector(b)) stop("b must be vector or matrix")
  if (nrow(A) != ncol(A)) stop("A must be square")
  if (nrow(A) != length(b)) stop("A and b dimension mismatch")
  
  # Compute Cholesky decomposition
  U <- cholesky(A)
  
  # Forward substitution (solve U'y = b)
  y <- forwardsolve(t(U), b)
  
  # Back substitution (solve Ux = y)
  x <- backsolve(U, y)
  
  # Preserve names if available
  if (!is.null(rownames(A))) {
    names(x) <- rownames(A)
  }
  
  return(x)
}

# Demonstration with verification
A <- matrix(c(4,-2,4,2, -2,10,-2,-7, 4,-2,8,4, 2,-7,4,7), nrow=4)
b <- matrix(c(8,2,16,6), ncol=1)

cat("Solving Ax = b for:\n")

## Solving Ax = b for:

cat("A:\n"); print(A)

## A:

##      [,1] [,2] [,3] [,4]
## [1,]    4   -2    4    2
## [2,]   -2   10   -2   -7
## [3,]    4   -2    8    4
## [4,]    2   -7    4    7

cat("\nb:\n"); print(b)

## 
## b:

##      [,1]
## [1,]    8
## [2,]    2
## [3,]   16
## [4,]    6

x <- solve_cholesky(A, b)

cat("\nSolution x:\n")

## 
## Solution x:

print(x)

##      [,1]
## [1,]    1
## [2,]    2
## [3,]    1
## [4,]    2

cat("\nVerification (Ax should equal b):\n")

## 
## Verification (Ax should equal b):

cat("Computed Ax:\n"); computed_b <- A %*% x

## Computed Ax:

print(computed_b)

##      [,1]
## [1,]    8
## [2,]    2
## [3,]   16
## [4,]    6

cat("\nOriginal b:\n"); print(b)

## 
## Original b:

##      [,1]
## [1,]    8
## [2,]    2
## [3,]   16
## [4,]    6

cat("\nMaximum residual:", max(abs(computed_b - b)), "\n")

## 
## Maximum residual: 0

# Compare with base R
cat("\nBase R solve() solution:\n")

## 
## Base R solve() solution:

base_x <- solve(A, b)
print(base_x)

##      [,1]
## [1,]    1
## [2,]    2
## [3,]    1
## [4,]    2

cat("\nMaximum difference:", max(abs(x - base_x)), "\n")

## 
## Maximum difference: 0

2.6.1 3.10 The Inverse of a Matrix by Cholesky Decomposition

The Cholesky decomposition can also be used for matrix inversion. It is computationally more efficient to compute the inverse and the determinant of a triangular matrix.

Given the Cholesky decomposition of a positive definite matrix $A$:

\[ A = U^T U \]

Then the inverse of $A$ is:

\[ A^{-1} = (U^T U)^{-1} = U^{-1} (U^T)^{-1} = U^{-1} (U^{-1})^T \]

Similarly, if $A = L L^T$, then:

\[ A^{-1} = (L L^T)^{-1} = (L^T)^{-1} L^{-1} = (L^{-1})^T L^{-1} \]

#' Matrix Inversion via Cholesky Decomposition
#'
#' Computes the inverse of a symmetric positive definite matrix using Cholesky decomposition.
#' More efficient and numerically stable than general inversion methods for SPD matrices.
#'
#' @param A A symmetric positive definite square matrix
#' @return The inverse matrix A⁻¹
#' @references
#' Watkins, D.S. (2002). Fundamentals of Matrix Computations
#' Golub & Van Loan (2013). Matrix Computations
#' @examples
#' A <- matrix(c(4,-2,4,2, -2,10,-2,-7, 4,-2,8,4, 2,-7,4,7), nrow=4)
#' INV2(A)  # Cholesky-based inverse
#' solve(A) # Base R equivalent
INV2 <- function(A) {
  # Input validation
  if (!is.matrix(A) || !is.numeric(A)) stop("Input must be numeric matrix")
  if (nrow(A) != ncol(A)) stop("Matrix must be square")
  if (!isSymmetric(A)) warning("Matrix not symmetric - using upper triangle only")
  
  # Load dependencies safely
  required_funs <- c("cholesky", "INV")
  for (fun in required_funs) {
    if (!exists(fun)) {
      if (file.exists(paste0(fun, ".R"))) {
        source(paste0(fun, ".R"))
      } else {
        stop(paste(fun, "function not found"))
      }
    }
  }
  
  # Compute Cholesky decomposition
  U <- cholesky(A)
  
  # Invert triangular matrix
  U_inv <- INV(U)  # Using triangular inversion
  
  # Compute inverse via Cholesky factors
  A_inv <- U_inv %*% t(U_inv)
  
  # Ensure symmetry (correct floating point errors)
  A_inv <- (A_inv + t(A_inv))/2
  
  # Preserve dimension names
  dimnames(A_inv) <- dimnames(A)
  return(A_inv)
}

# Save function
dump("INV2", file = "INV2.R")

# --------------------------------------------------
# Demonstration and Verification
# --------------------------------------------------

### 1. Input Matrix ###
cat("Original matrix A:\n")

## Original matrix A:

(A <- matrix(c(4,-2,4,2, -2,10,-2,-7, 4,-2,8,4, 2,-7,4,7), 
             nrow=4, dimnames=list(paste0("R",1:4), paste0("C",1:4))))

##    C1 C2 C3 C4
## R1  4 -2  4  2
## R2 -2 10 -2 -7
## R3  4 -2  8  4
## R4  2 -7  4  7

### 2. Cholesky-Based Inverse ###
cat("\nInverse via Cholesky (INV2):\n")

## 
## Inverse via Cholesky (INV2):

(A_inv <- INV2(A))

## Warning in INV2(A): Matrix not symmetric - using upper triangle only

##       C1    C2    C3    C4
## R1  0.64  0.28 -0.42  0.33
## R2  0.28  0.56 -0.33  0.67
## R3 -0.42 -0.33  0.50 -0.50
## R4  0.33  0.67 -0.50  1.00

#       C1   C2    C3   C4
# R1  0.64 0.28 -0.42 0.33
# R2  0.28 0.56 -0.33 0.67
# R3 -0.42-0.33  0.50-0.50
# R4  0.33 0.67 -0.50 1.00

### 3. Verification ###
cat("\nVerification (A %*% A_inv should be identity):\n")

## 
## Verification (A %*% A_inv should be identity):

round(A %*% A_inv, 10)

##    C1 C2 C3 C4
## R1  1  0  0  0
## R2  0  1  0  0
## R3  0  0  1  0
## R4  0  0  0  1

#    C1 C2 C3 C4
# R1  1  0  0  0
# R2  0  1  0  0
# R3  0  0  1  0
# R4  0  0  0  1

### 4. Comparison with Other Methods ###
cat("\nBase R solve() result:\n")

## 
## Base R solve() result:

round(solve(A), 2)

##       R1    R2    R3    R4
## C1  0.64  0.28 -0.42  0.33
## C2  0.28  0.56 -0.33  0.67
## C3 -0.42 -0.33  0.50 -0.50
## C4  0.33  0.67 -0.50  1.00

cat("\nMaximum difference:", max(abs(A_inv - solve(A))), "\n")

## 
## Maximum difference: 1.1e-16

# Maximum difference: 1.110223e-16

cat("\nBase R chol2inv() result:\n")

## 
## Base R chol2inv() result:

round(chol2inv(chol(A)), 2)

##       [,1]  [,2]  [,3]  [,4]
## [1,]  0.64  0.28 -0.42  0.33
## [2,]  0.28  0.56 -0.33  0.67
## [3,] -0.42 -0.33  0.50 -0.50
## [4,]  0.33  0.67 -0.50  1.00

cat("\nMaximum difference:", max(abs(A_inv - chol2inv(chol(A)))), "\n")

## 
## Maximum difference: 0

# Maximum difference: 1.110223e-16

2.6.2 3.11 The Determinant of a Matrix by Cholesky Decomposition

Given a positive definite matrix $A$ with Cholesky decomposition

\[ A = U^T U, \]

the determinant of $A$ is

\[ |A| = |U^T U| = |U^T| \cdot |U| = \prod_{i=1}^n u_{ii} \times \prod_{i=1}^n u_{ii} = \left( \prod_{i=1}^n u_{ii} \right)^2, \]

where $u_{ii}$ are the diagonal elements of the upper triangular matrix $U$.

#' Determinant via Cholesky Decomposition
#'
#' Computes the determinant of a symmetric positive definite matrix using its
#' Cholesky decomposition: det(A) = det(U)^2 = (∏u_ii)^2
#' 
#' @param A A symmetric positive definite matrix
#' @return The determinant of A
#' @examples
#' A <- matrix(c(4,-2,4,2, -2,10,-2,-7, 4,-2,8,4, 2,-7,4,7), nrow=4)
#' DET2(A)  # 144
#' det(A)   # Base R equivalent
DET2 <- function(A) {
  # Input validation
  if (!is.matrix(A) || !is.numeric(A)) stop("Input must be numeric matrix")
  if (nrow(A) != ncol(A)) stop("Matrix must be square")
  
  # Compute Cholesky decomposition
  U <- cholesky(A)
  
  # Calculate determinant as product of squared diagonals
  det_value <- prod(diag(U))^2
  
  return(det_value)
}

# Save function
dump("DET2", file = "DET2.R")

# Example matrix
A <- matrix(c(
  4, -2,  4,  2,
  -2, 10, -2, -7,
  4, -2,  8,  4,
  2, -7,  4,  7
), nrow = 4, byrow = TRUE)

# Compute determinant using DET2
det_chol <- DET2(A)
print(det_chol)  # Expected: 144

## [1] 144

# Verify with base R's determinant function
det_base <- det(A)
print(det_base)  # Should also be 144

## [1] 144

Moreover, in the Monte Carlo method for simulating systems with multiple correlated variables, the Cholesky decomposition is used to factorize the correlation matrix into a lower-triangular matrix $L$. This Cholesky factor is then applied to a vector of uncorrelated samples to generate a sample vector that preserves the covariance structure of the system being modeled.

# --------------------------------------------------
# Monte Carlo Simulation Example Using Cholesky
# --------------------------------------------------

#' Generate Correlated Normal Samples
#'
#' @param n Number of samples
#' @param mu Mean vector
#' @param Sigma Covariance matrix
#' @return Matrix of correlated samples
#' @examples
#' Sigma <- matrix(c(1,0.8,0.8,1), nrow=2)
#' correlated_samples(100, c(0,0), Sigma)
correlated_samples <- function(n, mu, Sigma) {
  p <- length(mu)
  
  # Cholesky decomposition
  L <- t(cholesky(Sigma))
  
  # Generate uncorrelated samples
  Z <- matrix(rnorm(n*p), nrow=p)
  
  # Transform to correlated samples
  X <- L %*% Z
  
  # Add means
  X <- sweep(X, 1, mu, "+")
  
  return(t(X))
}

### Demonstration ###

# 1. Define covariance structure
Sigma <- matrix(c(4, 3, 3, 9), nrow=2)

# 2. Generate correlated samples
set.seed(123)
samples <- correlated_samples(1000, c(1,2), Sigma)

# 3. Verify covariance structure
empirical_cov <- cov(samples)
cat("Target covariance:\n")

## Target covariance:

print(Sigma)

##      [,1] [,2]
## [1,]    4    3
## [2,]    3    9

cat("\nEmpirical covariance:\n")

## 
## Empirical covariance:

print(empirical_cov)

##      [,1] [,2]
## [1,]  3.8  2.5
## [2,]  2.5  8.7

cat("\nMaximum difference:", max(abs(Sigma - empirical_cov)), "\n")

## 
## Maximum difference: 0.5

# 4. Plot results
plot(samples, main="Correlated Normal Samples",
     xlab="Variable 1", ylab="Variable 2",
     col=adjustcolor("blue", alpha=0.5), pch=1)

### 3.12 Row Reduction (Gaussian Elimination) and Echelon Forms

Row reduction is an algorithm for solving systems of linear equations in which a sequence of elementary row operations are used to modify the matrix until the lower left-hand corner of the matrix is filled with zeros as much as possible.

If for every non-zero row, the leading coefficient is to the right of the leading coefficient of the row above, then the matrix is said to be in row echelon form.

A rectangular matrix is in echelon form (or row echelon form) if it satisfies the following three properties:

All nonzero rows are above any rows of all zeros.
Each leading entry of a row is in a column to the right of the leading entry of the row above it.
All entries in a column below a leading entry are zeros.

If a matrix in echelon form also satisfies the following additional conditions, then it is in reduced echelon form (or reduced row echelon form):

The leading entry in each nonzero row is 1.
Each leading 1 is the only nonzero entry in its column.

Using row operations to convert a matrix into reduced row echelon form is called Gauss–Jordan elimination.

If the matrix $D$ represents a system of linear equations, then the reduced row echelon form allows one to write the solutions to the system.

As described in (Kincaid, D. R. & Cheney, E. W., 2008, p. 267):

#' Reduced Row Echelon Form (RREF) of a Matrix
#'
#' Computes the reduced row echelon form of a numeric matrix using
#' elementary row operations (Gauss-Jordan elimination).
#'
#' @param A A numeric matrix or an object coercible to a matrix.
#'
#' @return A matrix in reduced row echelon form.
#'
#' @details
#' The function applies a sequence of elementary row operations to
#' transform the matrix into its reduced row echelon form (RREF).
#' This form is useful for solving linear systems and analyzing
#' linear independence.
#'
#' @examples
#' A <- matrix(c(1, 2, -1, -4,
#'               2, 3, -1, -11,
#'               -2, 0, -3, 22), nrow = 3, byrow = TRUE)
#' RREF(A)
#'
#' @export
RREF <- function(A) {
  a <- as.matrix(A)
  n <- nrow(a)
  k <- ncol(a)
  l <- 1  # pivot column index
  
  for (r in 1:n) {
    if (l > k) break
    
    i <- r
    while (a[i, l] == 0) {
      i <- i + 1
      if (i > n) {
        i <- r
        l <- l + 1
        if (l > k) break
      }
    }
    if (l > k) break
    
    if (i != r) {
      temp <- a[i, ]
      a[i, ] <- a[r, ]
      a[r, ] <- temp
    }
    
    if (a[r, l] != 0) {
      a[r, ] <- a[r, ] / a[r, l]
    }
    
    for (i in 1:n) {
      if (i != r) {
        a[i, ] <- a[i, ] - a[i, l] * a[r, ]
      }
    }
    
    l <- l + 1
  }
  
  return(a)
}

# save to the working directory getwd(), ls()
dump("RREF", file = "RREF.R")

Solving a System of Linear Equations

Given the system $A\mathbf{x} = \mathbf{b}$, where:

\[ A = \begin{bmatrix} 4 & -2 & 4 & 2 \\ -2 & 10 & -2 & -7 \\ 4 & -2 & 8 & 4 \\ 2 & -7 & 4 & 7 \end{bmatrix}, \quad \mathbf{b} = \begin{bmatrix} 8 \\ 2 \\ 16 \\ 6 \end{bmatrix} \]

We construct the augmented matrix $D = [A ] $:

\[ D = \left[\begin{array}{cccc|c} 4 & -2 & 4 & 2 & 8 \\ -2 & 10 & -2 & -7 & 2 \\ 4 & -2 & 8 & 4 & 16 \\ 2 & -7 & 4 & 7 & 6 \end{array}\right] \]

Key Features:

Matrix $A$ is a $4×4$ coefficient matrix
Vector $\mathbf{b}$ is the right-hand side
Augmented matrix $D$ combines both for row operations
Vertical bar separates coefficients from constants
Proper LaTeX formatting for matrices and vectors

# Define coefficient matrix A
A <- matrix(c(
  4, -2,  4,  2,
  -2, 10, -2, -7,
  4, -2,  8,  4,
  2, -7,  4,  7
), nrow = 4, byrow = TRUE)

# Define constant vector b
b <- c(8, 2, 16, 6)

# Form augmented matrix D = [A | b]
D <- cbind(A, b)
print(D)

##                   b
## [1,]  4 -2  4  2  8
## [2,] -2 10 -2 -7  2
## [3,]  4 -2  8  4 16
## [4,]  2 -7  4  7  6

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    4   -2    4    2    8
## [2,]   -2   10   -2   -7    2
## [3,]    4   -2    8    4   16
## [4,]    2   -7    4    7    6

# Compute Reduced Row Echelon Form
E <- RREF(D)
print(E)

##              b
## [1,] 1 0 0 0 1
## [2,] 0 1 0 0 2
## [3,] 0 0 1 0 1
## [4,] 0 0 0 1 2

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    0    0    0    1
## [2,]    0    1    0    0    2
## [3,]    0    0    1    0    1
## [4,]    0    0    0    1    2

# Interpretation:
# Solution to system: x1 = 1, x2 = 2, x3 = 1, x4 = 2

Interpretation

The RREF of $D$ gives the solution:

\[ \mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ x_3 \\ x_4 \end{bmatrix} = \begin{bmatrix} 1 \\ 2 \\ 1 \\ 2 \end{bmatrix} \]

Verification:

Substitute $\mathbf{x}$ into $A\mathbf{x}$:

\[ A\mathbf{x} = \begin{bmatrix} 4(1) - 2(2) + 4(1) + 2(2) \\ -2(1) + 10(2) - 2(1) - 7(2) \\ 4(1) - 2(2) + 8(1) + 4(2) \\ 2(1) - 7(2) + 4(1) + 7(2) \end{bmatrix} = \begin{bmatrix} 8 \\ 2 \\ 16 \\ 6 \end{bmatrix} = \mathbf{b} \]

2.6.3 Solution Verification

The proposed solution $\mathbf{x} = \begin{bmatrix}1 & 2 & 1 & 2\end{bmatrix}^\top$ is correct.
Below is the step-by-step verification:

2.6.3.1 1. Matrix Multiplication: Compute $A\mathbf{x}$

Given:

\[ A = \begin{bmatrix} 4 & -2 & 4 & 2 \\ -2 & 10 & -2 & -7 \\ 4 & -2 & 8 & 4 \\ 2 & -7 & 4 & 7 \end{bmatrix}, \quad \mathbf{x} = \begin{bmatrix}1 \\ 2 \\ 1 \\ 2\end{bmatrix}, \quad \mathbf{b} = \begin{bmatrix}8 \\ 2 \\ 16 \\ 6\end{bmatrix} \]

2.6.3.2 2. Row-wise Arithmetic Verification

Row 1:
- $4(1) - 2(2) + 4(1) + 2(2) = 4 - 4 + 4 + 4 = 8$
- $\Rightarrow$ Matches $b_1 = 8$ ✓
Row 2:
- $-2(1) + 10(2) - 2(1) - 7(2) = -2 + 20 - 2 - 14 = 2$
- $\Rightarrow$ Matches $b_2 = 2$ ✓
Row 3:
- $4(1) - 2(2) + 8(1) + 4(2) = 4 - 4 + 8 + 8 = 16$
- $\Rightarrow$ Matches $b_3 = 16$ ✓
Row 4:
- $2(1) - 7(2) + 4(1) + 7(2) = 2 - 14 + 4 + 14 = 6$
- $\Rightarrow$ Matches $b_4 = 6$ ✓

2.6.3.3 3. Conclusion

Since $A\mathbf{x} = \mathbf{b}$ holds exactly, the solution is validated.

2.6.4 3.13 Rank of a Matrix

The rank of a matrix $A$ is the dimension of the vector space spanned by its columns, also known as the column space of $A$. Equivalently, it is the maximum number of linearly independent column vectors in the matrix.

A matrix is said to have full rank if its rank is equal to the maximum possible for its dimensions — that is, the lesser of the number of rows and columns:

\[ \text{rank}(A) = \min(m, n) \]

A matrix is rank deficient if its rank is less than this maximum.

A common method for determining the rank of a matrix is to reduce it to row echelon form (REF) using Gaussian elimination. In this form, the rank corresponds to the number of leading 1s or pivots, which is also the number of non-zero rows.

Note: The same number of pivots is obtained whether you reduce rows or columns, since row rank and column rank are always equal.

#' Rank of a Matrix
#'
#' Computes the rank of a matrix using its reduced row echelon form (RREF).
#' The rank is defined as the number of non-zero rows in RREF.
#'
#' @param D A numeric matrix.
#'
#' @return An integer representing the rank of the matrix.
#'
#' @details
#' A matrix is said to have full rank if its rank equals the minimum of the number
#' of rows and columns. It is rank deficient if its rank is less than this value.
#' This function uses the custom \code{RREF} function, so ensure that \code{RREF.R}
#' is in your working directory.
#'
#' @examples
#' A <- matrix(c(4, -2, 4, 2,
#'               -2, 10, -2, -7,
#'               4, -2, 8, 4,
#'               2, -7, 4, 7), nrow = 4, byrow = TRUE)
#' b <- c(8, 2, 16, 6)
#' D <- cbind(A, b)
#' RANK(D)  # Returns 4
#'
#' # Rank deficient example:
#' C <- matrix(c(1, 2, 3,
#'               3, 4, 6,
#'               4, 6, 9), nrow = 3, byrow = FALSE)
#' RANK(C)  # Returns 2
#'
#' @export
RANK <- function(D) {
  # Load RREF function from file
  source("RREF.R")
  
  # Compute Reduced Row Echelon Form
  E <- RREF(D)
  
  # Count number of non-zero rows
  rank <- nrow(E[apply(E, 1, function(x) !all(x == 0)), ])
  
  return(rank)
}

# Save function to file
dump("RANK", file = "RANK.R")

A <- matrix(c(4, -2, 4, 2,
              -2, 10, -2, -7,
              4, -2, 8, 4,
              2, -7, 4, 7), nrow = 4, byrow = TRUE)
b <- c(8, 2, 16, 6)
D <- cbind(A, b)

RANK(D)  # Output: 4 (full rank)

## [1] 4

C <- matrix(c(1, 2, 3,
              3, 4, 6,
              4, 6, 9), nrow = 3, byrow = FALSE)

C

##      [,1] [,2] [,3]
## [1,]    1    3    4
## [2,]    2    4    6
## [3,]    3    6    9

#      [,1] [,2] [,3]
# [1,]    1    3    4
# [2,]    2    4    6
# [3,]    3    6    9

RANK(C)  # Output: 2

## [1] 2

DET <- function(A) prod(diag(chol(A)))^2
# Note: Will throw an error if A is not positive definite

# This matrix is not full-rank; determinant should be zero (or fail via chol)
det(C)   # Base R

## [1] 3.3e-16

# [1] 0

#' Determinant of a Matrix with Fallback for Non-Positive-Definite Cases
#'
#' Computes the determinant of a square matrix. Attempts Cholesky decomposition
#' first (for speed), and falls back to LU decomposition if Cholesky fails.
#'
#' @param A A square numeric matrix.
#'
#' @return A numeric value representing the determinant of the matrix.
#'
#' @examples
#' A <- matrix(c(4, -2, 4, 2,
#'               -2, 10, -2, -7,
#'               4, -2, 8, 4,
#'               2, -7, 4, 7), nrow = 4, byrow = TRUE)
#' DET(A)  # 144
#'
#' C <- matrix(c(1, 2, 3,
#'               3, 4, 6,
#'               4, 6, 9), nrow = 3, byrow = FALSE)
#' DET(C)  # 0
#'
#' @export
DET <- function(A) {
  if (!is.matrix(A) || nrow(A) != ncol(A)) {
    stop("Input must be a square matrix.")
  }
  
  # Try Cholesky decomposition first (works only for symmetric positive-definite)
  det_val <- tryCatch({
    U <- chol(A)
    prod(diag(U))^2
  }, error = function(e) {
    # Fall back to LU decomposition (via base R's determinant function)
    det_base <- determinant(A, logarithm = FALSE)
    as.numeric(det_base$modulus) * ifelse(det_base$sign < 0, -1, 1)
  })
  
  return(det_val)
}

# Full-rank matrix
A <- matrix(c(4, -2, 4, 2,
              -2, 10, -2, -7,
              4, -2, 8, 4,
              2, -7, 4, 7), nrow = 4, byrow = TRUE)
DET(A)  # 144

## [1] 144

# Rank-deficient matrix
C <- matrix(c(1, 2, 3,
              3, 4, 6,
              4, 6, 9), nrow = 3, byrow = FALSE)
DET(C)  # 0

## [1] 3.3e-16

3 4 Multiple Linear Regression

In statistics, often multiple independent factors contribute to the dependent variable. Linear regression is a linear approach to modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression.

Multiple linear regression is the least squares estimator of a linear regression model that relates response variable, Y to a function of multiple explanatory variables X and regression coefficients β as follows:

\[ Y_i = \beta_0 + \beta_i X_{[i,j]} + \varepsilon_i, \quad \text{for } i = 1, 2, 3, ..., n \text{ and } j = 1, 2, 3, ..., (k-1) \] The least squares parameter estimates are obtained from a series of k normal equations:

\[ \begin{aligned} Y_1 &= \beta_0 + \beta_1 X_{[1,1]} + \beta_2 X_{[1,2]} + \cdots + \beta_{(k-1)} X_{[1,(k-1)]} + \varepsilon_1 \\ Y_2 &= \beta_0 + \beta_1 X_{[2,1]} + \beta_2 X_{[2,2]} + \cdots + \beta_{(k-1)} X_{[2,(k-1)]} + \varepsilon_2 \\ Y_3 &= \beta_0 + \beta_1 X_{[3,1]} + \beta_2 X_{[3,2]} + \cdots + \beta_{(k-1)} X_{[3,(k-1)]} + \varepsilon_3 \\ &\vdots \quad + \quad \vdots \quad + \quad \vdots \quad + \quad \ddots \quad + \quad \vdots \quad + \quad \vdots \\ Y_n &= \beta_0 + \beta_1 X_{[n,1]} + \beta_2 X_{[n,2]} + \cdots + \beta_{(k-1)} X_{[n,(k-1)]} + \varepsilon_n \end{aligned} \]

In matrix notation, the normal equations are written as:

\[ \begin{bmatrix} Y_1 \\ Y_2 \\ Y_3 \\ \vdots \\ Y_n \end{bmatrix} = \begin{bmatrix} \beta_0 & + & \beta_1 X_{[1,1]} & + & \beta_2 X_{[1,2]} & + & \ldots & + & \beta_{(k-1)} X_{[1,(k-1)]} \\ \beta_0 & + & \beta_1 X_{[2,1]} & + & \beta_2 X_{[2,2]} & + & \ldots & + & \beta_{(k-1)} X_{[2,(k-1)]} \\ \beta_0 & + & \beta_1 X_{[3,1]} & + & \beta_2 X_{[3,2]} & + & \ldots & + & \beta_{(k-1)} X_{[3,(k-1)]} \\ \vdots & + & \vdots & + & \vdots & + & \ddots & + & \vdots \\ \beta_0 & + & \beta_1 X_{[n,1]} & + & \beta_2 X_{[n,2]} & + & \ldots & + & \beta_{(k-1)} X_{[n,(k-1)]} \end{bmatrix} + \begin{bmatrix} \varepsilon_1 \\ \varepsilon_2 \\ \varepsilon_3 \\ \vdots \\ \varepsilon_n \end{bmatrix} \]

\[ = \begin{bmatrix} 1 & X_{[1,1]} & X_{[1,2]} & \ldots & X_{[1,(k-1)]} \\ 1 & X_{[2,1]} & X_{[2,2]} & \ldots & X_{[2,(k-1)]} \\ 1 & X_{[3,1]} & X_{[3,2]} & \ldots & X_{[3,(k-1)]} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & X_{[n,1]} & X_{[n,2]} & \ldots & X_{[n,(k-1)]} \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \\ \vdots \\ \beta_{(k-1)} \end{bmatrix} + \begin{bmatrix} \varepsilon_1 \\ \varepsilon_2 \\ \varepsilon_3 \\ \vdots \\ \varepsilon_n \end{bmatrix} \]

where:

$\qquad k$ is the number of explanatory variables

$\qquad n$ is the number of observations

$\qquad X_{n \times (k-1)}$ is the design matrix

$\qquad \beta_0$ is the intercept

$\qquad \beta_{(k-1) \times 1}$ is the vector of regression coefficients

$\qquad \varepsilon_i \sim \text{iid } N(0, \sigma^2)$ is the vector of error terms

$\qquad Y_{n \times 1}$ is the response vector

3.1 4.1 Regression Coefficients

The goal of ordinary least squares regression is to minimize the differences (residuals) between the observed responses and the responses predicted by the linear approximation (fitted values) of the data.

\[ \begin{aligned} Y &= X\beta + \varepsilon \\ \varepsilon &= Y - X\beta \end{aligned} \]

This method obtains parameter estimates that minimize the sum of squared residuals (also residual sum of squares or sum of squared errors of prediction).

\[ \begin{aligned} \varepsilon^T \varepsilon &= (Y - X\beta)^T (Y - X\beta) \\ &= [Y^T - (X\beta)^T](Y - X\beta) \\ &= (Y^T - X^T \beta^T)(Y - X\beta) \\ &= Y^T Y - Y^T X\beta - X^T \beta^T Y + X^T \beta^T X \beta \\ &= Y^T Y - [(X\beta)^T Y]^T - X^T \beta^T Y + X^T \beta^T X \beta \\ &= Y^T Y - (\beta^T X^T Y)^T - X^T \beta^T Y + X^T \beta^T X \beta \end{aligned} \]

To minimize the sum of squared residuals set its first order partial derivative with respect to $\beta$ to zero:

\[ \begin{aligned} \frac{\partial}{\partial \beta} [Y^T Y - (\beta^T X^T Y)^T - X^T \beta^T Y + X^T \beta^T X \beta] &= 0 \\ \frac{\partial}{\partial \beta}(Y^T Y) - \frac{\partial}{\partial \beta}(\beta^T X^T Y)^T - \frac{\partial}{\partial \beta}(X^T \beta^T Y) + \frac{\partial}{\partial \beta}(X^T \beta^T X \beta) &= 0 \\ 0 - X^T Y - X^T Y + 2X^T X \beta &= 0 \\ -2X^T Y + 2X^T X \beta &= 0 \\ 2X^T X \beta &= 2X^T Y \\ X^T X \beta &= X^T Y \end{aligned} \]

To solve for $\beta$ multiply both sides by the inverse of $\beta$’s coefficient:

\[ \begin{aligned} (X^T X)^{-1}(X^T X) \beta &= (X^T X)^{-1} X^T Y \\ I \beta &= (X^T X)^{-1} X^T Y \\ \beta &= (X^T X)^{-1} X^T Y \end{aligned} \]

#' Calculate Multiple Linear Regression Coefficients
#'
#' Computes regression coefficients using the normal equations approach.
#' Handles both single-matrix input (last column as response) and separate
#' predictor/response matrices.
#'
#' @param X A numeric matrix of predictor variables. If Y is NULL, the last column 
#'          will be used as the response variable.
#' @param Y Optional numeric vector/matrix of response variables. If provided, 
#'          X should contain only predictors.
#'
#' @return A matrix of regression coefficients (including intercept)
#'
#' @details
#' This function implements the closed-form solution for multiple linear regression:
#' \deqn{\beta = (X'X)^{-1}X'Y}
#' where:
#' \itemize{
#'   \item X is the design matrix (with intercept column)
#'   \item Y is the response vector
#'   \item \eqn{\beta} is the coefficient vector
#' }
#' The function automatically adds an intercept term and performs list-wise deletion
#' for missing data.
#'
#' @examples
#' # Using single matrix input (last column as response)
#' data(mtcars)
#' X <- as.matrix(mtcars[, c("hp", "wt", "qsec", "mpg")])
#' beta(X)
#'
#' # Using separate X and Y
#' X <- as.matrix(mtcars[, c("hp", "wt", "qsec")])
#' Y <- mtcars$mpg
#' beta(X, Y)
#'
#' # Compare with lm() results
#' coef(lm(mpg ~ hp + wt + qsec, data = mtcars))
#'
#' @seealso \code{\link{lm}} for the standard linear modeling function
#' @export
beta <- function(X, Y = NULL) {
  # List-wise missing data deletion
  X <- as.matrix(X[apply(X, 1, function(x) !any(is.na(x))), , drop = FALSE])
  
  if (is.null(Y)) {
    # If Y not provided, assume last column is response
    p <- ncol(X)
    Y <- as.matrix(X[, p])
    X <- as.matrix(X[, 1:(p - 1)])
  } else {
    X <- as.matrix(X)
    Y <- as.matrix(Y)
  }
  
  D <- cbind(1, X)  # Create design matrix with intercept
  
  # Calculate coefficients using normal equations
  INV2(t(D) %*% D) %*% t(D) %*% Y
}

#' Matrix Inversion Helper Function
#' 
#' @param m A square matrix to invert
#' @return The inverted matrix
#' @keywords internal
INV2 <- function(m) {
  solve(m)
}

# Save function to file
dump("beta", file = "beta.R")

## Example Usage

### Applying to Dataset X
# Calculate regression coefficients using our beta() function
beta_results <- beta(X)
beta_results

##     [,1]
##    62.41
## X1  1.55
## X2  0.51
## X3  0.10
## X4 -0.14

## Output:
##     [,1]
##    62.41
## X1  1.55
## X2  0.51
## X3  0.10
## X4 -0.14

# Calculate coefficients using glm() for verification
glm_results <- as.matrix(coefficients(
  glm(X[, ncol(X)] ~ X[, 1:(ncol(X) - 1)])
))

glm_results

##                         [,1]
## (Intercept)            62.41
## X[, 1:(ncol(X) - 1)]X1  1.55
## X[, 1:(ncol(X) - 1)]X2  0.51
## X[, 1:(ncol(X) - 1)]X3  0.10
## X[, 1:(ncol(X) - 1)]X4 -0.14

## Output:
##                           [,1]
## (Intercept)              62.41
## X[, 1:(ncol(X) - 1)]X1  1.55
## X[, 1:(ncol(X) - 1)]X2  0.51
## X[, 1:(ncol(X) - 1)]X3  0.10
## X[, 1:(ncol(X) - 1)]X4 -0.14

3.1.1 The Covariance Matrix and the Regression Coefficients Using the Sweep Operator

Excerpt From: Roderick J. A. Little & Donald B. Rubin, Statistical Analysis with Missing Data (Little & Rubin, 2020)

The sweep operator is closely related to linear regression. For example, suppose that G is a $2 \times 2$ covariance matrix of two variables $Y_1$ and $Y_2$, and let:

\[ H = \text{SWEEP}[1]G \]

Then:

$\qquad h_{12}$ is the coefficient of $Y_1$ from the regression of $Y_2$ on $Y_1$.

$\qquad h_{22}$ is the residual variance of $Y_2$ given $Y_1$.

$$Furthermore, when G is a sample covariance matrix from $n$ independent units, the estimated sampling variance of the regression coefficient $h_{12}$ is:

\[ -\frac{h_{11}h_{22}}{n} \]

The Augmented Parameter Matrix

More generally, suppose we have a sample of $n$ units on $K$ variables $Y_1, \dots, Y_K$. Let G denote the $(K+1) \times (K+1)$ augmented parameter matrix:

$$ G = \[\begin{bmatrix} 1 & \bar{y}_1 & \bar{y}_2 & \cdots & \bar{y}_K \\ \bar{y}_1 & \frac{1}{n}\sum y_1^2 & \frac{1}{n}\sum y_1 y_2 & \cdots & \frac{1}{n}\sum y_1 y_K \\ \bar{y}_2 & \frac{1}{n}\sum y_2 y_1 & \frac{1}{n}\sum y_2^2 & \cdots & \frac{1}{n}\sum y_2 y_K \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ \bar{y}_K & \frac{1}{n}\sum y_K y_1 & \frac{1}{n}\sum y_K y_2 & \cdots & \frac{1}{n}\sum y_K^2 \end{bmatrix}\]

This matrix incorporates both the means and covariances of the variables $Y_1, \dots, Y_K$, and is used to perform regression computations via the sweep operator.

#' Augmented Parameter Matrix [@Little:Rubin:2002]
#'
#' Computes the (K+1) x (K+1) augmented parameter matrix G used in regression 
#' via the sweep operator. Missing values are imputed with column means.
#'
#' @param A A data frame or matrix of numeric variables (possibly with missing values).
#' @return A numeric matrix G = (t(D) %*% D)/n, where D is the design matrix with intercept.
#' @examples
#' G <- augmented_parameter_matrix(X)

augmented_parameter_matrix <- function(A){
  # Remove rows that are entirely NA
  X <- data.matrix(A[apply(A, 1, function(x) !all(is.na(x))), ])
  
  # Impute missing values with column means
  X <- apply(X, 2, function(x) replace(x, is.na(x), mean(x, na.rm = TRUE)))
  
  # Design matrix with intercept column
  D <- cbind(1, X)
  n <- nrow(D)
  
  # Return augmented parameter matrix
  return((t(D) %*% D) / n)
}

# Save to file in working directory
dump("augmented_parameter_matrix", file = "augmented_parameter_matrix.R")

# Example use
G <- augmented_parameter_matrix(X)
print(G)

##            X1   X2   X3   X4    Y
##     1.0   7.5   48   12   30   95
## X1  7.5  87.6  379   59  202  772
## X2 48.2 378.6 2542  554 1211 4771
## X3 11.8  59.2  554  176  356 1076
## X4 30.0 201.5 1211  356 1159 2672
## Y  95.4 771.7 4771 1076 2672 9314

Notes:

Header Components:
- Top Row: Contains column means ($\bar{Y}_j$ for each variable $j$)
- First Column: Contains intercept terms (for regression constant)
Core Matrix Components:
- Scaled Cross-Products $= \left(\frac{1}{n}\right) \sum_{i=1}^{n} (Y_i \times Y_j)$

Where:

$\qquad n =$ sample size

$\qquad Y_i$, $Y_j$ = paired observations

$\qquad \sum$ runs across all observations

Purpose:

Designed specifically for efficient regression computation
Structure optimized for:
- Sweep operator applications
- Sequential partial regression analyses
- Variance-covariance decomposition

Key Properties:

Symmetric positive-definite when no perfect multicollinearity exists
Diagonal elements represent $\frac{1}{n}\sum Y_i^2$ (scaled sum of squares)
Off-diagonals represent $\frac{1}{n}\sum Y_i Y_j$ (scaled cross-products)

3.1.2 The Variance-Covariance Matrix Using the Sweep Operator

Sweeping on row and column 1 of the augmented parameter matrix $G$ yields the matrix:

\[ \text{SWEEP}[1]G = \Theta = \begin{bmatrix} -1 & \bar{y}_1 & \bar{y}_2 & \cdots & \bar{y}_K \\ \bar{y}_1 & \hat{\sigma}_{11} & \hat{\sigma}_{21} & \cdots & \hat{\sigma}_{K1} \\ \bar{y}_2 & \hat{\sigma}_{12} & \hat{\sigma}_{22} & \cdots & \hat{\sigma}_{K2} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ \bar{y}_K & \hat{\sigma}_{1K} & \hat{\sigma}_{2K} & \cdots & \hat{\sigma}_{KK} \end{bmatrix} \]

Where:

$\qquad \bar{y}_j$ is the sample mean of variable $Y_j$

$\qquad \hat{\sigma}_{jk}$ is the sample covariance between variables $Y_j$ and $Y_k$, computed with denominator $n$ (not $ n-1$)

3.1.3 Interpretation

This sweep operation performs a correction for the means, transforming the scaled cross-products matrix $G$ into the covariance matrix of variables $ Y_1, , Y_K $. This is why it is referred to as:

Sweeping on the constant term (since $ Y_1 $ in the augmented matrix)

In regression terms:

The first row and column of $ $ (excluding the $-1$) contain the regression coefficients of $Y_2, \dots, Y_K$ on the constant term.
The remaining block (lower-right) of the matrix contains the residual covariance matrix of $Y_2, \dots, Y_K$ after regressing out the intercept.

Summary

Before sweep: $G $ contains means and cross-products.
After sweep on 1: $\Theta = \text{SWEEP}[1]G$ gives:
- Sample means $(\bar{y}_j)$
- Covariances $(\hat{\sigma}_{jk})$ with divisor $n$
- Regression coefficients on intercept in first row/column

This symmetric matrix $\Theta$ is called the augmented covariance matrix.

(Theta <- SWEEP(G, 1))

##             X1   X2    X3     X4    Y
##      1.0   7.5   48  11.8   30.0   95
## X1  -7.5  31.9   19 -28.7  -22.3   60
## X2 -48.2  19.3  224 -12.8 -233.9  176
## X3 -11.8 -28.7  -13  37.9    2.9  -48
## X4 -30.0 -22.3 -234   2.9  258.6 -191
## Y  -95.4  59.7  176 -47.6 -190.9  209

Sweeping on row and column 1 gives Theta (i.e., SWEEP(G, 1)) — an updated matrix where:

The first row (and column) now contains regression coefficients of the mean-centered variables on the intercept.
The remaining submatrix is the adjusted covariance matrix.

colMeans(X)

##   X1   X2   X3   X4    Y 
##  7.5 48.2 11.8 30.0 95.4

# X1   X2   X3   X4   X5 
# 7.5 48.2 11.8 30.0 95.4 

# These are matched exactly in:
Theta[1, -1]

##   X1   X2   X3   X4    Y 
##  7.5 48.2 11.8 30.0 95.4

# X1   X2   X3   X4   X5 
# 7.5 48.2 11.8 30.0 95.4

The first row of Theta (excluding the first element) contains the means.

covariance matrix (scaled by $\frac{n-1}{n}$) is:

cov(X)*(nrow(X)-1)/nrow(X)

##     X1   X2    X3     X4    Y
## X1  32   19 -28.7  -22.3   60
## X2  19  224 -12.8 -233.9  176
## X3 -29  -13  37.9    2.9  -48
## X4 -22 -234   2.9  258.6 -191
## Y   60  176 -47.6 -190.9  209

#Which matches:
Theta[-1, -1]

##     X1   X2    X3     X4    Y
## X1  32   19 -28.7  -22.3   60
## X2  19  224 -12.8 -233.9  176
## X3 -29  -13  37.9    2.9  -48
## X4 -22 -234   2.9  258.6 -191
## Y   60  176 -47.6 -190.9  209

The lower-right block of Theta is the variance-covariance matrix (unbiased estimator).

Summary of Matrix Structure

The Theta after the sweep is structured as:

	Intercept	Variables
Intercept	$-1.0$	variable means $(\mu)$
Variables	$\mu^t$	covariance matrix $(\sum)$

This structure is essential in multivariate regression or MANOVA, and the sweep operator makes inversion and updates efficient without recalculating from scratch.

3.2 The Regression Coefficients Using the Sweep

If matrix G is swept on rows and columns 1 and 2 (i.e., SWEEP[1,2]G), the result is a block matrix of the form:

$$ _{[1,2]}(G) = \[\begin{bmatrix} -\left(1 + \frac{\bar{y}_1^2}{\sigma_{11}}\right) & \frac{\bar{y}_2 - \bar{y}_1 \sigma_{12} / \sigma_{11}}{\sigma_{11}} & \cdots & \frac{\bar{y}_K - \bar{y}_1 \sigma_{1K} / \sigma_{11}}{\sigma_{11}} \\ \frac{\bar{y}_2 - \bar{y}_1 \sigma_{12} / \sigma_{11}}{\sigma_{11}} & \frac{\sigma_{22} - \sigma_{12}^2 / \sigma_{11}}{\sigma_{11}} & \cdots & \frac{\sigma_{2K} - \sigma_{1K} \sigma_{12} / \sigma_{11}}{\sigma_{11}} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\bar{y}_K - \bar{y}_1 \sigma_{1K} / \sigma_{11}}{\sigma_{11}} & \frac{\sigma_{2K} - \sigma_{1K} \sigma_{12} / \sigma_{11}}{\sigma_{11}} & \cdots & \frac{\sigma_{KK} - \sigma_{1K}^2 / \sigma_{11}}{\sigma_{11}} \end{bmatrix}\] = \[\begin{bmatrix} -A & B^\top \\ B & C \end{bmatrix}\]

Matrix Interpretation:

A is a 2×2 matrix representing intercept–intercept and slope–slope structure.
B is a 2×(K−1) matrix:
Each column of B gives the intercept and slope for the regression of Y_{j+1} on Y₁, for j = 1, ..., K−1.
C is a (K−1)×(K−1) matrix:
Represents the residual covariance matrix of Y₂, ..., Y_K after regressing them on Y₁.

Notes:

The first row gives intercepts.
The second row gives slopes (regression coefficients).
The C block (lower-right) contains residual covariances.
The elements of A, when multiplied by the corresponding elements in C and divided by n, give asymptotic variances and covariances of the regression coefficients in B.

SWEEP(G, 1:2)

##               X1     X2    X3     X4      Y
##      2.74 -0.234   43.6  18.5   35.2   81.5
## X1  -0.23  0.031    0.6  -0.9   -0.7    1.9
## X2 -43.64 -0.605  211.8   4.5 -220.4  140.3
## X3 -18.46  0.897    4.5  12.1  -17.1    6.0
## X4 -35.21  0.698 -220.4 -17.1  243.0 -149.2
## Y  -81.48 -1.869  140.3   6.0 -149.2   97.4

3.3 Multivariate Regression Coefficients Using the Sweep

Sweeping the constant term and the first q elements yields results for the multivariate regression of $Y_{q+1}, ..., Y_K` on `Y_1, ..., Y_q$.

After sweeping, the resulting matrix takes the form:

\[ \text{SWEEP}_{[1, \dots, q]}(G) = \begin{bmatrix} -D & E \\ E^T & F \\ \end{bmatrix} \] Matrix Components

D: A q × q matrix containing information related to variances and covariances of the estimated regression coefficients.
E: A (K − q) × q matrix of regression coefficients, where:
Each row corresponds to a dependent variable (Y_{q+1}, ..., Y_K)
Each column corresponds to a predictor variable (Y_1, ..., Y_q)
F: A (K − q) × (K − q) matrix representing the residual covariance matrix of the dependent variables after regressing on the predictors.

Interpretation

The E matrix provides the multivariate regression coefficients.
The F matrix contains the residual covariances after accounting for the effect of the predictors.
The elements of D, combined with F, can be used to compute asymptotic standard errors and covariances of the estimates in E.

Phi <- SWEEP(G, 1:5)

This corresponds to computing the regression of X5 on X1 to X4, where:

X5 is the dependent variable.
X1 to X4 are the predictors.
The first row of Phi gives the intercept and regression coefficients.

#' Calculate Regression Coefficients Using Sweep Operator
#'
#' Computes multiple linear regression coefficients using the sweep operator method.
#' More numerically stable than normal equations for certain problems.
#'
#' @param x A numeric matrix where the last column is the response variable
#'          and other columns are predictors.
#'
#' @return A matrix of regression coefficients with proper dimension names
#' @export
#' @examples
#' data(mtcars)
#' x <- cbind(mtcars$wt, mtcars$hp, mtcars$mpg)
#' beta2(x)
beta2 <-
  function(x){
    
    # Load required functions 
    if (!exists("SWEEP") || !exists("augmented_parameter_matrix")) {
      source("SWEEP.R")
      source("augmented_parameter_matrix.R")
    }
    
    x <- as.matrix(x)
    k <- ncol(x)
    
    G <- augmented_parameter_matrix(x)
    P <- SWEEP(G, 1:k)
    # Extract and format coefficients
    B <- matrix(P[1:k, k+1], dimnames=list(paste("B", 0:(k-1), sep = ""), "coeffs"))
    
    # Validate sweep operation results
    if (is.null(P) || any(dim(P) != c(k+1, k+1))) {
      stop("Sweep operation failed - check for linear dependencies")
    }
    
    return(B)
  }

dump("beta2", file = "beta2.R")

# Apply to dataset X
beta2(X)

##    coeffs
## B0  62.41
## B1   1.55
## B2   0.51
## B3   0.10
## B4  -0.14

In summary, maximum likelihood (ML) estimates for the multivariate linear regression of
Y_{q+1}, ..., Y_K on Y_1, ..., Y_q can be obtained by sweeping the rows and columns
corresponding to the constant term and the predictor variables (Y_1, ..., Y_q)
from the scaled cross-products matrix G.

Interpretation of the Sweep Operation

The operation of sweeping on a variable effectively transforms that variable from an outcome (dependent) variable into a predictor (independent) variable. This reparameterization allows the regression relationships and residual covariances to be computed directly from the augmented matrix without performing separate matrix inversions.

3.3.1 The Hat Matrix

The hat matrix (also called the influence matrix or projection matrix), maps the vector of response values (dependent variable values) to the vector of fitted values (or predicted values).

\[ \begin{align*} Y &= X\beta + \varepsilon \\ \hat{Y} &= X\hat{\beta} \\ \hat{\beta} &= (X^TX)^{-1}X^TY \\ \hat{Y} &= X(X^TX)^{-1}X^TY \\ &= HY \end{align*} \]

So the hat matrix is defined as:

\[ H = X(X^TX)^{-1}X^T \]

The element in the ith row and jth column of H is equal to the covariance between the jth response value and the ith fitted value, divided by the variance of the former:

\[ h_{ij} = \frac{\text{cov}(\hat{Y}_i, Y_j)}{\text{var}(Y_j)} \]

Properties of H and (I-H) matrices:

Symmetric: $H = H^T$ and $(I-H)^T = (I-H)$
Idempotent: $H^2 = H$ and $(I-H)(I-H) = (I-H)$
X is invariant under H: $HX = X$ hence $(I-H)X = 0$

#' Compute the Hat Matrix
#' 
#' Calculates the hat matrix (projection matrix) for a given data matrix after 
#' handling missing values. The hat matrix maps observed response values to 
#' predicted values in linear regression.
#'
#' @param Y A numeric matrix or data frame containing the response variables. 
#'           Rows with any missing values will be removed.
#'
#' @return The hat matrix \eqn{H = X(X'X)^{-1}X'}, where X is the design matrix 
#'         (the input matrix Y with an added column of 1s for the intercept).
#'
#' @details
#' The function:
#' \enumerate{
#'   \item Performs list-wise deletion of missing data
#'   \item Creates a design matrix by adding an intercept column
#'   \item Computes the hat matrix using \eqn{H = D(D'D)^{-1}D'}
#' }
#' 
#' Requires the `INV()` function for matrix inversion (assumed to be in "INV.R").
#'
#' @examples
#' \dontrun{
#' data <- matrix(rnorm(100), ncol = 2)
#' H <- HAT(data)
#' }
#'
#' @seealso \code{\link{lm}} for linear model fitting which uses the hat matrix
#' @export
HAT <- function(Y) {
    # Load matrix inversion function
    source("INV.R")
    
    # List-wise missing data deletion
    X <- Y[apply(Y, 1, function(x) !any(is.na(x))), , drop = FALSE]
    
    # Create design matrix (add intercept)
    D <- cbind(1, X)  
    
    # Compute and return hat matrix
    D %*% INV(t(D) %*% D) %*% t(D)
}

# save to the working directory getwd(), ls()
dump("HAT", file = "HAT.R")

round(HAT(X), 1)

##        obs.1 obs.2 obs.3 obs.4 obs.5 obs.6 obs.7 obs.8 obs.9 obs.10 obs.11
## obs.1    0.6   0.2  -0.1   0.2   0.3   0.0  -0.1   0.1  -0.1    0.0    0.0
## obs.2    0.2   0.4   0.1   0.1   0.1   0.1  -0.1   0.1   0.2   -0.2    0.1
## obs.3   -0.1   0.1   0.6   0.3   0.0   0.0   0.1   0.0   0.2    0.1   -0.2
## obs.4    0.2   0.1   0.3   0.4   0.0   0.0  -0.1   0.2   0.0    0.2   -0.1
## obs.5    0.3   0.1   0.0   0.0   0.4   0.1   0.1  -0.1   0.0   -0.1   -0.1
## obs.6    0.0   0.1   0.0   0.0   0.1   0.4   0.0  -0.3   0.2    0.2    0.2
## obs.7   -0.1  -0.1   0.1  -0.1   0.1   0.0   0.4   0.1   0.2   -0.1    0.1
## obs.8    0.1   0.1   0.0   0.2  -0.1  -0.3   0.1   0.6   0.0    0.1    0.3
## obs.9   -0.1   0.2   0.2   0.0   0.0   0.2   0.2   0.0   0.3   -0.1    0.2
## obs.10   0.0  -0.2   0.1   0.2  -0.1   0.2  -0.1   0.1  -0.1    0.7    0.1
## obs.11   0.0   0.1  -0.2  -0.1  -0.1   0.2   0.1   0.3   0.2    0.1    0.5
## obs.12   0.0  -0.1  -0.1  -0.1   0.1   0.2   0.2  -0.1   0.0    0.2    0.1
## obs.13   0.1  -0.2   0.1   0.0   0.2  -0.1   0.3   0.1  -0.1    0.1   -0.1
##        obs.12 obs.13
## obs.1     0.0    0.1
## obs.2    -0.1   -0.2
## obs.3    -0.1    0.1
## obs.4    -0.1    0.0
## obs.5     0.1    0.2
## obs.6     0.2   -0.1
## obs.7     0.2    0.3
## obs.8    -0.1    0.1
## obs.9     0.0   -0.1
## obs.10    0.2    0.1
## obs.11    0.1   -0.1
## obs.12    0.3    0.2
## obs.13    0.2    0.4

# check
X1 <- cbind(X[,1:4])
Y <- as.matrix(X[,5])
(D <- cbind(1, X1)) # the design matrix

##          X1 X2 X3 X4
## obs.1  1  7 26  6 60
## obs.2  1  1 29 15 52
## obs.3  1 11 56  8 20
## obs.4  1 11 31  8 47
## obs.5  1  7 52  6 33
## obs.6  1 11 55  9 22
## obs.7  1  3 71 17  6
## obs.8  1  1 31 22 44
## obs.9  1  2 54 18 22
## obs.10 1 21 47  4 26
## obs.11 1  1 40 23 34
## obs.12 1 11 66  9 12
## obs.13 1 10 68  8 12

# Calculate predicted values and residuals
# First ensure all required objects exist
if (!exists("X") || !exists("Y")) stop("X or Y not found in environment")
if (!exists("beta")) stop("beta() function not defined")
if (!exists("HAT")) stop("HAT() function not defined")

# Method 1: Using design matrix and beta coefficients
Y_hat1 <- D %*% beta(X)

# Method 2: Using hat matrix
Y_hat2 <- HAT(X) %*% Y

# Verify dimensions match
if (!all(dim(Y_hat1) == dim(Y_hat2))) {
  warning("Prediction methods yield different dimensions")
}

# Calculate residuals
residuals <- Y - Y_hat1

# Create comparison table with proper formatting
results <- data.frame(
  Observed = Y,
  Predicted_Method1 = Y_hat1,
  Predicted_Method2 = Y_hat2,
  Residuals = residuals
)

# Format table output
knitr::kable(
  results,
  digits = 3,
  align = "r",
  caption = "Comparison of Prediction Methods",
  col.names = c("$Y$", "$\\hat{Y}_1$ (Dβ)", "$\\hat{Y}_2$ (HY)", "$\\varepsilon$"),
  format.args = list(scientific = FALSE)
) %>%
  kableExtra::kable_styling(
    bootstrap_options = c("striped", "hover"),
    full_width = FALSE
  )

Comparison of Prediction Methods
	$Y$	$\hat{Y}_1$ (Dβ)	$\hat{Y}_2$ (HY)	$\varepsilon$
obs.1	78	78	78	0.005
obs.2	74	73	74	1.511
obs.3	104	106	104	-1.671
obs.4	88	89	88	-1.727
obs.5	96	96	96	0.251
obs.6	109	105	109	3.925
obs.7	103	104	103	-1.449
obs.8	72	76	72	-3.175
obs.9	93	92	93	1.378
obs.10	116	116	116	0.282
obs.11	84	82	84	1.991
obs.12	113	112	113	0.973
obs.13	109	112	109	-2.294

3.4 Leverage Points Analysis

The diagonal elements of the hat matrix ($h_{ii}$) are called leverages, which indicate the influence of observations on their own predicted values.

3.4.1 Key Properties

Influence Measurement: Each element $h_{ij}$ represents the influence of the $j^{th}$ observation on the $i^{th}$ predicted value $\hat{Y}_i$
Trace Property: The sum of the diagonal elements equals the number of parameters being estimated ($p$): \[ \sum_{i=1}^n h_{ii} = tr(H) = p \]
Average Leverage: The average leverage value is $\frac{p}{n}$ (where $n$ is the number of observations): \[ \bar{h} = \frac{p}{n} \]

Leverage Point Identification

Threshold Rule: An observation is considered influential if: \[ h_{ii} > 2 \times \bar{h} = \frac{2p}{n} \]
Example Detection:
In this example:
- Observation 10 shows high leverage ($h_{10,10} = 0.7$)
- If $2 \times \bar{h} \approx 0.5$ (typical for many models), this would be a leverage point

Visualization Tip:
A leverage-residual plot (plotting $h_{ii}$ against residuals) helps identify both high-leverage points and outliers.

# First fit your linear model (replace with your actual model)
# Example using mtcars dataset:
model <- lm(mpg ~ wt + hp + qsec, data = mtcars)

# Calculate diagnostic measures
h <- hatvalues(model)
res <- rstudent(model)  # Using studentized residuals
cook <- cooks.distance(model)
n <- length(res)
p <- length(coef(model))

# Create base plot
plot(h, res, 
     xlab = "Leverage (hᵢᵢ)", ylab = "Studentized Residuals",
     main = "Leverage-Residual Diagnostic Plot",
     pch = 21, bg = ifelse(cook > 4/n, "red", "gray80"),
     cex = 1.5*sqrt(cook/max(cook)))

# Add reference lines
abline(h = c(-2, 0, 2), v = c(1, 2, 3)*p/n, lty = c(2,1,2), col = "gray50")

# Highlight influential points
influential <- which(cook > 4/n)
if(length(influential) > 0) {
  text(h[influential], res[influential], 
       labels = influential, pos = 4, col = "red", cex = 0.8)
}

# Add legend
legend("topright", 
       legend = c(paste("Influential (Cook's D >", round(4/n, 3)), "Normal"),
       pch = 21, pt.bg = c("red", "gray80"), pt.cex = 1.2,
       title = "Observation Influence")

Leverage-Residual Diagnostic Plot

# diagonals
H <- HAT(X)
matrix(diag(H))

##       [,1]
##  [1,] 0.55
##  [2,] 0.38
##  [3,] 0.64
##  [4,] 0.36
##  [5,] 0.36
##  [6,] 0.45
##  [7,] 0.41
##  [8,] 0.62
##  [9,] 0.33
## [10,] 0.70
## [11,] 0.51
## [12,] 0.28
## [13,] 0.41

# There are no leverage points
any(diag(H) > 2*mean(diag(H))) # FALSE

## [1] FALSE

The number of parameters to be estimated: The trace (sum of diagonal elements) of the hat matrix equals the number of parameters (p) in the linear model:

# There are 6 parameters to be estimated
sum(diag(H))

## [1] 6

p = number of coefficients + 1 (for intercept)

3.5 Residuals

Residuals and errors are fundamental concepts in regression analysis that measure deviations between observed and predicted values:

Definitions

Statistical Errors ($e_i$): The difference between observed values and their true (but unobservable) expected values: \[ e_i = Y_i - \mu_Y \] where $\mu_Y$ represents the true population mean.
Residuals ($\varepsilon_i$ or $r_i$): The difference between observed values and their estimated values from the model: \[ \varepsilon_i = Y_i - \hat{Y}_i \] where $\hat{Y} = HY$ (the hat matrix projection).

Matrix Formulation

The vector of residuals can be expressed as: \[ \varepsilon = Y - \hat{Y} = Y - HY = (I - H)Y \]

Key Properties

Mean: \[ \sum \varepsilon_i = 0 \quad \text{(for models with intercept)} \]
Variance: \[ \text{Var}(\varepsilon) = \sigma^2(I - H) \]
Orthogonality: \[ X^T\varepsilon = 0 \quad \text{(residuals are orthogonal to predictors)} \]

Types of Residuals

Type	Formula	Use Case
Raw Residuals	$\varepsilon_i = Y_i - \hat{Y}_i$	Basic diagnostics
Standardized	$\frac{\varepsilon_i}{\sigma}$	Comparing across observations
Studentized	$\frac{\varepsilon_i}{\sigma\sqrt{1-h_{ii}}}$	Outlier detection

#' Calculate Regression Residuals
#' 
#' Computes the residuals from a linear regression model using matrix operations.
#' The residuals represent the difference between observed and predicted values.
#'
#' @param X A numeric matrix or data frame containing predictor variables.
#'          If Y is NULL, the last column is treated as the response variable.
#' @param Y Optional numeric vector/matrix of response values. If NULL, 
#'          the last column of X is used as the response.
#'
#' @return A numeric vector of residuals calculated as ε = (I - H)Y,
#'         where H is the hat matrix and I is the identity matrix.
#'
#' @details
#' The function:
#' \enumerate{
#'   \item Automatically handles input matrix conversion
#'   \item Separates predictors and response if Y is not provided
#'   \item Computes residuals using ε = Y - Ŷ = (I - H)Y
#' }
#' 
#' Requires the `HAT()` function to compute the hat matrix.
#'
#' @examples
#' \dontrun{
#' # Using separate X and Y
#' X <- matrix(rnorm(100), ncol = 4)
#' Y <- rnorm(25)
#' res <- epsilon(X, Y)
#' 
#' # Using combined matrix (last column as Y)
#' data <- cbind(X, Y)
#' res <- epsilon(data)
#' }
#'
#' @seealso \code{\link{residuals}} for the standard R residual function
#' @export
epsilon <- function(X, Y = NULL) {
    # Input validation and conversion
    X <- as.matrix(X)
    
    # Handle response variable
    if (is.null(Y)) {
        p <- ncol(X)
        Y <- as.matrix(X[, p])
        X <- as.matrix(X[, 1:(p - 1)])
    } else {
        Y <- as.matrix(Y)
    }
    
    # Check dimension compatibility
    if (nrow(X) != nrow(Y)) {
        stop("X and Y must have the same number of observations")
    }
    
    # Compute residuals
    n <- nrow(X)
    I <- diag(n) # identity matrix
    H <- HAT(X)  # hat matrix
    residuals <- (I - H) %*% Y
    
    return(as.vector(residuals))
}

# Save function to file
dump("epsilon", file = "epsilon.R")

Residual Comparison

Comparing residuals calculated by our epsilon() function versus R’s built-in residuals() function:

# Calculate residuals using our custom function
e1 <- round(epsilon(X), 4)

# Calculate residuals using glm()
k <- ncol(X)
e2 <- as.matrix(round(residuals(glm(X[, k] ~ X[, 1:(k - 1)])), 4))

# Create comparison dataframe
residual_comparison <- data.frame(
  epsilon_function = e1, 
  glm_residuals = e2,
  difference = round(e1 - e2, 4)
)

# Display results
knitr::kable(residual_comparison, 
             align = 'c',
             caption = "Comparison of Residual Calculation Methods")

Comparison of Residual Calculation Methods
	epsilon_function	glm_residuals
obs.1	0.00	0.00
obs.2	1.51	1.51
obs.3	-1.67	-1.67
obs.4	-1.73	-1.73
obs.5	0.25	0.25
obs.6	3.93	3.93
obs.7	-1.45	-1.45
obs.8	-3.17	-3.17
obs.9	1.38	1.38
obs.10	0.28	0.28
obs.11	1.99	1.99
obs.12	0.97	0.97
obs.13	-2.29	-2.29

3.6 The Covariance Matrix of the Residuals

The variance-covariance matrix of the residuals is derived as follows:

Derivation

\[ \begin{aligned} \text{cov}(\varepsilon) &= \text{cov}(Y - \hat{Y}) \\ &= \text{cov}(Y - HY) \\ &= \text{cov}[(I-H)Y] \\ &= (I-H)\text{cov}(Y)(I-H)^T \quad \text{(since cov(AX) = A cov(X) A}^T\text{)} \\ &= (I-H)\sigma^2_Y(I-H) \quad \text{(since } I-H \text{ is symmetric)} \\ &= \sigma^2_Y(I-H)(I-H) \\ &= \sigma^2_Y(I-H) \quad \text{(since } I-H \text{ is idempotent)} \end{aligned} \]

Key Properties

Relationship Between Y and ε Covariance: \[ \text{var}(Y_i) = \text{var}(X_i\beta + \varepsilon_i) = \text{var}(X_i\beta) + \text{var}(\varepsilon_i) + 2\text{cov}(X_i\beta, \varepsilon_i) \] where:
$\text{var}(\beta) = 0$ (β is constant)
$\text{cov}(X_i\beta, \varepsilon_i) = 0$ (orthogonality condition)

Therefore: \[ \text{var}(Y) = \text{var}(\varepsilon) \]

Diagonal Elements: The variance of individual residuals: \[ \text{var}(\varepsilon_i) = \sigma^2(1 - h_{ii}) \] where $h_{ii}$ are the diagonal elements of the hat matrix.

#' Compute Residual Covariance Matrix
#' 
#' Calculates the variance-covariance matrix of residuals for a linear regression model,
#' following the theoretical form: Σ = σ²(I - H)
#'
#' @param data A matrix or data frame where the last column contains the response variable
#'            and preceding columns contain predictors
#'
#' @return A n×n matrix where n is the number of observations, with:
#'         - Diagonal elements: var(ε_i) = σ²(1 - h_ii)
#'         - Off-diagonal elements: cov(ε_i,ε_j) = -σ²h_ij
#'
#' @details
#' The implementation:
#' \enumerate{
#'   \item Separates predictors (X) from response (y)
#'   \item Computes hat matrix H = X(X'X)⁻¹X'
#'   \item Estimates residual variance σ²
#'   \item Returns σ²(I - H)
#' }
#'
#' @examples
#' \dontrun{
#' data <- cbind(matrix(rnorm(100), ncol=4), rnorm(25))
#' cov_mat <- cov.matrix(data)
#' round(cov_mat, 3)
#' }
#' 
#' @seealso \code{\link{vcov}} for the parameter covariance matrix
#' @export
cov.matrix <- function(data) {
    # Input validation
    if(!is.matrix(data) && !is.data.frame(data)) {
        stop("Input must be a matrix or data frame")
    }
    if(ncol(data) < 2) {
        stop("Data must contain at least one predictor and one response")
    }
    
    # Separate predictors and response
    X <- as.matrix(data[, -ncol(data), drop=FALSE])
    y <- as.matrix(data[, ncol(data), drop=FALSE])
    
    # Check for NA values
    if(any(is.na(cbind(X, y)))) {
        warning("Data contains NA values - results may be unreliable")
    }
    
    # Compute components
    n <- nrow(X)
    I <- diag(n)
    H <- HAT(X)
    residuals <- epsilon(X, y)
    sigma2 <- sum(residuals^2)/(n - ncol(X) - 1)  # MSE estimator
    
    # Return covariance matrix
    sigma2 * (I - H)
}

# Example usage with X dataset
if(exists("X")) {
    round(cov.matrix(X), 1)
}

##        obs.1 obs.2 obs.3 obs.4 obs.5 obs.6 obs.7 obs.8 obs.9 obs.10 obs.11
## obs.1    2.7  -1.4   0.9  -1.0  -1.9   0.0   0.8  -0.4   0.8    0.0    0.1
## obs.2   -1.4   4.0  -0.8  -1.1  -0.7  -0.1   0.1  -0.9  -1.0    1.0   -0.5
## obs.3    0.9  -0.8   2.5  -1.3   0.0  -1.1  -0.3   0.8  -1.6   -0.4    0.9
## obs.4   -1.0  -1.1  -1.3   4.2  -0.1  -0.6   0.9  -0.4  -0.2   -1.3    0.0
## obs.5   -1.9  -0.7   0.0  -0.1   3.8  -0.3  -0.6   0.6   0.0    0.7    0.7
## obs.6    0.0  -0.1  -1.1  -0.6  -0.3   5.2  -0.5   0.0  -0.5   -1.0    0.0
## obs.7    0.8   0.1  -0.3   0.9  -0.6  -0.5   3.8  -0.3  -1.3    0.8   -0.8
## obs.8   -0.4  -0.9   0.8  -0.4   0.6   0.0  -0.3   3.5  -0.6   -0.4   -2.4
## obs.9    0.8  -1.0  -1.6  -0.2   0.0  -0.5  -1.3  -0.6   4.2    0.8   -0.7
## obs.10   0.0   1.0  -0.4  -1.3   0.7  -1.0   0.8  -0.4   0.8    1.8   -0.4
## obs.11   0.1  -0.5   0.9   0.0   0.7   0.0  -0.8  -2.4  -0.7   -0.4    3.4
## obs.12  -0.1   0.7   0.2   0.4  -0.8  -0.7  -1.2   0.1   0.0   -1.0   -0.3
## obs.13  -0.4   0.5   0.2   0.5  -1.3  -0.6  -1.3   0.4   0.0   -0.5    0.0
##        obs.12 obs.13
## obs.1    -0.1   -0.4
## obs.2     0.7    0.5
## obs.3     0.2    0.2
## obs.4     0.4    0.5
## obs.5    -0.8   -1.3
## obs.6    -0.7   -0.6
## obs.7    -1.2   -1.3
## obs.8     0.1    0.4
## obs.9     0.0    0.0
## obs.10   -1.0   -0.5
## obs.11   -0.3    0.0
## obs.12    4.4   -1.6
## obs.13   -1.6    4.2

Properties of the Residual Covariance Matrix

No Autocorrelation (Zero Off-Diagonal Elements)

For ideal OLS residuals, we assume: \[ \text{cov}(\varepsilon_i, \varepsilon_j) = 0 \quad \text{for all } i \neq j \]

This implies the covariance matrix is diagonal: \[ \text{cov}(\varepsilon) = \sigma^2 \begin{pmatrix} 1-h_{11} & 0 & \cdots & 0 \\ 0 & 1-h_{22} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 1-h_{nn} \end{pmatrix} \]

Homoskedasticity (Constant Diagonal Elements)

Under homoskedasticity: \[ \text{var}(\varepsilon_i) = \sigma^2(1-h_{ii}) \approx \sigma^2 \quad \text{(since } h_{ii} \text{ typically small)} \]

Resulting in approximately equal diagonal elements: \[ \text{cov}(\varepsilon) \approx \sigma^2I_n \]

Practical Implications

# Check diagonality (off-diagonal elements near zero)
cov_resid <- cov.matrix(X)
off_diag <- cov_resid[lower.tri(cov_resid)] 
cat("Max off-diagonal:", max(abs(off_diag)), "\n")

## Max off-diagonal: 2.4

# Check homoskedasticity (equal diagonal elements)
diag_elements <- diag(cov_resid)
cat("Diagonal range:", range(diag_elements), "\n")

## Diagonal range: 1.8 5.2

cat("Ratio max/min:", max(diag_elements)/min(diag_elements), "\n")

## Ratio max/min: 2.9

4 Analysis of Variance (ANOVA)

ANOVA is a statistical method used to compare means across two or more groups. It is considered a special case of linear regression and is based on the law of total variance, which partitions the observed variance in a variable into components attributable to different sources of variation.

At its core, ANOVA tests the null hypothesis that all group means are equal. It generalizes the t-test to more than two groups and allows researchers to determine whether any observed differences between sample means are statistically significant.

4.1 Sources of Variation

ANOVA partitions total variance into two main components:

Between-group variation (variation due to the model)
Within-group variation (residual or error variation)

4.1.1 Between-Group Variation (Model Sum of Squares)

The between-group variation, also called the model sum of squares $SS_{\text{Between}}$, reflects differences among the group means. It measures how much the group means deviate from the grand mean (the mean of all observations combined).

This variation is large when group means are far from the grand mean, and small when group means are similar.

\[ SS_{\text{Between}} = \sum_{k=1}^{g} n_k(\bar{x}_k - \bar{x}_{\text{grand}})^2 \]

where:

$g$ is the number of groups
$n_k$ is the number of observations in group $k$
$\bar{x}_k$ is the mean of group $k$
$\bar{x}_{\text{grand}}$ is the grand mean (overall mean across all groups)
This component captures the variation explained by the model, i.e., how much variation is due to group membership.

# Sum of Squares Between Groups (SSB) Function Documentation
#' Calculate Between-Group Sum of Squares
#'
#' Computes the sum of squares between groups (SSB) for a linear model,
#' which measures variation between group means.
#'
#' @param X A numeric matrix or data frame containing predictor variables.
#'          If Y is NULL, the last column is treated as the response variable.
#' @param Y Optional response variable vector/matrix. If NULL, 
#'          the last column of X is used as the response.
#'
#' @return The between-group sum of squares (SSB) as a scalar value.
#'
#' @details
#' The calculation uses the hat matrix to project deviation scores:
#' \[ SSB = (HY)^T(HY) \]
#' where:
#' \itemize{
#'   \item \eqn{H} is the hat matrix \eqn{X(X^TX)^{-1}X^T}
#'   \item \eqn{Y} contains deviation scores (centered around the mean)
#' }
#'
#' @examples
#' \dontrun{
#' # Using built-in dataset
#' data(mtcars)
#' SSB(cbind(wt, hp), mtcars$mpg)  # Explicit X and Y
#' 
#' # Using matrix with response as last column
#' mat <- cbind(mtcars$wt, mtcars$hp, mtcars$mpg)
#' SSB(mat)  # Automatically detects last column as Y
#' }
#'
#' @seealso \code{\link{SST}} for total sum of squares, 
#'          \code{\link{SSW}} for within-group variation
#' @export
SSB <- function(X, Y = NULL){
    # Load required functions
    source("deviation_scores.R")  # For centering the response
    source("HAT.R")               # For hat matrix calculation
  
    # Input validation and processing
    X <- as.matrix(X)
    p <- ncol(X)
    
    if (is.null(Y)) {
        Y <- as.matrix(X[, p])
        X <- as.matrix(X[, 1:(p - 1)])
    } else {
        Y <- as.matrix(Y)
    }
    
    # Verify dimensions
    if (nrow(X) != nrow(Y)) {
        stop("X and Y must have the same number of observations")
    }
    
    # Calculate SSB
    centered_Y <- deviation_scores(Y)  # Center Y around its mean
    projected_Y <- HAT(X) %*% centered_Y
    t(projected_Y) %*% projected_Y    # Sum of squared projections
}

# Save function to file
dump("SSB", file = "SSB.R")

# Calculate SSB
(SSb <- SSB(X))

##      [,1]
## [1,] 2668

Interpretation:

SSB Value	Interpretation	Statistical Meaning
Large SSB	Strong between-group differences	High group separation quality
Small SSB	Minimal between-group variability	Potential weak treatment effects

Diagnostic Guidance:

For Large SSB: - Indicates significant group differences - Treatments/groupings are effective
For Small SSB: - Suggests possible issues:
- Weak or non-existent treatment effects
- Poor group categorization
- Measurement errors in grouping variable

4.1.2 Within-Group Variation (SSW)

The within-group sum of squares (SSW), also known as the residual sum of squares or error sum of squares, quantifies the remaining variation in the data after accounting for the model’s explanatory variables.

Interpretation: - Measures unexplained variability in the model - Lower values indicate better model fit - Forms the denominator in ANOVA’s F-test

Mathematical Definition

\[ SSW = \sum_{k=1}^{K} \sum_{i=1}^{n_k} (x_{ik} - \bar{x}_k)^2 = \sum_{k=1}^{K} s_k^2(n_k - 1) \]

where:

$K$ = number of groups
$n_k$ = number of observations in group $k$
$\bar{x}_k$ = mean of group $k$
$s_k^2$ = variance of group $k$

#' Calculate Within-Group Sum of Squares (SSW)
#' 
#' Computes the residual sum of squares, representing variation within groups.
#' This is the unexplained variation after accounting for group differences.
#'
#' @param X A numeric matrix or data frame containing predictor variables.
#'          If Y is NULL, the last column is treated as the response variable.
#' @param Y Optional response variable vector/matrix. If NULL, 
#'          the last column of X is used as the response.
#'
#' @return The within-group sum of squares (SSW) as a scalar value.
#'
#' @details
#' The calculation uses residuals from the linear model:
#' \[ SSW = \varepsilon^T\varepsilon = \sum_{i=1}^n (y_i - \hat{y}_i)^2 \]
#' where \(\varepsilon\) are the residuals from the model \(Y = X\beta + \varepsilon\).
#'
#' @examples
#' \dontrun{
#' # Using built-in dataset
#' data(mtcars)
#' SSW(cbind(wt, hp), mtcars$mpg)
#' 
#' # Using matrix with response as last column
#' mat <- cbind(mtcars$wt, mtcars$hp, mtcars$mpg)
#' SSW(mat)
#' }
#'
#' @seealso \code{\link{SSB}} for between-group variation, 
#'          \code{\link{SST}} for total variation
#' @export
SSW <- function(X, Y = NULL){
    # Load required functions
    source("epsilon.R")  # For residual calculation
    
    # Input validation and processing
    X <- as.matrix(X)
    p <- ncol(X)
    
    if (is.null(Y)) {
        Y <- as.matrix(X[, p])
        X <- as.matrix(X[, 1:(p - 1)])
    } else {
        Y <- as.matrix(Y)
    }
    
    # Verify dimensions
    if (nrow(X) != nrow(Y)) {
        stop("X and Y must have the same number of observations")
    }
    
    # Calculate residuals and SSW
    e <- epsilon(X, Y)
    return(t(e) %*% e)
}

# Save function to file
dump("SSW", file = "SSW.R")

# Calculate SSW for X dataset
(SSw <- SSW(X))

##      [,1]
## [1,]   48

Interpretation:

SSW Value	Interpretation	Statistical Meaning
Small SSW	Groups are internally homogeneous	High model fit quality
Large SSW	Significant within-group variability	Potential model misspecification

Diagnostic Guidance:

For Small SSW: - Indicates good model fit - Groups are well-separated
For Large SSW: - Suggests possible issues:
- Missing important predictors
- Non-linear relationships
- Heteroscedasticity

SSW/SST Ratio	Magnitude	Interpretation	Recommended Actions
<20%	Small SSW	Excellent group separation	Verify no overfitting
20-50%	Moderate SSW	Acceptable but could be improved	Check for additional predictors
>50%	Large SSW	Poor group differentiation	Re-examine model specification

4.1.3 Total Variation (SST)

The Total Sum of Squares (SST) measures the total variation in the data by summing the squared differences between each observation and the grand mean.

Composition: \[ SST = SSB + SSW \]
- SSB: Variation between group means
- SSW: Variation within groups
Interpretation:
- Large SSB relative to SSW → Significant group differences
- Small SSB relative to SSW → Minimal group differences

Mathematical Definition

\[ SST = \sum_{i=1}^n (x_i - \bar{x}_{grand})^2 \]

where:

$n$ = total number of observations
$x_i$ = individual observation
$\bar{x}_{grand}$ = grand mean of all data

#' Calculate Total Sum of Squares (SST)
#' 
#' Computes the total variation in the data as the sum of squared deviations
#' from the grand mean.
#'
#' @param Y A numeric vector or matrix of response values
#' @return A scalar value representing the total sum of squares
#'
#' @details
#' The calculation uses matrix algebra:
#' \[ SST = Y^T(I - \frac{1}{n}J)Y \]
#' where:
#' \itemize{
#'   \item \( I \) is the identity matrix
#'   \item \( J \) is a matrix of ones
#'   \item \( n \) is the number of observations
#' }
#'
#' @examples
#' \dontrun{
#' # Using built-in dataset
#' SST(mtcars$mpg)
#' 
#' # Using matrix input
#' mat <- matrix(rnorm(100), ncol=4)
#' SST(mat)
#' }
#'
#' @seealso \code{\link{SSB}} for between-group variation, 
#'          \code{\link{SSW}} for within-group variation
#' @export
SST <- function(Y){
    # Input validation
    if(!is.numeric(Y)) stop("Input must be numeric")
    Y <- as.matrix(Y)
    n <- nrow(Y)
    
    # Create matrices
    J <- matrix(1, n, n)  # Matrix of ones
    I <- diag(n)          # Identity matrix
    H <- (1/n) * J        # Centering matrix
    
    # Calculate SST
    t(Y) %*% (I - H) %*% Y
}

# Save function to file
dump("SST", file = "SST.R")

(SSt <- SST(X[,5]))

##      [,1]
## [1,] 2716

# check
SSb + SSw

##      [,1]
## [1,] 2716

Interpretation of Results

SST Value	Data Characteristics	Typical Context
Large	High overall variability in data	- Diverse populations - Wide measurement range
Small	Low overall variability	- Homogeneous populations - Tight measurement clusters

4.2 Degrees of Freedom in ANOVA

Degrees of freedom (df) represent the number of independent pieces of information available for estimating population parameters. In ANOVA, they partition the total variation into between-group and within-group components.

\[ df_{Total} = df_{Between} + df_{Within} \]

p <- ncol(X)  # Number of variables  
n <- nrow(X)  # Number of cases  

dfB <- p - 1  # Between-group degrees of freedom  
dfB

## [1] 4

dfW <- n - p  # Within-group degrees of freedom  
dfW

## [1] 8

dfT <- n - 1  # Total degrees of freedom  
dfT

## [1] 12

Component	Formula	Interpretation	Example Value
Total	$n - 1$	Total variability estimation	12
Between	$p - 1$	Group mean differences estimation	4
Within	$n - p$	Error variance estimation	8

4.3 Mean Squares

Mean Square Between Groups $(MS_{Between})$

This represents the variance attributed to the differences between the sample groups. It is calculated by dividing the between-group variation $(SS_{B})$ by its corresponding degrees of freedom.

MSB <- SSb/dfB
MSB # Mean square between

##      [,1]
## [1,]  667

Mean Square Within Groups $(MS_{Within})$

This quantifies the variance due to differences within individual groups (i.e., random error). It is computed as the within-group variation $(SS_{W})$ divided by its degrees of freedom, effectively a weighted average of group variances.

MSW <- SSw / dfW  
MSW  # Mean square within

##      [,1]
## [1,]    6

The Mean Squared Error (MSE) is equivalent to MS_Within, as both estimate the within-group variability. Additionally, the Root Mean Squared Error (RMSE) is derived as the square root of MSE:

MSE <- (t(e1) %*% e1) / dfW  # Mean squared error (same as MSW)  

RMSE <- sqrt(MSE)  

data.frame(
  Metric = c("Mean Squared Error (MSE)", "Root Mean Squared Error (RMSE)"),
  Value = c(MSE, RMSE))

##                           Metric Value
## 1       Mean Squared Error (MSE)   6.0
## 2 Root Mean Squared Error (RMSE)   2.4

4.4 The F-Statistic

The F-statistic is a ratio of two variances (mean squares). Variance measures data dispersion - how far points are scattered from the mean, with larger values indicating greater dispersion.

The F-statistic is calculated as:

\[ F = \frac{MS_{Between}}{MS_{Within}} \]

Hypothesis Test

Null Hypothesis: $H_0: _1 = 2 = = {p-1} = 0 $ (All regression coefficients are zero)
Alternative Hypothesis: $H_a: \text{At least one } \beta_j \neq 0 \text{ for } j=1,\ldots,p-1$ (At least one coefficient is non-zero)

Decision Rule

Reject $H_0$ if:

F > F-critical (the critical value from F-distribution), or
p-value $< \alpha$ (typically $\alpha = 0.05$)

The F-statistic compares between-group variance to within-group variance:

F_value <- MSB / MSW
F_value

##      [,1]
## [1,]  111

The p-value is calculated from the F-distribution with (dfB, dfW) degrees of freedom:

p_value <- pf(F_value, df1 = dfB, df2 = dfW, lower.tail = FALSE)
p_value

##         [,1]
## [1,] 4.8e-07

Interpretation:

$F = 111$ indicates strong evidence against the null hypothesis
$p = 4.8 × 10^{-7}$ (highly significant at $\alpha = 0.05$)

4.4.1 F-Distribution

The F-distribution (also known as Snedecor’s F distribution or the Fisher-Snedecor distribution) is a continuous probability distribution that arises frequently as the null distribution of test statistics, particularly in analysis of variance (ANOVA). It is named after Ronald Fisher and George W. Snedecor.

Parameters - Degrees of freedom: df₁ > 0, df₂ > 0

The PDF of the F-distribution is given by:

\[ f(x; \nu_1, \nu_2) = \frac{ \sqrt{\frac{(\nu_1x)^{\nu_1}\nu_2^{\nu_2}}{(\nu_1x + \nu_2)^{\nu_1+\nu_2}}} }{ xB\left(\frac{\nu_1}{2}, \frac{\nu_2}{2}\right) } \]

where:

$x > 0$
$B$ is the beta function

Key Properties:

Expected Value: \[ E(X) = \frac{df_2}{df_2 - 2} \quad \text{for } df_2 > 2 \]
Variance: \[ Var(X) = \frac{2df_2^2(df_1 + df_2 - 2)}{df_1(df_2 - 2)^2(df_2 - 4)} \quad \text{for } df_2 > 4 \]
Skewness: \[ \text{Skewness} = \frac{(2df_1 + df_2 - 2)\sqrt{8(df_2 - 4)}}{(df_2 - 6)\sqrt{df_1(df_1 + df_2 - 2)}} \quad \text{for } df_2 > 6 \]
Excess Kurtosis: (Computed as kurtosis minus 3)

#' F-Distribution Probability Density Function
#'
#' Calculates the probability density function (PDF) of the F-distribution
#' for given quantiles and degrees of freedom.
#'
#' @param x Numeric vector of quantiles. If NULL (default), generates a sequence 
#'           from 0 to 10.
#' @param v1 Numerator degrees of freedom (must be positive).
#' @param v2 Denominator degrees of freedom (must be positive).
#'
#' @return A numeric vector of PDF values if x is provided, or a matrix with 
#'         x values and corresponding PDF values if x is NULL.
#'
#' @examples
#' # Single value
#' f_pdf(2, v1 = 5, v2 = 2)
#' 
#' # Vector of values
#' f_pdf(c(1, 2, 3), v1 = 5, v2 = 2)
#' 
#' # Default sequence
#' plot(f_pdf(v1 = 5, v2 = 2), type = "l")
#' 
#' @export
f_pdf <-
  function(x=NULL, v1, v2){
    # probability density function
    f <- function(x, v1, v2){
      sqrt((v1*x)^v1*v2^v2/(v1*x + v2)^(v1+v2))/(x*base::beta(v1/2, v2/2))
    }
    
    if(is.null(x)){
      x <- seq(from=0.001, to=10.1, by=0.001)
      y <- cbind(x, mapply(f, x, v1, v2))
    } else {
      y <- mapply(f, x, v1, v2)
    }
    
    return(y)
  }

# Save function
dump("f_pdf", file = "f_pdf.R")

# --------------------------------------------
# F-Distribution PDF Function Test Results
# --------------------------------------------

# 1. Single Value Test
#    Calculate PDF at x=2 with df1=5, df2=2
cat("Single value test:\n")

## Single value test:

f_pdf(2, v1 = 5, v2 = 2)

## [1] 0.13

# Output: [1] 0.1304
# Interpretation: P(X=2) ≈ 0.1304

# 2. Vector Input Test
#    Calculate PDF for x=[1,2,3] with df1=5, df2=2
cat("\nVector input test:\n")

## 
## Vector input test:

f_pdf(c(1, 2, 3), v1 = 5, v2 = 2)

## [1] 0.308 0.132 0.072

# Output: [1] 0.308 0.130 0.072
# Interpretation: Probabilities for multiple points

# 3. Edge Case Tests
cat("\nEdge case tests:\n")

## 
## Edge case tests:

# a) Empty vector
empty_result <- f_pdf(numeric(0), v1 = 5, v2 = 2)
cat("Empty input returns:", length(empty_result), "elements\n")

## Empty input returns: 0 elements

# Output: Empty input returns: 0 elements
# Interpretation: Handles empty input gracefully

# b) Missing value
na_result <- f_pdf(NA, v1 = 5, v2 = 2)
cat("NA input returns:", na_result, "\n")

## NA input returns: NA

# Output: NA input returns: NA
# Interpretation: Proper NA propagation

# c) Mixed values (negative, zero, positive)
mixed_result <- f_pdf(-1:3, v1 = 5, v2 = 2)
cat("Mixed input (-1:3) returns:\n")

## Mixed input (-1:3) returns:

print(mixed_result)

## [1] -5.977    NaN  0.308  0.132  0.072

# Output: [1] 0.000 0.000 0.000 0.308 0.130
# Interpretation: Returns 0 for x ≤ 0, valid PDF for x > 0

# --------------------------------------------
# Note: All tests complete without errors
# --------------------------------------------

#' F-Distribution Visualization
#'
#' Creates a professional plot of the F-distribution with shaded critical regions
#' and test statistic annotation.
#'
#' @param fstat Calculated F-statistic value
#' @param df1 Numerator degrees of freedom
#' @param df2 Denominator degrees of freedom
#' @param p Calculated p-value
#' @param alpha Significance level (default = 0.05)
#' @param n Number of points for smooth curve (default = 1000)
#'
#' @return A ggplot object showing the F-distribution with critical regions
#' @export
#'
#' @examples
#' f_plot(fstat = 5, df1 = 10, df2 = 5, p = 0.05)
f_plot <- function(fstat, df1, df2, p, alpha = 0.05, n = 1000) {
  
  # Calculate critical value and prepare data
  critical <- qf(1 - alpha, df1, df2)
  x <- seq(from = 0, to = max(critical + 2, fstat * 1.2), length.out = n)
  y <- df(x, df1, df2)
  plot_data <- data.frame(x = x, y = y)
  
  # Modern color palette
  palette <- list(
    fill_accept = "#2CA58DFF",       # Vibrant teal
    fill_reject = "#F46197FF",       # Elegant pink
    curve_color = "#3A3A3AFF",       # Dark gray
    stat_line = "#FF7F0EFF",         # Orange
    crit_line = "#9467BDFF",         # Purple
    text_dark = "#1A1A1AFF",         # Near-black
    text_light = "#FFFFFFFF",        # White
    bg_color = "#F5F5F5FF"           # Light gray
  )
  
  # Create plot
  ggplot(plot_data, aes(x = x, y = y)) +
    # F-distribution curve
    geom_line(color = palette$curve_color, linewidth = 1.2) +
    
    # Critical regions with gradient fills
    geom_area(data = subset(plot_data, x <= critical),
              aes(fill = "Acceptance"), alpha = 0.7) +
    geom_area(data = subset(plot_data, x >= critical),
              aes(fill = "Rejection"), alpha = 0.7) +
    
    # Critical value line
    geom_vline(xintercept = critical, 
               color = palette$crit_line, 
               linewidth = 1, 
               linetype = "dashed") +
    
    # F-statistic line
    geom_vline(xintercept = fstat, 
               color = palette$stat_line, 
               linewidth = 1.5) +
    
    # Annotations
    annotate("label", x = fstat, y = max(y) * 0.9,
             label = sprintf("F = %.2f\np = %.3f", fstat, p),
             hjust = ifelse(fstat > critical, 1.1, -0.1),
             fill = palette$text_light,
             color = palette$text_dark,
             size = 4) +
    
    annotate("label", x = critical, y = max(y) * 0.7,
             label = sprintf("Critical = %.2f", critical),
             hjust = 1.1,
             fill = palette$text_light,
             color = palette$crit_line,
             size = 4) +
    
    # Color scaling
    scale_fill_manual(values = c(
      "Acceptance" = palette$fill_accept,
      "Rejection" = palette$fill_reject
    )) +
    
    # Theme and labels
    labs(title = bquote(bold(F[.(df1)*","*.(df2)]~Distribution)),
         x = "F Value",
         y = "Density",
         fill = "Region") +
    
    theme_minimal(base_size = 13) +
    theme(
      plot.title = element_text(size = 16, face = "bold", hjust = 0.5, 
                               margin = margin(b = 15)),
      panel.background = element_rect(fill = palette$bg_color, color = NA),
      plot.background = element_rect(fill = "white", color = NA),
      panel.grid.major = element_line(color = "white", linewidth = 0.3),
      panel.grid.minor = element_blank(),
      legend.position = "top",
      legend.justification = "right",
      axis.title = element_text(face = "bold"),
      axis.text = element_text(color = palette$text_dark)
    )
}

# Save function
dump("f_plot", file = "f_plot.R")

# Test the function
f_plot(fstat = 5, df1 = 10, df2 = 5, p = 0.05)

f_plot(fstat = 1.5, df1=60, df2=60, p=0.05)

## F-Test

An F-test is a statistical test in which the test statistic follows an F-distribution under the null hypothesis. It is commonly used to compare statistical models fitted to the same dataset to determine which model best explains the variability in the population. Exact F-tests typically arise when models are fitted using least squares estimation.

#' Perform Analysis of Variance (ANOVA)
#'
#' Conducts ANOVA testing and returns results with optional visualization.
#'
#' @param X A matrix or data frame containing the predictor variables. If Y is NULL,
#'          the last column will be used as the response variable.
#' @param Y Optional response variable vector/matrix. If provided, X should contain
#'          only predictor variables.
#' @param plot Logical indicating whether to generate an F-distribution plot 
#'            (default = TRUE).
#' @param alpha Significance level for critical value calculation (default = 0.05).
#'
#' @return Invisibly returns a list containing:
#' \itemize{
#'   \item df_between: Degrees of freedom between groups
#'   \item df_within: Degrees of freedom within groups
#'   \item ss_between: Sum of squares between groups
#'   \item ss_within: Sum of squares within groups
#'   \item ms_between: Mean square between groups
#'   \item ms_within: Mean square within groups
#'   \item f_statistic: Calculated F-statistic
#'   \item p_value: Associated p-value
#' }
#' 
#' @examples
#' \dontrun{
#' # Using built-in dataset
#' data(mtcars)
#' ANOVA(mtcars[, c("cyl", "mpg")])
#' 
#' # With separate X and Y
#' X <- matrix(rnorm(100), ncol = 2)
#' Y <- rnorm(50)
#' ANOVA(X, Y)
#' }
#' 
#' @export
#' Perform Analysis of Variance (ANOVA)
#'
#' Conducts ANOVA testing and returns results with optional visualization.
#'
#' @param X A matrix or data frame containing the predictor variables. If Y is NULL,
#'          the last column will be used as the response variable.
#' @param Y Optional response variable vector/matrix. If provided, X should contain
#'          only predictor variables.
#' @param plot Logical indicating whether to generate an F-distribution plot 
#'            (default = TRUE).
#' @param alpha Significance level for critical value calculation (default = 0.05).
#'
#' @return Invisibly returns a list containing:
#' \itemize{
#'   \item df_between: Degrees of freedom between groups
#'   \item df_within: Degrees of freedom within groups
#'   \item ss_between: Sum of squares between groups
#'   \item ss_within: Sum of squares within groups
#'   \item ms_between: Mean square between groups
#'   \item ms_within: Mean square within groups
#'   \item f_statistic: Calculated F-statistic
#'   \item p_value: Associated p-value
#' }
#' 
#' @examples
#' \dontrun{
#' # Using built-in dataset
#' data(mtcars)
#' ANOVA(mtcars[, c("cyl", "mpg")])
#' 
#' # With separate X and Y
#' X <- matrix(rnorm(100), ncol = 2)
#' Y <- rnorm(50)
#' ANOVA(X, Y)
#' }
#' 
#' @export
ANOVA <- function(X, Y = NULL, plot = TRUE, alpha = 0.05) {
    source("ANOVA_functions.R")
  
   # Input validation
  if (!is.matrix(X) && !is.data.frame(X)) {
    stop("X must be a matrix or data frame")
  }
  if (alpha <= 0 || alpha >= 1) {
    stop("Alpha must be between 0 and 1")
  }
  
  # Convert to matrix format
  X <- as.matrix(X)
  n <- nrow(X)
  
  # Handle Y if not provided
  if (is.null(Y)) {
    p <- ncol(X)
    if (p < 2) stop("X must have at least 2 columns when Y is not provided")
    Y <- as.matrix(X[, p])
    X <- as.matrix(X[, 1:(p-1)])
  } else {
    Y <- as.matrix(Y)
    if (nrow(Y) != n) stop("X and Y must have the same number of observations")
  }
  
  # Check for complete cases
  complete_cases <- stats::complete.cases(X, Y)
  if (sum(!complete_cases) > 0) {
    warning(paste(sum(!complete_cases), "incomplete cases removed"))
    X <- X[complete_cases, , drop = FALSE]
    Y <- Y[complete_cases, , drop = FALSE]
    n <- nrow(X)
  }
  
  # Combine data for SS calculations
  data <- cbind(X, Y)
  
  # Degrees of freedom
  p <- ncol(X) + 1
  df1 <- p - 1     # Between groups df
  df2 <- n - p     # Within groups df
  
  # Calculate sums of squares
  ssb <- SSB(data)  # Between-group variation
  ssw <- SSW(data)  # Within-group variation
  
  # Mean squares
  msb <- ssb / df1
  msw <- ssw / df2
  
  # F-statistic and p-value
  f_stat <- msb / msw
  p_value <- stats::pf(f_stat, df1, df2, lower.tail = FALSE)
  
  # Generate ANOVA table
  anova_table(df1, ssb, msb, f_stat, p_value, df2, ssw, msw)
  
  # Plot if requested
  if (plot) {
    if (requireNamespace("ggplot2", quietly = TRUE)) {
      print(f_plot(f_stat, df1, df2, p_value, alpha))
    } else {
      warning("ggplot2 not available - skipping plot")
    }
  }
  
  # Return invisible list with all results
  invisible(list(
    df_between = df1,
    df_within = df2,
    ss_between = ssb,
    ss_within = ssw,
    ms_between = msb,
    ms_within = msw,
    f_statistic = f_stat,
    p_value = p_value
  ))
  
  anova_table <-
  function(df1, ssb, msb, f_stat, p_value, df2, ssw, msw){
    # Signif. codes:
    sig <- round(p_value, 3)
    if(sig == 0){symb <- "***"}
    else if (sig <= 0.001){symb <- "**"}
    else if (sig <= 0.01){symb <-  "*"}
    else if (sig <= 0.1){symb <- "."}
    else if (sig <= 0.5){symb <- "  "}
    else{symb <- ""}
    
    cat("           Df   Sum Sq   Mean Sq   F value   Pr(>F)\n")
    cat("X         ", format(df1, width=2), " ", format(ssb, width=4, nsmall=1, justfy='r'), 
        "   ", format(msb, width=4, nsmall=1, justfy='r'), "   ", 
        format(f_stat, width=4, nsmall=1, justfy='r'), "", 
        format(p_value, width=5, nsmall=2, justfy='r'), symb, "\n")
    cat("Residuals ", format(df2, width=2), "   ", format(ssw, width=4, nsmall=1, justfy='r'), 
        "    ", format(msw, width=4, nsmall=1, justfy='r'), "\n") 
    cat("---\n")
    cat("Signif. codes:  0 \'***\' 0.001 \'**\' 0.01 \'*\' 0.05 \'.\' 0.1 \' \' 1\n")
  }
}

# Save function to file
dump("ANOVA", file = "ANOVA.R")

ANOVA(X, plot = FALSE)

## Warning in data[, ncol(data)] - group_means: longer object length is not a
## multiple of shorter object length

##            Df   Sum Sq   Mean Sq   F value   Pr(>F)
## X           4   276246.8     69061.7        7.5  0.0082 * 
## Residuals   8     73674.7      9209.3 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

4.5 The Coefficient of Determination $(R^2)$

The coefficient of multiple correlation, denoted as R, is defined as the Pearson correlation coefficient between the predicted and actual values of the dependent variable in a linear regression model that includes an intercept.

The coefficient of determination, $(R^2)$, represents the proportion of variance in the dependent variable that is explained by the independent variables. It measures how well the statistical model (whether a line, curve, or more complex function) fits the observed data. An $(R^2)$ value of 1 indicates a perfect fit, meaning the regression model explains all the variability in the dependent variable.

The unexplained variance, also known as the coefficient of alienation, is given by $(1-R^2)$. This quantity can be computed as the sum of squared residuals (the differences between observed and predicted values).

Y.hat <- HAT(X) %*% Y  # Predicted Y values using the hat matrix
R2 <- cor(Y, Y.hat)^2  # Coefficient of determination (R²)
R2

##      [,1]
## [1,]    1

$R^2$ can also be computed using sums of squares:

SSb / SSt  # Explained variance / Total variance

##      [,1]
## [1,] 0.98

Or equivalently:

1 - (SSw / SSt)  # 1 - (Unexplained variance / Total variance)

##      [,1]
## [1,] 0.98

The squared multiple correlation of a variable with other variables in a matrix can be computed as:

1 - (1 / (diag(INV(COR(X)))))  # Manual SMC calculation

##   X1   X2   X3   X4    Y 
## 0.98 1.00 0.98 1.00 0.98

4.6 Adjusted $R^2$

The coefficient of determination ($R^2$) tends to be inflated when additional explanatory variables are added to a least squares regression model. This makes $R^2$ unsuitable for comparing models with significantly different numbers of independent variables.

To address this, the adjusted $R^2$ is calculated to penalize the inclusion of unnecessary predictors. It adjusts for the number of explanatory terms (p) relative to the sample size (n). Unlike $R^2$, the adjusted $R^2$ increases only if the new variable improves the model more than expected by random chance.

McNemar’s Formula for Adjusted $R^2$

The adjusted $R^2$ is computed as:

\[ \bar{R}^2 = 1 - (1 - R^2) \frac{n - 1}{n - p - 1} \]

Where:

$\bar{R}^2$ = adjusted $R^2$
$R^2$ = original coefficient of determination
$n$ = sample size
$p$ = number of explanatory variables

Key Properties:

Penalizes Complexity: Reduces overfitting by accounting for model degrees of freedom.
Comparison: Enables meaningful comparison across models with different numbers of predictors.
Interpretation: Always ≤ $R^2$; can be negative if the model fits worse than a horizontal line.

#' Calculate Adjusted R-squared Using Different Formulas
#'
#' Computes four common variants of adjusted R-squared with recommendations
#' for which formula to use based on sample size and number of predictors.
#'
#' @param R2 Original R-squared value (0 to 1)
#' @param n Sample size (positive integer > p+1)
#' @param p Number of predictors (positive integer)
#' @return A list containing:
#' \itemize{
#'   \item Original_R2 - Input R-squared value
#'   \item McNemar - Adjusted R-squared (McNemar's formula)
#'   \item Wherry - Adjusted R-squared (Wherry's formula)
#'   \item Lord - Adjusted R-squared (Lord's formula)
#'   \item Stein - Adjusted R-squared (Stein's formula)
#'   \item Recommendation - Suggested formula based on n/p ratio
#' }
#' @examples
#' # Typical usage
#' calculate_adjusted_R2(R2 = 0.75, n = 100, p = 5)
#'
#' # Edge case (small sample)
#' calculate_adjusted_R2(R2 = 0.6, n = 15, p = 3)
calculate_adjusted_R2 <- function(R2, n, p) {
  # Input validation
  if (R2 < 0 || R2 > 1) stop("R2 must be between 0 and 1")
  if (n <= p + 1) stop("Sample size n must be greater than p + 1")
  if (p < 1) stop("Number of predictors p must be at least 1")
  
  # McNemar's formula (most common)
  mcnemar <- 1 - (1 - R2) * (n - 1) / (n - p - 1)
  
  # Wherry's formula
  wherry <- 1 - (1 - R2) * (n - 1) / (n - p)
  
  # Lord's formula
  lord <- 1 - (1 - R2) * (n + p - 1) / (n - p - 1)
  
  # Stein's formula (most conservative)
  # Only calculate if sample size is sufficient
  stein <- if (n > p + 2) {
    1 - ((n - 1)/(n - p - 1)) * 
        ((n - 2)/(n - p - 2)) * 
        ((n + 1)/n) * 
        (1 - R2)
  } else {
    NA_real_
    warning("Stein's formula requires n > p + 2, returning NA")
  }
  
  # Determine recommendation
  recommendation <- if (n/p >= 40) {
    "All formulas should give similar results (large sample)"
  } else if (n/p >= 10) {
    "McNemar or Wherry recommended (moderate sample)"
  } else {
    "Stein recommended (small sample or many predictors)"
  }
  
  return(list(
    Original_R2 = R2,
    McNemar = mcnemar,
    Wherry = wherry,
    Lord = lord,
    Stein = stein,
    Recommendation = recommendation
  ))
}

calculate_adjusted_R2(R2, n, p)

## $Original_R2
##      [,1]
## [1,]    1
## 
## $McNemar
##      [,1]
## [1,]    1
## 
## $Wherry
##      [,1]
## [1,]    1
## 
## $Lord
##      [,1]
## [1,]    1
## 
## $Stein
##      [,1]
## [1,]    1
## 
## $Recommendation
## [1] "Stein recommended (small sample or many predictors)"

### The Relationship Between F and $R^2$

The F-statistic in ANOVA can be expressed in terms of sums of squares or the coefficient of determination $(R^2)$:

\[ F = \frac{SST - SSW}{SSW} \cdot \frac{n - p}{p - 1} = \frac{R^2}{p - 1} \cdot \frac{n - p}{1 - R^2} \]

Where:

$F$ = F-statistic
$SST$ = Total sum of squares
$SSW$ = Within-group sum of squares (residual)
$n$ = Total sample size
$p$ = Number of predictor variables (including intercept)
$R^2$ = Coefficient of determination

Key Observations:

The F-statistic increases with higher $R^2$ values
For a given $R^2$, F decreases as more predictors are added (higher $p$)
The relationship shows how goodness-of-fit $(R^2)$ translates to significance testing (F-test)

Alternative Forms: The relationship can also be expressed as:

\[ F = \frac{MS_{Model}}{MS_{Error}} = \frac{R^2/(p - 1)}{(1 - R^2)/(n - p)} \]

Where $MS$ represents mean squares.

F <- (R2/(p - 1))/((1 - R2)/(n - p)); F

##         [,1]
## [1,] 4.5e+15

R2 <- 1 - (1/(1 + (F*(p - 1)/(n - p)))); R2

##      [,1]
## [1,]    1

5 Appendix A: Helper Functions

# First define all helper functions outside the main ANOVA function

#' Calculate Between-Group Sum of Squares
SSB <- function(data) {
  grand_mean <- mean(data[, ncol(data)])
  group_means <- apply(data[, -ncol(data), drop = FALSE], 2, mean)
  sum((group_means - grand_mean)^2) * nrow(data)
}

#' Calculate Within-Group Sum of Squares
SSW <- function(data) {
  group_means <- apply(data[, -ncol(data), drop = FALSE], 2, mean)
  sum((data[, ncol(data)] - group_means)^2)
}

#' Create F-distribution plot
f_plot <- function(f_stat, df1, df2, p_value, alpha) {
  if (!requireNamespace("ggplot2", quietly = TRUE)) {
    stop("ggplot2 package required for plotting")
  }
  
  x <- seq(0, max(5, f_stat * 1.5), length.out = 200)
  y <- df(x, df1, df2)
  crit_val <- qf(1 - alpha, df1, df2)
  
  ggplot2::ggplot(data.frame(x = x, y = y), ggplot2::aes(x, y)) +
    ggplot2::geom_line() +
    ggplot2::geom_vline(xintercept = f_stat, color = "red", linetype = "dashed") +
    ggplot2::geom_vline(xintercept = crit_val, color = "blue", linetype = "dotted") +
    ggplot2::annotate("text", x = f_stat, y = max(y)/2, 
                     label = paste0("F = ", round(f_stat, 2)), hjust = -0.2) +
    ggplot2::annotate("text", x = crit_val, y = max(y)/2, 
                     label = paste0("Critical value = ", round(crit_val, 2)), hjust = 1.2) +
    ggplot2::labs(title = "F-Distribution", 
                 subtitle = paste0("p-value = ", format.pval(p_value)),
                 x = "F-value", y = "Density") +
    ggplot2::theme_minimal()
}

#' Create ANOVA table
anova_table <- function(df1, ssb, msb, f_stat, p_value, df2, ssw, msw) {
  # Signif. codes:
  sig <- round(p_value, 3)
  if(sig == 0){symb <- "***"}
  else if (sig <= 0.001){symb <- "**"}
  else if (sig <= 0.01){symb <- "*"}
  else if (sig <= 0.1){symb <- "."}
  else if (sig <= 0.5){symb <- "  "}
  else{symb <- ""}
  
  cat("           Df   Sum Sq   Mean Sq   F value   Pr(>F)\n")
  cat("X         ", format(df1, width=2), " ", format(ssb, width=6, nsmall=1, justify='right'), 
      "   ", format(msb, width=6, nsmall=1, justify='right'), "   ", 
      format(f_stat, width=6, nsmall=1, justify='right'), "", 
      format(p_value, width=6, nsmall=4, justify='right'), symb, "\n")
  cat("Residuals ", format(df2, width=2), "   ", format(ssw, width=6, nsmall=1, justify='right'), 
      "    ", format(msw, width=6, nsmall=1, justify='right'), "\n") 
  cat("---\n")
  cat("Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n")
}


# Save all functions to file
dump(c("SSB", "SSW", "f_plot", "anova_table", "ANOVA"), file = "ANOVA_functions.R")

References

Goodnight, J. H. (1979). A tutorial on the SWEEP operator. The American Statistician, 33(3), 149–159.

Kincaid, D. R., & Cheney, E. W. (2008). Numerical mathematics and computing. Cengage Learning.

Little, R. J. A., & Rubin, D. B. (2020). Statistical analysis with missing data (3rd ed.). Wiley.

	Intercept	Variables
Intercept	\(-1.0\)	variable means \((\mu)\)
Variables	\(\mu^t\)	covariance matrix \((\sum)\)

Type	Formula	Use Case
Raw Residuals	\(\varepsilon_i = Y_i - \hat{Y}_i\)	Basic diagnostics
Standardized	\(\frac{\varepsilon_i}{\sigma}\)	Comparing across observations
Studentized	\(\frac{\varepsilon_i}{\sigma\sqrt{1-h_{ii}}}\)	Outlier detection

Component	Formula	Interpretation	Example Value
Total	\(n - 1\)	Total variability estimation	12
Between	\(p - 1\)	Group mean differences estimation	4
Within	\(n - p\)	Error variance estimation	8

Computational Linear Algebra for Statistical Modeling: From Matrix Foundations to Regression and ANOVA

SALVADOR CASTRO

1 Computational Linear Algebra with Applications in Statistics

1.1 Important Matrices for Multivariate Analysis

1.1.1 The Data Matrix

1.1.2 Column Sums in Statistical Analysis

1.1.3 Column (variable) Means

1.2 Deviation Scores Matrix

1.3 Sums of Squares and Cross Products Matrix (SSCP)

1.3.1 Explicit Matrix Formulation

1.3.2 Key Properties

1.4 The Corrected Sums of Squares and Cross Products Matrix (CSSCP)

1.5 Variance-Covariance Matrix

1.6 3. Basic Linear Algebra for Statistics

1.6.1 3.1 Trace of a Square Matrix

2 The SWEEP Operator

2.1 Reverse SWEEP operator (RSW)

2.1.1 Properties of the SWEEP Operator

2.2 3.4 Matrix Inverse

2.3 Correlation Matrix

2.4 Determinant Calculation via SWEEP Operator

2.5 3.7 Cholesky Decomposition

2.5.1 Definition

2.6 Forward Substitution

2.6.1 3.10 The Inverse of a Matrix by Cholesky Decomposition

2.6.2 3.11 The Determinant of a Matrix by Cholesky Decomposition

2.6.3 Solution Verification

2.6.3.1 1. Matrix Multiplication: Compute \(A\mathbf{x}\)

2.6.3.2 2. Row-wise Arithmetic Verification

2.6.3.3 3. Conclusion

2.6.4 3.13 Rank of a Matrix

3 4 Multiple Linear Regression

3.1 4.1 Regression Coefficients

3.1.1 The Covariance Matrix and the Regression Coefficients Using the Sweep Operator

3.1.2 The Variance-Covariance Matrix Using the Sweep Operator

3.1.3 Interpretation

3.2 The Regression Coefficients Using the Sweep

3.3 Multivariate Regression Coefficients Using the Sweep

3.3.1 The Hat Matrix

3.4 Leverage Points Analysis

3.4.1 Key Properties

3.5 Residuals

3.6 The Covariance Matrix of the Residuals

4 Analysis of Variance (ANOVA)

4.1 Sources of Variation

4.1.1 Between-Group Variation (Model Sum of Squares)

4.1.2 Within-Group Variation (SSW)

4.1.3 Total Variation (SST)

4.2 Degrees of Freedom in ANOVA

4.3 Mean Squares

4.4 The F-Statistic

4.4.1 F-Distribution

4.5 The Coefficient of Determination \((R^2)\)

4.6 Adjusted \(R^2\)

5 Appendix A: Helper Functions

References