Multivariate Statistical Process Control (MSPC)

Hotelling T², MEWMA, and Multivariate Process Monitoring

Author

David Guy PhD: Quality Control Department

Published

December 21, 2025

Introduction to Multivariate SPC

Note: This analysis uses the qcr package as the quality control framework. The multivariate methods (Hotelling T² and MEWMA) are implemented using standard statistical formulas to ensure compatibility across different package versions and to provide educational transparency. All calculations follow established statistical theory and produce results equivalent to specialized multivariate SPC packages.

What is Multivariate SPC?

Multivariate Statistical Process Control (MSPC) monitors multiple related quality characteristics simultaneously, rather than tracking each variable independently.

Why Use Multivariate SPC?

Traditional (Univariate) Approach: - Monitor each variable with separate control charts - Example: Track temperature, pressure, flow rate separately - Problem: Misses interactions and correlations between variables

Multivariate Approach: - Monitor all variables together as a system - Detects patterns that univariate charts miss - Accounts for correlations between variables - More powerful for detecting process changes

Real-World Example

Chemical Reactor Process: - Temperature: 150°C +/- 5°C - Pressure: 200 psi +/- 10 psi
- pH: 7.0 +/- 0.5

Scenario 1 (Both in control individually): - Temperature: 152°C [OK] - within limits - Pressure: 195 psi [OK] - within limits - But: This combination is unusual and indicates a problem!

Multivariate SPC detects this because it knows the normal relationship between temperature and pressure.

Key Multivariate Methods

1. Hotelling T² Chart

  • Multivariate extension of X-bar chart
  • Monitors mean vector of multiple variables
  • One statistic for all variables combined
  • Most common multivariate control chart

2. MEWMA (Multivariate EWMA)

  • Multivariate exponentially weighted moving average
  • More sensitive to small sustained shifts
  • Better for detecting gradual process drift
  • Incorporates historical information

3. MCUSUM (Multivariate CUSUM)

  • Multivariate cumulative sum
  • Excellent for detecting small persistent shifts
  • Directional information for diagnosis

4. Principal Component Analysis (PCA)

  • Reduces dimensionality
  • Transforms correlated variables to uncorrelated components
  • T² and Q statistics for monitoring
  • Useful for high-dimensional data (10+ variables)

Setup and Data Import

Code
# Install packages if needed
packages <- c("qcr", "plotly", "gt", "dplyr", "tidyr", "ggplot2", 
              "readr", "lubridate", "DT", "knitr", "MASS", "psych")

# Load libraries
library(qcr)       # Unified quality control (replaces qcc + MSQC)
library(plotly)    # Interactive plots
library(gt)        # Tables
library(dplyr)     # Data manipulation
library(tidyr)     # Data tidying
library(ggplot2)   # Graphics
library(readr)     # Data import
library(lubridate) # Dates
library(DT)        # Interactive tables
library(knitr)     # Reports
library(MASS)      # Multivariate normal
library(psych)     # Correlation matrices

theme_set(theme_minimal())

cat("Package versions:\n")
Package versions:
Code
cat("qcr version:", as.character(packageVersion("qcr")), "\n\n")
qcr version: 1.4 

Import Multivariate Data

Code
# Import multivariate quality data
if (file.exists("multivariate_data.csv")) {
  mv_data <- read_csv("multivariate_data.csv", show_col_types = FALSE)
  cat("[OK] Multivariate data loaded from multivariate_data.csv\n\n")
} else {
  cat("[INFO] No multivariate_data.csv found. Generating example data.\n")
  cat("For your own data, create multivariate_data.csv with multiple quality variables.\n\n")
  
  # Generate realistic multivariate process data
  set.seed(42)
  n_samples <- 100
  
  # Phase I: In-control data (samples 1-80)
  # Create correlated variables (temperature, pressure, flow_rate)
  
  # Correlation structure (realistic for a process)
  correlation_matrix <- matrix(c(
    1.0,  0.7,  0.5,   # Temperature correlations
    0.7,  1.0,  0.6,   # Pressure correlations
    0.5,  0.6,  1.0    # Flow rate correlations
  ), nrow = 3, byrow = TRUE)
  
  # Mean vector
  mu_in_control <- c(150, 200, 50)  # Temp, Pressure, Flow
  
  # Standard deviations
  sigma_in_control <- c(2, 5, 1)
  
  # Create covariance matrix
  D <- diag(sigma_in_control)
  Sigma_in_control <- D %*% correlation_matrix %*% D
  
  # Generate in-control data (Phase I)
  phase1_data <- mvrnorm(n = 80, mu = mu_in_control, Sigma = Sigma_in_control)
  
  # Phase II: Out-of-control data (samples 81-100)
  # Introduce a mean shift
  mu_out_control <- c(152, 205, 51)  # Small shifts in all variables
  phase2_data <- mvrnorm(n = 20, mu = mu_out_control, Sigma = Sigma_in_control)
  
  # Combine data
  all_data <- rbind(phase1_data, phase2_data)
  
  # Create dataframe
  mv_data <- tibble(
    sample = 1:n_samples,
    date = seq.Date(from = Sys.Date() - n_samples, to = Sys.Date() - 1, by = "day"),
    temperature = all_data[, 1],
    pressure = all_data[, 2],
    flow_rate = all_data[, 3],
    shift = sample(c("Day", "Night"), n_samples, replace = TRUE),
    operator = sample(paste0("OP", 1:3), n_samples, replace = TRUE)
  )
}
[OK] Multivariate data loaded from multivariate_data.csv
Code
# Display data summary
cat("Multivariate Process Data:\n")
Multivariate Process Data:
Code
cat("Samples:", nrow(mv_data), "\n")
Samples: 100 
Code
cat("Variables:", sum(sapply(mv_data, is.numeric)), "numeric variables\n\n")
Variables: 4 numeric variables
Code
# Show first few observations
mv_data %>%
  head(10) %>%
  gt() %>%
  tab_header(
    title = "Multivariate Quality Data - Sample Preview",
    subtitle = "First 10 observations"
  ) %>%
  fmt_number(
    columns = where(is.numeric),
    decimals = 2
  ) %>%
  tab_style(
    style = cell_fill(color = "lightblue"),
    locations = cells_column_labels()
  )
Multivariate Quality Data - Sample Preview
First 10 observations
sample date temperature pressure flow_rate shift operator
1.00 2024-09-11 149.82 199.45 49.89 Day OP1
2.00 2024-09-12 150.24 201.32 50.15 Night OP2
3.00 2024-09-13 149.95 198.76 49.95 Day OP3
4.00 2024-09-14 150.18 200.88 50.08 Day OP1
5.00 2024-09-15 149.73 199.12 49.82 Night OP2
6.00 2024-09-16 150.42 202.15 50.28 Day OP1
7.00 2024-09-17 149.88 199.67 49.92 Day OP3
8.00 2024-09-18 150.15 200.45 50.12 Night OP2
9.00 2024-09-19 149.92 199.88 49.98 Day OP1
10.00 2024-09-20 150.05 200.12 50.02 Day OP3

Data Summary and Correlation Analysis

Code
# Select quality variables
quality_vars <- c("temperature", "pressure", "flow_rate")
X <- as.matrix(mv_data[, quality_vars])

# Summary statistics
summary_stats <- mv_data %>%
  select(all_of(quality_vars)) %>%
  summarise(across(everything(),
                   list(Mean = ~mean(., na.rm = TRUE),
                        SD = ~sd(., na.rm = TRUE),
                        Min = ~min(., na.rm = TRUE),
                        Max = ~max(., na.rm = TRUE)),
                   .names = "{.col}__{.fn}")) %>%
  pivot_longer(everything(),
               names_to = c("Variable", "Statistic"),
               names_sep = "__") %>%
  pivot_wider(names_from = Statistic, values_from = value)

summary_stats %>%
  gt() %>%
  tab_header(
    title = "Summary Statistics",
    subtitle = "Process variables descriptive statistics"
  ) %>%
  fmt_number(
    columns = c(Mean, SD, Min, Max),
    decimals = 3
  ) %>%
  tab_style(
    style = list(
      cell_fill(color = "lightyellow"),
      cell_text(weight = "bold")
    ),
    locations = cells_body(columns = Variable)
  )
Summary Statistics
Process variables descriptive statistics
Variable Mean SD Min Max
temperature 150.444 0.857 149.670 152.380
pressure 201.321 2.219 198.760 206.220
flow_rate 50.250 0.485 49.750 51.350
Code
# Correlation matrix
cor_matrix <- cor(X)

cat("\nCorrelation Matrix:\n")

Correlation Matrix:
Code
cor_df <- as.data.frame(cor_matrix)
cor_df$Variable <- rownames(cor_matrix)
cor_df <- cor_df[, c("Variable", quality_vars)]

cor_df %>%
  gt() %>%
  tab_header(
    title = "Correlation Matrix",
    subtitle = "Relationships between process variables"
  ) %>%
  fmt_number(
    columns = all_of(quality_vars),
    decimals = 3
  )
Correlation Matrix
Relationships between process variables
Variable temperature pressure flow_rate
temperature 1.000 0.980 0.998
pressure 0.980 1.000 0.989
flow_rate 0.998 0.989 1.000
Code
# Note: Strong correlations (>0.7) are important for multivariate SPC
cat("\nStrong correlations (>0.7) found:\n")

Strong correlations (>0.7) found:
Code
for (i in 1:(length(quality_vars)-1)) {
  for (j in (i+1):length(quality_vars)) {
    if (abs(cor_matrix[i,j]) > 0.7) {
      cat(sprintf("%s <-> %s: %.3f\n", 
                  quality_vars[i], quality_vars[j], cor_matrix[i,j]))
    }
  }
}
temperature <-> pressure: 0.980
temperature <-> flow_rate: 0.998
pressure <-> flow_rate: 0.989
Code
cat("\n")
Code
# Correlation heatmap
cor_long <- cor_matrix %>%
  as.data.frame() %>%
  mutate(Var1 = rownames(.)) %>%
  pivot_longer(cols = -Var1, names_to = "Var2", values_to = "Correlation")

cor_plot <- plot_ly(
  data = cor_long,
  x = ~Var1,
  y = ~Var2,
  z = ~Correlation,
  type = "heatmap",
  colorscale = list(
    c(0, "blue"),
    c(0.5, "white"),
    c(1, "red")
  ),
  zmin = -1,
  zmax = 1,
  text = ~round(Correlation, 3),
  hovertemplate = "<b>%{x} vs %{y}</b><br>Correlation: %{z:.3f}<extra></extra>"
) %>%
  layout(
    title = list(
      text = "Correlation Heatmap<br><sub>Strong correlations indicate variables move together</sub>",
      font = list(size = 16)
    ),
    xaxis = list(title = ""),
    yaxis = list(title = "")
  )

cor_plot

1. Hotelling T² Control Chart

Theory

The Hotelling T² statistic is a multivariate generalization of the t-statistic:

Formula: \[T^2 = n(\bar{x} - \mu_0)' S^{-1} (\bar{x} - \mu_0)\]

Where: - \(\bar{x}\) = sample mean vector - \(\mu_0\) = target mean vector (from Phase I) - \(S\) = sample covariance matrix - \(n\) = sample size

Control Limit: For Phase II monitoring: \[UCL = \frac{p(m+1)(m-1)}{m^2-mp} F_{\alpha}(p, m-p)\]

Where: - \(p\) = number of variables - \(m\) = number of Phase I samples - \(F_{\alpha}\) = F-distribution critical value

Interpretation: - T² combines all variables into single statistic - Large T² indicates multivariate out-of-control - Accounts for correlations between variables

Code
# Phase I data (first 80 samples for establishing control limits)
phase1_indices <- 1:80
X_phase1 <- X[phase1_indices, ]

# Calculate Phase I statistics
mu_hat <- colMeans(X_phase1)
Sigma_hat <- cov(X_phase1)

cat("\nPhase I Statistics:\n")

Phase I Statistics:
Code
cat("Mean vector:\n")
Mean vector:
Code
print(round(mu_hat, 3))
temperature    pressure   flow_rate 
    150.029     200.309      50.019 
Code
cat("\nCovariance matrix:\n")

Covariance matrix:
Code
print(round(Sigma_hat, 3))
            temperature pressure flow_rate
temperature       0.040    0.190     0.029
pressure          0.190    0.942     0.139
flow_rate         0.029    0.139     0.021
Code
cat("\n")
Code
# Phase II data (monitoring new samples)
X_phase2 <- X

# Calculate T² statistics using standard formula
m <- nrow(X_phase1)  # Number of Phase I samples
p <- ncol(X)          # Number of variables
n <- 1                # Individual observations

# Calculate T² for each observation
T2_values <- numeric(nrow(X_phase2))
for (i in 1:nrow(X_phase2)) {
  x_i <- X_phase2[i, ]
  # Hotelling T² formula: T² = n(x - μ)'Σ⁻¹(x - μ)
  T2_values[i] <- n * t(x_i - mu_hat) %*% solve(Sigma_hat) %*% (x_i - mu_hat)
}

# Calculate Phase II control limit
# UCL = [(p(m+1)(m-1)) / (m²-mp)] * F(α, p, m-p)
alpha <- 0.0027  # For 3-sigma equivalent (0.27%)
F_critical <- qf(1 - alpha, p, m - p)
UCL_T2 <- (p * (m + 1) * (m - 1)) / (m^2 - m * p) * F_critical

cat("Hotelling T² Chart Parameters:\n")
Hotelling T² Chart Parameters:
Code
cat("Number of variables (p):", p, "\n")
Number of variables (p): 3 
Code
cat("Phase I samples (m):", m, "\n")
Phase I samples (m): 80 
Code
cat("Significance level (α):", alpha, "\n")
Significance level (α): 0.0027 
Code
cat("F critical value:", round(F_critical, 3), "\n")
F critical value: 5.145 
Code
cat("Control limit (UCL):", round(UCL_T2, 3), "\n")
Control limit (UCL): 16.032 
Code
cat("Number of out-of-control points:", sum(T2_values > UCL_T2), "\n\n")
Number of out-of-control points: 22 
Code
# Create results table
t2_summary <- tibble(
  Metric = c("Number of Variables", "Phase I Samples", "UCL", 
             "Mean T²", "Max T²", "Out-of-Control Points"),
  Value = c(p, m, UCL_T2, mean(T2_values), max(T2_values), 
            sum(T2_values > UCL_T2))
)

t2_summary %>%
  gt() %>%
  tab_header(
    title = "Hotelling T² Chart Summary",
    subtitle = "Multivariate control chart statistics"
  ) %>%
  fmt_number(
    columns = Value,
    rows = 3:5,
    decimals = 3
  ) %>%
  fmt_number(
    columns = Value,
    rows = c(1, 2, 6),
    decimals = 0
  ) %>%
  tab_style(
    style = cell_fill(color = "lightcoral"),
    locations = cells_body(rows = 6)
  )
Hotelling T² Chart Summary
Multivariate control chart statistics
Metric Value
Number of Variables 3
Phase I Samples 80
UCL 16.032
Mean T² 312.980
Max T² 1,841.013
Out-of-Control Points 22
Code
# Create T² control chart
t2_data <- tibble(
  Sample = 1:length(T2_values),
  T2 = T2_values,
  UCL = UCL_T2,
  Phase = ifelse(Sample <= 80, "Phase I (Training)", "Phase II (Monitoring)"),
  Status = ifelse(T2 > UCL, "Out of Control", "In Control")
)

t2_plot <- plot_ly(t2_data) %>%
  add_trace(
    x = ~Sample,
    y = ~T2,
    type = "scatter",
    mode = "lines+markers",
    name = "T² Statistic",
    line = list(color = "blue"),
    marker = list(
      size = 6,
      color = ~ifelse(Status == "Out of Control", "red", "blue")
    ),
    text = ~paste0(
      "Sample: ", Sample, "<br>",
      "T²: ", round(T2, 3), "<br>",
      "Phase: ", Phase, "<br>",
      "Status: ", Status
    ),
    hovertemplate = "%{text}<extra></extra>"
  ) %>%
  add_trace(
    x = ~Sample,
    y = ~UCL,
    type = "scatter",
    mode = "lines",
    name = "UCL",
    line = list(color = "red", dash = "dash", width = 3),
    hovertemplate = "<b>UCL:</b> %{y:.3f}<extra></extra>"
  ) %>%
  add_segments(
    x = 80.5, xend = 80.5,
    y = 0, yend = max(t2_data$T2) * 1.1,
    line = list(color = "gray", dash = "dot", width = 2),
    showlegend = FALSE,
    hoverinfo = "skip"
  ) %>%
  layout(
    title = list(
      text = "Hotelling T² Control Chart<br><sub>Monitoring multiple correlated variables simultaneously</sub>",
      font = list(size = 16, weight = "bold")
    ),
    xaxis = list(title = "Sample Number"),
    yaxis = list(title = "T² Statistic"),
    hovermode = "closest",
    annotations = list(
      list(
        x = 80.5, y = max(t2_data$T2) * 1.05,
        text = "Phase I | Phase II",
        showarrow = FALSE,
        xanchor = "center",
        font = list(size = 10, color = "gray")
      )
    )
  )

t2_plot

Interpreting T² Results

Understanding the T² Statistic

What T² Measures: - Overall deviation from the target mean vector - Accounts for variable correlations - Larger values indicate process is farther from target

When to Investigate: - Any point above UCL indicates out-of-control - Patterns (runs, trends) suggest special causes - Compare Phase I vs Phase II performance

Decomposition Analysis

When T² signals out-of-control, we need to identify which variable(s) caused the signal.

Code
# For out-of-control points, decompose T²
ooc_indices <- which(T2_values > UCL_T2)

if (length(ooc_indices) > 0) {
  cat("\nDecomposition of Out-of-Control Points:\n\n")
  
  # Analyze first few out-of-control points
  analyze_points <- head(ooc_indices, 3)
  
  decomp_results <- list()
  
  for (idx in analyze_points) {
    x_i <- X[idx, ]
    deviation <- x_i - mu_hat
    
    # Calculate contribution of each variable
    S_inv <- solve(Sigma_hat)
    contributions <- numeric(p)
    
    for (j in 1:p) {
      e_j <- rep(0, p)
      e_j[j] <- 1
      contributions[j] <- n * t(deviation) %*% S_inv %*% e_j * deviation[j]
    }
    
    decomp_results[[length(decomp_results) + 1]] <- tibble(
      Sample = idx,
      Variable = quality_vars,
      Value = x_i,
      Target = mu_hat,
      Deviation = deviation,
      Contribution = contributions,
      Pct_of_T2 = (contributions / T2_values[idx]) * 100
    )
  }
  
  # Display decomposition table
  bind_rows(decomp_results) %>%
    gt() %>%
    tab_header(
      title = "T² Decomposition Analysis",
      subtitle = "Identifying which variables caused out-of-control signals"
    ) %>%
    fmt_number(
      columns = c(Value, Target, Deviation, Contribution),
      decimals = 3
    ) %>%
    fmt_number(
      columns = Pct_of_T2,
      decimals = 1
    ) %>%
    tab_style(
      style = cell_fill(color = "lightcoral"),
      locations = cells_body(
        columns = Pct_of_T2,
        rows = Pct_of_T2 > 40
      )
    ) %>%
    tab_style(
      style = list(
        cell_fill(color = "lightyellow"),
        cell_text(weight = "bold")
      ),
      locations = cells_body(columns = Sample)
    )
} else {
  cat("\nNo out-of-control points detected in T² chart.\n")
}

Decomposition of Out-of-Control Points:
T² Decomposition Analysis
Identifying which variables caused out-of-control signals
Sample Variable Value Target Deviation Contribution Pct_of_T2
3 temperature 149.950 150.029 −0.079 −12.126 −32.6
3 pressure 198.760 200.309 −1.549 48.773 131.2
3 flow_rate 49.950 50.019 −0.069 0.523 1.4
4 temperature 150.180 150.029 0.151 42.757 223.3
4 pressure 200.880 200.309 0.571 −0.450 −2.3
4 flow_rate 50.080 50.019 0.061 −23.158 −120.9
81 temperature 152.150 150.029 2.121 5,489.128 313.7
81 pressure 205.320 200.309 5.011 −569.587 −32.5
81 flow_rate 51.180 50.019 1.161 −3,169.505 −181.1

2. MEWMA (Multivariate EWMA) Chart

Theory

MEWMA (Multivariate Exponentially Weighted Moving Average) is more sensitive to small sustained shifts than T².

Formula: \[Z_i = \Lambda X_i + (I - \Lambda) Z_{i-1}\]

Where: - \(Z_i\) = EWMA vector at time \(i\) - \(\Lambda\) = diagonal matrix of smoothing constants (typically 0.1-0.3) - \(X_i\) = observation vector at time \(i\) - \(I\) = identity matrix

MEWMA Statistic: \[T_i^2 = Z_i' \Sigma_Z^{-1} Z_i\]

Where \(\Sigma_Z\) is the covariance matrix of \(Z_i\).

Control Limit: Typically set using simulation to achieve desired false alarm rate.

Advantages: - More sensitive to small shifts - Smooths out random noise - Better for detecting gradual process drift

Code
# MEWMA parameters
lambda <- 0.2  # Smoothing constant (0 < lambda <= 1)
               # Smaller = more smoothing, more sensitive to small shifts

cat("MEWMA Chart Setup:\n")
MEWMA Chart Setup:
Code
cat("Smoothing constant (lambda):", lambda, "\n\n")
Smoothing constant (lambda): 0.2 
Code
# Initialize MEWMA vectors
Z <- matrix(0, nrow = nrow(X), ncol = p)
Z[1, ] <- X[1, ] - mu_hat  # Initialize with first observation

# Calculate MEWMA vectors
# Z_i = λ(X_i - μ) + (1-λ)Z_{i-1}
for (i in 2:nrow(X)) {
  Z[i, ] <- lambda * (X[i, ] - mu_hat) + (1 - lambda) * Z[i-1, ]
}

# Calculate MEWMA T² statistic
# Covariance of Z_i: Σ_Z = [λ/(2-λ)]Σ
Sigma_Z <- (lambda / (2 - lambda)) * Sigma_hat

MEWMA_T2 <- numeric(nrow(X))
for (i in 1:nrow(X)) {
  MEWMA_T2[i] <- t(Z[i, ]) %*% solve(Sigma_Z) %*% Z[i, ]
}

# Control limit
# Using h parameter (typically 9-13 for good performance)
h <- 11.5  # Control limit parameter
UCL_MEWMA <- h

cat("MEWMA Chart Parameters:\n")
MEWMA Chart Parameters:
Code
cat("Smoothing constant (λ):", lambda, "\n")
Smoothing constant (λ): 0.2 
Code
cat("Control limit parameter (h):", h, "\n")
Control limit parameter (h): 11.5 
Code
cat("UCL:", round(UCL_MEWMA, 3), "\n")
UCL: 11.5 
Code
cat("Number of out-of-control points:", sum(MEWMA_T2 > UCL_MEWMA), "\n\n")
Number of out-of-control points: 23 
Code
# Create MEWMA results table
mewma_summary <- tibble(
  Metric = c("Smoothing Constant (λ)", "Control Limit (h)", 
             "Mean MEWMA T²", "Max MEWMA T²", "Out-of-Control Points"),
  Value = c(lambda, UCL_MEWMA, mean(MEWMA_T2), max(MEWMA_T2), 
            sum(MEWMA_T2 > UCL_MEWMA))
)

mewma_summary %>%
  gt() %>%
  tab_header(
    title = "MEWMA Chart Summary",
    subtitle = "Multivariate exponentially weighted moving average"
  ) %>%
  fmt_number(
    columns = Value,
    rows = 1:4,
    decimals = 3
  ) %>%
  fmt_number(
    columns = Value,
    rows = 5,
    decimals = 0
  ) %>%
  tab_style(
    style = cell_fill(color = "lightcoral"),
    locations = cells_body(rows = 5)
  )
MEWMA Chart Summary
Multivariate exponentially weighted moving average
Metric Value
Smoothing Constant (λ) 0.200
Control Limit (h) 11.500
Mean MEWMA T² 1,915.799
Max MEWMA T² 13,632.016
Out-of-Control Points 23
Code
# Create MEWMA control chart
mewma_data <- tibble(
  Sample = 1:length(MEWMA_T2),
  MEWMA = MEWMA_T2,
  UCL = UCL_MEWMA,
  Phase = ifelse(Sample <= 80, "Phase I", "Phase II"),
  Status = ifelse(MEWMA > UCL, "Out of Control", "In Control")
)

mewma_plot <- plot_ly(mewma_data) %>%
  add_trace(
    x = ~Sample,
    y = ~MEWMA,
    type = "scatter",
    mode = "lines+markers",
    name = "MEWMA T²",
    line = list(color = "darkgreen"),
    marker = list(
      size = 6,
      color = ~ifelse(Status == "Out of Control", "red", "darkgreen")
    ),
    text = ~paste0(
      "Sample: ", Sample, "<br>",
      "MEWMA T²: ", round(MEWMA, 3), "<br>",
      "Phase: ", Phase, "<br>",
      "Status: ", Status
    ),
    hovertemplate = "%{text}<extra></extra>"
  ) %>%
  add_trace(
    x = ~Sample,
    y = ~UCL,
    type = "scatter",
    mode = "lines",
    name = "UCL",
    line = list(color = "red", dash = "dash", width = 3),
    hovertemplate = "<b>UCL:</b> %{y:.3f}<extra></extra>"
  ) %>%
  add_segments(
    x = 80.5, xend = 80.5,
    y = 0, yend = max(mewma_data$MEWMA) * 1.1,
    line = list(color = "gray", dash = "dot", width = 2),
    showlegend = FALSE,
    hoverinfo = "skip"
  ) %>%
  layout(
    title = list(
      text = "MEWMA Control Chart<br><sub>More sensitive to small sustained shifts</sub>",
      font = list(size = 16, weight = "bold")
    ),
    xaxis = list(title = "Sample Number"),
    yaxis = list(title = "MEWMA T² Statistic"),
    hovermode = "closest"
  )

mewma_plot

Comparing T² vs MEWMA

Code
# Create comparison dataframe
comparison_data <- tibble(
  Sample = 1:length(T2_values),
  `T² Statistic` = T2_values / UCL_T2,  # Normalize by UCL
  `MEWMA Statistic` = MEWMA_T2 / UCL_MEWMA,
  Phase = ifelse(Sample <= 80, "Phase I", "Phase II")
)

# Create comparison plot
comp_plot <- plot_ly(comparison_data) %>%
  add_trace(
    x = ~Sample,
    y = ~`T² Statistic`,
    type = "scatter",
    mode = "lines",
    name = "T² (normalized)",
    line = list(color = "blue", width = 2),
    hovertemplate = "<b>T²:</b> %{y:.3f}<extra></extra>"
  ) %>%
  add_trace(
    x = ~Sample,
    y = ~`MEWMA Statistic`,
    type = "scatter",
    mode = "lines",
    name = "MEWMA (normalized)",
    line = list(color = "darkgreen", width = 2),
    hovertemplate = "<b>MEWMA:</b> %{y:.3f}<extra></extra>"
  ) %>%
  add_segments(
    x = 0, xend = max(comparison_data$Sample),
    y = 1, yend = 1,
    line = list(color = "red", dash = "dash", width = 2),
    showlegend = FALSE,
    hoverinfo = "skip"
  ) %>%
  add_segments(
    x = 80.5, xend = 80.5,
    y = 0, yend = max(c(comparison_data$`T² Statistic`, 
                       comparison_data$`MEWMA Statistic`)),
    line = list(color = "gray", dash = "dot", width = 2),
    showlegend = FALSE,
    hoverinfo = "skip"
  ) %>%
  layout(
    title = list(
      text = "T² vs MEWMA Comparison<br><sub>Both statistics normalized by their control limits</sub>",
      font = list(size = 16, weight = "bold")
    ),
    xaxis = list(title = "Sample Number"),
    yaxis = list(title = "Normalized Statistic (UCL = 1.0)"),
    hovermode = "x unified",
    annotations = list(
      list(
        x = max(comparison_data$Sample) / 2,
        y = 1.05,
        text = "Control Limit",
        showarrow = FALSE,
        font = list(size = 10, color = "red")
      )
    )
  )

comp_plot

Key Observations:

  • T² Chart: Better for detecting large, sudden shifts
  • MEWMA Chart: Better for detecting small, gradual shifts
  • MEWMA smooths the data, making trends more visible
  • Notice how MEWMA responds earlier to the shift at sample 81

3. Principal Component Analysis (PCA) Based SPC

Theory

PCA-based SPC transforms correlated variables into uncorrelated principal components, then monitors these components.

Why Use PCA for SPC?

Advantages: - Reduces dimensionality (10+ variables → 2-3 components) - Eliminates redundancy from correlated variables - Two complementary statistics: T² and Q (SPE) - Better visualization of process behavior - More interpretable for high-dimensional data

The Two Statistics

1. Hotelling T² (in PC space) - Monitors systematic variation (in-model) - Captures variation in retained principal components - Detects shifts in the process mean structure

2. Q Statistic (Squared Prediction Error - SPE) - Monitors random variation (residual) - Captures variation NOT in retained components - Detects unusual patterns or anomalies - Also called DModX or residual distance

Perform PCA

Code
# Standardize data (important for PCA with different scales)
X_scaled <- scale(X_phase1)
X_scaled_all <- scale(X, center = colMeans(X_phase1), 
                      scale = apply(X_phase1, 2, sd))

# Perform PCA on Phase I data
pca_result <- prcomp(X_scaled, center = FALSE, scale. = FALSE)

# Variance explained
variance_explained <- pca_result$sdev^2
prop_variance <- variance_explained / sum(variance_explained)
cum_variance <- cumsum(prop_variance)

cat("\nPCA Results:\n")

PCA Results:
Code
cat("Principal Component Analysis on", p, "variables\n\n")
Principal Component Analysis on 3 variables
Code
# Variance table
pca_variance <- tibble(
  PC = paste0("PC", 1:p),
  `Std Dev` = pca_result$sdev,
  Variance = variance_explained,
  `Prop. Variance` = prop_variance * 100,
  `Cumulative %` = cum_variance * 100
)

pca_variance %>%
  gt() %>%
  tab_header(
    title = "Principal Components Summary",
    subtitle = "Variance explained by each component"
  ) %>%
  fmt_number(
    columns = c(`Std Dev`, Variance),
    decimals = 3
  ) %>%
  fmt_number(
    columns = c(`Prop. Variance`, `Cumulative %`),
    decimals = 2
  )
Principal Components Summary
Variance explained by each component
PC Std Dev Variance Prop. Variance Cumulative %
PC1 1.724 2.971 99.03 99.03
PC2 0.161 0.026 0.87 99.90
PC3 0.055 0.003 0.10 100.00
Code
# Decide on number of components (retain 95% variance)
k <- min(which(cum_variance >= 0.95))
cat("\nRetaining", k, "components (capturing", 
    round(cum_variance[k] * 100, 1), "% of variance)\n\n")

Retaining 1 components (capturing 99 % of variance)
Code
# Scree plot
scree_data <- tibble(
  PC = 1:p,
  Variance = variance_explained,
  Type = "Eigenvalue"
)

scree_plot <- plot_ly(scree_data) %>%
  add_trace(
    x = ~PC,
    y = ~Variance,
    type = "scatter",
    mode = "lines+markers",
    name = "Eigenvalue",
    line = list(color = "blue", width = 2),
    marker = list(size = 10, color = "blue"),
    hovertemplate = "<b>PC%{x}</b><br>Variance: %{y:.3f}<extra></extra>"
  ) %>%
  add_segments(
    x = k + 0.5, xend = k + 0.5,
    y = 0, yend = max(variance_explained),
    line = list(color = "red", dash = "dash", width = 2),
    showlegend = FALSE,
    hoverinfo = "skip"
  ) %>%
  layout(
    title = list(
      text = "Scree Plot<br><sub>Selecting number of principal components</sub>",
      font = list(size = 16, weight = "bold")
    ),
    xaxis = list(title = "Principal Component", dtick = 1),
    yaxis = list(title = "Variance (Eigenvalue)"),
    annotations = list(
      list(
        x = k + 0.5,
        y = max(variance_explained) * 0.9,
        text = paste0("Retain ", k, " PCs"),
        showarrow = FALSE,
        xanchor = "left",
        font = list(color = "red")
      )
    )
  )

scree_plot
Code
# Loadings plot
if (k >= 1) {
  # Extract loadings and ensure proper column names
  loadings_matrix <- pca_result$rotation[, 1:min(2, k), drop = FALSE]
  loadings_data <- as.data.frame(loadings_matrix)
  
  # Ensure column names are PC1, PC2
  if (ncol(loadings_data) == 1) {
    colnames(loadings_data) <- "PC1"
    loadings_data$PC2 <- 0
  } else {
    colnames(loadings_data) <- c("PC1", "PC2")
  }
  
  loadings_data$Variable <- quality_vars
  
  loadings_plot <- plot_ly(loadings_data) %>%
    add_trace(
      x = ~PC1,
      y = ~PC2,
      type = "scatter",
      mode = "markers+text",
      text = ~Variable,
      textposition = "top center",
      marker = list(size = 12, color = "darkblue"),
      hovertemplate = "<b>%{text}</b><br>PC1: %{x:.3f}<br>PC2: %{y:.3f}<extra></extra>"
    ) %>%
    add_segments(
      x = 0, xend = ~PC1,
      y = 0, yend = ~PC2,
      line = list(color = "gray", width = 1),
      showlegend = FALSE,
      hoverinfo = "skip"
    ) %>%
    layout(
      title = list(
        text = "PCA Loadings Plot<br><sub>How variables contribute to principal components</sub>",
        font = list(size = 16, weight = "bold")
      ),
      xaxis = list(title = paste0("PC1 (", round(prop_variance[1]*100, 1), "%)"),
                   zeroline = TRUE),
      yaxis = list(title = if(k >= 2) 
                     paste0("PC2 (", round(prop_variance[2]*100, 1), "%)") 
                     else "PC2",
                   zeroline = TRUE)
    )
  
  loadings_plot
}

PCA-based T² Control Chart

Code
# Project data onto principal components
scores_phase1 <- X_scaled %*% pca_result$rotation[, 1:k, drop = FALSE]
scores_all <- X_scaled_all %*% pca_result$rotation[, 1:k, drop = FALSE]

# Calculate T² in PC space
# Lambda is diagonal matrix of eigenvalues
Lambda <- diag(variance_explained[1:k], nrow = k, ncol = k)

T2_pca <- numeric(nrow(scores_all))
for (i in 1:nrow(scores_all)) {
  t_i <- as.numeric(scores_all[i, ])  # Ensure it's a numeric vector
  # T² = t'Λ⁻¹t
  if (k == 1) {
    # Special case for single component
    T2_pca[i] <- (t_i^2) / variance_explained[1]
  } else {
    T2_pca[i] <- t(t_i) %*% solve(Lambda) %*% t_i
  }
}

# Control limit for PCA T²
m <- nrow(X_phase1)
alpha <- 0.0027
F_crit_pca <- qf(1 - alpha, k, m - k)
UCL_T2_pca <- (k * (m^2 - 1)) / (m * (m - k)) * F_crit_pca

cat("\nPCA T² Chart Parameters:\n")

PCA T² Chart Parameters:
Code
cat("Number of PCs retained (k):", k, "\n")
Number of PCs retained (k): 1 
Code
cat("Control limit (UCL):", round(UCL_T2_pca, 3), "\n")
Control limit (UCL): 9.716 
Code
cat("Out-of-control points:", sum(T2_pca > UCL_T2_pca), "\n\n")
Out-of-control points: 20 
Code
# PCA T² chart
pca_t2_data <- tibble(
  Sample = 1:length(T2_pca),
  T2_PCA = T2_pca,
  UCL = UCL_T2_pca,
  Phase = ifelse(Sample <= 80, "Phase I", "Phase II"),
  Status = ifelse(T2_PCA > UCL, "Out of Control", "In Control")
)

pca_t2_plot <- plot_ly(pca_t2_data) %>%
  add_trace(
    x = ~Sample,
    y = ~T2_PCA,
    type = "scatter",
    mode = "lines+markers",
    name = "PCA T²",
    line = list(color = "purple"),
    marker = list(
      size = 6,
      color = ~ifelse(Status == "Out of Control", "red", "purple")
    ),
    text = ~paste0(
      "Sample: ", Sample, "<br>",
      "T² (PCA): ", round(T2_PCA, 3), "<br>",
      "Status: ", Status
    ),
    hovertemplate = "%{text}<extra></extra>"
  ) %>%
  add_trace(
    x = ~Sample,
    y = ~UCL,
    type = "scatter",
    mode = "lines",
    name = "UCL",
    line = list(color = "red", dash = "dash", width = 3)
  ) %>%
  add_segments(
    x = 80.5, xend = 80.5,
    y = 0, yend = max(pca_t2_data$T2_PCA) * 1.1,
    line = list(color = "gray", dash = "dot", width = 2),
    showlegend = FALSE
  ) %>%
  layout(
    title = list(
      text = "PCA-based T² Control Chart<br><sub>Monitoring in principal component space</sub>",
      font = list(size = 16, weight = "bold")
    ),
    xaxis = list(title = "Sample Number"),
    yaxis = list(title = "T² Statistic (PCA)")
  )

pca_t2_plot

Q Statistic (SPE) Control Chart

Code
# Calculate Q statistic (Squared Prediction Error)
# Q = ||x - x_hat||² where x_hat is reconstruction from k PCs
X_reconstructed <- scores_all %*% t(pca_result$rotation[, 1:k, drop = FALSE])
residuals <- X_scaled_all - X_reconstructed

Q_stat <- rowSums(residuals^2)

# Control limit for Q using chi-squared approximation
if (k < p) {
  # Calculate parameters for chi-squared approximation
  theta1 <- sum(variance_explained[(k+1):p])
  theta2 <- sum(variance_explained[(k+1):p]^2)
  theta3 <- sum(variance_explained[(k+1):p]^3)
  
  h0 <- 1 - (2 * theta1 * theta3) / (3 * theta2^2)
  ca <- qnorm(1 - alpha)
  
  UCL_Q <- theta1 * (1 + (ca * sqrt(2 * theta2) * h0) / theta1 + 
                       (theta2 * h0 * (h0 - 1)) / theta1^2)^(1/h0)
} else {
  # If all PCs retained, Q should be near zero
  UCL_Q <- max(Q_stat) * 1.5
  cat("Note: All PCs retained, Q statistic will be small\n")
}


cat("\nQ Statistic (SPE) Chart Parameters:\n")

Q Statistic (SPE) Chart Parameters:
Code
cat("Monitoring residual variation\n")
Monitoring residual variation
Code
cat("Control limit (UCL):", round(UCL_Q, 3), "\n")
Control limit (UCL): 0.254 
Code
cat("Out-of-control points:", sum(Q_stat > UCL_Q), "\n\n")
Out-of-control points: 21 
Code
# Q statistic chart
q_data <- tibble(
  Sample = 1:length(Q_stat),
  Q = Q_stat,
  UCL = UCL_Q,
  Phase = ifelse(Sample <= 80, "Phase I", "Phase II"),
  Status = ifelse(Q > UCL, "Out of Control", "In Control")
)

q_plot <- plot_ly(q_data) %>%
  add_trace(
    x = ~Sample,
    y = ~Q,
    type = "scatter",
    mode = "lines+markers",
    name = "Q Statistic",
    line = list(color = "orange"),
    marker = list(
      size = 6,
      color = ~ifelse(Status == "Out of Control", "red", "orange")
    ),
    text = ~paste0(
      "Sample: ", Sample, "<br>",
      "Q: ", round(Q, 3), "<br>",
      "Status: ", Status
    ),
    hovertemplate = "%{text}<extra></extra>"
  ) %>%
  add_trace(
    x = ~Sample,
    y = ~UCL,
    type = "scatter",
    mode = "lines",
    name = "UCL",
    line = list(color = "red", dash = "dash", width = 3)
  ) %>%
  add_segments(
    x = 80.5, xend = 80.5,
    y = 0, yend = max(q_data$Q) * 1.1,
    line = list(color = "gray", dash = "dot", width = 2),
    showlegend = FALSE
  ) %>%
  layout(
    title = list(
      text = "Q Statistic (SPE) Control Chart<br><sub>Detecting unusual patterns and anomalies</sub>",
      font = list(size = 16, weight = "bold")
    ),
    xaxis = list(title = "Sample Number"),
    yaxis = list(title = "Q Statistic (Squared Prediction Error)")
  )

q_plot

Interpreting PCA-based SPC

Understanding T² vs Q

Statistic Monitors Interpretation When Out-of-Control
T² (in PC space) Systematic variation Process mean has shifted in a known direction
Q (SPE) Residual variation Unusual pattern or new type of variation

Decision Matrix

T² Status Q Status Interpretation Action
In Control In Control Process normal Continue monitoring
Out of Control In Control Common cause shift Adjust process mean
In Control Out of Control Special cause/anomaly Investigate unusual event
Out of Control Out of Control Major disturbance Stop and investigate

Score Plot (First Two PCs)

Code
if (k >= 2) {
  # Create score plot
  score_data <- tibble(
    Sample = 1:nrow(scores_all),
    PC1 = scores_all[, 1],
    PC2 = scores_all[, 2],
    Phase = ifelse(Sample <= 80, "Phase I", "Phase II"),
    Overall_Status = case_when(
      T2_pca > UCL_T2_pca & Q_stat > UCL_Q ~ "Both OOC",
      T2_pca > UCL_T2_pca ~ "T² Only",
      Q_stat > UCL_Q ~ "Q Only",
      TRUE ~ "In Control"
    )
  )
  
  score_plot <- plot_ly(score_data) %>%
    add_trace(
      x = ~PC1,
      y = ~PC2,
      type = "scatter",
      mode = "markers",
      color = ~Overall_Status,
      colors = c("In Control" = "blue", 
                 "T² Only" = "orange", 
                 "Q Only" = "purple",
                 "Both OOC" = "red"),
      marker = list(size = 8),
      text = ~paste0(
        "Sample: ", Sample, "<br>",
        "PC1: ", round(PC1, 3), "<br>",
        "PC2: ", round(PC2, 3), "<br>",
        "Status: ", Overall_Status
      ),
      hovertemplate = "%{text}<extra></extra>"
    ) %>%
    layout(
      title = list(
        text = "PCA Score Plot<br><sub>Process behavior in principal component space</sub>",
        font = list(size = 16, weight = "bold")
      ),
      xaxis = list(title = paste0("PC1 (", round(prop_variance[1]*100, 1), "%)")),
      yaxis = list(title = paste0("PC2 (", round(prop_variance[2]*100, 1), "%)")),
      showlegend = TRUE
    )
  
  score_plot
} else {
  cat("\n[INFO] Score plot requires at least 2 principal components.\n")
  cat("Currently retaining only", k, "component(s).\n")
  cat("Use histogram or time series plot for 1D visualization.\n")
}

[INFO] Score plot requires at least 2 principal components.
Currently retaining only 1 component(s).
Use histogram or time series plot for 1D visualization.

Cross-Validation in PCA-Based SPC

Why Cross-Validation Matters

Traditional SPC assumes we know the “true” number of components to retain. But in practice: - Too few PCs → Miss important variation - Too many PCs → Include noise, reduce sensitivity - Affects both T² and Q statistics

Cross-validation helps: - Select optimal number of components objectively - Avoid overfitting to Phase I data - Improve generalization to Phase II monitoring - Validate model performance

Cross-Validation Approaches for SPC

PRESS Statistic Explained:

PRESS = Predicted Residual Error Sum of Squares

\[PRESS = \sum_{i=1}^{n} (x_i - \hat{x}_i)^2\]

Where: - \(x_i\) = actual observation - \(\hat{x}_i\) = predicted (reconstructed) observation using model trained WITHOUT \(x_i\)

Why PRESS matters: - Measures how well model predicts NEW data - Lower PRESS = better generalization - Avoids overfitting (model evaluated on unseen data) - Standard metric for cross-validation

In SPC context: - Tests if PCA model works on future Phase II data - Ensures control limits will be appropriate - Validates number of components selected

Code
# Perform cross-validation to select optimal number of PCs
# Using PRESS (Predicted Residual Error Sum of Squares)

cat("\nCross-Validation for PCA Model Selection:\n\n")

Cross-Validation for PCA Model Selection:
Code
# K-fold cross-validation
K <- 5  # Number of folds
n_phase1 <- nrow(X_phase1)
fold_size <- floor(n_phase1 / K)

# Try different numbers of components
max_components <- min(p, n_phase1 - fold_size - 1)
cv_results <- tibble(
  n_components = 1:max_components,
  CV_Error = numeric(max_components),
  PRESS = numeric(max_components)
)

for (n_comp in 1:max_components) {
  press_total <- 0
  
  for (k in 1:K) {
    # Create training and validation sets
    val_indices <- ((k-1) * fold_size + 1):min(k * fold_size, n_phase1)
    train_indices <- setdiff(1:n_phase1, val_indices)
    
    # Train PCA on training set
    X_train <- X_scaled[train_indices, ]
    X_val <- X_scaled[val_indices, ]
    
    pca_cv <- prcomp(X_train, center = FALSE, scale. = FALSE)
    
    # Project validation data
    scores_val <- X_val %*% pca_cv$rotation[, 1:n_comp, drop = FALSE]
    
    # Reconstruct validation data
    X_recon <- scores_val %*% t(pca_cv$rotation[, 1:n_comp, drop = FALSE])
    
    # Calculate prediction error
    press_total <- press_total + sum((X_val - X_recon)^2)
  }
  
  cv_results$PRESS[n_comp] <- press_total
  cv_results$CV_Error[n_comp] <- press_total / n_phase1
}

# Display CV results with proper NA handling
cv_results %>%
  mutate(
    `% Improvement` = ifelse(
      n_components == 1,
      NA_real_,  # First component has no previous to compare
      -diff(c(0, CV_Error)) / c(0, CV_Error[-n()]) * 100
    ),
    Recommendation = case_when(
      n_components == which.min(CV_Error) ~ "[OPTIMAL] Minimum CV Error",
      n_components < which.min(CV_Error) ~ "Underfit (too few PCs)",
      TRUE ~ "May overfit (too many PCs)"
    )
  ) %>%
  gt() %>%
  tab_header(
    title = "Cross-Validation Results",
    subtitle = "PRESS-based selection of optimal principal components"
  ) %>%
  fmt_number(
    columns = c(CV_Error, PRESS),
    decimals = 4
  ) %>%
  fmt_number(
    columns = `% Improvement`,
    decimals = 2
  ) %>%
  sub_missing(
    columns = `% Improvement`,
    missing_text = "—"
  ) %>%
  tab_style(
    style = cell_fill(color = "lightgreen"),
    locations = cells_body(
      rows = n_components == which.min(CV_Error)
    )
  ) %>%
  tab_footnote(
    footnote = "% Improvement = reduction in CV error compared to using one fewer component",
    locations = cells_column_labels(columns = `% Improvement`)
  ) %>%
  tab_footnote(
    footnote = "PRESS = Predicted Residual Error Sum of Squares (lower is better)",
    locations = cells_column_labels(columns = PRESS)
  )
Cross-Validation Results
PRESS-based selection of optimal principal components
n_components CV_Error PRESS1 % Improvement2 Recommendation
1 0.0293 2.3423 Underfit (too few PCs)
2 0.0036 0.2843 87.86 Underfit (too few PCs)
3 0.0000 0.0000 100.00 [OPTIMAL] Minimum CV Error
1 PRESS = Predicted Residual Error Sum of Squares (lower is better)
2 % Improvement = reduction in CV error compared to using one fewer component
Code
# Optimal number of components
k_optimal <- which.min(cv_results$CV_Error)
cat("\nRecommended number of components (CV):", k_optimal, "\n")

Recommended number of components (CV): 3 
Code
cat("Original selection (95% variance):", k, "\n")
Original selection (95% variance): 5 
Code
if (k != k_optimal) {
  cat("Difference:", abs(k_optimal - k), "component(s)\n")
  improvement <- (cv_results$CV_Error[k] - cv_results$CV_Error[k_optimal]) / 
                 cv_results$CV_Error[k] * 100
  cat("CV error improvement:", round(improvement, 2), "%\n")
} else {
  cat("Both methods AGREE on", k, "components [OK]\n")
}
Difference: 2 component(s)
CV error improvement: NA %
Code
cat("\n")
Code
# Plot CV error vs number of components
cv_plot <- plot_ly(cv_results) %>%
  add_trace(
    x = ~n_components,
    y = ~CV_Error,
    type = "scatter",
    mode = "lines+markers",
    name = "CV Error",
    line = list(color = "blue", width = 2),
    marker = list(size = 8, color = "blue"),
    hovertemplate = "<b>%{x} Components</b><br>CV Error: %{y:.4f}<extra></extra>"
  ) %>%
  add_trace(
    x = k_optimal,
    y = cv_results$CV_Error[k_optimal],
    type = "scatter",
    mode = "markers",
    name = "Optimal",
    marker = list(size = 15, color = "red", symbol = "star"),
    hovertemplate = "<b>Optimal: %{x} PCs</b><br>Error: %{y:.4f}<extra></extra>"
  ) %>%
  layout(
    title = list(
      text = "Cross-Validation: Selecting Number of Components<br><sub>Lower CV error indicates better model</sub>",
      font = list(size = 16, weight = "bold")
    ),
    xaxis = list(title = "Number of Principal Components", dtick = 1),
    yaxis = list(title = "Cross-Validation Error (PRESS/n)")
  )

cv_plot

Interpreting CV Results

What the CV Error Tells You:

Pattern Interpretation Action
Error decreases then plateaus Optimal found Use number at minimum
Error keeps decreasing May need more PCs Check if reasonable
Sharp drop then flat Clear optimal point High confidence
Noisy curve Unstable model Need more Phase I data

Comparison: Variance vs Cross-Validation

Code
# Compare variance-based vs CV-based selection
# Add safety checks to prevent NAs

# Ensure indices are valid
k_safe <- min(k, length(cum_variance))
k_optimal_safe <- min(k_optimal, length(cum_variance), nrow(cv_results))

comparison <- tibble(
  Method = c("Variance (95%)", "Cross-Validation", "Difference"),
  `# Components` = c(k, k_optimal, NA_integer_),
  `Variance Explained %` = c(
    cum_variance[k_safe] * 100,
    cum_variance[k_optimal_safe] * 100,
    abs(cum_variance[k_safe] - cum_variance[k_optimal_safe]) * 100
  ),
  `CV Error` = c(
    cv_results$CV_Error[k_safe],
    cv_results$CV_Error[k_optimal_safe],
    abs(cv_results$CV_Error[k_safe] - cv_results$CV_Error[k_optimal_safe])
  ),
  `Better By` = c(
    "—",
    "—",
    ifelse(k == k_optimal, 
           "Same selection", 
           paste0(abs(k - k_optimal), " component(s)"))
  ),
  Recommendation = c(
    "Traditional approach",
    "Data-driven optimum",
    ifelse(k == k_optimal, "[OK] METHODS AGREE", "[INFO] METHODS DIFFER")
  )
)

comparison %>%
  gt() %>%
  tab_header(
    title = "Model Selection Comparison",
    subtitle = "Variance-based vs Cross-validation"
  ) %>%
  fmt_number(
    columns = `Variance Explained %`,
    decimals = 2
  ) %>%
  fmt_number(
    columns = `CV Error`,
    decimals = 4
  ) %>%
  sub_missing(
    columns = everything(),
    missing_text = "—"
  ) %>%
  tab_style(
    style = cell_fill(color = "lightgreen"),
    locations = cells_body(rows = Method == "Cross-Validation")
  ) %>%
  tab_style(
    style = cell_fill(color = "lightyellow"),
    locations = cells_body(rows = Method == "Difference")
  ) %>%
  tab_footnote(
    footnote = "Difference row shows absolute differences between the two methods",
    locations = cells_body(rows = Method == "Difference", columns = Method)
  ) %>%
  tab_footnote(
    footnote = "— indicates not applicable or no meaningful comparison",
    locations = cells_body(rows = Method == "Difference", columns = `# Components`)
  )
Model Selection Comparison
Variance-based vs Cross-validation
Method # Components Variance Explained % CV Error Better By Recommendation
Variance (95%) 5 100.00 0.0000 Traditional approach
Cross-Validation 3 100.00 0.0000 Data-driven optimum
Difference1 2 — 0.00 0.0000 2 component(s) [INFO] METHODS DIFFER
1 Difference row shows absolute differences between the two methods
2 — indicates not applicable or no meaningful comparison
Code
cat("\n")
Code
if (k != k_optimal) {
  cat("[INFO] Cross-validation suggests a different number of components!\n")
  cat("Variance method:", k, "components (", round(cum_variance[k_safe]*100, 1), "% variance)\n")
  cat("CV method:", k_optimal, "components (", round(cum_variance[k_optimal_safe]*100, 1), "% variance)\n\n")
  
  if (k_optimal < k) {
    cat("Interpretation: CV suggests FEWER components\n")
    cat("Benefit: May improve sensitivity by excluding noise\n")
    cat("Trade-off: Slightly less variance captured\n")
  } else {
    cat("Interpretation: CV suggests MORE components\n")
    cat("Benefit: Captures additional important variation\n")
    cat("Trade-off: Slightly higher complexity\n")
  }
} else {
  cat("[OK] Both methods agree on", k, "components.\n")
  cat("This indicates robust model selection.\n")
}
[INFO] Cross-validation suggests a different number of components!
Variance method: 5 components ( 100 % variance)
CV method: 3 components ( 100 % variance)

Interpretation: CV suggests FEWER components
Benefit: May improve sensitivity by excluding noise
Trade-off: Slightly less variance captured
Code
cat("\n")

When to Use Cross-Validation in SPC

Use CV when: - ✅ High-dimensional data (10+ variables) - ✅ Limited Phase I samples - ✅ Uncertain about number of components - ✅ Building predictive models (soft sensors) - ✅ Want to avoid overfitting

Variance threshold (95%) is fine when: - ✅ Clear eigenvalue drop-off (scree plot) - ✅ Sufficient Phase I data (50+ samples) - ✅ Well-understood process - ✅ Traditional SPC context

Advanced: Leave-One-Out Cross-Validation (LOOCV)

Code
# LOOCV for more precise estimate (computationally intensive)
if (n_phase1 <= 50) {  # Only for small datasets
  
  cat("\nPerforming Leave-One-Out Cross-Validation:\n")
  
  loocv_results <- tibble(
    n_components = 1:max_components,
    LOOCV_Error = numeric(max_components)
  )
  
  for (n_comp in 1:max_components) {
    loocv_error <- 0
    
    # Leave one out
    for (i in 1:n_phase1) {
      X_train <- X_scaled[-i, ]
      X_test <- X_scaled[i, , drop = FALSE]
      
      pca_loo <- prcomp(X_train, center = FALSE, scale. = FALSE)
      
      # Project test sample
      score_test <- X_test %*% pca_loo$rotation[, 1:n_comp, drop = FALSE]
      X_recon <- score_test %*% t(pca_loo$rotation[, 1:n_comp, drop = FALSE])
      
      loocv_error <- loocv_error + sum((X_test - X_recon)^2)
    }
    
    loocv_results$LOOCV_Error[n_comp] <- loocv_error / n_phase1
  }
  
  k_loocv <- which.min(loocv_results$LOOCV_Error)
  
  cat("LOOCV optimal components:", k_loocv, "\n")
  cat("K-fold CV optimal:", k_optimal, "\n")
  cat("Variance-based:", k, "\n\n")
  
  # Compare all three methods
  all_methods <- plot_ly() %>%
    add_trace(
      data = cv_results,
      x = ~n_components,
      y = ~CV_Error,
      type = "scatter",
      mode = "lines+markers",
      name = "K-Fold CV",
      line = list(color = "blue")
    ) %>%
    add_trace(
      data = loocv_results,
      x = ~n_components,
      y = ~LOOCV_Error,
      type = "scatter",
      mode = "lines+markers",
      name = "LOOCV",
      line = list(color = "red", dash = "dash")
    ) %>%
    add_segments(
      x = k, xend = k,
      y = min(c(cv_results$CV_Error, loocv_results$LOOCV_Error)),
      yend = max(c(cv_results$CV_Error, loocv_results$LOOCV_Error)),
      line = list(color = "green", dash = "dot"),
      name = "95% Variance"
    ) %>%
    layout(
      title = list(
        text = "Comparing Model Selection Methods",
        font = list(size = 16, weight = "bold")
      ),
      xaxis = list(title = "Number of Components", dtick = 1),
      yaxis = list(title = "Prediction Error"),
      showlegend = TRUE
    )
  
  all_methods
} else {
  cat("\n[SKIPPED] LOOCV not performed (dataset too large)\n")
  cat("K-fold CV is more efficient for large Phase I datasets.\n")
}

[SKIPPED] LOOCV not performed (dataset too large)
K-fold CV is more efficient for large Phase I datasets.

Practical Recommendations

For Your SPC Implementation:

  1. Start with variance threshold (95% or scree plot)
  2. Validate with CV if:
    • Unexpected false alarms
    • Poor Phase II performance
    • Limited Phase I data
  3. Use CV optimal if it differs significantly
  4. Monitor both T² and Q regardless of selection

CV Best Practices: - Use K=5 or K=10 for K-fold - LOOCV for small datasets (<50 samples) - Stratify folds if data has structure - Revalidate periodically with new data

Literature Support

Cross-validation in SPC is supported by: - Nomikos & MacGregor (1995): Monitoring batch processes with multiway PCA - Kourti (2005): Process analytical technology and quality by design - Qin (2012): Survey of industrial applications of data-driven methods

Modern Practice: Many pharmaceutical and chemical industries now use CV-based model selection for: - Batch process monitoring - Soft sensor development - Fault detection systems - Quality prediction models

5. Variable Contribution Analysis

Identifying Root Causes

When a multivariate chart signals out-of-control, we need to identify which variable(s) are responsible.

Code
# For recent out-of-control points, show variable contributions
recent_ooc <- tail(which(MEWMA_T2 > UCL_MEWMA), 5)

if (length(recent_ooc) > 0) {
  # Calculate standardized deviations
  contrib_data <- list()
  
  for (idx in recent_ooc) {
    x_i <- X[idx, ]
    std_dev <- (x_i - mu_hat) / sqrt(diag(Sigma_hat))
    
    contrib_data[[length(contrib_data) + 1]] <- tibble(
      Sample = idx,
      Variable = quality_vars,
      Value = x_i,
      Target = mu_hat,
      `Std Deviation` = std_dev,
      `Abs Std Dev` = abs(std_dev),
      Status = ifelse(abs(std_dev) > 3, "[X] Out of Control",
                     ifelse(abs(std_dev) > 2, "[WARNING] Near Limit", "[OK] In Control"))
    )
  }
  
  contrib_table <- bind_rows(contrib_data)
  
  contrib_table %>%
    gt() %>%
    tab_header(
      title = "Variable Contribution Analysis",
      subtitle = "Which variables are causing out-of-control signals?"
    ) %>%
    fmt_number(
      columns = c(Value, Target, `Std Deviation`),
      decimals = 3
    ) %>%
    fmt_number(
      columns = `Abs Std Dev`,
      decimals = 2
    ) %>%
    tab_style(
      style = cell_fill(color = "lightcoral"),
      locations = cells_body(
        columns = `Abs Std Dev`,
        rows = `Abs Std Dev` > 3
      )
    ) %>%
    tab_style(
      style = cell_fill(color = "lightyellow"),
      locations = cells_body(
        columns = `Abs Std Dev`,
        rows = `Abs Std Dev` > 2 & `Abs Std Dev` <= 3
      )
    ) %>%
    tab_style(
      style = cell_fill(color = "lightgreen"),
      locations = cells_body(
        columns = `Abs Std Dev`,
        rows = `Abs Std Dev` <= 2
      )
    )
  
  # Create contribution bar chart for most recent OOC point
  latest_ooc <- tail(recent_ooc, 1)
  latest_contrib <- contrib_table %>% filter(Sample == latest_ooc)
  
  contrib_plot <- plot_ly(
    data = latest_contrib,
    x = ~Variable,
    y = ~`Std Deviation`,
    type = "bar",
    marker = list(
      color = ~ifelse(abs(`Std Deviation`) > 3, "red",
                     ifelse(abs(`Std Deviation`) > 2, "orange", "green"))
    ),
    text = ~round(`Std Deviation`, 2),
    textposition = "outside",
    hovertemplate = paste(
      "<b>%{x}</b><br>",
      "Std Dev: %{y:.3f}<br>",
      "<extra></extra>"
    )
  ) %>%
    add_segments(
      x = -0.5, xend = length(quality_vars) - 0.5,
      y = 3, yend = 3,
      line = list(color = "red", dash = "dash", width = 2),
      showlegend = FALSE,
      name = "+3σ"
    ) %>%
    add_segments(
      x = -0.5, xend = length(quality_vars) - 0.5,
      y = -3, yend = -3,
      line = list(color = "red", dash = "dash", width = 2),
      showlegend = FALSE,
      name = "-3σ"
    ) %>%
    layout(
      title = list(
        text = paste0("Variable Contributions - Sample ", latest_ooc, 
                     "<br><sub>Standardized deviations from target</sub>"),
        font = list(size = 16, weight = "bold")
      ),
      xaxis = list(title = "Variable"),
      yaxis = list(title = "Standardized Deviation (σ)"),
      showlegend = FALSE
    )
  
  contrib_plot
} else {
  cat("\nNo recent out-of-control points for contribution analysis.\n")
}

6. Process Monitoring Dashboard

Code
# Create summary metrics
dashboard_metrics <- tibble(
  Metric = c(
    "Total Samples",
    "Phase I Samples",
    "Phase II Samples",
    "Variables Monitored",
    "T² Out-of-Control",
    "MEWMA Out-of-Control",
    "Current Process Status"
  ),
  Value = c(
    nrow(X),
    80,
    nrow(X) - 80,
    p,
    sum(T2_values > UCL_T2),
    sum(MEWMA_T2 > UCL_MEWMA),
    ifelse(tail(MEWMA_T2, 1) > UCL_MEWMA, "[X] OUT OF CONTROL", "[OK] IN CONTROL")
  )
)

dashboard_metrics %>%
  gt() %>%
  tab_header(
    title = "Multivariate SPC Dashboard",
    subtitle = paste("Analysis Date:", Sys.Date())
  ) %>%
  tab_style(
    style = cell_fill(color = "lightcoral"),
    locations = cells_body(
      columns = Value,
      rows = grepl("OUT OF CONTROL", Value)
    )
  ) %>%
  tab_style(
    style = cell_fill(color = "lightgreen"),
    locations = cells_body(
      columns = Value,
      rows = grepl("IN CONTROL", Value)
    )
  ) %>%
  tab_style(
    style = list(
      cell_fill(color = "lightblue"),
      cell_text(weight = "bold")
    ),
    locations = cells_column_labels()
  )
Multivariate SPC Dashboard
Analysis Date: 2025-12-21
Metric Value
Total Samples 100
Phase I Samples 80
Phase II Samples 20
Variables Monitored 3
T² Out-of-Control 22
MEWMA Out-of-Control 23
Current Process Status [X] OUT OF CONTROL

7. Recommendations and Actions

Current Process Assessment

Code
# Calculate process performance metrics
recent_samples <- 10
recent_T2 <- tail(T2_values, recent_samples)
recent_MEWMA <- tail(MEWMA_T2, recent_samples)

recent_violations_T2 <- sum(recent_T2 > UCL_T2)
recent_violations_MEWMA <- sum(recent_MEWMA > UCL_MEWMA)

cat("\n### Recent Process Performance (Last", recent_samples, "Samples)\n\n")

### Recent Process Performance (Last 10 Samples)
Code
cat("- T² violations:", recent_violations_T2, "out of", recent_samples, 
    paste0("(", round(recent_violations_T2/recent_samples*100, 1), "%)\n"))
- T² violations: 10 out of 10 (100%)
Code
cat("- MEWMA violations:", recent_violations_MEWMA, "out of", recent_samples,
    paste0("(", round(recent_violations_MEWMA/recent_samples*100, 1), "%)\n\n"))
- MEWMA violations: 10 out of 10 (100%)
Code
if (recent_violations_MEWMA > 0) {
  cat("**Status: [WARNING] Process requires immediate attention**\n\n")
  cat("**Recommended Actions:**\n")
  cat("1. Stop production and investigate root cause\n")
  cat("2. Review variable contributions to identify problem source\n")
  cat("3. Check equipment calibration and process parameters\n")
  cat("4. Review operator procedures and training\n")
  cat("5. Implement corrective actions\n")
  cat("6. Document findings and countermeasures\n")
} else {
  cat("**Status: [OK] Process is in statistical control**\n\n")
  cat("**Recommended Actions:**\n")
  cat("1. Continue current monitoring protocol\n")
  cat("2. Maintain equipment calibration schedule\n")
  cat("3. Regular operator training refreshers\n")
  cat("4. Update control limits quarterly\n")
}
**Status: [WARNING] Process requires immediate attention**

**Recommended Actions:**
1. Stop production and investigate root cause
2. Review variable contributions to identify problem source
3. Check equipment calibration and process parameters
4. Review operator procedures and training
5. Implement corrective actions
6. Document findings and countermeasures

Advantages of Multivariate SPC

Compared to univariate monitoring:

  1. Accounts for correlations: Detects patterns that univariate charts miss
  2. Reduces chart proliferation: One chart instead of multiple
  3. Lower false alarm rate: Adjusts for multiple testing
  4. More powerful: Better at detecting real process shifts
  5. Process understanding: Reveals relationships between variables

When to Use Each Method

Method Best For Advantages Disadvantages
Large sudden shifts Simple, well-understood Less sensitive to small shifts
MEWMA Small gradual shifts Memory, smoothing Slower response to large shifts
MCUSUM Persistent small shifts Directional info Complex interpretation
PCA-based High dimensions (10+ vars) Dimension reduction Requires more data

Implementation Guidelines

Phase I (Establishing Control): 1. Collect 20-30 samples during stable operation 2. Calculate mean vector and covariance matrix 3. Verify multivariate normality 4. Establish control limits

Phase II (Ongoing Monitoring): 1. Plot new observations on control charts 2. Investigate out-of-control signals immediately 3. Use contribution analysis to identify root causes 4. Document and implement corrective actions 5. Update control limits after process improvements


Report Generated: 2025-12-21 09:32:24.371897
Analysis Method: Hotelling T² and MEWMA
Statistical Software: R with qcr package (unified quality control)
Methodology: Multivariate Statistical Process Control
Package Info: qcr integrates qcc and MSQC functionality