Multivariate Statistical Process Control (MSPC)

Hotelling T², MEWMA, and Multivariate Process Monitoring

Author

David Guy PhD: Quality Control Department

Published

December 21, 2025

Introduction to Multivariate SPC

Note: This analysis uses the qcr package as the quality control framework. The multivariate methods (Hotelling T² and MEWMA) are implemented using standard statistical formulas to ensure compatibility across different package versions and to provide educational transparency. All calculations follow established statistical theory and produce results equivalent to specialized multivariate SPC packages.

What is Multivariate SPC?

Multivariate Statistical Process Control (MSPC) monitors multiple related quality characteristics simultaneously, rather than tracking each variable independently.

Why Use Multivariate SPC?

Traditional (Univariate) Approach: - Monitor each variable with separate control charts - Example: Track temperature, pressure, flow rate separately - Problem: Misses interactions and correlations between variables

Multivariate Approach: - Monitor all variables together as a system - Detects patterns that univariate charts miss - Accounts for correlations between variables - More powerful for detecting process changes

Real-World Example

Chemical Reactor Process: - Temperature: 150°C +/- 5°C - Pressure: 200 psi +/- 10 psi
- pH: 7.0 +/- 0.5

Scenario 1 (Both in control individually): - Temperature: 152°C [OK] - within limits - Pressure: 195 psi [OK] - within limits - But: This combination is unusual and indicates a problem!

Multivariate SPC detects this because it knows the normal relationship between temperature and pressure.

Key Multivariate Methods

1. Hotelling T² Chart

Multivariate extension of X-bar chart
Monitors mean vector of multiple variables
One statistic for all variables combined
Most common multivariate control chart

2. MEWMA (Multivariate EWMA)

Multivariate exponentially weighted moving average
More sensitive to small sustained shifts
Better for detecting gradual process drift
Incorporates historical information

3. MCUSUM (Multivariate CUSUM)

Multivariate cumulative sum
Excellent for detecting small persistent shifts
Directional information for diagnosis

4. Principal Component Analysis (PCA)

Reduces dimensionality
Transforms correlated variables to uncorrelated components
T² and Q statistics for monitoring
Useful for high-dimensional data (10+ variables)

Setup and Data Import

Code

# Install packages if needed
packages <- c("qcr", "plotly", "gt", "dplyr", "tidyr", "ggplot2", 
              "readr", "lubridate", "DT", "knitr", "MASS", "psych")

# Load libraries
library(qcr)       # Unified quality control (replaces qcc + MSQC)
library(plotly)    # Interactive plots
library(gt)        # Tables
library(dplyr)     # Data manipulation
library(tidyr)     # Data tidying
library(ggplot2)   # Graphics
library(readr)     # Data import
library(lubridate) # Dates
library(DT)        # Interactive tables
library(knitr)     # Reports
library(MASS)      # Multivariate normal
library(psych)     # Correlation matrices

theme_set(theme_minimal())

cat("Package versions:\n")

Package versions:

Code

cat("qcr version:", as.character(packageVersion("qcr")), "\n\n")

qcr version: 1.4

Import Multivariate Data

Code

# Import multivariate quality data
if (file.exists("multivariate_data.csv")) {
  mv_data <- read_csv("multivariate_data.csv", show_col_types = FALSE)
  cat("[OK] Multivariate data loaded from multivariate_data.csv\n\n")
} else {
  cat("[INFO] No multivariate_data.csv found. Generating example data.\n")
  cat("For your own data, create multivariate_data.csv with multiple quality variables.\n\n")
  
  # Generate realistic multivariate process data
  set.seed(42)
  n_samples <- 100
  
  # Phase I: In-control data (samples 1-80)
  # Create correlated variables (temperature, pressure, flow_rate)
  
  # Correlation structure (realistic for a process)
  correlation_matrix <- matrix(c(
    1.0,  0.7,  0.5,   # Temperature correlations
    0.7,  1.0,  0.6,   # Pressure correlations
    0.5,  0.6,  1.0    # Flow rate correlations
  ), nrow = 3, byrow = TRUE)
  
  # Mean vector
  mu_in_control <- c(150, 200, 50)  # Temp, Pressure, Flow
  
  # Standard deviations
  sigma_in_control <- c(2, 5, 1)
  
  # Create covariance matrix
  D <- diag(sigma_in_control)
  Sigma_in_control <- D %*% correlation_matrix %*% D
  
  # Generate in-control data (Phase I)
  phase1_data <- mvrnorm(n = 80, mu = mu_in_control, Sigma = Sigma_in_control)
  
  # Phase II: Out-of-control data (samples 81-100)
  # Introduce a mean shift
  mu_out_control <- c(152, 205, 51)  # Small shifts in all variables
  phase2_data <- mvrnorm(n = 20, mu = mu_out_control, Sigma = Sigma_in_control)
  
  # Combine data
  all_data <- rbind(phase1_data, phase2_data)
  
  # Create dataframe
  mv_data <- tibble(
    sample = 1:n_samples,
    date = seq.Date(from = Sys.Date() - n_samples, to = Sys.Date() - 1, by = "day"),
    temperature = all_data[, 1],
    pressure = all_data[, 2],
    flow_rate = all_data[, 3],
    shift = sample(c("Day", "Night"), n_samples, replace = TRUE),
    operator = sample(paste0("OP", 1:3), n_samples, replace = TRUE)
  )
}

[OK] Multivariate data loaded from multivariate_data.csv

Code

# Display data summary
cat("Multivariate Process Data:\n")

Multivariate Process Data:

Code

cat("Samples:", nrow(mv_data), "\n")

Samples: 100

Code

cat("Variables:", sum(sapply(mv_data, is.numeric)), "numeric variables\n\n")

Variables: 4 numeric variables

Code

# Show first few observations
mv_data %>%
  head(10) %>%
  gt() %>%
  tab_header(
    title = "Multivariate Quality Data - Sample Preview",
    subtitle = "First 10 observations"
  ) %>%
  fmt_number(
    columns = where(is.numeric),
    decimals = 2
  ) %>%
  tab_style(
    style = cell_fill(color = "lightblue"),
    locations = cells_column_labels()
  )

Multivariate Quality Data - Sample Preview
First 10 observations
sample	date	temperature	pressure	flow_rate	shift	operator
1.00	2024-09-11	149.82	199.45	49.89	Day	OP1
2.00	2024-09-12	150.24	201.32	50.15	Night	OP2
3.00	2024-09-13	149.95	198.76	49.95	Day	OP3
4.00	2024-09-14	150.18	200.88	50.08	Day	OP1
5.00	2024-09-15	149.73	199.12	49.82	Night	OP2
6.00	2024-09-16	150.42	202.15	50.28	Day	OP1
7.00	2024-09-17	149.88	199.67	49.92	Day	OP3
8.00	2024-09-18	150.15	200.45	50.12	Night	OP2
9.00	2024-09-19	149.92	199.88	49.98	Day	OP1
10.00	2024-09-20	150.05	200.12	50.02	Day	OP3

Data Summary and Correlation Analysis

Code

# Select quality variables
quality_vars <- c("temperature", "pressure", "flow_rate")
X <- as.matrix(mv_data[, quality_vars])

# Summary statistics
summary_stats <- mv_data %>%
  select(all_of(quality_vars)) %>%
  summarise(across(everything(),
                   list(Mean = ~mean(., na.rm = TRUE),
                        SD = ~sd(., na.rm = TRUE),
                        Min = ~min(., na.rm = TRUE),
                        Max = ~max(., na.rm = TRUE)),
                   .names = "{.col}__{.fn}")) %>%
  pivot_longer(everything(),
               names_to = c("Variable", "Statistic"),
               names_sep = "__") %>%
  pivot_wider(names_from = Statistic, values_from = value)

summary_stats %>%
  gt() %>%
  tab_header(
    title = "Summary Statistics",
    subtitle = "Process variables descriptive statistics"
  ) %>%
  fmt_number(
    columns = c(Mean, SD, Min, Max),
    decimals = 3
  ) %>%
  tab_style(
    style = list(
      cell_fill(color = "lightyellow"),
      cell_text(weight = "bold")
    ),
    locations = cells_body(columns = Variable)
  )

Summary Statistics
Process variables descriptive statistics
Variable	Mean	SD	Min	Max
temperature	150.444	0.857	149.670	152.380
pressure	201.321	2.219	198.760	206.220
flow_rate	50.250	0.485	49.750	51.350

Code

# Correlation matrix
cor_matrix <- cor(X)

cat("\nCorrelation Matrix:\n")


Correlation Matrix:

Code

cor_df <- as.data.frame(cor_matrix)
cor_df$Variable <- rownames(cor_matrix)
cor_df <- cor_df[, c("Variable", quality_vars)]

cor_df %>%
  gt() %>%
  tab_header(
    title = "Correlation Matrix",
    subtitle = "Relationships between process variables"
  ) %>%
  fmt_number(
    columns = all_of(quality_vars),
    decimals = 3
  )

Correlation Matrix
Relationships between process variables
Variable	temperature	pressure	flow_rate
temperature	1.000	0.980	0.998
pressure	0.980	1.000	0.989
flow_rate	0.998	0.989	1.000

Code

# Note: Strong correlations (>0.7) are important for multivariate SPC
cat("\nStrong correlations (>0.7) found:\n")


Strong correlations (>0.7) found:

Code

for (i in 1:(length(quality_vars)-1)) {
  for (j in (i+1):length(quality_vars)) {
    if (abs(cor_matrix[i,j]) > 0.7) {
      cat(sprintf("%s <-> %s: %.3f\n", 
                  quality_vars[i], quality_vars[j], cor_matrix[i,j]))
    }
  }
}

temperature <-> pressure: 0.980
temperature <-> flow_rate: 0.998
pressure <-> flow_rate: 0.989

Code

cat("\n")

Code

# Correlation heatmap
cor_long <- cor_matrix %>%
  as.data.frame() %>%
  mutate(Var1 = rownames(.)) %>%
  pivot_longer(cols = -Var1, names_to = "Var2", values_to = "Correlation")

cor_plot <- plot_ly(
  data = cor_long,
  x = ~Var1,
  y = ~Var2,
  z = ~Correlation,
  type = "heatmap",
  colorscale = list(
    c(0, "blue"),
    c(0.5, "white"),
    c(1, "red")
  ),
  zmin = -1,
  zmax = 1,
  text = ~round(Correlation, 3),
  hovertemplate = "<b>%{x} vs %{y}</b><br>Correlation: %{z:.3f}<extra></extra>"
) %>%
  layout(
    title = list(
      text = "Correlation Heatmap<br><sub>Strong correlations indicate variables move together</sub>",
      font = list(size = 16)
    ),
    xaxis = list(title = ""),
    yaxis = list(title = "")
  )

cor_plot

1. Hotelling T² Control Chart

Theory

The Hotelling T² statistic is a multivariate generalization of the t-statistic:

Formula: \[T^2 = n(\bar{x} - \mu_0)' S^{-1} (\bar{x} - \mu_0)\]

Where: - \(\bar{x}\) = sample mean vector - \(\mu_0\) = target mean vector (from Phase I) - \(S\) = sample covariance matrix - \(n\) = sample size

Control Limit: For Phase II monitoring: \[UCL = \frac{p(m+1)(m-1)}{m^2-mp} F_{\alpha}(p, m-p)\]

Where: - \(p\) = number of variables - \(m\) = number of Phase I samples - \(F_{\alpha}\) = F-distribution critical value

Interpretation: - T² combines all variables into single statistic - Large T² indicates multivariate out-of-control - Accounts for correlations between variables

Code

# Phase I data (first 80 samples for establishing control limits)
phase1_indices <- 1:80
X_phase1 <- X[phase1_indices, ]

# Calculate Phase I statistics
mu_hat <- colMeans(X_phase1)
Sigma_hat <- cov(X_phase1)

cat("\nPhase I Statistics:\n")


Phase I Statistics:

Code

cat("Mean vector:\n")

Mean vector:

Code

print(round(mu_hat, 3))

temperature    pressure   flow_rate 
    150.029     200.309      50.019

Code

cat("\nCovariance matrix:\n")


Covariance matrix:

Code

print(round(Sigma_hat, 3))

            temperature pressure flow_rate
temperature       0.040    0.190     0.029
pressure          0.190    0.942     0.139
flow_rate         0.029    0.139     0.021

Code

cat("\n")

Code

# Phase II data (monitoring new samples)
X_phase2 <- X

# Calculate T² statistics using standard formula
m <- nrow(X_phase1)  # Number of Phase I samples
p <- ncol(X)          # Number of variables
n <- 1                # Individual observations

# Calculate T² for each observation
T2_values <- numeric(nrow(X_phase2))
for (i in 1:nrow(X_phase2)) {
  x_i <- X_phase2[i, ]
  # Hotelling T² formula: T² = n(x - μ)'Σ⁻¹(x - μ)
  T2_values[i] <- n * t(x_i - mu_hat) %*% solve(Sigma_hat) %*% (x_i - mu_hat)
}

# Calculate Phase II control limit
# UCL = [(p(m+1)(m-1)) / (m²-mp)] * F(α, p, m-p)
alpha <- 0.0027  # For 3-sigma equivalent (0.27%)
F_critical <- qf(1 - alpha, p, m - p)
UCL_T2 <- (p * (m + 1) * (m - 1)) / (m^2 - m * p) * F_critical

cat("Hotelling T² Chart Parameters:\n")

Hotelling T² Chart Parameters:

Code

cat("Number of variables (p):", p, "\n")

Number of variables (p): 3

Code

cat("Phase I samples (m):", m, "\n")

Phase I samples (m): 80

Code

cat("Significance level (α):", alpha, "\n")

Significance level (α): 0.0027

Code

cat("F critical value:", round(F_critical, 3), "\n")

F critical value: 5.145

Code

cat("Control limit (UCL):", round(UCL_T2, 3), "\n")

Control limit (UCL): 16.032

Code

cat("Number of out-of-control points:", sum(T2_values > UCL_T2), "\n\n")

Number of out-of-control points: 22

Code

# Create results table
t2_summary <- tibble(
  Metric = c("Number of Variables", "Phase I Samples", "UCL", 
             "Mean T²", "Max T²", "Out-of-Control Points"),
  Value = c(p, m, UCL_T2, mean(T2_values), max(T2_values), 
            sum(T2_values > UCL_T2))
)

t2_summary %>%
  gt() %>%
  tab_header(
    title = "Hotelling T² Chart Summary",
    subtitle = "Multivariate control chart statistics"
  ) %>%
  fmt_number(
    columns = Value,
    rows = 3:5,
    decimals = 3
  ) %>%
  fmt_number(
    columns = Value,
    rows = c(1, 2, 6),
    decimals = 0
  ) %>%
  tab_style(
    style = cell_fill(color = "lightcoral"),
    locations = cells_body(rows = 6)
  )

Hotelling T² Chart Summary
Multivariate control chart statistics
Metric	Value
Number of Variables	3
Phase I Samples	80
UCL	16.032
Mean T²	312.980
Max T²	1,841.013
Out-of-Control Points	22

Code

# Create T² control chart
t2_data <- tibble(
  Sample = 1:length(T2_values),
  T2 = T2_values,
  UCL = UCL_T2,
  Phase = ifelse(Sample <= 80, "Phase I (Training)", "Phase II (Monitoring)"),
  Status = ifelse(T2 > UCL, "Out of Control", "In Control")
)

t2_plot <- plot_ly(t2_data) %>%
  add_trace(
    x = ~Sample,
    y = ~T2,
    type = "scatter",
    mode = "lines+markers",
    name = "T² Statistic",
    line = list(color = "blue"),
    marker = list(
      size = 6,
      color = ~ifelse(Status == "Out of Control", "red", "blue")
    ),
    text = ~paste0(
      "Sample: ", Sample, "<br>",
      "T²: ", round(T2, 3), "<br>",
      "Phase: ", Phase, "<br>",
      "Status: ", Status
    ),
    hovertemplate = "%{text}<extra></extra>"
  ) %>%
  add_trace(
    x = ~Sample,
    y = ~UCL,
    type = "scatter",
    mode = "lines",
    name = "UCL",
    line = list(color = "red", dash = "dash", width = 3),
    hovertemplate = "<b>UCL:</b> %{y:.3f}<extra></extra>"
  ) %>%
  add_segments(
    x = 80.5, xend = 80.5,
    y = 0, yend = max(t2_data$T2) * 1.1,
    line = list(color = "gray", dash = "dot", width = 2),
    showlegend = FALSE,
    hoverinfo = "skip"
  ) %>%
  layout(
    title = list(
      text = "Hotelling T² Control Chart<br><sub>Monitoring multiple correlated variables simultaneously</sub>",
      font = list(size = 16, weight = "bold")
    ),
    xaxis = list(title = "Sample Number"),
    yaxis = list(title = "T² Statistic"),
    hovermode = "closest",
    annotations = list(
      list(
        x = 80.5, y = max(t2_data$T2) * 1.05,
        text = "Phase I | Phase II",
        showarrow = FALSE,
        xanchor = "center",
        font = list(size = 10, color = "gray")
      )
    )
  )

t2_plot

Interpreting T² Results

Understanding the T² Statistic

What T² Measures: - Overall deviation from the target mean vector - Accounts for variable correlations - Larger values indicate process is farther from target

When to Investigate: - Any point above UCL indicates out-of-control - Patterns (runs, trends) suggest special causes - Compare Phase I vs Phase II performance

Decomposition Analysis

When T² signals out-of-control, we need to identify which variable(s) caused the signal.

Code

# For out-of-control points, decompose T²
ooc_indices <- which(T2_values > UCL_T2)

if (length(ooc_indices) > 0) {
  cat("\nDecomposition of Out-of-Control Points:\n\n")
  
  # Analyze first few out-of-control points
  analyze_points <- head(ooc_indices, 3)
  
  decomp_results <- list()
  
  for (idx in analyze_points) {
    x_i <- X[idx, ]
    deviation <- x_i - mu_hat
    
    # Calculate contribution of each variable
    S_inv <- solve(Sigma_hat)
    contributions <- numeric(p)
    
    for (j in 1:p) {
      e_j <- rep(0, p)
      e_j[j] <- 1
      contributions[j] <- n * t(deviation) %*% S_inv %*% e_j * deviation[j]
    }
    
    decomp_results[[length(decomp_results) + 1]] <- tibble(
      Sample = idx,
      Variable = quality_vars,
      Value = x_i,
      Target = mu_hat,
      Deviation = deviation,
      Contribution = contributions,
      Pct_of_T2 = (contributions / T2_values[idx]) * 100
    )
  }
  
  # Display decomposition table
  bind_rows(decomp_results) %>%
    gt() %>%
    tab_header(
      title = "T² Decomposition Analysis",
      subtitle = "Identifying which variables caused out-of-control signals"
    ) %>%
    fmt_number(
      columns = c(Value, Target, Deviation, Contribution),
      decimals = 3
    ) %>%
    fmt_number(
      columns = Pct_of_T2,
      decimals = 1
    ) %>%
    tab_style(
      style = cell_fill(color = "lightcoral"),
      locations = cells_body(
        columns = Pct_of_T2,
        rows = Pct_of_T2 > 40
      )
    ) %>%
    tab_style(
      style = list(
        cell_fill(color = "lightyellow"),
        cell_text(weight = "bold")
      ),
      locations = cells_body(columns = Sample)
    )
} else {
  cat("\nNo out-of-control points detected in T² chart.\n")
}


Decomposition of Out-of-Control Points:

T² Decomposition Analysis
Identifying which variables caused out-of-control signals
Sample	Variable	Value	Target	Deviation	Contribution	Pct_of_T2
3	temperature	149.950	150.029	−0.079	−12.126	−32.6
3	pressure	198.760	200.309	−1.549	48.773	131.2
3	flow_rate	49.950	50.019	−0.069	0.523	1.4
4	temperature	150.180	150.029	0.151	42.757	223.3
4	pressure	200.880	200.309	0.571	−0.450	−2.3
4	flow_rate	50.080	50.019	0.061	−23.158	−120.9
81	temperature	152.150	150.029	2.121	5,489.128	313.7
81	pressure	205.320	200.309	5.011	−569.587	−32.5
81	flow_rate	51.180	50.019	1.161	−3,169.505	−181.1

2. MEWMA (Multivariate EWMA) Chart

Theory

MEWMA (Multivariate Exponentially Weighted Moving Average) is more sensitive to small sustained shifts than T².

Formula: \[Z_i = \Lambda X_i + (I - \Lambda) Z_{i-1}\]

Where: - \(Z_i\) = EWMA vector at time \(i\) - \(\Lambda\) = diagonal matrix of smoothing constants (typically 0.1-0.3) - \(X_i\) = observation vector at time \(i\) - \(I\) = identity matrix

MEWMA Statistic: \[T_i^2 = Z_i' \Sigma_Z^{-1} Z_i\]

Where \(\Sigma_Z\) is the covariance matrix of \(Z_i\).

Control Limit: Typically set using simulation to achieve desired false alarm rate.

Advantages: - More sensitive to small shifts - Smooths out random noise - Better for detecting gradual process drift

Code

# MEWMA parameters
lambda <- 0.2  # Smoothing constant (0 < lambda <= 1)
               # Smaller = more smoothing, more sensitive to small shifts

cat("MEWMA Chart Setup:\n")

MEWMA Chart Setup:

Code

cat("Smoothing constant (lambda):", lambda, "\n\n")

Smoothing constant (lambda): 0.2

Code

# Initialize MEWMA vectors
Z <- matrix(0, nrow = nrow(X), ncol = p)
Z[1, ] <- X[1, ] - mu_hat  # Initialize with first observation

# Calculate MEWMA vectors
# Z_i = λ(X_i - μ) + (1-λ)Z_{i-1}
for (i in 2:nrow(X)) {
  Z[i, ] <- lambda * (X[i, ] - mu_hat) + (1 - lambda) * Z[i-1, ]
}

# Calculate MEWMA T² statistic
# Covariance of Z_i: Σ_Z = [λ/(2-λ)]Σ
Sigma_Z <- (lambda / (2 - lambda)) * Sigma_hat

MEWMA_T2 <- numeric(nrow(X))
for (i in 1:nrow(X)) {
  MEWMA_T2[i] <- t(Z[i, ]) %*% solve(Sigma_Z) %*% Z[i, ]
}

# Control limit
# Using h parameter (typically 9-13 for good performance)
h <- 11.5  # Control limit parameter
UCL_MEWMA <- h

cat("MEWMA Chart Parameters:\n")

MEWMA Chart Parameters:

Code

cat("Smoothing constant (λ):", lambda, "\n")

Smoothing constant (λ): 0.2

Code

cat("Control limit parameter (h):", h, "\n")

Control limit parameter (h): 11.5

Code

cat("UCL:", round(UCL_MEWMA, 3), "\n")

UCL: 11.5

Code

cat("Number of out-of-control points:", sum(MEWMA_T2 > UCL_MEWMA), "\n\n")

Number of out-of-control points: 23

Code

# Create MEWMA results table
mewma_summary <- tibble(
  Metric = c("Smoothing Constant (λ)", "Control Limit (h)", 
             "Mean MEWMA T²", "Max MEWMA T²", "Out-of-Control Points"),
  Value = c(lambda, UCL_MEWMA, mean(MEWMA_T2), max(MEWMA_T2), 
            sum(MEWMA_T2 > UCL_MEWMA))
)

mewma_summary %>%
  gt() %>%
  tab_header(
    title = "MEWMA Chart Summary",
    subtitle = "Multivariate exponentially weighted moving average"
  ) %>%
  fmt_number(
    columns = Value,
    rows = 1:4,
    decimals = 3
  ) %>%
  fmt_number(
    columns = Value,
    rows = 5,
    decimals = 0
  ) %>%
  tab_style(
    style = cell_fill(color = "lightcoral"),
    locations = cells_body(rows = 5)
  )

MEWMA Chart Summary
Multivariate exponentially weighted moving average
Metric	Value
Smoothing Constant (λ)	0.200
Control Limit (h)	11.500
Mean MEWMA T²	1,915.799
Max MEWMA T²	13,632.016
Out-of-Control Points	23

Code

# Create MEWMA control chart
mewma_data <- tibble(
  Sample = 1:length(MEWMA_T2),
  MEWMA = MEWMA_T2,
  UCL = UCL_MEWMA,
  Phase = ifelse(Sample <= 80, "Phase I", "Phase II"),
  Status = ifelse(MEWMA > UCL, "Out of Control", "In Control")
)

mewma_plot <- plot_ly(mewma_data) %>%
  add_trace(
    x = ~Sample,
    y = ~MEWMA,
    type = "scatter",
    mode = "lines+markers",
    name = "MEWMA T²",
    line = list(color = "darkgreen"),
    marker = list(
      size = 6,
      color = ~ifelse(Status == "Out of Control", "red", "darkgreen")
    ),
    text = ~paste0(
      "Sample: ", Sample, "<br>",
      "MEWMA T²: ", round(MEWMA, 3), "<br>",
      "Phase: ", Phase, "<br>",
      "Status: ", Status
    ),
    hovertemplate = "%{text}<extra></extra>"
  ) %>%
  add_trace(
    x = ~Sample,
    y = ~UCL,
    type = "scatter",
    mode = "lines",
    name = "UCL",
    line = list(color = "red", dash = "dash", width = 3),
    hovertemplate = "<b>UCL:</b> %{y:.3f}<extra></extra>"
  ) %>%
  add_segments(
    x = 80.5, xend = 80.5,
    y = 0, yend = max(mewma_data$MEWMA) * 1.1,
    line = list(color = "gray", dash = "dot", width = 2),
    showlegend = FALSE,
    hoverinfo = "skip"
  ) %>%
  layout(
    title = list(
      text = "MEWMA Control Chart<br><sub>More sensitive to small sustained shifts</sub>",
      font = list(size = 16, weight = "bold")
    ),
    xaxis = list(title = "Sample Number"),
    yaxis = list(title = "MEWMA T² Statistic"),
    hovermode = "closest"
  )

mewma_plot

Comparing T² vs MEWMA

Code

# Create comparison dataframe
comparison_data <- tibble(
  Sample = 1:length(T2_values),
  `T² Statistic` = T2_values / UCL_T2,  # Normalize by UCL
  `MEWMA Statistic` = MEWMA_T2 / UCL_MEWMA,
  Phase = ifelse(Sample <= 80, "Phase I", "Phase II")
)

# Create comparison plot
comp_plot <- plot_ly(comparison_data) %>%
  add_trace(
    x = ~Sample,
    y = ~`T² Statistic`,
    type = "scatter",
    mode = "lines",
    name = "T² (normalized)",
    line = list(color = "blue", width = 2),
    hovertemplate = "<b>T²:</b> %{y:.3f}<extra></extra>"
  ) %>%
  add_trace(
    x = ~Sample,
    y = ~`MEWMA Statistic`,
    type = "scatter",
    mode = "lines",
    name = "MEWMA (normalized)",
    line = list(color = "darkgreen", width = 2),
    hovertemplate = "<b>MEWMA:</b> %{y:.3f}<extra></extra>"
  ) %>%
  add_segments(
    x = 0, xend = max(comparison_data$Sample),
    y = 1, yend = 1,
    line = list(color = "red", dash = "dash", width = 2),
    showlegend = FALSE,
    hoverinfo = "skip"
  ) %>%
  add_segments(
    x = 80.5, xend = 80.5,
    y = 0, yend = max(c(comparison_data$`T² Statistic`, 
                       comparison_data$`MEWMA Statistic`)),
    line = list(color = "gray", dash = "dot", width = 2),
    showlegend = FALSE,
    hoverinfo = "skip"
  ) %>%
  layout(
    title = list(
      text = "T² vs MEWMA Comparison<br><sub>Both statistics normalized by their control limits</sub>",
      font = list(size = 16, weight = "bold")
    ),
    xaxis = list(title = "Sample Number"),
    yaxis = list(title = "Normalized Statistic (UCL = 1.0)"),
    hovermode = "x unified",
    annotations = list(
      list(
        x = max(comparison_data$Sample) / 2,
        y = 1.05,
        text = "Control Limit",
        showarrow = FALSE,
        font = list(size = 10, color = "red")
      )
    )
  )

comp_plot

Key Observations:

T² Chart: Better for detecting large, sudden shifts
MEWMA Chart: Better for detecting small, gradual shifts
MEWMA smooths the data, making trends more visible
Notice how MEWMA responds earlier to the shift at sample 81

3. Principal Component Analysis (PCA) Based SPC

Theory

PCA-based SPC transforms correlated variables into uncorrelated principal components, then monitors these components.

Why Use PCA for SPC?

Advantages: - Reduces dimensionality (10+ variables → 2-3 components) - Eliminates redundancy from correlated variables - Two complementary statistics: T² and Q (SPE) - Better visualization of process behavior - More interpretable for high-dimensional data

The Two Statistics

1. Hotelling T² (in PC space) - Monitors systematic variation (in-model) - Captures variation in retained principal components - Detects shifts in the process mean structure

2. Q Statistic (Squared Prediction Error - SPE) - Monitors random variation (residual) - Captures variation NOT in retained components - Detects unusual patterns or anomalies - Also called DModX or residual distance

Perform PCA

Code

# Standardize data (important for PCA with different scales)
X_scaled <- scale(X_phase1)
X_scaled_all <- scale(X, center = colMeans(X_phase1), 
                      scale = apply(X_phase1, 2, sd))

# Perform PCA on Phase I data
pca_result <- prcomp(X_scaled, center = FALSE, scale. = FALSE)

# Variance explained
variance_explained <- pca_result$sdev^2
prop_variance <- variance_explained / sum(variance_explained)
cum_variance <- cumsum(prop_variance)

cat("\nPCA Results:\n")


PCA Results:

Code

cat("Principal Component Analysis on", p, "variables\n\n")

Principal Component Analysis on 3 variables

Code

# Variance table
pca_variance <- tibble(
  PC = paste0("PC", 1:p),
  `Std Dev` = pca_result$sdev,
  Variance = variance_explained,
  `Prop. Variance` = prop_variance * 100,
  `Cumulative %` = cum_variance * 100
)

pca_variance %>%
  gt() %>%
  tab_header(
    title = "Principal Components Summary",
    subtitle = "Variance explained by each component"
  ) %>%
  fmt_number(
    columns = c(`Std Dev`, Variance),
    decimals = 3
  ) %>%
  fmt_number(
    columns = c(`Prop. Variance`, `Cumulative %`),
    decimals = 2
  )

Principal Components Summary
Variance explained by each component
PC	Std Dev	Variance	Prop. Variance	Cumulative %
PC1	1.724	2.971	99.03	99.03
PC2	0.161	0.026	0.87	99.90
PC3	0.055	0.003	0.10	100.00

Code

# Decide on number of components (retain 95% variance)
k <- min(which(cum_variance >= 0.95))
cat("\nRetaining", k, "components (capturing", 
    round(cum_variance[k] * 100, 1), "% of variance)\n\n")


Retaining 1 components (capturing 99 % of variance)

Code

# Scree plot
scree_data <- tibble(
  PC = 1:p,
  Variance = variance_explained,
  Type = "Eigenvalue"
)

scree_plot <- plot_ly(scree_data) %>%
  add_trace(
    x = ~PC,
    y = ~Variance,
    type = "scatter",
    mode = "lines+markers",
    name = "Eigenvalue",
    line = list(color = "blue", width = 2),
    marker = list(size = 10, color = "blue"),
    hovertemplate = "<b>PC%{x}</b><br>Variance: %{y:.3f}<extra></extra>"
  ) %>%
  add_segments(
    x = k + 0.5, xend = k + 0.5,
    y = 0, yend = max(variance_explained),
    line = list(color = "red", dash = "dash", width = 2),
    showlegend = FALSE,
    hoverinfo = "skip"
  ) %>%
  layout(
    title = list(
      text = "Scree Plot<br><sub>Selecting number of principal components</sub>",
      font = list(size = 16, weight = "bold")
    ),
    xaxis = list(title = "Principal Component", dtick = 1),
    yaxis = list(title = "Variance (Eigenvalue)"),
    annotations = list(
      list(
        x = k + 0.5,
        y = max(variance_explained) * 0.9,
        text = paste0("Retain ", k, " PCs"),
        showarrow = FALSE,
        xanchor = "left",
        font = list(color = "red")
      )
    )
  )

scree_plot

Code

# Loadings plot
if (k >= 1) {
  # Extract loadings and ensure proper column names
  loadings_matrix <- pca_result$rotation[, 1:min(2, k), drop = FALSE]
  loadings_data <- as.data.frame(loadings_matrix)
  
  # Ensure column names are PC1, PC2
  if (ncol(loadings_data) == 1) {
    colnames(loadings_data) <- "PC1"
    loadings_data$PC2 <- 0
  } else {
    colnames(loadings_data) <- c("PC1", "PC2")
  }
  
  loadings_data$Variable <- quality_vars
  
  loadings_plot <- plot_ly(loadings_data) %>%
    add_trace(
      x = ~PC1,
      y = ~PC2,
      type = "scatter",
      mode = "markers+text",
      text = ~Variable,
      textposition = "top center",
      marker = list(size = 12, color = "darkblue"),
      hovertemplate = "<b>%{text}</b><br>PC1: %{x:.3f}<br>PC2: %{y:.3f}<extra></extra>"
    ) %>%
    add_segments(
      x = 0, xend = ~PC1,
      y = 0, yend = ~PC2,
      line = list(color = "gray", width = 1),
      showlegend = FALSE,
      hoverinfo = "skip"
    ) %>%
    layout(
      title = list(
        text = "PCA Loadings Plot<br><sub>How variables contribute to principal components</sub>",
        font = list(size = 16, weight = "bold")
      ),
      xaxis = list(title = paste0("PC1 (", round(prop_variance[1]*100, 1), "%)"),
                   zeroline = TRUE),
      yaxis = list(title = if(k >= 2) 
                     paste0("PC2 (", round(prop_variance[2]*100, 1), "%)") 
                     else "PC2",
                   zeroline = TRUE)
    )
  
  loadings_plot
}

PCA-based T² Control Chart

Code

# Project data onto principal components
scores_phase1 <- X_scaled %*% pca_result$rotation[, 1:k, drop = FALSE]
scores_all <- X_scaled_all %*% pca_result$rotation[, 1:k, drop = FALSE]

# Calculate T² in PC space
# Lambda is diagonal matrix of eigenvalues
Lambda <- diag(variance_explained[1:k], nrow = k, ncol = k)

T2_pca <- numeric(nrow(scores_all))
for (i in 1:nrow(scores_all)) {
  t_i <- as.numeric(scores_all[i, ])  # Ensure it's a numeric vector
  # T² = t'Λ⁻¹t
  if (k == 1) {
    # Special case for single component
    T2_pca[i] <- (t_i^2) / variance_explained[1]
  } else {
    T2_pca[i] <- t(t_i) %*% solve(Lambda) %*% t_i
  }
}

# Control limit for PCA T²
m <- nrow(X_phase1)
alpha <- 0.0027
F_crit_pca <- qf(1 - alpha, k, m - k)
UCL_T2_pca <- (k * (m^2 - 1)) / (m * (m - k)) * F_crit_pca

cat("\nPCA T² Chart Parameters:\n")


PCA T² Chart Parameters:

Code

cat("Number of PCs retained (k):", k, "\n")

Number of PCs retained (k): 1

Code

cat("Control limit (UCL):", round(UCL_T2_pca, 3), "\n")

Control limit (UCL): 9.716

Code

cat("Out-of-control points:", sum(T2_pca > UCL_T2_pca), "\n\n")

Out-of-control points: 20

Code

# PCA T² chart
pca_t2_data <- tibble(
  Sample = 1:length(T2_pca),
  T2_PCA = T2_pca,
  UCL = UCL_T2_pca,
  Phase = ifelse(Sample <= 80, "Phase I", "Phase II"),
  Status = ifelse(T2_PCA > UCL, "Out of Control", "In Control")
)

pca_t2_plot <- plot_ly(pca_t2_data) %>%
  add_trace(
    x = ~Sample,
    y = ~T2_PCA,
    type = "scatter",
    mode = "lines+markers",
    name = "PCA T²",
    line = list(color = "purple"),
    marker = list(
      size = 6,
      color = ~ifelse(Status == "Out of Control", "red", "purple")
    ),
    text = ~paste0(
      "Sample: ", Sample, "<br>",
      "T² (PCA): ", round(T2_PCA, 3), "<br>",
      "Status: ", Status
    ),
    hovertemplate = "%{text}<extra></extra>"
  ) %>%
  add_trace(
    x = ~Sample,
    y = ~UCL,
    type = "scatter",
    mode = "lines",
    name = "UCL",
    line = list(color = "red", dash = "dash", width = 3)
  ) %>%
  add_segments(
    x = 80.5, xend = 80.5,
    y = 0, yend = max(pca_t2_data$T2_PCA) * 1.1,
    line = list(color = "gray", dash = "dot", width = 2),
    showlegend = FALSE
  ) %>%
  layout(
    title = list(
      text = "PCA-based T² Control Chart<br><sub>Monitoring in principal component space</sub>",
      font = list(size = 16, weight = "bold")
    ),
    xaxis = list(title = "Sample Number"),
    yaxis = list(title = "T² Statistic (PCA)")
  )

pca_t2_plot

Q Statistic (SPE) Control Chart

Code

# Calculate Q statistic (Squared Prediction Error)
# Q = ||x - x_hat||² where x_hat is reconstruction from k PCs
X_reconstructed <- scores_all %*% t(pca_result$rotation[, 1:k, drop = FALSE])
residuals <- X_scaled_all - X_reconstructed

Q_stat <- rowSums(residuals^2)

# Control limit for Q using chi-squared approximation
if (k < p) {
  # Calculate parameters for chi-squared approximation
  theta1 <- sum(variance_explained[(k+1):p])
  theta2 <- sum(variance_explained[(k+1):p]^2)
  theta3 <- sum(variance_explained[(k+1):p]^3)
  
  h0 <- 1 - (2 * theta1 * theta3) / (3 * theta2^2)
  ca <- qnorm(1 - alpha)
  
  UCL_Q <- theta1 * (1 + (ca * sqrt(2 * theta2) * h0) / theta1 + 
                       (theta2 * h0 * (h0 - 1)) / theta1^2)^(1/h0)
} else {
  # If all PCs retained, Q should be near zero
  UCL_Q <- max(Q_stat) * 1.5
  cat("Note: All PCs retained, Q statistic will be small\n")
}


cat("\nQ Statistic (SPE) Chart Parameters:\n")


Q Statistic (SPE) Chart Parameters:

Code

cat("Monitoring residual variation\n")

Monitoring residual variation

Code

cat("Control limit (UCL):", round(UCL_Q, 3), "\n")

Control limit (UCL): 0.254

Code

cat("Out-of-control points:", sum(Q_stat > UCL_Q), "\n\n")

Out-of-control points: 21

Code

# Q statistic chart
q_data <- tibble(
  Sample = 1:length(Q_stat),
  Q = Q_stat,
  UCL = UCL_Q,
  Phase = ifelse(Sample <= 80, "Phase I", "Phase II"),
  Status = ifelse(Q > UCL, "Out of Control", "In Control")
)

q_plot <- plot_ly(q_data) %>%
  add_trace(
    x = ~Sample,
    y = ~Q,
    type = "scatter",
    mode = "lines+markers",
    name = "Q Statistic",
    line = list(color = "orange"),
    marker = list(
      size = 6,
      color = ~ifelse(Status == "Out of Control", "red", "orange")
    ),
    text = ~paste0(
      "Sample: ", Sample, "<br>",
      "Q: ", round(Q, 3), "<br>",
      "Status: ", Status
    ),
    hovertemplate = "%{text}<extra></extra>"
  ) %>%
  add_trace(
    x = ~Sample,
    y = ~UCL,
    type = "scatter",
    mode = "lines",
    name = "UCL",
    line = list(color = "red", dash = "dash", width = 3)
  ) %>%
  add_segments(
    x = 80.5, xend = 80.5,
    y = 0, yend = max(q_data$Q) * 1.1,
    line = list(color = "gray", dash = "dot", width = 2),
    showlegend = FALSE
  ) %>%
  layout(
    title = list(
      text = "Q Statistic (SPE) Control Chart<br><sub>Detecting unusual patterns and anomalies</sub>",
      font = list(size = 16, weight = "bold")
    ),
    xaxis = list(title = "Sample Number"),
    yaxis = list(title = "Q Statistic (Squared Prediction Error)")
  )

q_plot

Interpreting PCA-based SPC

Understanding T² vs Q

Statistic	Monitors	Interpretation When Out-of-Control
T² (in PC space)	Systematic variation	Process mean has shifted in a known direction
Q (SPE)	Residual variation	Unusual pattern or new type of variation

Decision Matrix

T² Status	Q Status	Interpretation	Action
In Control	In Control	Process normal	Continue monitoring
Out of Control	In Control	Common cause shift	Adjust process mean
In Control	Out of Control	Special cause/anomaly	Investigate unusual event
Out of Control	Out of Control	Major disturbance	Stop and investigate

Score Plot (First Two PCs)

Code

if (k >= 2) {
  # Create score plot
  score_data <- tibble(
    Sample = 1:nrow(scores_all),
    PC1 = scores_all[, 1],
    PC2 = scores_all[, 2],
    Phase = ifelse(Sample <= 80, "Phase I", "Phase II"),
    Overall_Status = case_when(
      T2_pca > UCL_T2_pca & Q_stat > UCL_Q ~ "Both OOC",
      T2_pca > UCL_T2_pca ~ "T² Only",
      Q_stat > UCL_Q ~ "Q Only",
      TRUE ~ "In Control"
    )
  )
  
  score_plot <- plot_ly(score_data) %>%
    add_trace(
      x = ~PC1,
      y = ~PC2,
      type = "scatter",
      mode = "markers",
      color = ~Overall_Status,
      colors = c("In Control" = "blue", 
                 "T² Only" = "orange", 
                 "Q Only" = "purple",
                 "Both OOC" = "red"),
      marker = list(size = 8),
      text = ~paste0(
        "Sample: ", Sample, "<br>",
        "PC1: ", round(PC1, 3), "<br>",
        "PC2: ", round(PC2, 3), "<br>",
        "Status: ", Overall_Status
      ),
      hovertemplate = "%{text}<extra></extra>"
    ) %>%
    layout(
      title = list(
        text = "PCA Score Plot<br><sub>Process behavior in principal component space</sub>",
        font = list(size = 16, weight = "bold")
      ),
      xaxis = list(title = paste0("PC1 (", round(prop_variance[1]*100, 1), "%)")),
      yaxis = list(title = paste0("PC2 (", round(prop_variance[2]*100, 1), "%)")),
      showlegend = TRUE
    )
  
  score_plot
} else {
  cat("\n[INFO] Score plot requires at least 2 principal components.\n")
  cat("Currently retaining only", k, "component(s).\n")
  cat("Use histogram or time series plot for 1D visualization.\n")
}


[INFO] Score plot requires at least 2 principal components.
Currently retaining only 1 component(s).
Use histogram or time series plot for 1D visualization.

Cross-Validation in PCA-Based SPC

Why Cross-Validation Matters

Traditional SPC assumes we know the “true” number of components to retain. But in practice: - Too few PCs → Miss important variation - Too many PCs → Include noise, reduce sensitivity - Affects both T² and Q statistics

Cross-validation helps: - Select optimal number of components objectively - Avoid overfitting to Phase I data - Improve generalization to Phase II monitoring - Validate model performance

Cross-Validation Approaches for SPC

PRESS Statistic Explained:

PRESS = Predicted Residual Error Sum of Squares

\[PRESS = \sum_{i=1}^{n} (x_i - \hat{x}_i)^2\]

Where: - \(x_i\) = actual observation - \(\hat{x}_i\) = predicted (reconstructed) observation using model trained WITHOUT \(x_i\)

Why PRESS matters: - Measures how well model predicts NEW data - Lower PRESS = better generalization - Avoids overfitting (model evaluated on unseen data) - Standard metric for cross-validation

In SPC context: - Tests if PCA model works on future Phase II data - Ensures control limits will be appropriate - Validates number of components selected

Code

# Perform cross-validation to select optimal number of PCs
# Using PRESS (Predicted Residual Error Sum of Squares)

cat("\nCross-Validation for PCA Model Selection:\n\n")


Cross-Validation for PCA Model Selection:

Code

# K-fold cross-validation
K <- 5  # Number of folds
n_phase1 <- nrow(X_phase1)
fold_size <- floor(n_phase1 / K)

# Try different numbers of components
max_components <- min(p, n_phase1 - fold_size - 1)
cv_results <- tibble(
  n_components = 1:max_components,
  CV_Error = numeric(max_components),
  PRESS = numeric(max_components)
)

for (n_comp in 1:max_components) {
  press_total <- 0
  
  for (k in 1:K) {
    # Create training and validation sets
    val_indices <- ((k-1) * fold_size + 1):min(k * fold_size, n_phase1)
    train_indices <- setdiff(1:n_phase1, val_indices)
    
    # Train PCA on training set
    X_train <- X_scaled[train_indices, ]
    X_val <- X_scaled[val_indices, ]
    
    pca_cv <- prcomp(X_train, center = FALSE, scale. = FALSE)
    
    # Project validation data
    scores_val <- X_val %*% pca_cv$rotation[, 1:n_comp, drop = FALSE]
    
    # Reconstruct validation data
    X_recon <- scores_val %*% t(pca_cv$rotation[, 1:n_comp, drop = FALSE])
    
    # Calculate prediction error
    press_total <- press_total + sum((X_val - X_recon)^2)
  }
  
  cv_results$PRESS[n_comp] <- press_total
  cv_results$CV_Error[n_comp] <- press_total / n_phase1
}

# Display CV results with proper NA handling
cv_results %>%
  mutate(
    `% Improvement` = ifelse(
      n_components == 1,
      NA_real_,  # First component has no previous to compare
      -diff(c(0, CV_Error)) / c(0, CV_Error[-n()]) * 100
    ),
    Recommendation = case_when(
      n_components == which.min(CV_Error) ~ "[OPTIMAL] Minimum CV Error",
      n_components < which.min(CV_Error) ~ "Underfit (too few PCs)",
      TRUE ~ "May overfit (too many PCs)"
    )
  ) %>%
  gt() %>%
  tab_header(
    title = "Cross-Validation Results",
    subtitle = "PRESS-based selection of optimal principal components"
  ) %>%
  fmt_number(
    columns = c(CV_Error, PRESS),
    decimals = 4
  ) %>%
  fmt_number(
    columns = `% Improvement`,
    decimals = 2
  ) %>%
  sub_missing(
    columns = `% Improvement`,
    missing_text = "—"
  ) %>%
  tab_style(
    style = cell_fill(color = "lightgreen"),
    locations = cells_body(
      rows = n_components == which.min(CV_Error)
    )
  ) %>%
  tab_footnote(
    footnote = "% Improvement = reduction in CV error compared to using one fewer component",
    locations = cells_column_labels(columns = `% Improvement`)
  ) %>%
  tab_footnote(
    footnote = "PRESS = Predicted Residual Error Sum of Squares (lower is better)",
    locations = cells_column_labels(columns = PRESS)
  )

Cross-Validation Results
PRESS-based selection of optimal principal components
n_components	CV_Error	PRESS¹	% Improvement²	Recommendation
1	0.0293	2.3423	—	Underfit (too few PCs)
2	0.0036	0.2843	87.86	Underfit (too few PCs)
3	0.0000	0.0000	100.00	[OPTIMAL] Minimum CV Error
¹ PRESS = Predicted Residual Error Sum of Squares (lower is better)
² % Improvement = reduction in CV error compared to using one fewer component

Code

# Optimal number of components
k_optimal <- which.min(cv_results$CV_Error)
cat("\nRecommended number of components (CV):", k_optimal, "\n")


Recommended number of components (CV): 3

Code

cat("Original selection (95% variance):", k, "\n")

Original selection (95% variance): 5

Code

if (k != k_optimal) {
  cat("Difference:", abs(k_optimal - k), "component(s)\n")
  improvement <- (cv_results$CV_Error[k] - cv_results$CV_Error[k_optimal]) / 
                 cv_results$CV_Error[k] * 100
  cat("CV error improvement:", round(improvement, 2), "%\n")
} else {
  cat("Both methods AGREE on", k, "components [OK]\n")
}

Difference: 2 component(s)
CV error improvement: NA %

Code

cat("\n")

Code

# Plot CV error vs number of components
cv_plot <- plot_ly(cv_results) %>%
  add_trace(
    x = ~n_components,
    y = ~CV_Error,
    type = "scatter",
    mode = "lines+markers",
    name = "CV Error",
    line = list(color = "blue", width = 2),
    marker = list(size = 8, color = "blue"),
    hovertemplate = "<b>%{x} Components</b><br>CV Error: %{y:.4f}<extra></extra>"
  ) %>%
  add_trace(
    x = k_optimal,
    y = cv_results$CV_Error[k_optimal],
    type = "scatter",
    mode = "markers",
    name = "Optimal",
    marker = list(size = 15, color = "red", symbol = "star"),
    hovertemplate = "<b>Optimal: %{x} PCs</b><br>Error: %{y:.4f}<extra></extra>"
  ) %>%
  layout(
    title = list(
      text = "Cross-Validation: Selecting Number of Components<br><sub>Lower CV error indicates better model</sub>",
      font = list(size = 16, weight = "bold")
    ),
    xaxis = list(title = "Number of Principal Components", dtick = 1),
    yaxis = list(title = "Cross-Validation Error (PRESS/n)")
  )

cv_plot

Interpreting CV Results

What the CV Error Tells You:

Pattern	Interpretation	Action
Error decreases then plateaus	Optimal found	Use number at minimum
Error keeps decreasing	May need more PCs	Check if reasonable
Sharp drop then flat	Clear optimal point	High confidence
Noisy curve	Unstable model	Need more Phase I data

Comparison: Variance vs Cross-Validation

Code

# Compare variance-based vs CV-based selection
# Add safety checks to prevent NAs

# Ensure indices are valid
k_safe <- min(k, length(cum_variance))
k_optimal_safe <- min(k_optimal, length(cum_variance), nrow(cv_results))

comparison <- tibble(
  Method = c("Variance (95%)", "Cross-Validation", "Difference"),
  `# Components` = c(k, k_optimal, NA_integer_),
  `Variance Explained %` = c(
    cum_variance[k_safe] * 100,
    cum_variance[k_optimal_safe] * 100,
    abs(cum_variance[k_safe] - cum_variance[k_optimal_safe]) * 100
  ),
  `CV Error` = c(
    cv_results$CV_Error[k_safe],
    cv_results$CV_Error[k_optimal_safe],
    abs(cv_results$CV_Error[k_safe] - cv_results$CV_Error[k_optimal_safe])
  ),
  `Better By` = c(
    "—",
    "—",
    ifelse(k == k_optimal, 
           "Same selection", 
           paste0(abs(k - k_optimal), " component(s)"))
  ),
  Recommendation = c(
    "Traditional approach",
    "Data-driven optimum",
    ifelse(k == k_optimal, "[OK] METHODS AGREE", "[INFO] METHODS DIFFER")
  )
)

comparison %>%
  gt() %>%
  tab_header(
    title = "Model Selection Comparison",
    subtitle = "Variance-based vs Cross-validation"
  ) %>%
  fmt_number(
    columns = `Variance Explained %`,
    decimals = 2
  ) %>%
  fmt_number(
    columns = `CV Error`,
    decimals = 4
  ) %>%
  sub_missing(
    columns = everything(),
    missing_text = "—"
  ) %>%
  tab_style(
    style = cell_fill(color = "lightgreen"),
    locations = cells_body(rows = Method == "Cross-Validation")
  ) %>%
  tab_style(
    style = cell_fill(color = "lightyellow"),
    locations = cells_body(rows = Method == "Difference")
  ) %>%
  tab_footnote(
    footnote = "Difference row shows absolute differences between the two methods",
    locations = cells_body(rows = Method == "Difference", columns = Method)
  ) %>%
  tab_footnote(
    footnote = "— indicates not applicable or no meaningful comparison",
    locations = cells_body(rows = Method == "Difference", columns = `# Components`)
  )

Model Selection Comparison
Variance-based vs Cross-validation
Method	# Components	Variance Explained %	CV Error	Better By	Recommendation
Variance (95%)	5	100.00	0.0000	—	Traditional approach
Cross-Validation	3	100.00	0.0000	—	Data-driven optimum
Difference¹	² —	0.00	0.0000	2 component(s)	[INFO] METHODS DIFFER
¹ Difference row shows absolute differences between the two methods
² — indicates not applicable or no meaningful comparison

Code

cat("\n")

Code

if (k != k_optimal) {
  cat("[INFO] Cross-validation suggests a different number of components!\n")
  cat("Variance method:", k, "components (", round(cum_variance[k_safe]*100, 1), "% variance)\n")
  cat("CV method:", k_optimal, "components (", round(cum_variance[k_optimal_safe]*100, 1), "% variance)\n\n")
  
  if (k_optimal < k) {
    cat("Interpretation: CV suggests FEWER components\n")
    cat("Benefit: May improve sensitivity by excluding noise\n")
    cat("Trade-off: Slightly less variance captured\n")
  } else {
    cat("Interpretation: CV suggests MORE components\n")
    cat("Benefit: Captures additional important variation\n")
    cat("Trade-off: Slightly higher complexity\n")
  }
} else {
  cat("[OK] Both methods agree on", k, "components.\n")
  cat("This indicates robust model selection.\n")
}

[INFO] Cross-validation suggests a different number of components!
Variance method: 5 components ( 100 % variance)
CV method: 3 components ( 100 % variance)

Interpretation: CV suggests FEWER components
Benefit: May improve sensitivity by excluding noise
Trade-off: Slightly less variance captured

Code

cat("\n")

When to Use Cross-Validation in SPC

Use CV when: - ✅ High-dimensional data (10+ variables) - ✅ Limited Phase I samples - ✅ Uncertain about number of components - ✅ Building predictive models (soft sensors) - ✅ Want to avoid overfitting

Variance threshold (95%) is fine when: - ✅ Clear eigenvalue drop-off (scree plot) - ✅ Sufficient Phase I data (50+ samples) - ✅ Well-understood process - ✅ Traditional SPC context

Advanced: Leave-One-Out Cross-Validation (LOOCV)

Code

# LOOCV for more precise estimate (computationally intensive)
if (n_phase1 <= 50) {  # Only for small datasets
  
  cat("\nPerforming Leave-One-Out Cross-Validation:\n")
  
  loocv_results <- tibble(
    n_components = 1:max_components,
    LOOCV_Error = numeric(max_components)
  )
  
  for (n_comp in 1:max_components) {
    loocv_error <- 0
    
    # Leave one out
    for (i in 1:n_phase1) {
      X_train <- X_scaled[-i, ]
      X_test <- X_scaled[i, , drop = FALSE]
      
      pca_loo <- prcomp(X_train, center = FALSE, scale. = FALSE)
      
      # Project test sample
      score_test <- X_test %*% pca_loo$rotation[, 1:n_comp, drop = FALSE]
      X_recon <- score_test %*% t(pca_loo$rotation[, 1:n_comp, drop = FALSE])
      
      loocv_error <- loocv_error + sum((X_test - X_recon)^2)
    }
    
    loocv_results$LOOCV_Error[n_comp] <- loocv_error / n_phase1
  }
  
  k_loocv <- which.min(loocv_results$LOOCV_Error)
  
  cat("LOOCV optimal components:", k_loocv, "\n")
  cat("K-fold CV optimal:", k_optimal, "\n")
  cat("Variance-based:", k, "\n\n")
  
  # Compare all three methods
  all_methods <- plot_ly() %>%
    add_trace(
      data = cv_results,
      x = ~n_components,
      y = ~CV_Error,
      type = "scatter",
      mode = "lines+markers",
      name = "K-Fold CV",
      line = list(color = "blue")
    ) %>%
    add_trace(
      data = loocv_results,
      x = ~n_components,
      y = ~LOOCV_Error,
      type = "scatter",
      mode = "lines+markers",
      name = "LOOCV",
      line = list(color = "red", dash = "dash")
    ) %>%
    add_segments(
      x = k, xend = k,
      y = min(c(cv_results$CV_Error, loocv_results$LOOCV_Error)),
      yend = max(c(cv_results$CV_Error, loocv_results$LOOCV_Error)),
      line = list(color = "green", dash = "dot"),
      name = "95% Variance"
    ) %>%
    layout(
      title = list(
        text = "Comparing Model Selection Methods",
        font = list(size = 16, weight = "bold")
      ),
      xaxis = list(title = "Number of Components", dtick = 1),
      yaxis = list(title = "Prediction Error"),
      showlegend = TRUE
    )
  
  all_methods
} else {
  cat("\n[SKIPPED] LOOCV not performed (dataset too large)\n")
  cat("K-fold CV is more efficient for large Phase I datasets.\n")
}


[SKIPPED] LOOCV not performed (dataset too large)
K-fold CV is more efficient for large Phase I datasets.

Practical Recommendations

For Your SPC Implementation:

Start with variance threshold (95% or scree plot)
Validate with CV if:
- Unexpected false alarms
- Poor Phase II performance
- Limited Phase I data
Use CV optimal if it differs significantly
Monitor both T² and Q regardless of selection

CV Best Practices: - Use K=5 or K=10 for K-fold - LOOCV for small datasets (<50 samples) - Stratify folds if data has structure - Revalidate periodically with new data

Literature Support

Cross-validation in SPC is supported by: - Nomikos & MacGregor (1995): Monitoring batch processes with multiway PCA - Kourti (2005): Process analytical technology and quality by design - Qin (2012): Survey of industrial applications of data-driven methods

Modern Practice: Many pharmaceutical and chemical industries now use CV-based model selection for: - Batch process monitoring - Soft sensor development - Fault detection systems - Quality prediction models

5. Variable Contribution Analysis

Identifying Root Causes

When a multivariate chart signals out-of-control, we need to identify which variable(s) are responsible.

Code

# For recent out-of-control points, show variable contributions
recent_ooc <- tail(which(MEWMA_T2 > UCL_MEWMA), 5)

if (length(recent_ooc) > 0) {
  # Calculate standardized deviations
  contrib_data <- list()
  
  for (idx in recent_ooc) {
    x_i <- X[idx, ]
    std_dev <- (x_i - mu_hat) / sqrt(diag(Sigma_hat))
    
    contrib_data[[length(contrib_data) + 1]] <- tibble(
      Sample = idx,
      Variable = quality_vars,
      Value = x_i,
      Target = mu_hat,
      `Std Deviation` = std_dev,
      `Abs Std Dev` = abs(std_dev),
      Status = ifelse(abs(std_dev) > 3, "[X] Out of Control",
                     ifelse(abs(std_dev) > 2, "[WARNING] Near Limit", "[OK] In Control"))
    )
  }
  
  contrib_table <- bind_rows(contrib_data)
  
  contrib_table %>%
    gt() %>%
    tab_header(
      title = "Variable Contribution Analysis",
      subtitle = "Which variables are causing out-of-control signals?"
    ) %>%
    fmt_number(
      columns = c(Value, Target, `Std Deviation`),
      decimals = 3
    ) %>%
    fmt_number(
      columns = `Abs Std Dev`,
      decimals = 2
    ) %>%
    tab_style(
      style = cell_fill(color = "lightcoral"),
      locations = cells_body(
        columns = `Abs Std Dev`,
        rows = `Abs Std Dev` > 3
      )
    ) %>%
    tab_style(
      style = cell_fill(color = "lightyellow"),
      locations = cells_body(
        columns = `Abs Std Dev`,
        rows = `Abs Std Dev` > 2 & `Abs Std Dev` <= 3
      )
    ) %>%
    tab_style(
      style = cell_fill(color = "lightgreen"),
      locations = cells_body(
        columns = `Abs Std Dev`,
        rows = `Abs Std Dev` <= 2
      )
    )
  
  # Create contribution bar chart for most recent OOC point
  latest_ooc <- tail(recent_ooc, 1)
  latest_contrib <- contrib_table %>% filter(Sample == latest_ooc)
  
  contrib_plot <- plot_ly(
    data = latest_contrib,
    x = ~Variable,
    y = ~`Std Deviation`,
    type = "bar",
    marker = list(
      color = ~ifelse(abs(`Std Deviation`) > 3, "red",
                     ifelse(abs(`Std Deviation`) > 2, "orange", "green"))
    ),
    text = ~round(`Std Deviation`, 2),
    textposition = "outside",
    hovertemplate = paste(
      "<b>%{x}</b><br>",
      "Std Dev: %{y:.3f}<br>",
      "<extra></extra>"
    )
  ) %>%
    add_segments(
      x = -0.5, xend = length(quality_vars) - 0.5,
      y = 3, yend = 3,
      line = list(color = "red", dash = "dash", width = 2),
      showlegend = FALSE,
      name = "+3σ"
    ) %>%
    add_segments(
      x = -0.5, xend = length(quality_vars) - 0.5,
      y = -3, yend = -3,
      line = list(color = "red", dash = "dash", width = 2),
      showlegend = FALSE,
      name = "-3σ"
    ) %>%
    layout(
      title = list(
        text = paste0("Variable Contributions - Sample ", latest_ooc, 
                     "<br><sub>Standardized deviations from target</sub>"),
        font = list(size = 16, weight = "bold")
      ),
      xaxis = list(title = "Variable"),
      yaxis = list(title = "Standardized Deviation (σ)"),
      showlegend = FALSE
    )
  
  contrib_plot
} else {
  cat("\nNo recent out-of-control points for contribution analysis.\n")
}

6. Process Monitoring Dashboard

Code

# Create summary metrics
dashboard_metrics <- tibble(
  Metric = c(
    "Total Samples",
    "Phase I Samples",
    "Phase II Samples",
    "Variables Monitored",
    "T² Out-of-Control",
    "MEWMA Out-of-Control",
    "Current Process Status"
  ),
  Value = c(
    nrow(X),
    80,
    nrow(X) - 80,
    p,
    sum(T2_values > UCL_T2),
    sum(MEWMA_T2 > UCL_MEWMA),
    ifelse(tail(MEWMA_T2, 1) > UCL_MEWMA, "[X] OUT OF CONTROL", "[OK] IN CONTROL")
  )
)

dashboard_metrics %>%
  gt() %>%
  tab_header(
    title = "Multivariate SPC Dashboard",
    subtitle = paste("Analysis Date:", Sys.Date())
  ) %>%
  tab_style(
    style = cell_fill(color = "lightcoral"),
    locations = cells_body(
      columns = Value,
      rows = grepl("OUT OF CONTROL", Value)
    )
  ) %>%
  tab_style(
    style = cell_fill(color = "lightgreen"),
    locations = cells_body(
      columns = Value,
      rows = grepl("IN CONTROL", Value)
    )
  ) %>%
  tab_style(
    style = list(
      cell_fill(color = "lightblue"),
      cell_text(weight = "bold")
    ),
    locations = cells_column_labels()
  )

Multivariate SPC Dashboard
Analysis Date: 2025-12-21
Metric	Value
Total Samples	100
Phase I Samples	80
Phase II Samples	20
Variables Monitored	3
T² Out-of-Control	22
MEWMA Out-of-Control	23
Current Process Status	[X] OUT OF CONTROL

7. Recommendations and Actions

Current Process Assessment

Code

# Calculate process performance metrics
recent_samples <- 10
recent_T2 <- tail(T2_values, recent_samples)
recent_MEWMA <- tail(MEWMA_T2, recent_samples)

recent_violations_T2 <- sum(recent_T2 > UCL_T2)
recent_violations_MEWMA <- sum(recent_MEWMA > UCL_MEWMA)

cat("\n### Recent Process Performance (Last", recent_samples, "Samples)\n\n")


### Recent Process Performance (Last 10 Samples)

Code

cat("- T² violations:", recent_violations_T2, "out of", recent_samples, 
    paste0("(", round(recent_violations_T2/recent_samples*100, 1), "%)\n"))

- T² violations: 10 out of 10 (100%)

Code

cat("- MEWMA violations:", recent_violations_MEWMA, "out of", recent_samples,
    paste0("(", round(recent_violations_MEWMA/recent_samples*100, 1), "%)\n\n"))

- MEWMA violations: 10 out of 10 (100%)

Code

if (recent_violations_MEWMA > 0) {
  cat("**Status: [WARNING] Process requires immediate attention**\n\n")
  cat("**Recommended Actions:**\n")
  cat("1. Stop production and investigate root cause\n")
  cat("2. Review variable contributions to identify problem source\n")
  cat("3. Check equipment calibration and process parameters\n")
  cat("4. Review operator procedures and training\n")
  cat("5. Implement corrective actions\n")
  cat("6. Document findings and countermeasures\n")
} else {
  cat("**Status: [OK] Process is in statistical control**\n\n")
  cat("**Recommended Actions:**\n")
  cat("1. Continue current monitoring protocol\n")
  cat("2. Maintain equipment calibration schedule\n")
  cat("3. Regular operator training refreshers\n")
  cat("4. Update control limits quarterly\n")
}

**Status: [WARNING] Process requires immediate attention**

**Recommended Actions:**
1. Stop production and investigate root cause
2. Review variable contributions to identify problem source
3. Check equipment calibration and process parameters
4. Review operator procedures and training
5. Implement corrective actions
6. Document findings and countermeasures

Advantages of Multivariate SPC

Compared to univariate monitoring:

Accounts for correlations: Detects patterns that univariate charts miss
Reduces chart proliferation: One chart instead of multiple
Lower false alarm rate: Adjusts for multiple testing
More powerful: Better at detecting real process shifts
Process understanding: Reveals relationships between variables

When to Use Each Method

Method	Best For	Advantages	Disadvantages
T²	Large sudden shifts	Simple, well-understood	Less sensitive to small shifts
MEWMA	Small gradual shifts	Memory, smoothing	Slower response to large shifts
MCUSUM	Persistent small shifts	Directional info	Complex interpretation
PCA-based	High dimensions (10+ vars)	Dimension reduction	Requires more data

Implementation Guidelines

Phase I (Establishing Control): 1. Collect 20-30 samples during stable operation 2. Calculate mean vector and covariance matrix 3. Verify multivariate normality 4. Establish control limits

Phase II (Ongoing Monitoring): 1. Plot new observations on control charts 2. Investigate out-of-control signals immediately 3. Use contribution analysis to identify root causes 4. Document and implement corrective actions 5. Update control limits after process improvements

Report Generated: 2025-12-21 09:32:24.371897
Analysis Method: Hotelling T² and MEWMA
Statistical Software: R with qcr package (unified quality control)
Methodology: Multivariate Statistical Process Control
Package Info: qcr integrates qcc and MSQC functionality