Hotelling T², MEWMA, and Multivariate Process Monitoring
Author
David Guy PhD: Quality Control Department
Published
December 21, 2025
Introduction to Multivariate SPC
Note: This analysis uses the qcr package as the quality control framework. The multivariate methods (Hotelling T² and MEWMA) are implemented using standard statistical formulas to ensure compatibility across different package versions and to provide educational transparency. All calculations follow established statistical theory and produce results equivalent to specialized multivariate SPC packages.
What is Multivariate SPC?
Multivariate Statistical Process Control (MSPC) monitors multiple related quality characteristics simultaneously, rather than tracking each variable independently.
Why Use Multivariate SPC?
Traditional (Univariate) Approach: - Monitor each variable with separate control charts - Example: Track temperature, pressure, flow rate separately - Problem: Misses interactions and correlations between variables
Multivariate Approach: - Monitor all variables together as a system - Detects patterns that univariate charts miss - Accounts for correlations between variables - More powerful for detecting process changes
Scenario 1 (Both in control individually): - Temperature: 152°C [OK] - within limits - Pressure: 195 psi [OK] - within limits - But: This combination is unusual and indicates a problem!
Multivariate SPC detects this because it knows the normal relationship between temperature and pressure.
Key Multivariate Methods
1. Hotelling T² Chart
Multivariate extension of X-bar chart
Monitors mean vector of multiple variables
One statistic for all variables combined
Most common multivariate control chart
2. MEWMA (Multivariate EWMA)
Multivariate exponentially weighted moving average
More sensitive to small sustained shifts
Better for detecting gradual process drift
Incorporates historical information
3. MCUSUM (Multivariate CUSUM)
Multivariate cumulative sum
Excellent for detecting small persistent shifts
Directional information for diagnosis
4. Principal Component Analysis (PCA)
Reduces dimensionality
Transforms correlated variables to uncorrelated components
Control Limit: For Phase II monitoring: \[UCL = \frac{p(m+1)(m-1)}{m^2-mp} F_{\alpha}(p, m-p)\]
Where: - \(p\) = number of variables - \(m\) = number of Phase I samples - \(F_{\alpha}\) = F-distribution critical value
Interpretation: - T² combines all variables into single statistic - Large T² indicates multivariate out-of-control - Accounts for correlations between variables
Code
# Phase I data (first 80 samples for establishing control limits)phase1_indices <-1:80X_phase1 <- X[phase1_indices, ]# Calculate Phase I statisticsmu_hat <-colMeans(X_phase1)Sigma_hat <-cov(X_phase1)cat("\nPhase I Statistics:\n")
Phase I Statistics:
Code
cat("Mean vector:\n")
Mean vector:
Code
print(round(mu_hat, 3))
temperature pressure flow_rate
150.029 200.309 50.019
Code
cat("\nCovariance matrix:\n")
Covariance matrix:
Code
print(round(Sigma_hat, 3))
temperature pressure flow_rate
temperature 0.040 0.190 0.029
pressure 0.190 0.942 0.139
flow_rate 0.029 0.139 0.021
Code
cat("\n")
Code
# Phase II data (monitoring new samples)X_phase2 <- X# Calculate T² statistics using standard formulam <-nrow(X_phase1) # Number of Phase I samplesp <-ncol(X) # Number of variablesn <-1# Individual observations# Calculate T² for each observationT2_values <-numeric(nrow(X_phase2))for (i in1:nrow(X_phase2)) { x_i <- X_phase2[i, ]# Hotelling T² formula: T² = n(x - μ)'Σ⁻¹(x - μ) T2_values[i] <- n *t(x_i - mu_hat) %*%solve(Sigma_hat) %*% (x_i - mu_hat)}# Calculate Phase II control limit# UCL = [(p(m+1)(m-1)) / (m²-mp)] * F(α, p, m-p)alpha <-0.0027# For 3-sigma equivalent (0.27%)F_critical <-qf(1- alpha, p, m - p)UCL_T2 <- (p * (m +1) * (m -1)) / (m^2- m * p) * F_criticalcat("Hotelling T² Chart Parameters:\n")
What T² Measures: - Overall deviation from the target mean vector - Accounts for variable correlations - Larger values indicate process is farther from target
When to Investigate: - Any point above UCL indicates out-of-control - Patterns (runs, trends) suggest special causes - Compare Phase I vs Phase II performance
Decomposition Analysis
When T² signals out-of-control, we need to identify which variable(s) caused the signal.
T² Chart: Better for detecting large, sudden shifts
MEWMA Chart: Better for detecting small, gradual shifts
MEWMA smooths the data, making trends more visible
Notice how MEWMA responds earlier to the shift at sample 81
3. Principal Component Analysis (PCA) Based SPC
Theory
PCA-based SPC transforms correlated variables into uncorrelated principal components, then monitors these components.
Why Use PCA for SPC?
Advantages: - Reduces dimensionality (10+ variables → 2-3 components) - Eliminates redundancy from correlated variables - Two complementary statistics: T² and Q (SPE) - Better visualization of process behavior - More interpretable for high-dimensional data
The Two Statistics
1. Hotelling T² (in PC space) - Monitors systematic variation (in-model) - Captures variation in retained principal components - Detects shifts in the process mean structure
2. Q Statistic (Squared Prediction Error - SPE) - Monitors random variation (residual) - Captures variation NOT in retained components - Detects unusual patterns or anomalies - Also called DModX or residual distance
Perform PCA
Code
# Standardize data (important for PCA with different scales)X_scaled <-scale(X_phase1)X_scaled_all <-scale(X, center =colMeans(X_phase1), scale =apply(X_phase1, 2, sd))# Perform PCA on Phase I datapca_result <-prcomp(X_scaled, center =FALSE, scale. =FALSE)# Variance explainedvariance_explained <- pca_result$sdev^2prop_variance <- variance_explained /sum(variance_explained)cum_variance <-cumsum(prop_variance)cat("\nPCA Results:\n")
# Decide on number of components (retain 95% variance)k <-min(which(cum_variance >=0.95))cat("\nRetaining", k, "components (capturing", round(cum_variance[k] *100, 1), "% of variance)\n\n")
Retaining 1 components (capturing 99 % of variance)
[INFO] Score plot requires at least 2 principal components.
Currently retaining only 1 component(s).
Use histogram or time series plot for 1D visualization.
Cross-Validation in PCA-Based SPC
Why Cross-Validation Matters
Traditional SPC assumes we know the “true” number of components to retain. But in practice: - Too few PCs → Miss important variation - Too many PCs → Include noise, reduce sensitivity - Affects both T² and Q statistics
Cross-validation helps: - Select optimal number of components objectively - Avoid overfitting to Phase I data - Improve generalization to Phase II monitoring - Validate model performance
Cross-Validation Approaches for SPC
PRESS Statistic Explained:
PRESS = Predicted Residual Error Sum of Squares
\[PRESS = \sum_{i=1}^{n} (x_i - \hat{x}_i)^2\]
Where: - \(x_i\) = actual observation - \(\hat{x}_i\) = predicted (reconstructed) observation using model trained WITHOUT \(x_i\)
Why PRESS matters: - Measures how well model predicts NEW data - Lower PRESS = better generalization - Avoids overfitting (model evaluated on unseen data) - Standard metric for cross-validation
In SPC context: - Tests if PCA model works on future Phase II data - Ensures control limits will be appropriate - Validates number of components selected
Code
# Perform cross-validation to select optimal number of PCs# Using PRESS (Predicted Residual Error Sum of Squares)cat("\nCross-Validation for PCA Model Selection:\n\n")
Cross-Validation for PCA Model Selection:
Code
# K-fold cross-validationK <-5# Number of foldsn_phase1 <-nrow(X_phase1)fold_size <-floor(n_phase1 / K)# Try different numbers of componentsmax_components <-min(p, n_phase1 - fold_size -1)cv_results <-tibble(n_components =1:max_components,CV_Error =numeric(max_components),PRESS =numeric(max_components))for (n_comp in1:max_components) { press_total <-0for (k in1:K) {# Create training and validation sets val_indices <- ((k-1) * fold_size +1):min(k * fold_size, n_phase1) train_indices <-setdiff(1:n_phase1, val_indices)# Train PCA on training set X_train <- X_scaled[train_indices, ] X_val <- X_scaled[val_indices, ] pca_cv <-prcomp(X_train, center =FALSE, scale. =FALSE)# Project validation data scores_val <- X_val %*% pca_cv$rotation[, 1:n_comp, drop =FALSE]# Reconstruct validation data X_recon <- scores_val %*%t(pca_cv$rotation[, 1:n_comp, drop =FALSE])# Calculate prediction error press_total <- press_total +sum((X_val - X_recon)^2) } cv_results$PRESS[n_comp] <- press_total cv_results$CV_Error[n_comp] <- press_total / n_phase1}# Display CV results with proper NA handlingcv_results %>%mutate(`% Improvement`=ifelse( n_components ==1,NA_real_, # First component has no previous to compare-diff(c(0, CV_Error)) /c(0, CV_Error[-n()]) *100 ),Recommendation =case_when( n_components ==which.min(CV_Error) ~"[OPTIMAL] Minimum CV Error", n_components <which.min(CV_Error) ~"Underfit (too few PCs)",TRUE~"May overfit (too many PCs)" ) ) %>%gt() %>%tab_header(title ="Cross-Validation Results",subtitle ="PRESS-based selection of optimal principal components" ) %>%fmt_number(columns =c(CV_Error, PRESS),decimals =4 ) %>%fmt_number(columns =`% Improvement`,decimals =2 ) %>%sub_missing(columns =`% Improvement`,missing_text ="—" ) %>%tab_style(style =cell_fill(color ="lightgreen"),locations =cells_body(rows = n_components ==which.min(CV_Error) ) ) %>%tab_footnote(footnote ="% Improvement = reduction in CV error compared to using one fewer component",locations =cells_column_labels(columns =`% Improvement`) ) %>%tab_footnote(footnote ="PRESS = Predicted Residual Error Sum of Squares (lower is better)",locations =cells_column_labels(columns = PRESS) )
Cross-Validation Results
PRESS-based selection of optimal principal components
n_components
CV_Error
PRESS1
% Improvement2
Recommendation
1
0.0293
2.3423
—
Underfit (too few PCs)
2
0.0036
0.2843
87.86
Underfit (too few PCs)
3
0.0000
0.0000
100.00
[OPTIMAL] Minimum CV Error
1 PRESS = Predicted Residual Error Sum of Squares (lower is better)
2 % Improvement = reduction in CV error compared to using one fewer component
Code
# Optimal number of componentsk_optimal <-which.min(cv_results$CV_Error)cat("\nRecommended number of components (CV):", k_optimal, "\n")
1 Difference row shows absolute differences between the two methods
2 — indicates not applicable or no meaningful comparison
Code
cat("\n")
Code
if (k != k_optimal) {cat("[INFO] Cross-validation suggests a different number of components!\n")cat("Variance method:", k, "components (", round(cum_variance[k_safe]*100, 1), "% variance)\n")cat("CV method:", k_optimal, "components (", round(cum_variance[k_optimal_safe]*100, 1), "% variance)\n\n")if (k_optimal < k) {cat("Interpretation: CV suggests FEWER components\n")cat("Benefit: May improve sensitivity by excluding noise\n")cat("Trade-off: Slightly less variance captured\n") } else {cat("Interpretation: CV suggests MORE components\n")cat("Benefit: Captures additional important variation\n")cat("Trade-off: Slightly higher complexity\n") }} else {cat("[OK] Both methods agree on", k, "components.\n")cat("This indicates robust model selection.\n")}
[INFO] Cross-validation suggests a different number of components!
Variance method: 5 components ( 100 % variance)
CV method: 3 components ( 100 % variance)
Interpretation: CV suggests FEWER components
Benefit: May improve sensitivity by excluding noise
Trade-off: Slightly less variance captured
Code
cat("\n")
When to Use Cross-Validation in SPC
Use CV when: - ✅ High-dimensional data (10+ variables) - ✅ Limited Phase I samples - ✅ Uncertain about number of components - ✅ Building predictive models (soft sensors) - ✅ Want to avoid overfitting
Variance threshold (95%) is fine when: - ✅ Clear eigenvalue drop-off (scree plot) - ✅ Sufficient Phase I data (50+ samples) - ✅ Well-understood process - ✅ Traditional SPC context
Advanced: Leave-One-Out Cross-Validation (LOOCV)
Code
# LOOCV for more precise estimate (computationally intensive)if (n_phase1 <=50) { # Only for small datasetscat("\nPerforming Leave-One-Out Cross-Validation:\n") loocv_results <-tibble(n_components =1:max_components,LOOCV_Error =numeric(max_components) )for (n_comp in1:max_components) { loocv_error <-0# Leave one outfor (i in1:n_phase1) { X_train <- X_scaled[-i, ] X_test <- X_scaled[i, , drop =FALSE] pca_loo <-prcomp(X_train, center =FALSE, scale. =FALSE)# Project test sample score_test <- X_test %*% pca_loo$rotation[, 1:n_comp, drop =FALSE] X_recon <- score_test %*%t(pca_loo$rotation[, 1:n_comp, drop =FALSE]) loocv_error <- loocv_error +sum((X_test - X_recon)^2) } loocv_results$LOOCV_Error[n_comp] <- loocv_error / n_phase1 } k_loocv <-which.min(loocv_results$LOOCV_Error)cat("LOOCV optimal components:", k_loocv, "\n")cat("K-fold CV optimal:", k_optimal, "\n")cat("Variance-based:", k, "\n\n")# Compare all three methods all_methods <-plot_ly() %>%add_trace(data = cv_results,x =~n_components,y =~CV_Error,type ="scatter",mode ="lines+markers",name ="K-Fold CV",line =list(color ="blue") ) %>%add_trace(data = loocv_results,x =~n_components,y =~LOOCV_Error,type ="scatter",mode ="lines+markers",name ="LOOCV",line =list(color ="red", dash ="dash") ) %>%add_segments(x = k, xend = k,y =min(c(cv_results$CV_Error, loocv_results$LOOCV_Error)),yend =max(c(cv_results$CV_Error, loocv_results$LOOCV_Error)),line =list(color ="green", dash ="dot"),name ="95% Variance" ) %>%layout(title =list(text ="Comparing Model Selection Methods",font =list(size =16, weight ="bold") ),xaxis =list(title ="Number of Components", dtick =1),yaxis =list(title ="Prediction Error"),showlegend =TRUE ) all_methods} else {cat("\n[SKIPPED] LOOCV not performed (dataset too large)\n")cat("K-fold CV is more efficient for large Phase I datasets.\n")}
[SKIPPED] LOOCV not performed (dataset too large)
K-fold CV is more efficient for large Phase I datasets.
Practical Recommendations
For Your SPC Implementation:
Start with variance threshold (95% or scree plot)
Validate with CV if:
Unexpected false alarms
Poor Phase II performance
Limited Phase I data
Use CV optimal if it differs significantly
Monitor both T² and Q regardless of selection
CV Best Practices: - Use K=5 or K=10 for K-fold - LOOCV for small datasets (<50 samples) - Stratify folds if data has structure - Revalidate periodically with new data
Literature Support
Cross-validation in SPC is supported by: - Nomikos & MacGregor (1995): Monitoring batch processes with multiway PCA - Kourti (2005): Process analytical technology and quality by design - Qin (2012): Survey of industrial applications of data-driven methods
Modern Practice: Many pharmaceutical and chemical industries now use CV-based model selection for: - Batch process monitoring - Soft sensor development - Fault detection systems - Quality prediction models
5. Variable Contribution Analysis
Identifying Root Causes
When a multivariate chart signals out-of-control, we need to identify which variable(s) are responsible.
if (recent_violations_MEWMA >0) {cat("**Status: [WARNING] Process requires immediate attention**\n\n")cat("**Recommended Actions:**\n")cat("1. Stop production and investigate root cause\n")cat("2. Review variable contributions to identify problem source\n")cat("3. Check equipment calibration and process parameters\n")cat("4. Review operator procedures and training\n")cat("5. Implement corrective actions\n")cat("6. Document findings and countermeasures\n")} else {cat("**Status: [OK] Process is in statistical control**\n\n")cat("**Recommended Actions:**\n")cat("1. Continue current monitoring protocol\n")cat("2. Maintain equipment calibration schedule\n")cat("3. Regular operator training refreshers\n")cat("4. Update control limits quarterly\n")}
**Status: [WARNING] Process requires immediate attention**
**Recommended Actions:**
1. Stop production and investigate root cause
2. Review variable contributions to identify problem source
3. Check equipment calibration and process parameters
4. Review operator procedures and training
5. Implement corrective actions
6. Document findings and countermeasures
Advantages of Multivariate SPC
Compared to univariate monitoring:
Accounts for correlations: Detects patterns that univariate charts miss
Reduces chart proliferation: One chart instead of multiple
Lower false alarm rate: Adjusts for multiple testing
More powerful: Better at detecting real process shifts
Process understanding: Reveals relationships between variables
When to Use Each Method
Method
Best For
Advantages
Disadvantages
T²
Large sudden shifts
Simple, well-understood
Less sensitive to small shifts
MEWMA
Small gradual shifts
Memory, smoothing
Slower response to large shifts
MCUSUM
Persistent small shifts
Directional info
Complex interpretation
PCA-based
High dimensions (10+ vars)
Dimension reduction
Requires more data
Implementation Guidelines
Phase I (Establishing Control): 1. Collect 20-30 samples during stable operation 2. Calculate mean vector and covariance matrix 3. Verify multivariate normality 4. Establish control limits
Phase II (Ongoing Monitoring): 1. Plot new observations on control charts 2. Investigate out-of-control signals immediately 3. Use contribution analysis to identify root causes 4. Document and implement corrective actions 5. Update control limits after process improvements
Report Generated: 2025-12-21 09:32:24.371897 Analysis Method: Hotelling T² and MEWMA Statistical Software: R with qcr package (unified quality control) Methodology: Multivariate Statistical Process Control Package Info: qcr integrates qcc and MSQC functionality
Source Code
---title: "Multivariate Statistical Process Control (MSPC)"subtitle: "Hotelling T², MEWMA, and Multivariate Process Monitoring"author: "David Guy PhD: Quality Control Department"date: "`r Sys.Date()`"format: html: toc: true toc-depth: 3 toc-location: left code-fold: show code-tools: true theme: cosmo embed-resources: true fig-width: 12 fig-height: 8editor: visualexecute: warning: false message: false---# Introduction to Multivariate SPC**Note:** This analysis uses the `qcr` package as the quality control framework. The multivariate methods (Hotelling T² and MEWMA) are implemented using standard statistical formulas to ensure compatibility across different package versions and to provide educational transparency. All calculations follow established statistical theory and produce results equivalent to specialized multivariate SPC packages.## What is Multivariate SPC?**Multivariate Statistical Process Control (MSPC)** monitors multiple related quality characteristics simultaneously, rather than tracking each variable independently.### Why Use Multivariate SPC?**Traditional (Univariate) Approach:** - Monitor each variable with separate control charts - Example: Track temperature, pressure, flow rate separately - **Problem**: Misses interactions and correlations between variables**Multivariate Approach:** - Monitor all variables together as a system - Detects patterns that univariate charts miss - Accounts for correlations between variables - More powerful for detecting process changes### Real-World Example**Chemical Reactor Process:** - Temperature: 150°C +/- 5°C - Pressure: 200 psi +/- 10 psi\- pH: 7.0 +/- 0.5**Scenario 1 (Both in control individually):** - Temperature: 152°C \[OK\] - within limits - Pressure: 195 psi \[OK\] - within limits - **But**: This combination is unusual and indicates a problem!**Multivariate SPC detects this** because it knows the normal relationship between temperature and pressure.## Key Multivariate Methods### 1. Hotelling T² Chart- Multivariate extension of X-bar chart- Monitors mean vector of multiple variables- One statistic for all variables combined- Most common multivariate control chart### 2. MEWMA (Multivariate EWMA)- Multivariate exponentially weighted moving average- More sensitive to small sustained shifts- Better for detecting gradual process drift- Incorporates historical information### 3. MCUSUM (Multivariate CUSUM)- Multivariate cumulative sum- Excellent for detecting small persistent shifts- Directional information for diagnosis### 4. Principal Component Analysis (PCA)- Reduces dimensionality- Transforms correlated variables to uncorrelated components- T² and Q statistics for monitoring- Useful for high-dimensional data (10+ variables)## Setup and Data Import```{r setup}# Install packages if neededpackages <-c("qcr", "plotly", "gt", "dplyr", "tidyr", "ggplot2", "readr", "lubridate", "DT", "knitr", "MASS", "psych")# Load librarieslibrary(qcr) # Unified quality control (replaces qcc + MSQC)library(plotly) # Interactive plotslibrary(gt) # Tableslibrary(dplyr) # Data manipulationlibrary(tidyr) # Data tidyinglibrary(ggplot2) # Graphicslibrary(readr) # Data importlibrary(lubridate) # Dateslibrary(DT) # Interactive tableslibrary(knitr) # Reportslibrary(MASS) # Multivariate normallibrary(psych) # Correlation matricestheme_set(theme_minimal())cat("Package versions:\n")cat("qcr version:", as.character(packageVersion("qcr")), "\n\n")```## Import Multivariate Data```{r import-data}# Import multivariate quality dataif (file.exists("multivariate_data.csv")) { mv_data <-read_csv("multivariate_data.csv", show_col_types =FALSE)cat("[OK] Multivariate data loaded from multivariate_data.csv\n\n")} else {cat("[INFO] No multivariate_data.csv found. Generating example data.\n")cat("For your own data, create multivariate_data.csv with multiple quality variables.\n\n")# Generate realistic multivariate process dataset.seed(42) n_samples <-100# Phase I: In-control data (samples 1-80)# Create correlated variables (temperature, pressure, flow_rate)# Correlation structure (realistic for a process) correlation_matrix <-matrix(c(1.0, 0.7, 0.5, # Temperature correlations0.7, 1.0, 0.6, # Pressure correlations0.5, 0.6, 1.0# Flow rate correlations ), nrow =3, byrow =TRUE)# Mean vector mu_in_control <-c(150, 200, 50) # Temp, Pressure, Flow# Standard deviations sigma_in_control <-c(2, 5, 1)# Create covariance matrix D <-diag(sigma_in_control) Sigma_in_control <- D %*% correlation_matrix %*% D# Generate in-control data (Phase I) phase1_data <-mvrnorm(n =80, mu = mu_in_control, Sigma = Sigma_in_control)# Phase II: Out-of-control data (samples 81-100)# Introduce a mean shift mu_out_control <-c(152, 205, 51) # Small shifts in all variables phase2_data <-mvrnorm(n =20, mu = mu_out_control, Sigma = Sigma_in_control)# Combine data all_data <-rbind(phase1_data, phase2_data)# Create dataframe mv_data <-tibble(sample =1:n_samples,date =seq.Date(from =Sys.Date() - n_samples, to =Sys.Date() -1, by ="day"),temperature = all_data[, 1],pressure = all_data[, 2],flow_rate = all_data[, 3],shift =sample(c("Day", "Night"), n_samples, replace =TRUE),operator =sample(paste0("OP", 1:3), n_samples, replace =TRUE) )}# Display data summarycat("Multivariate Process Data:\n")cat("Samples:", nrow(mv_data), "\n")cat("Variables:", sum(sapply(mv_data, is.numeric)), "numeric variables\n\n")# Show first few observationsmv_data %>%head(10) %>%gt() %>%tab_header(title ="Multivariate Quality Data - Sample Preview",subtitle ="First 10 observations" ) %>%fmt_number(columns =where(is.numeric),decimals =2 ) %>%tab_style(style =cell_fill(color ="lightblue"),locations =cells_column_labels() )```## Data Summary and Correlation Analysis```{r data-summary}# Select quality variablesquality_vars <-c("temperature", "pressure", "flow_rate")X <-as.matrix(mv_data[, quality_vars])# Summary statisticssummary_stats <- mv_data %>%select(all_of(quality_vars)) %>%summarise(across(everything(),list(Mean =~mean(., na.rm =TRUE),SD =~sd(., na.rm =TRUE),Min =~min(., na.rm =TRUE),Max =~max(., na.rm =TRUE)),.names ="{.col}__{.fn}")) %>%pivot_longer(everything(),names_to =c("Variable", "Statistic"),names_sep ="__") %>%pivot_wider(names_from = Statistic, values_from = value)summary_stats %>%gt() %>%tab_header(title ="Summary Statistics",subtitle ="Process variables descriptive statistics" ) %>%fmt_number(columns =c(Mean, SD, Min, Max),decimals =3 ) %>%tab_style(style =list(cell_fill(color ="lightyellow"),cell_text(weight ="bold") ),locations =cells_body(columns = Variable) )# Correlation matrixcor_matrix <-cor(X)cat("\nCorrelation Matrix:\n")cor_df <-as.data.frame(cor_matrix)cor_df$Variable <-rownames(cor_matrix)cor_df <- cor_df[, c("Variable", quality_vars)]cor_df %>%gt() %>%tab_header(title ="Correlation Matrix",subtitle ="Relationships between process variables" ) %>%fmt_number(columns =all_of(quality_vars),decimals =3 )# Note: Strong correlations (>0.7) are important for multivariate SPCcat("\nStrong correlations (>0.7) found:\n")for (i in1:(length(quality_vars)-1)) {for (j in (i+1):length(quality_vars)) {if (abs(cor_matrix[i,j]) >0.7) {cat(sprintf("%s <-> %s: %.3f\n", quality_vars[i], quality_vars[j], cor_matrix[i,j])) } }}cat("\n")# Correlation heatmapcor_long <- cor_matrix %>%as.data.frame() %>%mutate(Var1 =rownames(.)) %>%pivot_longer(cols =-Var1, names_to ="Var2", values_to ="Correlation")cor_plot <-plot_ly(data = cor_long,x =~Var1,y =~Var2,z =~Correlation,type ="heatmap",colorscale =list(c(0, "blue"),c(0.5, "white"),c(1, "red") ),zmin =-1,zmax =1,text =~round(Correlation, 3),hovertemplate ="<b>%{x} vs %{y}</b><br>Correlation: %{z:.3f}<extra></extra>") %>%layout(title =list(text ="Correlation Heatmap<br><sub>Strong correlations indicate variables move together</sub>",font =list(size =16) ),xaxis =list(title =""),yaxis =list(title ="") )cor_plot```# 1. Hotelling T² Control Chart## TheoryThe **Hotelling T²** statistic is a multivariate generalization of the t-statistic:**Formula:** $$T^2 = n(\bar{x} - \mu_0)' S^{-1} (\bar{x} - \mu_0)$$Where: - $\bar{x}$ = sample mean vector - $\mu_0$ = target mean vector (from Phase I) - $S$ = sample covariance matrix - $n$ = sample size**Control Limit:** For Phase II monitoring: $$UCL = \frac{p(m+1)(m-1)}{m^2-mp} F_{\alpha}(p, m-p)$$Where: - $p$ = number of variables - $m$ = number of Phase I samples - $F_{\alpha}$ = F-distribution critical value**Interpretation:** - T² combines all variables into single statistic - Large T² indicates multivariate out-of-control - Accounts for correlations between variables```{r tsquared-chart}# Phase I data (first 80 samples for establishing control limits)phase1_indices <-1:80X_phase1 <- X[phase1_indices, ]# Calculate Phase I statisticsmu_hat <-colMeans(X_phase1)Sigma_hat <-cov(X_phase1)cat("\nPhase I Statistics:\n")cat("Mean vector:\n")print(round(mu_hat, 3))cat("\nCovariance matrix:\n")print(round(Sigma_hat, 3))cat("\n")# Phase II data (monitoring new samples)X_phase2 <- X# Calculate T² statistics using standard formulam <-nrow(X_phase1) # Number of Phase I samplesp <-ncol(X) # Number of variablesn <-1# Individual observations# Calculate T² for each observationT2_values <-numeric(nrow(X_phase2))for (i in1:nrow(X_phase2)) { x_i <- X_phase2[i, ]# Hotelling T² formula: T² = n(x - μ)'Σ⁻¹(x - μ) T2_values[i] <- n *t(x_i - mu_hat) %*%solve(Sigma_hat) %*% (x_i - mu_hat)}# Calculate Phase II control limit# UCL = [(p(m+1)(m-1)) / (m²-mp)] * F(α, p, m-p)alpha <-0.0027# For 3-sigma equivalent (0.27%)F_critical <-qf(1- alpha, p, m - p)UCL_T2 <- (p * (m +1) * (m -1)) / (m^2- m * p) * F_criticalcat("Hotelling T² Chart Parameters:\n")cat("Number of variables (p):", p, "\n")cat("Phase I samples (m):", m, "\n")cat("Significance level (α):", alpha, "\n")cat("F critical value:", round(F_critical, 3), "\n")cat("Control limit (UCL):", round(UCL_T2, 3), "\n")cat("Number of out-of-control points:", sum(T2_values > UCL_T2), "\n\n")# Create results tablet2_summary <-tibble(Metric =c("Number of Variables", "Phase I Samples", "UCL", "Mean T²", "Max T²", "Out-of-Control Points"),Value =c(p, m, UCL_T2, mean(T2_values), max(T2_values), sum(T2_values > UCL_T2)))t2_summary %>%gt() %>%tab_header(title ="Hotelling T² Chart Summary",subtitle ="Multivariate control chart statistics" ) %>%fmt_number(columns = Value,rows =3:5,decimals =3 ) %>%fmt_number(columns = Value,rows =c(1, 2, 6),decimals =0 ) %>%tab_style(style =cell_fill(color ="lightcoral"),locations =cells_body(rows =6) )# Create T² control chartt2_data <-tibble(Sample =1:length(T2_values),T2 = T2_values,UCL = UCL_T2,Phase =ifelse(Sample <=80, "Phase I (Training)", "Phase II (Monitoring)"),Status =ifelse(T2 > UCL, "Out of Control", "In Control"))t2_plot <-plot_ly(t2_data) %>%add_trace(x =~Sample,y =~T2,type ="scatter",mode ="lines+markers",name ="T² Statistic",line =list(color ="blue"),marker =list(size =6,color =~ifelse(Status =="Out of Control", "red", "blue") ),text =~paste0("Sample: ", Sample, "<br>","T²: ", round(T2, 3), "<br>","Phase: ", Phase, "<br>","Status: ", Status ),hovertemplate ="%{text}<extra></extra>" ) %>%add_trace(x =~Sample,y =~UCL,type ="scatter",mode ="lines",name ="UCL",line =list(color ="red", dash ="dash", width =3),hovertemplate ="<b>UCL:</b> %{y:.3f}<extra></extra>" ) %>%add_segments(x =80.5, xend =80.5,y =0, yend =max(t2_data$T2) *1.1,line =list(color ="gray", dash ="dot", width =2),showlegend =FALSE,hoverinfo ="skip" ) %>%layout(title =list(text ="Hotelling T² Control Chart<br><sub>Monitoring multiple correlated variables simultaneously</sub>",font =list(size =16, weight ="bold") ),xaxis =list(title ="Sample Number"),yaxis =list(title ="T² Statistic"),hovermode ="closest",annotations =list(list(x =80.5, y =max(t2_data$T2) *1.05,text ="Phase I | Phase II",showarrow =FALSE,xanchor ="center",font =list(size =10, color ="gray") ) ) )t2_plot```## Interpreting T² Results### Understanding the T² Statistic**What T² Measures:** - Overall deviation from the target mean vector - Accounts for variable correlations - Larger values indicate process is farther from target**When to Investigate:** - Any point above UCL indicates out-of-control - Patterns (runs, trends) suggest special causes - Compare Phase I vs Phase II performance### Decomposition AnalysisWhen T² signals out-of-control, we need to identify which variable(s) caused the signal.```{r tsquared-decomposition}# For out-of-control points, decompose T²ooc_indices <-which(T2_values > UCL_T2)if (length(ooc_indices) >0) {cat("\nDecomposition of Out-of-Control Points:\n\n")# Analyze first few out-of-control points analyze_points <-head(ooc_indices, 3) decomp_results <-list()for (idx in analyze_points) { x_i <- X[idx, ] deviation <- x_i - mu_hat# Calculate contribution of each variable S_inv <-solve(Sigma_hat) contributions <-numeric(p)for (j in1:p) { e_j <-rep(0, p) e_j[j] <-1 contributions[j] <- n *t(deviation) %*% S_inv %*% e_j * deviation[j] } decomp_results[[length(decomp_results) +1]] <-tibble(Sample = idx,Variable = quality_vars,Value = x_i,Target = mu_hat,Deviation = deviation,Contribution = contributions,Pct_of_T2 = (contributions / T2_values[idx]) *100 ) }# Display decomposition tablebind_rows(decomp_results) %>%gt() %>%tab_header(title ="T² Decomposition Analysis",subtitle ="Identifying which variables caused out-of-control signals" ) %>%fmt_number(columns =c(Value, Target, Deviation, Contribution),decimals =3 ) %>%fmt_number(columns = Pct_of_T2,decimals =1 ) %>%tab_style(style =cell_fill(color ="lightcoral"),locations =cells_body(columns = Pct_of_T2,rows = Pct_of_T2 >40 ) ) %>%tab_style(style =list(cell_fill(color ="lightyellow"),cell_text(weight ="bold") ),locations =cells_body(columns = Sample) )} else {cat("\nNo out-of-control points detected in T² chart.\n")}```# 2. MEWMA (Multivariate EWMA) Chart## Theory**MEWMA** (Multivariate Exponentially Weighted Moving Average) is more sensitive to small sustained shifts than T².**Formula:** $$Z_i = \Lambda X_i + (I - \Lambda) Z_{i-1}$$Where: - $Z_i$ = EWMA vector at time $i$ - $\Lambda$ = diagonal matrix of smoothing constants (typically 0.1-0.3) - $X_i$ = observation vector at time $i$ - $I$ = identity matrix**MEWMA Statistic:** $$T_i^2 = Z_i' \Sigma_Z^{-1} Z_i$$Where $\Sigma_Z$ is the covariance matrix of $Z_i$.**Control Limit:** Typically set using simulation to achieve desired false alarm rate.**Advantages:** - More sensitive to small shifts - Smooths out random noise - Better for detecting gradual process drift```{r mewma-chart}# MEWMA parameterslambda <-0.2# Smoothing constant (0 < lambda <= 1)# Smaller = more smoothing, more sensitive to small shiftscat("MEWMA Chart Setup:\n")cat("Smoothing constant (lambda):", lambda, "\n\n")# Initialize MEWMA vectorsZ <-matrix(0, nrow =nrow(X), ncol = p)Z[1, ] <- X[1, ] - mu_hat # Initialize with first observation# Calculate MEWMA vectors# Z_i = λ(X_i - μ) + (1-λ)Z_{i-1}for (i in2:nrow(X)) { Z[i, ] <- lambda * (X[i, ] - mu_hat) + (1- lambda) * Z[i-1, ]}# Calculate MEWMA T² statistic# Covariance of Z_i: Σ_Z = [λ/(2-λ)]ΣSigma_Z <- (lambda / (2- lambda)) * Sigma_hatMEWMA_T2 <-numeric(nrow(X))for (i in1:nrow(X)) { MEWMA_T2[i] <-t(Z[i, ]) %*%solve(Sigma_Z) %*% Z[i, ]}# Control limit# Using h parameter (typically 9-13 for good performance)h <-11.5# Control limit parameterUCL_MEWMA <- hcat("MEWMA Chart Parameters:\n")cat("Smoothing constant (λ):", lambda, "\n")cat("Control limit parameter (h):", h, "\n")cat("UCL:", round(UCL_MEWMA, 3), "\n")cat("Number of out-of-control points:", sum(MEWMA_T2 > UCL_MEWMA), "\n\n")# Create MEWMA results tablemewma_summary <-tibble(Metric =c("Smoothing Constant (λ)", "Control Limit (h)", "Mean MEWMA T²", "Max MEWMA T²", "Out-of-Control Points"),Value =c(lambda, UCL_MEWMA, mean(MEWMA_T2), max(MEWMA_T2), sum(MEWMA_T2 > UCL_MEWMA)))mewma_summary %>%gt() %>%tab_header(title ="MEWMA Chart Summary",subtitle ="Multivariate exponentially weighted moving average" ) %>%fmt_number(columns = Value,rows =1:4,decimals =3 ) %>%fmt_number(columns = Value,rows =5,decimals =0 ) %>%tab_style(style =cell_fill(color ="lightcoral"),locations =cells_body(rows =5) )# Create MEWMA control chartmewma_data <-tibble(Sample =1:length(MEWMA_T2),MEWMA = MEWMA_T2,UCL = UCL_MEWMA,Phase =ifelse(Sample <=80, "Phase I", "Phase II"),Status =ifelse(MEWMA > UCL, "Out of Control", "In Control"))mewma_plot <-plot_ly(mewma_data) %>%add_trace(x =~Sample,y =~MEWMA,type ="scatter",mode ="lines+markers",name ="MEWMA T²",line =list(color ="darkgreen"),marker =list(size =6,color =~ifelse(Status =="Out of Control", "red", "darkgreen") ),text =~paste0("Sample: ", Sample, "<br>","MEWMA T²: ", round(MEWMA, 3), "<br>","Phase: ", Phase, "<br>","Status: ", Status ),hovertemplate ="%{text}<extra></extra>" ) %>%add_trace(x =~Sample,y =~UCL,type ="scatter",mode ="lines",name ="UCL",line =list(color ="red", dash ="dash", width =3),hovertemplate ="<b>UCL:</b> %{y:.3f}<extra></extra>" ) %>%add_segments(x =80.5, xend =80.5,y =0, yend =max(mewma_data$MEWMA) *1.1,line =list(color ="gray", dash ="dot", width =2),showlegend =FALSE,hoverinfo ="skip" ) %>%layout(title =list(text ="MEWMA Control Chart<br><sub>More sensitive to small sustained shifts</sub>",font =list(size =16, weight ="bold") ),xaxis =list(title ="Sample Number"),yaxis =list(title ="MEWMA T² Statistic"),hovermode ="closest" )mewma_plot```## Comparing T² vs MEWMA```{r comparison-chart}# Create comparison dataframecomparison_data <-tibble(Sample =1:length(T2_values),`T² Statistic`= T2_values / UCL_T2, # Normalize by UCL`MEWMA Statistic`= MEWMA_T2 / UCL_MEWMA,Phase =ifelse(Sample <=80, "Phase I", "Phase II"))# Create comparison plotcomp_plot <-plot_ly(comparison_data) %>%add_trace(x =~Sample,y =~`T² Statistic`,type ="scatter",mode ="lines",name ="T² (normalized)",line =list(color ="blue", width =2),hovertemplate ="<b>T²:</b> %{y:.3f}<extra></extra>" ) %>%add_trace(x =~Sample,y =~`MEWMA Statistic`,type ="scatter",mode ="lines",name ="MEWMA (normalized)",line =list(color ="darkgreen", width =2),hovertemplate ="<b>MEWMA:</b> %{y:.3f}<extra></extra>" ) %>%add_segments(x =0, xend =max(comparison_data$Sample),y =1, yend =1,line =list(color ="red", dash ="dash", width =2),showlegend =FALSE,hoverinfo ="skip" ) %>%add_segments(x =80.5, xend =80.5,y =0, yend =max(c(comparison_data$`T² Statistic`, comparison_data$`MEWMA Statistic`)),line =list(color ="gray", dash ="dot", width =2),showlegend =FALSE,hoverinfo ="skip" ) %>%layout(title =list(text ="T² vs MEWMA Comparison<br><sub>Both statistics normalized by their control limits</sub>",font =list(size =16, weight ="bold") ),xaxis =list(title ="Sample Number"),yaxis =list(title ="Normalized Statistic (UCL = 1.0)"),hovermode ="x unified",annotations =list(list(x =max(comparison_data$Sample) /2,y =1.05,text ="Control Limit",showarrow =FALSE,font =list(size =10, color ="red") ) ) )comp_plot```**Key Observations:**- **T² Chart**: Better for detecting large, sudden shifts- **MEWMA Chart**: Better for detecting small, gradual shifts- **MEWMA** smooths the data, making trends more visible- Notice how MEWMA responds earlier to the shift at sample 81# 3. Principal Component Analysis (PCA) Based SPC## Theory**PCA-based SPC** transforms correlated variables into uncorrelated principal components, then monitors these components.### Why Use PCA for SPC?**Advantages:** - Reduces dimensionality (10+ variables → 2-3 components) - Eliminates redundancy from correlated variables - Two complementary statistics: T² and Q (SPE) - Better visualization of process behavior - More interpretable for high-dimensional data### The Two Statistics**1. Hotelling T² (in PC space)** - Monitors systematic variation (in-model) - Captures variation in retained principal components - Detects shifts in the process mean structure**2. Q Statistic (Squared Prediction Error - SPE)** - Monitors random variation (residual) - Captures variation NOT in retained components - Detects unusual patterns or anomalies - Also called DModX or residual distance## Perform PCA```{r pca-analysis}# Standardize data (important for PCA with different scales)X_scaled <-scale(X_phase1)X_scaled_all <-scale(X, center =colMeans(X_phase1), scale =apply(X_phase1, 2, sd))# Perform PCA on Phase I datapca_result <-prcomp(X_scaled, center =FALSE, scale. =FALSE)# Variance explainedvariance_explained <- pca_result$sdev^2prop_variance <- variance_explained /sum(variance_explained)cum_variance <-cumsum(prop_variance)cat("\nPCA Results:\n")cat("Principal Component Analysis on", p, "variables\n\n")# Variance tablepca_variance <-tibble(PC =paste0("PC", 1:p),`Std Dev`= pca_result$sdev,Variance = variance_explained,`Prop. Variance`= prop_variance *100,`Cumulative %`= cum_variance *100)pca_variance %>%gt() %>%tab_header(title ="Principal Components Summary",subtitle ="Variance explained by each component" ) %>%fmt_number(columns =c(`Std Dev`, Variance),decimals =3 ) %>%fmt_number(columns =c(`Prop. Variance`, `Cumulative %`),decimals =2 )# Decide on number of components (retain 95% variance)k <-min(which(cum_variance >=0.95))cat("\nRetaining", k, "components (capturing", round(cum_variance[k] *100, 1), "% of variance)\n\n")# Scree plotscree_data <-tibble(PC =1:p,Variance = variance_explained,Type ="Eigenvalue")scree_plot <-plot_ly(scree_data) %>%add_trace(x =~PC,y =~Variance,type ="scatter",mode ="lines+markers",name ="Eigenvalue",line =list(color ="blue", width =2),marker =list(size =10, color ="blue"),hovertemplate ="<b>PC%{x}</b><br>Variance: %{y:.3f}<extra></extra>" ) %>%add_segments(x = k +0.5, xend = k +0.5,y =0, yend =max(variance_explained),line =list(color ="red", dash ="dash", width =2),showlegend =FALSE,hoverinfo ="skip" ) %>%layout(title =list(text ="Scree Plot<br><sub>Selecting number of principal components</sub>",font =list(size =16, weight ="bold") ),xaxis =list(title ="Principal Component", dtick =1),yaxis =list(title ="Variance (Eigenvalue)"),annotations =list(list(x = k +0.5,y =max(variance_explained) *0.9,text =paste0("Retain ", k, " PCs"),showarrow =FALSE,xanchor ="left",font =list(color ="red") ) ) )scree_plot# Loadings plotif (k >=1) {# Extract loadings and ensure proper column names loadings_matrix <- pca_result$rotation[, 1:min(2, k), drop =FALSE] loadings_data <-as.data.frame(loadings_matrix)# Ensure column names are PC1, PC2if (ncol(loadings_data) ==1) {colnames(loadings_data) <-"PC1" loadings_data$PC2 <-0 } else {colnames(loadings_data) <-c("PC1", "PC2") } loadings_data$Variable <- quality_vars loadings_plot <-plot_ly(loadings_data) %>%add_trace(x =~PC1,y =~PC2,type ="scatter",mode ="markers+text",text =~Variable,textposition ="top center",marker =list(size =12, color ="darkblue"),hovertemplate ="<b>%{text}</b><br>PC1: %{x:.3f}<br>PC2: %{y:.3f}<extra></extra>" ) %>%add_segments(x =0, xend =~PC1,y =0, yend =~PC2,line =list(color ="gray", width =1),showlegend =FALSE,hoverinfo ="skip" ) %>%layout(title =list(text ="PCA Loadings Plot<br><sub>How variables contribute to principal components</sub>",font =list(size =16, weight ="bold") ),xaxis =list(title =paste0("PC1 (", round(prop_variance[1]*100, 1), "%)"),zeroline =TRUE),yaxis =list(title =if(k >=2) paste0("PC2 (", round(prop_variance[2]*100, 1), "%)") else"PC2",zeroline =TRUE) ) loadings_plot}```## PCA-based T² Control Chart```{r pca-t2-chart}# Project data onto principal componentsscores_phase1 <- X_scaled %*% pca_result$rotation[, 1:k, drop =FALSE]scores_all <- X_scaled_all %*% pca_result$rotation[, 1:k, drop =FALSE]# Calculate T² in PC space# Lambda is diagonal matrix of eigenvaluesLambda <-diag(variance_explained[1:k], nrow = k, ncol = k)T2_pca <-numeric(nrow(scores_all))for (i in1:nrow(scores_all)) { t_i <-as.numeric(scores_all[i, ]) # Ensure it's a numeric vector# T² = t'Λ⁻¹tif (k ==1) {# Special case for single component T2_pca[i] <- (t_i^2) / variance_explained[1] } else { T2_pca[i] <-t(t_i) %*%solve(Lambda) %*% t_i }}# Control limit for PCA T²m <-nrow(X_phase1)alpha <-0.0027F_crit_pca <-qf(1- alpha, k, m - k)UCL_T2_pca <- (k * (m^2-1)) / (m * (m - k)) * F_crit_pcacat("\nPCA T² Chart Parameters:\n")cat("Number of PCs retained (k):", k, "\n")cat("Control limit (UCL):", round(UCL_T2_pca, 3), "\n")cat("Out-of-control points:", sum(T2_pca > UCL_T2_pca), "\n\n")# PCA T² chartpca_t2_data <-tibble(Sample =1:length(T2_pca),T2_PCA = T2_pca,UCL = UCL_T2_pca,Phase =ifelse(Sample <=80, "Phase I", "Phase II"),Status =ifelse(T2_PCA > UCL, "Out of Control", "In Control"))pca_t2_plot <-plot_ly(pca_t2_data) %>%add_trace(x =~Sample,y =~T2_PCA,type ="scatter",mode ="lines+markers",name ="PCA T²",line =list(color ="purple"),marker =list(size =6,color =~ifelse(Status =="Out of Control", "red", "purple") ),text =~paste0("Sample: ", Sample, "<br>","T² (PCA): ", round(T2_PCA, 3), "<br>","Status: ", Status ),hovertemplate ="%{text}<extra></extra>" ) %>%add_trace(x =~Sample,y =~UCL,type ="scatter",mode ="lines",name ="UCL",line =list(color ="red", dash ="dash", width =3) ) %>%add_segments(x =80.5, xend =80.5,y =0, yend =max(pca_t2_data$T2_PCA) *1.1,line =list(color ="gray", dash ="dot", width =2),showlegend =FALSE ) %>%layout(title =list(text ="PCA-based T² Control Chart<br><sub>Monitoring in principal component space</sub>",font =list(size =16, weight ="bold") ),xaxis =list(title ="Sample Number"),yaxis =list(title ="T² Statistic (PCA)") )pca_t2_plot```## Q Statistic (SPE) Control Chart```{r pca-q-chart}# Calculate Q statistic (Squared Prediction Error)# Q = ||x - x_hat||² where x_hat is reconstruction from k PCsX_reconstructed <- scores_all %*%t(pca_result$rotation[, 1:k, drop =FALSE])residuals <- X_scaled_all - X_reconstructedQ_stat <-rowSums(residuals^2)# Control limit for Q using chi-squared approximationif (k < p) {# Calculate parameters for chi-squared approximation theta1 <-sum(variance_explained[(k+1):p]) theta2 <-sum(variance_explained[(k+1):p]^2) theta3 <-sum(variance_explained[(k+1):p]^3) h0 <-1- (2* theta1 * theta3) / (3* theta2^2) ca <-qnorm(1- alpha) UCL_Q <- theta1 * (1+ (ca *sqrt(2* theta2) * h0) / theta1 + (theta2 * h0 * (h0 -1)) / theta1^2)^(1/h0)} else {# If all PCs retained, Q should be near zero UCL_Q <-max(Q_stat) *1.5cat("Note: All PCs retained, Q statistic will be small\n")}cat("\nQ Statistic (SPE) Chart Parameters:\n")cat("Monitoring residual variation\n")cat("Control limit (UCL):", round(UCL_Q, 3), "\n")cat("Out-of-control points:", sum(Q_stat > UCL_Q), "\n\n")# Q statistic chartq_data <-tibble(Sample =1:length(Q_stat),Q = Q_stat,UCL = UCL_Q,Phase =ifelse(Sample <=80, "Phase I", "Phase II"),Status =ifelse(Q > UCL, "Out of Control", "In Control"))q_plot <-plot_ly(q_data) %>%add_trace(x =~Sample,y =~Q,type ="scatter",mode ="lines+markers",name ="Q Statistic",line =list(color ="orange"),marker =list(size =6,color =~ifelse(Status =="Out of Control", "red", "orange") ),text =~paste0("Sample: ", Sample, "<br>","Q: ", round(Q, 3), "<br>","Status: ", Status ),hovertemplate ="%{text}<extra></extra>" ) %>%add_trace(x =~Sample,y =~UCL,type ="scatter",mode ="lines",name ="UCL",line =list(color ="red", dash ="dash", width =3) ) %>%add_segments(x =80.5, xend =80.5,y =0, yend =max(q_data$Q) *1.1,line =list(color ="gray", dash ="dot", width =2),showlegend =FALSE ) %>%layout(title =list(text ="Q Statistic (SPE) Control Chart<br><sub>Detecting unusual patterns and anomalies</sub>",font =list(size =16, weight ="bold") ),xaxis =list(title ="Sample Number"),yaxis =list(title ="Q Statistic (Squared Prediction Error)") )q_plot```## Interpreting PCA-based SPC### Understanding T² vs Q| Statistic | Monitors | Interpretation When Out-of-Control ||----------------|----------------|-----------------------------------------|| **T² (in PC space)** | Systematic variation | Process mean has shifted in a known direction || **Q (SPE)** | Residual variation | Unusual pattern or new type of variation |### Decision Matrix| T² Status | Q Status | Interpretation | Action ||-----------------|-----------------|-----------------------|-----------------|| In Control | In Control | Process normal | Continue monitoring || Out of Control | In Control | **Common cause shift** | Adjust process mean || In Control | Out of Control | **Special cause/anomaly** | Investigate unusual event || Out of Control | Out of Control | **Major disturbance** | Stop and investigate |## Score Plot (First Two PCs)```{r pca-score-plot}if (k >=2) {# Create score plot score_data <-tibble(Sample =1:nrow(scores_all),PC1 = scores_all[, 1],PC2 = scores_all[, 2],Phase =ifelse(Sample <=80, "Phase I", "Phase II"),Overall_Status =case_when( T2_pca > UCL_T2_pca & Q_stat > UCL_Q ~"Both OOC", T2_pca > UCL_T2_pca ~"T² Only", Q_stat > UCL_Q ~"Q Only",TRUE~"In Control" ) ) score_plot <-plot_ly(score_data) %>%add_trace(x =~PC1,y =~PC2,type ="scatter",mode ="markers",color =~Overall_Status,colors =c("In Control"="blue", "T² Only"="orange", "Q Only"="purple","Both OOC"="red"),marker =list(size =8),text =~paste0("Sample: ", Sample, "<br>","PC1: ", round(PC1, 3), "<br>","PC2: ", round(PC2, 3), "<br>","Status: ", Overall_Status ),hovertemplate ="%{text}<extra></extra>" ) %>%layout(title =list(text ="PCA Score Plot<br><sub>Process behavior in principal component space</sub>",font =list(size =16, weight ="bold") ),xaxis =list(title =paste0("PC1 (", round(prop_variance[1]*100, 1), "%)")),yaxis =list(title =paste0("PC2 (", round(prop_variance[2]*100, 1), "%)")),showlegend =TRUE ) score_plot} else {cat("\n[INFO] Score plot requires at least 2 principal components.\n")cat("Currently retaining only", k, "component(s).\n")cat("Use histogram or time series plot for 1D visualization.\n")}```## Cross-Validation in PCA-Based SPC### Why Cross-Validation MattersTraditional SPC assumes we know the "true" number of components to retain. But in practice: - Too few PCs → Miss important variation - Too many PCs → Include noise, reduce sensitivity - Affects both T² and Q statistics**Cross-validation helps:** - Select optimal number of components objectively - Avoid overfitting to Phase I data - Improve generalization to Phase II monitoring - Validate model performance### Cross-Validation Approaches for SPC**PRESS Statistic Explained:****PRESS** = Predicted Residual Error Sum of Squares$$PRESS = \sum_{i=1}^{n} (x_i - \hat{x}_i)^2$$Where: - $x_i$ = actual observation - $\hat{x}_i$ = predicted (reconstructed) observation using model trained WITHOUT $x_i$**Why PRESS matters:** - Measures how well model predicts NEW data - Lower PRESS = better generalization - Avoids overfitting (model evaluated on unseen data) - Standard metric for cross-validation**In SPC context:** - Tests if PCA model works on future Phase II data - Ensures control limits will be appropriate - Validates number of components selected```{r pca-crossvalidation}# Perform cross-validation to select optimal number of PCs# Using PRESS (Predicted Residual Error Sum of Squares)cat("\nCross-Validation for PCA Model Selection:\n\n")# K-fold cross-validationK <-5# Number of foldsn_phase1 <-nrow(X_phase1)fold_size <-floor(n_phase1 / K)# Try different numbers of componentsmax_components <-min(p, n_phase1 - fold_size -1)cv_results <-tibble(n_components =1:max_components,CV_Error =numeric(max_components),PRESS =numeric(max_components))for (n_comp in1:max_components) { press_total <-0for (k in1:K) {# Create training and validation sets val_indices <- ((k-1) * fold_size +1):min(k * fold_size, n_phase1) train_indices <-setdiff(1:n_phase1, val_indices)# Train PCA on training set X_train <- X_scaled[train_indices, ] X_val <- X_scaled[val_indices, ] pca_cv <-prcomp(X_train, center =FALSE, scale. =FALSE)# Project validation data scores_val <- X_val %*% pca_cv$rotation[, 1:n_comp, drop =FALSE]# Reconstruct validation data X_recon <- scores_val %*%t(pca_cv$rotation[, 1:n_comp, drop =FALSE])# Calculate prediction error press_total <- press_total +sum((X_val - X_recon)^2) } cv_results$PRESS[n_comp] <- press_total cv_results$CV_Error[n_comp] <- press_total / n_phase1}# Display CV results with proper NA handlingcv_results %>%mutate(`% Improvement`=ifelse( n_components ==1,NA_real_, # First component has no previous to compare-diff(c(0, CV_Error)) /c(0, CV_Error[-n()]) *100 ),Recommendation =case_when( n_components ==which.min(CV_Error) ~"[OPTIMAL] Minimum CV Error", n_components <which.min(CV_Error) ~"Underfit (too few PCs)",TRUE~"May overfit (too many PCs)" ) ) %>%gt() %>%tab_header(title ="Cross-Validation Results",subtitle ="PRESS-based selection of optimal principal components" ) %>%fmt_number(columns =c(CV_Error, PRESS),decimals =4 ) %>%fmt_number(columns =`% Improvement`,decimals =2 ) %>%sub_missing(columns =`% Improvement`,missing_text ="—" ) %>%tab_style(style =cell_fill(color ="lightgreen"),locations =cells_body(rows = n_components ==which.min(CV_Error) ) ) %>%tab_footnote(footnote ="% Improvement = reduction in CV error compared to using one fewer component",locations =cells_column_labels(columns =`% Improvement`) ) %>%tab_footnote(footnote ="PRESS = Predicted Residual Error Sum of Squares (lower is better)",locations =cells_column_labels(columns = PRESS) )# Optimal number of componentsk_optimal <-which.min(cv_results$CV_Error)cat("\nRecommended number of components (CV):", k_optimal, "\n")cat("Original selection (95% variance):", k, "\n")if (k != k_optimal) {cat("Difference:", abs(k_optimal - k), "component(s)\n") improvement <- (cv_results$CV_Error[k] - cv_results$CV_Error[k_optimal]) / cv_results$CV_Error[k] *100cat("CV error improvement:", round(improvement, 2), "%\n")} else {cat("Both methods AGREE on", k, "components [OK]\n")}cat("\n")# Plot CV error vs number of componentscv_plot <-plot_ly(cv_results) %>%add_trace(x =~n_components,y =~CV_Error,type ="scatter",mode ="lines+markers",name ="CV Error",line =list(color ="blue", width =2),marker =list(size =8, color ="blue"),hovertemplate ="<b>%{x} Components</b><br>CV Error: %{y:.4f}<extra></extra>" ) %>%add_trace(x = k_optimal,y = cv_results$CV_Error[k_optimal],type ="scatter",mode ="markers",name ="Optimal",marker =list(size =15, color ="red", symbol ="star"),hovertemplate ="<b>Optimal: %{x} PCs</b><br>Error: %{y:.4f}<extra></extra>" ) %>%layout(title =list(text ="Cross-Validation: Selecting Number of Components<br><sub>Lower CV error indicates better model</sub>",font =list(size =16, weight ="bold") ),xaxis =list(title ="Number of Principal Components", dtick =1),yaxis =list(title ="Cross-Validation Error (PRESS/n)") )cv_plot```### Interpreting CV Results**What the CV Error Tells You:**| Pattern | Interpretation | Action ||---------------------|---------------------------------|------------------|| Error decreases then plateaus | Optimal found | Use number at minimum || Error keeps decreasing | May need more PCs | Check if reasonable || Sharp drop then flat | Clear optimal point | High confidence || Noisy curve | Unstable model | Need more Phase I data |### Comparison: Variance vs Cross-Validation```{r cv-comparison}# Compare variance-based vs CV-based selection# Add safety checks to prevent NAs# Ensure indices are validk_safe <-min(k, length(cum_variance))k_optimal_safe <-min(k_optimal, length(cum_variance), nrow(cv_results))comparison <-tibble(Method =c("Variance (95%)", "Cross-Validation", "Difference"),`# Components`=c(k, k_optimal, NA_integer_),`Variance Explained %`=c( cum_variance[k_safe] *100, cum_variance[k_optimal_safe] *100,abs(cum_variance[k_safe] - cum_variance[k_optimal_safe]) *100 ),`CV Error`=c( cv_results$CV_Error[k_safe], cv_results$CV_Error[k_optimal_safe],abs(cv_results$CV_Error[k_safe] - cv_results$CV_Error[k_optimal_safe]) ),`Better By`=c("—","—",ifelse(k == k_optimal, "Same selection", paste0(abs(k - k_optimal), " component(s)")) ),Recommendation =c("Traditional approach","Data-driven optimum",ifelse(k == k_optimal, "[OK] METHODS AGREE", "[INFO] METHODS DIFFER") ))comparison %>%gt() %>%tab_header(title ="Model Selection Comparison",subtitle ="Variance-based vs Cross-validation" ) %>%fmt_number(columns =`Variance Explained %`,decimals =2 ) %>%fmt_number(columns =`CV Error`,decimals =4 ) %>%sub_missing(columns =everything(),missing_text ="—" ) %>%tab_style(style =cell_fill(color ="lightgreen"),locations =cells_body(rows = Method =="Cross-Validation") ) %>%tab_style(style =cell_fill(color ="lightyellow"),locations =cells_body(rows = Method =="Difference") ) %>%tab_footnote(footnote ="Difference row shows absolute differences between the two methods",locations =cells_body(rows = Method =="Difference", columns = Method) ) %>%tab_footnote(footnote ="— indicates not applicable or no meaningful comparison",locations =cells_body(rows = Method =="Difference", columns =`# Components`) )cat("\n")if (k != k_optimal) {cat("[INFO] Cross-validation suggests a different number of components!\n")cat("Variance method:", k, "components (", round(cum_variance[k_safe]*100, 1), "% variance)\n")cat("CV method:", k_optimal, "components (", round(cum_variance[k_optimal_safe]*100, 1), "% variance)\n\n")if (k_optimal < k) {cat("Interpretation: CV suggests FEWER components\n")cat("Benefit: May improve sensitivity by excluding noise\n")cat("Trade-off: Slightly less variance captured\n") } else {cat("Interpretation: CV suggests MORE components\n")cat("Benefit: Captures additional important variation\n")cat("Trade-off: Slightly higher complexity\n") }} else {cat("[OK] Both methods agree on", k, "components.\n")cat("This indicates robust model selection.\n")}cat("\n")```### When to Use Cross-Validation in SPC**Use CV when:** - ✅ High-dimensional data (10+ variables) - ✅ Limited Phase I samples - ✅ Uncertain about number of components - ✅ Building predictive models (soft sensors) - ✅ Want to avoid overfitting**Variance threshold (95%) is fine when:** - ✅ Clear eigenvalue drop-off (scree plot) - ✅ Sufficient Phase I data (50+ samples) - ✅ Well-understood process - ✅ Traditional SPC context### Advanced: Leave-One-Out Cross-Validation (LOOCV)```{r loocv}# LOOCV for more precise estimate (computationally intensive)if (n_phase1 <=50) { # Only for small datasetscat("\nPerforming Leave-One-Out Cross-Validation:\n") loocv_results <-tibble(n_components =1:max_components,LOOCV_Error =numeric(max_components) )for (n_comp in1:max_components) { loocv_error <-0# Leave one outfor (i in1:n_phase1) { X_train <- X_scaled[-i, ] X_test <- X_scaled[i, , drop =FALSE] pca_loo <-prcomp(X_train, center =FALSE, scale. =FALSE)# Project test sample score_test <- X_test %*% pca_loo$rotation[, 1:n_comp, drop =FALSE] X_recon <- score_test %*%t(pca_loo$rotation[, 1:n_comp, drop =FALSE]) loocv_error <- loocv_error +sum((X_test - X_recon)^2) } loocv_results$LOOCV_Error[n_comp] <- loocv_error / n_phase1 } k_loocv <-which.min(loocv_results$LOOCV_Error)cat("LOOCV optimal components:", k_loocv, "\n")cat("K-fold CV optimal:", k_optimal, "\n")cat("Variance-based:", k, "\n\n")# Compare all three methods all_methods <-plot_ly() %>%add_trace(data = cv_results,x =~n_components,y =~CV_Error,type ="scatter",mode ="lines+markers",name ="K-Fold CV",line =list(color ="blue") ) %>%add_trace(data = loocv_results,x =~n_components,y =~LOOCV_Error,type ="scatter",mode ="lines+markers",name ="LOOCV",line =list(color ="red", dash ="dash") ) %>%add_segments(x = k, xend = k,y =min(c(cv_results$CV_Error, loocv_results$LOOCV_Error)),yend =max(c(cv_results$CV_Error, loocv_results$LOOCV_Error)),line =list(color ="green", dash ="dot"),name ="95% Variance" ) %>%layout(title =list(text ="Comparing Model Selection Methods",font =list(size =16, weight ="bold") ),xaxis =list(title ="Number of Components", dtick =1),yaxis =list(title ="Prediction Error"),showlegend =TRUE ) all_methods} else {cat("\n[SKIPPED] LOOCV not performed (dataset too large)\n")cat("K-fold CV is more efficient for large Phase I datasets.\n")}```### Practical Recommendations**For Your SPC Implementation:**1. **Start with variance threshold** (95% or scree plot)2. **Validate with CV** if: - Unexpected false alarms - Poor Phase II performance - Limited Phase I data3. **Use CV optimal** if it differs significantly4. **Monitor both** T² and Q regardless of selection**CV Best Practices:** - Use K=5 or K=10 for K-fold - LOOCV for small datasets (\<50 samples) - Stratify folds if data has structure - Revalidate periodically with new data### Literature SupportCross-validation in SPC is supported by: - **Nomikos & MacGregor (1995)**: Monitoring batch processes with multiway PCA - **Kourti (2005)**: Process analytical technology and quality by design - **Qin (2012)**: Survey of industrial applications of data-driven methods**Modern Practice:** Many pharmaceutical and chemical industries now use CV-based model selection for: - Batch process monitoring - Soft sensor development - Fault detection systems - Quality prediction models# 5. Variable Contribution Analysis## Identifying Root CausesWhen a multivariate chart signals out-of-control, we need to identify which variable(s) are responsible.```{r contribution-analysis}# For recent out-of-control points, show variable contributionsrecent_ooc <-tail(which(MEWMA_T2 > UCL_MEWMA), 5)if (length(recent_ooc) >0) {# Calculate standardized deviations contrib_data <-list()for (idx in recent_ooc) { x_i <- X[idx, ] std_dev <- (x_i - mu_hat) /sqrt(diag(Sigma_hat)) contrib_data[[length(contrib_data) +1]] <-tibble(Sample = idx,Variable = quality_vars,Value = x_i,Target = mu_hat,`Std Deviation`= std_dev,`Abs Std Dev`=abs(std_dev),Status =ifelse(abs(std_dev) >3, "[X] Out of Control",ifelse(abs(std_dev) >2, "[WARNING] Near Limit", "[OK] In Control")) ) } contrib_table <-bind_rows(contrib_data) contrib_table %>%gt() %>%tab_header(title ="Variable Contribution Analysis",subtitle ="Which variables are causing out-of-control signals?" ) %>%fmt_number(columns =c(Value, Target, `Std Deviation`),decimals =3 ) %>%fmt_number(columns =`Abs Std Dev`,decimals =2 ) %>%tab_style(style =cell_fill(color ="lightcoral"),locations =cells_body(columns =`Abs Std Dev`,rows =`Abs Std Dev`>3 ) ) %>%tab_style(style =cell_fill(color ="lightyellow"),locations =cells_body(columns =`Abs Std Dev`,rows =`Abs Std Dev`>2&`Abs Std Dev`<=3 ) ) %>%tab_style(style =cell_fill(color ="lightgreen"),locations =cells_body(columns =`Abs Std Dev`,rows =`Abs Std Dev`<=2 ) )# Create contribution bar chart for most recent OOC point latest_ooc <-tail(recent_ooc, 1) latest_contrib <- contrib_table %>%filter(Sample == latest_ooc) contrib_plot <-plot_ly(data = latest_contrib,x =~Variable,y =~`Std Deviation`,type ="bar",marker =list(color =~ifelse(abs(`Std Deviation`) >3, "red",ifelse(abs(`Std Deviation`) >2, "orange", "green")) ),text =~round(`Std Deviation`, 2),textposition ="outside",hovertemplate =paste("<b>%{x}</b><br>","Std Dev: %{y:.3f}<br>","<extra></extra>" ) ) %>%add_segments(x =-0.5, xend =length(quality_vars) -0.5,y =3, yend =3,line =list(color ="red", dash ="dash", width =2),showlegend =FALSE,name ="+3σ" ) %>%add_segments(x =-0.5, xend =length(quality_vars) -0.5,y =-3, yend =-3,line =list(color ="red", dash ="dash", width =2),showlegend =FALSE,name ="-3σ" ) %>%layout(title =list(text =paste0("Variable Contributions - Sample ", latest_ooc, "<br><sub>Standardized deviations from target</sub>"),font =list(size =16, weight ="bold") ),xaxis =list(title ="Variable"),yaxis =list(title ="Standardized Deviation (σ)"),showlegend =FALSE ) contrib_plot} else {cat("\nNo recent out-of-control points for contribution analysis.\n")}```# 6. Process Monitoring Dashboard```{r dashboard}# Create summary metricsdashboard_metrics <-tibble(Metric =c("Total Samples","Phase I Samples","Phase II Samples","Variables Monitored","T² Out-of-Control","MEWMA Out-of-Control","Current Process Status" ),Value =c(nrow(X),80,nrow(X) -80, p,sum(T2_values > UCL_T2),sum(MEWMA_T2 > UCL_MEWMA),ifelse(tail(MEWMA_T2, 1) > UCL_MEWMA, "[X] OUT OF CONTROL", "[OK] IN CONTROL") ))dashboard_metrics %>%gt() %>%tab_header(title ="Multivariate SPC Dashboard",subtitle =paste("Analysis Date:", Sys.Date()) ) %>%tab_style(style =cell_fill(color ="lightcoral"),locations =cells_body(columns = Value,rows =grepl("OUT OF CONTROL", Value) ) ) %>%tab_style(style =cell_fill(color ="lightgreen"),locations =cells_body(columns = Value,rows =grepl("IN CONTROL", Value) ) ) %>%tab_style(style =list(cell_fill(color ="lightblue"),cell_text(weight ="bold") ),locations =cells_column_labels() )```# 7. Recommendations and Actions## Current Process Assessment```{r assessment}# Calculate process performance metricsrecent_samples <-10recent_T2 <-tail(T2_values, recent_samples)recent_MEWMA <-tail(MEWMA_T2, recent_samples)recent_violations_T2 <-sum(recent_T2 > UCL_T2)recent_violations_MEWMA <-sum(recent_MEWMA > UCL_MEWMA)cat("\n### Recent Process Performance (Last", recent_samples, "Samples)\n\n")cat("- T² violations:", recent_violations_T2, "out of", recent_samples, paste0("(", round(recent_violations_T2/recent_samples*100, 1), "%)\n"))cat("- MEWMA violations:", recent_violations_MEWMA, "out of", recent_samples,paste0("(", round(recent_violations_MEWMA/recent_samples*100, 1), "%)\n\n"))if (recent_violations_MEWMA >0) {cat("**Status: [WARNING] Process requires immediate attention**\n\n")cat("**Recommended Actions:**\n")cat("1. Stop production and investigate root cause\n")cat("2. Review variable contributions to identify problem source\n")cat("3. Check equipment calibration and process parameters\n")cat("4. Review operator procedures and training\n")cat("5. Implement corrective actions\n")cat("6. Document findings and countermeasures\n")} else {cat("**Status: [OK] Process is in statistical control**\n\n")cat("**Recommended Actions:**\n")cat("1. Continue current monitoring protocol\n")cat("2. Maintain equipment calibration schedule\n")cat("3. Regular operator training refreshers\n")cat("4. Update control limits quarterly\n")}```## Advantages of Multivariate SPC**Compared to univariate monitoring:**1. **Accounts for correlations**: Detects patterns that univariate charts miss2. **Reduces chart proliferation**: One chart instead of multiple3. **Lower false alarm rate**: Adjusts for multiple testing4. **More powerful**: Better at detecting real process shifts5. **Process understanding**: Reveals relationships between variables## When to Use Each Method| Method | Best For | Advantages | Disadvantages ||-----------------|-----------------|------------------|----------------------|| **T²** | Large sudden shifts | Simple, well-understood | Less sensitive to small shifts || **MEWMA** | Small gradual shifts | Memory, smoothing | Slower response to large shifts || **MCUSUM** | Persistent small shifts | Directional info | Complex interpretation || **PCA-based** | High dimensions (10+ vars) | Dimension reduction | Requires more data |## Implementation Guidelines**Phase I (Establishing Control):** 1. Collect 20-30 samples during stable operation 2. Calculate mean vector and covariance matrix 3. Verify multivariate normality 4. Establish control limits**Phase II (Ongoing Monitoring):** 1. Plot new observations on control charts 2. Investigate out-of-control signals immediately 3. Use contribution analysis to identify root causes 4. Document and implement corrective actions 5. Update control limits after process improvements------------------------------------------------------------------------**Report Generated:** `r Sys.time()`\**Analysis Method:** Hotelling T² and MEWMA\**Statistical Software:** R with qcr package (unified quality control)\**Methodology:** Multivariate Statistical Process Control\**Package Info:** qcr integrates qcc and MSQC functionality