Fuzzy Cluster-Adaptive p'-Chart for Overdispersed Data
Developed by
Jean Paul MorĂ¡n-Zabala, Juan Miguel Cogollo-FlĂ³rez & Alexander Alberto Correa-Espinal
2026
Analytical route of the document. The workflow of this executable code is: (1) load and clean the proportional variable, (2) diagnose overdispersion, (3) test different numbers of clusters, (4) estimate Fuzzy C-Means by window, (5) compute regime-specific parameters, (6) build FCA p'-chart limits, (7) flag out-of-control signals, (8) plot the results, and (9) export files for traceability.
How to read this document. Each section contains three elements: decision context, equations used, and expected output. The tables make it possible to verify the calculations numerically; the plots support interpretation of process stability, dominant clusters, and the coherence of OOC signals.
1. Methodological foundation
The method monitors a nonconformity or discard proportion by subgroup. For each observation \(i\), the following proportion is defined:
\[
p_i = \frac{X_i}{n_i}
\]
where \(X_i\) is the number of non-viable, discarded, or nonconforming units, and \(n_i\) is the subgroup size. In heterogeneous processes, a single global parameter can induce false signals because it mixes different operational regimes. Therefore, Fuzzy C-Means is used to represent latent states with partial memberships.
1.1. Fuzzy C-Means
FCM partitions the data into \(c\) clusters by minimizing:
\[
J_m(U,V)=\sum_{i=1}^{N}\sum_{j=1}^{c} u_{ij}^{m}\,\lVert x_i-v_j\rVert^2
\]
subject to:
\[
\sum_{j=1}^{c} u_{ij}=1, \qquad 0\leq u_{ij}\leq 1
\]
where \(u_{ij}\) is the membership degree of observation \(i\) in cluster \(j\), \(v_j\) is the centroid, and \(m>1\) is the fuzzification parameter. This code uses \(m=2\) by default.
1.2. Cluster-conditioned central line
The central line of cluster \(j\) is estimated through a fuzzy-membership-weighted mean:
\[
\hat{\mu}_j = \frac{\sum_{i=1}^{N} u_{ij}^{m}p_i}{\sum_{i=1}^{N} u_{ij}^{m}}
\]
Additionally, to summarize the global performance of a window, the following quantity can be computed:
\[
\hat{p}_{F} = \sum_{j=1}^{c} w_j\hat{\mu}_j, \qquad
w_j=\frac{n_j^{\ast}}{\sum_{j=1}^{c}n_j^{\ast}}
\]
where \(n_j^{\ast}\) corresponds to the number of observations dominantly assigned to cluster \(j\).
1.3. Regime-specific variability
La varianza total dentro del régimen \(j\) se estima como:
\[
\hat{\sigma}_{T,j}^{2}=\frac{\sum_{i=1}^{N}u_{ij}^{m}(p_i-\hat{\mu}_j)^2}{\sum_{i=1}^{N}u_{ij}^{m}}
\]
The effective sample size of the cluster is:
\[
\bar{n}_{j}=\frac{\sum_{i=1}^{N}u_{ij}^{m}n_i}{\sum_{i=1}^{N}u_{ij}^{m}}
\]
The extra-binomial or between-subgroup component is obtained by subtracting the expected binomial variability:
\[
\hat{\sigma}_{B,j}^{2}=\max\left\{0,\frac{\hat{\sigma}_{T,j}^{2}-\frac{\hat{\mu}_j(1-\hat{\mu}_j)}{\bar{n}_j}}{1-\frac{1}{\bar{n}_j}}\right\}
\]
1.4. FCA p'-chart adaptive control limits
For each observation, the dominant cluster is identified:
\[
j_i^{\ast}=\arg\max_j u_{ij}
\]
La varianza condicional de \(p_i\) se calcula como:
\[
\widehat{Var}(p_i\mid j_i^{\ast})=\frac{\hat{\mu}_{j_i^{\ast}}(1-\hat{\mu}_{j_i^{\ast}})}{n_i}+\hat{\sigma}_{B,j_i^{\ast}}^{2}\left(1-\frac{1}{n_i}\right)
\]
The control limits are defined as:
\[
UCL_i=\min\{1,\hat{\mu}_{j_i^{\ast}}+k\sqrt{\widehat{Var}(p_i\mid j_i^{\ast})}\}
\]
\[
LCL_i=\max\{0,\hat{\mu}_{j_i^{\ast}}-k\sqrt{\widehat{Var}(p_i\mid j_i^{\ast})}\}
\]
An out-of-control signal is flagged when:
\[Signal_i=\mathbb{I}(p_i>UCL_i\ \lor\ p_i < LCL_i)\]
2. Included validation indices
The file computes and plots the following indices for \(K=2,\ldots,10\):
| Index | Interpretation | Usual criterion |
| PC: Partition Coefficient | Measures compact fuzzy partitioning. | Higher is better. |
| PE: Partition Entropy | Measures partition uncertainty. | Lower is better. |
| MPC: Modified Partition Coefficient | PC adjusted by number of clusters. | Higher is better. |
| SIL.F aproximado | Silhouette based on dominant assignment. | Higher is better. |
| DB: Davies-Bouldin | Dispersion/separation ratio. | Lower is better. |
| CH: Calinski-Harabasz | Between-group separation relative to within-group compactness. | Higher is better. |
| Gap aproximado | Comparison against a uniform reference distribution. | Higher value or elbow criterion. |
2.1. Equations of the validation indices
Context. These indices do not directly control the process; they validate whether the fuzzy partition is sufficiently compact, separated, and interpretable. Therefore, the final number of clusters should not be chosen from a single isolated index, but from convergence among statistical evidence, operational interpretability, and signal stability.
Partition Coefficient (PC). Evaluates how well-defined the fuzzy assignment is. Higher values indicate more concentrated memberships.
\[
PC=\frac{1}{N}\sum_{i=1}^{N}\sum_{j=1}^{c}u_{ij}^{2}
\]
Partition Entropy (PE). Measures partition uncertainty. Lower values indicate less ambiguity.
\[
PE=-\frac{1}{N}\sum_{i=1}^{N}\sum_{j=1}^{c}u_{ij}\ln(u_{ij})
\]
Modified Partition Coefficient (MPC). Corrects PC by the number of clusters.
\[
MPC=\frac{PC-1/c}{1-1/c}
\]
Davies-Bouldin (DB). Contrasts internal cluster dispersion with centroid separation.
\[
DB=\frac{1}{c}\sum_{j=1}^{c}\max_{l\neq j}\left(\frac{S_j+S_l}{\lVert v_j-v_l\rVert}\right)
\]
Calinski-Harabasz (CH). Compares between-cluster variability against within-cluster variability.
\[
CH=\frac{B/(c-1)}{W/(N-c)}
\]
Signal rate by window. Summarizes the proportion of out-of-control observations in a window.
\[
SignalRate_w=\frac{\sum_{i\in w}Signal_i}{N_w}
\]
3. Executable RHTML code
Before running: place the .xlsx file in the working directory or change the DATA_PATH variable. The Excel file must contain the columns: Semana, NĂºmero de muestras tomadas, and NĂºmero de muestras descartadas (no viables).
# ==========================================================
# 0. InstalaciĂ³n/carga de paquetes
# ==========================================================
required_pkgs <- c(
"readxl", "dplyr", "ggplot2", "e1071", "scales", "writexl",
"tidyr", "cluster", "knitr"
)
install_if_missing <- function(pkgs){
missing <- pkgs[!pkgs %in% rownames(installed.packages())]
if(length(missing) > 0){
install.packages(missing, dependencies = TRUE)
}
}
install_if_missing(required_pkgs)
invisible(lapply(required_pkgs, library, character.only = TRUE))
knitr::opts_chunk$set(
echo = TRUE,
message = FALSE,
warning = FALSE,
fig.width = 8.5,
fig.height = 4.8,
dpi = 150,
out.width = '100%'
)
# ==========================================================
# 1. General parameters
# ==========================================================
DATA_PATH <- "EX_DATA.xlsx" # Change if the file is in another path
base_out_dir <- "outputs_pcharts_rhtml"
dir.create(base_out_dir, showWarnings = FALSE, recursive = TRUE)
exploratory <- 200 # initial reference window
window_size <- 1000 # sequential window size
m_fuzz <- 2 # FCM fuzzification parameter
k_sigma <- 3 # Shewhart-type limit width
k_min <- 2
k_max <- 10
K_MAX <- 10
# K_ELEGIDO define la soluciĂ³n final que se desea visualizar de forma general.
# This study recommends starting with 4 clusters, but it can be changed
# to 5, 6, ..., 10 if the validation indices and operational interpretation
# justify a more granular partition.
K_ELEGIDO <- 4
# Output sizes adjusted to Letter/A4 paper.
# These values prevent oversized figures when exporting
# o al compilar/imprimir el RHTML.
FIG_W_STD <- 8.5 # standard width for individual plots
FIG_H_STD <- 4.8 # standard height for individual plots
FIG_W_LAND <- 10.5 # maximum recommended width in landscape page
FIG_H_LAND <- 7.2 # maximum recommended height in landscape page
FIG_W_PORT <- 7.5 # ancho recomendado en hoja vertical
FIG_H_PORT <- 9.8 # alto recomendado en hoja vertical
DPI_EXPORT <- 180 # sufficient report resolution without excessive file sizes
PLOT_BASE_SIZE <- 10 # compact base size so legends and facets fit on the page
if(K_ELEGIDO < 4){
stop("K_ELEGIDO must be 4 or greater according to the analysis request.")
}
if(K_ELEGIDO < k_min || K_ELEGIDO > k_max){
stop("K_ELEGIDO must be within the k_min:k_max range.")
}
CTRL_LTY <- c("UCL" = "longdash", "LCL" = "dotdash", "CCL" = "solid")
CTRL_LWD <- c("UCL" = 1.40, "LCL" = 1.20, "CCL" = 0.85)
CTRL_ALPHA <- c("UCL" = 0.95, "LCL" = 0.95, "CCL" = 0.55)
4. Auxiliary functions
What is computed here. This block defines reusable functions to create sequential windows, run FCM, order clusters by their central line, estimate regime-specific variability, compute adaptive limits, and obtain validation indices. Separating these functions makes the main loop more transparent and auditable.
Expected output. This block does not produce a plot; it prepares the mathematical tools that will later be used to build tables, limits, signals, and reports.
# ==========================================================
# 2. Auxiliary functions for plots, windows, FCM, and indices
# ==========================================================
get_cluster_palette <- function(k){
cols <- scales::hue_pal()(k)
names(cols) <- as.character(seq_len(k))
cols
}
limits_long <- function(out_df){
dplyr::bind_rows(
data.frame(week = out_df$week, limit = "UCL", value = out_df$UCL),
data.frame(week = out_df$week, limit = "LCL", value = out_df$LCL),
data.frame(week = out_df$week, limit = "CCL", value = out_df$CCL)
) %>%
dplyr::mutate(
limit = factor(limit, levels = c("UCL", "LCL", "CCL")),
linewidth = unname(CTRL_LWD[as.character(limit)]),
alpha = unname(CTRL_ALPHA[as.character(limit)])
)
}
make_windows <- function(N, exploratory = 200, window_size = 1000){
windows <- list()
windows[[1]] <- list(id = 0, start = 1, end = min(exploratory, N))
idx <- exploratory + 1
wid <- 1
while(idx <= N){
endi <- min(idx + window_size - 1, N)
windows[[length(windows) + 1]] <- list(id = wid, start = idx, end = endi)
idx <- endi + 1
wid <- wid + 1
}
windows
}
run_fcm_1d <- function(x, centers = 3, m = 2, seed = 1){
set.seed(seed)
fit <- e1071::cmeans(
as.matrix(x), centers = centers, m = m,
iter.max = 300, verbose = FALSE, method = "cmeans"
)
list(centers = as.numeric(fit$centers), U = fit$membership, fit = fit)
}
cluster_stats_pchart <- function(x, n_i, U, m = 2){
Um <- U^m
wsum <- colSums(Um) + 1e-12
mu <- colSums(Um * x) / wsum
diff2 <- (matrix(x, nrow = length(x), ncol = ncol(U)) -
matrix(mu, nrow = length(x), ncol = ncol(U), byrow = TRUE))^2
sig2_total <- colSums(Um * diff2) / wsum
nbar <- colSums(Um * n_i) / wsum
sig2_between <- pmax(
0,
(sig2_total - (mu * (1 - mu)) / nbar) / (1 - 1 / nbar)
)
list(mu = mu, sig2_total = sig2_total, nbar = nbar, sig2_between = sig2_between)
}
reorder_by_mu <- function(mu, U, centers, sig2_total, nbar, sig2_between){
ord <- order(mu)
list(
centers = centers[ord],
U = U[, ord, drop = FALSE],
mu = mu[ord],
sig2_total = sig2_total[ord],
nbar = nbar[ord],
sig2_between = sig2_between[ord]
)
}
limits_regime_adaptive <- function(x, n_i, U, mu, sig2_between, k = 3){
j_star <- max.col(U, ties.method = "first")
mu_star <- mu[j_star]
sig2b_star <- sig2_between[j_star]
var_hat <- (mu_star * (1 - mu_star)) / n_i + sig2b_star * (1 - 1 / n_i)
sd_hat <- sqrt(pmax(var_hat, 0))
ucl <- pmin(1, pmax(0, mu_star + k * sd_hat))
lcl <- pmin(1, pmax(0, mu_star - k * sd_hat))
list(cluster = j_star, lcl = lcl, ucl = ucl, cl = mu_star, var_hat = var_hat)
}
cluster_limits_rep <- function(mu, nbar, sig2_between, k = 3){
var_rep <- (mu * (1 - mu)) / nbar + sig2_between * (1 - 1 / nbar)
sd_rep <- sqrt(pmax(var_rep, 0))
data.frame(
cluster = seq_along(mu),
CL_rep = mu,
LCL_rep = pmin(1, pmax(0, mu - k * sd_rep)),
UCL_rep = pmin(1, pmax(0, mu + k * sd_rep))
)
}
p_fuzzy_global <- function(mu, dom_count){
if(sum(dom_count) == 0) return(NA_real_)
w <- dom_count / sum(dom_count)
sum(w * mu)
}
enforce_u_cols <- function(df_in, K_MAX = 10, after_col = "signal"){
u_cols <- paste0("u_", 1:K_MAX)
for(uc in u_cols){
if(!uc %in% names(df_in)) df_in[[uc]] <- NA_real_
}
df_in[u_cols] <- lapply(df_in[u_cols], function(v) as.numeric(v))
if(after_col %in% names(df_in)){
df_in <- df_in %>% dplyr::relocate(dplyr::all_of(u_cols), .after = dplyr::all_of(after_col))
} else {
df_in <- df_in %>% dplyr::relocate(dplyr::all_of(u_cols), .after = dplyr::last_col())
}
df_in
}
# ---------- Validation indices ----------
partition_coefficient <- function(U){
mean(rowSums(U^2))
}
partition_entropy <- function(U){
U_safe <- pmax(U, 1e-12)
-mean(rowSums(U_safe * log(U_safe)))
}
modified_partition_coefficient <- function(U){
c <- ncol(U)
pc <- partition_coefficient(U)
(pc - 1 / c) / (1 - 1 / c)
}
davies_bouldin_1d <- function(x, clusters){
labs <- sort(unique(clusters))
centers <- sapply(labs, function(g) mean(x[clusters == g]))
S <- sapply(labs, function(g) mean(abs(x[clusters == g] - mean(x[clusters == g]))))
R <- matrix(NA_real_, length(labs), length(labs))
for(a in seq_along(labs)){
for(b in seq_along(labs)){
if(a != b){
R[a, b] <- (S[a] + S[b]) / (abs(centers[a] - centers[b]) + 1e-12)
}
}
}
mean(apply(R, 1, max, na.rm = TRUE))
}
calinski_harabasz_1d <- function(x, clusters){
N <- length(x)
labs <- sort(unique(clusters))
K <- length(labs)
grand_mean <- mean(x)
B <- sum(sapply(labs, function(g) sum(clusters == g) * (mean(x[clusters == g]) - grand_mean)^2))
W <- sum(sapply(labs, function(g) sum((x[clusters == g] - mean(x[clusters == g]))^2)))
(B / (K - 1)) / (W / (N - K) + 1e-12)
}
silhouette_hard_1d <- function(x, clusters){
if(length(unique(clusters)) < 2) return(NA_real_)
sil <- cluster::silhouette(clusters, dist(as.matrix(x)))
mean(sil[, "sil_width"])
}
gap_approx_1d <- function(x, K, B = 20, m = 2){
# Gap-type approximation using FCM: compares observed log(Wk) against uniform references.
W_obs <- local({
fcm <- run_fcm_1d(x, centers = K, m = m, seed = 100 + K)
U <- fcm$U; centers <- fcm$centers
sum((U^m) * (as.numeric(outer(x, centers, "-"))^2))
})
refs <- replicate(B, {
xr <- runif(length(x), min(x), max(x))
fcmr <- run_fcm_1d(xr, centers = K, m = m, seed = sample.int(1e6, 1))
sum((fcmr$U^m) * (as.numeric(outer(xr, fcmr$centers, "-"))^2))
})
mean(log(refs + 1e-12)) - log(W_obs + 1e-12)
}
5. Data loading and preparation
Calculation context. The FCA p'-chart works with attribute data expressed as proportions. Therefore, before applying FCM and control limits, the code transforms the original Excel columns into four minimum variables: week, sample size, number of discarded units, and observed proportion.
\[
p_i=\frac{X_i}{n_i},\qquad 0\leq X_i\leq n_i,\qquad 0\leq p_i\leq 1
\]
Displayed output. The initial table confirms that the cleaning step was correct: there should be no sample sizes below 1, negative discard counts, or proportions outside the interval [0,1].
# ==========================================================
# 3. Dataset loading and preparation
# ==========================================================
if(!file.exists(DATA_PATH)){
stop(paste0(
"File not found: ", DATA_PATH,
". Place EX_DATA.xlsx in the working directory or update DATA_PATH."
))
}
EX_DATA <- readxl::read_excel(DATA_PATH)
required_cols <- c(
"Semana",
"NĂºmero de muestras tomadas",
"NĂºmero de muestras descartadas (no viables)"
)
missing_cols <- setdiff(required_cols, names(EX_DATA))
if(length(missing_cols) > 0){
stop(paste("Missing columns in the Excel file:", paste(missing_cols, collapse = ", ")))
}
df <- EX_DATA %>%
dplyr::rename(
week = `Semana`,
n_taken = `NĂºmero de muestras tomadas`,
x_discarded = `NĂºmero de muestras descartadas (no viables)`
) %>%
dplyr::mutate(
week = as.integer(round(week)),
n = as.integer(round(n_taken)),
X = as.integer(round(x_discarded)),
n = pmax(n, 1L),
X = pmin(pmax(X, 0L), n),
p = X / n
) %>%
dplyr::select(week, n, X, p)
N <- nrow(df)
windows <- make_windows(N, exploratory, window_size)
knitr::kable(head(df, 10), caption = "First observations prepared for the FCA p'-chart")
First observations prepared for the FCA p'-chart
| week |
n |
X |
p |
| 1 |
13477 |
3850 |
0.2856719 |
| 2 |
12857 |
5330 |
0.4145602 |
| 3 |
12559 |
2690 |
0.2141890 |
| 4 |
10223 |
10223 |
1.0000000 |
| 5 |
11894 |
3950 |
0.3321002 |
| 6 |
13240 |
4340 |
0.3277946 |
| 7 |
12980 |
7830 |
0.6032357 |
| 8 |
10439 |
4590 |
0.4396973 |
| 9 |
8742 |
4720 |
0.5399222 |
| 10 |
11331 |
3010 |
0.2656429 |
data_quality <- data.frame(
observations = N,
min_week = min(df$week, na.rm = TRUE),
max_week = max(df$week, na.rm = TRUE),
min_n = min(df$n, na.rm = TRUE),
max_n = max(df$n, na.rm = TRUE),
min_p = min(df$p, na.rm = TRUE),
mean_p = mean(df$p, na.rm = TRUE),
max_p = max(df$p, na.rm = TRUE),
windows_created = length(windows)
)
knitr::kable(data_quality, digits = 4, caption = "Data quality and dataset structure summary")
Data quality and dataset structure summary
| observations |
min_week |
max_week |
min_n |
max_n |
min_p |
mean_p |
max_p |
windows_created |
| 11539 |
1 |
11539 |
152 |
29670 |
0.0016 |
0.5912 |
1 |
13 |
How to interpret this output. If min_p and max_p are within [0,1], the proportional variable was correctly constructed. The field windows_created indicates how many windows will be analyzed by the sequential procedure.
6. Overdispersion diagnosis
Calculation context. The classical p-chart assumes that proportion variability comes mainly from a binomial distribution. In waste-collection or service systems with heterogeneous regimes, observed variability is often greater than expected. This block quantifies that difference through a variance ratio.
A global variance ratio is computed. If the observed variance of the proportions substantially exceeds the expected binomial variance, there is evidence of overdispersion.
\[
\hat{p}=\frac{\sum_i X_i}{\sum_i n_i},\qquad
S_{bin}^{2}=\frac{1}{N}\sum_{i=1}^{N}\frac{\hat{p}(1-\hat{p})}{n_i},
\qquad
S_T^2=\frac{1}{N-1}\sum_{i=1}^{N}(p_i-\bar{p})^2
\]
\[
VR=\frac{S_T^2}{S_{bin}^{2}}
\]
# ==========================================================
# 4. Analytical and graphical overdispersion diagnosis
# ==========================================================
p_pool <- sum(df$X) / sum(df$n)
sampling_var <- mean(p_pool * (1 - p_pool) / df$n)
total_var <- stats::var(df$p)
variance_ratio <- total_var / sampling_var
od_table <- data.frame(
N = N,
p_pool = p_pool,
sampling_variance = sampling_var,
total_variance = total_var,
variance_ratio = variance_ratio,
threshold_Heiman = 1.357,
diagnosis = ifelse(variance_ratio > 1.357, "Overdispersion detected", "No severe overdispersion")
)
knitr::kable(od_table, digits = 6, caption = "Analytical overdispersion diagnosis")
Analytical overdispersion diagnosis
| N |
p_pool |
sampling_variance |
total_variance |
variance_ratio |
threshold_Heiman |
diagnosis |
| 11539 |
0.556322 |
3.2e-05 |
0.087079 |
2716.212 |
1.357 |
Overdispersion detected |
# Arcsine-square-root transformation for visual diagnosis
od_plot_df <- df %>%
dplyr::mutate(z = asin(sqrt(p)))
ggplot(od_plot_df, aes(sample = z)) +
stat_qq(size = 1.2, alpha = 0.65) +
stat_qq_line(linewidth = 0.55) +
labs(
title = "Graphical overdispersion diagnosis",
subtitle = "Normal Q-Q plot of the arcsine-square-root transformation",
x = "Theoretical quantiles", y = "Transformed sample quantiles"
) +
theme_minimal(base_size = PLOT_BASE_SIZE)

Result and interpretation. If variance_ratio is greater than 1.357, the process shows overdispersion under the analytical criterion used. In that case, a conventional p-chart may produce overly narrow limits and false signals. The Q-Q plot complements this diagnosis: marked deviations from the reference line suggest that empirical variability does not fit the ideal binomial pattern well.
7. Validation of the number of clusters
Calculation context. Before constructing adaptive limits, the analysis evaluates how many latent regimes can adequately represent the behavior of the proportion. The code tests \(K=2,\ldots,10\) and computes complementary indices.
Generated plot. The faceted plot shows the evolution of each index against \(K\). The recommended decision is based on consistency across indices, not on a single rule. For example, a reasonable \(K\) usually shows relatively high PC/MPC/SIL/CH and relatively low DB/PE.
# ==========================================================
# 5. Validation indices for K = 2,...,10
# ==========================================================
set.seed(123)
validity_rows <- list()
x_all <- df$p
for(K in k_min:k_max){
fcm <- run_fcm_1d(x_all, centers = K, m = m_fuzz, seed = 1000 + K)
U <- fcm$U
clusters <- max.col(U, ties.method = "first")
validity_rows[[length(validity_rows) + 1]] <- data.frame(
K = K,
PC = partition_coefficient(U),
PE = partition_entropy(U),
MPC = modified_partition_coefficient(U),
SIL_F_approx = silhouette_hard_1d(x_all, clusters),
DB = davies_bouldin_1d(x_all, clusters),
CH = calinski_harabasz_1d(x_all, clusters),
GAP_approx = gap_approx_1d(x_all, K, B = 20, m = m_fuzz)
)
}
validity_indices <- dplyr::bind_rows(validity_rows)
knitr::kable(validity_indices, digits = 4, caption = "Cluster validation indices")
Cluster validation indices
| K |
PC |
PE |
MPC |
SIL_F_approx |
DB |
CH |
GAP_approx |
| 2 |
0.8726 |
0.2177 |
0.7451 |
0.6673 |
0.4502 |
40716.30 |
0.0835 |
| 3 |
0.8236 |
0.3220 |
0.7355 |
0.6302 |
0.4904 |
55698.78 |
0.1353 |
| 4 |
0.7910 |
0.3972 |
0.7213 |
0.6067 |
0.5070 |
68179.34 |
0.1367 |
| 5 |
0.7880 |
0.4189 |
0.7350 |
0.6161 |
0.4882 |
86885.24 |
0.1985 |
| 6 |
0.7745 |
0.4568 |
0.7294 |
0.6112 |
0.4922 |
101313.90 |
0.1973 |
| 7 |
0.7676 |
0.4801 |
0.7289 |
0.6074 |
0.4936 |
115481.84 |
0.2040 |
| 8 |
0.7682 |
0.4876 |
0.7351 |
0.6155 |
0.4899 |
135049.97 |
0.2311 |
| 9 |
0.7644 |
0.5025 |
0.7349 |
0.6153 |
0.4850 |
151716.67 |
0.2427 |
| 10 |
0.7585 |
0.5198 |
0.7317 |
0.6105 |
0.4899 |
165648.00 |
0.2369 |
validity_long <- validity_indices %>%
tidyr::pivot_longer(cols = -K, names_to = "Index", values_to = "Value")
ggplot(validity_long, aes(x = K, y = Value)) +
geom_line(linewidth = 0.8) +
geom_point(size = 2) +
facet_wrap(~ Index, scales = "free_y", ncol = 2) +
scale_x_continuous(breaks = k_min:k_max) +
labs(
title = "Validation of the number of clusters",
subtitle = "PC, MPC, SIL_F, and CH are maximized; PE and DB are minimized; GAP is analyzed by maximum value or elbow criterion",
x = "Number of clusters K", y = "Index value"
) +
theme_minimal(base_size = PLOT_BASE_SIZE)

best_summary <- data.frame(
criterion = c("max_PC", "min_PE", "max_MPC", "max_SIL_F", "min_DB", "max_CH", "max_GAP"),
selected_K = c(
validity_indices$K[which.max(validity_indices$PC)],
validity_indices$K[which.min(validity_indices$PE)],
validity_indices$K[which.max(validity_indices$MPC)],
validity_indices$K[which.max(validity_indices$SIL_F_approx)],
validity_indices$K[which.min(validity_indices$DB)],
validity_indices$K[which.max(validity_indices$CH)],
validity_indices$K[which.max(validity_indices$GAP_approx)]
)
)
knitr::kable(best_summary, caption = "K suggested by each criterion")
K suggested by each criterion
| criterion |
selected_K |
| max_PC |
2 |
| min_PE |
2 |
| max_MPC |
2 |
| max_SIL_F |
2 |
| min_DB |
2 |
| max_CH |
10 |
| max_GAP |
9 |
write_xlsx(list(Cluster_Validity = validity_indices, Best_K_By_Criterion = best_summary),
path = file.path(base_out_dir, "cluster_validity_indices.xlsx"))
Result and interpretation. The table K suggested by each criterion summarizes which number of clusters is favored by each index. If several indices converge on the same value of \(K\), there is evidence of a stable latent structure. If the indices disagree, the solution should be assessed using operational criteria: regime interpretability, membership stability, and number of OOC signals.
8. Main execution of the FCA p'-chart
Calculation context. This is the central stage of the document. For each number of clusters, the procedure recalibrates the model by window. This allows the chart to avoid depending on a single global mean and instead use central lines and variances specific to each latent operational regime.
The routine loops over \(K=2,\ldots,10\), divides the series into an initial exploratory window and sequential windows, computes FCM, estimates cluster-level parameters, constructs adaptive limits, and exports the results.
8.1. Calculations performed within each window
- FCM: estimates the membership matrix \(U=[u_{ij}]\).
- Dominant cluster: assigns each observation to the regime with the highest membership.
- CCL: central line conditioned on the dominant cluster.
- LCL/UCL: adaptive limits considering sample size and residual overdispersion.
- OOC signal: identifies observations outside the limits of the corresponding regime.
- Memberships: keeps columns \(u_1,\ldots,u_{10}\) for traceability and subsequent analysis.
\[
CCL_i=\hat{\mu}_{j_i^*},\qquad
LCL_i=\max\{0,CCL_i-k\hat{\sigma}_{i|j_i^*}\},\qquad
UCL_i=\min\{1,CCL_i+k\hat{\sigma}_{i|j_i^*}\}
\]
Plots generated by window. For each window, two plots are created: (1) the FCA p'-chart with points, limits, and OOC signals; and (2) the dominant-membership diagnostic, which verifies whether observations are clearly assigned to a regime or whether transition zones exist.
# ==========================================================
# 6. Main loop: FCA p'-chart for K = 2,...,10
# ==========================================================
master_detail_rows <- list()
master_summary_cluster_rows <- list()
master_summary_window_pfuzzy_rows <- list()
master_summary_window_rows <- list()
master_cluster_dist_rows <- list()
for(c_clusters in k_min:k_max){
message("Running c_clusters = ", c_clusters)
out_dir <- file.path(base_out_dir, sprintf("K_%02d", c_clusters))
dir.create(out_dir, showWarnings = FALSE, recursive = TRUE)
out_pdf <- file.path(out_dir, sprintf("pcharts_fuzzy_windowing_report_%dclusters.pdf", c_clusters))
cluster_cols <- get_cluster_palette(c_clusters)
detail_rows <- list()
summary_rows <- list()
summary_window_rows <- list()
grDevices::pdf(out_pdf, width = FIG_W_LAND, height = FIG_H_LAND, onefile = TRUE)
for(w in windows){
wid <- w$id
s <- w$start
e <- w$end
ww <- df[s:e, , drop = FALSE]
x <- ww$p
n_i <- ww$n
fcm <- run_fcm_1d(x, centers = c_clusters, m = m_fuzz, seed = wid + 1)
stats <- cluster_stats_pchart(x, n_i, fcm$U, m = m_fuzz)
re <- reorder_by_mu(
stats$mu, fcm$U, fcm$centers,
stats$sig2_total, stats$nbar, stats$sig2_between
)
U <- re$U
mu <- re$mu
sig2_total <- re$sig2_total
nbar <- re$nbar
sig2_between <- re$sig2_between
lim <- limits_regime_adaptive(x, n_i, U, mu, sig2_between, k = k_sigma)
out <- ww %>%
dplyr::mutate(
window_id = wid,
idx_global = s:e,
k_clusters = c_clusters,
cluster = lim$cluster,
CCL = lim$cl,
LCL = lim$lcl,
UCL = lim$ucl,
var_hat = lim$var_hat,
signal = (p > UCL) | (p < LCL),
cluster_f = factor(cluster, levels = 1:c_clusters),
ooc_flag = signal,
ooc_lab = factor(ifelse(ooc_flag, "OOC", NA_character_), levels = "OOC")
)
for(j in 1:K_MAX){
out[[paste0("u_", j)]] <- if(j <= c_clusters) U[, j] else NA_real_
}
out <- enforce_u_cols(out, K_MAX = K_MAX, after_col = "signal")
detail_rows[[length(detail_rows) + 1]] <- out
rep_limits <- cluster_limits_rep(mu, nbar, sig2_between, k = k_sigma)
dom_counts <- integer(c_clusters)
for(j in 1:c_clusters){
dom_count <- sum(out$cluster == j)
dom_counts[j] <- dom_count
avg_max_mem <- if(dom_count > 0) mean(out[[paste0("u_", j)]][out$cluster == j]) else NA_real_
signals_j <- sum(out$signal & out$cluster == j)
rep_row <- rep_limits %>% dplyr::filter(cluster == j)
summary_rows[[length(summary_rows) + 1]] <- data.frame(
k_clusters = c_clusters,
window_id = wid,
start_idx = s,
end_idx = e,
n_obs = nrow(out),
cluster = j,
mu_hat = mu[j],
sig2_total = sig2_total[j],
nbar = nbar[j],
sig2_between_hat = sig2_between[j],
CL_rep = rep_row$CL_rep,
LCL_rep = rep_row$LCL_rep,
UCL_rep = rep_row$UCL_rep,
dominant_count = dom_count,
avg_max_membership = avg_max_mem,
signals = signals_j
)
}
p_fg <- p_fuzzy_global(mu, dom_counts)
summary_window_rows[[length(summary_window_rows) + 1]] <- data.frame(
k_clusters = c_clusters,
window_id = wid,
start_idx = s,
end_idx = e,
n_obs = nrow(out),
p_fuzzy_global = p_fg
)
# ---------- Plot 1: FCA p-chart by window ----------
lim_df <- limits_long(out)
x_start <- min(out$week)
x_end <- max(out$week)
g1 <- ggplot(out, aes(x = week)) +
geom_line(
data = lim_df,
aes(x = week, y = value, linetype = limit, group = limit,
linewidth = linewidth, alpha = alpha),
color = "black"
) +
scale_linetype_manual(name = "Control limits", values = CTRL_LTY) +
scale_linewidth_identity(guide = "none") +
scale_alpha_identity(guide = "none") +
geom_line(aes(y = p, color = cluster_f), linewidth = 0.35, alpha = 0.9) +
geom_point(aes(y = p, color = cluster_f), size = 0.75, alpha = 0.9) +
geom_point(
data = out %>% dplyr::filter(ooc_flag),
aes(y = p, shape = ooc_lab),
color = "red", size = 1.5, stroke = 0.8
) +
geom_hline(yintercept = -0.02, color = "black", linewidth = 0.55) +
scale_color_manual(values = cluster_cols, limits = as.character(1:c_clusters),
name = "Cluster", drop = FALSE) +
scale_shape_manual(name = NULL, values = c("OOC" = 4), drop = FALSE) +
coord_cartesian(ylim = c(-0.02, 1.02)) +
labs(
title = sprintf("Window %d (rows %d-%d, n=%d): FCA p-Chart - %d clusters",
wid, s, e, nrow(out), c_clusters),
x = "Weeks", y = expression(p[i])
) +
scale_x_continuous(breaks = c(x_start, x_end), labels = c(x_start, x_end),
minor_breaks = NULL, expand = c(0, 0)) +
theme_classic(base_size = PLOT_BASE_SIZE) +
theme(plot.title = element_text(face = "bold", size = 12), legend.position = "right",
axis.text.x = element_text(size = 8), axis.ticks.x = element_blank(),
axis.line.y = element_line(color = "black", linewidth = 0.55),
axis.line.x = element_blank())
print(g1)
ggsave(file.path(out_dir, sprintf("window_%02d_pchart_%dclusters.png", wid, c_clusters)),
g1, width = FIG_W_LAND, height = FIG_H_LAND, units = "in", dpi = DPI_EXPORT)
# ---------- Plot 2: membership diagnostic ----------
dom_u <- sapply(1:nrow(out), function(i) U[i, out$cluster[i]])
out2 <- out %>% dplyr::mutate(dom_membership = dom_u)
g2 <- ggplot(out2, aes(x = p, y = dom_membership, color = cluster_f)) +
geom_point(size = 1.3, alpha = 0.75) +
scale_color_manual(values = cluster_cols, limits = as.character(1:c_clusters),
name = "Cluster", drop = FALSE) +
coord_cartesian(xlim = c(-0.02, 1.02), ylim = c(-0.02, 1.02)) +
labs(
title = sprintf("Window %d: Membership diagnostics - %d clusters", wid, c_clusters),
subtitle = "Dominant membership degree vs observed proportion p",
x = expression(p[i]), y = "Dominant membership degree"
) +
theme_classic(base_size = PLOT_BASE_SIZE) +
theme(plot.title = element_text(face = "bold", size = 12), legend.position = "right")
print(g2)
ggsave(file.path(out_dir, sprintf("window_%02d_membership_%dclusters.png", wid, c_clusters)),
g2, width = FIG_W_LAND, height = FIG_H_LAND, units = "in", dpi = DPI_EXPORT)
}
grDevices::dev.off()
# ---------- Export by K ----------
detail_all <- dplyr::bind_rows(detail_rows) %>%
dplyr::mutate(cluster_f = factor(cluster, levels = 1:c_clusters))
detail_all <- enforce_u_cols(detail_all, K_MAX = K_MAX, after_col = "signal")
summary_cluster_all <- dplyr::bind_rows(summary_rows)
summary_window_pfuzzy <- dplyr::bind_rows(summary_window_rows)
summary_window_all <- detail_all %>%
dplyr::group_by(k_clusters, window_id) %>%
dplyr::summarise(
n_obs = dplyr::n(),
signals = sum(signal),
signal_rate = mean(signal),
p_mean = mean(p),
p_sd = sd(p),
.groups = "drop"
)
cluster_dist <- detail_all %>%
dplyr::group_by(k_clusters, window_id, cluster_f) %>%
dplyr::summarise(count = dplyr::n(), signals = sum(signal), .groups = "drop") %>%
dplyr::group_by(k_clusters, window_id) %>%
dplyr::mutate(pct_within_window = count / sum(count)) %>%
dplyr::ungroup()
writexl::write_xlsx(
list(
All_Observations = detail_all,
Window_Cluster_Summary = summary_cluster_all,
Window_pFuzzy_Global = summary_window_pfuzzy,
Window_Summary = summary_window_all,
Window_Cluster_Distribution = cluster_dist
),
path = file.path(out_dir, sprintf("FCA_p_results_%dclusters.xlsx", c_clusters))
)
# ---------- Global plots by K ----------
lim_df_all <- dplyr::bind_rows(
data.frame(idx_global = detail_all$idx_global, window_id = detail_all$window_id,
limit = "UCL", value = detail_all$UCL),
data.frame(idx_global = detail_all$idx_global, window_id = detail_all$window_id,
limit = "LCL", value = detail_all$LCL),
data.frame(idx_global = detail_all$idx_global, window_id = detail_all$window_id,
limit = "CCL", value = detail_all$CCL)
) %>%
dplyr::mutate(limit = factor(limit, levels = c("UCL", "LCL", "CCL")),
linewidth = unname(CTRL_LWD[as.character(limit)]),
alpha = unname(CTRL_ALPHA[as.character(limit)]))
g_global_facet <- ggplot(detail_all, aes(x = idx_global)) +
geom_line(data = lim_df_all,
aes(x = idx_global, y = value, linetype = limit,
group = interaction(window_id, limit), linewidth = linewidth, alpha = alpha),
color = "black") +
scale_linetype_manual(name = "Control limits", values = CTRL_LTY) +
scale_linewidth_identity(guide = "none") +
scale_alpha_identity(guide = "none") +
geom_point(aes(y = p, color = cluster_f), size = 1.0, alpha = 0.85) +
geom_point(data = detail_all %>% dplyr::filter(signal),
aes(y = p, shape = "OOC"), color = "red", size = 2.0, stroke = 1.0) +
scale_color_manual(values = cluster_cols, limits = as.character(1:c_clusters),
name = "Cluster", drop = FALSE) +
scale_shape_manual(name = NULL, values = c("OOC" = 4)) +
facet_wrap(~ window_id, scales = "free_x", ncol = 2) +
coord_cartesian(ylim = c(-0.02, 1.02)) +
labs(title = sprintf("Global FCA p-Chart faceted by window - %d clusters", c_clusters),
subtitle = "Limits computed window-by-window", x = "Global index", y = expression(p[i])) +
theme_minimal(base_size = PLOT_BASE_SIZE)
ggsave(file.path(out_dir, sprintf("GLOBAL_pchart_facet_by_window_%dclusters.png", c_clusters)),
g_global_facet, width = FIG_W_LAND, height = FIG_H_LAND, units = "in", dpi = DPI_EXPORT)
g_global_single <- ggplot(detail_all, aes(x = idx_global)) +
geom_line(aes(y = UCL), color = "black", linewidth = 0.35,
alpha = CTRL_ALPHA["UCL"], linetype = CTRL_LTY["UCL"]) +
geom_line(aes(y = LCL), color = "black", linewidth = 0.35,
alpha = CTRL_ALPHA["LCL"], linetype = CTRL_LTY["LCL"]) +
geom_line(aes(y = CCL), color = "black", linewidth = 0.25,
alpha = CTRL_ALPHA["CCL"], linetype = CTRL_LTY["CCL"]) +
geom_point(aes(y = p, color = cluster_f), size = 0.6, alpha = 0.6) +
geom_point(data = detail_all %>% dplyr::filter(signal),
aes(y = p, shape = "OOC"), color = "red", size = 1.6, stroke = 0.9) +
scale_color_manual(values = cluster_cols, limits = as.character(1:c_clusters),
name = "Cluster", drop = FALSE) +
scale_shape_manual(name = NULL, values = c("OOC" = 4)) +
coord_cartesian(ylim = c(-0.02, 1.02)) +
labs(title = sprintf("Global FCA p-Chart single view - %d clusters", c_clusters),
subtitle = "All observations in one panel", x = "Global index", y = expression(p[i])) +
theme_minimal(base_size = PLOT_BASE_SIZE)
ggsave(file.path(out_dir, sprintf("GLOBAL_pchart_single_%dclusters.png", c_clusters)),
g_global_single, width = FIG_W_LAND, height = FIG_H_LAND, units = "in", dpi = DPI_EXPORT)
master_detail_rows[[length(master_detail_rows) + 1]] <- detail_all
master_summary_cluster_rows[[length(master_summary_cluster_rows) + 1]] <- summary_cluster_all
master_summary_window_pfuzzy_rows[[length(master_summary_window_pfuzzy_rows) + 1]] <- summary_window_pfuzzy
master_summary_window_rows[[length(master_summary_window_rows) + 1]] <- summary_window_all
master_cluster_dist_rows[[length(master_cluster_dist_rows) + 1]] <- cluster_dist
}
Result and interpretation. At the end of this block, folders from K_02 to K_10 will have been created. Each folder contains: window PDFs, PNG images, Excel files with observations, memberships, limits, signals, and cluster summaries. An OOC signal should not automatically be interpreted as an error; it should be reviewed together with its dominant cluster, membership degree, and time window.
9. Master summary and final comparative plots
Calculation context. After running all \(K\) solutions, this block consolidates the results to compare sensitivity and robustness. The objective is not to automatically choose the \(K\) with the fewest signals, but to observe how the OOC rate changes when the model represents more or fewer latent regimes.
\[
TotalSignals_K=\sum_{w}\sum_{i\in w}Signal_i,\qquad
MeanSignalRate_K=\frac{1}{W}\sum_{w=1}^{W}SignalRate_w
\]
Generated plot. The plot of total signals by \(K\) compares monitoring sensitivity. A sharp drop in signals as \(K\) increases may indicate that regimes were previously mixed; an excessive drop may suggest over-segmentation if not supported by validity indices.
General plot for the selected K. In addition to the window-level plots, the document generates a general view for K_ELEGIDO. This plot shows the full monitored series in a single panel, colors each observation according to its dominant cluster, overlays the CCL, LCL, and UCL lines, and highlights OOC signals. This is the main plot for reporting how the process behaves under the selected solution, for example, 4 clusters or more.
# ==========================================================
# 7. Master export and summary plots
# ==========================================================
master_detail_all <- dplyr::bind_rows(master_detail_rows)
master_summary_cluster_all <- dplyr::bind_rows(master_summary_cluster_rows)
master_summary_window_pfuzzy_all <- dplyr::bind_rows(master_summary_window_pfuzzy_rows)
master_summary_window_all <- dplyr::bind_rows(master_summary_window_rows)
master_cluster_dist_all <- dplyr::bind_rows(master_cluster_dist_rows)
master_detail_all <- enforce_u_cols(master_detail_all, K_MAX = K_MAX, after_col = "signal")
master_xlsx <- file.path(base_out_dir, sprintf("MASTER_FCA_p_results_K%02d_to_K%02d.xlsx", k_min, k_max))
writexl::write_xlsx(
list(
All_Observations_AllK = master_detail_all,
Window_Cluster_Summary_AllK = master_summary_cluster_all,
Window_pFuzzy_Global_AllK = master_summary_window_pfuzzy_all,
Window_Summary_AllK = master_summary_window_all,
Window_Cluster_Distribution_AllK = master_cluster_dist_all,
Cluster_Validity = validity_indices
),
path = master_xlsx
)
signal_by_K <- master_summary_window_all %>%
dplyr::group_by(k_clusters) %>%
dplyr::summarise(
total_signals = sum(signals),
mean_signal_rate = mean(signal_rate),
.groups = "drop"
)
knitr::kable(signal_by_K, digits = 4, caption = "OOC signal summary by number of clusters")
OOC signal summary by number of clusters
| k_clusters |
total_signals |
mean_signal_rate |
| 2 |
0 |
0.0000 |
| 3 |
6 |
0.0005 |
| 4 |
99 |
0.0093 |
| 5 |
159 |
0.0139 |
| 6 |
156 |
0.0131 |
| 7 |
171 |
0.0142 |
| 8 |
154 |
0.0129 |
| 9 |
145 |
0.0125 |
| 10 |
150 |
0.0129 |
ggplot(signal_by_K, aes(x = k_clusters, y = total_signals)) +
geom_line(linewidth = 0.55) +
geom_point(size = 2.5) +
scale_x_continuous(breaks = k_min:k_max) +
labs(
title = "Total out-of-control signals by K",
subtitle = "Compares the sensitivity of the FCA p'-chart against the number of clusters",
x = "Number of clusters K", y = "Total OOC signals"
) +
theme_minimal(base_size = PLOT_BASE_SIZE)

# Simple automatic recommendation: lowest number of signals with validity balance.
recommended_K_by_signals <- signal_by_K$k_clusters[which.min(signal_by_K$total_signals)]
cat("Output directory:", normalizePath(base_out_dir), "\n")
## Output directory: /Users/jeanpaul/Documents/Master's/Software_registration/FCA_p-chart/outputs_pcharts_rhtml
cat("Archivo maestro:", normalizePath(master_xlsx), "\n")
## Archivo maestro: /Users/jeanpaul/Documents/Master's/Software_registration/FCA_p-chart/outputs_pcharts_rhtml/MASTER_FCA_p_results_K02_to_K10.xlsx
cat("K selected for the general plot:", K_ELEGIDO, "\n")
## K selected for the general plot: 4
cat("K with the lowest number of OOC signals:", recommended_K_by_signals, "\n")
## K with the lowest number of OOC signals: 2
10. General FCA p'-chart plot for the selected number of clusters
Calculation context. Although the algorithm generates charts by window, academic and managerial communication often requires a single view of process behavior under the finally selected number of clusters. This section filters the consolidated results for K_ELEGIDO and builds a general chart of the complete process.
The general plot makes it possible to observe four elements simultaneously: the observed proportion \(p_i\), the dominant cluster of each observation, the adaptive limits \(LCL_i, CCL_i, UCL_i\), and the out-of-control signals. Mathematically, for the selected number of clusters \(K^*\), each observation is represented by:
\[
\left(p_i, j_i^*, CCL_i, LCL_i, UCL_i, Signal_i\right)\quad\text{con}\quad K^*=\texttt{K\_ELEGIDO}
\]
Where \(j_i^*\) is the dominant cluster. The main reading is: if a red point appears outside the limits, the deviation should be interpreted relative to the operational regime to which it belongs, not relative to a single global mean.
# ==========================================================
# 8. General plot for the selected number of clusters
# ==========================================================
selected_detail <- master_detail_all %>%
dplyr::filter(k_clusters == K_ELEGIDO) %>%
dplyr::mutate(
cluster_f = factor(cluster, levels = 1:K_ELEGIDO),
signal_label = factor(ifelse(signal, "OOC", NA_character_), levels = "OOC")
)
if(nrow(selected_detail) == 0){
stop("No hay resultados disponibles para K_ELEGIDO. Verifique que el loop principal haya incluido ese K.")
}
selected_detail$dom_membership <- mapply(
FUN = function(row_id, cl){
selected_detail[[paste0("u_", cl)]][row_id]
},
row_id = seq_len(nrow(selected_detail)),
cl = selected_detail$cluster
)
selected_cols <- get_cluster_palette(K_ELEGIDO)
selected_summary <- selected_detail %>%
dplyr::summarise(
selected_K = unique(k_clusters),
observations = dplyr::n(),
windows = dplyr::n_distinct(window_id),
total_signals = sum(signal),
signal_rate = mean(signal),
p_mean = mean(p),
p_sd = sd(p),
.groups = "drop"
)
knitr::kable(selected_summary, digits = 4,
caption = paste("General summary of the FCA p'-chart for selected K =", K_ELEGIDO))
General summary of the FCA p'-chart for selected K = 4
| selected_K |
observations |
windows |
total_signals |
signal_rate |
p_mean |
p_sd |
| 4 |
11539 |
13 |
99 |
0.0086 |
0.5912 |
0.2951 |
selected_cluster_summary <- selected_detail %>%
dplyr::group_by(cluster) %>%
dplyr::summarise(
observations = dplyr::n(),
porcentaje = observations / nrow(selected_detail),
signals = sum(signal),
tasa_signals = mean(signal),
p_promedio = mean(p),
CCL_promedio = mean(CCL),
LCL_promedio = mean(LCL),
UCL_promedio = mean(UCL),
membresia_promedio = mean(dom_membership, na.rm = TRUE),
.groups = "drop"
)
knitr::kable(selected_cluster_summary, digits = 4,
caption = paste("General characterization by cluster for selected K =", K_ELEGIDO))
General characterization by cluster for selected K = 4
| cluster |
observations |
porcentaje |
signals |
tasa_signals |
p_promedio |
CCL_promedio |
LCL_promedio |
UCL_promedio |
membresia_promedio |
| 1 |
2759 |
0.2391 |
4 |
0.0014 |
0.2232 |
0.2198 |
0.0133 |
0.4372 |
0.8615 |
| 2 |
3192 |
0.2766 |
0 |
0.0000 |
0.4429 |
0.4411 |
0.2382 |
0.6439 |
0.8089 |
| 3 |
2323 |
0.2013 |
0 |
0.0000 |
0.6932 |
0.6923 |
0.4694 |
0.9153 |
0.8070 |
| 4 |
3265 |
0.2830 |
95 |
0.0291 |
0.9746 |
0.9811 |
0.8539 |
1.0000 |
0.9449 |
selected_limits_long <- dplyr::bind_rows(
data.frame(idx_global = selected_detail$idx_global, window_id = selected_detail$window_id,
limit = "UCL", value = selected_detail$UCL),
data.frame(idx_global = selected_detail$idx_global, window_id = selected_detail$window_id,
limit = "LCL", value = selected_detail$LCL),
data.frame(idx_global = selected_detail$idx_global, window_id = selected_detail$window_id,
limit = "CCL", value = selected_detail$CCL)
) %>%
dplyr::mutate(
limit = factor(limit, levels = c("UCL", "LCL", "CCL")),
linewidth = unname(CTRL_LWD[as.character(limit)]),
alpha = unname(CTRL_ALPHA[as.character(limit)])
)
g_selected_global <- ggplot(selected_detail, aes(x = idx_global)) +
geom_line(
data = selected_limits_long,
aes(x = idx_global, y = value, linetype = limit, group = limit,
linewidth = linewidth, alpha = alpha),
color = "black"
) +
scale_linetype_manual(name = "Adaptive limits", values = CTRL_LTY) +
scale_linewidth_identity(guide = "none") +
scale_alpha_identity(guide = "none") +
geom_point(aes(y = p, color = cluster_f), size = 0.8, alpha = 0.72) +
geom_point(
data = selected_detail %>% dplyr::filter(signal),
aes(y = p, shape = signal_label),
color = "red", size = 2.1, stroke = 1.0
) +
scale_color_manual(values = selected_cols, limits = as.character(1:K_ELEGIDO),
name = "Dominant cluster", drop = FALSE) +
scale_shape_manual(name = NULL, values = c("OOC" = 4), drop = FALSE) +
coord_cartesian(ylim = c(-0.02, 1.02)) +
labs(
title = paste0("General FCA p'-chart plot for selected K = ", K_ELEGIDO),
subtitle = "Full series: observed proportion, dominant cluster, adaptive limits, and OOC signals",
x = "Global observation index",
y = expression(p[i])
) +
theme_minimal(base_size = PLOT_BASE_SIZE) +
theme(
plot.title = element_text(face = "bold", size = 12),
legend.position = "right",
panel.grid.minor = element_blank()
)
print(g_selected_global)

ggsave(
filename = file.path(base_out_dir, sprintf("GENERAL_FCA_pchart_K_elegido_%02d.png", K_ELEGIDO)),
plot = g_selected_global, width = FIG_W_LAND, height = FIG_H_LAND, units = "in", dpi = DPI_EXPORT
)
g_selected_cluster_bars <- selected_cluster_summary %>%
tidyr::pivot_longer(cols = c(observations, signals), names_to = "type", values_to = "value") %>%
ggplot(aes(x = factor(cluster), y = value, fill = type)) +
geom_col(position = "dodge", alpha = 0.85) +
labs(
title = paste0("General distribution of observations and signals by cluster - K = ", K_ELEGIDO),
subtitle = "Identifies which regimes concentrate more data and which concentrate more OOC signals",
x = "Dominant cluster",
y = "Count"
) +
theme_minimal(base_size = PLOT_BASE_SIZE) +
theme(plot.title = element_text(face = "bold", size = 12))
print(g_selected_cluster_bars)

ggsave(
filename = file.path(base_out_dir, sprintf("GENERAL_cluster_distribution_K_elegido_%02d.png", K_ELEGIDO)),
plot = g_selected_cluster_bars, width = FIG_W_STD, height = FIG_H_STD, units = "in", dpi = DPI_EXPORT
)
Result and interpretation. The first plot in this section is the requested general view for the selected number of clusters. If K_ELEGIDO = 4, the plot shows the four latent regimes in a single panel. If changed to 5 or more, the same routine updates colors, legend, cluster summary, and PNG files. Red points correspond to OOC signals and should be reviewed together with the dominant cluster and its signal rate.
Result and interpretation. The table OOC signal summary by number of clusters allows comparison among solutions. The signal-based recommended value should be contrasted with the validation-index table. For an academic report, the final \(K\) should be justified using both criteria: cluster structure and monitoring impact.
11. Sequential windowing strategy for the selected number of clusters
Calculation context. This section explicitly incorporates the windowing strategy for the solution selected in K_ELEGIDO, by default K_ELEGIDO = 4. The purpose is to show local process behavior in successive windows, as in the windowing figure of the Word document: each panel corresponds to a time window, with its own FCA adaptive limits, dominant cluster, and out-of-control signals.
The strategy separates the monitoring horizon into an initial exploratory window and subsequent sequential windows. The first window makes it possible to observe the initial process behavior; the following windows support analysis of local stability, regime changes, and temporal concentration of OOC signals.
\[
W_0=\{1,2,\ldots,n_0\},\qquad n_0=\texttt{exploratory}
\]
\[
W_r=\{n_0+(r-1)L+1,\ldots,n_0+rL\},\qquad L=\texttt{window\_size},\quad r=1,2,\ldots,R
\]
For each observation within a window, the model identifies the dominant cluster and evaluates the observed proportion against regime-conditioned limits:
\[
j_i^*=\arg\max_{j\in\{1,\ldots,K^*\}}u_{ij},\qquad K^*=\texttt{K\_ELEGIDO}
\]
\[
Signal_i=
\begin{cases}
1, & p_i>UCL_i\;\text{o}\;p_i
Generated plot. The faceted figure shows one FCA p'-chart per window. Each panel displays the monitored proportion \(p_i\), the limits \(LCL_i\), \(CCL_i\), \(UCL_i\), the dominant cluster through color, and OOC signals through red markers. This visualization is the operational equivalent of the windowing plot in the Word document.
# ==========================================================
# 9. Sequential windowing strategy for K_ELEGIDO
# ==========================================================
# This section uses the results already computed in selected_detail.
# Por defecto K_ELEGIDO = 4, pero funciona para K_ELEGIDO >= 4.
selected_detail_windowed <- selected_detail %>%
dplyr::mutate(
window_label = ifelse(
window_id == 0,
paste0("Initial window (", exploratory, " obs.)"),
paste0("Window ", window_id, ":(" window_size, " obs.)")
),
window_label = factor(window_label, levels = unique(window_label))
)
selected_window_summary <- selected_detail_windowed %>%
dplyr::group_by(window_id, window_label) %>%
dplyr::summarise(
start_idx = min(idx_global),
end_idx = max(idx_global),
n_obs = dplyr::n(),
clusters_presentes = paste(sort(unique(cluster)), collapse = ", "),
total_signals = sum(signal),
signal_rate = mean(signal),
p_mean = mean(p),
p_sd = sd(p),
CCL_mean = mean(CCL),
LCL_mean = mean(LCL),
UCL_mean = mean(UCL),
.groups = "drop"
)
knitr::kable(
selected_window_summary,
digits = 4,
caption = paste("Summary of the sequential windowing strategy for selected K =", K_ELEGIDO)
)
selected_limits_window_long <- dplyr::bind_rows(
data.frame(idx_global = selected_detail_windowed$idx_global,
window_label = selected_detail_windowed$window_label,
limit = "UCL", value = selected_detail_windowed$UCL),
data.frame(idx_global = selected_detail_windowed$idx_global,
window_label = selected_detail_windowed$window_label,
limit = "LCL", value = selected_detail_windowed$LCL),
data.frame(idx_global = selected_detail_windowed$idx_global,
window_label = selected_detail_windowed$window_label,
limit = "CCL", value = selected_detail_windowed$CCL)
) %>%
dplyr::mutate(
limit = factor(limit, levels = c("UCL", "LCL", "CCL")),
linewidth = unname(CTRL_LWD[as.character(limit)]),
alpha = unname(CTRL_ALPHA[as.character(limit)])
)
# Main windowing plot, similar to the figure in the Word document.
g_selected_window_facets <- ggplot(selected_detail_windowed, aes(x = idx_global)) +
geom_line(
data = selected_limits_window_long,
aes(x = idx_global, y = value, linetype = limit,
group = interaction(window_label, limit),
linewidth = linewidth, alpha = alpha),
color = "black"
) +
scale_linetype_manual(name = "Adaptive limits", values = CTRL_LTY) +
scale_linewidth_identity(guide = "none") +
scale_alpha_identity(guide = "none") +
geom_line(aes(y = p, color = cluster_f, group = window_label), linewidth = 0.25, alpha = 0.75) +
geom_point(aes(y = p, color = cluster_f), size = 0.45, alpha = 0.85) +
geom_point(
data = selected_detail_windowed %>% dplyr::filter(signal),
aes(y = p, shape = signal_label),
color = "red", size = 1.4, stroke = 0.8
) +
scale_color_manual(values = selected_cols,
limits = as.character(1:K_ELEGIDO),
name = "Dominant cluster",
drop = FALSE) +
scale_shape_manual(name = NULL, values = c("OOC" = 4), drop = FALSE) +
facet_wrap(~ window_label, scales = "free_x", ncol = 2) +
coord_cartesian(ylim = c(-0.02, 1.02)) +
labs(
title = paste0("Sequential windowing strategy FCA p'-chart - selected K = ", K_ELEGIDO),
subtitle = "Each panel recalculates and shows local behavior: p_i, dominant cluster, CCL, LCL, UCL, and OOC signals",
x = "Global observation index within each window",
y = expression(p[i])
) +
theme_minimal(base_size = PLOT_BASE_SIZE) +
theme(
plot.title = element_text(face = "bold", size = 12),
strip.text = element_text(face = "bold", size = 8.5),
legend.position = "right",
panel.grid.minor = element_blank()
)
print(g_selected_window_facets)
ggsave(
filename = file.path(base_out_dir, sprintf("VENTANEO_FCA_pchart_K_elegido_%02d.png", K_ELEGIDO)),
plot = g_selected_window_facets,
width = FIG_W_LAND, height = FIG_H_LAND, units = "in", dpi = DPI_EXPORT
)
# Auxiliary plot: OOC signals by window.
g_selected_signals_by_window <- selected_window_summary %>%
ggplot(aes(x = factor(window_id), y = total_signals)) +
geom_col(alpha = 0.85) +
geom_text(aes(label = total_signals), vjust = -0.25, size = 3.5) +
labs(
title = paste0("OOC signals by window - selected K = ", K_ELEGIDO),
subtitle = "Identifies critical windows and compares local process stability",
x = "Ventana secuencial",
y = "Number of OOC signals"
) +
theme_minimal(base_size = PLOT_BASE_SIZE) +
theme(plot.title = element_text(face = "bold", size = 12))
print(g_selected_signals_by_window)
ggsave(
filename = file.path(base_out_dir, sprintf("VENTANEO_signals_by_window_K_elegido_%02d.png", K_ELEGIDO)),
plot = g_selected_signals_by_window,
width = FIG_W_STD, height = FIG_H_STD, units = "in", dpi = DPI_EXPORT
)
## Error in parse(text = input): <text>:12:41: unexpected symbol
## 11: paste0("Initial window (", exploratory, " obs.)"),
## 12: paste0("Window ", window_id, ":(" window_size
## ^
Result and interpretation. The faceted plot makes it possible to read the process by windows, not only globally. If OOC signals concentrate in one or a few windows, the behavior suggests local episodes of instability. If signals are persistently distributed in the same cluster across several windows, the result suggests a structural operational condition associated with that regime. For K_ELEGIDO = 4, this figure shows the four latent clusters as dominant regimes within each window.
How to report it. The report may state that, after selecting the four-cluster solution, a sequential windowing strategy was applied, consisting of an initial exploratory window of 200 observations and subsequent windows of 1,000 observations. In each window, FCA limits conditioned by the dominant cluster were recalculated, making it possible to differentiate local OOC signals from fluctuations produced by regime mixing.
12. Inventory of generated outputs
When the document is executed, the following elements are generated:
- Overdispersion plots: Q-Q plot with arcsine-square-root transformation.
- Validation index table: PC, PE, MPC, approximate SIL.F, DB, CH, and approximate GAP.
- Faceted index plot: sensitivity of each index against \(K\).
- PDF for each K: report with p-charts by window and membership diagnostics.
- PNG by window: FCA p-chart and dominant-membership plot.
- Global PNG files: faceted view by window and single global view.
- General PNG for selected K:
GENERAL_FCA_pchart_K_elegido_XX.png and signal distribution by cluster.
- Windowing PNG for selected K:
VENTANEO_FCA_pchart_K_elegido_XX.png and VENTANEO_signals_by_window_K_elegido_XX.png.
- Excel for each K: observations, memberships \(u_1,\ldots,u_{10}\), limits, signals, window summary, and cluster distribution.
- Master Excel file: consolidated results for \(K=2,\ldots,10\).
13. Plot interpretation
| Plot | What it shows | How to interpret it |
| Q-Q arcsine-square-root plot | Visual overdispersion diagnosis. | Strong deviations from the line indicate variability greater than binomial variability. |
| Validation indices | Relative partition quality for each K. | Look for convergence: high PC/MPC/SIL/CH and low PE/DB. |
| FCA p-chart by window | Series \(p_i\), dominant cluster, CCL, LCL, UCL, and OOC signals. | Red signals indicate observations outside the adaptive limit of their regime. |
| Membership diagnostics | Relationship between \(p_i\) and dominant membership. | High memberships imply clear assignment; medium/low memberships suggest transition between regimes. |
| Global faceted chart | Comparison of successive windows. | Useful for detecting local stability, temporal changes, and critical windows. |
| Sequential windowing for selected K | FCA p'-chart by window for the selected solution, default K=4. | Allows each window to be interpreted as a local context, with adaptive limits, dominant cluster, and OOC signals. |
| General plot for selected K | Full series with dominant cluster, CCL, LCL, UCL, and OOC signals. | This is the main figure for reporting process behavior with the selected solution, for example K=4 or more. |
| Total OOC by K | Method sensitivity to the number of clusters. | It should be analyzed together with validity indices to avoid choosing K only by minimizing signals. |
14. Guide for reporting the results
Integrated reading of results. The methodology should be reported at four levels. First, confirm that the proportional data show overdispersion, which justifies not relying only on a classical p-chart. Second, validate the existence of latent regimes through clustering indices. Third, construct and show that FCA p'-chart limits adapt to each regime and reduce spurious signals. Fourth, analyze the remaining OOC signals as contextual events associated with specific windows and clusters.
| Element to report | Result to extract from the RHTML | Technical interpretation |
| Overdispersion | Variance table and Q-Q plot | Justifies the use of adjusted limits and a fuzzy-regime approach. |
| Number of clusters | Index table and plot by K | Supports selection of an interpretable latent structure. |
| Parameters by cluster | Window_Cluster_Summary | Describes CCL, variability, effective size, and signals by regime. |
| OOC signals | All_Observations and Window_Summary | Identifies critical observations, window, cluster, and limits used. |
| Dominant membership | Membership diagnostics and u_j columns | Evaluates whether the signal clearly belongs to a regime or to a transitional zone. |
| Comparison across K | MASTER_FCA_p_results | Allows analysis of robustness, sensitivity, and possible over-segmentation. |