Group Member: Fio Ulaa’ Octriyanti (030) Muhammad Raffi Fahrezi (100)

Introduction

Our objective is to uncover hidden operational patterns within gas turbine data collected over five years and to uncover hidden operational patterns within gas turbine data collected from 2015. By analyzing 11 different sensors including ambient conditions and emission levels we aim to group similar states of turbine behavior.To achieve a robust understanding, we will employ five distinct clustering methodologies: 1. K-Means: To find spherical clusters based on centroids. 2. K-Medians: To ensure our clusters are resistant to the outliers found in our emissions data. 3. DBSCAN: To identify clusters of arbitrary shapes based on data density (to be completed by teammate). 4. MeanShift: To find clusters by shifting points toward the highest density of neighbors (to be completed by teammate). 5. Fuzzy C-Means: To allow data points to belong to multiple clusters with varying degrees of membership (to be completed by teammate).

Environment Setup & Data Integration

We initialize all required libraries for the five methods and merge the data specifically for the years 2015.

if (!require("pacman")) install.packages("pacman")
## Loading required package: pacman
## Warning: package 'pacman' was built under R version 4.5.3
pacman::p_load(
  dplyr, tidyr, ggplot2, reshape2, knitr, 
  factoextra, cluster, 
  dbscan,               
  meanShiftR,           
  e1071, fclust, flexclust
)
file_name <- "gt_2015.csv"
if (file.exists(file_name)) {
  data <- read.csv(file_name)
} else if (file.exists(paste0("Dataset/", file_name))) {
  data <- read.csv(paste0("Dataset/", file_name))
} else {
  stop("Error: gt_2015.csv not found! Please place the file in your working directory.")
}

data <- data %>% distinct()

Preprocessing

Before we apply our five clustering algorithms, we must ensure our data is “clean” and “fair.” Clustering relies on distance calculations; therefore, any skewed values or mismatched scales can lead to misleading groups.

Missing Values

data.frame(
  Column   = names(data),
  Missing = colSums(is.na(data)),
  Percentage  = round(colSums(is.na(data)) / nrow(data) * 100, 2)
)
##      Column Missing Percentage
## AT       AT       0          0
## AP       AP       0          0
## AH       AH       0          0
## AFDP   AFDP       0          0
## GTEP   GTEP       0          0
## TIT     TIT       0          0
## TAT     TAT       0          0
## TEY     TEY       0          0
## CDP     CDP       0          0
## CO       CO       0          0
## NOX     NOX       0          0

Outlier Detection & Scaling

As per Clustering principles, we must handle outliers and standardize the data. Clustering uses distance (Euclidean/Manhattan), so we need all sensors on the same scale.

data_long <- data %>% pivot_longer(cols = everything())

ggplot(data_long, aes(x = name, y = value)) +
  geom_boxplot(fill = "skyblue",           
               outlier.colour = "red",    
               outlier.size = 0.5) +
  coord_flip() + 
  labs(title = "Sensor Outlier Check (2015)",
       x = "Sensor Name", 
       y = "Value Range")

cap_outliers <- function(x) {
  Q1 <- quantile(x, 0.25)
  Q3 <- quantile(x, 0.75)
  IQR_val <- Q3 - Q1
  x <- pmax(pmin(x, Q3 + 1.5 * IQR_val), Q1 - 1.5 * IQR_val)
  return(x)
}

data_capped <- as.data.frame(lapply(data, cap_outliers))
data_scaled <- as.data.frame(scale(data_capped))

data_scaled <- as.data.frame(scale(data_capped))

data_scaled_plot <- data_scaled %>% pivot_longer(cols = everything())

ggplot(data_scaled_plot, aes(x = name, y = value)) +
  geom_boxplot(fill = "skyblue",           
               outlier.colour = "red",    
               outlier.size = 0.5) +
  coord_flip() + 
  labs(title = "Sensor Outlier Check After Scaling",
       x = "Sensor Name", 
       y = "Value Range")

After applying IQR-based capping and z-score standardization, Figure 2 shows a significant improvement over Figure 1. All 11 sensors now share a common scale centered around 0, making distance-based clustering calculations fair and unbiased. The previously extreme outliers in CO and NOX have been successfully suppressed, with only a few mild residual points remaining in TIT, TAT, and GTEP indicating genuinely rare operating conditions rather than data errors. The data is now ready for clustering.

Results and Discussion

Optimal Clusters: The Elbow Method

We determine k by calculating the Total Within-Cluster Sum of Squares (WSS). We look for the “elbow” where the improvement levels off. This is useful for obtaining the most optimal value of K for K means and K medians

set.seed(123)
wss <- sapply(1:10, function(k){
  kmeans(data_scaled, centers = k, nstart = 10)$tot.withinss
})

plot_data <- data.frame(k = 1:10, wss = wss)
ggplot(plot_data, aes(x = k, y = wss)) +
  geom_line() + geom_point(color = "red") +
  geom_vline(xintercept = 3, linetype = "dashed") +
  theme_minimal() + labs(title = "Elbow Method (Full 2015 Data)")

K-Means

We partition the 2015 data into 3 clusters based on the result from the elbow method. We then labeled them by their Average Turbine Energy Yield (TEY).

set.seed(123)
km_res <- kmeans(data_scaled, centers = 3, nstart = 25)

cluster_labels <- data %>% 
  mutate(Cluster = km_res$cluster) %>%
  group_by(Cluster) %>%
  summarise(Avg_TEY = mean(TEY)) %>%
  arrange(Avg_TEY) %>%
  mutate(Label = c("Low Load", "Medium Load", "High Load"))

fviz_cluster(km_res, data = data_scaled, geom = "point", 
             ellipse.type = "convex", palette = "jco",
             main = "Figure 3. K-Means: Operational Load Groups")

print(cluster_labels)
## # A tibble: 3 × 3
##   Cluster Avg_TEY Label      
##     <int>   <dbl> <chr>      
## 1       1    111. Low Load   
## 2       3    132. Medium Load
## 3       2    152. High Load

K-Medians

set.seed(123)
kmed_result <- kcca(data_scaled, k = 3,
                    family  = kccaFamily("kmedians"),
                    control = list(iter.max = 100))

kmed_clusters <- clusters(kmed_result)

data.frame(
  Cluster = names(table(kmed_clusters)),
  Size  = as.integer(table(kmed_clusters))
)
##   Cluster Size
## 1       1 2577
## 2       2 3287
## 3       3 1520
set.seed(123)
kmed_result <- kcca(data_scaled, k = 3,
                    family  = kccaFamily("kmedians"),
                    control = list(iter.max = 100))

kmed_clusters <- clusters(kmed_result)

pca_res <- prcomp(data_scaled)
data_plot <- as.data.frame(pca_res$x[, 1:2])
data_plot$Cluster <- as.factor(kmed_clusters)

ggplot(data_plot, aes(x = PC1, y = PC2, color = Cluster)) +
  geom_point(alpha = 0.4, size = 0.8) +
  stat_ellipse(type = "norm", linewidth = 0.8) +
  scale_color_manual(values = c("#E64B35", "#4DBBD5", "#00A087"),
                     labels = c("Low Load", "Medium Load", "High Load")) +
  theme_minimal() +
  labs(title = "Figure 4. K-Medians: Robust Operational Groups",
       color = "Cluster")

data.frame(
  Cluster = names(table(kmed_clusters)),
  Size  = as.integer(table(kmed_clusters))
)
##   Cluster Size
## 1       1 2577
## 2       2 3287
## 3       3 1520


DBSCAN

To determine the optimal eps value for DBSCAN, we used a combination of the k-NN distance plot and brute-force parameter search. The k-NN plot suggested an elbow around 0.5–0.6, however testing at this range produced either too many clusters or excessive noise points. We systematically tested increasing eps values while fixing minPts = 5 (the standard recommendation for high-dimensional data) until we achieved a result consistent with our elbow method finding of k=3. The final parameters eps = 1.2, minPts = 5 yielded 3 clusters with only 110 noise points, balancing cluster consistency with statistical rigor.

pacman::p_load(dbscan)

kNNdistplot(data_scaled, k = 5)
abline(h = 1.5, lty = 2, col = "red")

dbscan_res <- dbscan(data_scaled, eps = 1.2, minPts = 5)

cat("Clusters found:", max(dbscan_res$cluster), "\n")
## Clusters found: 3
cat("Noise points:", sum(dbscan_res$cluster == 0), "\n")
## Noise points: 110
print(table(dbscan_res$cluster))
## 
##    0    1    2    3 
##  110 7261    6    7
pca_res <- prcomp(data_scaled)
data_plot <- as.data.frame(pca_res$x[, 1:2])
data_plot$Cluster <- as.factor(dbscan_res$cluster)

ggplot(data_plot, aes(x = PC1, y = PC2, color = Cluster)) +
  geom_point(alpha = 0.4, size = 0.8) +
  scale_color_manual(values = c("black", "#E64B35", "#4DBBD5", "#00A087"),
                     labels = c("Noise", "Cluster 1", "Cluster 2", "Cluster 3")) +
  theme_minimal() +
  labs(title = "Figure 5. DBSCAN: Density-Based Operational Groups",
       color = "Cluster")

Cluster 1 contains 7,261 points while Clusters 2 and 3 contain only 6 and 7 points respectively. This reveals that the gas turbine sensor data forms one large, continuous dense region rather than distinct density-separated groups. The two minority clusters are better interpreted as boundary outliers rather than meaningful operational states. And thus DBScan are not well suited for this database

Meanshift

Due to MeanShift’s O(n²) computational complexity, running it on all 7,384 observations was not feasible. Instead, we applied it to a representative random sample of 1,000 points, which is standard practice for large datasets. A brute-force bandwidth search was conducted, revealing that values below 2.7 produced an excessive number of clusters while values above 2.8 collapsed into just 2 clusters. A bandwidth of 2.7 was selected as the optimal value, yielding 3 clusters with a balanced distribution of 463, 239, and 298 points respectively, consistent with the 3-cluster finding from the Elbow Method.

set.seed(123)
sample_idx <- sample(nrow(data_scaled), 1000)
data_sample <- as.matrix(data_scaled[sample_idx, ])

for (bw in c(2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9)) {
  ms_res <- meanShift(
    trainData = data_sample,
    queryData = data_sample,
    bandwidth = rep(bw, ncol(data_scaled))
  )
  cat("Bandwidth:", bw, "| Clusters:", length(unique(ms_res$assignment)), "\n")
}
## Bandwidth: 2.1 | Clusters: 24 
## Bandwidth: 2.2 | Clusters: 18 
## Bandwidth: 2.3 | Clusters: 13 
## Bandwidth: 2.4 | Clusters: 10 
## Bandwidth: 2.5 | Clusters: 6 
## Bandwidth: 2.6 | Clusters: 5 
## Bandwidth: 2.7 | Clusters: 3 
## Bandwidth: 2.8 | Clusters: 3 
## Bandwidth: 2.9 | Clusters: 2
set.seed(123)
ms_res <- meanShift(
  trainData = data_sample,
  queryData = data_sample,
  bandwidth = rep(2.7, ncol(data_scaled))
)
ms_clusters <- ms_res$assignment
cat("Clusters found:", length(unique(ms_clusters)), "\n")
## Clusters found: 3
print(table(ms_clusters))
## ms_clusters
##   1   2   3 
## 463 239 298
pca_res <- prcomp(data_sample)
data_plot <- as.data.frame(pca_res$x[, 1:2])
data_plot$Cluster <- as.factor(ms_clusters)

ggplot(data_plot, aes(x = PC1, y = PC2, color = Cluster)) +
  geom_point(alpha = 0.4, size = 0.8) +
  stat_ellipse(type = "norm", linewidth = 0.8) +
  scale_color_manual(values = c("#E64B35", "#4DBBD5", "#00A087"),
                     labels = c("Low Load", "Medium Load", "High Load")) +
  theme_minimal() +
  labs(title = "Figure 6. MeanShift: Density Peak Groups",
       color = "Cluster")

Fuzzy C-Means

set.seed(123)
fcm_res <- cmeans(data_scaled, centers = 3, iter.max = 100, 
                  dist = "euclidean", method = "cmeans", m = 2)

cat("Cluster sizes:\n")
## Cluster sizes:
print(table(fcm_res$cluster))
## 
##    1    2    3 
## 2477 2918 1989
membership_df <- as.data.frame(fcm_res$membership)
colnames(membership_df) <- c("Low Load", "Medium Load", "High Load")
head(round(membership_df, 3))
##   Low Load Medium Load High Load
## 1    0.101       0.198     0.702
## 2    0.097       0.197     0.706
## 3    0.118       0.280     0.603
## 4    0.147       0.337     0.515
## 5    0.161       0.374     0.465
## 6    0.164       0.392     0.443
pca_res <- prcomp(data_scaled)
data_plot <- as.data.frame(pca_res$x[, 1:2])
data_plot$Cluster <- as.factor(fcm_res$cluster)

ggplot(data_plot, aes(x = PC1, y = PC2, color = Cluster)) +
  geom_point(alpha = 0.4, size = 0.8) +
  stat_ellipse(type = "norm", linewidth = 0.8) +
  scale_color_manual(values = c("#E64B35", "#4DBBD5", "#00A087"),
                     labels = c("Low Load", "Medium Load", "High Load")) +
  theme_minimal() +
  labs(title = "Figure 7. Fuzzy C-Means: Soft Operational Groups",
       color = "Cluster")

Visualization

The Power Generation Engine (Relationships with TEY)

In a gas turbine, energy yield (TEY) is heavily influenced by ambient pressure (AP) and compressor discharge pressure (CDP). We use a scatter plot matrix to see if these relationships are linear. BTo conclude our preprocessing, we investigate how variables influence one another. Does an increase in turbine temperature always lead to higher NOx emissions? By creating a correlation heatmap, we visualize the strength and direction of these relationships. This map serves as our final guide before we proceed to advanced multivariate modeling.

ggplot(data, aes(x = CDP, y = TEY, color = AT)) +
  geom_point(alpha = 0.5) +
  scale_color_gradient(low = "blue", high = "red") +
  theme_minimal() +
  labs(title = "Figure 1.5. Energy Yield vs Compressor Pressure",
       subtitle = "Color intensity represents Ambient Temperature (AT)",
       x = "Compressor Discharge Pressure (CDP)", y = "Turbine Energy Yield (TEY)")

As shown in Figure 1.5, there is a very strong positive linear relationship between CDP and TEY. Interestingly, when Ambient Temperature (AT) is lower (blue points), the turbine tends to achieve higher energy yield for the same pressure level, suggesting that cold air density improves efficiency.

The Emission Trade-off (NOx vs CO)

One of the biggest challenges in turbine operation is the trade-off between NOx and CO. High-temperature combustion reduces CO but increases NOx. We will visualize this “thermal dilemma.”

ggplot(data, aes(x = CO, y = NOX)) +
  geom_point(aes(color = TIT), alpha = 0.4) +
  scale_x_log10() + # Using Log Scale because CO has extreme outliers
  scale_color_gradient(low = "green", high = "purple") +
  theme_minimal() +
  labs(title = "Figure 1.6. The Emission Dilemma: NOx vs CO",
       subtitle = "Purple indicates high Turbine Inlet Temperature (TIT)",
       x = "Carbon Monoxide (CO) - Log Scale", y = "Nitrogen Oxides (NOx)")

Figure 1.6 reveals a non-linear relationship. High NOx levels are associated with high Turbine Inlet Temperatures (TIT), appearing at the top-left (low CO). Conversely, when the temperature drops, NOx decreases but CO spikes significantly (bottom-right). This confirms that our clusters should ideally capture these different “combustion regimes.”

Density Distribution of Operational States

We use a 2D Density plot to see where the turbine “spends most of its time” during the year 2015.

ggplot(data, aes(x = TIT, y = TEY)) +
  stat_density_2d(aes(fill = ..level..), geom = "polygon") +
  scale_fill_viridis_c() +
  theme_minimal() +
  labs(title = "Figure 1.7. Operational Hotspots (TIT vs TEY)",
       subtitle = "Bright areas represent the most frequent operating conditions",
       x = "Turbine Inlet Temperature (TIT)", y = "Turbine Energy Yield (TEY)")
## Warning: The dot-dot notation (`..level..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(level)` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The density map shows two or three distinct “islands” or hotspots. This visual evidence strongly supports our choice of k=3 from the Elbow Method, as the turbine clearly operates in three primary modes throughout 2015.