library(tidyr) # used to manipulate and clean data
library(dplyr) # used for data manipulation
library(matrixStats) # used to extract statistics per sensor
library(cluster)
library(ggplot2) # used for data visualization
library(gridExtra) # used for visualization arrangement
library(factoextra) # used for cluster visualization
#library(kableExtra) # used to create tables that accompany visualizations
library(corrplot) # used for visualization of correlation
library(randomForest) # used for Random Forest Classification
library(caret) # used to evaluate prediction models

Introduction

The data set used in this project is a record of 17 sensors from a hydraulic test rig that monitor operational parameters of various hydraulic system parts. The rows in the data set represent operational cycles, and the columns of each data file are the sensor’s readings within one cycle. One cycle lasts 60 seconds, and the readings are collected along three distinct intervals of 100 Hz, 10 Hz, and 1 Hz. Below are the sensors’ names, their unit of measurement, and their collection rate.

Pressure sensor 1 (PS1) - Measured in bars at 100 Hz.

Pressure sensor 2 (PS2) - Measured in bars at 100 Hz.

Pressure sensor 3 (PS3) - Measured in bars at 100 Hz.

Pressure sensor 4 (PS4) - Measured in bars at 100 Hz.

Pressure sensor 5 (PS5) - Measured in bars at 100 Hz.

Pressure sensor 6 (PS6) - Measured in bars at 100 Hz.

Motor Power sensor 1 (EPS1) - Measured in watts at 100 Hz.

Volume Flow sensor 1 (FS1) - Measured in liters per minute at 10 Hz.

Volume Flow sensor 2 (FS2) - Measured in liters per minute at 10 Hz.

Temperature sensor 1 (TS1) - Measured in degrees Celsius at 1 Hz.

Temperature sensor 2 (TS2) - Measured in degrees celcius at 1 Hz.

Temperature sensor 3 (TS3) - Measured in degrees celcius at 1 Hz.

Temperature sensor 4 (TS4) - Measured in degrees Celsius at 1 Hz.

Vibration sensor 1 (VS1) - Measured in millimeters per second at 1 Hz.

Cooling Efficiency sensor 1 (CE) - Measured as a percent of optimal conditions at 1 Hz.

Cooling Power sensor 1 (CP) - Measured in kilowatts at 1 Hz.

Efficiency factor sensor 1 (SE) - Measured as percent of optimal conditions at 1 Hz.

The output from each sensor is a single text file that is 2,205 rows long for each operational cycle and, based on the reading rate, comprises 6000, 600, and 60 columns.

The last data file downloaded for this project is diagnostic outputs based on each sensor’s readings that outline the system’s operational health. Like the data sets from each sensor, the diagnostic file comprises 2,205 rows for each cycle but only has one reading per cycle. Therefore, the columns in this dataset represent a measurement of one component or function. The columns from the diagnostic file are the following:

Cooler - Measured as a percent with values of 3 meaning close to total failure, 20 representing reduced efficiency, and 100 representing full efficiency.

Valve - Measured as a percent of valve switching behavior with values of 100 meaning optimal, 90 meaning there is a small lag in the valves, 80 representing a severe lag in the valve, and 73 meaning close to total failure of valve behavior.

Leakage - Internal pumnp leakage readings are values of 0 meaning there is no leakage detected, 1 meaning weak leakage is present, and 2 representing severe leakage in the system.

Hydraulic Accumulation - Measured in bars describing the hydraulic pressure accumulation during the cycle with values of 130 meaning optimal pressure, 115 meaning slightly reduced pressure, 100 describing severely reduced pressure, and 90 representing close to total failure of hydraulic pressure.

Stable - The final column in the diagnostic data file is a status reading describing if the system is in a stable condition with a value of 0 meaning conditions were stable throughout the cycle and a value of 1 meaning static conditions have not been reached.

This project will build upon previous exploratory analysis that identified a distinct pattern from multiple sensor and execute a clustering algorithm to determine if the visually identified patterns in the data are repeatable and truly represent significant operational requirements or conditions. With this information, future effort will aim to undertsand which of the clusters best represents peak operational efficiency. THe clustering output in this project will serve as the benchmark for answering overarching reserach questions aimed at improving operational systems, processes, and decision-making.

# Import Data Set

The data set used in this project is the cleaned and merged data file used in previous work.

data <- read.csv("Hydraulic Test Rig Data.csv")

Pattern Identification

As previously noted,a pattern emerged during exploratory analysis that hinted towards three distinct occurrences during the 2,205 operational cycles. Across multiple sensors, there appeared to be breaks that occur at roughly every 750 cycles. The following cycle frequently resulted in significantly different outputs from the sensors. This could be from maintenance breaks and fine-tuning the machine or a different operational load placed on the machine. Both the output variables and sensors follow the three output patterns. Outlined below are several charts that provide an illustrative example of knowledge gained during the exploratory process.

# create scatter plot of the hydraulic assessment
data %>%
  ggplot(aes(x = Cycle, y = Hydraulic)) +
  geom_point() +
  labs(title = "Hydraulic Assessment",
       subtitle = "Measured in Bars",
       caption = "data source: UCI Machine Learning Repository",
       x = "Cycle",
       y = "Diagnostic Output (Bars)") +
  theme_minimal ()

# create scatter plot of the cooler condition values
data %>%
  ggplot(aes(x = Cycle, y = Cooloer)) +
  geom_point() +
  labs(title = "Cooler Condition",
       subtitle = "Measured as percent of efficiency",
       caption = "data source: UCI Machine Learning Repository",
       x = "Cycle",
       y = "Efficiency Percent") +
  theme_minimal ()

# create scatter plot of the mean value of the CE sensor across all cycles
data %>%
  ggplot(aes(x = Cycle, y = Mean_CE)) +
  geom_point() +
  labs(title = "Mean Cooling Efficiency",
       subtitle = "Measured by Sensor CE (%)",
       caption = "data source: UCI Machine Learning Repository",
       x = "Cycle",
       y = "kW") +
  theme_minimal()

# create scatter plot of cooling power values
data %>%
  ggplot(aes(x = Cycle, y = Mean_CP)) +
  geom_point() +
  labs(title = "Mean Cooling Power per Cycle",
       subtitle = "Measured by Sensor CP",
       caption = "data source: UCI Machine Learning Repository",
       x = "Cycle",
       y = "kW") +
  theme_minimal ()

# create a scatterplot of each temperature sensor across all cycles
ggplot(data, aes(x = Cycle)) +
  geom_point(aes(y = Mean_TS1, color = "TS1")) +
  geom_point(aes(y = Mean_TS2, color = "TS2")) +
  geom_point(aes(y = Mean_TS3, color = "TS3")) +
  geom_point(aes(y = Mean_TS4, color = "TS4")) +
  scale_color_manual(values = c("TS1" = "red", "TS2" = "blue", "TS3" = "purple",
                                "TS4" = "orange"),
                       labels = c("TS1", "TS2", "TS3", "TS4")) +
  labs(title = "Mean Temperature Readings",
       subtitle = "Sensor TS1 through TS4",
       caption = "data source: UCI Machine Learning Repository",
       x = "Cycle",
       y = "Degrees Celcius") +
  theme_minimal()

# create a scatterplot of each pressure readings sensor across all cycles
ggplot(data, aes(x = Cycle)) +
  geom_point(aes(y = Mean_PS1, color = "PS1")) +
  geom_point(aes(y = Mean_PS2, color = "PS2")) +
  scale_color_manual(values = c("PS1" = "firebrick3", "PS2" = "dodgerblue3"),
                       labels = c("PS1", "PS2")) +
  labs(title = "Mean Pressure Readings",
       subtitle = "Sensor PS1 & PS2",
       caption = "data source: UCI Machine Learning Repository",
       x = "Cycle",
       y = "Bars") +
  theme_minimal()

# create a scatterplot of each pressure readings sensor across all cycles
ggplot(data, aes(x = Cycle)) +
  geom_point(aes(y = Mean_PS3, color = "PS3")) +
  geom_point(aes(y = Mean_PS4, color = "PS4")) +
  geom_point(aes(y = Mean_PS5, color = "PS5")) +
  geom_point(aes(y = Mean_PS6, color = "PS6")) +
  scale_color_manual(values = c("PS3" = "darkorchid1",
                                "PS4" = "seagreen2", 
                                "PS5" = "royalblue1", 
                                "PS6" = "darkorange1"),
                       labels = c("PS3", "PS4", "PS5", "PS6")) +
  labs(title = "Mean Pressure Readings",
       subtitle = "Sensor PS3 - PS6",
       caption = "data source: UCI Machine Learning Repository",
       x = "Cycle",
       y = "Bars") +
  theme_minimal()

From the scatterplot and frequency of 0 readings followed by a close relationship during the third operational phase, analysis has a decision. Future modeling is dependent on quality sensor readings. The study can either impute data where sensor PS4 closely follows sensor PS5 and PS6, or remove the sensor entirely. If the sensor is removed and a model is created, then any future model that is applied to real world data would not account for variation in sensor PS4. The sensor could be highly predictive of system health and predictive of operational efficiency. However, within the data set are four diagnostic outputs that are assumed to monitor the system and reflect system readings. There could be instances between cycle number 1 and cycle number 1700 where the sensor number four reading greatly impacted diagnostic outputs. Ultimately because the study aims to provide results that are generated through a combination of sensor readings and diagnostic outputs, it is valuable to gather instances where sensors are failing and understand the impact from failing sensors. If data was imputed into sensor four creating ideal conditions, then all conclusions gathered from the diagnostic variables could be less applicable when encountering real world data.

Correlation Analysis

# build correlation data set
cors <- subset(data, select = c(Mean_CE, Mean_CP, Mean_EPS1, Mean_FS1, Mean_FS2, 
                                Mean_PS1, Mean_PS2, Mean_PS3,Mean_PS4, Mean_PS5, 
                                Mean_PS6, Mean_SE, Mean_TS1, Mean_TS2, Mean_TS3, 
                                Mean_TS4,Mean_VS1, Cooloer, Valve, Leakage,
                                Hydraulic, Stable))

# format values to numeric
cors[] <- lapply(cors, as.numeric)

# Create visualization of correlations
  corrplot(corrs, 
         order = "hclust",
         tl.col = "black",
         tl.srt = 45)

Performance Clustering

Because it is currently unknown which sensors or combination of values in the sensors create each of the diagnostic outputs, the clustering will only occur on the sensor values then be merged to the data frame as an additional output variable. Then, the clusters will be plotted over 2,205 operational cycles.

Performance Cluster Number 1 Data Preperation

# remove output variables and cycle number, the malfunction number must be removed because it causes NaN results when scaling. Additionally, the minimum value for the sensor during the cycle needs to be removed. The remaining variables for clustering are the mean, max, min, standard deviation, and range of values for each sensor per cycle.
clust <- subset(data, select = -c(Cooloer, Valve, Leakage, Hydraulic, Stable, Cycle, 
                                  CE_Malfunctions, CP_Malfunctions, EPS1_Malfunctions,
                                  FS1_Malfunctions, FS2_Malfunctions, PS1_Malfunctions, 
                                  PS2_Malfunctions, PS3_Malfunctions, PS4_Malfunctions,
                                  PS5_Malfunctions, PS6_Malfunctions, SE_Malfunctions,
                                  TS1_Malfunctions, TS2_Malfunctions, TS3_Malfunctions,
                                  TS4_Malfunctions, VS1_Malfunctions, Min_CE, Min_CP, Min_EPS1,
                                  Min_FS1, Min_FS2, Min_PS1, Min_PS2, Min_PS3, Min_PS4, Min_PS5,
                                  Min_PS6, Min_SE, Min_TS1, Min_TS2, Min_TS3, Min_TS4, Min_VS1))

# scale the data in each of the variables
clust_scaled <- scale(clust)

Build Performance Cluster Number 1

# Set seed for reproducible continuity
set.seed(4321)

# Function to determine within cluster sum of squares
wcss <- function(k) {
  kmeans_result <- kmeans(clust_scaled, centers = k)
  return(kmeans_result$tot.withinss)}

# Determine number of clusters by using within cluster sum of squares function
k_values <- 1:25
wcss_values <- sapply(k_values, wcss)

# Create elbow plot to determine optimal number of clusters
elbow_plot <- data.frame(k = k_values, WCSS = wcss_values)

ggplot(elbow_plot, aes(x = k, y = WCSS)) +
  geom_line() +
  geom_point() +
  labs(title = "Elbow Plot for Optimal k",
       x = "Number of Clusters (k)",
       y = "Within-Cluster Sum of Squares (WCSS)")

Taken from the elbow plot and clustering of data, there is an argument for either four or five clusters of data. The steps taken below create a visualization of the five clusters and attaches the cluster group to the original data frame for further inspection. Future analysis in this paper will test if three clusters is an optimal segmentation of what occurred during the 2,205 operational cycles or if four is a more optimal number.

# Set seed for reproducible continuity
set.seed(5432)

# Use optimal k to assign Group and add the group result to the data frame
k <- 5
result <- kmeans(clust, centers = k)
data$Cluster <- result$cluster

# Visualize Density Based Clustering
fviz_cluster(result, data = clust_scaled, stand = TRUE,
             geom = "point", #palette = "jco",
             ggtheme = theme_minimal(),
             main = "K-means Clustering")

# plot clusters over the operational cycle
data %>%
  ggplot(aes(x = Cycle, y = Cluster)) +
  geom_point() +
  labs(title = "Performance Cluster per Cycle",
       subtitle = "From all summary statistics",
       caption = "data source: UCI Machine Learning Repository",
       x = "Cycle",
       y = "Cluster") +
  theme_minimal ()

Performance Cluster Number 1 Conclusions

While not entirely pleased with the clustering results in cycle 1 though roughly 1490, it is clear that the final cycles are distinct enough to provide no overlap between clusters and is indicative of results from previous scatter plots. It is likely that the number of variables used to construct each of the clusters is over complicating the separation. For example, the range of values may not be terribly important in determining overall working conditions. There are likely instances where the ranges of values in sensors are very small, but the output of the machine is both optimal and sub optimal. Also, there are likely instances where the ranges are very wide and result in both good and bad outputs. Therefore, clustering steps will be taken to further refine which outputs from each of the sensors is valuable to understand how to separate each of the cycles from one another.

Performance Cluster Number 2 Data Preperation

# remove summary statistics
clust_2_scaled <- subset(clust_scaled, select = c(Mean_CE, Mean_CP, Mean_EPS1, Mean_FS1,
                                                  Mean_FS2, Mean_PS1, Mean_PS2, Mean_PS3,
                                                  Mean_PS4, Mean_PS5, Mean_PS6, Mean_SE,
                                                  Mean_TS1, Mean_TS2, Mean_TS3, Mean_TS4,
                                                  Mean_VS1))

Build Performance Cluster Number 2

# Set seed for reproducible continuity
set.seed(6543)

# Function to determine within cluster sum of squares
wcss <- function(k) {
  kmeans_result <- kmeans(clust_2_scaled, centers = k)
  return(kmeans_result$tot.withinss)}

# Determine number of clusters by using within cluster sum of squares function
k_values <- 1:25
wcss_values <- sapply(k_values, wcss)

# Create elbow plot to determine optimal number of clusters
elbow_plot <- data.frame(k = k_values, WCSS = wcss_values)

ggplot(elbow_plot, aes(x = k, y = WCSS)) +
  geom_line() +
  geom_point() +
  labs(title = "Elbow Plot for Optimal k",
       x = "Number of Clusters (k)",
       y = "Within-Cluster Sum of Squares (WCSS)")

The elbow plot created by removing many of the summary statistics shows promising results in determine clear separation between instances. The amount of seperation between the within sum of square centroid that occurs between clusters and five is mininmal and does not describe a large amount of variability. In an effort to prevent overfitting of the data set, I will apply four clusters to the data set. It is possible that even though there may be three distinct ranges of values from multiple sensors, the combination of each of those sensors may result in similar outputs. Below is a visualization of the overlap and separation of each of the clusters.

# Set seed for reproducible continuity
set.seed(7654)

# Use optimal k to assign Group and add the group result to the data frame
k <- 4
result <- kmeans(clust, centers = k)
data$Cluster_2 <- result$cluster

# Visualize Kmeans Cluster
fviz_cluster(result, data = clust_2_scaled, stand = TRUE,
             geom = "point", #palette = "jco",
             ggtheme = theme_minimal(),
             main = "K-means Clustering")

# plot clusters over the operational cycle
data %>%
  ggplot(aes(x = Cycle, y = Cluster_2)) +
  geom_point() +
  labs(title = "Performance Cluster Number 2 per Cycle",
       subtitle = "From mean sensor reading",
       caption = "data source: UCI Machine Learning Repository",
       x = "Cycle",
       y = "Cluster") +
  theme_minimal ()

Performance Cluster Number 2 Conclusions

By removing the summary statistics of each of the sensors like the standard deviation of the readings, the max reading, and the range of sensor readings, the k means clustering accurately describes what was visually apparent during exploratory analysis. Although there are interesting occurrences during the first 750 cycles where the machine’s output is grouped throughout each of the four clusters, from cycle 750 to 2205 follow a clear pattern. However, it is currently unknown which cluster best represents peak operational effectiveness. With promising results of each of the four clusters occurring very frequently along previously identified patterns, brief inspection was conducted of several sensors and diagnostic outputs to plot their readings against the cluster number. Cycle number four and cycle number three provide very clear seperation between the phases of operational phases.

# convert cluster to factor for visualizations
data$Cluster_2 <- as.factor(data$Cluster_2)

Performance Cluster 2 Visualizations By Sensor

# extract measures of central tendency for each sensor cluster
means_by_cluster_2 <- data %>%
  group_by(Cluster_2) %>%
  summarize(mean_CE = mean(Mean_CE),
            mean_CP = mean(Mean_CP),
            mean_EPS1 = mean(Mean_EPS1),
            mean_FS1 = mean(Mean_FS1),
            mean_FS2 = mean(Mean_FS2),
            mean_PS1 = mean(Mean_PS1),
            mean_PS2 = mean(Mean_PS2),
            mean_PS3 = mean(Mean_PS3),
            mean_PS4 = mean(Mean_PS4),
            mean_PS5 = mean(Mean_PS5),
            mean_PS6 = mean(Mean_PS6),
            mean_SE = mean(Mean_SE),
            mean_TS1 = mean(Mean_TS1),
            mean_TS2 = mean(Mean_TS2),
            mean_TS3 = mean(Mean_TS3),
            mean_TS4 = mean(Mean_TS4),
            mean_VS1 = mean(Mean_VS1))

# visualize results
p1 <- ggplot(means_by_cluster_2, aes(x = Cluster_2, y = mean_CE, fill = Cluster_2)) +
  geom_bar(stat = "identity") +
  labs(title = "Sensor CE",
       x = "Sensor Cluster Number") +
  theme_minimal() +
  theme(legend.position = "none")

p2 <- ggplot(means_by_cluster_2, aes(x = Cluster_2, y = mean_CP, fill = Cluster_2)) +
  geom_bar(stat = "identity") +
  labs(title = "Sensor CP",
       x = "Sensor Cluster Number") +
  theme_minimal() +
  theme(legend.position = "none")

p3 <- ggplot(means_by_cluster_2, aes(x = Cluster_2, y = mean_SE, fill = Cluster_2)) +
  geom_bar(stat = "identity") +
  labs(title = "Sensor SE",
       x = "Sensor Cluster Number") +
  theme_minimal() +
  theme(legend.position = "none")

p4 <- ggplot(means_by_cluster_2, aes(x = Cluster_2, y = mean_EPS1, fill = Cluster_2)) +
  geom_bar(stat = "identity") +
  labs(title = "Sensor EPS1",
       x = "Sensor Cluster Number") +
  theme_minimal() +
  theme(legend.position = "none")

grid.arrange(p1, p2, p3, p4, nrow = 2, ncol = 2)

# visualize results
p5 <- ggplot(means_by_cluster_2, aes(x = Cluster_2, y = mean_TS1, fill = Cluster_2)) +
  geom_bar(stat = "identity") +
  labs(title = "Sensor TS1",
       x = "Sensor Cluster Number") +
  theme_minimal() +
  theme(legend.position = "none")

p6 <- ggplot(means_by_cluster_2, aes(x = Cluster_2, y = mean_TS2, fill = Cluster_2)) +
  geom_bar(stat = "identity") +
  labs(title = "Sensor TS2",
       x = "Sensor Cluster Number") +
  theme_minimal() +
  theme(legend.position = "none")

p7 <- ggplot(means_by_cluster_2, aes(x = Cluster_2, y = mean_TS3, fill = Cluster_2)) +
  geom_bar(stat = "identity") +
  labs(title = "Sensor TS3",
       x = "Sensor Cluster Number") +
  theme_minimal() +
  theme(legend.position = "none")

p8 <- ggplot(means_by_cluster_2, aes(x = Cluster_2, y = mean_TS4, fill = Cluster_2)) +
  geom_bar(stat = "identity") +
  labs(title = "Sensor TS4",
       x = "Sensor Cluster Number") +
  theme_minimal() +
  theme(legend.position = "none")

grid.arrange(p5, p6, p7, p8, nrow = 2, ncol = 2)

# visualize results
p9 <- ggplot(means_by_cluster_2, aes(x = Cluster_2, y = mean_FS1, fill = Cluster_2)) +
  geom_bar(stat = "identity") +
  labs(title = "Sensor FS1",
       x = "Sensor Cluster Number") +
  theme_minimal() +
  theme(legend.position = "none")

p10 <- ggplot(means_by_cluster_2, aes(x = Cluster_2, y = mean_FS2, fill = Cluster_2)) +
  geom_bar(stat = "identity") +
  labs(title = "Sensor FS2",
       x = "Sensor Cluster Number") +
  theme_minimal() +
  theme(legend.position = "none")

p11 <- ggplot(means_by_cluster_2, aes(x = Cluster_2, y = mean_VS1, fill = Cluster_2)) +
  geom_bar(stat = "identity") +
  labs(title = "Sensor VS1",
       x = "Sensor Cluster Number") +
  theme_minimal() +
  theme(legend.position = "none")

grid.arrange(p9, p10, p11, nrow = 2, ncol = 2)

# visualize results
p12 <- ggplot(means_by_cluster_2, aes(x = Cluster_2, y = mean_PS1, fill = Cluster_2)) +
  geom_bar(stat = "identity") +
  labs(title = "Sensor PS1",
       x = "Sensor Cluster Number") +
  theme_minimal() +
  theme(legend.position = "none")

p13 <- ggplot(means_by_cluster_2, aes(x = Cluster_2, y = mean_PS2, fill = Cluster_2)) +
  geom_bar(stat = "identity") +
  labs(title = "Sensor PS2",
       x = "Sensor Cluster Number") +
  theme_minimal() +
  theme(legend.position = "none")

grid.arrange(p12, p13, nrow = 1, ncol = 2)

# visualize results
p14 <- ggplot(means_by_cluster_2, aes(x = Cluster_2, y = mean_PS3, fill = Cluster_2)) +
  geom_bar(stat = "identity") +
  labs(title = "Sensor PS3",
       x = "Sensor Cluster Number") +
  theme_minimal() +
  theme(legend.position = "none")

p15 <- ggplot(means_by_cluster_2, aes(x = Cluster_2, y = mean_PS4, fill = Cluster_2)) +
  geom_bar(stat = "identity") +
  labs(title = "Sensor PS4",
       x = "Sensor Cluster Number") +
  theme_minimal() +
  theme(legend.position = "none")

p16 <- ggplot(means_by_cluster_2, aes(x = Cluster_2, y = mean_PS5, fill = Cluster_2)) +
  geom_bar(stat = "identity") +
  labs(title = "Sensor PS5",
       x = "Sensor Cluster Number") +
  theme_minimal() +
  theme(legend.position = "none")

p17 <- ggplot(means_by_cluster_2, aes(x = Cluster_2, y = mean_PS6, fill = Cluster_2)) +
  geom_bar(stat = "identity") +
  labs(title = "Sensor PS6",
       x = "Sensor Cluster Number") +
  theme_minimal() +
  theme(legend.position = "none")

grid.arrange(p14, p15, p16, p17, nrow = 2, ncol = 2)

Performance Cluster 2 Visulizations with Sensors and Diagnostic Output

# create scatterplot of sensor with cluster separation
data %>%
  ggplot(aes(x = Cycle, y = Cluster_2, color = Cluster_2)) +
  geom_point() +
  labs(title = "Sensor Clusters",
       subtitle = "Colored by Sensor Cluster",
       caption = "data source: UCI Machine Learning Repository",
       x = "Cycle",
       y = "Sensor Reading") +
  theme_minimal ()

Performance Clustering Conclusion

Additional work is needed to extract the meaning behind the cluster. It is currently known that the machine runs 2,205 cycles with varying levels of efficiency based on varying readings from the 17 sensors. Even though there appear to be three operational phases, the quality of output clusters of the rig is an important finding, not the existence of operational phases. It is essential to state that the cluster is based on the sensor readings and not based on the cycle number. The clustering algorithm identified significant differences in the readings occurring at regular enough intervals where research can definitively state that readings are different during the operational phases to create a statistical separation in some cycle phases. However, there are some instances where multiple clusters are observed inside one operational phase.

Based on the knowledge gained during exploratory analysis and the k means clustering, the study is one step closer to providing meaningful conclusions to the hydraulic rig operator by isolating the machine’s performance by its cluster or output. Future efforts will move to identify which of the clusters best represents peak operational efficiency. Additionally, future research will create benchmarks for each sensor, model the benchmarks through predictive analysis, and create meaningful guidance for the operator of the hydraulic rig to assist the company in reaching its organizational goals.

Diagnostic Clustering

# subset diagnostic variables
diag <- subset(data, select = c(Cooloer, Valve, Leakage, Hydraulic, Stable))

Build Diagnostic Cluster

# Set seed for reproducible continuity
set.seed(8765)

# Function to determine within cluster sum of squares
wcss <- function(k) {
  kmeans_result <- kmeans(diag, centers = k)
  return(kmeans_result$tot.withinss)}

# Determine number of clusters by using within cluster sum of squares function
k_values <- 1:25
wcss_values <- sapply(k_values, wcss)

# Create elbow plot to determine optimal number of clusters
elbow_plot <- data.frame(k = k_values, WCSS = wcss_values)

ggplot(elbow_plot, aes(x = k, y = WCSS)) +
  geom_line() +
  geom_point() +
  labs(title = "Elbow Plot for Optimal k",
       x = "Number of Clusters (k)",
       y = "Within-Cluster Sum of Squares (WCSS)")

# Set seed for reproducible continuity
set.seed(0987)

# Use optimal k to assign Group and add the group result to the data frame
k <- 4
result <- kmeans(diag, centers = k)
data$DIAG_Cluster <- result$cluster

# Visualize Density Based Clustering
fviz_cluster(result, data = diag, stand = TRUE,
             geom = "point", #palette = "jco",
             ggtheme = theme_minimal(),
             main = "K-means Clustering")

Evaluate Diagnostic Cluster

data$DIAG_Cluster <- as.factor(data$DIAG_Cluster)

data %>% 
  ggplot(aes(x = Cycle, y = DIAG_Cluster, color = DIAG_Cluster)) +
  geom_point() +
  labs(title = "Diagnostic Cluster Across all Cycles",
       subtitle = "Colored by Diagnostic Cluster",
       caption = "data source: UCI Machine Learning Repository",
       x = "Cycle",
       y = "Diagnostic Cluster") +
  theme_minimal()

# Construct data frame of each sensor, the diagnostic outputs, the performance cluster, and diagnostic cluster.
data_1 <- subset(data, select = c(Cycle, Mean_CE, Mean_CP, Mean_EPS1, Mean_FS1, Mean_FS2, Mean_PS1,
                                  Mean_PS2, Mean_PS3,Mean_PS4, Mean_PS5, Mean_PS6, Mean_SE,
                                  Mean_TS1, Mean_TS2, Mean_TS3, Mean_TS4, Mean_VS1, Cooloer, 
                                  Valve, Leakage, Hydraulic, Stable, Cluster_2, DIAG_Cluster))

# gather diagnostic variables
freq_DIAG_long <- data_1 %>%
  gather(key = "variable", value = "value", Cooloer, Valve, Leakage, Hydraulic, Stable) %>%
  group_by(DIAG_Cluster, variable, value)

# count the occurrences of each variable in each cluster
counts <- freq_DIAG_long %>%
  summarize(count = n()) %>%
  ungroup()

# create cooloer variable data frame
Cooloer_counts <- counts %>%
  filter(variable == "Cooloer")

# create leakage variable data frame
Leakage_counts <- counts %>%
  filter(variable == "Leakage")

# create stable variable data frame
Valve_counts <- counts %>%
  filter(variable == "Valve")

# create hydraulic variable data frame
Hydraulic_counts <- counts %>%
  filter(variable == "Hydraulic")

# create stable variable data frame
Stable_counts <- counts %>%
  filter(variable == "Stable")

# convert values to factor
Cooloer_counts$value <- as.factor(Cooloer_counts$value)
Hydraulic_counts$value <- as.factor(Hydraulic_counts$value)
Leakage_counts$value <- as.factor(Leakage_counts$value)
Valve_counts$value <- as.factor(Valve_counts$value)
Stable_counts$value <- as.factor(Stable_counts$value)

# create plot of Cooloer Condition
Cooler_plot <- ggplot(Cooloer_counts, aes(x = DIAG_Cluster, y = count, fill = value))+
  geom_bar(stat = "identity") +
  labs(title = "Cooler Readings per Diagnostic Cluster") +
  theme_minimal()

# create plot of Hydraulic Condition
Hydraulic_plot <- ggplot(Hydraulic_counts, aes(x = DIAG_Cluster, y = count, fill = value))+
  geom_bar(stat = "identity") +
  labs(title = "Hydraulic Readings per Diagnostic Cluster") +
  theme_minimal()

# create plot of Leakage Conditions
Leakage_plot <- ggplot(Leakage_counts, aes(x = DIAG_Cluster, y = count, fill = value))+
  geom_bar(stat = "identity") +
  labs(title = "Leakage Readings per Diagnostic Cluster") +
  theme_minimal()

# create plot of Valve Conditionss
Valve_plot <- ggplot(Valve_counts, aes(x = DIAG_Cluster, y = count, fill = value))+
  geom_bar(stat = "identity") +
  labs(title = "Valve Function per Diagnostic Cluster") +
  theme_minimal()

grid.arrange(Cooler_plot, Hydraulic_plot, Leakage_plot, Valve_plot, nrow = 2, ncol = 2)

# create plot of Valve Conditionss
Stable_plot <- ggplot(Stable_counts, aes(x = DIAG_Cluster, y = count, fill = value))+
  geom_bar(stat = "identity") +
  labs(title = "System Status per Diagnostic Cluster") +
  theme_minimal()

print(Stable_plot)

Diagnostic Clutering Conclusions

During clustering of the diagnostic outputs it was identified the existence of a meaningful seperation based on the diagnostic variables. This is an important finding because it will potentially allow operators of the rig to make assessments of overall system conditions purely informed by the output without special attention to 17 individual sensors. Inspection to the individual sensors and when they occur across the 2,205 cycles creates interesting results. The third operational phase creates near perfect separation of the diagnostic cluster. Phase one and phase two are much messier by containing multiple diagnostic clusters within each phase. However, the connection between the diagnostic cluster and the performance cluster has yet to be established. At the conclusion of this phase of research, the study can conclude that the sensor readings result in performance clusters, and the diagnostic variables result in diagnostic clusters. The next step of research is to establish a meaningful connection between the clusters and identify which performance and diagnostic cluster best represent ideal outputs and optimal system health.

Diagnostic and Sensor Cluster Pairing

# create scatterplot of sensor with cluster separation
data %>%
  ggplot(aes(x = Cycle, y = DIAG_Cluster, color = Cluster_2)) +
  geom_point() +
  labs(title = "Diagnostic Cluster by Sensor Cluster",
       subtitle = "Colored by Sensor Cluster",
       caption = "data source: UCI Machine Learning Repository",
       x = "Cycle",
       y = "Diagnostic Cluster") +
  theme_minimal ()

The scatterplot above outlines the combination of a sensor cluster and diagnostic cluster. However,reserach needs to determine which sensor cluster and diagnostic cluster result in optimal performance of the machine and optimal conditions that enable that performance.

A review of knowledge gained from previous analysis:

Sensor Cluster 1:

Sensor Cluster 2:

Sensor Cluster 3:

Sensor Cluster 4:

Diagnostic Cluster 1:

Diagnostic Cluster 2:

Diagnostic Cluster 3:

Diagnostic Cluster 4:

Ideal Diagnostic Cluster

Taken from the data set documentation, the ideal data set is extracted based on the instances where each of the diagnostic variables result in optimal conditions. The Cooler, Valve, Leakage, Hydraulic, and Stable ideal levels are the conditional values under which the ideal data frame is constructed.

# extract all instances of ideal outputs from all data
ideal <- data %>% filter(data$Cooloer == "100", 
                         data$Valve == "100",
                         data$Leakage == "0",
                         data$Hydraulic == "130",
                         data$Stable == "0")

# gather ideal performance cluster number
print(ideal$Cluster_2)

##  [1] 3 3 3 3 3 3 3 3 3 3
## Levels: 1 2 3 4

unique(ideal$Cluster_2)

## [1] 3
## Levels: 1 2 3 4

Within the ideal data frame, the study found that the only performance cluster that results in optimal diagnostic outputs occurs during performance cluster number 3.

# gather diagnostic cluster number from ideal readings
print(ideal$DIAG_Cluster)

##  [1] 2 2 2 2 2 2 2 2 2 2
## Levels: 1 2 3 4

unique(ideal$DIAG_Cluster)

## [1] 2
## Levels: 1 2 3 4

Within the ideal data frame, the study found that the only diagnsotic cluster that results in optimal diagnostic outputs occurs during diagnostic cluster number 2.

Performance and Diagnostic CLuster Pairs

# extract performance cluster and diagnostic clusters
grouped_data <- subset(data, select = c(Cluster_2, DIAG_Cluster))

# create table of pairs of performance and diagnostic clusters
counts <- table(grouped_data$Cluster_2, grouped_data$DIAG_Cluster)

# print combination table
print(counts)

##    
##       1   2   3   4
##   1  59   0 365  61
##   2  46   0  75 119
##   3   0 736   3   0
##   4 327   5 289 120

When evaluating the pairing of clusters in the enitre data set, performance cluster 3 and diagnostic cluster 2 are paired together at an extremely high frequency. Performance cluster 3 is paired with diagnostic cluster 2 736 times and with diagnostic cluster 3 only 3 times. Aside from the pairing of diagnostic cluster 2 with performance cluster 3, diagnostic cluster number 2 occurs only 5 times with performance cluster number 4. Therefore, the intersection of performance cluster three and diagnostic cluster number two represent not only represent the conditions which the machine can operate while creating optimal system conditions, but it also represents sensor values that create an ideal operating environment. The next phase of research will evaluate how accurately the sensors values can predict the performance cluster and how accurately the diagnostic values can predict the diagnostic cluster. If high levels of accuracy are found, then the study will move forward by constructing a data frame comprised of the intersection of performance cluster number three and diagnostic cluster number 2 to build ranges of sensor values.

Predictions and Validation

Partition Data

# Construct data frame of each sensor, the diagnostic outputs, the performance cluster, and diagnostic cluster.
data_1 <- subset(data, select = c(Cycle, Mean_CE, Mean_CP, Mean_EPS1, Mean_FS1, Mean_FS2, Mean_PS1,
                                  Mean_PS2, Mean_PS3,Mean_PS4, Mean_PS5, Mean_PS6, Mean_SE,
                                  Mean_TS1, Mean_TS2, Mean_TS3, Mean_TS4, Mean_VS1, Cooloer, 
                                  Valve, Leakage, Hydraulic, Stable, Cluster_2, DIAG_Cluster))

# create data set of 10% of all data for testing
set.seed(1098)
total_rows <- nrow(data_1)
holdout_size <- round(0.1 * total_rows) 

# create hold out data set
holdout <- data_1[sample(1:total_rows, holdout_size), ]

# create data frame of remaining 90%
model_90 <- data_1[-sample(1:total_rows, holdout_size), ]

# create training data set excluding the hold out
set.seed(2109)
total_rows_90 <- nrow(model_90)
split_70 <- round(0.7 * total_rows_90)

# create training data set
training_data <- model_90[sample(1:total_rows_90, split_70), ]

# create validation data set
validate_data <- model_90[-sample(1:total_rows_90, split_70), ]

Build Classification Model Predicting Performance Cluster

Build Random Forest Classification Model to evaluate predictive abilities of the sensor’s identifying performance cluster levels.

# set seed for reproducibility
set.seed(3210)

# define target variable as performance cluster and features because of syntax requirements in the Random Forest package.
target <- "Cluster_2"
features <- c("Mean_CE", "Mean_CP", "Mean_EPS1", "Mean_FS1", "Mean_FS2", "Mean_PS1",
              "Mean_PS2", "Mean_PS3", "Mean_PS4", "Mean_PS5", "Mean_PS6", "Mean_SE",
              "Mean_TS1", "Mean_TS2", "Mean_TS3", "Mean_TS4", "Mean_VS1")

# create random forest model
RF_model <- randomForest(training_data[, target] ~ ., 
                      data = training_data[, features], ntree = 100)

Training Data Evaluation

# make predictions on the validation data
RF_model_training <- predict(RF_model, newdata = training_data[, features])

# create confusion matrix of the model evaluated against the training data
confusionMatrix(RF_model_training, training_data[, target])

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   2   3   4
##          1 296   0   0   0
##          2   0 161   0   0
##          3   0   0 474   0
##          4   0   0   0 459
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9973, 1)
##     No Information Rate : 0.341      
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity            1.0000   1.0000    1.000   1.0000
## Specificity            1.0000   1.0000    1.000   1.0000
## Pos Pred Value         1.0000   1.0000    1.000   1.0000
## Neg Pred Value         1.0000   1.0000    1.000   1.0000
## Prevalence             0.2129   0.1158    0.341   0.3302
## Detection Rate         0.2129   0.1158    0.341   0.3302
## Detection Prevalence   0.2129   0.1158    0.341   0.3302
## Balanced Accuracy      1.0000   1.0000    1.000   1.0000

Validation Data Evaluation

# make predictions on the validation data
RF_model_validation <- predict(RF_model, newdata = validate_data[, features])

# create confusion matrix on the model against the validation data
confusionMatrix(RF_model_validation, validate_data[, target])

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   2   3   4
##          1 134   0   0   0
##          2   0  62   0   0
##          3   0   0 193   1
##          4   0   0   0 205
## 
## Overall Statistics
##                                      
##                Accuracy : 0.9983     
##                  95% CI : (0.9907, 1)
##     No Information Rate : 0.3462     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 0.9976     
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity            1.0000   1.0000   1.0000   0.9951
## Specificity            1.0000   1.0000   0.9975   1.0000
## Pos Pred Value         1.0000   1.0000   0.9948   1.0000
## Neg Pred Value         1.0000   1.0000   1.0000   0.9974
## Prevalence             0.2252   0.1042   0.3244   0.3462
## Detection Rate         0.2252   0.1042   0.3244   0.3445
## Detection Prevalence   0.2252   0.1042   0.3261   0.3445
## Balanced Accuracy      1.0000   1.0000   0.9988   0.9976

Holdout Data Evaluation

RF_model_holdout <- predict(RF_model, newdata = holdout[, features])

# create confusion matrix on the model's performance on the validation data
confusionMatrix(RF_model_holdout, holdout[, target])

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2  3  4
##          1 46  0  0  0
##          2  0 33  0  0
##          3  0  0 70  0
##          4  0  0  0 71
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9834, 1)
##     No Information Rate : 0.3227     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity            1.0000     1.00   1.0000   1.0000
## Specificity            1.0000     1.00   1.0000   1.0000
## Pos Pred Value         1.0000     1.00   1.0000   1.0000
## Neg Pred Value         1.0000     1.00   1.0000   1.0000
## Prevalence             0.2091     0.15   0.3182   0.3227
## Detection Rate         0.2091     0.15   0.3182   0.3227
## Detection Prevalence   0.2091     0.15   0.3182   0.3227
## Balanced Accuracy      1.0000     1.00   1.0000   1.0000

Variable Importance

# extract the variable importance plot from the Random Forest Model of performance cluster
varImpPlot(RF_model)

The Random Forest model shows the original concerns of sensor PS4 being highly predictive of the seperation between performance clusters is unwarranted. It is the least informative variable. Therefore, even though the model and study conclusions are created from one highly malfunctioning sensor, results will likely continue to be valid when the model is exposed to real world data.

Build Classification Model Predicting Diagnostic Cluster

Build Random Forest Classification Model to evaluate predictive abilities of the sensor’s identifying diagnostic cluster levels.

# set seed for reproducibility
set.seed(4321)

# define target variable as performance cluster and features because of syntax requirements in the Random Forest package.
target_1 <- "DIAG_Cluster"
features_1 <- c("Cooloer", "Valve", "Leakage", "Hydraulic", "Stable")

# create random forest model
RF_model_DIAG <- randomForest(training_data[, target_1] ~ ., 
                      data = training_data[, features_1], ntree = 100)

Training Data Evaluation

# make predictions on the validation data
RF_model_training_DIAG <- predict(RF_model_DIAG, newdata = training_data[, features_1])

# create confusion matrix of the model evaluated against the training data
confusionMatrix(RF_model_training_DIAG, training_data[, target_1])

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   2   3   4
##          1 267   0   0   0
##          2   0 477   0   0
##          3   0   0 463   0
##          4   0   0   0 183
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9973, 1)
##     No Information Rate : 0.3432     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity            1.0000   1.0000   1.0000   1.0000
## Specificity            1.0000   1.0000   1.0000   1.0000
## Pos Pred Value         1.0000   1.0000   1.0000   1.0000
## Neg Pred Value         1.0000   1.0000   1.0000   1.0000
## Prevalence             0.1921   0.3432   0.3331   0.1317
## Detection Rate         0.1921   0.3432   0.3331   0.1317
## Detection Prevalence   0.1921   0.3432   0.3331   0.1317
## Balanced Accuracy      1.0000   1.0000   1.0000   1.0000

Validation Data Evaluation

# make predictions on the validation data
RF_model_validation_DIAG <- predict(RF_model_DIAG, newdata = validate_data[, features_1])

# create confusion matrix on the model against the validation data
confusionMatrix(RF_model_validation_DIAG, validate_data[, target_1])

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   2   3   4
##          1 121   0   0   0
##          2   0 196   0   0
##          3   0   0 206   0
##          4   0   0   0  72
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9938, 1)
##     No Information Rate : 0.3462     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity            1.0000   1.0000   1.0000    1.000
## Specificity            1.0000   1.0000   1.0000    1.000
## Pos Pred Value         1.0000   1.0000   1.0000    1.000
## Neg Pred Value         1.0000   1.0000   1.0000    1.000
## Prevalence             0.2034   0.3294   0.3462    0.121
## Detection Rate         0.2034   0.3294   0.3462    0.121
## Detection Prevalence   0.2034   0.3294   0.3462    0.121
## Balanced Accuracy      1.0000   1.0000   1.0000    1.000

Holdout Data Evaluation

RF_model_holdout_DIAG <- predict(RF_model_DIAG, newdata = holdout[, features_1])

# create confusion matrix on the model's performance on the validation data
confusionMatrix(RF_model_holdout_DIAG, holdout[, target_1])

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2  3  4
##          1 34  0  0  0
##          2  0 69  0  0
##          3  0  0 62  0
##          4  0  0  0 55
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9834, 1)
##     No Information Rate : 0.3136     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity            1.0000   1.0000   1.0000     1.00
## Specificity            1.0000   1.0000   1.0000     1.00
## Pos Pred Value         1.0000   1.0000   1.0000     1.00
## Neg Pred Value         1.0000   1.0000   1.0000     1.00
## Prevalence             0.1545   0.3136   0.2818     0.25
## Detection Rate         0.1545   0.3136   0.2818     0.25
## Detection Prevalence   0.1545   0.3136   0.2818     0.25
## Balanced Accuracy      1.0000   1.0000   1.0000     1.00

# extract the variable importance plot from the Random Forest Model of performance cluster
varImpPlot(RF_model_DIAG)

Classification Conclusions

The Random Forest Model that was constructed from analytical conclusions provided extremely well fit models predicting performance cluster and diagnostic cluster levels. Conclusions from such isolated pairs and the 100% accuracy predicting both performance cluster number 3 and 100% accuracy from predicting diagnostic cluster number 2 means that if research is able to create guidance for how to keep the system functioning within the bounds that result in the cycle being defined as performance cycle 3, then it will also create an optimal output in the diagnostic variables.

The final step in the research process aims to identify ranges of values that result in optimal outputs. It is currently known that if sensor readings result in performance cluster number 3, then the hydraulic system is functioning at peak levels. Therefore, the next logical step will be to test the ranges of sensor levels to determine the bounds of values that result in creating a performance cluster number 3 classification.

# Extract all instances where the system is functioning in performance cluster 3 and diagnostic cluster 2
ideal <- data %>% filter(Cluster_2 == "3",
                         DIAG_Cluster == "2")

# visual inspection of sensor readings within performance cluster 3 and diagnostic cluster 2
ideal %>%
  ggplot(aes(x = Cycle, y = Mean_VS1)) +
  geom_point() +
  theme_minimal()

Validation from Artificial Data

Previous steps identified the pairing of performance cluster number 3 and diagnostic cluster number 2 of ideal operating function and system health. Each instance of ideal readings and diagnostic output creates the ideal data frame. Random imputation of data from the ranges of ideal readings into the training, validate, and holdout data will create the artificial data used to test previous conclusions. Then the model created from the real data will be applied against the data comprised of real readings and artificially imputed ideal sensor values.

Build Artifical Data from Ideal Cycles

Sensor CE

# Extract ideal ranges
range_values_CE <- range(ideal$Mean_CE)

# Generate random values of ideal sensor readings
random_values_CE <- runif(n = 736, min = range_values_CE[1], max = range_values_CE[2])

# build artificial data frame
art_CE <- data.frame(Mean_CE = random_values_CE)

Sensor CP

# Extract ideal ranges
range_values_CP <- range(ideal$Mean_CP)

# Generate random values of ideal sensor readings
random_values_CP <- runif(n = 736, min = range_values_CP[1], max = range_values_CP[2])

# build artificial data frame
art_CP <- data.frame(Mean_CP = random_values_CP)

Sensor EPS1

# Extract ideal ranges
range_values_EPS1 <- range(ideal$Mean_EPS1)

# Generate random values of ideal sensor readings
random_values_EPS1 <- runif(n = 736, min = range_values_EPS1[1], max = range_values_EPS1[2])

# build artificial data frame
art_EPS1 <- data.frame(Mean_EPS1 = random_values_EPS1)

Sensor FS1

# Extract ideal ranges
range_values_FS1 <- range(ideal$Mean_FS1)

# Generate random values of ideal sensor readings
random_values_FS1 <- runif(n = 736, min = range_values_FS1[1], max = range_values_FS1[2])

# build artificial data frame
art_FS1 <- data.frame(Mean_FS1 = random_values_FS1)

Sensor FS2

# Extract ideal ranges
range_values_FS2 <- range(ideal$Mean_FS2)

# Generate random values of ideal sensor readings
random_values_FS2 <- runif(n = 736, min = range_values_FS2[1], max = range_values_FS2[2])

# build artificial data frame
art_FS2 <- data.frame(Mean_FS2 = random_values_FS2)

Sensor PS1

# Extract ideal ranges
range_values_PS1 <- range(ideal$Mean_PS1)

# Generate random values of ideal sensor readings
random_values_PS1 <- runif(n = 736, min = range_values_PS1[1], max = range_values_PS1[2])

# build artificial data frame
art_PS1 <- data.frame(Mean_PS1 = random_values_PS1)

Sensor PS2

# Extract ideal ranges
range_values_PS2 <- range(ideal$Mean_PS2)

# Generate random values of ideal sensor readings
random_values_PS2 <- runif(n = 736, min = range_values_PS2[1], max = range_values_PS2[2])

# build artificial data frame
art_PS2 <- data.frame(Mean_PS2 = random_values_PS2)

Sensor PS3

# Extract ideal ranges
range_values_PS3 <- range(ideal$Mean_PS3)

# Generate random values of ideal sensor readings
random_values_PS3 <- runif(n = 736, min = range_values_PS3[1], max = range_values_PS3[2])

# build artificial data frame
art_PS3 <- data.frame(Mean_PS3 = random_values_PS3)

Sensor PS4

# Extract ideal ranges
range_values_PS4 <- range(ideal$Mean_PS4)

# Generate random values of ideal sensor readings
random_values_PS4 <- runif(n = 736, min = range_values_PS4[1], max = range_values_PS4[2])

# build artificial data frame
art_PS4 <- data.frame(Mean_PS4 = random_values_PS4)

Sensor PS5

# Extract ideal ranges
range_values_PS5 <- range(ideal$Mean_PS5)

# Generate random values of ideal sensor readings
random_values_PS5 <- runif(n = 736, min = range_values_PS5[1], max = range_values_PS5[2])

# build artificial data frame
art_PS5 <- data.frame(Mean_PS5 = random_values_PS5)

Sensor PS6

# Extract ideal ranges
range_values_PS6 <- range(ideal$Mean_PS6)

# Generate random values of ideal sensor readings
random_values_PS6 <- runif(n = 736, min = range_values_PS6[1], max = range_values_PS6[2])

# build artificial data frame
art_PS6 <- data.frame(Mean_PS6 = random_values_PS6)

Sensor SE

# Extract ideal ranges
range_values_SE <- range(ideal$Mean_SE)

# Generate random values of ideal sensor readings
random_values_SE <- runif(n = 736, min = range_values_SE[1], max = range_values_SE[2])

# build artificial data frame
art_SE <- data.frame(Mean_SE = random_values_SE)

Sensor TS1

# Extract ideal ranges
range_values_TS1 <- range(ideal$Mean_TS1)

# Generate random values of ideal sensor readings
random_values_TS1 <- runif(n = 736, min = range_values_TS1[1], max = range_values_TS1[2])

# build artificial data frame
art_TS1 <- data.frame(Mean_TS1 = random_values_TS1)

Sensor TS2

# Extract ideal ranges
range_values_TS2 <- range(ideal$Mean_TS2)

# Generate random values of ideal sensor readings
random_values_TS2 <- runif(n = 736, min = range_values_TS2[1], max = range_values_TS2[2])

# build artificial data frame
art_TS2 <- data.frame(Mean_TS2 = random_values_TS2)

Sensor TS3

# Extract ideal ranges
range_values_TS3 <- range(ideal$Mean_TS3)

# Generate random values of ideal sensor readings
random_values_TS3 <- runif(n = 736, min = range_values_TS3[1], max = range_values_TS3[2])

# build artificial data frame
art_TS3 <- data.frame(Mean_TS3 = random_values_TS3)

Sensor TS4

# Extract ideal ranges
range_values_TS4 <- range(ideal$Mean_TS4)

# Generate random values of ideal sensor readings
random_values_TS4 <- runif(n = 736, min = range_values_TS4[1], max = range_values_TS4[2])

# build artificial data frame
art_TS4 <- data.frame(Mean_TS4 = random_values_TS4)

Sensor VS1

# Extract ideal ranges
range_values_VS1 <- range(ideal$Mean_VS1)

# Generate random values of ideal sensor readings
random_values_VS1 <- runif(n = 736, min = range_values_VS1[1], max = range_values_VS1[2])

# build artificial data frame
art_VS1 <- data.frame(Mean_VS1 = random_values_VS1)

Build Artificial Data Frame

# build data frame
art_data <- cbind(ideal[,1],art_CE[,1], art_CP[,1], art_EPS1[,1], art_FS1[,1], art_FS2[,1],
                  art_PS1[,1],art_PS2[,1], art_PS3[,1], art_PS4[,1], art_PS5[,1], art_PS6[,1],
                  art_SE[,1], art_TS1[,1], art_TS2[,1], art_TS3[,1], art_TS4[,1], art_VS1[,1], 
                  ideal[,104],ideal[,105], ideal[,106], ideal[, 107],ideal[,108],ideal[,110],
                  ideal[,111])

# rename variables in artificial data frame
colnames(art_data) <- c("Cycle", "Mean_CE", "Mean_CP", "Mean_EPS1", "Mean_FS1", "Mean_FS2",
                        "Mean_PS1","Mean_PS2", "Mean_PS3", "Mean_PS4", "Mean_PS5", "Mean_PS6",
                        "Mean_SE", "Mean_TS1", "Mean_TS2", "Mean_TS3", "Mean_TS4", "Mean_VS1",
                        "Cooloer", "Valve", "Leakage", "Hydraulic", "Stable", "Cluster_2",
                        "DIAG_Cluster")

art_data <- as.data.frame(art_data)

art_data$Cycle <- as.integer(art_data$Cycle)

Artificial Data Imputation

# create copies of previously partitioned data
art_training <- training_data
art_validate_data <- validate_data
art_holdout <- holdout

# inspect head of art_training data frame to view values where artificial data will be imputed
head(art_training)

##      Cycle  Mean_CE  Mean_CP Mean_EPS1 Mean_FS1  Mean_FS2 Mean_PS1 Mean_PS2
## 2044  2044 47.17933 2.163750  2538.769 6.696015 10.138720 161.0080 109.4216
## 2152  2152 46.84297 2.157933  2553.103 6.505655 10.169322 160.6722 109.3911
## 894    894 26.48470 1.722883  2446.987 6.647622  9.615157 158.3335 107.3595
## 1565  1565 46.81825 2.146350  2544.563 6.696937 10.175443 161.0075 109.5842
## 1727  1727 47.13560 2.157183  2568.774 6.513843 10.205983 160.7993 109.0645
## 217    217 19.84677 1.526833  2423.716 6.234048  9.202648 156.5521 105.3114
##      Mean_PS3  Mean_PS4 Mean_PS5 Mean_PS6  Mean_SE Mean_TS1 Mean_TS2 Mean_TS3
## 2044 1.998423 10.066951 9.840798 9.724740 2.163750 36.22537 41.85660 39.08988
## 2152 1.935089 10.159732 9.929748 9.810843 2.157933 35.54420 41.19658 38.48865
## 894  1.793494  0.000000 8.999407 8.922607 1.722883 46.47405 51.47053 48.68048
## 1565 2.004334  2.046086 9.953961 9.837103 2.146350 35.51632 41.11645 38.42338
## 1727 1.943957 10.151640 9.918286 9.798094 2.157183 35.86712 41.50285 38.74382
## 217  1.645498  0.000000 8.539969 8.482748 1.526833 54.27212 58.57088 55.69443
##      Mean_TS4  Mean_VS1 Cooloer Valve Leakage Hydraulic Stable Cluster_2
## 2044 31.17073 0.5485667     100    90       0       100      0         3
## 2152 30.61188 0.5500167     100   100       1        90      0         3
## 894  42.02797 0.5960667      20   100       0        90      1         4
## 1565 30.59072 0.5606833     100   100       0        90      1         3
## 1727 30.89527 0.5317333     100    80       1       130      0         3
## 217  49.54620 0.6820500       3    73       2       130      0         1
##      DIAG_Cluster
## 2044            2
## 2152            2
## 894             1
## 1565            2
## 1727            2
## 217             3

# Ensure consistent data types before merging
art_data$Cluster_2 <- as.factor(art_data$Cluster_2)
art_data$DIAG_Cluster <- as.factor(art_data$DIAG_Cluster)

# conduct a left join based on a cycle match and replace the original trianing data where it does match with the artificial data
art_training <- art_training %>%
  left_join(art_data, by = "Cycle", suffix = c(".art_training", ".art_data")) %>%
  mutate(across(ends_with(".art_data"), ~coalesce(.x, get(sub(".art_data$", ".art_training",
                                                              cur_column()))))) %>%
  dplyr::select(-ends_with(".art_training")) %>%
  rename_with(~sub(".art_data$", "", .x))

# inspect head of art training to ensure accurate imputation
head(art_training)

##   Cycle  Mean_CE  Mean_CP Mean_EPS1 Mean_FS1  Mean_FS2 Mean_PS1 Mean_PS2
## 1  2044 47.44357 2.572285  2539.823 6.437802 10.082620 160.8398 109.0102
## 2  2152 46.39320 2.290994  2558.770 6.662722  9.888742 160.4907 109.2777
## 3   894 26.48470 1.722883  2446.987 6.647622  9.615157 158.3335 107.3595
## 4  1565 47.89630 2.551723  2526.149 6.716038 10.050091 160.9605 109.4162
## 5  1727 47.06645 2.627310  2580.483 6.522880  9.991097 160.8611 108.7234
## 6   217 19.84677 1.526833  2423.716 6.234048  9.202648 156.5521 105.3114
##   Mean_PS3  Mean_PS4 Mean_PS5 Mean_PS6  Mean_SE Mean_TS1 Mean_TS2 Mean_TS3
## 1 2.015768 8.2845969 9.851876 9.764331 2.673329 36.73490 41.03245 42.75297
## 2 1.922212 4.1596210 9.800329 9.574424 2.137297 39.34479 42.29169 42.09126
## 3 1.793494 0.0000000 8.999407 8.922607 1.722883 46.47405 51.47053 48.68048
## 4 1.929507 6.5309370 9.796850 9.605988 2.524816 35.65026 44.48567 39.91238
## 5 1.928483 0.9840546 9.756299 9.826330 2.123440 36.13523 44.43885 39.78657
## 6 1.645498 0.0000000 8.539969 8.482748 1.526833 54.27212 58.57088 55.69443
##   Mean_TS4  Mean_VS1 Cooloer Valve Leakage Hydraulic Stable Cluster_2
## 1 31.03144 0.5496930     100    90       0       100      0         3
## 2 31.75824 0.5837170     100   100       1        90      0         3
## 3 42.02797 0.5960667      20   100       0        90      1         4
## 4 30.98200 0.5909984     100   100       0        90      1         3
## 5 30.85816 0.5529344     100    80       1       130      0         3
## 6 49.54620 0.6820500       3    73       2       130      0         1
##   DIAG_Cluster
## 1            2
## 2            2
## 3            1
## 4            2
## 5            2
## 6            3

# visual inspection of original data
training_data %>%
  ggplot(aes(x = Cycle, y = Mean_CE)) +
  geom_point() +
  labs(title = "Mean Cooling Efficiency",
       subtitle = "Measured by Sensor CE (%)",
       caption = "data source: UCI Machine Learning Repository",
       x = "Cycle",
       y = "kW") +
  theme_minimal()

# visual inspection of artificial data
art_training %>% 
  ggplot(aes(x = Cycle, y = Mean_CE)) +
  labs(title = "Mean Cooling Efficiency",
       subtitle = "Created from artificial data",
       caption = "data source: UCI Machine Learning Repository",
       x = "Cycle",
       y = "kW") +
  geom_point() +
  theme_minimal()

# replicate imputation on validate data
art_validate_data <- art_validate_data %>%
  left_join(art_data, by = "Cycle", suffix = c(".art_validate_data", ".art_data")) %>%
  mutate(across(ends_with(".art_data"), 
                ~coalesce(.x, get(sub(".art_data$", ".art_validate_data", cur_column()))))) %>%
  dplyr::select(-ends_with(".art_validate_data")) %>%
  rename_with(~sub(".art_data$", "", .x))

# replicate imputation on holdout data
art_holdout <- art_holdout %>%
  left_join(art_data, by = "Cycle", suffix = c(".art_holdout", ".art_data")) %>%
  mutate(across(ends_with(".art_data"), 
                ~coalesce(.x, get(sub(".art_data$", ".art_holdout", cur_column()))))) %>%
  dplyr::select(-ends_with(".art_holdout")) %>%
  rename_with(~sub(".art_data$", "", .x))

art_training <- subset(art_training, select = -c(Cluster_2))
art_validate_data <- subset(art_validate_data, select = -c(Cluster_2))
art_holdout <- subset(art_holdout, select = -c(Cluster_2))

Now create a model determining if the sensor values can accurately predict the “Cluster_2” level. If Cluster_2 level 3 is still easily identified, then the ranges are valid and the user has the coice to either understand that the system is functioning optimally by ensuring the diagnostic values fall in the diagnostic cluster 3 level.

Data Range Testing

The model, target, and features have been established in previous research steps. Therefore, the study only needs to apply the old model to the new data.

Artificial Training Data Evaluation

# make predictions on the validation data using the original model
RF_model_training_art <- predict(RF_model, newdata = art_training[, features])

art_training$Cluster_2 <- RF_model_training_art

# create confusion matrix of the model evaluated against the training data
confusionMatrix(RF_model_training_art, training_data[, target])

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   2   3   4
##          1 296   0   0   0
##          2   0 161   0   0
##          3   0   0 474   0
##          4   0   0   0 459
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9973, 1)
##     No Information Rate : 0.341      
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity            1.0000   1.0000    1.000   1.0000
## Specificity            1.0000   1.0000    1.000   1.0000
## Pos Pred Value         1.0000   1.0000    1.000   1.0000
## Neg Pred Value         1.0000   1.0000    1.000   1.0000
## Prevalence             0.2129   0.1158    0.341   0.3302
## Detection Rate         0.2129   0.1158    0.341   0.3302
## Detection Prevalence   0.2129   0.1158    0.341   0.3302
## Balanced Accuracy      1.0000   1.0000    1.000   1.0000

Validation Data Evaluation

# make predictions on the validation data
RF_model_validation_art <- predict(RF_model, newdata = art_validate_data[, features])

art_validate_data$Cluster_2 <- RF_model_validation_art

# create confusion matrix on the model against the validation data
confusionMatrix(RF_model_validation_art, validate_data[, target])

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   2   3   4
##          1 134   0   0   0
##          2   0  62   0   0
##          3   0   0 193   1
##          4   0   0   0 205
## 
## Overall Statistics
##                                      
##                Accuracy : 0.9983     
##                  95% CI : (0.9907, 1)
##     No Information Rate : 0.3462     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 0.9976     
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity            1.0000   1.0000   1.0000   0.9951
## Specificity            1.0000   1.0000   0.9975   1.0000
## Pos Pred Value         1.0000   1.0000   0.9948   1.0000
## Neg Pred Value         1.0000   1.0000   1.0000   0.9974
## Prevalence             0.2252   0.1042   0.3244   0.3462
## Detection Rate         0.2252   0.1042   0.3244   0.3445
## Detection Prevalence   0.2252   0.1042   0.3261   0.3445
## Balanced Accuracy      1.0000   1.0000   0.9988   0.9976

Holdout Data Evaluation

RF_model_holdout_art <- predict(RF_model, newdata = art_holdout[, features])

art_holdout$Cluster_2 <- RF_model_holdout_art

# create confusion matrix on the model's performance on the validation data
confusionMatrix(RF_model_holdout, art_holdout[, target])

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2  3  4
##          1 46  0  0  0
##          2  0 33  0  0
##          3  0  0 70  0
##          4  0  0  0 71
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9834, 1)
##     No Information Rate : 0.3227     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity            1.0000     1.00   1.0000   1.0000
## Specificity            1.0000     1.00   1.0000   1.0000
## Pos Pred Value         1.0000     1.00   1.0000   1.0000
## Neg Pred Value         1.0000     1.00   1.0000   1.0000
## Prevalence             0.2091     0.15   0.3182   0.3227
## Detection Rate         0.2091     0.15   0.3182   0.3227
## Detection Prevalence   0.2091     0.15   0.3182   0.3227
## Balanced Accuracy      1.0000     1.00   1.0000   1.0000

Faulty Sensor Identification / Faulty Diagnostic Identification

I need to highlight the effects and conclusions that can be drawn from faulty sensors.

Test 1 - impute zeros in all values for PS4 in cluster 4, then run random forest predictive model on the data.

I will start on the low end of the variable importance plot and immpute 0 values across the entire variable. Then apply the original random forest model predicting performance cluster. I will only apply changes to performance cluster number 3. When the number of instances changes for how many times the model predicts performance cluster number three, then I know the sensor has impacted assessments of the machine.

# data selection for test 1
sensor_test <- subset(data_1, select =-c(Hydraulic, Cooloer, Stable, 
                                         Valve, Leakage, DIAG_Cluster))

# extract instances of Cluser_2 = 3
count(sensor_test$Cluster_2 == "3")

## [1] 739

# copy data frame for test 1
test_1 <- sensor_test

# impute zero values into a temperature sensor only in sensor cluster 3
test_1 <- test_1 %>%
  mutate(Mean_TS1 = ifelse(Cluster_2 == 3, 0, Mean_TS1))

test_1 %>%
  ggplot(aes(x = Cycle, y = Mean_TS4)) +
  geom_point() +
  labs(title = "Test Number 1",
       caption = "0 values for Sensor TS1 in Sensor Cluster 2") +
  theme_minimal()

Establish Baseline Predictions with Unaltered Data

# make predictions on the validation data
baseline_predictions <- predict(RF_model, newdata = sensor_test[, features])

# create confusion matrix of the model evaluated against the training data
confusionMatrix(baseline_predictions, sensor_test[, target])

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   2   3   4
##          1 485   0   0   1
##          2   0 240   0   0
##          3   0   0 738   1
##          4   0   0   1 739
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9986         
##                  95% CI : (0.996, 0.9997)
##     No Information Rate : 0.3361         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9981         
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity            1.0000   1.0000   0.9986   0.9973
## Specificity            0.9994   1.0000   0.9993   0.9993
## Pos Pred Value         0.9979   1.0000   0.9986   0.9986
## Neg Pred Value         1.0000   1.0000   0.9993   0.9986
## Prevalence             0.2200   0.1088   0.3351   0.3361
## Detection Rate         0.2200   0.1088   0.3347   0.3351
## Detection Prevalence   0.2204   0.1088   0.3351   0.3356
## Balanced Accuracy      0.9997   1.0000   0.9990   0.9983

Apply Random Forest Model on Failing Sensor Data

# gather predictions of sensor cluster using original model
model_test_1 <- predict(RF_model, newdata = test_1[, features])

# create confusion matrix on the model's performance on the test data
confusionMatrix(model_test_1, test_1[, target])

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   2   3   4
##          1 485   0   0   1
##          2   0 240   0   0
##          3   0   0 739   1
##          4   0   0   0 739
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9991          
##                  95% CI : (0.9967, 0.9999)
##     No Information Rate : 0.3361          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9987          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity            1.0000   1.0000   1.0000   0.9973
## Specificity            0.9994   1.0000   0.9993   1.0000
## Pos Pred Value         0.9979   1.0000   0.9986   1.0000
## Neg Pred Value         1.0000   1.0000   1.0000   0.9986
## Prevalence             0.2200   0.1088   0.3351   0.3361
## Detection Rate         0.2200   0.1088   0.3351   0.3351
## Detection Prevalence   0.2204   0.1088   0.3356   0.3351
## Balanced Accuracy      0.9997   1.0000   0.9997   0.9987

Repeat Imputation and Evaluation on Second Sensor

# impute zero values in TS2
test_2 <- test_1

# impute zero values into a temperature sensor only in sensor cluster 3
test_2 <- test_2 %>%
  mutate(Mean_TS2 = ifelse(Cluster_2 == 3, 0, Mean_TS2))

# gather predictions of sensor cluster using original model on a data frame consisting of two failing temperature sensors
model_test_2 <- predict(RF_model, newdata = test_2[, features])

# create confusion matrix on the model's performance on the test data
confusionMatrix(model_test_2, test_2[, target])

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   2   3   4
##          1 485   0   0   1
##          2   0 240   0   0
##          3   0   0 739   1
##          4   0   0   0 739
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9991          
##                  95% CI : (0.9967, 0.9999)
##     No Information Rate : 0.3361          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9987          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity            1.0000   1.0000   1.0000   0.9973
## Specificity            0.9994   1.0000   0.9993   1.0000
## Pos Pred Value         0.9979   1.0000   0.9986   1.0000
## Neg Pred Value         1.0000   1.0000   1.0000   0.9986
## Prevalence             0.2200   0.1088   0.3351   0.3361
## Detection Rate         0.2200   0.1088   0.3351   0.3351
## Detection Prevalence   0.2204   0.1088   0.3356   0.3351
## Balanced Accuracy      0.9997   1.0000   0.9997   0.9987

Repeat Imputation and Evaluation on Third Sensor

# impute zero values in TS2
test_3 <- test_2

# impute zero values into a temperature sensor only in sensor cluster 3
test_3 <- test_3 %>%
  mutate(Mean_TS3 = ifelse(Cluster_2 == 3, 0, Mean_TS3))

# gather predictions of sensor cluster using original model on a data frame consisting of two failing temperature sensors
model_test_3 <- predict(RF_model, newdata = test_3[, features])

# create confusion matrix on the model's performance on the test data
confusionMatrix(model_test_3, test_3[, target])

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   2   3   4
##          1 485   0   0   1
##          2   0 240   0   0
##          3   0   0 739   1
##          4   0   0   0 739
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9991          
##                  95% CI : (0.9967, 0.9999)
##     No Information Rate : 0.3361          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9987          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity            1.0000   1.0000   1.0000   0.9973
## Specificity            0.9994   1.0000   0.9993   1.0000
## Pos Pred Value         0.9979   1.0000   0.9986   1.0000
## Neg Pred Value         1.0000   1.0000   1.0000   0.9986
## Prevalence             0.2200   0.1088   0.3351   0.3361
## Detection Rate         0.2200   0.1088   0.3351   0.3351
## Detection Prevalence   0.2204   0.1088   0.3356   0.3351
## Balanced Accuracy      0.9997   1.0000   0.9997   0.9987

Repeat Imputation and Evaluation on Fourth Sensor

# impute zero values in TS2
test_4 <- test_3

# impute zero values into a temperature sensor only in sensor cluster 3
test_4 <- test_4 %>%
  mutate(Mean_TS4 = ifelse(Cluster_2 == 3, 0, Mean_TS4))

# gather predictions of sensor cluster using original model on a data frame consisting of two failing temperature sensors
model_test_4 <- predict(RF_model, newdata = test_4[, features])

# create confusion matrix on the model's performance on the test data
confusionMatrix(model_test_4, test_4[, target])

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   2   3   4
##          1 485   0   0   1
##          2   0 240   0   0
##          3   0   0 739   1
##          4   0   0   0 739
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9991          
##                  95% CI : (0.9967, 0.9999)
##     No Information Rate : 0.3361          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9987          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity            1.0000   1.0000   1.0000   0.9973
## Specificity            0.9994   1.0000   0.9993   1.0000
## Pos Pred Value         0.9979   1.0000   0.9986   1.0000
## Neg Pred Value         1.0000   1.0000   1.0000   0.9986
## Prevalence             0.2200   0.1088   0.3351   0.3361
## Detection Rate         0.2200   0.1088   0.3351   0.3351
## Detection Prevalence   0.2204   0.1088   0.3356   0.3351
## Balanced Accuracy      0.9997   1.0000   0.9997   0.9987

Test Number 5 / Random Sensor Reading from Multiple Clusters

# data selection for test 1
test_5 <- sensor_test

# extract minimum values for sensor TS1 in sensor cluster 3
sensor_test %>% filter(Cluster_2 == 3) %>% summarize(min(Mean_TS1))

##   min(Mean_TS1)
## 1      35.31378

# extract maximum values for sensor TS1 ion sensor cluster number 3
sensor_test %>% filter(Cluster_2 == 3) %>% summarize(max(Mean_TS1))

##   max(Mean_TS1)
## 1       39.4616

# extract min values for sensor TS1 in sensor cluster number 3
sensor_test %>% filter(Cluster_2 == 4) %>% summarize(min(Mean_TS1))

##   min(Mean_TS1)
## 1      38.87905

# extract maximum values for sensor TS1 in cluster number 3
sensor_test %>% filter(Cluster_2 == 4) %>% summarize(max(Mean_TS1))

##   max(Mean_TS1)
## 1      49.11003

# impute random numbers of the sensor applying readings that reside in both sensor cluster 3 and sensor cluster 4

test_5 <- test_5 %>%
  mutate(Mean_TS1 = case_when(
    Cluster_2 == 3 ~ runif(n(), min = 35.31378, max = 49.11003),
    TRUE ~ Mean_TS1))

# identify correct imputation of random numbers
test_5 %>%
  ggplot(aes(x = Cycle, y = Mean_TS1)) +
  geom_point() +
  theme_minimal()

Apply Model Against Noisy Data

# gather predictions of sensor cluster using original model
model_test_5 <- predict(RF_model, newdata = test_5[, features])

# create confusion matrix on the model's performance on the test data
confusionMatrix(model_test_5, test_5[, target])

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   2   3   4
##          1 485   0   0   1
##          2   0 240   0   0
##          3   0   0 738   1
##          4   0   0   1 739
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9986         
##                  95% CI : (0.996, 0.9997)
##     No Information Rate : 0.3361         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9981         
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity            1.0000   1.0000   0.9986   0.9973
## Specificity            0.9994   1.0000   0.9993   0.9993
## Pos Pred Value         0.9979   1.0000   0.9986   0.9986
## Neg Pred Value         1.0000   1.0000   0.9993   0.9986
## Prevalence             0.2200   0.1088   0.3351   0.3361
## Detection Rate         0.2200   0.1088   0.3347   0.3351
## Detection Prevalence   0.2204   0.1088   0.3351   0.3356
## Balanced Accuracy      0.9997   1.0000   0.9990   0.9983

Classification and Predictive Analysis on Remote Sensor Data

David Curtis

2024-02-09

Introduction

Pattern Identification

Correlation Analysis

Performance Clustering

Performance Cluster Number 1 Data Preperation

Build Performance Cluster Number 1

Performance Cluster Number 1 Conclusions

Performance Cluster Number 2 Data Preperation

Build Performance Cluster Number 2

Performance Cluster Number 2 Conclusions

Performance Cluster 2 Visualizations By Sensor

Performance Cluster 2 Visulizations with Sensors and Diagnostic Output

Performance Clustering Conclusion

Diagnostic Clustering

Build Diagnostic Cluster

Evaluate Diagnostic Cluster

Diagnostic Clutering Conclusions

Diagnostic and Sensor Cluster Pairing

Ideal Diagnostic Cluster

Performance and Diagnostic CLuster Pairs

Predictions and Validation

Partition Data

Build Classification Model Predicting Performance Cluster

Training Data Evaluation

Validation Data Evaluation

Holdout Data Evaluation

Variable Importance

Build Classification Model Predicting Diagnostic Cluster

Training Data Evaluation

Validation Data Evaluation

Holdout Data Evaluation

Classification Conclusions

Validation from Artificial Data

Build Artifical Data from Ideal Cycles

Sensor CE

Sensor CP

Sensor EPS1

Sensor FS1

Sensor FS2

Sensor PS1

Sensor PS2

Sensor PS3

Sensor PS4

Sensor PS5

Sensor PS6

Sensor SE

Sensor TS1

Sensor TS2

Sensor TS3

Sensor TS4

Sensor VS1

Build Artificial Data Frame

Artificial Data Imputation

Data Range Testing

Artificial Training Data Evaluation

Validation Data Evaluation

Holdout Data Evaluation

Faulty Sensor Identification / Faulty Diagnostic Identification

Establish Baseline Predictions with Unaltered Data

Apply Random Forest Model on Failing Sensor Data

Repeat Imputation and Evaluation on Second Sensor

Repeat Imputation and Evaluation on Third Sensor

Repeat Imputation and Evaluation on Fourth Sensor

Test Number 5 / Random Sensor Reading from Multiple Clusters

Apply Model Against Noisy Data