One of the application of Machine Learning (ML) in industrial sector is for analytical purposes related to engine performance and maintenance. Therefore we are going to use one of ML method which is Classification to predict engine failure. The data set contains information about vibration levels, torque, process temperature, operational hours and engine failure type. By analyzing the data, it is expected that we can identify the operating condition for optimal engine performance and also perform prediction later on to prevent major engine failure.
Libraries and Read Data
library(dplyr) # for wrangling
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(inspectdf) # for EDA
## Warning: package 'inspectdf' was built under R version 4.4.1
library(gtools) # for ML model & assumption
library(caret) # for ML model & evaluation
## Warning: package 'caret' was built under R version 4.4.1
## Loading required package: ggplot2
## Loading required package: lattice
library(rsample)
## Warning: package 'rsample' was built under R version 4.4.1
##
## Attaching package: 'rsample'
## The following object is masked from 'package:gtools':
##
## permutations
#reading data
engine <- read.csv("data_input/CIA-1_Dataset_1.csv")
head(engine)
## UDI Product.ID Type Air.temperature..K. Process.temperature..K.
## 1 1 M14860 M 298.1 308.6
## 2 2 L47181 L 298.2 308.7
## 3 3 L47182 L 298.1 308.5
## 4 4 L47183 L 298.2 308.6
## 5 5 L47184 L 298.2 308.7
## 6 6 M14865 M 298.1 308.6
## Rotational.speed..rpm. Torque..Nm. Vibration.Levels Operational.Hours
## 1 1551 42.8 42 20
## 2 1408 46.3 52 21
## 3 1498 49.4 44 18
## 4 1433 39.5 52 10
## 5 1408 40.0 44 10
## 6 1425 41.9 35 24
## Failure.Type
## 1 No Failure
## 2 No Failure
## 3 No Failure
## 4 No Failure
## 5 No Failure
## 6 No Failure
To understand the data, we will use glimpse()
function
from dplyr
to explore the structure of the datase, we can
also use str()
from base R. Also we will check for any
missing values in the data set using anyNA()
function.
glimpse(engine)
## Rows: 500
## Columns: 10
## $ UDI <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,…
## $ Product.ID <chr> "M14860", "L47181", "L47182", "L47183", "L4718…
## $ Type <chr> "M", "L", "L", "L", "L", "M", "L", "L", "M", "…
## $ Air.temperature..K. <dbl> 298.1, 298.2, 298.1, 298.2, 298.2, 298.1, 298.…
## $ Process.temperature..K. <dbl> 308.6, 308.7, 308.5, 308.6, 308.7, 308.6, 308.…
## $ Rotational.speed..rpm. <int> 1551, 1408, 1498, 1433, 1408, 1425, 1558, 1527…
## $ Torque..Nm. <dbl> 42.8, 46.3, 49.4, 39.5, 40.0, 41.9, 42.4, 40.2…
## $ Vibration.Levels <dbl> 42, 52, 44, 52, 44, 35, 36, 33, 23, 45, 55, 45…
## $ Operational.Hours <dbl> 20.00, 21.00, 18.00, 10.00, 10.00, 24.00, 20.0…
## $ Failure.Type <chr> "No Failure", "No Failure", "No Failure", "No …
anyNA(engine)
## [1] FALSE
💡Insights : - numeric data type already correct - some column can be
changed into factor such as : Type
, and
Failure.Type
- there are no missing values
As we know that Classification is a predictive method to predict a target variable of categorical type in supervised machine learning, therefore we will add failure column with the following categories : > 0 for no failure > 1 for any kind of failure
for this purpose, we will use if_else()
function while
also changing the data type of some columns mentioned above using
mutate()
function. We are also going to drop some
unecessary columns such as UDI
and
Product.ID
engine_clean <- engine %>%
mutate(failure = as.factor(if_else(Failure.Type == "No Failure", 0, 1)),
Type = as.factor(Type),
Failure.Type = as.factor(Failure.Type)) %>%
select(-c(UDI,Product.ID))
str(engine_clean)
## 'data.frame': 500 obs. of 9 variables:
## $ Type : Factor w/ 3 levels "H","L","M": 3 2 2 2 2 3 2 2 3 3 ...
## $ Air.temperature..K. : num 298 298 298 298 298 ...
## $ Process.temperature..K.: num 309 309 308 309 309 ...
## $ Rotational.speed..rpm. : int 1551 1408 1498 1433 1408 1425 1558 1527 1667 1741 ...
## $ Torque..Nm. : num 42.8 46.3 49.4 39.5 40 41.9 42.4 40.2 28.6 28 ...
## $ Vibration.Levels : num 42 52 44 52 44 35 36 33 23 45 ...
## $ Operational.Hours : num 20 21 18 10 10 24 20 14 22 22 ...
## $ Failure.Type : Factor w/ 4 levels "No Failure","Overstrain Failure",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ failure : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
In this section we will perform the workflow to make Logistic Regression model.
Cross validation is a method to split our data into 2 parts :
train data
to train the model and test data
to
test the model. We’re going to use initial_split()
function
from rsample
library.
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)
splitter <- initial_split(data=engine_clean, prop = 0.8) #prop is the proportion for train data
train_lr <- training(splitter)
test_lr <- testing(splitter)
A balanced class proportion is important for train
data because we will use it to train our model. This will make
sure our model can learn the characteristics of each class in a balanced
proportion so there will be no bias. We will use
prop.table()
on train data to check if any imbalance
class.
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)
table(train_lr$failure) %>% prop.table()
##
## 0 1
## 0.935 0.065
🔎 it is found that : > class 0 (engine good) –> 93.5% > class 1 (engine fail) –> 6.5%
Because the class are imbalance, therefore we need to use the
upsampling
or undersampling
method to handle
the imbalance class. In this section we will use upsampling where the
minority class is duplicated randomly.
#upsampling
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)
train_lr_up <- upSample(x=train_lr %>% select(-failure),
y = train_lr$failure, yname = "failure")
#check proportion again
table(train_lr_up$failure) %>% prop.table()
##
## 0 1
## 0.5 0.5
Now that the data class already balanced, now we will create a logistic regression model where the result in the form of probability. We will create the following models : 1. Model with no predictor 2. Model with 2 numeric predictor –> Vibration & Process Temperature, based on technical perspective where the most visible parameter before engine failure happening is increase in vibration and process temperature. 3. Model with all predictor
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)
model_lr_null <- glm(formula = failure ~ 1,
data = train_lr_up,
family = "binomial")
model_lr_vib <- glm(formula = failure ~ Vibration.Levels + Process.temperature..K.,
data = train_lr_up,
family = "binomial")
model_lr_all <- glm(formula = failure ~ .,
data = train_lr_up,
family = "binomial")
## Warning: glm.fit: algorithm did not converge
Now we will perform model selection by checking the
null & residual deviance
also AIC
from
each model.
lr_null_dev <- model_lr_null$deviance
lr_vib_dev <- model_lr_vib$deviance
lr_all_dev <- model_lr_all$deviance
lr_null_aic <- model_lr_null$aic
lr_vib_aic <- model_lr_vib$aic
lr_all_aic <- model_lr_all$aic
comparison <- data.frame(
Model = c("No Predictor", "2 Numerical Predictor", "All Predictor"),
Residual_Deviance = c(lr_null_dev, lr_vib_dev, lr_all_dev),
AIC = c(lr_null_aic,lr_vib_aic,lr_all_aic))
comparison
## Model Residual_Deviance AIC
## 1 No Predictor 1.036948e+03 1038.948
## 2 2 Numerical Predictor 1.032277e+03 1038.277
## 3 All Predictor 4.339583e-09 24.000
🔎 Based on the comparison, we will pick models with the lowest
residual deviance and lowest AIC which is model_lr_all
because : > low residual deviance means less error > low AIC means
less missing information
After selecting the model, we will use it to predict the probability of engine failure.
predict_lr <- predict(model_lr_all, newdata=test_lr, type="response")
test_lr$pred_failure <- ifelse(predict_lr > 0.9, yes = 1, no=0)
head(test_lr)
## Type Air.temperature..K. Process.temperature..K. Rotational.speed..rpm.
## 1 M 298.1 308.6 1551
## 2 L 298.2 308.7 1408
## 3 L 298.1 308.5 1498
## 4 L 298.1 308.6 1527
## 5 M 298.5 309.0 1741
## 6 H 298.6 309.1 1423
## Torque..Nm. Vibration.Levels Operational.Hours Failure.Type failure
## 1 42.8 42 20.00 No Failure 0
## 2 46.3 52 21.00 No Failure 0
## 3 49.4 44 18.00 No Failure 0
## 4 40.2 33 14.00 No Failure 0
## 5 28.0 45 22.00 No Failure 0
## 6 44.3 45 20.03 No Failure 0
## pred_failure
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
We will evaluate the model prediction by creating
confusion matrix
to check for accuracy,
sensitivity/recall, precision and sensitivity.
confusionMatrix(data = as.factor(test_lr$pred_failure),
reference = test_lr$failure,
positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 93 0
## 1 0 7
##
## Accuracy : 1
## 95% CI : (0.9638, 1)
## No Information Rate : 0.93
## P-Value [Acc > NIR] : 0.0007052
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.00
## Specificity : 1.00
## Pos Pred Value : 1.00
## Neg Pred Value : 1.00
## Prevalence : 0.07
## Detection Rate : 0.07
## Detection Prevalence : 0.07
## Balanced Accuracy : 1.00
##
## 'Positive' Class : 1
##
💡 Insights : - Looks like the model successfully predict
True Positive
(TP) and True Negative
(TN) from
the test data, with 0 False Positive
(FP) and
False Negative
(FN)
Let’s try to predict and make confusion matrix from the whole data to check the result.
predict_lr_clean <- predict(model_lr_all, newdata=engine_clean, type="response")
engine_clean$pred_failure <- ifelse(predict_lr_clean > 0.9, yes = 1, no=0)
confusionMatrix(data = as.factor(engine_clean$pred_failure),
reference = engine_clean$failure,
positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 467 0
## 1 0 33
##
## Accuracy : 1
## 95% CI : (0.9926, 1)
## No Information Rate : 0.934
## P-Value [Acc > NIR] : 1.491e-15
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.000
## Specificity : 1.000
## Pos Pred Value : 1.000
## Neg Pred Value : 1.000
## Prevalence : 0.066
## Detection Rate : 0.066
## Detection Prevalence : 0.066
## Balanced Accuracy : 1.000
##
## 'Positive' Class : 1
##
💡 Insights : - again, the model successfully predict
True Positive
(TP) and True Negative
(TN) from
the test data, with 0 False Positive
(FP) and
False Negative
(FN) - Becuase FP and FN are zero, therefore
: Accuracy, Sensitivity/Recall, Precision and Specificity = 100% - If we
want to lower the FN (predicted not failed, but actually failed), we can
choose the model with higher Sensitivity/Recall - If we want to lower
the FP (predicted failed, but actually not failed), we can choose the
model with higher Precision
From the above, we already created logistic regression model to see the probability of engine failure, now we will try to use KNN method to predict the engine failure.
Because KNN using Euclidean Distance in the
modelling therefore we can only use numerical features. We will take out
Type
and Failure.Type
column from the
dataset.
engine_knn <- engine %>%
mutate(failure = as.factor(if_else(Failure.Type == "No Failure", 0, 1))) %>%
select(-c(UDI,Product.ID,Type, Failure.Type))
head(engine_knn)
## Air.temperature..K. Process.temperature..K. Rotational.speed..rpm.
## 1 298.1 308.6 1551
## 2 298.2 308.7 1408
## 3 298.1 308.5 1498
## 4 298.2 308.6 1433
## 5 298.2 308.7 1408
## 6 298.1 308.6 1425
## Torque..Nm. Vibration.Levels Operational.Hours failure
## 1 42.8 42 20 0
## 2 46.3 52 21 0
## 3 49.4 44 18 0
## 4 39.5 52 10 0
## 5 40.0 44 10 0
## 6 41.9 35 24 0
Because KNN has a slightly different steps/workflow, therefore we will make another train test splitting and will be followed by scaling.
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)
splitter <- initial_split(data = engine_knn, prop = 0.8 )
train_knn <- training(splitter)
test_knn <- testing(splitter)
#check proportion
table(train_knn$failure) %>% prop.table()
##
## 0 1
## 0.935 0.065
For KNN method, we need to separate the predictor/feature and target variables and after that use z-score standardization to scale train and test data.
#Predictor/Feature
engine_train_x <- train_knn %>% select(-failure)
engine_test_x <- test_knn %>% select(-failure)
#target
engine_train_y <- train_knn$failure
engine_test_y <- test_knn$failure
#scaling data
engine_train_xs <- scale(engine_train_x)
#test data is scaled using parameter from train data
engine_test_xs <- scale(engine_test_x,
center = attr(engine_train_xs, "scaled:center"),
scale = attr(engine_train_xs, "scaled:scale"))
To find optimum K values, we can use
sqrt(nrow(data))
sqrt(nrow(train_knn))
## [1] 20
💡Therefore we will use K = 21
library(class)
test_knn$pred_failure <- knn(train = engine_train_xs,
test = engine_test_xs,
cl = engine_train_y,
k = 19)
head(test_knn)
## Air.temperature..K. Process.temperature..K. Rotational.speed..rpm.
## 1 298.1 308.6 1551
## 2 298.2 308.7 1408
## 3 298.1 308.5 1498
## 4 298.1 308.6 1527
## 5 298.5 309.0 1741
## 6 298.6 309.1 1423
## Torque..Nm. Vibration.Levels Operational.Hours failure pred_failure
## 1 42.8 42 20.00 0 0
## 2 46.3 52 21.00 0 0
## 3 49.4 44 18.00 0 0
## 4 40.2 33 14.00 0 0
## 5 28.0 45 22.00 0 0
## 6 44.3 45 20.03 0 0
Let’s make confusion matrix and check the accuracy, recall and precision.
confusionMatrix(data = test_knn$pred_failure,
reference = test_knn$failure,
positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 93 7
## 1 0 0
##
## Accuracy : 0.93
## 95% CI : (0.8611, 0.9714)
## No Information Rate : 0.93
## P-Value [Acc > NIR] : 0.59878
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 0.02334
##
## Sensitivity : 0.00
## Specificity : 1.00
## Pos Pred Value : NaN
## Neg Pred Value : 0.93
## Prevalence : 0.07
## Detection Rate : 0.00
## Detection Prevalence : 0.00
## Balanced Accuracy : 0.50
##
## 'Positive' Class : 1
##
Based on the confusion matrix from KNN model, we can see that on the
KNN, the model failed to predict any true positive
or
successfully predict engine failure. While on the logistic regression
model, it successfully predict without FP and FN.
Therefore in this case, it is best to use Logistic Regression model to predict engine failure while incorporate all feature/predictor into the model.