1 Overview

One of the application of Machine Learning (ML) in industrial sector is for analytical purposes related to engine performance and maintenance. Therefore we are going to use one of ML method which is Classification to predict engine failure. The data set contains information about vibration levels, torque, process temperature, operational hours and engine failure type. By analyzing the data, it is expected that we can identify the operating condition for optimal engine performance and also perform prediction later on to prevent major engine failure.

2 Preparation

Libraries and Read Data

library(dplyr) # for wrangling
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(inspectdf) # for EDA
## Warning: package 'inspectdf' was built under R version 4.4.1
library(gtools) # for ML model & assumption 
library(caret) # for ML model & evaluation 
## Warning: package 'caret' was built under R version 4.4.1
## Loading required package: ggplot2
## Loading required package: lattice
library(rsample)
## Warning: package 'rsample' was built under R version 4.4.1
## 
## Attaching package: 'rsample'
## The following object is masked from 'package:gtools':
## 
##     permutations
#reading data
engine <- read.csv("data_input/CIA-1_Dataset_1.csv")
head(engine)
##   UDI Product.ID Type Air.temperature..K. Process.temperature..K.
## 1   1     M14860    M               298.1                   308.6
## 2   2     L47181    L               298.2                   308.7
## 3   3     L47182    L               298.1                   308.5
## 4   4     L47183    L               298.2                   308.6
## 5   5     L47184    L               298.2                   308.7
## 6   6     M14865    M               298.1                   308.6
##   Rotational.speed..rpm. Torque..Nm. Vibration.Levels Operational.Hours
## 1                   1551        42.8               42                20
## 2                   1408        46.3               52                21
## 3                   1498        49.4               44                18
## 4                   1433        39.5               52                10
## 5                   1408        40.0               44                10
## 6                   1425        41.9               35                24
##   Failure.Type
## 1   No Failure
## 2   No Failure
## 3   No Failure
## 4   No Failure
## 5   No Failure
## 6   No Failure

3 Data Wrangling & EDA

To understand the data, we will use glimpse() function from dplyr to explore the structure of the datase, we can also use str() from base R. Also we will check for any missing values in the data set using anyNA() function.

glimpse(engine)
## Rows: 500
## Columns: 10
## $ UDI                     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,…
## $ Product.ID              <chr> "M14860", "L47181", "L47182", "L47183", "L4718…
## $ Type                    <chr> "M", "L", "L", "L", "L", "M", "L", "L", "M", "…
## $ Air.temperature..K.     <dbl> 298.1, 298.2, 298.1, 298.2, 298.2, 298.1, 298.…
## $ Process.temperature..K. <dbl> 308.6, 308.7, 308.5, 308.6, 308.7, 308.6, 308.…
## $ Rotational.speed..rpm.  <int> 1551, 1408, 1498, 1433, 1408, 1425, 1558, 1527…
## $ Torque..Nm.             <dbl> 42.8, 46.3, 49.4, 39.5, 40.0, 41.9, 42.4, 40.2…
## $ Vibration.Levels        <dbl> 42, 52, 44, 52, 44, 35, 36, 33, 23, 45, 55, 45…
## $ Operational.Hours       <dbl> 20.00, 21.00, 18.00, 10.00, 10.00, 24.00, 20.0…
## $ Failure.Type            <chr> "No Failure", "No Failure", "No Failure", "No …
anyNA(engine)
## [1] FALSE

💡Insights : - numeric data type already correct - some column can be changed into factor such as : Type, and Failure.Type - there are no missing values

As we know that Classification is a predictive method to predict a target variable of categorical type in supervised machine learning, therefore we will add failure column with the following categories : > 0 for no failure > 1 for any kind of failure

for this purpose, we will use if_else() function while also changing the data type of some columns mentioned above using mutate() function. We are also going to drop some unecessary columns such as UDI and Product.ID

engine_clean <- engine %>% 
          mutate(failure = as.factor(if_else(Failure.Type == "No Failure", 0, 1)),
                 Type = as.factor(Type),
                 Failure.Type = as.factor(Failure.Type)) %>% 
          select(-c(UDI,Product.ID))
str(engine_clean)
## 'data.frame':    500 obs. of  9 variables:
##  $ Type                   : Factor w/ 3 levels "H","L","M": 3 2 2 2 2 3 2 2 3 3 ...
##  $ Air.temperature..K.    : num  298 298 298 298 298 ...
##  $ Process.temperature..K.: num  309 309 308 309 309 ...
##  $ Rotational.speed..rpm. : int  1551 1408 1498 1433 1408 1425 1558 1527 1667 1741 ...
##  $ Torque..Nm.            : num  42.8 46.3 49.4 39.5 40 41.9 42.4 40.2 28.6 28 ...
##  $ Vibration.Levels       : num  42 52 44 52 44 35 36 33 23 45 ...
##  $ Operational.Hours      : num  20 21 18 10 10 24 20 14 22 22 ...
##  $ Failure.Type           : Factor w/ 4 levels "No Failure","Overstrain Failure",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ failure                : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...

4 Logistic regression

In this section we will perform the workflow to make Logistic Regression model.

4.1 Cross Validation

Cross validation is a method to split our data into 2 parts : train data to train the model and test data to test the model. We’re going to use initial_split() function from rsample library.

RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)

splitter <- initial_split(data=engine_clean, prop = 0.8) #prop is the proportion for train data

train_lr <- training(splitter)
test_lr <- testing(splitter)

A balanced class proportion is important for train data because we will use it to train our model. This will make sure our model can learn the characteristics of each class in a balanced proportion so there will be no bias. We will use prop.table() on train data to check if any imbalance class.

RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)
table(train_lr$failure) %>% prop.table()
## 
##     0     1 
## 0.935 0.065

🔎 it is found that : > class 0 (engine good) –> 93.5% > class 1 (engine fail) –> 6.5%

Because the class are imbalance, therefore we need to use the upsampling or undersampling method to handle the imbalance class. In this section we will use upsampling where the minority class is duplicated randomly.

#upsampling
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)

train_lr_up <- upSample(x=train_lr %>% select(-failure), 
                        y = train_lr$failure, yname = "failure")

#check proportion again
table(train_lr_up$failure) %>% prop.table()
## 
##   0   1 
## 0.5 0.5

4.2 Creating Model

Now that the data class already balanced, now we will create a logistic regression model where the result in the form of probability. We will create the following models : 1. Model with no predictor 2. Model with 2 numeric predictor –> Vibration & Process Temperature, based on technical perspective where the most visible parameter before engine failure happening is increase in vibration and process temperature. 3. Model with all predictor

RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)

model_lr_null <- glm(formula = failure ~ 1,
                     data = train_lr_up,
                     family = "binomial")

model_lr_vib <- glm(formula = failure ~ Vibration.Levels + Process.temperature..K.,
                     data = train_lr_up,
                     family = "binomial")
  
model_lr_all <- glm(formula = failure ~ .,
                     data = train_lr_up,
                     family = "binomial")
## Warning: glm.fit: algorithm did not converge

4.3 Model Selection

Now we will perform model selection by checking the null & residual deviance also AIC from each model.

lr_null_dev <- model_lr_null$deviance
lr_vib_dev <- model_lr_vib$deviance
lr_all_dev <- model_lr_all$deviance

lr_null_aic <- model_lr_null$aic
lr_vib_aic <- model_lr_vib$aic
lr_all_aic <- model_lr_all$aic

comparison <- data.frame(
  Model = c("No Predictor", "2 Numerical Predictor", "All Predictor"),
  Residual_Deviance = c(lr_null_dev, lr_vib_dev, lr_all_dev),
  AIC = c(lr_null_aic,lr_vib_aic,lr_all_aic))

comparison
##                   Model Residual_Deviance      AIC
## 1          No Predictor      1.036948e+03 1038.948
## 2 2 Numerical Predictor      1.032277e+03 1038.277
## 3         All Predictor      4.339583e-09   24.000

🔎 Based on the comparison, we will pick models with the lowest residual deviance and lowest AIC which is model_lr_all because : > low residual deviance means less error > low AIC means less missing information

4.4 Predict

After selecting the model, we will use it to predict the probability of engine failure.

predict_lr <- predict(model_lr_all, newdata=test_lr, type="response")

test_lr$pred_failure <- ifelse(predict_lr > 0.9, yes = 1, no=0)

head(test_lr)
##   Type Air.temperature..K. Process.temperature..K. Rotational.speed..rpm.
## 1    M               298.1                   308.6                   1551
## 2    L               298.2                   308.7                   1408
## 3    L               298.1                   308.5                   1498
## 4    L               298.1                   308.6                   1527
## 5    M               298.5                   309.0                   1741
## 6    H               298.6                   309.1                   1423
##   Torque..Nm. Vibration.Levels Operational.Hours Failure.Type failure
## 1        42.8               42             20.00   No Failure       0
## 2        46.3               52             21.00   No Failure       0
## 3        49.4               44             18.00   No Failure       0
## 4        40.2               33             14.00   No Failure       0
## 5        28.0               45             22.00   No Failure       0
## 6        44.3               45             20.03   No Failure       0
##   pred_failure
## 1            0
## 2            0
## 3            0
## 4            0
## 5            0
## 6            0

4.5 Model Evaluation

We will evaluate the model prediction by creating confusion matrix to check for accuracy, sensitivity/recall, precision and sensitivity.

confusionMatrix(data = as.factor(test_lr$pred_failure),
                reference = test_lr$failure,
                positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 93  0
##          1  0  7
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9638, 1)
##     No Information Rate : 0.93       
##     P-Value [Acc > NIR] : 0.0007052  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.00       
##             Specificity : 1.00       
##          Pos Pred Value : 1.00       
##          Neg Pred Value : 1.00       
##              Prevalence : 0.07       
##          Detection Rate : 0.07       
##    Detection Prevalence : 0.07       
##       Balanced Accuracy : 1.00       
##                                      
##        'Positive' Class : 1          
## 

💡 Insights : - Looks like the model successfully predict True Positive (TP) and True Negative (TN) from the test data, with 0 False Positive (FP) and False Negative (FN)

Let’s try to predict and make confusion matrix from the whole data to check the result.

predict_lr_clean <- predict(model_lr_all, newdata=engine_clean, type="response")

engine_clean$pred_failure <- ifelse(predict_lr_clean > 0.9, yes = 1, no=0)

confusionMatrix(data = as.factor(engine_clean$pred_failure),
                reference = engine_clean$failure,
                positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 467   0
##          1   0  33
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9926, 1)
##     No Information Rate : 0.934      
##     P-Value [Acc > NIR] : 1.491e-15  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.000      
##             Specificity : 1.000      
##          Pos Pred Value : 1.000      
##          Neg Pred Value : 1.000      
##              Prevalence : 0.066      
##          Detection Rate : 0.066      
##    Detection Prevalence : 0.066      
##       Balanced Accuracy : 1.000      
##                                      
##        'Positive' Class : 1          
## 

💡 Insights : - again, the model successfully predict True Positive (TP) and True Negative (TN) from the test data, with 0 False Positive (FP) and False Negative (FN) - Becuase FP and FN are zero, therefore : Accuracy, Sensitivity/Recall, Precision and Specificity = 100% - If we want to lower the FN (predicted not failed, but actually failed), we can choose the model with higher Sensitivity/Recall - If we want to lower the FP (predicted failed, but actually not failed), we can choose the model with higher Precision

5 KNN

From the above, we already created logistic regression model to see the probability of engine failure, now we will try to use KNN method to predict the engine failure.

5.1 Data Wrangling

Because KNN using Euclidean Distance in the modelling therefore we can only use numerical features. We will take out Type and Failure.Type column from the dataset.

engine_knn <- engine %>% 
          mutate(failure = as.factor(if_else(Failure.Type == "No Failure", 0, 1))) %>% 
          select(-c(UDI,Product.ID,Type, Failure.Type))

head(engine_knn)
##   Air.temperature..K. Process.temperature..K. Rotational.speed..rpm.
## 1               298.1                   308.6                   1551
## 2               298.2                   308.7                   1408
## 3               298.1                   308.5                   1498
## 4               298.2                   308.6                   1433
## 5               298.2                   308.7                   1408
## 6               298.1                   308.6                   1425
##   Torque..Nm. Vibration.Levels Operational.Hours failure
## 1        42.8               42                20       0
## 2        46.3               52                21       0
## 3        49.4               44                18       0
## 4        39.5               52                10       0
## 5        40.0               44                10       0
## 6        41.9               35                24       0

5.2 Cross Validation & Scaling

Because KNN has a slightly different steps/workflow, therefore we will make another train test splitting and will be followed by scaling.

RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)

splitter <-  initial_split(data = engine_knn, prop = 0.8 )

train_knn <- training(splitter)
test_knn <- testing(splitter)

#check proportion
table(train_knn$failure) %>% prop.table()
## 
##     0     1 
## 0.935 0.065

For KNN method, we need to separate the predictor/feature and target variables and after that use z-score standardization to scale train and test data.

#Predictor/Feature
engine_train_x <- train_knn %>% select(-failure)
engine_test_x <- test_knn %>% select(-failure)

#target
engine_train_y <- train_knn$failure
engine_test_y <- test_knn$failure

#scaling data
engine_train_xs <- scale(engine_train_x)
#test data is scaled using parameter from train data
engine_test_xs <- scale(engine_test_x,
                        center = attr(engine_train_xs, "scaled:center"),
                        scale = attr(engine_train_xs, "scaled:scale"))

5.3 Predict

To find optimum K values, we can use sqrt(nrow(data))

sqrt(nrow(train_knn))
## [1] 20

💡Therefore we will use K = 21

library(class)
test_knn$pred_failure <- knn(train = engine_train_xs,
                  test = engine_test_xs,
                  cl = engine_train_y,
                  k = 19)
head(test_knn)
##   Air.temperature..K. Process.temperature..K. Rotational.speed..rpm.
## 1               298.1                   308.6                   1551
## 2               298.2                   308.7                   1408
## 3               298.1                   308.5                   1498
## 4               298.1                   308.6                   1527
## 5               298.5                   309.0                   1741
## 6               298.6                   309.1                   1423
##   Torque..Nm. Vibration.Levels Operational.Hours failure pred_failure
## 1        42.8               42             20.00       0            0
## 2        46.3               52             21.00       0            0
## 3        49.4               44             18.00       0            0
## 4        40.2               33             14.00       0            0
## 5        28.0               45             22.00       0            0
## 6        44.3               45             20.03       0            0

5.4 Model Evaluation

Let’s make confusion matrix and check the accuracy, recall and precision.

confusionMatrix(data = test_knn$pred_failure,
                reference = test_knn$failure,
                positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 93  7
##          1  0  0
##                                           
##                Accuracy : 0.93            
##                  95% CI : (0.8611, 0.9714)
##     No Information Rate : 0.93            
##     P-Value [Acc > NIR] : 0.59878         
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 0.02334         
##                                           
##             Sensitivity : 0.00            
##             Specificity : 1.00            
##          Pos Pred Value :  NaN            
##          Neg Pred Value : 0.93            
##              Prevalence : 0.07            
##          Detection Rate : 0.00            
##    Detection Prevalence : 0.00            
##       Balanced Accuracy : 0.50            
##                                           
##        'Positive' Class : 1               
## 

6 Comparison Log reg and KNN 💡

Based on the confusion matrix from KNN model, we can see that on the KNN, the model failed to predict any true positive or successfully predict engine failure. While on the logistic regression model, it successfully predict without FP and FN.

Therefore in this case, it is best to use Logistic Regression model to predict engine failure while incorporate all feature/predictor into the model.