knitr::opts_chunk$set(
  message = FALSE,
  warning = FALSE,
  fig.align = "center",
  comment = "#>"
)

Introduction

Predictive maintenance of machine” refers to the practice of using data and analytics techniques to predict when a machineis likely to fail so that maintenance can be performed just in time to prevent the failure from occurring. In this project of Programming for Data Science with R, we would like to make a predict model of the predictive maintenace based on the categories of several supporting variables. The algorithms I will use are naive bayes, decision tree and random forest. This data collected from kaggle.

Data Preparation

Import Library

library(dplyr) # for data wrangling
library(ggplot2) # to visualize data
library(gridExtra) # to display multiple graph
library(inspectdf) # for EDA
library(tidymodels) # to build tidy models
library(caret) # to pre-process data
library(e1071) # model naive bayes
library(partykit)

Load Dataset

maintenance <- read.csv("datainput/predictive_maintenance.csv")
head(maintenance)
#>   UDI Product.ID Type Air.temperature..K. Process.temperature..K.
#> 1   1     M14860    M               298.1                   308.6
#> 2   2     L47181    L               298.2                   308.7
#> 3   3     L47182    L               298.1                   308.5
#> 4   4     L47183    L               298.2                   308.6
#> 5   5     L47184    L               298.2                   308.7
#> 6   6     M14865    M               298.1                   308.6
#>   Rotational.speed..rpm. Torque..Nm. Tool.wear..min. Failure.Type Target
#> 1                   1551        42.8               0   No Failure      0
#> 2                   1408        46.3               3   No Failure      0
#> 3                   1498        49.4               5   No Failure      0
#> 4                   1433        39.5               7   No Failure      0
#> 5                   1408        40.0               9   No Failure      0
#> 6                   1425        41.9              11   No Failure      0

Column Description:

Predictor Variabels:

  • UID : unique identifier ranging from 1 to 10000
  • productID : consisting of a letter L, M, or H for low (50% of all products), medium (30%), and high (20%) as product quality variants and a variant-specific serial number
  • air temperature [K] : generated using a random walk process later normalized to a standard deviation of 2 K around 300 K
  • process temperature [K] : generated using a random walk process normalized to a standard deviation of 1 K, added to the air temperature plus 10 K.
  • rotational speed [rpm] : calculated from powepower of 2860 W, overlaid with a normally distributed noise
  • torque [Nm] : torque values are normally distributed around 40 Nm with an σ = 10 Nm and no negative values.
  • tool wear [min] : The quality variants H/M/L add 5/3/2 minutes of tool wear to the used tool in the process. and a ‘machine failure’ label that indicates, whether the machine has failed in this particular data point for any of the following failure modes are true.
  • Failure Type : Type of Failure

Target Variable:

  • Target : Failure or Not

Data Wrangling

Check General Data Information

glimpse(maintenance)
#> Rows: 10,000
#> Columns: 10
#> $ UDI                     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,…
#> $ Product.ID              <chr> "M14860", "L47181", "L47182", "L47183", "L4718…
#> $ Type                    <chr> "M", "L", "L", "L", "L", "M", "L", "L", "M", "…
#> $ Air.temperature..K.     <dbl> 298.1, 298.2, 298.1, 298.2, 298.2, 298.1, 298.…
#> $ Process.temperature..K. <dbl> 308.6, 308.7, 308.5, 308.6, 308.7, 308.6, 308.…
#> $ Rotational.speed..rpm.  <int> 1551, 1408, 1498, 1433, 1408, 1425, 1558, 1527…
#> $ Torque..Nm.             <dbl> 42.8, 46.3, 49.4, 39.5, 40.0, 41.9, 42.4, 40.2…
#> $ Tool.wear..min.         <int> 0, 3, 5, 7, 9, 11, 14, 16, 18, 21, 24, 29, 34,…
#> $ Failure.Type            <chr> "No Failure", "No Failure", "No Failure", "No …
#> $ Target                  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

From the dataset above, the data has 10 columns, 10,000 rows and the data types for each column. Checking the data types is a crucial step due to the data types must be appropriate for analysis.

Missing Value

colSums(is.na(maintenance))
#>                     UDI              Product.ID                    Type 
#>                       0                       0                       0 
#>     Air.temperature..K. Process.temperature..K.  Rotational.speed..rpm. 
#>                       0                       0                       0 
#>             Torque..Nm.         Tool.wear..min.            Failure.Type 
#>                       0                       0                       0 
#>                  Target 
#>                       0

In the dataset above, has no missing value data in any columns.

Data Cleaning

maintenance_clean <- maintenance %>% 
  mutate_if(is.character, as.factor) %>%  #change data type character to factor
  mutate(Target= as.factor(Target)) %>% # change data type integer to factor
  mutate(Product.ID = as.integer(Product.ID)) #change data type character to integer
head(maintenance_clean)
#>   UDI Product.ID Type Air.temperature..K. Process.temperature..K.
#> 1   1       7004    M               298.1                   308.6
#> 2   2       1004    L               298.2                   308.7
#> 3   3       1005    L               298.1                   308.5
#> 4   4       1006    L               298.2                   308.6
#> 5   5       1007    L               298.2                   308.7
#> 6   6       7005    M               298.1                   308.6
#>   Rotational.speed..rpm. Torque..Nm. Tool.wear..min. Failure.Type Target
#> 1                   1551        42.8               0   No Failure      0
#> 2                   1408        46.3               3   No Failure      0
#> 3                   1498        49.4               5   No Failure      0
#> 4                   1433        39.5               7   No Failure      0
#> 5                   1408        40.0               9   No Failure      0
#> 6                   1425        41.9              11   No Failure      0

Checking Data Summary

summary(maintenance_clean)
#>       UDI          Product.ID    Type     Air.temperature..K.
#>  Min.   :    1   Min.   :    1   H:1003   Min.   :295.3      
#>  1st Qu.: 2501   1st Qu.: 2501   L:6000   1st Qu.:298.3      
#>  Median : 5000   Median : 5000   M:2997   Median :300.1      
#>  Mean   : 5000   Mean   : 5000            Mean   :300.0      
#>  3rd Qu.: 7500   3rd Qu.: 7500            3rd Qu.:301.5      
#>  Max.   :10000   Max.   :10000            Max.   :304.5      
#>  Process.temperature..K. Rotational.speed..rpm.  Torque..Nm.    Tool.wear..min.
#>  Min.   :305.7           Min.   :1168           Min.   : 3.80   Min.   :  0    
#>  1st Qu.:308.8           1st Qu.:1423           1st Qu.:33.20   1st Qu.: 53    
#>  Median :310.1           Median :1503           Median :40.10   Median :108    
#>  Mean   :310.0           Mean   :1539           Mean   :39.99   Mean   :108    
#>  3rd Qu.:311.1           3rd Qu.:1612           3rd Qu.:46.80   3rd Qu.:162    
#>  Max.   :313.8           Max.   :2886           Max.   :76.60   Max.   :253    
#>                    Failure.Type  Target  
#>  Heat Dissipation Failure: 112   0:9661  
#>  No Failure              :9652   1: 339  
#>  Overstrain Failure      :  78           
#>  Power Failure           :  95           
#>  Random Failures         :  18           
#>  Tool Wear Failure       :  45

Checking Target Variabels Class

prop.table(table(maintenance_clean$Target))*100
#> 
#>     0     1 
#> 96.61  3.39
table(maintenance_clean$Target)
#> 
#>    0    1 
#> 9661  339

Checking the proportion of the target variable class “Segmentation” it was found that the class proportions are 96.6% for (0) and 3.39% for (1).

Modelling

Naive Bayes Modelling

Naive Bayes algorithm is a probabilistic machine learning algorithm that is commonly used for classification tasks. It’s based on Bayes’ theorem with an assumption of independence among predictors. Despite its simplicity, Naive Bayes often performs surprisingly well in practice and is widely used in various applications such as spam filtering, sentiment analysis, and document classification.

Train_Test Split Data

Before we make the predict model, we should to split the data into train dataset and test dataset. We will use the train dataset to train the linear regression model. The test dataset will be used as a comparasion and see if the model get overfit and can not predict new data that hasn’t been seen during training phase. We will 80% of the data as the training data and the rest of it as the testing data.

set.seed(100)
samplesize <- round(0.8 * nrow(maintenance_clean), 0)
index <- sample(seq_len(nrow(maintenance_clean)), size = samplesize)

data_train <- maintenance_clean[index, ]
data_test <- maintenance_clean[-index, ]    
#Check number of training data
dim(data_train)
#> [1] 8000   10
#Check number of test data
dim(data_test)
#> [1] 2000   10
#remove  target Varibel from data_test
topredict_set<-data_test[1:9] 

dim(topredict_set)
#> [1] 2000    9

Model Fitting

#your code
model_bayes <- naiveBayes(formula = Target ~.,
                          data = data_train)
  
model_bayes
#> 
#> Naive Bayes Classifier for Discrete Predictors
#> 
#> Call:
#> naiveBayes.default(x = X, y = Y, laplace = laplace)
#> 
#> A-priori probabilities:
#> Y
#>      0      1 
#> 0.9655 0.0345 
#> 
#> Conditional probabilities:
#>    UDI
#> Y       [,1]     [,2]
#>   0 4996.539 2901.439
#>   1 4637.156 2397.567
#> 
#>    Product.ID
#> Y       [,1]     [,2]
#>   0 4997.167 2902.533
#>   1 4660.033 2533.953
#> 
#>    Type
#> Y           H         L         M
#>   0 0.1027965 0.5950285 0.3021750
#>   1 0.0615942 0.7101449 0.2282609
#> 
#>    Air.temperature..K.
#> Y       [,1]     [,2]
#>   0 299.9726 1.987859
#>   1 300.8601 2.051360
#> 
#>    Process.temperature..K.
#> Y       [,1]     [,2]
#>   0 309.9921 1.487908
#>   1 310.2558 1.333213
#> 
#>    Rotational.speed..rpm.
#> Y       [,1]     [,2]
#>   0 1539.481 165.7596
#>   1 1483.772 362.3624
#> 
#>    Torque..Nm.
#> Y       [,1]      [,2]
#>   0 39.66137  9.450709
#>   1 50.27717 15.816492
#> 
#>    Tool.wear..min.
#> Y       [,1]     [,2]
#>   0 106.5498 62.88326
#>   1 146.1232 72.60743
#> 
#>    Failure.Type
#> Y   Heat Dissipation Failure  No Failure Overstrain Failure Power Failure
#>   0              0.000000000 0.998575867        0.000000000   0.000000000
#>   1              0.333333333 0.021739130        0.235507246   0.264492754
#>    Failure.Type
#> Y   Random Failures Tool Wear Failure
#>   0     0.001424133       0.000000000
#>   1     0.000000000       0.144927536

Predict

Predicting the target class on the validation dataset (topredict)

preds_bayes <- predict(model_bayes, newdata = topredict_set) 

(conf_matrix_bayes <- table(preds_bayes, data_test$Target))   
#>            
#> preds_bayes    0    1
#>           0 1926    3
#>           1   11   60

Insight :

The result of the confusion matrix indicates that the Naive Bayes classification correctly predicted 1926 products as “Not-Failure” and 3 predictions were incorrect. Similarly, the model correctly predicted 60 products as “Failure” and 11 predictions were incorrect.

Evaluation Model

Positive class : “failure”

Metrics used:

FN: Predicted as not-failure but actually failure -> Recall -> the expectation is to have a high recall value.

FP: Predicted as failure but actually not-failure.

confusionMatrix(data = preds_bayes, reference = data_test$Target, positive="1") 
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction    0    1
#>          0 1926    3
#>          1   11   60
#>                                           
#>                Accuracy : 0.993           
#>                  95% CI : (0.9883, 0.9962)
#>     No Information Rate : 0.9685          
#>     P-Value [Acc > NIR] : 5.348e-14       
#>                                           
#>                   Kappa : 0.8919          
#>                                           
#>  Mcnemar's Test P-Value : 0.06137         
#>                                           
#>             Sensitivity : 0.9524          
#>             Specificity : 0.9943          
#>          Pos Pred Value : 0.8451          
#>          Neg Pred Value : 0.9984          
#>              Prevalence : 0.0315          
#>          Detection Rate : 0.0300          
#>    Detection Prevalence : 0.0355          
#>       Balanced Accuracy : 0.9734          
#>                                           
#>        'Positive' Class : 1               
#> 

Insight :

From the output of the Naive Bayes classification model, it can be seen that the accuracy rate of the model is 99.3% and for recal/senitivity is 95.24%.

Decision Tree

The Decision Tree algorithm is a popular machine learning technique used for both classification and regression tasks. It creates a tree-like structure where each internal node represents a feature (or attribute), each branch represents a decision rule, and each leaf node represents the outcome or class label. Decision trees are easy to understand and interpret, making them valuable for both analysis and prediction tasks.

Cross-Validation

We will 80% of the data as the training data and the rest of it as the testing data.

library(rsample)
RNGkind(sample.kind = "Rounding")
set.seed(100)

# your code here
index_maintenance <- sample(nrow(maintenance_clean), nrow(maintenance_clean)*0.80)

maintenance_train <- maintenance_clean[index_maintenance,] # untuk pelatihan
maintenance_test <- maintenance_clean[-index_maintenance,] # untuk predict
# Recheck Target Variabels Class
prop.table(table(maintenance_train$Target))
#> 
#>       0       1 
#> 0.96475 0.03525

Handling Imbalanced Data

# upsampling
RNGkind(sample.kind = "Rounding")
set.seed(100)
library(caret)

diab_train <- upSample(x = maintenance_train %>% select(-Target),
                         y = maintenance_train$Target,
                         yname = "Target")
# Recheck Target Variabels Class
prop.table(table(maintenance_train$Target))
#> 
#>       0       1 
#> 0.96475 0.03525

Model Fitting

maintenance_tree <- ctree(formula = Target~., data = maintenance_train)
maintenance_tree
#> 
#> Model formula:
#> Target ~ UDI + Product.ID + Type + Air.temperature..K. + Process.temperature..K. + 
#>     Rotational.speed..rpm. + Torque..Nm. + Tool.wear..min. + 
#>     Failure.Type
#> 
#> Fitted party:
#> [1] root
#> |   [2] Failure.Type in Heat Dissipation Failure, Overstrain Failure, Power Failure, Tool Wear Failure: 1 (n = 274, err = 0.0%)
#> |   [3] Failure.Type in No Failure, Random Failures: 0 (n = 7726, err = 0.1%)
#> 
#> Number of inner nodes:    1
#> Number of terminal nodes: 2
# visualisasi decision tree
plot(maintenance_tree, type="simple")

Insight :

We can observe the number of divisions/leaves (width) and the number of layers/levels (depth). Where:

[1] is the Root Node. [2], [3] are Internal Nodes or branches. These branches are indicated by arrows pointing to them, and there are arrows pointing from them.

With the function above, we can identify the class variable “failure type”. For No Failure, Power Failure, and Random Failures, they fall into the category “0” (Not-failure). While Heat Dissipation Failure, Overstrain Failure, and Tool Wear Failure fall into the category “1” (failure).

Predict

Let’s evaluate maintenance_tree using a confusion matrix based on the prediction results on the test data:

Parameter type:

type = “prob” returns the probabilities for each class. type = “response” returns the class labels.

# prediction class of testing data
pred_tree <- predict(maintenance_tree, newdata = maintenance_test, type="response")
(conf_matrix_tree <- table(pred_tree, maintenance_test$Target)) 
#>          
#> pred_tree    0    1
#>         0 1943    1
#>         1    0   56

Insight :

The result of the confusion matrix indicates that Decision Tree algorithm correctly predicted 1943 products as “Not-Failure” and 1 predictions were incorrect. Similarly, the model correctly predicted 56 products as “Failure” and 0 predictions were incorrect.

Model Evaluation

Positive class : “failure”

Metrics used:

FN: Predicted as not-failure but actually failure -> Recall -> the expectation is to have a high recall value.

FP: Predicted as failure but actually not-failure.

# confusion matrix testing data
confusionMatrix(data = pred_tree, reference = maintenance_test$Target, positive="1")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction    0    1
#>          0 1943    1
#>          1    0   56
#>                                      
#>                Accuracy : 0.9995     
#>                  95% CI : (0.9972, 1)
#>     No Information Rate : 0.9715     
#>     P-Value [Acc > NIR] : <2e-16     
#>                                      
#>                   Kappa : 0.9909     
#>                                      
#>  Mcnemar's Test P-Value : 1          
#>                                      
#>             Sensitivity : 0.9825     
#>             Specificity : 1.0000     
#>          Pos Pred Value : 1.0000     
#>          Neg Pred Value : 0.9995     
#>              Prevalence : 0.0285     
#>          Detection Rate : 0.0280     
#>    Detection Prevalence : 0.0280     
#>       Balanced Accuracy : 0.9912     
#>                                      
#>        'Positive' Class : 1          
#> 

insight :

Based on the results from the confusion matrix, the accuracy rate in this classification is 99.95%, which is higher compared to the Naive Model classification. The recall value is slightly higher compared to the Naive Model classification, at 98.25%.

From both methods above, we have obtained very good results. However, we will attempt to create another model using the Random Forest method to see if the model improves or not. We can still evaluate the performance to obtain results that are in line with the false positive condition obtained.

Random Forest

Random Forest is an ensemble learning method commonly used for classification and regression tasks. It constructs multiple decision trees during training and outputs the mode of the classes (classification) or the mean prediction (regression) of the individual trees.

Cross Validation

Splitting train test with proportion 80%:20%

RNGkind(sample.kind = "Rounding")
set.seed(100)

# splitting train test

ind_rb <- sample(nrow(maintenance_clean), nrow(maintenance_clean)*0.8)
rb_train <- maintenance_clean[ind_rb,]
rb_test <- maintenance_clean[-ind_rb,]

Cek proporsi kelas target pada data train

# Recheck Target Variabels Class
prop.table(table(rb_test$Target))
#> 
#>      0      1 
#> 0.9715 0.0285

Model Fitting

library(randomForest)
model_rf <- randomForest(Target~ ., data = rb_train, importance=TRUE, 
                         ntree = 500)

model_rf
#> 
#> Call:
#>  randomForest(formula = Target ~ ., data = rb_train, importance = TRUE,      ntree = 500) 
#>                Type of random forest: classification
#>                      Number of trees: 500
#> No. of variables tried at each split: 3
#> 
#>         OOB estimate of  error rate: 0.1%
#> Confusion matrix:
#>      0   1 class.error
#> 0 7718   0  0.00000000
#> 1    8 274  0.02836879

Predict

preds_rf <- predict(model_rf, rb_test) 
head(preds_rf)
#>  1  7  8 19 25 26 
#>  0  0  0  0  0  0 
#> Levels: 0 1
plot(model_rf)

(conf_matrix_forestII <- table(preds_rf, rb_test$Target))
#>         
#> preds_rf    0    1
#>        0 1943    1
#>        1    0   56

Insight :

The result of the confusion matrix indicates that Decision Tree algorithm correctly predicted 1937 products as “Not-Failure” and 3 predictions were incorrect. Similarly, the model correctly predicted 60 products as “Failure” and 0 predictions were incorrect.

Model Evaluation

confusionMatrix(conf_matrix_forestII, positive="1")
#> Confusion Matrix and Statistics
#> 
#>         
#> preds_rf    0    1
#>        0 1943    1
#>        1    0   56
#>                                      
#>                Accuracy : 0.9995     
#>                  95% CI : (0.9972, 1)
#>     No Information Rate : 0.9715     
#>     P-Value [Acc > NIR] : <2e-16     
#>                                      
#>                   Kappa : 0.9909     
#>                                      
#>  Mcnemar's Test P-Value : 1          
#>                                      
#>             Sensitivity : 0.9825     
#>             Specificity : 1.0000     
#>          Pos Pred Value : 1.0000     
#>          Neg Pred Value : 0.9995     
#>              Prevalence : 0.0285     
#>          Detection Rate : 0.0280     
#>    Detection Prevalence : 0.0280     
#>       Balanced Accuracy : 0.9912     
#>                                      
#>        'Positive' Class : 1          
#> 

insight :

Based on the results from the confusion matrix, the accuracy rate in this classification is 99.95%, which is higher compared to the Naive Model classification and Decision Tree model. The recall value is slightly higher compared to the Naive Model classification, at 98.25% and same as the value from Decision Three model.

Conclusion

• In predicting the products with failure or not using 9 predictor variables: UDI,Product.ID, Type, Air.temperature..K,, Process.temperature..K., Rotational.speed..rpm., Torque..Nm., Tool.wear..min., and Failure.Type

• In the model evaluation stage for three models above, we use the confusion matrix function with the priority metric being Sensitivity/recall. The focus of the prediction the model, that from all actual positive data (failure), can predict correctly.

• The risk if the model fails to predict accurately:

** FN: Predicted as not failure, but actually failure -> delayed treatment

** FP: Predicted as failure, but actually not failure -> incorrect treatment

• Due to the focus on predicting in “false negative”, based on the three models above, the sensitivity/recall results are not significantly different. The model with the Naive Bayes algorithm has a lower value at 95.22% compared to the Decision Tree and Random Forest models, which have sensitivity/recall values of 98.25% each.

• In this case, the suitable model are Decision Tree or Randon Forest model with a Sensitivity value of 98.25%.

• The “Predictive Maintenance” model is created with the aim of assisting companies in risk management for their machines. In the future, with this model, preventive maintenance can be focused on machine that fall into the “failure” category. With preventive maintenance in place, helps companies optimize maintenance schedules, reduce downtime, and minimize costs associated with unexpected failures.