knitr::opts_chunk$set(
message = FALSE,
warning = FALSE,
fig.align = "center",
comment = "#>"
)
Predictive maintenance of machine” refers to the practice of using data and analytics techniques to predict when a machineis likely to fail so that maintenance can be performed just in time to prevent the failure from occurring. In this project of Programming for Data Science with R, we would like to make a predict model of the predictive maintenace based on the categories of several supporting variables. The algorithms I will use are naive bayes, decision tree and random forest. This data collected from kaggle.
library(dplyr) # for data wrangling
library(ggplot2) # to visualize data
library(gridExtra) # to display multiple graph
library(inspectdf) # for EDA
library(tidymodels) # to build tidy models
library(caret) # to pre-process data
library(e1071) # model naive bayes
library(partykit)
maintenance <- read.csv("datainput/predictive_maintenance.csv")
head(maintenance)
#> UDI Product.ID Type Air.temperature..K. Process.temperature..K.
#> 1 1 M14860 M 298.1 308.6
#> 2 2 L47181 L 298.2 308.7
#> 3 3 L47182 L 298.1 308.5
#> 4 4 L47183 L 298.2 308.6
#> 5 5 L47184 L 298.2 308.7
#> 6 6 M14865 M 298.1 308.6
#> Rotational.speed..rpm. Torque..Nm. Tool.wear..min. Failure.Type Target
#> 1 1551 42.8 0 No Failure 0
#> 2 1408 46.3 3 No Failure 0
#> 3 1498 49.4 5 No Failure 0
#> 4 1433 39.5 7 No Failure 0
#> 5 1408 40.0 9 No Failure 0
#> 6 1425 41.9 11 No Failure 0
Column Description:
Predictor Variabels:
Target Variable:
glimpse(maintenance)
#> Rows: 10,000
#> Columns: 10
#> $ UDI <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,…
#> $ Product.ID <chr> "M14860", "L47181", "L47182", "L47183", "L4718…
#> $ Type <chr> "M", "L", "L", "L", "L", "M", "L", "L", "M", "…
#> $ Air.temperature..K. <dbl> 298.1, 298.2, 298.1, 298.2, 298.2, 298.1, 298.…
#> $ Process.temperature..K. <dbl> 308.6, 308.7, 308.5, 308.6, 308.7, 308.6, 308.…
#> $ Rotational.speed..rpm. <int> 1551, 1408, 1498, 1433, 1408, 1425, 1558, 1527…
#> $ Torque..Nm. <dbl> 42.8, 46.3, 49.4, 39.5, 40.0, 41.9, 42.4, 40.2…
#> $ Tool.wear..min. <int> 0, 3, 5, 7, 9, 11, 14, 16, 18, 21, 24, 29, 34,…
#> $ Failure.Type <chr> "No Failure", "No Failure", "No Failure", "No …
#> $ Target <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
From the dataset above, the data has 10 columns, 10,000 rows and the data types for each column. Checking the data types is a crucial step due to the data types must be appropriate for analysis.
colSums(is.na(maintenance))
#> UDI Product.ID Type
#> 0 0 0
#> Air.temperature..K. Process.temperature..K. Rotational.speed..rpm.
#> 0 0 0
#> Torque..Nm. Tool.wear..min. Failure.Type
#> 0 0 0
#> Target
#> 0
In the dataset above, has no missing value data in any columns.
maintenance_clean <- maintenance %>%
mutate_if(is.character, as.factor) %>% #change data type character to factor
mutate(Target= as.factor(Target)) %>% # change data type integer to factor
mutate(Product.ID = as.integer(Product.ID)) #change data type character to integer
head(maintenance_clean)
#> UDI Product.ID Type Air.temperature..K. Process.temperature..K.
#> 1 1 7004 M 298.1 308.6
#> 2 2 1004 L 298.2 308.7
#> 3 3 1005 L 298.1 308.5
#> 4 4 1006 L 298.2 308.6
#> 5 5 1007 L 298.2 308.7
#> 6 6 7005 M 298.1 308.6
#> Rotational.speed..rpm. Torque..Nm. Tool.wear..min. Failure.Type Target
#> 1 1551 42.8 0 No Failure 0
#> 2 1408 46.3 3 No Failure 0
#> 3 1498 49.4 5 No Failure 0
#> 4 1433 39.5 7 No Failure 0
#> 5 1408 40.0 9 No Failure 0
#> 6 1425 41.9 11 No Failure 0
summary(maintenance_clean)
#> UDI Product.ID Type Air.temperature..K.
#> Min. : 1 Min. : 1 H:1003 Min. :295.3
#> 1st Qu.: 2501 1st Qu.: 2501 L:6000 1st Qu.:298.3
#> Median : 5000 Median : 5000 M:2997 Median :300.1
#> Mean : 5000 Mean : 5000 Mean :300.0
#> 3rd Qu.: 7500 3rd Qu.: 7500 3rd Qu.:301.5
#> Max. :10000 Max. :10000 Max. :304.5
#> Process.temperature..K. Rotational.speed..rpm. Torque..Nm. Tool.wear..min.
#> Min. :305.7 Min. :1168 Min. : 3.80 Min. : 0
#> 1st Qu.:308.8 1st Qu.:1423 1st Qu.:33.20 1st Qu.: 53
#> Median :310.1 Median :1503 Median :40.10 Median :108
#> Mean :310.0 Mean :1539 Mean :39.99 Mean :108
#> 3rd Qu.:311.1 3rd Qu.:1612 3rd Qu.:46.80 3rd Qu.:162
#> Max. :313.8 Max. :2886 Max. :76.60 Max. :253
#> Failure.Type Target
#> Heat Dissipation Failure: 112 0:9661
#> No Failure :9652 1: 339
#> Overstrain Failure : 78
#> Power Failure : 95
#> Random Failures : 18
#> Tool Wear Failure : 45
prop.table(table(maintenance_clean$Target))*100
#>
#> 0 1
#> 96.61 3.39
table(maintenance_clean$Target)
#>
#> 0 1
#> 9661 339
Checking the proportion of the target variable class “Segmentation” it was found that the class proportions are 96.6% for (0) and 3.39% for (1).
Naive Bayes algorithm is a probabilistic machine learning algorithm that is commonly used for classification tasks. It’s based on Bayes’ theorem with an assumption of independence among predictors. Despite its simplicity, Naive Bayes often performs surprisingly well in practice and is widely used in various applications such as spam filtering, sentiment analysis, and document classification.
Before we make the predict model, we should to split the data into train dataset and test dataset. We will use the train dataset to train the linear regression model. The test dataset will be used as a comparasion and see if the model get overfit and can not predict new data that hasn’t been seen during training phase. We will 80% of the data as the training data and the rest of it as the testing data.
set.seed(100)
samplesize <- round(0.8 * nrow(maintenance_clean), 0)
index <- sample(seq_len(nrow(maintenance_clean)), size = samplesize)
data_train <- maintenance_clean[index, ]
data_test <- maintenance_clean[-index, ]
#Check number of training data
dim(data_train)
#> [1] 8000 10
#Check number of test data
dim(data_test)
#> [1] 2000 10
#remove target Varibel from data_test
topredict_set<-data_test[1:9]
dim(topredict_set)
#> [1] 2000 9
#your code
model_bayes <- naiveBayes(formula = Target ~.,
data = data_train)
model_bayes
#>
#> Naive Bayes Classifier for Discrete Predictors
#>
#> Call:
#> naiveBayes.default(x = X, y = Y, laplace = laplace)
#>
#> A-priori probabilities:
#> Y
#> 0 1
#> 0.9655 0.0345
#>
#> Conditional probabilities:
#> UDI
#> Y [,1] [,2]
#> 0 4996.539 2901.439
#> 1 4637.156 2397.567
#>
#> Product.ID
#> Y [,1] [,2]
#> 0 4997.167 2902.533
#> 1 4660.033 2533.953
#>
#> Type
#> Y H L M
#> 0 0.1027965 0.5950285 0.3021750
#> 1 0.0615942 0.7101449 0.2282609
#>
#> Air.temperature..K.
#> Y [,1] [,2]
#> 0 299.9726 1.987859
#> 1 300.8601 2.051360
#>
#> Process.temperature..K.
#> Y [,1] [,2]
#> 0 309.9921 1.487908
#> 1 310.2558 1.333213
#>
#> Rotational.speed..rpm.
#> Y [,1] [,2]
#> 0 1539.481 165.7596
#> 1 1483.772 362.3624
#>
#> Torque..Nm.
#> Y [,1] [,2]
#> 0 39.66137 9.450709
#> 1 50.27717 15.816492
#>
#> Tool.wear..min.
#> Y [,1] [,2]
#> 0 106.5498 62.88326
#> 1 146.1232 72.60743
#>
#> Failure.Type
#> Y Heat Dissipation Failure No Failure Overstrain Failure Power Failure
#> 0 0.000000000 0.998575867 0.000000000 0.000000000
#> 1 0.333333333 0.021739130 0.235507246 0.264492754
#> Failure.Type
#> Y Random Failures Tool Wear Failure
#> 0 0.001424133 0.000000000
#> 1 0.000000000 0.144927536
Predicting the target class on the validation dataset (topredict)
preds_bayes <- predict(model_bayes, newdata = topredict_set)
(conf_matrix_bayes <- table(preds_bayes, data_test$Target))
#>
#> preds_bayes 0 1
#> 0 1926 3
#> 1 11 60
Insight :
The result of the confusion matrix indicates that the Naive Bayes classification correctly predicted 1926 products as “Not-Failure” and 3 predictions were incorrect. Similarly, the model correctly predicted 60 products as “Failure” and 11 predictions were incorrect.
Positive class : “failure”
Metrics used:
FN: Predicted as not-failure but actually failure -> Recall -> the expectation is to have a high recall value.
FP: Predicted as failure but actually not-failure.
confusionMatrix(data = preds_bayes, reference = data_test$Target, positive="1")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 1926 3
#> 1 11 60
#>
#> Accuracy : 0.993
#> 95% CI : (0.9883, 0.9962)
#> No Information Rate : 0.9685
#> P-Value [Acc > NIR] : 5.348e-14
#>
#> Kappa : 0.8919
#>
#> Mcnemar's Test P-Value : 0.06137
#>
#> Sensitivity : 0.9524
#> Specificity : 0.9943
#> Pos Pred Value : 0.8451
#> Neg Pred Value : 0.9984
#> Prevalence : 0.0315
#> Detection Rate : 0.0300
#> Detection Prevalence : 0.0355
#> Balanced Accuracy : 0.9734
#>
#> 'Positive' Class : 1
#>
Insight :
From the output of the Naive Bayes classification model, it can be seen that the accuracy rate of the model is 99.3% and for recal/senitivity is 95.24%.
The Decision Tree algorithm is a popular machine learning technique used for both classification and regression tasks. It creates a tree-like structure where each internal node represents a feature (or attribute), each branch represents a decision rule, and each leaf node represents the outcome or class label. Decision trees are easy to understand and interpret, making them valuable for both analysis and prediction tasks.
We will 80% of the data as the training data and the rest of it as the testing data.
library(rsample)
RNGkind(sample.kind = "Rounding")
set.seed(100)
# your code here
index_maintenance <- sample(nrow(maintenance_clean), nrow(maintenance_clean)*0.80)
maintenance_train <- maintenance_clean[index_maintenance,] # untuk pelatihan
maintenance_test <- maintenance_clean[-index_maintenance,] # untuk predict
# Recheck Target Variabels Class
prop.table(table(maintenance_train$Target))
#>
#> 0 1
#> 0.96475 0.03525
# upsampling
RNGkind(sample.kind = "Rounding")
set.seed(100)
library(caret)
diab_train <- upSample(x = maintenance_train %>% select(-Target),
y = maintenance_train$Target,
yname = "Target")
# Recheck Target Variabels Class
prop.table(table(maintenance_train$Target))
#>
#> 0 1
#> 0.96475 0.03525
maintenance_tree <- ctree(formula = Target~., data = maintenance_train)
maintenance_tree
#>
#> Model formula:
#> Target ~ UDI + Product.ID + Type + Air.temperature..K. + Process.temperature..K. +
#> Rotational.speed..rpm. + Torque..Nm. + Tool.wear..min. +
#> Failure.Type
#>
#> Fitted party:
#> [1] root
#> | [2] Failure.Type in Heat Dissipation Failure, Overstrain Failure, Power Failure, Tool Wear Failure: 1 (n = 274, err = 0.0%)
#> | [3] Failure.Type in No Failure, Random Failures: 0 (n = 7726, err = 0.1%)
#>
#> Number of inner nodes: 1
#> Number of terminal nodes: 2
# visualisasi decision tree
plot(maintenance_tree, type="simple")
Insight :
We can observe the number of divisions/leaves (width) and the number of layers/levels (depth). Where:
[1] is the Root Node. [2], [3] are Internal Nodes or branches. These branches are indicated by arrows pointing to them, and there are arrows pointing from them.
With the function above, we can identify the class variable “failure type”. For No Failure, Power Failure, and Random Failures, they fall into the category “0” (Not-failure). While Heat Dissipation Failure, Overstrain Failure, and Tool Wear Failure fall into the category “1” (failure).
Let’s evaluate maintenance_tree using a confusion matrix based on the prediction results on the test data:
Parameter type:
type = “prob” returns the probabilities for each class. type = “response” returns the class labels.
# prediction class of testing data
pred_tree <- predict(maintenance_tree, newdata = maintenance_test, type="response")
(conf_matrix_tree <- table(pred_tree, maintenance_test$Target))
#>
#> pred_tree 0 1
#> 0 1943 1
#> 1 0 56
Insight :
The result of the confusion matrix indicates that Decision Tree algorithm correctly predicted 1943 products as “Not-Failure” and 1 predictions were incorrect. Similarly, the model correctly predicted 56 products as “Failure” and 0 predictions were incorrect.
Positive class : “failure”
Metrics used:
FN: Predicted as not-failure but actually failure -> Recall -> the expectation is to have a high recall value.
FP: Predicted as failure but actually not-failure.
# confusion matrix testing data
confusionMatrix(data = pred_tree, reference = maintenance_test$Target, positive="1")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 1943 1
#> 1 0 56
#>
#> Accuracy : 0.9995
#> 95% CI : (0.9972, 1)
#> No Information Rate : 0.9715
#> P-Value [Acc > NIR] : <2e-16
#>
#> Kappa : 0.9909
#>
#> Mcnemar's Test P-Value : 1
#>
#> Sensitivity : 0.9825
#> Specificity : 1.0000
#> Pos Pred Value : 1.0000
#> Neg Pred Value : 0.9995
#> Prevalence : 0.0285
#> Detection Rate : 0.0280
#> Detection Prevalence : 0.0280
#> Balanced Accuracy : 0.9912
#>
#> 'Positive' Class : 1
#>
insight :
Based on the results from the confusion matrix, the accuracy rate in this classification is 99.95%, which is higher compared to the Naive Model classification. The recall value is slightly higher compared to the Naive Model classification, at 98.25%.
From both methods above, we have obtained very good results. However, we will attempt to create another model using the Random Forest method to see if the model improves or not. We can still evaluate the performance to obtain results that are in line with the false positive condition obtained.
Random Forest is an ensemble learning method commonly used for classification and regression tasks. It constructs multiple decision trees during training and outputs the mode of the classes (classification) or the mean prediction (regression) of the individual trees.
Splitting train test with proportion 80%:20%
RNGkind(sample.kind = "Rounding")
set.seed(100)
# splitting train test
ind_rb <- sample(nrow(maintenance_clean), nrow(maintenance_clean)*0.8)
rb_train <- maintenance_clean[ind_rb,]
rb_test <- maintenance_clean[-ind_rb,]
Cek proporsi kelas target pada data train
# Recheck Target Variabels Class
prop.table(table(rb_test$Target))
#>
#> 0 1
#> 0.9715 0.0285
library(randomForest)
model_rf <- randomForest(Target~ ., data = rb_train, importance=TRUE,
ntree = 500)
model_rf
#>
#> Call:
#> randomForest(formula = Target ~ ., data = rb_train, importance = TRUE, ntree = 500)
#> Type of random forest: classification
#> Number of trees: 500
#> No. of variables tried at each split: 3
#>
#> OOB estimate of error rate: 0.1%
#> Confusion matrix:
#> 0 1 class.error
#> 0 7718 0 0.00000000
#> 1 8 274 0.02836879
preds_rf <- predict(model_rf, rb_test)
head(preds_rf)
#> 1 7 8 19 25 26
#> 0 0 0 0 0 0
#> Levels: 0 1
plot(model_rf)
(conf_matrix_forestII <- table(preds_rf, rb_test$Target))
#>
#> preds_rf 0 1
#> 0 1943 1
#> 1 0 56
Insight :
The result of the confusion matrix indicates that Decision Tree algorithm correctly predicted 1937 products as “Not-Failure” and 3 predictions were incorrect. Similarly, the model correctly predicted 60 products as “Failure” and 0 predictions were incorrect.
confusionMatrix(conf_matrix_forestII, positive="1")
#> Confusion Matrix and Statistics
#>
#>
#> preds_rf 0 1
#> 0 1943 1
#> 1 0 56
#>
#> Accuracy : 0.9995
#> 95% CI : (0.9972, 1)
#> No Information Rate : 0.9715
#> P-Value [Acc > NIR] : <2e-16
#>
#> Kappa : 0.9909
#>
#> Mcnemar's Test P-Value : 1
#>
#> Sensitivity : 0.9825
#> Specificity : 1.0000
#> Pos Pred Value : 1.0000
#> Neg Pred Value : 0.9995
#> Prevalence : 0.0285
#> Detection Rate : 0.0280
#> Detection Prevalence : 0.0280
#> Balanced Accuracy : 0.9912
#>
#> 'Positive' Class : 1
#>
insight :
Based on the results from the confusion matrix, the accuracy rate in this classification is 99.95%, which is higher compared to the Naive Model classification and Decision Tree model. The recall value is slightly higher compared to the Naive Model classification, at 98.25% and same as the value from Decision Three model.
• In predicting the products with failure or not using 9 predictor variables: UDI,Product.ID, Type, Air.temperature..K,, Process.temperature..K., Rotational.speed..rpm., Torque..Nm., Tool.wear..min., and Failure.Type
• In the model evaluation stage for three models above, we use the confusion matrix function with the priority metric being Sensitivity/recall. The focus of the prediction the model, that from all actual positive data (failure), can predict correctly.
• The risk if the model fails to predict accurately:
** FN: Predicted as not failure, but actually failure -> delayed treatment
** FP: Predicted as failure, but actually not failure -> incorrect treatment
• Due to the focus on predicting in “false negative”, based on the three models above, the sensitivity/recall results are not significantly different. The model with the Naive Bayes algorithm has a lower value at 95.22% compared to the Decision Tree and Random Forest models, which have sensitivity/recall values of 98.25% each.
• In this case, the suitable model are Decision Tree or Randon Forest model with a Sensitivity value of 98.25%.
• The “Predictive Maintenance” model is created with the aim of assisting companies in risk management for their machines. In the future, with this model, preventive maintenance can be focused on machine that fall into the “failure” category. With preventive maintenance in place, helps companies optimize maintenance schedules, reduce downtime, and minimize costs associated with unexpected failures.