Main stages of the project:
- Data gathering;
- Exploratory Data Analysis;
- Modeling;
To recognize factors affect engine carbon brushes lifespan and find an appropriate algorithm allows predicting carbon brushes replacement (predictive maintenance).
___________________________________________
a) retrieve and put in order raw data;
b) build convenient for further work datasets based on the collected raw data;
c) accomplish datasets by data from the plant PI system.
In order to find “abnormal” starts was proposed method based on Nelson rules for detection “out-of-control” conditions. For this purposes, were analyzed all 43961 engine starts of 10 mills, from 2010 til 2018.
Trying to clustering the quality of the start phases I concluded, that is important to clusterize whole working periods because the start phase is a part working period and the difference between start phases is determined by the difference between work periods. It is better way to clusterize whole work period using start phase as one of the classifiers. CM1, CM2, CM3, CM4, CM10, CM11, CM12, RMA, RMB, RMC.
Modeling based on the dataset which contains 47 variables and hundreds of observations for each carbon brush of each engine. Each observation represents the parameter measured or calculated from the beginning of each carbon brush usage until each date of revision this carbon brush by staff (isReplased 0/1 variable gives the information whether carbon brush was replaced during revision).
Variables: general duration of start phase, general duration of “long start” phase, a sum of unstable engine runs and runs with a long start phase-“longStart_unstableRun_Sum” are the product of a classification algorithm (description here).
Finally, a huge number of variables was reduced from 47 to 9, by using “information gain” criterion and substitution sets of variables by their “Principal Component”.
In order to compare different models, I choose to evaluete their acuracy and sensetivity parameters. Also, ROC curves are compared. Due to a relatively small number of observations, I’ll use the cross-validation method.
Seven algorithms were compared with “red line” (noModel)- a random decision of carbon brushes replacement.
## model sensitivity accuracy
## 1 Random Forest (RF) 0.8102190 0.7054313
## 2 Decision Tree (DT) 0.7299270 0.6747604
## 3 Artificial Neural Network (ANN) 0.7080292 0.6594249
## 4 Naive Bayes (NB) 0.6861314 0.6511182
## 5 Logistic Regression (GLM) 0.6715328 0.6351438
## 6 Mixture Discriminant Analysis (MDA) 0.7007299 0.5987220
## 7 Support Vector Machines (SVM) 0.9562044 0.2817891
## 8 K-Nearest Neighbor (KNN) 0.6277372 0.3169329
## 9 No Model 0.1386861 0.8453674
When it was obvious that the best performance provides by Decision Tree, was tried an ensemble method for DT - Random Forest. Finally, the RF model was used for prediction. To illustrate the prediction of the Random Forest model, results of carbon brushes revision on December 13, 2011, was taken and compared with prediction. The confusion matrix of the prediction:
## randomForest.Prediction
## predicted 0 predicted 1
## actual 0 12 8
## actual 1 2 8
## [1] "prediction sensitivity is 80%"
## [1] "prediction accuracy is 67%"
Comments about prediction are here.
During the project, some intermediate conclusions were made and some fact were collected. These conclusions and facts need further research. I suppose it might be useful to put them in a separate section for future consideration.