This module is a part of “Engine Carbon Brushes Replacement” project. The purpose of the module is a classification of the Cement Mill 4 engine start phases.
Trying to clustering the quality of the start phases was concluded, that is important to clusterize whole working periods because the start phase is a part working period and the difference between start phases is determined by the difference between work periods. It is better way to clusterize whole work period using start phase as one of the classifiers.
As classifiers of work periods were taken four parameters:
a) average current;
b) standard deviation of current;
c) duration of start phase;
d) period of time between previous and analyzed work period.
Before running classification algorithms let’s check whether there are any correlations between chosen classifiers.
There is no any correlation between classifiers.
Classification methods
Classification method will be chosen from hierarchical clustering “ward.D”, “ward.D2”, “single”, “complete”, “average”, “mcquitty”, “centroid” methods and K-Mean clustering method.
The best clustering method and the optimal number of clusters are evaluated by Calinski-Harabasz Index.
For this purpose, all listed algorithms applied to the dataset with the number of clusters from 2 to 20 (step 1). Results are shown in the chart.
(R- code is based on functions published on
https://github.com/ethen8181/machine-learning/blob/master/clustering_old/clustering/clustering_functions.R)
According to the chart, the best result brings K-Mean method divides data for 15 groups.
The biggest CH index k=15 of k-Mean line on the “CH-index” chart. Clusterization all observations by 15 clusters is redundant, I intend to divide data no more than 5 clusters. On “wss” chart, the line of K-Meam represents smallest values and dramatic decline of wss changed by a relatively smooth decline on k=5. CHIndex chart gives the best result on k=5 for a k-mean method. For clusterization will be used a k-mean method with k=5.
Let’s represent the CH Index data in the matrix view:
## k ward.D ward.D2 single complete average mcquitty centroid kmeans
## 1 2 1175.71 1212.25 358.82 358.82 358.82 358.82 358.82 1241.85
## 2 3 1058.60 1148.23 228.10 436.31 328.07 490.78 387.12 1174.09
## 3 4 938.59 1313.97 155.09 612.20 462.20 666.61 563.50 1368.45
## 4 5 747.28 1609.39 194.97 486.41 491.64 587.56 426.48 1757.77
## 5 6 981.95 1660.52 220.98 531.22 504.90 479.16 348.40 1756.30
## 6 7 873.16 1625.05 190.71 461.42 427.06 415.43 303.05 1756.55
## 7 8 1197.11 1647.67 188.45 451.35 369.06 480.83 264.48 1908.26
## 8 9 1115.73 1627.78 165.21 400.73 328.14 424.10 331.05 2026.20
## 9 10 1056.24 1645.04 154.16 488.40 293.83 396.17 296.49 2076.99
## 10 11 1320.03 1697.19 160.49 446.62 268.40 412.63 370.49 2074.22
## 11 12 1247.93 1731.66 148.27 409.33 307.92 377.81 337.30 2124.18
## 12 13 1244.54 1769.24 137.38 379.00 284.01 351.31 323.44 2127.35
## 13 14 1202.29 1801.93 129.96 353.48 268.71 335.77 302.83 2134.00
## 14 15 1136.65 1845.69 123.44 387.30 323.46 314.45 282.25 2167.62
## 15 16 1399.30 1871.77 145.44 363.66 336.84 350.73 306.76 2159.37
## 16 17 1331.65 1917.10 136.42 357.37 318.58 331.74 294.42 2124.79
## 17 18 1690.38 1971.46 128.42 341.50 302.04 320.36 277.95 2073.80
## 18 19 1637.82 1985.97 121.36 334.17 285.70 303.03 264.42 2059.83
## 19 20 1565.05 1987.56 128.45 317.66 271.22 292.94 251.22 2003.53
Let’s apply K-Mean clustering method with k=5. The algorithm divided all engine runnings into 5 clusters by following way:
By comparison information gain of all classifiers against clusters, we can see relative weight of all of them (classifiers ) in clustering:
## VarName weight
## 1 avgCurrent 0.75534938
## 2 stdCurrent 0.11672805
## 3 hoursFromPreviousRunning 0.07418759
## 4 startPhase 0.04579100
The most significant influence in clustering has average current, the less significant influence has duration of the start phase.
With the help of radar chart, let’s see distribution of chosen parameters between the groups.
In opposite to the other mills, here we have only two big clusters -98% together-include majority stable runnings of the engine. Both clusters (“1” and “3”) have a narrow range of current fluctuation and relatively short start phase. Cluster “1” has significant high current than cluster “3”.
All three small clusters “2”, “4”, and “5” have almost similar significant current fluctuation.
Cluster “2” include work periods with relatively long start phase.
Cluster “4” include work periods with relatively high current than clusters “2” and “5”.
Cluster “5” differs from others by a significantly long period from previous running.
Let’s see the average current in the groups compared with the average current of start phases of those groups.
Major groups 1 and 3 have a relatively narrow range of current fluctiation.
In opposed, small groups 2, 4, 5 have a wide range of the current fluctuation.
Now we have a question:
Maybe some processes or events connected to the mill have had a temporary influence and belong to the specific time period and now have no more importance?
Cement mill 4 produces only AM cement.
According to the chart, in 2014 was happened some event, which caused to change (decrease) of the working current and brought some stability. Actually, possible to say that cluster 1 and 3 represent the engine stable work before and after the some event.
Next chart shows the density of values of current within different clasters.
Next chart shows the density of duration of the start phases within different clasters.
Group “2” includes the runnings with longest start phases, and groups “4” and “5” include the most unstable runnings.