This module is a part of “Engine Carbon Brushes Replacement” project. The purpose of the module is a classification of the Cement Mill 1 engine start phases.

Trying to clustering the quality of the start phases was concluded, that is important to clusterize whole working periods because the start phase is a part working period and the difference between start phases is determined by the difference between work periods. It is better way to clusterize whole work period using start phase as one of the classifiers.

As classifiers of work periods were taken four parameters:
a) average current;
b) standard deviation of current;
c) duration of start phase;
d) period of time between previous and analyzed work period.

Before running classification algorithms let’s check whether there are any correlations between chosen classifiers.

There is no any correlation between classifiers.

Classification methods

Classification method will be chosen from hierarchical clustering “ward.D”, “ward.D2”, “single”, “complete”, “average”, “mcquitty”, “centroid” methods and K-Mean clustering method.

The best clustering method and the optimal number of clusters are evaluated by Calinski-Harabasz Index.
It is better to take as minimum quantity of groups as possible (no more than 4 or 5) but to see a general picture let’s research from 2 to 20 clusters. Results are shown in the chart.
(R- code is based on functions published on
https://github.com/ethen8181/machine-learning/blob/master/clustering_old/clustering/clustering_functions.R)

According to the chart, the best result brings K-Mean method divides data for 5 groups.
The biggest CH index k=5 of k-Mean line on the “CH-index” chart. On “wss” chart, the line of K-Meam represents smallest values and dramatic decline of wss changed by a relatively smooth decline on k=5.

Let’s represent the CH Index data in the matrix view:

##     k ward.D ward.D2 single complete average mcquitty centroid  kmeans
## 1   2 767.03  805.70 764.98   764.98  764.98   787.09   787.09  921.69
## 2   3 834.50  956.82 395.66   624.62  624.62   644.33   558.26 1002.36
## 3   4 766.26 1118.79 290.39   617.21  547.85   452.46   393.69 1278.37
## 4   5 647.41 1274.14 219.43   528.87  419.92   492.14   509.20 1499.80
## 5   6 589.38 1202.66 196.52   549.96  385.73   401.75   415.57 1432.73
## 6   7 521.96 1170.40 170.50   466.77  420.75   436.83   347.95 1391.47
## 7   8 659.92 1153.44 153.86   416.93  364.42   378.31   323.50 1369.58
## 8   9 672.13 1124.69 157.55   368.20  322.19   370.65   285.92 1374.68
## 9  10 620.04 1111.73 170.37   349.32  288.74   331.84   260.70 1350.52
## 10 11 578.80 1104.11 170.35   317.52  261.01   312.24   236.61 1336.72
## 11 12 988.94 1106.79 162.46   401.13  238.78   285.05   216.49 1344.33
## 12 13 966.85 1119.49 184.22   569.62  219.97   263.33   199.96 1356.29
## 13 14 912.56 1124.48 172.37   581.89  204.55   244.24   185.74 1386.50
## 14 15 944.93 1121.49 160.58   547.40  190.82   227.95   172.83 1366.29
## 15 16 909.05 1110.28 161.17   518.58  180.75   213.71   161.87 1364.49
## 16 17 875.59 1096.03 152.24   489.11  172.04   207.96   175.20 1351.78
## 17 18 859.13 1084.28 143.34   626.79  235.46   198.62   165.40 1351.20
## 18 19 845.39 1067.88 135.46   596.30  223.10   188.19   158.95 1328.14
## 19 20 837.99 1053.66 128.38   589.34  211.86   240.43   150.99 1322.46

It looks like the biggest value is k=5 of k-Mean method. Let’s scan the matrix in order to find the optimal clustering method and optimal number of clusters (the biggest value in the matrix):

bestMethod<-names(which.max(apply(CHIndex_Tab[-1],2,max)))
maxIndex<-max(CHIndex_Tab[bestMethod])

k<-CHIndex_Tab[CHIndex_Tab[bestMethod]==maxIndex,1]

print(paste("The optimal clustering method is ",bestMethod,". The optimal number of clusters is ",k,".",sep = ""))
## [1] "The optimal clustering method is kmeans. The optimal number of clusters is 5."

Let’s apply K-Mean clustering method with k=5. The algorithm divided all engine runnings into 5 clusters by following way:

Clusters research

By comparison information gain of all classifiers against clusters, we can see relative weight of all of them (classifiers ) in clustering:

##                    VarName     weight
## 1               avgCurrent 0.59602941
## 2               startPhase 0.07757490
## 3               stdCurrent 0.04458529
## 4 hoursFromPreviousRunning 0.03315004

The most significant influence in clustering has average current, the less significant influence has time from the previous running.

With the help of radar chart, let’s see distribution of chosen parameters between the groups.

The three major groups “1”, “3”, and “4” have the only difference in average current. Group number “4” has a relatively bigger average current than the average current of groups number “1” and “3”. The group number “1” has relatively smaller current than groups number “3” and “4”. The average current of the group “3” somewhere between currents of groups “1” and “4”.
The group “5” main difference is a significantly long start phase.
The smallest group “2” is different from others by a significant standard deviation of the current (both start phase and whole work period), relatively bigger current during start phase, and the long time interval from the previous running.

Let’s see the average current in the groups compared with the average current of start phases of those groups.

Major groups 1, 3, 4 have a relatively narrow range of current oscillation and the averages current of start phases (white point) are close to general averages.

In opposed, small group 2 has a wide range of the current oscillation and the average of the start phase significantly different from the general average.

Group “5” is kind of in-between type of stable major groups and small unstable group “2”.

Now we have a two question:
a) Maybe some processes or events connected to the mill have had a temporary influence and belong to the specific time period and now have no more importance?
b) Maybe the material type has an influence on groups?

According to the chart, obviously, there is no any connection between division by groups and special time period. All groups represented during the whole analyzed time period.
It looks like, that material type has no significant influence on the group division, but let’s dig the issue deeper.

Here is the table of the percent presentation (by work hours) material types in groups:

##        
##             1     2     3     4     5
##   AM    21.17  0.33 22.62 18.26  3.23
##   AMLS   0.49  0.00  0.21  0.40  0.02
##   BLL    5.84  0.09  4.56  3.49  0.49
##   empty  1.23  0.07  0.91  0.58  0.02
##   NB     2.61  0.02  4.54  2.56  0.40
##   OPC    0.81  0.02  2.26  2.65  0.12

Now it looks like that there is no any connection between material types and groups.

As already explained, the average current is the main factor of group division. The next chart shows the density of values of current within different types of material. The average current of groups is indicated by vertical black lines.

On the one hand there is the difference between distributions of the currents of the different material types. On the other hand, “problematic” groups “2” and “5” include all material types and actually type of milled material has no prominent connection with average current as a classifier.



According to the chart, different material types have a different average duration of start phase, but in the case of group “5” included longest start phases (avg 262) impossible to recognize specific material type which has a strong influence.



Actually, work with all material types have the same (about 6 amperes) standard deviation of engine current. In this case (this classifier) is obvious that there is no any connection between material types and groups division.



Obviously, this classifier has no connection between material types and groups division.



These charts show that material types have some influence on the milling process. The influence of material types will be tested further.
In cases of average current and duration of the start phase, there is not any significant connection between group division and material types.

Group “5” includes the runnings with longest start phases, and group “2” includes the most unstable runnings.