EngineCarbonBrushesReplacement-EDA

This module is a part of “Engine Carbon Brushes Replacement” project. Here I find correlations between multiple variables based on collected observations.

The inputs of the module are prepared dataset. The dataset contains data collected according to engine runs from 2010 until 2018.
The dataset contains along with mill name, start and end engine running time 35 more variables and 43961 observations.

Cement mill 1 - cross correlation

Reduction of redundant variables

At the beginning, let’s remove from the dataset:
a) not numeric variables;
b) “NA”s;
c) short - less than 8 min runnings.

After reductions, we have a dataset with 3739 measurements of 38 variables.
Among other variables, the dataset contains three pair variables of coils temperature and three pair variables of deviation of the temperature of those coils. In case data from thermometers have a correlation - it will be possible to reduce twelve variables to two variables using principal components.

All six variables (temperature of coils) have a strong correlation. Their principal component strongly correlated with those variables. And we can replace six variable by principal component.

All six variables (diviation of temperature of coils) have a strong correlation. Their principal component strongly correlated with those variables. And we can replace six variable by principal component.
Until now, we have reduced the number of variables to 28.

Material type variables represent the time of production each material during the mill running. They naturally correlate with variable of engine running time, but no other correlations were found. Those material type variables will be omitted.

Cement mill 1, Variables correlations, (after variables reduction).

Find the correlations in the huge dataset- it is the first step on EDA. Further deep research of the pairs can bring unexpected results. Let’s research the pair: brushSTD variable and vibration of engine rear bearing - correlation -0.49, and pair: dust emission via the stack and engine power- correlation -0.31.
Analysis of the correlations between brushSTD variable and vibration of engine rear bearing and between dust emission via the stack and engine power.

Cement mill 2 - cross correlation

Reduction of redundant variables

At the beginning, let’s remove from the dataset:
a) not numeric variables;
b) “NA”s;
c) short - less than 8 min runnings.

Following variables of CM2 do not contain any data and will be removed:
vibrationFrontBearing_avg, vibrationFrontBearing_std, vibrationRearBearing_avg, vibrationRearBearing_std, tempRearBearing_avg, tempRearBearing_std
After reductions, we have a dataset with 2831 measurements of 32 variables.
Among other variables, the dataset contains three pair variables of coils temperature and three pair variables of deviation of the temperature of those coils. In case data from thermometers have a correlation - it will be possible to reduce twelve variables to two variables using principal components.

Five of six variables (diviation of temperature of coils) have a strong correlation. Their principal component strongly correlated with those variables. And we can replace six variable by principal component. But Data of R2, looks like corrupted, and better to do not take R2 data to account.

Here we can see, that R2 variable has data during the whole taken period, but sometimes the behaviour of the variable is no the same as other coils temperatures.
let’s see the behaviour of correlations between coils temperature data from all sensors:

Obviously, before principal components calculation, S2 variables (temperature and its deviation) must be removed.

Cement mill 2, Variables correlations, (after variables reduction).

Cement mill 3 - cross correlation

Reduction of redundant variables

At the beginning, let’s remove from the dataset:
a) not numeric variables;
b) “NA”s;
c) short - less than 8 min runnings.

Following variables of CM3 do not contain any data and will be removed:

After reductions, we have a dataset with 950 measurements of 38 variables.
Among other variables, the dataset contains three pair variables of coils temperature and three pair variables of deviation of the temperature of those coils. In case data from thermometers have a correlation - it will be possible to reduce twelve variables to two variables using principal components.
But before build PC, let’s see whether all six variable represented during the whole period of time.

Only variables R1, S1, and T1 represented a whole period of time, therefore variables R2, S2, T2 should be removed and principal components will be built based on whole time represented variables.

Chosen variables (temperature of coils) have a strong correlation.

Cement mill 3, Variables correlations, (after variables reduction).

Cement mill 4 - cross correlation

Reduction of redundant variables

At the beginning, let’s remove from the dataset:
a) not numeric variables;
b) “NA”s;
c) short - less than 8 min runnings.

Following variables of CM4 do not contain any data and will be removed:
tempCoilR2_avg, tempCoilR2_std, tempCoilS2_avg, tempCoilS2_std, tempCoilT2_avg, tempCoilT2_std, vibrationFrontBearing_avg, vibrationFrontBearing_std, vibrationRearBearing_avg, vibrationRearBearing_std
After reductions, we have a dataset with 2714 measurements of 28 variables.
Among other variables, the dataset contains three pair variables of coils temperature and three pair variables of deviation of the temperature of those coils. In case data from thermometers have a correlation - it will be possible to reduce twelve variables to two variables using principal components.
But before build PC, let’s see whether all six variable represented during the whole period of time.

Only variables R1, S1, and T1 represented a whole period of time, therefore variables R2, S2, T2 should be removed and principal components will be built based on whole time represented variables.

Chosen variables (temperature of coils) have a strong correlation.

Actually, these variables provide information about sensitivity to temperature changes of the engine coils (or temperature sensors of the coils). Let’s see maximums of the three variables: 18.54, 183.71, 116.65, 208.22

On the one hand, It looks like that R1 has less variance than S1 and T1- this makes R1 variance variable less important than the variance of S1 and T1. On the other hand, variables of S1 and T1 variances contain the kind of useless noise.
Finally, I decide to keep tempCoilR1_std variable and replace by principal component only tempCoilS1_std and tempCoilT1_std variables.

Cement mill 4, Variables correlations, (after variables reduction).

Cement mill 10 - cross correlation

Reduction of redundant variables

At the beginning, let’s remove from the dataset:
a) not numeric variables;
b) “NA”s;
c) short - less than 8 min runnings.

Following variables of CM10 do not contain any data and will be removed:
tempFrontBearing_avg, tempFrontBearing_std, tempRearBearing_avg, tempRearBearing_std
After reductions, we have a dataset with 4240 measurements of 34 variables.
Among other variables, the dataset contains three pair variables of coils temperature and three pair variables of deviation of the temperature of those coils. In case data from thermometers have a correlation - it will be possible to reduce twelve variables to two variables using principal components.
But before build PC, let’s see whether all six variable represented during the whole period of time.

All six variables are represented a whole period of time, therefore variables could be replaced by principal component.

Chosen variables (temperature of coils) have a strong correlation.

Cement mill 10, Variables correlations, (after variables reduction).

Cement mill 11 - cross correlation

Reduction of redundant variables

At the beginning, let’s remove from the dataset:
a) not numeric variables;
b) “NA”s;
c) short - less than 8 min runnings.

Following variables of CM11 do not contain any data and will be removed:

After reductions, we have a dataset with 4250 measurements of 38 variables.
Among other variables, the dataset contains three pair variables of coils temperature and three pair variables of deviation of the temperature of those coils. In case data from thermometers have a correlation - it will be possible to reduce twelve variables to two variables using principal components.
But before build PC, let’s see whether all six variable represented during the whole period of time.

All six variables are represented a whole period of time, therefore variables could be replaced by principal component.

Chosen variables (temperature of coils) have a strong correlation.

Cement mill 11, Variables correlations, (after variables reduction).

Cement mill 12 - cross correlation**

Reduction of redundant variables

At the beginning, let’s remove from the dataset:
a) not numeric variables;
b) “NA”s;
c) short - less than 8 min runnings.

Following variables of CM12 do not contain any data and will be removed:

After reductions, we have a dataset with 1811 measurements of 38 variables.
Among other variables, the dataset contains three pair variables of coils temperature and three pair variables of deviation of the temperature of those coils. In case data from thermometers have a correlation - it will be possible to reduce twelve variables to two variables using principal components.
But before build PC, let’s see whether all six variable represented during the whole period of time.

All six variables are represented a whole period of time, therefore variables could be replaced by principal component.

Chosen variables (temperature of coils) have a strong correlation.

Cement mill 12, Variables correlations, (after variables reduction).

Raw mill A - cross correlation

Reduction of redundant variables

At the beginning, let’s remove from the dataset:
a) not numeric variables;
b) “NA”s;
c) short - less than 8 min runnings.

Following variables of RMA do not contain any data and will be removed:
PM10stack_avg
After reductions, we have a dataset with 1200 measurements of 37 variables.
Among other variables, the dataset contains three pair variables of coils temperature and three pair variables of deviation of the temperature of those coils. In case data from thermometers have a correlation - it will be possible to reduce twelve variables to two variables using principal components.
But before build PC, let’s see whether all six variable represented during the whole period of time.

All six variables are represented a whole period of time, therefore variables could be replaced by principal component.

Chosen variables (temperature of coils) have a strong correlation.

Raw mill A, Variables correlations, (after variables reduction).

Raw mill B - cross correlation

Reduction of redundant variables

At the beginning, let’s remove from the dataset:
a) not numeric variables;
b) “NA”s;
c) short - less than 8 min runnings.

Following variables of RMB do not contain any data and will be removed:
tempRearBearing_avg, tempRearBearing_std, PM10stack_avg
After reductions, we have a dataset with 1802 measurements of 35 variables.
Among other variables, the dataset contains three pair variables of coils temperature and three pair variables of deviation of the temperature of those coils. In case data from thermometers have a correlation - it will be possible to reduce twelve variables to two variables using principal components.
But before build PC, let’s see whether all six variable represented during the whole period of time.

All six variables are represented a whole period of time, therefore variables could be replaced by principal component.

Chosen variables (temperature of coils) have a strong correlation.

Raw mill B, Variables correlations, (after variables reduction).

Raw mill C - cross correlation

Reduction of redundant variables

At the beginning, let’s remove from the dataset:
a) not numeric variables;
b) “NA”s;
c) short - less than 8 min runnings.

Following variables of RMC do not contain any data and will be removed:
tempCoilR2_avg, tempCoilR2_std, tempCoilS2_avg, tempCoilS2_std, tempCoilT2_avg, tempCoilT2_std, PM10stack_avg
After reductions, we have a dataset with 3457 measurements of 31 variables.
Among other variables, the dataset contains three pair variables of coils temperature and three pair variables of deviation of the temperature of those coils. In case data from thermometers have a correlation - it will be possible to reduce twelve variables to two variables using principal components.
But before build PC, let’s see whether all six variable represented during the whole period of time.

Only variables R1, S1, and T1 represented a whole period of time, therefore variables R2, S2, T2 should be removed and principal components will be built based on whole time represented variables.

Chosen variables (temperature of coils) have a strong correlation.

EngineCarbonBrushesReplacement-EDA

St.Dmitry

September 29, 2018

Cement mill 1 - cross correlation

Cement mill 1, Variables correlations, (after variables reduction).

Cement mill 2 - cross correlation

Cement mill 2, Variables correlations, (after variables reduction).

Cement mill 3 - cross correlation

Cement mill 3, Variables correlations, (after variables reduction).

Cement mill 4 - cross correlation

Cement mill 4, Variables correlations, (after variables reduction).

Cement mill 10 - cross correlation

Cement mill 10, Variables correlations, (after variables reduction).

Cement mill 11 - cross correlation

Cement mill 11, Variables correlations, (after variables reduction).

Cement mill 12 - cross correlation**

Cement mill 12, Variables correlations, (after variables reduction).

Raw mill A - cross correlation

Raw mill A, Variables correlations, (after variables reduction).

Raw mill B - cross correlation

Raw mill B, Variables correlations, (after variables reduction).

Raw mill C - cross correlation

Raw mill C, Variables correlations, (after variables reduction).