Ship Propulsion Systems - Gas Turbine Compressor Decay State Coefficient Prediction

The behavior and interaction of the main components of Ship Propulsion Systems cannot be easily modeled with an a-priori physical knowledge, considering the large amount of variables influencing them.

Instead, data-driven Models (DDMs) in conjuntion with advanced statistical techniques are more useful to build models directly on the large amount of historical data collected by the modern on-board automation systems, without requiring any a-priori knowledge.

DDMs are extremely useful when it comes to continuously monitor the propulsion equipments to avoid Preventive or Corrective Maintenance and take decisions based on the actual condition of the propulsion plant, but unfortunately, this can only be done at the expense of a much larger amount of data to achieve satisfying performances.

Nowadays sensor’s data are cheap and easy to collect, however, label them with the actual state of decay of a component can be quite expensive and in some cases unfeasible or impractical given the huge amount of streaming data. This data stream is usually present at a very high frequency levels in real-time, in batches, or the nowadays less frequent offline mode.

The advent of more sophisticaded algorithms in deep neural networks, the progressive use of sound statistical techniques like Bayesian Neural Networks, Extreme Gradient Boosting, and Ensembles of Models, and the increasing gain in processing speed and high computing power is making the path for obtaining more insights on data at hand. Today is common practice in many industries to provide predictive and prescriptive knowledge for extending life-of-use and dealing with equipment failures in a more efficient manner.

# setwd("D:/OneDrive/Noah/XOM NIA Evaluation/RAnalysis/gasturbinedecay")
setwd("/home/drome/rprojects/gasturbinedecay")

Importing libraries

library(dplyr)
library(stringr)
library(stringi)
library(rattle)
library(ggplot2)
library(rpart)
library(rpart.plot)
library(knitr)
library(caret)
library(AppliedPredictiveModeling)
library(corrplot)

Reading data

ds = read.csv("./data/data_new.txt", header = F, sep = '\t', colClasses = "numeric")
headers = read.csv("./data/features_new.txt", header = T, sep = '\t', colClasses = "character")
headers <- c(names(headers))
headers <- sapply(headers, function(x) str_replace(str_replace_all(x, "[.]{1,}", "_"), "[_]$", ""))

## "Lever_lp", "Speed_knots", "Gas_Turbine_shaft_torque_GTT_kN_m","Gas_Turbine_Speed_GT_rpm_rpm",
## "Controllable_Pitch_Propeller_Thrust_stbd_N", "Controllable_Pitch_Propeller_Thrust_port_N", "Shaft_Torque_port_kN",
## "Shaft_rpm_port_rpm", "Shaft_Torque_stbd_Q_stdb_kN", "Shaft_rpm_stbd_rpm", "HP_Turbine_exit_temperature_T48_C",
## "Generator_of_Gas_speed_GG_rpm_rpm", "Fuel_flow_mf_kg_s", "ABB_Tic_control_signal",
## "GT_Compressor_outlet_air_pressure_P2_bar", "GT_Compressor_outlet_air_temperature_T2_C",
## "External_Pressure_Pext_bar", "HP_Turbine_exit_pressure_P48_bar", "TCS_tic_control_signal",
## "Thrust_coefficient_stbd", "Propeller_rps_stbd_rps", "Thrust_coefficient_port", "Propeller_rps_port_rps",
## "Propeller_Torque_port_Nm", "Propeller_Torque_stbd_Nm", "Propeller_Thrust_decay_state_coefficient_Kkt",
## "Propeller_Torque_decay_state_coefficient_Kkq", "Hull_decay_state_coefficient_Khull",
## "GT_Compressor_decay_state_coefficient_KMcompr" "GT_Turbine_decay_state_coefficient_KMturb"

bag_of_headers <- function(x) {
  full_bag <- c()
  for (i in seq_along(x)) full_bag <- c(full_bag, x[[i]])
  return(full_bag)
}

names(ds) <- bag_of_headers(headers)
names_short <- c("lever", "speed", "gt_shaft_tq", "gt_speed", "cpp_th", "cpp_tn", "shaft_tq_pt", "shaft_rpm_pt", "shaft_tq_Q", "shaft_rpm_stbd", "hp_turb_ex_T", "gg_speed", "ff_mf", "abb_Tic", "gt_cmpr_outP", "gt_cmpr_outT", "pext_bar", "hp_turb_outP", "tcs_signal", "th_coef_st", "prop_rps", "th_coef_pt", "prop_rps_pt", "prop_tq_pt", "prop_tq_st", "prop_th_dcy", "prop_tq_dcy", "hull_dcy", "gt_cmpr_dcy", "gt_turb_dcy")
names(ds) <- names_short

# Dropping dependent variable for calculating Multicollinearity on predictors
dep_var <- "gt_cmpr_dcy"
dssub <- subset(ds, select = -which(colnames(ds) %in% dep_var))

Saving new reformatted dataset to disk

# write.csv(x = ds, file = "./data/decay_state.csv", row.names = FALSE)

In using Linear Regression for solving a problem like this we need to meet certain conditions

Assumptions of Linear Regression Analysis

Linear Relationship: Linear regression needs a linear relationship between the dependent and independent variables.
Normality of Residual: Linear regression requires residuals should be normally distributed.
Homoscedasticity: Linear regression assumes that residuals are approximately equal for all predicted dependent variable values. In other words, it means constant variance of errors.
No Outlier Problem
Multicollinearity: It means there is a high correlation between independent variables. The linear regression model MUST NOT be faced with problem of multicollinearity.
Independence of error terms - No Autocorrelation: It states that the errors associated with one observation are not correlated with the errors of any other observation. It is a problem when you use time series data.

Analyzing and Filtering data with possibly Near Zero Variance

# Selecting columns with near zero variance
nzv_ds <- nearZeroVar(dssub)

ds_nzv <- dssub[, -nzv_ds]

Identifying and Reviewing predictors which are highly or perfectly correlated

# Calculating Correlation: is used to investigate the dependence between multiple variables at the same time
ds_Cor <- cor(ds_nzv)

# Print correlation matrix and look at max correlation
print(head(ds_Cor))

##                 lever     speed gt_shaft_tq  gt_speed    cpp_th    cpp_tn
## lever       1.0000000 0.9999155   0.9616845 0.9620651 0.9624643 0.9624643
## speed       0.9999155 1.0000000   0.9590689 0.9604286 0.9598244 0.9598244
## gt_shaft_tq 0.9616845 0.9590689   1.0000000 0.9882649 0.9948403 0.9948403
## gt_speed    0.9620651 0.9604286   0.9882649 1.0000000 0.9890975 0.9890975
## cpp_th      0.9624643 0.9598244   0.9948403 0.9890975 1.0000000 1.0000000
## cpp_tn      0.9624643 0.9598244   0.9948403 0.9890975 1.0000000 1.0000000
##             shaft_tq_pt shaft_rpm_pt shaft_tq_Q shaft_rpm_stbd  gg_speed
## lever         0.9625737    0.9620651  0.9625737      0.9620651 0.9853013
## speed         0.9599684    0.9604286  0.9599684      0.9604286 0.9858435
## gt_shaft_tq   0.9994057    0.9882650  0.9994057      0.9882650 0.9326338
## gt_speed      0.9887981    1.0000000  0.9887981      1.0000000 0.9373562
## cpp_th        0.9954398    0.9890976  0.9954398      0.9890976 0.9302012
## cpp_tn        0.9954398    0.9890976  0.9954398      0.9890976 0.9302012
##                 ff_mf   abb_Tic gt_cmpr_outP gt_cmpr_outT  pext_bar
## lever       0.9327579 0.9381444    0.9719123    0.9840957 0.9594589
## speed       0.9292910 0.9349200    0.9700393    0.9828304 0.9572781
## gt_shaft_tq 0.9945626 0.9940582    0.9965440    0.9886533 0.9951709
## gt_speed    0.9801702 0.9820010    0.9954188    0.9878662 0.9951293
## cpp_th      0.9904470 0.9913237    0.9932228    0.9847930 0.9927873
## cpp_tn      0.9904470 0.9913237    0.9932228    0.9847930 0.9927873
##             hp_turb_outP tcs_signal th_coef_st  prop_rps th_coef_pt
## lever          0.9662669  0.9211022  0.8572441 0.9620656  0.8572441
## speed          0.9640606  0.9176873  0.8592847 0.9604291  0.8592847
## gt_shaft_tq    0.9982652  0.9807074  0.7158873 0.9882651  0.7158873
## gt_speed       0.9951099  0.9653550  0.6928290 1.0000000  0.6928290
## cpp_th         0.9953786  0.9770904  0.7207593 0.9890978  0.7207593
## cpp_tn         0.9953786  0.9770904  0.7207593 0.9890978  0.7207593
##             prop_rps_pt prop_tq_pt prop_tq_st   prop_th_dcy   prop_tq_dcy
## lever         0.9620656  0.9600082  0.9600082 -0.0002253097  0.0002252369
## speed         0.9604291  0.9573558  0.9573558 -0.0002265453  0.0002264731
## gt_shaft_tq   0.9882651  0.9993507  0.9993507 -0.0490638293  0.0490634239
## gt_speed      1.0000000  0.9893827  0.9893827  0.0014227554 -0.0014229668
## cpp_th        0.9890978  0.9954279  0.9954279  0.0217102339 -0.0217104802
## cpp_tn        0.9890978  0.9954279  0.9954279  0.0217102339 -0.0217104802
##                  hull_dcy   gt_turb_dcy
## lever       -9.850476e-05 -1.195342e-04
## speed       -9.928168e-05 -1.200333e-04
## gt_shaft_tq  2.487490e-02  3.167911e-04
## gt_speed    -1.415625e-03  2.421862e-05
## cpp_th       3.375151e-02 -3.366170e-05
## cpp_tn       3.375151e-02 -3.366170e-05

# color maps
col1 <- colorRampPalette(c("#7F0000","red","#FF7F00","yellow","white", 
        "cyan", "#007FFF", "blue","#00007F"))
col2 <- colorRampPalette(c("#67001F", "#B2182B", "#D6604D", "#F4A582", "#FDDBC7",
            "#FFFFFF", "#D1E5F0", "#92C5DE", "#4393C3", "#2166AC", "#053061"))  
col3 <- colorRampPalette(c("red", "white", "blue")) 
col4 <- colorRampPalette(c("#7F0000","red","#FF7F00","yellow","#7FFF7F", 
            "cyan", "#007FFF", "blue","#00007F"))

# Combine with significance test
cor.mtest <- function(mat, conf.level = 0.95){
  mat <- as.matrix(mat)
    n <- ncol(mat)
    p.mat <- lowCI.mat <- uppCI.mat <- matrix(NA, n, n)
    diag(p.mat) <- 0
    diag(lowCI.mat) <- diag(uppCI.mat) <- 1
    for(i in 1:(n-1)){
        for(j in (i+1):n){
            tmp <- cor.test(mat[,i], mat[,j], conf.level = conf.level)
            p.mat[i,j] <- p.mat[j,i] <- tmp$p.value
            lowCI.mat[i,j] <- lowCI.mat[j,i] <- tmp$conf.int[1]
            uppCI.mat[i,j] <- uppCI.mat[j,i] <- tmp$conf.int[2]
        }
    }
    return(list(p.mat, lowCI.mat, uppCI.mat))
}

res1 <- cor.mtest(ds_Cor, 0.95)
res2 <- cor.mtest(ds_Cor, 0.99)

corrplot(ds_Cor, p.mat = res1[[1]], order="hclust", addrect=3, 
         insig = "pch", pch.cex = 1, pch.col = "red", 
         col=col4(16), cl.pos="b", cl.cex = 1, 
         tl.srt=60, tl.col = "blue")

Algorithm to flag predictors for removal

print(paste0("There are ", sum(abs(ds_Cor[upper.tri(ds_Cor)]) > .999), " predictors almost perfectly correlated."))

## [1] "There are 25 predictors almost perfectly correlated."

summary(ds_Cor[upper.tri(ds_Cor)])

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -1.00000  0.03973  0.96207  0.68437  0.98938  1.00000

# The code chunk below shows the effect of removing descriptors with absolute correlations above 0.70
dsHighlyCor <- findCorrelation(ds_Cor, cutoff = .70)
dsHighlyCorCol <- colnames(ds_Cor)[dsHighlyCor]

# Print highly correlated attributes
print(dsHighlyCorCol)

##  [1] "gt_cmpr_outP"   "gt_cmpr_outT"   "shaft_tq_pt"    "shaft_tq_Q"    
##  [5] "gt_shaft_tq"    "hp_turb_outP"   "prop_tq_pt"     "prop_tq_st"    
##  [9] "cpp_th"         "cpp_tn"         "pext_bar"       "prop_rps"      
## [13] "prop_rps_pt"    "gt_speed"       "shaft_rpm_pt"   "shaft_rpm_stbd"
## [17] "abb_Tic"        "ff_mf"          "lever"          "speed"         
## [21] "tcs_signal"     "gg_speed"       "th_coef_st"     "prop_th_dcy"

#Remove highly correlated variables and create a new dataset
ds_nhc = ds_nzv[, -which(colnames(ds_nzv) %in% dsHighlyCorCol)]
dim(ds_nhc)

## [1] 589223      4

ds_nhc_Cor <- cor(ds_nhc)
summary(ds_nhc[upper.tri(ds_nhc_Cor)])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01233 0.97500 1.00000 0.81907 1.06430 1.20000

# The following list is statistically significant:
bag_of_headers(headers)[which(names_short %in% colnames(ds_nhc))]

## [1] "Thrust_coefficient_port"                     
## [2] "Propeller_Torque_decay_state_coefficient_Kkq"
## [3] "Hull_decay_state_coefficient_Khull"          
## [4] "GT_Turbine_decay_state_coefficient_KMturb"

# For the dependent variable:
bag_of_headers(headers)[which(colnames(ds) %in% dep_var)]

## [1] "GT_Compressor_decay_state_coefficient_KMcompr"

Using the function findLinearCombos

# The function findLinearCombos uses the QR decomposition of a matrix to enumerate sets of linear combinations (if they exist)
comboInfo <- findLinearCombos(ds_nhc)

print(paste0("There are ", length(comboInfo$remove), " columns: [", comboInfo$remove, "] with linear combination"))

## [1] "There are 0 columns: [] with linear combination"

if (length(comboInfo$remove) != 0) ds_nhc[, -comboInfo$remove]

Transforming the dataset to a tibble

# as_tibble is a new S3 generic with more efficient methods for matrices and data frames.
ds_nhc <- as_tibble(ds_nhc)
head(ds_nhc) # first rows of the new compressor decay state non-highly correlated dataset

## # A tibble: 6 x 4
##   th_coef_pt prop_tq_dcy hull_dcy gt_turb_dcy
##        <dbl>       <dbl>    <dbl>       <dbl>
## 1   0.016701         1.1        1           1
## 2   0.054159         1.1        1           1
## 3   0.110630         1.1        1           1
## 4   0.167790         1.1        1           1
## 5   0.173170         1.1        1           1
## 6   0.173900         1.1        1           1

Now we know that only four columns out of 29 have non-near zero variance and are linearly independent between them. But after doing all that homework by hand and step by step I decided to try the function preProcess from the excellent package AppliedPredictiveModeling from Max Khun. In doing so I found myself with a sense of having done the porevious work fairly correct. The function preProcess takes care of many things at one shot. After applyting it to the original dataset I found that these results are telling me that those 6 variables are responsible for 95% of the variance. Then, those six variables [“gg_speed”, “tcs_signal”, “th_coef_pt”, “prop_tq_dcy”, “hull_dcy”, “gt_turb_dcy”] are the ones we will use for the rest of the analysis. We are going to step into Normalizing and establising the Neural Net for Regression Analysis using Tensorflow.

# Using preProcess without the target variable "Gas Turbine Compressor state Decay"
dsPCA <- preProcess(ds[,-29], method = c("corr", "nzv", "pca"))
dsPCA

## Created from 589223 samples and 29 variables
## 
## Pre-processing:
##   - centered (6)
##   - ignored (0)
##   - principal component signal extraction (6)
##   - removed (23)
##   - scaled (6)
## 
## PCA needed 5 components to capture 95 percent of the variance

names(dsPCA$mean)

## [1] "gg_speed"    "tcs_signal"  "th_coef_pt"  "prop_tq_dcy" "hull_dcy"   
## [6] "gt_turb_dcy"

mu <- dsPCA$mean
sd <- dsPCA$std
mu

##     gg_speed   tcs_signal   th_coef_pt  prop_tq_dcy     hull_dcy 
## 8263.9537807   35.6143099    0.1473745    1.0468836    1.0937488 
##  gt_turb_dcy 
##    0.9882810

sd

##     gg_speed   tcs_signal   th_coef_pt  prop_tq_dcy     hull_dcy 
## 1.103943e+03 2.782904e+01 6.331218e-02 3.224831e-02 6.446998e-02 
##  gt_turb_dcy 
## 8.060592e-03

Building Train, and Test sets

set.seed(123)
inTrainPercent <- 0.9
inTrain <- sample(1:length(ds[, which(colnames(ds) %in% dep_var)]))[1:floor(length(ds[, which(colnames(ds) %in% dep_var)])*inTrainPercent)]
Xtrain <- ds[inTrain, names(dsPCA$mean)]
Xtest <- ds[-inTrain, names(dsPCA$mean)]
ytrain <- ds[inTrain, which(colnames(ds) %in% dep_var)]
ytest <- ds[-inTrain, which(colnames(ds) %in% dep_var)]


XtrainTransformed <- predict(dsPCA, Xtrain)
XtestTransformed <- predict(dsPCA, Xtest)