The behavior and interaction of the main components of Ship Propulsion Systems cannot be easily modeled with an a-priori physical knowledge, considering the large amount of variables influencing them.
Instead, data-driven Models (DDMs) in conjuntion with advanced statistical techniques are more useful to build models directly on the large amount of historical data collected by the modern on-board automation systems, without requiring any a-priori knowledge.
DDMs are extremely useful when it comes to continuously monitor the propulsion equipments to avoid Preventive or Corrective Maintenance and take decisions based on the actual condition of the propulsion plant, but unfortunately, this can only be done at the expense of a much larger amount of data to achieve satisfying performances.
Nowadays sensor’s data are cheap and easy to collect, however, label them with the actual state of decay of a component can be quite expensive and in some cases unfeasible or impractical given the huge amount of streaming data. This data stream is usually present at a very high frequency levels in real-time, in batches, or the nowadays less frequent offline mode.
The advent of more sophisticaded algorithms in deep neural networks, the progressive use of sound statistical techniques like Bayesian Neural Networks, Extreme Gradient Boosting, and Ensembles of Models, and the increasing gain in processing speed and high computing power is making the path for obtaining more insights on data at hand. Today is common practice in many industries to provide predictive and prescriptive knowledge for extending life-of-use and dealing with equipment failures in a more efficient manner.
# setwd("D:/OneDrive/Noah/XOM NIA Evaluation/RAnalysis/gasturbinedecay")
setwd("/home/drome/rprojects/gasturbinedecay")
library(dplyr)
library(stringr)
library(stringi)
library(rattle)
library(ggplot2)
library(rpart)
library(rpart.plot)
library(knitr)
library(caret)
library(AppliedPredictiveModeling)
library(corrplot)
ds = read.csv("./data/data_new.txt", header = F, sep = '\t', colClasses = "numeric")
headers = read.csv("./data/features_new.txt", header = T, sep = '\t', colClasses = "character")
headers <- c(names(headers))
headers <- sapply(headers, function(x) str_replace(str_replace_all(x, "[.]{1,}", "_"), "[_]$", ""))
## "Lever_lp", "Speed_knots", "Gas_Turbine_shaft_torque_GTT_kN_m","Gas_Turbine_Speed_GT_rpm_rpm",
## "Controllable_Pitch_Propeller_Thrust_stbd_N", "Controllable_Pitch_Propeller_Thrust_port_N", "Shaft_Torque_port_kN",
## "Shaft_rpm_port_rpm", "Shaft_Torque_stbd_Q_stdb_kN", "Shaft_rpm_stbd_rpm", "HP_Turbine_exit_temperature_T48_C",
## "Generator_of_Gas_speed_GG_rpm_rpm", "Fuel_flow_mf_kg_s", "ABB_Tic_control_signal",
## "GT_Compressor_outlet_air_pressure_P2_bar", "GT_Compressor_outlet_air_temperature_T2_C",
## "External_Pressure_Pext_bar", "HP_Turbine_exit_pressure_P48_bar", "TCS_tic_control_signal",
## "Thrust_coefficient_stbd", "Propeller_rps_stbd_rps", "Thrust_coefficient_port", "Propeller_rps_port_rps",
## "Propeller_Torque_port_Nm", "Propeller_Torque_stbd_Nm", "Propeller_Thrust_decay_state_coefficient_Kkt",
## "Propeller_Torque_decay_state_coefficient_Kkq", "Hull_decay_state_coefficient_Khull",
## "GT_Compressor_decay_state_coefficient_KMcompr" "GT_Turbine_decay_state_coefficient_KMturb"
bag_of_headers <- function(x) {
full_bag <- c()
for (i in seq_along(x)) full_bag <- c(full_bag, x[[i]])
return(full_bag)
}
names(ds) <- bag_of_headers(headers)
names_short <- c("lever", "speed", "gt_shaft_tq", "gt_speed", "cpp_th", "cpp_tn", "shaft_tq_pt", "shaft_rpm_pt", "shaft_tq_Q", "shaft_rpm_stbd", "hp_turb_ex_T", "gg_speed", "ff_mf", "abb_Tic", "gt_cmpr_outP", "gt_cmpr_outT", "pext_bar", "hp_turb_outP", "tcs_signal", "th_coef_st", "prop_rps", "th_coef_pt", "prop_rps_pt", "prop_tq_pt", "prop_tq_st", "prop_th_dcy", "prop_tq_dcy", "hull_dcy", "gt_cmpr_dcy", "gt_turb_dcy")
names(ds) <- names_short
# Dropping dependent variable for calculating Multicollinearity on predictors
dep_var <- "gt_cmpr_dcy"
dssub <- subset(ds, select = -which(colnames(ds) %in% dep_var))
# write.csv(x = ds, file = "./data/decay_state.csv", row.names = FALSE)
In using Linear Regression for solving a problem like this we need to meet certain conditions
Assumptions of Linear Regression Analysis
Linear Relationship: Linear regression needs a linear relationship between the dependent and independent variables.
Normality of Residual: Linear regression requires residuals should be normally distributed.
Homoscedasticity: Linear regression assumes that residuals are approximately equal for all predicted dependent variable values. In other words, it means constant variance of errors.
No Outlier Problem
Multicollinearity: It means there is a high correlation between independent variables. The linear regression model MUST NOT be faced with problem of multicollinearity.
Independence of error terms - No Autocorrelation: It states that the errors associated with one observation are not correlated with the errors of any other observation. It is a problem when you use time series data.
# Selecting columns with near zero variance
nzv_ds <- nearZeroVar(dssub)
ds_nzv <- dssub[, -nzv_ds]
print(paste0("There are ", sum(abs(ds_Cor[upper.tri(ds_Cor)]) > .999), " predictors almost perfectly correlated."))
## [1] "There are 25 predictors almost perfectly correlated."
summary(ds_Cor[upper.tri(ds_Cor)])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.00000 0.03973 0.96207 0.68437 0.98938 1.00000
# The code chunk below shows the effect of removing descriptors with absolute correlations above 0.70
dsHighlyCor <- findCorrelation(ds_Cor, cutoff = .70)
dsHighlyCorCol <- colnames(ds_Cor)[dsHighlyCor]
# Print highly correlated attributes
print(dsHighlyCorCol)
## [1] "gt_cmpr_outP" "gt_cmpr_outT" "shaft_tq_pt" "shaft_tq_Q"
## [5] "gt_shaft_tq" "hp_turb_outP" "prop_tq_pt" "prop_tq_st"
## [9] "cpp_th" "cpp_tn" "pext_bar" "prop_rps"
## [13] "prop_rps_pt" "gt_speed" "shaft_rpm_pt" "shaft_rpm_stbd"
## [17] "abb_Tic" "ff_mf" "lever" "speed"
## [21] "tcs_signal" "gg_speed" "th_coef_st" "prop_th_dcy"
#Remove highly correlated variables and create a new dataset
ds_nhc = ds_nzv[, -which(colnames(ds_nzv) %in% dsHighlyCorCol)]
dim(ds_nhc)
## [1] 589223 4
ds_nhc_Cor <- cor(ds_nhc)
summary(ds_nhc[upper.tri(ds_nhc_Cor)])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01233 0.97500 1.00000 0.81907 1.06430 1.20000
# The following list is statistically significant:
bag_of_headers(headers)[which(names_short %in% colnames(ds_nhc))]
## [1] "Thrust_coefficient_port"
## [2] "Propeller_Torque_decay_state_coefficient_Kkq"
## [3] "Hull_decay_state_coefficient_Khull"
## [4] "GT_Turbine_decay_state_coefficient_KMturb"
# For the dependent variable:
bag_of_headers(headers)[which(colnames(ds) %in% dep_var)]
## [1] "GT_Compressor_decay_state_coefficient_KMcompr"
# The function findLinearCombos uses the QR decomposition of a matrix to enumerate sets of linear combinations (if they exist)
comboInfo <- findLinearCombos(ds_nhc)
print(paste0("There are ", length(comboInfo$remove), " columns: [", comboInfo$remove, "] with linear combination"))
## [1] "There are 0 columns: [] with linear combination"
if (length(comboInfo$remove) != 0) ds_nhc[, -comboInfo$remove]
# as_tibble is a new S3 generic with more efficient methods for matrices and data frames.
ds_nhc <- as_tibble(ds_nhc)
head(ds_nhc) # first rows of the new compressor decay state non-highly correlated dataset
## # A tibble: 6 x 4
## th_coef_pt prop_tq_dcy hull_dcy gt_turb_dcy
## <dbl> <dbl> <dbl> <dbl>
## 1 0.016701 1.1 1 1
## 2 0.054159 1.1 1 1
## 3 0.110630 1.1 1 1
## 4 0.167790 1.1 1 1
## 5 0.173170 1.1 1 1
## 6 0.173900 1.1 1 1
Now we know that only four columns out of 29 have non-near zero variance and are linearly independent between them. But after doing all that homework by hand and step by step I decided to try the function preProcess from the excellent package AppliedPredictiveModeling from Max Khun. In doing so I found myself with a sense of having done the porevious work fairly correct. The function preProcess takes care of many things at one shot. After applyting it to the original dataset I found that these results are telling me that those 6 variables are responsible for 95% of the variance. Then, those six variables [“gg_speed”, “tcs_signal”, “th_coef_pt”, “prop_tq_dcy”, “hull_dcy”, “gt_turb_dcy”] are the ones we will use for the rest of the analysis. We are going to step into Normalizing and establising the Neural Net for Regression Analysis using Tensorflow.
# Using preProcess without the target variable "Gas Turbine Compressor state Decay"
dsPCA <- preProcess(ds[,-29], method = c("corr", "nzv", "pca"))
dsPCA
## Created from 589223 samples and 29 variables
##
## Pre-processing:
## - centered (6)
## - ignored (0)
## - principal component signal extraction (6)
## - removed (23)
## - scaled (6)
##
## PCA needed 5 components to capture 95 percent of the variance
names(dsPCA$mean)
## [1] "gg_speed" "tcs_signal" "th_coef_pt" "prop_tq_dcy" "hull_dcy"
## [6] "gt_turb_dcy"
mu <- dsPCA$mean
sd <- dsPCA$std
mu
## gg_speed tcs_signal th_coef_pt prop_tq_dcy hull_dcy
## 8263.9537807 35.6143099 0.1473745 1.0468836 1.0937488
## gt_turb_dcy
## 0.9882810
sd
## gg_speed tcs_signal th_coef_pt prop_tq_dcy hull_dcy
## 1.103943e+03 2.782904e+01 6.331218e-02 3.224831e-02 6.446998e-02
## gt_turb_dcy
## 8.060592e-03
set.seed(123)
inTrainPercent <- 0.9
inTrain <- sample(1:length(ds[, which(colnames(ds) %in% dep_var)]))[1:floor(length(ds[, which(colnames(ds) %in% dep_var)])*inTrainPercent)]
Xtrain <- ds[inTrain, names(dsPCA$mean)]
Xtest <- ds[-inTrain, names(dsPCA$mean)]
ytrain <- ds[inTrain, which(colnames(ds) %in% dep_var)]
ytest <- ds[-inTrain, which(colnames(ds) %in% dep_var)]
XtrainTransformed <- predict(dsPCA, Xtrain)
XtestTransformed <- predict(dsPCA, Xtest)