knitr::include_graphics("energies-15-02061-g001-550.png")

1 Introduction

At the current oil refinery, there is a unit called Fluid Catalytic Cracking. This unit has a very important role in an oil refinery. This unit converts crude oil into gasoline by cracking. in addition to producing gasoline, this unit also produces coke. This coke must be burned in the regenerator to restore the catalyst function so that it can be re-circulated. in this case, an Engineer wants to optimize the regenerator so that the amount of coke yield is always in the expected range. The expected target is the optimum operating conditions. Those parameters are Total Dry Air to Regenerator, HBW Flow, Regenerator Temperature, and Catalyst per Oil ratio.

library(dplyr) #data wrangling
library(caret) #preProcess
library(caTools) #sample.split
library(neuralnet) # model nn
library(NeuralNetTools) #NeuralNetTools
require(devtools)
source_gist('6206737') #gar.fun
library(lmtest) # linear model
library(olsrr) # outlier and laverage

2 Data Preparation

At first, we have to import actual historical data from RCC unit.

fcc <- read.csv("FCC_Refinery_Data.csv")
head(fcc)
##         Date Total_Dry_Air_to_Regenerator HBW_Flow Regenerator_Temperature
## 1 2009-04-03                     261148.4   101.42                  617.68
## 2 2009-04-04                     438448.0   171.98                  713.93
## 3 2009-04-05                     448456.5   176.89                  717.09
## 4 2009-04-06                     460325.3   183.66                  717.95
## 5 2009-04-07                     482533.4   194.75                  724.77
## 6 2009-04-08                     484954.4   192.76                  723.33
##   Cat_per_Oil_ratio Percent_Coke_yield
## 1              6.53               9.00
## 2              7.76              10.96
## 3              7.85              11.05
## 4              7.63              10.98
## 5              6.93              10.81
## 6              6.87              10.77
colnames(fcc)
## [1] "Date"                         "Total_Dry_Air_to_Regenerator"
## [3] "HBW_Flow"                     "Regenerator_Temperature"     
## [5] "Cat_per_Oil_ratio"            "Percent_Coke_yield"

The variables that affect the amount of coke yield are :

  1. Date : The date when the data taken.
  2. Total_Dry_Air_to_Regenerator : Amount of dry air flow to regenerator (kg/h)
  3. HBW_Flow : Amount of Hot Boiler Water to regeneator (Ton/h)
  4. Regenerator_Temperature : Regenerator Temperature (degC)
  5. Cat_per_Oil_ratio : Ratio of Catalyst to Oil in FCC unit (wt/wt)
  6. Percent_Coke_yield : Percent of coke present in the catalyst (%)

3 Data Wrangling

In order to avoid bias in next data processing and modelling, the data shall be free of NA and negative value.

fcc %>% 
  is.na() %>% 
  colSums()
##                         Date Total_Dry_Air_to_Regenerator 
##                            0                           28 
##                     HBW_Flow      Regenerator_Temperature 
##                           26                           34 
##            Cat_per_Oil_ratio           Percent_Coke_yield 
##                          118                           56

There are 28 empty cells in Total_Dry_Air_to_Regenerator, 26 empty cells in HBW_Flow, 34 empty cells in Regenerator_Temperature, 118 empty cells in Cat_per_Oil_ratio and 56 empty cells in Percent_Coke_yield. Then the negative value shall be converted into NA.

fcc_clean <- fcc              # Duplicate data frame
fcc_clean[fcc_clean < 0] <- NA       # Replace negative values by NA
fcc_clean %>% 
  is.na() %>% 
  colSums()
##                         Date Total_Dry_Air_to_Regenerator 
##                           26                           30 
##                     HBW_Flow      Regenerator_Temperature 
##                           26                           34 
##            Cat_per_Oil_ratio           Percent_Coke_yield 
##                          118                           58

After convert some negative value, the number of NA in some variables increased. Then na.omit used to remove all NA value.

fcc_clean <- fcc_clean %>% 
  select(-Date) %>% 
  na.omit()

Then make sure all NA already removed and data is ready for further processing.

fcc_clean %>% 
  is.na() %>% 
  colSums()
## Total_Dry_Air_to_Regenerator                     HBW_Flow 
##                            0                            0 
##      Regenerator_Temperature            Cat_per_Oil_ratio 
##                            0                            0 
##           Percent_Coke_yield 
##                            0

4 Cross Validation

The data then splitted into 2. data for training and data for testing. the portion for data training is 90% of all data and testing data is 10% data.

process <- preProcess(as.data.frame(fcc_clean), method=c("range"))
fcc_norm <- predict(process, as.data.frame(fcc_clean))
sample<-sample.split(fcc_norm$Percent_Coke_yield,SplitRatio = 0.9)
train<-subset(fcc_norm,sample==T)
test<-subset(fcc_norm,sample==F)

5 Neural Network

The model used is Neural Network, due to adjusted number of hidden layer and number of nodes can be adjusted to give flexibility to increase the model accuracy. Here, even number used, the combination is : first layer consist of 10 nodes, second layer consist of 8 nodes, third layer consist of 4 nodes and forth layer consist of 1 nodes.

set.seed(123)
model_nn <-neuralnet(Percent_Coke_yield~.,train,hidden = c(10,8,4,1))
plotnet(model_nn)

result_train<-compute(model_nn,test[,-5])
postResample(result_train$net.result,test[,5])
##       RMSE   Rsquared        MAE 
## 0.02566209 0.81625296 0.01754204

The Rsquared produced from NN model is 92% give higher Rsquared than LM model which gave maximum 88%.

6 Model improvement

6.1 Outliers and laverage removal

Remove outliers and laverages is a method to improve model performance. Here, outliers and laverages removed by using ols_plot_resid_lev function.

model_linear <- lm(Percent_Coke_yield~., data=fcc_clean)
outlier <- ols_plot_resid_lev(model = model_linear)

eliminate <- outlier$data$txt
eliminate_col <- complete.cases(eliminate) 
eliminate_col <- eliminate[eliminate_col]
fcc_clean2 <- fcc_clean[-eliminate_col, ]

After got dataframe which free from outliers and laverages, then we should re-do same steps as above.

process2 <- preProcess(as.data.frame(fcc_clean2), method=c("range"))
fcc_norm2 <- predict(process2, as.data.frame(fcc_clean2))
sample2<-sample.split(fcc_norm2$Percent_Coke_yield,SplitRatio = 0.9)
train2<-subset(fcc_norm2,sample2==T)
test2<-subset(fcc_norm2,sample2==F)
set.seed(123)
model_nn2 <-neuralnet(Percent_Coke_yield~.,train2,hidden = c(10,8,4,1))
plotnet(model_nn2)

result_train2 <-compute(model_nn2,test2[,-5])
postResample(result_train2$net.result,test2[,5])
##       RMSE   Rsquared        MAE 
## 0.03602461 0.92620488 0.02657743

The Model improved!, the Rsquared of NN model from data which free of outliers and leverages improve about 2%.

7 Feature Importance

To check which variables taken signficant effet of NN model, the gar.fun function used to visualize those variables.

gar.fun('Percent_Coke_yield',model_nn2)

As shown above, the Total dry air to regenerator giving negative effect for Coke yield. In contrary, the Cat/Oil ratio giving highest positive effect followed by HBW Flow. But Regenerator temperature not giving significat effect to Coke Yield.

8 Conclusion

  1. The Neural Network model giving higher accuracy compared with LM Model.
  2. Removing Outliers and Laverages improve NN model accuracy about 2%.
  3. Total dry air to regenerator giving negative effect for Coke yield prediction, But the Cat/Oil ratio giving highest positive effect for Coke yield prediction.