knitr::include_graphics("energies-15-02061-g001-550.png")At the current oil refinery, there is a unit called Fluid Catalytic Cracking. This unit has a very important role in an oil refinery. This unit converts crude oil into gasoline by cracking. in addition to producing gasoline, this unit also produces coke. This coke must be burned in the regenerator to restore the catalyst function so that it can be re-circulated. in this case, an Engineer wants to optimize the regenerator so that the amount of coke yield is always in the expected range. The expected target is the optimum operating conditions. Those parameters are Total Dry Air to Regenerator, HBW Flow, Regenerator Temperature, and Catalyst per Oil ratio.
library(dplyr) # data transformation
library(lmtest) # linear model
library(olsrr) # outlier and laverage
library(caret) # NearZeroVar
library(car) #VIFAt first, we have to import actual historical data from RCC unit.
fcc <- read.csv("FCC_Refinery_Data.csv")
head(fcc)## Date Total_Dry_Air_to_Regenerator HBW_Flow Regenerator_Temperature
## 1 2009-04-03 261148.4 101.42 617.68
## 2 2009-04-04 438448.0 171.98 713.93
## 3 2009-04-05 448456.5 176.89 717.09
## 4 2009-04-06 460325.3 183.66 717.95
## 5 2009-04-07 482533.4 194.75 724.77
## 6 2009-04-08 484954.4 192.76 723.33
## Cat_per_Oil_ratio Percent_Coke_yield
## 1 6.53 9.00
## 2 7.76 10.96
## 3 7.85 11.05
## 4 7.63 10.98
## 5 6.93 10.81
## 6 6.87 10.77
colnames(fcc)## [1] "Date" "Total_Dry_Air_to_Regenerator"
## [3] "HBW_Flow" "Regenerator_Temperature"
## [5] "Cat_per_Oil_ratio" "Percent_Coke_yield"
The variables that affect the amount of coke yield are :
In order to avoid bias in next data processing and modelling, the data shall be free of NA and negative value.
fcc %>%
is.na() %>%
colSums()## Date Total_Dry_Air_to_Regenerator
## 0 28
## HBW_Flow Regenerator_Temperature
## 26 34
## Cat_per_Oil_ratio Percent_Coke_yield
## 118 56
There are 28 empty cells in Total_Dry_Air_to_Regenerator, 26 empty cells in HBW_Flow, 34 empty cells in Regenerator_Temperature, 118 empty cells in Cat_per_Oil_ratio and 56 empty cells in Percent_Coke_yield. Then the negative value shall be converted into NA.
fcc_clean <- fcc # Duplicate data frame
fcc_clean[fcc_clean < 0] <- NA # Replace negative values by NAfcc_clean %>%
is.na() %>%
colSums()## Date Total_Dry_Air_to_Regenerator
## 26 30
## HBW_Flow Regenerator_Temperature
## 26 34
## Cat_per_Oil_ratio Percent_Coke_yield
## 118 58
After convert some negative value, the number of NA in some variables increased. Then na.omit used to remove all NA value.
fcc_clean <- fcc_clean %>%
select(-Date) %>%
na.omit()Then make sure all NA already removed and data is ready for further processing.
fcc_clean %>%
is.na() %>%
colSums()## Total_Dry_Air_to_Regenerator HBW_Flow
## 0 0
## Regenerator_Temperature Cat_per_Oil_ratio
## 0 0
## Percent_Coke_yield
## 0
The data distribution shall be checked in order to identify if any outlayer exist in data.
summary(fcc_clean)## Total_Dry_Air_to_Regenerator HBW_Flow Regenerator_Temperature
## Min. :146363 Min. : 25.53 Min. :477.3
## 1st Qu.:409693 1st Qu.:131.54 1st Qu.:712.1
## Median :435519 Median :145.47 Median :716.2
## Mean :430876 Mean :144.50 Mean :714.0
## 3rd Qu.:464812 3rd Qu.:162.59 3rd Qu.:719.6
## Max. :538074 Max. :228.39 Max. :733.7
## Cat_per_Oil_ratio Percent_Coke_yield
## Min. : 3.240 Min. : 3.710
## 1st Qu.: 7.305 1st Qu.: 9.440
## Median : 7.830 Median : 9.730
## Mean : 7.758 Mean : 9.835
## 3rd Qu.: 8.170 3rd Qu.:10.250
## Max. :10.320 Max. :13.420
boxplot(fcc_clean$Total_Dry_Air_to_Regenerator)boxplot(fcc_clean$HBW_Flow)boxplot(fcc_clean$Regenerator_Temperature)boxplot(fcc_clean$Cat_per_Oil_ratio)boxplot(fcc_clean$Percent_Coke_yield)As shown in boxplot above, the data contain some outlayer.
Simple Linear model used as initial observation. If the model fit above 80%, the LM model is considered sufficient to model the data. Here, all data considered important to create model.
model_linear <- lm(Percent_Coke_yield~., data=fcc_clean)
summary(model_linear)##
## Call:
## lm(formula = Percent_Coke_yield ~ ., data = fcc_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.50380 -0.19016 -0.02637 0.14909 1.98528
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.504e+00 4.490e-01 -10.03 <2e-16 ***
## Total_Dry_Air_to_Regenerator -1.115e-05 3.404e-07 -32.75 <2e-16 ***
## HBW_Flow 3.403e-02 5.422e-04 62.77 <2e-16 ***
## Regenerator_Temperature 1.235e-02 6.885e-04 17.94 <2e-16 ***
## Cat_per_Oil_ratio 6.971e-01 1.451e-02 48.05 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3659 on 1534 degrees of freedom
## Multiple R-squared: 0.8172, Adjusted R-squared: 0.8167
## F-statistic: 1714 on 4 and 1534 DF, p-value: < 2.2e-16
Based on model summary above, the LM model fit about 81% of the data which is sufficient to model all data.
As mention above, the data still contain some outliers. Hence, the outliers tried to removed into improve the model fitting.
outlier <- ols_plot_resid_lev(model = model_linear)
As shown figure above, the some outliers and laverages in data and below
the index of outlier and laverage.
eliminate <- outlier$data$txt
eliminate_col <- complete.cases(eliminate)
eliminate_col <- eliminate[eliminate_col]
eliminate_col## [1] 1 44 139 140 241 242 269 270 327 328 329 330 331 332 333
## [16] 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348
## [31] 349 350 351 352 353 354 355 356 357 362 363 364 365 366 367
## [46] 368 369 377 378 379 380 381 382 383 384 385 386 387 388 404
## [61] 405 492 642 643 644 645 647 648 649 650 651 652 653 654 655
## [76] 667 668 728 1026 1027 1177 1213 1223 1224 1225 1226 1227 1228 1232 1233
## [91] 1234 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333
## [106] 1334 1335 1336 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351
## [121] 1352 1353 1354 1355 1356 1357 1380 1394 1395 1431 1432 1433 1434 1435 1436
The outlier and laverage index above used to remove outliers and laverages value from dataframe.
fcc_clean2 <- fcc_clean[-eliminate_col, ]
fcc_clean2 %>%
count()## n
## 1 1404
1657-1404## [1] 253
There are 253 value identified as outliers and laverages which already removed from dataframe. Then re-do the LM model by using data which already from outliers and laverages.
model_linear2 <- lm(Percent_Coke_yield~., data=fcc_clean2)
summary(model_linear2)##
## Call:
## lm(formula = Percent_Coke_yield ~ ., data = fcc_clean2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.66451 -0.13129 -0.01967 0.11013 1.03273
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.352e+00 6.866e-01 -9.252 <2e-16 ***
## Total_Dry_Air_to_Regenerator -1.122e-05 2.642e-07 -42.453 <2e-16 ***
## HBW_Flow 3.076e-02 3.947e-04 77.933 <2e-16 ***
## Regenerator_Temperature 1.802e-02 9.677e-04 18.618 <2e-16 ***
## Cat_per_Oil_ratio 4.764e-01 1.016e-02 46.903 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2082 on 1399 degrees of freedom
## Multiple R-squared: 0.881, Adjusted R-squared: 0.8806
## F-statistic: 2589 on 4 and 1399 DF, p-value: < 2.2e-16
The Adjusted R-squared from second LM model above improved to 88%.
After removing some outliers and laverages, the variance of data shall be checked to avoid near zero variance.
fcc_clean2 %>%
nearZeroVar()## integer(0)
There is no tend of data to zero variance.
vif(model_linear2)## Total_Dry_Air_to_Regenerator HBW_Flow
## 3.161156 3.423110
## Regenerator_Temperature Cat_per_Oil_ratio
## 1.376814 1.102831
The VIF value of all variables below 10, in other words thers is no Multicollinearity between variables.
shapiro.test(model_linear2$residuals)##
## Shapiro-Wilk normality test
##
## data: model_linear2$residuals
## W = 0.96802, p-value < 2.2e-16
plot(model_linear2$residuals)
Based on the Shapiro test above, p-value is below 0.05 which means that
the residual not distributed evenly or not normal.
bptest(model_linear2)##
## studentized Breusch-Pagan test
##
## data: model_linear2
## BP = 17.667, df = 4, p-value = 0.001433
Based on the bptest above, p-value is below 0.05 which means there are heteroscedasticity in data.
Percent_Coke_yield = 4.764e-01 * Cat_per_Oil_ratio + 1.802e-02 * Regenerator_Temperature + 3.076e-02 * HBW_Flow - 1.122e-05 * Total_Dry_Air_to_Regenerator - 6.352e+00
In actual case, even if some parameters have strong correlation to coke yield, some parameter could changed easly due to rating issues of pump and product distributions. Now,there are some constrains related to catalyst per oil ratio and HBW Flow.
Current catalyst per oil ratio already gave maximum gasoline production, and pump flow of HBW could not changed due to steam balance in the unit. Hence, catalyst per oil ratio shall be fixed at 8 and rated flow of HBW pump is 125 Ton/h. In order to maintain the heat balance in regenerator-reactor, the yeid of Coke shall be minimum at 9.7.
Hence, only regenerator temperature and Total dry air regenerator have possibility to adjust and the equation now became :
8.4 = 1.802e-02 * Regenerator_Temperature - 1.122e-05 * Total_Dry_Air_to_Regenerator
if normal operating temperature of regenerator 750 degC and the material limitation of regenerator is 850 degC, what is the total dry air to regenerator shall be set by the operator ? at 750 degC, the total dry air flow shall be 456256.68 kg/h and while at 850 degC the total dry air flow shall be maximum at 616862.75 kg/h.