Machine Learning Regression Model for Predicting Netpay

Author

Name : Debrina Silviana Dewi Sugiharto
Email : debrinasilviana@gmail.com
Date : 18 January 2021

Introduction

We will analyze what is the best model and what are variables to be used for predicting netpay using SOUTH EAST.xlsx dataset as train model. SOUTH EAST dataset contains of 17 columns and 60 rows. SOUTH EAST.xlsx is a dataset that extracted from 9 wells and 30 reservoirs contains of:

Seis_D : describe seismic depth of the gas reservoirs
Well Identifier : Well Name
Surface : Horizon 1 means ‘Top Negative Amplitude’ and Horizon 2 means ‘Bottom Positive Amplitude’
Reservoir Name : Name of the reservoir
FULL : Amplitude extracted from full stack amplitude with angle from 6-58 degrees
NEAR : Amplitude extracted from near stack amplitude with angle from 6-19 degrees
MID : Amplitude extracted from mid stack amplitude with angle from 19-32 degrees
FAR : Amplitude extracted from far stack amplitude with angle from 32-45 degrees
VFAR : Amplitude extracted from very far stack amplitude with angle from 45-58 degrees
RES : Resolution of the seismic
Netpay : Thickness of the gas reservoirs
Act_D : Actual Depth of the reservoirs

library(readxl)
library(GGally)
library(dplyr)
southeast <- read_xlsx("SOUTH EAST.xlsx")
southeast$`Well Identifier` <- factor(southeast$`Well Identifier`)
southeast

Well name in South East dataset:

levels(southeast$`Well Identifier`)

## [1] "A" "B" "C" "D" "E" "F" "G" "H" "I"

Theory

From tucker plot above we can see that there is a relation between thickness of the reservoir vs amplitude. The thicker reservoirs, the stronger amplitude showed from the seismic. This relation only happened if the thickess of the reservoirs is below tuning thickness.

SOUTH EAST.xlsx dataset has maximum and minimum netpay as below:

summary(southeast$Netpay)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.170   2.160   3.795   4.656   7.290  10.540

SOUTH EAST DATASET.xlsx is taken from the seismic cube with resolution range as below:

summary(southeast$RES)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1181    1222    1304    1307    1383    1448

as we can see netpay range of SOUTH EAST.xlsx dataset is from 1.17m until 10.54m, where this thickness is still below tuning thickness. So the netpay in the SOUTH EAST.xlsx dataset should have a good relation with amplitude.

Data Conditioning

southeast$Seis_D <- southeast$Seis_D*-1 #make Depth to be positive constanta
southeast_top <- southeast %>% 
  filter(Surface=="Horizon 1") #separate top and bottom amplitude into different dataframe
southeast_bottom <- southeast %>% 
  filter(Surface=="Horizon 2")
southeast_top$RMSFull <- sqrt((southeast_top$FULL^2+southeast_bottom$FULL^2)/2) #calculate RMS from TOP FULL and BOTTOM FULL amplitude
southeast_top_amp <- southeast_top[,c(1,13,5,6,7,8,9,10,11,12)]
southeast_top_amp$velocity <- if_else(southeast_top_amp$Seis_D<1059,(-0.0002*southeast_top_amp$Seis_D^2)-(0.194*southeast_top_amp$Seis_D)+1556.7,(-2*10^-6*southeast_top_amp$Seis_D^3)+(0.0069*southeast_top_amp$Seis_D^2)-(8.236*southeast_top_amp$Seis_D)+4513.4) #calculate VP estimation. this law is just refer to previous petrophysics study about VP vs Depth
southeast_top_amp$RES<- southeast_top_amp$RES/100
ggcorr(southeast_top_amp,label=T)

From correlation plot above, we can conclude that Netpay has the strongest correlation with ‘RMS Full Amplitude’ and the correlation between Netpay and ‘RMS FULL Amplitude’ reaching 0.6.

Difference Seismic and Actual Depth

southeast$mistie <- southeast$Seis_D-southeast$Act_D
p0<- southeast %>% 
  ggplot(aes(x=mistie)) +
  geom_histogram(bins = 10) +
  labs(title = "Residuals from Model Backward") +
  theme_algoritma

ggplotly(p0)

Modelling

Next, we should analyze further to find the best models to be used for predicting netpay. In this case, I will use backward regression models method and find best models with smallest AIC “Akaike Information Criterion”.

Firstly, I try to use all variables to predict netpay and the model that created from all variables will be put in model_full. And then lets see how small the AIC as below:

model_full <- lm(Netpay ~ ., southeast_top_amp)
summary(model_full)

## 
## Call:
## lm(formula = Netpay ~ ., data = southeast_top_amp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4433 -1.3609 -0.5712  0.7827  3.9218 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -8.926e+01  7.055e+01  -1.265   0.2211  
## Seis_D       1.284e-01  6.877e-02   1.866   0.0775 .
## RMSFull      4.654e-04  4.119e-04   1.130   0.2727  
## FULL         6.537e-03  5.150e-03   1.269   0.2197  
## NEAR        -2.299e-03  1.725e-03  -1.333   0.1982  
## MID         -2.216e-03  1.928e-03  -1.150   0.2646  
## FAR         -2.455e-03  1.687e-03  -1.455   0.1620  
## UFAR         3.939e-04  3.929e-04   1.002   0.3287  
## RES          2.745e-01  6.619e-01   0.415   0.6830  
## Act_D       -9.911e-02  5.869e-02  -1.689   0.1077  
## velocity     4.923e-02  4.389e-02   1.122   0.2760  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.251 on 19 degrees of freedom
## Multiple R-squared:  0.6078, Adjusted R-squared:  0.4014 
## F-statistic: 2.944 on 10 and 19 DF,  p-value: 0.02058

Adjusted R squared from model_full is only 0.363. Using all variables/parameters in the dataset doesn’t mean that we can increase the R-squared.So next, I try to find the optimum models by trying to find best variables to be used in predicting netpay.

I will use backward regression model methodology and analyze which model has the smallest AIC.

model_backward <- step(model_full, direction = "backward")

## Start:  AIC=56.98
## Netpay ~ Seis_D + RMSFull + FULL + NEAR + MID + FAR + UFAR + 
##     RES + Act_D + velocity
## 
##            Df Sum of Sq     RSS    AIC
## - RES       1    0.8713  97.161 55.255
## - UFAR      1    5.0925 101.382 56.531
## - velocity  1    6.3759 102.665 56.908
## - RMSFull   1    6.4676 102.757 56.935
## <none>                   96.289 56.985
## - MID       1    6.6979 102.987 57.002
## - FULL      1    8.1647 104.454 57.426
## - NEAR      1    9.0072 105.297 57.668
## - FAR       1   10.7309 107.020 58.155
## - Act_D     1   14.4487 110.738 59.179
## - Seis_D    1   17.6522 113.942 60.035
## 
## Step:  AIC=55.26
## Netpay ~ Seis_D + RMSFull + FULL + NEAR + MID + FAR + UFAR + 
##     Act_D + velocity
## 
##            Df Sum of Sq     RSS    AIC
## - UFAR      1    4.3232 101.484 54.561
## <none>                   97.161 55.255
## - RMSFull   1    6.9823 104.143 55.337
## - MID       1    7.1192 104.280 55.376
## - velocity  1    7.9882 105.149 55.625
## - FULL      1    8.7617 105.922 55.845
## - NEAR      1    9.5572 106.718 56.070
## - FAR       1   10.6938 107.854 56.388
## - Act_D     1   17.0269 114.188 58.099
## - Seis_D    1   21.2918 118.453 59.199
## 
## Step:  AIC=54.56
## Netpay ~ Seis_D + RMSFull + FULL + NEAR + MID + FAR + Act_D + 
##     velocity
## 
##            Df Sum of Sq    RSS    AIC
## - MID       1    3.4686 104.95 53.569
## - FULL      1    4.8654 106.35 53.966
## - NEAR      1    5.5563 107.04 54.160
## - velocity  1    5.8143 107.30 54.232
## - FAR       1    6.4073 107.89 54.398
## <none>                  101.48 54.561
## - RMSFull   1   12.0267 113.51 55.921
## - Act_D     1   12.7074 114.19 56.100
## - Seis_D    1   17.1146 118.60 57.236
## 
## Step:  AIC=53.57
## Netpay ~ Seis_D + RMSFull + FULL + NEAR + FAR + Act_D + velocity
## 
##            Df Sum of Sq    RSS    AIC
## - NEAR      1    3.1741 108.13 52.463
## - FULL      1    3.5170 108.47 52.558
## - velocity  1    4.3112 109.26 52.777
## - FAR       1    4.9663 109.92 52.956
## <none>                  104.95 53.569
## - RMSFull   1   11.2102 116.16 54.614
## - Act_D     1   12.7503 117.70 55.009
## - Seis_D    1   16.1817 121.13 55.871
## 
## Step:  AIC=52.46
## Netpay ~ Seis_D + RMSFull + FULL + FAR + Act_D + velocity
## 
##            Df Sum of Sq    RSS    AIC
## - FULL      1    0.3907 108.52 50.571
## - FAR       1    1.8443 109.97 50.971
## - velocity  1    4.7001 112.83 51.740
## <none>                  108.13 52.463
## - RMSFull   1    9.8768 118.00 53.085
## - Act_D     1   12.0294 120.16 53.628
## - Seis_D    1   15.8220 123.95 54.560
## 
## Step:  AIC=50.57
## Netpay ~ Seis_D + RMSFull + FAR + Act_D + velocity
## 
##            Df Sum of Sq    RSS    AIC
## - FAR       1    1.5510 110.07 48.997
## - velocity  1    4.4671 112.98 49.782
## <none>                  108.52 50.571
## - Act_D     1   12.3037 120.82 51.793
## - RMSFull   1   15.4297 123.95 52.560
## - Seis_D    1   15.8546 124.37 52.662
## 
## Step:  AIC=49
## Netpay ~ Seis_D + RMSFull + Act_D + velocity
## 
##            Df Sum of Sq    RSS    AIC
## - velocity  1     5.570 115.64 48.478
## <none>                  110.07 48.997
## - Act_D     1    11.024 121.09 49.861
## - Seis_D    1    14.817 124.89 50.786
## - RMSFull   1    94.466 204.53 65.586
## 
## Step:  AIC=48.48
## Netpay ~ Seis_D + RMSFull + Act_D
## 
##           Df Sum of Sq    RSS    AIC
## - Act_D    1     7.765 123.40 48.428
## <none>                 115.64 48.478
## - Seis_D   1     9.312 124.95 48.801
## - RMSFull  1    89.886 205.52 63.731
## 
## Step:  AIC=48.43
## Netpay ~ Seis_D + RMSFull
## 
##           Df Sum of Sq    RSS    AIC
## <none>                 123.40 48.428
## - Seis_D   1    43.024 166.43 55.401
## - RMSFull  1    84.199 207.60 62.033

From calculation above, we can conclude the best methodology in predicting netpay is using variables Seismic Depth and RMS Full with smallest AIC=48.43.

##backward
model_backward <- lm(Netpay~Seis_D+RMSFull,southeast_top_amp)
summary(model_backward)

## 
## Call:
## lm(formula = Netpay ~ Seis_D + RMSFull, data = southeast_top_amp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8004 -1.3700 -0.0098  1.4091  4.7722 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.4952061  1.7879659  -1.955 0.061027 .  
## Seis_D       0.0055099  0.0017958   3.068 0.004859 ** 
## RMSFull      0.0005409  0.0001260   4.292 0.000204 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.138 on 27 degrees of freedom
## Multiple R-squared:  0.4974, Adjusted R-squared:  0.4601 
## F-statistic: 13.36 on 2 and 27 DF,  p-value: 9.269e-05

The adjusted R-squared is 0.4601 by using models backward, which actually the adjusted R-squared is increased from the past model_full. Model backward: \[ Netpay= (Seismic Depth*-0.0055)+(RMSFull*0.0005)-3.49 \]

Now, I want to see the residuals distribution of the models as below:

p1<- model_backward %>% 
  ggplot(aes(x=model_backward$residuals)) +
  geom_histogram(bins = 10) +
  labs(title = "Residuals from Model Backward") +
  theme_algoritma

ggplotly(p1)

And I would like to do shapiro test for calculating if the residuals is distributed normally or not as below:

shapiro.test(model_backward$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  model_backward$residuals
## W = 0.98108, p-value = 0.8535

p-value > 0.05: Accepting \(H_0\) or residual is distributed normally
p-value < 0.05: Rejecting \(H_0\) atau residual is not distributed normally

From Shapiro test calculation above, we can see that p-value from model_backward is 0.8535 which means it is above 0.05, this conclude that the residual is distributed normally.

Prediction with Model Train

Now I would like to calculate Netpay Prediction from model backward

southeast_top_amp$predict <- predict(model_backward, southeast_top_amp)

Here I try to see the relation between Netpay Actual with Netpay Prediction from Model Backward. I expect to see that the data cloud of prediction vs actual netpay is should be around the red line.

p2<- southeast_top_amp %>% 
  ggplot(aes(x=Netpay, y=predict, text=paste("Netpay:",Netpay,"| Predict:", round(predict,2), "| Abs Residuals:", abs(round(model_backward$residuals,2))))) +
  geom_point(aes(color=Netpay,size=abs(model_backward$residuals))) +
  labs(title = "Netpay Actual vs Prediction from Model") + 
  theme_algoritma+geom_abline(intercept = 0, # intercept = 0
              slope = 1,
              color = "red"
              )+xlim(0,10)+ylim(0,10)

ggplotly(p2,tooltip = "text")

Comparing with Available Old Model Law

Actually there are available previous old law that usually used for predicting netpay in this certain dummy area. The law is:

\[ (9.05289*10^{(-5)}*RMSFull+0.064972886)*(Resolution/100)*Velocity/2000 \] Now, I want to see performance of the old model comparing with new model backward.

#general old model => (9.05289*10^-5x+0.064972886)*(resolution/100)*velocity/2000
southeast_top_amp$predictoldlaw <- (((9.05289*10^-5)*southeast_top_amp$RMSFull)+0.064972886)*southeast_top_amp$RES*southeast_top_amp$velocity/2000
rss <- sum((southeast_top_amp$predictoldlaw-southeast_top_amp$Netpay)^2)
tss <- sum((southeast_top_amp$Netpay-mean(southeast_top_amp$Netpay))^2)
rsq <- 1-rss/tss
rsq

## [1] 0.08665801

R-squared from old model is only 0.086, but from new model backward is 0.4601 . So, model backward is better than the old law.

Now, let’s see residuals produced from the Old Model:

southeast_top_amp$residualsoldlaw <- southeast_top_amp$predictoldlaw-southeast_top_amp$Netpay

p3<- southeast_top_amp %>% 
  ggplot(aes(x=residualsoldlaw)) +
  geom_histogram(bins = 10) +
  labs(title = "Residuals from Old Model") +
  theme_algoritma

ggplotly(p3)

I would like to do shapiro test for calculating if the residuals of the old model is distributed normally or not as below:

shapiro.test(southeast_top_amp$residualsoldlaw)

## 
##  Shapiro-Wilk normality test
## 
## data:  southeast_top_amp$residualsoldlaw
## W = 0.95001, p-value = 0.1692

p-value > 0.05: Accepting \(H_0\) or residual is distributed normally
p-value < 0.05: Rejecting \(H_0\) atau residual is not distributed normally

From Shapiro test calculation above, we can see that p-value from old model is 0.1692 which means it is below 0.05, this conclude that the residual is not distributed normally. So, the new model backward is better than the old law.

Now, I would like to see, prediction vs netpay actual from the old law as below:

p4<- southeast_top_amp %>% 
  ggplot(aes(x=Netpay, y=predictoldlaw, text=paste("Netpay:",Netpay,"| Predict:",round(predictoldlaw,2),"| Residuals",abs(round(residualsoldlaw,2))))) +
  geom_point(aes(color=Netpay,size=abs(residualsoldlaw))) +
  labs(title = "Netpay Actual vs Old Law Prediction") + 
  theme_algoritma+geom_abline(intercept = 0, # intercept = 0
              slope = 1,
              color = "red"
              )+xlim(0,10)+ylim(0,10)

ggplotly(p4,tooltip = "text")

Netpay Prediction of New Well Proposal

I would like to predict Netpay from data test loaded from ‘Z.xlsx’. Actually Z.xlsx is a proposed wells with three main targets and we want to know how much netpay we can expect from each targets using model backward below: \[ Netpay= (Seismic Depth*-0.0055)+(RMSFull*0.0005)-3.49 \]

datatest <- read_xlsx("Z.xlsx")
datatest$Seis_D <- datatest$Seis_D*-1
datatest_top <- datatest %>% 
  filter(Horizon=="TOP") #separate top and bottom amplitude into different dataframe
datatest_bottom <- datatest %>% 
  filter(Horizon=="BOTTOM")
datatest_top$RMSFull <- sqrt((datatest_top$FULL^2+datatest_bottom$FULL^2)/2) #calculate RMS from TOP FULL and BOTTOM FULL amplitude
datatest_top

Now, we can predict netpay as below:

datatest_top$predict <- round(predict(model_backward, datatest_top),2)
datatest_top[,c(1,2,10,11)]

So the conclusion is proposal well Z expecting to get 17.2 m from 3 main targets. T1 is expecting to get 4.1m, T2 is expecting 6.6m, and T3 is expecting 6.5m

sum(datatest_top$predict)

## [1] 17.23

This is just a demo about how to predict netpay using regression models, in this dataset we only analyzed attribute from amplitude full, near, mid, far, ufar, resolution, velocity. And from this specific dataset, the optimum results is produced by just using two variables which are Depth and RMS Full Amplitude. Different dataset, different area of analysis, may result different model. If needed, we can also add more attribute to be analyzed that probably can improve the models