We will analyze what is the best model and what are variables to be used for predicting netpay using SOUTH EAST.xlsx dataset as train model. SOUTH EAST dataset contains of 17 columns and 60 rows. SOUTH EAST.xlsx is a dataset that extracted from 9 wells and 30 reservoirs contains of:
library(readxl)
library(GGally)
library(dplyr)
southeast <- read_xlsx("SOUTH EAST.xlsx")
southeast$`Well Identifier` <- factor(southeast$`Well Identifier`)
southeast
Well name in South East dataset:
levels(southeast$`Well Identifier`)
## [1] "A" "B" "C" "D" "E" "F" "G" "H" "I"
From tucker plot above we can see that there is a relation between thickness of the reservoir vs amplitude. The thicker reservoirs, the stronger amplitude showed from the seismic. This relation only happened if the thickess of the reservoirs is below tuning thickness.
SOUTH EAST.xlsx dataset has maximum and minimum netpay as below:
summary(southeast$Netpay)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.170 2.160 3.795 4.656 7.290 10.540
SOUTH EAST DATASET.xlsx is taken from the seismic cube with resolution range as below:
summary(southeast$RES)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1181 1222 1304 1307 1383 1448
as we can see netpay range of SOUTH EAST.xlsx dataset is from 1.17m until 10.54m, where this thickness is still below tuning thickness. So the netpay in the SOUTH EAST.xlsx dataset should have a good relation with amplitude.
southeast$Seis_D <- southeast$Seis_D*-1 #make Depth to be positive constanta
southeast_top <- southeast %>%
filter(Surface=="Horizon 1") #separate top and bottom amplitude into different dataframe
southeast_bottom <- southeast %>%
filter(Surface=="Horizon 2")
southeast_top$RMSFull <- sqrt((southeast_top$FULL^2+southeast_bottom$FULL^2)/2) #calculate RMS from TOP FULL and BOTTOM FULL amplitude
southeast_top_amp <- southeast_top[,c(1,13,5,6,7,8,9,10,11,12)]
southeast_top_amp$velocity <- if_else(southeast_top_amp$Seis_D<1059,(-0.0002*southeast_top_amp$Seis_D^2)-(0.194*southeast_top_amp$Seis_D)+1556.7,(-2*10^-6*southeast_top_amp$Seis_D^3)+(0.0069*southeast_top_amp$Seis_D^2)-(8.236*southeast_top_amp$Seis_D)+4513.4) #calculate VP estimation. this law is just refer to previous petrophysics study about VP vs Depth
southeast_top_amp$RES<- southeast_top_amp$RES/100
ggcorr(southeast_top_amp,label=T)
From correlation plot above, we can conclude that Netpay has the strongest correlation with ‘RMS Full Amplitude’ and the correlation between Netpay and ‘RMS FULL Amplitude’ reaching 0.6.
southeast$mistie <- southeast$Seis_D-southeast$Act_D
p0<- southeast %>%
ggplot(aes(x=mistie)) +
geom_histogram(bins = 10) +
labs(title = "Residuals from Model Backward") +
theme_algoritma
ggplotly(p0)
Next, we should analyze further to find the best models to be used for predicting netpay. In this case, I will use backward regression models method and find best models with smallest AIC “Akaike Information Criterion”.
Firstly, I try to use all variables to predict netpay and the model that created from all variables will be put in model_full. And then lets see how small the AIC as below:
model_full <- lm(Netpay ~ ., southeast_top_amp)
summary(model_full)
##
## Call:
## lm(formula = Netpay ~ ., data = southeast_top_amp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4433 -1.3609 -0.5712 0.7827 3.9218
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.926e+01 7.055e+01 -1.265 0.2211
## Seis_D 1.284e-01 6.877e-02 1.866 0.0775 .
## RMSFull 4.654e-04 4.119e-04 1.130 0.2727
## FULL 6.537e-03 5.150e-03 1.269 0.2197
## NEAR -2.299e-03 1.725e-03 -1.333 0.1982
## MID -2.216e-03 1.928e-03 -1.150 0.2646
## FAR -2.455e-03 1.687e-03 -1.455 0.1620
## UFAR 3.939e-04 3.929e-04 1.002 0.3287
## RES 2.745e-01 6.619e-01 0.415 0.6830
## Act_D -9.911e-02 5.869e-02 -1.689 0.1077
## velocity 4.923e-02 4.389e-02 1.122 0.2760
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.251 on 19 degrees of freedom
## Multiple R-squared: 0.6078, Adjusted R-squared: 0.4014
## F-statistic: 2.944 on 10 and 19 DF, p-value: 0.02058
Adjusted R squared from model_full is only 0.363. Using all variables/parameters in the dataset doesn’t mean that we can increase the R-squared.So next, I try to find the optimum models by trying to find best variables to be used in predicting netpay.
I will use backward regression model methodology and analyze which model has the smallest AIC.
model_backward <- step(model_full, direction = "backward")
## Start: AIC=56.98
## Netpay ~ Seis_D + RMSFull + FULL + NEAR + MID + FAR + UFAR +
## RES + Act_D + velocity
##
## Df Sum of Sq RSS AIC
## - RES 1 0.8713 97.161 55.255
## - UFAR 1 5.0925 101.382 56.531
## - velocity 1 6.3759 102.665 56.908
## - RMSFull 1 6.4676 102.757 56.935
## <none> 96.289 56.985
## - MID 1 6.6979 102.987 57.002
## - FULL 1 8.1647 104.454 57.426
## - NEAR 1 9.0072 105.297 57.668
## - FAR 1 10.7309 107.020 58.155
## - Act_D 1 14.4487 110.738 59.179
## - Seis_D 1 17.6522 113.942 60.035
##
## Step: AIC=55.26
## Netpay ~ Seis_D + RMSFull + FULL + NEAR + MID + FAR + UFAR +
## Act_D + velocity
##
## Df Sum of Sq RSS AIC
## - UFAR 1 4.3232 101.484 54.561
## <none> 97.161 55.255
## - RMSFull 1 6.9823 104.143 55.337
## - MID 1 7.1192 104.280 55.376
## - velocity 1 7.9882 105.149 55.625
## - FULL 1 8.7617 105.922 55.845
## - NEAR 1 9.5572 106.718 56.070
## - FAR 1 10.6938 107.854 56.388
## - Act_D 1 17.0269 114.188 58.099
## - Seis_D 1 21.2918 118.453 59.199
##
## Step: AIC=54.56
## Netpay ~ Seis_D + RMSFull + FULL + NEAR + MID + FAR + Act_D +
## velocity
##
## Df Sum of Sq RSS AIC
## - MID 1 3.4686 104.95 53.569
## - FULL 1 4.8654 106.35 53.966
## - NEAR 1 5.5563 107.04 54.160
## - velocity 1 5.8143 107.30 54.232
## - FAR 1 6.4073 107.89 54.398
## <none> 101.48 54.561
## - RMSFull 1 12.0267 113.51 55.921
## - Act_D 1 12.7074 114.19 56.100
## - Seis_D 1 17.1146 118.60 57.236
##
## Step: AIC=53.57
## Netpay ~ Seis_D + RMSFull + FULL + NEAR + FAR + Act_D + velocity
##
## Df Sum of Sq RSS AIC
## - NEAR 1 3.1741 108.13 52.463
## - FULL 1 3.5170 108.47 52.558
## - velocity 1 4.3112 109.26 52.777
## - FAR 1 4.9663 109.92 52.956
## <none> 104.95 53.569
## - RMSFull 1 11.2102 116.16 54.614
## - Act_D 1 12.7503 117.70 55.009
## - Seis_D 1 16.1817 121.13 55.871
##
## Step: AIC=52.46
## Netpay ~ Seis_D + RMSFull + FULL + FAR + Act_D + velocity
##
## Df Sum of Sq RSS AIC
## - FULL 1 0.3907 108.52 50.571
## - FAR 1 1.8443 109.97 50.971
## - velocity 1 4.7001 112.83 51.740
## <none> 108.13 52.463
## - RMSFull 1 9.8768 118.00 53.085
## - Act_D 1 12.0294 120.16 53.628
## - Seis_D 1 15.8220 123.95 54.560
##
## Step: AIC=50.57
## Netpay ~ Seis_D + RMSFull + FAR + Act_D + velocity
##
## Df Sum of Sq RSS AIC
## - FAR 1 1.5510 110.07 48.997
## - velocity 1 4.4671 112.98 49.782
## <none> 108.52 50.571
## - Act_D 1 12.3037 120.82 51.793
## - RMSFull 1 15.4297 123.95 52.560
## - Seis_D 1 15.8546 124.37 52.662
##
## Step: AIC=49
## Netpay ~ Seis_D + RMSFull + Act_D + velocity
##
## Df Sum of Sq RSS AIC
## - velocity 1 5.570 115.64 48.478
## <none> 110.07 48.997
## - Act_D 1 11.024 121.09 49.861
## - Seis_D 1 14.817 124.89 50.786
## - RMSFull 1 94.466 204.53 65.586
##
## Step: AIC=48.48
## Netpay ~ Seis_D + RMSFull + Act_D
##
## Df Sum of Sq RSS AIC
## - Act_D 1 7.765 123.40 48.428
## <none> 115.64 48.478
## - Seis_D 1 9.312 124.95 48.801
## - RMSFull 1 89.886 205.52 63.731
##
## Step: AIC=48.43
## Netpay ~ Seis_D + RMSFull
##
## Df Sum of Sq RSS AIC
## <none> 123.40 48.428
## - Seis_D 1 43.024 166.43 55.401
## - RMSFull 1 84.199 207.60 62.033
From calculation above, we can conclude the best methodology in predicting netpay is using variables Seismic Depth and RMS Full with smallest AIC=48.43.
##backward
model_backward <- lm(Netpay~Seis_D+RMSFull,southeast_top_amp)
summary(model_backward)
##
## Call:
## lm(formula = Netpay ~ Seis_D + RMSFull, data = southeast_top_amp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8004 -1.3700 -0.0098 1.4091 4.7722
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.4952061 1.7879659 -1.955 0.061027 .
## Seis_D 0.0055099 0.0017958 3.068 0.004859 **
## RMSFull 0.0005409 0.0001260 4.292 0.000204 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.138 on 27 degrees of freedom
## Multiple R-squared: 0.4974, Adjusted R-squared: 0.4601
## F-statistic: 13.36 on 2 and 27 DF, p-value: 9.269e-05
The adjusted R-squared is 0.4601 by using models backward, which actually the adjusted R-squared is increased from the past model_full. Model backward: \[ Netpay= (Seismic Depth*-0.0055)+(RMSFull*0.0005)-3.49 \]
Now, I want to see the residuals distribution of the models as below:
p1<- model_backward %>%
ggplot(aes(x=model_backward$residuals)) +
geom_histogram(bins = 10) +
labs(title = "Residuals from Model Backward") +
theme_algoritma
ggplotly(p1)
And I would like to do shapiro test for calculating if the residuals is distributed normally or not as below:
shapiro.test(model_backward$residuals)
##
## Shapiro-Wilk normality test
##
## data: model_backward$residuals
## W = 0.98108, p-value = 0.8535
From Shapiro test calculation above, we can see that p-value from model_backward is 0.8535 which means it is above 0.05, this conclude that the residual is distributed normally.
Now I would like to calculate Netpay Prediction from model backward
southeast_top_amp$predict <- predict(model_backward, southeast_top_amp)
Here I try to see the relation between Netpay Actual with Netpay Prediction from Model Backward. I expect to see that the data cloud of prediction vs actual netpay is should be around the red line.
p2<- southeast_top_amp %>%
ggplot(aes(x=Netpay, y=predict, text=paste("Netpay:",Netpay,"| Predict:", round(predict,2), "| Abs Residuals:", abs(round(model_backward$residuals,2))))) +
geom_point(aes(color=Netpay,size=abs(model_backward$residuals))) +
labs(title = "Netpay Actual vs Prediction from Model") +
theme_algoritma+geom_abline(intercept = 0, # intercept = 0
slope = 1,
color = "red"
)+xlim(0,10)+ylim(0,10)
ggplotly(p2,tooltip = "text")
Actually there are available previous old law that usually used for predicting netpay in this certain dummy area. The law is:
\[ (9.05289*10^{(-5)}*RMSFull+0.064972886)*(Resolution/100)*Velocity/2000 \] Now, I want to see performance of the old model comparing with new model backward.
#general old model => (9.05289*10^-5x+0.064972886)*(resolution/100)*velocity/2000
southeast_top_amp$predictoldlaw <- (((9.05289*10^-5)*southeast_top_amp$RMSFull)+0.064972886)*southeast_top_amp$RES*southeast_top_amp$velocity/2000
rss <- sum((southeast_top_amp$predictoldlaw-southeast_top_amp$Netpay)^2)
tss <- sum((southeast_top_amp$Netpay-mean(southeast_top_amp$Netpay))^2)
rsq <- 1-rss/tss
rsq
## [1] 0.08665801
R-squared from old model is only 0.086, but from new model backward is 0.4601 . So, model backward is better than the old law.
Now, let’s see residuals produced from the Old Model:
southeast_top_amp$residualsoldlaw <- southeast_top_amp$predictoldlaw-southeast_top_amp$Netpay
p3<- southeast_top_amp %>%
ggplot(aes(x=residualsoldlaw)) +
geom_histogram(bins = 10) +
labs(title = "Residuals from Old Model") +
theme_algoritma
ggplotly(p3)
I would like to do shapiro test for calculating if the residuals of the old model is distributed normally or not as below:
shapiro.test(southeast_top_amp$residualsoldlaw)
##
## Shapiro-Wilk normality test
##
## data: southeast_top_amp$residualsoldlaw
## W = 0.95001, p-value = 0.1692
From Shapiro test calculation above, we can see that p-value from old model is 0.1692 which means it is below 0.05, this conclude that the residual is not distributed normally. So, the new model backward is better than the old law.
Now, I would like to see, prediction vs netpay actual from the old law as below:
p4<- southeast_top_amp %>%
ggplot(aes(x=Netpay, y=predictoldlaw, text=paste("Netpay:",Netpay,"| Predict:",round(predictoldlaw,2),"| Residuals",abs(round(residualsoldlaw,2))))) +
geom_point(aes(color=Netpay,size=abs(residualsoldlaw))) +
labs(title = "Netpay Actual vs Old Law Prediction") +
theme_algoritma+geom_abline(intercept = 0, # intercept = 0
slope = 1,
color = "red"
)+xlim(0,10)+ylim(0,10)
ggplotly(p4,tooltip = "text")
I would like to predict Netpay from data test loaded from ‘Z.xlsx’. Actually Z.xlsx is a proposed wells with three main targets and we want to know how much netpay we can expect from each targets using model backward below: \[ Netpay= (Seismic Depth*-0.0055)+(RMSFull*0.0005)-3.49 \]
datatest <- read_xlsx("Z.xlsx")
datatest$Seis_D <- datatest$Seis_D*-1
datatest_top <- datatest %>%
filter(Horizon=="TOP") #separate top and bottom amplitude into different dataframe
datatest_bottom <- datatest %>%
filter(Horizon=="BOTTOM")
datatest_top$RMSFull <- sqrt((datatest_top$FULL^2+datatest_bottom$FULL^2)/2) #calculate RMS from TOP FULL and BOTTOM FULL amplitude
datatest_top
Now, we can predict netpay as below:
datatest_top$predict <- round(predict(model_backward, datatest_top),2)
datatest_top[,c(1,2,10,11)]
So the conclusion is proposal well Z expecting to get 17.2 m from 3 main targets. T1 is expecting to get 4.1m, T2 is expecting 6.6m, and T3 is expecting 6.5m
sum(datatest_top$predict)
## [1] 17.23
This is just a demo about how to predict netpay using regression models, in this dataset we only analyzed attribute from amplitude full, near, mid, far, ufar, resolution, velocity. And from this specific dataset, the optimum results is produced by just using two variables which are Depth and RMS Full Amplitude. Different dataset, different area of analysis, may result different model. If needed, we can also add more attribute to be analyzed that probably can improve the models