Applied Geostatistics in R: 3. Linear Discriminant Analysis (LDA) for Multinomial Lithofacies Classification in a Sandstone Formation

The Linear Discriminant Analysis (LDA) tries to produce a linear combination of features that separate two or more classes after projection the data to the specified direction. To validating a LDA model and preserve the integrity of the statistical inference, cross-validation is used for estimating generalization error of a given model among various models and choose the optimal one that has the smallest generalization error. In order to fit the the data through LDA, the data should be splitted into approximate equal size k subsets. Then, one of the subsets is trained and the other is leaving out from testing. Leave-one-out cross-validation is used to estimate generalization error; however, the jackknife is used to estimate the bias or the standard error of a statistic by comparing the average of the subset statistics with the corresponding statistic computed from the entire sample in order to estimate the bias of the latter.

In this work, LDA was adopted here to model Lithofacies given core analysis and well logs data in order to predict discrete and posterior probability distributions of the Lithofacies in Karpur Dataset.

Install the packages required to implement LDA algorithm with their functions.

#First, install the required packages.
require(caTools)

## Loading required package: caTools

require(MASS)

## Loading required package: MASS

require(lattice)

## Loading required package: lattice

library(caTools)
library(MASS)
library(lattice)

Call the dataset and show the dataset head: -

##    depth caliper ind.deep ind.med  gamma phi.N R.deep  R.med      SP
## 1 5667.0   8.685  618.005 569.781 98.823 0.410  1.618  1.755 -56.587
## 2 5667.5   8.686  497.547 419.494 90.640 0.307  2.010  2.384 -61.916
## 3 5668.0   8.686  384.935 300.155 78.087 0.203  2.598  3.332 -55.861
## 4 5668.5   8.686  278.324 205.224 66.232 0.119  3.593  4.873 -41.860
## 5 5669.0   8.686  183.743 131.155 59.807 0.069  5.442  7.625 -34.934
## 6 5669.5   8.686  109.512  75.633 57.109 0.048  9.131 13.222 -39.769
##   density.corr density phi.core   k.core Facies
## 1       -0.033   2.205  33.9000 2442.590     F1
## 2       -0.067   2.040  33.4131 3006.989     F1
## 3       -0.064   1.888  33.1000 3370.000     F1
## 4       -0.053   1.794  34.9000 2270.000     F1
## 5       -0.054   1.758  35.0644 2530.758     F1
## 6       -0.058   1.759  35.3152 2928.314     F1

Summary of the dataset.

##      depth         caliper         ind.deep          ind.med       
##  Min.   :5667   Min.   :8.487   Min.   :  6.532   Min.   :  9.386  
##  1st Qu.:5769   1st Qu.:8.556   1st Qu.: 28.799   1st Qu.: 27.892  
##  Median :5872   Median :8.588   Median :217.849   Median :254.383  
##  Mean   :5873   Mean   :8.622   Mean   :275.357   Mean   :273.357  
##  3rd Qu.:5977   3rd Qu.:8.686   3rd Qu.:566.793   3rd Qu.:544.232  
##  Max.   :6083   Max.   :8.886   Max.   :769.484   Max.   :746.028  
##                                                                    
##      gamma            phi.N            R.deep            R.med        
##  Min.   : 16.74   Min.   :0.0150   Min.   :  1.300   Min.   :  1.340  
##  1st Qu.: 40.89   1st Qu.:0.2030   1st Qu.:  1.764   1st Qu.:  1.837  
##  Median : 51.37   Median :0.2450   Median :  4.590   Median :  3.931  
##  Mean   : 53.42   Mean   :0.2213   Mean   : 24.501   Mean   : 21.196  
##  3rd Qu.: 62.37   3rd Qu.:0.2640   3rd Qu.: 34.724   3rd Qu.: 35.853  
##  Max.   :112.40   Max.   :0.4100   Max.   :153.085   Max.   :106.542  
##                                                                       
##        SP          density.corr          density         phi.core    
##  Min.   :-73.95   Min.   :-0.067000   Min.   :1.758   Min.   :15.70  
##  1st Qu.:-42.01   1st Qu.:-0.016000   1st Qu.:2.023   1st Qu.:23.90  
##  Median :-32.25   Median :-0.007000   Median :2.099   Median :27.60  
##  Mean   :-30.98   Mean   :-0.008883   Mean   :2.102   Mean   :26.93  
##  3rd Qu.:-19.48   3rd Qu.: 0.002000   3rd Qu.:2.181   3rd Qu.:30.70  
##  Max.   : 25.13   Max.   : 0.089000   Max.   :2.387   Max.   :36.30  
##                                                                      
##      k.core             Facies   
##  Min.   :    0.42   F8     :184  
##  1st Qu.:  657.33   F9     :172  
##  Median : 1591.22   F10    :171  
##  Mean   : 2251.91   F1     :111  
##  3rd Qu.: 3046.82   F5     :109  
##  Max.   :15600.00   F3     : 55  
##                     (Other): 17

Visualize the dataset:

In order to use LDA, I need to first split the data into a part used to train the classifier, and another part to test the classifier. For this problem, I considered an 80:20 split, approximately.

The following shows the modeling Lithofacies given well logs and core data by LDA. Also, predict the 1st 10 observations of discrete and posterior distribution in addtion to plot the boxplot of the predicted lthofacies by LDA Algorithm.

Training data:

## [1] 655

Test data:

## [1] 164

##         Length Class  Mode     
## prior     8    -none- numeric  
## counts    8    -none- numeric  
## means   104    -none- numeric  
## scaling  91    -none- numeric  
## lev       8    -none- character
## svd       7    -none- numeric  
## N         1    -none- numeric  
## call      3    -none- call

## [1] "prior"   "counts"  "means"   "scaling" "lev"     "svd"     "N"      
## [8] "call"

Prior distribution of Facies:

##          F1         F10          F2          F3          F5          F7 
## 0.169465649 0.192366412 0.012213740 0.083969466 0.134351145 0.003053435 
##          F8          F9 
## 0.207633588 0.196946565

Means of the Well logs and core data given each Facies:

##        depth  caliper  ind.deep   ind.med    gamma      phi.N    R.deep
## F1  5694.581 8.685847  47.68522  46.57178 56.48741 0.09327027 37.563919
## F10 5776.806 8.769365 261.51079 287.18687 73.30020 0.24548413  4.518214
## F2  5725.750 8.823875 142.76850 165.96738 60.97350 0.25100000  7.658375
## F3  5772.318 8.771855 260.11147 262.34173 64.82567 0.21241818  4.391618
## F5  5838.341 8.573489  21.44975  25.82666 34.85292 0.16743182 76.979716
## F7  5855.750 8.588000 149.84850 110.46250 36.36050 0.23000000  6.796500
## F8  5939.551 8.524154 227.67058 216.68392 40.70689 0.26244118 38.182978
## F9  6016.128 8.568403 602.58357 564.45968 53.83036 0.25566667  1.661171
##         R.med        SP density.corr  density phi.core    k.core
## F1  34.091045 -47.00009 -0.037396396 1.908054 31.60519 2334.6776
## F10  3.907310 -26.03356  0.002880952 2.255119 20.57050  833.2559
## F2   6.576875 -39.66725  0.010125000 2.203375 17.85878  338.5453
## F3   4.204400 -27.99718 -0.005290909 2.182418 23.47783 1098.5071
## F5  62.922545 -40.03389 -0.015545455 2.055750 27.53975 4957.0287
## F7   9.283500   7.69650 -0.002500000 2.025500 27.65975 3908.6753
## F8  30.980191 -27.79474 -0.007897059 2.036934 30.85639 3192.2026
## F9   1.774806 -25.17478 -0.006000000 2.118899 27.17926  774.3533

##  F1 F10  F2  F3  F5  F7  F8  F9 
## 111 126   8  55  88   2 136 129

LDA Modeling Validation by computing the total correct percent.

##        F1       F10        F2        F3        F5        F7        F8 
## 0.9729730 0.8128655 0.7500000 0.6000000 0.7706422 0.7777778 0.9402174 
##        F9 
## 1.0000000

Total percent correct:

## [1] 0.8815629

LDA Cross-validation: Rather than splitting the data into a training and testing split, an alternative way is to measure the performance of the model is to ask R to perform cross-validation. The “lda” function is achieved by incorporating the argument CV=TRUE:

Facies Discrimination by LDA:

Visualizing the predicted posterior distribution of the Eight Facies.

Combining the posterior distribution of the eight Lithofacies in one plot.

## Warning in bxp(structure(list(stats = structure(c(5667, 5680.75, 5694.5, :
## some notches went outside hinges ('box'): maybe set notch=FALSE

## Warning in bxp(structure(list(stats = structure(c(5667.5, 5681.75,
## 5696.25, : some notches went outside hinges ('box'): maybe set notch=FALSE

References

Pires, A.M. and J.A. Branco, Projection-pursuit approach to robust linear discriminant analysis. Journal of Multivariate Analysis 101 24642485 (2010).
Al-Mudhafar, W. J. (2015). Integrating Component Analysis & Classification Techniques for Comparative Prediction of Continuous & Discrete Lithofacies Distributions. Offshore Technology Conference. doi:10.4043/25806-MS.
Karpur, L., L. Lake, and K. Sepehrnoori. (2000). Probability Logs for Facies Classification. In Situ 24(1): 57.
Al-Mudhafer, W. J. (2014). Multinomial Logistic Regression for Bayesian Estimation of Vertical Facies Modeling in Heterogeneous Sandstone Reservoirs. Offshore Technology Conference. doi:10.4043/24732-MS.
Al-Mudhafar, W. J. (2015). Applied Geostatistics in R: 1. Naive Bayes Classifier for Lithofacies Modeling in a Sandstone Formation. RPubs.
Al-Mudhafar, W. J. (2015). Applied Geostatistics in R: 2. Applied Geostatistics in R: 2. Logistic Boosting Regression (LogitBoost) for Multinomial Lithofacies Classification in a Sandstone Formation. RPubs.

Applied Geostatistics in R: 3. Linear Discriminant Analysis (LDA) for Multinomial Lithofacies Classification in a Sandstone Formation

Watheq J. Al-Mudhafar, Craft and Hawkins Department of Petroleum Engineering, Louisiana State University

August 4, 2015