CGR Calculation in Montney and Duvernay with Machine Learning

Prepared by: Raya Matoorian Updated: 2019/11/03

Sinopsis

This report contains the procedure to predict CGR (Condensate-Gas Ratio) in Montney and Duvernay formations in WCSB.

Introduction

Traditionally operators use a constant CGR to tie well with condensate production at the plant and then back-allocation to the well. This methodology is erratic as CGR will change over time and location. Current methods of CGR estimation range from empirical equations derived from specific field data and typically have local applicability or derived laboratory and fieldwork which is time and cost consuming. Varied thermo-chemical parameters control condensate’s behavior in the reservoir and surface facilities.

In the current research, we integrate different datasets including well data, reservoir data, fluids (oil, gas, and condensate) compositional analysis and production data to estimate CGR.

Data Acquisition

For this research 5 different dataset has been collected from the wells drilled in Montney and Duvernay across AB and BC. These datasets includes:

  • Wellheader data
  • Monthly well production data
  • Gas analysis
  • Oil analysis
  • Condensate analysis

Loading and Processing the Raw Data

The data obtaind from geoSCOUT software by geoLOGIC Systems. Data exported as csv (comma-Sparated Values) and then imported in R Package.

read.csv() function in R simply reads large csv file and can handle compressed files automatically. Data loading is time-consuming process. To avoid this, we can use cache=TRUE option in code chunck.

## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## corrplot 0.84 loaded
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
  prod <- read.csv("./data/prod.csv")

After reading the required datasets, we can summerize them. The summary() function gives the measure of central tendency for continuous features such as mean, median, quantiles, etc. If there is any categorical features in dataset, we’ll also get to the Class and Mode of those features.

  dim(prod)
## [1] 819940     15
  summary(prod)
##                     UWI                         Field       
##  100/05-04-069-21W5/00:   758   HERITAGE           :168105  
##  100/05-07-069-21W5/00:   732   KAYBOB SOUTH       :100238  
##  100/05-08-069-21W5/00:   729   NORTHERN MONTNEY   : 72013  
##  100/09-07-069-21W5/00:   718   POUCE COUPE SOUTH  : 64008  
##  100/07-07-069-21W5/00:   700   ANTE CREEK NORTH   : 48451  
##  100/14-29-069-19W5/00:   688   STURGEON LAKE SOUTH: 44953  
##  (Other)              :815615   (Other)            :322172  
##                   Pool                          Status      
##  MONTNEY A          :248415   Active Gas Production:217129  
##  TRIASSIC A         : 83269   Pumping Crude Oil    :146906  
##  COMMINGLED POOL 012: 46536   Pumping Gas          :102620  
##  MONTNEY C          : 39056   Flowing Gas          : 95657  
##  COMMINGLED MFP9529 : 34611   Suspended Crude Oil  : 44289  
##  COMMINGLED POOL 005: 31033   Suspended Gas        : 43804  
##  (Other)            :337020   (Other)              :169535  
##      Formation           Year          Month             GAS         
##           :     1   Min.   :1955   Min.   : 1.000   Min.   :      0  
##  Dduvernay: 28709   1st Qu.:2008   1st Qu.: 3.000   1st Qu.:   4251  
##  TRmontney:791230   Median :2014   Median : 6.000   Median :  15250  
##                     Mean   :2010   Mean   : 6.485   Mean   :  33020  
##                     3rd Qu.:2017   3rd Qu.:10.000   3rd Qu.:  41088  
##                     Max.   :2019   Max.   :12.000   Max.   :1189565  
##                     NA's   :1      NA's   :1        NA's   :1        
##       OIL              CND               WTR                 BOE         
##  Min.   :     0   Min.   :    0.0   Min.   :      0.0   Min.   :      0  
##  1st Qu.:     0   1st Qu.:    0.0   1st Qu.:    201.4   1st Qu.:  52622  
##  Median :     0   Median :    0.0   Median :   1154.5   Median : 133757  
##  Mean   :  9279   Mean   :  626.5   Mean   :   7693.9   Mean   : 256683  
##  3rd Qu.:  4031   3rd Qu.:   55.7   3rd Qu.:   4284.4   3rd Qu.: 300342  
##  Max.   :534420   Max.   :76863.5   Max.   :1540523.4   Max.   :7001594  
##  NA's   :1        NA's   :1         NA's   :1           NA's   :1        
##       FLD               HRS              CGR           
##  Min.   :      0   Min.   :     0   Min.   :      0.0  
##  1st Qu.:    623   1st Qu.:  9216   1st Qu.:      0.0  
##  Median :   3017   Median : 25098   Median :      0.0  
##  Mean   :  17600   Mean   : 39143   Mean   :     30.5  
##  3rd Qu.:  10690   3rd Qu.: 52509   3rd Qu.:      0.3  
##  Max.   :1734095   Max.   :411765   Max.   :1276141.0  
##  NA's   :1         NA's   :1        NA's   :1

If we need to convert the categorical features into factor data type.

  #catList <- c(1:5)
  #prod[,catList] <- data.frame(apply(prod[catList], 2, as.factor))

Below is a summary of the data tpye in the table:

  str(prod)
## 'data.frame':    819940 obs. of  15 variables:
##  $ UWI      : Factor w/ 10780 levels "","100/01-01-059-21W5/04",..: 3482 3482 3482 3482 3482 3482 3482 3482 3482 3482 ...
##  $ Field    : Factor w/ 146 levels "","ANSELL","ANTE CREEK NORTH",..: 133 133 133 133 133 133 133 133 133 133 ...
##  $ Pool     : Factor w/ 389 levels "","2WS UND","BEAVERHILL LAKE D",..: 85 85 85 85 85 85 85 85 85 85 ...
##  $ Status   : Factor w/ 62 levels "","Abandon Gas Production",..: 31 31 31 31 31 31 31 31 31 31 ...
##  $ Formation: Factor w/ 3 levels "","Dduvernay",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Year     : int  2018 2018 2018 2018 2018 2018 2018 2019 2019 2019 ...
##  $ Month    : int  6 7 8 9 10 11 12 1 2 3 ...
##  $ GAS      : num  37.5 46.2 145.5 219.5 282.6 ...
##  $ OIL      : num  478 716 1854 3168 4309 ...
##  $ CND      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ WTR      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BOE      : num  3227 4776 12524 21228 28777 ...
##  $ FLD      : num  478 716 1854 3168 4309 ...
##  $ HRS      : num  264 379 964 1633 2356 ...
##  $ CGR      : num  0 0 0 0 0 0 0 0 0 0 ...

As summary section, we can find that there is a few missing values in records so we need to impute them. Then we need to remove records with CGR = 0

  prod <- na.omit(prod)
  prod <- prod %>% filter(CGR > 0)
  dim(prod)
## [1] 339119     15

Number of wells in each formation:

  ddply(prod,~Formation,summarise,Wells=length(unique(UWI)))
##   Formation Wells
## 1 Dduvernay   627
## 2 TRmontney  5122

we will take logaritm of CGR column:

  prod$Log_CGR <- log10(prod$CGR)

Number of wells in each Pool:

  ddply(prod,~Pool,summarise,Wells=length(unique(UWI)))
##                         Pool Wells
## 1                    2WS UND     8
## 2                  BH LK UND     2
## 3          BLUESKY-MONTNEY E     1
## 4                   BNFF UND     1
## 5         COMMINGLED MFP9516    26
## 6         COMMINGLED MFP9529   253
## 7        COMMINGLED POOL 001    22
## 8        COMMINGLED POOL 002    11
## 9        COMMINGLED POOL 003    78
## 10       COMMINGLED POOL 004     8
## 11       COMMINGLED POOL 005     6
## 12       COMMINGLED POOL 006     6
## 13       COMMINGLED POOL 007     1
## 14       COMMINGLED POOL 008     2
## 15       COMMINGLED POOL 009     6
## 16       COMMINGLED POOL 010     1
## 17       COMMINGLED POOL 012   246
## 18       COMMINGLED POOL 013     1
## 19       COMMINGLED POOL 014     4
## 20       COMMINGLED POOL 017     1
## 21       COMMINGLED POOL 018     3
## 22       COMMINGLED POOL 019     4
## 23       COMMINGLED POOL 020     1
## 24       COMMINGLED POOL 021     3
## 25       COMMINGLED POOL 022     1
## 26       COMMINGLED POOL 023     4
## 27       COMMINGLED POOL 024    25
## 28       COMMINGLED POOL 025     4
## 29       COMMINGLED POOL 027     1
## 30       COMMINGLED POOL 030     2
## 31       COMMINGLED POOL 036     7
## 32                    DOIG C     1
## 33  DOIG PHOSPHATE-MONTNEY A   224
## 34                  DOIG UND     1
## 35                  DUNV UND    12
## 36                DUVERNAY A     6
## 37               DUVERNAY AA     2
## 38              DUVERNAY AAA     1
## 39                DUVERNAY B     3
## 40               DUVERNAY BB     2
## 41              DUVERNAY BBB     1
## 42                DUVERNAY C     4
## 43               DUVERNAY CC     2
## 44              DUVERNAY CCC     1
## 45                DUVERNAY D     4
## 46               DUVERNAY DD     2
## 47              DUVERNAY DDD     1
## 48                DUVERNAY E     4
## 49               DUVERNAY EE     2
## 50              DUVERNAY EEE     1
## 51                DUVERNAY F     4
## 52               DUVERNAY FF     2
## 53                DUVERNAY G     4
## 54               DUVERNAY GG     3
## 55                DUVERNAY H     4
## 56               DUVERNAY HH     2
## 57                DUVERNAY I     3
## 58               DUVERNAY II     2
## 59                DUVERNAY J     3
## 60               DUVERNAY JJ     2
## 61                DUVERNAY K     3
## 62               DUVERNAY KK     2
## 63                DUVERNAY L     2
## 64               DUVERNAY LL     2
## 65                DUVERNAY M     2
## 66               DUVERNAY MM     2
## 67                DUVERNAY N     3
## 68               DUVERNAY NN     2
## 69                DUVERNAY O     3
## 70               DUVERNAY OO     2
## 71                DUVERNAY P     3
## 72               DUVERNAY PP     2
## 73                DUVERNAY Q     3
## 74               DUVERNAY QQ     2
## 75                DUVERNAY R     3
## 76               DUVERNAY RR     2
## 77                DUVERNAY S     2
## 78               DUVERNAY SS     2
## 79                DUVERNAY T     2
## 80               DUVERNAY TT     2
## 81                DUVERNAY U     2
## 82              DUVERNAY UND   104
## 83               DUVERNAY UU     1
## 84                DUVERNAY V     2
## 85               DUVERNAY VV     2
## 86                DUVERNAY W     2
## 87               DUVERNAY WW     2
## 88                DUVERNAY X     2
## 89               DUVERNAY XX     2
## 90                DUVERNAY Y     2
## 91               DUVERNAY YY     1
## 92                DUVERNAY Z     2
## 93               DUVERNAY ZZ     1
## 94                  DUVY UND   278
## 95                 HALFWAY U     1
## 96                  MONT UND   631
## 97                 MONTNEY A  2729
## 98               MONTNEY A2A     5
## 99               MONTNEY A3A     3
## 100              MONTNEY A5A     1
## 101              MONTNEY A6A     1
## 102              MONTNEY A7A     1
## 103               MONTNEY AA     8
## 104              MONTNEY AAA     6
## 105                MONTNEY B    48
## 106              MONTNEY B2B     6
## 107              MONTNEY B3B     1
## 108              MONTNEY B5B     1
## 109              MONTNEY B6B     1
## 110               MONTNEY BB     6
## 111              MONTNEY BBB     3
## 112                MONTNEY C    32
## 113              MONTNEY C2C     3
## 114              MONTNEY C5C     1
## 115              MONTNEY C6C     1
## 116              MONTNEY C7C     1
## 117               MONTNEY CC     6
## 118              MONTNEY CCC     3
## 119                MONTNEY D    27
## 120              MONTNEY D2D     3
## 121              MONTNEY D5D     1
## 122              MONTNEY D6D     1
## 123              MONTNEY D7D     1
## 124               MONTNEY DD     6
## 125              MONTNEY DDD     3
## 126                MONTNEY E    21
## 127              MONTNEY E2E     3
## 128              MONTNEY E3E     2
## 129              MONTNEY E5E     1
## 130              MONTNEY E6E     1
## 131              MONTNEY E7E     1
## 132               MONTNEY EE     7
## 133              MONTNEY EEE     2
## 134                MONTNEY F    52
## 135              MONTNEY F2F     3
## 136              MONTNEY F3F     1
## 137              MONTNEY F5F     1
## 138              MONTNEY F6F     1
## 139              MONTNEY F7F     1
## 140               MONTNEY FF     7
## 141              MONTNEY FFF     3
## 142                MONTNEY G    12
## 143              MONTNEY G2G     3
## 144              MONTNEY G5G     1
## 145              MONTNEY G6G     1
## 146              MONTNEY G7G     1
## 147               MONTNEY GG     5
## 148              MONTNEY GGG     3
## 149                MONTNEY H     8
## 150              MONTNEY H2H     3
## 151              MONTNEY H3H     2
## 152              MONTNEY H5H     1
## 153              MONTNEY H6H     1
## 154              MONTNEY H7H     1
## 155               MONTNEY HH     7
## 156              MONTNEY HHH     3
## 157                MONTNEY I    12
## 158              MONTNEY I2I     4
## 159              MONTNEY I3I     2
## 160              MONTNEY I4I     1
## 161              MONTNEY I5I     1
## 162              MONTNEY I6I     1
## 163              MONTNEY I7I     1
## 164               MONTNEY II     7
## 165              MONTNEY III     4
## 166                MONTNEY J    12
## 167              MONTNEY J2J     3
## 168              MONTNEY J5J     1
## 169              MONTNEY J6J     1
## 170              MONTNEY J7J     1
## 171               MONTNEY JJ     6
## 172              MONTNEY JJJ     6
## 173                MONTNEY K    13
## 174              MONTNEY K2K     2
## 175              MONTNEY K5K     1
## 176              MONTNEY K6K     1
## 177              MONTNEY K7K     1
## 178               MONTNEY KK     7
## 179              MONTNEY KKK     4
## 180                MONTNEY L    10
## 181              MONTNEY L2L     3
## 182              MONTNEY L5L     1
## 183              MONTNEY L6L     1
## 184               MONTNEY LL     7
## 185              MONTNEY LLL     4
## 186                MONTNEY M     6
## 187              MONTNEY M2M     2
## 188              MONTNEY M3M     1
## 189              MONTNEY M5M     1
## 190              MONTNEY M6M     1
## 191              MONTNEY M7M     1
## 192               MONTNEY MM     7
## 193              MONTNEY MMM     4
## 194                MONTNEY N    10
## 195              MONTNEY N2N     2
## 196              MONTNEY N3N     1
## 197              MONTNEY N5N     1
## 198              MONTNEY N6N     1
## 199               MONTNEY NN     6
## 200              MONTNEY NNN     5
## 201                MONTNEY O     7
## 202              MONTNEY O2O     1
## 203              MONTNEY O3O     1
## 204              MONTNEY O6O     1
## 205              MONTNEY O7O     1
## 206               MONTNEY OO     5
## 207              MONTNEY OOO     4
## 208                MONTNEY P     5
## 209              MONTNEY P2P     1
## 210              MONTNEY P4P     1
## 211              MONTNEY P5P     1
## 212              MONTNEY P6P     1
## 213               MONTNEY PP     4
## 214              MONTNEY PPP     6
## 215                MONTNEY Q     4
## 216              MONTNEY Q2Q     1
## 217              MONTNEY Q3Q     1
## 218              MONTNEY Q4Q     1
## 219              MONTNEY Q5Q     1
## 220              MONTNEY Q6Q     1
## 221               MONTNEY QQ     7
## 222              MONTNEY QQQ     3
## 223                MONTNEY R     7
## 224              MONTNEY R2R     1
## 225              MONTNEY R4R     1
## 226              MONTNEY R5R     1
## 227              MONTNEY R6R     1
## 228               MONTNEY RR     3
## 229              MONTNEY RRR     5
## 230                MONTNEY S    10
## 231              MONTNEY S3S     1
## 232              MONTNEY S4S     1
## 233              MONTNEY S5S     1
## 234              MONTNEY S6S     1
## 235               MONTNEY SS     5
## 236              MONTNEY SSS     4
## 237                MONTNEY T     6
## 238              MONTNEY T2T     1
## 239              MONTNEY T3T     1
## 240              MONTNEY T4T     1
## 241              MONTNEY T5T     1
## 242              MONTNEY T6T     1
## 243               MONTNEY TT     4
## 244              MONTNEY TTT     4
## 245                MONTNEY U     5
## 246              MONTNEY U2U     1
## 247              MONTNEY U3U     2
## 248              MONTNEY U4U     1
## 249              MONTNEY U5U     1
## 250              MONTNEY U6U     1
## 251               MONTNEY UU     5
## 252              MONTNEY UUU     4
## 253                MONTNEY V     4
## 254              MONTNEY V2V     1
## 255              MONTNEY V3V     2
## 256              MONTNEY V4V     1
## 257              MONTNEY V5V     1
## 258              MONTNEY V6V     1
## 259               MONTNEY VV     5
## 260              MONTNEY VVV     8
## 261                MONTNEY W    11
## 262              MONTNEY W2W     4
## 263              MONTNEY W4W     1
## 264              MONTNEY W5W     1
## 265              MONTNEY W6W     1
## 266               MONTNEY WW     5
## 267              MONTNEY WWW     3
## 268                MONTNEY X     5
## 269              MONTNEY X2X     1
## 270              MONTNEY X3X     2
## 271              MONTNEY X4X     1
## 272              MONTNEY X5X     1
## 273              MONTNEY X6X     1
## 274               MONTNEY XX     5
## 275              MONTNEY XXX     2
## 276                MONTNEY Y     4
## 277              MONTNEY Y2Y     1
## 278              MONTNEY Y3Y     2
## 279              MONTNEY Y4Y     1
## 280              MONTNEY Y5Y     1
## 281              MONTNEY Y6Y     1
## 282               MONTNEY YY     3
## 283              MONTNEY YYY     3
## 284                MONTNEY Z     7
## 285              MONTNEY Z2Z     3
## 286              MONTNEY Z3Z     1
## 287              MONTNEY Z4Z     1
## 288              MONTNEY Z5Z     1
## 289              MONTNEY Z6Z     1
## 290               MONTNEY ZZ     3
## 291              MONTNEY ZZZ     2
## 292                   TD UND     3
## 293     TEMP COMMINGLED CODE    97
## 294               TRIASSIC A     9
## 295               TRIASSIC B     2
## 296               TRIASSIC C     8
## 297               TRIASSIC D     2
## 298               TRIASSIC E     4
## 299               TRIASSIC H     3
## 300               TRIASSIC I     1
## 301               TRIASSIC K     3
## 302              TRIASSIC KK     1
## 303              TRIASSIC NN     1
## 304               TRIASSIC R     1
## 305              TRIASSIC VV     1
## 306              TRIASSIC WW     1
## 307              TRIASSIC XX    10
## 308               TRIASSIC Y     1
## 309                UNDEFINED     1
## 310           UNDEFINED POOL     1

Number of wells in each field:

  ddply(prod,~Field,summarise,Wells=length(unique(UWI)))
##                  Field Wells
## 1               ANSELL     2
## 2     ANTE CREEK NORTH    14
## 3              ANTHONY     2
## 4               BEATON     2
## 5               BELLOY    17
## 6        BERLAND RIVER     5
## 7   BERLAND RIVER WEST     3
## 8               BERWYN     4
## 9             BEZANSON     1
## 10            BIGSTONE    13
## 11           BLUEBERRY     6
## 12       BOUNDARY LAKE     1
## 13 BOUNDARY LAKE SOUTH     3
## 14       BRAZEAU RIVER     1
## 15               CECIL     3
## 16           CHICKADEE     7
## 17               CHIME     2
## 18               CINDY     5
## 19       CLEAR PRAIRIE     3
## 20                CULP     6
## 21                DAHL     4
## 22            DIMSDALE     2
## 23          DIXONVILLE     1
## 24            DUNVEGAN     9
## 25           EAGLESHAM     7
## 26     EAGLESHAM NORTH     4
## 27               EDSON     5
## 28            ELMWORTH   214
## 29             FERRIER    22
## 30       FLATROCK WEST     1
## 31           FOX CREEK    56
## 32          GARRINGTON     1
## 33              GEORGE     1
## 34               GILBY     2
## 35    GIROUXVILLE EAST     2
## 36          GORDONDALE     6
## 37      GRANDE PRAIRIE     7
## 38             GRIZZLY    17
## 39             HAMBURG     1
## 40            HERITAGE  1694
## 41                JACK     1
## 42               KAKWA   396
## 43                KARR   113
## 44              KAYBOB   151
## 45        KAYBOB SOUTH   460
## 46            LA GLACE    92
## 47               LATOR     5
## 48              LELAND     8
## 49            MCKINLEY     8
## 50      MEDICINE LODGE     1
## 51            MULLIGAN     2
## 52             MUSKRAT     1
## 53        NORMANDVILLE     2
## 54    NORTHERN MONTNEY  1212
## 55                 OAK     2
## 56            PARADISE     2
## 57            PARKLAND     1
## 58             PEMBINA    16
## 59              PEORIA     4
## 60                PICA     1
## 61          PINE CREEK    22
## 62              PLACID    69
## 63         POUCE COUPE    37
## 64   POUCE COUPE SOUTH   352
## 65            PROGRESS     1
## 66               REINE     1
## 67           RESTHAVEN    18
## 68              ROXANA     3
## 69             RYCROFT     3
## 70        SADDLE HILLS     1
## 71               SAXON    16
## 72           SIMONETTE    31
## 73            SINCLAIR    35
## 74               SMOKY    22
## 75             SOLOMON     2
## 76               STUMP     1
## 77       STURGEON LAKE     1
## 78 STURGEON LAKE SOUTH     5
## 79            SUNDANCE     8
## 80             TANGENT    10
## 81    TONY CREEK NORTH    42
## 82           UNDEFINED     2
## 83            VALHALLA    78
## 84              WAPITI    99
## 85          WASKAHIGAN   108
## 86             WEMBLEY   103
## 87            WHITELAW     3
## 88          WILD RIVER    11
## 89     WILLESDEN GREEN    28
## 90             WORSLEY     3

Break down records based on formations:

  montney <- prod %>% filter(Formation == "TRmontney")
  duvernay <- prod %>% filter(Formation == "Dduvernay")

To free memory, it’s better ro remove the production table:

  rm(prod)

Removing UWI and Formation columns

  montney <- within(montney, rm("UWI", "Formation"))
  duvernay <- within(duvernay, rm("UWI", "Formation"))

Feature Selection

High-dimensional data, in terms of number of features, is increasingly common these days in machine learning problems. To extract useful information from these high volumes of data, you have to use statistical techniques to reduce the noise or redundant data. This is because you often need not use every feature at your disposal to train a model. You can improve your model by feeding in only those features that are uncorrelated and non-redundant. This is where feature selection plays an important role. Not only it helps in training your model faster but also reduces the complexity of the model, makes it easier to interpret and improves the accuracy, precision or recall, whatever may the performance metric be.

Generally, whenever you want to reduce the dimensionality of the data you come across methods like Principal Component Analysis (PCA), Singular Value decomposition etc. So it’s natural to ask why you need other feature selection methods at all. The thing with these techniques is that they are unsupervised ways of feature selection: take, for example, PCA, which uses variance in data to find the components. These techniques don’t take into account the information between feature values and the target class or values. Also, there are certain assumptions, such as normality, associated with such methods which require some kind of transformations before starting to apply them. These constraints doesn’t apply to all kinds of data.

There are three types of feature selection methods in general:

Filter Methods

filter methods are generally used as a preprocessing step. The selection of features is independent of any machine learning algorithm. Instead the features are selected on the basis of their scores in various statistical tests for their correlation with the outcome variable. Some common filter methods are Correlation metrics (Pearson, Spearman, Distance), Chi-Squared test, Anova, Fisher’s Score etc.

USING CORRELATION Correlation gives us the degree of association between two numeric variables. You can use corr() function to get the correlation values.

  n <- ncol(montney)
  rc <- cor(montney[4:n])
  print(rc)
##                 Year         Month          GAS           OIL          CND
## Year     1.000000000 -0.0638536156 -0.021355483  0.0041355183  0.188154738
## Month   -0.063853616  1.0000000000 -0.002036995 -0.0003492632 -0.003776291
## GAS     -0.021355483 -0.0020369952  1.000000000 -0.0179260251  0.011423265
## OIL      0.004135518 -0.0003492632 -0.017926025  1.0000000000 -0.004934274
## CND      0.188154738 -0.0037762911  0.011423265 -0.0049342745  1.000000000
## WTR      0.276122099 -0.0063835769  0.181353013 -0.0044083627  0.243248000
## BOE     -0.008180572 -0.0022946576  0.997570647 -0.0143809016  0.080947621
## FLD      0.292531967 -0.0063967202  0.117991061  0.0301090484  0.803341833
## HRS      0.031950768 -0.0014401142  0.461022766  0.0030347590 -0.119246877
## CGR      0.002176380  0.0015113088 -0.007882791  0.0005296408  0.001721925
## Log_CGR  0.111420725  0.0008074448 -0.282281108  0.0221322743  0.447774018
##                  WTR          BOE           FLD          HRS           CGR
## Year     0.276122099 -0.008180572  0.2925319666  0.031950768  0.0021763795
## Month   -0.006383577 -0.002294658 -0.0063967202 -0.001440114  0.0015113088
## GAS      0.181353013  0.997570647  0.1179910608  0.461022766 -0.0078827913
## OIL     -0.004408363 -0.014380902  0.0301090484  0.003034759  0.0005296408
## CND      0.243248000  0.080947621  0.8033418326 -0.119246877  0.0017219255
## WTR      1.000000000  0.197688892  0.7719838235 -0.069999221 -0.0031924093
## BOE      0.197688892  1.000000000  0.1736300860  0.451282529 -0.0077360773
## FLD      0.771983824  0.173630086  1.0000000000 -0.120834554 -0.0008108586
## HRS     -0.069999221  0.451282529 -0.1208345538  1.000000000 -0.0110151990
## CGR     -0.003192409 -0.007736077 -0.0008108586 -0.011015199  1.0000000000
## Log_CGR  0.097874404 -0.250149759  0.3538223296 -0.348229996  0.0485140763
##               Log_CGR
## Year     0.1114207254
## Month    0.0008074448
## GAS     -0.2822811083
## OIL      0.0221322743
## CND      0.4477740181
## WTR      0.0978744039
## BOE     -0.2501497590
## FLD      0.3538223296
## HRS     -0.3482299961
## CGR      0.0485140763
## Log_CGR  1.0000000000

However, as corr() function is elementary and so we cover a couple of other functions which can be used to generate the similar output for inferencing which variable is important.

  corrplot(rc, method="color", type = "lower", is.corr=FALSE, addCoef.col = "red", tl.col="black", tl.srt=45, number.cex = .6)

The three basic arguments of corrplot() function which you must know are:

  1. method = is used to decide the type of visualization. You can draw circle, square, ellipse, number, shade, color or pie

  2. type = is used to decide n whether you want a full matrix, upper triangle or lower triangle. By default it gives visualization for complete matrix; accepted options are full(default), upper or lower.

USING HYPOTHESIS TESTING You can use hypothesis testing to check if the independent variable has a significant dependent variable or not. For example, we want to check if the CGR is related to the gas/oil/condensate production rate or not.

  1. Null Hypothesis is that CGR has no relationship with the gas/oil/condensate production rate.
  2. Alternate Hypothesis is that CGR has a relationship with the gas/oil/condensate production rate.

As we have two variables in here and both are continuous, we would proceed with independent t-tests.

  t.test(montney$Log_CGR, montney$GAS)
## 
##  Welch Two Sample t-test
## 
## data:  montney$Log_CGR and montney$GAS
## t = -415.51, df = 316160, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -50109.90 -49639.37
## sample estimates:
##     mean of x     mean of y 
##    -0.1937566 49874.4425270

To take the decision, we will compare the p-value with the alpha (α) value which we decided in the second step. As p-value is less than alpha (α=0.05), we reject the null hypothesis.thus this claim that CGR has relation with Water is false.

USING INFORMATION GAIN FOR VARIABLE SELECTION Information gain tells us how much information is given by the independent variable about the dependent variable. Information gain is helpful in case of both categorical and numerical dependent variable. For numeric dependent variables, bins are created.

Although there are many functions, we are using information.gain() function from FSelector package. The FSelector package provides two approaches to select the most influential features from the original feature set. Firstly, rank features by some criteria and select the ones that are above a defined threshold. Secondly, search for optimum feature subsets from a space of feature subsets. Let’s see which feature is most important in production dataset.

  weights <- information.gain(Log_CGR~., montney)
  print(weights)
##        attr_importance
## Field       0.20648761
## Pool        0.18014855
## Status      0.01871979
## Year        0.02333425
## Month       0.00000000
## GAS         0.07693021
## OIL         0.00463516
## CND         0.88853153
## WTR         0.08692053
## BOE         0.05966118
## FLD         0.14662557
## HRS         0.10409393
## CGR         1.60943791
  subset <- cutoff.k(weights, 5)

  f <- as.simple.formula(subset, "Log_CGR")
  print(f)
## Log_CGR ~ CGR + CND + Field + Pool + FLD
## <environment: 0x0000000063044c30>

Wrapper Methods

in wrapper methods, you try to use a subset of features and train a model using them. Based on the inferences that you draw from the previous model, you decide to add or remove features from the subset. Forward Selection, Backward elimination are some of the examples for wrapper methods.

Embedded Methods

these are the algorithms that have their own built-in feature selection methods. LASSO regression is one such example.

USING RANDOMFOREST FOR VARIABLE SELECTION To get the list of essential variables first, we need to build the model and then extract the list of variables by importance using importance() function.

  rfModel <-randomForest(Log_CGR ~ OIL+GAS+CND+WTR+BOE+FLD, data = montney[1:10000,])
  importance(rfModel)
##     IncNodePurity
## OIL      20.40343
## GAS    3245.54886
## CND    8419.09077
## WTR    1345.05823
## BOE    2508.03640
## FLD    1051.81772

The Boruta Algorithm

The Boruta algorithm is a wrapper built around the random forest classification algorithm. It tries to capture all the important, interesting features you might have in your dataset with respect to an outcome variable.

The Boruta algorithm is a wrapper built around the random forest classification algorithm. It tries to capture all the important, interesting features you might have in your dataset with respect to an outcome variable.

  • First, it duplicates the dataset, and shuffle the values in each column. These values are called shadow features. Then, it trains a classifier, such as a Random Forest Classifier, on the dataset. By doing this, you ensure that you can an idea of the importance -via the Mean Decrease Accuracy or Mean Decrease Impurity- for each of the features of your data set. The higher the score, the better or more important.
  • Then, the algorithm checks for each of your real features if they have higher importance. That is, whether the feature has a higher Z-score than the maximum Z-score of its shadow features than the best of the shadow features. If they do, it records this in a vector. These are called a hits. Next,it will continue with another iteration. After a predefined set of iterations, you will end up with a table of these hits. Remember: a Z-score is the number of standard deviations from the mean a data point is, for more info click here.
  • At every iteration, the algorithm compares the Z-scores of the shuffled copies of the features and the original features to see if the latter performed better than the former. If it does, the algorithm will mark the feature as important. In essence, the algorithm is trying to validate the importance of the feature by comparing with random shuffled copies, which increases the robustness. This is done by simply comparing the number of times a feature did better with the shadow features using a binomial distribution.

image: Alt text

  • If a feature hasn’t been recorded as a hit in say 15 iterations, you reject it and also remove it from the original matrix. After a set number of iterations -or if all the features have been either confirmed or rejected- you stop.