Prepared by: Raya Matoorian Updated: 2019/11/03
This report contains the procedure to predict CGR (Condensate-Gas Ratio) in Montney and Duvernay formations in WCSB.
Traditionally operators use a constant CGR to tie well with condensate production at the plant and then back-allocation to the well. This methodology is erratic as CGR will change over time and location. Current methods of CGR estimation range from empirical equations derived from specific field data and typically have local applicability or derived laboratory and fieldwork which is time and cost consuming. Varied thermo-chemical parameters control condensate’s behavior in the reservoir and surface facilities.
In the current research, we integrate different datasets including well data, reservoir data, fluids (oil, gas, and condensate) compositional analysis and production data to estimate CGR.
For this research 5 different dataset has been collected from the wells drilled in Montney and Duvernay across AB and BC. These datasets includes:
The data obtaind from geoSCOUT software by geoLOGIC Systems. Data exported as csv (comma-Sparated Values) and then imported in R Package.
read.csv() function in R simply reads large csv file and can handle compressed files automatically. Data loading is time-consuming process. To avoid this, we can use cache=TRUE option in code chunck.
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## corrplot 0.84 loaded
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
prod <- read.csv("./data/prod.csv")
After reading the required datasets, we can summerize them. The summary() function gives the measure of central tendency for continuous features such as mean, median, quantiles, etc. If there is any categorical features in dataset, we’ll also get to the Class and Mode of those features.
dim(prod)
## [1] 819940 15
summary(prod)
## UWI Field
## 100/05-04-069-21W5/00: 758 HERITAGE :168105
## 100/05-07-069-21W5/00: 732 KAYBOB SOUTH :100238
## 100/05-08-069-21W5/00: 729 NORTHERN MONTNEY : 72013
## 100/09-07-069-21W5/00: 718 POUCE COUPE SOUTH : 64008
## 100/07-07-069-21W5/00: 700 ANTE CREEK NORTH : 48451
## 100/14-29-069-19W5/00: 688 STURGEON LAKE SOUTH: 44953
## (Other) :815615 (Other) :322172
## Pool Status
## MONTNEY A :248415 Active Gas Production:217129
## TRIASSIC A : 83269 Pumping Crude Oil :146906
## COMMINGLED POOL 012: 46536 Pumping Gas :102620
## MONTNEY C : 39056 Flowing Gas : 95657
## COMMINGLED MFP9529 : 34611 Suspended Crude Oil : 44289
## COMMINGLED POOL 005: 31033 Suspended Gas : 43804
## (Other) :337020 (Other) :169535
## Formation Year Month GAS
## : 1 Min. :1955 Min. : 1.000 Min. : 0
## Dduvernay: 28709 1st Qu.:2008 1st Qu.: 3.000 1st Qu.: 4251
## TRmontney:791230 Median :2014 Median : 6.000 Median : 15250
## Mean :2010 Mean : 6.485 Mean : 33020
## 3rd Qu.:2017 3rd Qu.:10.000 3rd Qu.: 41088
## Max. :2019 Max. :12.000 Max. :1189565
## NA's :1 NA's :1 NA's :1
## OIL CND WTR BOE
## Min. : 0 Min. : 0.0 Min. : 0.0 Min. : 0
## 1st Qu.: 0 1st Qu.: 0.0 1st Qu.: 201.4 1st Qu.: 52622
## Median : 0 Median : 0.0 Median : 1154.5 Median : 133757
## Mean : 9279 Mean : 626.5 Mean : 7693.9 Mean : 256683
## 3rd Qu.: 4031 3rd Qu.: 55.7 3rd Qu.: 4284.4 3rd Qu.: 300342
## Max. :534420 Max. :76863.5 Max. :1540523.4 Max. :7001594
## NA's :1 NA's :1 NA's :1 NA's :1
## FLD HRS CGR
## Min. : 0 Min. : 0 Min. : 0.0
## 1st Qu.: 623 1st Qu.: 9216 1st Qu.: 0.0
## Median : 3017 Median : 25098 Median : 0.0
## Mean : 17600 Mean : 39143 Mean : 30.5
## 3rd Qu.: 10690 3rd Qu.: 52509 3rd Qu.: 0.3
## Max. :1734095 Max. :411765 Max. :1276141.0
## NA's :1 NA's :1 NA's :1
If we need to convert the categorical features into factor data type.
#catList <- c(1:5)
#prod[,catList] <- data.frame(apply(prod[catList], 2, as.factor))
Below is a summary of the data tpye in the table:
str(prod)
## 'data.frame': 819940 obs. of 15 variables:
## $ UWI : Factor w/ 10780 levels "","100/01-01-059-21W5/04",..: 3482 3482 3482 3482 3482 3482 3482 3482 3482 3482 ...
## $ Field : Factor w/ 146 levels "","ANSELL","ANTE CREEK NORTH",..: 133 133 133 133 133 133 133 133 133 133 ...
## $ Pool : Factor w/ 389 levels "","2WS UND","BEAVERHILL LAKE D",..: 85 85 85 85 85 85 85 85 85 85 ...
## $ Status : Factor w/ 62 levels "","Abandon Gas Production",..: 31 31 31 31 31 31 31 31 31 31 ...
## $ Formation: Factor w/ 3 levels "","Dduvernay",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Year : int 2018 2018 2018 2018 2018 2018 2018 2019 2019 2019 ...
## $ Month : int 6 7 8 9 10 11 12 1 2 3 ...
## $ GAS : num 37.5 46.2 145.5 219.5 282.6 ...
## $ OIL : num 478 716 1854 3168 4309 ...
## $ CND : num 0 0 0 0 0 0 0 0 0 0 ...
## $ WTR : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BOE : num 3227 4776 12524 21228 28777 ...
## $ FLD : num 478 716 1854 3168 4309 ...
## $ HRS : num 264 379 964 1633 2356 ...
## $ CGR : num 0 0 0 0 0 0 0 0 0 0 ...
As summary section, we can find that there is a few missing values in records so we need to impute them. Then we need to remove records with CGR = 0
prod <- na.omit(prod)
prod <- prod %>% filter(CGR > 0)
dim(prod)
## [1] 339119 15
Number of wells in each formation:
ddply(prod,~Formation,summarise,Wells=length(unique(UWI)))
## Formation Wells
## 1 Dduvernay 627
## 2 TRmontney 5122
we will take logaritm of CGR column:
prod$Log_CGR <- log10(prod$CGR)
Number of wells in each Pool:
ddply(prod,~Pool,summarise,Wells=length(unique(UWI)))
## Pool Wells
## 1 2WS UND 8
## 2 BH LK UND 2
## 3 BLUESKY-MONTNEY E 1
## 4 BNFF UND 1
## 5 COMMINGLED MFP9516 26
## 6 COMMINGLED MFP9529 253
## 7 COMMINGLED POOL 001 22
## 8 COMMINGLED POOL 002 11
## 9 COMMINGLED POOL 003 78
## 10 COMMINGLED POOL 004 8
## 11 COMMINGLED POOL 005 6
## 12 COMMINGLED POOL 006 6
## 13 COMMINGLED POOL 007 1
## 14 COMMINGLED POOL 008 2
## 15 COMMINGLED POOL 009 6
## 16 COMMINGLED POOL 010 1
## 17 COMMINGLED POOL 012 246
## 18 COMMINGLED POOL 013 1
## 19 COMMINGLED POOL 014 4
## 20 COMMINGLED POOL 017 1
## 21 COMMINGLED POOL 018 3
## 22 COMMINGLED POOL 019 4
## 23 COMMINGLED POOL 020 1
## 24 COMMINGLED POOL 021 3
## 25 COMMINGLED POOL 022 1
## 26 COMMINGLED POOL 023 4
## 27 COMMINGLED POOL 024 25
## 28 COMMINGLED POOL 025 4
## 29 COMMINGLED POOL 027 1
## 30 COMMINGLED POOL 030 2
## 31 COMMINGLED POOL 036 7
## 32 DOIG C 1
## 33 DOIG PHOSPHATE-MONTNEY A 224
## 34 DOIG UND 1
## 35 DUNV UND 12
## 36 DUVERNAY A 6
## 37 DUVERNAY AA 2
## 38 DUVERNAY AAA 1
## 39 DUVERNAY B 3
## 40 DUVERNAY BB 2
## 41 DUVERNAY BBB 1
## 42 DUVERNAY C 4
## 43 DUVERNAY CC 2
## 44 DUVERNAY CCC 1
## 45 DUVERNAY D 4
## 46 DUVERNAY DD 2
## 47 DUVERNAY DDD 1
## 48 DUVERNAY E 4
## 49 DUVERNAY EE 2
## 50 DUVERNAY EEE 1
## 51 DUVERNAY F 4
## 52 DUVERNAY FF 2
## 53 DUVERNAY G 4
## 54 DUVERNAY GG 3
## 55 DUVERNAY H 4
## 56 DUVERNAY HH 2
## 57 DUVERNAY I 3
## 58 DUVERNAY II 2
## 59 DUVERNAY J 3
## 60 DUVERNAY JJ 2
## 61 DUVERNAY K 3
## 62 DUVERNAY KK 2
## 63 DUVERNAY L 2
## 64 DUVERNAY LL 2
## 65 DUVERNAY M 2
## 66 DUVERNAY MM 2
## 67 DUVERNAY N 3
## 68 DUVERNAY NN 2
## 69 DUVERNAY O 3
## 70 DUVERNAY OO 2
## 71 DUVERNAY P 3
## 72 DUVERNAY PP 2
## 73 DUVERNAY Q 3
## 74 DUVERNAY QQ 2
## 75 DUVERNAY R 3
## 76 DUVERNAY RR 2
## 77 DUVERNAY S 2
## 78 DUVERNAY SS 2
## 79 DUVERNAY T 2
## 80 DUVERNAY TT 2
## 81 DUVERNAY U 2
## 82 DUVERNAY UND 104
## 83 DUVERNAY UU 1
## 84 DUVERNAY V 2
## 85 DUVERNAY VV 2
## 86 DUVERNAY W 2
## 87 DUVERNAY WW 2
## 88 DUVERNAY X 2
## 89 DUVERNAY XX 2
## 90 DUVERNAY Y 2
## 91 DUVERNAY YY 1
## 92 DUVERNAY Z 2
## 93 DUVERNAY ZZ 1
## 94 DUVY UND 278
## 95 HALFWAY U 1
## 96 MONT UND 631
## 97 MONTNEY A 2729
## 98 MONTNEY A2A 5
## 99 MONTNEY A3A 3
## 100 MONTNEY A5A 1
## 101 MONTNEY A6A 1
## 102 MONTNEY A7A 1
## 103 MONTNEY AA 8
## 104 MONTNEY AAA 6
## 105 MONTNEY B 48
## 106 MONTNEY B2B 6
## 107 MONTNEY B3B 1
## 108 MONTNEY B5B 1
## 109 MONTNEY B6B 1
## 110 MONTNEY BB 6
## 111 MONTNEY BBB 3
## 112 MONTNEY C 32
## 113 MONTNEY C2C 3
## 114 MONTNEY C5C 1
## 115 MONTNEY C6C 1
## 116 MONTNEY C7C 1
## 117 MONTNEY CC 6
## 118 MONTNEY CCC 3
## 119 MONTNEY D 27
## 120 MONTNEY D2D 3
## 121 MONTNEY D5D 1
## 122 MONTNEY D6D 1
## 123 MONTNEY D7D 1
## 124 MONTNEY DD 6
## 125 MONTNEY DDD 3
## 126 MONTNEY E 21
## 127 MONTNEY E2E 3
## 128 MONTNEY E3E 2
## 129 MONTNEY E5E 1
## 130 MONTNEY E6E 1
## 131 MONTNEY E7E 1
## 132 MONTNEY EE 7
## 133 MONTNEY EEE 2
## 134 MONTNEY F 52
## 135 MONTNEY F2F 3
## 136 MONTNEY F3F 1
## 137 MONTNEY F5F 1
## 138 MONTNEY F6F 1
## 139 MONTNEY F7F 1
## 140 MONTNEY FF 7
## 141 MONTNEY FFF 3
## 142 MONTNEY G 12
## 143 MONTNEY G2G 3
## 144 MONTNEY G5G 1
## 145 MONTNEY G6G 1
## 146 MONTNEY G7G 1
## 147 MONTNEY GG 5
## 148 MONTNEY GGG 3
## 149 MONTNEY H 8
## 150 MONTNEY H2H 3
## 151 MONTNEY H3H 2
## 152 MONTNEY H5H 1
## 153 MONTNEY H6H 1
## 154 MONTNEY H7H 1
## 155 MONTNEY HH 7
## 156 MONTNEY HHH 3
## 157 MONTNEY I 12
## 158 MONTNEY I2I 4
## 159 MONTNEY I3I 2
## 160 MONTNEY I4I 1
## 161 MONTNEY I5I 1
## 162 MONTNEY I6I 1
## 163 MONTNEY I7I 1
## 164 MONTNEY II 7
## 165 MONTNEY III 4
## 166 MONTNEY J 12
## 167 MONTNEY J2J 3
## 168 MONTNEY J5J 1
## 169 MONTNEY J6J 1
## 170 MONTNEY J7J 1
## 171 MONTNEY JJ 6
## 172 MONTNEY JJJ 6
## 173 MONTNEY K 13
## 174 MONTNEY K2K 2
## 175 MONTNEY K5K 1
## 176 MONTNEY K6K 1
## 177 MONTNEY K7K 1
## 178 MONTNEY KK 7
## 179 MONTNEY KKK 4
## 180 MONTNEY L 10
## 181 MONTNEY L2L 3
## 182 MONTNEY L5L 1
## 183 MONTNEY L6L 1
## 184 MONTNEY LL 7
## 185 MONTNEY LLL 4
## 186 MONTNEY M 6
## 187 MONTNEY M2M 2
## 188 MONTNEY M3M 1
## 189 MONTNEY M5M 1
## 190 MONTNEY M6M 1
## 191 MONTNEY M7M 1
## 192 MONTNEY MM 7
## 193 MONTNEY MMM 4
## 194 MONTNEY N 10
## 195 MONTNEY N2N 2
## 196 MONTNEY N3N 1
## 197 MONTNEY N5N 1
## 198 MONTNEY N6N 1
## 199 MONTNEY NN 6
## 200 MONTNEY NNN 5
## 201 MONTNEY O 7
## 202 MONTNEY O2O 1
## 203 MONTNEY O3O 1
## 204 MONTNEY O6O 1
## 205 MONTNEY O7O 1
## 206 MONTNEY OO 5
## 207 MONTNEY OOO 4
## 208 MONTNEY P 5
## 209 MONTNEY P2P 1
## 210 MONTNEY P4P 1
## 211 MONTNEY P5P 1
## 212 MONTNEY P6P 1
## 213 MONTNEY PP 4
## 214 MONTNEY PPP 6
## 215 MONTNEY Q 4
## 216 MONTNEY Q2Q 1
## 217 MONTNEY Q3Q 1
## 218 MONTNEY Q4Q 1
## 219 MONTNEY Q5Q 1
## 220 MONTNEY Q6Q 1
## 221 MONTNEY QQ 7
## 222 MONTNEY QQQ 3
## 223 MONTNEY R 7
## 224 MONTNEY R2R 1
## 225 MONTNEY R4R 1
## 226 MONTNEY R5R 1
## 227 MONTNEY R6R 1
## 228 MONTNEY RR 3
## 229 MONTNEY RRR 5
## 230 MONTNEY S 10
## 231 MONTNEY S3S 1
## 232 MONTNEY S4S 1
## 233 MONTNEY S5S 1
## 234 MONTNEY S6S 1
## 235 MONTNEY SS 5
## 236 MONTNEY SSS 4
## 237 MONTNEY T 6
## 238 MONTNEY T2T 1
## 239 MONTNEY T3T 1
## 240 MONTNEY T4T 1
## 241 MONTNEY T5T 1
## 242 MONTNEY T6T 1
## 243 MONTNEY TT 4
## 244 MONTNEY TTT 4
## 245 MONTNEY U 5
## 246 MONTNEY U2U 1
## 247 MONTNEY U3U 2
## 248 MONTNEY U4U 1
## 249 MONTNEY U5U 1
## 250 MONTNEY U6U 1
## 251 MONTNEY UU 5
## 252 MONTNEY UUU 4
## 253 MONTNEY V 4
## 254 MONTNEY V2V 1
## 255 MONTNEY V3V 2
## 256 MONTNEY V4V 1
## 257 MONTNEY V5V 1
## 258 MONTNEY V6V 1
## 259 MONTNEY VV 5
## 260 MONTNEY VVV 8
## 261 MONTNEY W 11
## 262 MONTNEY W2W 4
## 263 MONTNEY W4W 1
## 264 MONTNEY W5W 1
## 265 MONTNEY W6W 1
## 266 MONTNEY WW 5
## 267 MONTNEY WWW 3
## 268 MONTNEY X 5
## 269 MONTNEY X2X 1
## 270 MONTNEY X3X 2
## 271 MONTNEY X4X 1
## 272 MONTNEY X5X 1
## 273 MONTNEY X6X 1
## 274 MONTNEY XX 5
## 275 MONTNEY XXX 2
## 276 MONTNEY Y 4
## 277 MONTNEY Y2Y 1
## 278 MONTNEY Y3Y 2
## 279 MONTNEY Y4Y 1
## 280 MONTNEY Y5Y 1
## 281 MONTNEY Y6Y 1
## 282 MONTNEY YY 3
## 283 MONTNEY YYY 3
## 284 MONTNEY Z 7
## 285 MONTNEY Z2Z 3
## 286 MONTNEY Z3Z 1
## 287 MONTNEY Z4Z 1
## 288 MONTNEY Z5Z 1
## 289 MONTNEY Z6Z 1
## 290 MONTNEY ZZ 3
## 291 MONTNEY ZZZ 2
## 292 TD UND 3
## 293 TEMP COMMINGLED CODE 97
## 294 TRIASSIC A 9
## 295 TRIASSIC B 2
## 296 TRIASSIC C 8
## 297 TRIASSIC D 2
## 298 TRIASSIC E 4
## 299 TRIASSIC H 3
## 300 TRIASSIC I 1
## 301 TRIASSIC K 3
## 302 TRIASSIC KK 1
## 303 TRIASSIC NN 1
## 304 TRIASSIC R 1
## 305 TRIASSIC VV 1
## 306 TRIASSIC WW 1
## 307 TRIASSIC XX 10
## 308 TRIASSIC Y 1
## 309 UNDEFINED 1
## 310 UNDEFINED POOL 1
Number of wells in each field:
ddply(prod,~Field,summarise,Wells=length(unique(UWI)))
## Field Wells
## 1 ANSELL 2
## 2 ANTE CREEK NORTH 14
## 3 ANTHONY 2
## 4 BEATON 2
## 5 BELLOY 17
## 6 BERLAND RIVER 5
## 7 BERLAND RIVER WEST 3
## 8 BERWYN 4
## 9 BEZANSON 1
## 10 BIGSTONE 13
## 11 BLUEBERRY 6
## 12 BOUNDARY LAKE 1
## 13 BOUNDARY LAKE SOUTH 3
## 14 BRAZEAU RIVER 1
## 15 CECIL 3
## 16 CHICKADEE 7
## 17 CHIME 2
## 18 CINDY 5
## 19 CLEAR PRAIRIE 3
## 20 CULP 6
## 21 DAHL 4
## 22 DIMSDALE 2
## 23 DIXONVILLE 1
## 24 DUNVEGAN 9
## 25 EAGLESHAM 7
## 26 EAGLESHAM NORTH 4
## 27 EDSON 5
## 28 ELMWORTH 214
## 29 FERRIER 22
## 30 FLATROCK WEST 1
## 31 FOX CREEK 56
## 32 GARRINGTON 1
## 33 GEORGE 1
## 34 GILBY 2
## 35 GIROUXVILLE EAST 2
## 36 GORDONDALE 6
## 37 GRANDE PRAIRIE 7
## 38 GRIZZLY 17
## 39 HAMBURG 1
## 40 HERITAGE 1694
## 41 JACK 1
## 42 KAKWA 396
## 43 KARR 113
## 44 KAYBOB 151
## 45 KAYBOB SOUTH 460
## 46 LA GLACE 92
## 47 LATOR 5
## 48 LELAND 8
## 49 MCKINLEY 8
## 50 MEDICINE LODGE 1
## 51 MULLIGAN 2
## 52 MUSKRAT 1
## 53 NORMANDVILLE 2
## 54 NORTHERN MONTNEY 1212
## 55 OAK 2
## 56 PARADISE 2
## 57 PARKLAND 1
## 58 PEMBINA 16
## 59 PEORIA 4
## 60 PICA 1
## 61 PINE CREEK 22
## 62 PLACID 69
## 63 POUCE COUPE 37
## 64 POUCE COUPE SOUTH 352
## 65 PROGRESS 1
## 66 REINE 1
## 67 RESTHAVEN 18
## 68 ROXANA 3
## 69 RYCROFT 3
## 70 SADDLE HILLS 1
## 71 SAXON 16
## 72 SIMONETTE 31
## 73 SINCLAIR 35
## 74 SMOKY 22
## 75 SOLOMON 2
## 76 STUMP 1
## 77 STURGEON LAKE 1
## 78 STURGEON LAKE SOUTH 5
## 79 SUNDANCE 8
## 80 TANGENT 10
## 81 TONY CREEK NORTH 42
## 82 UNDEFINED 2
## 83 VALHALLA 78
## 84 WAPITI 99
## 85 WASKAHIGAN 108
## 86 WEMBLEY 103
## 87 WHITELAW 3
## 88 WILD RIVER 11
## 89 WILLESDEN GREEN 28
## 90 WORSLEY 3
Break down records based on formations:
montney <- prod %>% filter(Formation == "TRmontney")
duvernay <- prod %>% filter(Formation == "Dduvernay")
To free memory, it’s better ro remove the production table:
rm(prod)
Removing UWI and Formation columns
montney <- within(montney, rm("UWI", "Formation"))
duvernay <- within(duvernay, rm("UWI", "Formation"))
High-dimensional data, in terms of number of features, is increasingly common these days in machine learning problems. To extract useful information from these high volumes of data, you have to use statistical techniques to reduce the noise or redundant data. This is because you often need not use every feature at your disposal to train a model. You can improve your model by feeding in only those features that are uncorrelated and non-redundant. This is where feature selection plays an important role. Not only it helps in training your model faster but also reduces the complexity of the model, makes it easier to interpret and improves the accuracy, precision or recall, whatever may the performance metric be.
Generally, whenever you want to reduce the dimensionality of the data you come across methods like Principal Component Analysis (PCA), Singular Value decomposition etc. So it’s natural to ask why you need other feature selection methods at all. The thing with these techniques is that they are unsupervised ways of feature selection: take, for example, PCA, which uses variance in data to find the components. These techniques don’t take into account the information between feature values and the target class or values. Also, there are certain assumptions, such as normality, associated with such methods which require some kind of transformations before starting to apply them. These constraints doesn’t apply to all kinds of data.
There are three types of feature selection methods in general:
filter methods are generally used as a preprocessing step. The selection of features is independent of any machine learning algorithm. Instead the features are selected on the basis of their scores in various statistical tests for their correlation with the outcome variable. Some common filter methods are Correlation metrics (Pearson, Spearman, Distance), Chi-Squared test, Anova, Fisher’s Score etc.
USING CORRELATION Correlation gives us the degree of association between two numeric variables. You can use corr() function to get the correlation values.
n <- ncol(montney)
rc <- cor(montney[4:n])
print(rc)
## Year Month GAS OIL CND
## Year 1.000000000 -0.0638536156 -0.021355483 0.0041355183 0.188154738
## Month -0.063853616 1.0000000000 -0.002036995 -0.0003492632 -0.003776291
## GAS -0.021355483 -0.0020369952 1.000000000 -0.0179260251 0.011423265
## OIL 0.004135518 -0.0003492632 -0.017926025 1.0000000000 -0.004934274
## CND 0.188154738 -0.0037762911 0.011423265 -0.0049342745 1.000000000
## WTR 0.276122099 -0.0063835769 0.181353013 -0.0044083627 0.243248000
## BOE -0.008180572 -0.0022946576 0.997570647 -0.0143809016 0.080947621
## FLD 0.292531967 -0.0063967202 0.117991061 0.0301090484 0.803341833
## HRS 0.031950768 -0.0014401142 0.461022766 0.0030347590 -0.119246877
## CGR 0.002176380 0.0015113088 -0.007882791 0.0005296408 0.001721925
## Log_CGR 0.111420725 0.0008074448 -0.282281108 0.0221322743 0.447774018
## WTR BOE FLD HRS CGR
## Year 0.276122099 -0.008180572 0.2925319666 0.031950768 0.0021763795
## Month -0.006383577 -0.002294658 -0.0063967202 -0.001440114 0.0015113088
## GAS 0.181353013 0.997570647 0.1179910608 0.461022766 -0.0078827913
## OIL -0.004408363 -0.014380902 0.0301090484 0.003034759 0.0005296408
## CND 0.243248000 0.080947621 0.8033418326 -0.119246877 0.0017219255
## WTR 1.000000000 0.197688892 0.7719838235 -0.069999221 -0.0031924093
## BOE 0.197688892 1.000000000 0.1736300860 0.451282529 -0.0077360773
## FLD 0.771983824 0.173630086 1.0000000000 -0.120834554 -0.0008108586
## HRS -0.069999221 0.451282529 -0.1208345538 1.000000000 -0.0110151990
## CGR -0.003192409 -0.007736077 -0.0008108586 -0.011015199 1.0000000000
## Log_CGR 0.097874404 -0.250149759 0.3538223296 -0.348229996 0.0485140763
## Log_CGR
## Year 0.1114207254
## Month 0.0008074448
## GAS -0.2822811083
## OIL 0.0221322743
## CND 0.4477740181
## WTR 0.0978744039
## BOE -0.2501497590
## FLD 0.3538223296
## HRS -0.3482299961
## CGR 0.0485140763
## Log_CGR 1.0000000000
However, as corr() function is elementary and so we cover a couple of other functions which can be used to generate the similar output for inferencing which variable is important.
corrplot(rc, method="color", type = "lower", is.corr=FALSE, addCoef.col = "red", tl.col="black", tl.srt=45, number.cex = .6)
The three basic arguments of corrplot() function which you must know are:
method = is used to decide the type of visualization. You can draw circle, square, ellipse, number, shade, color or pie
type = is used to decide n whether you want a full matrix, upper triangle or lower triangle. By default it gives visualization for complete matrix; accepted options are full(default), upper or lower.
USING HYPOTHESIS TESTING You can use hypothesis testing to check if the independent variable has a significant dependent variable or not. For example, we want to check if the CGR is related to the gas/oil/condensate production rate or not.
As we have two variables in here and both are continuous, we would proceed with independent t-tests.
t.test(montney$Log_CGR, montney$GAS)
##
## Welch Two Sample t-test
##
## data: montney$Log_CGR and montney$GAS
## t = -415.51, df = 316160, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -50109.90 -49639.37
## sample estimates:
## mean of x mean of y
## -0.1937566 49874.4425270
To take the decision, we will compare the p-value with the alpha (α) value which we decided in the second step. As p-value is less than alpha (α=0.05), we reject the null hypothesis.thus this claim that CGR has relation with Water is false.
USING INFORMATION GAIN FOR VARIABLE SELECTION Information gain tells us how much information is given by the independent variable about the dependent variable. Information gain is helpful in case of both categorical and numerical dependent variable. For numeric dependent variables, bins are created.
Although there are many functions, we are using information.gain() function from FSelector package. The FSelector package provides two approaches to select the most influential features from the original feature set. Firstly, rank features by some criteria and select the ones that are above a defined threshold. Secondly, search for optimum feature subsets from a space of feature subsets. Let’s see which feature is most important in production dataset.
weights <- information.gain(Log_CGR~., montney)
print(weights)
## attr_importance
## Field 0.20648761
## Pool 0.18014855
## Status 0.01871979
## Year 0.02333425
## Month 0.00000000
## GAS 0.07693021
## OIL 0.00463516
## CND 0.88853153
## WTR 0.08692053
## BOE 0.05966118
## FLD 0.14662557
## HRS 0.10409393
## CGR 1.60943791
subset <- cutoff.k(weights, 5)
f <- as.simple.formula(subset, "Log_CGR")
print(f)
## Log_CGR ~ CGR + CND + Field + Pool + FLD
## <environment: 0x0000000063044c30>
in wrapper methods, you try to use a subset of features and train a model using them. Based on the inferences that you draw from the previous model, you decide to add or remove features from the subset. Forward Selection, Backward elimination are some of the examples for wrapper methods.
these are the algorithms that have their own built-in feature selection methods. LASSO regression is one such example.
USING RANDOMFOREST FOR VARIABLE SELECTION To get the list of essential variables first, we need to build the model and then extract the list of variables by importance using importance() function.
rfModel <-randomForest(Log_CGR ~ OIL+GAS+CND+WTR+BOE+FLD, data = montney[1:10000,])
importance(rfModel)
## IncNodePurity
## OIL 20.40343
## GAS 3245.54886
## CND 8419.09077
## WTR 1345.05823
## BOE 2508.03640
## FLD 1051.81772
The Boruta Algorithm
The Boruta algorithm is a wrapper built around the random forest classification algorithm. It tries to capture all the important, interesting features you might have in your dataset with respect to an outcome variable.
The Boruta algorithm is a wrapper built around the random forest classification algorithm. It tries to capture all the important, interesting features you might have in your dataset with respect to an outcome variable.
image: