Background

Oil production is a capital-intensive business. A global estimate puts the cost of drilling on-shore wells in between USD 1-2 million per well. Drillers wells does not always find oil. If an oil reserve is found, the driller has struck payday. Productive wells can produce oil for years.

A newly producing oil well has natural pressure that pushes oil out of the well to flow out naturally. However, as oil is extracted from the well, the well pressure drops and the oil has to be artificially lifted. Artifical lifting employs methods to lift oil from a well, allowing oil to be extracted from wells with no natural flow. Installation of artificial lifting requires investment and increased operation cost.

Artificial lifts and related costs

Artificial lifts and related costs

The sucker rod-pump (SRP)

The standard method of artificial lifting is the beam pump, also known as the sucker-rod pump (SRP). The sucker-rod pump transfers its vertical bobbing motion to a reciprocating downhole pump inside the well. It acts like a bike-pump in reverse, pumping oil out of a reservoir.

Sucker-rod pump

Sucker-rod pump

Artificial lifting adds additional overhead and operations cost, so it is imperative for drilling operations to run efficiently. Monitoring pumping efficiency carried out by monitoring the downhole dynamometer graph or dynocard. A dynocard shows how the pump inside the well functions. A good dynocard has an almost square.

Full pump dynocard

Full pump dynocard

Pumps that does not function efficiently deviates from the efficient dynocard.

Standing valve leak. Pump is not efficient as it returns some of the oil back into the reservoir, instead of pumping it to the surface.

Standing valve leak. Pump is not efficient as it returns some of the oil back into the reservoir, instead of pumping it to the surface.

Pumped-off well. This indicates that the downhole pump is not well submerged in the fluid.

Pumped-off well. This indicates that the downhole pump is not well submerged in the fluid.

Fluid pound. This indicates that the pumping rate is higher than the reservoir production.

Fluid pound. This indicates that the pumping rate is higher than the reservoir production.

Inefficiencies lowers production and increaes costs. The dynocard can also show unfavorable operating conditions such as fluid pound or a pumped-off well. These conditions increase wear and tear, which, in the long run, may damage the pump which necessitates down time for repairs and replacement.

Problem statement

The downhole card is usually monitored by petroleum engineers daily, usually in the morning. However, production conditions may change throughout the day. Oil production run for 24 hours, 7 days a week. It is impossible for a petroleum engineer to be on standby for that whole duration.

A decrease in 10% efficiency for 12 hours (during offtime) in a conservative well in Indonesia (300 BOPD) results in a decreased production of 15 barrels of oil. At the current value of USD 42 (Rp. 630.000) per barrels, the loss for the company due to inefficiency amounts to Rp. 10 million per day. This may seem small, but to put things in persepective, Pertamina produces an average of 414.400 barrels oil per day (BOPD) in 2019 1. An improvement taken to remedy the inefficiencies similar to our previous scenario can potentially add revenues up to Rp. 13 billion per day.

Proposed solution

A model that can classify the downhole card automatically in real-time can monitor the performance of oil wells continuously, 24 hours a day, 7 days a week. The interpretation result can be used to give out diagnose problems and suggestions how to improve production efficiency. This can improve efficiency and avoid costly down time and repairs.

Project scope

Goal

The goal of this project is to train and deploy machine learning model that can classify down dynocards in near real-time with 95% accuracy. This project is a small part of the whole, which is to provide actionable recommendations powered by the model.

To reach the stated goal, the project is divided into milestones :

  1. Data acqusition
  2. Data wrangling and exploratory analysis
  3. Training and evaluating models
  4. Model trial
  5. Model deployment

Prior art and methodology

Several published papers have utilized machine learning for the analysis of downhole dynocards.

Sharaf2 compared the performance trained image recognition models VGG, ResNeXt34 and ResNeXt50 on a dataset of 80 training and 16 test dynocard images. The dynocard images consists of 8 classes of dynocards : normal, gas interference, gas interference severe, fluid pound severe, standing valve leak, worn pump stuck piston and sand production. The best model is ResNet34 with 100% accuracy, followed by VGG with 87.5% accuracy and ResNext50 with 56.25% accuracy.

In 2019, Sharaf3 used various methods to classify downhole card data obtained in Bahrain. The data is represented as text, with each downhole card data represented by 100 pairs of scaled positions vs load. The models were trained on 22,298 manually-labelled images. The resulting models achieved 99.99% validation and 100% testing accuracies.

Methodology

The methodology that is used in this project is similar to the second method employed by Sharaf. However, before processing, each card is scaled uniformly and its dimensionality reduced using PCA. The diagram below summarizes the steps.

methodology

methodology

Data Preprocessing

library(tidyverse) # Wrangling and visualization
library(FactoMineR) # PCA
library(umap) # Visual clustering using UMAP
library(dbscan) # Clustering using HDBSCAN
library(factoextra) # Clustering functions
library(caret) # Confusion matrix
library(kableExtra) # Pretty printing tables

The data is obtained from records of several oil wells over the course of July 2020.

data <- read_csv('downholecard_data.csv')
glimpse(data)
## Rows: 36,647
## Columns: 2
## $ rownum          <dbl> 1, 11, 21, 31, 41, 51, 61, 71, 81, 91, 101, 111, 121,…
## $ CurveCoordinate <chr> "-121.55,62438;-121.7,62422;-121.52,62408;-121.01,623…

The data contains 36647 and 2 columns :

  • rownum : number of row in the dataset
  • CurveCoordinate : (x,y)-paired coordinates to make the downhole card

To get a sense of the data, let's see the first row

data[1, 'CurveCoordinate']
## # A tibble: 1 x 1
##   CurveCoordinate                                                               
##   <chr>                                                                         
## 1 -121.55,62438;-121.7,62422;-121.52,62408;-121.01,62398;-120.17,62391;-119.02,…

Each row contains an id and a CurveCoordinate column which contains the 100 pairs of position vs load data. We can replace the row number rownum to make referencing the row easier.

# Change rownumber into ordered id
data$rownum <- seq(1:nrow(data))

Next, I introduce the function drawDynocard below to make it easier to draw dynocards.

drawDynocard <- function(d, is_clean) {
  
  # If dynocard data is not cleaned, then clean
  if(!is_clean) {
  
    # Create column names
    columns <- c()
    for (i in seq(1:100)) {
      for(j in c('x', 'y'))  { 
        columns <- c(columns, paste(j,i, sep = ''))
      }
    }
  
    # Separate into rows
    row <- separate_rows(d, CurveCoordinate, sep = ";")
    
    # Separate into columns
    row <- separate(row, CurveCoordinate, into = c('x', 'y'), sep = ",", convert = TRUE)
  
  } else {
    
    # Transform columns into rows
    rows_x <- d[, seq(2,200,2)] %>% t()
    rows_y <- d[, seq(3,201,2)] %>% t()
    row <- as_tibble(cbind(rownum = d[,1], x = rows_x, y = rows_y))
  
  }

  # Draw dynocard
  row %>% 
    add_row(row[1, ]) %>% # Add first row to close the curve
    ggplot(aes(x = x, y = y)) +
    geom_path() + 
    labs(title = paste('Downhole dynocard id:', d[1,1]), x = 'Position (in)', y = 'Load (lbs)')

}

Let's plot some cards

idx <- sample(nrow(data), 5) 

for(i in idx) {
  print(drawDynocard(data[i,], is_clean = F))
}

Reshaping data

The data in its current form can not be used for analysis. CurveCoordinate has to be expanded into each values into its own column

# Create column names
columns <- c()
for (i in seq(1:100)) {
  for(j in c('x', 'y'))  { 
    columns <- c(columns, paste(j,i, sep = ''))
  }
}

# Separating values into columns
data.col <- data %>% 
  separate(CurveCoordinate, into = columns, sep = "[,;]", convert = TRUE)

head(data.col) %>% 
  kable() %>% 
  kable_styling()
rownum x1 y1 x2 y2 x3 y3 x4 y4 x5 y5 x6 y6 x7 y7 x8 y8 x9 y9 x10 y10 x11 y11 x12 y12 x13 y13 x14 y14 x15 y15 x16 y16 x17 y17 x18 y18 x19 y19 x20 y20 x21 y21 x22 y22 x23 y23 x24 y24 x25 y25 x26 y26 x27 y27 x28 y28 x29 y29 x30 y30 x31 y31 x32 y32 x33 y33 x34 y34 x35 y35 x36 y36 x37 y37 x38 y38 x39 y39 x40 y40 x41 y41 x42 y42 x43 y43 x44 y44 x45 y45 x46 y46 x47 y47 x48 y48 x49 y49 x50 y50 x51 y51 x52 y52 x53 y53 x54 y54 x55 y55 x56 y56 x57 y57 x58 y58 x59 y59 x60 y60 x61 y61 x62 y62 x63 y63 x64 y64 x65 y65 x66 y66 x67 y67 x68 y68 x69 y69 x70 y70 x71 y71 x72 y72 x73 y73 x74 y74 x75 y75 x76 y76 x77 y77 x78 y78 x79 y79 x80 y80 x81 y81 x82 y82 x83 y83 x84 y84 x85 y85 x86 y86 x87 y87 x88 y88 x89 y89 x90 y90 x91 y91 x92 y92 x93 y93 x94 y94 x95 y95 x96 y96 x97 y97 x98 y98 x99 y99 x100 y100
1 -121.55 62438 -121.70 62422 -121.52 62408 -121.01 62398 -120.17 62391 -119.02 62389 -117.54 62390 -115.76 62396 -113.68 62406 -111.30 62421 -108.64 62441 -105.70 62464 -102.50 62491 -99.04 62521 -95.35 62553 -91.42 62587 -87.29 62623 -82.96 62660 -78.45 62698 -73.78 62737 -68.97 62778 -64.04 62820 -59.01 62864 -53.90 62909 -48.73 62955 -43.53 63003 -38.30 63052 -33.08 63103 -27.87 63154 -22.71 63206 -17.61 63258 -12.60 63310 -7.68 63362 -2.89 63415 1.77 63466 6.26 63517 10.58 63566 14.71 63614 18.63 63660 22.32 63704 25.78 63745 28.98 63782 31.92 63817 34.58 63849 36.95 63879 39.03 63906 40.79 63932 42.24 63956 43.36 63980 44.16 64002 44.63 64023 44.77 64042 44.58 64060 44.06 64074 43.22 64084 42.06 64089 40.58 64090 38.80 64085 36.72 64074 34.35 64059 31.69 64039 28.75 64015 25.55 63989 22.08 63960 18.38 63930 14.45 63899 10.30 63867 5.96 63834 1.44 63799 -3.23 63761 -8.04 63720 -12.96 63675 -17.98 63627 -23.08 63576 -28.23 63521 -33.41 63463 -38.62 63405 -43.82 63345 -49.01 63287 -54.16 63230 -59.25 63175 -64.27 63123 -69.20 63074 -74.00 63027 -78.67 62983 -83.19 62939 -87.52 62898 -91.66 62856 -95.59 62816 -99.29 62776 -102.74 62736 -105.94 62697 -108.87 62660 -111.52 62624 -113.88 62590 -115.95 62558 -117.70 62529 -119.15 62502 -120.28 62478 -121.08 62457
2 4.34 -3650 3.99 -3517 3.83 -3350 3.86 -3156 4.10 -2944 4.58 -2727 5.30 -2519 6.30 -2331 7.58 -2175 9.16 -2058 11.03 -1983 13.18 -1951 15.60 -1955 18.26 -1990 21.13 -2044 24.19 -2109 27.40 -2173 30.74 -2229 34.18 -2271 37.71 -2297 41.32 -2306 44.99 -2301 48.72 -2284 52.50 -2260 56.32 -2233 60.18 -2205 64.05 -2178 67.94 -2154 71.81 -2132 75.66 -2112 79.46 -2093 83.19 -2073 86.85 -2054 90.42 -2035 93.89 -2018 97.24 -2003 100.47 -1990 103.55 -1980 106.49 -1973 109.25 -1965 111.84 -1956 114.23 -1943 116.41 -1924 118.37 -1899 120.11 -1869 121.61 -1836 122.89 -1805 123.94 -1780 124.77 -1769 125.39 -1776 125.81 -1808 126.03 -1866 126.06 -1952 125.88 -2062 125.49 -2191 124.88 -2332 124.05 -2478 122.96 -2618 121.63 -2745 120.04 -2854 118.19 -2940 116.09 -3002 113.75 -3042 111.18 -3064 108.42 -3073 105.48 -3075 102.38 -3075 99.14 -3078 95.78 -3087 92.32 -3103 88.76 -3126 85.13 -3155 81.42 -3188 77.65 -3222 73.83 -3257 69.98 -3290 66.11 -3322 62.24 -3353 58.38 -3383 54.55 -3413 50.77 -3444 47.05 -3475 43.41 -3506 39.86 -3535 36.41 -3562 33.08 -3586 29.86 -3606 26.79 -3624 23.86 -3640 21.11 -3657 18.54 -3676 16.17 -3699 14.01 -3727 12.06 -3759 10.34 -3791 8.83 -3819 7.54 -3836 6.46 -3834 5.57 -3806 4.86 -3746
3 -121.54 62434 -121.70 62421 -121.52 62413 -121.03 62407 -120.20 62405 -119.05 62405 -117.59 62409 -115.81 62416 -113.72 62426 -111.34 62439 -108.67 62456 -105.73 62476 -102.51 62499 -99.05 62524 -95.34 62553 -91.42 62584 -87.28 62617 -82.94 62652 -78.43 62689 -73.76 62728 -68.95 62768 -64.02 62810 -58.99 62853 -53.88 62898 -48.71 62944 -43.50 62992 -38.28 63042 -33.06 63093 -27.86 63145 -22.70 63198 -17.60 63251 -12.59 63305 -7.67 63358 -2.88 63410 1.77 63461 6.27 63512 10.59 63561 14.72 63609 18.64 63656 22.33 63701 25.78 63745 28.98 63786 31.90 63826 34.55 63863 36.92 63897 38.98 63928 40.74 63956 42.18 63982 43.31 64003 44.12 64022 44.60 64038 44.75 64050 44.58 64060 44.08 64066 43.25 64069 42.10 64069 40.64 64066 38.86 64059 36.78 64048 34.40 64033 31.73 64015 28.79 63993 25.58 63968 22.12 63940 18.42 63909 14.49 63876 10.35 63841 6.02 63804 1.51 63766 -3.16 63727 -7.97 63687 -12.90 63645 -17.93 63602 -23.04 63557 -28.20 63511 -33.41 63464 -38.64 63415 -43.86 63366 -49.07 63315 -54.23 63265 -59.34 63214 -64.36 63164 -69.28 63115 -74.08 63067 -78.75 63019 -83.25 62973 -87.58 62927 -91.71 62881 -95.63 62837 -99.32 62792 -102.77 62749 -105.96 62706 -108.88 62664 -111.53 62625 -113.88 62587 -115.94 62553 -117.69 62522 -119.13 62494 -120.25 62470 -121.06 62450
4 0.83 -1741 0.45 -1643 0.29 -1532 0.35 -1415 0.64 -1298 1.19 -1185 2.00 -1083 3.07 -994 4.42 -923 6.03 -869 7.89 -834 10.00 -814 12.34 -808 14.88 -812 17.61 -822 20.51 -835 23.56 -848 26.73 -859 30.03 -866 33.43 -871 36.93 -872 40.51 -871 44.16 -869 47.87 -866 51.64 -863 55.43 -860 59.25 -858 63.07 -854 66.86 -848 70.62 -839 74.33 -826 77.96 -807 81.50 -782 84.94 -752 88.26 -717 91.45 -679 94.51 -638 97.42 -595 100.19 -553 102.79 -512 105.23 -474 107.50 -437 109.57 -402 111.46 -370 113.14 -340 114.60 -311 115.85 -285 116.87 -261 117.67 -242 118.24 -229 118.59 -224 118.71 -229 118.63 -247 118.34 -279 117.84 -327 117.15 -391 116.27 -470 115.20 -562 113.94 -664 112.47 -772 110.81 -883 108.94 -991 106.86 -1094 104.57 -1186 102.07 -1266 99.37 -1331 96.47 -1383 93.39 -1420 90.15 -1446 86.78 -1464 83.28 -1476 79.69 -1485 76.04 -1496 72.33 -1510 68.60 -1529 64.86 -1553 61.12 -1582 57.39 -1615 53.69 -1650 50.00 -1685 46.36 -1717 42.76 -1746 39.22 -1770 35.74 -1788 32.36 -1803 29.08 -1814 25.92 -1824 22.91 -1835 20.07 -1848 17.40 -1864 14.92 -1884 12.64 -1907 10.56 -1930 8.68 -1951 7.00 -1966 5.51 -1970 4.21 -1961 3.09 -1934 2.15 -1889 1.39 -1825
5 -12.79 11011 -12.99 10965 -13.00 10907 -12.85 10855 -12.54 10822 -12.11 10823 -11.56 10861 -10.90 10931 -10.10 11016 -9.13 11091 -7.98 11129 -6.59 11101 -4.96 10988 -3.07 10784 -0.93 10495 1.41 10147 3.90 9772 6.49 9415 9.11 9116 11.71 8908 14.25 8812 16.72 8829 19.13 8943 21.50 9121 23.88 9322 26.31 9502 28.85 9624 31.49 9666 34.25 9625 37.10 9516 39.97 9373 42.80 9241 45.53 9164 48.08 9180 50.43 9308 52.55 9544 54.49 9858 56.29 10196 58.04 10487 59.82 10657 61.72 10637 63.79 10378 66.06 9859 68.50 9093 71.06 8129 73.63 7042 76.08 5926 78.29 4883 80.14 4001 81.53 3349 82.42 2963 82.80 2841 82.72 2949 82.25 3225 81.51 3589 80.60 3962 79.60 4274 78.59 4481 77.58 4569 76.56 4555 75.46 4488 74.21 4435 72.71 4471 70.87 4666 68.64 5071 65.98 5711 62.92 6577 59.50 7630 55.83 8805 52.01 10020 48.17 11189 44.42 12231 40.84 13088 37.50 13722 34.40 14124 31.53 14313 28.87 14326 26.34 14211 23.89 14022 21.47 13804 19.06 13592 16.62 13404 14.19 13244 11.78 13104 9.42 12966 7.16 12814 5.02 12633 3.02 12419 1.17 12175 -0.55 11916 -2.14 11660 -3.65 11428 -5.07 11240 -6.43 11105 -7.72 11027 -8.93 10997 -10.03 11001 -10.99 11021 -11.78 11037 -12.39 11036
6 -9.67 1728 -10.16 1831 -10.60 1967 -11.02 2135 -11.41 2328 -11.76 2539 -12.04 2755 -12.23 2965 -12.26 3155 -12.09 3315 -11.64 3434 -10.87 3508 -9.74 3534 -8.22 3515 -6.34 3457 -4.14 3371 -1.69 3270 0.92 3166 3.57 3072 6.18 2999 8.66 2955 10.94 2943 13.02 2961 14.91 3006 16.66 3069 18.35 3141 20.07 3212 21.91 3272 23.94 3315 26.20 3336 28.68 3334 31.35 3314 34.13 3281 36.93 3242 39.63 3206 42.14 3181 44.39 3172 46.32 3182 47.94 3211 49.28 3257 50.39 3314 51.37 3376 52.29 3434 53.24 3482 54.26 3514 55.37 3527 56.55 3519 57.75 3490 58.91 3443 59.95 3380 60.79 3306 61.39 3222 61.72 3131 61.78 3033 61.62 2928 61.28 2812 60.85 2685 60.37 2546 59.91 2393 59.49 2229 59.11 2056 58.72 1881 58.26 1710 57.67 1550 56.86 1409 55.77 1294 54.34 1208 52.58 1154 50.49 1131 48.14 1135 45.60 1161 42.96 1200 40.29 1244 37.69 1287 35.19 1321 32.84 1341 30.61 1347 28.50 1338 26.45 1317 24.42 1290 22.35 1261 20.21 1237 17.99 1222 15.68 1221 13.31 1236 10.92 1266 8.55 1309 6.26 1362 4.08 1418 2.07 1474 0.23 1523 -1.42 1561 -2.89 1587 -4.17 1600 -5.30 1603 -6.28 1599 -7.14 1594 -7.89 1598 -8.56 1616 -9.15 1658

Now that we have reshaped the data, we can check for missing values, and remove them.

colSums(is.na(data.col))
## rownum     x1     y1     x2     y2     x3     y3     x4     y4     x5     y5 
##      0      0      0      1      1      1      1      1      1      1      1 
##     x6     y6     x7     y7     x8     y8     x9     y9    x10    y10    x11 
##      1      1      1      1      1      1      1      1      1      1      1 
##    y11    x12    y12    x13    y13    x14    y14    x15    y15    x16    y16 
##      1      1      1      1      1      1      1      1      1      1      1 
##    x17    y17    x18    y18    x19    y19    x20    y20    x21    y21    x22 
##      1      1      1      1      1      1      1      1      1      1      1 
##    y22    x23    y23    x24    y24    x25    y25    x26    y26    x27    y27 
##      1      1      1      1      1      1      1      1      1      1      1 
##    x28    y28    x29    y29    x30    y30    x31    y31    x32    y32    x33 
##      1      1      1      1      1      1      1      1      1      1      1 
##    y33    x34    y34    x35    y35    x36    y36    x37    y37    x38    y38 
##      1      1      1      1      1      1      1      1      1      1      1 
##    x39    y39    x40    y40    x41    y41    x42    y42    x43    y43    x44 
##      1      1      1      1      1      1      1      1      1      1      1 
##    y44    x45    y45    x46    y46    x47    y47    x48    y48    x49    y49 
##      1      1      1      1      1      1      1      1      1      1      1 
##    x50    y50    x51    y51    x52    y52    x53    y53    x54    y54    x55 
##      1      1      1      1      1      1      1      1      1      1      1 
##    y55    x56    y56    x57    y57    x58    y58    x59    y59    x60    y60 
##      1      1      1      1      1      1      1      1      1      1      1 
##    x61    y61    x62    y62    x63    y63    x64    y64    x65    y65    x66 
##      1      1      1      1      1      1      1      1      1      1      1 
##    y66    x67    y67    x68    y68    x69    y69    x70    y70    x71    y71 
##      1      1      1      1      1      1      1      1      1      1      1 
##    x72    y72    x73    y73    x74    y74    x75    y75    x76    y76    x77 
##      1      1      1      1      1      1      1      1      1      1      1 
##    y77    x78    y78    x79    y79    x80    y80    x81    y81    x82    y82 
##      1      1      1      1      1      1      1      1      1      1      1 
##    x83    y83    x84    y84    x85    y85    x86    y86    x87    y87    x88 
##      1      1      1      1      1      1      1      1      1      1      1 
##    y88    x89    y89    x90    y90    x91    y91    x92    y92    x93    y93 
##      1      1      1      1      1      1      1      1      1      1      1 
##    x94    y94    x95    y95    x96    y96    x97    y97    x98    y98    x99 
##      1      1      1      1      1      1      1      1      1      1      1 
##    y99   x100   y100 
##      1      1      1
# Dropping missing values
data.col <- data.col %>% 
  drop_na()

Because the amount of data is huge, we'll work with a subset of them. Once we have the correct steps, we can repeat the steps with all the data. Let's take a quarter (~9000) data.

idx <- sample(nrow(data.col), 9000)
data.sample <- data.col[idx, ]

Now we are ready to wrangle and explore our data

Data Wrangling & Exploration

Scaling

The dynocards are derived from measurement values of load vs position of the polished rod above ground, and transformed into the downhole dynocard using the equipment parameters. The main feature that is seen from the dynograph is the shape and the relative dimension of the card, not the absolute reading values. To make the data device-independent, we can normalize the data by scaling it into a set scale. This removes the dependence of the data to equipment size. Let's scale the data so it falls between 0 and 1. We can use the following function

# Min-max inputs into a 1x1 box
scaleBox <- function(d) {
  
  # Get range in x
  range_x <- range(d[, seq(2, 200, 2)])
  min_x <- range_x[1]
  max_x <- range_x[2]
  
  # Get range in y
  range_y <- range(d[, seq(3, 201, 2)])
  min_y <- range_y[1]
  max_y <- range_y[2]
  

 # Minmax scale all values of x and y
  d[, seq(2, 200, 2)] <- (d[, seq(2, 200, 2)] - min_x )/ (max_x - min_x)
  d[, seq(3, 201, 2)] <- (d[, seq(3, 201, 2)] - min_y )/ (max_y - min_y) 
  
  return(d)
  
}
data.scaled <- scaleBox(data.sample)

Let's now try to view or transformed dynocards

drawDynocard(data.scaled[3, ], is_clean = T)

Let's look at the pricipal components

Principal component analysis

sample.pca <- PCA(data.scaled, scale.unit = F, quali.sup = c(1), graph = F, ncp = 10)
plot.PCA(x = sample.pca, choix = 'varcor', habillage = 10, select = 'contrib 10')

sample.pca$eig[1:10, ]
##          eigenvalue percentage of variance cumulative percentage of variance
## comp 1  0.273619131             47.1980194                          47.19802
## comp 2  0.207574392             35.8056111                          83.00363
## comp 3  0.026866964              4.6344255                          87.63806
## comp 4  0.017630684              3.0412104                          90.67927
## comp 5  0.002495220              0.4304138                          91.10968
## comp 6  0.002327396              0.4014648                          91.51115
## comp 7  0.002038564              0.3516428                          91.86279
## comp 8  0.001731051              0.2985983                          92.16139
## comp 9  0.001579043              0.2723775                          92.43376
## comp 10 0.001494278              0.2577560                          92.69152

From the analysis, we can see that the first 10 PCs covers 92.69% of the total variance.

Clustering

Because our data is unlabeled, we have to label our data. It'll take a lot of time to manually label each data. Clustering can reduce the work by grouping similar dynocards. There are 2 methods of clustering that we will use and compare : TSNE and UMAP.

set.seed(888)
# idx <- sample(nrow(data.col), 9000)
  
# data.sample <- data.col[idx, ]
# data.scaled <- scaleBox(data.sample)
# data.scaled <- scaleBox(data.)

data.scaled <- data.scaled[,-1] %>% 
              distinct()

TSNE

library(Rtsne)
data.tsne <- Rtsne(data.scaled)
plot(data.tsne$Y)

The TSNE cluster produces a lot of clusters. We'll group and cluster these data using a density-based clustering algorithm such as HDBSCAN.

data.tsne.cluster <- hdbscan(data.tsne$Y, minPts = 150)
data.tsne.cluster
## HDBSCAN clustering for 8719 objects.
## Parameters: minPts = 150
## The clustering contains 5 cluster(s) and 6230 noise points.
## 
##    0    1    2    3    4    5 
## 6230  218  443  160 1510  158 
## 
## Available fields: cluster, minPts, cluster_scores, membership_prob,
##                   outlier_scores, hc

Now that we have the labels let's see how well the points are separated.

data.tsne.labelled <- as_tibble(data.tsne$Y) %>% 
  rename( x = V1, y = V2) %>% 
  mutate(label = as.factor(data.tsne.cluster$cluster))

data.tsne.labelled %>% 
ggplot(aes(x = x, y = y , col = label)) +
  geom_point()

Let's profile the clusters made by TNSE, to see if the groups are really different.

# Let's take a cluster and see how the dots perform
# data.tsne.labelled <- data.scaled %>% 
#   mutate(label = data.tsne.cluster$cluster) %>% 
#   relocate(label, .before = x1)
# 
# # Let's take a cluster and see how the dots perform
# temp <- data.tsne.labelled[1,] %>% 
#   group_by(label) %>% 
#   apply()
# 
# drawDynocard(as_tibble(temp), T)

UMAP clustering

As a comparison, we'll cluster the points using the UMAP algorithm.

# Clustering and plotting
data.umap <- umap(data.scaled)
plot(data.umap$layout, type = 'points')

The UMAP results in finer clusters with a lot more separation.

data.umap.cluster <- hdbscan(data.umap$layout, minPts = 100)
data.umap.cluster
## HDBSCAN clustering for 8719 objects.
## Parameters: minPts = 100
## The clustering contains 35 cluster(s) and 1776 noise points.
## 
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 1776  107  119  206  201  119  167  200  291  131  218  137  262  112  154  444 
##   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30   31 
##  325  114  153  137  105  219  221  316  487  171  116  124  177  204  192  238 
##   32   33   34   35 
##  124  232  189  231 
## 
## Available fields: cluster, minPts, cluster_scores, membership_prob,
##                   outlier_scores, hc
# Labeling data.sample with the assigned clusters
data.umap.labelled <- as_tibble(data.umap$layout) %>% 
  rename(x = V1, y = V2) %>% 
  mutate(label = as.factor(data.umap.cluster$cluster))

data.umap.labelled %>% 
  ggplot(aes(x = x, y = y, col = as.factor(label))) +
  geom_jitter()

Visually inspecting the clustering result, it seems that UMAP has failed to cluster the data. There is no well-defined clusters that can be seen.

To see if this is the case, let's profile the wells to see how effective this clustering approach is compared to the TSNE.

# data.umap.labelled <- data.scaled %>% 
#   mutate(label = as.factor(data.umap.cluster$cluster))
# 
# # Let's take a cluster and see how the dots perform
# temp <- data.umap.labelled %>% 
#   filter(label == 23) %>% 
#   select_if(is.numeric) %>% 
#   map(mean)

From the clustering

Clustering

Neural Network Model

Model Evaluation

References


  1. Reuters. 2019. Indonesia's Pertamina targets higher crude oil output in 2020. Accessed 26 Aug 2020 from https://www.reuters.com/article/indonesia-pertamina-idUSL4N2881BP

  2. Sharaf, S. A. 2018. Beam Pump Dynamometer Card Prediction Using Artifical Neural Networks in Sustainability and Resilience Conference.

  3. Sharaf, S. A., Bangert, P. Fardan, M., Khalil, A., Abubakr, M. and Ahmed M. 2019. Beam pump Dynamometer Card Classification Using Machine Learning in SPE Middle East Oil and Gas Show and Conference.