Improving Oil Well Production

Background

Oil production is a capital-intensive business. A global estimate puts the cost of drilling on-shore wells in between USD 1-2 million per well. Drillers wells does not always find oil. If an oil reserve is found, the driller has struck payday. Productive wells can produce oil for years.

A newly producing oil well has natural pressure that pushes oil out of the well to flow out naturally. However, as oil is extracted from the well, the well pressure drops and the oil has to be artificially lifted. Artifical lifting employs methods to lift oil from a well, allowing oil to be extracted from wells with no natural flow. Installation of artificial lifting requires investment and increased operation cost.

Artificial lifts and related costs

The sucker rod-pump (SRP)

The standard method of artificial lifting is the beam pump, also known as the sucker-rod pump (SRP). The sucker-rod pump transfers its vertical bobbing motion to a reciprocating downhole pump inside the well. It acts like a bike-pump in reverse, pumping oil out of a reservoir.

$Sucker-rod pump$

Sucker-rod pump

Artificial lifting adds additional overhead and operations cost, so it is imperative for drilling operations to run efficiently. Monitoring pumping efficiency carried out by monitoring the downhole dynamometer graph or dynocard. A dynocard shows how the pump inside the well functions. A good dynocard has an almost square.

Full pump dynocard

Pumps that does not function efficiently deviates from the efficient dynocard.

Standing valve leak. Pump is not efficient as it returns some of the oil back into the reservoir, instead of pumping it to the surface.

Pumped-off well. This indicates that the downhole pump is not well submerged in the fluid.

Fluid pound. This indicates that the pumping rate is higher than the reservoir production.

Inefficiencies lowers production and increaes costs. The dynocard can also show unfavorable operating conditions such as fluid pound or a pumped-off well. These conditions increase wear and tear, which, in the long run, may damage the pump which necessitates down time for repairs and replacement.

Problem statement

The downhole card is usually monitored by petroleum engineers daily, usually in the morning. However, production conditions may change throughout the day. Oil production run for 24 hours, 7 days a week. It is impossible for a petroleum engineer to be on standby for that whole duration.

A decrease in 10% efficiency for 12 hours (during offtime) in a conservative well in Indonesia (300 BOPD) results in a decreased production of 15 barrels of oil. At the current value of USD 42 (Rp. 630.000) per barrels, the loss for the company due to inefficiency amounts to Rp. 10 million per day. This may seem small, but to put things in persepective, Pertamina produces an average of 414.400 barrels oil per day (BOPD) in 2019 ¹. An improvement taken to remedy the inefficiencies similar to our previous scenario can potentially add revenues up to Rp. 13 billion per day.

Proposed solution

A model that can classify the downhole card automatically in real-time can monitor the performance of oil wells continuously, 24 hours a day, 7 days a week. The interpretation result can be used to give out diagnose problems and suggestions how to improve production efficiency. This can improve efficiency and avoid costly down time and repairs.

Project scope

Goal

The goal of this project is to train and deploy machine learning model that can classify down dynocards in near real-time with 95% accuracy. This project is a small part of the whole, which is to provide actionable recommendations powered by the model.

To reach the stated goal, the project is divided into milestones :

Data acqusition
Data wrangling and exploratory analysis
Training and evaluating models
Model trial
Model deployment

Prior art and methodology

Several published papers have utilized machine learning for the analysis of downhole dynocards.

Sharaf² compared the performance trained image recognition models VGG, ResNeXt34 and ResNeXt50 on a dataset of 80 training and 16 test dynocard images. The dynocard images consists of 8 classes of dynocards : normal, gas interference, gas interference severe, fluid pound severe, standing valve leak, worn pump stuck piston and sand production. The best model is ResNet34 with 100% accuracy, followed by VGG with 87.5% accuracy and ResNext50 with 56.25% accuracy.

In 2019, Sharaf³ used various methods to classify downhole card data obtained in Bahrain. The data is represented as text, with each downhole card data represented by 100 pairs of scaled positions vs load. The models were trained on 22,298 manually-labelled images. The resulting models achieved 99.99% validation and 100% testing accuracies.

Methodology

The methodology that is used in this project is similar to the second method employed by Sharaf. However, before processing, each card is scaled uniformly and its dimensionality reduced using PCA. The diagram below summarizes the steps.

methodology

Data Preprocessing

library(tidyverse) # Wrangling and visualization
library(FactoMineR) # PCA
library(umap) # Visual clustering using UMAP
library(dbscan) # Clustering using HDBSCAN
library(factoextra) # Clustering functions
library(caret) # Confusion matrix
library(kableExtra) # Pretty printing tables

The data is obtained from records of several oil wells over the course of July 2020.

data <- read_csv('downholecard_data.csv')
glimpse(data)

## Rows: 36,647
## Columns: 2
## $ rownum          <dbl> 1, 11, 21, 31, 41, 51, 61, 71, 81, 91, 101, 111, 121,…
## $ CurveCoordinate <chr> "-121.55,62438;-121.7,62422;-121.52,62408;-121.01,623…

The data contains 36647 and 2 columns :

rownum : number of row in the dataset
CurveCoordinate : (x,y)-paired coordinates to make the downhole card

To get a sense of the data, let's see the first row

data[1, 'CurveCoordinate']

## # A tibble: 1 x 1
##   CurveCoordinate                                                               
##   <chr>                                                                         
## 1 -121.55,62438;-121.7,62422;-121.52,62408;-121.01,62398;-120.17,62391;-119.02,…

Each row contains an id and a CurveCoordinate column which contains the 100 pairs of position vs load data. We can replace the row number rownum to make referencing the row easier.

# Change rownumber into ordered id
data$rownum <- seq(1:nrow(data))

Next, I introduce the function drawDynocard below to make it easier to draw dynocards.

drawDynocard <- function(d, is_clean) {
  
  # If dynocard data is not cleaned, then clean
  if(!is_clean) {
  
    # Create column names
    columns <- c()
    for (i in seq(1:100)) {
      for(j in c('x', 'y'))  { 
        columns <- c(columns, paste(j,i, sep = ''))
      }
    }
  
    # Separate into rows
    row <- separate_rows(d, CurveCoordinate, sep = ";")
    
    # Separate into columns
    row <- separate(row, CurveCoordinate, into = c('x', 'y'), sep = ",", convert = TRUE)
  
  } else {
    
    # Transform columns into rows
    rows_x <- d[, seq(2,200,2)] %>% t()
    rows_y <- d[, seq(3,201,2)] %>% t()
    row <- as_tibble(cbind(rownum = d[,1], x = rows_x, y = rows_y))
  
  }

  # Draw dynocard
  row %>% 
    add_row(row[1, ]) %>% # Add first row to close the curve
    ggplot(aes(x = x, y = y)) +
    geom_path() + 
    labs(title = paste('Downhole dynocard id:', d[1,1]), x = 'Position (in)', y = 'Load (lbs)')

}

Let's plot some cards

idx <- sample(nrow(data), 5) 

for(i in idx) {
  print(drawDynocard(data[i,], is_clean = F))
}

Reshaping data

The data in its current form can not be used for analysis. CurveCoordinate has to be expanded into each values into its own column

# Create column names
columns <- c()
for (i in seq(1:100)) {
  for(j in c('x', 'y'))  { 
    columns <- c(columns, paste(j,i, sep = ''))
  }
}

# Separating values into columns
data.col <- data %>% 
  separate(CurveCoordinate, into = columns, sep = "[,;]", convert = TRUE)

head(data.col) %>% 
  kable() %>% 
  kable_styling()

rownum	x1	y1	x2	y2	x3	y3	x4	y4	x5	y5	x6	y6	x7	y7	x8	y8	x9	y9	x10	y10	x11	y11	x12	y12	x13	y13	x14	y14	x15	y15	x16	y16	x17	y17	x18	y18	x19	y19	x20	y20	x21	y21	x22	y22	x23	y23	x24	y24	x25	y25	x26	y26	x27	y27	x28	y28	x29	y29	x30	y30	x31	y31	x32	y32	x33	y33	x34	y34	x35	y35	x36	y36	x37	y37	x38	y38	x39	y39	x40	y40	x41	y41	x42	y42	x43	y43	x44	y44	x45	y45	x46	y46	x47	y47	x48	y48	x49	y49	x50	y50	x51	y51	x52	y52	x53	y53	x54	y54	x55	y55	x56	y56	x57	y57	x58	y58	x59	y59	x60	y60	x61	y61	x62	y62	x63	y63	x64	y64	x65	y65	x66	y66	x67	y67	x68	y68	x69	y69	x70	y70	x71	y71	x72	y72	x73	y73	x74	y74	x75	y75	x76	y76	x77	y77	x78	y78	x79	y79	x80	y80	x81	y81	x82	y82	x83	y83	x84	y84	x85	y85	x86	y86	x87	y87	x88	y88	x89	y89	x90	y90	x91	y91	x92	y92	x93	y93	x94	y94	x95	y95	x96	y96	x97	y97	x98	y98	x99	y99	x100	y100
1	-121.55	62438	-121.70	62422	-121.52	62408	-121.01	62398	-120.17	62391	-119.02	62389	-117.54	62390	-115.76	62396	-113.68	62406	-111.30	62421	-108.64	62441	-105.70	62464	-102.50	62491	-99.04	62521	-95.35	62553	-91.42	62587	-87.29	62623	-82.96	62660	-78.45	62698	-73.78	62737	-68.97	62778	-64.04	62820	-59.01	62864	-53.90	62909	-48.73	62955	-43.53	63003	-38.30	63052	-33.08	63103	-27.87	63154	-22.71	63206	-17.61	63258	-12.60	63310	-7.68	63362	-2.89	63415	1.77	63466	6.26	63517	10.58	63566	14.71	63614	18.63	63660	22.32	63704	25.78	63745	28.98	63782	31.92	63817	34.58	63849	36.95	63879	39.03	63906	40.79	63932	42.24	63956	43.36	63980	44.16	64002	44.63	64023	44.77	64042	44.58	64060	44.06	64074	43.22	64084	42.06	64089	40.58	64090	38.80	64085	36.72	64074	34.35	64059	31.69	64039	28.75	64015	25.55	63989	22.08	63960	18.38	63930	14.45	63899	10.30	63867	5.96	63834	1.44	63799	-3.23	63761	-8.04	63720	-12.96	63675	-17.98	63627	-23.08	63576	-28.23	63521	-33.41	63463	-38.62	63405	-43.82	63345	-49.01	63287	-54.16	63230	-59.25	63175	-64.27	63123	-69.20	63074	-74.00	63027	-78.67	62983	-83.19	62939	-87.52	62898	-91.66	62856	-95.59	62816	-99.29	62776	-102.74	62736	-105.94	62697	-108.87	62660	-111.52	62624	-113.88	62590	-115.95	62558	-117.70	62529	-119.15	62502	-120.28	62478	-121.08	62457
2	4.34	-3650	3.99	-3517	3.83	-3350	3.86	-3156	4.10	-2944	4.58	-2727	5.30	-2519	6.30	-2331	7.58	-2175	9.16	-2058	11.03	-1983	13.18	-1951	15.60	-1955	18.26	-1990	21.13	-2044	24.19	-2109	27.40	-2173	30.74	-2229	34.18	-2271	37.71	-2297	41.32	-2306	44.99	-2301	48.72	-2284	52.50	-2260	56.32	-2233	60.18	-2205	64.05	-2178	67.94	-2154	71.81	-2132	75.66	-2112	79.46	-2093	83.19	-2073	86.85	-2054	90.42	-2035	93.89	-2018	97.24	-2003	100.47	-1990	103.55	-1980	106.49	-1973	109.25	-1965	111.84	-1956	114.23	-1943	116.41	-1924	118.37	-1899	120.11	-1869	121.61	-1836	122.89	-1805	123.94	-1780	124.77	-1769	125.39	-1776	125.81	-1808	126.03	-1866	126.06	-1952	125.88	-2062	125.49	-2191	124.88	-2332	124.05	-2478	122.96	-2618	121.63	-2745	120.04	-2854	118.19	-2940	116.09	-3002	113.75	-3042	111.18	-3064	108.42	-3073	105.48	-3075	102.38	-3075	99.14	-3078	95.78	-3087	92.32	-3103	88.76	-3126	85.13	-3155	81.42	-3188	77.65	-3222	73.83	-3257	69.98	-3290	66.11	-3322	62.24	-3353	58.38	-3383	54.55	-3413	50.77	-3444	47.05	-3475	43.41	-3506	39.86	-3535	36.41	-3562	33.08	-3586	29.86	-3606	26.79	-3624	23.86	-3640	21.11	-3657	18.54	-3676	16.17	-3699	14.01	-3727	12.06	-3759	10.34	-3791	8.83	-3819	7.54	-3836	6.46	-3834	5.57	-3806	4.86	-3746
3	-121.54	62434	-121.70	62421	-121.52	62413	-121.03	62407	-120.20	62405	-119.05	62405	-117.59	62409	-115.81	62416	-113.72	62426	-111.34	62439	-108.67	62456	-105.73	62476	-102.51	62499	-99.05	62524	-95.34	62553	-91.42	62584	-87.28	62617	-82.94	62652	-78.43	62689	-73.76	62728	-68.95	62768	-64.02	62810	-58.99	62853	-53.88	62898	-48.71	62944	-43.50	62992	-38.28	63042	-33.06	63093	-27.86	63145	-22.70	63198	-17.60	63251	-12.59	63305	-7.67	63358	-2.88	63410	1.77	63461	6.27	63512	10.59	63561	14.72	63609	18.64	63656	22.33	63701	25.78	63745	28.98	63786	31.90	63826	34.55	63863	36.92	63897	38.98	63928	40.74	63956	42.18	63982	43.31	64003	44.12	64022	44.60	64038	44.75	64050	44.58	64060	44.08	64066	43.25	64069	42.10	64069	40.64	64066	38.86	64059	36.78	64048	34.40	64033	31.73	64015	28.79	63993	25.58	63968	22.12	63940	18.42	63909	14.49	63876	10.35	63841	6.02	63804	1.51	63766	-3.16	63727	-7.97	63687	-12.90	63645	-17.93	63602	-23.04	63557	-28.20	63511	-33.41	63464	-38.64	63415	-43.86	63366	-49.07	63315	-54.23	63265	-59.34	63214	-64.36	63164	-69.28	63115	-74.08	63067	-78.75	63019	-83.25	62973	-87.58	62927	-91.71	62881	-95.63	62837	-99.32	62792	-102.77	62749	-105.96	62706	-108.88	62664	-111.53	62625	-113.88	62587	-115.94	62553	-117.69	62522	-119.13	62494	-120.25	62470	-121.06	62450
4	0.83	-1741	0.45	-1643	0.29	-1532	0.35	-1415	0.64	-1298	1.19	-1185	2.00	-1083	3.07	-994	4.42	-923	6.03	-869	7.89	-834	10.00	-814	12.34	-808	14.88	-812	17.61	-822	20.51	-835	23.56	-848	26.73	-859	30.03	-866	33.43	-871	36.93	-872	40.51	-871	44.16	-869	47.87	-866	51.64	-863	55.43	-860	59.25	-858	63.07	-854	66.86	-848	70.62	-839	74.33	-826	77.96	-807	81.50	-782	84.94	-752	88.26	-717	91.45	-679	94.51	-638	97.42	-595	100.19	-553	102.79	-512	105.23	-474	107.50	-437	109.57	-402	111.46	-370	113.14	-340	114.60	-311	115.85	-285	116.87	-261	117.67	-242	118.24	-229	118.59	-224	118.71	-229	118.63	-247	118.34	-279	117.84	-327	117.15	-391	116.27	-470	115.20	-562	113.94	-664	112.47	-772	110.81	-883	108.94	-991	106.86	-1094	104.57	-1186	102.07	-1266	99.37	-1331	96.47	-1383	93.39	-1420	90.15	-1446	86.78	-1464	83.28	-1476	79.69	-1485	76.04	-1496	72.33	-1510	68.60	-1529	64.86	-1553	61.12	-1582	57.39	-1615	53.69	-1650	50.00	-1685	46.36	-1717	42.76	-1746	39.22	-1770	35.74	-1788	32.36	-1803	29.08	-1814	25.92	-1824	22.91	-1835	20.07	-1848	17.40	-1864	14.92	-1884	12.64	-1907	10.56	-1930	8.68	-1951	7.00	-1966	5.51	-1970	4.21	-1961	3.09	-1934	2.15	-1889	1.39	-1825
5	-12.79	11011	-12.99	10965	-13.00	10907	-12.85	10855	-12.54	10822	-12.11	10823	-11.56	10861	-10.90	10931	-10.10	11016	-9.13	11091	-7.98	11129	-6.59	11101	-4.96	10988	-3.07	10784	-0.93	10495	1.41	10147	3.90	9772	6.49	9415	9.11	9116	11.71	8908	14.25	8812	16.72	8829	19.13	8943	21.50	9121	23.88	9322	26.31	9502	28.85	9624	31.49	9666	34.25	9625	37.10	9516	39.97	9373	42.80	9241	45.53	9164	48.08	9180	50.43	9308	52.55	9544	54.49	9858	56.29	10196	58.04	10487	59.82	10657	61.72	10637	63.79	10378	66.06	9859	68.50	9093	71.06	8129	73.63	7042	76.08	5926	78.29	4883	80.14	4001	81.53	3349	82.42	2963	82.80	2841	82.72	2949	82.25	3225	81.51	3589	80.60	3962	79.60	4274	78.59	4481	77.58	4569	76.56	4555	75.46	4488	74.21	4435	72.71	4471	70.87	4666	68.64	5071	65.98	5711	62.92	6577	59.50	7630	55.83	8805	52.01	10020	48.17	11189	44.42	12231	40.84	13088	37.50	13722	34.40	14124	31.53	14313	28.87	14326	26.34	14211	23.89	14022	21.47	13804	19.06	13592	16.62	13404	14.19	13244	11.78	13104	9.42	12966	7.16	12814	5.02	12633	3.02	12419	1.17	12175	-0.55	11916	-2.14	11660	-3.65	11428	-5.07	11240	-6.43	11105	-7.72	11027	-8.93	10997	-10.03	11001	-10.99	11021	-11.78	11037	-12.39	11036
6	-9.67	1728	-10.16	1831	-10.60	1967	-11.02	2135	-11.41	2328	-11.76	2539	-12.04	2755	-12.23	2965	-12.26	3155	-12.09	3315	-11.64	3434	-10.87	3508	-9.74	3534	-8.22	3515	-6.34	3457	-4.14	3371	-1.69	3270	0.92	3166	3.57	3072	6.18	2999	8.66	2955	10.94	2943	13.02	2961	14.91	3006	16.66	3069	18.35	3141	20.07	3212	21.91	3272	23.94	3315	26.20	3336	28.68	3334	31.35	3314	34.13	3281	36.93	3242	39.63	3206	42.14	3181	44.39	3172	46.32	3182	47.94	3211	49.28	3257	50.39	3314	51.37	3376	52.29	3434	53.24	3482	54.26	3514	55.37	3527	56.55	3519	57.75	3490	58.91	3443	59.95	3380	60.79	3306	61.39	3222	61.72	3131	61.78	3033	61.62	2928	61.28	2812	60.85	2685	60.37	2546	59.91	2393	59.49	2229	59.11	2056	58.72	1881	58.26	1710	57.67	1550	56.86	1409	55.77	1294	54.34	1208	52.58	1154	50.49	1131	48.14	1135	45.60	1161	42.96	1200	40.29	1244	37.69	1287	35.19	1321	32.84	1341	30.61	1347	28.50	1338	26.45	1317	24.42	1290	22.35	1261	20.21	1237	17.99	1222	15.68	1221	13.31	1236	10.92	1266	8.55	1309	6.26	1362	4.08	1418	2.07	1474	0.23	1523	-1.42	1561	-2.89	1587	-4.17	1600	-5.30	1603	-6.28	1599	-7.14	1594	-7.89	1598	-8.56	1616	-9.15	1658

Now that we have reshaped the data, we can check for missing values, and remove them.

colSums(is.na(data.col))

## rownum     x1     y1     x2     y2     x3     y3     x4     y4     x5     y5 
##      0      0      0      1      1      1      1      1      1      1      1 
##     x6     y6     x7     y7     x8     y8     x9     y9    x10    y10    x11 
##      1      1      1      1      1      1      1      1      1      1      1 
##    y11    x12    y12    x13    y13    x14    y14    x15    y15    x16    y16 
##      1      1      1      1      1      1      1      1      1      1      1 
##    x17    y17    x18    y18    x19    y19    x20    y20    x21    y21    x22 
##      1      1      1      1      1      1      1      1      1      1      1 
##    y22    x23    y23    x24    y24    x25    y25    x26    y26    x27    y27 
##      1      1      1      1      1      1      1      1      1      1      1 
##    x28    y28    x29    y29    x30    y30    x31    y31    x32    y32    x33 
##      1      1      1      1      1      1      1      1      1      1      1 
##    y33    x34    y34    x35    y35    x36    y36    x37    y37    x38    y38 
##      1      1      1      1      1      1      1      1      1      1      1 
##    x39    y39    x40    y40    x41    y41    x42    y42    x43    y43    x44 
##      1      1      1      1      1      1      1      1      1      1      1 
##    y44    x45    y45    x46    y46    x47    y47    x48    y48    x49    y49 
##      1      1      1      1      1      1      1      1      1      1      1 
##    x50    y50    x51    y51    x52    y52    x53    y53    x54    y54    x55 
##      1      1      1      1      1      1      1      1      1      1      1 
##    y55    x56    y56    x57    y57    x58    y58    x59    y59    x60    y60 
##      1      1      1      1      1      1      1      1      1      1      1 
##    x61    y61    x62    y62    x63    y63    x64    y64    x65    y65    x66 
##      1      1      1      1      1      1      1      1      1      1      1 
##    y66    x67    y67    x68    y68    x69    y69    x70    y70    x71    y71 
##      1      1      1      1      1      1      1      1      1      1      1 
##    x72    y72    x73    y73    x74    y74    x75    y75    x76    y76    x77 
##      1      1      1      1      1      1      1      1      1      1      1 
##    y77    x78    y78    x79    y79    x80    y80    x81    y81    x82    y82 
##      1      1      1      1      1      1      1      1      1      1      1 
##    x83    y83    x84    y84    x85    y85    x86    y86    x87    y87    x88 
##      1      1      1      1      1      1      1      1      1      1      1 
##    y88    x89    y89    x90    y90    x91    y91    x92    y92    x93    y93 
##      1      1      1      1      1      1      1      1      1      1      1 
##    x94    y94    x95    y95    x96    y96    x97    y97    x98    y98    x99 
##      1      1      1      1      1      1      1      1      1      1      1 
##    y99   x100   y100 
##      1      1      1

# Dropping missing values
data.col <- data.col %>% 
  drop_na()

Because the amount of data is huge, we'll work with a subset of them. Once we have the correct steps, we can repeat the steps with all the data. Let's take a quarter (~9000) data.

idx <- sample(nrow(data.col), 9000)
data.sample <- data.col[idx, ]

Now we are ready to wrangle and explore our data

Data Wrangling & Exploration

Scaling

The dynocards are derived from measurement values of load vs position of the polished rod above ground, and transformed into the downhole dynocard using the equipment parameters. The main feature that is seen from the dynograph is the shape and the relative dimension of the card, not the absolute reading values. To make the data device-independent, we can normalize the data by scaling it into a set scale. This removes the dependence of the data to equipment size. Let's scale the data so it falls between 0 and 1. We can use the following function

# Min-max inputs into a 1x1 box
scaleBox <- function(d) {
  
  # Get range in x
  range_x <- range(d[, seq(2, 200, 2)])
  min_x <- range_x[1]
  max_x <- range_x[2]
  
  # Get range in y
  range_y <- range(d[, seq(3, 201, 2)])
  min_y <- range_y[1]
  max_y <- range_y[2]
  

 # Minmax scale all values of x and y
  d[, seq(2, 200, 2)] <- (d[, seq(2, 200, 2)] - min_x )/ (max_x - min_x)
  d[, seq(3, 201, 2)] <- (d[, seq(3, 201, 2)] - min_y )/ (max_y - min_y) 
  
  return(d)
  
}

data.scaled <- scaleBox(data.sample)

Let's now try to view or transformed dynocards

drawDynocard(data.scaled[3, ], is_clean = T)

Let's look at the pricipal components

Principal component analysis

sample.pca <- PCA(data.scaled, scale.unit = F, quali.sup = c(1), graph = F, ncp = 10)
plot.PCA(x = sample.pca, choix = 'varcor', habillage = 10, select = 'contrib 10')

sample.pca$eig[1:10, ]

##          eigenvalue percentage of variance cumulative percentage of variance
## comp 1  0.273619131             47.1980194                          47.19802
## comp 2  0.207574392             35.8056111                          83.00363
## comp 3  0.026866964              4.6344255                          87.63806
## comp 4  0.017630684              3.0412104                          90.67927
## comp 5  0.002495220              0.4304138                          91.10968
## comp 6  0.002327396              0.4014648                          91.51115
## comp 7  0.002038564              0.3516428                          91.86279
## comp 8  0.001731051              0.2985983                          92.16139
## comp 9  0.001579043              0.2723775                          92.43376
## comp 10 0.001494278              0.2577560                          92.69152

From the analysis, we can see that the first 10 PCs covers 92.69% of the total variance.

Clustering

Because our data is unlabeled, we have to label our data. It'll take a lot of time to manually label each data. Clustering can reduce the work by grouping similar dynocards. There are 2 methods of clustering that we will use and compare : TSNE and UMAP.

set.seed(888)
# idx <- sample(nrow(data.col), 9000)
  
# data.sample <- data.col[idx, ]
# data.scaled <- scaleBox(data.sample)
# data.scaled <- scaleBox(data.)

data.scaled <- data.scaled[,-1] %>% 
              distinct()

TSNE

library(Rtsne)
data.tsne <- Rtsne(data.scaled)
plot(data.tsne$Y)

The TSNE cluster produces a lot of clusters. We'll group and cluster these data using a density-based clustering algorithm such as HDBSCAN.

data.tsne.cluster <- hdbscan(data.tsne$Y, minPts = 150)
data.tsne.cluster

## HDBSCAN clustering for 8719 objects.
## Parameters: minPts = 150
## The clustering contains 5 cluster(s) and 6230 noise points.
## 
##    0    1    2    3    4    5 
## 6230  218  443  160 1510  158 
## 
## Available fields: cluster, minPts, cluster_scores, membership_prob,
##                   outlier_scores, hc

Now that we have the labels let's see how well the points are separated.

data.tsne.labelled <- as_tibble(data.tsne$Y) %>% 
  rename( x = V1, y = V2) %>% 
  mutate(label = as.factor(data.tsne.cluster$cluster))

data.tsne.labelled %>% 
ggplot(aes(x = x, y = y , col = label)) +
  geom_point()

Let's profile the clusters made by TNSE, to see if the groups are really different.

# Let's take a cluster and see how the dots perform
# data.tsne.labelled <- data.scaled %>% 
#   mutate(label = data.tsne.cluster$cluster) %>% 
#   relocate(label, .before = x1)
# 
# # Let's take a cluster and see how the dots perform
# temp <- data.tsne.labelled[1,] %>% 
#   group_by(label) %>% 
#   apply()
# 
# drawDynocard(as_tibble(temp), T)

UMAP clustering

As a comparison, we'll cluster the points using the UMAP algorithm.

# Clustering and plotting
data.umap <- umap(data.scaled)
plot(data.umap$layout, type = 'points')

The UMAP results in finer clusters with a lot more separation.

data.umap.cluster <- hdbscan(data.umap$layout, minPts = 100)
data.umap.cluster

## HDBSCAN clustering for 8719 objects.
## Parameters: minPts = 100
## The clustering contains 35 cluster(s) and 1776 noise points.
## 
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 1776  107  119  206  201  119  167  200  291  131  218  137  262  112  154  444 
##   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30   31 
##  325  114  153  137  105  219  221  316  487  171  116  124  177  204  192  238 
##   32   33   34   35 
##  124  232  189  231 
## 
## Available fields: cluster, minPts, cluster_scores, membership_prob,
##                   outlier_scores, hc

# Labeling data.sample with the assigned clusters
data.umap.labelled <- as_tibble(data.umap$layout) %>% 
  rename(x = V1, y = V2) %>% 
  mutate(label = as.factor(data.umap.cluster$cluster))

data.umap.labelled %>% 
  ggplot(aes(x = x, y = y, col = as.factor(label))) +
  geom_jitter()

Visually inspecting the clustering result, it seems that UMAP has failed to cluster the data. There is no well-defined clusters that can be seen.

To see if this is the case, let's profile the wells to see how effective this clustering approach is compared to the TSNE.

# data.umap.labelled <- data.scaled %>% 
#   mutate(label = as.factor(data.umap.cluster$cluster))
# 
# # Let's take a cluster and see how the dots perform
# temp <- data.umap.labelled %>% 
#   filter(label == 23) %>% 
#   select_if(is.numeric) %>% 
#   map(mean)

Improving Oil Well Production

By Automatic Dynocards Interpretation

Anugerah Erlaut

8/21/2020

Background

The sucker rod-pump (SRP)

Problem statement

Proposed solution

Project scope

Goal

Prior art and methodology

Methodology

Data Preprocessing

Reshaping data

Data Wrangling & Exploration

Scaling

Principal component analysis

Clustering

TSNE

UMAP clustering

Clustering

Neural Network Model

Model Evaluation

References