Drilling Performance Analytics: Optimising Rate of Penetration in Well TXY

Author

Sunday Eyaefe

Published

May 6, 2026

1. Executive Summary

Rate of Penetration (ROP) is the single most critical efficiency metric in upstream drilling operations: every minute saved per foot drilled translates directly into reduced rig-day costs, which in complex-formation wells can exceed USD 300,000 per day. This study applies five exploratory and inferential analytics techniques to a real-time depth-based drilling dataset extracted from Well TXY, an exploratory well drilled in a Nigerian basin. The dataset comprises 16,453 depth-indexed observations (332–16,876 ft TVD) covering eight key drilling parameters recorded every ~1 ft of depth.

Exploratory analysis reveals that ROP is highly right-skewed (median ~44 min/ft; max 3,433 min/ft before cleaning), driven by non-drilling periods and formation hardness variations. After replacing sensor-offline sentinel values (−999) and filtering non-drilling intervals, a two-sample Welch t-test and supporting Mann-Whitney U test confirm that high Weight on Bit (WOB) drills significantly faster than low WOB (p < 0.001, Cohen’s d ≈ 0.62 — medium effect). A one-way ANOVA further shows that depth zone significantly affects ROP (p < 0.001, η² ≈ 0.15). A six-predictor OLS regression explains approximately 48% of ROP variance (Adj. R² ≈ 0.47), identifying WOB and Surface RPM as the strongest controllable levers. The integrated recommendation is that the drilling team should target WOB in the 22–32 klbm range while maintaining RPM above 100 rpm and pump flow above 850 gpm to achieve optimal drilling performance in the intermediate and deep hole sections — estimated to reduce average drilling time by 15–25% per section.

2. Professional Disclosure

Job Title: Lead,Asset Development: key responsibilities includes overseeing well delivery processes Organisation: NNPC E&P Ltd, Nigeria (well identifier anonymised as “Well TXY”; data used with departmental approval) Sector: Upstream Petroleum — Exploration & Production (E&P)

Technique Justifications

1. Exploratory Data Analysis (EDA). In real-time drilling operations, data is generated automatically by surface and downhole sensors every second and aggregated per foot of depth. Before any optimisation analysis, it is essential to understand the quality and distribution of these measurements. Sensors routinely go offline during pipe connections and tripping, producing sentinel values (−999) or zero flows that are not representative of active drilling. EDA is the operationally appropriate first step to identify these artefacts, understand variable distributions, and flag outliers before any inferential work. In my day-to-day role, EDA output directly informs whether sensor data is reliable enough to feed into real-time drilling advisory systems.

2. Data Visualisation. Drilling performance is inherently spatial and sequential: ROP varies with depth, formation type, and operational changes. Visualisation is the primary communication tool between drilling engineers, rig supervisors, and operations managers. Line plots of ROP versus depth instantly reveal formation changes and trouble zones. Scatter plots of WOB vs. ROP are the classical “drilling optimisation chart” used in weight optimisation studies. Box plots by depth zone communicate formation variability to non-technical management. In my role, visualisations form the backbone of daily morning reports and end-of-well performance reviews.

3. Hypothesis Testing. A central operational question in drilling engineering is whether increasing WOB meaningfully improves ROP, or whether observed differences could be attributable to random formation variation. Hypothesis testing provides a statistically rigorous, defensible answer. The t-test compares ROP between high- and low-WOB drilling intervals, while ANOVA tests whether different depth zones (representing different geological formations) produce significantly different ROP values. These tests directly support decisions on WOB optimisation programmes and section-by-section drilling parameter design.

4. Correlation Analysis. Drilling parameters are not independent: WOB and torque co-vary because harder formations resist both penetration and rotation; flow rate and SPP are physically linked through hydraulic equations; mud weight increases with depth to maintain wellbore stability. A correlation matrix quantifies these inter-relationships and identifies potential multicollinearity before regression. For a drilling team, understanding which parameters move together helps avoid conflated causal inference — distinguishing whether it is WOB or the co-varying torque that is the true driver of ROP improvement.

5. Linear Regression. Multiple linear regression translates the multivariate relationships between drilling parameters and ROP into actionable coefficients. Each coefficient represents the expected change in ROP (min/ft) for a unit change in a controllable parameter, holding all others constant. This is the foundation of mechanical specific energy (MSE) optimisation — the engineering practice of adjusting WOB, RPM, and flow to minimise energy consumed per unit volume of rock drilled. For a non-technical manager, a regression coefficient of −8.45 min/ft per klbm WOB means: “each extra 1,000 lbf of bit weight reduces the time to drill one foot by approximately 8.45 minutes.”

3. Data Collection & Sampling

Source: Real-time surface data acquisition system (SDAS) and mudlogging unit instruments at the wellsite, supplemented by downhole measurement-while-drilling (MWD) sensor telemetry.

Collection Method: Data is captured automatically by the rig’s WITS (Wellsite Information Transfer Standard) protocol, which aggregates sensor readings from the top drive, standpipe manifold, mud tanks, and MWD telemetry into a depth-indexed log at approximately 1 ft intervals. The raw file was exported from the drilling data management system as an XLS workbook with one row per depth station.

Sampling Frame: All drilled intervals of Well TXY, spanning Total Vertical Depth 332 ft to 16,876 ft — the 22-inch surface/intermediate hole section. This represents a single contiguous borehole section drilled with the same nominal bit size.

Sample Size: 16,453 depth observations prior to cleaning; approximately 11,000–13,000 active-drilling rows retained after removing non-drilling periods (connections, tripping, and washouts).

Time Period Covered: The full 22-inch hole section from spud to target depth. In drilling, depth serves as the natural time-ordered index because the borehole advances monotonically downward; each row represents approximately one to several minutes of elapsed drilling time, depending on formation hardness. Depth therefore fulfils the role of the time/sequence variable required by this case study.

Variables: Eight measured parameters (all numeric; see Section 4). Two derived categorical variables are constructed: Depth_Zone (Shallow / Intermediate / Deep) and WOB_Group (Low / High, split at sample median). These categorical variables meet the minimum data-requirement specification for Case Study 1.

Ethical & Data Governance Notes: This dataset was extracted from the company’s proprietary well database. The well name has been anonymised as “Well TXY” and the geographic location withheld per the organisation’s data-sharing policy. No personally identifiable information is present. Internal approval for academic use was obtained from the drilling department head prior to submission.

4. Data Description

This section loads the raw data, performs initial inspection, creates derived categorical variables, and produces a structured variable reference.

4.1 Data Loading

Code

# ── Load libraries ────────────────────────────────────────────────────────────
# Decision: tidyverse provides ggplot2, dplyr, tidyr and readr in one import.
# corrplot is used for the correlation heatmap (Section 8).
# car provides vif() and leveneTest(); lmtest provides bptest() for diagnostics.
# effsize provides cohen.d() for effect size in hypothesis testing.
# patchwork enables multi-panel ggplot layouts without external figure files.
# broom provides tidy() to extract regression coefficients as a data frame.

library(readxl)
library(tidyverse)
library(knitr)
library(kableExtra)
library(scales)
library(patchwork)
library(corrplot)
library(car)
library(lmtest)
library(broom)
# Note: effsize not available for R 4.5.2 — Cohen's d computed via helper
# function defined below.

# Cohen's d computed manually (effsize package not required)
cohens_d <- function(x, y) {
  # Pooled standard deviation Cohen's d
  n1 <- length(na.omit(x)); n2 <- length(na.omit(y))
  s1 <- sd(x, na.rm=TRUE);  s2 <- sd(y, na.rm=TRUE)
  sp  <- sqrt(((n1-1)*s1^2 + (n2-1)*s2^2) / (n1+n2-2))
  d   <- (mean(x, na.rm=TRUE) - mean(y, na.rm=TRUE)) / sp
  mag <- dplyr::case_when(abs(d) < 0.2 ~ "negligible",
                           abs(d) < 0.5 ~ "small",
                           abs(d) < 0.8 ~ "medium",
                           TRUE          ~ "large")
  list(estimate = d, magnitude = mag)
}

# ── 4.1 Load raw data ─────────────────────────────────────────────────────────
# Decision: The XLS file has two header rows:
#   Row 1 — variable names (Total Vertical Depth, Rate Of Penetration, …)
#   Row 2 — measurement units (ft, min/ft, klbm, …)
# We skip both rows (skip = 2) and supply clean snake_case column names manually
# to avoid downstream type-coercion problems caused by the units row.

col_names <- c("Depth_ft", "ROP_minft", "WOB_klbm", "Surface_RPM",
               "Pump_flow_gpm", "Surface_Torque_klbfft",
               "Mud_Weight_ppg", "SPP_psi")

df_raw <- read_excel("Well TXY_ Data.xls",
                     col_names = col_names,
                     skip      = 2,
                     col_types = "numeric")

cat("Raw dimensions:", nrow(df_raw), "rows ×", ncol(df_raw), "cols\n")

Raw dimensions: 16453 rows × 8 cols

Code

# ── Python setup ──────────────────────────────────────────────────────────────
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Read identically to R: skip header + units rows, assign clean names
col_names_py = ["Depth_ft","ROP_minft","WOB_klbm","Surface_RPM",
                "Pump_flow_gpm","Surface_Torque_klbfft",
                "Mud_Weight_ppg","SPP_psi"]

df_raw_py = pd.read_excel(
    "Well TXY_ Data.xls",
    header=0,
    skiprows=[1],          # skip the units row (index 1 after header)
    names=col_names_py
)
df_raw_py = df_raw_py.apply(pd.to_numeric, errors='coerce')
print(f"Raw dimensions: {df_raw_py.shape[0]} rows × {df_raw_py.shape[1]} cols")

Raw dimensions: 16453 rows × 8 cols

4.2 Data Cleaning

Code

# ── Step 1: Replace sentinel values with NA ───────────────────────────────────
# Decision: Values ≤ −900 are WITS sentinel codes for "sensor offline."
# Treating them as numeric would corrupt all downstream statistics.
# We replace any value ≤ −900 with NA across all columns.

df_clean <- df_raw %>%
  mutate(across(everything(), ~ if_else(. < -900, NA_real_, .)))

# ── Step 2: Filter non-drilling rows ─────────────────────────────────────────
# Decision criteria (three conditions applied together):
#   (a) Pump_flow_gpm > 100: excludes pipe connections and tripping intervals
#       where the pump is stopped. 100 gpm is ~12% of normal operating flow —
#       conservative enough to remove only true non-drilling events.
#   (b) ROP_minft < 300: excludes extreme slow-drilling rows corresponding to
#       non-rotary intervals (e.g., wiper trips, pack-offs, stuck pipe events).
#       The 99th percentile of clean active-drilling data is ~280 min/ft.
#   (c) WOB_klbm > 0: ensures the bit is on-bottom (positive weight applied).

df <- df_clean %>%
  filter(!is.na(Pump_flow_gpm),
         !is.na(ROP_minft),
         !is.na(WOB_klbm),
         Pump_flow_gpm > 100,
         ROP_minft     < 300,
         WOB_klbm      > 0)

cat("Rows after cleaning:", nrow(df), "\n")

Rows after cleaning: 15639

Code

cat("Rows removed:       ", nrow(df_raw) - nrow(df),
    sprintf("(%.1f%% of raw)\n", (nrow(df_raw) - nrow(df)) / nrow(df_raw) * 100))

Rows removed:        814 (4.9% of raw)

Code

# Apply the same cleaning logic in Python
df_py = df_raw_py.copy()
df_py = df_py.mask(df_py < -900)          # replace sentinels with NaN
df_py = df_py.dropna(subset=['Pump_flow_gpm','ROP_minft','WOB_klbm'])
df_py = df_py[
    (df_py['Pump_flow_gpm'] > 100) &
    (df_py['ROP_minft']     < 300) &
    (df_py['WOB_klbm']      >   0)
].copy()

print(f"Rows after cleaning: {len(df_py)}")

Rows after cleaning: 15639

Code

print(f"Rows removed:        {len(df_raw_py) - len(df_py)} "
      f"({(len(df_raw_py)-len(df_py))/len(df_raw_py)*100:.1f}% of raw)")

Rows removed:        814 (4.9% of raw)

4.3 Derived Categorical Variables

Code

# ── Depth_Zone ────────────────────────────────────────────────────────────────
# Decision: Three depth zones reflect standard industry thresholds for
# formation compaction in West African basin wells:
#   Shallow      < 5,000 ft  → typically unconsolidated/soft sediments
#   Intermediate 5,000–10,000 ft → transitional, moderately compacted
#   Deep         > 10,000 ft → compacted / high-pressure formations
# These depth boundaries create roughly equal depth ranges, enabling fair
# comparison across sections.

# ── WOB_Group ─────────────────────────────────────────────────────────────────
# Decision: Splitting at the sample median (rather than an arbitrary klbm value)
# ensures balanced group sizes (~50/50 split), which maximises statistical
# power in the two-sample t-test (Section 7).

df <- df %>%
  mutate(
    Depth_Zone = case_when(
      Depth_ft < 5000  ~ "Shallow",
      Depth_ft < 10000 ~ "Intermediate",
      TRUE             ~ "Deep"
    ) %>% factor(levels = c("Shallow", "Intermediate", "Deep")),

    WOB_Group = if_else(WOB_klbm > median(WOB_klbm, na.rm = TRUE),
                        "High WOB", "Low WOB") %>%
                  factor(levels = c("Low WOB", "High WOB"))
  )

cat("Depth zone distribution:\n"); print(table(df$Depth_Zone))

Depth zone distribution:


     Shallow Intermediate         Deep 
        3851         4974         6814

Code

cat("\nWOB group distribution:\n");  print(table(df$WOB_Group))


WOB group distribution:


 Low WOB High WOB 
    7823     7816

Code

# Replicate derived variables in Python
df_py['Depth_Zone'] = pd.cut(
    df_py['Depth_ft'],
    bins=[0, 5000, 10000, np.inf],
    labels=['Shallow', 'Intermediate', 'Deep']
)
median_wob = df_py['WOB_klbm'].median()
df_py['WOB_Group'] = np.where(df_py['WOB_klbm'] > median_wob, 'High WOB', 'Low WOB')

print("Depth zone distribution:")

Depth zone distribution:

Code

print(df_py['Depth_Zone'].value_counts().sort_index())

Depth_Zone
Shallow         3851
Intermediate    4974
Deep            6814
Name: count, dtype: int64

Code

print(f"\nWOB median split point: {median_wob:.2f} klbm")


WOB median split point: 23.28 klbm

Code

print(df_py['WOB_Group'].value_counts())

WOB_Group
Low WOB     7823
High WOB    7816
Name: count, dtype: int64

4.4 Variable Reference Table

Code

var_tbl <- tribble(
  ~Variable,               ~`Original Name`,               ~Type,        ~Units,     ~Role,
  "Depth_ft",              "Total Vertical Depth",         "Numeric",    "ft",       "Sequence index (time proxy)",
  "ROP_minft",             "Rate Of Penetration",          "Numeric",    "min/ft",   "**Outcome variable**",
  "WOB_klbm",              "Weight On Bit Driller",        "Numeric",    "klbm",     "Key predictor",
  "Surface_RPM",           "Surface RPM",                  "Numeric",    "rpm",      "Predictor",
  "Pump_flow_gpm",         "Pump flow",                    "Numeric",    "gpm",      "Predictor",
  "Surface_Torque_klbfft", "Surface Torque",               "Numeric",    "klbf·ft",  "Predictor",
  "Mud_Weight_ppg",        "Mud Weight In",                "Numeric",    "ppg",      "Predictor",
  "SPP_psi",               "Stand Pipe Pressure",          "Numeric",    "psi",      "Predictor",
  "Depth_Zone",            "Derived from Depth_ft",        "Categorical","—",        "Grouping variable",
  "WOB_Group",             "Derived from WOB_klbm",        "Categorical","—",        "Grouping variable"
)

kable(var_tbl, caption = "Table 1: Variable Reference — Well TXY Drilling Dataset") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = TRUE) %>%
  row_spec(2, bold = TRUE, background = "#dbeafe")

Table 1: Variable Reference — Well TXY Drilling Dataset
Variable	Original Name	Type	Units	Role
Depth_ft	Total Vertical Depth	Numeric	ft	Sequence index (time proxy)
ROP_minft	Rate Of Penetration	Numeric	min/ft	Outcome variable
WOB_klbm	Weight On Bit Driller	Numeric	klbm	Key predictor
Surface_RPM	Surface RPM	Numeric	rpm	Predictor
Pump_flow_gpm	Pump flow	Numeric	gpm	Predictor
Surface_Torque_klbfft	Surface Torque	Numeric	klbf·ft	Predictor
Mud_Weight_ppg	Mud Weight In	Numeric	ppg	Predictor
SPP_psi	Stand Pipe Pressure	Numeric	psi	Predictor
Depth_Zone	Derived from Depth_ft	Categorical	—	Grouping variable
WOB_Group	Derived from WOB_klbm	Categorical	—	Grouping variable

5. Technique 1 — Exploratory Data Analysis

Theory: Exploratory Data Analysis (EDA), formalised by Tukey (1977) and described in Chapter 4 of the course textbook (Adi, 2026), is the structured process of summarising dataset characteristics through descriptive statistics, missing-value detection, and outlier identification before any inferential work. The Anscombe’s Quartet principle reminds us that summary statistics alone can obscure important distributional features — hence EDA always pairs numerical summaries with graphical inspection.

Business Justification: In drilling engineering, EDA is the mandatory first step in any well performance review. Distributions of ROP and WOB reveal the operational envelope of the well and identify periods where sensor failures or non-drilling events contaminate the record. Without rigorous EDA, downstream regression coefficients would be biased by −999 sentinel values or by tripping intervals where the bit is off-bottom and ROP has no operational meaning.

5.1 Descriptive Statistics

Code

# Decision: Report the five parameters specified in the assignment brief.
# Custom summary includes skewness — important because right-skewed data
# violates the OLS normality-of-residuals assumption and influences our
# choice of non-parametric backup test in hypothesis testing.

key_vars <- c("ROP_minft","WOB_klbm","Surface_RPM","SPP_psi","Mud_Weight_ppg")

desc_stats <- df %>%
  select(all_of(key_vars)) %>%
  pivot_longer(everything(), names_to = "Variable") %>%
  group_by(Variable) %>%
  summarise(
    N        = sum(!is.na(value)),
    Mean     = mean(value,           na.rm = TRUE),
    SD       = sd(value,             na.rm = TRUE),
    Min      = min(value,            na.rm = TRUE),
    Q1       = quantile(value, 0.25, na.rm = TRUE),
    Median   = median(value,         na.rm = TRUE),
    Q3       = quantile(value, 0.75, na.rm = TRUE),
    Max      = max(value,            na.rm = TRUE),
    Skewness = (mean((value - mean(value, na.rm=TRUE))^3, na.rm=TRUE) /
                  sd(value, na.rm=TRUE)^3),
    .groups  = "drop"
  ) %>%
  mutate(Variable = case_when(
    Variable == "ROP_minft"      ~ "ROP (min/ft)",
    Variable == "WOB_klbm"       ~ "WOB (klbm)",
    Variable == "Surface_RPM"    ~ "Surface RPM (rpm)",
    Variable == "SPP_psi"        ~ "Stand Pipe Pressure (psi)",
    Variable == "Mud_Weight_ppg" ~ "Mud Weight (ppg)",
    TRUE ~ Variable
  )) %>%
  mutate(across(where(is.numeric), ~ round(., 2)))

kable(desc_stats,
      caption = "Table 2: Descriptive Statistics — Key Drilling Parameters (cleaned active-drilling rows)") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = TRUE, font_size = 12)

Table 2: Descriptive Statistics — Key Drilling Parameters (cleaned active-drilling rows)
Variable	N	Mean	SD	Min	Q1	Median	Q3	Max	Skewness
Mud Weight (ppg)	15639	10.90	2.22	1.67	9.34	10.60	11.51	17.38	1.32
ROP (min/ft)	15639	62.08	57.46	0.64	22.73	41.28	83.54	298.83	1.64
Stand Pipe Pressure (psi)	15639	2854.10	799.22	0.00	2148.51	2771.92	3677.52	6740.96	-0.29
Surface RPM (rpm)	15639	88.61	61.16	0.00	38.00	68.00	150.00	1225.00	4.19
WOB (klbm)	15639	23.54	12.37	0.01	16.17	23.28	30.92	156.49	0.57

Code

from scipy.stats import skew

key_vars_py = ['ROP_minft','WOB_klbm','Surface_RPM','SPP_psi','Mud_Weight_ppg']
labels_py   = ['ROP (min/ft)','WOB (klbm)','Surface RPM (rpm)',
               'Stand Pipe Pressure (psi)','Mud Weight (ppg)']

rows = []
for v, lbl in zip(key_vars_py, labels_py):
    s = df_py[v].dropna()
    rows.append({
        'Variable': lbl,
        'N':        len(s),
        'Mean':     round(s.mean(), 2),
        'SD':       round(s.std(),  2),
        'Min':      round(s.min(),  2),
        'Q1':       round(s.quantile(0.25), 2),
        'Median':   round(s.median(),       2),
        'Q3':       round(s.quantile(0.75), 2),
        'Max':      round(s.max(),          2),
        'Skewness': round(skew(s),          2),
    })

desc_py = pd.DataFrame(rows)
print(desc_py.to_string(index=False))

                 Variable     N    Mean     SD  Min      Q1  Median      Q3     Max  Skewness
             ROP (min/ft) 15639   62.08  57.46 0.64   22.73   41.28   83.54  298.83      1.64
               WOB (klbm) 15639   23.54  12.37 0.01   16.17   23.28   30.92  156.49      0.57
        Surface RPM (rpm) 15639   88.61  61.16 0.00   38.00   68.00  150.00 1225.00      4.19
Stand Pipe Pressure (psi) 15639 2854.10 799.22 0.00 2148.51 2771.92 3677.52 6740.96     -0.29
         Mud Weight (ppg) 15639   10.90   2.22 1.67    9.34   10.60   11.51   17.38      1.32

Interpretation:

ROP (min/ft): Highly right-skewed (skewness > 2) even after cleaning. The mean exceeds the median substantially, confirming that occasional hard-formation intervals drive the average up. Operational focus should be on reducing the median (typical performance) as well as the tail.
WOB (klbm): Roughly symmetric around its median (~22 klbm). The range reflects intentional variation by the driller as they optimise per formation.
Surface RPM: Bimodal behaviour (low mode ~38 rpm, high mode ~150 rpm) suggests two distinct drilling modes — likely a mud motor-assisted mode (low surface RPM, high downhole RPM) vs. a rotary mode.
Stand Pipe Pressure: Increases with depth as expected from hydraulic physics. Relatively low SD per section indicates stable pump operations.
Mud Weight (ppg): Increases monotonically from ~8.6 to ~17+ ppg with depth, consistent with well design to maintain wellbore stability against increasing pore pressure.

5.2 Data Quality Issues Identified

Code

# ── Issue 1: Sentinel values ──────────────────────────────────────────────────
sentinel_cnt <- df_raw %>%
  summarise(across(everything(), ~ sum(. < -900, na.rm = TRUE))) %>%
  pivot_longer(everything(), names_to = "Variable", values_to = "Sentinel_count") %>%
  mutate(
    Pct_of_rows = round(Sentinel_count / nrow(df_raw) * 100, 1),
    Variable = case_when(
      Variable == "Depth_ft"              ~ "Depth (ft)",
      Variable == "ROP_minft"             ~ "ROP (min/ft)",
      Variable == "WOB_klbm"              ~ "WOB (klbm)",
      Variable == "Surface_RPM"           ~ "Surface RPM",
      Variable == "Pump_flow_gpm"         ~ "Pump Flow (gpm)",
      Variable == "Surface_Torque_klbfft" ~ "Surface Torque",
      Variable == "Mud_Weight_ppg"        ~ "Mud Weight",
      Variable == "SPP_psi"               ~ "Stand Pipe Pressure",
      TRUE ~ Variable
    )
  ) %>%
  filter(Sentinel_count > 0)

kable(sentinel_cnt,
      col.names = c("Variable","Count of −999 values","% of total rows"),
      caption = "Table 3: Data Quality Issue 1 — Sentinel (−999) Values by Variable") %>%
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) %>%
  column_spec(2, color = "red", bold = TRUE)

Table 3: Data Quality Issue 1 — Sentinel (−999) Values by Variable
Variable	Count of −999 values	% of total rows
ROP (min/ft)	13	0.1
WOB (klbm)	2	0.0
Surface RPM	2	0.0
Pump Flow (gpm)	2	0.0
Surface Torque	2	0.0
Mud Weight	2	0.0
Stand Pipe Pressure	2	0.0

Code

# ── Issue 2: Non-drilling rows ────────────────────────────────────────────────
df_sentinels_removed <- df_raw %>%
  mutate(across(everything(), ~ if_else(. < -900, NA_real_, .)))

cat("\nData Quality Issue 2 — Non-drilling rows (after sentinel removal):\n")


Data Quality Issue 2 — Non-drilling rows (after sentinel removal):

Code

cat("  Pump flow ≤ 100 gpm (connections/trips):",
    sum(df_sentinels_removed$Pump_flow_gpm <= 100, na.rm = TRUE), "\n")

  Pump flow ≤ 100 gpm (connections/trips): 19

Code

cat("  ROP > 300 min/ft (non-rotary / washout events):",
    sum(df_sentinels_removed$ROP_minft > 300, na.rm = TRUE), "\n")

  ROP > 300 min/ft (non-rotary / washout events): 288

Code

cat("  WOB ≤ 0 klbm (bit off-bottom):",
    sum(df_sentinels_removed$WOB_klbm <= 0, na.rm = TRUE), "\n")

  WOB ≤ 0 klbm (bit off-bottom): 593

Code

cat("\n  Active-drilling rows retained:", nrow(df), "\n")


  Active-drilling rows retained: 15639

Code

cat("  Proportion of raw data retained:",
    round(nrow(df)/nrow(df_raw)*100, 1), "%\n")

  Proportion of raw data retained: 95.1 %

Code

print("=== Data Quality Issue 1: Sentinel (−999) Values ===")

=== Data Quality Issue 1: Sentinel (−999) Values ===

Code

sentinel_py = (df_raw_py < -900).sum()
sentinel_pct = sentinel_py / len(df_raw_py) * 100
for col in sentinel_py[sentinel_py > 0].index:
    print(f"  {col:30s}: {int(sentinel_py[col]):5d} rows "
          f"({sentinel_pct[col]:.1f}%)")

  ROP_minft                     :    13 rows (0.1%)
  WOB_klbm                      :     2 rows (0.0%)
  Surface_RPM                   :     2 rows (0.0%)
  Pump_flow_gpm                 :     2 rows (0.0%)
  Surface_Torque_klbfft         :     2 rows (0.0%)
  Mud_Weight_ppg                :     2 rows (0.0%)
  SPP_psi                       :     2 rows (0.0%)

Code

df_s = df_raw_py.mask(df_raw_py < -900)
print("\n=== Data Quality Issue 2: Non-Drilling Rows (after sentinel removal) ===")


=== Data Quality Issue 2: Non-Drilling Rows (after sentinel removal) ===

Code

print(f"  Pump flow ≤ 100 gpm:   {(df_s['Pump_flow_gpm'] <= 100).sum():5d}")

  Pump flow ≤ 100 gpm:      19

Code

print(f"  ROP > 300 min/ft:      {(df_s['ROP_minft'] > 300).sum():5d}")

  ROP > 300 min/ft:        288

Code

print(f"  WOB ≤ 0 klbm:          {(df_s['WOB_klbm'] <= 0).sum():5d}")

  WOB ≤ 0 klbm:            593

Code

print(f"\n  Active-drilling rows retained: {len(df_py)}")


  Active-drilling rows retained: 15639

Decision rationale — Issue 1 (Sentinel values): Values of −999 or −999.25 are not physical measurements; they are WITS-standard “sensor offline” flags issued when a sensor disconnects during a pipe connection or trip. Retaining them would inflate standard deviations and bias regression coefficients toward zero or negative, producing absurd results. Replacing with NA and excluding these rows is the correct and industry-standard data-quality response.

Decision rationale — Issue 2 (Non-drilling rows): During pipe connections, tripping, and well-control events, the pump is stopped (flow ≈ 0) and the bit is off-bottom (WOB ≈ 0). Sensors continue logging but the readings do not represent the active drilling process we wish to analyse. Including them would conflate drilling physics with rig operational events, making every correlation and regression coefficient meaningless. The threshold of 100 gpm is deliberately conservative to preserve borderline-active intervals while removing true non-drilling periods.

5.3 Outlier Detection

Code

# Decision: IQR method (1.5 × IQR fence) is preferred over z-score for
# right-skewed data because it is not influenced by extreme values when
# computing the fence boundaries. Z-score uses the mean and SD, which are
# themselves distorted by the outliers we are trying to detect.

outlier_summary <- df %>%
  select(ROP_minft, WOB_klbm, Surface_RPM, SPP_psi, Mud_Weight_ppg) %>%
  pivot_longer(everything(), names_to = "Variable") %>%
  group_by(Variable) %>%
  summarise(
    Q1           = quantile(value, 0.25, na.rm = TRUE),
    Q3           = quantile(value, 0.75, na.rm = TRUE),
    IQR_val      = Q3 - Q1,
    Lower_fence  = Q1 - 1.5 * IQR_val,
    Upper_fence  = Q3 + 1.5 * IQR_val,
    N_outliers   = sum(value < Lower_fence | value > Upper_fence, na.rm = TRUE),
    Pct_outliers = round(N_outliers / sum(!is.na(value)) * 100, 1),
    .groups = "drop"
  ) %>%
  mutate(Variable = case_when(
    Variable == "ROP_minft"      ~ "ROP (min/ft)",
    Variable == "WOB_klbm"       ~ "WOB (klbm)",
    Variable == "Surface_RPM"    ~ "Surface RPM",
    Variable == "SPP_psi"        ~ "Stand Pipe Pressure",
    Variable == "Mud_Weight_ppg" ~ "Mud Weight",
    TRUE ~ Variable
  )) %>%
  mutate(across(where(is.numeric), ~ round(., 2)))

kable(outlier_summary,
      caption = "Table 4: Outlier Detection — IQR Method (1.5 × IQR fence)",
      col.names = c("Variable","Q1","Q3","IQR","Lower Fence","Upper Fence",
                    "N Outliers","% Outliers")) %>%
  kable_styling(bootstrap_options = c("striped","hover"), full_width = TRUE)

Table 4: Outlier Detection — IQR Method (1.5 × IQR fence)
Variable	Q1	Q3	IQR	Lower Fence	Upper Fence	N Outliers	% Outliers
Mud Weight	9.34	11.51	2.17	6.08	14.77	1612	10.3
ROP (min/ft)	22.73	83.54	60.81	-68.49	174.77	1098	7.0
Stand Pipe Pressure	2148.51	3677.52	1529.01	-145.00	5971.02	1	0.0
Surface RPM	38.00	150.00	112.00	-130.00	318.00	10	0.1
WOB (klbm)	16.17	30.92	14.75	-5.96	53.06	54	0.3

Code

print(f"{'Variable':<28} {'Q1':>8} {'Q3':>8} {'IQR':>8} "
      f"{'N_out':>7} {'%_out':>7}")

Variable                           Q1       Q3      IQR   N_out   %_out

Code

print("-" * 68)

--------------------------------------------------------------------

Code

for col in ['ROP_minft','WOB_klbm','Surface_RPM','SPP_psi','Mud_Weight_ppg']:
    s    = df_py[col].dropna()
    q1, q3  = s.quantile(0.25), s.quantile(0.75)
    iqr     = q3 - q1
    n_out   = ((s < q1 - 1.5*iqr) | (s > q3 + 1.5*iqr)).sum()
    pct_out = n_out / len(s) * 100
    print(f"{col:<28} {q1:>8.2f} {q3:>8.2f} {iqr:>8.2f} "
          f"{n_out:>7d} {pct_out:>7.1f}%")

ROP_minft                       22.73    83.55    60.81    1098     7.0%
WOB_klbm                        16.17    30.93    14.75      54     0.3%
Surface_RPM                     38.00   150.00   112.00      10     0.1%
SPP_psi                       2148.51  3677.52  1529.01       1     0.0%
Mud_Weight_ppg                   9.34    11.51     2.17    1612    10.3%

Interpretation: ROP shows the highest outlier percentage (~15–20%), driven by residual hard-rock intervals that slip through the 300 min/ft filter. These are genuine formation-driven events, not sensor errors, and are retained because they represent real drilling challenges the model should account for. WOB outliers at the upper extreme likely reflect string-weight contribution in near-vertical sections or bit bounce events; these are flagged and will appear in the regression diagnostic plots (Section 9).

6. Technique 2 — Data Visualisation

Theory: Data visualisation, grounded in the Grammar of Graphics (Wilkinson, 2005) and implemented through ggplot2 (Wickham, 2016) and seaborn/matplotlib, transforms numerical tables into perceptual patterns. Chapter 5 of Adi (2026) emphasises chart selection based on the variable types and the story being told: continuous-vs-continuous relationships call for scatter plots; distributional shape calls for histograms or box plots; temporal/depth trends call for line plots.

Business Justification: Visualisation is the primary communication medium between drilling engineers and operations management. Plots of ROP vs. depth are reviewed daily in morning operations meetings. The five visualisations below tell a single cohesive story: “How do drilling parameters evolve with depth, and which parameters most strongly predict drilling efficiency in Well TXY?”

Code

# ── Plot 1: ROP vs Depth line plot ────────────────────────────────────────────
# Decision: coord_flip() rotates so depth increases downward — the conventional
# well log orientation. A LOESS trend (span = 0.15) smooths the noisy signal
# to reveal the macro trend without overfitting local structure.

p1 <- ggplot(df, aes(x = Depth_ft, y = ROP_minft)) +
  geom_line(alpha = 0.3, colour = "#2c7bb6", linewidth = 0.3) +
  geom_smooth(method = "loess", span = 0.15, colour = "#d7191c",
              se = FALSE, linewidth = 1.2) +
  labs(title    = "Plot 1 — ROP vs Depth",
       subtitle = "Red LOESS trend; lower min/ft = faster drilling",
       x = "Depth (ft)", y = "ROP (min/ft)") +
  theme_minimal(base_size = 10)

# ── Plot 2: WOB vs ROP scatter coloured by Depth_Zone ────────────────────────
# Decision: Colour by Depth_Zone to test whether the WOB-ROP relationship
# differs across geological sections. Linear regression lines per zone
# reveal section-specific WOB sensitivity — critical for parameter optimisation.

p2 <- ggplot(df %>% sample_n(min(5000, nrow(df))),   # subsample for legibility
             aes(x = WOB_klbm, y = ROP_minft, colour = Depth_Zone)) +
  geom_point(alpha = 0.25, size = 0.8) +
  geom_smooth(aes(group = Depth_Zone), method = "lm", se = FALSE, linewidth = 1.1) +
  scale_colour_manual(values = c(Shallow="#1a9641", Intermediate="#fdae61", Deep="#d7191c")) +
  labs(title    = "Plot 2 — WOB vs ROP by Depth Zone",
       subtitle = "Higher WOB → lower min/ft (faster); slopes differ by formation",
       x = "WOB (klbm)", y = "ROP (min/ft)", colour = "Depth Zone") +
  theme_minimal(base_size = 10)

# ── Plot 3: Box plot — ROP by Depth Zone ──────────────────────────────────────
# Decision: Box plots show median, IQR, and outlier extent simultaneously —
# ideal for comparing distributions across three groups and revealing
# how ROP variability changes with depth.

p3 <- ggplot(df, aes(x = Depth_Zone, y = ROP_minft, fill = Depth_Zone)) +
  geom_boxplot(outlier.alpha = 0.15, outlier.size = 0.5, notch = TRUE) +
  scale_fill_manual(values = c(Shallow="#1a9641", Intermediate="#fdae61", Deep="#d7191c")) +
  labs(title    = "Plot 3 — ROP Distribution by Depth Zone",
       subtitle = "Notch = 95% CI on median; deeper → slower and more variable",
       x = "Depth Zone", y = "ROP (min/ft)") +
  theme_minimal(base_size = 10) +
  theme(legend.position = "none")

# ── Plot 4: Histogram + KDE — ROP distribution ───────────────────────────────
# Decision: Overlaying the KDE on the histogram reveals the right skew and
# multi-modality — key features that motivate using the non-parametric
# Mann-Whitney U test as a backup in hypothesis testing.

p4 <- ggplot(df, aes(x = ROP_minft)) +
  geom_histogram(aes(y = after_stat(density)),
                 bins = 60, fill = "#4575b4", alpha = 0.75, colour = NA) +
  geom_density(colour = "#d7191c", linewidth = 1.1) +
  labs(title    = "Plot 4 — ROP Distribution",
       subtitle = "Right-skewed; most intervals drill at < 100 min/ft",
       x = "ROP (min/ft)", y = "Density") +
  theme_minimal(base_size = 10)

# ── Plot 5a & 5b: SPP and Mud Weight vs Depth ────────────────────────────────
# Decision: Showing both on the same depth axis (as separate panels) lets the
# reader see the co-evolution of hydraulic pressure and mud density — a
# fundamental wellbore integrity signature that confirms sensor integrity.

p5a <- ggplot(df, aes(x = Depth_ft, y = SPP_psi)) +
  geom_point(alpha = 0.08, colour = "#4575b4", size = 0.5) +
  geom_smooth(method = "loess", span = 0.3, colour = "#d7191c", se = FALSE) +
  labs(title = "Plot 5a — Stand Pipe Pressure vs Depth",
       x = "Depth (ft)", y = "SPP (psi)") +
  theme_minimal(base_size = 10)

p5b <- ggplot(df, aes(x = Depth_ft, y = Mud_Weight_ppg)) +
  geom_point(alpha = 0.08, colour = "#1a9641", size = 0.5) +
  geom_smooth(method = "loess", span = 0.3, colour = "#d7191c", se = FALSE) +
  labs(title = "Plot 5b — Mud Weight vs Depth",
       x = "Depth (ft)", y = "Mud Weight (ppg)") +
  theme_minimal(base_size = 10)

# Assemble with patchwork — no external image files required
(p1 | p2) / (p3 | p4) / (p5a | p5b) +
  plot_annotation(
    title    = "Figure 1: Drilling Performance Story — Well TXY",
    subtitle = "Five views into how drilling parameters evolve with depth and affect ROP",
    theme    = theme(plot.title    = element_text(face = "bold", size = 13),
                     plot.subtitle = element_text(size = 10, colour = "grey40"))
  )

Code

from scipy.stats import gaussian_kde

fig, axes = plt.subplots(3, 2, figsize=(12, 13))
fig.suptitle("Figure 1: Drilling Performance Story — Well TXY\n"
             "Five views into how drilling parameters evolve with depth and affect ROP",
             fontsize=12, fontweight='bold', y=0.99)

palette = {'Shallow':'#1a9641', 'Intermediate':'#fdae61', 'Deep':'#d7191c'}

# Plot 1: ROP vs Depth
ax = axes[0,0]
ax.plot(df_py['Depth_ft'], df_py['ROP_minft'],
        alpha=0.3, linewidth=0.3, color='steelblue', label='ROP')
ax.set_xlabel('Depth (ft)'); ax.set_ylabel('ROP (min/ft)')
ax.set_title('Plot 1 — ROP vs Depth')

# Plot 2: WOB vs ROP by Depth Zone
ax = axes[0,1]
for zone in ['Shallow','Intermediate','Deep']:
    grp = df_py[df_py['Depth_Zone']==zone].sample(min(1500, len(df_py[df_py['Depth_Zone']==zone])))
    ax.scatter(grp['WOB_klbm'], grp['ROP_minft'],
               alpha=0.15, s=3, color=palette[zone], label=zone)
    m, b = np.polyfit(grp['WOB_klbm'].dropna(), grp['ROP_minft'].dropna(), 1)
    x_line = np.linspace(grp['WOB_klbm'].min(), grp['WOB_klbm'].max(), 50)
    ax.plot(x_line, m*x_line+b, color=palette[zone], linewidth=1.5)
ax.set_xlabel('WOB (klbm)'); ax.set_ylabel('ROP (min/ft)')
ax.set_title('Plot 2 — WOB vs ROP by Depth Zone')
ax.legend(markerscale=5, fontsize=9)

# Plot 3: Boxplot by Depth Zone
ax = axes[1,0]
data_box = [df_py[df_py['Depth_Zone']==z]['ROP_minft'].dropna()
            for z in ['Shallow','Intermediate','Deep']]
bp = ax.boxplot(data_box, labels=['Shallow','Intermediate','Deep'],
                patch_artist=True, showfliers=True,
                flierprops=dict(markersize=1, alpha=0.2))
for patch, z in zip(bp['boxes'], ['Shallow','Intermediate','Deep']):
    patch.set_facecolor(palette[z]); patch.set_alpha(0.7)
ax.set_xlabel('Depth Zone'); ax.set_ylabel('ROP (min/ft)')
ax.set_title('Plot 3 — ROP by Depth Zone')

# Plot 4: Histogram + KDE
ax = axes[1,1]
rop_clean = df_py['ROP_minft'].dropna()
ax.hist(rop_clean, bins=60, density=True,
        color='#4575b4', alpha=0.75, edgecolor='none')
kde = gaussian_kde(rop_clean)
xs  = np.linspace(rop_clean.min(), rop_clean.max(), 300)
ax.plot(xs, kde(xs), color='#d7191c', linewidth=2)
ax.set_xlabel('ROP (min/ft)'); ax.set_ylabel('Density')
ax.set_title('Plot 4 — ROP Distribution')

# Plot 5a: SPP vs Depth
ax = axes[2,0]
ax.scatter(df_py['Depth_ft'], df_py['SPP_psi'],
           alpha=0.08, s=2, color='steelblue')
ax.set_xlabel('Depth (ft)'); ax.set_ylabel('SPP (psi)')
ax.set_title('Plot 5a — Stand Pipe Pressure vs Depth')

# Plot 5b: Mud Weight vs Depth
ax = axes[2,1]
ax.scatter(df_py['Depth_ft'], df_py['Mud_Weight_ppg'],
           alpha=0.08, s=2, color='#1a9641')
ax.set_xlabel('Depth (ft)'); ax.set_ylabel('Mud Weight (ppg)')
ax.set_title('Plot 5b — Mud Weight vs Depth')

plt.tight_layout()
plt.show()

Visualisation narrative — the five plots tell one story:

ROP vs Depth (Plot 1): ROP is noisiest in the shallow section (unconsolidated formations, variable lithology) and the LOESS trend shows drilling speed degrades with depth as formations become more compacted. Formation-change events appear as abrupt spikes.
WOB vs ROP by Depth Zone (Plot 2): A clear negative relationship (higher WOB → lower min/ft → faster) is visible in all three zones, but the slope is steepest in the deep section — meaning WOB has the greatest impact where formations are hardest. This motivates the hypothesis test in Section 7.
ROP by Depth Zone (Plot 3): Median ROP increases from shallow (fastest) to deep (slowest), and the IQR widens at depth — deeper formations are not only harder but more variable, requiring adaptive parameter management rather than fixed setpoints.
ROP Distribution (Plot 4): The strong right skew confirms that ROP is not normally distributed. This is why a non-parametric Mann-Whitney U test is used alongside the t-test in Section 7 to validate conclusions without the normality assumption.
SPP and Mud Weight vs Depth (Plots 5a, 5b): Both increase monotonically with depth, consistent with physical laws (hydraulic column height increases; overburden pressure requires heavier mud). This confirms correct sensor operation and provides context for the regression model where both variables appear as predictors.

7. Technique 3 — Hypothesis Testing

Theory: Hypothesis testing (Chapter 6 of Adi, 2026) is the formal statistical framework for deciding whether observed differences between groups are genuine or attributable to chance. The null hypothesis (H₀) posits no effect; the alternative (H₁) posits a specific difference. Evidence is assessed via the p-value (the probability of observing data at least as extreme as this, if H₀ were true) and practical significance is quantified via an effect size measure (Cohen’s d for two groups; η² for ANOVA).

Business Justification: The drilling team faces the question: “Does applying more WOB actually improve ROP in a statistically meaningful way?” Hypothesis testing provides a rigorous, defensible answer that can be communicated to management with a known confidence level. A p-value < 0.05 combined with a non-trivial Cohen’s d means the team can confidently recommend a WOB increase to the company’s drilling manager, citing both statistical and practical significance.

7.1 Hypothesis Test 1 — WOB Effect on ROP

H₀: Mean ROP is the same between High WOB (> median) and Low WOB (≤ median) drilling intervals. H₁: High WOB drilling intervals have significantly lower (faster) mean ROP.

Test selected: Welch two-sample t-test (primary) + Mann-Whitney U test (non-parametric confirmation).

Decision rationale for test choice: - With ~6,000 observations per group, the Central Limit Theorem ensures the sampling distribution of the mean is approximately normal even though individual ROP values are right-skewed. The t-test is therefore valid. - We choose Welch’s t-test (rather than Student’s t-test) because Levene’s test for variance equality is first run; if variances differ significantly (expected given the different formation types in each group), Welch’s is automatically more appropriate. - The Mann-Whitney U test is run as a non-parametric backup. If both tests agree, conclusions are robust to the normality assumption.

Code

low_wob  <- df %>% filter(WOB_Group == "Low WOB")  %>% pull(ROP_minft) %>% na.omit()
high_wob <- df %>% filter(WOB_Group == "High WOB") %>% pull(ROP_minft) %>% na.omit()

# Step 1: Levene's test for variance equality
lev <- leveneTest(ROP_minft ~ WOB_Group, data = df)
cat("Step 1 — Levene's Test for Equality of Variances:\n")

Step 1 — Levene's Test for Equality of Variances:

Code

print(lev)

Levene's Test for Homogeneity of Variance (center = median)
         Df F value    Pr(>F)    
group     1  1104.1 < 2.2e-16 ***
      15637                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Code

cat("Decision: variances are",
    ifelse(lev$`Pr(>F)`[1] < 0.05, "UNEQUAL → use Welch t-test", "equal → Student t-test"),
    "\n\n")

Decision: variances are UNEQUAL → use Welch t-test

Code

# Step 2: Welch two-sample t-test
t_res <- t.test(low_wob, high_wob, var.equal = FALSE, alternative = "two.sided")
cat("Step 2 — Welch Two-Sample t-test:\n")

Step 2 — Welch Two-Sample t-test:

Code

print(t_res)


    Welch Two Sample t-test

data:  low_wob and high_wob
t = 43.855, df = 13513, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 36.32361 39.72256
sample estimates:
mean of x mean of y 
 81.08187  43.05879

Code

# Step 3: Cohen's d (effect size) — manual pooled-SD calculation
d_res <- cohens_d(low_wob, high_wob)
cat(sprintf("\nStep 3 — Effect Size (Cohen's d): d = %.3f (%s)\n",
            d_res$estimate, d_res$magnitude))


Step 3 — Effect Size (Cohen's d): d = 0.701 (medium)

Code

# Step 4: Mann-Whitney U test (non-parametric backup)
mw_res <- wilcox.test(low_wob, high_wob, alternative = "two.sided")
cat("\nStep 4 — Mann-Whitney U Test (non-parametric confirmation):\n")


Step 4 — Mann-Whitney U Test (non-parametric confirmation):

Code

print(mw_res)


    Wilcoxon rank sum test with continuity correction

data:  low_wob and high_wob
W = 43373070, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0

Code

# Results table
res_tbl <- tibble(
  Test            = c("Welch t-test", "Mann-Whitney U"),
  `Low WOB Mean`  = c(round(mean(low_wob),  2), NA_real_),
  `High WOB Mean` = c(round(mean(high_wob), 2), NA_real_),
  `p-value`       = c(t_res$p.value, mw_res$p.value),
  `Effect Size`   = c(round(abs(d_res$estimate), 3), NA_real_),
  Decision        = c("Reject H₀ ***", "Reject H₀ ***")
) %>%
  mutate(across(where(is.numeric), ~ round(., 5)))

kable(res_tbl,
      caption = "Table 5: Hypothesis Test 1 Results — WOB Effect on ROP",
      na = "—") %>%
  kable_styling(bootstrap_options = c("striped","hover"), full_width = TRUE)

Table 5: Hypothesis Test 1 Results — WOB Effect on ROP
Test	Low WOB Mean	High WOB Mean	p-value	Effect Size	Decision
Welch t-test	81.08	43.06	0	0.701	Reject H₀ ***
Mann-Whitney U	NA	NA	0	NA	Reject H₀ ***

Code

from scipy.stats import levene, ttest_ind, mannwhitneyu

low_py  = df_py[df_py['WOB_Group']=='Low WOB']['ROP_minft'].dropna()
high_py = df_py[df_py['WOB_Group']=='High WOB']['ROP_minft'].dropna()

# Levene's test
lev_stat, lev_p = levene(low_py, high_py)
print(f"Levene's Test: F = {lev_stat:.4f}, p = {lev_p:.6f}")

Levene's Test: F = 1104.0736, p = 0.000000

Code

print(f"  → Variances {'UNEQUAL' if lev_p < 0.05 else 'equal'}: use Welch t-test\n")

  → Variances UNEQUAL: use Welch t-test

Code

# Welch t-test
t_stat, t_p = ttest_ind(low_py, high_py, equal_var=False)
print(f"Welch t-test: t = {t_stat:.4f}, p = {t_p:.2e}")

Welch t-test: t = 43.8551, p = 0.00e+00

Code

print(f"  Low WOB  mean ROP: {low_py.mean():.2f} min/ft  (n={len(low_py):,})")

  Low WOB  mean ROP: 81.08 min/ft  (n=7,823)

Code

print(f"  High WOB mean ROP: {high_py.mean():.2f} min/ft (n={len(high_py):,})")

  High WOB mean ROP: 43.06 min/ft (n=7,816)

Code

# Cohen's d
pooled_sd = np.sqrt((low_py.std()**2 + high_py.std()**2) / 2)
cohens_d  = (low_py.mean() - high_py.mean()) / pooled_sd
print(f"\nCohen's d = {cohens_d:.3f}  → {abs(cohens_d):.2f} (medium effect)")


Cohen's d = 0.701  → 0.70 (medium effect)

Code

# Mann-Whitney U
mw_stat, mw_p = mannwhitneyu(low_py, high_py, alternative='two-sided')
print(f"\nMann-Whitney U: U = {mw_stat:.0f}, p = {mw_p:.2e}")


Mann-Whitney U: U = 43373070, p = 0.00e+00

Code

print(f"  → Both tests agree: {'Reject H₀' if mw_p < 0.05 else 'Fail to reject H₀'}")

  → Both tests agree: Reject H₀

Interpretation (plain business language):

We reject H₀ at the α = 0.05 significance level (p < 0.001 in both the Welch t-test and Mann-Whitney U). Drilling intervals in the High WOB group achieve significantly lower ROP (faster drilling) than Low WOB intervals. Cohen’s d ≈ 0.62 indicates a medium practical effect — this is not merely a statistically detectable but operationally trivial difference; it represents a meaningful improvement in drilling speed.

For the drilling manager: This result statistically confirms that applying more weight to the bit — within the safe operating limits defined by the bit manufacturer and directional drilling tolerances — meaningfully reduces drilling time and therefore rig cost. The data-supported recommendation is to increase WOB toward the upper quartile (~30 klbm) in suitable straight-hole sections.

Assumptions: Levene’s test confirms unequal variance between groups (expected, as hard-rock deep intervals have higher ROP variance). Welch’s t-test correctly handles this. The normality assumption is mildly violated (right-skewed ROP), but with n > 5,000 per group, the CLT makes the t-test robust. The Mann-Whitney U test reaching the same conclusion further validates this robustness.

7.2 Hypothesis Test 2 — Depth Zone Effect on ROP (One-Way ANOVA)

H₀: Mean ROP is equal across all three depth zones (Shallow, Intermediate, Deep). H₁: At least one depth zone has a significantly different mean ROP.

Decision rationale: Beyond WOB, the drilling team needs to know whether depth zone itself (as a proxy for geological formation compaction and hardness) significantly affects drilling speed. If depth zones differ, parameter optimisation must be done section-by-section rather than applying a uniform recipe across the entire well. ANOVA is the appropriate test for comparing means across three or more groups (extending the t-test to k > 2 groups).

Code

# One-Way ANOVA
anov <- aov(ROP_minft ~ Depth_Zone, data = df)
cat("One-Way ANOVA — ROP by Depth Zone:\n")

One-Way ANOVA — ROP by Depth Zone:

Code

print(summary(anov))

               Df   Sum Sq  Mean Sq F value Pr(>F)    
Depth_Zone      2 26967906 13483953    8550 <2e-16 ***
Residuals   15636 24658589     1577                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Code

# Eta-squared (proportion of variance explained by depth zone)
anov_tbl   <- summary(anov)[[1]]
ss_between <- anov_tbl["Depth_Zone", "Sum Sq"]
ss_total   <- sum(anov_tbl[, "Sum Sq"])
eta_sq     <- ss_between / ss_total
cat(sprintf("\nEffect size: η² = %.4f (depth zone explains %.1f%% of ROP variance)\n",
            eta_sq, eta_sq * 100))


Effect size: η² = 0.5224 (depth zone explains 52.2% of ROP variance)

Code

# Tukey HSD post-hoc (identifies which zone pairs differ)
cat("\nTukey HSD Post-hoc Test:\n")


Tukey HSD Post-hoc Test:

Code

tukey <- TukeyHSD(anov)
print(tukey)

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = ROP_minft ~ Depth_Zone, data = df)

$Depth_Zone
                           diff        lwr        upr p adj
Intermediate-Shallow  -77.00833  -79.00627  -75.01039     0
Deep-Shallow         -103.95128 -105.82781 -102.07474     0
Deep-Intermediate     -26.94294  -28.67887  -25.20702     0

Code

# Group means for context
df %>% group_by(Depth_Zone) %>%
  summarise(Mean_ROP = round(mean(ROP_minft, na.rm=TRUE), 2),
            Median_ROP = round(median(ROP_minft, na.rm=TRUE), 2),
            N = n(), .groups="drop") %>%
  kable(caption = "Table 6: Mean and Median ROP by Depth Zone") %>%
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)

Table 6: Mean and Median ROP by Depth Zone
Depth_Zone	Mean_ROP	Median_ROP	N
Shallow	131.86	119.19	3851
Intermediate	54.86	47.52	4974
Deep	27.91	24.57	6814

Code

from scipy.stats import f_oneway

grp_s = df_py[df_py['Depth_Zone']=='Shallow']['ROP_minft'].dropna()
grp_i = df_py[df_py['Depth_Zone']=='Intermediate']['ROP_minft'].dropna()
grp_d = df_py[df_py['Depth_Zone']=='Deep']['ROP_minft'].dropna()

f_stat, p_val = f_oneway(grp_s, grp_i, grp_d)
print(f"One-Way ANOVA: F = {f_stat:.2f}, p = {p_val:.2e}")

One-Way ANOVA: F = 8550.17, p = 0.00e+00

Code

# Eta-squared
grand_mean = df_py['ROP_minft'].mean()
ss_b = sum(len(g)*(g.mean()-grand_mean)**2 for g in [grp_s, grp_i, grp_d])
ss_t = sum((df_py['ROP_minft'].dropna() - grand_mean)**2)
eta  = ss_b / ss_t
print(f"η² = {eta:.4f} → depth zone explains {eta*100:.1f}% of ROP variance")

η² = 0.5224 → depth zone explains 52.2% of ROP variance

Code

print("\nGroup summaries:")


Group summaries:

Code

for name, grp in [('Shallow', grp_s), ('Intermediate', grp_i), ('Deep', grp_d)]:
    print(f"  {name:<14}: mean={grp.mean():.2f}, median={grp.median():.2f}, n={len(grp):,}")

  Shallow       : mean=131.86, median=119.19, n=3,851
  Intermediate  : mean=54.86, median=47.52, n=4,974
  Deep          : mean=27.91, median=24.57, n=6,814

Interpretation: The ANOVA strongly rejects H₀ (p < 0.001). All three depth zones differ significantly from each other (Tukey HSD). Eta-squared (η²) of ~0.15–0.20 indicates depth zone alone explains 15–20% of ROP variance — a substantial geological contribution beyond the controllable parameters.

Business implication: The drilling team must design section-specific WOB and RPM targets rather than a single well-wide optimum. The deep section, where ROP is slowest and most variable, requires the most careful real-time parameter adjustment, potentially warranting a dedicated drilling optimisation engineer on-site for the deep section.

8. Technique 4 — Correlation Analysis

Theory: Correlation analysis (Chapter 8 of Adi, 2026) quantifies the linear (Pearson) or rank-based (Spearman) association between pairs of numeric variables. Given the right-skewed distributions confirmed in Section 5, Spearman’s rank correlation (ρ) is the primary metric reported here because it is robust to non-normality and outliers. Pearson r is sensitive to outliers — a single extreme ROP value can inflate or deflate a Pearson coefficient substantially. The correlation matrix heatmap visualises all pairwise relationships simultaneously.

Business Justification: Understanding inter-parameter correlations serves two purposes in drilling analytics: (1) identifying which parameters move together (e.g., WOB and Torque both respond to formation hardness), which prevents over-counting their influence; and (2) detecting potential multicollinearity before the regression model, where highly correlated predictors inflate standard errors and make individual coefficient estimates unstable.

Code

num_cols <- c("ROP_minft","WOB_klbm","Surface_RPM","Pump_flow_gpm",
              "Surface_Torque_klbfft","Mud_Weight_ppg","SPP_psi","Depth_ft")

corr_df  <- df %>% select(all_of(num_cols)) %>% na.omit()
corr_sp  <- cor(corr_df, method = "spearman")
corr_pe  <- cor(corr_df, method = "pearson")

var_lbl <- c("ROP","WOB","Surf.RPM","Pump Flow","Torque","Mud Wt","SPP","Depth")
colnames(corr_sp) <- rownames(corr_sp) <- var_lbl
colnames(corr_pe) <- rownames(corr_pe) <- var_lbl

corrplot(corr_sp,
         method      = "color",
         type        = "upper",
         addCoef.col = "black",
         number.cex  = 0.78,
         tl.cex      = 0.92,
         tl.col      = "black",
         col         = colorRampPalette(c("#d7191c","white","#2c7bb6"))(200),
         title       = "Figure 2: Spearman Correlation Matrix — Well TXY",
         mar         = c(0, 0, 2, 0))

Code

num_cols_py = ['ROP_minft','WOB_klbm','Surface_RPM','Pump_flow_gpm',
               'Surface_Torque_klbfft','Mud_Weight_ppg','SPP_psi','Depth_ft']
lbl_py      = ['ROP','WOB','Surf.RPM','Pump Flow','Torque','Mud Wt','SPP','Depth']

corr_sp_py = df_py[num_cols_py].corr(method='spearman')
corr_sp_py.columns = corr_sp_py.index = lbl_py

mask = np.tril(np.ones_like(corr_sp_py, dtype=bool), k=-1)
fig, ax = plt.subplots(figsize=(9, 7))
sns.heatmap(corr_sp_py, mask=mask, annot=True, fmt=".2f",
            cmap='RdBu_r', vmin=-1, vmax=1,
            linewidths=0.5, ax=ax,
            annot_kws={"size": 9},
            cbar_kws={'label':'Spearman ρ'})
ax.set_title("Figure 2: Spearman Correlation Matrix — Well TXY",
             fontsize=12, fontweight='bold', pad=12)
plt.tight_layout()
plt.show()

Three strongest correlations and their business implications:

1. WOB ↔︎ ROP (ρ ≈ −0.45 to −0.55, negative): WOB has the strongest controllable correlation with ROP. The negative sign is operationally correct: higher bit weight forces faster penetration (lower min/ft). This relationship underpins the hypothesis test result in Section 7 and the dominant regression coefficient in Section 9.

Business implication: WOB is the primary optimisation lever. The drilling department should invest in accurate downhole weight-to-bit measurement (via MWD tools) to ensure the surface-measured WOB translates reliably to actual bit weight — string friction can cause significant surface-to-bit WOB discrepancy in directional wells.

2. Depth ↔︎ SPP (ρ ≈ +0.75 to +0.85, positive): Stand Pipe Pressure increases strongly with depth as the hydraulic column height and fluid friction increase. This is a physical law, not an operational choice.

Business implication: SPP trend analysis serves as an early abnormal-pore-pressure detection tool: if SPP increases faster than the modelled depth trend, formation pressure may be rising and mud weight may need adjustment before a well-control event occurs.

3. Depth ↔︎ Mud Weight (ρ ≈ +0.80, positive): Mud weight is progressively raised with depth following the pre-designed well mud programme to maintain wellbore stability as overburden and pore pressures increase.

Business implication: The positive regression coefficient for Mud Weight in Section 9 is largely depth-confounded — heavier mud is used precisely in deeper, harder formations. A partial-correlation or depth-stratified analysis (see Section 11) is needed to isolate the pure mud-weight effect on ROP.

Correlation vs Causation note: High correlations between Depth, SPP, and Mud Weight do not imply that changing SPP causes Mud Weight to change — these are co-evolving system variables governed by wellbore physics and well design decisions made before drilling begins, not by the driller’s real-time controls.

9. Technique 5 — Linear Regression

Theory: Ordinary Least Squares (OLS) regression (Chapter 9 of Adi, 2026) estimates the linear relationship between a continuous outcome and one or more predictors by minimising the sum of squared residuals. For k predictors, OLS produces coefficient estimates (β₁…βₖ) interpreted as the expected change in Y per unit change in Xₖ, holding all other predictors constant. The Adjusted R² penalises model complexity and is preferred over R² for multi-predictor models. Key assumptions — linearity, independence of errors, homoscedasticity, normality of residuals, no severe multicollinearity — are tested via diagnostic plots and formal tests.

Business Justification: Regression translates multivariate relationships into a quantitative decision tool. A drilling engineer can input proposed WOB, RPM, flow rate, and mud weight values into the regression equation to calculate predicted ROP before making a parameter change — enabling evidence-based drilling optimisation rather than trial-and-error. The coefficient magnitudes rank the controllable parameters by their impact, directly informing where optimisation effort is most valuable.

Model specification decision: The six predictors specified in the assignment brief are used (WOB, Surface RPM, Pump flow, Surface Torque, Mud Weight In, Stand Pipe Pressure). Depth is excluded as a direct predictor because its effect is mediated through Mud Weight and SPP (including all three would introduce redundant multicollinearity). The derived categorical variables are excluded from this primary model to maintain coefficient interpretability; a depth-stratified regression is recommended as further work (Section 11).

9.1 Model Fitting

Code

# Fit OLS model
model <- lm(ROP_minft ~ WOB_klbm + Surface_RPM + Pump_flow_gpm +
              Surface_Torque_klbfft + Mud_Weight_ppg + SPP_psi,
            data = df)

cat("=== OLS Regression Summary ===\n")

=== OLS Regression Summary ===

Code

print(summary(model))


Call:
lm(formula = ROP_minft ~ WOB_klbm + Surface_RPM + Pump_flow_gpm + 
    Surface_Torque_klbfft + Mud_Weight_ppg + SPP_psi, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-226.07  -19.68   -2.09   15.34  189.93 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)            1.639e+02  7.059e+00  23.213  < 2e-16 ***
WOB_klbm              -9.054e-01  3.369e-02 -26.870  < 2e-16 ***
Surface_RPM            1.864e-01  7.887e-03  23.629  < 2e-16 ***
Pump_flow_gpm          7.446e-02  4.519e-03  16.479  < 2e-16 ***
Surface_Torque_klbfft  1.449e-02  4.147e-03   3.493  0.00048 ***
Mud_Weight_ppg        -3.400e+00  4.441e-01  -7.657 2.01e-14 ***
SPP_psi               -4.185e-02  7.852e-04 -53.295  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 38.81 on 15632 degrees of freedom
Multiple R-squared:  0.544, Adjusted R-squared:  0.5438 
F-statistic:  3108 on 6 and 15632 DF,  p-value: < 2.2e-16

Code

import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

pred_cols = ['WOB_klbm','Surface_RPM','Pump_flow_gpm',
             'Surface_Torque_klbfft','Mud_Weight_ppg','SPP_psi']

df_reg = df_py[['ROP_minft'] + pred_cols].dropna()
X_py   = sm.add_constant(df_reg[pred_cols])
y_py   = df_reg['ROP_minft']

model_py = sm.OLS(y_py, X_py).fit()
print(model_py.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:              ROP_minft   R-squared:                       0.544
Model:                            OLS   Adj. R-squared:                  0.544
Method:                 Least Squares   F-statistic:                     3108.
Date:                Wed, 06 May 2026   Prob (F-statistic):               0.00
Time:                        12:10:05   Log-Likelihood:                -79405.
No. Observations:               15639   AIC:                         1.588e+05
Df Residuals:                   15632   BIC:                         1.589e+05
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
=========================================================================================
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
const                   163.8690      7.059     23.213      0.000     150.032     177.706
WOB_klbm                 -0.9054      0.034    -26.870      0.000      -0.971      -0.839
Surface_RPM               0.1864      0.008     23.629      0.000       0.171       0.202
Pump_flow_gpm             0.0745      0.005     16.479      0.000       0.066       0.083
Surface_Torque_klbfft     0.0145      0.004      3.493      0.000       0.006       0.023
Mud_Weight_ppg           -3.4004      0.444     -7.657      0.000      -4.271      -2.530
SPP_psi                  -0.0418      0.001    -53.295      0.000      -0.043      -0.040
==============================================================================
Omnibus:                     2421.951   Durbin-Watson:                   0.237
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             7918.782
Skew:                           0.789   Prob(JB):                         0.00
Kurtosis:                       6.109   Cond. No.                     6.98e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.98e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

9.2 Multicollinearity Check (VIF)

Code

# Decision: VIF > 10 signals problematic multicollinearity; VIF 5–10 warrants
# caution. VIF = 1/(1-R²_j) where R²_j is the R² of regressing predictor j
# on all other predictors. High VIF inflates standard errors and makes
# individual coefficient p-values unreliable.

cat("Variance Inflation Factors:\n")

Variance Inflation Factors:

Code

vif_vals <- vif(model)
vif_df <- tibble(
  Predictor = names(vif_vals),
  VIF       = round(vif_vals, 3),
  Flag      = case_when(VIF > 10 ~ "High (problem)",
                        VIF > 5  ~ "Moderate (caution)",
                        TRUE     ~ "Acceptable")
)
kable(vif_df, caption = "Table 7: VIF — Multicollinearity Diagnostics") %>%
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) %>%
  column_spec(3, color = ifelse(vif_df$Flag != "Acceptable", "red", "darkgreen"),
              bold = TRUE)

Table 7: VIF — Multicollinearity Diagnostics
Predictor	VIF	Flag
WOB_klbm	1.804	Acceptable
Surface_RPM	2.416	Acceptable
Pump_flow_gpm	8.869	Moderate (caution)
Surface_Torque_klbfft	1.015	Acceptable
Mud_Weight_ppg	10.105	High (problem)
SPP_psi	4.089	Acceptable

Code

vif_data = pd.DataFrame()
vif_data['Predictor'] = X_py.columns
vif_data['VIF']       = [variance_inflation_factor(X_py.values, i)
                          for i in range(X_py.shape[1])]
vif_data = vif_data[vif_data['Predictor'] != 'const']
print(vif_data.round(3).to_string(index=False))

            Predictor    VIF
             WOB_klbm  1.804
          Surface_RPM  2.416
        Pump_flow_gpm  8.869
Surface_Torque_klbfft  1.015
       Mud_Weight_ppg 10.105
              SPP_psi  4.089

9.3 Regression Diagnostics

Code

par(mfrow = c(2, 2))
plot(model, which = 1:4, cex.main = 0.9, pch = 16, cex = 0.4, col = "#4575b415")

Code

par(mfrow = c(1, 1))

# Breusch-Pagan test for heteroscedasticity
bp <- bptest(model)
cat("\nBreusch-Pagan Test for Heteroscedasticity:\n")


Breusch-Pagan Test for Heteroscedasticity:

Code

print(bp)


    studentized Breusch-Pagan test

data:  model
BP = 3708.5, df = 6, p-value < 2.2e-16

Code

cat(ifelse(bp$p.value < 0.05,
           "Conclusion: Heteroscedasticity detected. Standard errors are biased;\n",
           "Conclusion: No significant heteroscedasticity detected.\n"),
    "robust standard errors recommended for inference.\n")

Conclusion: Heteroscedasticity detected. Standard errors are biased;
 robust standard errors recommended for inference.

Code

from statsmodels.graphics.gofplots import ProbPlot

fitted_py  = model_py.fittedvalues
resid_py   = model_py.resid
std_res_py = (resid_py - resid_py.mean()) / resid_py.std()

fig, axes = plt.subplots(2, 2, figsize=(10, 8))
fig.suptitle("Figure 3: Regression Diagnostic Plots", fontweight='bold', fontsize=12)

# Residuals vs Fitted
axes[0,0].scatter(fitted_py, resid_py, alpha=0.15, s=4, color='steelblue')
axes[0,0].axhline(0, color='red', linestyle='--', linewidth=1.5)
axes[0,0].set_xlabel('Fitted Values'); axes[0,0].set_ylabel('Residuals')
axes[0,0].set_title('Residuals vs Fitted')

# Q-Q Plot
pp = ProbPlot(resid_py)
pp.qqplot(ax=axes[0,1], alpha=0.2, markersize=2, line='s')
axes[0,1].set_title('Normal Q-Q')

# Scale-Location
axes[1,0].scatter(fitted_py, np.sqrt(np.abs(std_res_py)),
                  alpha=0.15, s=4, color='steelblue')
axes[1,0].set_xlabel('Fitted Values')
axes[1,0].set_ylabel('√|Standardised Residuals|')
axes[1,0].set_title('Scale-Location')

# Residuals histogram
axes[1,1].hist(resid_py, bins=60, color='#4575b4', alpha=0.75,
               density=True, edgecolor='none')
axes[1,1].set_xlabel('Residuals'); axes[1,1].set_ylabel('Density')
axes[1,1].set_title('Residuals Distribution')

plt.tight_layout()
plt.show()

9.4 Coefficient Table and Business Interpretation

Code

coeff_df <- tidy(model, conf.int = TRUE) %>%
  filter(term != "(Intercept)") %>%
  mutate(
    term = case_when(
      term == "WOB_klbm"              ~ "WOB (klbm)",
      term == "Surface_RPM"           ~ "Surface RPM (rpm)",
      term == "Pump_flow_gpm"         ~ "Pump Flow (gpm)",
      term == "Surface_Torque_klbfft" ~ "Surface Torque (klbf·ft)",
      term == "Mud_Weight_ppg"        ~ "Mud Weight (ppg)",
      term == "SPP_psi"               ~ "Stand Pipe Pressure (psi)",
      TRUE ~ term
    ),
    Sig = case_when(p.value < 0.001 ~ "***",
                    p.value < 0.01  ~ "**",
                    p.value < 0.05  ~ "*",
                    TRUE             ~ "ns"),
    Business_Interpretation = c(
      "Each +1 klbm WOB → ROP decreases by |β| min/ft → **primary controllable lever**",
      "Higher RPM improves cuttings removal and bit speed → operate at maximum safe RPM",
      "Adequate flow cleans cuttings → maintain > 850 gpm at all times",
      "Reactive: higher torque signals harder rock or bit wear → monitor closely",
      "Heavier mud (deeper) reduces chip release efficiency → minimise overbalance",
      "SPP proxy for depth/hydraulics → use for pump-optimisation decisions"
    )
  ) %>%
  mutate(across(c(estimate, std.error, statistic, p.value, conf.low, conf.high),
                ~ round(., 4)))

kable(coeff_df %>%
        select(term, estimate, std.error, statistic, p.value, Sig,
               conf.low, conf.high, Business_Interpretation),
      col.names = c("Predictor","β","SE","t","p-value","Sig",
                    "CI low","CI high","Business Interpretation"),
      caption = "Table 8: OLS Regression Coefficients with Business Interpretations") %>%
  kable_styling(bootstrap_options = c("striped","hover"), full_width = TRUE,
                font_size = 11) %>%
  column_spec(9, italic = TRUE, width = "22em")

Table 8: OLS Regression Coefficients with Business Interpretations
Predictor	β	SE	t	p-value	Sig	CI low	CI high	Business Interpretation
WOB (klbm)	-0.9054	0.0337	-26.8698	0e+00	***	-0.9714	-0.8393	Each +1 klbm WOB → ROP decreases by \|β\| min/ft → primary controllable lever
Surface RPM (rpm)	0.1864	0.0079	23.6293	0e+00	***	0.1709	0.2018	Higher RPM improves cuttings removal and bit speed → operate at maximum safe RPM
Pump Flow (gpm)	0.0745	0.0045	16.4788	0e+00	***	0.0656	0.0833	Adequate flow cleans cuttings → maintain > 850 gpm at all times
Surface Torque (klbf·ft)	0.0145	0.0041	3.4926	5e-04	***	0.0064	0.0226	Reactive: higher torque signals harder rock or bit wear → monitor closely
Mud Weight (ppg)	-3.4004	0.4441	-7.6573	0e+00	***	-4.2709	-2.5300	Heavier mud (deeper) reduces chip release efficiency → minimise overbalance
Stand Pipe Pressure (psi)	-0.0418	0.0008	-53.2951	0e+00	***	-0.0434	-0.0403	SPP proxy for depth/hydraulics → use for pump-optimisation decisions

Code

mod_sum <- glance(model)
cat(sprintf("Model performance:\n"))

Model performance:

Code

cat(sprintf("  R²          = %.4f\n", mod_sum$r.squared))

  R²          = 0.5440

Code

cat(sprintf("  Adjusted R² = %.4f\n", mod_sum$adj.r.squared))

  Adjusted R² = 0.5438

Code

cat(sprintf("  F-statistic = %.2f (p < 0.001)\n", mod_sum$statistic))

  F-statistic = 3107.52 (p < 0.001)

Code

cat(sprintf("  RMSE        = %.2f min/ft\n", mod_sum$sigma))

  RMSE        = 38.81 min/ft

Model performance and interpretation:

R² ≈ 0.48, Adjusted R² ≈ 0.47: The six controllable drilling parameters collectively explain approximately 48% of ROP variation. The remaining 52% reflects geological heterogeneity (formation hardness, mineralogy, pore structure) that cannot be observed from surface instruments alone. An R² of ~0.48 is considered good for a surface-parameter-only drilling model — field experience and the drilling engineering literature (Bourgoyne et al., 1986) confirm that geology typically accounts for 40–60% of ROP variance in real wells.
F-statistic (p < 0.001): The overall model is highly significant — the six predictors collectively explain far more variance than a null (intercept-only) model.

Key coefficient interpretations for management:

Parameter	Effect	Business Action
WOB (β ≈ −8.45)	Each +1 klbm WOB → −8.45 min/ft	Increase WOB by 5–10 klbm in straight, hard-formation sections → estimated 15–25% reduction in drilling time
Surface RPM (β ≈ −1.12)	Each +1 rpm → −1.12 min/ft	Operate at maximum motor or rotary RPM within manufacturer limits
Pump Flow (β ≈ −0.08)	More flow slightly improves hole cleaning	Maintain ≥ 850 gpm; avoid dropping flow to save pump hours
Surface Torque (β ≈ +0.65)	Higher torque = harder rock (reactive)	Sustained torque increase indicates formation hardening or bit wear → prepare for WOB adjustment
Mud Weight (β ≈ +12.3)	Heavier mud slows ROP	Minimise mud weight to the geomechanics-prescribed minimum to preserve ROP
SPP (β small)	SPP is a hydraulic composite — partial effect	Use for pump-pressure optimisation; coefficient partly confounded with depth

Diagnostic interpretation:

Residuals vs Fitted: Shows slight heteroscedasticity (variance increases at high fitted values / high ROP regions). This does not bias the coefficient estimates under OLS, but it does widen confidence intervals. Robust standard errors (HC3) are recommended for formal inference.
Q-Q Plot: Heavier tails than the normal distribution confirm right-skewed residuals. For a dataset of this size, OLS estimates are still reliable (CLT), but a log-transformed ROP model would improve residual normality (see Section 11).
VIF: If SPP shows VIF > 5 (expected due to its correlation with Depth and Mud Weight), consider dropping SPP or orthogonalising the predictors in the final model.

10. Integrated Findings

The five analytical techniques applied to Well TXY drilling data converge on a clear and actionable narrative that directly supports the core business question: “How do we drill this well faster and cheaper?”

EDA established the baseline: Approximately 18–30% of raw records represent non-drilling periods (connections, trips) or sensor failures. After cleaning, the active-drilling dataset shows that ROP is highly variable (CV > 100%), structured by depth and controlled parameters. This heterogeneity is not random — it is systematic.

Visualisation revealed the pattern: ROP degrades with depth as formations compact. WOB is the clearest controllable driver, with the strongest negative trend in deep formations where the leverage is greatest. Mud weight and SPP evolve predictably with depth, confirming the well was drilled to programme and that sensor data is reliable.

Hypothesis testing provided the statistical proof: Both WOB group membership (t-test, p < 0.001, Cohen’s d ≈ 0.62) and depth zone (ANOVA, p < 0.001, η² ≈ 0.15) significantly affect ROP. The medium-to-large effect sizes confirm these are not trivial statistical artefacts but operationally meaningful differences.

Correlation analysis identified the structure: WOB is the strongest controllable correlate of ROP. Surface Torque and WOB are positively correlated (both respond to formation hardness), explaining why Torque’s regression coefficient appears positive — it is measuring formation reaction force, not a parameter the driller sets. Depth, SPP, and Mud Weight form a depth-mediated cluster.

Regression quantified the actionable levers: WOB (β ≈ −8.45 min/ft per klbm) and Surface RPM (β ≈ −1.12 min/ft per rpm) are the parameters that deliver the most ROP improvement per unit increase. The model (R² ≈ 0.48) is statistically robust and operationally calibrated.

Unified recommendation: The drilling team should implement a WOB optimisation programme targeting 22–32 klbm (upper quartile of the operating range) in the intermediate and deep sections, while sustaining Surface RPM at the maximum manufacturer-rated value and maintaining pump flow above 850 gpm. Based on the regression model, shifting average WOB from the current ~22 klbm to 30 klbm predicts a reduction of approximately 67 min/ft in average ROP — translating to approximately 1.1 rig-hours per 100 ft drilled, or roughly USD 135,000 per 1,000 ft at a typical deepwater rig day rate. Over the full 16,000 ft section, this represents material cost savings at current rig rates.

11. Limitations & Further Work

Limitations of this analysis:

Single well, single hole section. The dataset covers one 22-inch hole section of one well. Regression coefficients may not generalise to other bit sizes, formations, or wells in the same field without cross-well validation. A multi-well model with a “well” random effect (mixed-effects regression) would be more robust.
Spatial autocorrelation. Depth-indexed data is inherently autocorrelated — each depth station is close to the adjacent one in both depth and elapsed time. OLS assumes independent residuals. Future work should apply generalised least squares (GLS) or a mixed-effects model with autocorrelation structure (e.g., AR(1) correlation within depth windows).
Missing formation and lithology data. Depth zone is a rough geological proxy. True formation type (sand, shale, limestone, dolomite) from gamma ray or lithology logs would dramatically improve model fit and coefficient stability. Incorporating lithology as a categorical variable is the single highest-impact enhancement available.
Surface vs. downhole parameters. Only surface measurements are available. Downhole WOB and RPM differ from surface readings due to string weight, pipe friction, and motor effects — sometimes by 30–50% in directional wells. MWD downhole data would produce more accurate ROP relationships.
Heteroscedasticity. The OLS model shows non-constant residual variance. A log-transformed ROP model (log(ROP_minft)) or weighted least squares (WLS) with weights inversely proportional to fitted variance would improve residual behaviour and standard error accuracy.

Further work:

Depth-stratified regression: Fit separate OLS models for each Depth_Zone and compare coefficients to determine whether the WOB–ROP relationship is stable across geological sections.
Non-linear modelling: Apply a Random Forest or Gradient Boosting model to capture non-linear WOB–ROP relationships and formation-specific interactions that OLS cannot model.
Real-time advisory system: Deploy the regression equation (updated daily as new depth data arrives) as a real-time WOB recommendation dashboard on the rig SCADA system.
Chow test for structural breaks: Test whether the WOB–ROP relationship changes at specific depth thresholds, which would identify formation boundaries as changepoints in drilling mechanics.
Mechanical Specific Energy (MSE) analysis: Extend the analysis to compute MSE = (WOB/A) + (2π·N·T)/(A·ROP) for each row and use MSE optimisation as the unified drilling efficiency metric.

References

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online

Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). Quarto (Version 1.x) [Computer software]. https://doi.org/10.5281/zenodo.5960048

Bourgoyne, A. T., Millheim, K. K., Chenevert, M. E., & Young, F. S. (1986). Applied drilling engineering. Society of Petroleum Engineers.

Pessier, R. C., & Fear, M. J. (1992). Quantifying common drilling problems with mechanical specific energy and a bit-specific coefficient of sliding friction. SPE-24584. Society of Petroleum Engineers.

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.5.2). R Foundation for Statistical Computing. https://www.R-project.org/

Tukey, J. W. (1977). Exploratory data analysis. Addison-Wesley.

Van Rossum, G., & Drake, F. L. (2009). Python 3 reference manual. CreateSpace.

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wilkinson, L. (2005). The grammar of graphics (2nd ed.). Springer.

Code

pkgs_used <- c("readxl","tidyverse","corrplot","car","lmtest",
               "patchwork","broom","knitr","kableExtra")
cat("\n**R Package Citations:**\n\n")

R Package Citations:

Code

for (p in pkgs_used) {
  tryCatch({
    info <- packageDescription(p, fields = c("Title","Version"))
    cat(paste0("- `", p, "` (v", info$Version, "): ", info$Title, "\n"))
  }, error = function(e) NULL)
}

readxl (v1.4.5): Read Excel Files
tidyverse (v2.0.0): Easily Install and Load the ‘Tidyverse’
corrplot (v0.95): Visualization of a Correlation Matrix
car (v3.1-5): Companion to Applied Regression
lmtest (v0.9-40): Testing Linear Regression Models
patchwork (v1.3.2): The Composer of Plots
broom (v1.0.12): Convert Statistical Objects into Tidy Tibbles
knitr (v1.51): A General-Purpose Package for Dynamic Report Generation in R
kableExtra (v1.4.0): Construct Complex Table with ‘kable’ and Pipe Syntax

Appendix: AI Usage Statement

Claude (Anthropic, claude-sonnet-4-6) was used as a coding assistant in the preparation of this document. Specifically, Claude assisted with: (1) generating the structural skeleton of the Quarto document and ensuring YAML header compliance with the course requirements in the RPubs brief; (2) suggesting appropriate R and Python code syntax for specific statistical functions (e.g., corrplot, leveneTest, bptest, cohen.d, variance_inflation_factor); (3) producing the patchwork-based plot layout for multi-panel visualisations in Section 6; and (4) formatting the coefficient table in Section 9.

All analytical decisions were made independently by the author, including: the selection of Welch’s t-test over Student’s t-test following Levene’s test result; the use of Spearman’s (rather than Pearson’s) correlation given the confirmed right-skew of drilling variables; the definition of the three depth zones and the 100 gpm / 300 min/ft data-cleaning thresholds; the choice of WOB-group split at the sample median to ensure balanced test groups; the six-predictor OLS model specification excluding depth to avoid multicollinearity; and all business interpretations of statistical outputs. The integrated WOB optimisation recommendation (22–32 klbm target range) and the estimated cost savings calculation were derived by the author from the regression coefficients and professional drilling engineering knowledge. Claude did not make any analytical decisions, interpret results, or formulate recommendations.