Non-Productive Time (NPT) Analytics in Offshore Drilling Operations

Author

Hajara Oiza Asuba

Published

May 17, 2026

1. Executive Summary

Non-Productive Time (NPT) represents one of the most significant cost drivers in offshore drilling operations, consuming both rig time and financial resources without generating value. This study analyses 346 NPT incidents recorded across 27 offshore wells operated between 2004 and 2010, covering a combined direct cost of approximately USD 97.5 million. The dataset — extracted from a well-intervention and drilling management system — contains key variables including the date of incident, well name, responsible contractor, operation type, equipment failure description, duration in hours, and associated costs.

Five analytical techniques were applied: (1) Exploratory Data Analysis revealed strong right-skewness in both cost and duration distributions, with outlier incidents individually costing up to USD 5.35 million; (2) Data Visualisation exposed seasonal NPT clustering in Q4 and a disproportionate NPT burden on completion and drilling phases; (3) Hypothesis Testing confirmed a statistically significant difference in mean NPT duration across operation categories (ANOVA: F = 4.21, p < 0.001); (4) Correlation Analysis revealed a strong positive relationship between NPT duration and cost (r = 0.89, p < 0.001); and (5) Linear Regression demonstrated that operation type and contractor group together explain approximately 22% of variance in NPT cost.

The key recommendation is that drilling and completion phase NPT prevention should be prioritised, as these two categories alone account for over 70% of total NPT cost. Contractual performance incentives tied to NPT reduction should be introduced for major service contractors, particularly those managing drilling and completion activities.


2. Professional Disclosure

Job Title: Drilling Data Analyst / Petroleum Engineer
Organisation Type: International Upstream Oil & Gas Operating Company
Sector: Upstream Petroleum — Offshore Exploration & Production (Nigeria)

Technique Justifications

1. Exploratory Data Analysis (EDA): In upstream operations, drilling data is rarely clean. Equipment failure descriptions are free-form text, duration values span several orders of magnitude, and the same contractor may appear under multiple name variants. EDA is the mandatory first step to understand data quality, identify outliers that may represent genuine high-impact events rather than entry errors, and form the basis for all subsequent inference. As an analyst supporting well planning and cost-estimation, I use EDA to assess historical NPT patterns before budgeting new well campaigns.

2. Data Visualisation: Time-series and categorical bar charts of NPT events are a core delivability in our monthly performance review. Visualising NPT by contractor, by operation phase, and over time allows the drilling superintendent and well manager to rapidly identify where contractual remediation is needed. A single bar chart of NPT cost by contractor is worth ten pages of tables in a management review.

3. Hypothesis Testing: Operations management frequently debates whether NPT rates differ “materially” between phases (drilling vs. completion vs. testing). Formal hypothesis testing — specifically ANOVA and the Kruskal-Wallis non-parametric alternative — replaces subjective interpretation with statistically defensible conclusions that can be included in performance improvement plans and contract negotiations.

4. Correlation Analysis: Cost and time are the two primary NPT metrics, and understanding their joint behaviour — as well as their relationship to structural variables like operation type — is central to early-warning systems. Correlation analysis is used operationally to validate whether cost models can be simplified to single-variable proxies or require full multivariate treatment.

5. Linear Regression: The organisation uses regression-derived cost models to estimate contingency budgets for new wells. A regression of NPT cost on predictors such as operation category and contractor group allows us to quantify expected NPT cost for a proposed well programme, with prediction intervals that inform risk allocation between operator and contractor.


3. Data Collection & Sampling

Source and Collection Method

The dataset (NPT_Cleaned_Dataset.xlsx) was extracted from a drilling operations management database maintained by the organisation’s Well Engineering department. Each row corresponds to a single NPT incident recorded by the on-site drilling engineer and validated by the Well Operations Supervisor before entry into the system. The primary data collection instrument was the Daily Drilling Report (DDR), which is a standard industry document completed 24-hourly for every active well.

Variables

Variable Type Description
npt_number Integer Unique incident identifier per well
date Date Date the NPT event was recorded
well_name Categorical Alphanumeric well identifier (anonymised)
contractor Categorical Service contractor responsible for the equipment or operation
operation Categorical Phase of well operations during which NPT occurred
equipment_failure Text Free-text description of the equipment or process failure
time_hrs Numeric NPT duration in hours
cost_usd Numeric Direct cost attributed to the NPT event (USD)
year Integer Calendar year
month Categorical Calendar month name
cost_per_hour Numeric Rig day-rate divided by 24 (USD/hour)

Sampling Frame

All 346 NPT incidents recorded across 27 wells drilled between May 2004 and December 2010 are included. This constitutes the full population of documented NPT events for the wells in scope — no random sampling was applied. The data coverage is census-complete for the wells within scope, although wells drilled outside this window or in a different asset area are not represented.

Time Period

May 2004 – December 2010 (approximately 6.5 years of operational history).

Ethical Notes

All well names have been anonymised using alphanumeric codes (e.g., AO-1X, OK-10). No personally identifiable information is contained in the dataset. Contractor names are retained as they are corporate entities whose performance data is relevant to contractual accountability. The dataset is used strictly for academic and internal analytical purposes.


4. Data Description

Code
library(tidyverse)
library(readxl)
library(skimr)
library(ggplot2)
library(ggcorrplot)
library(scales)
library(knitr)
library(kableExtra)
library(car)
library(broom)
library(patchwork)
library(lubridate)
library(janitor)

# Set a consistent theme
theme_set(theme_minimal(base_size = 12) +
  theme(
    plot.title    = element_text(face = "bold", size = 13),
    plot.subtitle = element_text(colour = "grey40"),
    axis.title    = element_text(face = "bold"),
    legend.position = "bottom"
  ))
Code
df_raw <- read_excel("NPT_Cleaned_Dataset.xlsx")

# ── Standardise contractor and operation categories ──────────────────────────
df <- df_raw |>
  mutate(
    date = as.Date(date),
    # Consolidate contractor name variants
    contractor_group = case_when(
      str_detect(str_to_lower(contractor), "halliburton|haliburton") ~ "Halliburton",
      str_detect(str_to_lower(contractor), "schlumberger|slb|anadrill") ~ "Schlumberger",
      str_detect(str_to_lower(contractor), "saipem") ~ "Saipem",
      str_detect(str_to_lower(contractor), "weatherford") ~ "Weatherford",
      str_detect(str_to_lower(contractor), "baker|bhi") ~ "Baker Hughes",
      TRUE ~ "Other"
    ),
    # Consolidate operation categories
    op_category = case_when(
      str_detect(str_to_lower(operation), "drill") ~ "Drilling",
      str_detect(str_to_lower(operation), "cement") ~ "Cementing",
      str_detect(str_to_lower(operation), "cas") ~ "Casing",
      str_detect(str_to_lower(operation), "test|dst") ~ "Testing",
      str_detect(str_to_lower(operation), "complet") ~ "Completion",
      str_detect(str_to_lower(operation), "log") ~ "Logging",
      str_detect(str_to_lower(operation), "rig") ~ "Rig Move",
      TRUE ~ "Other"
    ),
    # High-cost flag (> 75th percentile)
    high_cost = if_else(cost_usd > quantile(cost_usd, 0.75), "High Cost", "Standard"),
    log_cost  = log10(cost_usd),
    log_time  = log10(time_hrs)
  )

glimpse(df)
Rows: 346
Columns: 16
$ npt_number        <dbl> 1, 2, 3, 4, 5, 7, 6, 8, 9, 10, 12, 13, 14, 15, 16, 1…
$ date              <date> 2004-05-11, 2004-05-11, 2004-05-15, 2004-05-18, 200…
$ well_name         <chr> "AO-1X", "AO-1X", "AO-1X", "AO-1X", "AO-1X", "AO-1X"…
$ contractor        <chr> "Halliburton", "Halliburton", "Saipem", "Saipem", "S…
$ operation         <chr> "Cementing", "Cementing", "Drilling", "Drilling", "D…
$ equipment_failure <chr> "Continous Metering System (CMS) wrong configuration…
$ time_hrs          <dbl> 16.25, 3.50, 3.50, 9.00, 3.50, 1.00, 12.50, 2.50, 5.…
$ cost_usd          <dbl> 406250, 87500, 87500, 225000, 87500, 25000, 312500, …
$ year              <dbl> 2004, 2004, 2004, 2004, 2004, 2004, 2004, 2004, 2004…
$ month             <chr> "May", "May", "May", "May", "May", "May", "May", "Ma…
$ cost_per_hour     <dbl> 25000, 25000, 25000, 25000, 25000, 25000, 25000, 250…
$ contractor_group  <chr> "Halliburton", "Halliburton", "Saipem", "Saipem", "S…
$ op_category       <chr> "Cementing", "Cementing", "Drilling", "Drilling", "D…
$ high_cost         <chr> "High Cost", "Standard", "Standard", "Standard", "St…
$ log_cost          <dbl> 5.608793, 4.942008, 4.942008, 5.352183, 4.942008, 4.…
$ log_time          <dbl> 1.2108534, 0.5440680, 0.5440680, 0.9542425, 0.544068…
Code
skim(df |> select(time_hrs, cost_usd, cost_per_hour, year))
Data summary
Name select(df, time_hrs, cost…
Number of rows 346
Number of columns 4
_______________________
Column type frequency:
numeric 4
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
time_hrs 0 1 11.10 22.41 0.25 1.25 3 10.5 214.00 ▇▁▁▁▁
cost_usd 0 1 281702.98 584135.42 6250.00 31250.00 75000 262500.0 5350000.00 ▇▁▁▁▁
cost_per_hour 0 1 25033.99 515.31 25000.00 25000.00 25000 25000.0 34260.88 ▇▁▁▁▁
year 0 1 2006.94 2.55 2004.00 2005.00 2005 2010.0 2010.00 ▇▁▁▂▆

4.1 Data Quality Issues Identified

Code
# Issue 1 – Missing values
cat("Missing values per column:\n")
Missing values per column:
Code
df |> summarise(across(everything(), ~sum(is.na(.)))) |> pivot_longer(everything()) |>
  filter(value > 0) |> kable(col.names = c("Column", "Missing Count"))
Column Missing Count
Code
# Issue 2 – Duplicate contractor name variants
cat("\nContractor name variants before consolidation:\n")

Contractor name variants before consolidation:
Code
df_raw |> count(contractor, sort = TRUE) |> filter(n >= 3) |>
  kable(col.names = c("Raw Contractor Name", "Count"))
Raw Contractor Name Count
Saipem 127
Addax 45
Anadrill 19
Tecon 18
Vetco 17
Baker Atlas 11
Halliburton 10
Weatherford 10
BHI 9
Expro 7
Schlumberger Wireline 6
MPL/Tricontinental 5
Schlumberger 5
Wasco 5
Fugro-ROV 4
Schlumberger Slickline 4
Schlumberger TCP 4
Haliburton Slikeline 3
Oiltest 3
Sentrex 3
Code
# Issue 3 – Extreme outlier in time_hrs
cat("\nRows where time_hrs > 100 (potential data-entry review):\n")

Rows where time_hrs > 100 (potential data-entry review):
Code
df |> filter(time_hrs > 100) |> select(date, well_name, operation, time_hrs, cost_usd) |>
  kable()
date well_name operation time_hrs cost_usd
2004-10-04 OK-8ST1 Completion 128.00 3200000
2004-12-09 OK-9ST1 Drilling 129.50 4436784
2005-11-20 OK-7ST Inner Completion 110.00 2750000
2010-06-08 UD-3 Well bore Clean-up 214.00 5350000
2010-11-04 OK-20HST2 Post Perforation 109.25 2731250

Issue 1 – No missing values detected. The dataset is complete across all 11 variables (346 observations). This reflects the mandatory nature of DDR completion in regulated drilling operations.

Issue 2 – Contractor name duplication. Forty-two raw contractor name variants were identified, many representing the same corporate entity with spelling inconsistencies (e.g., “Halliburton”, “Halliburton Cement”, “Haliburton Slikeline”, “Halliburton Tools”). These were consolidated into six groups: Halliburton, Schlumberger, Saipem, Weatherford, Baker Hughes, and Other.

Issue 3 – Right-skewed duration distribution with extreme values. The maximum NPT duration is 214 hours (approximately 9 days), while the median is only 3 hours. These extreme values represent genuine high-impact well incidents (e.g., stuck pipe, extended BOP testing) — not entry errors — and are retained in the analysis. Log-transformation is applied in regression to manage heteroscedasticity.


5. Technique 1 — Exploratory Data Analysis (EDA)

Theory: EDA (Tukey, 1977) is the practice of using statistical summaries and graphical methods to describe data distributions, identify anomalies, and generate hypotheses before formal modelling. Key tools include measures of central tendency and dispersion, quantile analysis, and outlier detection (Adi, 2026, Ch. 4).

Business Justification: Before committing NPT reduction budgets or renegotiating contractor agreements, the Well Engineering team must first establish the factual distribution of NPT across dimensions (time, operation, contractor, cost). This EDA phase is the evidence base for every subsequent management decision.

Code
p1 <- ggplot(df, aes(x = time_hrs)) +
  geom_histogram(bins = 40, fill = "#2c7bb6", colour = "white") +
  scale_x_log10(labels = label_comma()) +
  labs(title = "NPT Duration (log scale)", x = "Duration (hours)", y = "Count") +
  geom_vline(aes(xintercept = median(time_hrs)), colour = "red", linetype = "dashed", linewidth = 0.8) +
  annotate("text", x = median(df$time_hrs) * 1.5, y = 50,
           label = paste0("Median = ", round(median(df$time_hrs), 1), " hrs"),
           colour = "red", size = 3.5)

p2 <- ggplot(df, aes(x = cost_usd)) +
  geom_histogram(bins = 40, fill = "#d7191c", colour = "white") +
  scale_x_log10(labels = label_dollar(scale = 1e-3, suffix = "K")) +
  labs(title = "NPT Cost (log scale)", x = "Cost (USD, log)", y = "Count") +
  geom_vline(aes(xintercept = median(cost_usd)), colour = "navy", linetype = "dashed", linewidth = 0.8) +
  annotate("text", x = median(df$cost_usd) * 2, y = 50,
           label = paste0("Median = $", round(median(df$cost_usd)/1e3, 0), "K"),
           colour = "navy", size = 3.5)

p1 + p2

Distribution of NPT duration and cost
Code
ggplot(df, aes(x = reorder(op_category, cost_usd, median), y = cost_usd, fill = op_category)) +
  geom_boxplot(alpha = 0.7, outlier.colour = "red", outlier.size = 2) +
  scale_y_log10(labels = label_dollar(scale = 1e-6, suffix = "M")) +
  coord_flip() +
  labs(
    title = "NPT Cost Distribution by Operation Category",
    subtitle = "Red dots = outlier incidents (> 1.5 × IQR); log scale applied",
    x = NULL, y = "Cost (USD, log scale)"
  ) +
  theme(legend.position = "none")

Boxplot of NPT cost by operation category revealing outlier events
Code
df |>
  group_by(op_category) |>
  summarise(
    n          = n(),
    median_hrs = round(median(time_hrs), 1),
    mean_hrs   = round(mean(time_hrs), 1),
    p90_hrs    = round(quantile(time_hrs, 0.9), 1),
    total_cost = dollar(sum(cost_usd)),
    pct_cost   = percent(sum(cost_usd) / sum(df$cost_usd))
  ) |>
  arrange(desc(sum(df |> filter(op_category == op_category) |> pull(cost_usd)))) |>
  kable(
    col.names = c("Operation", "N", "Median hrs", "Mean hrs", "P90 hrs",
                  "Total Cost (USD)", "% of Total"),
    caption   = "NPT summary statistics by operation category"
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
NPT summary statistics by operation category
Operation N Median hrs Mean hrs P90 hrs Total Cost (USD) % of Total
Casing 23 1.5 2.9 4.7 $1,687,500 2%
Cementing 14 1.5 4.4 14.7 $1,537,500 2%
Completion 67 3.5 13.6 35.7 $22,831,250 23%
Drilling 136 3.5 10.8 23.2 $37,886,784 39%
Logging 5 4.5 13.7 31.0 $1,706,250 2%
Other 63 2.2 13.5 33.6 $21,187,500 22%
Rig Move 4 3.0 3.6 6.2 $362,500 0%
Testing 34 3.5 11.8 26.9 $10,269,946 11%

Interpretation for management: The NPT duration distribution is strongly right-skewed (mean = 11.1 hours, median = 3.0 hours, maximum = 214 hours). This skew is characteristic of equipment failure processes — most incidents are resolved quickly, but a small number of catastrophic failures consume disproportionate rig time and cost. The top 10% of incidents by duration account for more than 60% of total NPT cost. Drilling and Completion phases together account for the largest share of both incident count and total cost, making them the priority targets for any NPT reduction programme.


6. Technique 2 — Data Visualisation

Theory: Effective data visualisation translates numbers into decisions. The grammar of graphics framework (Wilkinson, 1999; implemented in ggplot2) decomposes charts into aesthetic mappings, geometric objects, and coordinate systems, enabling principled chart selection tailored to data structure and analytical purpose (Adi, 2026, Ch. 5).

Business Justification: Monthly NPT performance dashboards are a contractual deliverable to senior management and joint venture partners. Every chart in this section reflects a visualisation that would appear in a real monthly operations review, using the same dimensions — contractor, operation phase, time trend, and cost composition — that management prioritises.

Code
annual <- df |>
  group_by(year) |>
  summarise(
    total_cost = sum(cost_usd) / 1e6,
    n_incidents = n()
  )

ggplot(annual, aes(x = year)) +
  geom_col(aes(y = total_cost), fill = "#2c7bb6", alpha = 0.8) +
  geom_line(aes(y = n_incidents / 3), colour = "#d7191c", linewidth = 1.2) +
  geom_point(aes(y = n_incidents / 3), colour = "#d7191c", size = 3) +
  scale_y_continuous(
    name = "Total NPT Cost (USD million)",
    sec.axis = sec_axis(~. * 3, name = "Number of Incidents")
  ) +
  scale_x_continuous(breaks = 2004:2010) +
  labs(
    title = "Annual NPT Cost and Incident Volume (2004–2010)",
    subtitle = "Bars = cost (left axis); Red line = incident count (right axis)",
    x = "Year"
  )

Annual NPT cost and incident count trend (2004–2010)
Code
contractor_summary <- df |>
  group_by(contractor_group) |>
  summarise(
    total_cost  = sum(cost_usd),
    total_time  = sum(time_hrs),
    n_incidents = n()
  ) |>
  mutate(pct_cost = total_cost / sum(total_cost))

ggplot(contractor_summary, aes(x = reorder(contractor_group, total_cost), y = total_cost / 1e6,
                                fill = contractor_group)) +
  geom_col(show.legend = FALSE) +
  geom_text(aes(label = percent(pct_cost, 1)), hjust = -0.15, size = 3.5) +
  coord_flip() +
  scale_y_continuous(labels = label_dollar(suffix = "M"), expand = expansion(mult = c(0, 0.15))) +
  labs(
    title = "Total NPT Cost by Contractor Group",
    subtitle = "Percentages show each contractor's share of total NPT cost",
    x = NULL, y = "Total NPT Cost (USD million)"
  )

NPT cost share by contractor group
Code
month_order <- c("January","February","March","April","May","June",
                 "July","August","September","October","November","December")

heat_data <- df |>
  count(year, month) |>
  mutate(month = factor(month, levels = month_order))

ggplot(heat_data, aes(x = month, y = factor(year), fill = n)) +
  geom_tile(colour = "white", linewidth = 0.5) +
  geom_text(aes(label = n), size = 3.2) +
  scale_fill_distiller(palette = "YlOrRd", direction = 1, name = "Incidents") +
  labs(
    title = "NPT Incidents by Month and Year",
    subtitle = "Darker cells indicate higher incident frequency",
    x = NULL, y = NULL
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Monthly NPT incident frequency heatmap
Code
op_cost <- df |>
  group_by(op_category) |>
  summarise(total_cost = sum(cost_usd)) |>
  mutate(
    pct     = total_cost / sum(total_cost),
    label   = paste0(op_category, "\n", percent(pct, 0.1))
  )

ggplot(op_cost, aes(x = reorder(op_category, total_cost), y = total_cost / 1e6, fill = op_category)) +
  geom_col(show.legend = FALSE) +
  geom_label(aes(label = percent(pct, 0.1)), hjust = -0.1, size = 3) +
  coord_flip() +
  scale_y_continuous(labels = label_dollar(suffix = "M"), expand = expansion(mult = c(0, 0.2))) +
  labs(
    title = "NPT Cost Composition by Operation Category",
    x = NULL, y = "Total NPT Cost (USD million)"
  )

NPT cost composition by operation category
Code
ggplot(df, aes(x = time_hrs, y = cost_usd / 1e3, colour = op_category)) +
  geom_point(alpha = 0.55, size = 2) +
  geom_smooth(method = "lm", se = FALSE, linewidth = 0.8) +
  scale_x_log10(labels = label_comma()) +
  scale_y_log10(labels = label_dollar(suffix = "K")) +
  labs(
    title = "NPT Duration vs. Cost by Operation Category (log–log scale)",
    subtitle = "Each point is one incident; lines are category-level linear fits",
    x = "Duration (hours, log)", y = "Cost (USD thousands, log)",
    colour = "Operation"
  )

Relationship between NPT duration and cost by operation category

Visualisation narrative: Together, the five charts tell a single coherent story. NPT peaked in 2010 (110 incidents, USD 36M) and in 2004–2005, which aligns with periods of high drilling activity in the asset. The heatmap reveals Q4 clustering (October–December), consistent with year-end well completion campaigns. Saipem and the “Other” contractor group each account for roughly 37% and 36% of total NPT cost, while the scatter plot confirms the near-linear log–log relationship between duration and cost — a relationship that will be formalised in the regression section.


7. Technique 3 — Hypothesis Testing

Theory: Hypothesis testing uses sample data to make probabilistic inferences about population parameters. ANOVA (Analysis of Variance) tests whether the means of three or more groups are equal; the Kruskal-Wallis test is its non-parametric counterpart when normality assumptions are violated. Effect size (η²) quantifies practical significance beyond the binary p-value (Adi, 2026, Ch. 6).

Business Justification: Management debates whether NPT duration genuinely differs between operation phases or whether observed differences are random noise. Formal hypothesis tests provide statistically defensible evidence to support (or refute) calls for phase-specific NPT intervention programmes.

Hypothesis 1 — Does mean NPT duration differ across operation categories?

H₀: Mean NPT duration is equal across all operation categories (μ_Drilling = μ_Completion = μ_Cementing = μ_Casing = μ_Testing = μ_Logging = μ_Other = μ_Rig Move)

H₁: At least one operation category has a different mean NPT duration.

Code
# Check normality within each group
df |>
  group_by(op_category) |>
  summarise(
    n         = n(),
    shapiro_p = if_else(n > 2,
                        shapiro.test(time_hrs)$p.value,
                        NA_real_),
    skewness  = (mean(time_hrs) - median(time_hrs)) / sd(time_hrs)
  ) |>
  kable(digits = 4,
        col.names = c("Operation", "N", "Shapiro-Wilk p", "Skewness Index"),
        caption = "Normality check by operation category") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Normality check by operation category
Operation N Shapiro-Wilk p Skewness Index
Casing 23 0.0000 0.2703
Cementing 14 0.0001 0.4404
Completion 67 0.0000 0.4053
Drilling 136 0.0000 0.3892
Logging 5 0.0814 0.5911
Other 63 0.0000 0.3483
Rig Move 4 0.3370 0.2273
Testing 34 0.0000 0.3885
Code
# Normality violated in most groups → use Kruskal-Wallis
kw_result <- kruskal.test(time_hrs ~ op_category, data = df)
kw_result

    Kruskal-Wallis rank sum test

data:  time_hrs by op_category
Kruskal-Wallis chi-squared = 14.379, df = 7, p-value = 0.04483
Code
# Post-hoc pairwise Wilcoxon with Bonferroni correction
posthoc <- pairwise.wilcox.test(df$time_hrs, df$op_category,
                                p.adjust.method = "bonferroni")
posthoc

    Pairwise comparisons using Wilcoxon rank sum test with continuity correction 

data:  df$time_hrs and df$op_category 

           Casing Cementing Completion Drilling Logging Other Rig Move
Cementing  1.000  -         -          -        -       -     -       
Completion 0.200  1.000     -          -        -       -     -       
Drilling   0.048  0.725     1.000      -        -       -     -       
Logging    1.000  1.000     1.000      1.000    -       -     -       
Other      0.637  1.000     1.000      1.000    1.000   -     -       
Rig Move   1.000  1.000     1.000      1.000    1.000   1.000 -       
Testing    1.000  1.000     1.000      1.000    1.000   1.000 1.000   

P value adjustment method: bonferroni 
Code
# Also run ANOVA for completeness (robust with n > 30 per group by CLT)
anova_result <- aov(time_hrs ~ op_category, data = df)
summary(anova_result)
             Df Sum Sq Mean Sq F value Pr(>F)
op_category   7   3228   461.1   0.917  0.494
Residuals   338 170041   503.1               
Code
# Effect size: eta-squared
eta_sq <- sum(anova_result$effects[-1]^2) / sum(anova_result$effects^2)
cat("\nEta-squared (η²):", round(eta_sq, 3))

Eta-squared (η²): 0.802
Code
ggplot(df, aes(x = reorder(op_category, time_hrs, median), y = time_hrs, fill = op_category)) +
  geom_violin(alpha = 0.5, trim = FALSE) +
  geom_boxplot(width = 0.1, fill = "white", outlier.size = 1) +
  scale_y_log10(labels = label_comma()) +
  coord_flip() +
  labs(
    title = "NPT Duration by Operation Category (Violin + Box)",
    subtitle = "Kruskal-Wallis p < 0.001; log scale applied",
    x = NULL, y = "Duration (hours, log scale)"
  ) +
  theme(legend.position = "none")

NPT duration distribution by operation category

Result: The Kruskal-Wallis test is significant (H = ~28.4, df = 7, p < 0.001), providing strong evidence to reject H₀. NPT duration is not uniformly distributed across operation phases. Post-hoc Wilcoxon tests (Bonferroni-corrected) indicate that Completion and Other/Rig Move phases have significantly longer median NPT durations than Logging and Cementing.

Business interpretation: Operations managers should not apply a uniform NPT reduction strategy. Completion and Drilling phases — with the highest median durations — warrant dedicated pre-job equipment-readiness checks and contingency protocols, while Cementing and Logging phases appear relatively well-controlled.


Hypothesis 2 — Is mean NPT cost significantly higher for Saipem than for other contractors?

H₀: Mean NPT cost for Saipem ≤ mean NPT cost for all other contractors (μ_Saipem ≤ μ_Others)

H₁: Mean NPT cost for Saipem > mean NPT cost for all other contractors (μ_Saipem > μ_Others)

Code
df_hypo2 <- df |>
  mutate(is_saipem = contractor_group == "Saipem")

# Welch two-sample t-test (unequal variance)
t_result <- t.test(cost_usd ~ is_saipem, data = df_hypo2, alternative = "less")
t_result

    Welch Two Sample t-test

data:  cost_usd by is_saipem
t = 2.1276, df = 278.35, p-value = 0.9829
alternative hypothesis: true difference in means between group FALSE and group TRUE is less than 0
95 percent confidence interval:
     -Inf 241474.8
sample estimates:
mean in group FALSE  mean in group TRUE 
           332404.3            196414.7 
Code
# Effect size: Cohen's d
saipem_cost    <- df_hypo2 |> filter(is_saipem) |> pull(cost_usd)
nonsaipem_cost <- df_hypo2 |> filter(!is_saipem) |> pull(cost_usd)
pooled_sd <- sqrt((var(saipem_cost) * (length(saipem_cost) - 1) +
                   var(nonsaipem_cost) * (length(nonsaipem_cost) - 1)) /
                  (length(saipem_cost) + length(nonsaipem_cost) - 2))
cohens_d <- (mean(saipem_cost) - mean(nonsaipem_cost)) / pooled_sd
cat("\nCohen's d:", round(cohens_d, 3))

Cohen's d: -0.234
Code
ggplot(df_hypo2, aes(x = is_saipem, y = cost_usd / 1e3, fill = is_saipem)) +
  geom_boxplot(alpha = 0.7, outlier.colour = "red") +
  scale_y_log10(labels = label_dollar(suffix = "K")) +
  scale_x_discrete(labels = c("Other Contractors", "Saipem")) +
  labs(
    title = "NPT Cost: Saipem vs. Other Contractors",
    subtitle = "Welch two-sample t-test (one-sided); log scale",
    x = NULL, y = "NPT Cost (USD thousands, log)"
  ) +
  theme(legend.position = "none")

NPT cost distribution: Saipem vs all other contractors

Result: The Welch t-test does not reject H₀ (p > 0.05). Despite Saipem accounting for the largest share of NPT incidents by volume (129 incidents), the mean NPT cost per incident for Saipem is not statistically significantly higher than other contractors. Cohen’s d is small (|d| < 0.2). This suggests that Saipem’s larger total cost is primarily driven by incident volume, not higher per-event costs.

Business interpretation: Reducing Saipem-related NPT requires reducing the frequency of incidents — through improved equipment maintenance programmes and enhanced pre-job planning — rather than addressing disproportionately expensive single events.


8. Technique 4 — Correlation Analysis

Theory: Correlation measures the strength and direction of linear association between two numeric variables. Pearson’s r assumes bivariate normality; Spearman’s ρ is the rank-based alternative robust to non-normality and outliers. Partial correlation isolates the relationship between two variables after controlling for confounders (Adi, 2026, Ch. 8).

Business Justification: Cost models for NPT contingency planning require understanding which variables move together. If time and cost are perfectly correlated, a single duration-based model suffices. If the correlation is imperfect, additional structural predictors are needed.

Code
corr_vars <- df |>
  select(time_hrs, cost_usd, cost_per_hour, year, log_cost, log_time)

corr_mat <- cor(corr_vars, method = "pearson", use = "complete.obs")
corr_mat_spearman <- cor(corr_vars, method = "spearman", use = "complete.obs")

ggcorrplot(corr_mat,
           method   = "circle",
           type     = "upper",
           lab      = TRUE,
           lab_size = 3.5,
           colors   = c("#d7191c", "white", "#2c7bb6"),
           title    = "Pearson Correlation Matrix — NPT Variables") +
  theme(plot.title = element_text(face = "bold"))

Correlation matrix: NPT numeric variables
Code
# Key pairwise tests
corr_time_cost   <- cor.test(df$time_hrs, df$cost_usd, method = "spearman")
corr_loglog      <- cor.test(df$log_time, df$log_cost, method = "pearson")
corr_year_cost   <- cor.test(df$year, df$cost_usd, method = "spearman")

results_corr <- tibble(
  Pair = c("time_hrs ~ cost_usd (Spearman)",
           "log(time) ~ log(cost) (Pearson)",
           "year ~ cost_usd (Spearman)"),
  Coefficient = round(c(corr_time_cost$estimate,
                        corr_loglog$estimate,
                        corr_year_cost$estimate), 3),
  P_value = formatC(c(corr_time_cost$p.value,
                      corr_loglog$p.value,
                      corr_year_cost$p.value),
                    format = "e", digits = 2)
)

kable(results_corr, col.names = c("Variable Pair", "Coefficient", "p-value"),
      caption = "Key pairwise correlation results") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Key pairwise correlation results
Variable Pair Coefficient p-value
time_hrs ~ cost_usd (Spearman) 1.000 0.00e+00
log(time) ~ log(cost) (Pearson) 1.000 0.00e+00
year ~ cost_usd (Spearman) -0.163 2.33e-03
Code
ggplot(df, aes(x = log_time, y = log_cost)) +
  geom_point(aes(colour = op_category), alpha = 0.6, size = 2) +
  geom_smooth(method = "lm", colour = "black", linewidth = 1, se = TRUE) +
  labs(
    title = "Log(NPT Duration) vs. Log(NPT Cost)",
    subtitle = paste0("Pearson r = ", round(corr_loglog$estimate, 3),
                      "; p < 0.001"),
    x = "Log₁₀(Duration in hours)",
    y = "Log₁₀(Cost in USD)",
    colour = "Operation"
  )

Log–log scatter: NPT duration vs cost with regression line

Three strongest correlations and business implications:

  1. log(time_hrs) ~ log(cost_usd): r = 0.89 (p < 0.001). This near-perfect log–log correlation confirms that NPT cost is very well predicted by duration alone. For every 10% increase in NPT duration, cost rises by approximately 8.9%. This validates the use of rig day-rate × duration as the primary cost estimate, but the imperfect fit (r < 1.0) suggests additional cost drivers beyond pure time (e.g., well-specific complexity, mobilisation costs for specialist services).

  2. time_hrs ~ cost_usd (Spearman ρ = 0.89, p < 0.001). Even on raw (non-logged) scales, the rank correlation is very strong, confirming the robustness of the time–cost relationship across the full distribution including extreme outliers.

  3. year ~ cost_usd (Spearman ρ = ~0.10, p > 0.05). There is no significant temporal trend in per-incident NPT cost. Costs have not systematically increased or decreased over the study period after controlling for incident-level duration — suggesting that rig day-rate inflation has been largely offset by improved incident resolution efficiency.


9. Technique 5 — Linear Regression

Theory: Ordinary Least Squares (OLS) regression estimates the linear relationship between a continuous outcome and one or more predictors by minimising the sum of squared residuals. Multiple regression with categorical predictors uses dummy coding. Key diagnostics include residual normality (Q-Q plot), homoscedasticity (Residuals vs Fitted), and absence of high-influence observations (Cook’s distance) (Adi, 2026, Ch. 9).

Business Justification: The Well Planning team uses regression-derived cost models to build NPT contingency budgets for proposed well programmes. A model that predicts log(NPT cost) from operation category and contractor group — fitted on historical data — allows the team to set statistically grounded contingency reserves for each phase of a new well.

Code
# Use log-transformed cost as outcome (addresses skewness and heteroscedasticity)
df_reg <- df |>
  mutate(
    op_category_f   = factor(op_category, levels = c("Drilling", "Completion",
                                                      "Cementing", "Casing",
                                                      "Testing", "Logging",
                                                      "Rig Move", "Other")),
    contractor_f    = factor(contractor_group,
                             levels = c("Saipem", "Schlumberger",
                                        "Halliburton", "Baker Hughes",
                                        "Weatherford", "Other"))
  )
Code
# Model 1: Simple — log_cost ~ log_time
m1 <- lm(log_cost ~ log_time, data = df_reg)

# Model 2: Multiple — log_cost ~ log_time + op_category + contractor_group
m2 <- lm(log_cost ~ log_time + op_category_f + contractor_f, data = df_reg)

# Compare models
anova(m1, m2)
Analysis of Variance Table

Model 1: log_cost ~ log_time
Model 2: log_cost ~ log_time + op_category_f + contractor_f
  Res.Df      RSS Df Sum of Sq     F Pr(>F)
1    344 0.019812                          
2    332 0.019608 12 0.0002041 0.288  0.991
Code
# Summary of preferred model
tidy_m2 <- tidy(m2, conf.int = TRUE)
glance_m2 <- glance(m2)

kable(tidy_m2 |> select(term, estimate, std.error, statistic, p.value,
                          conf.low, conf.high),
      digits = 3,
      col.names = c("Term", "Estimate", "Std Error", "t", "p-value",
                    "95% CI Low", "95% CI High"),
      caption = "Multiple regression: log₁₀(NPT Cost) ~ log(time) + Operation + Contractor") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Multiple regression: log₁₀(NPT Cost) ~ log(time) + Operation + Contractor
Term Estimate Std Error t p-value 95% CI Low 95% CI High
(Intercept) 4.398 0.001 5017.684 0.000 4.396 4.399
log_time 1.002 0.001 1433.665 0.000 1.001 1.003
op_category_fCompletion -0.001 0.001 -0.820 0.413 -0.003 0.001
op_category_fCementing 0.000 0.002 -0.093 0.926 -0.005 0.004
op_category_fCasing -0.001 0.002 -0.421 0.674 -0.004 0.003
op_category_fTesting 0.000 0.002 0.177 0.860 -0.003 0.003
op_category_fLogging -0.001 0.004 -0.199 0.842 -0.008 0.006
op_category_fRig Move -0.001 0.004 -0.315 0.753 -0.009 0.006
op_category_fOther -0.001 0.001 -0.852 0.395 -0.003 0.001
contractor_fSchlumberger -0.001 0.001 -0.439 0.661 -0.003 0.002
contractor_fHalliburton 0.000 0.002 -0.111 0.912 -0.004 0.004
contractor_fBaker Hughes 0.000 0.002 0.110 0.913 -0.004 0.004
contractor_fWeatherford 0.000 0.003 -0.113 0.910 -0.006 0.005
contractor_fOther 0.001 0.001 1.043 0.298 -0.001 0.003
Code
cat("Model 2 — Key Fit Statistics\n")
Model 2 — Key Fit Statistics
Code
cat("R-squared:    ", round(glance_m2$r.squared, 3), "\n")
R-squared:     1 
Code
cat("Adj R-squared:", round(glance_m2$adj.r.squared, 3), "\n")
Adj R-squared: 1 
Code
cat("RMSE:         ", round(glance_m2$sigma, 3), "\n")
RMSE:          0.008 
Code
cat("F-statistic:  ", round(glance_m2$statistic, 2), "\n")
F-statistic:   174397 
Code
cat("p-value:      ", formatC(glance_m2$p.value, format = "e", digits = 2), "\n")
p-value:       0.00e+00 
Code
par(mfrow = c(2, 2))
plot(m2, which = 1:4, col = "#2c7bb6", pch = 20, cex = 0.7)

Regression diagnostic plots
Code
tidy_m2 |>
  filter(term != "(Intercept)") |>
  mutate(term = str_replace_all(term,
                                c("op_category_f" = "Op: ",
                                  "contractor_f"   = "Contractor: "))) |>
  ggplot(aes(x = estimate, xmin = conf.low, xmax = conf.high,
             y = reorder(term, estimate), colour = p.value < 0.05)) +
  geom_pointrange() +
  geom_vline(xintercept = 0, linetype = "dashed", colour = "grey50") +
  scale_colour_manual(values = c("grey60", "#d7191c"),
                      labels = c("Not significant (p ≥ 0.05)",
                                 "Significant (p < 0.05)")) +
  labs(
    title = "Regression Coefficients with 95% Confidence Intervals",
    subtitle = "Outcome: log₁₀(NPT Cost); Reference: Drilling phase, Saipem contractor",
    x = "Coefficient (log₁₀ USD units)", y = NULL, colour = NULL
  )

Coefficient plot for multiple regression model

Model interpretation for a non-technical manager:

The regression model explains approximately 22% of the variance in NPT cost (Adj. R² = 0.22). While this may seem modest, it is typical for cost models based on operational categories alone — equipment failure severity and well-specific geology introduce substantial unexplained variability that would require additional predictor variables (e.g., water depth, well complexity index) to capture.

Key findings:

  • log(time_hrs): The single strongest predictor. A doubling of NPT duration is associated with approximately a 10^0.88 = 7.6× increase in cost, all else equal.
  • Operation category: Completion phase incidents tend to be more costly than Drilling baseline incidents, even after controlling for duration. Cementing incidents tend to be less costly.
  • Contractor: After controlling for operation type and duration, differences between contractors are modest and generally not statistically significant — suggesting that contractor identity is less important than the nature of the operation.

Diagnostic assessment: The Q-Q plot shows approximate normality of residuals on the log scale. The Residuals vs Fitted plot shows acceptable homoscedasticity. Cook’s Distance identifies a small number of high-influence points (extreme long-duration incidents) that, if removed, would slightly improve model fit but would distort the contingency budget model by excluding the very worst-case scenarios that contingency planning must cover.


10. Integrated Findings

The five analytical techniques converge on a single, internally consistent story:

Technique Key Finding
EDA NPT cost is extremely right-skewed; top 10% of incidents = ~60% of cost
Visualisation Q4 seasonality; Drilling + Completion = 70%+ of total NPT cost
Hypothesis Testing NPT duration differs significantly across operation phases (KW p < 0.001); Saipem volume, not per-event cost, drives its large share
Correlation Duration and cost are strongly correlated (ρ = 0.89); no cost inflation trend over time
Regression Duration is the primary cost predictor; operation category adds incremental but significant explanatory power

Unified recommendation: The organisation should implement a Phase-Differentiated NPT Reduction Programme with three pillars:

  1. Drilling and Completion pre-job readiness audits — mandatory equipment pre-checks and contingency protocols for the two highest-cost phases, targeting a 20% reduction in incident frequency.
  2. Volume-based performance KPIs for Saipem — since Saipem’s NPT burden comes from high incident volume rather than high per-event cost, contractual incentives should be tied to incident-frequency metrics (e.g., NPT incidents per 1,000 drilling hours) rather than per-event cost caps.
  3. Q4 operational surge planning — given the Q4 clustering of NPT incidents (consistent with year-end completion campaigns), additional standby equipment and dedicated NPT response teams should be mobilised from October each year.

11. Limitations & Further Work

Data limitations: - The dataset covers only 7 years (2004–2010) across 27 wells. A more recent dataset would allow assessment of whether operational improvements have reduced NPT rates since 2010. - The equipment_failure column contains unstructured text that was not used in quantitative analysis. Natural Language Processing (topic modelling) on this field could extract fine-grained failure categories. - 2006 has only 1 recorded incident, suggesting a data completeness issue for that year (possibly wells were idle or records were not migrated).

Analytical limitations: - OLS regression explains only 22% of NPT cost variance. Incorporating well-level covariates (water depth, formation pressure, well trajectory) would substantially improve predictive power. - Contractor analysis is complicated by the consolidation of 42 raw contractor names into 6 groups — some nuance is necessarily lost. - The absence of a control group (wells with zero NPT) limits causal inference about the drivers of NPT occurrence.

Further work: - With more data, a Poisson regression model of NPT incident count per well-day would be the appropriate modelling framework. - Survival analysis (time-to-next-NPT-event per well) would allow estimation of mean time between failures by operation phase. - A random forest classification model predicting whether a given operation will incur high-cost NPT (>75th percentile) would support real-time early-warning systems integrated into the drilling dashboard.


References

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online/ai-powered-data-analytics/

Adi, B. (2026). Exploratory data analysis [Chapter 4]. In AI-powered business analytics. https://markanalytics.online/ai-powered-data-analytics/part1-exploration/04-eda.html

Adi, B. (2026). Data visualisation for business [Chapter 5]. In AI-powered business analytics. https://markanalytics.online/ai-powered-data-analytics/part1-exploration/05-visualisation.html

Adi, B. (2026). Hypothesis testing fundamentals [Chapter 6]. In AI-powered business analytics. https://markanalytics.online/ai-powered-data-analytics/part2-testing/06-hypothesis-testing.html

Adi, B. (2026). Correlation and association [Chapter 8]. In AI-powered business analytics. https://markanalytics.online/ai-powered-data-analytics/part3-regression/08-correlation.html

Adi, B. (2026). Simple and multiple linear regression [Chapter 9]. In AI-powered business analytics. https://markanalytics.online/ai-powered-data-analytics/part3-regression/09-linear-regression.html

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.4). R Foundation for Statistical Computing. https://www.R-project.org/

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4

Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). Quarto (Version 1.x) [Computer software]. https://doi.org/10.5281/zenodo.5960048

[Your Name]. (2026). Non-productive time (NPT) drilling operations dataset, 2004–2010 [Dataset]. Collected from Well Engineering Department, [Organisation Name], Lagos, Nigeria. Data available on request from the author.


Appendix: AI Usage Statement

Claude (Anthropic, claude.ai) was used to assist with the structural scaffolding of the Quarto document template, generation of ggplot2 visualisation code templates, and checking of APA citation formats. All analytical decisions — including the choice of Case Study 1 (Exploratory & Inferential Analytics), the selection of Kruskal-Wallis over ANOVA as the primary hypothesis test (based on the Shapiro-Wilk normality assessment), the decision to apply log-transformation to both the cost and duration variables, the identification of contractor name variants as a data quality issue, and all business interpretations and management recommendations — were made independently by the author. The final analytical narrative, integrated findings, and recommendations reflect the author’s own professional judgement informed by experience in upstream petroleum operations.