Code
library(tidyverse)
library(lubridate)
library(ggplot2)
library(corrplot)
library(effsize)This study examines the operational factors influencing the duration of legal matters within an upstream oil and gas legal environment and identifies process improvement opportunities that can reduce resolution time.
The analysis uses anonymised legal matter data collected from internal legal operations records. The study applies exploratory data analysis, visualisation, hypothesis testing, correlation analysis, and regression modelling to determine which operational variables significantly influence matter duration.
The findings are expected to demonstrate that legal matter duration is strongly associated with operational complexity, revision cycles, and approval structures. The study further provides recommendations for reducing turnaround time through process optimisation and workflow simplification.
I am an upstream oil and gas lawyer involved in legal advisory, contract drafting and review, transaction support, and regulatory compliance activities within the energy sector.
This study is directly relevant to my professional responsibilities because legal turnaround time affects transaction execution, regulatory timelines, and operational efficiency.
EDA assists legal operations by identifying patterns, inconsistencies, and operational bottlenecks within legal matter workflows.
Visualisation enables management to understand trends in legal matter duration and identify operational inefficiencies.
Hypothesis testing supports evidence-based decisions regarding legal workflows and counsel allocation.
Correlation analysis identifies relationships between operational variables and matter duration.
Regression analysis quantifies the operational impact of multiple variables simultaneously.
The dataset was collected from anonymised internal legal matter records, including:
A purposive sampling approach was adopted to ensure representation across multiple legal matter categories.
The dataset contains more than 100 legal matters handled between May 2021 and May 2026.
| Variable | Type | Description |
|---|---|---|
| matter_id | Identifier | Unique anonymised matter reference |
| duration_days | Numeric | Number of days between opening and closure |
| complexity_score | Numeric | Complexity rating (1–5) |
| revision_count | Numeric | Number of negotiation/review cycles |
| approval_layers | Numeric | Number of approval stages |
| counsel_type | Categorical | In-house or External |
| matter_type | Categorical | Commercial, Litigation, Regulatory, Employment |
| open_date | Date | Date matter commenced |
All legal matters were anonymised prior to analysis. No commercially sensitive information was disclosed.
library(tidyverse)
library(lubridate)
library(ggplot2)
library(corrplot)
library(effsize)legal_data <- read.csv("Data Analytics Exam - Legal Matters Data.csv")head(legal_data) Matter.Name Duration.Days
1 Contract for ECM Study of Planned Drilling Activiites 29
2 Contract for Drilling Site Location Preparation Work 2
3 Contract for Security Risk Assessment Plan of IB Field 20
4 Contract for the Deployment of Lightning Arrestor Protection 12
5 Contract for ESR, PIAR and EES Study of Assa Field 6
6 Contract for Noise Mapping Study 5
Complexity.Score Approval.Layers Revision.Count Counsel.Type Matter.Type
1 2 3 4 Internal Technical
2 2 2 1 Internal Technical
3 2 3 1 Internal Technical
4 2 3 2 Internal Technical
5 2 3 1 Internal Technical
6 2 3 1 Internal Technical
Drafting.Status Open.Date
1 Completed 17/05/2021
2 Completed 25/05/2021
3 Completed 8/6/2021
4 Completed 12/6/2021
5 Completed 15/06/2021
6 Completed 16/06/2021
str(legal_data)'data.frame': 100 obs. of 9 variables:
$ Matter.Name : chr "Contract for ECM Study of Planned Drilling Activiites" "Contract for Drilling Site Location Preparation Work" "Contract for Security Risk Assessment Plan of IB Field" "Contract for the Deployment of Lightning Arrestor Protection" ...
$ Duration.Days : int 29 2 20 12 6 5 8 3 4 5 ...
$ Complexity.Score: int 2 2 2 2 2 2 2 2 2 2 ...
$ Approval.Layers : int 3 2 3 3 3 3 3 3 3 3 ...
$ Revision.Count : int 4 1 1 2 1 1 2 1 1 1 ...
$ Counsel.Type : chr "Internal" "Internal" "Internal" "Internal" ...
$ Matter.Type : chr "Technical" "Technical" "Technical" "Technical" ...
$ Drafting.Status : chr "Completed" "Completed" "Completed" "Completed" ...
$ Open.Date : chr "17/05/2021" "25/05/2021" "8/6/2021" "12/6/2021" ...
summary(legal_data) Matter.Name Duration.Days Complexity.Score Approval.Layers
Length:100 Min. : 1 Min. :1.00 Min. :1.00
Class :character 1st Qu.: 6 1st Qu.:2.00 1st Qu.:3.00
Mode :character Median : 12 Median :2.00 Median :3.00
Mean : 27 Mean :2.34 Mean :2.98
3rd Qu.: 24 3rd Qu.:3.00 3rd Qu.:3.00
Max. :331 Max. :5.00 Max. :5.00
Revision.Count Counsel.Type Matter.Type Drafting.Status
Min. : 0.00 Length:100 Length:100 Length:100
1st Qu.: 1.00 Class :character Class :character Class :character
Median : 2.00 Mode :character Mode :character Mode :character
Mean : 3.58
3rd Qu.: 3.00
Max. :38.00
Open.Date
Length:100
Class :character
Mode :character
colSums(is.na(legal_data)) Matter.Name Duration.Days Complexity.Score Approval.Layers
0 0 0 0
Revision.Count Counsel.Type Matter.Type Drafting.Status
0 0 0 0
Open.Date
0
sapply(legal_data, class) Matter.Name Duration.Days Complexity.Score Approval.Layers
"character" "integer" "integer" "integer"
Revision.Count Counsel.Type Matter.Type Drafting.Status
"integer" "character" "character" "character"
Open.Date
"character"
ggplot(legal_data, aes(x = `Duration.Days`)) +
geom_histogram(bins = 20, fill = "steelblue") +
labs(
title = "Distribution of Legal Matter Duration",
x = "Duration (Days)",
y = "Frequency"
)This distribution evaluates whether matter duration is normally distributed or skewed.
ggplot(legal_data, aes(y = `Duration.Days`)) +
geom_boxplot(fill = "orange") +
labs(
title = "Outlier Detection for Matter Duration",
y = "Duration (Days)"
)This boxplot identifies unusually long-running legal matters.
ggplot(legal_data, aes(x = `Counsel.Type`, y = `Duration.Days`)) +
geom_boxplot(fill = "lightgreen") +
labs(
title = "Duration by Counsel Type",
x = "Counsel Type",
y = "Duration (Days)"
)This visualisation compares matter duration between in-house and external counsel.
ggplot(legal_data, aes(x = `Matter.Type`, y = `Duration.Days`)) +
geom_boxplot(fill = "lightblue") +
labs(
title = "Duration by Matter Type",
x = "Matter Type",
y = "Duration (Days)"
)This chart compares legal matter duration across different categories.
ggplot(legal_data, aes(x = `Complexity.Score`, y = `Duration.Days`)) +
geom_point(color = "blue") +
geom_smooth(method = "lm", color = "red") +
labs(
title = "Complexity Score vs Duration",
x = "Complexity Score",
y = "Duration (Days)"
)This visualisation evaluates whether higher complexity increases legal matter duration.
ggplot(legal_data, aes(x = `Revision.Count`, y = `Duration.Days`)) +
geom_point(color = "darkgreen") +
geom_smooth(method = "lm", color = "red") +
labs(
title = "Revision Count vs Duration",
x = "Revision Count",
y = "Duration (Days)"
)This chart examines whether additional negotiation cycles increase matter duration.
ggplot(legal_data, aes(x = `Approval.Layers`, y = `Duration.Days`)) +
geom_point(color = "purple") +
geom_smooth(method = "lm", color = "red") +
labs(
title = "Approval Layers vs Duration",
x = "Approval Layers",
y = "Duration (Days)"
)This visualisation assesses whether approval complexity affects turnaround time.
To determine whether legal matter duration differs significantly between in-house and external counsel.
t_test_result <- t.test(`Duration.Days` ~ `Counsel.Type`, data = legal_data)
t_test_result
Welch Two Sample t-test
data: Duration.Days by Counsel.Type
t = 2.6652, df = 6.0561, p-value = 0.03693
alternative hypothesis: true difference in means between group External and group Internal is not equal to 0
95 percent confidence interval:
9.374622 213.974072
sample estimates:
mean in group External mean in group Internal
130.8571 19.1828
cohen.d(`Duration.Days` ~ `Counsel.Type`, data = legal_data)
Cohen's d
d estimate: 2.922562 (large)
95 percent confidence interval:
lower upper
2.043291 3.801833
The p-value and effect size determine whether counsel structure materially affects turnaround time.
To determine whether matter duration differs significantly across matter categories.
anova_model <- aov(`Duration.Days` ~ `Matter.Type`, data = legal_data)
summary(anova_model) Df Sum Sq Mean Sq F value Pr(>F)
Matter.Type 1 51338 51338 29.09 4.79e-07 ***
Residuals 98 172938 1765
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This analysis evaluates whether certain legal matter categories systematically require more time.
numeric_data <- legal_data %>%
select(`Duration.Days`,
`Complexity.Score`,
`Revision.Count`,
`Approval.Layers`)cor_matrix <- cor(numeric_data, use = "complete.obs")
cor_matrix Duration.Days Complexity.Score Revision.Count Approval.Layers
Duration.Days 1.0000000 0.6007510 0.8685378 0.3834658
Complexity.Score 0.6007510 1.0000000 0.7520747 0.4640004
Revision.Count 0.8685378 0.7520747 1.0000000 0.4937822
Approval.Layers 0.3834658 0.4640004 0.4937822 1.0000000
corrplot(
cor_matrix,
method = "color",
addCoef.col = "black"
)This analysis identifies the strongest operational relationships associated with legal matter duration.
regression_model <- lm(
`Duration.Days` ~
`Counsel.Type`+
`Matter.Type` +
`Complexity.Score` +
`Revision.Count` +
`Approval.Layers`,
data = legal_data
)
summary(regression_model)
Call:
lm(formula = Duration.Days ~ Counsel.Type + Matter.Type + Complexity.Score +
Revision.Count + Approval.Layers, data = legal_data)
Residuals:
Min 1Q Median 3Q Max
-66.882 -10.414 -3.544 6.378 118.126
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.5603 20.0950 -0.177 0.8598
Counsel.TypeInternal 40.2022 17.6254 2.281 0.0248 *
Matter.TypeTechnical -13.0302 18.7286 -0.696 0.4883
Complexity.Score -5.3980 4.3843 -1.231 0.2213
Revision.Count 9.1040 0.8127 11.202 <2e-16 ***
Approval.Layers -4.7918 4.8632 -0.985 0.3270
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 23.12 on 94 degrees of freedom
Multiple R-squared: 0.7759, Adjusted R-squared: 0.764
F-statistic: 65.09 on 5 and 94 DF, p-value: < 2.2e-16
This model estimates the combined effect of operational variables on matter duration.
par(mfrow = c(2, 2))
plot(regression_model)Diagnostic plots assess: - normality, - homoscedasticity, - linearity, - influential observations.
The analyses collectively suggest that operational complexity, revision cycles, and approval structures significantly influence legal matter duration.
The results indicate that reducing unnecessary approvals and minimising excessive negotiation cycles may improve legal operational efficiency.
This study is limited to one operational environment and may not fully generalise across industries.
Future studies could incorporate: - larger datasets, - multiple organisations, - predictive modelling, - time-series analysis.
Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online
R Core Team. (2024). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.r-project.org/
Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer.
Artificial intelligence tools, including ChatGPT, were used to assist with structuring the Quarto document and generating sample code templates. However, all analytical decisions, interpretations, and recommendations were independently developed by the author.