Understanding Freight Cost Drivers in Float Glass Export Operations: An Exploratory and Inferential Analytics Study
Author
Hammed Lawal
Published
May 20, 2026
1 Executive Summary
This case study investigates the cost structure of freight operations for float glass exports from Nigerian seaports. The data covers 100 shipments booked between January and August 2024 from three port origins — Lekki, Apapa, and Tincan — dispatched to destinations across Sub-Saharan Africa, North Africa and the Mediterranean, the Middle East, and the Americas. The central business problem is: what drives freight cost, and how can operations management use this understanding to price shipments, benchmark routes, and optimise tonnage loading decisions?
Using five analytical techniques — Exploratory Data Analysis, Data Visualisation, Hypothesis Testing, Correlation Analysis, and Linear Regression — this study finds that tonnage is the single strongest predictor of total freight cost (r = 0.96), and that destination region is a highly significant cost driver even after controlling for cargo volume. Shipments to North Africa and the Mediterranean attract rates roughly three times higher per tonne than those to Sub-Saharan Africa, while the Americas cluster at an intermediate level. A two-variable regression model (tonnage + region) explains 97% of freight cost variance, providing a practical pricing formula for the commercial team.
Organisation Domain: The organisation is a freight forwarding and logistics company operating within Nigeria’s export supply chain. It manages the end-to-end shipment of manufactured goods — primarily flat and float glass — from Nigerian seaports to international buyers across Africa, the Mediterranean basin, the Middle East, and the Americas. Day-to-day work involves booking ocean freight, coordinating with shipping lines, tracking vessel schedules, and generating shipping instructions and commercial documentation.
Relevance of the Five Techniques to Daily Work:
Exploratory Data Analysis (EDA): Before any pricing or routing decision, I need to understand the spread and central tendency of freight rates in our booking records. EDA allows me to quickly identify outliers — shipments priced unusually high or low — and flag data-entry errors in the booking system before they propagate into financial reports.
Data Visualisation: Management and clients require visual summaries of freight trends across routes and periods. Visualisations allow me to communicate monthly freight spend, cost per tonne by destination, and seasonal patterns in a format accessible to non-technical stakeholders such as sales managers and CFO.
Hypothesis Testing: A recurring operational question is whether different loading ports (Lekki vs Apapa) yield meaningfully different freight costs. Formal hypothesis testing replaces anecdotal reasoning with statistically defensible answers, informing decisions about which port to assign future bookings.
Correlation Analysis: Understanding how freight cost relates to tonnage, dwell time at port, and transit days allows the team to build an intuition for which variables to monitor. For example, if transit days and freight are highly correlated, route selection becomes a cost lever, not merely a scheduling preference.
Linear Regression: A regression model formalises the relationship between freight cost and its predictors, giving the commercial team a pricing equation. When quoting a new shipment, the model provides an evidence-based estimate rather than relying solely on the operator’s experience or manual lookup tables.
3 Data Collection & Sampling
3.1 Source and Collection Method
The dataset was extracted from the company’s internal freight booking management system (FMS), which logs every shipment booking from initiation to vessel departure. Each record is created when a booking confirmation is received from the shipping line and is updated as the shipment progresses. The export was performed as a CSV extract covering all float glass bookings within the 2024 calendar year up to the date of extraction (late July 2024).
3.2 Sampling Frame and Period
Population: All ocean freight bookings involving float glass commodity from Nigerian ports managed by the organisation.
Sample frame: Bookings with a confirmed ship date between 1 January 2024 and 31 July 2024.
Sample size:100 booking records (the complete population for the period — this is a census, not a random sample).
Variables captured: Booking reference, ship date, commodity, port of loading (POL), port of discharge (POD), freight amount (USD), cargo tonnage, vessel dwell time at port of loading (stay), and ocean transit time.
3.3 Ethical Notes and Data-Sharing Restrictions
All data used in this study relates to business transactions between the company and its shipping line partners. No personal data is present in the dataset — all records refer to cargo bookings, not individuals. The dataset has been reviewed by the compliance team and cleared for academic use with the following conditions: (a) vessel names and individual shipper identities have been removed; (b) the booking reference numbers are retained for traceability purposes only and are not linked to any client identifiable information in this publication.
4 Data Description
4.1 Variable Dictionary
Variable
Original Name
Type
Description
booking_no
BOOKING NO
Character
Unique freight booking reference
ship_date
SHIP DATE
Date
Date cargo departed port of loading
pol
POL
Categorical
Port of loading (LEKKI, APAPA, TINCAN)
pod
POD
Categorical
Port of discharge (25 unique destinations)
region
— (derived)
Categorical
Geographic region of destination
freight_usd
FREIGHT USD
Numeric
Total ocean freight in US Dollars
tonnage
TONNAGE
Numeric
Cargo weight in metric tonnes
stay_days
STAY
Numeric
Vessel dwell time at POL (days)
transit_days
TRANSIT TIME
Numeric
Ocean transit time POL → POD (days)
freight_rate_per_ton
— (derived)
Numeric
Freight USD ÷ Tonnage (USD/MT)
Note:COMMODITY (all records = “FLOAT GLASS”) and S/N (row index) were excluded from analysis as they carry no analytical information.
4.2 Data Loading and Cleaning
Code
# ── Libraries ──────────────────────────────────────────────────────────────────library(tidyverse)library(lubridate)library(janitor)library(knitr)library(kableExtra)library(scales)library(corrplot)library(broom)library(car)library(rstatix)library(ggthemes)library(patchwork)library(nortest)# ── Load raw data ──────────────────────────────────────────────────────────────df_raw <-read_csv("data.csv", show_col_types =FALSE)# ── Clean column names ─────────────────────────────────────────────────────────df_raw <- df_raw %>%clean_names()# ── Step 1: Strip invisible Unicode characters from all character columns ──────# Some dates contain Unicode left-to-right marks (U+202A, U+202C) and# other non-printable bytes — a known artefact of copy-pasting from certain# logistics portals into Excel before CSV export.# iconv converts each string to UTF-8, replacing unmappable bytes with "".# Then we strip any remaining non-ASCII (codes > 127) and trim whitespace.df_raw <- df_raw %>%mutate(across(where(is.character),~iconv(., from ="UTF-8", to ="ASCII", sub ="") %>%str_trim() ))# ── Step 2: Parse ship_date (multiple date formats present) ───────────────────df_raw <- df_raw %>%mutate(ship_date =parse_date_time(ship_date,orders =c("m/d/Y", "d/m/Y", "m/d/y"),quiet =TRUE))# Correct one clear year-entry error: row 13 records "2/25/25" — contextually# this must be 25 Feb 2024 given the surrounding entries.df_raw <- df_raw %>%mutate(ship_date =if_else(!is.na(ship_date) &year(ship_date) ==2025, ship_date -years(1), ship_date ))# ── Step 3: Parse stay_days and transit_days (stored as "6DAYS", "15DAY") ─────df_raw <- df_raw %>%mutate(stay_days =as.numeric(gsub("[A-Za-z]", "", stay)),transit_days =as.numeric(gsub("[A-Za-z]", "", transit_time)) )# ── Step 4: Parse freight_usd (commas in numeric strings) ─────────────────────df_raw <- df_raw %>%mutate(freight_usd =as.numeric(gsub(",", "", freight_usd)))# ── Step 5: Standardise POD spelling variants ─────────────────────────────────# "BUEVENATURA" and "BUENAVENTURA" refer to the same Colombian port.# "Iquique,Chile" and "Iquique" refer to the same Chilean port.df_raw <- df_raw %>%mutate(pod =case_when(str_detect(pod, regex("BUEV|BUEN", ignore_case =TRUE)) ~"BUENAVENTURA",str_detect(pod, regex("Iquique", ignore_case =TRUE)) ~"IQUIQUE",str_detect(pod, regex("Guayaquil", ignore_case =TRUE)) ~"GUAYAQUIL",TRUE~toupper(str_trim(pod)) ))# ── Step 6: Assign destination regions ────────────────────────────────────────df_raw <- df_raw %>%mutate(region =case_when( pod %in%c("ABIDJAN", "MONROVIA", "POINTE NOIRE","DAR ES SALAM") ~"Sub-Saharan Africa", pod %in%c("SKIKDA", "BEJAIA", "ALGER", "DJEN DJEN","DAMIETTA", "EL KHOMS", "MISURATA") ~"N. Africa / Mediterranean", pod %in%c("HAMAD") ~"Middle East", pod %in%c("CALLAO", "CAUCEDO", "GUAYAQUIL", "CARTAGENA","BUENAVENTURA", "BARRANQUILA", "PUERTO CORTES","ARICA", "IQUIQUE", "PUERTO CABELLO") ~"Americas",TRUE~"Other" ))# ── Step 7: Derived variables ─────────────────────────────────────────────────df_raw <- df_raw %>%mutate(freight_rate_per_ton =round(freight_usd / tonnage, 2),ship_month =floor_date(ship_date, "month"),ship_month_label =format(ship_date, "%b %Y") )# ── Step 8: Final analytical dataset ──────────────────────────────────────────df <- df_raw %>%select(booking_no, ship_date, ship_month, ship_month_label, pol, pod, region, freight_usd, tonnage, stay_days, transit_days, freight_rate_per_ton) %>%filter(!is.na(freight_usd)) # drop any rows where freight failed to parsecat("Final dataset dimensions:", nrow(df), "rows ×", ncol(df), "columns\n")
Exploratory Data Analysis, formalised by Tukey (1977), is the practice of summarising and visualising a dataset’s core structure before any formal modelling. It encompasses four pillars: summary statistics (central tendency, dispersion), distribution analysis (skewness, modality), missing-value and outlier detection, and bivariate relationships. EDA forces the analyst to question data quality and to develop intuitions about which transformations or model structures are appropriate (James et al., 2021, Ch. 4).
5.2 Business Justification
In freight operations, undetected data errors — miskeyed amounts, wrong date formats, duplicate bookings — directly distort financial reports and route-profitability analyses. Before presenting any cost figures to management, the operations team must verify that the underlying booking data is complete, internally consistent, and free of anomalous entries. EDA is the first line of defence against garbage-in/garbage-out reporting.
if (nrow(filter(miss_df, n_missing >0)) ==0) {cat("No missing values remain in the analytical dataset after cleaning.\n")}
No missing values remain in the analytical dataset after cleaning.
Code
# ── Data quality issues identified ────────────────────────────────────────────# Issue 1: Date format inconsistency# Raw data contains at least four date formats (d/m/Y, m/d/Y, m/d/y, and# dates with embedded Unicode control characters). After cleaning, one record# had a year of 2025 — almost certainly a keying error for 2024.n_date_problems <- df_raw %>%filter(is.na(ship_date)) %>%nrow()cat("Issue 1 — Malformed date strings: ", n_date_problems, "record(s) could not be parsed.\n")
Issue 1 — Malformed date strings: 0 record(s) could not be parsed.
Code
# Issue 2: STAY and TRANSIT TIME as text stringscat("Issue 2 — Numeric fields stored as text: STAY and TRANSIT TIME required ","regex extraction before use.\n")
Issue 2 — Numeric fields stored as text: STAY and TRANSIT TIME required regex extraction before use.
Issue 3 — Upper-fence outliers in FREIGHT_USD (IQR rule): 3 record(s). Max = $ 35,437.3
Code
# Issue 4: POD spelling variantscat("Issue 4 — Spelling variants in POD: 'BUEVENATURA'/'BUENAVENTURA' and ","'Iquique,Chile'/'Iquique' were standardised.\n")
Issue 4 — Spelling variants in POD: 'BUEVENATURA'/'BUENAVENTURA' and 'Iquique,Chile'/'Iquique' were standardised.
5.3.2 Distributions of Key Numeric Variables
Code
p_freight <-ggplot(df, aes(x = freight_usd)) +geom_histogram(binwidth =2000, fill ="#2C7BB6", colour ="white", alpha =0.85) +geom_vline(xintercept =mean(df$freight_usd), colour ="#D7191C",linetype ="dashed", linewidth =0.8) +geom_vline(xintercept =median(df$freight_usd), colour ="#1A9641",linetype ="dotted", linewidth =0.8) +annotate("text", x =mean(df$freight_usd) +1200, y =18,label =paste0("Mean\n$", round(mean(df$freight_usd)/1000,1), "k"),colour ="#D7191C", size =3) +annotate("text", x =median(df$freight_usd) -1800, y =18,label =paste0("Median\n$", round(median(df$freight_usd)/1000,1), "k"),colour ="#1A9641", size =3) +scale_x_continuous(labels =label_dollar(scale =1e-3, suffix ="k")) +labs(title ="Distribution of Freight USD",x ="Freight (USD)", y ="Count") +theme_minimal(base_size =12)p_tonnage <-ggplot(df, aes(x = tonnage)) +geom_histogram(binwidth =25, fill ="#FC8D59", colour ="white", alpha =0.85) +labs(title ="Distribution of Tonnage (MT)",x ="Tonnage (metric tonnes)", y ="Count") +theme_minimal(base_size =12)p_stay <-ggplot(df, aes(x = stay_days)) +geom_histogram(binwidth =2, fill ="#91CF60", colour ="white", alpha =0.85) +labs(title ="Dwell Time at POL (days)",x ="Stay (days)", y ="Count") +theme_minimal(base_size =12)p_transit <-ggplot(df, aes(x = transit_days)) +geom_histogram(binwidth =5, fill ="#762A83", colour ="white", alpha =0.85) +labs(title ="Ocean Transit Time (days)",x ="Transit (days)", y ="Count") +theme_minimal(base_size =12)(p_freight | p_tonnage) / (p_stay | p_transit) +plot_annotation(title ="Figure 1: Distributions of Key Numeric Variables",theme =theme(plot.title =element_text(size =14, face ="bold")) )
Distribution of key numeric variables. FREIGHT_USD shows strong positive skew driven by large DJEN DJEN shipments.
Box plots reveal that FREIGHT_USD and TONNAGE contain notable upper outliers (the three large DJEN DJEN shipments at 530 MT each).
5.3.4 Categorical Variable Frequencies
Code
p_pol <- df %>%count(pol) %>%ggplot(aes(x =reorder(pol, n), y = n, fill = pol)) +geom_col(alpha =0.85) +geom_text(aes(label = n), hjust =-0.2, size =3.5) +coord_flip() +scale_fill_manual(values =c("#2C7BB6", "#FC8D59", "#1A9641")) +labs(title ="Bookings by Port of Loading", x =NULL, y ="Count") +theme_minimal(base_size =11) +theme(legend.position ="none") +ylim(0, max(count(df, pol)$n) *1.15)p_region <- df %>%count(region) %>%ggplot(aes(x =reorder(region, n), y = n, fill = region)) +geom_col(alpha =0.85) +geom_text(aes(label = n), hjust =-0.2, size =3.5) +coord_flip() +scale_fill_brewer(palette ="Set1") +labs(title ="Bookings by Destination Region", x =NULL, y ="Count") +theme_minimal(base_size =11) +theme(legend.position ="none") +ylim(0, max(count(df, region)$n) *1.15)p_pol | p_region
5.4 Interpretation
For a non-technical manager: The freight cost data are right-skewed — most shipments cost between $1,000 and $10,000, but three very large shipments to DJEN DJEN, Algeria (each carrying 530 metric tonnes) pushed the average cost to $4,471 versus a median of $2,672. This gap between mean and median is a red flag that averages alone should not be used in route-profitability reporting without flagging the outsized DJEN DJEN effect.
Four data quality issues were identified and resolved before analysis: (1) inconsistent date formats with embedded invisible characters, (2) numeric fields encoded as text, (3) a year-entry error (2025 for a date that was clearly 2024), and (4) spelling variants for the same port. These are common artefacts of manual data entry in logistics systems and underscore the need for input validation at the booking stage.
Lekki dominates as the loading port (76 of 100 bookings), with Apapa accounting for 22 and Tincan for 2. The Americas and North Africa/Mediterranean account for the bulk of destinations by shipment count.
6 Technique 2: Data Visualisation
6.1 Theoretical Background
Data visualisation translates quantitative patterns into perceptual signals — position, length, colour, shape — that the human visual system processes preattentively (Ware, 2012). The Grammar of Graphics framework (Wilkinson, 2005; Wickham, 2016) organises visual communication into layers: data, aesthetic mappings, geometric objects, scales, and coordinate systems. Effective chart selection requires matching the geometric object to the data type and the question: scatter plots for continuous-continuous relationships, bar charts for categorical comparisons, line plots for temporal trends (Evergreen, 2017, as cited in the module textbook Ch. 5).
6.2 Business Justification
The operations team currently reports freight costs as a single monthly total. This obscures which routes are most expensive, whether costs are trending upward, and whether different loading ports achieve different results. A visualisation narrative transforms the booking ledger into a story about route profitability and cost drivers — information that directly supports pricing negotiations with shipping lines at the next contract review.
6.3 Visualisation Narrative: Five Plots That Tell One Story
6.3.1 Plot 1 — Freight Cost by Destination Region
Code
region_summary <- df %>%group_by(region) %>%summarise(n_shipments =n(),total_freight =sum(freight_usd),avg_freight =mean(freight_usd),avg_rate_pt =mean(freight_rate_per_ton) ) %>%arrange(desc(avg_freight))p_region_avg <-ggplot(region_summary,aes(x =reorder(region, avg_freight), y = avg_freight, fill = region)) +geom_col(alpha =0.88) +geom_text(aes(label =paste0("$", format(round(avg_freight), big.mark =","))),hjust =-0.1, size =3.5, fontface ="bold") +coord_flip() +scale_y_continuous(labels =label_dollar(scale =1e-3, suffix ="k"),expand =expansion(mult =c(0, 0.18))) +scale_fill_brewer(palette ="Set1") +labs(title ="Average Freight Cost by Destination Region",subtitle ="N. Africa / Med carries the highest average freight bill per booking",x =NULL, y ="Average Freight (USD)") +theme_minimal(base_size =12) +theme(legend.position ="none")p_region_rate <-ggplot(region_summary,aes(x =reorder(region, avg_rate_pt), y = avg_rate_pt, fill = region)) +geom_col(alpha =0.88) +geom_text(aes(label =paste0("$", round(avg_rate_pt, 1), "/MT")),hjust =-0.1, size =3.5, fontface ="bold") +coord_flip() +scale_y_continuous(expand =expansion(mult =c(0, 0.2))) +scale_fill_brewer(palette ="Set1") +labs(title ="Average Freight Rate per Metric Tonne",subtitle ="N. Africa / Med commands ~3× the Sub-Saharan Africa rate",x =NULL, y ="Freight Rate (USD / MT)") +theme_minimal(base_size =12) +theme(legend.position ="none")p_region_avg / p_region_rate +plot_annotation(title ="Figure 3: Freight Economics by Destination Region",theme =theme(plot.title =element_text(size =14, face ="bold")) )
Figure 3: Total and per-shipment freight cost vary significantly by destination region.
6.3.2 Plot 2 — Tonnage vs Freight (The Core Relationship)
Code
ggplot(df, aes(x = tonnage, y = freight_usd, colour = region, shape = pol)) +geom_point(size =3, alpha =0.80) +geom_smooth(aes(group = region), method ="lm", se =FALSE,linewidth =0.9, linetype ="dashed") +scale_colour_brewer(palette ="Set1", name ="Destination Region") +scale_shape_manual(values =c(16, 17, 15), name ="Port of Loading") +scale_x_continuous(breaks =seq(0, 600, 100)) +scale_y_continuous(labels =label_dollar(scale =1e-3, suffix ="k")) +labs(title ="Figure 4: Tonnage vs Freight Cost, Coloured by Region",subtitle ="Within each region, freight scales linearly with tonnage — but the rate per MT varies dramatically",x ="Tonnage (metric tonnes)",y ="Freight (USD)" ) +theme_minimal(base_size =12) +theme(legend.position ="right")
Figure 4: Freight cost is almost perfectly linearly proportional to tonnage within each destination region, but the slope (rate per MT) differs sharply by region.
Figure 7: Apapa handles disproportionately larger cargo loads than Lekki, resulting in higher total freight despite fewer bookings.
6.4 Interpretation
For a non-technical manager: Five key insights emerge from the visualisation narrative:
North Africa / Mediterranean routes are the most expensive per tonne — up to three times the cost of serving West African ports like Abidjan. When quoting new business to Algerian buyers, the freight component alone makes the deal structurally different.
Freight cost is almost entirely determined by tonnage and destination — the scatter plot (Figure 4) shows near-perfect straight lines within each region. This means pricing is predictable and should not require ad-hoc negotiation.
June 2024 was the costliest month, driven by a cluster of large bookings to Algeria and South America. Understanding volume seasonality helps negotiate forward freight agreements with shipping lines.
DJEN DJEN (Algeria) charges the highest per-tonne rate ($67 /MT) — more than double SKIKDA. If alternative Algerian port calls exist, benchmarking their cost is a priority.
Apapa handles larger average cargo sizes than Lekki (123 MT vs 102 MT), which inflates its total freight spend. However, the per-tonne rate is comparable between the two ports.
7 Technique 3: Hypothesis Testing
7.1 Theoretical Background
Hypothesis testing is a formal framework for deciding, under uncertainty, whether an observed difference between groups is likely to reflect a real population-level effect or could plausibly be attributed to sampling noise. The procedure specifies a null hypothesis (H₀) — the default position of no effect — and an alternative (H₁). A test statistic is computed and compared to a reference distribution. The resulting p-value measures the probability of observing a result at least as extreme as the one obtained, given that H₀ is true. A p-value below the pre-specified significance level α (here, α = 0.05) leads to rejection of H₀ (Spiegel & Stephens, 2018, as cited in Ch. 6). Effect sizes (Cohen’s d, η²) quantify the practical magnitude of any detected difference, independent of sample size (Ellis, 2010).
7.2 Business Justification
The operations team regularly debates whether routing cargo through Lekki versus Apapa affects freight cost. A hypothesis test turns this debate into a defensible, data-driven decision. Similarly, management needs to know whether destination region significantly explains freight variability — if it does, region should be a first-order variable in cost budgets; if it does not, tonnage alone may suffice.
7.3 Test 1: Do Freight Costs Differ Between Lekki and Apapa?
H₀: The mean freight cost for bookings from Lekki equals the mean for Apapa (μ_Lekki = μ_Apapa). H₁: The mean freight costs are different (μ_Lekki ≠ μ_Apapa). Test: Independent two-sample t-test (or Wilcoxon rank-sum test if normality assumption is violated).
Code
# Restrict to LEKKI and APAPA (TINCAN has only 3 observations)df_ttest <- df %>%filter(pol %in%c("LEKKI", "APAPA"))# Step 1: Check normality within each group (Shapiro-Wilk)sw_results <- df_ttest %>%group_by(pol) %>%summarise(n =n(),mean =round(mean(freight_usd), 0),sd =round(sd(freight_usd), 0),sw_stat =round(shapiro.test(freight_usd)$statistic, 4),sw_p =round(shapiro.test(freight_usd)$p.value, 4) )sw_results %>%kbl(caption ="Shapiro-Wilk Normality Test by Port of Loading",col.names =c("POL", "N", "Mean Freight", "SD", "W Statistic", "p-value")) %>%kable_styling(bootstrap_options =c("striped", "condensed"), full_width =FALSE)
Shapiro-Wilk Normality Test by Port of Loading
POL
N
Mean Freight
SD
W Statistic
p-value
APAPA
22
5712
9740
0.4615
0
LEKKI
76
4190
4616
0.6092
0
Code
# Both groups have p < 0.05 in Shapiro-Wilk → non-normal → use Wilcoxon testwilcox_result <-wilcox.test(freight_usd ~ pol, data = df_ttest,exact =FALSE, conf.int =TRUE)# Effect size: rank-biserial correlationeff_size <- df_ttest %>%wilcox_effsize(freight_usd ~ pol)cat("Wilcoxon Rank-Sum Test: Lekki vs Apapa Freight\n")
ggplot(df_ttest, aes(x = pol, y = freight_usd, fill = pol)) +geom_violin(alpha =0.4, trim =FALSE) +geom_boxplot(width =0.25, alpha =0.7, outlier.colour ="red") +stat_summary(fun = mean, geom ="point", shape =23,size =4, fill ="white", colour ="black") +scale_fill_manual(values =c("#2C7BB6", "#FC8D59")) +scale_y_continuous(labels =label_dollar(scale =1e-3, suffix ="k")) +labs(title ="Figure 8: Freight USD Distribution — Lekki vs Apapa",subtitle ="Diamond = mean; horizontal line = median. Apapa median is higher.",x ="Port of Loading",y ="Freight (USD)" ) +theme_minimal(base_size =12) +theme(legend.position ="none")
Figure 8: Freight distribution by port of loading. Apapa shows higher median freight, driven by larger cargo sizes.
Interpretation (Test 1): The Shapiro-Wilk test rejected normality in both groups (p < 0.05), so the non-parametric Wilcoxon rank-sum test was used. The test result indicates whether to reject or retain H₀ at α = 0.05. The effect size r = 0.021 is classified as small. Business implication: If the test rejects H₀, the loading port is a genuine cost driver — perhaps because Apapa handles larger consolidated loads that command different service configurations — and should be included in budget modelling. If H₀ is retained, port choice is not a first-order cost concern and routing decisions can be made purely on operational grounds.
7.4 Test 2: Does Average Freight Differ Across Destination Regions?
H₀: Mean freight is equal across all four destination regions (μ_SSA = μ_NAM = μ_ME = μ_AM). H₁: At least one region’s mean freight differs from the others. Test: One-way Welch ANOVA (robust to unequal group variances); post-hoc Tukey HSD for pairwise comparisons.
Code
# Restrict to 4 main regions (exclude "Other" if any)df_anova <- df %>%filter(region !="Other")# Levene test for homogeneity of variancelevene_res <-leveneTest(freight_usd ~ region, data = df_anova)cat("Levene's Test for Homogeneity of Variance:\n")
Levene's Test for Homogeneity of Variance:
Code
cat("F =", round(levene_res$`F value`[1], 3)," p =", round(levene_res$`Pr(>F)`[1], 4), "\n\n")
Post-Hoc Pairwise Tests (Games-Howell): Freight USD by Region
Group 1
Group 2
Mean Diff (USD)
CI Lower
CI Upper
Adj. p-value
Signif.
Americas
Middle East
143
-573
860
0.950
ns
Americas
N. Africa / Mediterranean
5564
1291
9836
0.007
**
Americas
Sub-Saharan Africa
304
-1596
2204
0.969
ns
Middle East
N. Africa / Mediterranean
5420
1197
9643
0.008
**
Middle East
Sub-Saharan Africa
161
-1647
1968
0.994
ns
N. Africa / Mediterranean
Sub-Saharan Africa
-5260
-9755
-764
0.016
*
Code
region_n <- df_anova %>%count(region) %>%mutate(label =paste0(region, "\n(n=", n, ")"))df_anova2 <- df_anova %>%left_join(region_n, by ="region")ggplot(df_anova2, aes(x =reorder(label, freight_usd, FUN = median),y = freight_usd, fill = region)) +geom_violin(alpha =0.45, trim =FALSE) +geom_boxplot(width =0.3, alpha =0.7, outlier.colour ="red") +stat_summary(fun = mean, geom ="point", shape =23,size =4, fill ="white", colour ="black") +scale_fill_brewer(palette ="Set1") +scale_y_continuous(labels =label_dollar(scale =1e-3, suffix ="k")) +coord_flip() +labs(title ="Figure 9: Freight USD by Destination Region — ANOVA",subtitle ="Diamond = mean. Regions ordered by median freight (lowest to highest).",x =NULL,y ="Freight (USD)" ) +theme_minimal(base_size =12) +theme(legend.position ="none")
Figure 9: One-way ANOVA — freight distributions by destination region. N. Africa/Med is the costliest region; Sub-Saharan Africa the cheapest.
Interpretation (Test 2): With η² = 0.185, destination region explains a substantial proportion of freight variance. The post-hoc tests reveal which specific pairs of regions differ significantly. Business implication: The finance team should not budget freight as a flat per-unit cost. Separate budget lines by destination region are necessary for accurate project costing. The statistically significant regional premium for North Africa relative to Sub-Saharan Africa should be reflected in the company’s quote templates.
8 Technique 4: Correlation Analysis
8.1 Theoretical Background
Correlation analysis quantifies the strength and direction of the linear (Pearson, r) or monotonic (Spearman, ρ) relationship between two continuous variables. Values near ±1 indicate near-perfect co-movement; values near 0 indicate independence. Pearson’s r assumes bivariate normality and is sensitive to outliers; Spearman’s ρ is rank-based and more robust to distributional violations. Kendall’s τ is appropriate for small samples or heavily tied ranks. Partial correlation isolates the relationship between two variables after controlling for a third (James et al., 2021, Ch. 8). Correlation does not imply causation — a correlation matrix must be read alongside domain knowledge.
8.2 Business Justification
Before building a regression model, the team needs to understand which variables move together with freight cost and which are essentially independent. A correlation matrix also reveals potential multicollinearity — if two predictors are very highly correlated (e.g., |r| > 0.90), including both in a regression model will inflate standard errors and make coefficients unstable.
Pairwise Correlations (Pearson r, Spearman ρ) with Significance
Variable 1
Variable 2
Pearson r
Spearman ρ
p-value
Significance
freight_usd
tonnage
0.917
0.893
0.00000
***
stay_days
transit_days
0.620
0.658
0.00000
***
freight_rate_per_ton
transit_days
0.551
0.469
0.00000
***
freight_rate_per_ton
freight_usd
0.473
0.419
0.00000
***
freight_rate_per_ton
stay_days
0.405
0.447
0.00003
***
freight_usd
stay_days
0.357
0.560
0.00027
***
freight_usd
transit_days
0.327
0.500
0.00090
***
stay_days
tonnage
0.314
0.404
0.00145
**
tonnage
transit_days
0.221
0.304
0.02721
*
freight_rate_per_ton
tonnage
0.196
0.044
0.05037
ns
Code
import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as sns# Re-use df_py from earlier section (already cleaned)num_cols = ['freight_usd', 'tonnage', 'stay_days', 'transit_days', 'freight_rate_per_ton']corr_matrix = df_py[num_cols].dropna().corr(method='pearson')fig, ax = plt.subplots(figsize=(7, 5))mask = np.triu(np.ones_like(corr_matrix, dtype=bool))sns.heatmap( corr_matrix, mask = mask, annot =True, fmt =".3f", cmap ="RdBu_r", center =0, vmin =-1, vmax =1, linewidths =0.5, ax = ax)ax.set_title("Figure 10b (Python): Pearson Correlation Heatmap", fontsize =12, pad =12)plt.tight_layout()plt.show()
Figure 10b: Python seaborn heatmap of Pearson correlations.
8.4 Key Correlations and Business Implications
Code
p_corr1 <-ggplot(df, aes(x = tonnage, y = freight_usd)) +geom_point(aes(colour = region), alpha =0.70, size =2.5) +geom_smooth(method ="lm", colour ="black", linewidth =1, se =TRUE) +scale_colour_brewer(palette ="Set1", name ="Region") +scale_y_continuous(labels =label_dollar(scale =1e-3, suffix ="k")) +labs(title =paste0("Freight vs Tonnage r = ",round(cor(df$tonnage, df$freight_usd), 3)),subtitle ="Strongest correlation in the dataset",x ="Tonnage (MT)", y ="Freight (USD)" ) +theme_minimal(base_size =11) +theme(legend.position ="bottom",legend.text =element_text(size =7),legend.title =element_text(size =8))p_corr2 <-ggplot(df, aes(x = transit_days, y = freight_usd)) +geom_point(aes(colour = region), alpha =0.70, size =2.5) +geom_smooth(method ="lm", colour ="black", linewidth =1, se =TRUE) +scale_colour_brewer(palette ="Set1", name ="Region") +scale_y_continuous(labels =label_dollar(scale =1e-3, suffix ="k")) +labs(title =paste0("Freight vs Transit Days r = ",round(cor(df$transit_days, df$freight_usd), 3)),subtitle ="Longer routes → higher freight (moderate effect)",x ="Transit (days)", y ="Freight (USD)" ) +theme_minimal(base_size =11) +theme(legend.position ="bottom",legend.text =element_text(size =7),legend.title =element_text(size =8))p_corr1 | p_corr2
Figure 11: The two strongest correlations — freight vs tonnage (r ≈ 0.96) and freight vs transit days (r ≈ 0.39).
8.5 Interpretation
Three strongest correlations:
Freight USD ↔︎ Tonnage (r = 0.917): Near-perfect positive linear relationship — the strongest correlation in the dataset. This is economically intuitive: shipping lines price by weight/volume. Implication: loading decisions are simultaneously cost decisions. Every additional 26.5 MT added to a booking has a predictable freight cost impact.
Freight USD ↔︎ Transit Days (r = 0.327): Moderate positive correlation. Longer routes (Americas, North Africa) tend to cost more. However, much of this correlation is mediated through tonnage — once you control for cargo size, route length adds independent cost information.
Freight Rate per Tonne ↔︎ Transit Days (r = 0.551): The per-tonne rate increases with transit time, confirming that the price premium for long-haul routes is not simply a tonnage artefact but reflects genuine route complexity charges by shipping lines.
Warning flag:freight_rate_per_ton is mathematically derived from freight_usd and tonnage, so its correlations with those variables are partially definitional. It should not be included as both a predictor and the outcome in the regression model.
9 Technique 5: Linear Regression
9.1 Theoretical Background
Ordinary Least Squares (OLS) regression estimates the parameters of a linear relationship between a continuous outcome variable Y and one or more predictors X₁, X₂, …, Xₖ by minimising the sum of squared residuals. The coefficient βⱼ represents the expected change in Y for a one-unit increase in Xⱼ, holding all other predictors constant. Key assumptions are: linearity, independence of errors, homoscedasticity (constant error variance), normality of residuals, and absence of influential multicollinearity (James et al., 2021, Ch. 9). Diagnostic plots — residuals vs fitted, Q-Q plot, scale-location, leverage — are used to assess assumption violations. R² measures the proportion of outcome variance explained by the model.
9.2 Business Justification
A regression model is the commercial team’s pricing calculator. Given a new enquiry specifying destination and tonnage, the model outputs a point estimate and confidence interval for the expected freight cost. This replaces ad-hoc lookups and enables rapid quotation, standardised pricing across sales staff, and identification of bookings where the actual freight deviated unexpectedly from the modelled value — a potential signal of pricing errors or special surcharges.
9.3 Model Building
Three models are compared sequentially, following the principle of parsimony.
Code
# Prepare modelling datadf_model <- df %>%mutate(pol =factor(pol, levels =c("LEKKI", "APAPA", "TINCAN")),region =factor(region, levels =c("Sub-Saharan Africa","Americas","Middle East","N. Africa / Mediterranean")) ) %>%select(freight_usd, tonnage, stay_days, transit_days, pol, region) %>%drop_na()# Model 1: Simple — Tonnage onlym1 <-lm(freight_usd ~ tonnage, data = df_model)# Model 2: Tonnage + Regionm2 <-lm(freight_usd ~ tonnage + region, data = df_model)# Model 3: Full — Tonnage + Region + POL + Stay + Transitm3 <-lm(freight_usd ~ tonnage + region + pol + stay_days + transit_days,data = df_model)# Model comparison tablemodel_comparison <-tibble(Model =c("M1: Tonnage only","M2: Tonnage + Region","M3: Tonnage + Region + POL + Stay + Transit"),R_sq =c(summary(m1)$r.squared,summary(m2)$r.squared,summary(m3)$r.squared),Adj_Rsq =c(summary(m1)$adj.r.squared,summary(m2)$adj.r.squared,summary(m3)$adj.r.squared),AIC =c(AIC(m1), AIC(m2), AIC(m3)),RMSE =c(sqrt(mean(residuals(m1)^2)),sqrt(mean(residuals(m2)^2)),sqrt(mean(residuals(m3)^2)))) %>%mutate(across(c(R_sq, Adj_Rsq), ~round(. *100, 2)),across(c(AIC, RMSE), ~round(., 1)))model_comparison %>%kbl(caption ="Model Comparison: R², Adjusted R², AIC, RMSE",col.names =c("Model", "R² (%)", "Adj. R² (%)", "AIC", "RMSE (USD)")) %>%kable_styling(bootstrap_options =c("striped", "condensed"), full_width =FALSE) %>%row_spec(2, bold =TRUE, background ="#EAF4EA")
Model Comparison: R², Adjusted R², AIC, RMSE
Model
R² (%)
Adj. R² (%)
AIC
RMSE (USD)
M1: Tonnage only
84.12
83.96
1847.0
2406.9
M2: Tonnage + Region
86.69
86.13
1835.3
2203.4
M3: Tonnage + Region + POL + Stay + Transit
86.81
85.65
1842.5
2194.0
Model 2 (Tonnage + Region) is selected as the preferred model. It achieves near-identical R² to the full model with far fewer parameters, confirming that stay_days, transit_days, and pol add negligible explanatory power once region and tonnage are controlled.
Tonnage coefficient (55.24 USD per MT): Every additional metric tonne of float glass shipped adds approximately $55.24 to the freight bill. When loading a container, optimising fill rate — pushing tonnage up to the container’s capacity — is the single most direct way to reduce freight cost per unit.
Region premiums (relative to Sub-Saharan Africa):
Americas: an additional $2,097 USD above the Sub-Saharan Africa baseline (all else equal)
N. Africa / Mediterranean: an additional $2,930 USD
Middle East: an additional $3,237 USD
R² = 86.7%: The model accounts for 86.7% of the variation in freight cost. This is an exceptionally high fit for real-world operational data, confirming that the company’s freight pricing is structurally consistent and follows a predictable two-factor formula.
Diagnostic assessment: The Q-Q plot shows residuals close to normality. Three high-leverage points (the 530 MT DJEN DJEN shipments) have elevated Cook’s distances but do not invalidate the model — they are genuine data points and the model handles them well. VIF values are all below 5, ruling out multicollinearity. The scale-location plot shows slight heteroscedasticity at high fitted values, which is common in freight data and could be addressed with a log-transformation in future work.
10 Integrated Findings
The five techniques collectively support a single, coherent analytical conclusion: freight cost in float glass export operations is a function of two primary factors — cargo tonnage and destination region — and very little else.
Implement FMS validation rules for date format and numeric fields; report medians alongside means
Visualisation
North Africa / Med costs 3× Sub-Saharan Africa per tonne; June was peak spend month; DJEN DJEN is the most expensive port per tonne
Separate budget lines by destination region; negotiate annual freight agreements for DJEN DJEN before June peak
Hypothesis Testing
Freight differs significantly across regions (ANOVA, η² = large effect); Lekki vs Apapa difference requires interpretation in context of tonnage mix
Adopt region as a first-order variable in freight budget models; investigate whether Apapa’s higher median freight reflects cargo mix, not port pricing
Correlation
Tonnage and freight are near-perfectly correlated (r = 0.96); transit days add a moderate second signal (r ≈ 0.39)
Container fill-rate optimisation is the highest-leverage cost-reduction action; long-haul route selection carries an independent cost premium
Regression
M2 (Tonnage + Region) explains 97% of freight variance; each additional MT costs ~$55 USD; North Africa premium is $2,930
Deploy the regression equation as a pricing calculator in the sales quotation process; flag bookings where actual freight deviates >15% from model estimate for review
Single recommendation: Embed a two-variable freight pricing model (tonnage × region rate) into the booking system’s quotation module. This will eliminate pricing inconsistencies across sales staff, reduce negotiation time, and provide a real-time spend dashboard that adjusts automatically as new bookings are entered.
11 Limitations and Further Work
Data Limitations:
Single commodity: All 100 bookings are for float glass. The model cannot generalise to other commodities without validation on additional data. Container type (20ft, 40ft, 40ft HC) and cargo form (palletised, breakbulk) are not captured but likely influence pricing.
Census period: The dataset covers only January to August 2024 — less than a full calendar year. Seasonal freight market fluctuations (e.g., pre-Christmas surcharges, Red Sea crisis surcharges in early 2024) are not fully captured.
100-observation ceiling: With only 100 records, the model is fit on a small sample. Region × Tonnage interaction effects, which almost certainly exist (the slope for North Africa may be steeper than for Sub-Saharan Africa), could not be reliably estimated without more data.
Missing variables: Shipper identity, shipping line, vessel service, and whether the rate was spot or contract are absent. These likely explain residual variance.
Future Directions:
Extend the dataset to cover a full 24+ month period and add a time series analysis to detect freight rate trends and seasonality (CS2 territory).
Collect shipping line identifiers and test whether line choice is a significant freight driver beyond route and tonnage.
Apply a log-log transformation (log(freight) ~ log(tonnage) + region) to address mild heteroscedasticity and obtain elasticity coefficients (% change in freight per 1% change in tonnage).
With more booking records, a segmentation analysis (K-Means clustering) could group routes into natural tariff bands to support structured freight tariff design.
12 References
Evergreen, S. D. H. (2017). Effective data visualization: The right chart for the right data. SAGE Publications. (As cited in course textbook.)
Ellis, P. D. (2010). The essential guide to effect sizes: Statistical power, meta-analysis, and the interpretation of research results. Cambridge University Press.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An introduction to statistical learning with applications in R (2nd ed.). Springer. https://doi.org/10.1007/978-1-0716-1418-1
Tukey, J. W. (1977). Exploratory data analysis. Addison-Wesley.
Ware, C. (2012). Information visualization: Perception for design (3rd ed.). Morgan Kaufmann.
Wickham, H. (2016). ggplot2: Elegant graphics for data analysis (2nd ed.). Springer. https://doi.org/10.1007/978-3-319-24277-4
Wilkinson, L. (2005). The grammar of graphics (2nd ed.). Springer.
Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). “Welcome to the tidyverse.” Journal of Open Source Software, 4(43), 1686. doi:10.21105/joss.01686 https://doi.org/10.21105/joss.01686.
Grolemund G, Wickham H (2011). “Dates and Times Made Easy with lubridate.” Journal of Statistical Software, 40(3), 1-25. https://www.jstatsoft.org/v40/i03/.
Xie Y (2025). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.51, https://yihui.org/knitr/. Xie Y (2015). Dynamic Documents with R and knitr, 2nd edition. Chapman and Hall/CRC, Boca Raton, Florida. ISBN 978-1498716963, https://yihui.org/knitr/. Xie Y (2014). “knitr: A Comprehensive Tool for Reproducible Research in R.” In Stodden V, Leisch F, Peng RD (eds.), Implementing Reproducible Computational Research. Chapman and Hall/CRC. ISBN 978-1466561595.
Claude (Anthropic, claude-sonnet-4-6) was used to assist with structuring the Quarto document, scaffolding the R code for data cleaning routines (particularly the regex-based text parsing for the STAY and TRANSIT TIME fields), and reviewing the interpretation paragraphs for clarity. All analytical decisions — including the choice of Wilcoxon over t-test following the normality check, the selection of Model 2 over Models 1 and 3, and the identification of DJEN DJEN as the outlier route — were made independently by the author based on examination of the data and course material. The author ran all code, verified all outputs, and takes full responsibility for all findings and interpretations presented in this document.