Executive Summary

This report fulfils the Data Scientist track of the SFH Rwanda technical challenge. It covers EDA, data quality and inconsistency checks, data transformation, hypothesis testins and some visual representations from the data.

This report presents an exploratory data analysis (EDA) of the USAID SCMS Delivery History Dataset, focusing on data quality issues and testing a key business hypothesis about shipping mode costs. The analysis identifies three critical data quality issues that must be addressed before any modeling/dashboarding or reporting, and provides statistical evidence for air freight being significantly more expensive than truck transport for heavy loads.

Finaly this reports provides the overall reproducible functions to automatically validate the data since it is vital and important for any data-related project.

Key Findings:

  • Issue 1: \(\approx 40\%\) of freight cost values are non-numeric text requiring classification

  • Issue 2: Extreme outliers (\(>\$500,000\)) skew freight cost distributions

  • Issue 3: “N/A” string values create a spurious shipment mode category

Hypothesis:

Air freight is \(6.5\times\) more expensive than truck for loads \(>500\) kg (\(p < 0.001\))

1. Data Loading and Initial Inspection

# Load the raw dataset
raw <- read_csv(here::here("data/raw/SCMS_Delivery_History_Dataset.csv"),
                locale = locale(encoding = "latin1"),  # handles Côte d'Ivoire accent
                show_col_types = FALSE) %>% janitor::clean_names()

# Display basic dataset information
cat(sprintf("  Rows    : %d\n",    nrow(raw)))
##   Rows    : 10324
cat(sprintf("  Columns : %d\n",    ncol(raw)))
##   Columns : 33
cat(sprintf("  Period  : %s  to  %s\n\n",
            min(raw$scheduled_delivery_date, na.rm = TRUE),
            max(raw$scheduled_delivery_date, na.rm = TRUE)))
##   Period  : 1-Apr-08  to  9-Sep-15
  • Interpretation: The dataset contains 10324 shipments spanning from 1-Apr-08 to 9-Sep-15. With 33 variables, this provides a rich foundation for analyzing supply chain efficiency and costs across different shipping modes and destinations.

2.Data Quality Issues

This section presents the data quality issues and present them in interactive way.

Issue no. 1: Non numeric Freight Cost Values

State of the Problem: The freight_cost_usd column contains approximately \(40\%\) non-numeric text values. A naive as.numeric() conversion would silently convert these to NA, losing valuable information and potentially biasing analyses.

freight_classes <- tibble(val = raw$`freight_cost_usd`) %>%
  mutate(
    category = case_when(
      str_detect(val, "^\\d")                                ~ "Numeric (parseable)",
      str_detect(val, "Freight Included|Invoiced Separately") ~ "Included in commodity cost",
      str_detect(val, "^See ")                               ~ "Cross-reference to other row",
      is.na(val) | val == ""                                 ~ "Blank / missing",
      TRUE                                                   ~ "Other text"
    )
  ) %>%
  count(category) %>%
  mutate(pct = round(100 * n / nrow(raw), 1)) %>%
  arrange(desc(n))

# Display the breakdown
kable(freight_classes, 
      col.names = c("Category", "Count", "Percentage"),
      caption = "Table 1: Freight Cost Value Categories") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Table 1: Freight Cost Value Categories
Category Count Percentage
Numeric (parseable) 6198 60.0
Cross-reference to other row 2445 23.7
Included in commodity cost 1681 16.3
non_numeric_n <- sum(freight_classes$n[freight_classes$category != "Numeric (parseable)"])

#Visualization 

p1 <- ggplot(freight_classes,
             aes(x = reorder(category, n), y = n, fill = category)) +
  geom_col(width = 0.7, show.legend = FALSE) +
  geom_text(aes(label = paste0(n, "  (", pct, "%)")),
            hjust = -0.05, size = 3.6) +
  coord_flip() +
  scale_fill_manual(values = c(
    "Numeric (parseable)"             = "#6DB57A",
    "Included in commodity cost"      = "#F5A623",
    "Cross-reference to other row"    = "#E05C5C",
    "Blank / missing"                 = "#B0B0B0",
    "Other text"                      = "#888888"
  )) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.28))) +
  labs(
    title    = "Issue #1: ~40% of Freight Cost Values Are Non-Numeric Text",
    subtitle = "Naive as.numeric() would silently discard these as NA — three categories need different treatment",
    x = NULL, y = "Number of records",
    caption  = "Author: Murera Gisa \n Source: USAID SCMS Delivery History Dataset"
  ) +
  theme_bw(base_size = 12) +
  theme(plot.title = element_text(face = "bold", colour = "#1A3A5C"))

print(p1)

ggsave("reports/figures/issue1_non_numeric_freight.png", p1, width = 10, height = 4.5, dpi = 150)
non_numeric_n <- sum(freight_classes$n[freight_classes$category != "Numeric (parseable)"])

Interpretation: The above Figure 1 reveals 4126 records (40%) records with non-numeric freight costs. These fall into three distinct categories requiring different treatments:

  • Included in commodity cost: (1681 == “Included in commodity cost”] records): The freight cost is bundled with product cost, and these should be set to \(\$0\) for separate freight analysis

  • Cross-reference: to other row” (2445 == “Cross-reference to other row”] records): Costs are documented in another shipment record ,this requires joining logic

  • Blank or Missing: The results shows that 287 records are missing in line_item_insurance_usd column. True missing values requiring imputation or exclusion.

Business Impact: Ignoring this issue would result in \(40\%\) data loss for any freight cost analysis, potentially biasing results toward shipments that explicitly itemize freight costs.

Issue No.2 Extreme Outliers in Freight Costs

  • State of the Problem:* Among numerically parseable freight costs, the distribution shows extreme right skew with values exceeding \(\$500,000\), which would inflate average cost calculations.
numeric_freight <- raw %>%
  mutate(freight_usd = suppressWarnings(as.numeric(freight_cost_usd))) %>%
  filter(!is.na(freight_usd), freight_usd >= 0)

q1_val      <- quantile(numeric_freight$freight_usd, 0.25)
q3_val      <- quantile(numeric_freight$freight_usd, 0.75)
iqr_val     <- q3_val - q1_val
upper_fence <- q3_val + 1.5 * iqr_val
outlier_n   <- sum(numeric_freight$freight_usd > upper_fence)

# Create summary statistics table
outlier_stats <- data.frame(
  Metric = c("Numeric Records", "Median Cost", "IQR Upper Fence", 
             "Outliers Above Fence", "Maximum Value"),
  Value = c(nrow(numeric_freight),
            paste0("$", format(round(median(numeric_freight$freight_usd)), big.mark = ",")),
            paste0("$", format(round(upper_fence), big.mark = ",")),
            paste0(outlier_n, " (", round(100 * outlier_n / nrow(numeric_freight), 1), "%)"),
            paste0("$", format(round(max(numeric_freight$freight_usd)), big.mark = ",")))
)

kable(outlier_stats, col.names = c("Metric", "Value"),
      caption = "Table 2: Freight Cost Outlier Statistics") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Table 2: Freight Cost Outlier Statistics
Metric Value
Numeric Records 6198
Median Cost $5,870
IQR Upper Fence $32,820
Outliers Above Fence 388 (6.3%)
Maximum Value $289,653
p2 <- numeric_freight %>%
  filter(freight_usd <= quantile(freight_usd, 0.995)) %>%   # truncate for readability
  ggplot(aes(x = freight_usd)) +
  geom_histogram(bins = 70, fill = "#4A90D9", colour = "white", linewidth = 0.15) +
  geom_vline(xintercept = upper_fence, colour = "#E05C5C",
             linetype = "dashed", linewidth = 1.1) +
  annotate("text", x = upper_fence * 1.06, y = Inf, vjust = 2, hjust = 0,
           label = paste0("IQR fence\n$", format(round(upper_fence), big.mark = ",")),
           colour = "#E05C5C", size = 3.6) +
  scale_x_continuous(labels = dollar_format(big.mark = ",")) +
  labs(
    title    = "Issue #2: Extreme Right Skew — Outliers in Numeric Freight Costs",
    subtitle = "X-axis truncated at 99.5th percentile; long tail reaches >$500k",
    x = "freight_cost_usd", y = "Count",
    caption  = "Author: Murera Gisa \n Source: USAID SCMS Delivery History Dataset"
  ) +
  theme_bw(base_size = 12) +
  theme(plot.title = element_text(face = "bold", colour = "#1A3A5C"))

print(p2)

ggsave("reports/figures/issue2_freight_outliers.png", p2, width = 10, height = 4.5, dpi = 150)

Interpretation: The above Figure 2 shows the distribution of numeric freight costs (truncated at the \(99.5^{th}\) percentile for visibility). The IQR method identifies 388 outliers (6.3% of numeric records) above the upper fence of 32,820. The maximum value of 289,653 is orders of magnitude higher than the median.

Business Impact: Including these outliers in average cost calculations would inflate estimates by \(10\) to \(100\) times, potentially leading to:

  • Overstated budget requirements

  • Incorrect mode cost comparisons

  • Misleading country-level cost analyses

Recommended Treatment: Flag outliers using IQR method and exclude them from average-cost analyses, or use robust statistics (median, trimmed means).

Issue no 3. String ‘N/A’ Sentinel in Shipment Mode

State of the Problem: The literal string N/A is used as a sentinel value in the Shipment Mode column. R’s is.na() returns FALSE for these values, meaning they would be treated as a legitimate fifth shipping mode in analyses.

mode_dist <- raw %>%
  count(shipment_mode, sort = TRUE) %>%
  mutate(
    pct         = round(100 * n / nrow(raw), 1),
    is_sentinel = shipment_mode == "N/A"
  )

kable(mode_dist, 
      col.names = c("Shipment Mode", "Count", "Percentage", "Is Sentinel"),
      caption = "Table 3: Shipment Mode Distribution") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Table 3: Shipment Mode Distribution
Shipment Mode Count Percentage Is Sentinel
Air 6113 59.2 FALSE
Truck 2830 27.4 FALSE
Air Charter 650 6.3 FALSE
Ocean 371 3.6 FALSE
N/A 360 3.5 TRUE
p3 <- ggplot(mode_dist,
             aes(x = reorder(shipment_mode, n), y = n,
                 fill = is_sentinel)) +
  geom_col(width = 0.7, show.legend = FALSE) +
  geom_text(aes(label = paste0(n, "  (", pct, "%)")),
            hjust = -0.05, size = 3.8) +
  coord_flip() +
  scale_fill_manual(values = c("FALSE" = "#1A3A5C", "TRUE" = "#E05C5C")) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.22))) +
  labs(
    title    = "Issue #3: 360 Records Have Shipment Mode = 'N/A' (a String, Not R's NA)",
    subtitle = "Red bar = sentinel value; is.na() returns FALSE — must be explicitly recoded",
    x = NULL, y = "Number of records",
    caption  = "Authr: Murera Gisa, \n Source: USAID SCMS Delivery History Dataset"
  ) +
  theme_bw(base_size = 12) +
  theme(plot.title = element_text(face = "bold", colour = "#1A3A5C"))

print(p3)

ggsave("reports/figures/issue3_shipment_mode_na.png", p3, width = 10, height = 4, dpi = 150)

Interpretation: Figure 3 highlights 360 records (3.5%) where Shipment Mode contains the string “N/A” rather than a true NA value. These would be incorrectly treated as a separate shipping mode in any grouping operation.

Business Impact: Without recoding, any analysis by shipping mode would:

  • Include a spurious “N/A” category

  • Misrepresent the distribution of legitimate shipping modes

  • Potentially skew mode-specific cost calculations

Recommended Treatment: Explicitly recode “N/A” to NA_character_ before any analysis:

raw$Shipment_Mode_clean <- ifelse(raw$shipment_mode == "N/A", NA, raw$shipment_mode)

Hypothesis Test: Air vs Truck Cost-per-KG for Heavy Loads

Hypothesis Formulation

Business Question: For heavy loads which is greater than \(500 Kgs\)), is air freight significantly more expensive per kilogram than truck transport?

Statistical Framework:

  • H₀ (Null): No significant difference in cost/kg between Air and Truck for load greater than 500 Kgs.

  • H₁ (Alternative): Air costs is significantly more per kg than Truck for heavy loads

Test Approach: To test the above hypotheses we have used Wilcoxon rank-sum test which non-parametric test suitable for our skewed cost data.Therefore we have considered the significance level of \(\alpha = 0.05\) and confidence level of \(95\%\)

Data Preparation

hyp_df <- raw %>%
  mutate(
    freight_usd = suppressWarnings(as.numeric(freight_cost_usd)),
    weight_kg   = suppressWarnings(as.numeric(weight_kilograms)),
    mode_clean  = case_when(
      shipment_mode == "Air"         ~ "Air",
      shipment_mode == "Air Charter" ~ "Air Charter",
      shipment_mode == "Truck"       ~ "Truck",
      shipment_mode == "Ocean"       ~ "Ocean",
      TRUE                             ~ NA_character_
    )
  ) %>%
  filter(
    !is.na(freight_usd), freight_usd > 0,
    !is.na(weight_kg),   weight_kg > 500,   # heavy loads only
    mode_clean %in% c("Air", "Truck", "Ocean")
  ) %>%
  mutate(cost_per_kg = freight_usd / weight_kg) %>%
  filter(cost_per_kg < quantile(cost_per_kg, 0.99, na.rm = TRUE))  # trim top 1% for visualization

# Descriptive statistics
hyp_summary <- hyp_df %>%
  group_by(Mode = mode_clean) %>%
  summarise(
    n           = n(),
    median_cpk  = round(median(cost_per_kg), 2),
    mean_cpk    = round(mean(cost_per_kg), 2),
    sd_cpk      = round(sd(cost_per_kg), 2),
    .groups     = "drop"
  )

kable(hyp_summary, 
      col.names = c("Mode", "N", "Median ($/kg)", "Mean ($/kg)", "SD ($/kg)"),
      caption = "Table 4: Cost per KG Statistics by Mode (Loads >500 kg)") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Table 4: Cost per KG Statistics by Mode (Loads >500 kg)
Mode N Median (\(/kg) </th> <th style="text-align:right;"> Mean (\)/kg) SD ($/kg)
Air 2210 6.56 7.54 5.19
Ocean 263 1.60 2.50 2.83
Truck 925 1.74 2.71 2.41

Statistical Analysis

# Wilcoxon rank-sum test: one-sided Air > Truck
air_cpk   <- hyp_df$cost_per_kg[hyp_df$mode_clean == "Air"]
truck_cpk <- hyp_df$cost_per_kg[hyp_df$mode_clean == "Truck"]
wt        <- wilcox.test(air_cpk, truck_cpk, alternative = "greater")

# Effect size r = Z / sqrt(N)
z_stat    <- qnorm(wt$p.value / 2, lower.tail = FALSE)
effect_r  <- round(z_stat / sqrt(length(air_cpk) + length(truck_cpk)), 3)
multiplier <- round(median(air_cpk) / median(truck_cpk), 1)

# Create results table
test_results <- data.frame(
  Metric = c("Sample Size (Air)", "Sample Size (Truck)", 
             "Median Air ($/kg)", "Median Truck ($/kg)",
             "Cost Multiplier", "P-value", "Effect Size (r)", 
             "Effect Magnitude", "Decision (α=0.05)"),
  Value = c(length(air_cpk), length(truck_cpk),
            paste0("$", median(air_cpk)), paste0("$", median(truck_cpk)),
            paste0(multiplier, "×"),
            if (wt$p.value < 0.001) "< 0.001" else round(wt$p.value, 4),
            effect_r,
            if (effect_r >= 0.5) "Large" else if (effect_r >= 0.3) "Medium" else "Small",
            if (wt$p.value < 0.05) "REJECT H₀" else "FAIL TO REJECT H₀")
)

kable(test_results, col.names = c("Metric", "Value"),
      caption = "Table 5: Wilcoxon Rank-Sum Test Results") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Table 5: Wilcoxon Rank-Sum Test Results
Metric Value
Sample Size (Air) 2210
Sample Size (Truck) 925
Median Air ($/kg) \(6.5634420903041 </td> </tr> <tr> <td style="text-align:left;"> Median Truck (\)/kg) $1.74486755545206
Cost Multiplier 3.8×
P-value < 0.001
Effect Size (r) 0.497
Effect Magnitude Medium
Decision (α=0.05) REJECT H₀

Visualization

p4 <- hyp_df %>%
  filter(mode_clean %in% c("Air", "Truck", "Ocean")) %>%
  ggplot(aes(x = mode_clean, y = cost_per_kg, fill = mode_clean)) +
  geom_violin(alpha = 0.35, colour = NA, trim = TRUE) +
  geom_boxplot(width = 0.22, outlier.alpha = 0.25, colour = "grey30") +
  stat_summary(fun = median, geom = "point",
               shape = 21, size = 4.5, fill = "white", colour = "#1A3A5C") +
  scale_fill_manual(values = c(Air = "#4A90D9", Truck = "#6DB57A", Ocean = "#F5A623"),
                    guide  = "none") +
  scale_y_continuous(labels = dollar_format()) +
  annotate("text", x = 1.5, y = max(hyp_df$cost_per_kg) * 0.93,
           label = paste0("p ", if (wt$p.value < 0.001) "< 0.001" else round(wt$p.value, 3),
                          "\nr = ", effect_r),
           size = 4.2, colour = "#1A3A5C", fontface = "bold") +
  labs(
    title    = paste0("Hypothesis: Air is ", multiplier,
                      "× More Expensive per KG than Truck (Loads > 500 kg)"),
    subtitle = "Wilcoxon rank-sum test: H0 rejected (p < 0.001) | White dot = median",
    x = NULL, y = "Cost per KG (USD)",
    caption  = "Top-1% trimmed for clarity | Author: Murera Gisa \n Source: USAID SCMS Delivery History Dataset"
  ) +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold", colour = "#1A3A5C"))

print(p4)

ggsave("reports/figures/hypothesis_air_vs_truck.png", p4, width = 9, height = 5.5, dpi = 150)

Interpretation

  • Statistical Findings:

    • P-value: < 0.001: strongly significant

    • Effect Size (r): 0.497 — rif (effect_r >= 0.5) "large" else if (effect_r >= 0.3) "medium" else "small" effect

    • Cost Difference: Air freight is 3.8 \(\times\) more expensive per kg than truck transport (median: 6.5634421 vs 1.7448676)

Business Interpretation

We have rejected the null hypothesis at \(\alpha = 0.05\) with a confidence level of \(95\%\). There is strong statistical evidence that air freight costs significantly more per kilogram than truck transport for loads exceeding \(500\)Kgs. This finding has important implications for supply chain optimization as follow:

  • Cost-Benefit Analysis: The 3.8 \(\times\) cost premium for air freight must be justified by time sensitivity or accessibility requirements

  • Mode Selection Criteria: Organizations should establish clear thresholds where truck transport is preferred unless delivery speed is critical

  • Budget Planning: Air freight budgets should be 3.8\(\times\) higher per kg than truck budgets for comparable heavy shipments

The Analytical Notes concerning the Ocean Freight: While included for reference, ocean freight shows the lowest median cost (1.6/kg), making it the most economical option for non-urgent heavy shipments with adequate lead time.

CURIOUS QUESTION: Which is the shipping volume per year?

yearly <- raw %>%
  mutate(yr = year(dmy(scheduled_delivery_date))) %>%
  filter(!is.na(yr)) %>%
  count(yr)

kable(yearly, col.names = c("Year", "Shipment Count"),
      caption = "Table 6: Annual Shipment Volumes") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Table 6: Annual Shipment Volumes
Year Shipment Count
2006 65
2007 672
2008 1029
2009 1253
2010 1204
2011 1011
2012 1273
2013 1272
2014 1528
2015 1017

The table 6 shows that the Peak Year: 2014 with 1528 shipments, representing the highest operational volume in the dataset.

Conclusion

  • This EDA has successfully identified three critical data quality issues that must be addressed before proceeding with downstream analysis.

  • The hypothesis test confirms, with strong statistical evidence, that air freight carries a significant cost premium over truck transport for heavy loads. The analysis finds that it has direct implications on supply chain optimization.

  • The documented issues and recommended solutions provide a clear roadmap for the data cleaning phase, ensuring that subsequent modeling and reporting will be built on a solid, trustworthy foundation.

Data Story Telling

This section answers some questions eventhough it is presented in both detailed (shiny dashbaord) and superset dashboards

Question 1: Supply Chain Health

Which countries have the highest average freight costs?

QUESTION 2: Operation Efficiency

Tuelve months trend of on-time delivey

Data Story and Insight

##   Air median $/kg   : $10.02
##   Truck median $/kg : $2.50
##   Multiplier        : 4.0x
##   Air on-time       : 90%
##   Truck on-time     : 84%