Introduction

Volatilities in global energy markets introduce severe economic uncertainties, impacting consumer retail prices and national economic planning. Understanding the pass-through effect of crude oil price shocks onto localized retail markets requires a dual-pronged analytical framework.

This study leverages historical data across distinct temporal granularities to address two core research questions using machine learning paradigms:

Q1: Retail Price Prediction (Regression Model)

Objective: To predict future retail fuel prices based on the historical behavior of crude oil benchmarks.

Granularity: Analyzed according to country by month to capture medium-term macroeconomic shifts and regional dependencies.

Approach: Continuous numeric forecasting (Regression) to map upstream supply-side costs directly to downstream consumer impacts.

Q2: Global Price Impact Assessment (Classification Model)

Objective: To build a classification architecture that analyzes the specific impact level of global fuel price spikes on localized retail adjustments.

Granularity: Monitored by quarter to align with broader corporate and national budgetary cycles.

Approach: Categorical risk assessment (Classification) to segment country-specific fuel behavior into distinct shock-absorption or impact bands.

We will employ a standard machine learning framework which consist of:

  • Data Understanding: Overview and surface understanding of the data to ensure it meets our needs
  • Data Preprocessing: Cleaning, integration and transformation to ensure the data is robust
  • Exploratory Data Analysis: Exploring the data deeper to uncover relationships through visualization
  • Modeling: Compare various models along with hyperparameter tuning to find the best suited for both questions
  • Analysis: Analyze the result, as well as understand the limitation

Data Understanding

The data was gathered from Kaggle by the user BELBIN BENO R M titled Global Fuel Prices 2020–2026. It was then imported to R with the default function. A preview of the first 5 records are as below:

df <- read.csv("data_raw/dirty_fuel_data.csv")
Preview of the First Five Records in the Dataset
date country region income_level subsidy_level petrol_usd_liter diesel_usd_liter lpg_usd_liter brent_crude_usd tax_percentage
2020-01-06 United States North America High Low $1.465 999.990 1.093 65.75 59.6
2020-01-13 USA North America High Low $1.435 -5.000 1.077 65.54 27.6
2020-01-20 United States North America High Low $1.446 1.329 1.080 66.51 62.3
2020-01-27 United States North America High Low $1.488 1.359 1.125 68.79 44.8
2020-02-03 United States North America High Low $1.457 1.341 1.111 68.44 51.4

The data consists of nine features:

  • date: The time stamp of the record in the format of YYYY-MM-DD
  • country: The specific nation where the fuel prices were recorded
  • region: The broader geographic territory
  • income_level: The economic classification of the country
  • subsidy_level: Indicates the scale of government intervention and subsidies on national fuel prices
  • petrol_usd_liter: The local retail price of consumer gasoline, standardized to US Dollars per liter
  • diesel_usd_liter: The local retail price of automotive and industrial diesel fuel, standardized to US Dollars per liter
  • lpg_usd_liter: The local retail price of Liquefied Petroleum Gas, standardized to US Dollars per liter
  • brent_crude_usd: The global benchmark price for a barrel of Brent Crude oil in US Dollars
  • tax_percentage: The tax rate applied by the national government to fuel sales

Number of Unique Values:

## date: 327
## country: 85
## region: 7
## income_level: 3
## subsidy_level: 4
## petrol_usd_liter: 5517

Overview of the data:

## Rows: 27,483
## Columns: 10
## $ date             <chr> "2020-01-06", "2020-01-13", "2020-01-20", "2020-01-27", "2020-0~
## $ country          <chr> "United States", "USA", "United States", "United States", "Unit~
## $ region           <chr> "North America", "North America", "North America", "North Ameri~
## $ income_level     <chr> "High", "High", "High", "High", "High", "High", "High", "High",~
## $ subsidy_level    <chr> "Low", "Low", "Low", "Low", "Low", "Low", "Low", "Low", "Low", ~
## $ petrol_usd_liter <chr> "$1.465", "$1.435", "$1.446", "$1.488", "$1.457", "1.461 USD", ~
## $ diesel_usd_liter <dbl> 999.990, -5.000, 1.329, 1.359, 1.341, 1.356, 1.390, 1.396, 1.41~
## $ lpg_usd_liter    <dbl> 1.093, 1.077, 1.080, 1.125, 1.111, 1.105, 1.111, 1.193, 1.156, ~
## $ brent_crude_usd  <dbl> 65.75, 65.54, 66.51, 68.79, 68.44, 68.09, 70.46, 71.61, 70.53, ~
## $ tax_percentage   <dbl> 59.6, 27.6, 62.3, 44.8, 51.4, 15.0, 58.8, 17.9, 55.2, 14.7, 14.~

Dataset Summary:

  • Total Observations: 27468
  • Total No. of Columns: 10
  • Total Countries: 84
  • Price Range (Petrol): 0.010, 6.779
  • Price Range (Diesel): 0.01, 6.24

This dataset is well suited for our questions because it captures both the primary global market drivers and the local economic buffers that dictate fuel costs. For Q1 (Regression), brent_crude_usd provides the baseline upstream cost, while the date and country variables allow the model to track the exact monthly time lag it takes for global price shocks to pass through to local retail pumps (petrol_usd_liter).

For Q2 (Classification), the data provides the structural macro-context needed to categorize quarterly impact risk bands. Features like subsidy_level and tax_percentage act as policy shields that alter how a nation absorbs financial shocks, while income_level and region help the classifier identify which clusters of countries are highly vulnerable versus those with the fiscal resilience to smooth out volatile energy spikes.


Data Cleaning

Executive Summary

This section details the reproducible data cleaning workflow applied to the dirty_fuel_data.csv dataset. The goal of this pipeline is to address duplicate rows, inconsistent categorical naming, character-to-numeric type mismatches, statistical outliers, and missing data imputation.

Part 1: Environment Setup

First, we load the tidyverse suite (which includes dplyr, readr, stringr, and ggplot2 for efficient data manipulation) and import the raw dataset.

## Rows: 27483 Columns: 10
## -- Column specification -------------------------------------------------------------------
## Delimiter: ","
## chr  (5): country, region, income_level, subsidy_level, petrol_usd_liter
## dbl  (4): diesel_usd_liter, lpg_usd_liter, brent_crude_usd, tax_percentage
## date (1): date
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 27,483
## Columns: 10
## $ date             <date> 2020-01-06, 2020-01-13, 2020-01-20, 2020-01-27, 2020-02-03, 20~
## $ country          <chr> "United States", "USA", "United States", "United States", "Unit~
## $ region           <chr> "North America", "North America", "North America", "North Ameri~
## $ income_level     <chr> "High", "High", "High", "High", "High", "High", "High", "High",~
## $ subsidy_level    <chr> "Low", "Low", "Low", "Low", "Low", "Low", "Low", "Low", "Low", ~
## $ petrol_usd_liter <chr> "$1.465", "$1.435", "$1.446", "$1.488", "$1.457", "1.461 USD", ~
## $ diesel_usd_liter <dbl> 999.990, -5.000, 1.329, 1.359, 1.341, 1.356, 1.390, 1.396, 1.41~
## $ lpg_usd_liter    <dbl> 1.093, 1.077, 1.080, 1.125, 1.111, 1.105, 1.111, 1.193, 1.156, ~
## $ brent_crude_usd  <dbl> 65.75, 65.54, 66.51, 68.79, 68.44, 68.09, 70.46, 71.61, 70.53, ~
## $ tax_percentage   <dbl> 59.6, 27.6, 62.3, 44.8, 51.4, 15.0, 58.8, 17.9, 55.2, 14.7, 14.~

Part 2: Identifying “Dirty” Data

Before writing the cleaning pipeline, we must profile the dataset to uncover quality, structure, and validity issues.

A. Check for Missing Values (NAs)

print("Missing values per column:")
## [1] "Missing values per column:"
colSums(is.na(raw_data))
##             date          country           region     income_level    subsidy_level 
##                0                0                0                0                0 
## petrol_usd_liter diesel_usd_liter    lpg_usd_liter  brent_crude_usd   tax_percentage 
##             1373             1375                0                0                0

B. Check for Duplicates

print("Number of duplicate rows:")
## [1] "Number of duplicate rows:"
sum(duplicated(raw_data))
## [1] 15

C. Check for Data Type Mismatches

# Verifying if the petrol price column is recognized as numeric
class(raw_data$petrol_usd_liter)
## [1] "character"

D. Check for Outliers

# Summarizing diesel to identify mathematically impossible values
summary(raw_data$diesel_usd_liter)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##   -5.000    0.860    1.955    2.173    3.233  999.990     1375

E. Check for Inconsistent Naming

# Identifying naming variants for categorical grouping
unique(raw_data$country)
##  [1] "United States"  "USA"            "Canada"         "Mexico"         "Brazil"        
##  [6] "Argentina"      "Colombia"       "Chile"          "Peru"           "Venezuela"     
## [11] "Ecuador"        "United Kingdom" "Germany"        "France"         "Italy"         
## [16] "Spain"          "Netherlands"    "Belgium"        "Sweden"         "Norway"        
## [21] "Denmark"        "Finland"        "Poland"         "Portugal"       "Greece"        
## [26] "Switzerland"    "Austria"        "Ireland"        "Hungary"        "Czech Republic"
## [31] "Romania"        "Russia"         "Ukraine"        "Turkey"         "China"         
## [36] "Japan"          "South Korea"    "India"          "Indonesia"      "Malaysia"      
## [41] "Thailand"       "Vietnam"        "Philippines"    "Bangladesh"     "Pakistan"      
## [46] "Sri Lanka"      "Nepal"          "Myanmar"        "Singapore"      "Hong Kong"     
## [51] "Taiwan"         "Australia"      "New Zealand"    "Saudi Arabia"   "UAE"           
## [56] "Kuwait"         "Qatar"          "Iran"           "Iraq"           "Israel"        
## [61] "Jordan"         "Lebanon"        "Egypt"          "South Africa"   "Nigeria"       
## [66] "Kenya"          "Ethiopia"       "Ghana"          "Tanzania"       "Morocco"       
## [71] "Algeria"        "Libya"          "Tunisia"        "Angola"         "Mozambique"    
## [76] "Zambia"         "Zimbabwe"       "Uganda"         "Cameroon"       "Senegal"       
## [81] "Ivory Coast"    "Sudan"          "Botswana"       "Namibia"        "Rwanda"

Part 3: The Data Cleaning Pipeline

Using a single dplyr pipe sequence, we apply five specific data-cleaning steps to ensure the integrity of the data without breaking spatial correlations.

  1. Deduplication: Removed identical duplicate rows to prevent skewed statistical distributions.
  2. Structural Fixes: Standardized categorical country values (e.g., mapping “USA” to “United States”).
  3. Type Conversion: Stripped special characters ($, USD, spaces) from petrol_usd_liter and cast it to numeric.
  4. Outlier Mitigation: Filtered out unrealistic diesel prices (>50 or <0) and set them to NA.
  5. Missing Value Imputation: Imputed missing values using the grouped country mean, preserving localized geographic price variations.

Part 4: Verification & Export

Post-Cleaning Quality Assurance We run the data profiling checks again to verify that our dataset is now clean.

# 1. Check for remaining Missing Values
colSums(is.na(cleaned_data))
##             date          country           region     income_level    subsidy_level 
##                0                0                0                0                0 
## petrol_usd_liter diesel_usd_liter    lpg_usd_liter  brent_crude_usd   tax_percentage 
##                0                0                0                0                0
# 2. Check for remaining Duplicates
sum(duplicated(cleaned_data))
## [1] 0
# 3. Verify final data types
class(cleaned_data$petrol_usd_liter)
## [1] "numeric"
# 4. Verify distribution ranges without massive outliers
summary(cleaned_data$diesel_usd_liter)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.010   0.860   1.961   2.135   3.235   6.240

Save Final Dataset

The final, sparkling clean data is saved for analysis and downstream reporting.

write_csv(cleaned_data, "fuel_prices_cleaned_final.csv")
print("Cleaning complete! Final data successfully exported.")

Appendix: Data Dictionary

The clean dataset consists of 10 variables outlined below:

# Variable Name Data Type Description
1 date Date The recorded date of fuel prices.
2 country Character Name of the country (standardized).
3 region Character Geographic region (e.g., North America, Europe).
4 income_level Character Economic classification of the country.
5 subsidy_level Character Level of government fuel subsidies.
6 petrol_usd_liter Numeric Price of gasoline per liter in USD.
7 diesel_usd_liter Numeric Price of diesel per liter in USD.
8 lpg_usd_liter Numeric Price of LPG per liter in USD.
9 brent_crude_usd Numeric The global benchmark price for crude oil.
10 tax_percentage Numeric The percentage of fuel price consisting of tax.

Exploratory Data Analysis

We will explore the data to uncover relationships and patterns that can inform our modeling approach. This includes visualizations to understand the distribution of fuel prices, the relationship between global crude prices and local retail prices, the impact of each countries every quarter and how different countries and regions are affected by price changes throughout the quarters.

Initial Setup

These libraries will be used for data manipulation, visualization, and analysis throughout the EDA process. We will use ggplot2 for creating visualizations, dplyr for data manipulation, lubridate for handling date operations, zoo for time series analysis, and corrplot for visualizing correlations between numeric features.

clean_data$year_quarter <- as.yearqtr(clean_data$date)
head(clean_data[, c("date", "year_quarter")])
##         date year_quarter
## 1 2020-01-06      2020 Q1
## 2 2020-01-13      2020 Q1
## 3 2020-01-20      2020 Q1
## 4 2020-01-27      2020 Q1
## 5 2020-02-03      2020 Q1
## 6 2020-02-10      2020 Q1
class(clean_data$year_quarter) # yearqtr class
## [1] "yearqtr"

This code extracts the year and quarter from the date column and creates a new column called year_quarter. This will allow us to analyze the data on a quarterly basis, which is important for our classification question. The as.yearqtr function from the zoo package is used to convert the date into a year-quarter format.

Regional Analysis

Country Variability

This code calculates the mean, standard deviation, minimum, and maximum petrol price for each country by grouping the data by country and then summarizing it. The countries are then arranged in descending order based on the standard deviation of petrol prices to identify which countries have the most variability in their petrol prices. The top 10 most volatile countries are extracted for further analysis.

The variability analysis showed that several countries experienced significantly higher fluctuations in petrol prices compared to others. Countries with higher standard deviation values indicate less stable fuel pricing trends over time, potentially due to weaker subsidy controls, economic instability, or stronger sensitivity to global crude oil price movements. In contrast, countries with lower variability demonstrated more stable fuel prices, which may suggest stronger regulatory or subsidy mechanisms.

Relationship Analysis

Petrol Price Distribution by Income Level

Looking at boxplot, we can see that the the distribution for each level overlaps with another. Both low and middle income level have a distribution between 0 and 4 USD/L, though low income level has a low median than middle income level. low income level is shown to have a normal distribution, while middle income is positively skewed. On the other hand, high income level has a huge distribution, between approximately 0 and 7 USD/L. It is skewed to the upper percentile, and some outliers can be seen in the lower percentile.

The graph suggests that income level does play a role in the petrol prices, though a high income level is not always an indicator of a high petrol price. An example would be the UAE which has a high income level but low petrol price.

Petrol Price Distribution by Subsidy Level

The boxplot shows a clear inverse relationship between subsidy level and petrol price. It shows a decrease in the petrol price distribution from low to very high with each level tightening the distribution. Some outliers are present in the medium and high, though no outliers are on low despite the wide distribution - suggesting that the distribution is spread apart on the 25% and 100% quartile.

Correlation Matrix of Numeric Features

As expected, petrol, diesel, and LPG prices show extremely strong positive correlations with one another, with correlation coefficients close to 0.99. This indicates that the prices of different fuel types generally move together over time.

In contrast, Brent crude oil prices show only weak positive correlations with the fuel prices, with coefficients around 0.25 to 0.26. This suggests that while crude oil prices may influence fuel prices, they are not the sole determinant of retail fuel pricing. Other factors such as subsidies, taxation policies, and regional economic conditions may also contribute significantly.

Tax percentage shows almost no correlation with Brent crude oil prices but demonstrates moderate positive correlations of approximately 0.52 with fuel prices. This suggests that countries with higher fuel taxation tend to have higher retail fuel prices.

Quarterly Change Analysis

Quarter to Quarter Change in Petrol Price

quarterly_country <- clean_data %>%
  group_by(country, year_quarter) %>%
  summarise(
    mean_petrol = mean(petrol_usd_liter, na.rm = TRUE),
    .groups = "drop"
  )

quarterly_country <- quarterly_country %>%
  group_by(country) %>%
  arrange(year_quarter) %>%
  mutate(
    pct_change = (
      (mean_petrol - lag(mean_petrol)) /
        lag(mean_petrol)
    ) * 100
  )

This code calculates the average petrol price for each country by quarter and then computes the percentage change in petrol price from one quarter to the next for each country. The lag function is used to access the previous quarter’s mean petrol price for the percentage change calculation.

The distribution of quarterly petrol price changes is concentrated mainly between -10% and 10%, with the highest frequency occurring in the positive 0% to 10% range. This suggests that most quarterly petrol price movements are relatively moderate, with price increases occurring more frequently than decreases.

Although extreme changes are less common, several larger positive changes above 30% are present. Negative changes also appear frequently, particularly within the -20% to -10% range, indicating that substantial price drops occurred in certain periods. Overall, the distribution suggests that petrol prices generally experience moderate quarterly fluctuations with occasional periods of high volatility.

Impact Level Analysis by Quarter

This code categorizes the percentage change in petrol price into three impact levels: “Low” for changes less than 2%, “Medium” for changes between 2% and 5%, and “High” for changes greater than or equal to 5%. The case_when function is used to create the impact_level variable based on the absolute value of the percentage change. The resulting impact_level variable is then converted into an ordered factor with specified levels.

The impact level distribution shows that most observations fall within the high impact category, followed by medium and low impact levels. This is consistent with the previous histogram analysis, where the majority of quarterly petrol price changes were concentrated around the -10% to 10% range. High impact observations occur less frequently, indicating that extreme quarterly fuel price fluctuations are relatively common.

Summary

  • Fuel prices generally followed the trend of Brent crude oil prices over time, with petrol consistently having the highest average price.
  • Europe and Oceania recorded the highest average petrol prices, while the Middle East generally showed lower prices.
  • Country-level analysis showed that some countries experienced significantly higher fuel price volatility than others.
  • Petrol, diesel, and LPG prices showed extremely strong positive correlations with one another.
  • Brent crude oil prices showed only weak positive correlations with retail fuel prices, suggesting that additional factors influence fuel pricing.
  • Tax percentage showed moderate positive correlation with fuel prices.
  • Higher subsidy levels were generally associated with lower and more stable petrol prices.
  • Most quarterly petrol price changes were concentrated within the -10% to 10% range.
  • The majority of observations were classified as high impact, while low impact fuel price changes occurred less frequently.
  • The findings suggest that fuel prices are influenced by crude oil trends, taxation, subsidies, and regional economic conditions.

Data Preparation

Introduction to Data Preparation

This report documents the end-to-end data preparation workflow required to clean, transform, and engineer features from our raw fuel market data. The final output generates two highly specialized datasets: a monthly time-series file structured for regression modeling, and a quarterly file tailored for classification tasks.

Environment Setup and Data Ingestion

To build a reproducible data pipeline, we initialize our environment by loading the tidyverse library for core data manipulation and reshaping, alongside lubridate for specialized date-time parsing operations.

library(tidyverse) # Core data science packages (dplyr, tidyr, stringr, readr)
library(lubridate) # Specialized package for date conversion
library(readr)

df <- read_csv("data_clean/fuel_prices_cleaned_final.csv")

Data Preprocessing

The raw input dataset tracks retail asset classes across multiple adjacent wide columns. To make this data compatible with tidy data principles, we first parse the text dates into structured date values ​​and then collapse the multiple fuel columns into a single categorical key-value pair.

# 1. Date Conversion & Feature Extraction
df_clean <- df |>
  mutate(
    # Convert string to Date format. 'dmy' parses Day/Month/Year correctly.
    date = as.Date(date),

    # Extract temporal features for downstream grouping
    year = year(date),
    month = month(date),
    quarter = quarter(date)
  )

# 2. Data Pivoting
df_long <- df_clean |>
  pivot_longer(
    cols = c(petrol_usd_liter, diesel_usd_liter, lpg_usd_liter),
    names_to = "fuel_type", # New column holding the categorical fuel names
    values_to = "retail_price_usd" # New column holding the numerical prices
  )

Text Normalization and Formatting

Reshaping variables automatically preserves technical column tags (such as _usd_liter). We apply regular expression string adjustments to strip out technical labels and normalize the text into title case format to ensure clean reporting aesthetics.

df_long <- df_long %>%
  mutate(
    # Remove the redundant "_usd_liter" text from the fuel types
    fuel_type = str_replace(fuel_type, "_usd_liter", ""),
    fuel_type = str_to_title(fuel_type)
  )

Feature Discretization and Binning

To support multi-tier behavioral analysis, continuous numbers are grouped into distinct market classes. Retail rates and tax metrics are split into equal statistical thirds (tertiles) driven by data distributions, while crude oil indicators are segmented using deterministic macro-economic price blocks.

df_long <- df_long %>%
  mutate(
    # A. Retail Price Categories (Using equal statistical thirds)
    price_category = cut(
      retail_price_usd,
      breaks = quantile(retail_price_usd, probs = c(0, 1 / 3, 2 / 3, 1), na.rm = TRUE),
      labels = c("Low Retail", "Medium Retail", "High Retail"),
      include.lowest = TRUE
    ),

    # B. Tax Categories (Using equal statistical thirds)
    tax_category = cut(
      tax_percentage,
      breaks = quantile(tax_percentage, probs = c(0, 1 / 3, 2 / 3, 1), na.rm = TRUE),
      labels = c("Low Tax", "Medium Tax", "High Tax"),
      include.lowest = TRUE
    ),

    # C. Crude Oil Categories (Using specific rule-based thresholds)
    crude_category = case_when(
      brent_crude_usd < 50 ~ "Cheap Crude",
      brent_crude_usd >= 50 & brent_crude_usd <= 80 ~ "Normal Crude",
      brent_crude_usd > 80 ~ "Expensive Crude",
      TRUE ~ "Unknown"
    )
  )

Aggregation & Predictive Feature Engineering

With foundational adjustments finalized, we construct our two final target analytical arrays.

Dataset 1: Monthly Regression Pipeline

This pipeline groups observations into monthly country footprints, preserving structural local parameters like tax classifications. We also extract a sequential lag index to evaluate historical, trailing market dependencies.

q1_monthly_data <- df_long %>%
  # Included tax_category so it doesn't get deleted
  group_by(country, region, income_level, tax_category, year, month, fuel_type) %>%
  summarize(
    avg_retail_price = mean(retail_price_usd, na.rm = TRUE),
    avg_crude_price = mean(brent_crude_usd, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  # Create Historical Behavior (Lagged Variable)
  arrange(country, fuel_type, year, month) %>% # Sort chronologically first
  group_by(country, fuel_type) %>%
  mutate(
    # This grabs the crude price from 1 month ago
    historical_crude_price = lag(avg_crude_price, n = 1)
  ) %>%
  ungroup() %>%
  drop_na(historical_crude_price) # Drops the very first month which has no history

print("--- Q1 Monthly Data Ready ---")
## [1] "--- Q1 Monthly Data Ready ---"
head(q1_monthly_data)
## # A tibble: 6 x 10
##   country region income_level tax_category  year month fuel_type avg_retail_price
##   <chr>   <chr>  <chr>        <fct>        <dbl> <dbl> <chr>                <dbl>
## 1 Algeria Africa Middle       Medium Tax    2020     1 Diesel              0.045 
## 2 Algeria Africa Middle       Low Tax       2020     2 Diesel              0.0695
## 3 Algeria Africa Middle       Medium Tax    2020     2 Diesel              0.058 
## 4 Algeria Africa Middle       Low Tax       2020     3 Diesel              0.054 
## 5 Algeria Africa Middle       Medium Tax    2020     3 Diesel              0.0727
## 6 Algeria Africa Middle       Low Tax       2020     4 Diesel              0.074 
## # i 2 more variables: avg_crude_price <dbl>, historical_crude_price <dbl>

Dataset 2: Quarterly Classification Pipeline

This sequence maps records onto a wider quarterly footprint. It handles the mathematical computation of quarter-over-quarter growth rates for local retail structures and global crude oil benchmarks, using these rates to establish a hard boundary for our classification target impact_level.

q2_quarterly_data <- df_long %>%
  group_by(country, region, income_level, tax_category, year, quarter, fuel_type) %>%
  summarize(
    avg_retail_price = mean(retail_price_usd, na.rm = TRUE),
    avg_crude_price = mean(brent_crude_usd, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  # Calculate Percentage Changes and "Impact Level" Class
  arrange(country, fuel_type, year, quarter) %>% # Sort chronologically first
  group_by(country, fuel_type) %>%
  mutate(
    # Calculate Quarter-over-Quarter % Change
    retail_pct_change = (avg_retail_price - lag(avg_retail_price)) / lag(avg_retail_price) * 100,
    crude_pct_change = (avg_crude_price - lag(avg_crude_price)) / lag(avg_crude_price) * 100
  ) %>%
  ungroup() %>%
  mutate(
    # Create the Target Variable for Classification based on the price change
    impact_level = case_when(
      abs(retail_pct_change) > 5 ~ "High Impact",
      abs(retail_pct_change) >= 2 ~ "Medium Impact",
      abs(retail_pct_change) < 2 ~ "Low Impact",
      TRUE ~ NA_character_
    )
  ) %>%
  drop_na(impact_level)

print("--- Q2 Quarterly Data Ready ---")
## [1] "--- Q2 Quarterly Data Ready ---"
head(q2_quarterly_data)
## # A tibble: 6 x 12
##   country region income_level tax_category  year quarter fuel_type avg_retail_price
##   <chr>   <chr>  <chr>        <fct>        <dbl>   <int> <chr>                <dbl>
## 1 Algeria Africa Middle       Medium Tax    2020       1 Diesel              0.0632
## 2 Algeria Africa Middle       Low Tax       2020       2 Diesel              0.0623
## 3 Algeria Africa Middle       Medium Tax    2020       2 Diesel              0.0675
## 4 Algeria Africa Middle       Low Tax       2020       3 Diesel              0.0804
## 5 Algeria Africa Middle       Medium Tax    2020       3 Diesel              0.0865
## 6 Algeria Africa Middle       Low Tax       2020       4 Diesel              0.0989
## # i 4 more variables: avg_crude_price <dbl>, retail_pct_change <dbl>,
## #   crude_pct_change <dbl>, impact_level <chr>

End Results

Q1 Monthly Regression Dataset

country region income_level tax_category year month fuel_type avg_retail_price avg_crude_price historical_crude_price
Algeria Africa Middle Medium Tax 2020 1 Diesel 0.0450000 66.510 66.69333
Algeria Africa Middle Low Tax 2020 2 Diesel 0.0695000 71.035 66.51000
Algeria Africa Middle Medium Tax 2020 2 Diesel 0.0580000 68.265 71.03500
Algeria Africa Middle Low Tax 2020 3 Diesel 0.0540000 60.395 68.26500
Algeria Africa Middle Medium Tax 2020 3 Diesel 0.0726667 64.640 60.39500

The dataset features nine independent variables: - country - region - income_level - tax_category - year - month - fuel_type - avg_crude_price - historical_crude_price

The target variable is avg_retail_price.

Q2 Quarterly Classification Dataset

country region income_level tax_category year quarter fuel_type avg_retail_price avg_crude_price retail_pct_change crude_pct_change impact_level
Algeria Africa Middle Medium Tax 2020 1 Diesel 0.0631667 66.16000 -7.496513 0.0388819 High Impact
Algeria Africa Middle Low Tax 2020 2 Diesel 0.0622727 53.55273 -1.415207 -19.0557327 Low Impact
Algeria Africa Middle Medium Tax 2020 2 Diesel 0.0675000 53.95500 8.394161 0.7511713 High Impact
Algeria Africa Middle Low Tax 2020 3 Diesel 0.0804286 67.91714 19.153439 25.8773846 High Impact
Algeria Africa Middle Medium Tax 2020 3 Diesel 0.0865000 71.66667 7.548845 5.5207325 High Impact

The dataset features 11 independent variables: - country - region - income_level - tax_category - year - quarter - fuel_type - avg_retail_price - avg_crude_price - retail_pct_change - crude_pct_change

The target variable is impact_level.

Downstream Asset Exporting

To ensure data continuity and freeze state baselines for independent downstream model validation, both structures are written out to dedicated local storage nodes.

# Extracting datasets
write_csv(q1_monthly_data, "data_processed/Q1_Monthly_Regression_Data.csv")
write_csv(q2_quarterly_data, "data_processed/Q2_Quarterly_Classification_Data.csv")

print("Success! Both CSV files have been extracted and saved to your folder.")

Regression Modeling

Executive Summary

In this section of our analytics project, we developed an advanced machine learning pipeline to forecast global retail fuel prices. To capture complex, non-linear market behaviours such as regional policy variations and the lag effects of global oil shocks, we executed a comparative tournament between three different methodologies: Linear Regression (Elastic Net), Random Forest, and Extreme Gradient Boosting (XGBoost).

Import Libraries & Data

In this section, we load the required packages from the tidymodels framework alongside specialized libraries for model explanation (vip) and parallel computation processing engines.

library(readr)
library(tidyverse)
library(tidymodels)
library(vip) # For feature importance plots

# Load the regression dataset
df <- read_csv("data_processed/Q1_Monthly_Regression_Data.csv")

Data Preparation & Chronological Sorting

Because this dataset tracks multiple countries across time (panel/time-series structure), random slicing would result in data leakage. We establish an explicit chronological timeline sorting by dates, and encode ordinal classifications scale adjustments.

# Create a proper Date column and sort chronologically
df <- df %>%
  mutate(date = as.Date(paste(year, month, "01", sep = "-"))) %>%
  arrange(date)

# Define levels for ordinal categories
tax_levels <- c("Low Tax", "Medium Tax", "High Tax")
income_levels <- c("Low", "Middle", "High")

# Encode categorical data types
df$country <- as.factor(df$country)
df$region <- as.factor(df$region)
df$fuel_type <- as.factor(df$fuel_type)

# Convert ordered factors to numeric scales so ML algorithms can read their progression scale
df$tax_category <- as.numeric(factor(df$tax_category, levels = tax_levels, ordered = TRUE))
df$income_level <- as.numeric(factor(df$income_level, levels = income_levels, ordered = TRUE))

Time-Based Data Splitting & Resampling

We split the data sequentially: the oldest 80% is used to build the predictive engine, while the newest 20% remains untouched to simulate real-world forward-testing. Cross-validation uses a rolling-origin time framework to map sliding windows.

set.seed(123)
# Chronological Split: Oldest 80% for training, newest 20% for testing
data_split <- initial_time_split(df, prop = 0.80)
train_data <- training(data_split)
test_data <- testing(data_split)

# Calculate how many rows equal exactly "1 Month" of data across all countries
rows_per_month <- train_data %>%
  count(year, month) %>%
  summarise(avg_rows = mean(n)) %>%
  pull(avg_rows) %>%
  round()

# Set up the rolling origin cross-validation based on rows
# e.g., 24 months lookback window, 6 months evaluation window, sliding 3 months forward
folds <- rolling_origin(
  train_data,
  initial    = rows_per_month * 24,
  assess     = rows_per_month * 6,
  skip       = rows_per_month * 3,
  cumulative = TRUE
)

Recipe & Model Specifications

We define a workflow preprocessing standard to handle new factor variations seamlessly via dummy variable steps. We set up an engine testing matrix across Elastic Net Linear Regressions, XGBoost, and Random Forest specifications.

base_recipe <- recipe(avg_retail_price ~ ., data = train_data) %>%
  # Assign time/date trackers an "ID" role so they aren't used as direct linear predictors
  update_role(year, month, date, new_role = "ID") %>%
  step_novel(all_nominal_predictors()) %>%
  step_zv(all_predictors()) %>%
  step_dummy(all_nominal_predictors())

# Linear Model Spec
linear_spec <- linear_reg(penalty = tune(), mixture = tune()) %>%
  set_engine("glmnet") %>%
  set_mode("regression")

# XGBoost Spec
xgb_spec <- boost_tree(trees = tune(), tree_depth = tune(), learn_rate = tune()) %>%
  set_engine("xgboost") %>%
  set_mode("regression")

# Random Forest Spec
rf_spec <- rand_forest(trees = tune(), min_n = tune()) %>% # Removed mtry = tune()
  set_engine("ranger", importance = "impurity") %>%
  set_mode("regression")

# Combine into a Workflow Set
model_workflows <- workflow_set(
  preproc = list(base_rec = base_recipe),
  models = list(linear = linear_spec, xgb = xgb_spec, rf = rf_spec),
  cross = TRUE
)

Hyperparameter Tuning

We run grid search matrices across all 3 structural workflow designs.

set.seed(123)

library(doParallel)
cl <- makeCluster(detectCores() - 1)
registerDoParallel(cl)

parallel_control <- control_grid(
  save_pred = TRUE,
  parallel_over = "everything",
  verbose = FALSE,
)

tune_results <- model_workflows %>%
  workflow_map(
    fn = "tune_grid",
    seed = 123,
    resamples = folds,
    grid = 10,
    metrics = metric_set(rmse, mae, rsq, mape, huber_loss),
    control = parallel_control
  )

stopCluster(cl)

Resampling Validation Summary Plot

This chart details the cross-validation score summaries across each evaluated grid option.

autoplot(tune_results, metric = "rmse")

Automated Winning Model Extraction

The system reads across the finalised cross-validation profiles, filters down to select the architecture minimising RMSE, and assigns its hyperparameter choices onto the holdout performance verification test set.

# Programmatically find the absolute best model ID based on the lowest validation RMSE
best_model_id <- tune_results %>%
  rank_results(rank_metric = "rmse", select_best = TRUE) %>%
  filter(rank == 1) %>%
  slice(1) %>%
  pull(wflow_id)

# Extract best tuning results and parameters for that winning model
winning_tune_res <- tune_results %>% extract_workflow_set_result(best_model_id)
best_parameters <- winning_tune_res %>% select_best(metric = "rmse")

# Finalize the workflow with the winner's best hyperparameters
final_wf <- tune_results %>%
  extract_workflow(best_model_id) %>%
  finalize_workflow(best_parameters)

# Fit to the complete Training data and evaluate on the unseen holdout Test set
final_fit <- last_fit(
  final_wf,
  data_split,
  metrics = metric_set(rmse, mae, rsq, mape, huber_loss)
)

Losing Model Extraction

Linear Regression

# Extract best tuning results and parameters for that winning model
linear_tune_res <- tune_results %>% extract_workflow_set_result("base_rec_linear")
linear_best_parameters <- linear_tune_res %>% select_best(metric = "rmse")

# Finalize the workflow with the winner's best hyperparameters
linear_final_wf <- tune_results %>%
  extract_workflow("base_rec_linear") %>%
  finalize_workflow(linear_best_parameters)

# Fit to the complete Training data and evaluate on the unseen holdout Test set
linear_final_fit <- last_fit(
  linear_final_wf,
  data_split,
  metrics = metric_set(rmse, mae, rsq, mape, huber_loss)
)
collect_metrics(linear_final_fit)
## # A tibble: 5 x 4
##   .metric    .estimator .estimate .config        
##   <chr>      <chr>          <dbl> <chr>          
## 1 rmse       standard      0.271  pre0_mod0_post0
## 2 mae        standard      0.199  pre0_mod0_post0
## 3 rsq        standard      0.984  pre0_mod0_post0
## 4 mape       standard     68.4    pre0_mod0_post0
## 5 huber_loss standard      0.0367 pre0_mod0_post0

Random Forest Regression

# Extract best tuning results and parameters for that winning model
rf_tune_res <- tune_results %>% extract_workflow_set_result("base_rec_rf")
rf_best_parameters <- rf_tune_res %>% select_best(metric = "rmse")

# Finalize the workflow with the winner's best hyperparameters
rf_final_wf <- tune_results %>%
  extract_workflow("base_rec_rf") %>%
  finalize_workflow(rf_best_parameters)

# Fit to the complete Training data and evaluate on the unseen holdout Test set
rf_final_fit <- last_fit(
  rf_final_wf,
  data_split,
  metrics = metric_set(rmse, mae, rsq, mape, huber_loss)
)
collect_metrics(rf_final_fit)
## # A tibble: 5 x 4
##   .metric    .estimator .estimate .config        
##   <chr>      <chr>          <dbl> <chr>          
## 1 rmse       standard      0.427  pre0_mod0_post0
## 2 mae        standard      0.323  pre0_mod0_post0
## 3 rsq        standard      0.951  pre0_mod0_post0
## 4 mape       standard     75.3    pre0_mod0_post0
## 5 huber_loss standard      0.0890 pre0_mod0_post0

Best Model Performance Summary

The algorithm has dynamically selected base_rec_xgb as the winning model structure. Below are its optimal tuned parameters:

knitr::kable(best_parameters, caption = "Winning Model Hyperparameter Values")
Winning Model Hyperparameter Values
trees tree_depth learn_rate .config
2000 10 0.0464159 pre0_mod10_post0

Unseen Holdout Test Performance Metrics

The table below presents the model’s accuracy scores evaluated on the holdout test phase across all 5 requested evaluation metrics:

final_metrics <- collect_metrics(final_fit)
knitr::kable(final_metrics, caption = "Holdout Test Set Metric Diagnostics")
Holdout Test Set Metric Diagnostics
.metric .estimator .estimate .config
rmse standard 0.0676905 pre0_mod0_post0
mae standard 0.0298363 pre0_mod0_post0
rsq standard 0.9979924 pre0_mod0_post0
mape standard 5.3420236 pre0_mod0_post0
huber_loss standard 0.0022111 pre0_mod0_post0

Output Generation

Scatterplot: Actual vs. Predicted Prices

This chart verifies goodness-of-fit. Points centering tightly around the 45-degree diagonal path illustrate a robust baseline alignment.

predictions_df <- collect_predictions(final_fit)

scatterplot <- ggplot(predictions_df, aes(x = avg_retail_price, y = .pred)) +
  geom_point(alpha = 0.4, color = "midnightblue") +
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed", linewidth = 1) +
  theme_minimal(base_size = 12) +
  labs(
    title = "Actual vs. Predicted Retail Prices",
    subtitle = paste("Winning Model Architecture:", best_model_id),
    x = "Actual Retail Price",
    y = "Predicted Retail Price"
  )
print(scatterplot)

Residuals Analysis Plot

Residual points scattered randomly around the center line indicate that the predictive errors remain unbiased across the scale.

predictions_df <- predictions_df %>% mutate(residual = avg_retail_price - .pred)

residuals_plot <- ggplot(predictions_df, aes(x = .pred, y = residual)) +
  geom_point(alpha = 0.4, color = "darkred") +
  geom_hline(yintercept = 0, color = "black", linetype = "dashed", linewidth = 1) +
  theme_minimal(base_size = 12) +
  labs(
    title = "Model Residuals Plot",
    subtitle = "Points should be randomly scattered around the 0 line",
    x = "Predicted Price (Fitted Values)",
    y = "Residual (Actual - Predicted)"
  )
print(residuals_plot)

Relative Feature Importance Plot

This chart highlights which specific macroeconomic variations, regional variables, or tax policies dominate the winning model’s internal prediction splits.

final_model_fit <- extract_fit_parsnip(final_fit)

vip_plot <- vip(final_model_fit, num_features = 10, geom = "point") +
  theme_minimal(base_size = 12) +
  labs(
    title = "Top 10 Most Important Predictors",
    subtitle = paste("Extracted dynamically from:", best_model_id),
    x = "Predictors/Features",
    y = "Relative Importance Weight"
  )
print(vip_plot)

Save Models and Test Data Snapshot

# Save the final fitted models and test data snapshot for later evaluation
saveRDS(linear_final_fit, "models/regression/final_linear_model.rds")
saveRDS(final_fit, "models/regression/final_xgboost_model.rds")
saveRDS(rf_final_fit, "models/regression/final_rf_model.rds")
saveRDS(test_data, "models/regression/test_data_snapshot.rds")

Regression Result

1. Environment Setup

Load the required packages and create the output folders.

library(tidyverse)
library(tidymodels)
library(kableExtra)
library(yardstick)

# Strict pathing constraints to neutralize working directory mismatches
dir_model_in <- "models/regression/"
dir_eval_out <- "models/evaluation_regression/"

if (!dir.exists(dir_eval_out)) {
  dir.create(dir_eval_out, recursive = TRUE)
}

2. Load Trained Models and Test Data

Import the saved models and the exact test dataset snapshot.

# Path-safe validation check before deserializing objects
paths <- c(
  paste0(dir_model_in, "final_linear_model.rds"),
  paste0(dir_model_in, "final_xgboost_model.rds"),
  paste0(dir_model_in, "final_rf_model.rds"),
  paste0(dir_model_in, "test_data_snapshot.rds")
)

for (p in paths) {
  if (!file.exists(p)) stop(paste("Critical file missing at destination path:", p))
}

final_fit_linear <- readRDS(paste0(dir_model_in, "final_linear_model.rds"))
final_fit_xgb <- readRDS(paste0(dir_model_in, "final_xgboost_model.rds"))
final_fit_rf <- readRDS(paste0(dir_model_in, "final_rf_model.rds"))
test_data_cleaned <- readRDS(paste0(dir_model_in, "test_data_snapshot.rds"))

3. Generate Predictions

Use augment() to calculate predictions and residuals for all models.

scored_data_long <- bind_rows(
  augment(extract_workflow(final_fit_linear), new_data = test_data_cleaned) %>% mutate(Model = "Elastic Net Linear"),
  augment(extract_workflow(final_fit_xgb), new_data = test_data_cleaned) %>% mutate(Model = "XGBoost"),
  augment(extract_workflow(final_fit_rf), new_data = test_data_cleaned) %>% mutate(Model = "Random Forest")
) %>%
  mutate(Residual = avg_retail_price - .pred)

4. Model Performance Comparison

Calculate standard error metrics to find the best performing model.

# Instantiate metric set tracking continuous error distances
regression_metrics <- metric_set(rmse, mae, rsq, mape, huber_loss)

comparison_df <- scored_data_long %>%
  group_by(Model) %>%
  regression_metrics(truth = avg_retail_price, estimate = .pred) %>%
  select(Model, .metric, .estimate) %>%
  pivot_wider(names_from = .metric, values_from = .estimate) %>%
  inner_join(
    scored_data_long %>%
      group_by(Model) %>%
      summarise(Max_Error = max(abs(Residual), na.rm = TRUE), .groups = "drop"),
    by = "Model"
  )

# Output styled interactive summary table
comparison_df %>%
  kbl(caption = "Predictive Accuracy Comparison Matrix", booktabs = TRUE) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  row_spec(which.min(comparison_df$rmse), bold = TRUE, background = "#e8f8f5")
Predictive Accuracy Comparison Matrix
Model rmse mae rsq mape huber_loss Max_Error
Elastic Net Linear 0.2708419 0.1990199 0.9841798 68.425086 0.0366700 1.207740
Random Forest 0.4273334 0.3226139 0.9510472 75.282361 0.0890235 2.350927
XGBoost 0.0676905 0.0298363 0.9979924 5.342024 0.0022111 1.800150

5. Prediction Speed Benchmark

Measure how fast each model runs predictions on the test set.

speed_tracking <- map_df(
  list("Elastic Net Linear" = final_fit_linear, "XGBoost" = final_fit_xgb, "Random Forest" = final_fit_rf),
  function(wf) {
    start_time <- Sys.time()

    # Extract the underlying trained workflow before running predict()
    invisible(predict(extract_workflow(wf), new_data = test_data_cleaned))

    duration <- as.numeric(difftime(Sys.time(), start_time, units = "secs"))
    return(data.frame(Execution_Time_Secs = round(duration, 5)))
  },
  .id = "Model"
)

speed_tracking %>%
  kbl(caption = "Operational Latency & Speed Benchmark Profiles") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Operational Latency & Speed Benchmark Profiles
Model Execution_Time_Secs
Elastic Net Linear 0.11637
XGBoost 0.33783
Random Forest 0.50267

6. Performance by Subgroup

Break down error metrics by fuel type and country to check for specific model biases.

# Product Categorization Group Breakdown
fuel_diagnostic_matrix <- scored_data_long %>%
  group_by(Model, fuel_type) %>%
  regression_metrics(truth = avg_retail_price, estimate = .pred) %>%
  filter(.metric %in% c("rmse", "mae")) %>%
  select(Model, fuel_type, .metric, .estimate) %>%
  pivot_wider(names_from = .metric, values_from = .estimate)

# Geographical Regional Breakdown
regional_diagnostic_matrix <- scored_data_long %>%
  group_by(Model, country) %>%
  regression_metrics(truth = avg_retail_price, estimate = .pred) %>%
  filter(.metric %in% c("rmse", "mae")) %>%
  select(Model, country, .metric, .estimate) %>%
  pivot_wider(names_from = .metric, values_from = .estimate)

fuel_diagnostic_matrix %>%
  kbl(caption = "Subgroup Error Distributions by Fuel Type Classification") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Subgroup Error Distributions by Fuel Type Classification
Model fuel_type rmse mae
Elastic Net Linear Diesel 0.2872869 0.2373000
Elastic Net Linear Lpg 0.0793382 0.0601667
Elastic Net Linear Petrol 0.3622725 0.2996082
Random Forest Diesel 0.4407890 0.3453756
Random Forest Lpg 0.2878705 0.2124926
Random Forest Petrol 0.5202707 0.4099826
XGBoost Diesel 0.0710461 0.0323077
XGBoost Lpg 0.0361329 0.0176621
XGBoost Petrol 0.0859829 0.0395400

7. Visual Dashboards

Diagram A: Error Comparison

Predictive Error Profile Layouts explicitly quantify how effectively each algorithm minimizes price prediction errors across MAE, RMSE, and R-squared metrics. XGBoost decisively dominates the comparison, dropping prediction errors to a mere fraction of the other models with a remarkably low MAE and RMSE, drastically outperforming Random Forest’s higher error rates. This massive reduction in error is paired with a near-perfect R-squared value of 0.998, confirming that XGBoost captures nearly 100% of the price variance and establishes itself as the most accurate model by every metric.

Diagram B: Actual vs. Predicted Values

The Goodness-of-Fit plots reveal that while Elastic Net and Random Forest systematically underpredict higher values and visibly bow below the 45-degree line once actual prices exceed 4, XGBoost demonstrates near-perfect calibration. Its predictions tightly hug the diagonal parity line across the entire pricing spectrum, confirming its ability to accurately capture fluctuations and peaks with minimal deviation.

Diagram C: Speed vs. Accuracy Trade-Off

The operational efficiency frontier evaluates model suitability for deployment by analyzing the multi-objective trade-off between predictive accuracy (RMSE) and computational latency. While the Elastic Net Linear model minimizes inference latency at the expense of precision, the Random Forest architecture proves suboptimal, exhibiting both elevated error rates and high computational overhead. Consequently, XGBoost establishes the optimal Pareto-efficient compromise, yielding the lowest overall error rate while maintaining a highly acceptable inference speed of approximately one second.


8. Export Results

Save all generated data frames to the evaluation folder for reporting.

saveRDS(comparison_df, paste0(dir_eval_out, "metrics_comparison.rds"))
saveRDS(speed_tracking, paste0(dir_eval_out, "operational_speed.rds"))
saveRDS(scored_data_long, paste0(dir_eval_out, "error_profiles_snapshot.rds"))
saveRDS(fuel_diagnostic_matrix, paste0(dir_eval_out, "fuel_deep_dive_metrics.rds"))
saveRDS(regional_diagnostic_matrix, paste0(dir_eval_out, "regional_deep_dive_metrics.rds"))

Conclusion

The regression modeling phase has conclusively identified XGBoost as the superior predictive engine for forecasting retail fuel prices, achieving a remarkable balance of accuracy and operational efficiency. Its ability to capture complex non-linear relationships and interactions within the data has resulted in significantly lower error rates compared to both Elastic Net Linear Regression and Random Forest models. The comprehensive evaluation across multiple metrics and subgroups further validates XGBoost’s robustness and suitability for deployment in real-world forecasting scenarios.


Classification Modeling

Same as regression modeling, we will be building 3 classification models, and compare their performance to select the best one for our analysis. The 3 models are: 1. Multinomial Logistic Regression 2. Random Forest 3. Extreme Gradient Boosting (XGBoost)

Multinomial Logistic Regression acts as a strong linear baseline, while Random Forest and XGBoost capture non-linear relationships and interactions. We will evaluate model performance using accuracy, precision, recall, F1-score, and AUC-ROC metrics to determine the best classification model for predicting impact_level.

The workflow will be similar to the regression modeling section. The data will be encoded according to the features, and then split into training and testing sets. As the data is time-series in nature, time-based split is used to avoid data leakage. Cross-validation will also be used with a rolling origin approach to evaluate model performance during hyperparameter tuning by utilizing the tidymodels framework.

Tidymodels is a collection of packages for modeling and statistical analysis in R. It provides a consistent interface for model specification, tuning, and evaluation across different algorithms, making it easier to compare the performance of the multinomial logistic regression, random forest, and XGBoost models in a structured manner. It also has recipe which provides a consistent interface for data preprocessing. However, the cross-validation uses the timetk package to create time-based resampling splits, which is more suitable for time-series data. This allows to evaluate the models’ performance in a way that respects the temporal order of the data, and ensuring that the model evaluation is realistic for forecasting tasks.

Data Preparation

This section will cover the data preparation steps for the classification modeling. It will be using the quarterly dataset that we prepared in the data preparation section, which contains the impact_level variable as our target for classification. This step includes splitting the data into training and testing sets, encoding categorical variables, and setting up the cross-validation strategy for time-series data.

Import Libraries & Data

library(tidymodels)
library(tidyverse)
library(rsample)
library(dplyr)
library(tsibble)
library(timetk)

Descriptions of Each Package: - tidymodels: A collection of packages for modeling and statistical analysis in R, providing a consistent interface for model specification, tuning, and evaluation across different algorithms. - tidyverse: A collection of R packages for data manipulation, visualization, and analysis - rsample: Provides functions for creating resamples of a dataset, such as cross-validation splits, which is essential for model evaluation. - dplyr: A package for data manipulation, providing functions for filtering, summarizing, and transforming data. - tsibble: A package for working with time-series data in a tidy format, allowing for easy manipulation and analysis of time-indexed data. - timetk: A package for working with time-series data, providing functions for creating time-based resampling splits.

Encode Data and Create Time-Based Splits

df <- read_csv("data_processed/Q2_Quarterly_Classification_Data.csv")

tax_levels <- c("Low Tax", "Medium Tax", "High Tax")
income_levels <- c("Low", "Middle", "High")
impact_levels <- c("Low Impact", "Medium Impact", "High Impact")

# encode data

# nominal data
df$country <- as.factor(df$country)
df$region <- as.factor(df$region)

# ordinal data
df$tax_category <- factor(
  df$tax_category,
  levels = tax_levels,
  ordered = TRUE
)
df$income_level <- factor(
  df$income_level,
  levels = income_levels,
  ordered = TRUE
)
df$impact_level <- factor(
  df$impact_level,
  levels = impact_levels,
  ordered = TRUE
)

A new feature engineering step is added to create a date column by combining the year and quarter columns. This allows for a proper time index for our time-series data to enable time-based splits and resampling strategies.

df |>
  mutate(time_index = yearquarter(paste(year, "Q", quarter, sep = ""))) |>
  group_by(country, fuel_type, time_index) |>
  summarise(unique_impacts = n_distinct(impact_level), .groups = "drop") |>
  filter(unique_impacts > 1)
## # A tibble: 4,462 x 4
##    country fuel_type time_index unique_impacts
##    <fct>   <chr>          <qtr>          <int>
##  1 Algeria Diesel       2021 Q1              2
##  2 Algeria Diesel       2021 Q4              2
##  3 Algeria Diesel       2022 Q3              2
##  4 Algeria Diesel       2023 Q4              2
##  5 Algeria Diesel       2024 Q3              2
##  6 Algeria Diesel       2024 Q4              2
##  7 Algeria Diesel       2025 Q1              2
##  8 Algeria Diesel       2025 Q4              2
##  9 Algeria Diesel       2026 Q1              2
## 10 Algeria Lpg          2021 Q1              2
## # i 4,452 more rows
df_ts <- df |>
  mutate(time_index = yearquarter(paste(year, "Q", quarter, sep = ""))) |>
  # Adding tax_category to the keys separates the data properly
  as_tsibble(index = time_index, key = c(country, fuel_type, tax_category))

glimpse(df_ts)
## Rows: 15,339
## Columns: 13
## Key: country, fuel_type, tax_category [696]
## $ country           <fct> Algeria, Algeria, Algeria, Algeria, Algeria, Algeria, Algeria,~
## $ region            <fct> Africa, Africa, Africa, Africa, Africa, Africa, Africa, Africa~
## $ income_level      <ord> Middle, Middle, Middle, Middle, Middle, Middle, Middle, Middle~
## $ tax_category      <ord> Low Tax, Low Tax, Low Tax, Low Tax, Low Tax, Low Tax, Low Tax,~
## $ year              <dbl> 2020, 2020, 2020, 2021, 2021, 2021, 2021, 2022, 2022, 2022, 20~
## $ quarter           <dbl> 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2,~
## $ fuel_type         <chr> "Diesel", "Diesel", "Diesel", "Diesel", "Diesel", "Diesel", "D~
## $ avg_retail_price  <dbl> 0.06227273, 0.08042857, 0.09885714, 0.11075000, 0.12280600, 0.~
## $ avg_crude_price   <dbl> 53.55273, 67.91714, 89.52857, 100.03250, 103.88833, 109.70000,~
## $ retail_pct_change <dbl> -15.626473, 19.153439, 14.285714, 27.298851, 16.514234, 17.305~
## $ crude_pct_change  <dbl> -19.05573266, 25.87738459, 24.92358804, 14.97107501, 3.2379343~
## $ impact_level      <ord> High Impact, High Impact, High Impact, High Impact, High Impac~
## $ time_index        <qtr> 2020 Q2, 2020 Q3, 2020 Q4, 2021 Q1, 2021 Q2, 2021 Q3, 2021 Q4,~
range(df_ts$time_index)
## <yearquarter[2]>
## [1] "2020 Q1" "2026 Q2"
## # Year starts on: January

Splitting the Data

# Train: 2020 Q1 to 2024 Q4
train_data_raw <- df_ts |>
  filter(time_index <= yearquarter("2024 Q4"))

# Test: 2025 Q1 to 2026 Q2 (The rest of the data)
test_data_raw <- df_ts |>
  filter(time_index >= yearquarter("2025 Q1"))

# Convert test data to standard tibble and turn yearquarter into a standard Date
test_data <- test_data_raw |>
  as_tibble() |>
  mutate(time_index = as.Date(time_index))

train_data <- train_data_raw |>
  as_tibble() |>
  mutate(time_index = as.Date(time_index)) # Converts "2020 Q1" to "2020-01-01"

Cross-Validation Setup

folds <- time_series_cv(
  data        = train_data,
  date_var    = time_index,
  initial     = "36 months", # Equivalent to 12 quarters
  assess      = "6 months", # Equivalent to 2 quarters
  skip        = "3 months", # Equivalent to 1 quarter
  cumulative  = TRUE
)

A rolling-origin time-series cross-validation was used to evaluate the model without data leakage. The initial training window was set to 36 months (12 quarters) to provide a sufficient baseline for capturing stable seasonal patterns. To ensure the evaluation mirrors the actual forecasting objective, a 6-month (2-quarter) assessment window was selected. The training origin advances using a 3-month skip, which aligns with quarterly reporting cycles and maintains computational efficiency. Finally, a cumulative window ensures the model retains all historical data as it moves forward, accurately replicating a production environment where the model continuously learns from the entire past.

classification_recipe <- recipe(impact_level ~ ., data = train_data) |>
  step_mutate(time_index = as.numeric(time_index)) |>
  step_novel(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors(), -all_outcomes()) |>
  step_zv(all_predictors())

A data preprocessing recipe was constructed to prepare the features for all models. The steps include: 1. The date variable time_index is converted into a numeric format, as the algorithms cannot natively process date objects. 2. To handle potential data drift or unseen categories in future deployment, step_novel() is applied to assign a dedicated category to new factor levels. 3. All nominal predictors are then converted into dummy variables using one-hot encoding. 4. Finally, a zero-variance filter step_zv() removes any constant predictors that could destabilize the training process to ensure a clean and mathematically stable dataset for modeling.

Multinomial Logistic Regression

Objective

Q2 Classification: To build classification model that analyse impact level of global fuel price to countries fuel price changes by quarter.**

The objective of this classification model is to classify the impact level of global fuel price changes on countries' fuel price changes by quarter. The target variable used in this model is impact_level, which consists of three classes: Low Impact, Medium Impact, and High Impact.

Multinomial Logistic Regression was selected because the classification problem has more than two outcome categories. Unlike binary Logistic Regression, which is used for two-class classification problems, Multinomial Logistic Regression can handle multi-class classification problems.

In this model, the predictors used are quarter, country, fuel_type, and crude_pct_change. These variables are used to examine whether quarterly crude oil price movements and country-level fuel information can

Building the Model

Load Required Libraries

library(dplyr)
library(caret)
library(nnet)
library(ggplot2)
library(pROC)

Load training and testing datasets

train_df <- read.csv("data_split/classification/train_df.csv")

test_df <- read.csv("data_split/classification/test_df.csv")

Data Preparation

# Remove extra spaces from the target variable
train_df$impact_level <- trimws(train_df$impact_level)
test_df$impact_level <- trimws(test_df$impact_level)

# Convert target variable into factor with fixed class order
train_df$impact_level <- factor(
  train_df$impact_level,
  levels = c("Low Impact", "Medium Impact", "High Impact")
)

test_df$impact_level <- factor(
  test_df$impact_level,
  levels = c("Low Impact", "Medium Impact", "High Impact")
)

# Convert target variable into factor with fixed class order

train_df$impact_level <- factor(
  train_df$impact_level,
  levels = c("Low Impact", "Medium Impact", "High Impact")
)

test_df$impact_level <- factor(
  test_df$impact_level,
  levels = c("Low Impact", "Medium Impact", "High Impact")
)

# Check class distribution

cat("Training class distribution:n")
## Training class distribution:n
print(table(train_df$impact_level))
## 
##    Low Impact Medium Impact   High Impact 
##          4496          3263          4241
cat("Testing class distribution:n")
## Testing class distribution:n
print(table(test_df$impact_level))
## 
##    Low Impact Medium Impact   High Impact 
##          1770           995           574
# Convert categorical predictors into factors

categorical_cols <- c("quarter", "country", "fuel_type")

for (col in categorical_cols) {
  train_df[[col]] <- factor(train_df[[col]])

  test_df[[col]] <- factor(test_df[[col]],
    levels =
      levels(train_df[[col]])
  )
}

# Scale numerical predictor using training data statistics

crude_mean <- mean(train_df$crude_pct_change, na.rm = TRUE)

crude_sd <- sd(train_df$crude_pct_change, na.rm = TRUE)

train_df$crude_pct_change_scaled <- (train_df$crude_pct_change - crude_mean) / crude_sd

test_df$crude_pct_change_scaled <- (test_df$crude_pct_change - crude_mean) / crude_sd

# Select variables used in the model

model_vars <- c(
  "impact_level",
  "quarter",
  "country",
  "fuel_type",
  "crude_pct_change_scaled"
)

# Remove missing values from selected variables

train_model <- train_df %>%
  select(all_of(model_vars)) %>%
  na.omit()

test_model <- test_df %>%
  select(all_of(model_vars)) %>%
  na.omit()

# Check final number of observations used

cat("Training rows used:", nrow(train_model), "n")
## Training rows used: 12000 n
cat("Testing rows used:", nrow(test_model), "n")
## Testing rows used: 3339 n

The variable impact_level was used as the target variable. The predictors selected were quarter, country, fuel_type, and crude_pct_change_scaled. The numerical variable crude_pct_change was scaled to improve model stability.

Model Training

set.seed(123)

multi_log_model <- multinom(
  impact_level ~ quarter + country + fuel_type + crude_pct_change_scaled,
  data = train_model,
  maxit = 1000,
  trace = FALSE
)

Summary of the Model:

## Call:
## multinom(formula = impact_level ~ quarter + country + fuel_type + 
##     crude_pct_change_scaled, data = train_model, maxit = 1000, 
##     trace = FALSE)
## 
## Coefficients:
##               (Intercept)   quarter2  quarter3   quarter4 countryAngola countryArgentina countryAustralia countryAustria countryBangladesh countryBelgium countryBotswana countryBrazil
## Medium Impact   0.3565812 -0.1010369 0.1903039  0.1612509    -0.0932418       -0.3748889        -1.048654     -0.3819582        -0.7197958     -0.5515683      -0.6622423    -0.9637362
## High Impact     1.7747465  0.2996437 0.2145259 -0.1776538    -1.5913488       -1.3685734        -2.268015     -2.2139965        -1.9917275     -2.2020716      -2.2054239    -1.9322170
##               countryCameroon countryCanada countryChile countryChina countryColombia countryCzech Republic countryDenmark countryEcuador countryEgypt countryEthiopia countryFinland
## Medium Impact      -0.7374166    -0.5703508   -0.9686056   -0.7463005      -0.8014645            -0.7003902     -0.6400727     -0.4685899    0.2310229       -1.088267      -0.633162
## High Impact        -1.7653029    -2.1944302   -2.5536785   -1.9521570      -1.9880609            -2.3232003     -1.8174039     -1.6420782   -0.3978379       -2.092936      -1.756164
##               countryFrance countryGermany countryGhana countryGreece countryHong Kong countryHungary countryIndia countryIndonesia countryIran countryIraq countryIreland countryIsrael
## Medium Impact    -0.9176191     -0.9317898   -0.7265697     -1.232572       -0.8387431     -0.9327127   -0.6194894       -0.9292943   0.5586567  -0.3026625      -1.146719     -0.707016
## High Impact      -1.9360412     -2.0505596   -1.9632654     -2.095332       -2.0369950     -2.1762742   -2.1654932       -1.9790108   1.2906808  -0.3703818      -2.243899     -2.234684
##               countryItaly countryIvory Coast countryJapan countryJordan countryKenya countryKuwait countryLebanon countryLibya countryMalaysia countryMexico countryMorocco
## Medium Impact    -0.866620         -0.6741443   -0.8014405    -0.8111955   -0.7570373    -0.3719660      -1.053642   -0.8908658      -0.4045026    -0.7517276     -0.4916238
## High Impact      -2.411633         -2.1740766   -1.8979428    -1.9219993   -2.0122292    -0.8566371      -2.125555    0.2536396      -1.6957435    -1.5136977     -1.7645405
##               countryMozambique countryMyanmar countryNamibia countryNepal countryNetherlands countryNew Zealand countryNigeria countryNorway countryPakistan countryPeru
## Medium Impact        -0.3844573     -0.2430471      -1.176882   -0.9197453          -1.668139         -0.9698057     -0.4051902    -0.8332426      -0.2658621  -0.8142522
## High Impact          -2.0569119     -1.5176162      -2.600538   -2.0346126          -2.367621         -2.1167300     -0.6640091    -2.3592310      -1.4252986  -1.8502627
##               countryPhilippines countryPoland countryPortugal countryQatar countryRomania countryRussia countryRwanda countrySaudi Arabia countrySenegal countrySingapore
## Medium Impact         -0.7564096    -0.7861756      -0.6703536   -0.3576162       -1.16698    -0.4900587     -1.007674          -0.2057177     -0.7791182       -0.9934818
## High Impact           -2.4723873    -2.1848362      -2.3884489   -0.5204218       -2.08973    -1.7630165     -2.352127          -1.0103801     -1.9220871       -2.2098708
##               countrySouth Africa countrySouth Korea countrySpain countrySri Lanka countrySudan countrySweden countrySwitzerland countryTaiwan countryTanzania countryThailand
## Medium Impact          -0.4840723          -1.283184   -0.6782123        -1.145807   -0.2243023    -0.9205812         -0.7917003    -0.6561474      -0.8258379      -0.5435062
## High Impact            -1.9448186          -2.053937   -1.9169582        -2.114259   -1.6338488    -1.9120195         -2.2760286    -2.2678240      -2.2920051      -1.8696288
##               countryTunisia countryTurkey countryUAE countryUganda countryUkraine countryUnited Kingdom countryUnited States countryVenezuela countryVietnam countryZambia
## Medium Impact      -0.387162    -0.7275985  -1.062420    -0.3345963     -0.5119267            -0.8958869           -0.7235675        0.5192959     -0.8236359    -0.3574211
## High Impact        -1.884182    -2.1315200  -1.967763    -1.8881102     -2.0618198            -2.2509312           -2.1995794        1.5301123     -2.1554180    -2.0536675
##               countryZimbabwe fuel_typeLpg fuel_typePetrol crude_pct_change_scaled
## Medium Impact       -0.591045  -0.01349872     0.003995098               0.0981385
## High Impact         -2.100456  -0.11243719    -0.050586132               0.4651139
## 
## Std. Errors:
##               (Intercept)   quarter2   quarter3   quarter4 countryAngola countryArgentina countryAustralia countryAustria countryBangladesh countryBelgium countryBotswana countryBrazil
## Medium Impact   0.3494690 0.06788649 0.06640441 0.06457246     0.4164665        0.4212320        0.3969290      0.3904511         0.4146942       0.388647       0.3918544     0.4216277
## High Impact     0.2969552 0.06297558 0.06442560 0.06581010     0.3782727        0.3677603        0.3470512      0.3584590         0.3686677       0.350116       0.3516089     0.3639583
##               countryCameroon countryCanada countryChile countryChina countryColombia countryCzech Republic countryDenmark countryEcuador countryEgypt countryEthiopia countryFinland
## Medium Impact       0.4183040     0.3903823    0.3938262    0.4141093       0.4147224             0.3936657      0.3989548      0.4183968    0.4577139       0.4213548      0.3970048
## High Impact         0.3638043     0.3522225    0.3562539    0.3656723       0.3649029             0.3557998      0.3493550      0.3696436    0.4014611       0.3635626      0.3462187
##               countryFrance countryGermany countryGhana countryGreece countryHong Kong countryHungary countryIndia countryIndonesia countryIran countryIraq countryIreland countryIsrael
## Medium Impact     0.3994028      0.3978298    0.3941783     0.4037384        0.3972333      0.4134422    0.4087065        0.4156488   0.6295772   0.4686872      0.4036189     0.3910143
## High Impact       0.3453583      0.3456705    0.3459940     0.3435984        0.3476631      0.3658079    0.3715702        0.3614906   0.5453321   0.3912622      0.3493171     0.3508787
##               countryItaly countryIvory Coast countryJapan countryJordan countryKenya countryKuwait countryLebanon countryLibya countryMalaysia countryMexico countryMorocco
## Medium Impact    0.3926456          0.4090346    0.3968465     0.4183383    0.3965275     0.4383774      0.3957141    0.5668244       0.4148330     0.4275726      0.4140817
## High Impact      0.3538510          0.3709007    0.3449138     0.3669096    0.3487989     0.3729795      0.3418824    0.4177945       0.3698357     0.3647689      0.3686078
##               countryMozambique countryMyanmar countryNamibia countryNepal countryNetherlands countryNew Zealand countryNigeria countryNorway countryPakistan countryPeru
## Medium Impact         0.3904531      0.4188051      0.3961274    0.4180834          0.4132246          0.3983279      0.4543325     0.3913142       0.4215953   0.4183718
## High Impact           0.3532103      0.3739287      0.3535479    0.3652493          0.3433860          0.3472324      0.3807082     0.3507848       0.3734614   0.3633515
##               countryPhilippines countryPoland countryPortugal countryQatar countryRomania countryRussia countryRwanda countrySaudi Arabia countrySenegal countrySingapore
## Medium Impact          0.3893619     0.3912577       0.3878740    0.4610189      0.4001713     0.4157520     0.3976122           0.4322061      0.4152585        0.3978640
## High Impact            0.3537183     0.3465684       0.3516748    0.3856085      0.3419681     0.3704335     0.3521537           0.3762243      0.3645028        0.3480133
##               countrySouth Africa countrySouth Korea countrySpain countrySri Lanka countrySudan countrySweden countrySwitzerland countryTaiwan countryTanzania countryThailand
## Medium Impact           0.3919504          0.4029120    0.3950954        0.4227167    0.4145176     0.4020051          0.3920587     0.3918601       0.3933717       0.4126510
## High Impact             0.3496374          0.3399677    0.3465887        0.3634147    0.3732310     0.3465796          0.3498820     0.3536774       0.3517940       0.3681444
##               countryTunisia countryTurkey countryUAE countryUganda countryUkraine countryUnited Kingdom countryUnited States countryVenezuela countryVietnam countryZambia
## Medium Impact      0.4140306     0.3941728  0.4218051     0.3922982      0.4103194             0.3961451            0.3933475        0.6852325      0.4119977     0.3911666
## High Impact        0.3759438     0.3503308  0.3608396     0.3522326      0.3733099             0.3509987            0.3515262        0.5895201      0.3670720     0.3549281
##               countryZimbabwe fuel_typeLpg fuel_typePetrol crude_pct_change_scaled
## Medium Impact       0.3959367   0.05682098      0.05701668              0.02639093
## High Impact         0.3552450   0.05521420      0.05511600              0.02474443
## 
## Residual Deviance: 24643.28 
## AIC: 25003.28

Model Prediction

# Predict class labels on the test data
pred_class <- predict(
  multi_log_model,
  newdata = test_model,
  type = "class"
)

pred_class <- factor(
  pred_class,
  levels = c("Low Impact", "Medium Impact", "High Impact")
)

# Predict class probabilities
pred_prob <- predict(
  multi_log_model,
  newdata = test_model,
  type = "probs"
)

View first few predicted probabilities:

Low Impact Medium Impact High Impact
0.1227371 0.1742381 0.7030248
0.1170634 0.1439332 0.7390034
0.1198538 0.1965270 0.6836192
0.1291797 0.2178769 0.6529434
0.1049234 0.1556110 0.7394656
0.0783168 0.1069949 0.8146883

Model Performance

Confusion Matrix

conf_matrix <- confusionMatrix(
  pred_class,
  test_model$impact_level
)

Accuracy, Precision, Recall and F1-Score

# Extract overall accuracy

accuracy <- conf_matrix$overall["Accuracy"]

# Extract per-class metrics

precision <- conf_matrix$byClass[, "Pos Pred Value"]

recall <- conf_matrix$byClass[, "Sensitivity"]

# Calculate F1-score

f1_score <- 2 * ((precision * recall) / (precision + recall))

# Create per-class performance table

metrics_df <- data.frame(
  Class = rownames(conf_matrix$byClass),
  Precision = round(precision, 4),
  Recall = round(recall, 4),
  F1_Score = round(f1_score, 4),
  Balanced_Accuracy = round(conf_matrix$byClass[, "Balanced Accuracy"], 4)
)

metrics_df
##                                     Class Precision Recall F1_Score Balanced_Accuracy
## Class: Low Impact       Class: Low Impact    0.6209 0.8006   0.6994            0.6246
## Class: Medium Impact Class: Medium Impact    0.3063 0.0683   0.1118            0.5013
## Class: High Impact     Class: High Impact    0.4527 0.6585   0.5366            0.7466
# Calculate macro-average metrics

macro_precision <- mean(precision, na.rm = TRUE)

macro_recall <- mean(recall, na.rm = TRUE)

macro_f1 <- mean(f1_score, na.rm = TRUE)

# Create summary table
macro_metrics <- data.frame(
  Metric = c("Accuracy", "Macro Precision", "Macro Recall", "Macro
F1-score"),
  Value = round(c(accuracy, macro_precision, macro_recall, macro_f1), 4)
)
macro_metrics
##            Metric  Value
## 1        Accuracy 0.5580
## 2 Macro Precision 0.4600
## 3    Macro Recall 0.5091
## 4 Macro\nF1-score 0.4492

The accuracy shows the overall proportion of correctly classified observations. Precision shows how reliable the model is when it predicts a specific impact class. Recall shows how well the model identifies actual cases from each class. The F1-score provides a balanced measure between precision and recall.

Multiclass ROC and AUC

multi_auc <- multiclass.roc(
  response = test_model$impact_level,
  predictor = pred_prob
)

multi_auc_value <- auc(multi_auc)

print(multi_auc_value)
## Multi-class area under the curve: 0.6838

The multiclass AUC measures how well the model separates the three impact levels overall. A higher AUC value indicates stronger classification performance.

One-vs-All ROC Curves

# One-vs-All ROC curve for Low Impact

roc_low <- roc(
  response = ifelse(test_model$impact_level == "Low Impact", 1, 0),
  predictor = pred_prob[, "Low Impact"]
)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# One-vs-All ROC curve for Medium Impact

roc_medium <- roc(
  response = ifelse(test_model$impact_level == "Medium Impact", 1, 0),
  predictor = pred_prob[, "Medium Impact"]
)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# One-vs-All ROC curve for High Impact

roc_high <- roc(
  response = ifelse(test_model$impact_level == "High Impact", 1, 0),
  predictor = pred_prob[, "High Impact"]
)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# Plot ROC curves

plot(
  roc_low,
  col = "blue",
  lwd = 2,
  main = "One-vs-All ROC Curves for Multinomial Logistic Regression"
)

plot(
  roc_medium,
  col = "orange",
  lwd = 2,
  add = TRUE
)

plot(
  roc_high,
  col = "darkgreen",
  lwd = 2,
  add = TRUE
)

abline(a = 0, b = 1, lty = 2, col = "red")

legend(
  "bottomright",
  legend = c(
    paste("Low Impact AUC =", round(auc(roc_low), 4)),
    paste("Medium Impact AUC =", round(auc(roc_medium), 4)),
    paste("High Impact AUC =", round(auc(roc_high), 4))
  ),
  col = c("blue", "orange", "darkgreen"),
  lwd = 2
)

Model Results and Interpretation

Multinomial Logistic Regression Performance Summary
Metric Value
Accuracy.Accuracy Accuracy.Accuracy 0.5580
Macro Precision Macro Precision 0.4600
Macro Recall Macro Recall 0.5091
Macro F1-score Macro F1-score 0.4492
Multiclass AUC Multiclass AUC 0.6838

The Multinomial Logistic Regression model was developed to classify the impact level of global fuel price changes on country-level fuel price changes by quarter. The target variable used was impact_level, which consists of three categories: Low Impact, Medium Impact, and High Impact. The predictors used in the model were quarter, country, fuel_type, and crude_pct_change.

Based on the test dataset, the model achieved an overall accuracy of 55.8%. This means that the model correctly classified 55.8% of the observations. The No Information Rate was 53.01%, which represents the accuracy that could be achieved by always predicting the majority class. Since the model accuracy is only slightly higher than the No Information Rate, this indicates that the model provides some improvement over a simple majority-class prediction, but the improvement is limited.

The confusion matrix shows that the model performed best in classifying Low Impact observations. Out of the actual Low Impact cases, the model correctly classified 1,417 observations, giving a recall of 80.06% for this class. This suggests that the model is reasonably good at detecting lower-impact fuel price changes.

For High Impact, the model showed moderate performance. It correctly classified 378 High Impact observations, with a recall of 65.85%. However, the precision for High Impact was 45.27%, meaning that not all observations predicted as High Impact were actually High Impact cases.

The weakest performance was observed for Medium Impact. The model correctly classified only 68 Medium Impact observations, while many Medium Impact cases were misclassified as Low Impact or High Impact. The recall for Medium Impact was only 6.83%, showing that the model had difficulty identifying this category. This may suggest that Medium Impact observations are less clearly separated from the other two classes using the selected predictors.

The macro-average results further support this interpretation. The model achieved a Macro Precision of 46.00%, Macro Recall of 50.91%, and Macro F1-score of 44.92%. Although the overall accuracy was 55.8%, the low Macro F1-score shows that the model performance was not balanced across all three impact levels. In particular, the poor performance for Medium Impact reduced the overall class-balanced performance.

The model also achieved a Multiclass AUC of 0.6838, indicating a moderate ability to distinguish between Low Impact, Medium Impact, and High Impact classes. However, this value also suggests that the model still has room for improvement.

Overall, the Multinomial Logistic Regression model provides a useful baseline for the classification problem. It is able to identify Low Impact cases relatively well and shows moderate ability in detecting High Impact cases. However, it struggles to classify Medium Impact observations accurately. This indicates that the relationship between global crude oil price changes and country-level fuel price impact may be more complex than what a linear classification model can fully capture. Therefore, more flexible machine learning models such as Random Forest or XGBoost may be better suited to improve classification performance.

Random Forest Classifier

Objective

The objective of this classification model is to predict the impact level of global fuel price changes on countries’ fuel price changes by quarter.
The target variable impact_level has three categories: Low Impact, Medium Impact, and High Impact.

Random Forest was selected because it is well‑suited for multi‑class classification problems and can handle complex interactions between categorical and continuous predictors.
It also provides variable importance measures, which help identify the most influential factors driving classification outcomes.

In this project, the predictors include:

  • Quarter (seasonal effects)
  • Country (policy and subsidy differences)
  • Fuel type (petrol, diesel, LPG)
  • Crude oil percentage change (global market fluctuations)

Methods

# Load libraries
library(randomForest)
library(caret)
library(ggplot2)
library(pROC)

# Read train/test datasets
train_df <- read.csv("data_split/classification/train_df.csv")
test_df <- read.csv("data_split/classification/test_df.csv")

# Ensure factor levels match
train_df$impact_level <- trimws(train_df$impact_level)
test_df$impact_level <- trimws(test_df$impact_level)

train_df$impact_level <- factor(train_df$impact_level,
  levels = c("Low Impact", "Medium Impact", "High Impact")
)
test_df$impact_level <- factor(test_df$impact_level,
  levels = c("Low Impact", "Medium Impact", "High Impact")
)

Results

# Check class distribution
cat("Train class counts:\n")
## Train class counts:
print(table(train_df$impact_level))
## 
##    Low Impact Medium Impact   High Impact 
##          4496          3263          4241
cat("Test class counts:\n")
## Test class counts:
print(table(test_df$impact_level))
## 
##    Low Impact Medium Impact   High Impact 
##          1770           995           574

Model Training

# Train Random Forest model
set.seed(123)
rf_model <- randomForest(
  impact_level ~ quarter + country + fuel_type +
    crude_pct_change,
  data = train_df,
  ntree = 300,
  mtry = floor(sqrt(5)),
  importance = TRUE
)

Model Evaluation

# Predict on test set
predictions <- predict(rf_model, newdata = test_df)

Confusion Matrix

cm <- confusionMatrix(predictions, test_df$impact_level)
print(cm)
## Confusion Matrix and Statistics
## 
##                Reference
## Prediction      Low Impact Medium Impact High Impact
##   Low Impact          1577           466         122
##   Medium Impact        164           442         119
##   High Impact           29            87         333
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7044          
##                  95% CI : (0.6886, 0.7198)
##     No Information Rate : 0.5301          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.48            
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: Low Impact Class: Medium Impact Class: High Impact
## Sensitivity                     0.8910               0.4442            0.58014
## Specificity                     0.6252               0.8793            0.95805
## Pos Pred Value                  0.7284               0.6097            0.74165
## Neg Pred Value                  0.8356               0.7884            0.91661
## Prevalence                      0.5301               0.2980            0.17191
## Detection Rate                  0.4723               0.1324            0.09973
## Detection Prevalence            0.6484               0.2171            0.13447
## Balanced Accuracy               0.7581               0.6617            0.76909

The model achieves an overall accuracy of 70.4%. It performs very well at identifying Low Impact cases, correctly predicting 1,577 of them. However, it experiences noticeable friction with Medium Impact cases, frequently misclassifying them as Low Impact.

Per class metrics

precision <- cm$byClass[, "Pos Pred Value"]
recall <- cm$byClass[, "Sensitivity"]
f1 <- 2 * ((precision * recall) / (precision + recall))

metrics_df <- data.frame(
  Class = rownames(cm$byClass),
  Precision = round(precision, 3),
  Recall = round(recall, 3),
  F1_Score = round(f1, 3),
  Balanced_Accuracy = round(cm$byClass[, "Balanced Accuracy"], 3)
)
print(metrics_df)
##                                     Class Precision Recall F1_Score Balanced_Accuracy
## Class: Low Impact       Class: Low Impact     0.728  0.891    0.802             0.758
## Class: Medium Impact Class: Medium Impact     0.610  0.444    0.514             0.662
## Class: High Impact     Class: High Impact     0.742  0.580    0.651             0.769

The per-class metrics confirm that the model is highly reliable for Low Impact predictions, yielding a strong F1-Score of 0.801 and a recall of 0.891. Conversely, Medium Impact is the weakest category with an F1-Score of 0.514, showing that the model requires sharper decision boundaries to separate minor baseline shifts from low-level activity.

Variable Importance Plot

imp <- importance(rf_model)
imp_df <- data.frame(
  Variable = rownames(imp),
  MeanDecreaseAccuracy = imp[, "MeanDecreaseAccuracy"]
)

ggplot(imp_df, aes(
  x = reorder(Variable, MeanDecreaseAccuracy),
  y = MeanDecreaseAccuracy
)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(
    title = "Variable Importance (Random Forest)",
    x = "Predictor",
    y = "Mean Decrease in Accuracy"
  ) +
  theme_minimal()

Changes in global crude oil prices are by far the most important factor driving the model’s predictions, completely outperforming everything else. On the other hand, the fuel type variable actually scores below zero. This indicates that knowing whether a fuel is petrol or diesel provides no real predictive value and might even confuse the model, making it a prime candidate to be removed in future versions.

Class distribution plot

ggplot(train_df, aes(x = impact_level)) +
  geom_bar(fill = "steelblue") +
  labs(title = "Class Distribution (Train Set)", x = "Impact Level", y = "Count") +
  theme_minimal()

ggplot(test_df, aes(x = impact_level)) +
  geom_bar(fill = "darkred") +
  labs(title = "Class Distribution (Test Set)", x = "Impact Level", y = "Count") +
  theme_minimal()

The distribution plots reveal a critical shift between our datasets. While the training set is relatively well-balanced across all three tiers, the testing set is heavily skewed toward Low Impact observations (1,770 cases vs. only 574 High Impact cases). This imbalance in the test data heavily influences our final accuracy metrics.

Confusion matrix heatmap

cm_table <- as.data.frame(cm$table)
ggplot(cm_table, aes(x = Reference, y = Prediction, fill = Freq)) +
  geom_tile() +
  geom_text(aes(label = Freq), color = "white") +
  scale_fill_gradient(low = "lightblue", high = "darkblue") +
  labs(title = "Confusion Matrix Heatmap") +
  theme_minimal()

The heatmap visually isolates where the model succeeds and where it loses track. The deep blue square at the bottom left highlights the model’s core strength: correctly identifying Low Impact cases 1,577 times. However, the lighter tile right next to it clearly flags our primary weak point, showing that 468 actual Medium Impact cases were incorrectly swept into the Low Impact category.

Conclusion

The Random Forest model provides a solid baseline with an overall accuracy of 70.4%, proving highly effective at identifying low-impact market shifts. Because global crude oil trends completely dominate the model’s decision tree splits while product specifics like fuel type offer no actual value, future work should focus on simplifying the feature space. Dropping the fuel type variable and introducing specific country by crude interaction terms will be the most direct path to fixing the overlap between Low and Medium impact classes.

Extreme Gradient Boosting (XGBoost) Classifier

Extreme Gradient Boosting (XGBoost) is a powerful machine learning algorithm that has gained popularity for its performance and efficiency in classification tasks. It is an implementation of gradient boosting that uses decision trees as base learners. XGBoost is designed to optimize both speed and performance, making it a strong candidate for our classification problem.

Building the Model

xgboost_spec <- boost_tree(
  trees = 1000,
  tree_depth = tune(), # Tune this
  learn_rate = tune(), # Tune this
  min_n = tune() # Tune this
) |>
  set_engine("xgboost") |>
  set_mode("classification")

# Combine recipe and model into a workflow
xgboost_workflow <- workflow() |>
  add_model(xgboost_spec) |>
  add_recipe(classification_recipe)

The code above sets the specification of the XGBoost model. The number of trees is set to 1000 to allow the model to learn complex patterns, but not too much that it overfits; while tree_depth, learn_rate, and min_n are set to be tuned during the hyperparameter tuning process.

library(doParallel)
cl <- makePSOCKcluster(parallel::detectCores() - 1)
registerDoParallel(cl)

parallel_control <- control_grid(
  save_pred = TRUE,
  parallel_over = "everything",
  verbose = FALSE,
)

# Tune the model using cross-validation
xgboost_tune <- xgboost_workflow |>
  tune_grid(
    resamples = folds,
    grid = 20, # Number of random combinations to try
    control = parallel_control
  )

stopCluster(cl)

The training and cross-validation process for the XGBoost model is computationally intensive, especially when tuning multiple hyperparameters. To speed up this process, parallel processing is utilized by creating a cluster that uses all available CPU cores except one. The control_grid function is configured to save predictions and allow parallel processing across all resamples. The tune_grid function then performs the hyperparameter tuning using the specified cross-validation folds.

The best parameters for XGBoost can be grabbed with the code below.

# Select the best hyperparameters based on accuracy
best_xgboost <- xgboost_tune |>
  select_best(metric = "accuracy")

Parameters:

  • min_n: 24
    • This suggests that the model performs best when each leaf node has at least 24 observations.
  • tree_depth: 4
    • This indicates that the optimal tree depth for the model is 4. Indicating that the model is not too complex.
  • learn_rate: 0.001
    • This very low learning rate suggests that the model benefits from a slow and steady learning process.

Model Metrics

# 2. Explicitly force R to use the tidymodels (yardstick) functions
my_metrics <- yardstick::metric_set(
  yardstick::accuracy,
  yardstick::precision,
  yardstick::recall,
  yardstick::f_meas,
  yardstick::roc_auc
)

# 3. Generate predictions on the test data
xgboost_predictions <- predict(final_xgboost, test_data, type = "class") |>
  bind_cols(predict(final_xgboost, test_data, type = "prob")) |>
  bind_cols(test_data)

# 4. Calculate metrics safely
# 4. Calculate metrics safely
xgboost_metrics <- xgboost_predictions |>
  my_metrics(
    truth = impact_level,
    estimate = .pred_class,
    # Select all probability columns, but ignore the factor prediction column
    starts_with(".pred_") & !.pred_class,
    estimator = "macro"
  )

XGBoost Classification Performance Metrics
.metric .estimator .estimate
accuracy multiclass 0.9949087
precision macro 0.9949302
recall macro 0.9930726
f_meas macro 0.9939784
roc_auc macro 0.9998956

The model demonstrates outstanding performance across all evaluated classification metrics, achieving near-perfect predictive accuracy and balance:

  • Accuracy (99.5%): The model correctly classifies 99.5% of all instances in the dataset.

  • Precision (Macro: 99.5%): Out of all instances the model predicted as belonging to a specific class, 99.5% were correctly identified, indicating an exceptionally low false-positive rate.

  • Recall (Macro: 99.3%): The model successfully captures 99.3% of all actual positive instances across classes, demonstrating an exceptionally low false-negative rate.

  • F1-Score (Macro: 99.4%): The harmonic mean of precision and recall confirms that the model maintains a highly stable and balanced performance without favoring one metric over the other.

  • ROC AUC (Macro: 0.999): The model’s ability to distinguish between classes is nearly perfect, with an AUC close to 1.0.

The model performs exceptionally well across all individual classes as the macro-averaged metrics are virtually identical to the overall accuracy.

Confusion Matrix

##                Truth
## Prediction      Low Impact Medium Impact High Impact
##   Low Impact          1765             4           0
##   Medium Impact          5           991           8
##   High Impact            0             0         566

The confusion matrix reveals how further robust is the model. Only a total of 17 values were misclassified across all classes. The majority being actual Medium Impact is understandable as this class is often the most difficult to distinguish due to its middle position. However, no extreme misclassifications occurred - such as a high impact predicted as low impact - which shows that the model is reliable in differentiating between the high and low impact cases.


Classification Result

Introduction

The aim of this project is to evaluate three machine learning models — XGBoost, Multinomial Logistic Regression, and Random Forest — for predicting the impact level of global fuel price changes.

Our objectives are: - To compare predictive accuracy across models.
- To measure computational efficiency (latency).
- To analyze error distributions and misclassifications.
- To identify the most reliable framework for deployment.

Methodology

1. Loading Libraries

We begin by loading essential R packages for machine learning, evaluation, and visualization. These include caret, tidymodels, randomForest, and pROC.

# Load libraries
library(dplyr)
library(ggplot2)
library(tidyr)
library(caret)
library(pROC)
library(yardstick)
library(tidymodels)
library(bundle)
library(randomForest)

2. Loading Pre‑Trained Models

Models were trained previously and saved as .rds files, so we are restoring them here:

xgb_bundle <- readRDS("models/classification/final_xgboost_model.rds")
model_xgb <- bundle::unbundle(xgb_bundle)

model_logistic <- readRDS("models/classification/final_logistic_model.rds")
rf_model <- readRDS("models/classification/final_rf_model.rds")

3. Preparing Test Data

Each model requires slightly different preprocessing:

XGBoost: factor alignment and date handling. Logistic Regression: scaling of numeric predictors. Random Forest: factor level synchronization with training data.

Example: Logistic Regression preprocessing

crude_mean <- mean(train_df$crude_pct_change, na.rm = TRUE)
crude_sd <- sd(train_df$crude_pct_change, na.rm = TRUE)

test_data_logistic <- test_df %>%
  mutate(
    impact_level = factor(trimws(impact_level), levels = target_levels),
    quarter = factor(quarter),
    country = factor(country),
    fuel_type = factor(fuel_type),
    crude_pct_change_scaled = (crude_pct_change - crude_mean) / crude_sd
  ) %>%
  select(impact_level, quarter, country, fuel_type, crude_pct_change_scaled) %>%
  na.omit()

Results

4. Prediction Generation

Each model produces both class predictions and probability matrices:

XGBoost predictions

xgb_class <- predict(model_xgb, new_data = test_data_xgb, type = "class")$.pred_class
xgb_prob <- predict(model_xgb, new_data = test_data_xgb, type = "prob")

Logistic Regression predictions

logistic_class <- predict(model_logistic, newdata = test_data_logistic, type = "class")
logistic_prob <- predict(model_logistic, newdata = test_data_logistic, type = "probs")

Random Forest predictions

rf_class <- predict(rf_model, newdata = test_data_rf_clean, type = "response")
rf_prob <- predict(rf_model, newdata = test_data_rf_clean, type = "prob")

5. Core Evaluation Metrics

We compute Accuracy, Precision, Recall, F1‑Score, and AUC for each model.

print(comparison_df)
##                  Model Accuracy Precision Recall F1_Score    AUC
## 1              XGBoost   0.9949    0.9949 0.9931   0.9940 0.9999
## 2 Multinomial Logistic   0.5580    0.4600 0.5091   0.4492 0.6838
## 3        Random Forest   0.7041    0.6929 0.6379   0.6550 0.8117

Interpretation:

XGBoost consistently achieved the highest accuracy and F1‑score. Logistic Regression was moderate, with balanced but lower recall. Random Forest produced valid results after factor alignment, but lagged behind XGBoost in accuracy.

6. Computational Latency

Execution time was measured to assess efficiency.

print(speed_df)
##                  Model Execution_Time_Secs
## 1              XGBoost             0.44549
## 2 Multinomial Logistic             1.65848
## 3        Random Forest             0.50988

Interpretation:

Logistic Regression was fastest. XGBoost balanced speed with strong accuracy. Random Forest was slower, reflecting its ensemble complexity.

7. Error Distribution

Confusion matrices highlight misclassifications.

8. Visualization Dashboard

We generated three plots:

Performance Profile: Accuracy, F1, and AUC side‑by‑side.

Confusion Matrix Heatmaps: Misclassification grids for each model.

Efficiency Frontier: Trade‑off between accuracy and execution time.

9. Advanced Diagnostics

Balanced Accuracy per class and Log Loss were computed for deeper insights.

print(advanced_comparison)
##                  Model Overall_Accuracy Low_Impact_BalAcc Med_Impact_BalAcc
## 1              XGBoost           0.9949            0.9973            0.9952
## 2 Multinomial Logistic           0.5580            0.6246            0.5013
## 3        Random Forest           0.7041            0.7578            0.6617
##   High_Impact_BalAcc Log_Loss
## 1             0.9930   0.3406
## 2             0.7466   1.0061
## 3             0.7682   0.8212

Interpretation:

XGBoost maintained balanced accuracy across all classes and lower log loss, reinforcing its robustness.

Conclusion

This unified evaluation demonstrates that XGBoost is the most reliable framework for predicting the impact level of global fuel price changes. It achieved the best balance of accuracy, F1‑score, and computational efficiency.

Logistic Regression remains a lightweight option with acceptable performance.

Random Forest requires careful factor alignment and was less competitive in this dataset.

Final takeaway: For deployment in real‑world scenarios, XGBoost offers the strongest predictive performance and operational efficiency.


Limitations

1. Synthetic / secondary dataset limitation

The dataset may not fully capture real-world fuel pricing mechanisms such as exchange rates, government price controls, import costs, and geopolitical shocks.

2. Limited external predictors

The model mainly uses Brent crude price and country-level categories. It does not include exchange rate, inflation, domestic tax policy changes, transport cost, or demand/supply indicators.

3. Potential target leakage in classification

Some features may be too closely related to how impact_level was created. The model should be rerun after removing leakage variables.

4. Class imbalance

The testing set has more Low Impact cases than High Impact cases, which can make accuracy look better than actual class-level performance. The report itself notes this imbalance in the classification section.

5. Generalization risk

High performance on historical data does not guarantee performance on future fuel shocks or unseen countries.


Conclusion

This study aimed to predict global retail fuel prices and classify the impact level of fuel price changes using Brent crude oil prices and country-level characteristics. The analysis followed a complete machine learning workflow, including data understanding, data cleaning, exploratory data analysis, data preparation, regression modeling, classification modeling, and model evaluation.

For the regression task, the objective was to predict monthly average retail fuel prices. Three models were compared: Elastic Net Linear Regression, Random Forest Regression, and XGBoost Regression. Based on the evaluation results, XGBoost Regression produced the best performance across all key metrics. It achieved the lowest prediction errors and the highest explanatory power, indicating that it was able to capture complex non-linear relationships between global crude oil prices, historical crude prices, country characteristics, tax category, income level, region, and fuel type. Therefore, XGBoost Regression was selected as the final model for predicting retail fuel prices.

For the classification task, the objective was to predict quarterly fuel price impact levels, categorized as Low Impact, Medium Impact, and High Impact. Three models were also compared: Multinomial Logistic Regression, Random Forest, and XGBoost. The Multinomial Logistic Regression model served as a useful baseline but showed limited performance, especially in identifying Medium Impact cases. Random Forest improved the classification accuracy and performed well in detecting Low Impact observations, but it still struggled to separate Medium Impact from the other classes. XGBoost achieved the strongest overall classification performance, with very high accuracy, precision, recall, and F1-score. As a result, XGBoost was selected as the final classification model.

Overall, the findings show that XGBoost was the most suitable algorithm for both research questions. Its strong performance suggests that fuel price behaviour is influenced by complex and non-linear patterns rather than simple linear relationships alone. This is reasonable because local retail fuel prices are not only affected by Brent crude oil price movements, but also by country-specific factors such as tax structure, subsidy level, income level, region, and fuel type.

The results also highlight that global crude oil prices do not affect all countries equally. Some countries may experience stronger retail fuel price changes, while others may absorb price shocks more effectively due to policy intervention, subsidies, taxation systems, or economic structure. Therefore, the classification model provides useful insight by identifying which observations are more likely to fall into Low, Medium, or High Impact categories.

However, the results should be interpreted with caution. The very high performance of the XGBoost models indicates strong predictive ability within the dataset, but it does not guarantee perfect real-world forecasting performance. Future work should further validate the models using new unseen data, check for possible target leakage, and test whether the models remain stable under different time periods or external market conditions. Additional variables such as exchange rates, inflation, government policy changes, fuel import dependency, and geopolitical events could also be included to improve the real-world explanatory power of the models.

In conclusion, this project successfully demonstrated how machine learning can be used to analyze fuel price behaviour from both regression and classification perspectives. XGBoost was found to be the best-performing model for predicting retail fuel prices and classifying fuel price impact levels. The study provides a useful analytical framework for understanding how global crude oil price movements interact with country-level characteristics to influence local fuel markets.