Volatilities in global energy markets introduce severe economic uncertainties, impacting consumer retail prices and national economic planning. Understanding the pass-through effect of crude oil price shocks onto localized retail markets requires a dual-pronged analytical framework.
This study leverages historical data across distinct temporal granularities to address two core research questions using machine learning paradigms:
Objective: To predict future retail fuel prices based on the historical behavior of crude oil benchmarks.
Granularity: Analyzed according to country by month to capture medium-term macroeconomic shifts and regional dependencies.
Approach: Continuous numeric forecasting (Regression) to map upstream supply-side costs directly to downstream consumer impacts.
Objective: To build a classification architecture that analyzes the specific impact level of global fuel price spikes on localized retail adjustments.
Granularity: Monitored by quarter to align with broader corporate and national budgetary cycles.
Approach: Categorical risk assessment (Classification) to segment country-specific fuel behavior into distinct shock-absorption or impact bands.
We will employ a standard machine learning framework which consist of:
The data was gathered from Kaggle by the user BELBIN BENO R M titled Global Fuel Prices 2020–2026. It was then imported to R with the default function. A preview of the first 5 records are as below:
df <- read.csv("data_raw/dirty_fuel_data.csv")
| date | country | region | income_level | subsidy_level | petrol_usd_liter | diesel_usd_liter | lpg_usd_liter | brent_crude_usd | tax_percentage |
|---|---|---|---|---|---|---|---|---|---|
| 2020-01-06 | United States | North America | High | Low | $1.465 | 999.990 | 1.093 | 65.75 | 59.6 |
| 2020-01-13 | USA | North America | High | Low | $1.435 | -5.000 | 1.077 | 65.54 | 27.6 |
| 2020-01-20 | United States | North America | High | Low | $1.446 | 1.329 | 1.080 | 66.51 | 62.3 |
| 2020-01-27 | United States | North America | High | Low | $1.488 | 1.359 | 1.125 | 68.79 | 44.8 |
| 2020-02-03 | United States | North America | High | Low | $1.457 | 1.341 | 1.111 | 68.44 | 51.4 |
The data consists of nine features:
date: The time stamp of the record in the format of
YYYY-MM-DDcountry: The specific nation where the fuel prices were
recordedregion: The broader geographic territoryincome_level: The economic classification of the
countrysubsidy_level: Indicates the scale of government
intervention and subsidies on national fuel pricespetrol_usd_liter: The local retail price of consumer
gasoline, standardized to US Dollars per literdiesel_usd_liter: The local retail price of automotive
and industrial diesel fuel, standardized to US Dollars per literlpg_usd_liter: The local retail price of Liquefied
Petroleum Gas, standardized to US Dollars per literbrent_crude_usd: The global benchmark price for a
barrel of Brent Crude oil in US Dollarstax_percentage: The tax rate applied by the national
government to fuel salesNumber of Unique Values:
## date: 327
## country: 85
## region: 7
## income_level: 3
## subsidy_level: 4
## petrol_usd_liter: 5517
Overview of the data:
## Rows: 27,483
## Columns: 10
## $ date <chr> "2020-01-06", "2020-01-13", "2020-01-20", "2020-01-27", "2020-0~
## $ country <chr> "United States", "USA", "United States", "United States", "Unit~
## $ region <chr> "North America", "North America", "North America", "North Ameri~
## $ income_level <chr> "High", "High", "High", "High", "High", "High", "High", "High",~
## $ subsidy_level <chr> "Low", "Low", "Low", "Low", "Low", "Low", "Low", "Low", "Low", ~
## $ petrol_usd_liter <chr> "$1.465", "$1.435", "$1.446", "$1.488", "$1.457", "1.461 USD", ~
## $ diesel_usd_liter <dbl> 999.990, -5.000, 1.329, 1.359, 1.341, 1.356, 1.390, 1.396, 1.41~
## $ lpg_usd_liter <dbl> 1.093, 1.077, 1.080, 1.125, 1.111, 1.105, 1.111, 1.193, 1.156, ~
## $ brent_crude_usd <dbl> 65.75, 65.54, 66.51, 68.79, 68.44, 68.09, 70.46, 71.61, 70.53, ~
## $ tax_percentage <dbl> 59.6, 27.6, 62.3, 44.8, 51.4, 15.0, 58.8, 17.9, 55.2, 14.7, 14.~
Dataset Summary:
This dataset is well suited for our questions because it captures
both the primary global market drivers and the local economic buffers
that dictate fuel costs. For Q1 (Regression),
brent_crude_usd provides the baseline upstream cost, while
the date and country variables allow the model
to track the exact monthly time lag it takes for global price shocks to
pass through to local retail pumps (petrol_usd_liter).
For Q2 (Classification), the data provides the structural
macro-context needed to categorize quarterly impact risk bands. Features
like subsidy_level and tax_percentage act as
policy shields that alter how a nation absorbs financial shocks, while
income_level and region help the classifier
identify which clusters of countries are highly vulnerable versus those
with the fiscal resilience to smooth out volatile energy spikes.
This section details the reproducible data cleaning workflow applied to the dirty_fuel_data.csv dataset. The goal of this pipeline is to address duplicate rows, inconsistent categorical naming, character-to-numeric type mismatches, statistical outliers, and missing data imputation.
First, we load the tidyverse suite (which includes dplyr, readr, stringr, and ggplot2 for efficient data manipulation) and import the raw dataset.
## Rows: 27483 Columns: 10
## -- Column specification -------------------------------------------------------------------
## Delimiter: ","
## chr (5): country, region, income_level, subsidy_level, petrol_usd_liter
## dbl (4): diesel_usd_liter, lpg_usd_liter, brent_crude_usd, tax_percentage
## date (1): date
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 27,483
## Columns: 10
## $ date <date> 2020-01-06, 2020-01-13, 2020-01-20, 2020-01-27, 2020-02-03, 20~
## $ country <chr> "United States", "USA", "United States", "United States", "Unit~
## $ region <chr> "North America", "North America", "North America", "North Ameri~
## $ income_level <chr> "High", "High", "High", "High", "High", "High", "High", "High",~
## $ subsidy_level <chr> "Low", "Low", "Low", "Low", "Low", "Low", "Low", "Low", "Low", ~
## $ petrol_usd_liter <chr> "$1.465", "$1.435", "$1.446", "$1.488", "$1.457", "1.461 USD", ~
## $ diesel_usd_liter <dbl> 999.990, -5.000, 1.329, 1.359, 1.341, 1.356, 1.390, 1.396, 1.41~
## $ lpg_usd_liter <dbl> 1.093, 1.077, 1.080, 1.125, 1.111, 1.105, 1.111, 1.193, 1.156, ~
## $ brent_crude_usd <dbl> 65.75, 65.54, 66.51, 68.79, 68.44, 68.09, 70.46, 71.61, 70.53, ~
## $ tax_percentage <dbl> 59.6, 27.6, 62.3, 44.8, 51.4, 15.0, 58.8, 17.9, 55.2, 14.7, 14.~
Before writing the cleaning pipeline, we must profile the dataset to uncover quality, structure, and validity issues.
print("Missing values per column:")
## [1] "Missing values per column:"
colSums(is.na(raw_data))
## date country region income_level subsidy_level
## 0 0 0 0 0
## petrol_usd_liter diesel_usd_liter lpg_usd_liter brent_crude_usd tax_percentage
## 1373 1375 0 0 0
print("Number of duplicate rows:")
## [1] "Number of duplicate rows:"
sum(duplicated(raw_data))
## [1] 15
# Verifying if the petrol price column is recognized as numeric
class(raw_data$petrol_usd_liter)
## [1] "character"
# Summarizing diesel to identify mathematically impossible values
summary(raw_data$diesel_usd_liter)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -5.000 0.860 1.955 2.173 3.233 999.990 1375
# Identifying naming variants for categorical grouping
unique(raw_data$country)
## [1] "United States" "USA" "Canada" "Mexico" "Brazil"
## [6] "Argentina" "Colombia" "Chile" "Peru" "Venezuela"
## [11] "Ecuador" "United Kingdom" "Germany" "France" "Italy"
## [16] "Spain" "Netherlands" "Belgium" "Sweden" "Norway"
## [21] "Denmark" "Finland" "Poland" "Portugal" "Greece"
## [26] "Switzerland" "Austria" "Ireland" "Hungary" "Czech Republic"
## [31] "Romania" "Russia" "Ukraine" "Turkey" "China"
## [36] "Japan" "South Korea" "India" "Indonesia" "Malaysia"
## [41] "Thailand" "Vietnam" "Philippines" "Bangladesh" "Pakistan"
## [46] "Sri Lanka" "Nepal" "Myanmar" "Singapore" "Hong Kong"
## [51] "Taiwan" "Australia" "New Zealand" "Saudi Arabia" "UAE"
## [56] "Kuwait" "Qatar" "Iran" "Iraq" "Israel"
## [61] "Jordan" "Lebanon" "Egypt" "South Africa" "Nigeria"
## [66] "Kenya" "Ethiopia" "Ghana" "Tanzania" "Morocco"
## [71] "Algeria" "Libya" "Tunisia" "Angola" "Mozambique"
## [76] "Zambia" "Zimbabwe" "Uganda" "Cameroon" "Senegal"
## [81] "Ivory Coast" "Sudan" "Botswana" "Namibia" "Rwanda"
Using a single dplyr pipe sequence, we apply five specific data-cleaning steps to ensure the integrity of the data without breaking spatial correlations.
Post-Cleaning Quality Assurance We run the data profiling checks again to verify that our dataset is now clean.
# 1. Check for remaining Missing Values
colSums(is.na(cleaned_data))
## date country region income_level subsidy_level
## 0 0 0 0 0
## petrol_usd_liter diesel_usd_liter lpg_usd_liter brent_crude_usd tax_percentage
## 0 0 0 0 0
# 2. Check for remaining Duplicates
sum(duplicated(cleaned_data))
## [1] 0
# 3. Verify final data types
class(cleaned_data$petrol_usd_liter)
## [1] "numeric"
# 4. Verify distribution ranges without massive outliers
summary(cleaned_data$diesel_usd_liter)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.010 0.860 1.961 2.135 3.235 6.240
The final, sparkling clean data is saved for analysis and downstream reporting.
write_csv(cleaned_data, "fuel_prices_cleaned_final.csv")
print("Cleaning complete! Final data successfully exported.")
The clean dataset consists of 10 variables outlined below:
| # | Variable Name | Data Type | Description |
|---|---|---|---|
| 1 | date | Date | The recorded date of fuel prices. |
| 2 | country | Character | Name of the country (standardized). |
| 3 | region | Character | Geographic region (e.g., North America, Europe). |
| 4 | income_level | Character | Economic classification of the country. |
| 5 | subsidy_level | Character | Level of government fuel subsidies. |
| 6 | petrol_usd_liter | Numeric | Price of gasoline per liter in USD. |
| 7 | diesel_usd_liter | Numeric | Price of diesel per liter in USD. |
| 8 | lpg_usd_liter | Numeric | Price of LPG per liter in USD. |
| 9 | brent_crude_usd | Numeric | The global benchmark price for crude oil. |
| 10 | tax_percentage | Numeric | The percentage of fuel price consisting of tax. |
We will explore the data to uncover relationships and patterns that can inform our modeling approach. This includes visualizations to understand the distribution of fuel prices, the relationship between global crude prices and local retail prices, the impact of each countries every quarter and how different countries and regions are affected by price changes throughout the quarters.
These libraries will be used for data manipulation, visualization,
and analysis throughout the EDA process. We will use
ggplot2 for creating visualizations, dplyr for
data manipulation, lubridate for handling date operations,
zoo for time series analysis, and corrplot for
visualizing correlations between numeric features.
clean_data$year_quarter <- as.yearqtr(clean_data$date)
head(clean_data[, c("date", "year_quarter")])
## date year_quarter
## 1 2020-01-06 2020 Q1
## 2 2020-01-13 2020 Q1
## 3 2020-01-20 2020 Q1
## 4 2020-01-27 2020 Q1
## 5 2020-02-03 2020 Q1
## 6 2020-02-10 2020 Q1
class(clean_data$year_quarter) # yearqtr class
## [1] "yearqtr"
This code extracts the year and quarter from the date
column and creates a new column called year_quarter. This
will allow us to analyze the data on a quarterly basis, which is
important for our classification question. The as.yearqtr
function from the zoo package is used to convert the date
into a year-quarter format.
quarterly_brent <- clean_data |>
group_by(year_quarter) |>
summarise(mean_brent = mean(brent_crude_usd, na.rm = TRUE))
This code calculates the average Brent crude price for each quarter
by grouping the data by year_quarter and then summarizing
it to get the mean of brent_crude_usd. This will help us
understand the global trends in crude oil prices over time.
The fuel price is extremely volatile between 2020 Q1 and 2021 Q1. 2020 Q2 is the lowest point in the data before spiking to 100 from ~55. The average brent crude price then stay above 100 except for 2023 Q1 with a value of ~96. 2022 Q2 and 2026 Q1. 2022 Q2 to 2023 Q1 shows a down slope in price before slowly raising up throughout 2023 and 2024.
quarterly_fuels <- clean_data %>%
group_by(year_quarter) %>%
summarise(
mean_petrol = mean(petrol_usd_liter, na.rm = TRUE),
mean_diesel = mean(diesel_usd_liter, na.rm = TRUE),
mean_lpg = mean(lpg_usd_liter, na.rm = TRUE)
)
This code calculates the average price for petrol, diesel, and LPG
for each quarter by grouping the data by year_quarter and
then summarizing it to get the mean of each fuel type.
The mean prices of all fuel types generally follow the Brent crude oil trend. Petrol maintained the highest average price across all quarters, followed by diesel and LPG. The similar movement patterns among fuel types suggest that their prices are positively correlated and influenced by similar market conditions.
This code calculates the average petrol price for each region by
quarter by grouping the data by year_quarter and
region, and then summarizing it to get the mean of
petrol_usd_liter.
Europe and Oceania showed to have the highest petrol price throughout the quarters, way above the other regions. They show no minimal difference with each other, but have a big leap from the next region.North America and Asia showed to have similar grouping like Europe and Ocenia. Then followed by Africa, South America and Middle East.
This has some logic when comparing with proven oil reserves. Middle East is logical due to the high reserves in comparison to Europe and Oceania. There could also however be a relationship between fuel prices, and the countries’ subsidy and income level. South America has high reserve countries such as Venuzuela and Brazil, though not as much as the Middle East. Therefore, a proper analysis of specific countries and its levels should be conducted.
This code calculates the mean, standard deviation, minimum, and
maximum petrol price for each country by grouping the data by
country and then summarizing it. The countries are then
arranged in descending order based on the standard deviation of petrol
prices to identify which countries have the most variability in their
petrol prices. The top 10 most volatile countries are extracted for
further analysis.
The variability analysis showed that several countries experienced significantly higher fluctuations in petrol prices compared to others. Countries with higher standard deviation values indicate less stable fuel pricing trends over time, potentially due to weaker subsidy controls, economic instability, or stronger sensitivity to global crude oil price movements. In contrast, countries with lower variability demonstrated more stable fuel prices, which may suggest stronger regulatory or subsidy mechanisms.
Looking at boxplot, we can see that the the distribution for each level overlaps with another. Both low and middle income level have a distribution between 0 and 4 USD/L, though low income level has a low median than middle income level. low income level is shown to have a normal distribution, while middle income is positively skewed. On the other hand, high income level has a huge distribution, between approximately 0 and 7 USD/L. It is skewed to the upper percentile, and some outliers can be seen in the lower percentile.
The graph suggests that income level does play a role in the petrol prices, though a high income level is not always an indicator of a high petrol price. An example would be the UAE which has a high income level but low petrol price.
The boxplot shows a clear inverse relationship between subsidy level and petrol price. It shows a decrease in the petrol price distribution from low to very high with each level tightening the distribution. Some outliers are present in the medium and high, though no outliers are on low despite the wide distribution - suggesting that the distribution is spread apart on the 25% and 100% quartile.
As expected, petrol, diesel, and LPG prices show extremely strong positive correlations with one another, with correlation coefficients close to 0.99. This indicates that the prices of different fuel types generally move together over time.
In contrast, Brent crude oil prices show only weak positive correlations with the fuel prices, with coefficients around 0.25 to 0.26. This suggests that while crude oil prices may influence fuel prices, they are not the sole determinant of retail fuel pricing. Other factors such as subsidies, taxation policies, and regional economic conditions may also contribute significantly.
Tax percentage shows almost no correlation with Brent crude oil prices but demonstrates moderate positive correlations of approximately 0.52 with fuel prices. This suggests that countries with higher fuel taxation tend to have higher retail fuel prices.
quarterly_country <- clean_data %>%
group_by(country, year_quarter) %>%
summarise(
mean_petrol = mean(petrol_usd_liter, na.rm = TRUE),
.groups = "drop"
)
quarterly_country <- quarterly_country %>%
group_by(country) %>%
arrange(year_quarter) %>%
mutate(
pct_change = (
(mean_petrol - lag(mean_petrol)) /
lag(mean_petrol)
) * 100
)
This code calculates the average petrol price for each country by
quarter and then computes the percentage change in petrol price from one
quarter to the next for each country. The lag function is
used to access the previous quarter’s mean petrol price for the
percentage change calculation.
The distribution of quarterly petrol price changes is concentrated mainly between -10% and 10%, with the highest frequency occurring in the positive 0% to 10% range. This suggests that most quarterly petrol price movements are relatively moderate, with price increases occurring more frequently than decreases.
Although extreme changes are less common, several larger positive changes above 30% are present. Negative changes also appear frequently, particularly within the -20% to -10% range, indicating that substantial price drops occurred in certain periods. Overall, the distribution suggests that petrol prices generally experience moderate quarterly fluctuations with occasional periods of high volatility.
This code categorizes the percentage change in petrol price into
three impact levels: “Low” for changes less than 2%, “Medium” for
changes between 2% and 5%, and “High” for changes greater than or equal
to 5%. The case_when function is used to create the
impact_level variable based on the absolute value of the
percentage change. The resulting impact_level variable is
then converted into an ordered factor with specified levels.
The impact level distribution shows that most observations fall within the high impact category, followed by medium and low impact levels. This is consistent with the previous histogram analysis, where the majority of quarterly petrol price changes were concentrated around the -10% to 10% range. High impact observations occur less frequently, indicating that extreme quarterly fuel price fluctuations are relatively common.
This report documents the end-to-end data preparation workflow required to clean, transform, and engineer features from our raw fuel market data. The final output generates two highly specialized datasets: a monthly time-series file structured for regression modeling, and a quarterly file tailored for classification tasks.
To build a reproducible data pipeline, we initialize our environment
by loading the tidyverse library for core data manipulation
and reshaping, alongside lubridate for specialized
date-time parsing operations.
library(tidyverse) # Core data science packages (dplyr, tidyr, stringr, readr)
library(lubridate) # Specialized package for date conversion
library(readr)
df <- read_csv("data_clean/fuel_prices_cleaned_final.csv")
The raw input dataset tracks retail asset classes across multiple adjacent wide columns. To make this data compatible with tidy data principles, we first parse the text dates into structured date values and then collapse the multiple fuel columns into a single categorical key-value pair.
# 1. Date Conversion & Feature Extraction
df_clean <- df |>
mutate(
# Convert string to Date format. 'dmy' parses Day/Month/Year correctly.
date = as.Date(date),
# Extract temporal features for downstream grouping
year = year(date),
month = month(date),
quarter = quarter(date)
)
# 2. Data Pivoting
df_long <- df_clean |>
pivot_longer(
cols = c(petrol_usd_liter, diesel_usd_liter, lpg_usd_liter),
names_to = "fuel_type", # New column holding the categorical fuel names
values_to = "retail_price_usd" # New column holding the numerical prices
)
Reshaping variables automatically preserves technical column tags (such as _usd_liter). We apply regular expression string adjustments to strip out technical labels and normalize the text into title case format to ensure clean reporting aesthetics.
df_long <- df_long %>%
mutate(
# Remove the redundant "_usd_liter" text from the fuel types
fuel_type = str_replace(fuel_type, "_usd_liter", ""),
fuel_type = str_to_title(fuel_type)
)
To support multi-tier behavioral analysis, continuous numbers are grouped into distinct market classes. Retail rates and tax metrics are split into equal statistical thirds (tertiles) driven by data distributions, while crude oil indicators are segmented using deterministic macro-economic price blocks.
df_long <- df_long %>%
mutate(
# A. Retail Price Categories (Using equal statistical thirds)
price_category = cut(
retail_price_usd,
breaks = quantile(retail_price_usd, probs = c(0, 1 / 3, 2 / 3, 1), na.rm = TRUE),
labels = c("Low Retail", "Medium Retail", "High Retail"),
include.lowest = TRUE
),
# B. Tax Categories (Using equal statistical thirds)
tax_category = cut(
tax_percentage,
breaks = quantile(tax_percentage, probs = c(0, 1 / 3, 2 / 3, 1), na.rm = TRUE),
labels = c("Low Tax", "Medium Tax", "High Tax"),
include.lowest = TRUE
),
# C. Crude Oil Categories (Using specific rule-based thresholds)
crude_category = case_when(
brent_crude_usd < 50 ~ "Cheap Crude",
brent_crude_usd >= 50 & brent_crude_usd <= 80 ~ "Normal Crude",
brent_crude_usd > 80 ~ "Expensive Crude",
TRUE ~ "Unknown"
)
)
With foundational adjustments finalized, we construct our two final target analytical arrays.
This pipeline groups observations into monthly country footprints, preserving structural local parameters like tax classifications. We also extract a sequential lag index to evaluate historical, trailing market dependencies.
q1_monthly_data <- df_long %>%
# Included tax_category so it doesn't get deleted
group_by(country, region, income_level, tax_category, year, month, fuel_type) %>%
summarize(
avg_retail_price = mean(retail_price_usd, na.rm = TRUE),
avg_crude_price = mean(brent_crude_usd, na.rm = TRUE),
.groups = "drop"
) %>%
# Create Historical Behavior (Lagged Variable)
arrange(country, fuel_type, year, month) %>% # Sort chronologically first
group_by(country, fuel_type) %>%
mutate(
# This grabs the crude price from 1 month ago
historical_crude_price = lag(avg_crude_price, n = 1)
) %>%
ungroup() %>%
drop_na(historical_crude_price) # Drops the very first month which has no history
print("--- Q1 Monthly Data Ready ---")
## [1] "--- Q1 Monthly Data Ready ---"
head(q1_monthly_data)
## # A tibble: 6 x 10
## country region income_level tax_category year month fuel_type avg_retail_price
## <chr> <chr> <chr> <fct> <dbl> <dbl> <chr> <dbl>
## 1 Algeria Africa Middle Medium Tax 2020 1 Diesel 0.045
## 2 Algeria Africa Middle Low Tax 2020 2 Diesel 0.0695
## 3 Algeria Africa Middle Medium Tax 2020 2 Diesel 0.058
## 4 Algeria Africa Middle Low Tax 2020 3 Diesel 0.054
## 5 Algeria Africa Middle Medium Tax 2020 3 Diesel 0.0727
## 6 Algeria Africa Middle Low Tax 2020 4 Diesel 0.074
## # i 2 more variables: avg_crude_price <dbl>, historical_crude_price <dbl>
This sequence maps records onto a wider quarterly footprint. It handles the mathematical computation of quarter-over-quarter growth rates for local retail structures and global crude oil benchmarks, using these rates to establish a hard boundary for our classification target impact_level.
q2_quarterly_data <- df_long %>%
group_by(country, region, income_level, tax_category, year, quarter, fuel_type) %>%
summarize(
avg_retail_price = mean(retail_price_usd, na.rm = TRUE),
avg_crude_price = mean(brent_crude_usd, na.rm = TRUE),
.groups = "drop"
) %>%
# Calculate Percentage Changes and "Impact Level" Class
arrange(country, fuel_type, year, quarter) %>% # Sort chronologically first
group_by(country, fuel_type) %>%
mutate(
# Calculate Quarter-over-Quarter % Change
retail_pct_change = (avg_retail_price - lag(avg_retail_price)) / lag(avg_retail_price) * 100,
crude_pct_change = (avg_crude_price - lag(avg_crude_price)) / lag(avg_crude_price) * 100
) %>%
ungroup() %>%
mutate(
# Create the Target Variable for Classification based on the price change
impact_level = case_when(
abs(retail_pct_change) > 5 ~ "High Impact",
abs(retail_pct_change) >= 2 ~ "Medium Impact",
abs(retail_pct_change) < 2 ~ "Low Impact",
TRUE ~ NA_character_
)
) %>%
drop_na(impact_level)
print("--- Q2 Quarterly Data Ready ---")
## [1] "--- Q2 Quarterly Data Ready ---"
head(q2_quarterly_data)
## # A tibble: 6 x 12
## country region income_level tax_category year quarter fuel_type avg_retail_price
## <chr> <chr> <chr> <fct> <dbl> <int> <chr> <dbl>
## 1 Algeria Africa Middle Medium Tax 2020 1 Diesel 0.0632
## 2 Algeria Africa Middle Low Tax 2020 2 Diesel 0.0623
## 3 Algeria Africa Middle Medium Tax 2020 2 Diesel 0.0675
## 4 Algeria Africa Middle Low Tax 2020 3 Diesel 0.0804
## 5 Algeria Africa Middle Medium Tax 2020 3 Diesel 0.0865
## 6 Algeria Africa Middle Low Tax 2020 4 Diesel 0.0989
## # i 4 more variables: avg_crude_price <dbl>, retail_pct_change <dbl>,
## # crude_pct_change <dbl>, impact_level <chr>
| country | region | income_level | tax_category | year | month | fuel_type | avg_retail_price | avg_crude_price | historical_crude_price |
|---|---|---|---|---|---|---|---|---|---|
| Algeria | Africa | Middle | Medium Tax | 2020 | 1 | Diesel | 0.0450000 | 66.510 | 66.69333 |
| Algeria | Africa | Middle | Low Tax | 2020 | 2 | Diesel | 0.0695000 | 71.035 | 66.51000 |
| Algeria | Africa | Middle | Medium Tax | 2020 | 2 | Diesel | 0.0580000 | 68.265 | 71.03500 |
| Algeria | Africa | Middle | Low Tax | 2020 | 3 | Diesel | 0.0540000 | 60.395 | 68.26500 |
| Algeria | Africa | Middle | Medium Tax | 2020 | 3 | Diesel | 0.0726667 | 64.640 | 60.39500 |
The dataset features nine independent variables: -
country - region - income_level -
tax_category - year - month -
fuel_type - avg_crude_price -
historical_crude_price
The target variable is avg_retail_price.
| country | region | income_level | tax_category | year | quarter | fuel_type | avg_retail_price | avg_crude_price | retail_pct_change | crude_pct_change | impact_level |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Algeria | Africa | Middle | Medium Tax | 2020 | 1 | Diesel | 0.0631667 | 66.16000 | -7.496513 | 0.0388819 | High Impact |
| Algeria | Africa | Middle | Low Tax | 2020 | 2 | Diesel | 0.0622727 | 53.55273 | -1.415207 | -19.0557327 | Low Impact |
| Algeria | Africa | Middle | Medium Tax | 2020 | 2 | Diesel | 0.0675000 | 53.95500 | 8.394161 | 0.7511713 | High Impact |
| Algeria | Africa | Middle | Low Tax | 2020 | 3 | Diesel | 0.0804286 | 67.91714 | 19.153439 | 25.8773846 | High Impact |
| Algeria | Africa | Middle | Medium Tax | 2020 | 3 | Diesel | 0.0865000 | 71.66667 | 7.548845 | 5.5207325 | High Impact |
The dataset features 11 independent variables: - country
- region - income_level -
tax_category - year - quarter -
fuel_type - avg_retail_price -
avg_crude_price - retail_pct_change -
crude_pct_change
The target variable is impact_level.
To ensure data continuity and freeze state baselines for independent downstream model validation, both structures are written out to dedicated local storage nodes.
# Extracting datasets
write_csv(q1_monthly_data, "data_processed/Q1_Monthly_Regression_Data.csv")
write_csv(q2_quarterly_data, "data_processed/Q2_Quarterly_Classification_Data.csv")
print("Success! Both CSV files have been extracted and saved to your folder.")
In this section of our analytics project, we developed an advanced machine learning pipeline to forecast global retail fuel prices. To capture complex, non-linear market behaviours such as regional policy variations and the lag effects of global oil shocks, we executed a comparative tournament between three different methodologies: Linear Regression (Elastic Net), Random Forest, and Extreme Gradient Boosting (XGBoost).
In this section, we load the required packages from the tidymodels framework alongside specialized libraries for model explanation (vip) and parallel computation processing engines.
library(readr)
library(tidyverse)
library(tidymodels)
library(vip) # For feature importance plots
# Load the regression dataset
df <- read_csv("data_processed/Q1_Monthly_Regression_Data.csv")
Because this dataset tracks multiple countries across time (panel/time-series structure), random slicing would result in data leakage. We establish an explicit chronological timeline sorting by dates, and encode ordinal classifications scale adjustments.
# Create a proper Date column and sort chronologically
df <- df %>%
mutate(date = as.Date(paste(year, month, "01", sep = "-"))) %>%
arrange(date)
# Define levels for ordinal categories
tax_levels <- c("Low Tax", "Medium Tax", "High Tax")
income_levels <- c("Low", "Middle", "High")
# Encode categorical data types
df$country <- as.factor(df$country)
df$region <- as.factor(df$region)
df$fuel_type <- as.factor(df$fuel_type)
# Convert ordered factors to numeric scales so ML algorithms can read their progression scale
df$tax_category <- as.numeric(factor(df$tax_category, levels = tax_levels, ordered = TRUE))
df$income_level <- as.numeric(factor(df$income_level, levels = income_levels, ordered = TRUE))
We split the data sequentially: the oldest 80% is used to build the predictive engine, while the newest 20% remains untouched to simulate real-world forward-testing. Cross-validation uses a rolling-origin time framework to map sliding windows.
set.seed(123)
# Chronological Split: Oldest 80% for training, newest 20% for testing
data_split <- initial_time_split(df, prop = 0.80)
train_data <- training(data_split)
test_data <- testing(data_split)
# Calculate how many rows equal exactly "1 Month" of data across all countries
rows_per_month <- train_data %>%
count(year, month) %>%
summarise(avg_rows = mean(n)) %>%
pull(avg_rows) %>%
round()
# Set up the rolling origin cross-validation based on rows
# e.g., 24 months lookback window, 6 months evaluation window, sliding 3 months forward
folds <- rolling_origin(
train_data,
initial = rows_per_month * 24,
assess = rows_per_month * 6,
skip = rows_per_month * 3,
cumulative = TRUE
)
We define a workflow preprocessing standard to handle new factor variations seamlessly via dummy variable steps. We set up an engine testing matrix across Elastic Net Linear Regressions, XGBoost, and Random Forest specifications.
base_recipe <- recipe(avg_retail_price ~ ., data = train_data) %>%
# Assign time/date trackers an "ID" role so they aren't used as direct linear predictors
update_role(year, month, date, new_role = "ID") %>%
step_novel(all_nominal_predictors()) %>%
step_zv(all_predictors()) %>%
step_dummy(all_nominal_predictors())
# Linear Model Spec
linear_spec <- linear_reg(penalty = tune(), mixture = tune()) %>%
set_engine("glmnet") %>%
set_mode("regression")
# XGBoost Spec
xgb_spec <- boost_tree(trees = tune(), tree_depth = tune(), learn_rate = tune()) %>%
set_engine("xgboost") %>%
set_mode("regression")
# Random Forest Spec
rf_spec <- rand_forest(trees = tune(), min_n = tune()) %>% # Removed mtry = tune()
set_engine("ranger", importance = "impurity") %>%
set_mode("regression")
# Combine into a Workflow Set
model_workflows <- workflow_set(
preproc = list(base_rec = base_recipe),
models = list(linear = linear_spec, xgb = xgb_spec, rf = rf_spec),
cross = TRUE
)
We run grid search matrices across all 3 structural workflow designs.
set.seed(123)
library(doParallel)
cl <- makeCluster(detectCores() - 1)
registerDoParallel(cl)
parallel_control <- control_grid(
save_pred = TRUE,
parallel_over = "everything",
verbose = FALSE,
)
tune_results <- model_workflows %>%
workflow_map(
fn = "tune_grid",
seed = 123,
resamples = folds,
grid = 10,
metrics = metric_set(rmse, mae, rsq, mape, huber_loss),
control = parallel_control
)
stopCluster(cl)
This chart details the cross-validation score summaries across each evaluated grid option.
autoplot(tune_results, metric = "rmse")
The system reads across the finalised cross-validation profiles, filters down to select the architecture minimising RMSE, and assigns its hyperparameter choices onto the holdout performance verification test set.
# Programmatically find the absolute best model ID based on the lowest validation RMSE
best_model_id <- tune_results %>%
rank_results(rank_metric = "rmse", select_best = TRUE) %>%
filter(rank == 1) %>%
slice(1) %>%
pull(wflow_id)
# Extract best tuning results and parameters for that winning model
winning_tune_res <- tune_results %>% extract_workflow_set_result(best_model_id)
best_parameters <- winning_tune_res %>% select_best(metric = "rmse")
# Finalize the workflow with the winner's best hyperparameters
final_wf <- tune_results %>%
extract_workflow(best_model_id) %>%
finalize_workflow(best_parameters)
# Fit to the complete Training data and evaluate on the unseen holdout Test set
final_fit <- last_fit(
final_wf,
data_split,
metrics = metric_set(rmse, mae, rsq, mape, huber_loss)
)
# Extract best tuning results and parameters for that winning model
linear_tune_res <- tune_results %>% extract_workflow_set_result("base_rec_linear")
linear_best_parameters <- linear_tune_res %>% select_best(metric = "rmse")
# Finalize the workflow with the winner's best hyperparameters
linear_final_wf <- tune_results %>%
extract_workflow("base_rec_linear") %>%
finalize_workflow(linear_best_parameters)
# Fit to the complete Training data and evaluate on the unseen holdout Test set
linear_final_fit <- last_fit(
linear_final_wf,
data_split,
metrics = metric_set(rmse, mae, rsq, mape, huber_loss)
)
collect_metrics(linear_final_fit)
## # A tibble: 5 x 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 rmse standard 0.271 pre0_mod0_post0
## 2 mae standard 0.199 pre0_mod0_post0
## 3 rsq standard 0.984 pre0_mod0_post0
## 4 mape standard 68.4 pre0_mod0_post0
## 5 huber_loss standard 0.0367 pre0_mod0_post0
# Extract best tuning results and parameters for that winning model
rf_tune_res <- tune_results %>% extract_workflow_set_result("base_rec_rf")
rf_best_parameters <- rf_tune_res %>% select_best(metric = "rmse")
# Finalize the workflow with the winner's best hyperparameters
rf_final_wf <- tune_results %>%
extract_workflow("base_rec_rf") %>%
finalize_workflow(rf_best_parameters)
# Fit to the complete Training data and evaluate on the unseen holdout Test set
rf_final_fit <- last_fit(
rf_final_wf,
data_split,
metrics = metric_set(rmse, mae, rsq, mape, huber_loss)
)
collect_metrics(rf_final_fit)
## # A tibble: 5 x 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 rmse standard 0.427 pre0_mod0_post0
## 2 mae standard 0.323 pre0_mod0_post0
## 3 rsq standard 0.951 pre0_mod0_post0
## 4 mape standard 75.3 pre0_mod0_post0
## 5 huber_loss standard 0.0890 pre0_mod0_post0
The algorithm has dynamically selected base_rec_xgb as the winning model structure. Below are its optimal tuned parameters:
knitr::kable(best_parameters, caption = "Winning Model Hyperparameter Values")
| trees | tree_depth | learn_rate | .config |
|---|---|---|---|
| 2000 | 10 | 0.0464159 | pre0_mod10_post0 |
The table below presents the model’s accuracy scores evaluated on the holdout test phase across all 5 requested evaluation metrics:
final_metrics <- collect_metrics(final_fit)
knitr::kable(final_metrics, caption = "Holdout Test Set Metric Diagnostics")
| .metric | .estimator | .estimate | .config |
|---|---|---|---|
| rmse | standard | 0.0676905 | pre0_mod0_post0 |
| mae | standard | 0.0298363 | pre0_mod0_post0 |
| rsq | standard | 0.9979924 | pre0_mod0_post0 |
| mape | standard | 5.3420236 | pre0_mod0_post0 |
| huber_loss | standard | 0.0022111 | pre0_mod0_post0 |
This chart verifies goodness-of-fit. Points centering tightly around the 45-degree diagonal path illustrate a robust baseline alignment.
predictions_df <- collect_predictions(final_fit)
scatterplot <- ggplot(predictions_df, aes(x = avg_retail_price, y = .pred)) +
geom_point(alpha = 0.4, color = "midnightblue") +
geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed", linewidth = 1) +
theme_minimal(base_size = 12) +
labs(
title = "Actual vs. Predicted Retail Prices",
subtitle = paste("Winning Model Architecture:", best_model_id),
x = "Actual Retail Price",
y = "Predicted Retail Price"
)
print(scatterplot)
Residual points scattered randomly around the center line indicate that the predictive errors remain unbiased across the scale.
predictions_df <- predictions_df %>% mutate(residual = avg_retail_price - .pred)
residuals_plot <- ggplot(predictions_df, aes(x = .pred, y = residual)) +
geom_point(alpha = 0.4, color = "darkred") +
geom_hline(yintercept = 0, color = "black", linetype = "dashed", linewidth = 1) +
theme_minimal(base_size = 12) +
labs(
title = "Model Residuals Plot",
subtitle = "Points should be randomly scattered around the 0 line",
x = "Predicted Price (Fitted Values)",
y = "Residual (Actual - Predicted)"
)
print(residuals_plot)
This chart highlights which specific macroeconomic variations, regional variables, or tax policies dominate the winning model’s internal prediction splits.
final_model_fit <- extract_fit_parsnip(final_fit)
vip_plot <- vip(final_model_fit, num_features = 10, geom = "point") +
theme_minimal(base_size = 12) +
labs(
title = "Top 10 Most Important Predictors",
subtitle = paste("Extracted dynamically from:", best_model_id),
x = "Predictors/Features",
y = "Relative Importance Weight"
)
print(vip_plot)
# Save the final fitted models and test data snapshot for later evaluation
saveRDS(linear_final_fit, "models/regression/final_linear_model.rds")
saveRDS(final_fit, "models/regression/final_xgboost_model.rds")
saveRDS(rf_final_fit, "models/regression/final_rf_model.rds")
saveRDS(test_data, "models/regression/test_data_snapshot.rds")
Load the required packages and create the output folders.
library(tidyverse)
library(tidymodels)
library(kableExtra)
library(yardstick)
# Strict pathing constraints to neutralize working directory mismatches
dir_model_in <- "models/regression/"
dir_eval_out <- "models/evaluation_regression/"
if (!dir.exists(dir_eval_out)) {
dir.create(dir_eval_out, recursive = TRUE)
}
Import the saved models and the exact test dataset snapshot.
# Path-safe validation check before deserializing objects
paths <- c(
paste0(dir_model_in, "final_linear_model.rds"),
paste0(dir_model_in, "final_xgboost_model.rds"),
paste0(dir_model_in, "final_rf_model.rds"),
paste0(dir_model_in, "test_data_snapshot.rds")
)
for (p in paths) {
if (!file.exists(p)) stop(paste("Critical file missing at destination path:", p))
}
final_fit_linear <- readRDS(paste0(dir_model_in, "final_linear_model.rds"))
final_fit_xgb <- readRDS(paste0(dir_model_in, "final_xgboost_model.rds"))
final_fit_rf <- readRDS(paste0(dir_model_in, "final_rf_model.rds"))
test_data_cleaned <- readRDS(paste0(dir_model_in, "test_data_snapshot.rds"))
Use augment() to calculate predictions and residuals for
all models.
scored_data_long <- bind_rows(
augment(extract_workflow(final_fit_linear), new_data = test_data_cleaned) %>% mutate(Model = "Elastic Net Linear"),
augment(extract_workflow(final_fit_xgb), new_data = test_data_cleaned) %>% mutate(Model = "XGBoost"),
augment(extract_workflow(final_fit_rf), new_data = test_data_cleaned) %>% mutate(Model = "Random Forest")
) %>%
mutate(Residual = avg_retail_price - .pred)
Calculate standard error metrics to find the best performing model.
# Instantiate metric set tracking continuous error distances
regression_metrics <- metric_set(rmse, mae, rsq, mape, huber_loss)
comparison_df <- scored_data_long %>%
group_by(Model) %>%
regression_metrics(truth = avg_retail_price, estimate = .pred) %>%
select(Model, .metric, .estimate) %>%
pivot_wider(names_from = .metric, values_from = .estimate) %>%
inner_join(
scored_data_long %>%
group_by(Model) %>%
summarise(Max_Error = max(abs(Residual), na.rm = TRUE), .groups = "drop"),
by = "Model"
)
# Output styled interactive summary table
comparison_df %>%
kbl(caption = "Predictive Accuracy Comparison Matrix", booktabs = TRUE) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
row_spec(which.min(comparison_df$rmse), bold = TRUE, background = "#e8f8f5")
| Model | rmse | mae | rsq | mape | huber_loss | Max_Error |
|---|---|---|---|---|---|---|
| Elastic Net Linear | 0.2708419 | 0.1990199 | 0.9841798 | 68.425086 | 0.0366700 | 1.207740 |
| Random Forest | 0.4273334 | 0.3226139 | 0.9510472 | 75.282361 | 0.0890235 | 2.350927 |
| XGBoost | 0.0676905 | 0.0298363 | 0.9979924 | 5.342024 | 0.0022111 | 1.800150 |
Measure how fast each model runs predictions on the test set.
speed_tracking <- map_df(
list("Elastic Net Linear" = final_fit_linear, "XGBoost" = final_fit_xgb, "Random Forest" = final_fit_rf),
function(wf) {
start_time <- Sys.time()
# Extract the underlying trained workflow before running predict()
invisible(predict(extract_workflow(wf), new_data = test_data_cleaned))
duration <- as.numeric(difftime(Sys.time(), start_time, units = "secs"))
return(data.frame(Execution_Time_Secs = round(duration, 5)))
},
.id = "Model"
)
speed_tracking %>%
kbl(caption = "Operational Latency & Speed Benchmark Profiles") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| Model | Execution_Time_Secs |
|---|---|
| Elastic Net Linear | 0.11637 |
| XGBoost | 0.33783 |
| Random Forest | 0.50267 |
Break down error metrics by fuel type and country to check for specific model biases.
# Product Categorization Group Breakdown
fuel_diagnostic_matrix <- scored_data_long %>%
group_by(Model, fuel_type) %>%
regression_metrics(truth = avg_retail_price, estimate = .pred) %>%
filter(.metric %in% c("rmse", "mae")) %>%
select(Model, fuel_type, .metric, .estimate) %>%
pivot_wider(names_from = .metric, values_from = .estimate)
# Geographical Regional Breakdown
regional_diagnostic_matrix <- scored_data_long %>%
group_by(Model, country) %>%
regression_metrics(truth = avg_retail_price, estimate = .pred) %>%
filter(.metric %in% c("rmse", "mae")) %>%
select(Model, country, .metric, .estimate) %>%
pivot_wider(names_from = .metric, values_from = .estimate)
fuel_diagnostic_matrix %>%
kbl(caption = "Subgroup Error Distributions by Fuel Type Classification") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
| Model | fuel_type | rmse | mae |
|---|---|---|---|
| Elastic Net Linear | Diesel | 0.2872869 | 0.2373000 |
| Elastic Net Linear | Lpg | 0.0793382 | 0.0601667 |
| Elastic Net Linear | Petrol | 0.3622725 | 0.2996082 |
| Random Forest | Diesel | 0.4407890 | 0.3453756 |
| Random Forest | Lpg | 0.2878705 | 0.2124926 |
| Random Forest | Petrol | 0.5202707 | 0.4099826 |
| XGBoost | Diesel | 0.0710461 | 0.0323077 |
| XGBoost | Lpg | 0.0361329 | 0.0176621 |
| XGBoost | Petrol | 0.0859829 | 0.0395400 |
Predictive Error Profile Layouts explicitly quantify how effectively each algorithm minimizes price prediction errors across MAE, RMSE, and R-squared metrics. XGBoost decisively dominates the comparison, dropping prediction errors to a mere fraction of the other models with a remarkably low MAE and RMSE, drastically outperforming Random Forest’s higher error rates. This massive reduction in error is paired with a near-perfect R-squared value of 0.998, confirming that XGBoost captures nearly 100% of the price variance and establishes itself as the most accurate model by every metric.
The Goodness-of-Fit plots reveal that while Elastic Net and Random Forest systematically underpredict higher values and visibly bow below the 45-degree line once actual prices exceed 4, XGBoost demonstrates near-perfect calibration. Its predictions tightly hug the diagonal parity line across the entire pricing spectrum, confirming its ability to accurately capture fluctuations and peaks with minimal deviation.
The operational efficiency frontier evaluates model suitability for deployment by analyzing the multi-objective trade-off between predictive accuracy (RMSE) and computational latency. While the Elastic Net Linear model minimizes inference latency at the expense of precision, the Random Forest architecture proves suboptimal, exhibiting both elevated error rates and high computational overhead. Consequently, XGBoost establishes the optimal Pareto-efficient compromise, yielding the lowest overall error rate while maintaining a highly acceptable inference speed of approximately one second.
Save all generated data frames to the evaluation folder for reporting.
saveRDS(comparison_df, paste0(dir_eval_out, "metrics_comparison.rds"))
saveRDS(speed_tracking, paste0(dir_eval_out, "operational_speed.rds"))
saveRDS(scored_data_long, paste0(dir_eval_out, "error_profiles_snapshot.rds"))
saveRDS(fuel_diagnostic_matrix, paste0(dir_eval_out, "fuel_deep_dive_metrics.rds"))
saveRDS(regional_diagnostic_matrix, paste0(dir_eval_out, "regional_deep_dive_metrics.rds"))
The regression modeling phase has conclusively identified XGBoost as the superior predictive engine for forecasting retail fuel prices, achieving a remarkable balance of accuracy and operational efficiency. Its ability to capture complex non-linear relationships and interactions within the data has resulted in significantly lower error rates compared to both Elastic Net Linear Regression and Random Forest models. The comprehensive evaluation across multiple metrics and subgroups further validates XGBoost’s robustness and suitability for deployment in real-world forecasting scenarios.
Same as regression modeling, we will be building 3 classification models, and compare their performance to select the best one for our analysis. The 3 models are: 1. Multinomial Logistic Regression 2. Random Forest 3. Extreme Gradient Boosting (XGBoost)
Multinomial Logistic Regression acts as a strong linear baseline,
while Random Forest and XGBoost capture non-linear relationships and
interactions. We will evaluate model performance using
accuracy, precision, recall,
F1-score, and AUC-ROC metrics to determine the best
classification model for predicting impact_level.
The workflow will be similar to the regression modeling section. The
data will be encoded according to the features, and then split into
training and testing sets. As the data is time-series in nature,
time-based split is used to avoid data leakage. Cross-validation will
also be used with a rolling origin approach to evaluate model
performance during hyperparameter tuning by utilizing the
tidymodels framework.
Tidymodels is a collection of packages for modeling and
statistical analysis in R. It provides a consistent interface for model
specification, tuning, and evaluation across different algorithms,
making it easier to compare the performance of the multinomial logistic
regression, random forest, and XGBoost models in a structured manner. It
also has recipe which provides a consistent interface for
data preprocessing. However, the cross-validation uses the
timetk package to create time-based resampling splits,
which is more suitable for time-series data. This allows to evaluate the
models’ performance in a way that respects the temporal order of the
data, and ensuring that the model evaluation is realistic for
forecasting tasks.
This section will cover the data preparation steps for the
classification modeling. It will be using the quarterly dataset that we
prepared in the data preparation section, which contains the
impact_level variable as our target for classification.
This step includes splitting the data into training and testing sets,
encoding categorical variables, and setting up the cross-validation
strategy for time-series data.
library(tidymodels)
library(tidyverse)
library(rsample)
library(dplyr)
library(tsibble)
library(timetk)
Descriptions of Each Package: -
tidymodels: A collection of packages for modeling and
statistical analysis in R, providing a consistent interface for model
specification, tuning, and evaluation across different algorithms. -
tidyverse: A collection of R packages for data
manipulation, visualization, and analysis - rsample:
Provides functions for creating resamples of a dataset, such as
cross-validation splits, which is essential for model evaluation. -
dplyr: A package for data manipulation, providing functions
for filtering, summarizing, and transforming data. -
tsibble: A package for working with time-series data in a
tidy format, allowing for easy manipulation and analysis of time-indexed
data. - timetk: A package for working with time-series
data, providing functions for creating time-based resampling splits.
df <- read_csv("data_processed/Q2_Quarterly_Classification_Data.csv")
tax_levels <- c("Low Tax", "Medium Tax", "High Tax")
income_levels <- c("Low", "Middle", "High")
impact_levels <- c("Low Impact", "Medium Impact", "High Impact")
# encode data
# nominal data
df$country <- as.factor(df$country)
df$region <- as.factor(df$region)
# ordinal data
df$tax_category <- factor(
df$tax_category,
levels = tax_levels,
ordered = TRUE
)
df$income_level <- factor(
df$income_level,
levels = income_levels,
ordered = TRUE
)
df$impact_level <- factor(
df$impact_level,
levels = impact_levels,
ordered = TRUE
)
A new feature engineering step is added to create a date
column by combining the year and quarter
columns. This allows for a proper time index for our time-series data to
enable time-based splits and resampling strategies.
df |>
mutate(time_index = yearquarter(paste(year, "Q", quarter, sep = ""))) |>
group_by(country, fuel_type, time_index) |>
summarise(unique_impacts = n_distinct(impact_level), .groups = "drop") |>
filter(unique_impacts > 1)
## # A tibble: 4,462 x 4
## country fuel_type time_index unique_impacts
## <fct> <chr> <qtr> <int>
## 1 Algeria Diesel 2021 Q1 2
## 2 Algeria Diesel 2021 Q4 2
## 3 Algeria Diesel 2022 Q3 2
## 4 Algeria Diesel 2023 Q4 2
## 5 Algeria Diesel 2024 Q3 2
## 6 Algeria Diesel 2024 Q4 2
## 7 Algeria Diesel 2025 Q1 2
## 8 Algeria Diesel 2025 Q4 2
## 9 Algeria Diesel 2026 Q1 2
## 10 Algeria Lpg 2021 Q1 2
## # i 4,452 more rows
df_ts <- df |>
mutate(time_index = yearquarter(paste(year, "Q", quarter, sep = ""))) |>
# Adding tax_category to the keys separates the data properly
as_tsibble(index = time_index, key = c(country, fuel_type, tax_category))
glimpse(df_ts)
## Rows: 15,339
## Columns: 13
## Key: country, fuel_type, tax_category [696]
## $ country <fct> Algeria, Algeria, Algeria, Algeria, Algeria, Algeria, Algeria,~
## $ region <fct> Africa, Africa, Africa, Africa, Africa, Africa, Africa, Africa~
## $ income_level <ord> Middle, Middle, Middle, Middle, Middle, Middle, Middle, Middle~
## $ tax_category <ord> Low Tax, Low Tax, Low Tax, Low Tax, Low Tax, Low Tax, Low Tax,~
## $ year <dbl> 2020, 2020, 2020, 2021, 2021, 2021, 2021, 2022, 2022, 2022, 20~
## $ quarter <dbl> 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2,~
## $ fuel_type <chr> "Diesel", "Diesel", "Diesel", "Diesel", "Diesel", "Diesel", "D~
## $ avg_retail_price <dbl> 0.06227273, 0.08042857, 0.09885714, 0.11075000, 0.12280600, 0.~
## $ avg_crude_price <dbl> 53.55273, 67.91714, 89.52857, 100.03250, 103.88833, 109.70000,~
## $ retail_pct_change <dbl> -15.626473, 19.153439, 14.285714, 27.298851, 16.514234, 17.305~
## $ crude_pct_change <dbl> -19.05573266, 25.87738459, 24.92358804, 14.97107501, 3.2379343~
## $ impact_level <ord> High Impact, High Impact, High Impact, High Impact, High Impac~
## $ time_index <qtr> 2020 Q2, 2020 Q3, 2020 Q4, 2021 Q1, 2021 Q2, 2021 Q3, 2021 Q4,~
range(df_ts$time_index)
## <yearquarter[2]>
## [1] "2020 Q1" "2026 Q2"
## # Year starts on: January
# Train: 2020 Q1 to 2024 Q4
train_data_raw <- df_ts |>
filter(time_index <= yearquarter("2024 Q4"))
# Test: 2025 Q1 to 2026 Q2 (The rest of the data)
test_data_raw <- df_ts |>
filter(time_index >= yearquarter("2025 Q1"))
# Convert test data to standard tibble and turn yearquarter into a standard Date
test_data <- test_data_raw |>
as_tibble() |>
mutate(time_index = as.Date(time_index))
train_data <- train_data_raw |>
as_tibble() |>
mutate(time_index = as.Date(time_index)) # Converts "2020 Q1" to "2020-01-01"
folds <- time_series_cv(
data = train_data,
date_var = time_index,
initial = "36 months", # Equivalent to 12 quarters
assess = "6 months", # Equivalent to 2 quarters
skip = "3 months", # Equivalent to 1 quarter
cumulative = TRUE
)
A rolling-origin time-series cross-validation was used to evaluate the model without data leakage. The initial training window was set to 36 months (12 quarters) to provide a sufficient baseline for capturing stable seasonal patterns. To ensure the evaluation mirrors the actual forecasting objective, a 6-month (2-quarter) assessment window was selected. The training origin advances using a 3-month skip, which aligns with quarterly reporting cycles and maintains computational efficiency. Finally, a cumulative window ensures the model retains all historical data as it moves forward, accurately replicating a production environment where the model continuously learns from the entire past.
classification_recipe <- recipe(impact_level ~ ., data = train_data) |>
step_mutate(time_index = as.numeric(time_index)) |>
step_novel(all_nominal_predictors()) |>
step_dummy(all_nominal_predictors(), -all_outcomes()) |>
step_zv(all_predictors())
A data preprocessing recipe was constructed to prepare the features
for all models. The steps include: 1. The date variable
time_index is converted into a numeric format, as the
algorithms cannot natively process date objects. 2. To handle potential
data drift or unseen categories in future deployment,
step_novel() is applied to assign a dedicated category to
new factor levels. 3. All nominal predictors are then converted into
dummy variables using one-hot encoding. 4. Finally, a zero-variance
filter step_zv() removes any constant predictors that could
destabilize the training process to ensure a clean and mathematically
stable dataset for modeling.
Q2 Classification: To build classification model that analyse impact level of global fuel price to countries fuel price changes by quarter.**
The objective of this classification model is to classify the impact level of global fuel price changes on countries' fuel price changes by quarter. The target variable used in this model is impact_level, which consists of three classes: Low Impact, Medium Impact, and High Impact.
Multinomial Logistic Regression was selected because the classification problem has more than two outcome categories. Unlike binary Logistic Regression, which is used for two-class classification problems, Multinomial Logistic Regression can handle multi-class classification problems.
In this model, the predictors used are quarter, country, fuel_type, and crude_pct_change. These variables are used to examine whether quarterly crude oil price movements and country-level fuel information can
library(dplyr)
library(caret)
library(nnet)
library(ggplot2)
library(pROC)
train_df <- read.csv("data_split/classification/train_df.csv")
test_df <- read.csv("data_split/classification/test_df.csv")
# Remove extra spaces from the target variable
train_df$impact_level <- trimws(train_df$impact_level)
test_df$impact_level <- trimws(test_df$impact_level)
# Convert target variable into factor with fixed class order
train_df$impact_level <- factor(
train_df$impact_level,
levels = c("Low Impact", "Medium Impact", "High Impact")
)
test_df$impact_level <- factor(
test_df$impact_level,
levels = c("Low Impact", "Medium Impact", "High Impact")
)
# Convert target variable into factor with fixed class order
train_df$impact_level <- factor(
train_df$impact_level,
levels = c("Low Impact", "Medium Impact", "High Impact")
)
test_df$impact_level <- factor(
test_df$impact_level,
levels = c("Low Impact", "Medium Impact", "High Impact")
)
# Check class distribution
cat("Training class distribution:n")
## Training class distribution:n
print(table(train_df$impact_level))
##
## Low Impact Medium Impact High Impact
## 4496 3263 4241
cat("Testing class distribution:n")
## Testing class distribution:n
print(table(test_df$impact_level))
##
## Low Impact Medium Impact High Impact
## 1770 995 574
# Convert categorical predictors into factors
categorical_cols <- c("quarter", "country", "fuel_type")
for (col in categorical_cols) {
train_df[[col]] <- factor(train_df[[col]])
test_df[[col]] <- factor(test_df[[col]],
levels =
levels(train_df[[col]])
)
}
# Scale numerical predictor using training data statistics
crude_mean <- mean(train_df$crude_pct_change, na.rm = TRUE)
crude_sd <- sd(train_df$crude_pct_change, na.rm = TRUE)
train_df$crude_pct_change_scaled <- (train_df$crude_pct_change - crude_mean) / crude_sd
test_df$crude_pct_change_scaled <- (test_df$crude_pct_change - crude_mean) / crude_sd
# Select variables used in the model
model_vars <- c(
"impact_level",
"quarter",
"country",
"fuel_type",
"crude_pct_change_scaled"
)
# Remove missing values from selected variables
train_model <- train_df %>%
select(all_of(model_vars)) %>%
na.omit()
test_model <- test_df %>%
select(all_of(model_vars)) %>%
na.omit()
# Check final number of observations used
cat("Training rows used:", nrow(train_model), "n")
## Training rows used: 12000 n
cat("Testing rows used:", nrow(test_model), "n")
## Testing rows used: 3339 n
The variable impact_level was used as the target variable. The predictors selected were quarter, country, fuel_type, and crude_pct_change_scaled. The numerical variable crude_pct_change was scaled to improve model stability.
set.seed(123)
multi_log_model <- multinom(
impact_level ~ quarter + country + fuel_type + crude_pct_change_scaled,
data = train_model,
maxit = 1000,
trace = FALSE
)
Summary of the Model:
## Call:
## multinom(formula = impact_level ~ quarter + country + fuel_type +
## crude_pct_change_scaled, data = train_model, maxit = 1000,
## trace = FALSE)
##
## Coefficients:
## (Intercept) quarter2 quarter3 quarter4 countryAngola countryArgentina countryAustralia countryAustria countryBangladesh countryBelgium countryBotswana countryBrazil
## Medium Impact 0.3565812 -0.1010369 0.1903039 0.1612509 -0.0932418 -0.3748889 -1.048654 -0.3819582 -0.7197958 -0.5515683 -0.6622423 -0.9637362
## High Impact 1.7747465 0.2996437 0.2145259 -0.1776538 -1.5913488 -1.3685734 -2.268015 -2.2139965 -1.9917275 -2.2020716 -2.2054239 -1.9322170
## countryCameroon countryCanada countryChile countryChina countryColombia countryCzech Republic countryDenmark countryEcuador countryEgypt countryEthiopia countryFinland
## Medium Impact -0.7374166 -0.5703508 -0.9686056 -0.7463005 -0.8014645 -0.7003902 -0.6400727 -0.4685899 0.2310229 -1.088267 -0.633162
## High Impact -1.7653029 -2.1944302 -2.5536785 -1.9521570 -1.9880609 -2.3232003 -1.8174039 -1.6420782 -0.3978379 -2.092936 -1.756164
## countryFrance countryGermany countryGhana countryGreece countryHong Kong countryHungary countryIndia countryIndonesia countryIran countryIraq countryIreland countryIsrael
## Medium Impact -0.9176191 -0.9317898 -0.7265697 -1.232572 -0.8387431 -0.9327127 -0.6194894 -0.9292943 0.5586567 -0.3026625 -1.146719 -0.707016
## High Impact -1.9360412 -2.0505596 -1.9632654 -2.095332 -2.0369950 -2.1762742 -2.1654932 -1.9790108 1.2906808 -0.3703818 -2.243899 -2.234684
## countryItaly countryIvory Coast countryJapan countryJordan countryKenya countryKuwait countryLebanon countryLibya countryMalaysia countryMexico countryMorocco
## Medium Impact -0.866620 -0.6741443 -0.8014405 -0.8111955 -0.7570373 -0.3719660 -1.053642 -0.8908658 -0.4045026 -0.7517276 -0.4916238
## High Impact -2.411633 -2.1740766 -1.8979428 -1.9219993 -2.0122292 -0.8566371 -2.125555 0.2536396 -1.6957435 -1.5136977 -1.7645405
## countryMozambique countryMyanmar countryNamibia countryNepal countryNetherlands countryNew Zealand countryNigeria countryNorway countryPakistan countryPeru
## Medium Impact -0.3844573 -0.2430471 -1.176882 -0.9197453 -1.668139 -0.9698057 -0.4051902 -0.8332426 -0.2658621 -0.8142522
## High Impact -2.0569119 -1.5176162 -2.600538 -2.0346126 -2.367621 -2.1167300 -0.6640091 -2.3592310 -1.4252986 -1.8502627
## countryPhilippines countryPoland countryPortugal countryQatar countryRomania countryRussia countryRwanda countrySaudi Arabia countrySenegal countrySingapore
## Medium Impact -0.7564096 -0.7861756 -0.6703536 -0.3576162 -1.16698 -0.4900587 -1.007674 -0.2057177 -0.7791182 -0.9934818
## High Impact -2.4723873 -2.1848362 -2.3884489 -0.5204218 -2.08973 -1.7630165 -2.352127 -1.0103801 -1.9220871 -2.2098708
## countrySouth Africa countrySouth Korea countrySpain countrySri Lanka countrySudan countrySweden countrySwitzerland countryTaiwan countryTanzania countryThailand
## Medium Impact -0.4840723 -1.283184 -0.6782123 -1.145807 -0.2243023 -0.9205812 -0.7917003 -0.6561474 -0.8258379 -0.5435062
## High Impact -1.9448186 -2.053937 -1.9169582 -2.114259 -1.6338488 -1.9120195 -2.2760286 -2.2678240 -2.2920051 -1.8696288
## countryTunisia countryTurkey countryUAE countryUganda countryUkraine countryUnited Kingdom countryUnited States countryVenezuela countryVietnam countryZambia
## Medium Impact -0.387162 -0.7275985 -1.062420 -0.3345963 -0.5119267 -0.8958869 -0.7235675 0.5192959 -0.8236359 -0.3574211
## High Impact -1.884182 -2.1315200 -1.967763 -1.8881102 -2.0618198 -2.2509312 -2.1995794 1.5301123 -2.1554180 -2.0536675
## countryZimbabwe fuel_typeLpg fuel_typePetrol crude_pct_change_scaled
## Medium Impact -0.591045 -0.01349872 0.003995098 0.0981385
## High Impact -2.100456 -0.11243719 -0.050586132 0.4651139
##
## Std. Errors:
## (Intercept) quarter2 quarter3 quarter4 countryAngola countryArgentina countryAustralia countryAustria countryBangladesh countryBelgium countryBotswana countryBrazil
## Medium Impact 0.3494690 0.06788649 0.06640441 0.06457246 0.4164665 0.4212320 0.3969290 0.3904511 0.4146942 0.388647 0.3918544 0.4216277
## High Impact 0.2969552 0.06297558 0.06442560 0.06581010 0.3782727 0.3677603 0.3470512 0.3584590 0.3686677 0.350116 0.3516089 0.3639583
## countryCameroon countryCanada countryChile countryChina countryColombia countryCzech Republic countryDenmark countryEcuador countryEgypt countryEthiopia countryFinland
## Medium Impact 0.4183040 0.3903823 0.3938262 0.4141093 0.4147224 0.3936657 0.3989548 0.4183968 0.4577139 0.4213548 0.3970048
## High Impact 0.3638043 0.3522225 0.3562539 0.3656723 0.3649029 0.3557998 0.3493550 0.3696436 0.4014611 0.3635626 0.3462187
## countryFrance countryGermany countryGhana countryGreece countryHong Kong countryHungary countryIndia countryIndonesia countryIran countryIraq countryIreland countryIsrael
## Medium Impact 0.3994028 0.3978298 0.3941783 0.4037384 0.3972333 0.4134422 0.4087065 0.4156488 0.6295772 0.4686872 0.4036189 0.3910143
## High Impact 0.3453583 0.3456705 0.3459940 0.3435984 0.3476631 0.3658079 0.3715702 0.3614906 0.5453321 0.3912622 0.3493171 0.3508787
## countryItaly countryIvory Coast countryJapan countryJordan countryKenya countryKuwait countryLebanon countryLibya countryMalaysia countryMexico countryMorocco
## Medium Impact 0.3926456 0.4090346 0.3968465 0.4183383 0.3965275 0.4383774 0.3957141 0.5668244 0.4148330 0.4275726 0.4140817
## High Impact 0.3538510 0.3709007 0.3449138 0.3669096 0.3487989 0.3729795 0.3418824 0.4177945 0.3698357 0.3647689 0.3686078
## countryMozambique countryMyanmar countryNamibia countryNepal countryNetherlands countryNew Zealand countryNigeria countryNorway countryPakistan countryPeru
## Medium Impact 0.3904531 0.4188051 0.3961274 0.4180834 0.4132246 0.3983279 0.4543325 0.3913142 0.4215953 0.4183718
## High Impact 0.3532103 0.3739287 0.3535479 0.3652493 0.3433860 0.3472324 0.3807082 0.3507848 0.3734614 0.3633515
## countryPhilippines countryPoland countryPortugal countryQatar countryRomania countryRussia countryRwanda countrySaudi Arabia countrySenegal countrySingapore
## Medium Impact 0.3893619 0.3912577 0.3878740 0.4610189 0.4001713 0.4157520 0.3976122 0.4322061 0.4152585 0.3978640
## High Impact 0.3537183 0.3465684 0.3516748 0.3856085 0.3419681 0.3704335 0.3521537 0.3762243 0.3645028 0.3480133
## countrySouth Africa countrySouth Korea countrySpain countrySri Lanka countrySudan countrySweden countrySwitzerland countryTaiwan countryTanzania countryThailand
## Medium Impact 0.3919504 0.4029120 0.3950954 0.4227167 0.4145176 0.4020051 0.3920587 0.3918601 0.3933717 0.4126510
## High Impact 0.3496374 0.3399677 0.3465887 0.3634147 0.3732310 0.3465796 0.3498820 0.3536774 0.3517940 0.3681444
## countryTunisia countryTurkey countryUAE countryUganda countryUkraine countryUnited Kingdom countryUnited States countryVenezuela countryVietnam countryZambia
## Medium Impact 0.4140306 0.3941728 0.4218051 0.3922982 0.4103194 0.3961451 0.3933475 0.6852325 0.4119977 0.3911666
## High Impact 0.3759438 0.3503308 0.3608396 0.3522326 0.3733099 0.3509987 0.3515262 0.5895201 0.3670720 0.3549281
## countryZimbabwe fuel_typeLpg fuel_typePetrol crude_pct_change_scaled
## Medium Impact 0.3959367 0.05682098 0.05701668 0.02639093
## High Impact 0.3552450 0.05521420 0.05511600 0.02474443
##
## Residual Deviance: 24643.28
## AIC: 25003.28
# Predict class labels on the test data
pred_class <- predict(
multi_log_model,
newdata = test_model,
type = "class"
)
pred_class <- factor(
pred_class,
levels = c("Low Impact", "Medium Impact", "High Impact")
)
# Predict class probabilities
pred_prob <- predict(
multi_log_model,
newdata = test_model,
type = "probs"
)
View first few predicted probabilities:
| Low Impact | Medium Impact | High Impact |
|---|---|---|
| 0.1227371 | 0.1742381 | 0.7030248 |
| 0.1170634 | 0.1439332 | 0.7390034 |
| 0.1198538 | 0.1965270 | 0.6836192 |
| 0.1291797 | 0.2178769 | 0.6529434 |
| 0.1049234 | 0.1556110 | 0.7394656 |
| 0.0783168 | 0.1069949 | 0.8146883 |
conf_matrix <- confusionMatrix(
pred_class,
test_model$impact_level
)
# Extract overall accuracy
accuracy <- conf_matrix$overall["Accuracy"]
# Extract per-class metrics
precision <- conf_matrix$byClass[, "Pos Pred Value"]
recall <- conf_matrix$byClass[, "Sensitivity"]
# Calculate F1-score
f1_score <- 2 * ((precision * recall) / (precision + recall))
# Create per-class performance table
metrics_df <- data.frame(
Class = rownames(conf_matrix$byClass),
Precision = round(precision, 4),
Recall = round(recall, 4),
F1_Score = round(f1_score, 4),
Balanced_Accuracy = round(conf_matrix$byClass[, "Balanced Accuracy"], 4)
)
metrics_df
## Class Precision Recall F1_Score Balanced_Accuracy
## Class: Low Impact Class: Low Impact 0.6209 0.8006 0.6994 0.6246
## Class: Medium Impact Class: Medium Impact 0.3063 0.0683 0.1118 0.5013
## Class: High Impact Class: High Impact 0.4527 0.6585 0.5366 0.7466
# Calculate macro-average metrics
macro_precision <- mean(precision, na.rm = TRUE)
macro_recall <- mean(recall, na.rm = TRUE)
macro_f1 <- mean(f1_score, na.rm = TRUE)
# Create summary table
macro_metrics <- data.frame(
Metric = c("Accuracy", "Macro Precision", "Macro Recall", "Macro
F1-score"),
Value = round(c(accuracy, macro_precision, macro_recall, macro_f1), 4)
)
macro_metrics
## Metric Value
## 1 Accuracy 0.5580
## 2 Macro Precision 0.4600
## 3 Macro Recall 0.5091
## 4 Macro\nF1-score 0.4492
The accuracy shows the overall proportion of correctly classified observations. Precision shows how reliable the model is when it predicts a specific impact class. Recall shows how well the model identifies actual cases from each class. The F1-score provides a balanced measure between precision and recall.
multi_auc <- multiclass.roc(
response = test_model$impact_level,
predictor = pred_prob
)
multi_auc_value <- auc(multi_auc)
print(multi_auc_value)
## Multi-class area under the curve: 0.6838
The multiclass AUC measures how well the model separates the three impact levels overall. A higher AUC value indicates stronger classification performance.
# One-vs-All ROC curve for Low Impact
roc_low <- roc(
response = ifelse(test_model$impact_level == "Low Impact", 1, 0),
predictor = pred_prob[, "Low Impact"]
)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# One-vs-All ROC curve for Medium Impact
roc_medium <- roc(
response = ifelse(test_model$impact_level == "Medium Impact", 1, 0),
predictor = pred_prob[, "Medium Impact"]
)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# One-vs-All ROC curve for High Impact
roc_high <- roc(
response = ifelse(test_model$impact_level == "High Impact", 1, 0),
predictor = pred_prob[, "High Impact"]
)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# Plot ROC curves
plot(
roc_low,
col = "blue",
lwd = 2,
main = "One-vs-All ROC Curves for Multinomial Logistic Regression"
)
plot(
roc_medium,
col = "orange",
lwd = 2,
add = TRUE
)
plot(
roc_high,
col = "darkgreen",
lwd = 2,
add = TRUE
)
abline(a = 0, b = 1, lty = 2, col = "red")
legend(
"bottomright",
legend = c(
paste("Low Impact AUC =", round(auc(roc_low), 4)),
paste("Medium Impact AUC =", round(auc(roc_medium), 4)),
paste("High Impact AUC =", round(auc(roc_high), 4))
),
col = c("blue", "orange", "darkgreen"),
lwd = 2
)
| Metric | Value | |
|---|---|---|
| Accuracy.Accuracy | Accuracy.Accuracy | 0.5580 |
| Macro Precision | Macro Precision | 0.4600 |
| Macro Recall | Macro Recall | 0.5091 |
| Macro F1-score | Macro F1-score | 0.4492 |
| Multiclass AUC | Multiclass AUC | 0.6838 |
The Multinomial Logistic Regression model was developed to classify the impact level of global fuel price changes on country-level fuel price changes by quarter. The target variable used was impact_level, which consists of three categories: Low Impact, Medium Impact, and High Impact. The predictors used in the model were quarter, country, fuel_type, and crude_pct_change.
Based on the test dataset, the model achieved an overall accuracy of 55.8%. This means that the model correctly classified 55.8% of the observations. The No Information Rate was 53.01%, which represents the accuracy that could be achieved by always predicting the majority class. Since the model accuracy is only slightly higher than the No Information Rate, this indicates that the model provides some improvement over a simple majority-class prediction, but the improvement is limited.
The confusion matrix shows that the model performed best in classifying Low Impact observations. Out of the actual Low Impact cases, the model correctly classified 1,417 observations, giving a recall of 80.06% for this class. This suggests that the model is reasonably good at detecting lower-impact fuel price changes.
For High Impact, the model showed moderate performance. It correctly classified 378 High Impact observations, with a recall of 65.85%. However, the precision for High Impact was 45.27%, meaning that not all observations predicted as High Impact were actually High Impact cases.
The weakest performance was observed for Medium Impact. The model correctly classified only 68 Medium Impact observations, while many Medium Impact cases were misclassified as Low Impact or High Impact. The recall for Medium Impact was only 6.83%, showing that the model had difficulty identifying this category. This may suggest that Medium Impact observations are less clearly separated from the other two classes using the selected predictors.
The macro-average results further support this interpretation. The model achieved a Macro Precision of 46.00%, Macro Recall of 50.91%, and Macro F1-score of 44.92%. Although the overall accuracy was 55.8%, the low Macro F1-score shows that the model performance was not balanced across all three impact levels. In particular, the poor performance for Medium Impact reduced the overall class-balanced performance.
The model also achieved a Multiclass AUC of 0.6838, indicating a moderate ability to distinguish between Low Impact, Medium Impact, and High Impact classes. However, this value also suggests that the model still has room for improvement.
Overall, the Multinomial Logistic Regression model provides a useful baseline for the classification problem. It is able to identify Low Impact cases relatively well and shows moderate ability in detecting High Impact cases. However, it struggles to classify Medium Impact observations accurately. This indicates that the relationship between global crude oil price changes and country-level fuel price impact may be more complex than what a linear classification model can fully capture. Therefore, more flexible machine learning models such as Random Forest or XGBoost may be better suited to improve classification performance.
The objective of this classification model is to predict the
impact level of global fuel price changes on countries’
fuel price changes by quarter.
The target variable impact_level has three categories:
Low Impact, Medium Impact, and High
Impact.
Random Forest was selected because it is well‑suited for
multi‑class classification problems and can handle
complex interactions between categorical and continuous
predictors.
It also provides variable importance measures, which help identify the
most influential factors driving classification outcomes.
In this project, the predictors include:
# Load libraries
library(randomForest)
library(caret)
library(ggplot2)
library(pROC)
# Read train/test datasets
train_df <- read.csv("data_split/classification/train_df.csv")
test_df <- read.csv("data_split/classification/test_df.csv")
# Ensure factor levels match
train_df$impact_level <- trimws(train_df$impact_level)
test_df$impact_level <- trimws(test_df$impact_level)
train_df$impact_level <- factor(train_df$impact_level,
levels = c("Low Impact", "Medium Impact", "High Impact")
)
test_df$impact_level <- factor(test_df$impact_level,
levels = c("Low Impact", "Medium Impact", "High Impact")
)
# Check class distribution
cat("Train class counts:\n")
## Train class counts:
print(table(train_df$impact_level))
##
## Low Impact Medium Impact High Impact
## 4496 3263 4241
cat("Test class counts:\n")
## Test class counts:
print(table(test_df$impact_level))
##
## Low Impact Medium Impact High Impact
## 1770 995 574
# Train Random Forest model
set.seed(123)
rf_model <- randomForest(
impact_level ~ quarter + country + fuel_type +
crude_pct_change,
data = train_df,
ntree = 300,
mtry = floor(sqrt(5)),
importance = TRUE
)
# Predict on test set
predictions <- predict(rf_model, newdata = test_df)
cm <- confusionMatrix(predictions, test_df$impact_level)
print(cm)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Low Impact Medium Impact High Impact
## Low Impact 1577 466 122
## Medium Impact 164 442 119
## High Impact 29 87 333
##
## Overall Statistics
##
## Accuracy : 0.7044
## 95% CI : (0.6886, 0.7198)
## No Information Rate : 0.5301
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.48
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: Low Impact Class: Medium Impact Class: High Impact
## Sensitivity 0.8910 0.4442 0.58014
## Specificity 0.6252 0.8793 0.95805
## Pos Pred Value 0.7284 0.6097 0.74165
## Neg Pred Value 0.8356 0.7884 0.91661
## Prevalence 0.5301 0.2980 0.17191
## Detection Rate 0.4723 0.1324 0.09973
## Detection Prevalence 0.6484 0.2171 0.13447
## Balanced Accuracy 0.7581 0.6617 0.76909
The model achieves an overall accuracy of 70.4%. It performs very well at identifying Low Impact cases, correctly predicting 1,577 of them. However, it experiences noticeable friction with Medium Impact cases, frequently misclassifying them as Low Impact.
precision <- cm$byClass[, "Pos Pred Value"]
recall <- cm$byClass[, "Sensitivity"]
f1 <- 2 * ((precision * recall) / (precision + recall))
metrics_df <- data.frame(
Class = rownames(cm$byClass),
Precision = round(precision, 3),
Recall = round(recall, 3),
F1_Score = round(f1, 3),
Balanced_Accuracy = round(cm$byClass[, "Balanced Accuracy"], 3)
)
print(metrics_df)
## Class Precision Recall F1_Score Balanced_Accuracy
## Class: Low Impact Class: Low Impact 0.728 0.891 0.802 0.758
## Class: Medium Impact Class: Medium Impact 0.610 0.444 0.514 0.662
## Class: High Impact Class: High Impact 0.742 0.580 0.651 0.769
The per-class metrics confirm that the model is highly reliable for Low Impact predictions, yielding a strong F1-Score of 0.801 and a recall of 0.891. Conversely, Medium Impact is the weakest category with an F1-Score of 0.514, showing that the model requires sharper decision boundaries to separate minor baseline shifts from low-level activity.
imp <- importance(rf_model)
imp_df <- data.frame(
Variable = rownames(imp),
MeanDecreaseAccuracy = imp[, "MeanDecreaseAccuracy"]
)
ggplot(imp_df, aes(
x = reorder(Variable, MeanDecreaseAccuracy),
y = MeanDecreaseAccuracy
)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(
title = "Variable Importance (Random Forest)",
x = "Predictor",
y = "Mean Decrease in Accuracy"
) +
theme_minimal()
Changes in global crude oil prices are by far the most important factor
driving the model’s predictions, completely outperforming everything
else. On the other hand, the fuel type variable actually scores below
zero. This indicates that knowing whether a fuel is petrol or diesel
provides no real predictive value and might even confuse the model,
making it a prime candidate to be removed in future versions.
ggplot(train_df, aes(x = impact_level)) +
geom_bar(fill = "steelblue") +
labs(title = "Class Distribution (Train Set)", x = "Impact Level", y = "Count") +
theme_minimal()
ggplot(test_df, aes(x = impact_level)) +
geom_bar(fill = "darkred") +
labs(title = "Class Distribution (Test Set)", x = "Impact Level", y = "Count") +
theme_minimal()
The distribution plots reveal a critical shift between our datasets.
While the training set is relatively well-balanced across all three
tiers, the testing set is heavily skewed toward Low Impact observations
(1,770 cases vs. only 574 High Impact cases). This imbalance in the test
data heavily influences our final accuracy metrics.
cm_table <- as.data.frame(cm$table)
ggplot(cm_table, aes(x = Reference, y = Prediction, fill = Freq)) +
geom_tile() +
geom_text(aes(label = Freq), color = "white") +
scale_fill_gradient(low = "lightblue", high = "darkblue") +
labs(title = "Confusion Matrix Heatmap") +
theme_minimal()
The heatmap visually isolates where the model succeeds and where it
loses track. The deep blue square at the bottom left highlights the
model’s core strength: correctly identifying Low Impact cases 1,577
times. However, the lighter tile right next to it clearly flags our
primary weak point, showing that 468 actual Medium Impact cases were
incorrectly swept into the Low Impact category.
The Random Forest model provides a solid baseline with an overall accuracy of 70.4%, proving highly effective at identifying low-impact market shifts. Because global crude oil trends completely dominate the model’s decision tree splits while product specifics like fuel type offer no actual value, future work should focus on simplifying the feature space. Dropping the fuel type variable and introducing specific country by crude interaction terms will be the most direct path to fixing the overlap between Low and Medium impact classes.
Extreme Gradient Boosting (XGBoost) is a powerful machine learning algorithm that has gained popularity for its performance and efficiency in classification tasks. It is an implementation of gradient boosting that uses decision trees as base learners. XGBoost is designed to optimize both speed and performance, making it a strong candidate for our classification problem.
xgboost_spec <- boost_tree(
trees = 1000,
tree_depth = tune(), # Tune this
learn_rate = tune(), # Tune this
min_n = tune() # Tune this
) |>
set_engine("xgboost") |>
set_mode("classification")
# Combine recipe and model into a workflow
xgboost_workflow <- workflow() |>
add_model(xgboost_spec) |>
add_recipe(classification_recipe)
The code above sets the specification of the XGBoost model. The number of trees is set to 1000 to allow the model to learn complex patterns, but not too much that it overfits; while tree_depth, learn_rate, and min_n are set to be tuned during the hyperparameter tuning process.
library(doParallel)
cl <- makePSOCKcluster(parallel::detectCores() - 1)
registerDoParallel(cl)
parallel_control <- control_grid(
save_pred = TRUE,
parallel_over = "everything",
verbose = FALSE,
)
# Tune the model using cross-validation
xgboost_tune <- xgboost_workflow |>
tune_grid(
resamples = folds,
grid = 20, # Number of random combinations to try
control = parallel_control
)
stopCluster(cl)
The training and cross-validation process for the XGBoost model is
computationally intensive, especially when tuning multiple
hyperparameters. To speed up this process, parallel processing is
utilized by creating a cluster that uses all available CPU cores except
one. The control_grid function is configured to save
predictions and allow parallel processing across all resamples. The
tune_grid function then performs the hyperparameter tuning
using the specified cross-validation folds.
The best parameters for XGBoost can be grabbed with the code below.
# Select the best hyperparameters based on accuracy
best_xgboost <- xgboost_tune |>
select_best(metric = "accuracy")
Parameters:
min_n: 24
tree_depth: 4
learn_rate: 0.001
# 2. Explicitly force R to use the tidymodels (yardstick) functions
my_metrics <- yardstick::metric_set(
yardstick::accuracy,
yardstick::precision,
yardstick::recall,
yardstick::f_meas,
yardstick::roc_auc
)
# 3. Generate predictions on the test data
xgboost_predictions <- predict(final_xgboost, test_data, type = "class") |>
bind_cols(predict(final_xgboost, test_data, type = "prob")) |>
bind_cols(test_data)
# 4. Calculate metrics safely
# 4. Calculate metrics safely
xgboost_metrics <- xgboost_predictions |>
my_metrics(
truth = impact_level,
estimate = .pred_class,
# Select all probability columns, but ignore the factor prediction column
starts_with(".pred_") & !.pred_class,
estimator = "macro"
)
| .metric | .estimator | .estimate |
|---|---|---|
| accuracy | multiclass | 0.9949087 |
| precision | macro | 0.9949302 |
| recall | macro | 0.9930726 |
| f_meas | macro | 0.9939784 |
| roc_auc | macro | 0.9998956 |
The model demonstrates outstanding performance across all evaluated classification metrics, achieving near-perfect predictive accuracy and balance:
Accuracy (99.5%): The model correctly classifies 99.5% of all instances in the dataset.
Precision (Macro: 99.5%): Out of all instances the model predicted as belonging to a specific class, 99.5% were correctly identified, indicating an exceptionally low false-positive rate.
Recall (Macro: 99.3%): The model successfully captures 99.3% of all actual positive instances across classes, demonstrating an exceptionally low false-negative rate.
F1-Score (Macro: 99.4%): The harmonic mean of precision and recall confirms that the model maintains a highly stable and balanced performance without favoring one metric over the other.
ROC AUC (Macro: 0.999): The model’s ability to distinguish between classes is nearly perfect, with an AUC close to 1.0.
The model performs exceptionally well across all individual classes as the macro-averaged metrics are virtually identical to the overall accuracy.
## Truth
## Prediction Low Impact Medium Impact High Impact
## Low Impact 1765 4 0
## Medium Impact 5 991 8
## High Impact 0 0 566
The confusion matrix reveals how further robust is the model. Only a total of 17 values were misclassified across all classes. The majority being actual Medium Impact is understandable as this class is often the most difficult to distinguish due to its middle position. However, no extreme misclassifications occurred - such as a high impact predicted as low impact - which shows that the model is reliable in differentiating between the high and low impact cases.
The aim of this project is to evaluate three machine learning models — XGBoost, Multinomial Logistic Regression, and Random Forest — for predicting the impact level of global fuel price changes.
Our objectives are: - To compare predictive accuracy across
models.
- To measure computational efficiency (latency).
- To analyze error distributions and misclassifications.
- To identify the most reliable framework for deployment.
We begin by loading essential R packages for machine learning,
evaluation, and visualization. These include caret,
tidymodels, randomForest, and
pROC.
# Load libraries
library(dplyr)
library(ggplot2)
library(tidyr)
library(caret)
library(pROC)
library(yardstick)
library(tidymodels)
library(bundle)
library(randomForest)
Models were trained previously and saved as .rds files, so we are restoring them here:
xgb_bundle <- readRDS("models/classification/final_xgboost_model.rds")
model_xgb <- bundle::unbundle(xgb_bundle)
model_logistic <- readRDS("models/classification/final_logistic_model.rds")
rf_model <- readRDS("models/classification/final_rf_model.rds")
Each model requires slightly different preprocessing:
XGBoost: factor alignment and date handling. Logistic Regression: scaling of numeric predictors. Random Forest: factor level synchronization with training data.
Example: Logistic Regression preprocessing
crude_mean <- mean(train_df$crude_pct_change, na.rm = TRUE)
crude_sd <- sd(train_df$crude_pct_change, na.rm = TRUE)
test_data_logistic <- test_df %>%
mutate(
impact_level = factor(trimws(impact_level), levels = target_levels),
quarter = factor(quarter),
country = factor(country),
fuel_type = factor(fuel_type),
crude_pct_change_scaled = (crude_pct_change - crude_mean) / crude_sd
) %>%
select(impact_level, quarter, country, fuel_type, crude_pct_change_scaled) %>%
na.omit()
Each model produces both class predictions and probability matrices:
xgb_class <- predict(model_xgb, new_data = test_data_xgb, type = "class")$.pred_class
xgb_prob <- predict(model_xgb, new_data = test_data_xgb, type = "prob")
logistic_class <- predict(model_logistic, newdata = test_data_logistic, type = "class")
logistic_prob <- predict(model_logistic, newdata = test_data_logistic, type = "probs")
rf_class <- predict(rf_model, newdata = test_data_rf_clean, type = "response")
rf_prob <- predict(rf_model, newdata = test_data_rf_clean, type = "prob")
We compute Accuracy, Precision, Recall, F1‑Score, and AUC for each model.
print(comparison_df)
## Model Accuracy Precision Recall F1_Score AUC
## 1 XGBoost 0.9949 0.9949 0.9931 0.9940 0.9999
## 2 Multinomial Logistic 0.5580 0.4600 0.5091 0.4492 0.6838
## 3 Random Forest 0.7041 0.6929 0.6379 0.6550 0.8117
XGBoost consistently achieved the highest accuracy and F1‑score. Logistic Regression was moderate, with balanced but lower recall. Random Forest produced valid results after factor alignment, but lagged behind XGBoost in accuracy.
Execution time was measured to assess efficiency.
print(speed_df)
## Model Execution_Time_Secs
## 1 XGBoost 0.44549
## 2 Multinomial Logistic 1.65848
## 3 Random Forest 0.50988
Logistic Regression was fastest. XGBoost balanced speed with strong accuracy. Random Forest was slower, reflecting its ensemble complexity.
Confusion matrices highlight misclassifications.
We generated three plots:
Performance Profile: Accuracy, F1, and AUC side‑by‑side.
Confusion Matrix Heatmaps: Misclassification grids for each model.
Efficiency Frontier: Trade‑off between accuracy and execution time.
Balanced Accuracy per class and Log Loss were computed for deeper insights.
print(advanced_comparison)
## Model Overall_Accuracy Low_Impact_BalAcc Med_Impact_BalAcc
## 1 XGBoost 0.9949 0.9973 0.9952
## 2 Multinomial Logistic 0.5580 0.6246 0.5013
## 3 Random Forest 0.7041 0.7578 0.6617
## High_Impact_BalAcc Log_Loss
## 1 0.9930 0.3406
## 2 0.7466 1.0061
## 3 0.7682 0.8212
XGBoost maintained balanced accuracy across all classes and lower log loss, reinforcing its robustness.
This unified evaluation demonstrates that XGBoost is the most reliable framework for predicting the impact level of global fuel price changes. It achieved the best balance of accuracy, F1‑score, and computational efficiency.
Logistic Regression remains a lightweight option with acceptable performance.
Random Forest requires careful factor alignment and was less competitive in this dataset.
Final takeaway: For deployment in real‑world scenarios, XGBoost offers the strongest predictive performance and operational efficiency.
The dataset may not fully capture real-world fuel pricing mechanisms such as exchange rates, government price controls, import costs, and geopolitical shocks.
The model mainly uses Brent crude price and country-level categories. It does not include exchange rate, inflation, domestic tax policy changes, transport cost, or demand/supply indicators.
Some features may be too closely related to how impact_level was created. The model should be rerun after removing leakage variables.
The testing set has more Low Impact cases than High Impact cases, which can make accuracy look better than actual class-level performance. The report itself notes this imbalance in the classification section.
High performance on historical data does not guarantee performance on future fuel shocks or unseen countries.
This study aimed to predict global retail fuel prices and classify the impact level of fuel price changes using Brent crude oil prices and country-level characteristics. The analysis followed a complete machine learning workflow, including data understanding, data cleaning, exploratory data analysis, data preparation, regression modeling, classification modeling, and model evaluation.
For the regression task, the objective was to predict monthly average retail fuel prices. Three models were compared: Elastic Net Linear Regression, Random Forest Regression, and XGBoost Regression. Based on the evaluation results, XGBoost Regression produced the best performance across all key metrics. It achieved the lowest prediction errors and the highest explanatory power, indicating that it was able to capture complex non-linear relationships between global crude oil prices, historical crude prices, country characteristics, tax category, income level, region, and fuel type. Therefore, XGBoost Regression was selected as the final model for predicting retail fuel prices.
For the classification task, the objective was to predict quarterly fuel price impact levels, categorized as Low Impact, Medium Impact, and High Impact. Three models were also compared: Multinomial Logistic Regression, Random Forest, and XGBoost. The Multinomial Logistic Regression model served as a useful baseline but showed limited performance, especially in identifying Medium Impact cases. Random Forest improved the classification accuracy and performed well in detecting Low Impact observations, but it still struggled to separate Medium Impact from the other classes. XGBoost achieved the strongest overall classification performance, with very high accuracy, precision, recall, and F1-score. As a result, XGBoost was selected as the final classification model.
Overall, the findings show that XGBoost was the most suitable algorithm for both research questions. Its strong performance suggests that fuel price behaviour is influenced by complex and non-linear patterns rather than simple linear relationships alone. This is reasonable because local retail fuel prices are not only affected by Brent crude oil price movements, but also by country-specific factors such as tax structure, subsidy level, income level, region, and fuel type.
The results also highlight that global crude oil prices do not affect all countries equally. Some countries may experience stronger retail fuel price changes, while others may absorb price shocks more effectively due to policy intervention, subsidies, taxation systems, or economic structure. Therefore, the classification model provides useful insight by identifying which observations are more likely to fall into Low, Medium, or High Impact categories.
However, the results should be interpreted with caution. The very high performance of the XGBoost models indicates strong predictive ability within the dataset, but it does not guarantee perfect real-world forecasting performance. Future work should further validate the models using new unseen data, check for possible target leakage, and test whether the models remain stable under different time periods or external market conditions. Additional variables such as exchange rates, inflation, government policy changes, fuel import dependency, and geopolitical events could also be included to improve the real-world explanatory power of the models.
In conclusion, this project successfully demonstrated how machine learning can be used to analyze fuel price behaviour from both regression and classification perspectives. XGBoost was found to be the best-performing model for predicting retail fuel prices and classifying fuel price impact levels. The study provides a useful analytical framework for understanding how global crude oil price movements interact with country-level characteristics to influence local fuel markets.