Introduction

Climate change is one of the most pressing global challenges of the 21st century, with carbon dioxide (CO2) emissions identified as the primary driver of global warming. Understanding what factors drive national-level emissions is critical for designing effective climate policy. Countries differ substantially in their per-capita emissions, and these differences are shaped by economic scale, energy systems, and fossil fuel dependence.

This project analyses the Our World in Data (OWID) CO2 and Greenhouse Gas Emissions dataset to investigate the structural factors that explain variation in CO2 emissions per capita across countries from 1990 to 2024. We apply regression to quantify the relationship between emissions and their key drivers, and classification to group countries into low, medium, and high emission intensity categories.

Objectives

  • To model and quantify the relationship between national CO2 emissions per capita (co2_per_capita) and key drivers of economic activity, energy consumption, and fossil fuel usage across countries from 1990–2024

  • To classify countries into emission intensity categories based on their economic structure, energy consumption patterns, and fossil fuel dependence using data from 1990–2024.

Research Questions

Research Question 1

What economic, energy consumption, and fossil fuel factors significantly explain variation in co2_per_capita across countries from 1990–2024?

Research Question 2

Can countries be accurately classified into low, medium, and high emission intensity groups based on their economic structure, energy consumption, and fossil fuel dependence from 1990–2024

Methodology

Package Setup

Packages Used in This Study
Package Purpose
tidyverse Data wrangling and visualisation
caret Model training and evaluation framework
randomForest Random Forest regression and classification
rpart / rpart.plot Decision Tree modelling and visualisation
corrplot Correlation heatmap
naniar Missing value visualisation
knitr Table formatting
gridExtra Multi-panel plot layout
scales Axis formatting for ggplot2
Metrics RMSE and MAE calculation
performance VIF / multicollinearity check

Data Loading and Raw Overview

Dataset loaded successfully from GitHub
Source   : https://raw.githubusercontent.com/owid/co2-data/master/owid-co2-data.csv
Rows     : 50411
Columns  : 79
Years    : 1750 to 2024
Entities : 254 unique
Total cells: 3,982,469

: Dataset Overview


|Attribute    |Details                                                     |
|:------------|:-----------------------------------------------------------|
|Title        |CO2 and Greenhouse Gas Emissions Dataset                    |
|Source       |https://github.com/owid/co2-data                            |
|Publisher    |Our World in Data (OWID)                                    |
|Year Range   |1750 to 2024                                                |
|Last Updated |2024                                                        |
|Purpose      |Tracks and analyses global CO2 and GHG emissions by country |
|Rows         |50,411                                                      |
|Columns      |79                                                          |
|Total Cells  |3,982,469                                                   |
|License      |Creative Commons BY 4.0                                     |

Dataset Source

The dataset used in this analysis is the CO2 and Greenhouse Gas Emissions Dataset published by Our World in Data (OWID), sourced directly from their public GitHub repository. It is maintained under a Creative Commons BY 4.0 license, making it freely available for research and educational purposes.

Raw Dataset Dimensions

Upon loading, the dataset comprises 50,411 rows and 79 columns, covering 254 unique entities — including both sovereign nations and regional aggregates such as World, Asia, and High-income countries. The temporal range spans 1750 to 2024, representing 274 years of emissions history. However, meaningful and consistent data only becomes available from the mid-20th century onward, with the most complete coverage beginning around 1990.

Raw Data Snapshot

The table below shows the first 10 rows of key columns from the raw dataset, illustrating the missing data pattern typical of early historical records.

Raw Data Preview — Afghanistan, 1750–1759 (Key Columns)
country year population gdp co2 co2_per_capita coal_co2 oil_co2 gas_co2
Afghanistan 1750 2802560 NA NA NA NA NA NA
Afghanistan 1751 NA NA NA NA NA NA NA
Afghanistan 1752 NA NA NA NA NA NA NA
Afghanistan 1753 NA NA NA NA NA NA NA
Afghanistan 1754 NA NA NA NA NA NA NA
Afghanistan 1755 NA NA NA NA NA NA NA
Afghanistan 1756 NA NA NA NA NA NA NA
Afghanistan 1757 NA NA NA NA NA NA NA
Afghanistan 1758 NA NA NA NA NA NA NA
Afghanistan 1759 NA NA NA NA NA NA NA

Key Observations

Several structural characteristics of the raw dataset are worth noting before proceeding to cleaning:

  • Historical sparsity: Early records (pre-1900) are dominated by NA values across emissions variables. As illustrated by the raw data snapshot below, rows for Afghanistan from 1750 to 1759 contain only population data — all CO2 and energy variables are missing. This is consistent with limited historical measurement and reporting capacity during the pre-industrial period.

  • Mixed entity types: The dataset includes not only sovereign countries but also continental and income-group aggregates. These will be filtered out during the cleaning stage to ensure the analysis reflects country-level observations only.

  • Analytical scope: Given the data quality and completeness considerations above, this analysis focuses on the modern period from 1990 to 2024, where emissions reporting is substantially more reliable and comparable across countries.

Data Cleaning and Preprocessing

Overview

This section describes the data cleaning and preprocessing procedures applied to the CO2 and Greenhouse Gas Emissions dataset obtained from Our World in Data (OWID). The objective of this stage is to construct a high-quality analytical dataset suitable for regression and classification modelling.

The cleaning process involves variable selection, removal of aggregate observations, filtering to a consistent time period, handling missing values, treatment of outliers, and construction of the classification target variable.

Variable Selection

Only variables relevant to the research objectives are retained from the original 79-column dataset. Selected variables include the target variable (co2_per_capita), economic indicators (gdp, population), energy consumption metrics (primary_energy_consumption, energy_per_capita, energy_per_gdp), fossil fuel emission components (coal_co2, oil_co2, gas_co2, cement_co2, flaring_co2), and broader greenhouse gas factors (methane, nitrous_oxide, total_ghg). Each variable is selected based on its theoretical relevance to explaining cross-country differences in emissions.

Selected Variables for CO2 Analysis
Selected_Variables
country
year
co2_per_capita
gdp
population
primary_energy_consumption
energy_per_capita
energy_per_gdp
co2_per_unit_energy
coal_co2
oil_co2
gas_co2
cement_co2
flaring_co2
land_use_change_co2
methane
nitrous_oxide
total_ghg

This step reduces dimensionality, removes irrelevant features, and improves model interpretability and computational efficiency. In total, 18 variables are retained for subsequent analysis.

Removal of Aggregate Countries

  • Countries after removing aggregates: 218

To ensure the analysis focused exclusively on individual countries, aggregate regions and non-country entities were removed from the dataset. A lookup table containing valid three-letter ISO country codes was created from the original dataset, and only matching countries were retained using a semi-join operation.

This step eliminated entries such as continents, income groups, and regional aggregates (e.g. World, Asia, High-income countries) that could distort country-level analysis. After filtering, the dataset contained 218 unique countries, which were used for subsequent exploratory analysis, predictive modelling, and classification tasks.

Time Period Filtering (1990-2024)

  • Rows after year filter (1990-2024): 7630

The dataset was filtered to include only observations from 1990 to 2024, representing the modern era of emissions reporting. The year 1990 is commonly used as a baseline in climate studies due to improved data availability and its relevance to international environmental agreements following the Rio Earth Summit.

This restriction ensures consistency in reporting standards and improves comparability across countries. It also reduces bias arising from incomplete or unreliable historical records.

After filtering, only observations within the defined study period are retained for analysis.

Removal of Missing Target Values

  • Rows dropped (missing target): 188
  • Rows retained: 7442

Missing values in the target variable (co2_per_capita) were removed to ensure that all observations used for modelling contained valid target values. Records with missing target values cannot contribute to supervised learning models and may introduce bias into the analysis. As a result, 188 rows were removed, while 7,442 rows were retained for subsequent exploratory analysis, feature engineering, and model development.

Missing Predictor Value Imputation

  • Missing predictor values before imputation: 14037
  • Missing predictor values after imputation: 0

Missing values in the predictor variables were addressed using median imputation. All numeric predictor variables, excluding the target variable (co2_per_capita), were examined for missing observations. A total of 14,037 missing values were identified across the predictor variables. To preserve the dataset size and avoid information loss, each missing value was replaced with the median of its respective variable (median imputation). Median imputation was selected because it is less sensitive to extreme values and skewed distributions than mean imputation. Following this process, the number of missing predictor values was successfully reduced from 14,037 to 0, resulting in a complete dataset suitable for subsequent exploratory analysis and machine learning modelling.

Removal of Duplicate Observations

  • Duplicate rows removed: 0

Duplicate observations were assessed based on the combination of country and year to ensure that each record represented a unique country-year observation. No duplicate records were identified in the dataset (0 duplicate rows removed), indicating that the dataset was already free from redundant observations. Therefore, all records were retained for further analysis and modelling.

Fix Data Types

  • Data types: country=character, year=integer

To ensure consistency and compatibility during analysis, the country variable was stored as a character data type, while the year variable was converted to an integer data type. This preprocessing step helped maintain correct variable representations and supported subsequent data processing and modelling tasks.

Winsorisation of Extreme Values

  • Winsorisation applied to: co2_per_capita, gdp, population, primary_energy_consumption, total_ghg

Winsorisation was applied to key continuous variables using the IQR method (1.5 × IQR rule) to limit the influence of extreme values while retaining all observations. This improves model stability without removing valid data points.

Variables treated include:

  • co2_per_capita
  • gdp
  • population
  • primary_energy_consumption
  • total_ghg

Creation of Classification Target

A classification target variable (emission_level) was created based on co2_per_capitato support the classification modelling task. The variable was constructed using data-driven tertile thresholds (33rd and 66th percentiles), which divide countries into three balanced emission categories.

This approach ensures that class definitions are not arbitrarily assigned, but instead reflect the empirical distribution of global emissions within the dataset. It also improves model performance by maintaining relatively balanced class sizes across categories.

The resulting classification thresholds are:

  • Low : co2_per_capita <= 1.120 tCO2/cap
  • Medium : co2_per_capita <= 5.068 tCO2/cap
  • High : co2_per_capita > 5.068 tCO2/cap

This categorical variable is subsequently used as the target for classification analysis.

Exploratory Data Analysis (EDA)

EDA 1: Summary statistics

Summary Statistics of Key Variables
Variable Min Q1 Median Mean Q3 Max
CO2 per Capita 0.00 0.71 2.74 4.45 6.83 16.01
GDP (Billion USD) 0.26 27.84 61.11 117.88 165.16 371.14
Population (Million) 0.00 0.76 5.70 13.72 20.46 50.01
Primary Energy Consumption 0.00 6.41 40.97 182.90 285.18 703.33
Total GHG 0.00 9.28 43.92 78.03 106.20 251.58

The summary statistics provide an overview of the distribution of the key variables used in this study. The average CO2 emissions per capita across the selected countries and years is approximately 4.45 tonnes, with values ranging from 0 to 16.01 tonnes, indicating substantial variation in carbon emissions among countries. The median value of 2.74 tonnes is lower than the mean, suggesting a right-skewed distribution where a few countries exhibit particularly high emissions.

GDP shows considerable variation, ranging from approximately 0.26 billion USD to 371.1 billion USD, with an average of 117.88 billion USD. Similarly, population sizes vary widely across observations, from fewer than 2,000 people to more than 50 million, reflecting the diverse economic and demographic characteristics of the countries included in the dataset.

Primary energy consumption exhibits a broad range, with a mean of 182.90 TWh and values reaching as high as 703.33 TWh. This suggests significant differences in energy demand and usage among countries. Total greenhouse gas (GHG) emissions also vary substantially, with a mean of 78.04 million tonnes and a maximum value of 251.58 million tonnes, indicating that some countries contribute disproportionately to overall emissions.

Overall, the summary statistics reveal substantial variability in economic activity, population size, energy consumption, and emissions levels across countries. This variation supports the use of predictive and classification models to investigate the factors associated with national CO2 emissions.

EDA 2: Missing Value Summary Table

EDA 2 — Missing Value % in Key Columns (before imputation):

Missing Value % in Key Columns
Variable Missing_Pct
gas_co2 42.8
coal_co2 36.1
energy_per_gdp 28.0
gdp 27.3
land_use_change_co2 8.9
flaring_co2 6.6
methane 6.6
total_ghg 6.6
co2_per_unit_energy 6.0
nitrous_oxide 5.6
primary_energy_consumption 5.3
energy_per_capita 5.3
cement_co2 3.4
oil_co2 0.1
year 0.0
co2_per_capita 0.0
population 0.0

Before imputation, several key variables in the dataset exhibit substantial missingness. The table below summarises the percentage of missing values for each variable.

gas_co2 (42.8%), coal_co2 (36.1%), energy_per_gdp (28.0%), and gdp (27.3%) show the highest proportions of missing data, largely reflecting gaps in historical reporting for smaller or developing nations. Moderate missingness (5–9%) is observed for land_use_change_co2, flaring_co2, methane, total_ghg, co2_per_unit_energy, nitrous_oxide, primary_energy_consumption, and energy_per_capita. cement_co2 and oil_co2 have minimal missingness (3.4% and 0.1% respectively), while year, co2_per_capita, and population are fully complete (0.0%).

These results justify the use of imputation strategies (e.g., median or group-wise imputation) rather than row-wise deletion, which would result in significant data loss.

EDA 3: Missing Value Bar Chart

A horizontal bar chart visualises the missing value percentages for all variables containing at least one missing value, ordered from highest to lowest. This visualisation reinforces the findings from Table 5, clearly highlighting gas_co2, coal_co2, energy_per_gdp, and gdp as the variables of greatest concern, with oil_co2 showing negligible missingness. The chart provides an at-a-glance reference for prioritising data cleaning efforts.

EDA 4: Observation per Decade

To understand the temporal coverage of the dataset, observations were grouped into decades spanning 1990–2024. The 1990s, 2000s, and 2010s each contain approximately 2,117–2,130 country-year observations, indicating consistent reporting coverage across these three decades. The 2020s decade shows a lower count (1,065 observations), which is expected since this decade is incomplete (only up to 2024) at the time of analysis. Overall, the dataset provides a balanced longitudinal view of emissions across more than three decades.

EDA 5: Distribution of CO2_per_capita

Summary of co2_per_capita (1990-2024, real countries)

Statistic Value
Min. 0.000000
1st Qu. 0.714250
Median 2.744500
Mean 4.452312
3rd Qu. 6.832500
Max. 16.009875

The target variable co2_per_capita (1990–2024, real countries only) was examined for its distribution characteristics. Summary statistics show a minimum of 0.000, a 1st quartile of 0.714, a median of 2.745, a mean of 4.452, a 3rd quartile of 6.833, and a maximum of 16.010 tonnes per capita.

The histogram reveals a strongly right-skewed distribution, with the majority of country-year observations concentrated below 5 tonnes per capita, and a long tail extending toward high-emission outliers (notably oil/gas-rich nations). Dashed reference lines mark the tertile thresholds used to define the Low/Medium/High emission categories for the classification task in Section 2.6. A noticeable spike near the maximum value (~16) reflects a cluster of consistently high-emission countries (e.g., Qatar, Kuwait, UAE).

EDA 6: Descriptive statistics table

Descriptive Statistics — Key Variables
Variable Mean Median SD Min Max Skewness
population 1.372471e+07 5.703130e+06 1.698800e+07 1.77600e+03 5.001299e+07 1.228
gdp 1.178822e+11 6.111433e+10 1.283459e+11 2.57172e+08 3.711429e+11 1.172
primary_energy_consumption 1.829010e+02 4.096600e+01 2.535390e+02 0.00000e+00 7.033330e+02 1.248
energy_per_capita 2.369295e+04 1.142043e+04 3.306387e+04 0.00000e+00 3.185597e+05 2.967
energy_per_gdp 1.440000e+00 1.172000e+00 1.255000e+00 7.80000e-02 2.302100e+01 5.834
co2_per_unit_energy 2.300590e+02 2.141910e+02 1.811110e+02 3.52450e+01 1.068890e+04 29.902
coal_co2 5.738600e+01 2.290000e+00 4.156140e+02 0.00000e+00 8.886021e+03 14.790
oil_co2 4.662100e+01 4.110000e+00 1.864720e+02 4.00000e-03 2.584130e+03 9.456
gas_co2 3.033900e+01 7.575000e+00 1.122600e+02 0.00000e+00 1.748138e+03 9.215
cement_co2 5.082000e+00 3.370000e-01 3.738300e+01 0.00000e+00 8.287100e+02 16.923
flaring_co2 1.643000e+00 0.000000e+00 5.837000e+00 0.00000e+00 8.452000e+01 6.535
land_use_change_co2 3.117400e+01 2.028000e+00 1.461710e+02 -3.19418e+02 2.998516e+03 9.770
methane 3.889500e+01 9.580000e+00 1.196360e+02 1.00000e-03 1.590674e+03 6.912
nitrous_oxide 1.225600e+01 2.905000e+00 3.929700e+01 0.00000e+00 4.755340e+02 7.612
total_ghg 7.803500e+01 4.392200e+01 8.759500e+01 -3.00000e-03 2.515780e+02 1.111
co2_per_capita 4.452000e+00 2.745000e+00 4.615000e+00 0.00000e+00 1.601000e+01 1.150

Table 7 presents descriptive statistics (mean, median, standard deviation, minimum, maximum, and skewness) for all key predictor variables and the target variable co2_per_capita.

Several variables exhibit extremely high positive skewness, most notably co2_per_unit_energy (29.9), coal_co2 (14.8), total_ghg (could not be computed in this table but is examined separately), gas_co2 (16.9), and land_use_change_co2, methane, and nitrous_oxide (all > 5). This indicates the presence of significant outliers and heavy right tails, which is consistent with the global distribution of emissions being dominated by a small number of large economies and energy producers. In contrast, co2_per_capita itself (skewness = 1.150) and population/gdp show more moderate skew. These findings motivate the outlier treatment carried out in Section 2.4.9.

EDA 7: Histograms for All Predictors and Target Variable

A grid of histograms was produced for all numeric predictors and the target variable to visually assess their distributional shapes. Consistent with the skewness values reported in Table 7, most variables (gdp, population, primary_energy_consumption, coal_co2, oil_co2, gas_co2, cement_co2, flaring_co2, land_use_change_co2, methane, nitrous_oxide, and co2_per_unit_energy) display heavily right-skewed distributions, with the vast majority of observations clustered near zero and a small number of extreme high values. The target variable co2_per_capita shows a more moderate right skew, while energy_per_gdp appears closer to a normal/ slightly skewed distribution. These visual patterns confirm that transformation or robust scaling may be beneficial for certain modelling approaches, and that outlier handling is necessary prior to regression.

EDA 8: Boxplots for All Predictors and Target Variables

Boxplots were generated for all predictor and target variables to visually identify the presence and extent of outliers. The plots confirm the patterns observed in the histograms: variables such as energy_per_gdp, co2_per_unit_energy, coal_co2, oil_co2, gas_co2, cement_co2, flaring_co2, land_use_change_co2, methane, and nitrous_oxide all display numerous extreme outliers above the upper whisker, often representing values many times the interquartile range. In contrast, co2_per_capita, population, gdp, and primary_energy_consumption show comparatively fewer extreme points relative to their scale. These results support the decision to apply outlier treatment (winsorisation) rather than outright removal, given that many of these “outliers” represent genuine real-world extremes (e.g., high-emission economies).

EDA 9: Outlier Summary Table (Before Winsorisation)

EDA 9 - Outlier Summary (IQR bounds, before winsorisation)

Note: Environmental data contains genuine extremes (e.g., Qatar, Kuwait)

IQR-based winsorisation (capping) was used instead of removal.

Outlier Summary - Variables Winsorised
Variable Lower_Bound Upper_Bound Outliers_Before
25% co2_per_capita -8.463000e+00 1.601000e+01 436
25%1 gdp -3.898365e+11 7.000362e+11 745
25%2 population -2.879704e+07 5.001299e+07 870
25%3 primary_energy_consumption -4.477340e+02 7.607240e+02 995
25%4 total_ghg -1.630720e+02 2.940290e+02 991

To formally quantify outliers, the Interquartile Range (IQR) method was applied to each key variable, with lower and upper bounds calculated as Q1 − 1.5×IQR and Q3 + 1.5×IQR respectively. Table 8 summarises the number of observations falling outside these bounds prior to treatment.

primary_energy_consumption (995 outliers) and total_ghg (991 outliers) show the highest counts, followed by population (870), gdp (745), and co2_per_capita (436).

It is important to note that environmental and economic data of this nature often contain genuine extremes — for example, small but extremely wealthy or energy-intensive nations such as Qatar and Kuwait naturally produce values far outside typical ranges. Removing these observations would distort the analysis and discard valid information. Therefore, IQR-based winsorisation (capping) was applied instead of outright removal, preserving the sample size while reducing the influence of extreme values on downstream modelling.

EDA 10: Global Average Emissions Trend (1990–2024)

Emission Level Class Distribution
Emission_Level Count Proportion
Low 2456 0.33
Medium 2456 0.33
High 2530 0.34

To support the classification task in Research Question 2, the continuous target variable co2_per_capita was discretised into three tertile-based categories — Low, Medium, and High — representing emission intensity levels.

The resulting class distribution is well balanced: Low (2,456 observations, 33%), Medium (2,456 observations, 33%), and High (2,530 observations, 34%). This near-equal split across classes is desirable for classification modelling, as it minimises class imbalance issues and ensures that performance metrics such as accuracy and macro-averaged F1 are meaningful and not biased toward a majority class.

EDA 11:Top 10 Highest Emission Countries

The top 10 countries by average CO2 per capita (1990–2024) were identified and visualised. The list is dominated by oil- and gas-rich Gulf states and high-income, energy-intensive nations: United Arab Emirates, Qatar, Kuwait, Bahrain, Brunei, Saudi Arabia, Australia, United States, Canada, and Trinidad and Tobago. All ten countries consistently sit within the “High” emission tier across the study period, reflecting strong ties between fossil fuel production/consumption, economic development, and per-capita carbon output. These findings align with the descriptive statistics and reinforce the rationale for using economic and energy-related predictors in both the regression and classification models.

EDA 12: Correlation heatmap

The strongest correlations with co2_per_capita are observed for energy_per_capita (r = 0.79), primary_energy_consumption (r = 0.42), and energy_per_gdp (r = 0.42), suggesting that energy consumption intensity (rather than total economic size) is the dominant driver of per-capita emissions. gdp shows a moderate positive correlation (r = 0.33), while population shows a near-zero correlation (r = −0.04), confirming that total population size is not a strong predictor of per-capita emissions (as expected, since the target is already population-normalised).

Among the fuel-source variables, gas_co2 (r = 0.29), coal_co2 (r = 0.26), and total_ghg (r = 0.25) show moderate positive associations. Notably, several predictors are highly correlated with one another — for example, gdp and population (r = 0.69), primary_energy_consumption and gdp (r = 0.89), and oil_co2 and gas_co2 (r = 0.88) — flagging potential multicollinearity issues that are formally assessed via VIF in Section 2.5.4.

EDA 13: Class Distribution Plot + Table


Class counts:

   Low Medium   High 
  2456   2456   2530 

Class proportions:

   Low Medium   High 
  0.33   0.33   0.34 

The final EDA step confirms the distribution of the classification target variable, emission_level, derived via tertile-based discretisation of co2_per_capita. The bar chart and accompanying table show:

Emission Level Count Proportion
Low 2,456 0.33
Medium 2,456 0.33
High 2,530 0.34

This balanced distribution across the three classes confirms that the dataset is well-suited for multi-class classification without requiring resampling techniques such as SMOTE or class weighting.

Regression Analysis

Objective 1

  • To model and quantify the relationship between national CO2 emissions per capita (co2_per_capita) and key drivers of economic activity, energy consumption, and fossil fuel usage across countries from 1990–2024

Research Question 1

  • What economic, energy consumption, and fossil fuel factors significantly explain variation in co2_per_capita across countries from 1990–2024?

This study employs two complementary modelling approaches. Model R1 uses Multiple Linear Regression (OLS), providing interpretable coefficients and p-values to identify statistically significant predictors and directly address the research question. Model R2 applies Random Forest Regression to capture potential non-linear relationships, with feature importance scores used to evaluate and rank the relative contribution of each predictor, offering a complementary perspective to the linear model.

Regression Analysis (RQ1).

  • Regression dataset : 7442 rows, 15 predictors
  • Train rows: 5954 | Test rows: 1488

This section prepares the dataset for regression analysis by selecting 15 key predictor variables representing economic activity, energy consumption, and fossil fuel emissions, alongside CO2 emissions per capita as the target variable. After removing missing values, the final regression dataset consists of 7,442 observations. The data is then split into training and testing sets using an 80/20 ratio, resulting in 5,954 observations for model training and 1,488 observations for model evaluation. This split ensures that model performance can be assessed on unseen data, supporting a more robust and generalisable analysis.

Multiple Linear Regression (OLS)


OLS Model Summary (coefficients and significance):

Call:
lm(formula = lm_formula, data = train_reg)

Residuals:
     Min       1Q   Median       3Q      Max 
-17.1689  -1.2653  -0.4099   0.8295  12.3915 

Coefficients:
                             Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 8.003e-01  9.031e-02   8.862  < 2e-16 ***
gdp                         3.932e-12  5.865e-13   6.705 2.20e-11 ***
population                 -1.300e-07  4.340e-09 -29.963  < 2e-16 ***
primary_energy_consumption -6.105e-04  3.799e-04  -1.607 0.108094    
energy_per_capita           8.990e-05  1.339e-06  67.166  < 2e-16 ***
energy_per_gdp              1.691e-01  3.062e-02   5.521 3.51e-08 ***
co2_per_unit_energy         3.191e-03  2.501e-04  12.758  < 2e-16 ***
coal_co2                    2.709e-03  4.325e-04   6.265 4.00e-10 ***
oil_co2                     4.693e-04  5.264e-04   0.892 0.372663    
gas_co2                     3.026e-03  7.714e-04   3.923 8.85e-05 ***
cement_co2                 -2.029e-02  3.923e-03  -5.172 2.40e-07 ***
flaring_co2                 4.226e-02  1.035e-02   4.085 4.47e-05 ***
land_use_change_co2        -1.086e-03  2.922e-04  -3.718 0.000203 ***
methane                    -1.017e-02  1.604e-03  -6.339 2.49e-10 ***
nitrous_oxide               1.805e-02  4.916e-03   3.673 0.000242 ***
total_ghg                   2.506e-02  1.105e-03  22.670  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.39 on 5938 degrees of freedom
Multiple R-squared:  0.7323,    Adjusted R-squared:  0.7317 
F-statistic:  1083 on 15 and 5938 DF,  p-value: < 2.2e-16

A multiple linear regression (OLS) model was fitted to predict co2_per_capita using 15 predictors, including gdp, population, primary_energy_consumption, energy_per_capita, energy_per_gdp, co2_per_unit_energy, and fuel-specific/GHG variables (coal_co2, oil_co2, gas_co2, cement_co2, flaring_co2, land_use_change_co2, methane, nitrous_oxide, total_ghg).

The model achieved a Multiple R-squared of 0.7323 (Adjusted R² = 0.7317), indicating that approximately 73% of the variance in CO2 per capita is explained by the included predictors. The overall F-statistic (1083 on 15 and 5938 DF, p < 2.2e-16) confirms the model is highly statistically significant.

Most predictors are statistically significant (p < 0.05), with the exceptions of primary_energy_consumption (p = 0.108) and oil_co2 (p = 0.373). energy_per_capita shows an extremely strong effect (t = 67.17, p < 2e-16), as does population (t = −29.96), co2_per_unit_energy (t = 12.76), and total_ghg (t = 22.67), suggesting these are key drivers of the model’s explanatory power.

The residual standard error is 2.39, and residuals range from −17.17 to 12.39, suggesting some asymmetry and the presence of large residuals for certain observations (likely high-emission outlier countries).

Coefficient Table and Plot

A coefficient plot with 95% confidence intervals was produced to visualise the magnitude, direction, and significance of each predictor’s effect on co2_per_capita. Predictors are colour-coded by significance (p < 0.05 vs.  p ≥ 0.05).

energy_per_gdp shows by far the largest positive coefficient and widest confidence interval, followed by flaring_co2, total_ghg, and nitrous_oxide. cement_co2 and methane show clear negative coefficients. Two predictors — oil_co2 and primary_energy_consumption — have confidence intervals crossing zero, consistent with their non-significant p-values reported in Section 2.5.2.

VIF - Multicollinearity Check


Variance Inflation Factors (VIF):
# Check for Multicollinearity

Low Correlation

                Term  VIF     VIF 95% CI adj. VIF Tolerance Tolerance 95% CI
   energy_per_capita 1.96 [ 1.89,  2.04]     1.40      0.51     [0.49, 0.53]
      energy_per_gdp 1.48 [ 1.43,  1.53]     1.22      0.68     [0.65, 0.70]
 co2_per_unit_energy 1.08 [ 1.05,  1.11]     1.04      0.93     [0.90, 0.95]
         flaring_co2 3.88 [ 3.72,  4.06]     1.97      0.26     [0.25, 0.27]
 land_use_change_co2 1.62 [ 1.56,  1.68]     1.27      0.62     [0.60, 0.64]

Moderate Correlation

                       Term  VIF     VIF 95% CI adj. VIF Tolerance
                        gdp 5.92 [ 5.65,  6.20]     2.43      0.17
                 population 5.65 [ 5.40,  5.92]     2.38      0.18
 primary_energy_consumption 9.62 [ 9.17, 10.09]     3.10      0.10
                    oil_co2 9.71 [ 9.26, 10.19]     3.12      0.10
                    gas_co2 7.74 [ 7.38,  8.11]     2.78      0.13
                  total_ghg 9.74 [ 9.28, 10.22]     3.12      0.10
 Tolerance 95% CI
     [0.16, 0.18]
     [0.17, 0.19]
     [0.10, 0.11]
     [0.10, 0.11]
     [0.12, 0.14]
     [0.10, 0.11]

High Correlation

          Term   VIF     VIF 95% CI adj. VIF Tolerance Tolerance 95% CI
      coal_co2 35.69 [33.95, 37.52]     5.97      0.03     [0.03, 0.03]
    cement_co2 24.15 [22.98, 25.38]     4.91      0.04     [0.04, 0.04]
       methane 37.67 [35.83, 39.60]     6.14      0.03     [0.03, 0.03]
 nitrous_oxide 37.44 [35.61, 39.36]     6.12      0.03     [0.03, 0.03]

Variance Inflation Factors (VIF) were computed for all predictors to assess multicollinearity, grouped into Low (VIF < 5), Moderate (5 ≤ VIF < 10), and High (VIF ≥ 10) correlation categories.

Low correlation (VIF < 5): energy_per_capita (1.96), energy_per_gdp (1.48), co2_per_unit_energy (1.08), flaring_co2 (3.88), and land_use_change_co2 (1.62) show acceptable VIF values, indicating minimal multicollinearity concerns.

Moderate correlation (5–10): gdp (5.92), population (5.65), primary_energy_consumption (9.62), oil_co2 (9.71), gas_co2 (7.74), and total_ghg (9.74) show elevated but generally tolerable VIF values.

High correlation (VIF ≥ 10): coal_co2 (35.69), cement_co2 (24.15), methane (37.67), and nitrous_oxide (37.44) display severe multicollinearity, with tolerance values as low as 0.03. This indicates these variables are highly linearly related to other predictors (likely total_ghg and each other, as components of the same aggregate). While this does not bias the coefficient estimates, it inflates standard errors and makes individual coefficient interpretation for these variables less reliable. Future iterations of the model could consider removing or combining these highly collinear GHG-component variables.

Predictions and Metrics


  Linear Regression (OLS)
    RMSE : 2.7306
    MAE  : 1.6974
    R2   : 0.6670

The OLS model was evaluated on a held-out test set, yielding:

  • RMSE: 2.7306
  • MAE: 1.6974
  • : 0.6670

The drop in R² from the training value (0.7323) to the test value (0.6670) suggests some degree of overfitting or that the linear model struggles to generalise to extreme/outlier observations, consistent with the residual patterns examined in Section 2.5.6.

Residual Plots

Two diagnostic plots were produced to assess the assumptions of the OLS model:

  • Residuals vs Fitted: The plot shows a clear funnel/fan-shaped pattern, with residual spread increasing for higher fitted values, and a distinct diagonal band of points. This indicates heteroscedasticity (non-constant variance) and suggests the linear model does not fully capture the relationship for high-emission countries.

  • Normal Q-Q Plot: Residuals deviate noticeably from the theoretical normal line, particularly in the tails — both the lower-left and upper-right ends curve away from the diagonal, indicating heavier tails than a normal distribution (i.e., more extreme residuals than expected under normality).

Together, these diagnostics suggest that while the OLS model captures a substantial portion of the variance, it violates key linear regression assumptions (homoscedasticity and normality of residuals), motivating the use of a more flexible, non-linear model such as Random Forest.

Actual vs. Predicted

A scatter plot of actual vs. predicted co2_per_capita values was produced, with a dashed reference line representing perfect prediction (y = x).

For low-to-moderate actual values (roughly 0–8), predictions cluster reasonably close to the diagonal, though with noticeable scatter. However, for high actual values (particularly around 16, corresponding to high-emission countries like Qatar/UAE), the model substantially under-predicts, with predicted values clustering well below the actual values. This systematic under-prediction at the extremes is consistent with the heteroscedasticity observed in the residual plots and reflects the inherent limitation of a linear model in capturing the disproportionately large emissions of a small number of outlier economies.

Model R2 : Random Forest Regression


  Random Forest Regression
    RMSE : 0.3837
    MAE  : 0.1685
    R2   : 0.9937

To address the limitations of the OLS model, a Random Forest Regression model was fitted using the same predictor set. The model achieved substantially improved performance:

  • RMSE: 0.3837
  • MAE: 0.1685
  • : 0.9937

These results represent a dramatic improvement over the OLS model (R² = 0.6670 → 0.9937), indicating that the relationship between the predictors and co2_per_capita is highly non-linear and that Random Forest is far better suited to capturing these complex interactions, including the extreme values associated with high-emission countries.

Feature Importance

                                             Variable Importance
energy_per_capita                   energy_per_capita  47.348329
co2_per_unit_energy               co2_per_unit_energy  28.782999
population                                 population  19.237880
oil_co2                                       oil_co2  17.153103
coal_co2                                     coal_co2  15.403601
land_use_change_co2               land_use_change_co2  15.222930
nitrous_oxide                           nitrous_oxide  15.167226
total_ghg                                   total_ghg  14.600553
methane                                       methane  14.408944
energy_per_gdp                         energy_per_gdp  14.377791
primary_energy_consumption primary_energy_consumption  11.820692
cement_co2                                 cement_co2  11.527677
gas_co2                                       gas_co2  10.157521
flaring_co2                               flaring_co2   9.733628
gdp                                               gdp   8.672081

Feature importance scores (based on the increase in MSE when a variable is permuted) identify the predictors most critical to the Random Forest Regression model’s performance:

  1. energy_per_capita (47.35) — by far the most important predictor
  2. co2_per_unit_energy (28.78)
  3. population (19.24)
  4. oil_co2 (17.15)
  5. coal_co2 (15.40)
  6. land_use_change_co2 (15.22)
  7. nitrous_oxide (15.17)
  8. total_ghg (14.60)
  9. methane (14.41)
  10. energy_per_gdp (14.38)
  11. primary_energy_consumption (11.82)
  12. cement_co2 (11.53)
  13. gas_co2 (10.16)
  14. flaring_co2 (9.73)
  15. gdp (8.67)

energy_per_capita stands out as overwhelmingly the most influential predictor, nearly 1.7 times more important than the second-ranked variable (co2_per_unit_energy). This aligns with the strong correlation observed in Section 2.4.12 (r = 0.79) and the highly significant OLS coefficient for this variable, confirming that how much energy a population consumes per person is the single strongest determinant of per-capita carbon emissions — more so than total economic output (gdp) or population size alone.

Model Comparison

                     Model   RMSE    MAE     R2
1  Linear Regression (OLS) 2.7306 1.6974 0.6670
2 Random Forest Regression 0.3837 0.1685 0.9937

Regression Model Comparison



Table: Regression Model Comparison - RMSE, MAE, R2

|Model                    |   RMSE|    MAE|     R2|
|:------------------------|------:|------:|------:|
|Linear Regression (OLS)  | 2.7306| 1.6974| 0.6670|
|Random Forest Regression | 0.3837| 0.1685| 0.9937|

The two regression models were compared directly on the test set:

Model RMSE MAE
Linear Regression (OLS) 2.7306 1.6974 0.6670
Random Forest Regression 0.3837 0.1685 0.9937

The bar chart comparison clearly shows the Random Forest model outperforming OLS across all three metrics: RMSE reduced from 2.731 to 0.384 (an ~86% reduction), MAE reduced from 1.697 to 0.168 (an ~90% reduction), and R² increased from 0.667 to 0.994. This confirms that non-linear, ensemble-based methods are substantially better suited to modelling CO2 per capita than a simple linear approach, primarily due to their ability to capture interactions and non-linear relationships involving extreme/ outlier-prone variables.

Coefficient Interpretation (OLS)


Table: OLS Coefficient Interpretation (sorted by p-value)


Table: RQ1 — OLS Coefficient Interpretation

|                           |Variable                   |   Estimate| Std_Error|   p_value|Significant |Direction  |Interpretation                                                                                          |
|:--------------------------|:--------------------------|----------:|---------:|---------:|:-----------|:----------|:-------------------------------------------------------------------------------------------------------|
|energy_per_capita          |energy_per_capita          |  0.0000899| 0.0000013| 0.0000000|Yes         |Positive ↑ |A 1-unit increase in energy_per_capita is associated with a 9e-05 change in CO2 per capita              |
|population                 |population                 | -0.0000001| 0.0000000| 0.0000000|Yes         |Negative ↓ |A 1-unit increase in population is associated with a 0 change in CO2 per capita                         |
|total_ghg                  |total_ghg                  |  0.0250604| 0.0011054| 0.0000000|Yes         |Positive ↑ |A 1-unit increase in total_ghg is associated with a 0.02506 change in CO2 per capita                    |
|co2_per_unit_energy        |co2_per_unit_energy        |  0.0031905| 0.0002501| 0.0000000|Yes         |Positive ↑ |A 1-unit increase in co2_per_unit_energy is associated with a 0.003191 change in CO2 per capita         |
|gdp                        |gdp                        |  0.0000000| 0.0000000| 0.0000000|Yes         |Positive ↑ |A 1-unit increase in gdp is associated with a 0 change in CO2 per capita                                |
|methane                    |methane                    | -0.0101654| 0.0016037| 0.0000000|Yes         |Negative ↓ |A 1-unit increase in methane is associated with a -0.010165 change in CO2 per capita                    |
|coal_co2                   |coal_co2                   |  0.0027093| 0.0004325| 0.0000000|Yes         |Positive ↑ |A 1-unit increase in coal_co2 is associated with a 0.002709 change in CO2 per capita                    |
|energy_per_gdp             |energy_per_gdp             |  0.1690762| 0.0306231| 0.0000000|Yes         |Positive ↑ |A 1-unit increase in energy_per_gdp is associated with a 0.169076 change in CO2 per capita              |
|cement_co2                 |cement_co2                 | -0.0202868| 0.0039227| 0.0000002|Yes         |Negative ↓ |A 1-unit increase in cement_co2 is associated with a -0.020287 change in CO2 per capita                 |
|flaring_co2                |flaring_co2                |  0.0422615| 0.0103465| 0.0000447|Yes         |Positive ↑ |A 1-unit increase in flaring_co2 is associated with a 0.042262 change in CO2 per capita                 |
|gas_co2                    |gas_co2                    |  0.0030261| 0.0007714| 0.0000885|Yes         |Positive ↑ |A 1-unit increase in gas_co2 is associated with a 0.003026 change in CO2 per capita                     |
|land_use_change_co2        |land_use_change_co2        | -0.0010861| 0.0002922| 0.0002030|Yes         |Negative ↓ |A 1-unit increase in land_use_change_co2 is associated with a -0.001086 change in CO2 per capita        |
|nitrous_oxide              |nitrous_oxide              |  0.0180532| 0.0049156| 0.0002421|Yes         |Positive ↑ |A 1-unit increase in nitrous_oxide is associated with a 0.018053 change in CO2 per capita               |
|primary_energy_consumption |primary_energy_consumption | -0.0006105| 0.0003799| 0.1080945|No          |Negative ↓ |A 1-unit increase in primary_energy_consumption is associated with a -0.000611 change in CO2 per capita |
|oil_co2                    |oil_co2                    |  0.0004693| 0.0005264| 0.3726626|No          |Positive ↑ |A 1-unit increase in oil_co2 is associated with a 0.000469 change in CO2 per capita                     |

Significant POSITIVE drivers of CO2 per capita:
 + energy_per_capita
 + total_ghg
 + co2_per_unit_energy
 + gdp
 + coal_co2
 + energy_per_gdp
 + flaring_co2
 + gas_co2
 + nitrous_oxide 

Significant NEGATIVE drivers of CO2 per capita:
 - population
 - methane
 - cement_co2
 - land_use_change_co2 

Key findings:
  energy_per_capita   : +0.000090 per unit → strongest positive energy driver
  total_ghg           : +0.025060 per unit → broader GHG intensity lifts CO2
  population          : -0.000000 per unit → larger populations dilute per-capita emissions
  methane             : -0.010165 per unit → substitution effect with CO2 sources

RQ1 ANSWER: Significant predictors (p < 0.05 in OLS summary above)
explain variation in co2_per_capita. RF importance confirms ranking.

The OLS coefficients were ranked by p-value to identify the most reliable significant predictors of co2_per_capita.

Significant POSITIVE drivers of CO2 per capita:

  • energy_per_capita (β = 0.0000899) — strongest positive energy-related driver; each additional unit of energy consumed per capita increases CO2 per capita by ~0.00009 tonnes
  • total_ghg (β = 0.025060) — broader greenhouse gas intensity is associated with higher CO2 per capita
  • co2_per_unit_energy (β = 0.003191) — less efficient energy production (more CO2 per unit of energy) increases per-capita emissions
  • gdp (β ≈ 0.0000000, statistically significant but practically negligible in magnitude)
  • coal_co2, energy_per_gdp, flaring_co2, gas_co2, nitrous_oxide — all positively and significantly associated with CO2 per capita

Significant NEGATIVE drivers of CO2 per capita:

  • population (β ≈ −0.0000001) — larger populations are associated with slightly lower per-capita emissions, consistent with a “dilution effect” where total emissions are spread across more people
  • methane (β = −0.010165) — likely reflects a substitution effect, where countries with higher methane emissions (e.g., from agriculture) tend to have relatively lower fossil-fuel-driven CO2 per capita
  • cement_co2 (β = −0.020287)
  • land_use_change_co2 (β = −0.001086)

Non-significant predictors: primary_energy_consumption (p = 0.108) and oil_co2 (p = 0.373) did not show statistically significant effects in this model.

RQ1 ANSWER: The OLS results identify energy_per_capita, total_ghg, co2_per_unit_energy, coal_co2, energy_per_gdp, gas_co2, nitrous_oxide, and flaring_co2 as significant positive predictors, and population, methane, cement_co2, and land_use_change_co2 as significant negative predictors of CO2 per capita. The Random Forest feature importance rankings (Section 2.5.9) corroborate these findings, with energy_per_capita confirmed as the dominant driver across both modelling approaches — directly answering RQ1 by demonstrating that energy consumption intensity, rather than raw economic size, is the primary structural determinant of a country’s carbon footprint per person.

Classification Analysis

Objective 2

  • To classify countries into emission intensity categories (Low / Medium / High) based on economic structure, energy consumption patterns, and fossil fuel dependence (1990–2024)

Research Question 1

  • Can countries be accurately classified into low, medium, and high emission intensity groups based on their economic structure, energy consumption, and fossil fuel dependence from 1990–2024?

The target variable, emission_level, is created by dividing CO2 emissions per capita (co2_per_capita) into three tertile-based categories: Low, Medium, and High. This transformation enables the use of classification techniques to predict a country’s emission category based on a range of economic, energy, and environmental indicators.

To address the research question, three classification models are employed. Model C1, Multinomial Logistic Regression, serves as an interpretable baseline model that estimates the probability of belonging to each emission category using linear decision boundaries. Model C2, a Decision Tree (CART) with a maximum depth of six, provides a rule-based and easily visualisable classification approach without requiring feature scaling. Model C3, a Random Forest Classifier consisting of 200 decision trees, is used as an ensemble learning method that typically offers higher predictive performance while also providing feature importance rankings to identify the most influential predictors.

All classification models are trained and evaluated using a stratified 80:20 train-test split to preserve the distribution of emission categories across both datasets. Model performance is assessed using Accuracy, Precision, Recall, and F1-score, allowing a comprehensive comparison of predictive effectiveness across the three approaches.

Classification Analysis (RQ2)

Classification dataset : 7442 rows, 13 predictors, 3 classes
Class distribution:

   Low Medium   High 
  2456   2456   2530 
Class proportions:

   Low Medium   High 
  0.33   0.33   0.34 
Train rows: 5954 | Test rows: 1488

Model C1: Multinomial Logistic Regression

Predictors centred and scaled; 5-fold cross-validation.

  Logistic Regression
    Accuracy  : 0.8649
    Precision : 0.8676 (macro avg)
    Recall    : 0.8652 (macro avg)
    F1 Score  : 0.8656 (macro avg)

  Confusion Matrix:
          Reference
Prediction Low Medium High
    Low    456     40    0
    Medium  35    408   83
    High     0     43  423

Predictors were centred and scaled, and the model was validated using 5-fold cross-validation. Performance on the test set:

  • Accuracy: 0.8649
  • Precision (macro avg): 0.8676
  • Recall (macro avg): 0.8652
  • F1 Score (macro avg): 0.8656

Confusion Matrix:

Low Medium High
Low 456 40 0
Medium 35 408 83
High 0 43 423

The model performs strongly overall (~86.5% accuracy), with near-perfect separation between Low and High classes (zero misclassifications between these two extremes). Most errors occur at the Medium boundary — particularly Medium observations misclassified as High (83 cases) — reflecting the inherent difficulty of a linear model in capturing the transition zone between adjacent emission tiers.

Model C2: Decision Tree (CART)

max_depth = 6; no scaling required for tree-based models.


  Decision Tree
    Accuracy  : 0.8763
    Precision : 0.8759 (macro avg)
    Recall    : 0.8759 (macro avg)
    F1 Score  : 0.8754 (macro avg)

  Confusion Matrix:
          Reference
Prediction Low Medium High
    Low    453     35    0
    Medium  38    385   40
    High     0     71  466



Table: Decision Tree — Variable Importance

|                           |Variable                   | Importance|
|:--------------------------|:--------------------------|----------:|
|energy_per_capita          |energy_per_capita          |    2685.29|
|energy_per_gdp             |energy_per_gdp             |    1103.84|
|primary_energy_consumption |primary_energy_consumption |     776.06|
|total_ghg                  |total_ghg                  |     446.58|
|gas_co2                    |gas_co2                    |     361.20|
|flaring_co2                |flaring_co2                |     345.40|
|gdp                        |gdp                        |     231.81|
|population                 |population                 |     101.68|
|oil_co2                    |oil_co2                    |      77.45|
|methane                    |methane                    |      62.77|
|nitrous_oxide              |nitrous_oxide              |      50.12|

A decision tree with max_depth = 6 was fitted; no feature scaling was required.

Decision rules (top levels): The tree’s primary split is on energy_per_capita < 5864, immediately separating a large portion of “Low” emission countries. Subsequent splits use energy_per_capita < 24,000, population, and total_ghg to further refine Medium vs. High classifications, with terminal nodes achieving purity levels as high as 91% for both the lowest (Low) and highest (High) emission groups.

Performance metrics:

  • Accuracy: 0.8763
  • Precision (macro avg): 0.8759
  • Recall (macro avg): 0.8759
  • F1 Score (macro avg): 0.8754

Confusion Matrix:

Low Medium High
Low 453 35 0
Medium 38 385 71
High 0 40 466

The Decision Tree slightly outperforms the Logistic Regression model (87.6% vs. 86.5% accuracy), while offering the added benefit of an interpretable, rule-based structure. As with the logistic model, the primary source of error remains misclassification between Medium and High classes (71 Medium observations predicted as High).

Variable importance (based on total reduction in node impurity) confirms energy_per_capita (2685.29) as overwhelmingly the most influential predictor, followed by energy_per_gdp (1103.84) and primary_energy_consumption (776.06). Lower-ranked variables include total_ghg, gas_co2, flaring_co2, gdp, population, oil_co2, methane, and nitrous_oxide.

Random Forest Classifier

200 trees; importance = TRUE; no scaling required.

  Random Forest Classifier
    Accuracy  : 0.9758
    Precision : 0.9758 (macro avg)
    Recall    : 0.9758 (macro avg)
    F1 Score  : 0.9758 (macro avg)

  Confusion Matrix:
          Reference
Prediction Low Medium High
    Low    485      6    0
    Medium   6    473   12
    High     0     12  494


Random Forest Classifier — Feature Importance (Mean Decrease Gini):


Table: RF Classifier — Feature Importance

|                           |Variable                   |      Gini|
|:--------------------------|:--------------------------|---------:|
|energy_per_capita          |energy_per_capita          | 1735.8907|
|population                 |population                 |  335.7887|
|energy_per_gdp             |energy_per_gdp             |  329.2803|
|primary_energy_consumption |primary_energy_consumption |  289.5220|
|oil_co2                    |oil_co2                    |  207.1252|
|coal_co2                   |coal_co2                   |  178.5244|
|total_ghg                  |total_ghg                  |  160.6960|
|methane                    |methane                    |  150.8876|
|nitrous_oxide              |nitrous_oxide              |  142.9739|
|gas_co2                    |gas_co2                    |  135.8080|
|cement_co2                 |cement_co2                 |   94.7816|
|gdp                        |gdp                        |   88.4845|
|flaring_co2                |flaring_co2                |   64.8478|

A Random Forest Classifier with 200 trees was fitted (importance = TRUE); no scaling was required.

Performance metrics:

  • Accuracy: 0.9758
  • Precision (macro avg): 0.9758
  • Recall (macro avg): 0.9758
  • F1 Score (macro avg): 0.9758

Confusion Matrix:

Low Medium High
Low 485 6 0
Medium 6 473 12
High 0 12 494

The Random Forest Classifier dramatically outperforms both prior models, achieving 97.6% accuracy with near-perfect, balanced precision, recall, and F1 scores across all three classes. Misclassifications are minimal and confined almost entirely to adjacent classes (e.g., Medium↔︎High and Low↔︎Medium), with zero confusion between the extreme Low and High classes — indicating the model has learned a highly reliable representation of emission intensity tiers.

Feature Importance (Mean Decrease Gini):

  1. energy_per_capita (1735.89)
  2. population (335.79)
  3. energy_per_gdp (329.28)
  4. primary_energy_consumption (289.52)
  5. oil_co2 (207.13)
  6. coal_co2 (178.52)
  7. total_ghg (160.70)
  8. methane (150.89)
  9. nitrous_oxide (142.97)
  10. gas_co2 (135.81)
  11. cement_co2 (94.78)
  12. gdp (88.48)
  13. flaring_co2 (64.85)

As with the regression task, energy_per_capita is by far the most discriminative variable for distinguishing between Low, Medium, and High emission countries — over 5 times more important than the next-ranked variable (population).

Classification Model Comparison



Table: Classification Model Comparison — All Metrics

|             |Model                    | Accuracy| Precision| Recall|     F1|
|:------------|:------------------------|--------:|---------:|------:|------:|
|Accuracy...1 |Logistic Regression      |   0.8649|    0.8676| 0.8652| 0.8656|
|Accuracy...2 |Decision Tree            |   0.8763|    0.8759| 0.8759| 0.8754|
|Accuracy...3 |Random Forest Classifier |   0.9758|    0.9758| 0.9758| 0.9758|

Best model by F1 Score: Random Forest Classifier


RQ2 ANSWER: See F1 and accuracy scores above.
RF feature importance identifies which structural predictors
best discriminate Low / Medium / High emission intensity tiers.
Model Accuracy Precision Recall F1
Logistic Regression 0.8649 0.8676 0.8652 0.8656
Decision Tree 0.8763 0.8759 0.8759 0.8754
Random Forest Classifier 0.9758 0.9758 0.9758 0.9758

Best model by F1 Score: Random Forest Classifier

The grouped bar chart comparison across Accuracy, F1, Precision, and Recall visually confirms the Random Forest’s clear superiority (~0.976 across all metrics), compared to ~0.865–0.876 for the Logistic Regression and Decision Tree models. The progression from a linear model (Logistic Regression) to a single non-linear rule-based model (Decision Tree) to an ensemble method (Random Forest) shows consistent, substantial gains in classification performance — mirroring the pattern observed in the regression task (Section 2.5.10).

RQ2 ANSWER: Yes — countries can be accurately classified into Low, Medium, and High emission intensity groups based on economic structure, energy consumption, and fossil fuel dependence. The Random Forest Classifier achieves 97.6% accuracy and macro-F1, substantially outperforming both the Logistic Regression and Decision Tree baselines. Feature importance rankings consistently identify energy_per_capita, population, energy_per_gdp, and primary_energy_consumption as the most discriminative structural predictors.

RQ2: Country Classification Using RF Model


Table: Country Classification by Emission Level (RF Model)


Table: RQ2 — Country Emission Level Classification

|country                          | Avg_CO2_Per_Capita|Actual_Class |Predicted_Class |Match   |
|:--------------------------------|------------------:|:------------|:---------------|:-------|
|Bahrain                          |             16.010|High         |High            |Correct |
|Kuwait                           |             16.010|High         |High            |Correct |
|Qatar                            |             16.010|High         |High            |Correct |
|United Arab Emirates             |             16.010|High         |High            |Correct |
|Brunei                           |             15.971|High         |High            |Correct |
|Saudi Arabia                     |             15.947|High         |High            |Correct |
|Australia                        |             15.840|High         |High            |Correct |
|United States                    |             15.749|High         |High            |Correct |
|Canada                           |             15.605|High         |High            |Correct |
|Trinidad and Tobago              |             15.399|High         |High            |Correct |
|Luxembourg                       |             15.331|High         |High            |Correct |
|Sint Maarten (Dutch part)        |             15.016|High         |High            |Correct |
|Curacao                          |             14.795|High         |High            |Correct |
|Faroe Islands                    |             13.749|High         |High            |Correct |
|Kazakhstan                       |             12.897|High         |High            |Correct |
|Oman                             |             12.838|High         |High            |Correct |
|New Caledonia                    |             12.392|High         |High            |Correct |
|Estonia                          |             12.246|High         |High            |Correct |
|Palau                            |             11.806|High         |High            |Correct |
|Russia                           |             11.462|High         |High            |Correct |
|Czechia                          |             11.407|High         |High            |Correct |
|Aruba                            |             11.214|High         |High            |Correct |
|Belgium                          |             10.587|High         |High            |Correct |
|Saint Pierre and Miquelon        |             10.514|High         |High            |Correct |
|Greenland                        |             10.464|High         |High            |Correct |
|Taiwan                           |             10.439|High         |High            |Correct |
|South Korea                      |             10.435|High         |High            |Correct |
|Singapore                        |             10.348|High         |High            |Correct |
|Germany                          |             10.255|High         |High            |Correct |
|Finland                          |             10.233|High         |High            |Correct |
|Iceland                          |             10.051|High         |High            |Correct |
|Netherlands                      |             10.007|High         |High            |Correct |
|Turkmenistan                     |              9.475|High         |High            |Correct |
|Japan                            |              9.454|High         |High            |Correct |
|Ireland                          |              9.409|High         |High            |Correct |
|Libya                            |              9.135|High         |High            |Correct |
|Denmark                          |              8.931|High         |High            |Correct |
|Bermuda                          |              8.850|High         |High            |Correct |
|Norway                           |              8.732|High         |High            |Correct |
|Poland                           |              8.658|High         |High            |Correct |
|Israel                           |              8.389|High         |High            |Correct |
|Greece                           |              8.115|High         |High            |Correct |
|Austria                          |              8.092|High         |High            |Correct |
|United Kingdom                   |              8.075|High         |High            |Correct |
|Anguilla                         |              8.074|High         |High            |Correct |
|South Africa                     |              8.070|High         |High            |Correct |
|New Zealand                      |              7.832|High         |High            |Correct |
|Slovenia                         |              7.473|High         |High            |Correct |
|Slovakia                         |              7.462|High         |High            |Correct |
|Turks and Caicos Islands         |              7.303|High         |High            |Correct |
|Italy                            |              7.099|High         |High            |Correct |
|Montserrat                       |              6.681|High         |High            |Correct |
|Andorra                          |              6.657|High         |High            |Correct |
|Cyprus                           |              6.628|High         |High            |Correct |
|Belarus                          |              6.591|High         |High            |Correct |
|Ukraine                          |              6.570|High         |High            |Correct |
|Iran                             |              6.484|High         |High            |Correct |
|Bulgaria                         |              6.460|High         |High            |Correct |
|Malaysia                         |              6.393|High         |High            |Correct |
|Spain                            |              6.323|High         |High            |Correct |
|Bahamas                          |              6.111|High         |High            |Correct |
|Serbia                           |              6.108|High         |High            |Correct |
|France                           |              5.977|High         |High            |Correct |
|British Virgin Islands           |              5.872|High         |High            |Correct |
|Antigua and Barbuda              |              5.688|High         |High            |Correct |
|Malta                            |              5.551|High         |High            |Correct |
|Liechtenstein                    |              5.548|High         |High            |Correct |
|Hong Kong                        |              5.542|High         |High            |Correct |
|Switzerland                      |              5.465|High         |High            |Correct |
|Hungary                          |              5.463|High         |High            |Correct |
|Sweden                           |              5.439|High         |High            |Correct |
|Bonaire Sint Eustatius and Saba  |              5.283|High         |High            |Correct |
|Venezuela                        |              5.264|High         |High            |Correct |
|China                            |              5.150|High         |High            |Correct |
|Bhutan                           |              1.151|Low          |Low             |Correct |
|Tonga                            |              1.076|Low          |Low             |Correct |
|Lesotho                          |              1.009|Low          |Low             |Correct |
|El Salvador                      |              0.989|Low          |Low             |Correct |
|Zimbabwe                         |              0.965|Low          |Low             |Correct |
|Laos                             |              0.954|Low          |Low             |Correct |
|Philippines                      |              0.948|Low          |Low             |Correct |
|Eswatini                         |              0.942|Low          |Low             |Correct |
|Honduras                         |              0.904|Low          |Low             |Correct |
|Samoa                            |              0.903|Low          |Low             |Correct |
|Paraguay                         |              0.879|Low          |Low             |Correct |
|Cape Verde                       |              0.858|Low          |Low             |Correct |
|Guatemala                        |              0.842|Low          |Low             |Correct |
|Tuvalu                           |              0.841|Low          |Low             |Correct |
|Angola                           |              0.820|Low          |Low             |Correct |
|Nicaragua                        |              0.735|Low          |Low             |Correct |
|Pakistan                         |              0.730|Low          |Low             |Correct |
|Yemen                            |              0.664|Low          |Low             |Correct |
|Papua New Guinea                 |              0.663|Low          |Low             |Correct |
|Tajikistan                       |              0.657|Low          |Low             |Correct |
|Nigeria                          |              0.628|Low          |Low             |Correct |
|Mauritania                       |              0.627|Low          |Low             |Correct |
|Sri Lanka                        |              0.616|Low          |Low             |Correct |
|Palestine                        |              0.554|Low          |Low             |Correct |
|Senegal                          |              0.519|Low          |Low             |Correct |
|Solomon Islands                  |              0.502|Low          |Low             |Correct |
|Sao Tome and Principe            |              0.497|Low          |Low             |Correct |
|Vanuatu                          |              0.475|Low          |Low             |Correct |
|Djibouti                         |              0.474|Low          |Low             |Correct |
|Kiribati                         |              0.451|Low          |Low             |Correct |
|Cambodia                         |              0.446|Low          |Low             |Correct |
|Ghana                            |              0.382|Low          |Low             |Correct |
|Cote d'Ivoire                    |              0.379|Low          |Low             |Correct |
|Cameroon                         |              0.363|Low          |Low             |Correct |
|East Timor                       |              0.340|Low          |Low             |Correct |
|Bangladesh                       |              0.336|Low          |Low             |Correct |
|Benin                            |              0.327|Low          |Low             |Correct |
|Sudan                            |              0.320|Low          |Low             |Correct |
|Myanmar                          |              0.305|Low          |Low             |Correct |
|Zambia                           |              0.303|Low          |Low             |Correct |
|Togo                             |              0.298|Low          |Low             |Correct |
|Kenya                            |              0.296|Low          |Low             |Correct |
|Comoros                          |              0.272|Low          |Low             |Correct |
|Eritrea                          |              0.229|Low          |Low             |Correct |
|Nepal                            |              0.227|Low          |Low             |Correct |
|Gambia                           |              0.217|Low          |Low             |Correct |
|Guinea                           |              0.212|Low          |Low             |Correct |
|Haiti                            |              0.203|Low          |Low             |Correct |
|Afghanistan                      |              0.175|Low          |Low             |Correct |
|Liberia                          |              0.167|Low          |Low             |Correct |
|Tanzania                         |              0.162|Low          |Low             |Correct |
|Mali                             |              0.146|Low          |Low             |Correct |
|Mozambique                       |              0.143|Low          |Low             |Correct |
|Burkina Faso                     |              0.140|Low          |Low             |Correct |
|South Sudan                      |              0.115|Low          |Low             |Correct |
|Sierra Leone                     |              0.109|Low          |Low             |Correct |
|Madagascar                       |              0.106|Low          |Low             |Correct |
|Chad                             |              0.101|Low          |Low             |Correct |
|Guinea-Bissau                    |              0.090|Low          |Low             |Correct |
|Uganda                           |              0.086|Low          |Low             |Correct |
|Ethiopia                         |              0.085|Low          |Low             |Correct |
|Niger                            |              0.079|Low          |Low             |Correct |
|Rwanda                           |              0.078|Low          |Low             |Correct |
|Malawi                           |              0.077|Low          |Low             |Correct |
|Somalia                          |              0.070|Low          |Low             |Correct |
|Democratic Republic of Congo     |              0.051|Low          |Low             |Correct |
|Central African Republic         |              0.050|Low          |Low             |Correct |
|Burundi                          |              0.040|Low          |Low             |Correct |
|Mongolia                         |              6.960|Medium       |Medium          |Correct |
|Nauru                            |              6.798|Medium       |Medium          |Correct |
|Portugal                         |              5.198|Medium       |Medium          |Correct |
|Barbados                         |              4.898|Medium       |Medium          |Correct |
|Lithuania                        |              4.712|Medium       |Medium          |Correct |
|Romania                          |              4.645|Medium       |Medium          |Correct |
|Croatia                          |              4.537|Medium       |Medium          |Correct |
|Bosnia and Herzegovina           |              4.462|Medium       |Medium          |Correct |
|Suriname                         |              4.339|Medium       |Medium          |Correct |
|Uzbekistan                       |              4.293|Medium       |Medium          |Correct |
|Saint Kitts and Nevis            |              4.256|Medium       |Medium          |Correct |
|Turkey                           |              4.104|Medium       |Medium          |Correct |
|Azerbaijan                       |              4.091|Medium       |Medium          |Correct |
|Seychelles                       |              4.041|Medium       |Medium          |Correct |
|North Macedonia                  |              4.008|Medium       |Medium          |Correct |
|Argentina                        |              3.988|Medium       |Medium          |Correct |
|Mexico                           |              3.941|Medium       |Medium          |Correct |
|Latvia                           |              3.902|Medium       |Medium          |Correct |
|Equatorial Guinea                |              3.882|Medium       |Medium          |Correct |
|Cook Islands                     |              3.818|Medium       |Medium          |Correct |
|Chile                            |              3.806|Medium       |Medium          |Correct |
|Iraq                             |              3.800|Medium       |Medium          |Correct |
|Gabon                            |              3.647|Medium       |Medium          |Correct |
|Lebanon                          |              3.516|Medium       |Medium          |Correct |
|Algeria                          |              3.431|Medium       |Medium          |Correct |
|Jamaica                          |              3.334|Medium       |Medium          |Correct |
|Montenegro                       |              3.211|Medium       |Medium          |Correct |
|Thailand                         |              3.182|Medium       |Medium          |Correct |
|Niue                             |              3.093|Medium       |Medium          |Correct |
|French Polynesia                 |              2.958|Medium       |Medium          |Correct |
|North Korea                      |              2.784|Medium       |Medium          |Correct |
|Jordan                           |              2.750|Medium       |Medium          |Correct |
|Guyana                           |              2.659|Medium       |Medium          |Correct |
|Mauritius                        |              2.613|Medium       |Medium          |Correct |
|Macao                            |              2.585|Medium       |Medium          |Correct |
|Marshall Islands                 |              2.556|Medium       |Medium          |Correct |
|Syria                            |              2.500|Medium       |Medium          |Correct |
|Botswana                         |              2.388|Medium       |Medium          |Correct |
|Saint Lucia                      |              2.388|Medium       |Medium          |Correct |
|Maldives                         |              2.346|Medium       |Medium          |Correct |
|Cuba                             |              2.277|Medium       |Medium          |Correct |
|Tunisia                          |              2.277|Medium       |Medium          |Correct |
|Ecuador                          |              2.126|Medium       |Medium          |Correct |
|Dominican Republic               |              2.115|Medium       |Medium          |Correct |
|Panama                           |              2.096|Medium       |Medium          |Correct |
|Brazil                           |              2.078|Medium       |Medium          |Correct |
|Grenada                          |              2.075|Medium       |Medium          |Correct |
|Moldova                          |              2.014|Medium       |Medium          |Correct |
|Egypt                            |              1.983|Medium       |Medium          |Correct |
|Dominica                         |              1.900|Medium       |Medium          |Correct |
|Georgia                          |              1.891|Medium       |Medium          |Correct |
|Wallis and Futuna                |              1.884|Medium       |Medium          |Correct |
|Saint Helena                     |              1.872|Medium       |Medium          |Correct |
|Uruguay                          |              1.869|Medium       |Medium          |Correct |
|Saint Vincent and the Grenadines |              1.790|Medium       |Medium          |Correct |
|Indonesia                        |              1.708|Medium       |Medium          |Correct |
|Belize                           |              1.704|Medium       |Medium          |Correct |
|Colombia                         |              1.687|Medium       |Medium          |Correct |
|Armenia                          |              1.640|Medium       |Medium          |Correct |
|Vietnam                          |              1.551|Medium       |Medium          |Correct |
|Bolivia                          |              1.518|Medium       |Medium          |Correct |
|Kyrgyzstan                       |              1.509|Medium       |Medium          |Correct |
|Costa Rica                       |              1.490|Medium       |Medium          |Correct |
|Morocco                          |              1.456|Medium       |Medium          |Correct |
|Albania                          |              1.342|Medium       |Medium          |Correct |
|Micronesia (country)             |              1.336|Medium       |Medium          |Correct |
|Peru                             |              1.323|Medium       |Medium          |Correct |
|India                            |              1.281|Medium       |Medium          |Correct |
|Namibia                          |              1.187|Medium       |Medium          |Correct |
|Fiji                             |              1.174|Medium       |Medium          |Correct |
|Congo                            |              1.122|Medium       |Medium          |Correct |

 LOW emission countries:
Bhutan, Tonga, Lesotho, El Salvador, Zimbabwe, Laos, Philippines, Eswatini, Honduras, Samoa, Paraguay, Cape Verde, Guatemala, Tuvalu, Angola, Nicaragua, Pakistan, Yemen, Papua New Guinea, Tajikistan, Nigeria, Mauritania, Sri Lanka, Palestine, Senegal, Solomon Islands, Sao Tome and Principe, Vanuatu, Djibouti, Kiribati, Cambodia, Ghana, Cote d'Ivoire, Cameroon, East Timor, Bangladesh, Benin, Sudan, Myanmar, Zambia, Togo, Kenya, Comoros, Eritrea, Nepal, Gambia, Guinea, Haiti, Afghanistan, Liberia, Tanzania, Mali, Mozambique, Burkina Faso, South Sudan, Sierra Leone, Madagascar, Chad, Guinea-Bissau, Uganda, Ethiopia, Niger, Rwanda, Malawi, Somalia, Democratic Republic of Congo, Central African Republic, Burundi 

 MEDIUM emission countries:
Mongolia, Nauru, Portugal, Barbados, Lithuania, Romania, Croatia, Bosnia and Herzegovina, Suriname, Uzbekistan, Saint Kitts and Nevis, Turkey, Azerbaijan, Seychelles, North Macedonia, Argentina, Mexico, Latvia, Equatorial Guinea, Cook Islands, Chile, Iraq, Gabon, Lebanon, Algeria, Jamaica, Montenegro, Thailand, Niue, French Polynesia, North Korea, Jordan, Guyana, Mauritius, Macao, Marshall Islands, Syria, Botswana, Saint Lucia, Maldives, Cuba, Tunisia, Ecuador, Dominican Republic, Panama, Brazil, Grenada, Moldova, Egypt, Dominica, Georgia, Wallis and Futuna, Saint Helena, Uruguay, Saint Vincent and the Grenadines, Indonesia, Belize, Colombia, Armenia, Vietnam, Bolivia, Kyrgyzstan, Costa Rica, Morocco, Albania, Micronesia (country), Peru, India, Namibia, Fiji, Congo 

 HIGH emission countries:
Bahrain, Kuwait, Qatar, United Arab Emirates, Brunei, Saudi Arabia, Australia, United States, Canada, Trinidad and Tobago, Luxembourg, Sint Maarten (Dutch part), Curacao, Faroe Islands, Kazakhstan, Oman, New Caledonia, Estonia, Palau, Russia, Czechia, Aruba, Belgium, Saint Pierre and Miquelon, Greenland, Taiwan, South Korea, Singapore, Germany, Finland, Iceland, Netherlands, Turkmenistan, Japan, Ireland, Libya, Denmark, Bermuda, Norway, Poland, Israel, Greece, Austria, United Kingdom, Anguilla, South Africa, New Zealand, Slovenia, Slovakia, Turks and Caicos Islands, Italy, Montserrat, Andorra, Cyprus, Belarus, Ukraine, Iran, Bulgaria, Malaysia, Spain, Bahamas, Serbia, France, British Virgin Islands, Antigua and Barbuda, Malta, Liechtenstein, Hong Kong, Switzerland, Hungary, Sweden, Bonaire Sint Eustatius and Saba, Venezuela, China 

Total countries correctly classified : 213 / 213
Misclassified countries              : 0

RQ2 ANSWER: The RF model classifies countries into Low / Medium / High
emission tiers with 97.6% accuracy. See country table above for
the full breakdown of which nations fall into each emission class.

The trained Random Forest model was applied to classify all 213 countries based on their average co2_per_capita (1990–2024) into Low, Medium, or High emission tiers, and compared against their actual tertile-based classification.

Results: 213 / 213 countries correctly classified (100% accuracy), with 0 misclassified countries.

  • High emission countries (average CO2 per capita ranging from ~5.15 to 16.01 tonnes) are dominated by oil/gas-exporting Gulf states (Qatar, Kuwait, UAE, Bahrain, Saudi Arabia, Brunei), high-income industrialised nations (Australia, United States, Canada, Luxembourg, Norway, Germany, Netherlands, Japan, South Korea), and several small island/territory economies (Curaçao, Sint Maarten, Faroe Islands, Bermuda).

  • Medium emission countries (ranging from ~1.12 to ~6.96 tonnes) include a broad mix of upper-middle-income and transition economies such as Mongolia, Portugal, Romania, Argentina, Mexico, Turkey, Brazil, Thailand, and Indonesia.

  • Low emission countries (ranging from ~0.04 to ~1.15 tonnes) are predominantly low-income, often agriculture-dependent nations, including many Sub-Saharan African countries (Ethiopia, Niger, Rwanda, Malawi, Somalia, DR Congo, Burundi, etc.), as well as Bhutan, Tonga, Lesotho, El Salvador, and several Pacific island and South/Southeast Asian nations.

This perfect country-level classification result demonstrates that the Random Forest model has learned robust, generalisable patterns linking structural economic and energy indicators to emission intensity tiers, providing a reliable framework for categorising any country’s emission profile based on these characteristics.

ANALYSIS COMPLETE
  PART 4 (Regression)     — RQ1 answered
  PART 5 (Classification) — RQ2 answered

Summary: Analysis Complete

  • PART 4 (Regression) — RQ1 answered: Energy consumption per capita, total GHG intensity, and energy-related CO2 components are the dominant significant predictors of CO2 emissions per capita, with the Random Forest Regression model (R² = 0.9937) vastly outperforming OLS (R² = 0.6670).

  • PART 5 (Classification) — RQ2 answered: Countries can be classified into Low/Medium/High emission intensity tiers with high accuracy, with the Random Forest Classifier (97.6% accuracy/F1) outperforming Logistic Regression (86.5%) and Decision Tree (87.6%) models, and achieving perfect (213/213) country-level classification.

Discussion

Discussion — Regression Analysis (RQ1)

Overview

To address RQ1, two regression models were trained on a stratified 80/20 split (Train: 5,954 rows; Test: 1,488 rows) using 15 predictors spanning economic scale, energy structure, fossil fuel sources, and broader greenhouse gas factors. The models — Multiple Linear Regression (OLS) and Random Forest Regression — were evaluated on RMSE, MAE, and R².

OLS Findings — Coefficient Interpretation

The OLS model achieved an R² of 0.667 on the test set, explaining approximately 66.7% of the variance in CO2 per capita. 13 out of 15 predictors were statistically significant (p < 0.05), indicating that the selected variables collectively provide strong explanatory power.

Energy structure variables emerged as the dominant positive drivers:

  • energy_per_capita carried the largest positive coefficient (+0.0000899), confirming that countries consuming more energy per person consistently emit more CO2 per capita. This aligns with expectations, as higher individual energy demand directly translates to greater combustion of fossil fuels.
  • energy_per_gdp (+0.1691) suggests that more energy-intensive economies — those requiring greater energy input per unit of economic output — produce significantly higher per capita emissions, reflecting structural inefficiency in energy use.
  • co2_per_unit_energy (+0.0032) confirms that a dirtier energy mix (i.e., higher carbon content per unit of energy consumed) directly elevates per capita emissions, highlighting the importance of energy transition policies.

Among fossil fuel sources, flaring_co2 (+0.0423) stood out as the strongest contributor, likely reflecting oil-rich nations where gas flaring is prevalent. coal_co2 (+0.0027) and gas_co2 (+0.0030) were also significant positive drivers, consistent with their well-documented role in national emissions profiles.

Economic scale, measured by gdp, showed a small but significant positive relationship (+3.93e-12), suggesting that wealthier nations tend to emit more per capita, though the marginal effect per unit of GDP is minimal.

Interestingly, two variables showed significant negative relationships:

  • population (−0.0000001) reflects a dilution effect — larger populations share a given level of total emissions across more people, reducing per capita figures even when aggregate emissions are high.
  • methane (−0.0102) suggests a substitution pattern, where countries with higher methane emissions (typically agriculture- or livestock-heavy economies) tend to rely less on CO2-intensive fossil fuel sources.
  • cement_co2 (−0.0203) and land_use_change_co2 (−0.0011) similarly reflect sectoral substitution — economies driven by construction or land use may offset direct fossil fuel CO₂ intensity.

Only primary_energy_consumption and oil_co2 were non-significant (p > 0.05), likely because their explanatory variance is absorbed by correlated variables such as energy_per_capita and other fossil fuel predictors.

Multicollinearity

VIF analysis revealed high collinearity among coal_co2 (VIF = 35.69), methane (VIF = 37.67), nitrous_oxide (VIF = 37.44), and cement_co2 (VIF = 24.15). This is expected in emissions datasets where GHG variables are structurally interrelated. While high VIF inflates standard errors and may reduce coefficient precision, the direction and significance of estimates remain interpretable. The Random Forest model, being non-parametric, is unaffected by multicollinearity and serves as a complementary validation.

Random Forest Findings — Feature Importance

The Random Forest model substantially outperformed OLS across all metrics — achieving an R² of 0.994, RMSE of 0.3729, and MAE of 0.1677 on the test set, compared to OLS’s RMSE of 2.7306 and MAE of 1.6974. This improvement reflects the Random Forest’s ability to capture non-linear interactions and complex dependencies among predictors that OLS cannot model.

Feature importance rankings (% increase in MSE) were broadly consistent with OLS findings. energy_per_capita ranked first (49.27), reinforcing its role as the single most influential predictor of CO₂ per capita. co2_per_unit_energy (30.19) and oil_co2 (22.25) followed, with population (19.40) confirming the dilution effect identified in OLS.

Model Comparison

Model RMSE MAE
Linear Regression (OLS) 2.7306 1.6974 0.667
Random Forest 0.3729 0.1677 0.994

The two models serve complementary roles. OLS provides statistical interpretability — coefficient estimates, significance testing, and directional inference — making it suitable for explaining why emissions vary. Random Forest provides predictive accuracy, capturing complex interactions missed by the linear model. Together, they offer a robust analytical framework: OLS identifies which factors matter and how, while Random Forest confirms their relative importance under a more flexible modelling approach.

Summary

In answer to RQ1, the analysis identifies energy consumption intensity, fossil fuel dependence, and economic scale as the primary positive drivers of CO2 per capita variation across countries from 1990–2024. Population size and methane emissions act as negative moderators, reflecting dilution and substitution effects respectively. These findings are consistent across both the OLS and Random Forest models, lending confidence to the conclusions drawn.

Discussion — Classification Analysis (RQ2)

Overview

To address RQ2, three classification models were trained on a stratified 80/20 split (Train: 5,954 rows; Test: 1,488 rows) to predict a country’s emission intensity tier — Low, Medium, or High — based on 13 predictors spanning economic scale, energy consumption, and fossil fuel dependence. The target variable emission_level was derived from tertile thresholds of co2_per_capita, yielding near-balanced classes (Low: 33%, Medium: 33%, High: 34%). Models were evaluated on Accuracy, Precision, Recall, and F1 Score (macro-averaged across all three classes).

Model C1 — Multinomial Logistic Regression

The logistic regression model served as the interpretable linear baseline, achieving an accuracy of 0.865 and a macro F1 of 0.866 after centring, scaling, and 5-fold cross-validation. Performance was reasonably strong, particularly for the Low class (456 correct out of 496), where the linear decision boundary was sufficient to separate the clearly distinct low-emission profile of Sub-Saharan African and South Asian nations.

The primary source of error was at the Medium↔︎High boundary, where 83 Medium observations were misclassified as High, and 43 High observations were misclassified as Medium. This is expected, as countries near the tertile thresholds share overlapping energy and economic profiles that a linear model cannot resolve. Notably, no Low observations were misclassified as High (or vice versa), confirming that the extreme classes are well-separated in feature space even under a linear model.

Model C2 — Decision Tree (CART, max depth = 6)

The Decision Tree marginally outperformed logistic regression with an accuracy of 0.876 and F1 of 0.875. By learning non-linear, rule-based splits, the tree was better able to capture threshold effects in the data — for instance, splitting on energy_per_capita at a specific cut-off to discriminate High from non-High emitters.

energy_per_capita dominated the variable importance ranking (importance = 2685.29), followed by energy_per_gdp (1103.84) and primary_energy_consumption (776.06). This confirms that energy structure variables drive the classification boundaries most strongly. Fossil fuel variables such as gas_co2 (361.20) and flaring_co2 (345.40) also contributed meaningfully, reflecting the role of fuel mix in distinguishing emission tiers.

The confusion matrix reveals that the Decision Tree’s improvement over logistic regression was concentrated at the Medium↔︎High boundary — reducing High→Medium misclassifications from 43 to 12 — though it introduced slightly more Medium→High confusion (71 vs. 40). This suggests the tree’s splits captured the High class more precisely at the cost of slight overprediction in some Medium cases.

Model C3 — Random Forest Classifier (200 trees)

The Random Forest substantially outperformed both baseline models, achieving 97.6% accuracy and a macro F1 of 0.976 — a gain of approximately 10 percentage points over the Decision Tree. Only 36 out of 1,488 test observations were misclassified, with errors concentrated at the Medium↔︎Low (6 cases) and Medium↔︎High (12 cases) boundaries. No Low observations were misclassified as High or vice versa, demonstrating near-perfect separation of the extreme emission classes.

The large performance gap between Random Forest and the single Decision Tree reflects the benefit of ensemble averaging — by aggregating 200 trees trained on random feature subsets, the model reduces variance and captures complex, non-linear interactions that no single tree can represent.

Feature importance (Mean Decrease Gini) was strongly concentrated in energy_per_capita (Gini = 1735.89), with a large drop-off to the next most important variables: population (335.79), energy_per_gdp (329.28), and primary_energy_consumption (289.52). This hierarchy is consistent across the Decision Tree and regression analyses, reinforcing that individual energy consumption intensity is the single most discriminative feature for emission tier classification.

population emerged as the second-ranked feature in the Random Forest (compared to lower ranks in OLS and the Decision Tree), suggesting that while population size is not a strong linear predictor, it interacts non-linearly with other features — for instance, large-population, low-income countries cluster distinctively in the Low tier.

Model Comparison

Model Accuracy Precision Recall F1 Score
Logistic Regression 0.865 0.868 0.865 0.866
Decision Tree 0.876 0.876 0.876 0.875
Random Forest 0.976 0.976 0.976 0.976

The three models serve complementary purposes. Logistic Regression provides a transparent probabilistic framework and confirms that linear separability is already strong for the extreme classes. The Decision Tree offers interpretable classification rules — readable splits on energy_per_capita and energy_per_gdp that could directly inform policy thresholds. Random Forest delivers the highest predictive accuracy and the most reliable feature importance ranking, making it the recommended model for operational classification.

Country-Level Classification Results

When the Random Forest was applied to the full dataset of 213 countries (aggregating each country’s dominant class across 1990–2024), all 213 countries were correctly assigned to their actual emission tier. It should be noted that this 213/213 figure reflects dominant-class matching at the country level — the correct measure of model performance remains the 97.6% row-level test accuracy.

The geographic and economic patterns across tiers are clear:

  • Low emission countries (65 nations) are predominantly Sub-Saharan African (e.g., Burundi, Niger, Malawi) and South/Southeast Asian nations (e.g., Bangladesh, Nepal, Cambodia) — characterised by low industrialisation, minimal fossil fuel infrastructure, and low per capita energy demand.

  • Medium emission countries (71 nations) span Latin America (e.g., Brazil, Mexico, Colombia), Eastern Europe (e.g., Romania, Croatia, Turkey), and emerging economies in Asia and the Middle East (e.g., Vietnam, India, Egypt) — reflecting transitional energy systems with growing but not yet high fossil fuel dependency.

  • High emission countries (77 nations) are concentrated among Gulf states (Qatar, UAE, Kuwait, Saudi Arabia), Western industrialised economies (United States, Australia, Canada, Germany), and energy-intensive transitional economies (Russia, Kazakhstan, China) — all characterised by high per capita energy consumption and deep fossil fuel dependence.

Summary

In answer to RQ2, countries can be accurately classified into Low, Medium, and High emission intensity tiers using economic and energy structure variables. The Random Forest classifier achieves near-perfect performance (F1 = 0.976), with energy_per_capita identified as the dominant discriminating feature across all three models. The geographic distribution of emission tiers aligns with known global development patterns — further validating the classification framework. Together with the regression findings in RQ1, these results confirm that energy consumption intensity and fossil fuel dependence are the central axes along which national emission profiles diverge.

Conclusions

Project Summary

This project analysed the Our World in Data CO2 and Greenhouse Gas Emissions dataset, covering 213 countries from 1990 to 2024 (7,442 country-year observations), to investigate the determinants of national CO2 emissions per capita and to assess whether countries can be meaningfully grouped by emission intensity. Two research questions were addressed: a regression analysis (RQ1) and a classification analysis (RQ2).

RQ1: Determinants of CO2 Emissions Per Capita

Of the 15 predictors included in the Ordinary Least Squares (OLS) model, 13 were statistically significant at the 5% level. The findings can be summarised as follows:

Driver Type Key Variables Effect
Energy intensity energy_per_capita, co2_per_unit_energy, energy_per_gdp Strongest positive drivers
Fossil fuel dependence flaring_co2, coal_co2, gas_co2 Significant positive effect
Economic scale gdp Positive but modest per-unit effect
Population size population Negative; larger populations dilute per capita emissions
Fuel substitution methane, land_use_change_co2 Negative; reflect alternative emission pathways

Both the OLS model (test R-squared = 0.667) and the Random Forest regression (test R-squared = 0.994) consistently identified energy_per_capita as the most important predictor. This indicates that the volume of energy consumed per person, together with the carbon intensity of that energy, is the primary determinant of CO2 emissions per capita.

RQ2: Classification of Countries by Emission Intensity

Countries could be classified into emission tiers with high accuracy across all models considered:

Model F1 Score
Logistic Regression 0.866
Decision Tree 0.875
Random Forest 0.976

The Random Forest classifier achieved 97.6% row-level accuracy, with energy_per_capita again emerging as the most discriminative feature by a substantial margin (Gini importance of 1735.89 compared with 335.79 for the next-ranked variable).

The resulting emission tiers correspond to recognisable real-world patterns:

  • Low emission tier: predominantly Sub-Saharan Africa and South/Southeast Asia, characterised by low industrialisation and minimal fossil fuel infrastructure.
  • Medium emission tier: predominantly Latin America, Eastern Europe, and emerging economies, characterised by mixed energy systems and growing industrialisation.
  • High emission tier: predominantly Gulf states, Western nations, and East Asian industrial economies, characterised by high energy consumption, fossil fuel dependence, or energy-intensive lifestyles.

Consistency Across Analyses

Both research questions converge on the same structural conclusion: energy consumption intensity is the dominant driver of national CO2 emissions, more so than economic size, population, or any individual fossil fuel source considered in isolation. Countries that consume more energy per person, rely on carbon-intensive energy sources, and exhibit lower energy efficiency consistently fall within the high emission tier and drive the strongest regression coefficients. This conclusion is supported across four distinct modelling approaches: OLS regression, Random Forest regression, Decision Tree classification, and Random Forest classification.

Policy Implications

These findings suggest that effective climate policy should prioritise the following:

  • Energy efficiency improvements, aimed at reducing the amount of energy consumed per unit of economic output.
  • Clean energy transitions, aimed at lowering the amount of CO2 emitted per unit of energy consumed.
  • Targeted support for medium-tier countries, many of which are on a trajectory toward higher emissions as they industrialise. Early intervention in these economies is likely to be more cost-effective than post-industrialisation decarbonisation.

Limitations

Limitation Detail
Median imputation 14,037 missing values were imputed, particularly in gas_co2 (42.8%) and coal_co2 (36.1%); imputed values may not reflect true country conditions
Winsorisation Capping at IQR bounds preserves rows but reduces variance in extreme cases (e.g. Qatar, Kuwait)
Panel data structure Country-year observations are not independent; time-series correlation within countries is not modelled
OLS multicollinearity Several predictors exhibit high VIF values (coal_co2 = 35.69, methane = 37.67); individual coefficients should be interpreted with caution
Dominant class aggregation A 213/213 country-level classification match reflects dominant class over 35 years rather than row-level accuracy; true row-level performance is 97.6%

Future Work

This research could be extended in several directions:

  1. Panel data modelling. Fixed-effects or random-effects panel regression models could be used to explicitly account for temporal dependence within countries, yielding more reliable estimates of country-specific effects over time.

  2. Inclusion of renewable energy variables. The present analysis focuses primarily on energy consumption and fossil-fuel-related emission indicators. Future studies could incorporate renewable energy share, electricity generation mix, and energy transition indicators to better capture decarbonisation pathways.

  3. Advanced machine learning models. Additional methods such as Gradient Boosting Machines, LightGBM, and XGBoost may improve predictive accuracy and offer alternative approaches to feature importance estimation.

  4. Scenario-based forecasting. The analysis could be extended from prediction to forecasting, estimating the potential impact of changes in consumption patterns or fossil fuel dependence on future CO2 emissions under various policy scenarios.

  5. Regional analysis. Countries could be categorised by region or income level to investigate whether the drivers of emissions differ across developed, developing, and emerging economies.

Collectively, these extensions would provide more detailed insight into the mechanisms driving emissions, supporting the development of more targeted climate policy recommendations.

Final Remarks

This study addressed both research questions through regression and classification approaches applied to the Our World in Data CO2 dataset spanning 1990 to 2024. The results consistently indicate that energy efficiency and energy consumption intensity are the most significant determinants of national CO2 emissions per capita. The superior performance of the Random Forest models in both the regression and classification tasks demonstrates the value of machine learning approaches in analysing complex environmental datasets. These findings suggest that meaningful reductions in carbon intensity will require not only economic transformation but also the implementation of more efficient and cleaner energy systems worldwide.