Climate change is one of the most pressing global challenges of the 21st century, with carbon dioxide (CO2) emissions identified as the primary driver of global warming. Understanding what factors drive national-level emissions is critical for designing effective climate policy. Countries differ substantially in their per-capita emissions, and these differences are shaped by economic scale, energy systems, and fossil fuel dependence.
This project analyses the Our World in Data (OWID) CO2 and Greenhouse Gas Emissions dataset to investigate the structural factors that explain variation in CO2 emissions per capita across countries from 1990 to 2024. We apply regression to quantify the relationship between emissions and their key drivers, and classification to group countries into low, medium, and high emission intensity categories.
To model and quantify the relationship between national CO2
emissions per capita (co2_per_capita) and key drivers of
economic activity, energy consumption, and fossil fuel usage across
countries from 1990–2024
To classify countries into emission intensity categories based on their economic structure, energy consumption patterns, and fossil fuel dependence using data from 1990–2024.
Research Question 1
What economic, energy consumption, and fossil fuel factors
significantly explain variation in co2_per_capita across
countries from 1990–2024?
Research Question 2
Can countries be accurately classified into low, medium, and high emission intensity groups based on their economic structure, energy consumption, and fossil fuel dependence from 1990–2024
| Package | Purpose |
|---|---|
| tidyverse | Data wrangling and visualisation |
| caret | Model training and evaluation framework |
| randomForest | Random Forest regression and classification |
| rpart / rpart.plot | Decision Tree modelling and visualisation |
| corrplot | Correlation heatmap |
| naniar | Missing value visualisation |
| knitr | Table formatting |
| gridExtra | Multi-panel plot layout |
| scales | Axis formatting for ggplot2 |
| Metrics | RMSE and MAE calculation |
| performance | VIF / multicollinearity check |
Dataset loaded successfully from GitHub
Source : https://raw.githubusercontent.com/owid/co2-data/master/owid-co2-data.csv
Rows : 50411
Columns : 79
Years : 1750 to 2024
Entities : 254 unique
Total cells: 3,982,469
: Dataset Overview
|Attribute |Details |
|:------------|:-----------------------------------------------------------|
|Title |CO2 and Greenhouse Gas Emissions Dataset |
|Source |https://github.com/owid/co2-data |
|Publisher |Our World in Data (OWID) |
|Year Range |1750 to 2024 |
|Last Updated |2024 |
|Purpose |Tracks and analyses global CO2 and GHG emissions by country |
|Rows |50,411 |
|Columns |79 |
|Total Cells |3,982,469 |
|License |Creative Commons BY 4.0 |
The dataset used in this analysis is the CO2 and Greenhouse Gas Emissions Dataset published by Our World in Data (OWID), sourced directly from their public GitHub repository. It is maintained under a Creative Commons BY 4.0 license, making it freely available for research and educational purposes.
Upon loading, the dataset comprises 50,411 rows and 79 columns, covering 254 unique entities — including both sovereign nations and regional aggregates such as World, Asia, and High-income countries. The temporal range spans 1750 to 2024, representing 274 years of emissions history. However, meaningful and consistent data only becomes available from the mid-20th century onward, with the most complete coverage beginning around 1990.
The table below shows the first 10 rows of key columns from the raw dataset, illustrating the missing data pattern typical of early historical records.
| country | year | population | gdp | co2 | co2_per_capita | coal_co2 | oil_co2 | gas_co2 |
|---|---|---|---|---|---|---|---|---|
| Afghanistan | 1750 | 2802560 | NA | NA | NA | NA | NA | NA |
| Afghanistan | 1751 | NA | NA | NA | NA | NA | NA | NA |
| Afghanistan | 1752 | NA | NA | NA | NA | NA | NA | NA |
| Afghanistan | 1753 | NA | NA | NA | NA | NA | NA | NA |
| Afghanistan | 1754 | NA | NA | NA | NA | NA | NA | NA |
| Afghanistan | 1755 | NA | NA | NA | NA | NA | NA | NA |
| Afghanistan | 1756 | NA | NA | NA | NA | NA | NA | NA |
| Afghanistan | 1757 | NA | NA | NA | NA | NA | NA | NA |
| Afghanistan | 1758 | NA | NA | NA | NA | NA | NA | NA |
| Afghanistan | 1759 | NA | NA | NA | NA | NA | NA | NA |
Several structural characteristics of the raw dataset are worth noting before proceeding to cleaning:
Historical sparsity: Early records (pre-1900)
are dominated by NA values across emissions variables. As
illustrated by the raw data snapshot below, rows for Afghanistan from
1750 to 1759 contain only population data — all CO2 and
energy variables are missing. This is consistent with limited historical
measurement and reporting capacity during the pre-industrial
period.
Mixed entity types: The dataset includes not only sovereign countries but also continental and income-group aggregates. These will be filtered out during the cleaning stage to ensure the analysis reflects country-level observations only.
Analytical scope: Given the data quality and completeness considerations above, this analysis focuses on the modern period from 1990 to 2024, where emissions reporting is substantially more reliable and comparable across countries.
This section describes the data cleaning and preprocessing procedures applied to the CO2 and Greenhouse Gas Emissions dataset obtained from Our World in Data (OWID). The objective of this stage is to construct a high-quality analytical dataset suitable for regression and classification modelling.
The cleaning process involves variable selection, removal of aggregate observations, filtering to a consistent time period, handling missing values, treatment of outliers, and construction of the classification target variable.
Only variables relevant to the research objectives are retained from
the original 79-column dataset. Selected variables include the target
variable (co2_per_capita), economic indicators
(gdp, population), energy consumption metrics
(primary_energy_consumption,
energy_per_capita, energy_per_gdp), fossil
fuel emission components (coal_co2, oil_co2,
gas_co2, cement_co2,
flaring_co2), and broader greenhouse gas factors
(methane, nitrous_oxide,
total_ghg). Each variable is selected based on its
theoretical relevance to explaining cross-country differences in
emissions.
| Selected_Variables |
|---|
| country |
| year |
| co2_per_capita |
| gdp |
| population |
| primary_energy_consumption |
| energy_per_capita |
| energy_per_gdp |
| co2_per_unit_energy |
| coal_co2 |
| oil_co2 |
| gas_co2 |
| cement_co2 |
| flaring_co2 |
| land_use_change_co2 |
| methane |
| nitrous_oxide |
| total_ghg |
This step reduces dimensionality, removes irrelevant features, and improves model interpretability and computational efficiency. In total, 18 variables are retained for subsequent analysis.
To ensure the analysis focused exclusively on individual countries, aggregate regions and non-country entities were removed from the dataset. A lookup table containing valid three-letter ISO country codes was created from the original dataset, and only matching countries were retained using a semi-join operation.
This step eliminated entries such as continents, income groups, and regional aggregates (e.g. World, Asia, High-income countries) that could distort country-level analysis. After filtering, the dataset contained 218 unique countries, which were used for subsequent exploratory analysis, predictive modelling, and classification tasks.
The dataset was filtered to include only observations from 1990 to 2024, representing the modern era of emissions reporting. The year 1990 is commonly used as a baseline in climate studies due to improved data availability and its relevance to international environmental agreements following the Rio Earth Summit.
This restriction ensures consistency in reporting standards and improves comparability across countries. It also reduces bias arising from incomplete or unreliable historical records.
After filtering, only observations within the defined study period are retained for analysis.
Missing values in the target variable (co2_per_capita)
were removed to ensure that all observations used for modelling
contained valid target values. Records with missing target values cannot
contribute to supervised learning models and may introduce bias into the
analysis. As a result, 188 rows were removed, while 7,442 rows were
retained for subsequent exploratory analysis, feature engineering, and
model development.
Missing values in the predictor variables were addressed using median
imputation. All numeric predictor variables, excluding the target
variable (co2_per_capita), were examined for missing
observations. A total of 14,037 missing values were identified across
the predictor variables. To preserve the dataset size and avoid
information loss, each missing value was replaced with the median of its
respective variable (median imputation). Median
imputation was selected because it is less sensitive to extreme values
and skewed distributions than mean imputation. Following this process,
the number of missing predictor values was successfully reduced from
14,037 to 0, resulting in a complete dataset suitable for subsequent
exploratory analysis and machine learning modelling.
Duplicate observations were assessed based on the combination of
country and year to ensure that each record
represented a unique country-year observation. No duplicate
records were identified in the dataset (0 duplicate rows
removed), indicating that the dataset was already free from redundant
observations. Therefore, all records were retained for further analysis
and modelling.
To ensure consistency and compatibility during analysis, the
country variable was stored as a character
data type, while the year variable was converted to an
integer data type. This preprocessing step helped
maintain correct variable representations and supported subsequent data
processing and modelling tasks.
Winsorisation was applied to key continuous variables using the IQR method (1.5 × IQR rule) to limit the influence of extreme values while retaining all observations. This improves model stability without removing valid data points.
Variables treated include:
co2_per_capitagdppopulationprimary_energy_consumptiontotal_ghgA classification target variable (emission_level) was
created based on co2_per_capitato support the
classification modelling task. The variable was constructed using
data-driven tertile thresholds (33rd and 66th
percentiles), which divide countries into three balanced emission
categories.
This approach ensures that class definitions are not arbitrarily assigned, but instead reflect the empirical distribution of global emissions within the dataset. It also improves model performance by maintaining relatively balanced class sizes across categories.
The resulting classification thresholds are:
This categorical variable is subsequently used as the target for classification analysis.
| Variable | Min | Q1 | Median | Mean | Q3 | Max |
|---|---|---|---|---|---|---|
| CO2 per Capita | 0.00 | 0.71 | 2.74 | 4.45 | 6.83 | 16.01 |
| GDP (Billion USD) | 0.26 | 27.84 | 61.11 | 117.88 | 165.16 | 371.14 |
| Population (Million) | 0.00 | 0.76 | 5.70 | 13.72 | 20.46 | 50.01 |
| Primary Energy Consumption | 0.00 | 6.41 | 40.97 | 182.90 | 285.18 | 703.33 |
| Total GHG | 0.00 | 9.28 | 43.92 | 78.03 | 106.20 | 251.58 |
The summary statistics provide an overview of the distribution of the key variables used in this study. The average CO2 emissions per capita across the selected countries and years is approximately 4.45 tonnes, with values ranging from 0 to 16.01 tonnes, indicating substantial variation in carbon emissions among countries. The median value of 2.74 tonnes is lower than the mean, suggesting a right-skewed distribution where a few countries exhibit particularly high emissions.
GDP shows considerable variation, ranging from approximately 0.26 billion USD to 371.1 billion USD, with an average of 117.88 billion USD. Similarly, population sizes vary widely across observations, from fewer than 2,000 people to more than 50 million, reflecting the diverse economic and demographic characteristics of the countries included in the dataset.
Primary energy consumption exhibits a broad range, with a mean of 182.90 TWh and values reaching as high as 703.33 TWh. This suggests significant differences in energy demand and usage among countries. Total greenhouse gas (GHG) emissions also vary substantially, with a mean of 78.04 million tonnes and a maximum value of 251.58 million tonnes, indicating that some countries contribute disproportionately to overall emissions.
Overall, the summary statistics reveal substantial variability in economic activity, population size, energy consumption, and emissions levels across countries. This variation supports the use of predictive and classification models to investigate the factors associated with national CO2 emissions.
EDA 2 — Missing Value % in Key Columns (before imputation):
| Variable | Missing_Pct |
|---|---|
| gas_co2 | 42.8 |
| coal_co2 | 36.1 |
| energy_per_gdp | 28.0 |
| gdp | 27.3 |
| land_use_change_co2 | 8.9 |
| flaring_co2 | 6.6 |
| methane | 6.6 |
| total_ghg | 6.6 |
| co2_per_unit_energy | 6.0 |
| nitrous_oxide | 5.6 |
| primary_energy_consumption | 5.3 |
| energy_per_capita | 5.3 |
| cement_co2 | 3.4 |
| oil_co2 | 0.1 |
| year | 0.0 |
| co2_per_capita | 0.0 |
| population | 0.0 |
Before imputation, several key variables in the dataset exhibit substantial missingness. The table below summarises the percentage of missing values for each variable.
gas_co2 (42.8%), coal_co2 (36.1%),
energy_per_gdp (28.0%), and gdp (27.3%) show
the highest proportions of missing data, largely reflecting gaps in
historical reporting for smaller or developing nations. Moderate
missingness (5–9%) is observed for land_use_change_co2,
flaring_co2, methane, total_ghg,
co2_per_unit_energy, nitrous_oxide,
primary_energy_consumption, and
energy_per_capita. cement_co2 and
oil_co2 have minimal missingness (3.4% and 0.1%
respectively), while year, co2_per_capita, and
population are fully complete (0.0%).
These results justify the use of imputation strategies (e.g., median or group-wise imputation) rather than row-wise deletion, which would result in significant data loss.
A horizontal bar chart visualises the missing value percentages for
all variables containing at least one missing value, ordered from
highest to lowest. This visualisation reinforces the findings from Table
5, clearly highlighting gas_co2, coal_co2,
energy_per_gdp, and gdp as the variables of
greatest concern, with oil_co2 showing negligible
missingness. The chart provides an at-a-glance reference for
prioritising data cleaning efforts.
To understand the temporal coverage of the dataset, observations were grouped into decades spanning 1990–2024. The 1990s, 2000s, and 2010s each contain approximately 2,117–2,130 country-year observations, indicating consistent reporting coverage across these three decades. The 2020s decade shows a lower count (1,065 observations), which is expected since this decade is incomplete (only up to 2024) at the time of analysis. Overall, the dataset provides a balanced longitudinal view of emissions across more than three decades.
Summary of co2_per_capita (1990-2024, real countries)
| Statistic | Value |
|---|---|
| Min. | 0.000000 |
| 1st Qu. | 0.714250 |
| Median | 2.744500 |
| Mean | 4.452312 |
| 3rd Qu. | 6.832500 |
| Max. | 16.009875 |
The target variable co2_per_capita (1990–2024, real
countries only) was examined for its distribution characteristics.
Summary statistics show a minimum of 0.000, a 1st quartile of 0.714, a
median of 2.745, a mean of 4.452, a 3rd quartile of 6.833, and a maximum
of 16.010 tonnes per capita.
The histogram reveals a strongly right-skewed distribution, with the majority of country-year observations concentrated below 5 tonnes per capita, and a long tail extending toward high-emission outliers (notably oil/gas-rich nations). Dashed reference lines mark the tertile thresholds used to define the Low/Medium/High emission categories for the classification task in Section 2.6. A noticeable spike near the maximum value (~16) reflects a cluster of consistently high-emission countries (e.g., Qatar, Kuwait, UAE).
| Variable | Mean | Median | SD | Min | Max | Skewness |
|---|---|---|---|---|---|---|
| population | 1.372471e+07 | 5.703130e+06 | 1.698800e+07 | 1.77600e+03 | 5.001299e+07 | 1.228 |
| gdp | 1.178822e+11 | 6.111433e+10 | 1.283459e+11 | 2.57172e+08 | 3.711429e+11 | 1.172 |
| primary_energy_consumption | 1.829010e+02 | 4.096600e+01 | 2.535390e+02 | 0.00000e+00 | 7.033330e+02 | 1.248 |
| energy_per_capita | 2.369295e+04 | 1.142043e+04 | 3.306387e+04 | 0.00000e+00 | 3.185597e+05 | 2.967 |
| energy_per_gdp | 1.440000e+00 | 1.172000e+00 | 1.255000e+00 | 7.80000e-02 | 2.302100e+01 | 5.834 |
| co2_per_unit_energy | 2.300590e+02 | 2.141910e+02 | 1.811110e+02 | 3.52450e+01 | 1.068890e+04 | 29.902 |
| coal_co2 | 5.738600e+01 | 2.290000e+00 | 4.156140e+02 | 0.00000e+00 | 8.886021e+03 | 14.790 |
| oil_co2 | 4.662100e+01 | 4.110000e+00 | 1.864720e+02 | 4.00000e-03 | 2.584130e+03 | 9.456 |
| gas_co2 | 3.033900e+01 | 7.575000e+00 | 1.122600e+02 | 0.00000e+00 | 1.748138e+03 | 9.215 |
| cement_co2 | 5.082000e+00 | 3.370000e-01 | 3.738300e+01 | 0.00000e+00 | 8.287100e+02 | 16.923 |
| flaring_co2 | 1.643000e+00 | 0.000000e+00 | 5.837000e+00 | 0.00000e+00 | 8.452000e+01 | 6.535 |
| land_use_change_co2 | 3.117400e+01 | 2.028000e+00 | 1.461710e+02 | -3.19418e+02 | 2.998516e+03 | 9.770 |
| methane | 3.889500e+01 | 9.580000e+00 | 1.196360e+02 | 1.00000e-03 | 1.590674e+03 | 6.912 |
| nitrous_oxide | 1.225600e+01 | 2.905000e+00 | 3.929700e+01 | 0.00000e+00 | 4.755340e+02 | 7.612 |
| total_ghg | 7.803500e+01 | 4.392200e+01 | 8.759500e+01 | -3.00000e-03 | 2.515780e+02 | 1.111 |
| co2_per_capita | 4.452000e+00 | 2.745000e+00 | 4.615000e+00 | 0.00000e+00 | 1.601000e+01 | 1.150 |
Table 7 presents descriptive statistics (mean, median, standard
deviation, minimum, maximum, and skewness) for all key predictor
variables and the target variable co2_per_capita.
Several variables exhibit extremely high positive skewness, most
notably co2_per_unit_energy (29.9), coal_co2
(14.8), total_ghg (could not be computed in this table but
is examined separately), gas_co2 (16.9), and
land_use_change_co2, methane, and
nitrous_oxide (all > 5). This indicates the presence of
significant outliers and heavy right tails, which is consistent with the
global distribution of emissions being dominated by a small number of
large economies and energy producers. In contrast,
co2_per_capita itself (skewness = 1.150) and
population/gdp show more moderate skew. These
findings motivate the outlier treatment carried out in Section
2.4.9.
A grid of histograms was produced for all numeric predictors and the
target variable to visually assess their distributional shapes.
Consistent with the skewness values reported in Table 7, most variables
(gdp, population,
primary_energy_consumption, coal_co2,
oil_co2, gas_co2, cement_co2,
flaring_co2, land_use_change_co2,
methane, nitrous_oxide, and
co2_per_unit_energy) display heavily right-skewed
distributions, with the vast majority of observations clustered near
zero and a small number of extreme high values. The target variable
co2_per_capita shows a more moderate right skew, while
energy_per_gdp appears closer to a normal/ slightly skewed
distribution. These visual patterns confirm that transformation or
robust scaling may be beneficial for certain modelling approaches, and
that outlier handling is necessary prior to regression.
Boxplots were generated for all predictor and target variables to
visually identify the presence and extent of outliers. The plots confirm
the patterns observed in the histograms: variables such as
energy_per_gdp, co2_per_unit_energy,
coal_co2, oil_co2, gas_co2,
cement_co2, flaring_co2,
land_use_change_co2, methane, and
nitrous_oxide all display numerous extreme outliers above
the upper whisker, often representing values many times the
interquartile range. In contrast, co2_per_capita,
population, gdp, and
primary_energy_consumption show comparatively fewer extreme
points relative to their scale. These results support the decision to
apply outlier treatment (winsorisation) rather than outright removal,
given that many of these “outliers” represent genuine real-world
extremes (e.g., high-emission economies).
EDA 9 - Outlier Summary (IQR bounds, before winsorisation)
Note: Environmental data contains genuine extremes (e.g., Qatar, Kuwait)
IQR-based winsorisation (capping) was used instead of removal.
| Variable | Lower_Bound | Upper_Bound | Outliers_Before | |
|---|---|---|---|---|
| 25% | co2_per_capita | -8.463000e+00 | 1.601000e+01 | 436 |
| 25%1 | gdp | -3.898365e+11 | 7.000362e+11 | 745 |
| 25%2 | population | -2.879704e+07 | 5.001299e+07 | 870 |
| 25%3 | primary_energy_consumption | -4.477340e+02 | 7.607240e+02 | 995 |
| 25%4 | total_ghg | -1.630720e+02 | 2.940290e+02 | 991 |
To formally quantify outliers, the Interquartile Range (IQR) method was applied to each key variable, with lower and upper bounds calculated as Q1 − 1.5×IQR and Q3 + 1.5×IQR respectively. Table 8 summarises the number of observations falling outside these bounds prior to treatment.
primary_energy_consumption (995 outliers) and
total_ghg (991 outliers) show the highest counts, followed
by population (870), gdp (745), and
co2_per_capita (436).
It is important to note that environmental and economic data of this nature often contain genuine extremes — for example, small but extremely wealthy or energy-intensive nations such as Qatar and Kuwait naturally produce values far outside typical ranges. Removing these observations would distort the analysis and discard valid information. Therefore, IQR-based winsorisation (capping) was applied instead of outright removal, preserving the sample size while reducing the influence of extreme values on downstream modelling.
| Emission_Level | Count | Proportion |
|---|---|---|
| Low | 2456 | 0.33 |
| Medium | 2456 | 0.33 |
| High | 2530 | 0.34 |
To support the classification task in Research Question 2, the
continuous target variable co2_per_capita was discretised
into three tertile-based categories — Low,
Medium, and High — representing
emission intensity levels.
The resulting class distribution is well balanced: Low (2,456 observations, 33%), Medium (2,456 observations, 33%), and High (2,530 observations, 34%). This near-equal split across classes is desirable for classification modelling, as it minimises class imbalance issues and ensures that performance metrics such as accuracy and macro-averaged F1 are meaningful and not biased toward a majority class.
The top 10 countries by average CO2 per capita (1990–2024) were identified and visualised. The list is dominated by oil- and gas-rich Gulf states and high-income, energy-intensive nations: United Arab Emirates, Qatar, Kuwait, Bahrain, Brunei, Saudi Arabia, Australia, United States, Canada, and Trinidad and Tobago. All ten countries consistently sit within the “High” emission tier across the study period, reflecting strong ties between fossil fuel production/consumption, economic development, and per-capita carbon output. These findings align with the descriptive statistics and reinforce the rationale for using economic and energy-related predictors in both the regression and classification models.
The strongest correlations with co2_per_capita are
observed for energy_per_capita (r = 0.79),
primary_energy_consumption (r = 0.42), and
energy_per_gdp (r = 0.42), suggesting that energy
consumption intensity (rather than total economic size) is the dominant
driver of per-capita emissions. gdp shows a moderate
positive correlation (r = 0.33), while population shows a
near-zero correlation (r = −0.04), confirming that total population size
is not a strong predictor of per-capita emissions (as expected,
since the target is already population-normalised).
Among the fuel-source variables, gas_co2 (r = 0.29),
coal_co2 (r = 0.26), and total_ghg (r = 0.25)
show moderate positive associations. Notably, several predictors are
highly correlated with one another — for example, gdp and
population (r = 0.69),
primary_energy_consumption and gdp (r = 0.89),
and oil_co2 and gas_co2 (r = 0.88) — flagging
potential multicollinearity issues that are formally assessed via VIF in
Section 2.5.4.
Class counts:
Low Medium High
2456 2456 2530
Class proportions:
Low Medium High
0.33 0.33 0.34
The final EDA step confirms the distribution of the classification
target variable, emission_level, derived via tertile-based
discretisation of co2_per_capita. The bar chart and
accompanying table show:
| Emission Level | Count | Proportion |
|---|---|---|
| Low | 2,456 | 0.33 |
| Medium | 2,456 | 0.33 |
| High | 2,530 | 0.34 |
This balanced distribution across the three classes confirms that the dataset is well-suited for multi-class classification without requiring resampling techniques such as SMOTE or class weighting.
Objective 1
co2_per_capita) and key drivers of
economic activity, energy consumption, and fossil fuel usage across
countries from 1990–2024Research Question 1
This study employs two complementary modelling approaches. Model R1 uses Multiple Linear Regression (OLS), providing interpretable coefficients and p-values to identify statistically significant predictors and directly address the research question. Model R2 applies Random Forest Regression to capture potential non-linear relationships, with feature importance scores used to evaluate and rank the relative contribution of each predictor, offering a complementary perspective to the linear model.
This section prepares the dataset for regression analysis by selecting 15 key predictor variables representing economic activity, energy consumption, and fossil fuel emissions, alongside CO2 emissions per capita as the target variable. After removing missing values, the final regression dataset consists of 7,442 observations. The data is then split into training and testing sets using an 80/20 ratio, resulting in 5,954 observations for model training and 1,488 observations for model evaluation. This split ensures that model performance can be assessed on unseen data, supporting a more robust and generalisable analysis.
OLS Model Summary (coefficients and significance):
Call:
lm(formula = lm_formula, data = train_reg)
Residuals:
Min 1Q Median 3Q Max
-17.1689 -1.2653 -0.4099 0.8295 12.3915
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.003e-01 9.031e-02 8.862 < 2e-16 ***
gdp 3.932e-12 5.865e-13 6.705 2.20e-11 ***
population -1.300e-07 4.340e-09 -29.963 < 2e-16 ***
primary_energy_consumption -6.105e-04 3.799e-04 -1.607 0.108094
energy_per_capita 8.990e-05 1.339e-06 67.166 < 2e-16 ***
energy_per_gdp 1.691e-01 3.062e-02 5.521 3.51e-08 ***
co2_per_unit_energy 3.191e-03 2.501e-04 12.758 < 2e-16 ***
coal_co2 2.709e-03 4.325e-04 6.265 4.00e-10 ***
oil_co2 4.693e-04 5.264e-04 0.892 0.372663
gas_co2 3.026e-03 7.714e-04 3.923 8.85e-05 ***
cement_co2 -2.029e-02 3.923e-03 -5.172 2.40e-07 ***
flaring_co2 4.226e-02 1.035e-02 4.085 4.47e-05 ***
land_use_change_co2 -1.086e-03 2.922e-04 -3.718 0.000203 ***
methane -1.017e-02 1.604e-03 -6.339 2.49e-10 ***
nitrous_oxide 1.805e-02 4.916e-03 3.673 0.000242 ***
total_ghg 2.506e-02 1.105e-03 22.670 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.39 on 5938 degrees of freedom
Multiple R-squared: 0.7323, Adjusted R-squared: 0.7317
F-statistic: 1083 on 15 and 5938 DF, p-value: < 2.2e-16
A multiple linear regression (OLS) model was fitted to predict
co2_per_capita using 15 predictors, including
gdp, population,
primary_energy_consumption, energy_per_capita,
energy_per_gdp, co2_per_unit_energy, and
fuel-specific/GHG variables (coal_co2,
oil_co2, gas_co2, cement_co2,
flaring_co2, land_use_change_co2,
methane, nitrous_oxide,
total_ghg).
The model achieved a Multiple R-squared of 0.7323 (Adjusted R² = 0.7317), indicating that approximately 73% of the variance in CO2 per capita is explained by the included predictors. The overall F-statistic (1083 on 15 and 5938 DF, p < 2.2e-16) confirms the model is highly statistically significant.
Most predictors are statistically significant (p < 0.05), with the
exceptions of primary_energy_consumption (p = 0.108) and
oil_co2 (p = 0.373). energy_per_capita shows
an extremely strong effect (t = 67.17, p < 2e-16), as does
population (t = −29.96), co2_per_unit_energy
(t = 12.76), and total_ghg (t = 22.67), suggesting these
are key drivers of the model’s explanatory power.
The residual standard error is 2.39, and residuals range from −17.17 to 12.39, suggesting some asymmetry and the presence of large residuals for certain observations (likely high-emission outlier countries).
A coefficient plot with 95% confidence intervals was produced to
visualise the magnitude, direction, and significance of each predictor’s
effect on co2_per_capita. Predictors are colour-coded by
significance (p < 0.05 vs. p ≥ 0.05).
energy_per_gdp shows by far the largest positive
coefficient and widest confidence interval, followed by
flaring_co2, total_ghg, and
nitrous_oxide. cement_co2 and
methane show clear negative coefficients. Two predictors —
oil_co2 and primary_energy_consumption — have
confidence intervals crossing zero, consistent with their
non-significant p-values reported in Section 2.5.2.
Variance Inflation Factors (VIF):
# Check for Multicollinearity
Low Correlation
Term VIF VIF 95% CI adj. VIF Tolerance Tolerance 95% CI
energy_per_capita 1.96 [ 1.89, 2.04] 1.40 0.51 [0.49, 0.53]
energy_per_gdp 1.48 [ 1.43, 1.53] 1.22 0.68 [0.65, 0.70]
co2_per_unit_energy 1.08 [ 1.05, 1.11] 1.04 0.93 [0.90, 0.95]
flaring_co2 3.88 [ 3.72, 4.06] 1.97 0.26 [0.25, 0.27]
land_use_change_co2 1.62 [ 1.56, 1.68] 1.27 0.62 [0.60, 0.64]
Moderate Correlation
Term VIF VIF 95% CI adj. VIF Tolerance
gdp 5.92 [ 5.65, 6.20] 2.43 0.17
population 5.65 [ 5.40, 5.92] 2.38 0.18
primary_energy_consumption 9.62 [ 9.17, 10.09] 3.10 0.10
oil_co2 9.71 [ 9.26, 10.19] 3.12 0.10
gas_co2 7.74 [ 7.38, 8.11] 2.78 0.13
total_ghg 9.74 [ 9.28, 10.22] 3.12 0.10
Tolerance 95% CI
[0.16, 0.18]
[0.17, 0.19]
[0.10, 0.11]
[0.10, 0.11]
[0.12, 0.14]
[0.10, 0.11]
High Correlation
Term VIF VIF 95% CI adj. VIF Tolerance Tolerance 95% CI
coal_co2 35.69 [33.95, 37.52] 5.97 0.03 [0.03, 0.03]
cement_co2 24.15 [22.98, 25.38] 4.91 0.04 [0.04, 0.04]
methane 37.67 [35.83, 39.60] 6.14 0.03 [0.03, 0.03]
nitrous_oxide 37.44 [35.61, 39.36] 6.12 0.03 [0.03, 0.03]
Variance Inflation Factors (VIF) were computed for all predictors to assess multicollinearity, grouped into Low (VIF < 5), Moderate (5 ≤ VIF < 10), and High (VIF ≥ 10) correlation categories.
Low correlation (VIF < 5):
energy_per_capita (1.96), energy_per_gdp
(1.48), co2_per_unit_energy (1.08),
flaring_co2 (3.88), and land_use_change_co2
(1.62) show acceptable VIF values, indicating minimal multicollinearity
concerns.
Moderate correlation (5–10): gdp
(5.92), population (5.65),
primary_energy_consumption (9.62), oil_co2
(9.71), gas_co2 (7.74), and total_ghg (9.74)
show elevated but generally tolerable VIF values.
High correlation (VIF ≥ 10): coal_co2
(35.69), cement_co2 (24.15), methane (37.67),
and nitrous_oxide (37.44) display severe multicollinearity,
with tolerance values as low as 0.03. This indicates these variables are
highly linearly related to other predictors (likely
total_ghg and each other, as components of the same
aggregate). While this does not bias the coefficient estimates, it
inflates standard errors and makes individual coefficient interpretation
for these variables less reliable. Future iterations of the model could
consider removing or combining these highly collinear GHG-component
variables.
Linear Regression (OLS)
RMSE : 2.7306
MAE : 1.6974
R2 : 0.6670
The OLS model was evaluated on a held-out test set, yielding:
The drop in R² from the training value (0.7323) to the test value (0.6670) suggests some degree of overfitting or that the linear model struggles to generalise to extreme/outlier observations, consistent with the residual patterns examined in Section 2.5.6.
Two diagnostic plots were produced to assess the assumptions of the OLS model:
Residuals vs Fitted: The plot shows a clear funnel/fan-shaped pattern, with residual spread increasing for higher fitted values, and a distinct diagonal band of points. This indicates heteroscedasticity (non-constant variance) and suggests the linear model does not fully capture the relationship for high-emission countries.
Normal Q-Q Plot: Residuals deviate noticeably from the theoretical normal line, particularly in the tails — both the lower-left and upper-right ends curve away from the diagonal, indicating heavier tails than a normal distribution (i.e., more extreme residuals than expected under normality).
Together, these diagnostics suggest that while the OLS model captures a substantial portion of the variance, it violates key linear regression assumptions (homoscedasticity and normality of residuals), motivating the use of a more flexible, non-linear model such as Random Forest.
A scatter plot of actual vs. predicted co2_per_capita
values was produced, with a dashed reference line representing perfect
prediction (y = x).
For low-to-moderate actual values (roughly 0–8), predictions cluster reasonably close to the diagonal, though with noticeable scatter. However, for high actual values (particularly around 16, corresponding to high-emission countries like Qatar/UAE), the model substantially under-predicts, with predicted values clustering well below the actual values. This systematic under-prediction at the extremes is consistent with the heteroscedasticity observed in the residual plots and reflects the inherent limitation of a linear model in capturing the disproportionately large emissions of a small number of outlier economies.
Random Forest Regression
RMSE : 0.3837
MAE : 0.1685
R2 : 0.9937
To address the limitations of the OLS model, a Random Forest Regression model was fitted using the same predictor set. The model achieved substantially improved performance:
These results represent a dramatic improvement over the OLS model (R²
= 0.6670 → 0.9937), indicating that the relationship between the
predictors and co2_per_capita is highly non-linear and that
Random Forest is far better suited to capturing these complex
interactions, including the extreme values associated with high-emission
countries.
Variable Importance
energy_per_capita energy_per_capita 47.348329
co2_per_unit_energy co2_per_unit_energy 28.782999
population population 19.237880
oil_co2 oil_co2 17.153103
coal_co2 coal_co2 15.403601
land_use_change_co2 land_use_change_co2 15.222930
nitrous_oxide nitrous_oxide 15.167226
total_ghg total_ghg 14.600553
methane methane 14.408944
energy_per_gdp energy_per_gdp 14.377791
primary_energy_consumption primary_energy_consumption 11.820692
cement_co2 cement_co2 11.527677
gas_co2 gas_co2 10.157521
flaring_co2 flaring_co2 9.733628
gdp gdp 8.672081
Feature importance scores (based on the increase in MSE when a variable is permuted) identify the predictors most critical to the Random Forest Regression model’s performance:
energy_per_capita (47.35) — by far the most important
predictorco2_per_unit_energy (28.78)population (19.24)oil_co2 (17.15)coal_co2 (15.40)land_use_change_co2 (15.22)nitrous_oxide (15.17)total_ghg (14.60)methane (14.41)energy_per_gdp (14.38)primary_energy_consumption (11.82)cement_co2 (11.53)gas_co2 (10.16)flaring_co2 (9.73)gdp (8.67)energy_per_capita stands out as overwhelmingly the most
influential predictor, nearly 1.7 times more important than the
second-ranked variable (co2_per_unit_energy). This aligns
with the strong correlation observed in Section 2.4.12 (r = 0.79) and
the highly significant OLS coefficient for this variable, confirming
that how much energy a population consumes per person
is the single strongest determinant of per-capita carbon emissions —
more so than total economic output (gdp) or population size
alone.
Model RMSE MAE R2
1 Linear Regression (OLS) 2.7306 1.6974 0.6670
2 Random Forest Regression 0.3837 0.1685 0.9937
Table: Regression Model Comparison - RMSE, MAE, R2
|Model | RMSE| MAE| R2|
|:------------------------|------:|------:|------:|
|Linear Regression (OLS) | 2.7306| 1.6974| 0.6670|
|Random Forest Regression | 0.3837| 0.1685| 0.9937|
The two regression models were compared directly on the test set:
| Model | RMSE | MAE | R² |
|---|---|---|---|
| Linear Regression (OLS) | 2.7306 | 1.6974 | 0.6670 |
| Random Forest Regression | 0.3837 | 0.1685 | 0.9937 |
The bar chart comparison clearly shows the Random Forest model outperforming OLS across all three metrics: RMSE reduced from 2.731 to 0.384 (an ~86% reduction), MAE reduced from 1.697 to 0.168 (an ~90% reduction), and R² increased from 0.667 to 0.994. This confirms that non-linear, ensemble-based methods are substantially better suited to modelling CO2 per capita than a simple linear approach, primarily due to their ability to capture interactions and non-linear relationships involving extreme/ outlier-prone variables.
Table: OLS Coefficient Interpretation (sorted by p-value)
Table: RQ1 — OLS Coefficient Interpretation
| |Variable | Estimate| Std_Error| p_value|Significant |Direction |Interpretation |
|:--------------------------|:--------------------------|----------:|---------:|---------:|:-----------|:----------|:-------------------------------------------------------------------------------------------------------|
|energy_per_capita |energy_per_capita | 0.0000899| 0.0000013| 0.0000000|Yes |Positive ↑ |A 1-unit increase in energy_per_capita is associated with a 9e-05 change in CO2 per capita |
|population |population | -0.0000001| 0.0000000| 0.0000000|Yes |Negative ↓ |A 1-unit increase in population is associated with a 0 change in CO2 per capita |
|total_ghg |total_ghg | 0.0250604| 0.0011054| 0.0000000|Yes |Positive ↑ |A 1-unit increase in total_ghg is associated with a 0.02506 change in CO2 per capita |
|co2_per_unit_energy |co2_per_unit_energy | 0.0031905| 0.0002501| 0.0000000|Yes |Positive ↑ |A 1-unit increase in co2_per_unit_energy is associated with a 0.003191 change in CO2 per capita |
|gdp |gdp | 0.0000000| 0.0000000| 0.0000000|Yes |Positive ↑ |A 1-unit increase in gdp is associated with a 0 change in CO2 per capita |
|methane |methane | -0.0101654| 0.0016037| 0.0000000|Yes |Negative ↓ |A 1-unit increase in methane is associated with a -0.010165 change in CO2 per capita |
|coal_co2 |coal_co2 | 0.0027093| 0.0004325| 0.0000000|Yes |Positive ↑ |A 1-unit increase in coal_co2 is associated with a 0.002709 change in CO2 per capita |
|energy_per_gdp |energy_per_gdp | 0.1690762| 0.0306231| 0.0000000|Yes |Positive ↑ |A 1-unit increase in energy_per_gdp is associated with a 0.169076 change in CO2 per capita |
|cement_co2 |cement_co2 | -0.0202868| 0.0039227| 0.0000002|Yes |Negative ↓ |A 1-unit increase in cement_co2 is associated with a -0.020287 change in CO2 per capita |
|flaring_co2 |flaring_co2 | 0.0422615| 0.0103465| 0.0000447|Yes |Positive ↑ |A 1-unit increase in flaring_co2 is associated with a 0.042262 change in CO2 per capita |
|gas_co2 |gas_co2 | 0.0030261| 0.0007714| 0.0000885|Yes |Positive ↑ |A 1-unit increase in gas_co2 is associated with a 0.003026 change in CO2 per capita |
|land_use_change_co2 |land_use_change_co2 | -0.0010861| 0.0002922| 0.0002030|Yes |Negative ↓ |A 1-unit increase in land_use_change_co2 is associated with a -0.001086 change in CO2 per capita |
|nitrous_oxide |nitrous_oxide | 0.0180532| 0.0049156| 0.0002421|Yes |Positive ↑ |A 1-unit increase in nitrous_oxide is associated with a 0.018053 change in CO2 per capita |
|primary_energy_consumption |primary_energy_consumption | -0.0006105| 0.0003799| 0.1080945|No |Negative ↓ |A 1-unit increase in primary_energy_consumption is associated with a -0.000611 change in CO2 per capita |
|oil_co2 |oil_co2 | 0.0004693| 0.0005264| 0.3726626|No |Positive ↑ |A 1-unit increase in oil_co2 is associated with a 0.000469 change in CO2 per capita |
Significant POSITIVE drivers of CO2 per capita:
+ energy_per_capita
+ total_ghg
+ co2_per_unit_energy
+ gdp
+ coal_co2
+ energy_per_gdp
+ flaring_co2
+ gas_co2
+ nitrous_oxide
Significant NEGATIVE drivers of CO2 per capita:
- population
- methane
- cement_co2
- land_use_change_co2
Key findings:
energy_per_capita : +0.000090 per unit → strongest positive energy driver
total_ghg : +0.025060 per unit → broader GHG intensity lifts CO2
population : -0.000000 per unit → larger populations dilute per-capita emissions
methane : -0.010165 per unit → substitution effect with CO2 sources
RQ1 ANSWER: Significant predictors (p < 0.05 in OLS summary above)
explain variation in co2_per_capita. RF importance confirms ranking.
The OLS coefficients were ranked by p-value to identify the most
reliable significant predictors of co2_per_capita.
Significant POSITIVE drivers of CO2 per capita:
energy_per_capita (β = 0.0000899) — strongest positive
energy-related driver; each additional unit of energy consumed per
capita increases CO2 per capita by ~0.00009 tonnestotal_ghg (β = 0.025060) — broader greenhouse gas
intensity is associated with higher CO2 per capitaco2_per_unit_energy (β = 0.003191) — less efficient
energy production (more CO2 per unit of energy) increases per-capita
emissionsgdp (β ≈ 0.0000000, statistically significant but
practically negligible in magnitude)coal_co2, energy_per_gdp,
flaring_co2, gas_co2,
nitrous_oxide — all positively and significantly associated
with CO2 per capitaSignificant NEGATIVE drivers of CO2 per capita:
population (β ≈ −0.0000001) — larger populations are
associated with slightly lower per-capita emissions, consistent with a
“dilution effect” where total emissions are spread across more
peoplemethane (β = −0.010165) — likely reflects a
substitution effect, where countries with higher methane emissions
(e.g., from agriculture) tend to have relatively lower
fossil-fuel-driven CO2 per capitacement_co2 (β = −0.020287)land_use_change_co2 (β = −0.001086)Non-significant predictors:
primary_energy_consumption (p = 0.108) and
oil_co2 (p = 0.373) did not show statistically significant
effects in this model.
RQ1 ANSWER: The OLS results identify
energy_per_capita, total_ghg,
co2_per_unit_energy, coal_co2,
energy_per_gdp, gas_co2,
nitrous_oxide, and flaring_co2 as significant
positive predictors, and population, methane,
cement_co2, and land_use_change_co2 as
significant negative predictors of CO2 per capita. The Random Forest
feature importance rankings (Section 2.5.9) corroborate these findings,
with energy_per_capita confirmed as the dominant driver
across both modelling approaches — directly answering RQ1 by
demonstrating that energy consumption intensity, rather than raw
economic size, is the primary structural determinant of a country’s
carbon footprint per person.
Objective 2
Research Question 1
The target variable, emission_level, is created by dividing CO2 emissions per capita (co2_per_capita) into three tertile-based categories: Low, Medium, and High. This transformation enables the use of classification techniques to predict a country’s emission category based on a range of economic, energy, and environmental indicators.
To address the research question, three classification models are employed. Model C1, Multinomial Logistic Regression, serves as an interpretable baseline model that estimates the probability of belonging to each emission category using linear decision boundaries. Model C2, a Decision Tree (CART) with a maximum depth of six, provides a rule-based and easily visualisable classification approach without requiring feature scaling. Model C3, a Random Forest Classifier consisting of 200 decision trees, is used as an ensemble learning method that typically offers higher predictive performance while also providing feature importance rankings to identify the most influential predictors.
All classification models are trained and evaluated using a stratified 80:20 train-test split to preserve the distribution of emission categories across both datasets. Model performance is assessed using Accuracy, Precision, Recall, and F1-score, allowing a comprehensive comparison of predictive effectiveness across the three approaches.
Classification dataset : 7442 rows, 13 predictors, 3 classes
Class distribution:
Low Medium High
2456 2456 2530
Class proportions:
Low Medium High
0.33 0.33 0.34
Train rows: 5954 | Test rows: 1488
Predictors centred and scaled; 5-fold cross-validation.
Logistic Regression
Accuracy : 0.8649
Precision : 0.8676 (macro avg)
Recall : 0.8652 (macro avg)
F1 Score : 0.8656 (macro avg)
Confusion Matrix:
Reference
Prediction Low Medium High
Low 456 40 0
Medium 35 408 83
High 0 43 423
Predictors were centred and scaled, and the model was validated using 5-fold cross-validation. Performance on the test set:
Confusion Matrix:
| Low | Medium | High | |
|---|---|---|---|
| Low | 456 | 40 | 0 |
| Medium | 35 | 408 | 83 |
| High | 0 | 43 | 423 |
The model performs strongly overall (~86.5% accuracy), with near-perfect separation between Low and High classes (zero misclassifications between these two extremes). Most errors occur at the Medium boundary — particularly Medium observations misclassified as High (83 cases) — reflecting the inherent difficulty of a linear model in capturing the transition zone between adjacent emission tiers.
max_depth = 6; no scaling required for tree-based models.
Decision Tree
Accuracy : 0.8763
Precision : 0.8759 (macro avg)
Recall : 0.8759 (macro avg)
F1 Score : 0.8754 (macro avg)
Confusion Matrix:
Reference
Prediction Low Medium High
Low 453 35 0
Medium 38 385 40
High 0 71 466
Table: Decision Tree — Variable Importance
| |Variable | Importance|
|:--------------------------|:--------------------------|----------:|
|energy_per_capita |energy_per_capita | 2685.29|
|energy_per_gdp |energy_per_gdp | 1103.84|
|primary_energy_consumption |primary_energy_consumption | 776.06|
|total_ghg |total_ghg | 446.58|
|gas_co2 |gas_co2 | 361.20|
|flaring_co2 |flaring_co2 | 345.40|
|gdp |gdp | 231.81|
|population |population | 101.68|
|oil_co2 |oil_co2 | 77.45|
|methane |methane | 62.77|
|nitrous_oxide |nitrous_oxide | 50.12|
A decision tree with max_depth = 6 was fitted; no
feature scaling was required.
Decision rules (top levels): The tree’s primary
split is on energy_per_capita < 5864, immediately
separating a large portion of “Low” emission countries. Subsequent
splits use energy_per_capita < 24,000,
population, and total_ghg to further refine
Medium vs. High classifications, with terminal nodes achieving purity
levels as high as 91% for both the lowest (Low) and highest (High)
emission groups.
Performance metrics:
Confusion Matrix:
| Low | Medium | High | |
|---|---|---|---|
| Low | 453 | 35 | 0 |
| Medium | 38 | 385 | 71 |
| High | 0 | 40 | 466 |
The Decision Tree slightly outperforms the Logistic Regression model (87.6% vs. 86.5% accuracy), while offering the added benefit of an interpretable, rule-based structure. As with the logistic model, the primary source of error remains misclassification between Medium and High classes (71 Medium observations predicted as High).
Variable importance (based on total reduction in
node impurity) confirms energy_per_capita (2685.29) as
overwhelmingly the most influential predictor, followed by
energy_per_gdp (1103.84) and
primary_energy_consumption (776.06). Lower-ranked variables
include total_ghg, gas_co2,
flaring_co2, gdp, population,
oil_co2, methane, and
nitrous_oxide.
200 trees; importance = TRUE; no scaling required.
Random Forest Classifier
Accuracy : 0.9758
Precision : 0.9758 (macro avg)
Recall : 0.9758 (macro avg)
F1 Score : 0.9758 (macro avg)
Confusion Matrix:
Reference
Prediction Low Medium High
Low 485 6 0
Medium 6 473 12
High 0 12 494
Random Forest Classifier — Feature Importance (Mean Decrease Gini):
Table: RF Classifier — Feature Importance
| |Variable | Gini|
|:--------------------------|:--------------------------|---------:|
|energy_per_capita |energy_per_capita | 1735.8907|
|population |population | 335.7887|
|energy_per_gdp |energy_per_gdp | 329.2803|
|primary_energy_consumption |primary_energy_consumption | 289.5220|
|oil_co2 |oil_co2 | 207.1252|
|coal_co2 |coal_co2 | 178.5244|
|total_ghg |total_ghg | 160.6960|
|methane |methane | 150.8876|
|nitrous_oxide |nitrous_oxide | 142.9739|
|gas_co2 |gas_co2 | 135.8080|
|cement_co2 |cement_co2 | 94.7816|
|gdp |gdp | 88.4845|
|flaring_co2 |flaring_co2 | 64.8478|
A Random Forest Classifier with 200 trees was fitted
(importance = TRUE); no scaling was required.
Performance metrics:
Confusion Matrix:
| Low | Medium | High | |
|---|---|---|---|
| Low | 485 | 6 | 0 |
| Medium | 6 | 473 | 12 |
| High | 0 | 12 | 494 |
The Random Forest Classifier dramatically outperforms both prior models, achieving 97.6% accuracy with near-perfect, balanced precision, recall, and F1 scores across all three classes. Misclassifications are minimal and confined almost entirely to adjacent classes (e.g., Medium↔︎High and Low↔︎Medium), with zero confusion between the extreme Low and High classes — indicating the model has learned a highly reliable representation of emission intensity tiers.
Feature Importance (Mean Decrease Gini):
energy_per_capita (1735.89)population (335.79)energy_per_gdp (329.28)primary_energy_consumption (289.52)oil_co2 (207.13)coal_co2 (178.52)total_ghg (160.70)methane (150.89)nitrous_oxide (142.97)gas_co2 (135.81)cement_co2 (94.78)gdp (88.48)flaring_co2 (64.85)As with the regression task, energy_per_capita is by far
the most discriminative variable for distinguishing between Low, Medium,
and High emission countries — over 5 times more important than the
next-ranked variable (population).
Table: Classification Model Comparison — All Metrics
| |Model | Accuracy| Precision| Recall| F1|
|:------------|:------------------------|--------:|---------:|------:|------:|
|Accuracy...1 |Logistic Regression | 0.8649| 0.8676| 0.8652| 0.8656|
|Accuracy...2 |Decision Tree | 0.8763| 0.8759| 0.8759| 0.8754|
|Accuracy...3 |Random Forest Classifier | 0.9758| 0.9758| 0.9758| 0.9758|
Best model by F1 Score: Random Forest Classifier
RQ2 ANSWER: See F1 and accuracy scores above.
RF feature importance identifies which structural predictors
best discriminate Low / Medium / High emission intensity tiers.
| Model | Accuracy | Precision | Recall | F1 |
|---|---|---|---|---|
| Logistic Regression | 0.8649 | 0.8676 | 0.8652 | 0.8656 |
| Decision Tree | 0.8763 | 0.8759 | 0.8759 | 0.8754 |
| Random Forest Classifier | 0.9758 | 0.9758 | 0.9758 | 0.9758 |
Best model by F1 Score: Random Forest Classifier
The grouped bar chart comparison across Accuracy, F1, Precision, and Recall visually confirms the Random Forest’s clear superiority (~0.976 across all metrics), compared to ~0.865–0.876 for the Logistic Regression and Decision Tree models. The progression from a linear model (Logistic Regression) to a single non-linear rule-based model (Decision Tree) to an ensemble method (Random Forest) shows consistent, substantial gains in classification performance — mirroring the pattern observed in the regression task (Section 2.5.10).
RQ2 ANSWER: Yes — countries can be accurately
classified into Low, Medium, and High emission intensity groups based on
economic structure, energy consumption, and fossil fuel dependence. The
Random Forest Classifier achieves 97.6% accuracy and macro-F1,
substantially outperforming both the Logistic Regression and Decision
Tree baselines. Feature importance rankings consistently identify
energy_per_capita, population,
energy_per_gdp, and primary_energy_consumption
as the most discriminative structural predictors.
Table: Country Classification by Emission Level (RF Model)
Table: RQ2 — Country Emission Level Classification
|country | Avg_CO2_Per_Capita|Actual_Class |Predicted_Class |Match |
|:--------------------------------|------------------:|:------------|:---------------|:-------|
|Bahrain | 16.010|High |High |Correct |
|Kuwait | 16.010|High |High |Correct |
|Qatar | 16.010|High |High |Correct |
|United Arab Emirates | 16.010|High |High |Correct |
|Brunei | 15.971|High |High |Correct |
|Saudi Arabia | 15.947|High |High |Correct |
|Australia | 15.840|High |High |Correct |
|United States | 15.749|High |High |Correct |
|Canada | 15.605|High |High |Correct |
|Trinidad and Tobago | 15.399|High |High |Correct |
|Luxembourg | 15.331|High |High |Correct |
|Sint Maarten (Dutch part) | 15.016|High |High |Correct |
|Curacao | 14.795|High |High |Correct |
|Faroe Islands | 13.749|High |High |Correct |
|Kazakhstan | 12.897|High |High |Correct |
|Oman | 12.838|High |High |Correct |
|New Caledonia | 12.392|High |High |Correct |
|Estonia | 12.246|High |High |Correct |
|Palau | 11.806|High |High |Correct |
|Russia | 11.462|High |High |Correct |
|Czechia | 11.407|High |High |Correct |
|Aruba | 11.214|High |High |Correct |
|Belgium | 10.587|High |High |Correct |
|Saint Pierre and Miquelon | 10.514|High |High |Correct |
|Greenland | 10.464|High |High |Correct |
|Taiwan | 10.439|High |High |Correct |
|South Korea | 10.435|High |High |Correct |
|Singapore | 10.348|High |High |Correct |
|Germany | 10.255|High |High |Correct |
|Finland | 10.233|High |High |Correct |
|Iceland | 10.051|High |High |Correct |
|Netherlands | 10.007|High |High |Correct |
|Turkmenistan | 9.475|High |High |Correct |
|Japan | 9.454|High |High |Correct |
|Ireland | 9.409|High |High |Correct |
|Libya | 9.135|High |High |Correct |
|Denmark | 8.931|High |High |Correct |
|Bermuda | 8.850|High |High |Correct |
|Norway | 8.732|High |High |Correct |
|Poland | 8.658|High |High |Correct |
|Israel | 8.389|High |High |Correct |
|Greece | 8.115|High |High |Correct |
|Austria | 8.092|High |High |Correct |
|United Kingdom | 8.075|High |High |Correct |
|Anguilla | 8.074|High |High |Correct |
|South Africa | 8.070|High |High |Correct |
|New Zealand | 7.832|High |High |Correct |
|Slovenia | 7.473|High |High |Correct |
|Slovakia | 7.462|High |High |Correct |
|Turks and Caicos Islands | 7.303|High |High |Correct |
|Italy | 7.099|High |High |Correct |
|Montserrat | 6.681|High |High |Correct |
|Andorra | 6.657|High |High |Correct |
|Cyprus | 6.628|High |High |Correct |
|Belarus | 6.591|High |High |Correct |
|Ukraine | 6.570|High |High |Correct |
|Iran | 6.484|High |High |Correct |
|Bulgaria | 6.460|High |High |Correct |
|Malaysia | 6.393|High |High |Correct |
|Spain | 6.323|High |High |Correct |
|Bahamas | 6.111|High |High |Correct |
|Serbia | 6.108|High |High |Correct |
|France | 5.977|High |High |Correct |
|British Virgin Islands | 5.872|High |High |Correct |
|Antigua and Barbuda | 5.688|High |High |Correct |
|Malta | 5.551|High |High |Correct |
|Liechtenstein | 5.548|High |High |Correct |
|Hong Kong | 5.542|High |High |Correct |
|Switzerland | 5.465|High |High |Correct |
|Hungary | 5.463|High |High |Correct |
|Sweden | 5.439|High |High |Correct |
|Bonaire Sint Eustatius and Saba | 5.283|High |High |Correct |
|Venezuela | 5.264|High |High |Correct |
|China | 5.150|High |High |Correct |
|Bhutan | 1.151|Low |Low |Correct |
|Tonga | 1.076|Low |Low |Correct |
|Lesotho | 1.009|Low |Low |Correct |
|El Salvador | 0.989|Low |Low |Correct |
|Zimbabwe | 0.965|Low |Low |Correct |
|Laos | 0.954|Low |Low |Correct |
|Philippines | 0.948|Low |Low |Correct |
|Eswatini | 0.942|Low |Low |Correct |
|Honduras | 0.904|Low |Low |Correct |
|Samoa | 0.903|Low |Low |Correct |
|Paraguay | 0.879|Low |Low |Correct |
|Cape Verde | 0.858|Low |Low |Correct |
|Guatemala | 0.842|Low |Low |Correct |
|Tuvalu | 0.841|Low |Low |Correct |
|Angola | 0.820|Low |Low |Correct |
|Nicaragua | 0.735|Low |Low |Correct |
|Pakistan | 0.730|Low |Low |Correct |
|Yemen | 0.664|Low |Low |Correct |
|Papua New Guinea | 0.663|Low |Low |Correct |
|Tajikistan | 0.657|Low |Low |Correct |
|Nigeria | 0.628|Low |Low |Correct |
|Mauritania | 0.627|Low |Low |Correct |
|Sri Lanka | 0.616|Low |Low |Correct |
|Palestine | 0.554|Low |Low |Correct |
|Senegal | 0.519|Low |Low |Correct |
|Solomon Islands | 0.502|Low |Low |Correct |
|Sao Tome and Principe | 0.497|Low |Low |Correct |
|Vanuatu | 0.475|Low |Low |Correct |
|Djibouti | 0.474|Low |Low |Correct |
|Kiribati | 0.451|Low |Low |Correct |
|Cambodia | 0.446|Low |Low |Correct |
|Ghana | 0.382|Low |Low |Correct |
|Cote d'Ivoire | 0.379|Low |Low |Correct |
|Cameroon | 0.363|Low |Low |Correct |
|East Timor | 0.340|Low |Low |Correct |
|Bangladesh | 0.336|Low |Low |Correct |
|Benin | 0.327|Low |Low |Correct |
|Sudan | 0.320|Low |Low |Correct |
|Myanmar | 0.305|Low |Low |Correct |
|Zambia | 0.303|Low |Low |Correct |
|Togo | 0.298|Low |Low |Correct |
|Kenya | 0.296|Low |Low |Correct |
|Comoros | 0.272|Low |Low |Correct |
|Eritrea | 0.229|Low |Low |Correct |
|Nepal | 0.227|Low |Low |Correct |
|Gambia | 0.217|Low |Low |Correct |
|Guinea | 0.212|Low |Low |Correct |
|Haiti | 0.203|Low |Low |Correct |
|Afghanistan | 0.175|Low |Low |Correct |
|Liberia | 0.167|Low |Low |Correct |
|Tanzania | 0.162|Low |Low |Correct |
|Mali | 0.146|Low |Low |Correct |
|Mozambique | 0.143|Low |Low |Correct |
|Burkina Faso | 0.140|Low |Low |Correct |
|South Sudan | 0.115|Low |Low |Correct |
|Sierra Leone | 0.109|Low |Low |Correct |
|Madagascar | 0.106|Low |Low |Correct |
|Chad | 0.101|Low |Low |Correct |
|Guinea-Bissau | 0.090|Low |Low |Correct |
|Uganda | 0.086|Low |Low |Correct |
|Ethiopia | 0.085|Low |Low |Correct |
|Niger | 0.079|Low |Low |Correct |
|Rwanda | 0.078|Low |Low |Correct |
|Malawi | 0.077|Low |Low |Correct |
|Somalia | 0.070|Low |Low |Correct |
|Democratic Republic of Congo | 0.051|Low |Low |Correct |
|Central African Republic | 0.050|Low |Low |Correct |
|Burundi | 0.040|Low |Low |Correct |
|Mongolia | 6.960|Medium |Medium |Correct |
|Nauru | 6.798|Medium |Medium |Correct |
|Portugal | 5.198|Medium |Medium |Correct |
|Barbados | 4.898|Medium |Medium |Correct |
|Lithuania | 4.712|Medium |Medium |Correct |
|Romania | 4.645|Medium |Medium |Correct |
|Croatia | 4.537|Medium |Medium |Correct |
|Bosnia and Herzegovina | 4.462|Medium |Medium |Correct |
|Suriname | 4.339|Medium |Medium |Correct |
|Uzbekistan | 4.293|Medium |Medium |Correct |
|Saint Kitts and Nevis | 4.256|Medium |Medium |Correct |
|Turkey | 4.104|Medium |Medium |Correct |
|Azerbaijan | 4.091|Medium |Medium |Correct |
|Seychelles | 4.041|Medium |Medium |Correct |
|North Macedonia | 4.008|Medium |Medium |Correct |
|Argentina | 3.988|Medium |Medium |Correct |
|Mexico | 3.941|Medium |Medium |Correct |
|Latvia | 3.902|Medium |Medium |Correct |
|Equatorial Guinea | 3.882|Medium |Medium |Correct |
|Cook Islands | 3.818|Medium |Medium |Correct |
|Chile | 3.806|Medium |Medium |Correct |
|Iraq | 3.800|Medium |Medium |Correct |
|Gabon | 3.647|Medium |Medium |Correct |
|Lebanon | 3.516|Medium |Medium |Correct |
|Algeria | 3.431|Medium |Medium |Correct |
|Jamaica | 3.334|Medium |Medium |Correct |
|Montenegro | 3.211|Medium |Medium |Correct |
|Thailand | 3.182|Medium |Medium |Correct |
|Niue | 3.093|Medium |Medium |Correct |
|French Polynesia | 2.958|Medium |Medium |Correct |
|North Korea | 2.784|Medium |Medium |Correct |
|Jordan | 2.750|Medium |Medium |Correct |
|Guyana | 2.659|Medium |Medium |Correct |
|Mauritius | 2.613|Medium |Medium |Correct |
|Macao | 2.585|Medium |Medium |Correct |
|Marshall Islands | 2.556|Medium |Medium |Correct |
|Syria | 2.500|Medium |Medium |Correct |
|Botswana | 2.388|Medium |Medium |Correct |
|Saint Lucia | 2.388|Medium |Medium |Correct |
|Maldives | 2.346|Medium |Medium |Correct |
|Cuba | 2.277|Medium |Medium |Correct |
|Tunisia | 2.277|Medium |Medium |Correct |
|Ecuador | 2.126|Medium |Medium |Correct |
|Dominican Republic | 2.115|Medium |Medium |Correct |
|Panama | 2.096|Medium |Medium |Correct |
|Brazil | 2.078|Medium |Medium |Correct |
|Grenada | 2.075|Medium |Medium |Correct |
|Moldova | 2.014|Medium |Medium |Correct |
|Egypt | 1.983|Medium |Medium |Correct |
|Dominica | 1.900|Medium |Medium |Correct |
|Georgia | 1.891|Medium |Medium |Correct |
|Wallis and Futuna | 1.884|Medium |Medium |Correct |
|Saint Helena | 1.872|Medium |Medium |Correct |
|Uruguay | 1.869|Medium |Medium |Correct |
|Saint Vincent and the Grenadines | 1.790|Medium |Medium |Correct |
|Indonesia | 1.708|Medium |Medium |Correct |
|Belize | 1.704|Medium |Medium |Correct |
|Colombia | 1.687|Medium |Medium |Correct |
|Armenia | 1.640|Medium |Medium |Correct |
|Vietnam | 1.551|Medium |Medium |Correct |
|Bolivia | 1.518|Medium |Medium |Correct |
|Kyrgyzstan | 1.509|Medium |Medium |Correct |
|Costa Rica | 1.490|Medium |Medium |Correct |
|Morocco | 1.456|Medium |Medium |Correct |
|Albania | 1.342|Medium |Medium |Correct |
|Micronesia (country) | 1.336|Medium |Medium |Correct |
|Peru | 1.323|Medium |Medium |Correct |
|India | 1.281|Medium |Medium |Correct |
|Namibia | 1.187|Medium |Medium |Correct |
|Fiji | 1.174|Medium |Medium |Correct |
|Congo | 1.122|Medium |Medium |Correct |
LOW emission countries:
Bhutan, Tonga, Lesotho, El Salvador, Zimbabwe, Laos, Philippines, Eswatini, Honduras, Samoa, Paraguay, Cape Verde, Guatemala, Tuvalu, Angola, Nicaragua, Pakistan, Yemen, Papua New Guinea, Tajikistan, Nigeria, Mauritania, Sri Lanka, Palestine, Senegal, Solomon Islands, Sao Tome and Principe, Vanuatu, Djibouti, Kiribati, Cambodia, Ghana, Cote d'Ivoire, Cameroon, East Timor, Bangladesh, Benin, Sudan, Myanmar, Zambia, Togo, Kenya, Comoros, Eritrea, Nepal, Gambia, Guinea, Haiti, Afghanistan, Liberia, Tanzania, Mali, Mozambique, Burkina Faso, South Sudan, Sierra Leone, Madagascar, Chad, Guinea-Bissau, Uganda, Ethiopia, Niger, Rwanda, Malawi, Somalia, Democratic Republic of Congo, Central African Republic, Burundi
MEDIUM emission countries:
Mongolia, Nauru, Portugal, Barbados, Lithuania, Romania, Croatia, Bosnia and Herzegovina, Suriname, Uzbekistan, Saint Kitts and Nevis, Turkey, Azerbaijan, Seychelles, North Macedonia, Argentina, Mexico, Latvia, Equatorial Guinea, Cook Islands, Chile, Iraq, Gabon, Lebanon, Algeria, Jamaica, Montenegro, Thailand, Niue, French Polynesia, North Korea, Jordan, Guyana, Mauritius, Macao, Marshall Islands, Syria, Botswana, Saint Lucia, Maldives, Cuba, Tunisia, Ecuador, Dominican Republic, Panama, Brazil, Grenada, Moldova, Egypt, Dominica, Georgia, Wallis and Futuna, Saint Helena, Uruguay, Saint Vincent and the Grenadines, Indonesia, Belize, Colombia, Armenia, Vietnam, Bolivia, Kyrgyzstan, Costa Rica, Morocco, Albania, Micronesia (country), Peru, India, Namibia, Fiji, Congo
HIGH emission countries:
Bahrain, Kuwait, Qatar, United Arab Emirates, Brunei, Saudi Arabia, Australia, United States, Canada, Trinidad and Tobago, Luxembourg, Sint Maarten (Dutch part), Curacao, Faroe Islands, Kazakhstan, Oman, New Caledonia, Estonia, Palau, Russia, Czechia, Aruba, Belgium, Saint Pierre and Miquelon, Greenland, Taiwan, South Korea, Singapore, Germany, Finland, Iceland, Netherlands, Turkmenistan, Japan, Ireland, Libya, Denmark, Bermuda, Norway, Poland, Israel, Greece, Austria, United Kingdom, Anguilla, South Africa, New Zealand, Slovenia, Slovakia, Turks and Caicos Islands, Italy, Montserrat, Andorra, Cyprus, Belarus, Ukraine, Iran, Bulgaria, Malaysia, Spain, Bahamas, Serbia, France, British Virgin Islands, Antigua and Barbuda, Malta, Liechtenstein, Hong Kong, Switzerland, Hungary, Sweden, Bonaire Sint Eustatius and Saba, Venezuela, China
Total countries correctly classified : 213 / 213
Misclassified countries : 0
RQ2 ANSWER: The RF model classifies countries into Low / Medium / High
emission tiers with 97.6% accuracy. See country table above for
the full breakdown of which nations fall into each emission class.
The trained Random Forest model was applied to classify all 213
countries based on their average co2_per_capita (1990–2024)
into Low, Medium, or High emission tiers, and compared against their
actual tertile-based classification.
Results: 213 / 213 countries correctly classified (100% accuracy), with 0 misclassified countries.
High emission countries (average CO2 per capita ranging from ~5.15 to 16.01 tonnes) are dominated by oil/gas-exporting Gulf states (Qatar, Kuwait, UAE, Bahrain, Saudi Arabia, Brunei), high-income industrialised nations (Australia, United States, Canada, Luxembourg, Norway, Germany, Netherlands, Japan, South Korea), and several small island/territory economies (Curaçao, Sint Maarten, Faroe Islands, Bermuda).
Medium emission countries (ranging from ~1.12 to ~6.96 tonnes) include a broad mix of upper-middle-income and transition economies such as Mongolia, Portugal, Romania, Argentina, Mexico, Turkey, Brazil, Thailand, and Indonesia.
Low emission countries (ranging from ~0.04 to ~1.15 tonnes) are predominantly low-income, often agriculture-dependent nations, including many Sub-Saharan African countries (Ethiopia, Niger, Rwanda, Malawi, Somalia, DR Congo, Burundi, etc.), as well as Bhutan, Tonga, Lesotho, El Salvador, and several Pacific island and South/Southeast Asian nations.
This perfect country-level classification result demonstrates that the Random Forest model has learned robust, generalisable patterns linking structural economic and energy indicators to emission intensity tiers, providing a reliable framework for categorising any country’s emission profile based on these characteristics.
ANALYSIS COMPLETE
PART 4 (Regression) — RQ1 answered
PART 5 (Classification) — RQ2 answered
PART 4 (Regression) — RQ1 answered: Energy consumption per capita, total GHG intensity, and energy-related CO2 components are the dominant significant predictors of CO2 emissions per capita, with the Random Forest Regression model (R² = 0.9937) vastly outperforming OLS (R² = 0.6670).
PART 5 (Classification) — RQ2 answered: Countries can be classified into Low/Medium/High emission intensity tiers with high accuracy, with the Random Forest Classifier (97.6% accuracy/F1) outperforming Logistic Regression (86.5%) and Decision Tree (87.6%) models, and achieving perfect (213/213) country-level classification.
To address RQ1, two regression models were trained on a stratified 80/20 split (Train: 5,954 rows; Test: 1,488 rows) using 15 predictors spanning economic scale, energy structure, fossil fuel sources, and broader greenhouse gas factors. The models — Multiple Linear Regression (OLS) and Random Forest Regression — were evaluated on RMSE, MAE, and R².
The OLS model achieved an R² of 0.667 on the test set, explaining approximately 66.7% of the variance in CO2 per capita. 13 out of 15 predictors were statistically significant (p < 0.05), indicating that the selected variables collectively provide strong explanatory power.
Energy structure variables emerged as the dominant positive drivers:
energy_per_capita carried the largest positive
coefficient (+0.0000899), confirming that countries consuming more
energy per person consistently emit more CO2 per capita. This aligns
with expectations, as higher individual energy demand directly
translates to greater combustion of fossil fuels.energy_per_gdp (+0.1691) suggests that more
energy-intensive economies — those requiring greater energy input per
unit of economic output — produce significantly higher per capita
emissions, reflecting structural inefficiency in energy use.co2_per_unit_energy (+0.0032) confirms that a dirtier
energy mix (i.e., higher carbon content per unit of energy consumed)
directly elevates per capita emissions, highlighting the importance of
energy transition policies.Among fossil fuel sources, flaring_co2
(+0.0423) stood out as the strongest contributor, likely reflecting
oil-rich nations where gas flaring is prevalent. coal_co2
(+0.0027) and gas_co2 (+0.0030) were also significant
positive drivers, consistent with their well-documented role in national
emissions profiles.
Economic scale, measured by gdp, showed
a small but significant positive relationship (+3.93e-12), suggesting
that wealthier nations tend to emit more per capita, though the marginal
effect per unit of GDP is minimal.
Interestingly, two variables showed significant negative relationships:
population (−0.0000001) reflects a dilution effect —
larger populations share a given level of total emissions across more
people, reducing per capita figures even when aggregate emissions are
high.methane (−0.0102) suggests a substitution pattern,
where countries with higher methane emissions (typically agriculture- or
livestock-heavy economies) tend to rely less on CO2-intensive fossil
fuel sources.cement_co2 (−0.0203) and
land_use_change_co2 (−0.0011) similarly reflect sectoral
substitution — economies driven by construction or land use may offset
direct fossil fuel CO₂ intensity.Only primary_energy_consumption and oil_co2
were non-significant (p > 0.05), likely because their explanatory
variance is absorbed by correlated variables such as
energy_per_capita and other fossil fuel predictors.
VIF analysis revealed high collinearity among coal_co2
(VIF = 35.69), methane (VIF = 37.67),
nitrous_oxide (VIF = 37.44), and cement_co2
(VIF = 24.15). This is expected in emissions datasets where GHG
variables are structurally interrelated. While high VIF inflates
standard errors and may reduce coefficient precision, the direction and
significance of estimates remain interpretable. The Random Forest model,
being non-parametric, is unaffected by multicollinearity and serves as a
complementary validation.
The Random Forest model substantially outperformed OLS across all metrics — achieving an R² of 0.994, RMSE of 0.3729, and MAE of 0.1677 on the test set, compared to OLS’s RMSE of 2.7306 and MAE of 1.6974. This improvement reflects the Random Forest’s ability to capture non-linear interactions and complex dependencies among predictors that OLS cannot model.
Feature importance rankings (% increase in MSE) were broadly
consistent with OLS findings. energy_per_capita ranked
first (49.27), reinforcing its role as the single most influential
predictor of CO₂ per capita. co2_per_unit_energy (30.19)
and oil_co2 (22.25) followed, with population
(19.40) confirming the dilution effect identified in OLS.
| Model | RMSE | MAE | R² |
|---|---|---|---|
| Linear Regression (OLS) | 2.7306 | 1.6974 | 0.667 |
| Random Forest | 0.3729 | 0.1677 | 0.994 |
The two models serve complementary roles. OLS provides statistical interpretability — coefficient estimates, significance testing, and directional inference — making it suitable for explaining why emissions vary. Random Forest provides predictive accuracy, capturing complex interactions missed by the linear model. Together, they offer a robust analytical framework: OLS identifies which factors matter and how, while Random Forest confirms their relative importance under a more flexible modelling approach.
In answer to RQ1, the analysis identifies energy consumption intensity, fossil fuel dependence, and economic scale as the primary positive drivers of CO2 per capita variation across countries from 1990–2024. Population size and methane emissions act as negative moderators, reflecting dilution and substitution effects respectively. These findings are consistent across both the OLS and Random Forest models, lending confidence to the conclusions drawn.
To address RQ2, three classification models were trained on a
stratified 80/20 split (Train: 5,954 rows; Test: 1,488 rows) to predict
a country’s emission intensity tier — Low,
Medium, or High — based on 13
predictors spanning economic scale, energy consumption, and fossil fuel
dependence. The target variable emission_level was derived
from tertile thresholds of co2_per_capita, yielding
near-balanced classes (Low: 33%, Medium: 33%, High: 34%). Models were
evaluated on Accuracy, Precision, Recall, and F1 Score (macro-averaged
across all three classes).
The logistic regression model served as the interpretable linear baseline, achieving an accuracy of 0.865 and a macro F1 of 0.866 after centring, scaling, and 5-fold cross-validation. Performance was reasonably strong, particularly for the Low class (456 correct out of 496), where the linear decision boundary was sufficient to separate the clearly distinct low-emission profile of Sub-Saharan African and South Asian nations.
The primary source of error was at the Medium↔︎High boundary, where 83 Medium observations were misclassified as High, and 43 High observations were misclassified as Medium. This is expected, as countries near the tertile thresholds share overlapping energy and economic profiles that a linear model cannot resolve. Notably, no Low observations were misclassified as High (or vice versa), confirming that the extreme classes are well-separated in feature space even under a linear model.
The Decision Tree marginally outperformed logistic regression with an
accuracy of 0.876 and F1 of 0.875. By
learning non-linear, rule-based splits, the tree was better able to
capture threshold effects in the data — for instance, splitting on
energy_per_capita at a specific cut-off to discriminate
High from non-High emitters.
energy_per_capita dominated the variable importance
ranking (importance = 2685.29), followed by energy_per_gdp
(1103.84) and primary_energy_consumption (776.06). This
confirms that energy structure variables drive the classification
boundaries most strongly. Fossil fuel variables such as
gas_co2 (361.20) and flaring_co2 (345.40) also
contributed meaningfully, reflecting the role of fuel mix in
distinguishing emission tiers.
The confusion matrix reveals that the Decision Tree’s improvement over logistic regression was concentrated at the Medium↔︎High boundary — reducing High→Medium misclassifications from 43 to 12 — though it introduced slightly more Medium→High confusion (71 vs. 40). This suggests the tree’s splits captured the High class more precisely at the cost of slight overprediction in some Medium cases.
The Random Forest substantially outperformed both baseline models, achieving 97.6% accuracy and a macro F1 of 0.976 — a gain of approximately 10 percentage points over the Decision Tree. Only 36 out of 1,488 test observations were misclassified, with errors concentrated at the Medium↔︎Low (6 cases) and Medium↔︎High (12 cases) boundaries. No Low observations were misclassified as High or vice versa, demonstrating near-perfect separation of the extreme emission classes.
The large performance gap between Random Forest and the single Decision Tree reflects the benefit of ensemble averaging — by aggregating 200 trees trained on random feature subsets, the model reduces variance and captures complex, non-linear interactions that no single tree can represent.
Feature importance (Mean Decrease Gini) was strongly
concentrated in energy_per_capita (Gini = 1735.89), with a
large drop-off to the next most important variables:
population (335.79), energy_per_gdp (329.28),
and primary_energy_consumption (289.52). This hierarchy is
consistent across the Decision Tree and regression analyses, reinforcing
that individual energy consumption intensity is the
single most discriminative feature for emission tier classification.
population emerged as the second-ranked feature in the
Random Forest (compared to lower ranks in OLS and the Decision Tree),
suggesting that while population size is not a strong linear predictor,
it interacts non-linearly with other features — for instance,
large-population, low-income countries cluster distinctively in the Low
tier.
| Model | Accuracy | Precision | Recall | F1 Score |
|---|---|---|---|---|
| Logistic Regression | 0.865 | 0.868 | 0.865 | 0.866 |
| Decision Tree | 0.876 | 0.876 | 0.876 | 0.875 |
| Random Forest | 0.976 | 0.976 | 0.976 | 0.976 |
The three models serve complementary purposes. Logistic Regression
provides a transparent probabilistic framework and confirms that linear
separability is already strong for the extreme classes. The Decision
Tree offers interpretable classification rules — readable splits on
energy_per_capita and energy_per_gdp that
could directly inform policy thresholds. Random Forest delivers the
highest predictive accuracy and the most reliable feature importance
ranking, making it the recommended model for operational
classification.
When the Random Forest was applied to the full dataset of 213 countries (aggregating each country’s dominant class across 1990–2024), all 213 countries were correctly assigned to their actual emission tier. It should be noted that this 213/213 figure reflects dominant-class matching at the country level — the correct measure of model performance remains the 97.6% row-level test accuracy.
The geographic and economic patterns across tiers are clear:
Low emission countries (65 nations) are predominantly Sub-Saharan African (e.g., Burundi, Niger, Malawi) and South/Southeast Asian nations (e.g., Bangladesh, Nepal, Cambodia) — characterised by low industrialisation, minimal fossil fuel infrastructure, and low per capita energy demand.
Medium emission countries (71 nations) span Latin America (e.g., Brazil, Mexico, Colombia), Eastern Europe (e.g., Romania, Croatia, Turkey), and emerging economies in Asia and the Middle East (e.g., Vietnam, India, Egypt) — reflecting transitional energy systems with growing but not yet high fossil fuel dependency.
High emission countries (77 nations) are concentrated among Gulf states (Qatar, UAE, Kuwait, Saudi Arabia), Western industrialised economies (United States, Australia, Canada, Germany), and energy-intensive transitional economies (Russia, Kazakhstan, China) — all characterised by high per capita energy consumption and deep fossil fuel dependence.
In answer to RQ2, countries can be accurately classified into
Low, Medium, and High emission intensity tiers using economic
and energy structure variables. The Random Forest classifier achieves
near-perfect performance (F1 = 0.976), with
energy_per_capita identified as the dominant discriminating
feature across all three models. The geographic distribution of emission
tiers aligns with known global development patterns — further validating
the classification framework. Together with the regression findings in
RQ1, these results confirm that energy consumption intensity and
fossil fuel dependence are the central axes along which national
emission profiles diverge.
This project analysed the Our World in Data CO2 and Greenhouse Gas Emissions dataset, covering 213 countries from 1990 to 2024 (7,442 country-year observations), to investigate the determinants of national CO2 emissions per capita and to assess whether countries can be meaningfully grouped by emission intensity. Two research questions were addressed: a regression analysis (RQ1) and a classification analysis (RQ2).
Of the 15 predictors included in the Ordinary Least Squares (OLS) model, 13 were statistically significant at the 5% level. The findings can be summarised as follows:
| Driver Type | Key Variables | Effect |
|---|---|---|
| Energy intensity | energy_per_capita, co2_per_unit_energy, energy_per_gdp | Strongest positive drivers |
| Fossil fuel dependence | flaring_co2, coal_co2, gas_co2 | Significant positive effect |
| Economic scale | gdp | Positive but modest per-unit effect |
| Population size | population | Negative; larger populations dilute per capita emissions |
| Fuel substitution | methane, land_use_change_co2 | Negative; reflect alternative emission pathways |
Both the OLS model (test R-squared = 0.667) and the Random Forest regression (test R-squared = 0.994) consistently identified energy_per_capita as the most important predictor. This indicates that the volume of energy consumed per person, together with the carbon intensity of that energy, is the primary determinant of CO2 emissions per capita.
Countries could be classified into emission tiers with high accuracy across all models considered:
| Model | F1 Score |
|---|---|
| Logistic Regression | 0.866 |
| Decision Tree | 0.875 |
| Random Forest | 0.976 |
The Random Forest classifier achieved 97.6% row-level accuracy, with energy_per_capita again emerging as the most discriminative feature by a substantial margin (Gini importance of 1735.89 compared with 335.79 for the next-ranked variable).
The resulting emission tiers correspond to recognisable real-world patterns:
Both research questions converge on the same structural conclusion: energy consumption intensity is the dominant driver of national CO2 emissions, more so than economic size, population, or any individual fossil fuel source considered in isolation. Countries that consume more energy per person, rely on carbon-intensive energy sources, and exhibit lower energy efficiency consistently fall within the high emission tier and drive the strongest regression coefficients. This conclusion is supported across four distinct modelling approaches: OLS regression, Random Forest regression, Decision Tree classification, and Random Forest classification.
These findings suggest that effective climate policy should prioritise the following:
| Limitation | Detail |
|---|---|
| Median imputation | 14,037 missing values were imputed, particularly in gas_co2 (42.8%) and coal_co2 (36.1%); imputed values may not reflect true country conditions |
| Winsorisation | Capping at IQR bounds preserves rows but reduces variance in extreme cases (e.g. Qatar, Kuwait) |
| Panel data structure | Country-year observations are not independent; time-series correlation within countries is not modelled |
| OLS multicollinearity | Several predictors exhibit high VIF values (coal_co2 = 35.69, methane = 37.67); individual coefficients should be interpreted with caution |
| Dominant class aggregation | A 213/213 country-level classification match reflects dominant class over 35 years rather than row-level accuracy; true row-level performance is 97.6% |
This research could be extended in several directions:
Panel data modelling. Fixed-effects or random-effects panel regression models could be used to explicitly account for temporal dependence within countries, yielding more reliable estimates of country-specific effects over time.
Inclusion of renewable energy variables. The present analysis focuses primarily on energy consumption and fossil-fuel-related emission indicators. Future studies could incorporate renewable energy share, electricity generation mix, and energy transition indicators to better capture decarbonisation pathways.
Advanced machine learning models. Additional methods such as Gradient Boosting Machines, LightGBM, and XGBoost may improve predictive accuracy and offer alternative approaches to feature importance estimation.
Scenario-based forecasting. The analysis could be extended from prediction to forecasting, estimating the potential impact of changes in consumption patterns or fossil fuel dependence on future CO2 emissions under various policy scenarios.
Regional analysis. Countries could be categorised by region or income level to investigate whether the drivers of emissions differ across developed, developing, and emerging economies.
Collectively, these extensions would provide more detailed insight into the mechanisms driving emissions, supporting the development of more targeted climate policy recommendations.
This study addressed both research questions through regression and classification approaches applied to the Our World in Data CO2 dataset spanning 1990 to 2024. The results consistently indicate that energy efficiency and energy consumption intensity are the most significant determinants of national CO2 emissions per capita. The superior performance of the Random Forest models in both the regression and classification tasks demonstrates the value of machine learning approaches in analysing complex environmental datasets. These findings suggest that meaningful reductions in carbon intensity will require not only economic transformation but also the implementation of more efficient and cleaner energy systems worldwide.