Global Agriculture Greenhouse Gas Emissions Analysis

#Introduction

This is a dataset from Kaggle, which details greenhouse gas global emissions from agricultural land, sourced from Food and Agriculture Organization (FAO) of the United Nations. The dataset includes emission type, such as methane, carbon dioxide or nitrous oxide, by country from 1990 to 2021. https://www.kaggle.com/datasets/muhammadaammartufail/global-greenhouse-gas-emissions-from-agriculture

Nitrous oxide (N2O) and methane (CH4) are under intense scrutiny and of significant concern beyond the typically focused carbon dioxide. Both N2O has a warming power of ~300x more than CO2, and CH4 has a warming power of ~80x more than CO2, and N2O is the most significantly emitted GHG that damages the ozone layer. We aim to the see the trends and behavior associated with these GHGs to better understand their distribution across the years and from which countries.

The number of observations is 22,216, with 17 column variables. The response variable of interest will be Value, which is the GHG amount emitted in kilotons in a particular year for one of the three gases.

Number of rows: 22216

Number of columns: 17

#Pre-Processing

Reading in the dataset, and determining if the multiple columns seen contain unique observations necessary for analysis. Of the 17 variables, only 4 are of interest.

Only missing values in original dataset was an empty column called “Note” which there wasn’t any clarification from the Kaggle description what its purpose was. For this reason, this column was removed in the refined dataset “ghg.agri” created. Other than this, there were no missing values found in the columns of interest, like Area (country), Element (emission type), Year, and Value (in kilotons). These four columns were retained for analysis, and the rest of the redundant column variables were removed in the refined dataset ghg.agri.

Discovered that 1,061 observations had an emission value of 0 kilotons.

To deal with these 1,061 observations that will create an issue with EDA and modeling, a new variable will be created transforming these values and appending it to the dataset. Created a logVal variable that adds 1 to emission value Val and takes the logarithm. This helps to compensate the observations with a value of 0 emission, so that it can still be interpreted to 0 for further analyses: log(0+1)=log(1)=0. Set the Area and Element variables as factor so that the further analyses are done properly. Created an Emission class variable that groups emission values into either Low, Medium, or High categories based on the following percentiles: Low emissions were at the 33rd percentile or below, Medium were between 33rd and 66th percentiles, and High emissions were above 66th percentile.

Final refined dataset ghg.agri contains 6 variables.

#Exploratory Data Analysis

The Summary Statistics are as follows:

Checking for class imbalance in the Emission Class variable created. Based on output, there is essentially negligible class imbalance.

Var1	Freq
Low	7331
Medium	7331
High	7554

Summary statistics provided for the frequency counts for Area, Element, and Emission Class, and the continuous variables like Value and Year.

Element	Mean	Median	SD	Max
Emissions (CH4)	659.44461	82.37325	2136.1417	20608.351
Emissions (CO2)	24734.00627	1687.80580	109310.8790	1660372.882
Emissions (N2O)	35.72804	4.84735	122.3693	1229.867

Created a 50-bin histogram for the distribution of the Log Emissions. This shows an extreme right-skew, with a small number of high emission countries that stretch the distribution. The logVal usage helped to stabilize variance, and after transformation makes it suitable for regression later.

A scatterplot of the mean Emissions shows that the baseline for mean CO2 emissions in kilotons is magnitudes higher than both CH4 and N2O mean emissions who are much lower. To better visualize the compressed element mean emissions, separate scatterplots were created. Scatterplot for mean CO2 shows an overall inversely proportional trend; in 31 years, mean CO2 emissions have gone down overall from an initial ~30,000 kt down to ~21,250 kt. Scatterplot for mean CH4 shows an overall increasing trend; in 31 years, mean CH4 emissions significantly dipped from 700+ kt in the year ~1993, and by 2021 back up to ~680 kt. Scatterplot for mean N2O shows an overall increasing trend; in 31 years, mean N2O emissions significantly dipped from ~35 kt in the year ~1993, and by 2021 back up to ~38.75 kt.

#Unsupervised Learning: Correlation Plot, Eigenvalue Scree Plot, PCA Individuals Biplot, PCA Variable Cos2 plot, Silhouette Plot, and K-Means Cluster Plot.

Correlation plot shows high correlation coefficient between N2O and CH4 emissions of 0.9350107, while lower correlation coefficients when either is paired with CO2, for example coefficient of 0.6173995 for CH4 and CO2, and lowest coefficient of 0.5182584 for N2O and CO2.

PCA was used to reduce the dimensionality of the emission variables and identify dominant patterns among countries. PC1 explained ~79.9% of total variance, meaning most (~80%) of the variation in emissions is captured by the first component. PC2 explained ~18.2%. Together PC1 and PC2 explained ~98.1% of total variance, so the 2D PCA plot represents the data very well. This is confirmed by the Scree/Elbow plot visually shows PC2 (2 Dimensions) covers 98.1% of total variance in emissions.

Each point represents a country (Area). Arrows represent emission variables. Longer arrows indicate stronger contribution to variability. Countries farther in the direction of the variable arrow tend to have higher values of that emission type. CO2 points in a different direction from CH4/N2O, which suggests that CO2 has a different variation pattern. Overwhelmingly most of the countries cluster near the origin, while a few countries appear as strong outliers with very high emissions.

In our PCA variable Contribution Cos2 plot, we now see that N2O x CH4 emissions have a very strong positive correlation, because their angle is the smallest relative to the other variable pairing angles. Conversely, the (N2O or CH4) x CO2 emission angles have a moderate or weakly positive correlation, because their angles are <180 degrees but >90 degrees. Neither of the (N2O or CH4) x CO2 variable pair angle are smaller than the NO2 x CH4 variable pair angle. This is also confirmed by the earlier PCA Individuals biplot, as we saw the Emissions N2O and CH4 arrows pointing in nearly the same direction with a small angle between them, while CO2 points in a different direction away from them. Also confirmed with the correlation plot.

	Emissions (CH4)	Emissions (CO2)	Emissions (N2O)
Emissions (CH4)	1.0000000	0.6173995	0.9350107
Emissions (CO2)	0.6173995	1.0000000	0.5182584
Emissions (N2O)	0.9350107	0.5182584	1.0000000

For our PCA Biplot, each point represents a country (Area). Arrows represent emission variables. Longer arrows indicate stronger contribution to variability. Countries farther in the direction of the variable arrow tend to have higher values of that emission type. CO2 points in a different direction from CH4/N2O, which suggests that CO2 has a different variation pattern. Overwhelmingly most of the countries cluster near the origin, while a few countries appear as strong outliers with very high emissions.

Importance of components:
                          PC1    PC2     PC3
Standard deviation     1.5483 0.7389 0.23832
Proportion of Variance 0.7991 0.1820 0.01893
Cumulative Proportion  0.7991 0.9811 1.00000

Silhouette plot shows that the optimal number of clusters is 2, because k=2 shows the highest average silhouette width, close to ~1.00.

K-means clustering grouped countries based on similarity of their overall GHG emission profiles (CO2, CH4, and N2O). Most countries fell into Cluster 1, which indicates relatively similar and moderate emission structures. Cluster 2 contains countries with more distinct high emission patterns for CO2, CH4, and N2O. The clustering suggests that high-emission countries differ greatly from the majority in their GHG emission composition and magnitude. These countries were identified as Brazil, China, China Mainland, India, Indonesia, USA, and USSR. The separation along PCA dimensions indicates that emission variables meaningfully distinguish groups of countries.

Area	Emissions (CH4)	Emissions (CO2)	Emissions (N2O)	Cluster
Brazil	12375.756	1241579.0	482.4881	2
China	14313.258	229795.7	1091.3595	2
China, mainland	14198.080	225786.5	1084.6738	2
India	18881.286	125400.2	626.7253	2
Indonesia	5824.493	776182.9	163.4007	2
United States of America	7973.988	153089.7	579.6665	2
USSR	11888.653	460221.0	664.4908	2

#Supervised Learning 1-A: Linear OLS Model

With a linear OLS model, we see an RMSE (root-mean squared error) of 52298.9 kilotons of emission, and a coefficient of determination R-squared = 0.3280057. This means a raw linear regression model modelling Value~ Area+Element+Year explains only 32.8% of the emission variance. The high RMSE and low R-squared suggest poor prediction accuracy, which is likely maimed by the extreme right-skew we observed in the raw distribution of emission values.

Linear Regression 

22216 samples
    3 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 19995, 19996, 19994, 19995, 19995, 19994, ... 
Resampling results:

  RMSE     Rsquared   MAE     
  52298.9  0.3280057  15601.45

Tuning parameter 'intercept' was held constant at a value of TRUE

#Supervised Learning 2: Log-Transformed Linear OLS Model

With a log-transformed emission values linear OLS model, the RMSE was 1.079516 on the log scale, and an R-squared of 0.8976524. This indicates that when log-transforming the raw data the linear regression model modelling log(Value+1)~ Area+Element+Year explains 89.76% of the transformed emission variance. The higher R-squared suggests improved prediction accuracy, because this stabilized the extreme right-skew, outliers, and emission values of 0.

Linear Regression 

22216 samples
    3 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 19993, 19994, 19994, 19996, 19993, 19995, ... 
Resampling results:

  RMSE      Rsquared   MAE      
  1.080713  0.8975028  0.8332782

Tuning parameter 'intercept' was held constant at a value of TRUE

#Supervised Learning 3: Regression Tree Model

The final model produced an RMSE of 31994.05 and R-squared of 0.7304366.

So far, comparing the R-squared values: 0.8976524 (log-transformed linear OLS) > 0.7304366 (regression tree) > 0.3280057 (linear OLS)

Amongst R-squared values, the log-transformed linear OLS performed the best, while regression tree performed the second-best, and linear OLS performed the worst. Log-transformed linear OLS successfully explains 89.8% of transformed emission value variance, and the regression tree, not far behind, explains 73.0% of emission value variance. Linear OLS performed much worse than both of them explaining only 32.8% of variance.

However, since RMSE for log-transformed Value cannot be appropriately compared with the other two: 31994.05 (regression tree) < 52298.9 (linear OLS)

Therefore, Regression tree performs better than the linear OLS in both explaining emission value variance and reducing prediction errors (improving accuracy).

In addition, the regression tree identified Brazil, Indonesia, and CO2 emissions as major splitting variables. This means that the regression tree captures nonlinear structure and interactions better than raw linear regression.

CART 

22216 samples
    3 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 19996, 19994, 19994, 19995, 19994, 19994, ... 
Resampling results across tuning parameters:

  cp          RMSE      Rsquared   MAE      
  0.04681952  31865.82  0.7363985   9812.115
  0.10360289  40000.38  0.5850222  10805.410
  0.26604973  53695.36  0.4387401  12700.809

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.04681952.

#Supervised Learning 4: Classification Tree Model by Emisson Class Created a classification tree with the created Emission Class variable, providing 3 classes of low, medium, or high in emissions. There are 3 terminal nodes. The root node confirms the class probabilities for low, medium, and high emissions of .33, .33, and .34 were based on the 33rd and 66th percentiles. The first split is based on whether the emission type is N2O, second split is based on whether emission type is CO2. 55% of NO2 values are in the Low emission class therefore N2O mainly leans Low. This is confirmed by the NO2 EDA, showed a mean value of 35.72804 kt. 66% of CO2 values are in the High emission class therefore CO2 strongly leans High. This is confirmed by the CO2 EDA, showed a mean value of 24734.00627 kt. CH4 emissions are approximately balanced between emission classes, compared to N2O and CO2. This is confirmed by the CH4 EDA, showed a mean value of 659.44461 kt, somewhere in between the other two mean values.

The final model’s accuracy statistic was 0.5723304, with a kappa statistic of 0.3585575. This means that about 57.2% of emission classes in the dataset were correctly predicted in their respective class based on Year, Area (Country), and Element (Emission Type). Kappa statistic is lower than test accuracy, because it adjusts for agreement expected by chance (E), while accuracy does not. Although the class distribution was nearly balanced based on the EDA (High = 7554, Low = 7331, Medium = 7331), some correct classifications would still occur by chance alone. A Kappa value of ~0.358 suggests the models provides fair classification performance beyond chance agreement.

CART 

22216 samples
    3 predictor
    3 classes: 'Low', 'Medium', 'High' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 19994, 19995, 19995, 19994, 19995, 19994, ... 
Resampling results across tuning parameters:

  cp           Accuracy   Kappa    
  0.002182513  0.5730545  0.3596451
  0.007161370  0.5170158  0.2744235
  0.266130132  0.4245437  0.1299452

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.002182513.

#Conclusion

Looking over the gross scope of results, we saw in EDA the high right-skew of distribution of raw emissions values, due to only a handful countries having incredibly high emissions, stretching the distribution to the right, which was investigated later. A grouped scatterplot for emission types alludes to the fact that mean CO2 emissions are much higher than mean CH4 and N2O emissions, respectively. We observed that in 31 years, mean CO2 emissions have been decreasing steadily, while mean CH4 and N2O have been increasing steadily. In unsupervised learning, using a Cos2 plot, correlation plot, and Individuals Biplot, we uncovered that CH4 emissions and N2O emissions were highly correlated with each other, much more than either were with CO2. In addition, unsupervised learning revealed that 2 dimensions of principal component analysis was enough to explain 98.1% of total variance, confirmed by an eigenvalue scree plot. Furthermore, a Silhouette plot showed an optimal number of clusters k=2, and from that a k-means cluster plot revealed a huge disparity: a majority of Areas were in cluster 1, while only 7 Areas resided in cluster 2. These countries/Areas were placed in cluster 2 for having the same pattern of very GHG emissions, which were Brazil, China, China Mainland, India, Indonesia, USA, and USSR. In supervised learning, four models were developed. The worst-performing regression model of them was the raw emission value OLS linear regression which modeled emission value as a function of area, year, and element. This linear OLS regression had the lowest of the regression R-squared values, and RMSE values. The best performing regression model was the log-transformed raw emission value OLS linear regression, with the highest of the regression R-squared values, though has an nonequivalent RMSE value. The second best performing regression model was the regression tree, having the second highest R-squared value and the lower RMSE value (relative to raw linear OLS). Also, the regression tree identified Brazil, Indonesia, and CO2 emissions as major splitting variables, which captured nonlinear structure and interactions better than raw linear regression. Finally, the classification tree used the Emission class based on percentiles provided an accuracy statistic of 0.5723304, with a kappa statistic of 0.3585575. This means the classification tree model correctly predicts the emission class 57.2% of the time overall.

Time permitting, and with more domain knowledge, potential future work would look into how the emission types are sourced agriculturally, providing a new variable column (Source). For example, the findings during unsupervised learning suggested that N2O and CH4 emissions are highly correlated, relative to CO2 separately. While CO2 can be emitted agriculturally, it tends to not be the same way N2O and CH4 are. N2O and CH4 both directly come from soil processing, manure processing, and livestock diet. Both livestock and soil go hand in hand. On the other hand, CO2 gets emitted agriculturally from deforestation or land transformation. It would be interesting to see how breaking down the agricultural emissions by source would yield new findings. The findings during unsupervised learning that Brazil had the highest CO2 emissions (>1 million kilotons) makes sense considering CO2 emissions mainly are from deforestation and land transformation, and huge areas of the Amazon rain forest and indigenous lands are under threat of worsening GHG emission. In the EDA we examined the scatterplots of mean emission gas over time. Therefore time-series forecasting models could additionally be applied to predict future GHG emissions by country and emission type. Potential future work would expand the dataset to possibly include mean global sea temperature or mean global sea pH (CO2 by equilibrium dissolves into seawater as carbonic acid, making the ocean more acidic), livestock population, and population growth.

Global Agriculture Greenhouse Gas Emissions Analysis

Armin Mahdavi

2026-05-13