Introduction:

This project analyzes global CO₂ emissions to understand how population growth and temperature changes influence emission patterns.

Goal:

To identify key relationships, classify emission levels, and group countries using regression, KNN, ANOVA, clustering, and association rules.

[1]: Inspecting the Structure and Getting an Overview

library(readr)

## Warning: package 'readr' was built under R version 4.5.2

# Loading my dataset
C02_data <- read_csv("C:/Users/bzimb/Desktop/DA Proj/C02_data.csv")

## Rows: 4369 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): country
## dbl (5): year, population, temperature_change_from_co2, consumption_co2, co2
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

View(C02_data)


head(C02_data)

## # A tibble: 6 × 6
##    year population temperature_change_from_co2 consumption_co2   co2 country
##   <dbl>      <dbl>                       <dbl>           <dbl> <dbl> <chr>  
## 1  1990  643775320                       0.042            583.  658. Africa 
## 2  1991  661103806                       0.043            607.  688. Africa 
## 3  1992  678558076                       0.044            620.  667. Africa 
## 4  1993  696621727                       0.044            643.  706. Africa 
## 5  1994  714502658                       0.045            716.  790. Africa 
## 6  1995  732695476                       0.046            752.  845. Africa

summary(C02_data)

##       year        population        temperature_change_from_co2
##  Min.   :1990   Min.   :2.553e+05   Min.   :0.00100            
##  1st Qu.:1998   1st Qu.:5.398e+06   1st Qu.:0.00100            
##  Median :2006   Median :1.410e+07   Median :0.00300            
##  Mean   :2006   Mean   :2.145e+08   Mean   :0.03517            
##  3rd Qu.:2014   3rd Qu.:5.819e+07   3rd Qu.:0.01100            
##  Max.   :2023   Max.   :8.092e+09   Max.   :1.16100            
##                                     NA's   :843                
##  consumption_co2          co2              country         
##  Min.   :    0.084   Min.   :    0.455   Length:4369       
##  1st Qu.:    9.940   1st Qu.:    7.958   Class :character  
##  Median :   49.935   Median :   45.034   Mode  :character  
##  Mean   :  948.098   Mean   :  955.180                     
##  3rd Qu.:  254.129   3rd Qu.:  270.715                     
##  Max.   :37791.566   Max.   :37791.570                     
##

str(C02_data)

## spc_tbl_ [4,369 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ year                       : num [1:4369] 1990 1991 1992 1993 1994 ...
##  $ population                 : num [1:4369] 6.44e+08 6.61e+08 6.79e+08 6.97e+08 7.15e+08 ...
##  $ temperature_change_from_co2: num [1:4369] 0.042 0.043 0.044 0.044 0.045 0.046 0.047 0.048 0.049 0.05 ...
##  $ consumption_co2            : num [1:4369] 583 607 620 643 716 ...
##  $ co2                        : num [1:4369] 658 688 667 706 790 ...
##  $ country                    : chr [1:4369] "Africa" "Africa" "Africa" "Africa" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   year = col_double(),
##   ..   population = col_double(),
##   ..   temperature_change_from_co2 = col_double(),
##   ..   consumption_co2 = col_double(),
##   ..   co2 = col_double(),
##   ..   country = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

colSums(is.na(C02_data))

##                        year                  population 
##                           0                           0 
## temperature_change_from_co2             consumption_co2 
##                         843                           0 
##                         co2                     country 
##                           0                           0

[1]:

- The dataset contains 4,369 observations and 8 variables: numeric variables (year, population, temperature_change_from_co2, consumption_co2, co2) and one categorical variable (country).

- Missing values are present in temperature_change_from_co2 (843 NAs) ; all other variables are complete.

- Summary statistics show wide variation: population ranges from ~255,000 to ~8 billion, co2 ranges from 0.455 to 37,791, and temperature change variables have small numeric scales but some outliers.

- All numeric variables are of type double and country is character, making the dataset ready for preprocessing and further analysis.

[2] Handling the missing Values

C02_data$temperature_change_from_co2[is.na(C02_data$temperature_change_from_co2)] <- median(C02_data$temperature_change_from_co2, na.rm = TRUE)

colSums(is.na(C02_data))

##                        year                  population 
##                           0                           0 
## temperature_change_from_co2             consumption_co2 
##                           0                           0 
##                         co2                     country 
##                           0                           0

[2]

- Missing values in ‘temperature_change_from_co2’ (843 NAs) were replaced with the median of their respective columns.

- After imputation, all variables have complete data with no missing values, making the dataset ready for further analysis.

[3] Checking for outliers and Relationships

CO2 VS Population Relationship

library(ggplot2)

# Scatterplot CO2 vs Population
ggplot(C02_data, aes(x = population, y = co2)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", color = "blue") +
  ggtitle("CO2 vs Population") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

[3 - i]

There is a clear Moderate Positive linear relationship between population and CO2 emissions.

The upward slope of the blue regression line indicates that as population increases, CO2 emissions tend to increase as well.

Temp VS CO2 Relationship

# CO2 vs Temperature Change from CO2

ggplot(C02_data, aes(x = temperature_change_from_co2, y = co2)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", color = "red") +
  ggtitle("CO2 vs Temperature Change from CO2") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

[3 - ii]

The plot reveals a very strong, positive, and predominantly linear relationship between the two variables - temperature_change_from_co2 and Co2 .

This indicates that as total CO2 emissions increase, the temperature change attributed to CO2 increases in a highly predictable, linear fashion.

- corplot

 library(corrplot)

## Warning: package 'corrplot' was built under R version 4.5.2

## corrplot 0.95 loaded

#Im will Select only CO2, population, and temperature change columns here
subset_numeric <- C02_data[, c("co2", "population", "temperature_change_from_co2")]

##### Getting the correlation matrix
cor_matrix_subset <- cor(subset_numeric, use = "complete.obs")

# Plotting correlation matrix
corrplot(cor_matrix_subset, method = "color", type = "upper", 
         addCoef.col = "black", tl.cex = 0.8, number.cex = 0.8,
         title = "Correlation Matrix: CO2, Population, Temperature Changes", mar=c(0,0,1,0))

[3 - iii]

- CO2 is strongly positively correlated with population (0.90) and temperature_change_from_co2 (0.97), indicating that as population and temperature change increase, CO2 emissions also rise.

Population also has strong positive correlations with the temperature change variables (temperature_change_from_co2 0.85 showing that larger populations are associated with higher temperature impacts.

[4]: Regression Analysis

CO2 ~ Population

# Performing Linear regression: CO2 as a function of population

lm_co2_pop <- lm(co2 ~ population, data = C02_data)

 
summary(lm_co2_pop)

## 
## Call:
## lm(formula = co2 ~ population, data = C02_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8841.6  -121.6   -73.0   -29.3 11505.4 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.080e+01  2.379e+01   2.976  0.00294 ** 
## population  4.122e-06  2.944e-08 140.001  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1516 on 4367 degrees of freedom
## Multiple R-squared:  0.8178, Adjusted R-squared:  0.8178 
## F-statistic: 1.96e+04 on 1 and 4367 DF,  p-value: < 2.2e-16

[4 - i]

- R-squared (0.818): About 81.8% of the variance in CO2 emissions is explained by population alone.

- - Significance: Extremely significant (p-value < 2e-16), indicating a very strong linear relationship.

CO2 ~ Temperature Change from CO2

# Performing Linear regression: CO2 as a function of temperature_change_from_co2
lm_co2_temp <- lm(co2 ~ temperature_change_from_co2, data = C02_data)

# View summary
summary(lm_co2_temp)

## 
## Call:
## lm(formula = co2 ~ temperature_change_from_co2, data = C02_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6781.3   -67.3    -4.4    28.9  9465.4 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   -29.42      14.10  -2.087    0.037 *  
## temperature_change_from_co2 33995.57     134.04 253.622   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 895.6 on 4367 degrees of freedom
## Multiple R-squared:  0.9364, Adjusted R-squared:  0.9364 
## F-statistic: 6.432e+04 on 1 and 4367 DF,  p-value: < 2.2e-16

[4 - ii]

- R-squared (0.936): 93.6% of CO2 variation is explained by temperature change, a very strong relationship.

- - Significance: Extremely significant (p-value < 2.2e-16), confirming a highly predictive model.

[4 iiii] Both population and temperature change significantly explain the Carbon emissions.

KNN (K-Nearest Neighbors)

KNN (Predicting CO₂ Emissions)

 library(class)

## Warning: package 'class' was built under R version 4.5.2

# 1. Creating categorical CO2 levels (Low, Medium, High)
C02_data$CO2_Level <- cut(
  C02_data$co2,
  breaks = quantile(C02_data$co2, probs = c(0, 0.33, 0.66, 1), na.rm = TRUE),
  labels = c("Low", "Medium", "High"),
  include.lowest = TRUE
)

# Checking class distribution
table(C02_data$CO2_Level)

## 
##    Low Medium   High 
##   1442   1441   1486

# Remove rows with any missing values
C02_data <- na.omit(C02_data)

# 2. Selecting and scaling numeric features
features <- C02_data[, c("population", "temperature_change_from_co2")]
features_scaled <- scale(features)

# 3. Defining target variable
target <- C02_data$CO2_Level

# Randomize and split data
set.seed(123)
rand_index <- sample(1:nrow(C02_data))
train_index <- rand_index[1:3500]
test_index  <- rand_index[3501:nrow(C02_data)]

train_x <- features_scaled[train_index, ]
test_x  <- features_scaled[test_index, ]
train_y <- target[train_index]
test_y  <- target[test_index]

# 5. KNN with (k = 5)
pred_knn <- knn(train = train_x, test = test_x, cl = train_y, k = 5)

# 6. Actual vs predicted
comparison <- data.frame(Actual = test_y, Predicted = pred_knn)
print(head(comparison))

##   Actual Predicted
## 1    Low    Medium
## 2   High      High
## 3 Medium    Medium
## 4    Low       Low
## 5    Low       Low
## 6   High      High

# 7. Evaluating model accuracy
accuracy <- mean(pred_knn == test_y)
cat("KNN Model Accuracy:", round(accuracy * 100, 2), "%\n")

## KNN Model Accuracy: 78.14 %

The KNN algorithm was applied to classify CO₂ emission levels into three categories — Low, Medium, and High — based on selected numeric features. [Population and Temperature Change from CO₂ were used as the features (predictors).]

Before modeling, all features were scaled to ensure equal contribution during distance calculations.

The dataset was randomly split into training and testing subsets (80/20).

The model output compared the actual and predicted CO₂ categories for several test points.

Most predictions matched the true classes, indicating that the model effectively captured the relationships between population, temperature change, and CO₂ emissions.

Model Accuracy was 78%

ANOVA

One-Way ANOVA — CO₂ by Country

# One-Way ANOVA: CO2 by Country
anova_result <- aov(co2 ~ country, data = C02_data)

# Summary of the ANOVA model
summary(anova_result)

##               Df    Sum Sq   Mean Sq F value Pr(>F)    
## country      132 5.223e+10 395683943   583.7 <2e-16 ***
## Residuals   4236 2.872e+09    677935                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

I perfomed a one-way ANOVA to examine whether mean CO₂ emissions differ significantly across countries.

The ANOVA shows a statistically significant difference in mean CO₂ emissions across countries.

F(132, 4236) = 583.7, p < 2e-16

CO₂ emission levels vary widely between countries, suggesting strong country-level effects.

Clustering (K-Means)

Determine Optimal Number of Clusters (Elbow Method)

# numeric variables
cluster_data <- C02_data[, c("co2", "population", "temperature_change_from_co2")]

# Scaling
cluster_scaled <- scale(cluster_data)

# Computing total within-cluster sum of squares for k = 1 to 10
set.seed(123)
wss <- numeric(10)
for (k in 1:10) {
  kmeans_model <- kmeans(cluster_scaled, centers = k, nstart = 25)
  wss[k] <- kmeans_model$tot.withinss
}

# here l will plot using the Elbow Method
plot(1:10, wss, type = "b",
     pch = 19, frame = FALSE,
     xlab = "Number of Clusters (k)",
     ylab = "Total Within-Cluster Sum of Squares (WSS)",
     main = "Elbow Method for Determining Optimal k")

The “elbow” of the curve is located at k = 3. This is the point after which adding more clusters yields diminishing returns. Therefore, the Elbow Method suggests that 3 is the optimal number of clusters for segmenting this dataset based on co2, population, and temperature_change_from_co2.

set.seed(123)
C02_scaled <- scale(C02_data[, c("co2", "population", "temperature_change_from_co2")])
kmeans_result <- kmeans(C02_scaled, centers = 3, nstart = 25)

C02_data$Cluster <- as.factor(kmeans_result$cluster)


# Visualizing clusters
library(ggplot2)

ggplot(C02_data, aes(x = population, y = co2, color = Cluster)) +
  geom_point(alpha = 0.6, size = 2) +
  scale_color_manual(values = c("#1f77b4", "#ff7f0e", "#2ca02c")) +
  labs(title = "CO2 Clusters Based on Population and Temperature Change",
       x = "Population",
       y = "CO2 Emissions",
       color = "Cluster") +
  theme_minimal()

Cluster 1 (Blue– Low Emitters):

This cluster represents countries with low population and low CO₂ emissions, generally located at the bottom-left region of the plot.

It mostly includes smaller or less industrialized nations with minimal carbon contribution.

Cluster 2 (Green – Medium to High Emitters):

This cluster contains countries with moderate to high population sizes and CO₂ emissions.

It represents the largest and most diverse group, capturing most nations and regions with industrial activity and noticeable carbon output.

Cluster 3 (Orange – Highest Emitters / Global Totals):

This small, distinct group includes entities with extremely high population and CO₂ levels.

It likely corresponds to the aggregated ‘World’ or continental data, where emissions are at their global peak.

These points are clearly separated from the rest, indicating a unique emission pattern at the global scale.

Market Basket / Association Rule Mining

  # Step 5: Association Rule Mining
library(arules); library(arulesViz)

## Loading required package: Matrix

## 
## Attaching package: 'arules'

## The following objects are masked from 'package:base':
## 
##     abbreviate, write

assoc_data <- C02_data[, c("co2", "population", "temperature_change_from_co2")]
assoc_data[] <- lapply(assoc_data, function(x) cut(x, 3, labels = c("Low", "Med", "High")))

rules <- apriori(as(assoc_data, "transactions"), parameter = list(supp=0.05, conf=0.8, minlen=2))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5    0.05      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 218 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[9 item(s), 4369 transaction(s)] done [0.00s].
## sorting and recoding items ... [3 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [9 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

inspect(head(sort(rules, by="lift"), 10))

##     lhs                                  rhs                                 support confidence  coverage     lift count
## [1] {population=Low,                                                                                                    
##      temperature_change_from_co2=Low} => {co2=Low}                         0.9711604  0.9990582 0.9720760 1.021743  4243
## [2] {co2=Low}                         => {temperature_change_from_co2=Low} 0.9777981  1.0000000 0.9777981 1.015574  4272
## [3] {co2=Low,                                                                                                           
##      population=Low}                  => {temperature_change_from_co2=Low} 0.9711604  1.0000000 0.9711604 1.015574  4243
## [4] {temperature_change_from_co2=Low} => {co2=Low}                         0.9777981  0.9930265 0.9846647 1.015574  4272
## [5] {co2=Low}                         => {population=Low}                  0.9711604  0.9932116 0.9777981 1.013865  4243
## [6] {population=Low}                  => {co2=Low}                         0.9711604  0.9913551 0.9796292 1.013865  4243
## [7] {co2=Low,                                                                                                           
##      temperature_change_from_co2=Low} => {population=Low}                  0.9711604  0.9932116 0.9777981 1.013865  4243
## [8] {population=Low}                  => {temperature_change_from_co2=Low} 0.9720760  0.9922897 0.9796292 1.007744  4247
## [9] {temperature_change_from_co2=Low} => {population=Low}                  0.9720760  0.9872152 0.9846647 1.007744  4247

plot(head(sort(rules, by="lift"), 10), method="graph", engine="htmlwidget")

The Apriori algorithm identified strong relationships between the categorical versions of CO₂, population, and temperature change.

From the top rules:

Countries with Low Population and Low Temperature Change almost always have Low CO₂ emissions.

Conversely, Low CO₂ is strongly associated with both Low Population and Low Temperature Change.

The confidence values (≥ 0.99) mean these relationships hold true for over 99% of the observations.

The lift values (~ 1.01–1.02) indicate a strong positive correlation, showing these factors frequently occur together more often than expected by chance.

Carbon_Emissions Analysis Using Regression, Knn, Anova, Clustering, Association Rules

Elias

2025-11-09

Introduction:

This project analyzes global CO₂ emissions to understand how population growth and temperature changes influence emission patterns.

Goal:

To identify key relationships, classify emission levels, and group countries using regression, KNN, ANOVA, clustering, and association rules.

[1]: Inspecting the Structure and Getting an Overview

[1]:

- The dataset contains 4,369 observations and 8 variables: numeric variables (year, population, temperature_change_from_co2, consumption_co2, co2) and one categorical variable (country).

- Missing values are present in temperature_change_from_co2 (843 NAs) ; all other variables are complete.

- Summary statistics show wide variation: population ranges from ~255,000 to ~8 billion, co2 ranges from 0.455 to 37,791, and temperature change variables have small numeric scales but some outliers.

- All numeric variables are of type double and country is character, making the dataset ready for preprocessing and further analysis.

[2] Handling the missing Values

[2]

- Missing values in ‘temperature_change_from_co2’ (843 NAs) were replaced with the median of their respective columns.

- After imputation, all variables have complete data with no missing values, making the dataset ready for further analysis.

[3] Checking for outliers and Relationships

CO2 VS Population Relationship

[3 - i]

There is a clear Moderate Positive linear relationship between population and CO2 emissions.

The upward slope of the blue regression line indicates that as population increases, CO2 emissions tend to increase as well.

Temp VS CO2 Relationship

[3 - ii]

The plot reveals a very strong, positive, and predominantly linear relationship between the two variables - temperature_change_from_co2 and Co2 .

This indicates that as total CO2 emissions increase, the temperature change attributed to CO2 increases in a highly predictable, linear fashion.

- corplot

[3 - iii]

- CO2 is strongly positively correlated with population (0.90) and temperature_change_from_co2 (0.97), indicating that as population and temperature change increase, CO2 emissions also rise.

Population also has strong positive correlations with the temperature change variables (temperature_change_from_co2 0.85 showing that larger populations are associated with higher temperature impacts.

[4]: Regression Analysis

CO2 ~ Population

[4 - i]

- R-squared (0.818): About 81.8% of the variance in CO2 emissions is explained by population alone.

- - Significance: Extremely significant (p-value < 2e-16), indicating a very strong linear relationship.

CO2 ~ Temperature Change from CO2

[4 - ii]

- R-squared (0.936): 93.6% of CO2 variation is explained by temperature change, a very strong relationship.

- - Significance: Extremely significant (p-value < 2.2e-16), confirming a highly predictive model.

[4 iiii] Both population and temperature change significantly explain the Carbon emissions.

KNN (K-Nearest Neighbors)

KNN (Predicting CO₂ Emissions)

The KNN algorithm was applied to classify CO₂ emission levels into three categories — Low, Medium, and High — based on selected numeric features. [Population and Temperature Change from CO₂ were used as the features (predictors).]

Before modeling, all features were scaled to ensure equal contribution during distance calculations.

The dataset was randomly split into training and testing subsets (80/20).

The model output compared the actual and predicted CO₂ categories for several test points.

Most predictions matched the true classes, indicating that the model effectively captured the relationships between population, temperature change, and CO₂ emissions.

Model Accuracy was 78%

ANOVA

One-Way ANOVA — CO₂ by Country

I perfomed a one-way ANOVA to examine whether mean CO₂ emissions differ significantly across countries.

The ANOVA shows a statistically significant difference in mean CO₂ emissions across countries.

F(132, 4236) = 583.7, p < 2e-16

CO₂ emission levels vary widely between countries, suggesting strong country-level effects.

Clustering (K-Means)

Determine Optimal Number of Clusters (Elbow Method)

The “elbow” of the curve is located at k = 3. This is the point after which adding more clusters yields diminishing returns. Therefore, the Elbow Method suggests that 3 is the optimal number of clusters for segmenting this dataset based on co2, population, and temperature_change_from_co2.

Cluster 1 (Blue– Low Emitters):

This cluster represents countries with low population and low CO₂ emissions, generally located at the bottom-left region of the plot.

It mostly includes smaller or less industrialized nations with minimal carbon contribution.

Cluster 2 (Green – Medium to High Emitters):

This cluster contains countries with moderate to high population sizes and CO₂ emissions.

It represents the largest and most diverse group, capturing most nations and regions with industrial activity and noticeable carbon output.

Cluster 3 (Orange – Highest Emitters / Global Totals):

This small, distinct group includes entities with extremely high population and CO₂ levels.

It likely corresponds to the aggregated ‘World’ or continental data, where emissions are at their global peak.

These points are clearly separated from the rest, indicating a unique emission pattern at the global scale.

Market Basket / Association Rule Mining

The Apriori algorithm identified strong relationships between the categorical versions of CO₂, population, and temperature change.

From the top rules:

Countries with Low Population and Low Temperature Change almost always have Low CO₂ emissions.

Conversely, Low CO₂ is strongly associated with both Low Population and Low Temperature Change.

The confidence values (≥ 0.99) mean these relationships hold true for over 99% of the observations.

The lift values (~ 1.01–1.02) indicate a strong positive correlation, showing these factors frequently occur together more often than expected by chance.