Introduction:
This project analyzes global CO₂ emissions to understand how
population growth and temperature changes influence emission
patterns.
Goal:
To identify key relationships, classify emission levels, and group
countries using regression, KNN, ANOVA, clustering, and association
rules.
[1]: Inspecting the Structure and Getting an Overview
library(readr)
## Warning: package 'readr' was built under R version 4.5.2
# Loading my dataset
C02_data <- read_csv("C:/Users/bzimb/Desktop/DA Proj/C02_data.csv")
## Rows: 4369 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): country
## dbl (5): year, population, temperature_change_from_co2, consumption_co2, co2
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(C02_data)
head(C02_data)
## # A tibble: 6 × 6
## year population temperature_change_from_co2 consumption_co2 co2 country
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 1990 643775320 0.042 583. 658. Africa
## 2 1991 661103806 0.043 607. 688. Africa
## 3 1992 678558076 0.044 620. 667. Africa
## 4 1993 696621727 0.044 643. 706. Africa
## 5 1994 714502658 0.045 716. 790. Africa
## 6 1995 732695476 0.046 752. 845. Africa
summary(C02_data)
## year population temperature_change_from_co2
## Min. :1990 Min. :2.553e+05 Min. :0.00100
## 1st Qu.:1998 1st Qu.:5.398e+06 1st Qu.:0.00100
## Median :2006 Median :1.410e+07 Median :0.00300
## Mean :2006 Mean :2.145e+08 Mean :0.03517
## 3rd Qu.:2014 3rd Qu.:5.819e+07 3rd Qu.:0.01100
## Max. :2023 Max. :8.092e+09 Max. :1.16100
## NA's :843
## consumption_co2 co2 country
## Min. : 0.084 Min. : 0.455 Length:4369
## 1st Qu.: 9.940 1st Qu.: 7.958 Class :character
## Median : 49.935 Median : 45.034 Mode :character
## Mean : 948.098 Mean : 955.180
## 3rd Qu.: 254.129 3rd Qu.: 270.715
## Max. :37791.566 Max. :37791.570
##
str(C02_data)
## spc_tbl_ [4,369 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ year : num [1:4369] 1990 1991 1992 1993 1994 ...
## $ population : num [1:4369] 6.44e+08 6.61e+08 6.79e+08 6.97e+08 7.15e+08 ...
## $ temperature_change_from_co2: num [1:4369] 0.042 0.043 0.044 0.044 0.045 0.046 0.047 0.048 0.049 0.05 ...
## $ consumption_co2 : num [1:4369] 583 607 620 643 716 ...
## $ co2 : num [1:4369] 658 688 667 706 790 ...
## $ country : chr [1:4369] "Africa" "Africa" "Africa" "Africa" ...
## - attr(*, "spec")=
## .. cols(
## .. year = col_double(),
## .. population = col_double(),
## .. temperature_change_from_co2 = col_double(),
## .. consumption_co2 = col_double(),
## .. co2 = col_double(),
## .. country = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
colSums(is.na(C02_data))
## year population
## 0 0
## temperature_change_from_co2 consumption_co2
## 843 0
## co2 country
## 0 0
[1]:
- The dataset contains 4,369 observations and 8 variables: numeric
variables (year, population, temperature_change_from_co2,
consumption_co2, co2) and one categorical variable (country).
- Missing values are present in temperature_change_from_co2 (843
NAs) ; all other variables are complete.
- Summary statistics show wide variation: population ranges from
~255,000 to ~8 billion, co2 ranges from 0.455 to 37,791, and temperature
change variables have small numeric scales but some outliers.
- All numeric variables are of type double and country is character,
making the dataset ready for preprocessing and further analysis.
[2] Handling the missing Values
C02_data$temperature_change_from_co2[is.na(C02_data$temperature_change_from_co2)] <- median(C02_data$temperature_change_from_co2, na.rm = TRUE)
colSums(is.na(C02_data))
## year population
## 0 0
## temperature_change_from_co2 consumption_co2
## 0 0
## co2 country
## 0 0
[2]
- Missing values in ‘temperature_change_from_co2’ (843 NAs) were
replaced with the median of their respective columns.
- After imputation, all variables have complete data with no missing
values, making the dataset ready for further analysis.
[3] Checking for outliers and Relationships
CO2 VS Population Relationship
library(ggplot2)
# Scatterplot CO2 vs Population
ggplot(C02_data, aes(x = population, y = co2)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", color = "blue") +
ggtitle("CO2 vs Population") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

[3 - i]
There is a clear Moderate Positive linear relationship between
population and CO2 emissions.
The upward slope of the blue regression line indicates that as
population increases, CO2 emissions tend to increase as well.
Temp VS CO2 Relationship
# CO2 vs Temperature Change from CO2
ggplot(C02_data, aes(x = temperature_change_from_co2, y = co2)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", color = "red") +
ggtitle("CO2 vs Temperature Change from CO2") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

[3 - ii]
The plot reveals a very strong, positive, and predominantly linear
relationship between the two variables - temperature_change_from_co2 and
Co2 .
This indicates that as total CO2 emissions increase, the temperature
change attributed to CO2 increases in a highly predictable, linear
fashion.
- corplot
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.5.2
## corrplot 0.95 loaded
#Im will Select only CO2, population, and temperature change columns here
subset_numeric <- C02_data[, c("co2", "population", "temperature_change_from_co2")]
##### Getting the correlation matrix
cor_matrix_subset <- cor(subset_numeric, use = "complete.obs")
# Plotting correlation matrix
corrplot(cor_matrix_subset, method = "color", type = "upper",
addCoef.col = "black", tl.cex = 0.8, number.cex = 0.8,
title = "Correlation Matrix: CO2, Population, Temperature Changes", mar=c(0,0,1,0))

[3 - iii]
- CO2 is strongly positively correlated with population (0.90) and
temperature_change_from_co2 (0.97), indicating that as population and
temperature change increase, CO2 emissions also rise.
Population also has strong positive correlations with the
temperature change variables (temperature_change_from_co2 0.85 showing
that larger populations are associated with higher temperature
impacts.
[4]: Regression Analysis
CO2 ~ Population
# Performing Linear regression: CO2 as a function of population
lm_co2_pop <- lm(co2 ~ population, data = C02_data)
summary(lm_co2_pop)
##
## Call:
## lm(formula = co2 ~ population, data = C02_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8841.6 -121.6 -73.0 -29.3 11505.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.080e+01 2.379e+01 2.976 0.00294 **
## population 4.122e-06 2.944e-08 140.001 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1516 on 4367 degrees of freedom
## Multiple R-squared: 0.8178, Adjusted R-squared: 0.8178
## F-statistic: 1.96e+04 on 1 and 4367 DF, p-value: < 2.2e-16
[4 - i]
- R-squared (0.818): About 81.8% of the variance in CO2 emissions is
explained by population alone.
- - Significance: Extremely significant (p-value < 2e-16),
indicating a very strong linear relationship.
CO2 ~ Temperature Change from CO2
# Performing Linear regression: CO2 as a function of temperature_change_from_co2
lm_co2_temp <- lm(co2 ~ temperature_change_from_co2, data = C02_data)
# View summary
summary(lm_co2_temp)
##
## Call:
## lm(formula = co2 ~ temperature_change_from_co2, data = C02_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6781.3 -67.3 -4.4 28.9 9465.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -29.42 14.10 -2.087 0.037 *
## temperature_change_from_co2 33995.57 134.04 253.622 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 895.6 on 4367 degrees of freedom
## Multiple R-squared: 0.9364, Adjusted R-squared: 0.9364
## F-statistic: 6.432e+04 on 1 and 4367 DF, p-value: < 2.2e-16
[4 - ii]
- R-squared (0.936): 93.6% of CO2 variation is explained by
temperature change, a very strong relationship.
- - Significance: Extremely significant (p-value < 2.2e-16),
confirming a highly predictive model.
[4 iiii] Both population and temperature change significantly
explain the Carbon emissions.
KNN (K-Nearest Neighbors)
KNN (Predicting CO₂ Emissions)
library(class)
## Warning: package 'class' was built under R version 4.5.2
# 1. Creating categorical CO2 levels (Low, Medium, High)
C02_data$CO2_Level <- cut(
C02_data$co2,
breaks = quantile(C02_data$co2, probs = c(0, 0.33, 0.66, 1), na.rm = TRUE),
labels = c("Low", "Medium", "High"),
include.lowest = TRUE
)
# Checking class distribution
table(C02_data$CO2_Level)
##
## Low Medium High
## 1442 1441 1486
# Remove rows with any missing values
C02_data <- na.omit(C02_data)
# 2. Selecting and scaling numeric features
features <- C02_data[, c("population", "temperature_change_from_co2")]
features_scaled <- scale(features)
# 3. Defining target variable
target <- C02_data$CO2_Level
# Randomize and split data
set.seed(123)
rand_index <- sample(1:nrow(C02_data))
train_index <- rand_index[1:3500]
test_index <- rand_index[3501:nrow(C02_data)]
train_x <- features_scaled[train_index, ]
test_x <- features_scaled[test_index, ]
train_y <- target[train_index]
test_y <- target[test_index]
# 5. KNN with (k = 5)
pred_knn <- knn(train = train_x, test = test_x, cl = train_y, k = 5)
# 6. Actual vs predicted
comparison <- data.frame(Actual = test_y, Predicted = pred_knn)
print(head(comparison))
## Actual Predicted
## 1 Low Medium
## 2 High High
## 3 Medium Medium
## 4 Low Low
## 5 Low Low
## 6 High High
# 7. Evaluating model accuracy
accuracy <- mean(pred_knn == test_y)
cat("KNN Model Accuracy:", round(accuracy * 100, 2), "%\n")
## KNN Model Accuracy: 78.14 %
The KNN algorithm was applied to classify CO₂ emission levels into
three categories — Low, Medium, and High — based on selected numeric
features. [Population and Temperature Change from CO₂ were used as the
features (predictors).]
Before modeling, all features were scaled to ensure equal
contribution during distance calculations.
The dataset was randomly split into training and testing subsets
(80/20).
The model output compared the actual and predicted CO₂ categories
for several test points.
Most predictions matched the true classes, indicating that the model
effectively captured the relationships between population, temperature
change, and CO₂ emissions.
Model Accuracy was 78%
ANOVA
One-Way ANOVA — CO₂ by Country
# One-Way ANOVA: CO2 by Country
anova_result <- aov(co2 ~ country, data = C02_data)
# Summary of the ANOVA model
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## country 132 5.223e+10 395683943 583.7 <2e-16 ***
## Residuals 4236 2.872e+09 677935
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
I perfomed a one-way ANOVA to examine whether mean CO₂ emissions
differ significantly across countries.
The ANOVA shows a statistically significant difference in mean CO₂
emissions across countries.
F(132, 4236) = 583.7, p < 2e-16
CO₂ emission levels vary widely between countries, suggesting strong
country-level effects.
Clustering (K-Means)
Determine Optimal Number of Clusters (Elbow Method)
# numeric variables
cluster_data <- C02_data[, c("co2", "population", "temperature_change_from_co2")]
# Scaling
cluster_scaled <- scale(cluster_data)
# Computing total within-cluster sum of squares for k = 1 to 10
set.seed(123)
wss <- numeric(10)
for (k in 1:10) {
kmeans_model <- kmeans(cluster_scaled, centers = k, nstart = 25)
wss[k] <- kmeans_model$tot.withinss
}
# here l will plot using the Elbow Method
plot(1:10, wss, type = "b",
pch = 19, frame = FALSE,
xlab = "Number of Clusters (k)",
ylab = "Total Within-Cluster Sum of Squares (WSS)",
main = "Elbow Method for Determining Optimal k")

The “elbow” of the curve is located at k = 3. This is the point
after which adding more clusters yields diminishing returns. Therefore,
the Elbow Method suggests that 3 is the optimal number of clusters for
segmenting this dataset based on co2, population, and
temperature_change_from_co2.
set.seed(123)
C02_scaled <- scale(C02_data[, c("co2", "population", "temperature_change_from_co2")])
kmeans_result <- kmeans(C02_scaled, centers = 3, nstart = 25)
C02_data$Cluster <- as.factor(kmeans_result$cluster)
# Visualizing clusters
library(ggplot2)
ggplot(C02_data, aes(x = population, y = co2, color = Cluster)) +
geom_point(alpha = 0.6, size = 2) +
scale_color_manual(values = c("#1f77b4", "#ff7f0e", "#2ca02c")) +
labs(title = "CO2 Clusters Based on Population and Temperature Change",
x = "Population",
y = "CO2 Emissions",
color = "Cluster") +
theme_minimal()

Cluster 1 (Blue– Low Emitters):
This cluster represents countries with low population and low CO₂
emissions, generally located at the bottom-left region of the plot.
It mostly includes smaller or less industrialized nations with
minimal carbon contribution.
Cluster 2 (Green – Medium to High Emitters):
This cluster contains countries with moderate to high population
sizes and CO₂ emissions.
It represents the largest and most diverse group, capturing most
nations and regions with industrial activity and noticeable carbon
output.
Cluster 3 (Orange – Highest Emitters / Global Totals):
This small, distinct group includes entities with extremely high
population and CO₂ levels.
It likely corresponds to the aggregated ‘World’ or continental data,
where emissions are at their global peak.
These points are clearly separated from the rest, indicating a
unique emission pattern at the global scale.
Market Basket / Association Rule Mining
# Step 5: Association Rule Mining
library(arules); library(arulesViz)
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
assoc_data <- C02_data[, c("co2", "population", "temperature_change_from_co2")]
assoc_data[] <- lapply(assoc_data, function(x) cut(x, 3, labels = c("Low", "Med", "High")))
rules <- apriori(as(assoc_data, "transactions"), parameter = list(supp=0.05, conf=0.8, minlen=2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.05 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 218
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[9 item(s), 4369 transaction(s)] done [0.00s].
## sorting and recoding items ... [3 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [9 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
inspect(head(sort(rules, by="lift"), 10))
## lhs rhs support confidence coverage lift count
## [1] {population=Low,
## temperature_change_from_co2=Low} => {co2=Low} 0.9711604 0.9990582 0.9720760 1.021743 4243
## [2] {co2=Low} => {temperature_change_from_co2=Low} 0.9777981 1.0000000 0.9777981 1.015574 4272
## [3] {co2=Low,
## population=Low} => {temperature_change_from_co2=Low} 0.9711604 1.0000000 0.9711604 1.015574 4243
## [4] {temperature_change_from_co2=Low} => {co2=Low} 0.9777981 0.9930265 0.9846647 1.015574 4272
## [5] {co2=Low} => {population=Low} 0.9711604 0.9932116 0.9777981 1.013865 4243
## [6] {population=Low} => {co2=Low} 0.9711604 0.9913551 0.9796292 1.013865 4243
## [7] {co2=Low,
## temperature_change_from_co2=Low} => {population=Low} 0.9711604 0.9932116 0.9777981 1.013865 4243
## [8] {population=Low} => {temperature_change_from_co2=Low} 0.9720760 0.9922897 0.9796292 1.007744 4247
## [9] {temperature_change_from_co2=Low} => {population=Low} 0.9720760 0.9872152 0.9846647 1.007744 4247
plot(head(sort(rules, by="lift"), 10), method="graph", engine="htmlwidget")
The Apriori algorithm identified strong relationships between the
categorical versions of CO₂, population, and temperature change.
From the top rules:
Countries with Low Population and Low Temperature Change almost
always have Low CO₂ emissions.
Conversely, Low CO₂ is strongly associated with both Low Population
and Low Temperature Change.
The confidence values (≥ 0.99) mean these relationships hold true
for over 99% of the observations.
The lift values (~ 1.01–1.02) indicate a strong positive
correlation, showing these factors frequently occur together more often
than expected by chance.