Introduction

Pollution is one of the most critical global challenges affecting environmental sustainability and human health. This project analyzes global CO2 emission trends using R programming to understand historical patterns, identify key contributors, and evaluate potential future impacts.

Dataset

The dataset used in this project is sourced from Our World in Data. It contains country-wise information on CO2 emissions, population, and related environmental indicators across multiple years.

Analysis

Q1: Which countries emit the most CO2 globally?

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
data <- read.csv("data.csv")
top_countries <- data %>%
  filter(!is.na(co2)) %>%
  filter(nchar(iso_code) == 3) %>%
  group_by(country) %>%
  summarise(total_co2 = sum(co2, na.rm=TRUE)) %>%
  arrange(desc(total_co2)) %>%
  head(10)

print(top_countries)
## # A tibble: 10 × 2
##    country        total_co2
##    <chr>              <dbl>
##  1 United States    434867.
##  2 China            285087.
##  3 Russia           122808.
##  4 Germany           95136.
##  5 United Kingdom    80079.
##  6 Japan             69612.
##  7 India             66073.
##  8 France            40048.
##  9 Canada            35644.
## 10 Ukraine           31236.

Inference: Top countries contributing to CO2 emissions are shown above.

Q2: How has global CO2 emission changed over time?

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.5.3
theme_set(theme_minimal())
          
global_trend <- data %>%
  group_by(year) %>%
  summarise(total_co2 = sum(co2, na.rm=TRUE))

ggplot(global_trend, aes(x=year, y=total_co2)) +
  geom_line(color="blue") +
  labs(title="Global CO2 Emissions Over Time",
       x="Year",
       y="Total CO2 Emissions")

Inference: Global CO2 emissions have shown a continuous increasing trend over time, indicating rising pollution levels.

Q3: Which year recorded the highest global CO2 emissions?

peak_year <- global_trend[which.max(global_trend$total_co2), ]

peak_year
## # A tibble: 1 × 2
##    year total_co2
##   <int>     <dbl>
## 1  2024   245672.

Inference: The year shown above recorded the highest global CO2 emissions, indicating peak pollution levels.

Q5: How do CO2 emissions per person vary across countries?

data$co2_per_capita <- data$co2 / data$population

top_per_capita <- data %>%
  filter(!is.na(co2_per_capita)) %>%
  filter(nchar(iso_code) == 3) %>%
  group_by(country) %>%
  summarise(avg_per_capita = mean(co2_per_capita, na.rm=TRUE)) %>%
  arrange(desc(avg_per_capita)) %>%
  head(10)

top_per_capita
## # A tibble: 10 × 2
##    country                   avg_per_capita
##    <chr>                              <dbl>
##  1 Sint Maarten (Dutch part)      0.000143 
##  2 Curacao                        0.0000491
##  3 Qatar                          0.0000461
##  4 United Arab Emirates           0.0000287
##  5 Kuwait                         0.0000281
##  6 Luxembourg                     0.0000252
##  7 Brunei                         0.0000243
##  8 Bahrain                        0.0000200
##  9 Saudi Arabia                   0.0000138
## 10 Trinidad and Tobago            0.0000134

Inference: The results show that some countries have significantly higher emissions per person, indicating greater individual environmental impact.

Q6: What is the distribution of CO2 emissions across countries?

ggplot(data, aes(x=co2)) +
  geom_histogram(bins=30, fill="steelblue", color="black") +
  labs(
    title="Distribution of CO2 Emissions",
    x="CO2 Emissions",
    y="Frequency"
  )
## Warning: Removed 21027 rows containing non-finite outside the scale range
## (`stat_bin()`).

Inference: The distribution indicates that most countries have relatively low emissions, while a few countries contribute disproportionately high emissions.

Q7: Are there any extreme outliers in CO2 emissions?

ggplot(data, aes(y=co2)) +
  geom_boxplot(fill="orange", color="black") +
  labs(
    title="Boxplot of CO2 Emissions"
  )
## Warning: Removed 21027 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Inference: The boxplot clearly highlights extreme outliers, representing countries with exceptionally high emission levels compared to others.

Q8: Is global CO2 emission increasing over time?

trend_model <- lm(total_co2 ~ year, data = global_trend)

summary(trend_model)
## 
## Call:
## lm(formula = total_co2 ~ year, data = global_trend)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -57202 -37371  -5574  29702 104020 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.288e+06  5.786e+04  -22.25   <2e-16 ***
## year         7.062e+02  3.064e+01   23.05   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40330 on 273 degrees of freedom
## Multiple R-squared:  0.6606, Adjusted R-squared:  0.6594 
## F-statistic: 531.4 on 1 and 273 DF,  p-value: < 2.2e-16

Inference: The positive trend indicates that global CO2 emissions are increasing over time, highlighting worsening pollution levels.

Q9: What could be the future trend of CO2 emissions if current patterns continue?

global_trend$future_co2 <- global_trend$total_co2 * 1.1

ggplot(global_trend, aes(x=year)) +
  geom_line(aes(y=total_co2), color="blue") +
  geom_line(aes(y=future_co2), color="red") +
  labs(
    title="Current vs Predicted CO2 Emissions",
    x="Year",
    y="CO2 Emissions"
  )

Inference: If current trends continue, CO2 emissions are expected to rise further, posing serious environmental risks.

Q10: Which countries should take immediate action based on high emissions?

high_risk <- data %>%
  filter(nchar(iso_code) == 3) %>%
  group_by(country) %>%
  summarise(total_co2 = sum(co2, na.rm=TRUE)) %>%
  arrange(desc(total_co2)) %>%
  head(5)

high_risk
## # A tibble: 5 × 2
##   country        total_co2
##   <chr>              <dbl>
## 1 United States    434867.
## 2 China            285087.
## 3 Russia           122808.
## 4 Germany           95136.
## 5 United Kingdom    80079.

Inference: Countries with the highest emissions should take immediate action to control pollution and reduce environmental impact.

Q11: How might increasing CO2 emissions impact future environmental conditions?

model <- lm(co2 ~ year, data = data)

summary(model)
## 
## Call:
## lm(formula = co2 ~ year, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##   -763   -568   -394   -106  37835 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -8416.5716   375.6285  -22.41   <2e-16 ***
## year            4.5355     0.1927   23.54   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1954 on 29382 degrees of freedom
##   (21027 observations deleted due to missingness)
## Multiple R-squared:  0.0185, Adjusted R-squared:  0.01847 
## F-statistic:   554 on 1 and 29382 DF,  p-value: < 2.2e-16

Inference: The model shows a positive relationship between year and CO2 emissions, indicating that pollution is increasing over time and may worsen future environmental conditions.

Q13: Which countries show the fastest growth in CO2 emissions?

growth_data <- data %>%
  filter(nchar(iso_code) == 3) %>%
  group_by(country) %>%
  summarise(growth = max(co2, na.rm=TRUE) - min(co2, na.rm=TRUE)) %>%
  arrange(desc(growth)) %>%
  head(10)
## Warning: There were 6 warnings in `summarise()`.
## The first warning was:
## ℹ In argument: `growth = max(co2, na.rm = TRUE) - min(co2, na.rm = TRUE)`.
## ℹ In group 128: `country = "Monaco"`.
## Caused by warning in `max()`:
## ! no non-missing arguments to max; returning -Inf
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 5 remaining warnings.
growth_data
## # A tibble: 10 × 2
##    country       growth
##    <chr>          <dbl>
##  1 China         12272.
##  2 United States  6127.
##  3 India          3193.
##  4 Russia         2536.
##  5 Japan          1312.
##  6 Germany        1117.
##  7 Indonesia       812.
##  8 Iran            793.
##  9 Ukraine         744.
## 10 Saudi Arabia    708.

Inference: These countries have shown the highest increase in emissions over time, indicating rapid industrial or economic growth.

Q14: Which countries are the top polluters in recent years?

latest_year <- max(data$year, na.rm=TRUE)

recent_data <- data %>%
  filter(year == latest_year) %>%
  filter(nchar(iso_code) == 3) %>%
  arrange(desc(co2)) %>%
  head(10)

recent_data[, c("country", "co2")]
##          country       co2
## 1          China 12289.037
## 2  United States  4904.120
## 3          India  3193.478
## 4         Russia  1780.524
## 5          Japan   961.867
## 6      Indonesia   812.220
## 7           Iran   792.631
## 8   Saudi Arabia   692.133
## 9    South Korea   583.679
## 10       Germany   572.319

Inference: The most recent data highlights current global pollution leaders, which are key contributors to environmental issues today.

Q15: How do emissions compare among the top 5 countries?

top5 <- c("United States", "China", "India", "Russia", "Japan")

comparison_data <- data %>%
  filter(country %in% top5)

ggplot(comparison_data, aes(x=year, y=co2, color=country)) +
  geom_line(size=1) +
  labs(
    title="Comparison of CO2 Emissions (Top 5 Countries)",
    x="Year",
    y="CO2 Emissions"
  )
## Warning: Removed 83 rows containing missing values or values outside the scale range
## (`geom_line()`).

Inference: The graph shows how major countries differ in their emission trends over time.

Q16: What is the percentage contribution of top countries to total emissions?

total_global <- sum(data$co2, na.rm=TRUE)

top5_data <- data %>%
  filter(country %in% c("United States", "China", "India", "Russia", "Japan")) %>%
  group_by(country) %>%
  summarise(total = sum(co2, na.rm=TRUE))

top5_data$percentage <- (top5_data$total / total_global) * 100

top5_data
## # A tibble: 5 × 3
##   country         total percentage
##   <chr>           <dbl>      <dbl>
## 1 China         285087.      2.31 
## 2 India          66073.      0.535
## 3 Japan          69612.      0.564
## 4 Russia        122808.      0.995
## 5 United States 434867.      3.52

Inference: A small number of countries contribute a large percentage of global emissions.

Q17: Are CO2 emissions increasing at an accelerating rate?

global_trend$change <- c(NA, diff(global_trend$total_co2))

ggplot(global_trend, aes(x=year, y=change)) +
  geom_line(color="purple") +
  labs(
    title="Yearly Change in CO2 Emissions",
    x="Year",
    y="Change in Emissions"
  )
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

Inference: The increasing fluctuations suggest that emissions are not just rising but may be accelerating.

Q18: Which countries have reduced their CO2 emissions over time?

reduction <- data %>%
  filter(nchar(iso_code) == 3) %>%
  group_by(country) %>%
  summarise(change = last(co2) - first(co2)) %>%
  arrange(change)

head(reduction, 10)
## # A tibble: 10 × 2
##    country     change
##    <chr>        <dbl>
##  1 Moldova       5.33
##  2 Latvia        6.46
##  3 Armenia       7.43
##  4 Estonia       8.31
##  5 Tajikistan   10.7 
##  6 Kyrgyzstan   11.8 
##  7 Georgia      11.8 
##  8 Lithuania    12.5 
##  9 Denmark      28.2 
## 10 New Zealand  32.5

Inference: Some countries have successfully reduced emissions, indicating effective environmental policies.

Q20: How does CO2 emission vary across different time periods?

data$period <- ifelse(data$year < 1980, "Before 1980",
                ifelse(data$year < 2000, "1980-2000", "After 2000"))

period_data <- data %>%
  group_by(period) %>%
  summarise(avg_co2 = mean(co2, na.rm=TRUE))

period_data
## # A tibble: 3 × 2
##   period      avg_co2
##   <chr>         <dbl>
## 1 1980-2000      622.
## 2 After 2000     867.
## 3 Before 1980    219.

Inference: CO2 emissions have increased significantly in recent decades, especially after 2000, indicating rapid industrial growth and environmental impact.

Q21: How does population size influence total CO2 emissions?

ggplot(data, aes(x=population, y=co2)) +
  geom_point(alpha=0.5, color="blue") +
  labs(
    title="Population vs CO2 Emissions",
    x="Population",
    y="CO2 Emissions"
  )
## Warning: Removed 25048 rows containing missing values or values outside the scale range
## (`geom_point()`).

Inference: Countries with larger populations tend to have higher total emissions.

Q22: How do total emissions differ from per capita emissions?

ggplot(data, aes(x=co2_per_capita, y=co2)) +
  geom_point(alpha=0.5, color="red") +
  labs(
    title="Per Capita vs Total Emissions",
    x="CO2 per Capita",
    y="Total CO2"
  )
## Warning: Removed 25048 rows containing missing values or values outside the scale range
## (`geom_point()`).

Inference: Some countries have high total emissions but lower per capita values.

Q23: Which countries have the highest per capita emissions in recent years?

latest_year <- max(data$year, na.rm=TRUE)

recent_pc <- data %>%
  filter(year == latest_year) %>%
  filter(nchar(iso_code) == 3) %>%
  arrange(desc(co2_per_capita)) %>%
  head(10)

recent_pc[, c("country","co2_per_capita")]
##                      country co2_per_capita
## 1                      Qatar   4.127109e-05
## 2                     Kuwait   2.624760e-05
## 3                     Brunei   2.604520e-05
## 4                    Bahrain   2.426980e-05
## 5        Trinidad and Tobago   2.293176e-05
## 6               Saudi Arabia   2.037918e-05
## 7       United Arab Emirates   2.013107e-05
## 8              New Caledonia   1.806564e-05
## 9  Sint Maarten (Dutch part)   1.655446e-05
## 10                      Oman   1.565111e-05

Inference: Some smaller countries have extremely high emissions per person.

Q24: What is the smoothed trend of global CO2 emissions?

ggplot(global_trend, aes(x=year, y=total_co2)) +
  geom_line(color="gray") +
  geom_smooth(method="loess", color="red") +
  labs(title="Smoothed CO2 Emission Trend")
## `geom_smooth()` using formula = 'y ~ x'

Inference: The smoothed curve highlights the long-term upward trend in emissions.

Q25: How are CO2 emissions distributed across countries?

ggplot(data, aes(x=co2)) +
  geom_density(fill="green", alpha=0.5)
## Warning: Removed 21027 rows containing non-finite outside the scale range
## (`stat_density()`).

Inference: Most countries cluster at lower emission levels with a long tail of high emitters.

Q26: How have emissions changed in the last decade?

recent_data <- data %>%
  filter(year >= max(year) - 10)

ggplot(recent_data, aes(x=year, y=co2)) +
  geom_line(color="blue")

Inference: Recent years show continued increase in emissions.

Q27: Which countries have stable emission patterns over time?

stability <- data %>%
  filter(nchar(iso_code) == 3) %>%
  group_by(country) %>%
  summarise(sd_co2 = sd(co2, na.rm=TRUE)) %>%
  arrange(sd_co2)

head(stability, 10)
## # A tibble: 10 × 2
##    country                    sd_co2
##    <chr>                       <dbl>
##  1 Niue                      0.00152
##  2 Tuvalu                    0.00274
##  3 Saint Helena              0.00296
##  4 Wallis and Futuna         0.00315
##  5 Antarctica                0.00493
##  6 Montserrat                0.0128 
##  7 Saint Pierre and Miquelon 0.0175 
##  8 Kiribati                  0.0181 
##  9 Cook Islands              0.0236 
## 10 Micronesia (country)      0.0241

Inference: Countries with low variation show stable emission patterns.

Q28: Is there a correlation between population and CO2 emissions?

cor(data$population, data$co2, use="complete.obs")
## [1] 0.8481262

Inference: There is a positive correlation between population and emissions.

Q29: How does log transformation help in understanding emissions?

ggplot(data, aes(x=log(co2))) +
  geom_histogram(bins=30, fill="purple")
## Warning: Removed 22381 rows containing non-finite outside the scale range
## (`stat_bin()`).

Inference: Log transformation reduces skewness and improves visualization.

Q30: How have top contributors changed over time?

top_countries_names <- top_countries$country

trend_top <- data %>%
  filter(country %in% top_countries_names)

ggplot(trend_top, aes(x=year, y=co2, color=country)) +
  geom_line()
## Warning: Removed 83 rows containing missing values or values outside the scale range
## (`geom_line()`).

Inference: Top contributors remain consistent over time.

Q31: What share do top countries contribute globally?

top_total <- sum(top_countries$total_co2)
global_total <- sum(data$co2, na.rm=TRUE)

top_total / global_total * 100
## [1] 10.2089

Inference: Top countries contribute a major portion of global emissions.

Q32: Which countries have the lowest emissions?

low_emitters <- data %>%
  filter(nchar(iso_code) == 3) %>%
  arrange(co2) %>%
  head(10)

low_emitters[, c("country","co2")]
##       country co2
## 1  Antarctica   0
## 2  Antarctica   0
## 3  Antarctica   0
## 4  Antarctica   0
## 5  Antarctica   0
## 6  Antarctica   0
## 7  Antarctica   0
## 8  Antarctica   0
## 9  Antarctica   0
## 10 Antarctica   0

Inference: Some countries contribute very little to global emissions.

Q33: How variable are emissions globally?

var(data$co2, na.rm=TRUE)
## [1] 3889147

Inference: High variance indicates unequal emission distribution.

Q34: How does emission growth indicate future risk?

ggplot(growth_data, aes(x=reorder(country, growth), y=growth)) +
  geom_bar(stat="identity", fill="red") +
  coord_flip()

Inference: Countries with highest growth pose future environmental risks.

Key Insights

Conclusion

This project analyzed global pollution trends using CO2 emission data. The results show a clear increase in emissions over time, with certain countries contributing disproportionately. Future projections suggest that without intervention, environmental conditions may deteriorate further. Therefore, immediate global action is required to reduce emissions and promote sustainability.