Image Source: larepublica
Methane (CH4) is a powerful gas that contributes to global warming. Although is not as common as CO2, methane is much more effective at trapping heat in the atmosphere, making it about 28 times more stronger than carbon dioxide over 100 years. Methane is released through natural sources, like wetlands, and human activities, such as farming, energy production, and waste management. One big source is livestock, especially cows, which produce methane as part of their digestive process. Mining and fossil fuel extraction also release significant amount of CH4 to the air.
In this project, I will use the dataset: Peru- Greenhouse Gas and Air Pollutant Emissions from HDX website. This dataset focuses on methane emission in Peru, where agriculture, mining, and other industries sectors contribute to pollution. While Lima is often highlighted as one of the most polluted cities in the world, this dataset looks at emissions from other parts of Peru, helping understand the bigger picture of pollution in the country.
The main research question that I will try to answer is: “What are the main sectors contributing to methane emission in Peru?.” To answer this question I will use these variables: * Year (quantitative): the year the emission were recorded. * Emissions Quantity (quantitative): the amount of methane emitted in each sector. * Month (quantitative): the month were the emission was recorded ( 1 for January, 2 for February, etc.). * Sector (Categorical): The source of the emissions, such as agriculture, energy, etc.
Before analyzing the data, I’ll clean it up by fixing any missing information, select only the columns I will use, and making sure all the data is in the right format.
# Loading libraries
library(readr)
## Warning: package 'readr' was built under R version 4.5.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.5.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#Downloading dataset
emissions_df <- read_csv("per_ch4_city.csv")
## Rows: 66748 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): id, name, country, alternateNames, sector, gas
## dbl (3): year, month, emissionsQuantity
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(emissions_df)
## # A tibble: 6 × 9
## id name country alternateNames year month sector gas emissionsQuantity
## <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <dbl>
## 1 ghs-f… Aban… PER ['Abancay'] 2024 4 agric… ch4 1.98
## 2 ghs-f… Aban… PER ['Abancay'] 2024 4 build… ch4 4.35
## 3 ghs-f… Aban… PER ['Abancay'] 2024 4 fluor… ch4 0
## 4 ghs-f… Aban… PER ['Abancay'] 2024 4 fores… ch4 0
## 5 ghs-f… Aban… PER ['Abancay'] 2024 4 fossi… ch4 0.276
## 6 ghs-f… Aban… PER ['Abancay'] 2024 4 manuf… ch4 0.112
# tructure of dataset
str(emissions_df)
## spc_tbl_ [66,748 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ id : chr [1:66748] "ghs-fua_3118" "ghs-fua_3118" "ghs-fua_3118" "ghs-fua_3118" ...
## $ name : chr [1:66748] "Abancay Urban Area, PER" "Abancay Urban Area, PER" "Abancay Urban Area, PER" "Abancay Urban Area, PER" ...
## $ country : chr [1:66748] "PER" "PER" "PER" "PER" ...
## $ alternateNames : chr [1:66748] "['Abancay']" "['Abancay']" "['Abancay']" "['Abancay']" ...
## $ year : num [1:66748] 2024 2024 2024 2024 2024 ...
## $ month : num [1:66748] 4 4 4 4 4 4 4 4 4 4 ...
## $ sector : chr [1:66748] "agriculture" "buildings" "fluorinated-gases" "forestry-and-land-use" ...
## $ gas : chr [1:66748] "ch4" "ch4" "ch4" "ch4" ...
## $ emissionsQuantity: num [1:66748] 1.978 4.347 0 0 0.276 ...
## - attr(*, "spec")=
## .. cols(
## .. id = col_character(),
## .. name = col_character(),
## .. country = col_character(),
## .. alternateNames = col_character(),
## .. year = col_double(),
## .. month = col_double(),
## .. sector = col_character(),
## .. gas = col_character(),
## .. emissionsQuantity = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
This output shows that each variable is correctly classified. The str() is also including a “problem” attribute, which means R is ready to track any issue, but in this case no major problems were found.
emissions_1 <- emissions_df |>
select(name, year, month, sector, emissionsQuantity)
head(emissions_1)
## # A tibble: 6 × 5
## name year month sector emissionsQuantity
## <chr> <dbl> <dbl> <chr> <dbl>
## 1 Abancay Urban Area, PER 2024 4 agriculture 1.98
## 2 Abancay Urban Area, PER 2024 4 buildings 4.35
## 3 Abancay Urban Area, PER 2024 4 fluorinated-gases 0
## 4 Abancay Urban Area, PER 2024 4 forestry-and-land-use 0
## 5 Abancay Urban Area, PER 2024 4 fossil-fuel-operations 0.276
## 6 Abancay Urban Area, PER 2024 4 manufacturing 0.112
So, we only kept the five main variables for this analysis.
sum(duplicated(emissions_1))
## [1] 31857
the amount of duplicates rows is near half of the dataset which is concerning, so it is important to remove it but before that make sure I am not blindly removing data.
emissions_1 |>
filter(duplicated(emissions_1)) %>%
head(10)
## # A tibble: 10 × 5
## name year month sector emissionsQuantity
## <chr> <dbl> <dbl> <chr> <dbl>
## 1 Abancay Urban Area, PER 2024 4 manufacturing 0
## 2 Abancay Urban Area, PER 2024 4 manufacturing 0
## 3 Abancay Urban Area, PER 2024 4 mineral-extraction 0
## 4 Abancay Urban Area, PER 2024 4 agriculture 0
## 5 Abancay Urban Area, PER 2024 4 fluorinated-gases 0
## 6 Abancay Urban Area, PER 2024 4 manufacturing 0
## 7 Abancay Urban Area, PER 2024 4 forestry-and-land-use 0
## 8 Abancay Urban Area, PER 2024 4 forestry-and-land-use 0
## 9 Abancay Urban Area, PER 2024 4 forestry-and-land-use 0
## 10 Abancay Urban Area, PER 2024 4 manufacturing 0
We can see this duplications are exactly the same, and most of them correspond to zero emissions across identical locations and sector. I can remove them safely
# Removing duplicates
emissions_2 <- emissions_1 |>
distinct()
colSums(is.na(emissions_2))
## name year month sector
## 0 0 0 0
## emissionsQuantity
## 0
There is non missing values, so no need of removing anything else.
To see the overall clean data, I will use the str() again to check the final results.
str(emissions_2)
## tibble [34,891 × 5] (S3: tbl_df/tbl/data.frame)
## $ name : chr [1:34891] "Abancay Urban Area, PER" "Abancay Urban Area, PER" "Abancay Urban Area, PER" "Abancay Urban Area, PER" ...
## $ year : num [1:34891] 2024 2024 2024 2024 2024 ...
## $ month : num [1:34891] 4 4 4 4 4 4 4 4 4 4 ...
## $ sector : chr [1:34891] "agriculture" "buildings" "fluorinated-gases" "forestry-and-land-use" ...
## $ emissionsQuantity: num [1:34891] 1.978 4.347 0 0 0.276 ...
As showed here, what result fromt this section is that my data now contains 34,891 observations and 5 variables, but is still a large enough dataset to do a good analysis.
Histogram of Emissions Quantity A histogram will help see how emissions are distributed, and show if they concentrated in any value.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.5.2
# Log-transformed histogram to handle outliers
ggplot(emissions_2, aes(x = emissionsQuantity)) +
geom_histogram(binwidth = 0.1, fill = "orange", color = "black") +
scale_x_log10() +
labs(title = "Distribution of Methane Emissions Quantity",
x = "Emissions Quantity (Tons)",
y = "Count (Observations)") +
theme_minimal()
## Warning in scale_x_log10(): log-10 transformation introduced infinite values.
## Warning: Removed 8118 rows containing non-finite outside the scale range
## (`stat_bin()`).
The distribution of methane emissions is highly skewed. Most
observations are very small (close to zero), while a small number of
observations have very large emission values. A log scale was used to
better visualize this wide range.The x-axis is shown in logarithmic
scale, which compresses very small and very large emission values into a
single scale. The labels in scientific notation (e.g., 1e-03) represent
very small decimal values (0.001). This transformation is used because
the data contains both very small and very large emission values.
max(emissions_2$emissionsQuantity, na.rm = TRUE)
## [1] 7112.063
The maximum methane emission value in the dataset is approximately 7112.06, meaning that at least one recorded source (a specific sector, location, and month combination) produces emissions that are much higher than most other observations. This shows that emissions are highly uneven, with a few extreme values dominating the dataset.
Bar Chart for Sector Emission For this plot I will use the sector and emission quantity variables. I will use fun = sum to summarize the data by adding the emissions for each section.
ggplot(emissions_2, aes(x = sector, y = emissionsQuantity, fill = sector)) +
geom_bar(stat = "summary", fun = "sum", color = "black") +
scale_y_continuous(labels = scales::label_comma()) +
labs(title = "Total Methane Emissions by Sector in Peru",
x = "Sector",
y = "Total Emissions Quantity (tons)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
The bar chart shows the total methane emissions by sector in Peru. The
Waste sector has the highest total emissions, significantly higher than
all other sectors. Other sectors, like Agriculture, Buildings, and
Fossil-fuel operations, contribute much smaller amounts of methane
emissions. This indicates that waste management is the dominant source
of methane emissions in Peru, while most other sectors contribute much
less.
In this section, a multiple linear regression model will be use to analyze how methane emissions vary across different sectors in Peru answering the research question. This helps identify which sectors contribute the most to overall emissions.
model <- lm(emissionsQuantity ~ sector + month, data = emissions_2)
summary(model)
##
## Call:
## lm(formula = emissionsQuantity ~ sector + month, data = emissions_2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -82.3 -16.8 -2.9 -0.2 7030.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.02229 3.61551 0.836 0.403203
## sectorbuildings 13.26574 4.72614 2.807 0.005005 **
## sectorfluorinated-gases -3.54109 8.30081 -0.427 0.669676
## sectorforestry-and-land-use -3.52305 7.98186 -0.441 0.658939
## sectorfossil-fuel-operations 16.63967 4.89943 3.396 0.000684 ***
## sectormanufacturing -3.05457 4.11205 -0.743 0.457587
## sectormineral-extraction -3.53606 6.14058 -0.576 0.564720
## sectorpower -3.13749 6.26037 -0.501 0.616257
## sectortransportation -3.10808 4.88870 -0.636 0.524931
## sectorwaste 78.36434 4.11205 19.057 < 2e-16 ***
## month 0.07559 0.37468 0.202 0.840124
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 237.2 on 34880 degrees of freedom
## Multiple R-squared: 0.01421, Adjusted R-squared: 0.01393
## F-statistic: 50.29 on 10 and 34880 DF, p-value: < 2.2e-16
EmissionsQuantity = β₀ + β₁(Sector₁) + β₂(Sector₂) + … + βₙ(Sector)+ βₘ(Month)
The results show that some sectors have a statistically significant effect on methane emissions. In particular, the WASTE sector has the largest impact, increasing emissions by about 78.37 tons compared to the baseline sector, with a very small p-value (p < 0.001), indicating strong statistical significance.
The fossil-fuel operations and buildings sectors are also statistically significant (p < 0.05), meaning they contribute more emissions than the baseline sector. Other sectors have high p-values (greater than 0.05), which suggests that their emissions are not significantly different from the baseline.
The adjusted R² value is approximately 0.014, meaning that sector explains about 1.4% of the variation in methane emissions. This indicates that while sector is important for identifying which sources emit more methane, there are likely other factors that also influence emission levels.
Overall, the model is statistically significant (p < 0.001), showing that sector plays an important role in methane emissions in Peru.
In this section, I will create an interactive visualization to explore how methane emissions vary across sectors over time. Instead of plotting all sectors in the dataset, I focus on the main contributors identified in the regression analysis to improve clarity and interpretation. This allows for a clearer comparison of emission patterns and highlights the sectors that have the strongest impact on methane emissions in Peru. One important step for creating the visulization is to summarize the data I will use to improve readability by using the function of group_by (), filter() and summarixe().
# Loading libraries for the visualization
library(ggplot2)
library(plotly)
## Warning: package 'plotly' was built under R version 4.5.3
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
#Summarize data
emissions_plot <- emissions_2 |>
group_by(sector, month) |>
summarise(total_emissions = sum(emissionsQuantity)) |>
filter(sector %in% c("waste", "fossil-fuel-operations", "buildings", "agriculture"))
## `summarise()` has regrouped the output.
## ℹ Summaries were computed grouped by sector and month.
## ℹ Output is grouped by sector.
## ℹ Use `summarise(.groups = "drop_last")` to silence this message.
## ℹ Use `summarise(.by = c(sector, month))` for per-operation grouping
## (`?dplyr::dplyr_by`) instead.
Finally, a multi line time plot will be created as is the best fit for this dataset.
plot <- ggplot(emissions_plot, aes( x = month, y = total_emissions,
color = sector, group = sector)) +
geom_line(linewidth = 0.9, alpha = 0.7) +
geom_point(size = 1.3, alpha = 0.7) +
scale_color_manual(values = c(
"waste" = "purple", "fossil-fuel-operations" = "brown",
"buildings" = "darkgreen", "agriculture" = "orange")) +
labs( title = "Monthly Methane Emissions for Key Sectors in Peru",
x = "Month", y = "Total Emissions",
color = "Sector",
caption = "Source: HDX Peru Greenhouse Gas Dataset") +
theme_minimal()
ggplotly(plot)
This multi-line time series plot shows monthly methane emissions across four key sectors in Peru: waste, fossil-fuel operations, buildings, and agriculture. The waste sector clearly stands out as the largest contributor, with emissions consistently much higher than the other sectors throughout all months. Although there are some small fluctuations, its level remains dominant across the year.
Fossil-fuel operations and buildings show moderate emission levels that stay relatively stable over time, with slight increases in the middle and later months. In contrast, agriculture has the lowest emissions overall and gradually decreases toward the end of the year.
Overall, the pattern suggests that methane emissions are heavily driven by waste-related activities, while the other sectors contribute smaller and more stable amounts. These results are consistent with the regression analysis, where waste was identified as the strongest predictor of methane emissions.
Image Source:
[https://www.woimacorporation.com/drowning-in-waste-case-lima-peru/)
This project investigated the main sources of methane emissions in Peru, guided by the research question: which sectors contribute most to methane emissions? To answer this, the dataset was first cleaned by removing duplicated observations and selecting relevant variables for analysis. This step was necessary because the original dataset contained many repeated entries, which is common in real-world environmental data and can affect the reliability of results if not properly addressed.
The analysis showed clear differences across sectors, with fossil-fuel operations, buildings, and especially waste showing higher methane emissions over time. These patterns were consistent in both the multi-line time series plot and the regression model, where sector differences were statistically significant. However, the model explains only a small portion of the overall variation in emissions, suggesting that other influencing factors are not captured in this dataset.
Overall, the results highlight that methane emissions in Peru are unevenly distributed across sectors, and demonstrate the importance of both data cleaning and exploratory analysis when working with real-world datasets. A key limitation of this study is the restricted set of variables, which limits the ability to fully explain emission patterns. Future work could improve this analysis by including additional socio-economic or geographic factors and by using more advanced modeling approaches to better understand the drivers of methane emissions.
Sources: Wolma Corporation, https://www.woimacorporation.com/drowning-in-waste-case-lima-peru/ : HDX, https://data.humdata.org/dataset/per-climate-trace