Project 2: Methane gas Emissions in Peru

View of Polution in Peru Image Source: larepublica

A. Introduction

Methane (CH4) is a powerful gas that contributes to global warming. Although is not as common as CO2, methane is much more effective at trapping heat in the atmosphere, making it about 28 times more stronger than carbon dioxide over 100 years. Methane is released through natural sources, like wetlands, and human activities, such as farming, energy production, and waste management. One big source is livestock, especially cows, which produce methane as part of their digestive process. Mining and fossil fuel extraction also release significant amount of CH4 to the air.

In this project, I will use the dataset: Peru- Greenhouse Gas and Air Pollutant Emissions from HDX website. This dataset focuses on methane emission in Peru, where agriculture, mining, and other industries sectors contribute to pollution. While Lima is often highlighted as one of the most polluted cities in the world, this dataset looks at emissions from other parts of Peru, helping understand the bigger picture of pollution in the country.

The main research question that I will try to answer is: “What are the main sectors contributing to methane emission in Peru?.” To answer this question I will use these variables: * Year (quantitative): the year the emission were recorded. * Emissions Quantity (quantitative): the amount of methane emitted in each sector. * Month (quantitative): the month were the emission was recorded ( 1 for January, 2 for February, etc.). * Sector (Categorical): The source of the emissions, such as agriculture, energy, etc.

Before analyzing the data, I’ll clean it up by fixing any missing information, select only the columns I will use, and making sure all the data is in the right format.

# Loading libraries 
library(readr)

## Warning: package 'readr' was built under R version 4.5.3

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.5.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

#Downloading dataset
emissions_df <- read_csv("per_ch4_city.csv")

## Rows: 66748 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): id, name, country, alternateNames, sector, gas
## dbl (3): year, month, emissionsQuantity
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(emissions_df)

## # A tibble: 6 × 9
##   id     name  country alternateNames  year month sector gas   emissionsQuantity
##   <chr>  <chr> <chr>   <chr>          <dbl> <dbl> <chr>  <chr>             <dbl>
## 1 ghs-f… Aban… PER     ['Abancay']     2024     4 agric… ch4               1.98 
## 2 ghs-f… Aban… PER     ['Abancay']     2024     4 build… ch4               4.35 
## 3 ghs-f… Aban… PER     ['Abancay']     2024     4 fluor… ch4               0    
## 4 ghs-f… Aban… PER     ['Abancay']     2024     4 fores… ch4               0    
## 5 ghs-f… Aban… PER     ['Abancay']     2024     4 fossi… ch4               0.276
## 6 ghs-f… Aban… PER     ['Abancay']     2024     4 manuf… ch4               0.112

B. Cleaning and Exploration of Data

To further understand my dataset i will use the str() function and confirm that my variables are correctly classified.

# tructure of dataset
str(emissions_df)

## spc_tbl_ [66,748 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ id               : chr [1:66748] "ghs-fua_3118" "ghs-fua_3118" "ghs-fua_3118" "ghs-fua_3118" ...
##  $ name             : chr [1:66748] "Abancay Urban Area, PER" "Abancay Urban Area, PER" "Abancay Urban Area, PER" "Abancay Urban Area, PER" ...
##  $ country          : chr [1:66748] "PER" "PER" "PER" "PER" ...
##  $ alternateNames   : chr [1:66748] "['Abancay']" "['Abancay']" "['Abancay']" "['Abancay']" ...
##  $ year             : num [1:66748] 2024 2024 2024 2024 2024 ...
##  $ month            : num [1:66748] 4 4 4 4 4 4 4 4 4 4 ...
##  $ sector           : chr [1:66748] "agriculture" "buildings" "fluorinated-gases" "forestry-and-land-use" ...
##  $ gas              : chr [1:66748] "ch4" "ch4" "ch4" "ch4" ...
##  $ emissionsQuantity: num [1:66748] 1.978 4.347 0 0 0.276 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   id = col_character(),
##   ..   name = col_character(),
##   ..   country = col_character(),
##   ..   alternateNames = col_character(),
##   ..   year = col_double(),
##   ..   month = col_double(),
##   ..   sector = col_character(),
##   ..   gas = col_character(),
##   ..   emissionsQuantity = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

This output shows that each variable is correctly classified. The str() is also including a “problem” attribute, which means R is ready to track any issue, but in this case no major problems were found.

Next step is to select only the relevant variables for my analysis

emissions_1 <- emissions_df |>
  select(name, year, month, sector, emissionsQuantity)
head(emissions_1)

## # A tibble: 6 × 5
##   name                     year month sector                 emissionsQuantity
##   <chr>                   <dbl> <dbl> <chr>                              <dbl>
## 1 Abancay Urban Area, PER  2024     4 agriculture                        1.98 
## 2 Abancay Urban Area, PER  2024     4 buildings                          4.35 
## 3 Abancay Urban Area, PER  2024     4 fluorinated-gases                  0    
## 4 Abancay Urban Area, PER  2024     4 forestry-and-land-use              0    
## 5 Abancay Urban Area, PER  2024     4 fossil-fuel-operations             0.276
## 6 Abancay Urban Area, PER  2024     4 manufacturing                      0.112

So, we only kept the five main variables for this analysis.

After that, I will check if there is any duplicated rows in the data and remove it so the results for the future model is more accurate.

sum(duplicated(emissions_1))

## [1] 31857

the amount of duplicates rows is near half of the dataset which is concerning, so it is important to remove it but before that make sure I am not blindly removing data.

emissions_1 |>
  filter(duplicated(emissions_1)) %>%
  head(10)

## # A tibble: 10 × 5
##    name                     year month sector                emissionsQuantity
##    <chr>                   <dbl> <dbl> <chr>                             <dbl>
##  1 Abancay Urban Area, PER  2024     4 manufacturing                         0
##  2 Abancay Urban Area, PER  2024     4 manufacturing                         0
##  3 Abancay Urban Area, PER  2024     4 mineral-extraction                    0
##  4 Abancay Urban Area, PER  2024     4 agriculture                           0
##  5 Abancay Urban Area, PER  2024     4 fluorinated-gases                     0
##  6 Abancay Urban Area, PER  2024     4 manufacturing                         0
##  7 Abancay Urban Area, PER  2024     4 forestry-and-land-use                 0
##  8 Abancay Urban Area, PER  2024     4 forestry-and-land-use                 0
##  9 Abancay Urban Area, PER  2024     4 forestry-and-land-use                 0
## 10 Abancay Urban Area, PER  2024     4 manufacturing                         0

We can see this duplications are exactly the same, and most of them correspond to zero emissions across identical locations and sector. I can remove them safely

# Removing duplicates
emissions_2 <- emissions_1 |>
  distinct()

Finally, I will check missing values in each row and remove it if necessary.

colSums(is.na(emissions_2))

##              name              year             month            sector 
##                 0                 0                 0                 0 
## emissionsQuantity 
##                 0

There is non missing values, so no need of removing anything else.

To see the overall clean data, I will use the str() again to check the final results.

str(emissions_2)

## tibble [34,891 × 5] (S3: tbl_df/tbl/data.frame)
##  $ name             : chr [1:34891] "Abancay Urban Area, PER" "Abancay Urban Area, PER" "Abancay Urban Area, PER" "Abancay Urban Area, PER" ...
##  $ year             : num [1:34891] 2024 2024 2024 2024 2024 ...
##  $ month            : num [1:34891] 4 4 4 4 4 4 4 4 4 4 ...
##  $ sector           : chr [1:34891] "agriculture" "buildings" "fluorinated-gases" "forestry-and-land-use" ...
##  $ emissionsQuantity: num [1:34891] 1.978 4.347 0 0 0.276 ...

As showed here, what result fromt this section is that my data now contains 34,891 observations and 5 variables, but is still a large enough dataset to do a good analysis.

C. Exploring Variables Through Plots

Quantitative Variable Exploration

Histogram of Emissions Quantity A histogram will help see how emissions are distributed, and show if they concentrated in any value.

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.5.2

# Log-transformed histogram to handle outliers
ggplot(emissions_2, aes(x = emissionsQuantity)) +
  geom_histogram(binwidth = 0.1, fill = "orange", color = "black") +
  scale_x_log10() + 
  labs(title = "Distribution of Methane Emissions Quantity", 
       x = "Emissions Quantity (Tons)", 
       y = "Count (Observations)") +
  theme_minimal()

## Warning in scale_x_log10(): log-10 transformation introduced infinite values.

## Warning: Removed 8118 rows containing non-finite outside the scale range
## (`stat_bin()`).

The distribution of methane emissions is highly skewed. Most observations are very small (close to zero), while a small number of observations have very large emission values. A log scale was used to better visualize this wide range.The x-axis is shown in logarithmic scale, which compresses very small and very large emission values into a single scale. The labels in scientific notation (e.g., 1e-03) represent very small decimal values (0.001). This transformation is used because the data contains both very small and very large emission values.

max(emissions_2$emissionsQuantity, na.rm = TRUE)

## [1] 7112.063

The maximum methane emission value in the dataset is approximately 7112.06, meaning that at least one recorded source (a specific sector, location, and month combination) produces emissions that are much higher than most other observations. This shows that emissions are highly uneven, with a few extreme values dominating the dataset.

Categorical Variable exploration Plot

Bar Chart for Sector Emission For this plot I will use the sector and emission quantity variables. I will use fun = sum to summarize the data by adding the emissions for each section.

ggplot(emissions_2, aes(x = sector, y = emissionsQuantity, fill = sector)) +
  geom_bar(stat = "summary", fun = "sum", color = "black") +
  scale_y_continuous(labels = scales::label_comma()) +  
  labs(title = "Total Methane Emissions by Sector in Peru",
       x = "Sector",
       y = "Total Emissions Quantity (tons)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The bar chart shows the total methane emissions by sector in Peru. The Waste sector has the highest total emissions, significantly higher than all other sectors. Other sectors, like Agriculture, Buildings, and Fossil-fuel operations, contribute much smaller amounts of methane emissions. This indicates that waste management is the dominant source of methane emissions in Peru, while most other sectors contribute much less.

D. Multiple Linear Regression Model

In this section, a multiple linear regression model will be use to analyze how methane emissions vary across different sectors in Peru answering the research question. This helps identify which sectors contribute the most to overall emissions.

model <- lm(emissionsQuantity ~ sector + month, data = emissions_2)
summary(model)

## 
## Call:
## lm(formula = emissionsQuantity ~ sector + month, data = emissions_2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -82.3  -16.8   -2.9   -0.2 7030.6 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   3.02229    3.61551   0.836 0.403203    
## sectorbuildings              13.26574    4.72614   2.807 0.005005 ** 
## sectorfluorinated-gases      -3.54109    8.30081  -0.427 0.669676    
## sectorforestry-and-land-use  -3.52305    7.98186  -0.441 0.658939    
## sectorfossil-fuel-operations 16.63967    4.89943   3.396 0.000684 ***
## sectormanufacturing          -3.05457    4.11205  -0.743 0.457587    
## sectormineral-extraction     -3.53606    6.14058  -0.576 0.564720    
## sectorpower                  -3.13749    6.26037  -0.501 0.616257    
## sectortransportation         -3.10808    4.88870  -0.636 0.524931    
## sectorwaste                  78.36434    4.11205  19.057  < 2e-16 ***
## month                         0.07559    0.37468   0.202 0.840124    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 237.2 on 34880 degrees of freedom
## Multiple R-squared:  0.01421,    Adjusted R-squared:  0.01393 
## F-statistic: 50.29 on 10 and 34880 DF,  p-value: < 2.2e-16

The regression model equation can be written as:

EmissionsQuantity = β₀ + β₁(Sector₁) + β₂(Sector₂) + … + βₙ(Sector)+ βₘ(Month)

The model compares methane emissions across sectors by using one sector as a reference and measuring how much higher or lower the other sectors are. The baseline sector (agriculture) has an estimated mean emission of about 3.54 tons in the model.

The results show that some sectors have a statistically significant effect on methane emissions. In particular, the WASTE sector has the largest impact, increasing emissions by about 78.37 tons compared to the baseline sector, with a very small p-value (p < 0.001), indicating strong statistical significance.

The fossil-fuel operations and buildings sectors are also statistically significant (p < 0.05), meaning they contribute more emissions than the baseline sector. Other sectors have high p-values (greater than 0.05), which suggests that their emissions are not significantly different from the baseline.

The adjusted R² value is approximately 0.014, meaning that sector explains about 1.4% of the variation in methane emissions. This indicates that while sector is important for identifying which sources emit more methane, there are likely other factors that also influence emission levels.

Overall, the model is statistically significant (p < 0.001), showing that sector plays an important role in methane emissions in Peru.

E. Plotly Interactive Plot

In this section, I will create an interactive visualization to explore how methane emissions vary across sectors over time. Instead of plotting all sectors in the dataset, I focus on the main contributors identified in the regression analysis to improve clarity and interpretation. This allows for a clearer comparison of emission patterns and highlights the sectors that have the strongest impact on methane emissions in Peru. One important step for creating the visulization is to summarize the data I will use to improve readability by using the function of group_by (), filter() and summarixe().

# Loading libraries for the visualization
library(ggplot2)
library(plotly)

## Warning: package 'plotly' was built under R version 4.5.3

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

#Summarize data
emissions_plot <- emissions_2 |>
  group_by(sector, month) |>
  summarise(total_emissions = sum(emissionsQuantity)) |>
  filter(sector %in% c("waste", "fossil-fuel-operations", "buildings", "agriculture"))

## `summarise()` has regrouped the output.
## ℹ Summaries were computed grouped by sector and month.
## ℹ Output is grouped by sector.
## ℹ Use `summarise(.groups = "drop_last")` to silence this message.
## ℹ Use `summarise(.by = c(sector, month))` for per-operation grouping
##   (`?dplyr::dplyr_by`) instead.

Finally, a multi line time plot will be created as is the best fit for this dataset.

plot <- ggplot(emissions_plot, aes( x = month, y = total_emissions,
                                   color = sector, group = sector)) +
  geom_line(linewidth = 0.9, alpha = 0.7) +
  geom_point(size = 1.3, alpha = 0.7) +

  scale_color_manual(values = c(
    "waste" = "purple",  "fossil-fuel-operations" = "brown",
    "buildings" = "darkgreen", "agriculture" = "orange")) +

  labs( title = "Monthly Methane Emissions for Key Sectors in Peru",
        x = "Month", y = "Total Emissions",
        color = "Sector", 
        caption = "Source: HDX Peru Greenhouse Gas Dataset") +

  theme_minimal() 
 
ggplotly(plot)

This multi-line time series plot shows monthly methane emissions across four key sectors in Peru: waste, fossil-fuel operations, buildings, and agriculture. The waste sector clearly stands out as the largest contributor, with emissions consistently much higher than the other sectors throughout all months. Although there are some small fluctuations, its level remains dominant across the year.

Fossil-fuel operations and buildings show moderate emission levels that stay relatively stable over time, with slight increases in the middle and later months. In contrast, agriculture has the lowest emissions overall and gradually decreases toward the end of the year.

Overall, the pattern suggests that methane emissions are heavily driven by waste-related activities, while the other sectors contribute smaller and more stable amounts. These results are consistent with the regression analysis, where waste was identified as the strongest predictor of methane emissions.

F. Conclusion

Waste in Peru Image Source: [https://www.woimacorporation.com/drowning-in-waste-case-lima-peru/)

This project investigated the main sources of methane emissions in Peru, guided by the research question: which sectors contribute most to methane emissions? To answer this, the dataset was first cleaned by removing duplicated observations and selecting relevant variables for analysis. This step was necessary because the original dataset contained many repeated entries, which is common in real-world environmental data and can affect the reliability of results if not properly addressed.

The analysis showed clear differences across sectors, with fossil-fuel operations, buildings, and especially waste showing higher methane emissions over time. These patterns were consistent in both the multi-line time series plot and the regression model, where sector differences were statistically significant. However, the model explains only a small portion of the overall variation in emissions, suggesting that other influencing factors are not captured in this dataset.

Overall, the results highlight that methane emissions in Peru are unevenly distributed across sectors, and demonstrate the importance of both data cleaning and exploratory analysis when working with real-world datasets. A key limitation of this study is the restricted set of variables, which limits the ability to fully explain emission patterns. Future work could improve this analysis by including additional socio-economic or geographic factors and by using more advanced modeling approaches to better understand the drivers of methane emissions.

Sources: Wolma Corporation, https://www.woimacorporation.com/drowning-in-waste-case-lima-peru/ : HDX, https://data.humdata.org/dataset/per-climate-trace