The analysis of Baltimore Trash Collection (2014-2023)

Author

Su Thet Hninn

Baltimore Trash Collection

Background Information

Baltimore, like many urban areas, faces significant challenges related to waste management and trash disposal.Baltimore struggles with littering and illegal dumping in various neighborhoods, which has been a persistent issue affecting the city’s cleanliness and public health. The city provides regular waste collection services, including trash, recycling, and bulk item pickup. However, inefficiencies and delays in these services can contribute to trash accumulation. Trash problems are often more severe in low-income neighborhoods. These areas may experience less frequent waste collection, higher rates of illegal dumping, and fewer resources for cleanup efforts.

From Baltimore’s trash collection data, I am interested in analyzing the distribution of trash collection from 2014 to 2023. Additionally, I can analyze seasonal patterns of different trash collection.

Among the 16 variables in the dataset, I am particularly interested in the year and month of trash collection, total weights of trash in tons, total volume of trash, and variables representing different types of trash (plastic bottles, polystyrene, cigarette butts, glass bottles, plastic bags, wrappers, and sports balls). My plan is to explore the distribution of trash collection from 2014 to 2023 and analyze the seasonal patterns for the year 2018 with the highest trash volume.

I got the Baltimore Trash Collection Dataset from the Open Baltimore Website.

Analyzing the data

Load the library:

# Load necessary libraries
library(tidyverse)
library(tinytex)
library(dplyr)
library(RColorBrewer)
library(ggplot2)

Load the dataset:

# Load the Baltimore dataset using the read_csv() command 
setwd("/Users/hlinethitzinwai/Documents/1 - College/DATA 110/Project 1/Trash_Dataset")
Baltimore_Trash <- read_csv("trash_collection_Baltimore_2014-23.csv")

New names:
Rows: 630 Columns: 16
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(2): Month, Date dbl (2): Dumpster, Year num (10): Weight (tons), Volume (cubic
yards), Plastic Bottles, Polystyrene,... lgl (2): ...15, ...16
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...15`
• `` -> `...16`

Explore the dataset:

# Display the first few rows of the dataset
head(Baltimore_Trash)

# A tibble: 6 × 16
  Dumpster Month  Year Date      `Weight (tons)` `Volume (cubic yards)`
     <dbl> <chr> <dbl> <chr>               <dbl>                  <dbl>
1        1 May    2014 5/16/2014            4.31                     18
2        2 May    2014 5/16/2014            2.74                     13
3        3 May    2014 5/16/2014            3.45                     15
4        4 May    2014 5/17/2014            3.1                      15
5        5 May    2014 5/17/2014            4.06                     18
6        6 May    2014 5/20/2014            2.71                     13
# ℹ 10 more variables: `Plastic Bottles` <dbl>, Polystyrene <dbl>,
#   `Cigarette Butts` <dbl>, `Glass Bottles` <dbl>, `Plastic Bags` <dbl>,
#   Wrappers <dbl>, `Sports Balls` <dbl>, `Homes Powered*` <dbl>, ...15 <lgl>,
#   ...16 <lgl>

# Display the names of the columns in the dataset
names(Baltimore_Trash)

 [1] "Dumpster"             "Month"                "Year"                
 [4] "Date"                 "Weight (tons)"        "Volume (cubic yards)"
 [7] "Plastic Bottles"      "Polystyrene"          "Cigarette Butts"     
[10] "Glass Bottles"        "Plastic Bags"         "Wrappers"            
[13] "Sports Balls"         "Homes Powered*"       "...15"               
[16] "...16"

Clean up the data:

Make all headers lowercase and remove spaces:

The dataset has uppercase and spaces in the name of variables which can lead to mistype in the commend. So I will make all headers lowercase and remove the spaces.

# make the headers lowercase and remove spaces
names(Baltimore_Trash) <- tolower(names(Baltimore_Trash))
names(Baltimore_Trash) <- gsub(" ","",names(Baltimore_Trash))

Rename the variables:

Some variables has long names, then I will rename the variables.

# rename the columns
Baltimore_Trash <- Baltimore_Trash |> 
  rename(weight = "weight(tons)", 
         volume = "volume(cubicyards)")

Select only certain variable:

After cleaning up the variable names, I want to look at the structure of the data. Since there are 630 variables in this dataset, I will use “summary” function to see any missing value and to decide which variable to focus on. In the output of “summary”, I will look at the min/max values. The variable which has the max-vale of 1 cannot be used to analyze.

# summarize the dataset to see the summary of each variable
summary(Baltimore_Trash)

    dumpster      month                year          date          
 Min.   :  1   Length:630         Min.   :2014   Length:630        
 1st Qu.:158   Class :character   1st Qu.:2016   Class :character  
 Median :315   Mode  :character   Median :2019   Mode  :character  
 Mean   :315                      Mean   :2019                     
 3rd Qu.:472                      3rd Qu.:2021                     
 Max.   :629                      Max.   :2023                     
 NA's   :1                        NA's   :1                        
     weight             volume        plasticbottles     polystyrene    
 Min.   :   0.780   Min.   :   7.00   Min.   :     80   Min.   :    20  
 1st Qu.:   2.720   1st Qu.:  15.00   1st Qu.:   1025   1st Qu.:   440  
 Median :   3.205   Median :  15.00   Median :   1900   Median :  1040  
 Mean   :   6.411   Mean   :  30.44   Mean   :   3956   Mean   :  2921  
 3rd Qu.:   3.730   3rd Qu.:  15.00   3rd Qu.:   2780   3rd Qu.:  2258  
 Max.   :2019.540   Max.   :9589.00   Max.   :1246155   Max.   :920011  
                                                                        
 cigarettebutts      glassbottles       plasticbags        wrappers       
 Min.   :     500   Min.   :    0.00   Min.   :    24   Min.   :   180.0  
 1st Qu.:    3600   1st Qu.:   10.00   1st Qu.:   270   1st Qu.:   776.2  
 Median :    6000   Median :   18.00   Median :   551   Median :  1142.0  
 Mean   :   37254   Mean   :   42.86   Mean   :  1732   Mean   :  2851.2  
 3rd Qu.:   22000   3rd Qu.:   29.75   3rd Qu.:  1140   3rd Qu.:  1980.0  
 Max.   :11735100   Max.   :13502.00   Max.   :545554   Max.   :898129.0  
                                                                          
  sportsballs      homespowered*       ...15          ...16        
 Min.   :   0.00   Min.   :    0.00   Mode:logical   Mode:logical  
 1st Qu.:   6.00   1st Qu.:   41.00   NA's:630       NA's:630      
 Median :  12.00   Median :   52.00                                
 Mean   :  27.15   Mean   :   95.38                                
 3rd Qu.:  20.00   3rd Qu.:   60.75                                
 Max.   :8553.00   Max.   :30020.00

I found missing values in the variables “dumpster”, “year”, and two columns. Firstly I will remove the NAs in the dumpster and year.

Remove the omit data:

# create the new dataset removing the NAs in the year and dumpster variables
Baltimore_dropNA <- Baltimore_Trash %>% 
  drop_na(year, dumpster)

Recheck the dataset:

# summary of the dataset if it has removed NAs in year and dumpster variable
summary(Baltimore_dropNA)

    dumpster      month                year          date          
 Min.   :  1   Length:629         Min.   :2014   Length:629        
 1st Qu.:158   Class :character   1st Qu.:2016   Class :character  
 Median :315   Mode  :character   Median :2019   Mode  :character  
 Mean   :315                      Mean   :2019                     
 3rd Qu.:472                      3rd Qu.:2021                     
 Max.   :629                      Max.   :2023                     
     weight          volume      plasticbottles  polystyrene   cigarettebutts  
 Min.   :0.780   Min.   : 7.00   Min.   :  80   Min.   :  20   Min.   :   500  
 1st Qu.:2.720   1st Qu.:15.00   1st Qu.:1020   1st Qu.: 440   1st Qu.:  3600  
 Median :3.200   Median :15.00   Median :1900   Median :1040   Median :  6000  
 Mean   :3.211   Mean   :15.24   Mean   :1981   Mean   :1463   Mean   : 18657  
 3rd Qu.:3.730   3rd Qu.:15.00   3rd Qu.:2780   3rd Qu.:2250   3rd Qu.: 22000  
 Max.   :5.620   Max.   :20.00   Max.   :5960   Max.   :6540   Max.   :310000  
  glassbottles     plasticbags        wrappers     sportsballs   
 Min.   :  0.00   Min.   :  24.0   Min.   : 180   Min.   : 0.00  
 1st Qu.: 10.00   1st Qu.: 270.0   1st Qu.: 775   1st Qu.: 6.00  
 Median : 18.00   Median : 550.0   Median :1140   Median :12.00  
 Mean   : 21.47   Mean   : 867.3   Mean   :1428   Mean   :13.59  
 3rd Qu.: 29.00   3rd Qu.:1140.0   3rd Qu.:1980   3rd Qu.:20.00  
 Max.   :110.00   Max.   :3750.0   Max.   :5085   Max.   :56.00  
 homespowered*    ...15          ...16        
 Min.   : 0.00   Mode:logical   Mode:logical  
 1st Qu.:41.00   NA's:629       NA's:629      
 Median :52.00                                
 Mean   :47.81                                
 3rd Qu.:60.00                                
 Max.   :94.00

The dataset is now reducing to the 629 observation as it removes the NAs values in dumpster and year variables.

Convert the variables to be used:

The variable “month” and “year” are categorical and these variables are converted as a factor for efficient processing.

# create the clean dataset of converting into factor
Baltimore_Clean <- Baltimore_dropNA %>% mutate_if(is.character, as.factor)

Since I mentioned in the first part, I will look into the variables: year, month, weight, volume, plasticbottles, polystyrene, cigarettebutts, glassbottles, plasticbags, wrappers, sportsballs. I will drop the other variables of housepowered and other missing two columns.

# create the new dataset with the selected variables
Baltimore_SelectedVariables <- Baltimore_Clean|>
  select(year, month, weight, volume, plasticbottles, polystyrene,cigarettebutts, glassbottles, plasticbags, wrappers, sportsballs) |>
  group_by(year, month)

Data Manipulation for Visualization

I group by trashtype and year, and summarize the total amount of trash collected for each type of trash so that I can create a data visualization of using a graph showing the x = year and y= Total Trash weight.

# group by trashtype and month, and summarize the total amount of trash collected for each type of trash.
 AllTrash_Group <- Baltimore_SelectedVariables |>
  select(year, month, weight, volume, polystyrene, plasticbags, cigarettebutts, plasticbottles, glassbottles, wrappers, sportsballs) |>
  filter(month %in% c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December")) |>
  filter(year %in% c(2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023))|>
  group_by(month, year) |>
  summarise(total_weight = sum(weight), sum_volume = sum(volume), sum_polystyrene = sum(polystyrene), plasticbags= sum(plasticbags), cigarettebutts = sum(cigarettebutts), plasticbottles = sum(plasticbottles), glassbottles = sum(glassbottles), wrappers = sum(wrappers), sportsballs = sum(sportsballs))

`summarise()` has grouped output by 'month'. You can override using the
`.groups` argument.

I am interested in the trash collected in the month of December in the survey years as December has full of holidays and events.I will look at the trash weight distribution in the December.

Trash Composition in the month of December (2014-2023)

library(dplyr)

# Filter data for December from 2014 to 2023
December_Trash <- AllTrash_Group %>%
  filter(month == "December", year >= 2014, year <= 2023)

# Summarize the data using aggregate function 
Total_December_Trash <- aggregate(
  cbind(plasticbottles, plasticbags, sum_polystyrene, cigarettebutts, glassbottles,wrappers, sportsballs) ~ 1,
  data = December_Trash,
  sum,
  na.rm = TRUE
)

# Rename the columns for clarity
colnames(Total_December_Trash) <- c("Total_Plasticbottles", "Total_Polystyrene", "Total_Cigarettebutts", "Total_Glassbottles", "Total_PlasticBags", "Total_Wrappers", "Total_SportsBalls")

# Convert the summarized data to a long format
Total_December_Trash_Long <- stack(Total_December_Trash)

# Rename the columns for clarity
colnames(Total_December_Trash_Long) <- c("Total_Amount", "Trash_Type")

Creating the plot for December trash composition

# Reload the library
library(ggplot2)

To avoid scientific notation, I installed the “scales” in the library and rerun the chunk.

# avoid scientific notation
library(scales)


Attaching package: 'scales'

The following object is masked from 'package:purrr':

    discard

The following object is masked from 'package:readr':

    col_factor

# Create the bar plot avoiding scientific notation
ggplot(Total_December_Trash_Long, aes(x = Trash_Type, y = Total_Amount, fill = Trash_Type)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c(
    "Total_Plasticbottles" = "skyblue",
    "Total_Polystyrene" = "orange",
    "Total_Cigarettebutts" = "green",
    "Total_Glassbottles" = "red",
    "Total_PlasticBags" = "purple",
    "Total_Wrappers" = "yellow",
    "Total_SportsBalls" = "blue"
  )) +
  theme_minimal() +
  labs(
    title = "Total Amount of Each Type of Trash in December (2014-2023)",
    x = "Type of Trash",
    y = "Total Amount",
    caption = "Source: https://data.baltimorecity.gov"
  ) +
  scale_y_continuous(labels = scales::comma) +  # Ensure comma function is from scales package
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Based on the data presented above, it is evident that the total amount of glass bottles collected is highest during the month of December. This increase may be attributed to the various holidays and events typically observed during this festive month.

Trash Distribution of Baltimore (2014-2023)

# create the boxplot to see the trash distribution for each year.

AllTrash_Match <- AllTrash_Group |>
  mutate(year = factor(year, levels = c("2014", "2015", "2016", "2017", "2018", "2019", "2020", "2021", "2022", "2023")))

AllTrash_Match |>
  ggplot(aes(x = year , y = total_weight, color = year)) +
  geom_boxplot() +
  scale_color_manual(values = c("#25A5D8", "#20A458", "#E4BC20", "#BC85B9", "#FF5733","#FF33FF","#33FF4D", "#3347FF", "#33FFF4","#FF8633" )) + #Specified the colors for year
  labs(title = "Total Trash Produce in Baltimore (2014 - 2023)",
       x = "Year",
       y = "Total Trash Weight (in tons)", 
        caption = "Source: https://data.baltimorecity.gov") +  
  theme_minimal()

In our analysis of Baltimore’s trash distribution from 2014 to 2023, it was found that the year 2018 had the highest levels of trash distribution. I will look into more details of trash composition.

# create new dataset to see 2018 data
Trash_2018 <- AllTrash_Group |> 
  dplyr::filter(year == 2018)

library(dplyr)

# Summarize the data using aggregate function
TotalTrash_2018 <- aggregate(
  cbind(plasticbottles, sum_polystyrene, cigarettebutts, glassbottles, plasticbags, wrappers, sportsballs) ~ 1,
  data = Trash_2018,
  sum,
  na.rm = TRUE
)

# Rename the columns for clarity
colnames(TotalTrash_2018) <- c("Total_Plasticbottles", "Total_Polystyrene", "Total_Cigarettebutts", 
                              "Total_Glassbottles", "Total_PlasticBags", "Total_Wrappers", 
                              "Total_SportsBalls")

# Convert data to long format suitable for ggplot
TotalTrash_2018_long <- stack(TotalTrash_2018)

# Create the bar plot using ggplot2
ggplot(TotalTrash_2018_long, aes(x = ind, y = values)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  theme_minimal() +
  labs(
    title = "Total Amount of Each Type of Trash in 2018",
    x = "Type of Trash",
    y = "Total Amount",
    caption = "Source: https://data.baltimorecity.gov"
  ) +
  scale_y_continuous(labels = scales::comma) +  # Ensure comma function is from scales package
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The tatal volume for glassbottle was under 100. So I need to adjust the lower and upper limits for the y-axis.

# Determine the lower and upper limits for the y-axis
lower_limit <- 0  # Start from 0 or adjust as needed
upper_limit <- max(TotalTrash_2018_long$values) * 1.2  # Increase upper limit to accommodate data

# Plot using ggplot2 with more color variation
library(ggplot2)

ggplot(TotalTrash_2018_long, aes(x = ind, y = values, fill = ind)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  labs(
    title = "Total Amount of Each Type of Trash in 2018",
    x = "Type of Trash",
    y = "Total Amount",
    caption = "Source: https://data.baltimorecity.gov"
  ) +
  scale_fill_manual(values = c(
    "Total_Plasticbottles" = "skyblue",
    "Total_Polystyrene" = "orange",
    "Total_Cigarettebutts" = "green",
    "Total_PlasticBags" = "red",
    "Total_Glassbottles" = "pink",
    "Total_Wrappers" = "yellow",
    "Total_SportsBalls" = "purple"
  )) +
  scale_y_continuous(labels = scales::comma, limits = c(lower_limit, upper_limit)) +  # Set limits for the y-axis
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Conclusion

I am systematically cleaning the dataset, methodically applying commands to ensure each step enhances data quality and reliability. I intend to analyze the total trash weight and horsepower consumption in the dataset, but I require additional time to locate the appropriate commands. If time permits, I also plan to explore alternative visualizations for a more engaging presentation.

Based on the analysis of trash collection data spanning multiple years and focusing on December volumes, several key insights emerge. Overall, 2018 stands out as the year with the highest trash collection across various categories, indicating potentially higher waste generation or improved reporting that year. However, some years exhibit outliers, suggesting fluctuations in environmental conditions, societal behaviors, or reporting accuracy.In terms of monthly patterns, December consistently shows peak volumes of glassbottles, likely attributed to holiday festivities and increased consumption during that period. This trend underscores the seasonal variability in waste composition and highlights the impact of cultural and social factors on trash generation.

Interestingly, despite glassbottles dominating December volumes in most years, 2018 deviates from this trend with cigarette butts emerging as the highest-volume waste category. This anomaly in 2018 suggests specific local or temporal factors influencing waste patterns, such as changes in smoking habits, regulatory shifts, or localized events impacting waste composition.

In conclusion, while 2018 emerges as a standout year for overall trash collection, particularly with unexpected spikes in cigarette butt volumes, December consistently witnesses significant waste generation, primarily driven by glassbottle consumption. These insights underscore the complex interplay of societal behaviors, environmental factors, and seasonal influences on urban waste streams, highlighting the need for targeted waste management strategies that account for both annual trends and exceptional fluctuations.