# Load necessary libraries
library(tidyverse)
library(tinytex)
library(dplyr)
library(RColorBrewer)
library(ggplot2)The analysis of Baltimore Trash Collection (2014-2023)
Baltimore Trash Collection
Background Information
Baltimore, like many urban areas, faces significant challenges related to waste management and trash disposal.Baltimore struggles with littering and illegal dumping in various neighborhoods, which has been a persistent issue affecting the city’s cleanliness and public health. The city provides regular waste collection services, including trash, recycling, and bulk item pickup. However, inefficiencies and delays in these services can contribute to trash accumulation. Trash problems are often more severe in low-income neighborhoods. These areas may experience less frequent waste collection, higher rates of illegal dumping, and fewer resources for cleanup efforts.
From Baltimore’s trash collection data, I am interested in analyzing the distribution of trash collection from 2014 to 2023. Additionally, I can analyze seasonal patterns of different trash collection.
Among the 16 variables in the dataset, I am particularly interested in the year and month of trash collection, total weights of trash in tons, total volume of trash, and variables representing different types of trash (plastic bottles, polystyrene, cigarette butts, glass bottles, plastic bags, wrappers, and sports balls). My plan is to explore the distribution of trash collection from 2014 to 2023 and analyze the seasonal patterns for the year 2018 with the highest trash volume.
I got the Baltimore Trash Collection Dataset from the Open Baltimore Website.
Analyzing the data
Load the library:
Load the dataset:
# Load the Baltimore dataset using the read_csv() command
setwd("/Users/hlinethitzinwai/Documents/1 - College/DATA 110/Project 1/Trash_Dataset")
Baltimore_Trash <- read_csv("trash_collection_Baltimore_2014-23.csv")New names:
Rows: 630 Columns: 16
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(2): Month, Date dbl (2): Dumpster, Year num (10): Weight (tons), Volume (cubic
yards), Plastic Bottles, Polystyrene,... lgl (2): ...15, ...16
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...15`
• `` -> `...16`
Explore the dataset:
# Display the first few rows of the dataset
head(Baltimore_Trash)# A tibble: 6 × 16
Dumpster Month Year Date `Weight (tons)` `Volume (cubic yards)`
<dbl> <chr> <dbl> <chr> <dbl> <dbl>
1 1 May 2014 5/16/2014 4.31 18
2 2 May 2014 5/16/2014 2.74 13
3 3 May 2014 5/16/2014 3.45 15
4 4 May 2014 5/17/2014 3.1 15
5 5 May 2014 5/17/2014 4.06 18
6 6 May 2014 5/20/2014 2.71 13
# ℹ 10 more variables: `Plastic Bottles` <dbl>, Polystyrene <dbl>,
# `Cigarette Butts` <dbl>, `Glass Bottles` <dbl>, `Plastic Bags` <dbl>,
# Wrappers <dbl>, `Sports Balls` <dbl>, `Homes Powered*` <dbl>, ...15 <lgl>,
# ...16 <lgl>
# Display the names of the columns in the dataset
names(Baltimore_Trash) [1] "Dumpster" "Month" "Year"
[4] "Date" "Weight (tons)" "Volume (cubic yards)"
[7] "Plastic Bottles" "Polystyrene" "Cigarette Butts"
[10] "Glass Bottles" "Plastic Bags" "Wrappers"
[13] "Sports Balls" "Homes Powered*" "...15"
[16] "...16"
Clean up the data:
Make all headers lowercase and remove spaces:
The dataset has uppercase and spaces in the name of variables which can lead to mistype in the commend. So I will make all headers lowercase and remove the spaces.
# make the headers lowercase and remove spaces
names(Baltimore_Trash) <- tolower(names(Baltimore_Trash))
names(Baltimore_Trash) <- gsub(" ","",names(Baltimore_Trash))Rename the variables:
Some variables has long names, then I will rename the variables.
# rename the columns
Baltimore_Trash <- Baltimore_Trash |>
rename(weight = "weight(tons)",
volume = "volume(cubicyards)")Select only certain variable:
After cleaning up the variable names, I want to look at the structure of the data. Since there are 630 variables in this dataset, I will use “summary” function to see any missing value and to decide which variable to focus on. In the output of “summary”, I will look at the min/max values. The variable which has the max-vale of 1 cannot be used to analyze.
# summarize the dataset to see the summary of each variable
summary(Baltimore_Trash) dumpster month year date
Min. : 1 Length:630 Min. :2014 Length:630
1st Qu.:158 Class :character 1st Qu.:2016 Class :character
Median :315 Mode :character Median :2019 Mode :character
Mean :315 Mean :2019
3rd Qu.:472 3rd Qu.:2021
Max. :629 Max. :2023
NA's :1 NA's :1
weight volume plasticbottles polystyrene
Min. : 0.780 Min. : 7.00 Min. : 80 Min. : 20
1st Qu.: 2.720 1st Qu.: 15.00 1st Qu.: 1025 1st Qu.: 440
Median : 3.205 Median : 15.00 Median : 1900 Median : 1040
Mean : 6.411 Mean : 30.44 Mean : 3956 Mean : 2921
3rd Qu.: 3.730 3rd Qu.: 15.00 3rd Qu.: 2780 3rd Qu.: 2258
Max. :2019.540 Max. :9589.00 Max. :1246155 Max. :920011
cigarettebutts glassbottles plasticbags wrappers
Min. : 500 Min. : 0.00 Min. : 24 Min. : 180.0
1st Qu.: 3600 1st Qu.: 10.00 1st Qu.: 270 1st Qu.: 776.2
Median : 6000 Median : 18.00 Median : 551 Median : 1142.0
Mean : 37254 Mean : 42.86 Mean : 1732 Mean : 2851.2
3rd Qu.: 22000 3rd Qu.: 29.75 3rd Qu.: 1140 3rd Qu.: 1980.0
Max. :11735100 Max. :13502.00 Max. :545554 Max. :898129.0
sportsballs homespowered* ...15 ...16
Min. : 0.00 Min. : 0.00 Mode:logical Mode:logical
1st Qu.: 6.00 1st Qu.: 41.00 NA's:630 NA's:630
Median : 12.00 Median : 52.00
Mean : 27.15 Mean : 95.38
3rd Qu.: 20.00 3rd Qu.: 60.75
Max. :8553.00 Max. :30020.00
I found missing values in the variables “dumpster”, “year”, and two columns. Firstly I will remove the NAs in the dumpster and year.
Remove the omit data:
# create the new dataset removing the NAs in the year and dumpster variables
Baltimore_dropNA <- Baltimore_Trash %>%
drop_na(year, dumpster)Recheck the dataset:
# summary of the dataset if it has removed NAs in year and dumpster variable
summary(Baltimore_dropNA) dumpster month year date
Min. : 1 Length:629 Min. :2014 Length:629
1st Qu.:158 Class :character 1st Qu.:2016 Class :character
Median :315 Mode :character Median :2019 Mode :character
Mean :315 Mean :2019
3rd Qu.:472 3rd Qu.:2021
Max. :629 Max. :2023
weight volume plasticbottles polystyrene cigarettebutts
Min. :0.780 Min. : 7.00 Min. : 80 Min. : 20 Min. : 500
1st Qu.:2.720 1st Qu.:15.00 1st Qu.:1020 1st Qu.: 440 1st Qu.: 3600
Median :3.200 Median :15.00 Median :1900 Median :1040 Median : 6000
Mean :3.211 Mean :15.24 Mean :1981 Mean :1463 Mean : 18657
3rd Qu.:3.730 3rd Qu.:15.00 3rd Qu.:2780 3rd Qu.:2250 3rd Qu.: 22000
Max. :5.620 Max. :20.00 Max. :5960 Max. :6540 Max. :310000
glassbottles plasticbags wrappers sportsballs
Min. : 0.00 Min. : 24.0 Min. : 180 Min. : 0.00
1st Qu.: 10.00 1st Qu.: 270.0 1st Qu.: 775 1st Qu.: 6.00
Median : 18.00 Median : 550.0 Median :1140 Median :12.00
Mean : 21.47 Mean : 867.3 Mean :1428 Mean :13.59
3rd Qu.: 29.00 3rd Qu.:1140.0 3rd Qu.:1980 3rd Qu.:20.00
Max. :110.00 Max. :3750.0 Max. :5085 Max. :56.00
homespowered* ...15 ...16
Min. : 0.00 Mode:logical Mode:logical
1st Qu.:41.00 NA's:629 NA's:629
Median :52.00
Mean :47.81
3rd Qu.:60.00
Max. :94.00
The dataset is now reducing to the 629 observation as it removes the NAs values in dumpster and year variables.
Convert the variables to be used:
The variable “month” and “year” are categorical and these variables are converted as a factor for efficient processing.
# create the clean dataset of converting into factor
Baltimore_Clean <- Baltimore_dropNA %>% mutate_if(is.character, as.factor)Since I mentioned in the first part, I will look into the variables: year, month, weight, volume, plasticbottles, polystyrene, cigarettebutts, glassbottles, plasticbags, wrappers, sportsballs. I will drop the other variables of housepowered and other missing two columns.
# create the new dataset with the selected variables
Baltimore_SelectedVariables <- Baltimore_Clean|>
select(year, month, weight, volume, plasticbottles, polystyrene,cigarettebutts, glassbottles, plasticbags, wrappers, sportsballs) |>
group_by(year, month)Data Manipulation for Visualization
I group by trashtype and year, and summarize the total amount of trash collected for each type of trash so that I can create a data visualization of using a graph showing the x = year and y= Total Trash weight.
# group by trashtype and month, and summarize the total amount of trash collected for each type of trash.
AllTrash_Group <- Baltimore_SelectedVariables |>
select(year, month, weight, volume, polystyrene, plasticbags, cigarettebutts, plasticbottles, glassbottles, wrappers, sportsballs) |>
filter(month %in% c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December")) |>
filter(year %in% c(2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023))|>
group_by(month, year) |>
summarise(total_weight = sum(weight), sum_volume = sum(volume), sum_polystyrene = sum(polystyrene), plasticbags= sum(plasticbags), cigarettebutts = sum(cigarettebutts), plasticbottles = sum(plasticbottles), glassbottles = sum(glassbottles), wrappers = sum(wrappers), sportsballs = sum(sportsballs))`summarise()` has grouped output by 'month'. You can override using the
`.groups` argument.
I am interested in the trash collected in the month of December in the survey years as December has full of holidays and events.I will look at the trash weight distribution in the December.
Trash Composition in the month of December (2014-2023)
library(dplyr)# Filter data for December from 2014 to 2023
December_Trash <- AllTrash_Group %>%
filter(month == "December", year >= 2014, year <= 2023)# Summarize the data using aggregate function
Total_December_Trash <- aggregate(
cbind(plasticbottles, plasticbags, sum_polystyrene, cigarettebutts, glassbottles,wrappers, sportsballs) ~ 1,
data = December_Trash,
sum,
na.rm = TRUE
)# Rename the columns for clarity
colnames(Total_December_Trash) <- c("Total_Plasticbottles", "Total_Polystyrene", "Total_Cigarettebutts", "Total_Glassbottles", "Total_PlasticBags", "Total_Wrappers", "Total_SportsBalls")# Convert the summarized data to a long format
Total_December_Trash_Long <- stack(Total_December_Trash)# Rename the columns for clarity
colnames(Total_December_Trash_Long) <- c("Total_Amount", "Trash_Type")Creating the plot for December trash composition
# Reload the library
library(ggplot2)To avoid scientific notation, I installed the “scales” in the library and rerun the chunk.
# avoid scientific notation
library(scales)
Attaching package: 'scales'
The following object is masked from 'package:purrr':
discard
The following object is masked from 'package:readr':
col_factor
# Create the bar plot avoiding scientific notation
ggplot(Total_December_Trash_Long, aes(x = Trash_Type, y = Total_Amount, fill = Trash_Type)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = c(
"Total_Plasticbottles" = "skyblue",
"Total_Polystyrene" = "orange",
"Total_Cigarettebutts" = "green",
"Total_Glassbottles" = "red",
"Total_PlasticBags" = "purple",
"Total_Wrappers" = "yellow",
"Total_SportsBalls" = "blue"
)) +
theme_minimal() +
labs(
title = "Total Amount of Each Type of Trash in December (2014-2023)",
x = "Type of Trash",
y = "Total Amount",
caption = "Source: https://data.baltimorecity.gov"
) +
scale_y_continuous(labels = scales::comma) + # Ensure comma function is from scales package
theme(axis.text.x = element_text(angle = 45, hjust = 1))Based on the data presented above, it is evident that the total amount of glass bottles collected is highest during the month of December. This increase may be attributed to the various holidays and events typically observed during this festive month.
Trash Distribution of Baltimore (2014-2023)
# create the boxplot to see the trash distribution for each year.
AllTrash_Match <- AllTrash_Group |>
mutate(year = factor(year, levels = c("2014", "2015", "2016", "2017", "2018", "2019", "2020", "2021", "2022", "2023")))
AllTrash_Match |>
ggplot(aes(x = year , y = total_weight, color = year)) +
geom_boxplot() +
scale_color_manual(values = c("#25A5D8", "#20A458", "#E4BC20", "#BC85B9", "#FF5733","#FF33FF","#33FF4D", "#3347FF", "#33FFF4","#FF8633" )) + #Specified the colors for year
labs(title = "Total Trash Produce in Baltimore (2014 - 2023)",
x = "Year",
y = "Total Trash Weight (in tons)",
caption = "Source: https://data.baltimorecity.gov") +
theme_minimal()In our analysis of Baltimore’s trash distribution from 2014 to 2023, it was found that the year 2018 had the highest levels of trash distribution. I will look into more details of trash composition.
# create new dataset to see 2018 data
Trash_2018 <- AllTrash_Group |>
dplyr::filter(year == 2018)library(dplyr)# Summarize the data using aggregate function
TotalTrash_2018 <- aggregate(
cbind(plasticbottles, sum_polystyrene, cigarettebutts, glassbottles, plasticbags, wrappers, sportsballs) ~ 1,
data = Trash_2018,
sum,
na.rm = TRUE
)# Rename the columns for clarity
colnames(TotalTrash_2018) <- c("Total_Plasticbottles", "Total_Polystyrene", "Total_Cigarettebutts",
"Total_Glassbottles", "Total_PlasticBags", "Total_Wrappers",
"Total_SportsBalls")# Convert data to long format suitable for ggplot
TotalTrash_2018_long <- stack(TotalTrash_2018)# Create the bar plot using ggplot2
ggplot(TotalTrash_2018_long, aes(x = ind, y = values)) +
geom_bar(stat = "identity", fill = "skyblue") +
theme_minimal() +
labs(
title = "Total Amount of Each Type of Trash in 2018",
x = "Type of Trash",
y = "Total Amount",
caption = "Source: https://data.baltimorecity.gov"
) +
scale_y_continuous(labels = scales::comma) + # Ensure comma function is from scales package
theme(axis.text.x = element_text(angle = 45, hjust = 1))The tatal volume for glassbottle was under 100. So I need to adjust the lower and upper limits for the y-axis.
# Determine the lower and upper limits for the y-axis
lower_limit <- 0 # Start from 0 or adjust as needed
upper_limit <- max(TotalTrash_2018_long$values) * 1.2 # Increase upper limit to accommodate data# Plot using ggplot2 with more color variation
library(ggplot2)
ggplot(TotalTrash_2018_long, aes(x = ind, y = values, fill = ind)) +
geom_bar(stat = "identity") +
theme_minimal() +
labs(
title = "Total Amount of Each Type of Trash in 2018",
x = "Type of Trash",
y = "Total Amount",
caption = "Source: https://data.baltimorecity.gov"
) +
scale_fill_manual(values = c(
"Total_Plasticbottles" = "skyblue",
"Total_Polystyrene" = "orange",
"Total_Cigarettebutts" = "green",
"Total_PlasticBags" = "red",
"Total_Glassbottles" = "pink",
"Total_Wrappers" = "yellow",
"Total_SportsBalls" = "purple"
)) +
scale_y_continuous(labels = scales::comma, limits = c(lower_limit, upper_limit)) + # Set limits for the y-axis
theme(axis.text.x = element_text(angle = 45, hjust = 1))Conclusion
I am systematically cleaning the dataset, methodically applying commands to ensure each step enhances data quality and reliability. I intend to analyze the total trash weight and horsepower consumption in the dataset, but I require additional time to locate the appropriate commands. If time permits, I also plan to explore alternative visualizations for a more engaging presentation.
Based on the analysis of trash collection data spanning multiple years and focusing on December volumes, several key insights emerge. Overall, 2018 stands out as the year with the highest trash collection across various categories, indicating potentially higher waste generation or improved reporting that year. However, some years exhibit outliers, suggesting fluctuations in environmental conditions, societal behaviors, or reporting accuracy.In terms of monthly patterns, December consistently shows peak volumes of glassbottles, likely attributed to holiday festivities and increased consumption during that period. This trend underscores the seasonal variability in waste composition and highlights the impact of cultural and social factors on trash generation.
Interestingly, despite glassbottles dominating December volumes in most years, 2018 deviates from this trend with cigarette butts emerging as the highest-volume waste category. This anomaly in 2018 suggests specific local or temporal factors influencing waste patterns, such as changes in smoking habits, regulatory shifts, or localized events impacting waste composition.
In conclusion, while 2018 emerges as a standout year for overall trash collection, particularly with unexpected spikes in cigarette butt volumes, December consistently witnesses significant waste generation, primarily driven by glassbottle consumption. These insights underscore the complex interplay of societal behaviors, environmental factors, and seasonal influences on urban waste streams, highlighting the need for targeted waste management strategies that account for both annual trends and exceptional fluctuations.