The dataset explores trash collection in Baltimore City from 2014 to 2023, detailing various types of waste collected monthly at different dumpsters.
This dataset provides insights into Baltimore’s waste management trends, I plan to explore the relationship between variables like glass bottles and plastic bottles,and investigate seasonal trends in waste collection. Source: Baltimore City Open Data platform (https://data.baltimorecity.gov/), which offers publicly accessible information on trash collection and other municipal services.
Load tidyverse library to access requisite package to execute the dataset.
library(tidyverse)
Warning: package 'readr' was built under R version 4.4.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(treemap)
Warning: package 'treemap' was built under R version 4.4.3
library(RColorBrewer)
Load dataset from working directory to the global environment,and make headers lower_case and remove space
setwd("C:/Users/Owner/Downloads")# Suppress all messages when reading the CSV filetrashcollection <-suppressMessages(read_csv("trash_collection_Baltimore_2014-23.csv", show_col_types =FALSE))names(trashcollection)<-gsub(" ","_",tolower(names(trashcollection)))head(trashcollection)
# A tibble: 6 × 16
dumpster month year date `weight_(tons)` `volume_(cubic_yards)`
<dbl> <chr> <dbl> <chr> <dbl> <dbl>
1 1 May 2014 5/16/2014 4.31 18
2 2 May 2014 5/16/2014 2.74 13
3 3 May 2014 5/16/2014 3.45 15
4 4 May 2014 5/17/2014 3.1 15
5 5 May 2014 5/17/2014 4.06 18
6 6 May 2014 5/20/2014 2.71 13
# ℹ 10 more variables: plastic_bottles <dbl>, polystyrene <dbl>,
# cigarette_butts <dbl>, glass_bottles <dbl>, plastic_bags <dbl>,
# wrappers <dbl>, sports_balls <dbl>, `homes_powered*` <dbl>, ...15 <lgl>,
# ...16 <lgl>
Filter out columns with all NA values, and convert the column “dumpster” from numerical variable to categorical variables
# Filter out columns with all NA valuestrashcollection_filter <- trashcollection |>select(where(~!all(is.na(.))))# remove the last row(total),and natrashcollection_filter<-trashcollection_filter[-630,]# Convert 'dumpster' column to a factor trashcollection_filter$dumpster <-as.factor(trashcollection_filter$dumpster)# Display the first few rows of the filtered datasethead(trashcollection_filter)
# A tibble: 6 × 14
dumpster month year date `weight_(tons)` `volume_(cubic_yards)`
<fct> <chr> <dbl> <chr> <dbl> <dbl>
1 1 May 2014 5/16/2014 4.31 18
2 2 May 2014 5/16/2014 2.74 13
3 3 May 2014 5/16/2014 3.45 15
4 4 May 2014 5/17/2014 3.1 15
5 5 May 2014 5/17/2014 4.06 18
6 6 May 2014 5/20/2014 2.71 13
# ℹ 8 more variables: plastic_bottles <dbl>, polystyrene <dbl>,
# cigarette_butts <dbl>, glass_bottles <dbl>, plastic_bags <dbl>,
# wrappers <dbl>, sports_balls <dbl>, `homes_powered*` <dbl>
summarize trashcollection_filter data to display the 5-number summary of dataset datase
summary(trashcollection_filter)
dumpster month year date
1 : 1 Length:629 Min. :2014 Length:629
2 : 1 Class :character 1st Qu.:2016 Class :character
3 : 1 Mode :character Median :2019 Mode :character
4 : 1 Mean :2019
5 : 1 3rd Qu.:2021
6 : 1 Max. :2023
(Other):623
weight_(tons) volume_(cubic_yards) plastic_bottles polystyrene
Min. :0.780 Min. : 7.00 Min. : 80 Min. : 20
1st Qu.:2.720 1st Qu.:15.00 1st Qu.:1020 1st Qu.: 440
Median :3.200 Median :15.00 Median :1900 Median :1040
Mean :3.211 Mean :15.24 Mean :1981 Mean :1463
3rd Qu.:3.730 3rd Qu.:15.00 3rd Qu.:2780 3rd Qu.:2250
Max. :5.620 Max. :20.00 Max. :5960 Max. :6540
cigarette_butts glass_bottles plastic_bags wrappers
Min. : 500 Min. : 0.00 Min. : 24.0 Min. : 180
1st Qu.: 3600 1st Qu.: 10.00 1st Qu.: 270.0 1st Qu.: 775
Median : 6000 Median : 18.00 Median : 550.0 Median :1140
Mean : 18657 Mean : 21.47 Mean : 867.3 Mean :1428
3rd Qu.: 22000 3rd Qu.: 29.00 3rd Qu.:1140.0 3rd Qu.:1980
Max. :310000 Max. :110.00 Max. :3750.0 Max. :5085
sports_balls homes_powered*
Min. : 0.00 Min. : 0.00
1st Qu.: 6.00 1st Qu.:41.00
Median :12.00 Median :52.00
Mean :13.59 Mean :47.81
3rd Qu.:20.00 3rd Qu.:60.00
Max. :56.00 Max. :94.00
Use the mutate function and the lubridate package to restructure the date component and sum-up month.
# A tibble: 570 × 4
# Groups: year [10]
year month trash_type values
<dbl> <ord> <chr> <dbl>
1 2014 May plastic_bottles 14300
2 2014 May polystyrene 17090
3 2014 May cigarette_butts 800000
4 2014 May glass_bottles 424
5 2014 May plastic_bags 6064
6 2014 Jun plastic_bottles 10920
7 2014 Jun polystyrene 13019
8 2014 Jun cigarette_butts 972000
9 2014 Jun glass_bottles 494
10 2014 Jun plastic_bags 4744
# ℹ 560 more rows
Create another qualitative variable (percentage),to enable the plotting of a treemap.
# Group by year and month, then calculate percentagesmonthly_trashcollection_long <- monthly_trashcollection_long |>group_by(year, month) |>mutate(percentage = values /sum(values, na.rm =TRUE) *100)monthly_trashcollection_long
# A tibble: 570 × 5
# Groups: year, month [114]
year month trash_type values percentage
<dbl> <ord> <chr> <dbl> <dbl>
1 2014 May plastic_bottles 14300 1.71
2 2014 May polystyrene 17090 2.04
3 2014 May cigarette_butts 800000 95.5
4 2014 May glass_bottles 424 0.0506
5 2014 May plastic_bags 6064 0.724
6 2014 Jun plastic_bottles 10920 1.09
7 2014 Jun polystyrene 13019 1.30
8 2014 Jun cigarette_butts 972000 97.1
9 2014 Jun glass_bottles 494 0.0493
10 2014 Jun plastic_bags 4744 0.474
# ℹ 560 more rows
Plotting a treemap to capture the percentage in weight of trash_type disposed within 2014-2023
Use statistical methods IQR method to spot the outlier.
ggplot(monthly_trashcollection, aes(x = glass_bottles, y = plastic_bottles)) +geom_point() +labs(title ="Initial Scatterplot: Plastic Bottles vs Glass Bottles")
Find the row with the outlier
outlier <-which.max(monthly_trashcollection$plastic_bottles)# Remove the outliermonthly_trashcollection_filtered_data <- monthly_trashcollection[-outlier, ]
Plot p3,using ggplot to plot geom_point,to capture linear regression between two quantitative variables plastic bottles and glass bottles.
p3 <-ggplot(monthly_trashcollection_filtered_data, aes(x = glass_bottles, y = plastic_bottles)) +geom_point() +labs(title ="Plastic_bottles VERSUS glass_bottles IN Baltimore",caption ="(https://data.baltimorecity.gov/)",x ="glass_bottles rates in each dumpster per 100,000", y ="plastic_bottles rates in each state per 100,000") +theme_minimal(base_size =12)p3 +geom_smooth(method='lm',formula=y~x, se =FALSE, linetype="dotdash", size =0.5)# add the points
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Regression and Modeling
Cor stands for “correlation”. This is a value between (inclusively) -1 and 1. The correlation coefficient tells how strong or weak the correlation is. Values closer to +/- 1 are strong correlation (the sign is determined by the linear slope), values close to +/- 0.5 are weak correlation, and values close to zero have no correlation.
The correlation of the above plot is above 0.5,suggesting moderately-weak cor.
Summary statsistic of fit1.
fit1 <-lm(plastic_bottles ~ glass_bottles, data = monthly_trashcollection_filtered_data)summary(fit1)
Call:
lm(formula = plastic_bottles ~ glass_bottles, data = monthly_trashcollection_filtered_data)
Residuals:
Min 1Q Median 3Q Max
-16449.1 -3131.8 -501.4 2435.5 15798.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5729.151 760.349 7.535 1.41e-11 ***
glass_bottles 43.806 5.295 8.273 3.18e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4928 on 111 degrees of freedom
Multiple R-squared: 0.3814, Adjusted R-squared: 0.3759
F-statistic: 68.45 on 1 and 111 DF, p-value: 3.184e-13
Regression Equation
The model output includes estimated coefficients, allowing us to write the regression equation as: [{y} = 5729.151 + 43.806x ], with y(Predicted value of plastic bottles collected)and x=glass_bottles(43.806):the slope, indicating that for every one additional glass bottle collected, approximately 43.81 more plastic bottles are collected.)
Residuals
Residuals are the differences between the observed and predicted values. The summary of residuals indicates: Min and Max: Range of residuals ((-16449.1) to (15798.8)), showing variation in prediction errors. Median Residual: Close to zero ((-501.4)), suggesting no significant bias in over- or under-predictions.
Model Fit Metrics
Adjusted (R^2):The adjusted (R^2) value of (0.3759) indicates that the model explains approximately 37.59% of the variability in.
Summary
The first step in filtering my data,requires I remove the columns with NAF using dypler function filter, and convert the column “dumpster” from numerical variable to categorical variables. Secondly,I created another qualitative variable (percentage),to enable the plotting of a treemap. Lastly, I prepared my data for R,by using the pivot_longer function,then use the mutate function and the lubridate package to restructure the date component.
Project 1 has three plots. The first plot is a treemap which compares the overall scale and relative contributions of trash types to the total trash collected,plot 2 is a density log plot showing statistical distribution (e.g., variability and trends) of collected amounts within each trash type,and plot 3 is geom_point to establish a linear regression plot to find correlation between two quantitative variables.
Plot2, presents a density visualization of the log-transformed values of trash collection amounts, grouped by various trash types. The x-axis depicts the log-transformed collected amounts, while the y-axis represents density. To have a clear representation, the RcolorBrewer:set 1 was used with a transparency of 0.35 to highlight overlapping.
Different colors distinguish the trash types, with cigarette butts (red) showing a peak at lower values, reflecting a high density of small, collected amounts. Glass bottles (blue) exhibit a moderate distribution with a smaller peak. Plastic bags (blue) and plastic bottles (purple) display higher peaks, indicating they are collected in larger amounts. Polystyrene (orange) features a broad distribution, signifying a wide range of collection values. Overall, the plot highlights that plastic-related waste tends to be collected in larger amounts, whereas cigarette butts are characterized by high density at lower collection amounts. The overlapping density curves illustrate variability in collection sizes across the different trash types.