Project 1

Author

Paul D-O

Project 1: Trash Collection Baltimore 2014-2023.

Project 1 dataset was source from Baltimore open dataset Trash collection Baltimore 2014-2023

Introduction

The dataset explores trash collection in Baltimore City from 2014 to 2023, detailing various types of waste collected monthly at different dumpsters.

This dataset provides insights into Baltimore’s waste management trends, I plan to explore the relationship between variables like glass bottles and plastic bottles,and investigate seasonal trends in waste collection. Source: Baltimore City Open Data platform (https://data.baltimorecity.gov/), which offers publicly accessible information on trash collection and other municipal services.

Load tidyverse library to access requisite package to execute the dataset.

library(tidyverse)

Warning: package 'readr' was built under R version 4.4.3

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(treemap)

Warning: package 'treemap' was built under R version 4.4.3

library(RColorBrewer)

Load dataset from working directory to the global environment,and make headers lower_case and remove space

setwd("C:/Users/Owner/Downloads")
# Suppress all messages when reading the CSV file
trashcollection <- suppressMessages(read_csv("trash_collection_Baltimore_2014-23.csv", show_col_types = FALSE))
names(trashcollection)<- gsub(" ","_",tolower(names(trashcollection)))
head(trashcollection)

# A tibble: 6 × 16
  dumpster month  year date      `weight_(tons)` `volume_(cubic_yards)`
     <dbl> <chr> <dbl> <chr>               <dbl>                  <dbl>
1        1 May    2014 5/16/2014            4.31                     18
2        2 May    2014 5/16/2014            2.74                     13
3        3 May    2014 5/16/2014            3.45                     15
4        4 May    2014 5/17/2014            3.1                      15
5        5 May    2014 5/17/2014            4.06                     18
6        6 May    2014 5/20/2014            2.71                     13
# ℹ 10 more variables: plastic_bottles <dbl>, polystyrene <dbl>,
#   cigarette_butts <dbl>, glass_bottles <dbl>, plastic_bags <dbl>,
#   wrappers <dbl>, sports_balls <dbl>, `homes_powered*` <dbl>, ...15 <lgl>,
#   ...16 <lgl>

Filter out columns with all NA values, and convert the column “dumpster” from numerical variable to categorical variables

# Filter out columns with all NA values
trashcollection_filter <- trashcollection |>
  select(where(~ !all(is.na(.))))
# remove the last row(total),and na
trashcollection_filter<-trashcollection_filter[-630,]

# Convert 'dumpster' column to a factor 
trashcollection_filter$dumpster <-
  as.factor(trashcollection_filter$dumpster)

# Display the first few rows of the filtered dataset
head(trashcollection_filter)

# A tibble: 6 × 14
  dumpster month  year date      `weight_(tons)` `volume_(cubic_yards)`
  <fct>    <chr> <dbl> <chr>               <dbl>                  <dbl>
1 1        May    2014 5/16/2014            4.31                     18
2 2        May    2014 5/16/2014            2.74                     13
3 3        May    2014 5/16/2014            3.45                     15
4 4        May    2014 5/17/2014            3.1                      15
5 5        May    2014 5/17/2014            4.06                     18
6 6        May    2014 5/20/2014            2.71                     13
# ℹ 8 more variables: plastic_bottles <dbl>, polystyrene <dbl>,
#   cigarette_butts <dbl>, glass_bottles <dbl>, plastic_bags <dbl>,
#   wrappers <dbl>, sports_balls <dbl>, `homes_powered*` <dbl>

summarize trashcollection_filter data to display the 5-number summary of dataset datase

summary(trashcollection_filter)

    dumpster      month                year          date          
 1      :  1   Length:629         Min.   :2014   Length:629        
 2      :  1   Class :character   1st Qu.:2016   Class :character  
 3      :  1   Mode  :character   Median :2019   Mode  :character  
 4      :  1                      Mean   :2019                     
 5      :  1                      3rd Qu.:2021                     
 6      :  1                      Max.   :2023                     
 (Other):623                                                       
 weight_(tons)   volume_(cubic_yards) plastic_bottles  polystyrene  
 Min.   :0.780   Min.   : 7.00        Min.   :  80    Min.   :  20  
 1st Qu.:2.720   1st Qu.:15.00        1st Qu.:1020    1st Qu.: 440  
 Median :3.200   Median :15.00        Median :1900    Median :1040  
 Mean   :3.211   Mean   :15.24        Mean   :1981    Mean   :1463  
 3rd Qu.:3.730   3rd Qu.:15.00        3rd Qu.:2780    3rd Qu.:2250  
 Max.   :5.620   Max.   :20.00        Max.   :5960    Max.   :6540  
                                                                    
 cigarette_butts  glass_bottles     plastic_bags       wrappers   
 Min.   :   500   Min.   :  0.00   Min.   :  24.0   Min.   : 180  
 1st Qu.:  3600   1st Qu.: 10.00   1st Qu.: 270.0   1st Qu.: 775  
 Median :  6000   Median : 18.00   Median : 550.0   Median :1140  
 Mean   : 18657   Mean   : 21.47   Mean   : 867.3   Mean   :1428  
 3rd Qu.: 22000   3rd Qu.: 29.00   3rd Qu.:1140.0   3rd Qu.:1980  
 Max.   :310000   Max.   :110.00   Max.   :3750.0   Max.   :5085  
                                                                  
  sports_balls   homes_powered* 
 Min.   : 0.00   Min.   : 0.00  
 1st Qu.: 6.00   1st Qu.:41.00  
 Median :12.00   Median :52.00  
 Mean   :13.59   Mean   :47.81  
 3rd Qu.:20.00   3rd Qu.:60.00  
 Max.   :56.00   Max.   :94.00

Use the mutate function and the lubridate package to restructure the date component and sum-up month.

monthly_trashcollection <- trashcollection_filter |>
  mutate(
    date = as.Date(date, format = "%m/%d/%Y"), # Ensure correct Date format
    month = month(date, label = TRUE),         # Extract month (shortened, e.g., "Jan")
    year = year(date)                          # Extract year
  ) |>
  group_by(year, month) |>
  summarise(
    plastic_bottles = sum(plastic_bottles, na.rm = TRUE),
    polystyrene = sum(polystyrene, na.rm = TRUE),
    cigarette_butts = sum(cigarette_butts, na.rm = TRUE),
    glass_bottles = sum(glass_bottles, na.rm = TRUE),
    plastic_bags = sum(plastic_bags, na.rm = TRUE),
   
  )

`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.

Prepare data for R,by using the pivot_longer function

monthly_trashcollection_long <- monthly_trashcollection |>
  pivot_longer(
    cols = c(plastic_bottles, polystyrene, cigarette_butts, glass_bottles, plastic_bags),
    names_to = "trash_type",
    values_to = "values"
  )

# View the reshaped data
print(monthly_trashcollection_long)

# A tibble: 570 × 4
# Groups:   year [10]
    year month trash_type      values
   <dbl> <ord> <chr>            <dbl>
 1  2014 May   plastic_bottles  14300
 2  2014 May   polystyrene      17090
 3  2014 May   cigarette_butts 800000
 4  2014 May   glass_bottles      424
 5  2014 May   plastic_bags      6064
 6  2014 Jun   plastic_bottles  10920
 7  2014 Jun   polystyrene      13019
 8  2014 Jun   cigarette_butts 972000
 9  2014 Jun   glass_bottles      494
10  2014 Jun   plastic_bags      4744
# ℹ 560 more rows

Create another qualitative variable (percentage),to enable the plotting of a treemap.

# Group by year and month, then calculate percentages
monthly_trashcollection_long <- monthly_trashcollection_long |>
  group_by(year, month) |>
  mutate(percentage = values / sum(values, na.rm = TRUE) * 100)
monthly_trashcollection_long

# A tibble: 570 × 5
# Groups:   year, month [114]
    year month trash_type      values percentage
   <dbl> <ord> <chr>            <dbl>      <dbl>
 1  2014 May   plastic_bottles  14300     1.71  
 2  2014 May   polystyrene      17090     2.04  
 3  2014 May   cigarette_butts 800000    95.5   
 4  2014 May   glass_bottles      424     0.0506
 5  2014 May   plastic_bags      6064     0.724 
 6  2014 Jun   plastic_bottles  10920     1.09  
 7  2014 Jun   polystyrene      13019     1.30  
 8  2014 Jun   cigarette_butts 972000    97.1   
 9  2014 Jun   glass_bottles      494     0.0493
10  2014 Jun   plastic_bags      4744     0.474 
# ℹ 560 more rows

Plotting a treemap to capture the percentage in weight of trash_type disposed within 2014-2023

treemap(
  monthly_trashcollection_long,
  index = "trash_type",
  vSize = "values",
  vColor = "percentage",
  type = "manual",
  palette = "RdYlBu"
)

Plot density of log trash collection values

ggplot(data = monthly_trashcollection_long, aes(x = log(values), fill = trash_type)) +
  geom_density(alpha = 0.35) +
  scale_fill_brewer(palette = "Set1")+
  labs(
    title = "Density Plot of Log(Trash Collection Values)",
    x = "Log(Values) (Collected Amount)",
    y = "Density",
    fill = "Trash Type"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(),
    legend.position = "bottom"
  )

Use statistical methods IQR method to spot the outlier.

ggplot(monthly_trashcollection, aes(x = glass_bottles, y = plastic_bottles)) +
  geom_point() +
  labs(title = "Initial Scatterplot: Plastic Bottles vs Glass Bottles")

Find the row with the outlier

outlier <- which.max(monthly_trashcollection$plastic_bottles)

# Remove the outlier
monthly_trashcollection_filtered_data <- monthly_trashcollection[-outlier, ]

Plot p3,using ggplot to plot geom_point,to capture linear regression between two quantitative variables plastic bottles and glass bottles.

p3 <- ggplot(monthly_trashcollection_filtered_data, aes(x = glass_bottles, y = plastic_bottles)) +
  geom_point() +
  labs(title = "Plastic_bottles VERSUS glass_bottles IN Baltimore",
  caption = "(https://data.baltimorecity.gov/)",
  x = "glass_bottles rates in each dumpster per 100,000", 
  y = "plastic_bottles rates in each state per 100,000") +
  theme_minimal(base_size = 12)
p3 +  geom_smooth(method='lm',formula=y~x, se = FALSE, linetype= "dotdash", size = 0.5)# add the points

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

Regression and Modeling

Cor stands for “correlation”. This is a value between (inclusively) -1 and 1. The correlation coefficient tells how strong or weak the correlation is. Values closer to +/- 1 are strong correlation (the sign is determined by the linear slope), values close to +/- 0.5 are weak correlation, and values close to zero have no correlation.

cor(monthly_trashcollection_filtered_data$glass_bottles, monthly_trashcollection_filtered_data$plastic_bottles)

[1] 0.6176127

The correlation of the above plot is above 0.5,suggesting moderately-weak cor.

Summary statsistic of fit1.

fit1 <- lm(plastic_bottles ~ glass_bottles, data = monthly_trashcollection_filtered_data)
summary(fit1)


Call:
lm(formula = plastic_bottles ~ glass_bottles, data = monthly_trashcollection_filtered_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-16449.1  -3131.8   -501.4   2435.5  15798.8 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   5729.151    760.349   7.535 1.41e-11 ***
glass_bottles   43.806      5.295   8.273 3.18e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4928 on 111 degrees of freedom
Multiple R-squared:  0.3814,    Adjusted R-squared:  0.3759 
F-statistic: 68.45 on 1 and 111 DF,  p-value: 3.184e-13

Regression Equation

The model output includes estimated coefficients, allowing us to write the regression equation as: [{y} = 5729.151 + 43.806x ], with y(Predicted value of plastic bottles collected)and x=glass_bottles(43.806):the slope, indicating that for every one additional glass bottle collected, approximately 43.81 more plastic bottles are collected.)

Residuals

Residuals are the differences between the observed and predicted values. The summary of residuals indicates: Min and Max: Range of residuals ((-16449.1) to (15798.8)), showing variation in prediction errors. Median Residual: Close to zero ((-501.4)), suggesting no significant bias in over- or under-predictions.

Model Fit Metrics

Adjusted (R^2):The adjusted (R^2) value of (0.3759) indicates that the model explains approximately 37.59% of the variability in.

Summary

The first step in filtering my data,requires I remove the columns with NAF using dypler function filter, and convert the column “dumpster” from numerical variable to categorical variables. Secondly,I created another qualitative variable (percentage),to enable the plotting of a treemap. Lastly, I prepared my data for R,by using the pivot_longer function,then use the mutate function and the lubridate package to restructure the date component.

Project 1 has three plots. The first plot is a treemap which compares the overall scale and relative contributions of trash types to the total trash collected,plot 2 is a density log plot showing statistical distribution (e.g., variability and trends) of collected amounts within each trash type,and plot 3 is geom_point to establish a linear regression plot to find correlation between two quantitative variables.

Plot2, presents a density visualization of the log-transformed values of trash collection amounts, grouped by various trash types. The x-axis depicts the log-transformed collected amounts, while the y-axis represents density. To have a clear representation, the RcolorBrewer:set 1 was used with a transparency of 0.35 to highlight overlapping.

Different colors distinguish the trash types, with cigarette butts (red) showing a peak at lower values, reflecting a high density of small, collected amounts. Glass bottles (blue) exhibit a moderate distribution with a smaller peak. Plastic bags (blue) and plastic bottles (purple) display higher peaks, indicating they are collected in larger amounts. Polystyrene (orange) features a broad distribution, signifying a wide range of collection values. Overall, the plot highlights that plastic-related waste tends to be collected in larger amounts, whereas cigarette butts are characterized by high density at lower collection amounts. The overlapping density curves illustrate variability in collection sizes across the different trash types.