Final Project

Author

Emilio Sanchez San Martin

knitr::opts_chunk$set(echo = TRUE) #Used ChatGPT to figure out how to upload the image

Source: https://www.mrtrashwheel.com/

Brief Introduction

Once again, I am introducing the Mr. Trash Wheel data set, also known as the Inner Harbor Water Wheel, to dive in a semi-autonomous trash interceptor that has been removing trash from Baltimore Jones Falls River since May, 2014. The data set includes the date, weight, and volume of trash for each filled dumpster, along with estimates for the number of plastic bottles, cigarette butts, and other types of items extracted. The data set is available at https://www.mrtrashwheel.com/ & https://docs.google.com/spreadsheets/d/1b8Lbez3PNb3H8nSsSjrwK2B0ReAblL2/edit#gid=1143432795. It’s located at the end of an outfall around the river, waiting for trash to flow into the machine! It’s a great example of how technology can be used to help the environment. The technology it uses to take trash inside and record the data of trash takes lot’s of machinery and time.

Last time, I was working with the Cigarette butts variable since it was a the MOST amount of trash litter found in Baltimore. I also chose years 2014, 2020, and 2023, because I wanted to see if Covid had anything to do with this.

This time for my project, I recently discovered that their are MORE data sets to work with, which means I am going to actually COMBINE more data sets with the Mr. Trash wheel, using OTHER TRASH WHEELS and using ALL the data to find a visualization!

Professor trash, Captain trash, and Gwynnda trash wheel’s are the other trash wheels I will be using for this project. Each trash wheel (in that order) were installed in previous years before Mr Trash wheel.

Mr Trash Wheel was installed in 2014. Professor Trash Wheel was installed in 2016. Captain Trash Wheel was installed in 2018. Gwynnda Trash Wheel was installed in 2021.

If you want more information about each trash wheel, you can visit their “Meet the Family” page on the website. ALL these trash wheels protects the Four Corners of the harbor in Baltimore using renewable energy, the Mr. Trash Wheel family “cleans the Harbor, educates the public, and provides data to inform anti-litter legislation.” (Source: https://www.mrtrashwheel.com/). I can’t wait to discover the data sets and the impact of trash being collected.

I will be combining all the data sets to find a visualization that will show the amount of trash collected by each trash wheel, and see if there is a correlation between the trash wheels!

Here are the Variables for these datasets

Month Month
Year Year
Date Date
Weight Weight in Tons
Volume Volume in Cubic Yards
PlasticBottles Number of Plastic Bottles
Polystyrene Number of Polystyrene Items
CigaretteButts Number of Cigarette Butts
GlassBottles Number of Glass Bottles
PlasticBags Number of Plastic Bags
Wrappers Number of Wrappers
SportsBalls Number of SportsBalls
HomesPowered Each ton of trash = average 500 kilowatts of electricity. (Average household will use 30 kilowatts per day).

I want to explore new variables and see if there is correlation between different variables and performs tests to see how variables are related to each other. I’m going to now put the data set into a data frame and start exploring the data.

First, as always, upload nessesary libraries and the data set.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tibble)
library(GGally) #data analysis
Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2
library(RColorBrewer) #may or may not use for color graphs
library(plotly) #use for visualizations

Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout
library(highcharter) #I will use this for visualizations
Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo 
library(patchwork) #I will use this for visualizations.
library(ggfortify)

Now, I wil upload the data set and put it into a data frame.

setwd("~/Downloads/Data 101 and Data 110 class/Data 110")

MrTrash <- read_csv("Data Sets/mrtrash.csv", skip = 1)
New names:
Rows: 630 Columns: 16
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(2): Month, Date dbl (2): Dumpster, Year num (10): Weight (tons), Volume (cubic
yards), Plastic Bottles, Polystyrene,... lgl (2): ...15, ...16
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...15`
• `` -> `...16`
ProfessorTrash <- read_csv("Data Sets/professortrash.csv", skip = 1)
Rows: 116 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Month, Date
dbl (3): Dumpster, Year, Weight (tons)
num (8): Volume (cubic yards), Plastic Bottles, Polystyrene, Cigarette Butts...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
CaptainTrash <- read_csv("Data Sets/captaintrash.csv", skip = 1)
New names:
Rows: 32 Columns: 16
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(2): Month, Date dbl (5): Dumpster, Year, Weight (tons), Volume (cubic yards),
Homes Powered* num (5): Plastic Bottles, Polystyrene, Cigarette Butts, Plastic
Bags, Wrappers lgl (4): ...13, ...14, ...15, ...16
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...13`
• `` -> `...14`
• `` -> `...15`
• `` -> `...16`
GwynndaTrash <- read_csv("Data Sets/gwynndatrash.csv", skip = 1)
Rows: 221 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Month, Date
dbl (3): Dumpster, Year, Weight (tons)
num (7): Volume (cubic yards), Plastic Bottles, Polystyrene, Cigarette Butts...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#Trash Wheels! (USED SKIP 1 to remove the first row of the data set that was useless)
#Can't believe I found more data sets to work with, this is gonna be good! The cleaning process might take a while though!

From looking at the variables in the Environment for each Data Set, I realize that CaptainTrash and GywnndaTrash do not have data regarding Glass Bottles and Sports Balls, and Professor Trash doesn’t have data on SportsBalls. Interesting!

Data Cleaning

For now, I am going to view NA’s, rename variables, and etc to have the perfect data set to work with! Doing this with every data set too! This might be the most boring part of my project haha

Let me check NA’s..

sum(is.na(MrTrash))
[1] 1264
sum(is.na(ProfessorTrash))
[1] 22
sum(is.na(CaptainTrash))
[1] 169
sum(is.na(GwynndaTrash))
[1] 121

I will remove the empty rows and columns found in the data set. (Went through every data set)

MrTrash <- head(MrTrash, -1)
MrTrash <- MrTrash[-c(15:16)]
ProfessorTrash <- head(ProfessorTrash, -2)
CaptainTrash <- head(CaptainTrash, -2)
CaptainTrash <- CaptainTrash[-c(13:16)]
GwynndaTrash <- head(GwynndaTrash, -1)

From looking at the variables in the Environment for each Data Set, I realize that CaptainTrash and GywnndaTrash do not have data regarding Glass Bottles and Sports Balls, and Professor Trash doesn’t have data on SportsBalls. Interesting!

Now for the renaming process!

# Rename columns so they don't have spaces

MrTrash <- MrTrash |> 
  rename(
    Weight = "Weight (tons)",
    Volume = "Volume (cubic yards)",
    PlasticBottles = "Plastic Bottles",
    CigaretteButts = "Cigarette Butts",
    GlassBottles = "Glass Bottles",
    PlasticBags = "Plastic Bags",
    SportsBalls = "Sports Balls",
    HomesPowered = "Homes Powered*"
  )

ProfessorTrash <- ProfessorTrash |> 
  rename(
    Weight = "Weight (tons)",
    Volume = "Volume (cubic yards)",
    PlasticBottles = "Plastic Bottles",
    CigaretteButts = "Cigarette Butts",
    GlassBottles = "Glass Bottles",
    PlasticBags = "Plastic Bags",
    HomesPowered = "Homes Powered*"
  )

CaptainTrash <- CaptainTrash |> 
  rename(
    Weight = "Weight (tons)",
    Volume = "Volume (cubic yards)",
    PlasticBottles = "Plastic Bottles",
    CigaretteButts = "Cigarette Butts",
    PlasticBags = "Plastic Bags",
    HomesPowered = "Homes Powered*"
  )

GwynndaTrash <- GwynndaTrash |> 
  rename(
    Weight = "Weight (tons)",
    Volume = "Volume (cubic yards)",
    PlasticBottles = "Plastic Bottles",
    CigaretteButts = "Cigarette Butts",
    PlasticBags = "Plastic Bags",
    HomesPowered = "Homes Powered*"
  )

colnames(MrTrash)
 [1] "Dumpster"       "Month"          "Year"           "Date"          
 [5] "Weight"         "Volume"         "PlasticBottles" "Polystyrene"   
 [9] "CigaretteButts" "GlassBottles"   "PlasticBags"    "Wrappers"      
[13] "SportsBalls"    "HomesPowered"  
colnames(ProfessorTrash)
 [1] "Dumpster"       "Month"          "Year"           "Date"          
 [5] "Weight"         "Volume"         "PlasticBottles" "Polystyrene"   
 [9] "CigaretteButts" "GlassBottles"   "PlasticBags"    "Wrappers"      
[13] "HomesPowered"  
colnames(CaptainTrash)
 [1] "Dumpster"       "Month"          "Year"           "Date"          
 [5] "Weight"         "Volume"         "PlasticBottles" "Polystyrene"   
 [9] "CigaretteButts" "PlasticBags"    "Wrappers"       "HomesPowered"  
colnames(GwynndaTrash)
 [1] "Dumpster"       "Month"          "Year"           "Date"          
 [5] "Weight"         "Volume"         "PlasticBottles" "Polystyrene"   
 [9] "CigaretteButts" "PlasticBags"    "Wrappers"       "HomesPowered"  

I want to add a new variable to each data set that will show the names of the trash wheels. This will help me when I combine all the data sets together and identify which trash wheel the data is coming from.

MrTrash$TrashWheel <- "MrTrash"
ProfessorTrash$TrashWheel <- "ProfessorTrash"
CaptainTrash$TrashWheel <- "CaptainTrash"
GwynndaTrash$TrashWheel <- "GwynndaTrash"

We are ready to combine the data sets!

AllTrash <- bind_rows(MrTrash, ProfessorTrash, CaptainTrash, GwynndaTrash)

Data Manipulation for FIRST VISUALIZATION

For my next part, I want to make a data visualization with High charter that will make a graph showing the x = year, y = Total sum of Trash, and group by each type of trash collection there is; HOWEVER, I can’t do that unless I make each trash a new variable. I will do that now! (I used ChatAi to help do this, and RE-learned the function pivot_longer)

AllTrashWithLabel <- AllTrash |> 
  pivot_longer(
    cols = c(PlasticBottles, Polystyrene, CigaretteButts, GlassBottles, PlasticBags, Wrappers, SportsBalls),
    names_to = "TrashType",
    values_to = "TrashAmount"
  )

Now that I have made each trash a new variable, I am going to group by TrashType and Year, and summarize the total amount of trash collected for each type of trash.

AllTrashForVis <- AllTrashWithLabel |>
  group_by(TrashType, Year) |>
  summarize(TotalTrash = sum(TrashAmount, na.rm = TRUE))
`summarise()` has grouped output by 'TrashType'. You can override using the
`.groups` argument.

Data Manipulation for SECOND VISUALIZATION

With this visualization, I will base it on the Data Analysis I use. I want to discover a variable I haven’t discovered yet. That is the “HomesPowered” variable! I want to analyze and visualize the relationship between the amount of trash collected (e.g., weight or volume) and the estimated number of homes powered. This could provide insights into the impact of waste management efforts on energy production and sustainability.

First, I want to see the class of the Date variable in the AllTrash data set.

class(AllTrash$Date)
[1] "character"

This is a character value. I am going to convert this to a date value so I can group by the dates.

AllTrash$Date <- as.Date(AllTrash$Date, format = "%m/%d/%Y")
head(AllTrash$Date)
[1] "2014-05-16" "2014-05-16" "2014-05-16" "2014-05-17" "2014-05-17"
[6] "2014-05-20"
class(AllTrash$Date)
[1] "Date"

The first 1:30 observations of the Date variable is shown with the year being 0014 and 0021. This is not correct. I am going to change the year to 2014 and 2021.

Now that I have converted the Date variable to a date value, I am going to group by the Date variable and summarize the total amount of trash in Weight and Volume collected for each date. I will also summarize the total number of homes powered for each date. I will ALSO add a sum towards EVERY trash variable so I can put it into my visulization later with Plotly.

AllTrashDataAna <- AllTrash |> 
  group_by(Date, Year) |> 
  summarize(TotalTrashWeight = sum(Weight, na.rm = TRUE),
            TotalTrashVolume = sum(Volume, na.rm = TRUE),
            TotalHomesPowered = sum(HomesPowered, na.rm = TRUE),
            PlasticBottles = sum(PlasticBottles, na.rm = TRUE),
            Polystyrene = sum(Polystyrene, na.rm = TRUE),
            CigaretteButts = sum(CigaretteButts, na.rm = TRUE),
            GlassBottles = sum(GlassBottles, na.rm = TRUE),
            PlasticBags = sum(PlasticBags, na.rm = TRUE),
            Wrappers = sum(Wrappers, na.rm = TRUE),
            SportsBalls = sum(SportsBalls, na.rm = TRUE))
`summarise()` has grouped output by 'Date'. You can override using the
`.groups` argument.

The first 1:30 observations of a Date variable is shown with the year being 0014 and 0021. This is not correct. I will change the first 1:4 observations to 2014 and 5:30 to 2021 with their correct dates.

# First four observations

AllTrashDataAna$Date[1:1] <- as.Date("2014-11-06")
AllTrashDataAna$Date[2:2] <- as.Date("2014-11-17")
AllTrashDataAna$Date[3:3] <- as.Date("2014-11-20")
AllTrashDataAna$Date[4:4] <- as.Date("2014-11-28")

# Next 26 observations

AllTrashDataAna$Date[5:5] <- as.Date("2021-02-17")
AllTrashDataAna$Date[6:6] <- as.Date("2021-03-03")
AllTrashDataAna$Date[7:7] <- as.Date("2021-03-23")
AllTrashDataAna$Date[8:8] <- as.Date("2021-03-25")
AllTrashDataAna$Date[9:9] <- as.Date("2021-03-27")
AllTrashDataAna$Date[10:10] <- as.Date("2021-03-31")
AllTrashDataAna$Date[11:11] <- as.Date("2021-04-13")
AllTrashDataAna$Date[12:12] <- as.Date("2021-04-15")
AllTrashDataAna$Date[13:13] <- as.Date("2021-05-04")
AllTrashDataAna$Date[14:14] <- as.Date("2021-05-06")
AllTrashDataAna$Date[15:15] <- as.Date("2021-05-08")
AllTrashDataAna$Date[16:16] <- as.Date("2021-05-28")
AllTrashDataAna$Date[17:17] <- as.Date("2021-06-01")
AllTrashDataAna$Date[18:18] <- as.Date("2021-06-04")
AllTrashDataAna$Date[19:19] <- as.Date("2021-06-11")
AllTrashDataAna$Date[20:20] <- as.Date("2021-06-12")
AllTrashDataAna$Date[21:21] <- as.Date("2021-06-15")
AllTrashDataAna$Date[22:22] <- as.Date("2021-07-02")
AllTrashDataAna$Date[23:23] <- as.Date("2021-07-14")
AllTrashDataAna$Date[24:24] <- as.Date("2021-07-21")
AllTrashDataAna$Date[25:25] <- as.Date("2021-08-03")
AllTrashDataAna$Date[26:26] <- as.Date("2021-08-12")
AllTrashDataAna$Date[27:27] <- as.Date("2021-08-16")
AllTrashDataAna$Date[28:28] <- as.Date("2021-08-18")
AllTrashDataAna$Date[29:29] <- as.Date("2021-08-20")
AllTrashDataAna$Date[30:30] <- as.Date("2021-11-30")

Finally, I will convert the Year variable to a factor variable so I can use it in my visualization.

AllTrashDataAna$Year <- as.factor(AllTrashDataAna$Year)
class(AllTrashDataAna$Year)
[1] "factor"

Data Analysis

I am going to FIRST see the correlation between Total Trash Weight and Homes Powered to see if there is a relationship between the two variables.

cor(AllTrashDataAna$TotalTrashWeight, AllTrashDataAna$TotalHomesPowered, use = "complete.obs")
[1] 0.9349281

As expect, there is a really high correlation between the Total Trash Weight and Homes Powered variables. This means that the more trash collected, the more homes powered. This is a good sign that the trash wheels are doing their job!

# With ggpairs, use ignore warning to ignore the warning message
ggpairs(AllTrashDataAna, columns = c(2:4), progress = FALSE) #ignore progress
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Interesting! The Total Trash Weight and Total Homes Powered variables have a positive correlation. HOWEVER, there are out liars that does not really explain the data so well! If there are some observations saying that the Total homes powered is low while the weight is high, we need to explore why that is! With my next objective, I am going to make a multiple regression model to see if there is a relationship between the Home Powered variables and the total trash collected variables.

Multiple regression model.

I am going to use the lm() function to create a multiple regression model with the Homes Powered variable as the response variable and the total weight/volume. I am going to use the AllTrash data set to create the model, and ultimately, find the adjusted R-squared value to see which variable/model fits the data.

(Below is notes)

  1. Look at the p-value for each variable - if it is relatively small ( < 0.10), then it is likely contributing to the model.

  2. Check out the residual plots. A good model will have a relatively straight horizontal blue line across the scatter plot between residuals plotted with fitted values. The more curved the blue line, the more likely that a better model exists.

  3. Look at the output for the Adjusted R-Squared value at the bottom of the output. The interpretation is:

% (from the adjusted r-squared value) of the variation in the observations may be explained by this model. The higher the adjusted R-squared value, the better the model. We use the adjusted R-squared value because it compensates for more predictors mathematically increasing the normal R-squared value. _

What does these diagnostic plots mean?

  1. Residual plot essentiall indicates whether a linear model is appropriate - you can see this by the blue line showing relatively horizontal. If it is not relatively horizontal, a linear plot may not be appropriate.

  2. QQPlot indicates whether the distribution is relatively normal. Observations that might be out liars are indicated by their row number.

  3. Scale-Location indicates homogeneous variance. Influential observations that are skewing the variance distribution are indicated.

  4. Cook’s Distance indicates which outliers have high leverage, meaning that some out liars may not cause the model to violate basic assumptions required for the regression analysis (see #1-3). If out liars have high leverage, then they may be causing problems for your model. You can try to remove those observations, especially if they appear in any of the other 3 plots above.


fit1 <- lm(TotalHomesPowered ~ TotalTrashWeight + TotalTrashVolume, data = AllTrashDataAna)
summary(fit1)

Call:
lm(formula = TotalHomesPowered ~ TotalTrashWeight + TotalTrashVolume, 
    data = AllTrashDataAna)

Residuals:
     Min       1Q   Median       3Q      Max 
-294.330    1.356    5.705    9.462   32.996 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -7.6052     2.1134  -3.599  0.00035 ***
TotalTrashWeight  11.2826     0.9949  11.340  < 2e-16 ***
TotalTrashVolume   1.0991     0.2271   4.840 1.71e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 26.12 on 521 degrees of freedom
Multiple R-squared:  0.8795,    Adjusted R-squared:  0.879 
F-statistic:  1901 on 2 and 521 DF,  p-value: < 2.2e-16
autoplot(fit1, 1:4, nrow=2, ncol=2)

As you see, the blue line is NOT curved. The adjusted R-squared value is 0.8788, which means that 0.8788 of the variation in the Homes Powered variable can be explained by the other trash variables. (Of course it can, but I am not too focused on that. I am more focused on looking at the dates and the trash variables that led to out liars being in the data set As you can see, observations like 78, 31, and 46 played a big role in outliars with the comparison of the Homespowered variable with the TotalTrashWeight and TotalTrashVolume variables. It doesn’t make sense because if the weight is extremely huge, why wouldn’t the Homes Powered variable be high? This is something I need to look at.

AllTrashDataAna[31,]
# A tibble: 1 × 12
# Groups:   Date [1]
  Date       Year  TotalTrashWeight TotalTrashVolume TotalHomesPowered
  <date>     <fct>            <dbl>            <dbl>             <dbl>
1 2014-05-16 2014              10.5               46                 0
# ℹ 7 more variables: PlasticBottles <dbl>, Polystyrene <dbl>,
#   CigaretteButts <dbl>, GlassBottles <dbl>, PlasticBags <dbl>,
#   Wrappers <dbl>, SportsBalls <dbl>
AllTrashDataAna[46,]
# A tibble: 1 × 12
# Groups:   Date [1]
  Date       Year  TotalTrashWeight TotalTrashVolume TotalHomesPowered
  <date>     <fct>            <dbl>            <dbl>             <dbl>
1 2014-07-15 2014              7.19               30                 0
# ℹ 7 more variables: PlasticBottles <dbl>, Polystyrene <dbl>,
#   CigaretteButts <dbl>, GlassBottles <dbl>, PlasticBags <dbl>,
#   Wrappers <dbl>, SportsBalls <dbl>
AllTrashDataAna[78,]
# A tibble: 1 × 12
# Groups:   Date [1]
  Date       Year  TotalTrashWeight TotalTrashVolume TotalHomesPowered
  <date>     <fct>            <dbl>            <dbl>             <dbl>
1 2015-06-10 2015              24.5               96                80
# ℹ 7 more variables: PlasticBottles <dbl>, Polystyrene <dbl>,
#   CigaretteButts <dbl>, GlassBottles <dbl>, PlasticBags <dbl>,
#   Wrappers <dbl>, SportsBalls <dbl>

As you can see, the first two out liars are around the same exact date (just 2 days off) and the third out liars is around the same date as the first two out liars by a year. This is interesting because the weight is extremely high, but the homes powered variable is low. This is something I need to look at. When I plot my visualization, maybe I can put the dates down to see how the dates reflect the homes powered variables and also see if a specific date in times affects how the homes powered variable is affected by the total trash.

We of course don’t need to worry about the p value being to high as it is < 2.2e-16. This is just more proof that Weight and Volume are contributing to the model. (Like I said before, I am not too focused on this, but I am more focused on the out liars)

Exploration

They may not be too important, but since my next visualization model will focus on weight and home powered houses, I want to see the distribution between the trash wheels and the weight. I am going to use the AllTrash data set to see if there is a correlation between the trash wheels and their weight.

# I am going to put in order of trashwheel to see the weight distribution for each trash wheel.
AllTrash <- AllTrash |>
  mutate(TrashWheel = factor(TrashWheel, levels = c("MrTrash", "ProfessorTrash", "CaptainTrash", "GwynndaTrash")))

AllTrash |>
  ggplot(aes(x = TrashWheel, y = Weight, color = TrashWheel)) +
  geom_boxplot() +
  scale_color_manual(values = c("#25A5D8", "#20A458", "#E4BC20", "#BC85B9")) + #Specified the colors for trash wheel
  labs(title = "Trash Wheel Weight Distribution",
       x = "Trash Wheel",
       y = "Weight (Tons)",
       color = "Trash Wheel") +
  theme_minimal()

# Color each trash wheel differently to see the weight distribution for each trash wheel. Use red yellow blue and green

As you can see, the weight distribution for each trash wheel is different. Mr. Trash Wheel has the highest weight distribution, while Professor Trash Wheel has the lowest weight distribution.

This is actually REALLY interesting to see! Gwynnda Trash Wheel has a high weight distribution compared to Professor Trash and Captain trash, but it was installed in 2021. That means that the amount of trash found in Gwynns Falls, Baltimore, MD has increased over the years. This is a good sign that the trash wheels are doing their job, and it also means that this certain place has a LOT of trash!

What this ALSO means is that the Mr Trash wheel and the Gwynnda Trash wheel are powering homes the most as they are collect the most weight, especially with Gwynnda Trash wheel being the most recent trash wheel installed. This could mean that the most effective trashwheel that is keep electricity in homes would be both of them. (I think however, that the Gwynnda wheel would be more effective as it is the most recent trash wheel installed, and it is already collecting SOO much trash!)

Let’s make a average weight distribution for each trash wheel to see the different between the median and average

AllTrash |>
  group_by(TrashWheel) |>
  summarize(AvgWeight = mean(Weight, na.rm = TRUE)) |>
  ggplot(aes(x = TrashWheel, y = AvgWeight, fill = TrashWheel)) + #Add color to the bars "blue", "green", "orange", "purple"
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("#25A5D8", "#20A458", "#E4BC20", "#BC85B9")) + #Specified the colors for trash wheel
  labs(title = "Average Weight Distribution for Each Trash Wheel",
       x = "Trash Wheel",
       y = "Average Weight (Tons)",
       fill = "Trash Wheel",
       caption = "Source: https://www.mrtrashwheel.com/") +
  theme_minimal()

Really interesting, it’s almost the same exact distribution as the box plot! Mr. Trash Wheel has the highest average weight distribution, while Professor Trash Wheel has the lowest average weight distribution.

For my next step, I want to make a visualization that will show the out liars for the data set AllTrashDataAna. With this, I can see possible YEARS that have had out liars with the weight of the trash not reflecting the homes powered variable, and we can see which year or date did the trash wheel company not use trash for electricity. This is NOT one of my final visualizations, but I want to see the out liars for the data set to compares each year so I can make a clear visualization. I will also make this interactive with Plotly to add text in tooltip to show the date so I can see which date has the most out liars

Help_For_Final_Vis2 <- AllTrashDataAna |>
  ggplot(aes(x = TotalTrashWeight, y = TotalHomesPowered, color = Year, text = paste(
    "Date:", Date))) +
  geom_point() +
  labs(title = "Total Trash Collected by Weight and Volume: 2014 - 2023",
       x = "Total Trash Weight (Tons)",
       y = "Total Homes Powered",
       color = "Date") +
  theme_minimal()

Help_For_Final_Vis2 <- ggplotly(plot = Help_For_Final_Vis2)
Help_For_Final_Vis2 <- layout(Help_For_Final_Vis2, annotations = list(
  text = "Source: Mr. Trash Wheel",
  x = 27, y = 10,  #Use ChatAI to find how to add a source
  showarrow = FALSE
))

Help_For_Final_Vis2

No way! We can see that for every other year except for 2014 and 2015, the weight of the trash collected is reflecting the homes powered variable. This asks me another question, why is it that the trash collected in 2014 and 2015 did not reflect the homes powered variable? What did the Trash Wheel Company do with that trash that was left over those years? Did they not use it for electricity and use it for something else? I would like to know. You can even see in the year 2015, date 2015-06-10, that even the highest weight in tons collected observation did obviously NOT help as much with the total homes powered.

For my Visualization, I will show the years 2014, 2015, and 2021 to compare each year with the Total Trash Weight and Total homes powered. I choose these years because with 2014 and 2015, there are out liars, but I am specifically choosing 2021 because every other year except 20214 and 2015 have a perfect linear relationship between the weight of the trash and the homes powered variable. With the year 2021, with it having the highest observation with Total Trash Weight and Homes Powered, it will be the best line to represent each year of how the linear regression goes in comparison with the other two years (2014 and 2015).

For my second visualization, I also want to see how many powered homes could have been powered with the trash collected in 2014 and 2015. With this, I will have to calculate the total weight collected in those years, using the AllTrashDataAna data set, and then calculate the total number of homes powered to compare how much homes COULD HAVE been powered, and compare it with the actual number of homes powered in those years (Of course 2014 was 0, but 2015 has some observations). I also want to make a tibble to show this more clearly.

TotalTrashWeight2014 <- sum(AllTrashDataAna$TotalTrashWeight[AllTrashDataAna$Year == 2014], na.rm = TRUE)
TotalTrashWeight2015 <- sum(AllTrashDataAna$TotalTrashWeight[AllTrashDataAna$Year == 2015], na.rm = TRUE)

TotalHomesPowered2014 <- sum(AllTrashDataAna$TotalHomesPowered[AllTrashDataAna$Year == 2014], na.rm = TRUE)
TotalHomesPowered2015 <- sum(AllTrashDataAna$TotalHomesPowered[AllTrashDataAna$Year == 2015], na.rm = TRUE)

TotalHomesPowered2014
[1] 0
TotalHomesPowered2015
[1] 2714
# I will make a tibble for this to see the data clearly
PoweredHomes <- tibble(Year = c(2014, 2015),
                       TotalTrashWeight = c(TotalTrashWeight2014, TotalTrashWeight2015),
                       TotalHomesPowered = c(TotalHomesPowered2014, TotalHomesPowered2015))
PoweredHomes
# A tibble: 2 × 3
   Year TotalTrashWeight TotalHomesPowered
  <dbl>            <dbl>             <dbl>
1  2014             141.                 0
2  2015             239.              2714

Looking at above, I wrote earlier that each 1 ton of trash = average 500 kilowatts of electricity. (Average household will use 30 kilowatts per day). That means that 1 ton of trash = 16.67 homes powered (basically 17). (500/30 = 17). I am going to calculate how many homes could have been powered with the trash collected in 2014 and 2015.

PoweredHomes <- PoweredHomes |>
  mutate(HomesCouldHavePowered = TotalTrashWeight / 1 * 17)

PoweredHomes
# A tibble: 2 × 4
   Year TotalTrashWeight TotalHomesPowered HomesCouldHavePowered
  <dbl>            <dbl>             <dbl>                 <dbl>
1  2014             141.                 0                 2403.
2  2015             239.              2714                 4060.

No way! We see that in 2014, around 2403 homes could have been powered with the trash collected, and in 2015, around 4060 homes could have been powered with the trash collected. This is interesting because the actual number of homes powered in 2015 was 2714. This means that most of the trash collected in 2015 was not used for electricity, and went to waist (maybe). What an amazing finding! I’ll also show this in my visualization.

Last Data Manipulation for Third Visualization

For my final visualization, I want to see which month collected the MOST trash from years 2021 to 2023. The reason why I choose these specific years is because 2021 is when the Gwynnda Trash Wheel was installed, so combining all the amount of trash collected from 2021 to 2023 will show how much trash (in volume) was collected in total (to see how much trash is taking up the space in Baltimore’s river), and which month collected the most trash (it’s also the most recent years). I want add the trash types because we ALL know that Cigarette butts is the main reason why the trash gets so high, and the with visualization I did earlier, we don’t really need to identify which trash type influenced the amount of trash found (since that’s cigarette butts). I will use the AllTrashWithLabel data set I created for the first visualization, and group by the Month and Year variables, and summarize the total amount of trash collected for each month.

# What I first noticed aboute the AllTrashWithLabel data set is that the months are in lowercase. I am going to change them to uppercase so they look better in the visualization.

AllTrashWithLabel$Month <- toupper(AllTrashWithLabel$Month)

# Now, I am going to group by the Month and Year variables, and summarize the total amount of trash collected for each month.

AllTrashMonth <- AllTrashWithLabel |>
  filter(Year %in% c(2021, 2022, 2023)) |>
  group_by(Month, Year) |>
  summarize(TotalTrashVolume = sum(Volume, na.rm = TRUE))
`summarise()` has grouped output by 'Month'. You can override using the
`.groups` argument.
# Change the months class to character
AllTrashMonth$Month <- as.character(AllTrashMonth$Month)

# Finally, I want the months to be in order, so I am going to convert the Month variable to a factor variable and specify the levels in order.

AllTrashMonth <- AllTrashMonth |>
  mutate(Month = factor(Month, levels = c("JANUARY", "FEBRUARY", "MARCH", "APRIL", "MAY", "JUNE", "JULY", "AUGUST", "SEPTEMBER", "OCTOBER", "NOVEMBER", "DECEMBER"), ordered = TRUE))

# Reorder observations by month
AllTrashMonth <- AllTrashMonth |>
  arrange(Month)

# (I am going to say this to include in my project and for everyone to know, because this is IMPORTANT to me. I was struggling so much trying to order the data correctly by month, and realized that the arragen code was the right code to arrange the data by month for the visualization. Omg I am so happy that I figured this out I was going insane trying to figure this out hahaha!)

Visualization

1. - Total Trash Collected by Type: Year 2014 - 2023

highchart() |>
  hc_add_series(data = AllTrashForVis, # selects dataframe
                type = "streamgraph", # sets the graph type, streamgraph
                hcaes(x = Year, y = TotalTrash, group = TrashType)) |>
  hc_title(text = "Total Trash Collected by Type: Year 2014 - 2023") |>
  hc_subtitle(text = "Baltimore's Mr. Trash Wheel Family") |>
  hc_xAxis(categories = unique(AllTrashForVis$Year)) |>
  hc_yAxis(title = list(text = "Total Trash Volume")) |>
  hc_legend(title = list(text = "Types of Trash"),
            floating = TRUE, #Used ChatAI to put legend in the middle
            x = 100, y = -60, 
            enabled = TRUE) |>
  hc_colors(c("#93b7be", "#2d3047", "#c8553d", "#f28f3b", "#ffd5c2", "#6a994e", "#f0ead2")) |>
  hc_plotOptions(streamgraph = list(label = list(enabled = TRUE))) |> # Used ChatAI to figure out how to add labels
  hc_chart(backgroundColor = "rgba(255, 255, 255, 0.7)", # Used ChatAI for background transparency
           divBackgroundImage = "https://wordpress.wbur.org/wp-content/uploads/2019/04/0416_trashwheel01.jpg",
           style = list(fontFamily = "Verdana", fontWeight = "bold")) |>
  hc_tooltip(shadow = TRUE,
             crosshairs = TRUE,
             borderWidth = 3) |> # Used Alexandra Veremeychik Project 2 Assignment to learn crosshairs arguement.
  hc_labels(items = list(
    list(html = "Mr. Trash Wheel (2014)", style = list(left = "500%", top = "20%")),
    list(html = "Professor Trash Wheel (2016)", style = list(left = "500%", top = "40%")),
    list(html = "Captain Trash Wheel (2018)", style = list(left = "500%", top = "60%")),
    list(html = "Gwynnda Trash Wheel (2021)", style = list(left = "500%", top = "80%"))
  )) |> # Used ChatAI to figure out how to add labels
  hc_caption(text = "Data Source: Mr. Trash Wheel (All Trash Wheels)") |>
  hc_exporting(enabled = TRUE)

Going back to the AllTrashForVis dataset (which is the first Visualization data set I will use for this project), I made a visualization using High charter that will show the total amount of trash collected for each type of trash, grouped by year. My goal is to show the total amount of trash collected by each trash wheel, and how the amount of trash is affected throughout the years by each trash wheel when it was installed. You can see each trash wheel being implemented by there year and how that has affected the amount of trash collected throughout the years in Baltimore Harbor River. The X - values is the year, Y - value represent the Volume of total trash collected, and the group is the type of trash collected (in the legend).

2. - Total Trash Collected by Weight and Volume: 2014 - 2023 (Data Analysis)

AllTrashDataAna$Year <- as.factor(AllTrashDataAna$Year) #First, I am going to convert the Year's to Factor for legend.

smooth_data <- AllTrashDataAna |> #Filtered the data for geom_smooth (Year 2021 only)
  filter(Year %in% c(2021))

#Rename 2021 to Every Year Before 2021 (Used ChatAI for recoding 2021 to be "Every Year Before", and learned fct_recode)
smooth_data$Year <- fct_recode(smooth_data$Year,
                                   "Every Year After" = "2021")


# Filter the data for geom_point (Years 2014 and 2015 only)
point_data <- AllTrashDataAna |>
  filter(Year %in% c(2014, 2015))

# Create the ggplot object for geom_point
point_plot <- point_data |>
  ggplot(aes(x = TotalTrashWeight, y = TotalHomesPowered, color = Year)) +
  geom_point() +
  geom_smooth(method = "auto", se = TRUE, size = 0.5) +
  labs(title = "Total Trash Collected by Weight and Volume: 2014 - 2023",
       x = "Total Trash Weight (Tons)",
       y = "Total Homes Powered",
       color = "Year") +
  theme_minimal() +
  scale_color_manual(values = c("#ADFC92", "#9BF3F0")) +
  facet_wrap(~Year) +
  scale_y_continuous(limits = c(0, 455), breaks = seq(0, 455, by = 100)) +
  scale_x_continuous(limits = c(0, 24.50), breaks = seq(0, 24.50, by = 10))
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
# Create the ggplot object for geom_smooth
smooth_plot <- smooth_data |>
  ggplot(aes(x = TotalTrashWeight, y = TotalHomesPowered, color = Year)) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Total Trash Collected by Weight and Volume: 2014 - 2023",
       x = "Total Trash Weight (Tons)",
       y = "Total Homes Powered",
       color = "Year") +
  theme_minimal() +
  facet_wrap(~Year) + #I did this to show title "Every Year Before"
  scale_color_manual(values = "#473198")

# Combine the plots
Final_Vis2 <- subplot(point_plot, smooth_plot, nrows = 1) #Used ChatAI to figure out how to combine plots!
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
Final_Vis2 <- layout(Final_Vis2, annotations = list(
  text = "Source: Mr. Trash Wheel <br> (All Trash Wheels)",
  x = 10, y = 24,
  showarrow = FALSE
))
Final_Vis2 <- layout(Final_Vis2, legend = list(x = 0.01, y = 0.99, orientation = "h")) #Used Google Docs to find a good family font 
Final_Vis2 <- layout(Final_Vis2, font = list(family = "Copperplate, Papyrus, fantasy", size = 12, color = "black"))
Final_Vis2 <- layout(Final_Vis2, xaxis = list(title = "Total Trash Weight (Tons)"), yaxis = list(title = "Total Homes Powered"))
Final_Vis2 <- layout(Final_Vis2, annotations = list(
  text = "Total Weight = 141.35 <br> Total Homes That Could <br> Have Powered = 2403",
  x = 13, y = 100,
  showarrow = FALSE
))
Final_Vis2 <- layout(Final_Vis2, annotations = list(
  text = "Total Weight = 238.80  <br> Total Homes That Powered = 2714 <br> Total Homes That Could <br> Have Powered = 4060",
  x = 25, y = 255,
  showarrow = FALSE
))

Final_Vis2

For this second visualization, I used the AllTrashDataAna dataset from my Data Analysis to make a visualization using Plotly that will show the total amount of trash collected by weight, and the total number of homes powered for each date. (I know I also worked with volume, but since Weight has a higher correlation in comparison with volume, I chose weight instead and it’s the reason why I used that for my exploration above). My goal is to show the relationship between the amount of trash collected and the number of homes powered, and how the trash collected affects the number of homes powered.

From this graph, I also got to visualize the first two years that haven’t done anything with the amount of trash that was collected. From the above exploration, I found that the years 2014 and 2015 had out liars that did not reflect the homes powered variable, while the other years did. That mean’s either the trash was unused for electricity, or it COULD have been used for something else (or burned). That way, we can see the amount of trash in weight that could have helped power more homes.

From my conclusion and calculations above, I estimated that In total 6463 homes could have been powered with the trash collected in 2014 and 2015 (2014 = 2403, 2015 = 4060). This is a lot of homes that could have been powered with the trash collected in those years, and it’s a shame that most of the trash collected in 2015 was not used for electricity.

3. - Total Trash Collected by Month: 2021 - 2023

Here, I will use the AllTrashMonth data set I created for the third visualization to make a visualization using Highcharter that will show the total amount of trash collected for each month from years 2021 to 2023. My goal is to show which month collected the most trash per year, and see a PATTERN of how trash is collected.

highchart() |>
  hc_add_series(data = AllTrashMonth, # selects dataframe
                type = "column", # sets the graph type, column
                hcaes(x = Month, y = TotalTrashVolume, group = Year)) |>
  hc_title(text = "Total Trash Collected by Month: 2021 - 2023") |>
  hc_subtitle(text = "Baltimore's Mr. Trash Wheel Family") |>
  hc_xAxis(title = list(text = "List of Months Each Year"), categories = levels(AllTrashMonth$Month)) |>
  hc_yAxis(title = list(text = "Total Trash Volume")) |>
  hc_legend(title = list(text = "Year"),
            floating = TRUE, #Used ChatAI to put legend in the middle
            x = 263, y = -10, 
            enabled = TRUE) |>
  hc_colors(c("#540d6e", "#ee4266", "#ffd23f")) |>
  hc_chart(backgroundColor = "rgba(255, 255, 255, 0.7)", # Used ChatAI for background transparency
           divBackgroundImage = "https://images.squarespace-cdn.com/content/v1/53dec77ce4b0f52a5ebb8563/1451578465518-5PN7SDC4I5EQARFV1GX5/mrtrashwheel8x10.png?format=1500w",
           style = list(fontFamily = "Verdana", fontWeight = "bold")) |>
  hc_tooltip(shadow = TRUE,
             crosshairs = TRUE,
             borderWidth = 3) |> # Used Alexandra Veremeychik Project 2 Assignment to learn crosshairs arguement.
  hc_caption(text = "Data Source: Mr. Trash Wheel (All Trash Wheels)") |>
  hc_exporting(enabled = TRUE)

What this data visualization shows is the total amount of trash collected for each month from years 2021 to 2023. The X - values is the month, Y - value represent the Volume of total trash collected, and the group is the year. This is a good way to see which month collected the most trash, and how the trash collected is affected throughout the months in Baltimore Harbor River. With what we see in the graph, in January, February, and March, April, May, and October never reached more than 2,000 volume of trash before. However, in June, July, August, September, November, and December, the volume of trash collected was more than 2,000. This is interesting because it shows that the summer months and winter months are the months where the most trash is collected in Baltimore Harbor River. This is a good sign that the trash wheels are doing their job, and that the trash collected is being used for electricity too. Most of this could be from Summer break - Winter break, and people being outside more often in Baltimore with more events going on especially throughout breaks. Baltimore is a huge city known for hosting so many events and throughout the Harbor in Baltimore most of these months could be when lot’s of people have more trash being thrown in rivers. We can also (kind of) that in 2023, there has been a lot more trash found in Baltimore river compares to 2021, which means it kind of gets worse. The only months in 2021 that has more volume of trash to other months in 2023 was March, May, October, and November (big difference).

Essay

The topic of my data is the total amount of trash collected by the Mr. Trash Wheel family in Baltimore, Maryland. The variables included in the data set are the Trash Wheel, Date, Year, Month, Volume, Weight, and Trash Type. The Trash Wheel, Date, Year, Month, and Trash Type variables are categorical variables, while the Volume and Weight variables are continuous variables. The data came from the Mr. Trash Wheel website, and I cleaned it up by converting the Date variable to a date variable, and the Year variable to a factor variable. I made many changes with Data Manipulation and Dplyr to make data sets that tell a story with visualization. I chose this topic and data set because I am interested in environmental science, especially being close to the Baltimore area, and want to learn more about the trash that’s close to the DMV Area. There was much data on how much trash is collected in Baltimore’s rivers, and how that trash is used to power homes. Every person working on the Trash wheel Machines, individually takes count of “every single trash” they see, write it down, and use their own statistical equations to write how much trash they account for finding Trash in the Baltimore Harbor River (MrTrashWheel). This data set has meaning to me because it shows how much trash is collected in Baltimore’s rivers, and how that trash is used to power homes. It also shows how effective the trash wheels are in collecting trash, in all four corners in the harbor.

The Mr. Trash Wheel family is a group of trash wheels that are used to collect trash in Baltimore’s rivers. The first trash wheel, Mr. Trash Wheel, was installed in 2014. The second trash wheel, the Professor Trash Wheel, was installed in 2016. The third trash wheel, the Captain Trash Wheel, was installed in 2018. Lastly, The fourth trash wheel, the Gwynnda Trash Wheel, was installed in 2021. The trash wheels are powered by solar panels and hydroelectric power, and use a conveyor belt to collect trash from the water (MrTrashWheel). The trash wheels are able to collect over 50,000 pounds of trash per day and have helped to clean up Baltimore’s rivers and prevent trash from flowing into the Chesapeake Bay. The trash wheels are a sustainable and environmentally friendly way to clean up the rivers and have helped to reduce the amount of trash in the water (MrTrashWheel).

The first visualization shows the total amount of trash collected by each trash wheel, grouped by year. The visualization shows that the amount of trash collected by each trash wheel has decreased over the years, and that the trash wheels are doing their job in collecting trash. Most of the trash collected throughout all years were Cigarette Butts, especially starting from 2014 when the first trash wheel was born. The second visualization shows the relationship between the amount of trash collected and the number of homes powered. The visualization shows that there is a positive correlation between the amount of trash collected and the number of homes powered and that the trash wheels are effective in powering homes. Years 2014 and 2015 however, have shown that in 2014, there were no homes powered regardless of the weight in trash, and in 2015, there wasn’t a good correlation between the weight of trash and homes powered. The third visualization shows the total amount of trash collected for each month from years 2021 to 2023. The visualization shows that the summer months and winter months are the months where the most trash is collected in Baltimore Harbor River, and that the trash wheels are doing their job in collecting trash. The visualizations show that the trash wheels are effective in collecting trash and that the trash collected is being used for electricity. The visualizations also show that the trash wheels are a sustainable and environmentally friendly way to clean up the rivers, and have helped to reduce the amount of trash in the water!

One interesting pattern from my visualizations is that there is a positive correlation between the amount of trash collected and the number of homes powered and that the trash wheels are effective in powering homes. One surprise that arises within the visualizations is that the amount of trash collected in 2014 and 2015 did not reflect the homes-powered variable and that most of the trash collected in 2015 was not used for electricity. Another surprise is that the summer months and winter months are the months where the most trash is collected in Baltimore Harbor River, and that the trash wheels are doing their job in collecting trash. One thing that could have been shown in the visualizations is putting a background in my 2nd visualization, but it was really hard to figure out since I combined plots together. I also wish I could have included a better way to show the total amount of trash collected for each month from years 2021 to 2023, but I think the visualization I made was good enough to show the pattern of how trash is collected.

(Source: https://www.mrtrashwheel.com/)