Mr. Trash Baltimore Data Investigation with Cigarette Butts!

Author

Emilio Sanchez San Martin

knitr::opts_chunk$set(echo = TRUE)
#Used ChatGPT to figure out how to upload the image

Brief Introduction

Mr. Trash Wheel, also known as the Inner Harbor Water Wheel, is a semi-autonomous (meaning work’s it’s self) trash interceptor that removes trash from Baltimore’s Jones Falls River. Since May 2014, they have documented the date, weight, and volume of trash for each filled dumpster, along with estimates for the number of plastic bottles, cigarette butts, and other types of items extracted. It’s located at the end of an outfall around the river, waiting for trash to flow into it! It’s a great example of how technology can be used to help the environment.

Just from looking at the data set, I can already see that Cigarette butts have the MOST amount of trash litter found in Baltimore which is crazy since it’s really toxic and bad for many individuals and the environment! With this, I want to find if there was a difference between 2014, 2020 and 2023 to compare how much trash of Cigareette Butts was collected back when the Covid-19 Pandemic begun, and I also want to see the differece from 2020, 2021, 2022, and 2023. The first month EVERYONE in the world started to quarantine was around the end of March (I say this because that’s when school’s and business in the US started to close/shut down in March), so I want to start from April to December, and plot the average of cigarette butts that have been littered in Baltimore. I will be using the dataset from the Healthy Harbor Initiative, which is a program that aims to make the Baltimore Harbor swimm-able and fish-able. This dataset is available at https://www.mrtrashwheel.com/ & https://docs.google.com/spreadsheets/d/1b8Lbe-z3PNb3H8nSsSjrwK2B0ReAblL2/edit#gid=1143432795

(Describe dataset topic and the variables (you may need to define them for your audience), and establish what you plan to explore. You also MUST identify the source for your dataset)

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(RColorBrewer)
library(readr)
library(GGally)

Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2

library(ggfortify)
library(plotly)


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

setwd("~/Downloads/Data 101 and Data 110 class/Data 110")
TWCD_MrTrashWheel <- read_csv("Data Sets/Trash Wheel Collection Data - Mr. Trash Wheel.csv")

New names:
Rows: 630 Columns: 16
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(2): Month, Date dbl (2): Dumpster, Year num (10): Weight (tons), Volume (cubic
yards), Plastic Bottles, Polystyrene,... lgl (2): ...15, ...16
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...15`
• `` -> `...16`

#TWCD stands for Trash Wheel Collection Data

#For now, I am going to remove all NA’s, rename variables, and etc to have the perfect data set to work with!

I will remove columns 15 and 16 (Had useless columns).

TWCD_MrTrashWheel <- TWCD_MrTrashWheel |>
  select(-c(15, 16))

Also remove the NA’s in the Data set (will use NA.omit since it’s only one row that’s complete useless, and did not need to work with)

TWCD_MrTrashWheel <- na.omit(TWCD_MrTrashWheel)

I want to rename the columns to make it easier to read and understand.

TWCD_MrTrashWheel <- TWCD_MrTrashWheel |>
  rename("dumpster" = "Dumpster",
         "month" = "Month",
         "year" = "Year",
         "day" = "Date",
         "weight_tons" = "Weight (tons)", #weight of all trash
         "volume_cubic_yards" = "Volume (cubic yards)", #volume of all trash
         "plastic_bottles" = "Plastic Bottles",
         "polystyrene" = "Polystyrene",
         "cigarette_butts" = "Cigarette Butts",
         "glass_bottles" = "Glass Bottles",
         "plastic_bags" = "Plastic Bags",
         "wrappers" = "Wrappers",
         "sports_balls" = "Sports Balls",
         "homes_powered" = "Homes Powered*")
#this took be a while lol I had to look back at each column name and write it perfectly haha

I am going to make my first clean data set!

TWCD_1 <- TWCD_MrTrashWheel |>
  select(month, year, weight_tons, volume_cubic_yards, cigarette_butts, plastic_bottles, glass_bottles, wrappers, sports_balls)

I want to change the class of my Year column to a factor, so I can use it in my ggplot function.

I’m also going to make another graph comparing the amount of cigarette butts collected in 2020, 2021, 2022, and 2023. THIS WILL BE SAVED FOR MY FUTURE DATA VISUALZATION!

TWCD_FUTURE <- TWCD_MrTrashWheel |>
  select(year, month, weight_tons,volume_cubic_yards, cigarette_butts, plastic_bottles, glass_bottles, wrappers, sports_balls) |>
  filter(month %in% c("April", "May", "June", "July", "August", "September", "October", "November", "December")) |>
  filter(year %in% c(2020, 2021, 2022, 2023))|>
  group_by(month, year) |>
  summarise(total_weight_tons = sum(weight_tons), sum_volume_cubic_yards = sum(volume_cubic_yards), cigarette_butts = sum(cigarette_butts), plastic_bottles = sum(plastic_bottles), glass_bottles = sum(glass_bottles), wrappers = sum(wrappers), sports_balls = sum(sports_balls))

`summarise()` has grouped output by 'month'. You can override using the
`.groups` argument.

#I notice that the Class of the Year is numerical. I want to categorize it as a character.
TWCD_1$year <- factor(TWCD_1$year)
class(TWCD_1$year)

[1] "factor"

str(TWCD_1)

tibble [629 × 9] (S3: tbl_df/tbl/data.frame)
 $ month             : chr [1:629] "May" "May" "May" "May" ...
 $ year              : Factor w/ 10 levels "2014","2015",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ weight_tons       : num [1:629] 4.31 2.74 3.45 3.1 4.06 2.71 1.91 3.7 2.52 3.76 ...
 $ volume_cubic_yards: num [1:629] 18 13 15 15 18 13 8 16 14 18 ...
 $ cigarette_butts   : num [1:629] 126000 91000 105000 100000 120000 90000 56000 112000 98000 130000 ...
 $ plastic_bottles   : num [1:629] 1450 1120 2450 2380 980 1430 910 3580 2400 1340 ...
 $ glass_bottles     : num [1:629] 72 42 50 52 72 46 32 58 49 75 ...
 $ wrappers          : num [1:629] 1162 874 2032 1971 753 ...
 $ sports_balls      : num [1:629] 7 5 6 6 7 5 3 6 6 7 ...
 - attr(*, "na.action")= 'omit' Named int 630
  ..- attr(*, "names")= chr "630"

#I'd also like to do the same for TWCD_FUTURE
TWCD_FUTURE$year <- factor(TWCD_FUTURE$year)

#Now we have a clean data set! My next steps are going to be trying to manipulate the data in order to make a data set that shows the year 2020 and 2023, and the amount of trash collected in each year.

TWCD_1 <- TWCD_MrTrashWheel |>
  select(year, month, weight_tons,volume_cubic_yards, cigarette_butts, plastic_bottles, glass_bottles, wrappers, sports_balls) |>
  filter(month %in% c("April", "May", "June", "July", "August", "September", "October", "November", "December")) |>
  filter(year %in% c(2014, 2020, 2023))|>
  group_by(month, year) |>
  summarise(total_weight_tons = sum(weight_tons), sum_volume_cubic_yards = sum(volume_cubic_yards), cigarette_butts = sum(cigarette_butts), plastic_bottles = sum(plastic_bottles), glass_bottles = sum(glass_bottles), wrappers = sum(wrappers), sports_balls = sum(sports_balls))

`summarise()` has grouped output by 'month'. You can override using the
`.groups` argument.

#I’m going to make a Linear Regression Model. With the data set I created above, I want to see if there is a correlation between the months and the sum of cigarette butts collected.

To make the Linear Regression Model, I am going to convert the months to numerical values to make it easier to use in the lm and cor function. Starting with April, which will be 4, and December, which will be 12.

#Gonna make it a new data set just in case! Using Chat GPT to find the "case_when" arguement to learn how to change to numerical value starting with 4 to 12. Instead of replacing the column, I will make a new column called "month_numeric" because I want to keep the original column as well if I want to work with it while graphing.
TWCD_2 <- TWCD_1 %>%
  mutate(month_numeric = case_when(
    month == "April" ~ 4,
    month == "May" ~ 5,
    month == "June" ~ 6,
    month == "July" ~ 7,
    month == "August" ~ 8,
    month == "September" ~ 9,
    month == "October" ~ 10,
    month == "November" ~ 11,
    month == "December" ~ 12
  ))
#I'm going to make sure that the new column is a numerical value.
class(TWCD_2$month_numeric)

[1] "numeric"

#I'm also going to repeat the same code above, but for the TWCD_FUTURE data set.
TWCD_FUTURE <- TWCD_FUTURE %>%
  mutate(month_numeric = case_when(
    month == "April" ~ 4,
    month == "May" ~ 5,
    month == "June" ~ 6,
    month == "July" ~ 7,
    month == "August" ~ 8,
    month == "September" ~ 9,
    month == "October" ~ 10,
    month == "November" ~ 11,
    month == "December" ~ 12
  ))

Now, I am going to see if there is a correlation between the months and the sum of cigarette butts collected.

cor(TWCD_2$month_numeric, TWCD_2$cigarette_butts)

[1] -0.20037

cor(TWCD_FUTURE$month_numeric, TWCD_FUTURE$cigarette_butts)

[1] 0.01304704

Since the corrleation is SO low for the TWCD_FUTURE data set, I am not going to even bother. I will just use the TWCD_2 data set for the linear regression model (I could see the correlation for 2020 and 2023 and see how bad it will be, so I will not need to do it for 2022 and 2023).

This is a value between (inclusively) -1 and 1. Correlation coefficient tells how strong or weak the correlation is. Values closer to +/- 1 are strong correlation (the sign is determined by the linear slope), values close to +/- 0.5 are weak correlation, and values close to zero have no correlation.

I am going to use the lm function to make a model that shows the amount of Cigarette Butts collected in the following months above.

#lm (y ~ x, data = ) #Just to remember the format of the lm function
TWCD_lm_summary <- lm(cigarette_butts ~ month_numeric, data = TWCD_2)
summary(TWCD_lm_summary)


Call:
lm(formula = cigarette_butts ~ month_numeric, data = TWCD_2)

Residuals:
    Min      1Q  Median      3Q     Max 
-250354 -170245 -122231   31066  747521 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)  
(Intercept)     359102     191032   1.880   0.0723 .
month_numeric   -22437      22394  -1.002   0.3264  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 286200 on 24 degrees of freedom
Multiple R-squared:  0.04015,   Adjusted R-squared:  0.0001543 
F-statistic: 1.004 on 1 and 24 DF,  p-value: 0.3264

This correlation coefficient of -0.20037 indicates a weak negative correlation (sadly, was hoping for a close correlation, but since its taking all the months and they have extreme different values from different years, it’s going to show a bad correlation :c. Just by looking at the graphs as well you can see a bad correlation per year). Since -0.20037 is relatively close to 0, it shows a weak negative correlation, meaning that as one variable increases, the other tends to decrease slightly, but the relationship is not very strong.

The equation for the following correlation is y = -0.20037x + 359102.

For reference, for every month, the amount of cigarette butts collected decreases by -0.20037. The p-value is 0.3264. This means that the correlation is not strong enough to be considered statistically significant.

(cigarette_butts) = -0.20037(month) + 359102

we also need to look at the Adjusted R-Squared value. It states that 0.015% of the variation in the observations may be explained by the model.

in other words, the model is not a good fit for the data.

Change years to factors to use in ggplot function (Want to use the years as a color in the ggplot function)

TWCD_2$year <- factor(TWCD_2$year)
class(TWCD_2$year)

[1] "factor"

Make linear regression model for the amount of cigarette butts collected in April, May, June, July, August, September, October, November, and December using Facet wrap, and color in 2014, 2020, and 2023. #Reason I use Facet Wrap is to show a clear comparison between the years and the months.

TWCD_4 <- TWCD_2 |>
  ggplot(aes(x = month_numeric, y = cigarette_butts, color = year)) +
  geom_point() + xlim(4,12) + ylim(0,972000) +
  geom_smooth(method = "lm", formula=y~x, se = FALSE, size = 0.6, linetype= "dotdash") +
  scale_color_manual(values = c("#5C3AC5", "#3D7FC2", "#3AC5A1")) +
  facet_wrap(~year) +
  labs(title = "Cigarette Butts Collected in 2014, 2020 and 2023",
       caption = "Source: Mr. Trash Wheel",
       color = "Year",
       x = "Months (April - December | 4 - 12)",
       y = "Sum of Cigarette Butts") +
  theme_minimal(base_size = 11)

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

TWCD_4

As you can see, the amount of Cigarette Butts are INSANELY large in May, where they first discovered the most amount of Cigarette Butts in 2014. The amount of Cigarette Butts collected in 2020 and 2023 are relatively the same, but not QUIET. We can’t really see the different between 2020 and 2023, but the amount of Cigarette Butts collected in 2020 is slightly higher than 2023. I will make a new linear regression graph to show this. but first, I will make a new data set that shows the amount of Cigarette Butts collected in 2020 and 2023.

TWCD_5 <- TWCD_2 |>
  filter(year %in% c(2020, 2023)) |>
  ggplot(aes(x = month_numeric, y = cigarette_butts, color = year)) +
  geom_point() + xlim(4,12) + ylim(0,55750) +
  geom_smooth(method = "lm", formula=y~x, se = FALSE, size = 1, linetype= "dotdash") +
  scale_color_manual(values = c("#3D7FC2", "#3AC5A1")) +
  labs(title = "Cigarette Butts Collected in 2020 and 2023",
       caption = "Source: Mr. Trash Wheel",
       color = "Year",
       x = "Months (April - December | 4 - 12)",
       y = "Sum of Cigarette Butts") +
  scale_x_continuous(breaks = unique(TWCD_2$month_numeric)) + #Used ChatAI for code
  scale_y_continuous(breaks = seq(0, 55750, 5575)) + #Used ChatAI for code
  theme_minimal(base_size = 11)

Scale for x is already present.
Adding another scale for x, which will replace the existing scale.
Scale for y is already present.
Adding another scale for y, which will replace the existing scale.

TWCD_5

For my next part of this project, I am going to see if I can compare more variables using scatterplot matrix!

ggpairs(TWCD_2, columns = c(3, 4, 5, 6, 7, 8, 9), title = "Scatterplot Matrix of Trash Collected in 2014, 2020, and 2023")

Here, we can see that the highest correlation is between the weight of all trash and the volume of all trash. This makes sense, as the more trash there is, the more volume it will take up.

Let’s see the correlation for the weight of all trash and volume of all trash.

cor(TWCD_2$total_weight_tons, TWCD_2$sum_volume_cubic_yards)

[1] 0.9674263

Let’s find the lm summary

TWCD_lm_summary_2 <- lm(sum_volume_cubic_yards ~ total_weight_tons, data = TWCD_2)
summary(TWCD_lm_summary_2)


Call:
lm(formula = sum_volume_cubic_yards ~ total_weight_tons, data = TWCD_2)

Residuals:
    Min      1Q  Median      3Q     Max 
-31.979  -5.334  -0.852   7.155  37.106 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)         7.0798     5.1389   1.378    0.181    
total_weight_tons   4.2714     0.2282  18.722 7.98e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.76 on 24 degrees of freedom
Multiple R-squared:  0.9359,    Adjusted R-squared:  0.9332 
F-statistic: 350.5 on 1 and 24 DF,  p-value: 7.975e-16

The equation is: y = 0.9674263x + 7.0798. This means that for every ton of trash, there are a predicted value of 0.9674263 cubic yards of trash. The p-value is 7.975e-16, which means that the correlation is statistically significant. The Adjusted R-Squared value is 0.9332, which means that 93.3% of the variation in the observations may be explained by the model.

This is so cool! let’s graph it!

TWCD_6 <- TWCD_2 |>
  ggplot(aes(x = total_weight_tons, y = sum_volume_cubic_yards, color = year)) +
  geom_point() +
  geom_smooth(method = "lm", formula=y~x, se = FALSE, size = 1, linetype= "dotdash") +
  scale_color_manual(values = c("#5C3AC5", "#3D7FC2", "#3AC5A1")) +
  labs(title = "Weight of Trash vs. Volume of Trash in 2014, 2020, and 2023",
       caption = "Source: Mr. Trash Wheel",
       color = "Year",
       x = "Weight of Trash (tons)",
       y = "Volume of Trash (cubic yards)") +
  theme_minimal(base_size = 11)
TWCD_6

Now we found in our data that the weight of all trash and the volume of all trash have a strong correlation of 0.9674263. This means that the weight of all trash and the volume of all trash are strongly correlated.

Let’s take this a step forward and make a multiple regression. I want to compare variable inputs and make a model with it. I will start off with all possible predictor variables with my rsponse variable. In this case, I will use: cigarette_buts ~ weight of all trash, volume of all trash, plastic_bottles, glass_bottles, wrappers, and sports_balls. I will use the lm function to make the model.

fit_1 <- lm(cigarette_butts ~ sum_volume_cubic_yards + total_weight_tons + plastic_bottles + glass_bottles + wrappers + sports_balls, data = TWCD_2)

summary(fit_1)


Call:
lm(formula = cigarette_butts ~ sum_volume_cubic_yards + total_weight_tons + 
    plastic_bottles + glass_bottles + wrappers + sports_balls, 
    data = TWCD_2)

Residuals:
   Min     1Q Median     3Q    Max 
-85311 -35663   8101  25226 112159 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)             6349.478  26213.102   0.242   0.8112    
sum_volume_cubic_yards  -334.699   1039.175  -0.322   0.7509    
total_weight_tons      -4317.094   4531.073  -0.953   0.3527    
plastic_bottles           10.761      4.701   2.289   0.0337 *  
glass_bottles           2207.187    139.194  15.857 2.06e-12 ***
wrappers                   1.412      4.544   0.311   0.7593    
sports_balls           -2426.761    321.613  -7.546 3.95e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 55940 on 19 degrees of freedom
Multiple R-squared:  0.971, Adjusted R-squared:  0.9618 
F-statistic:   106 on 6 and 19 DF,  p-value: 1.423e-13

Now I am going to auto plot the model to see if it’s a good fit for the data.

autoplot(fit_1, 1:4, nrow=2, ncol=2)

#Note for self: - 1. Residual plot essential indicates whether a linear model is appropriate - you can see this by the blue line showing relatively horizontal. If it is not relatively horizontal, a linear plot may not be appropriate.

1. QQPlot indicates whether the distribution is relatively normal. Observations that might be outliers are indicated by their row number.
1. Scale-Location indicates homogeneous variance (homeoscedacity). Influential observations that are skewing the variance distribution are indicated.
1. Cook’s Distance indicates which outliers have high leverage, meaning that some outliers may not cause the model to violate basic assumptions required for the regression analysis (see #1-3). If outliers have high leverage, then they may be causing problems for your model. You can try to remove those observations, especially if they appear in any of the other 3 plots above.

Look at the p-value for each variable - if it is relatively small ( < 0.10), then it is likely contributing to the model. If it is relatively large ( > 0.10), then it is likely not contributing to the model.

The P value is 1.423e-13.

I look at the output for the Adjusted R-Squared value at the bottom of the output. The interpretation is: 96.18%. That’s amazing!!!

#Note to self -> __% (from the adjusted r-squared value) of the variation in the observations may be explained by this model. The higher the adjusted R-squared value, the better the model. We use the adjusted R-squared value because it compensates for more predictors mathematically increasing the normal R-squared value:

The adjusted R-squared value is 96.18%, which means that 96.18% of the variation in the observations may be explained by the model. This is a good model.

Back to visualizing: I am going to make 2 graph’s (because I want to compare 2014, 2020, and 2023 data set, then 2020, 2021, 2022, and 2023).

Final_Visualization <- TWCD_2 |>
  filter(year %in% c(2014, 2020, 2023)) |>
  ggplot(aes(x = year, y = cigarette_butts, fill = year)) +
  facet_wrap(~month)+
  geom_bar(method = "lm", formula=y~x, se = FALSE, stat = "identity") +
  scale_fill_manual(values = c("#5C3AC5", "#3D7FC2", "#3AC5A1")) +
  labs(title = "Cigarette Butts Collected in 2014, 2020 and 2023",
       caption = "Source: Mr. Trash Wheel",
       color = "Year",
       x = "Years (2014, 2020, 2023)",
       y = "Sum of Cigarette Butts") +
  theme_minimal(base_size = 11)

Warning in geom_bar(method = "lm", formula = y ~ x, se = FALSE, stat =
"identity"): Ignoring unknown parameters: `method`, `formula`, and `se`

Final_Visualization

Obviously, 2014 was a mess. Let’s make a new graph to compare 2020 and 2023.

Final_Visualization_2 <- TWCD_FUTURE |>
  ggplot(aes(x = month, y = cigarette_butts, text = paste("weight_tons:", total_weight_tons))) +
  geom_bar(method = "lm", formula=y~x, se = FALSE, stat = "identity", position = "dodge", aes(fill = year)) + #Tried to see if I could use method = "lm" argument and formula = y~x. It did not work. I will keep it as is so you can see I tried haha.
  scale_fill_manual(name = "Year", values = c("#C33CA0", "#C3A23C", "#3CC35F", "#3C5DC3")) +
  labs(title = "Cigarette Butts Collected in 2020 and 2023",
       caption = "Source: Mr. Trash Wheel",
       x = "Months (April - December | 4 - 12)",
       y = "Sum of Cigarette Butts") +
  scale_y_continuous(breaks = seq(0, max(TWCD_FUTURE$cigarette_butts), by = 5000)) +  #Used ChatAI for this to break the y scale into 5000
  theme_minimal(base_size = 11) +
  theme(axis.text.x = element_text(angle = 17, hjust = 1))

Warning in geom_bar(method = "lm", formula = y ~ x, se = FALSE, stat =
"identity", : Ignoring unknown parameters: `method`, `formula`, and `se`

  Final_Visualization_2 <- ggplotly(plot = Final_Visualization_2) #Used ChatAI to learn how to make it functional
Final_Visualization_2 <- layout(Final_Visualization_2, annotations = list(
  text = "Source: Mr. Trash Wheel",
  x = 7, y = 58530,  #Use ChatAI to find how to add a source
  showarrow = FALSE
))
#I tried to move legend but every time I did, it would take away the Title and caption I used. I will keep it as is. :(

Final_Visualization_2

What a masterpiece!

#Conclusion

I cleaned up the dataset by removing any rows with missing data using the na.omit function. I also got rid of columns 15 and 16 since they didn’t provide useful information. After that, I renamed the columns to make them easier to understand. I converted the ‘Year’ column into a factor so it could be used in my plots. I then created separate datasets showing the amount of cigarette butts collected each year from 2014 to 2023. Using facet_wrap and color, I made a linear regression model to see how the number of cigarette butts collected varied month by month in 2014, 2020, and 2023. To compare multiple variables more comprehensively, I made a scatterplot matrix. I also made a linear regression model to see how the weight of all trash related to its volume. Using a variety of inputs, I made a multiple regression model to create a comprehensive model. Finally, I created two graphs to compare data from 2014, 2020, and 2023, and then data from 2020, 2021, 2022, and 2023.

The visualization represents the amount of Cigarette Butts collected in 2014, 2020 and 2023. The amount of Cigarette Butts collected in 2014 was INSANELY high, and the amount of Cigarette Butts collected in 2020, 2021, 2022, and 2023 are relatively the same, but not QUIET. We can’t really see the different between 2020 and 2023, but the amount of Cigarette Butts collected in 2020 is slightly higher than 2023. You can see this from the Linear Regression and the graph. Suprisingly, 2023 has the LEAST amount of cigarettes found on the ocean waters of Baltimore! Also, from the graph we can see that in December 2020 there was a a lot of cigarette butts found in the rivers, and in May 2023 there was less. I like how I can see a scattered correlation in the graph since all bars are all over the place which shows that the months have nothing to do with cigarettes being in the river. The weight of all trash and the volume of all trash have a strong correlation of 0.9674263. This means that the weight of all trash and the volume of all trash are strongly correlated. I would assume this would be very true since the volume and weight would both go up at the same time. The adjusted R-squared value is 96.18%, which means that 96.18% of the variation in the observations may be explained by the model.

I wish I could have moved the legend in the graph to compare 2020 and 2023, but every time I did, it would take away the Title and caption I used; HOWEVER, I am really proud of my data visualization graphs and wish to learn more. With ChatAI, I was able to learn and push my self to the limits in visualizing my data. It helped a lot and allowed me to make a masterpiece of the Mr Trash Data set. I could just keep thinking about the ending graphs I could do if I had the time, haha!