Final Project

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)

setwd("~/Desktop/Final Project DATA101")

dr <- read_csv("Spill_Incidents_ny.csv")

## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)

## Rows: 564321 Columns: 20
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (16): Spill Number, Program Facility Name, Street 1, Street 2, Locality,...
## dbl  (4): ZIP Code, DEC Region, Quantity, Recovered
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Introduction

Is the mean concentration of a contaminant (e.g. petroleum) significantly different across other counties in New York: Kings, New York, Richmond, Queens, Bronx?

This dataset contains 564,321 records of spills of petroleum and other contaminants in the state of New York. Under the state of New York law, spills that could potentially pollute the land or waters of the state must be reported. The data contains 20 variables, including details on the date, location, source, cause, material, and quantity of each spill. In this report, the variables county and quantity (gallons) will be used to reveal the total contamination levels and compare the mean spill amount among different New York counties.

Spill incident reports are crucial for risk management and assessment strategies. This data is essential for protecting both the environment and public health. Recording spill information is important for establishing better safety protocols and ensuring improvement across all matters. Moreover, documenting detailed reports of hazardous spills enforces accountability. Overall, analyzing spill incidents not only creates opportunity for data improvement, but also hold operators responsible for proper prevention, and remediation.

Spill Incidents in New York from Data.gov https://catalog.data.gov/dataset/spill-incidents

head(dr)

## # A tibble: 6 × 20
##   `Spill Number` `Program Facility Name` `Street 1`   `Street 2` Locality County
##   <chr>          <chr>                   <chr>        <chr>      <chr>    <chr> 
## 1 0107132        MH 864                  RT 119/MILL… <NA>       ELMSFORD Westc…
## 2 0405586        BOWRY BAY               WATER POLL … <NA>       QUEENS   Queens
## 3 0405586        BOWRY BAY               WATER POLL … <NA>       QUEENS   Queens
## 4 0204667        POLE 16091              GRACE AVE/B… <NA>       BRONX    Bronx 
## 5 0210559        POLE ON                 FERDALE LOM… <NA>       LIBERTY  Sulli…
## 6 0311484        PRIVATE RESIDENCE       6568 GLEN H… <NA>       SCOTT    Cortl…
## # ℹ 14 more variables: `ZIP Code` <dbl>, `SWIS Code` <chr>, `DEC Region` <dbl>,
## #   `Spill Date` <chr>, `Received Date` <chr>, `Contributing Factor` <chr>,
## #   Waterbody <chr>, Source <chr>, `Close Date` <chr>, `Material Name` <chr>,
## #   `Material Family` <chr>, Quantity <dbl>, Units <chr>, Recovered <dbl>

Data Analysis (1 paragraph and 3-5 chunks of code)

This data analysis focuses on preparing the data for ANOVA Testing by cleaning the dataset. This process consists of checking for missing values and renaming column titles to ensure clarity and consistency. This analysis also includes summary statistics to better understand the variables, such as the county and quantity of spills in that specific county. This report will only focus on 5 counties in the state of New York (New York county, Kings county, Queens county, Bronx county, and Richmond county).

Check for NA’s

colSums(is.na(dr))

##          Spill Number Program Facility Name              Street 1 
##                     0                     0                   119 
##              Street 2              Locality                County 
##                520428                  1134                     0 
##              ZIP Code             SWIS Code            DEC Region 
##                509342                     0                     0 
##            Spill Date         Received Date   Contributing Factor 
##                   151                   477                     0 
##             Waterbody                Source            Close Date 
##                516402                     0                 11499 
##         Material Name       Material Family              Quantity 
##                     0                     0                     0 
##                 Units             Recovered 
##                115947                     0

Rename column titles

# Clean variable names
names(dr) <- gsub("[(). \\-]", "_", names(dr)) #Replace ., (), space, with dash
names(dr) <- gsub("_$", "", names(dr))  #Remove trailing underscore
names(dr) <- tolower(names(dr))         #Lowercase

head(dr)

## # A tibble: 6 × 20
##   spill_number program_facility_name street_1  street_2 locality county zip_code
##   <chr>        <chr>                 <chr>     <chr>    <chr>    <chr>     <dbl>
## 1 0107132      MH 864                RT 119/M… <NA>     ELMSFORD Westc…       NA
## 2 0405586      BOWRY BAY             WATER PO… <NA>     QUEENS   Queens       NA
## 3 0405586      BOWRY BAY             WATER PO… <NA>     QUEENS   Queens       NA
## 4 0204667      POLE 16091            GRACE AV… <NA>     BRONX    Bronx        NA
## 5 0210559      POLE ON               FERDALE … <NA>     LIBERTY  Sulli…       NA
## 6 0311484      PRIVATE RESIDENCE     6568 GLE… <NA>     SCOTT    Cortl…       NA
## # ℹ 13 more variables: swis_code <chr>, dec_region <dbl>, spill_date <chr>,
## #   received_date <chr>, contributing_factor <chr>, waterbody <chr>,
## #   source <chr>, close_date <chr>, material_name <chr>, material_family <chr>,
## #   quantity <dbl>, units <chr>, recovered <dbl>

Pick 5 sites to compare means: New York (Manhattan), kings (Brooklyn), Queens, Bronx, Richmond (Staten Island)

unique(dr$county)

##  [1] "Westchester"                            
##  [2] "Queens"                                 
##  [3] "Bronx"                                  
##  [4] "Sullivan"                               
##  [5] "Cortland"                               
##  [6] "New York"                               
##  [7] "Ulster"                                 
##  [8] "Kings"                                  
##  [9] "Orange"                                 
## [10] "Dutchess"                               
## [11] "Onondaga"                               
## [12] "Saratoga"                               
## [13] "Cayuga"                                 
## [14] "Oswego"                                 
## [15] "Warren"                                 
## [16] "Niagara"                                
## [17] "Rockland"                               
## [18] "Nassau"                                 
## [19] "Jefferson"                              
## [20] "Schenectady"                            
## [21] "Albany"                                 
## [22] "Monroe"                                 
## [23] "Schuyler"                               
## [24] "St Lawrence"                            
## [25] "Richmond"                               
## [26] "Clinton"                                
## [27] "Lewis"                                  
## [28] "Essex"                                  
## [29] "Chenango"                               
## [30] "Erie"                                   
## [31] "Livingston"                             
## [32] "Oneida"                                 
## [33] "Wayne"                                  
## [34] "Suffolk"                                
## [35] "Orleans"                                
## [36] "Ontario"                                
## [37] "Genesee"                                
## [38] "Otsego"                                 
## [39] "Tompkins"                               
## [40] "Madison"                                
## [41] "Chemung"                                
## [42] "Seneca"                                 
## [43] "Broome"                                 
## [44] "Hamilton"                               
## [45] "Washington"                             
## [46] "Steuben"                                
## [47] "Rensselaer"                             
## [48] "Franklin"                               
## [49] "Columbia"                               
## [50] "Fulton"                                 
## [51] "Herkimer"                               
## [52] "Schoharie"                              
## [53] "Montgomery"                             
## [54] "Putnam"                                 
## [55] "Delaware"                               
## [56] "New Jersey - Region 2"                  
## [57] "Tioga"                                  
## [58] "Chautauqua"                             
## [59] "Cattaraugus"                            
## [60] "Wyoming"                                
## [61] "Yates"                                  
## [62] "Greene"                                 
## [63] "Pennsylvania - Region 9"                
## [64] "Allegany"                               
## [65] "New Jersey - Region 3 (N)"              
## [66] "Cattaraugus Indian Reservation"         
## [67] "New Jersey - Region 3 (T)"              
## [68] "Canada - Region 6"                      
## [69] "Canada - Region 9"                      
## [70] "Pennsylvania - Region 8"                
## [71] "Vermont - Region 5 (R)"                 
## [72] "Vermont - Region 4"                     
## [73] "Connecticut - Region 3 (N)"             
## [74] "Pennsylvania - Region 3"                
## [75] "Tuscarora Indian Reservation"           
## [76] "Connecticut - Region 4"                 
## [77] "Connecticut - Region 3 (T)"             
## [78] "Massachusetts - Region 4"               
## [79] "Connecticut - Region 1"                 
## [80] "Canada - Region 8"                      
## [81] "Oil Springs Indian Reservation"         
## [82] "Canada - Region 5"                      
## [83] "Poospatuck Indian Reservation"          
## [84] "Onondaga Indian Reservation"            
## [85] "Shinnecock Indian Reservation"          
## [86] "St. Regis Indian Reservation - Region 5"
## [87] "Pennsylvania - Region 7"

Take sum of the quantity of spills in Bronx county and find the mean

bronx <- dr |>
  filter(county == "Bronx") 

bronx

## # A tibble: 15,017 × 20
##    spill_number program_facility_name street_1 street_2 locality county zip_code
##    <chr>        <chr>                 <chr>    <chr>    <chr>    <chr>     <dbl>
##  1 0204667      POLE 16091            GRACE A… <NA>     BRONX    Bronx        NA
##  2 0208773      MH 17516              LOCUST … <NA>     BRONX    Bronx        NA
##  3 1305446      #2 FUEL OIL SPILL FR… 41 ELLI… <NA>     BRONX    Bronx        NA
##  4 0407112      #32535 HESS GAS STAT… 1201 WE… <NA>     BRONX    Bronx        NA
##  5 1306665      #4 FUEL SPILL FROM V… 731 WHI… <NA>     BRONX    Bronx        NA
##  6 8602591      #6  IN MNHLE-173 ST … 173 ST … <NA>     NEW YOR… Bronx        NA
##  7 1401602      #6 FUEL OIL OVERFILL  653 EAS… <NA>     BRONX    Bronx        NA
##  8 1401578      #6 FUEL OIL SPILL TO… 1500 GR… <NA>     BRONX    Bronx        NA
##  9 0204815      #6 LEAK - TO METRO T… 2400 JO… <NA>     RIVERDA… Bronx        NA
## 10 1201066      @ RESIDENCE           311 EAS… <NA>     BRONX    Bronx        NA
## # ℹ 15,007 more rows
## # ℹ 13 more variables: swis_code <chr>, dec_region <dbl>, spill_date <chr>,
## #   received_date <chr>, contributing_factor <chr>, waterbody <chr>,
## #   source <chr>, close_date <chr>, material_name <chr>, material_family <chr>,
## #   quantity <dbl>, units <chr>, recovered <dbl>

# Sum of quantity
sum(bronx$quantity)

## [1] 155607391

# Mean of quantity
mean(bronx$quantity)

## [1] 10362.08

Take sum of the quantity of spills in Kings county and find the mean

brooklyn <- dr |>
  filter(county == "Kings") 

brooklyn

## # A tibble: 24,197 × 20
##    spill_number program_facility_name street_1 street_2 locality county zip_code
##    <chr>        <chr>                 <chr>    <chr>    <chr>    <chr>     <dbl>
##  1 0312848      BROOKLYN TECH HIGHSC… 29 FORT… <NA>     BROOKLYN Kings        NA
##  2 9814599      MANHOLE 2981          CLASSON… <NA>     BROOKLYN Kings        NA
##  3 0401659      MANHOLE 4467          GRAND A… <NA>     BROOKLYN Kings        NA
##  4 0011768      MANHOLE 829           WILLIAM… <NA>     NEW YOR… Kings        NA
##  5 0203407      MR. PRASAUD           317 WOO… <NA>     BROOKLYN Kings        NA
##  6 9903892      SOUTH OF BAYVIEW AVE  WEST SI… <NA>     BROOKLYN Kings        NA
##  7 0300033      #2 FUEL OIL SPILL TO… 149 CLI… <NA>     BROOKLYN Kings        NA
##  8 1214667      #2 FUEL SPILL AND VA… 119 BRO… <NA>     BROOKLYN Kings        NA
##  9 8601792      #2 SHEEN UPPER BAY    UPPER B… <NA>     NYC BRO… Kings        NA
## 10 9911046      #3 BORING HOLE        500 KEN… <NA>     BROOKLYN Kings        NA
## # ℹ 24,187 more rows
## # ℹ 13 more variables: swis_code <chr>, dec_region <dbl>, spill_date <chr>,
## #   received_date <chr>, contributing_factor <chr>, waterbody <chr>,
## #   source <chr>, close_date <chr>, material_name <chr>, material_family <chr>,
## #   quantity <dbl>, units <chr>, recovered <dbl>

# Sum of quantity
sum(brooklyn$quantity)

## [1] 35250797

# Mean of quantity
mean(brooklyn$quantity)

## [1] 1456.825

Take sum of the quantity of spills in New York county and find the mean

manhattan <- dr |>
  filter(county == "New York") 

manhattan

## # A tibble: 22,848 × 20
##    spill_number program_facility_name street_1 street_2 locality county zip_code
##    <chr>        <chr>                 <chr>    <chr>    <chr>    <chr>     <dbl>
##  1 9606869      AMSTERDAM AVE         WEST 79… <NA>     NYC      New Y…       NA
##  2 0405793      MANHOLE #58430        W 59TH … <NA>     MANHATT… New Y…       NA
##  3 0405793      MANHOLE #58430        W 59TH … <NA>     MANHATT… New Y…       NA
##  4 0313032      MANHOLE #6014         156-170… <NA>     MANHATT… New Y…       NA
##  5 0101621      VAULT 8754            135 WES… <NA>     MANHATT… New Y…       NA
##  6 0211713      VS 4673               5TH AVE… <NA>     MANHATT… New Y…       NA
##  7 0312312      WEST 42 SUBSTATION    521 WES… <NA>     MANHATT… New Y…       NA
##  8 1010077      # 7 SUBWAY LINE       STEINWA… 43+00-4… MANHATT… New Y…       NA
##  9 9903576      #2 EAST 29TH ST SUBS… 108 E 2… <NA>     MANHATT… New Y…       NA
## 10 0508699      #2 FO IN EXCAVATION   470 WES… <NA>     MANHATT… New Y…       NA
## # ℹ 22,838 more rows
## # ℹ 13 more variables: swis_code <chr>, dec_region <dbl>, spill_date <chr>,
## #   received_date <chr>, contributing_factor <chr>, waterbody <chr>,
## #   source <chr>, close_date <chr>, material_name <chr>, material_family <chr>,
## #   quantity <dbl>, units <chr>, recovered <dbl>

# Sum of quantity
sum(brooklyn$quantity)

## [1] 35250797

# Mean of quantity
mean(brooklyn$quantity)

## [1] 1456.825

Take sum of the quantity of spills in Richmond county and find the mean

staten_island <- dr |>
  filter(county == "Richmond") 

staten_island

## # A tibble: 7,468 × 20
##    spill_number program_facility_name street_1 street_2 locality county zip_code
##    <chr>        <chr>                 <chr>    <chr>    <chr>    <chr>     <dbl>
##  1 1607796      #2-322687             60 BAY … <NA>     STATEN … Richm…       NA
##  2 0311283      #3743 MANHOLE         ARDEN A… <NA>     STATEN … Richm…       NA
##  3 0907511      #6 OIL PUMP PAD       4101 AR… <NA>     STATEN … Richm…       NA
##  4 2301651      (DRILL) TUGBOAT       2015 RI… <NA>     STATEN … Richm…       NA
##  5 2203242      *DRILL*               LOWER B… <NA>     NYC      Richm…       NA
##  6 0104199      +                     RICHMON… <NA>     STATEN … Richm…       NA
##  7 9412352      020 AMSTRONG AVENUE   202 ARM… <NA>     STATEN … Richm…       NA
##  8 9608213      1 DAVIS AVE           1 DAVIS… <NA>     STATEN … Richm…       NA
##  9 9703129      1 DELMONT TERR        1 DELMO… <NA>     STATEN … Richm…       NA
## 10 0303737      1 EDGEWATER PLAZA     1 EDGEW… <NA>     STATEN … Richm…       NA
## # ℹ 7,458 more rows
## # ℹ 13 more variables: swis_code <chr>, dec_region <dbl>, spill_date <chr>,
## #   received_date <chr>, contributing_factor <chr>, waterbody <chr>,
## #   source <chr>, close_date <chr>, material_name <chr>, material_family <chr>,
## #   quantity <dbl>, units <chr>, recovered <dbl>

# Sum of quantity
sum(staten_island$quantity)

## [1] 4013206

# Mean of quantity
mean(staten_island$quantity)

## [1] 537.387

Take sum of the quantity of spills in Queens county and find the mean

queens <- dr |>
  filter(county == "Queens") 

queens

## # A tibble: 30,389 × 20
##    spill_number program_facility_name street_1 street_2 locality county zip_code
##    <chr>        <chr>                 <chr>    <chr>    <chr>    <chr>     <dbl>
##  1 0405586      BOWRY BAY             WATER P… <NA>     QUEENS   Queens       NA
##  2 0405586      BOWRY BAY             WATER P… <NA>     QUEENS   Queens       NA
##  3 0104307      149TH RD              183RD S… <NA>     QUEENS   Queens       NA
##  4 0109039      APT BUILDING    '     94-06  … <NA>     QUEENS   Queens       NA
##  5 0400255      PRIVATE RESIDENCE     133 -45… <NA>     QUEENS   Queens       NA
##  6 9411488      # 5 ROCKAWAY INLET    # 5 ROC… <NA>     QUEENS   Queens       NA
##  7 9813853      #0284 VAULT           164TH S… <NA>     QUEENS   Queens       NA
##  8 9906292      #1 GAS TURBINE        20TH AV… <NA>     NEW YORK Queens       NA
##  9 8605574      #2 FUEL OIL/ 130-19,… 130-19 … <NA>     NEW YOR… Queens       NA
## 10 1112016      #6 OIL SEEPAGE INTO … 147-25 … FAIRMON… QUEENS   Queens       NA
## # ℹ 30,379 more rows
## # ℹ 13 more variables: swis_code <chr>, dec_region <dbl>, spill_date <chr>,
## #   received_date <chr>, contributing_factor <chr>, waterbody <chr>,
## #   source <chr>, close_date <chr>, material_name <chr>, material_family <chr>,
## #   quantity <dbl>, units <chr>, recovered <dbl>

# Sum of quantity
sum(queens$quantity)

## [1] 179502701

# Mean of quantity
mean(queens$quantity)

## [1] 5906.831

Filter dataset to only Manhattan (New York), Brooklyn (Kings), Queens, Bronx, and Staten Island (Richmond)

counties_df <- dr |>
  filter(county %in% c("Kings", "New York", "Queens", "Bronx", "Richmond"))

counties_df

## # A tibble: 99,919 × 20
##    spill_number program_facility_name street_1 street_2 locality county zip_code
##    <chr>        <chr>                 <chr>    <chr>    <chr>    <chr>     <dbl>
##  1 0405586      BOWRY BAY             WATER P… <NA>     QUEENS   Queens       NA
##  2 0405586      BOWRY BAY             WATER P… <NA>     QUEENS   Queens       NA
##  3 0204667      POLE 16091            GRACE A… <NA>     BRONX    Bronx        NA
##  4 0104307      149TH RD              183RD S… <NA>     QUEENS   Queens       NA
##  5 9606869      AMSTERDAM AVE         WEST 79… <NA>     NYC      New Y…       NA
##  6 0109039      APT BUILDING    '     94-06  … <NA>     QUEENS   Queens       NA
##  7 0312848      BROOKLYN TECH HIGHSC… 29 FORT… <NA>     BROOKLYN Kings        NA
##  8 0405793      MANHOLE #58430        W 59TH … <NA>     MANHATT… New Y…       NA
##  9 0405793      MANHOLE #58430        W 59TH … <NA>     MANHATT… New Y…       NA
## 10 0313032      MANHOLE #6014         156-170… <NA>     MANHATT… New Y…       NA
## # ℹ 99,909 more rows
## # ℹ 13 more variables: swis_code <chr>, dec_region <dbl>, spill_date <chr>,
## #   received_date <chr>, contributing_factor <chr>, waterbody <chr>,
## #   source <chr>, close_date <chr>, material_name <chr>, material_family <chr>,
## #   quantity <dbl>, units <chr>, recovered <dbl>

Statistical Analysis

I will be performing ANOVA Test to show the relationship between categorical and quantitative variables. In this statistical analysis, the New York counties represent as categorical variables and the amount of toxic spills represent as quantitative variables. This is the appropriate method to answer the research question because this tests whether there are significant differences between the means of contaminant levels in each county. The null hypothesis states that the means for all counties are equal, while the alternative hypothesis states that at least the mean contaminant level of one county is different from the others. After performing the ANOVA test and extracting the summary, results show a not statistically significant p-value of 0.517.

Hypothesis

Null Hypothesis: The means for all counties are equal, where μ represents the mean contaminant level at different counties.

\(H_0\): \(\mu_A\) = \(\mu_B\) = \(\mu_C\)

Alternative Hypothesis: At least the mean contaminant level of one county is different from the others.

\(H_a\): not all \(\mu_i\) are equal

# Perform ANOVA
anova_result <- aov(quantity ~ county, data = counties_df)

anova_result

## Call:
##    aov(formula = quantity ~ county, data = counties_df)
## 
## Terms:
##                       county    Residuals
## Sum of Squares  1.216569e+12 3.742964e+16
## Deg. of Freedom            4        99914
## 
## Residual standard error: 612061
## Estimated effects may be unbalanced

p-value

#pass anova result to summary to get p value
summary(anova_result)

##                Df    Sum Sq   Mean Sq F value Pr(>F)
## county          4 1.217e+12 3.041e+11   0.812  0.517
## Residuals   99914 3.743e+16 3.746e+11

P-value = 0.517 is greater than the level of significance of 0.05. Therefore, we do not reject the null hypothesis. Thus, the results are not statistically significant.

Tukey’s Honestly Significant Difference (HSD) test on the ANOVA model

library(tidyverse)

TukeyHSD(anova_result)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = quantity ~ county, data = counties_df)
## 
## $county
##                         diff        lwr       upr     p adj
## Kings-Bronx       -8905.2573 -26249.340  8438.826 0.6272928
## New York-Bronx    -9691.0396 -27230.104  7848.025 0.5577342
## Queens-Bronx      -4455.2509 -21108.921 12198.419 0.9496444
## Richmond-Bronx    -9824.6954 -33465.150 13815.759 0.7887089
## New York-Kings     -785.7823 -16186.998 14615.433 0.9999158
## Queens-Kings       4450.0063  -9934.826 18834.838 0.9168824
## Richmond-Kings     -919.4381 -23020.338 21181.462 0.9999627
## Queens-New York    5235.7887  -9383.546 19855.124 0.8657303
## Richmond-New York  -133.6558 -22387.899 22120.587 1.0000000
## Richmond-Queens   -5369.4445 -26932.776 16193.887 0.9609744

The most significant difference is between Richmond and Bronx (-9824.6954). While Richmond and New York have the least difference (-133.6558). The result shows that there is not a statistical significance.

Box-Plot

boxplot(quantity ~ county, data = counties_df, 
         ylab = "Spill Quantity (Gallons)", xlab = "Counties",
        main = "Spill Quantity of Each county")

The box plot looks very similar for all groups with the exception of outliers. The data from each county seem to have similar spread, suggesting no significant differences between the means of spill quantity.

Conclusion and Future Directions

The statistical analysis on 564,321 spill records from New York State, found no statistically significant difference in the mean quantity of contaminants spilled among the New York counties. Since the p-value (p-value = 0.517) exceeds the significance level of 0.05, the null hypothesis—the mean spill quantities for all counties are equal—could not be rejected. This result suggests that, despite potential differences in the total amount of contaminants spilled per county, the average spill amount remains statistically similar across the counties (New York, Kings, Queens, Bronx, and Richmond). While a significant difference was observed between Richmond and Bronx, the overall test indicates that there is no statistical significance.

While the analysis suggests that the mean spill quantity does not vary significantly by county, future research should shift focus onto other key variables within the dataset, such as spill source, contributing factor, and material type. These variables should be explored to determine if there are statistically significant differences in mean spill quantity. For instance, testing whether the mean quantity of a spill caused by a tank test failure differs significantly from a spill caused by human error could potentially provide insights for developing prevention strategies and safety training programs.