library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.1
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
setwd("~/Desktop/Final Project DATA101")
dr <- read_csv("Spill_Incidents_ny.csv")
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## Rows: 564321 Columns: 20
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (16): Spill Number, Program Facility Name, Street 1, Street 2, Locality,...
## dbl (4): ZIP Code, DEC Region, Quantity, Recovered
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
This dataset contains 564,321 records of spills of petroleum and
other contaminants in the state of New York. Under the state of New York
law, spills that could potentially pollute the land or waters of the
state must be reported. The data contains 20 variables, including
details on the date, location, source, cause, material, and quantity of
each spill. In this report, the variables county and
quantity (gallons) will be used to reveal the total
contamination levels and compare the mean spill amount among different
New York counties.
Spill incident reports are crucial for risk management and assessment strategies. This data is essential for protecting both the environment and public health. Recording spill information is important for establishing better safety protocols and ensuring improvement across all matters. Moreover, documenting detailed reports of hazardous spills enforces accountability. Overall, analyzing spill incidents not only creates opportunity for data improvement, but also hold operators responsible for proper prevention, and remediation.
Spill Incidents in New York from Data.gov https://catalog.data.gov/dataset/spill-incidents
head(dr)
## # A tibble: 6 × 20
## `Spill Number` `Program Facility Name` `Street 1` `Street 2` Locality County
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 0107132 MH 864 RT 119/MILL… <NA> ELMSFORD Westc…
## 2 0405586 BOWRY BAY WATER POLL … <NA> QUEENS Queens
## 3 0405586 BOWRY BAY WATER POLL … <NA> QUEENS Queens
## 4 0204667 POLE 16091 GRACE AVE/B… <NA> BRONX Bronx
## 5 0210559 POLE ON FERDALE LOM… <NA> LIBERTY Sulli…
## 6 0311484 PRIVATE RESIDENCE 6568 GLEN H… <NA> SCOTT Cortl…
## # ℹ 14 more variables: `ZIP Code` <dbl>, `SWIS Code` <chr>, `DEC Region` <dbl>,
## # `Spill Date` <chr>, `Received Date` <chr>, `Contributing Factor` <chr>,
## # Waterbody <chr>, Source <chr>, `Close Date` <chr>, `Material Name` <chr>,
## # `Material Family` <chr>, Quantity <dbl>, Units <chr>, Recovered <dbl>
This data analysis focuses on preparing the data for ANOVA Testing by
cleaning the dataset. This process consists of checking for missing
values and renaming column titles to ensure clarity and consistency.
This analysis also includes summary statistics to better understand the
variables, such as the county and quantity of
spills in that specific county. This report will only focus on 5
counties in the state of New York (New York county, Kings county, Queens
county, Bronx county, and Richmond county).
colSums(is.na(dr))
## Spill Number Program Facility Name Street 1
## 0 0 119
## Street 2 Locality County
## 520428 1134 0
## ZIP Code SWIS Code DEC Region
## 509342 0 0
## Spill Date Received Date Contributing Factor
## 151 477 0
## Waterbody Source Close Date
## 516402 0 11499
## Material Name Material Family Quantity
## 0 0 0
## Units Recovered
## 115947 0
# Clean variable names
names(dr) <- gsub("[(). \\-]", "_", names(dr)) #Replace ., (), space, with dash
names(dr) <- gsub("_$", "", names(dr)) #Remove trailing underscore
names(dr) <- tolower(names(dr)) #Lowercase
head(dr)
## # A tibble: 6 × 20
## spill_number program_facility_name street_1 street_2 locality county zip_code
## <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 0107132 MH 864 RT 119/M… <NA> ELMSFORD Westc… NA
## 2 0405586 BOWRY BAY WATER PO… <NA> QUEENS Queens NA
## 3 0405586 BOWRY BAY WATER PO… <NA> QUEENS Queens NA
## 4 0204667 POLE 16091 GRACE AV… <NA> BRONX Bronx NA
## 5 0210559 POLE ON FERDALE … <NA> LIBERTY Sulli… NA
## 6 0311484 PRIVATE RESIDENCE 6568 GLE… <NA> SCOTT Cortl… NA
## # ℹ 13 more variables: swis_code <chr>, dec_region <dbl>, spill_date <chr>,
## # received_date <chr>, contributing_factor <chr>, waterbody <chr>,
## # source <chr>, close_date <chr>, material_name <chr>, material_family <chr>,
## # quantity <dbl>, units <chr>, recovered <dbl>
unique(dr$county)
## [1] "Westchester"
## [2] "Queens"
## [3] "Bronx"
## [4] "Sullivan"
## [5] "Cortland"
## [6] "New York"
## [7] "Ulster"
## [8] "Kings"
## [9] "Orange"
## [10] "Dutchess"
## [11] "Onondaga"
## [12] "Saratoga"
## [13] "Cayuga"
## [14] "Oswego"
## [15] "Warren"
## [16] "Niagara"
## [17] "Rockland"
## [18] "Nassau"
## [19] "Jefferson"
## [20] "Schenectady"
## [21] "Albany"
## [22] "Monroe"
## [23] "Schuyler"
## [24] "St Lawrence"
## [25] "Richmond"
## [26] "Clinton"
## [27] "Lewis"
## [28] "Essex"
## [29] "Chenango"
## [30] "Erie"
## [31] "Livingston"
## [32] "Oneida"
## [33] "Wayne"
## [34] "Suffolk"
## [35] "Orleans"
## [36] "Ontario"
## [37] "Genesee"
## [38] "Otsego"
## [39] "Tompkins"
## [40] "Madison"
## [41] "Chemung"
## [42] "Seneca"
## [43] "Broome"
## [44] "Hamilton"
## [45] "Washington"
## [46] "Steuben"
## [47] "Rensselaer"
## [48] "Franklin"
## [49] "Columbia"
## [50] "Fulton"
## [51] "Herkimer"
## [52] "Schoharie"
## [53] "Montgomery"
## [54] "Putnam"
## [55] "Delaware"
## [56] "New Jersey - Region 2"
## [57] "Tioga"
## [58] "Chautauqua"
## [59] "Cattaraugus"
## [60] "Wyoming"
## [61] "Yates"
## [62] "Greene"
## [63] "Pennsylvania - Region 9"
## [64] "Allegany"
## [65] "New Jersey - Region 3 (N)"
## [66] "Cattaraugus Indian Reservation"
## [67] "New Jersey - Region 3 (T)"
## [68] "Canada - Region 6"
## [69] "Canada - Region 9"
## [70] "Pennsylvania - Region 8"
## [71] "Vermont - Region 5 (R)"
## [72] "Vermont - Region 4"
## [73] "Connecticut - Region 3 (N)"
## [74] "Pennsylvania - Region 3"
## [75] "Tuscarora Indian Reservation"
## [76] "Connecticut - Region 4"
## [77] "Connecticut - Region 3 (T)"
## [78] "Massachusetts - Region 4"
## [79] "Connecticut - Region 1"
## [80] "Canada - Region 8"
## [81] "Oil Springs Indian Reservation"
## [82] "Canada - Region 5"
## [83] "Poospatuck Indian Reservation"
## [84] "Onondaga Indian Reservation"
## [85] "Shinnecock Indian Reservation"
## [86] "St. Regis Indian Reservation - Region 5"
## [87] "Pennsylvania - Region 7"
bronx <- dr |>
filter(county == "Bronx")
bronx
## # A tibble: 15,017 × 20
## spill_number program_facility_name street_1 street_2 locality county zip_code
## <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 0204667 POLE 16091 GRACE A… <NA> BRONX Bronx NA
## 2 0208773 MH 17516 LOCUST … <NA> BRONX Bronx NA
## 3 1305446 #2 FUEL OIL SPILL FR… 41 ELLI… <NA> BRONX Bronx NA
## 4 0407112 #32535 HESS GAS STAT… 1201 WE… <NA> BRONX Bronx NA
## 5 1306665 #4 FUEL SPILL FROM V… 731 WHI… <NA> BRONX Bronx NA
## 6 8602591 #6 IN MNHLE-173 ST … 173 ST … <NA> NEW YOR… Bronx NA
## 7 1401602 #6 FUEL OIL OVERFILL 653 EAS… <NA> BRONX Bronx NA
## 8 1401578 #6 FUEL OIL SPILL TO… 1500 GR… <NA> BRONX Bronx NA
## 9 0204815 #6 LEAK - TO METRO T… 2400 JO… <NA> RIVERDA… Bronx NA
## 10 1201066 @ RESIDENCE 311 EAS… <NA> BRONX Bronx NA
## # ℹ 15,007 more rows
## # ℹ 13 more variables: swis_code <chr>, dec_region <dbl>, spill_date <chr>,
## # received_date <chr>, contributing_factor <chr>, waterbody <chr>,
## # source <chr>, close_date <chr>, material_name <chr>, material_family <chr>,
## # quantity <dbl>, units <chr>, recovered <dbl>
# Sum of quantity
sum(bronx$quantity)
## [1] 155607391
# Mean of quantity
mean(bronx$quantity)
## [1] 10362.08
brooklyn <- dr |>
filter(county == "Kings")
brooklyn
## # A tibble: 24,197 × 20
## spill_number program_facility_name street_1 street_2 locality county zip_code
## <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 0312848 BROOKLYN TECH HIGHSC… 29 FORT… <NA> BROOKLYN Kings NA
## 2 9814599 MANHOLE 2981 CLASSON… <NA> BROOKLYN Kings NA
## 3 0401659 MANHOLE 4467 GRAND A… <NA> BROOKLYN Kings NA
## 4 0011768 MANHOLE 829 WILLIAM… <NA> NEW YOR… Kings NA
## 5 0203407 MR. PRASAUD 317 WOO… <NA> BROOKLYN Kings NA
## 6 9903892 SOUTH OF BAYVIEW AVE WEST SI… <NA> BROOKLYN Kings NA
## 7 0300033 #2 FUEL OIL SPILL TO… 149 CLI… <NA> BROOKLYN Kings NA
## 8 1214667 #2 FUEL SPILL AND VA… 119 BRO… <NA> BROOKLYN Kings NA
## 9 8601792 #2 SHEEN UPPER BAY UPPER B… <NA> NYC BRO… Kings NA
## 10 9911046 #3 BORING HOLE 500 KEN… <NA> BROOKLYN Kings NA
## # ℹ 24,187 more rows
## # ℹ 13 more variables: swis_code <chr>, dec_region <dbl>, spill_date <chr>,
## # received_date <chr>, contributing_factor <chr>, waterbody <chr>,
## # source <chr>, close_date <chr>, material_name <chr>, material_family <chr>,
## # quantity <dbl>, units <chr>, recovered <dbl>
# Sum of quantity
sum(brooklyn$quantity)
## [1] 35250797
# Mean of quantity
mean(brooklyn$quantity)
## [1] 1456.825
manhattan <- dr |>
filter(county == "New York")
manhattan
## # A tibble: 22,848 × 20
## spill_number program_facility_name street_1 street_2 locality county zip_code
## <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 9606869 AMSTERDAM AVE WEST 79… <NA> NYC New Y… NA
## 2 0405793 MANHOLE #58430 W 59TH … <NA> MANHATT… New Y… NA
## 3 0405793 MANHOLE #58430 W 59TH … <NA> MANHATT… New Y… NA
## 4 0313032 MANHOLE #6014 156-170… <NA> MANHATT… New Y… NA
## 5 0101621 VAULT 8754 135 WES… <NA> MANHATT… New Y… NA
## 6 0211713 VS 4673 5TH AVE… <NA> MANHATT… New Y… NA
## 7 0312312 WEST 42 SUBSTATION 521 WES… <NA> MANHATT… New Y… NA
## 8 1010077 # 7 SUBWAY LINE STEINWA… 43+00-4… MANHATT… New Y… NA
## 9 9903576 #2 EAST 29TH ST SUBS… 108 E 2… <NA> MANHATT… New Y… NA
## 10 0508699 #2 FO IN EXCAVATION 470 WES… <NA> MANHATT… New Y… NA
## # ℹ 22,838 more rows
## # ℹ 13 more variables: swis_code <chr>, dec_region <dbl>, spill_date <chr>,
## # received_date <chr>, contributing_factor <chr>, waterbody <chr>,
## # source <chr>, close_date <chr>, material_name <chr>, material_family <chr>,
## # quantity <dbl>, units <chr>, recovered <dbl>
# Sum of quantity
sum(brooklyn$quantity)
## [1] 35250797
# Mean of quantity
mean(brooklyn$quantity)
## [1] 1456.825
staten_island <- dr |>
filter(county == "Richmond")
staten_island
## # A tibble: 7,468 × 20
## spill_number program_facility_name street_1 street_2 locality county zip_code
## <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 1607796 #2-322687 60 BAY … <NA> STATEN … Richm… NA
## 2 0311283 #3743 MANHOLE ARDEN A… <NA> STATEN … Richm… NA
## 3 0907511 #6 OIL PUMP PAD 4101 AR… <NA> STATEN … Richm… NA
## 4 2301651 (DRILL) TUGBOAT 2015 RI… <NA> STATEN … Richm… NA
## 5 2203242 *DRILL* LOWER B… <NA> NYC Richm… NA
## 6 0104199 + RICHMON… <NA> STATEN … Richm… NA
## 7 9412352 020 AMSTRONG AVENUE 202 ARM… <NA> STATEN … Richm… NA
## 8 9608213 1 DAVIS AVE 1 DAVIS… <NA> STATEN … Richm… NA
## 9 9703129 1 DELMONT TERR 1 DELMO… <NA> STATEN … Richm… NA
## 10 0303737 1 EDGEWATER PLAZA 1 EDGEW… <NA> STATEN … Richm… NA
## # ℹ 7,458 more rows
## # ℹ 13 more variables: swis_code <chr>, dec_region <dbl>, spill_date <chr>,
## # received_date <chr>, contributing_factor <chr>, waterbody <chr>,
## # source <chr>, close_date <chr>, material_name <chr>, material_family <chr>,
## # quantity <dbl>, units <chr>, recovered <dbl>
# Sum of quantity
sum(staten_island$quantity)
## [1] 4013206
# Mean of quantity
mean(staten_island$quantity)
## [1] 537.387
queens <- dr |>
filter(county == "Queens")
queens
## # A tibble: 30,389 × 20
## spill_number program_facility_name street_1 street_2 locality county zip_code
## <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 0405586 BOWRY BAY WATER P… <NA> QUEENS Queens NA
## 2 0405586 BOWRY BAY WATER P… <NA> QUEENS Queens NA
## 3 0104307 149TH RD 183RD S… <NA> QUEENS Queens NA
## 4 0109039 APT BUILDING ' 94-06 … <NA> QUEENS Queens NA
## 5 0400255 PRIVATE RESIDENCE 133 -45… <NA> QUEENS Queens NA
## 6 9411488 # 5 ROCKAWAY INLET # 5 ROC… <NA> QUEENS Queens NA
## 7 9813853 #0284 VAULT 164TH S… <NA> QUEENS Queens NA
## 8 9906292 #1 GAS TURBINE 20TH AV… <NA> NEW YORK Queens NA
## 9 8605574 #2 FUEL OIL/ 130-19,… 130-19 … <NA> NEW YOR… Queens NA
## 10 1112016 #6 OIL SEEPAGE INTO … 147-25 … FAIRMON… QUEENS Queens NA
## # ℹ 30,379 more rows
## # ℹ 13 more variables: swis_code <chr>, dec_region <dbl>, spill_date <chr>,
## # received_date <chr>, contributing_factor <chr>, waterbody <chr>,
## # source <chr>, close_date <chr>, material_name <chr>, material_family <chr>,
## # quantity <dbl>, units <chr>, recovered <dbl>
# Sum of quantity
sum(queens$quantity)
## [1] 179502701
# Mean of quantity
mean(queens$quantity)
## [1] 5906.831
counties_df <- dr |>
filter(county %in% c("Kings", "New York", "Queens", "Bronx", "Richmond"))
counties_df
## # A tibble: 99,919 × 20
## spill_number program_facility_name street_1 street_2 locality county zip_code
## <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 0405586 BOWRY BAY WATER P… <NA> QUEENS Queens NA
## 2 0405586 BOWRY BAY WATER P… <NA> QUEENS Queens NA
## 3 0204667 POLE 16091 GRACE A… <NA> BRONX Bronx NA
## 4 0104307 149TH RD 183RD S… <NA> QUEENS Queens NA
## 5 9606869 AMSTERDAM AVE WEST 79… <NA> NYC New Y… NA
## 6 0109039 APT BUILDING ' 94-06 … <NA> QUEENS Queens NA
## 7 0312848 BROOKLYN TECH HIGHSC… 29 FORT… <NA> BROOKLYN Kings NA
## 8 0405793 MANHOLE #58430 W 59TH … <NA> MANHATT… New Y… NA
## 9 0405793 MANHOLE #58430 W 59TH … <NA> MANHATT… New Y… NA
## 10 0313032 MANHOLE #6014 156-170… <NA> MANHATT… New Y… NA
## # ℹ 99,909 more rows
## # ℹ 13 more variables: swis_code <chr>, dec_region <dbl>, spill_date <chr>,
## # received_date <chr>, contributing_factor <chr>, waterbody <chr>,
## # source <chr>, close_date <chr>, material_name <chr>, material_family <chr>,
## # quantity <dbl>, units <chr>, recovered <dbl>
I will be performing ANOVA Test to show the relationship between categorical and quantitative variables. In this statistical analysis, the New York counties represent as categorical variables and the amount of toxic spills represent as quantitative variables. This is the appropriate method to answer the research question because this tests whether there are significant differences between the means of contaminant levels in each county. The null hypothesis states that the means for all counties are equal, while the alternative hypothesis states that at least the mean contaminant level of one county is different from the others. After performing the ANOVA test and extracting the summary, results show a not statistically significant p-value of 0.517.
Hypothesis
Null Hypothesis: The means for all counties are equal, where μ represents the mean contaminant level at different counties.
\(H_0\): \(\mu_A\) = \(\mu_B\) = \(\mu_C\)
Alternative Hypothesis: At least the mean contaminant level of one county is different from the others.
\(H_a\): not all \(\mu_i\) are equal
# Perform ANOVA
anova_result <- aov(quantity ~ county, data = counties_df)
anova_result
## Call:
## aov(formula = quantity ~ county, data = counties_df)
##
## Terms:
## county Residuals
## Sum of Squares 1.216569e+12 3.742964e+16
## Deg. of Freedom 4 99914
##
## Residual standard error: 612061
## Estimated effects may be unbalanced
p-value
#pass anova result to summary to get p value
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## county 4 1.217e+12 3.041e+11 0.812 0.517
## Residuals 99914 3.743e+16 3.746e+11
P-value = 0.517 is greater than the level of significance of 0.05. Therefore, we do not reject the null hypothesis. Thus, the results are not statistically significant.
Tukey’s Honestly Significant Difference (HSD) test on the ANOVA model
library(tidyverse)
TukeyHSD(anova_result)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = quantity ~ county, data = counties_df)
##
## $county
## diff lwr upr p adj
## Kings-Bronx -8905.2573 -26249.340 8438.826 0.6272928
## New York-Bronx -9691.0396 -27230.104 7848.025 0.5577342
## Queens-Bronx -4455.2509 -21108.921 12198.419 0.9496444
## Richmond-Bronx -9824.6954 -33465.150 13815.759 0.7887089
## New York-Kings -785.7823 -16186.998 14615.433 0.9999158
## Queens-Kings 4450.0063 -9934.826 18834.838 0.9168824
## Richmond-Kings -919.4381 -23020.338 21181.462 0.9999627
## Queens-New York 5235.7887 -9383.546 19855.124 0.8657303
## Richmond-New York -133.6558 -22387.899 22120.587 1.0000000
## Richmond-Queens -5369.4445 -26932.776 16193.887 0.9609744
The most significant difference is between Richmond and Bronx (-9824.6954). While Richmond and New York have the least difference (-133.6558). The result shows that there is not a statistical significance.
Box-Plot
boxplot(quantity ~ county, data = counties_df,
ylab = "Spill Quantity (Gallons)", xlab = "Counties",
main = "Spill Quantity of Each county")
The box plot looks very similar for all groups with the exception of outliers. The data from each county seem to have similar spread, suggesting no significant differences between the means of spill quantity.
The statistical analysis on 564,321 spill records from New York State, found no statistically significant difference in the mean quantity of contaminants spilled among the New York counties. Since the p-value (p-value = 0.517) exceeds the significance level of 0.05, the null hypothesis—the mean spill quantities for all counties are equal—could not be rejected. This result suggests that, despite potential differences in the total amount of contaminants spilled per county, the average spill amount remains statistically similar across the counties (New York, Kings, Queens, Bronx, and Richmond). While a significant difference was observed between Richmond and Bronx, the overall test indicates that there is no statistical significance.
While the analysis suggests that the mean spill quantity does not vary significantly by county, future research should shift focus onto other key variables within the dataset, such as spill source, contributing factor, and material type. These variables should be explored to determine if there are statistically significant differences in mean spill quantity. For instance, testing whether the mean quantity of a spill caused by a tank test failure differs significantly from a spill caused by human error could potentially provide insights for developing prevention strategies and safety training programs.