library(tidyverse)
setwd("~/Nassor/MC/Data101")
HWQ <- read_csv("Harbor_Water_Quality_02_25_NYC.csv")
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## Rows: 98288 Columns: 100
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (26): Sampling Location, Duplicate Sample, Sample Date, Weather Conditi...
## dbl (64): Top Sample Temperature (ºC), Bottom Sample Temperature (ºC), Site...
## num (1): Bottom Fecal Coliform Bacteria (Cells/100mL)
## lgl (8): Bottom Dissolved Organic Carbon (mg/L), top_dissolved_organic_car...
## time (1): Sample Time
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
This project seeks to answer the research question - Is the salinity different in each direction? The Harbor Water Quality Dataset I found contains 100 total columns and over 98,000 rows. This dataset can be found at Harbor Water Quality | NYC Open Data. Each case is a specific New York City water sample on a particular day from January 1909 to December 2025 and each column represents a measurement of a specific water quality factor. Since this dataset has a total of 100 variables, I chose to narrow my analysis down to one numerical variable and a character column, that I will convert to a categorical variable. As I am interested in differences in salinity, a numerical factor, I will compare the mean salinity across cardinal and ordinal direction categories (N, S, E, W, NE, NW, SE, SW).
“collect samples of harbor water at 85 sampling stations throughout New York Harbor” ()
First Let’s Take a Look at the data
dim(HWQ)
## [1] 98288 100
head(HWQ)
## # A tibble: 6 × 100
## `Sampling Location` `Duplicate Sample` `Sample Date` `Sample Time`
## <chr> <chr> <chr> <time>
## 1 BB2 <NA> 8/9/2006 13:53
## 2 BB2 <NA> 9/12/2006 13:10
## 3 BB4 <NA> 8/9/2006 14:02
## 4 BB4 <NA> 9/12/2006 13:17
## 5 E10 <NA> 10/23/2006 10:46
## 6 E10 <NA> 7/31/2006 10:00
## # ℹ 96 more variables: `Weather Condition (Dry or Wet)` <chr>,
## # `Top Sample Temperature (ºC)` <dbl>,
## # `Bottom Sample Temperature (ºC)` <dbl>, `Site Actual Depth (ft)` <dbl>,
## # `Top Sample Depth(ft)` <dbl>, `Bottom Sample Depth (ft)` <dbl>,
## # `Top Salinity (psu)` <dbl>, `Bottom Salinity (psu)` <dbl>,
## # `Top Conductivity (S/m)` <dbl>, `Bottom Conductivity (S/m)` <dbl>,
## # `CTD (conductivity, temperature, depth profiler) Top Dissolved Oxygen (mg/L)` <dbl>, …
Simplifying column names for use
names(HWQ) <- tolower(names(HWQ)) #lower case variable names
names(HWQ) <- gsub(" ", "_", names(HWQ)) # remove spaces in variable names
names(HWQ) <- gsub( "[(,)]", "", names(HWQ)) #remove the parenthesis in variable names
hwq <- HWQ |>
rename(current_direction = current_direction_current_direction) |>
rename(salinity = top_salinity__psu)
head(hwq)
## # A tibble: 6 × 100
## sampling_location duplicate_sample sample_date sample_time
## <chr> <chr> <chr> <time>
## 1 BB2 <NA> 8/9/2006 13:53
## 2 BB2 <NA> 9/12/2006 13:10
## 3 BB4 <NA> 8/9/2006 14:02
## 4 BB4 <NA> 9/12/2006 13:17
## 5 E10 <NA> 10/23/2006 10:46
## 6 E10 <NA> 7/31/2006 10:00
## # ℹ 96 more variables: weather_condition_dry_or_wet <chr>,
## # top_sample_temperature_ºc <dbl>, bottom_sample_temperature_ºc <dbl>,
## # site_actual_depth_ft <dbl>, top_sample_depthft <dbl>,
## # bottom_sample_depth_ft <dbl>, salinity <dbl>, bottom_salinity__psu <dbl>,
## # `top_conductivity_s/m` <dbl>, `bottom_conductivity_s/m` <dbl>,
## # `ctd_conductivity_temperature_depth_profiler_top_dissolved_oxygen_mg/l` <dbl>,
## # `ctd_conductivity_temperature_depth_profiler_bottom_dissolved_oxygen_mg/l` <dbl>, …
Selecting Key Variables
“sample_date” “top_salinity__psu” “current_direction_current_direction”
Exploring the class of key variables
unique(hwq$current_direction)
## [1] "NE" "SW" "N" "S"
## [5] "SE" "E" "W" "NW"
## [9] "SSW" "SSE" NA "*"
## [13] "LW SLK" "SLK" "WNW" "ND"
## [17] "ENE" "0" "WEAK" "N/E"
## [21] "49" "NNE" "0.2" "0.4"
## [25] "O" "0.5" "0.1" "ESE"
## [29] "ELY" "86" "EBB" "117"
## [33] "Sly" "NNW" "Ebb" "SLACK"
## [37] "H.W 0" "Slack" "LW 0" "FLOOD"
## [41] "FLD" "10" "CALM" "NLY"
## [45] "F" "WEST" "EB" "FL"
## [49] "." "?" "SL" "W 1.6"
## [53] "1.1" "1" "SE-1.1" "E-1.0"
## [57] "W-0.8" "0.8" "0.6" "2"
## [61] "W-1.0" "E-0.1" "240" "E-0.4"
## [65] "NE-0.6" "slack" "EBBSW" "46"
## [69] "W-0.7" "W-.4" "0.7" "W-0.3"
## [73] "1.5" "NE-0.8" "SW-0.4" "W-0.6"
## [77] "1.3" "0.3" "26" "180"
## [81] "185" "FLDNE" "202" "Nly"
## [85] "WSW" "3" "E 0.5" "181"
## [89] "0E-0.7" "N-1.0" "S-0.4" "S-0.2"
## [93] "N-0.3" "N/W" "S-0.5" "SW-1.3"
## [97] "S-0.3" "S-1.1" "199" "N-0.4"
## [101] "N-.3" "N-0.8" "W-0.9" "103"
## [105] "294" "EBBNE" "130" "W 1.3"
## [109] "1.7" "194" "S-1.3" "E-0.3"
## [113] "E-0.6" "W-2.1" "NS" "171"
## [117] "SLW" "E-0.9" "N-0.6" "357"
## [121] "E-.7" "W-1.8" "W-1.2" "1.2"
## [125] "1.4" "N\\E" "ME" "E 3.8"
## [129] "Flood" "248" "0.9" "W-4.4"
## [133] "E-3.6" "2.2" "E-1.3" "E-1.2"
## [137] "5" "303N" "E-5.0" "W-3.2"
## [141] "239" "W-3.5" "WW" "W-3.7"
## [145] "230" "E-4.7" "W-4.8" "E-1.8"
## [149] "W-3.0" "1.9" "E-1.5" "W-0.2"
## [153] "2.1" "SW-2.4" "4" "E 0.7"
## [157] "287" "SE-0.6" "S-1.0" "S-1.2"
## [161] "WEAk" "S-0.6" "SE-0.5" "S-0.9"
## [165] "N-0.7" "125" "N-1.6" "W-.9"
## [169] "NN-0.7" "S-0.7" "1.6" "W-0.5"
## [173] "SE 2.0" "256" "SE-1.9" "1.8"
## [177] "E-0.2" "E-2.3" "W-1.9" "311"
## [181] "SE-0.1" "314" "E-0.7" "none"
## [185] "W-.8" "85" "E 2.8" "280"
## [189] "S-2.4" "W-0.4" "146" "W-.6"
## [193] "E-1.1" "NW-0.5" "HW" "LW"
## [197] "147" "N-0.2" "N-0.1" "NE-0.9"
## [201] "186" "292" "HW 0" "94"
## [205] "207" "251" "--" "277"
## [209] "188" "FLOW" "70" "NE-0.3"
## [213] "NE-0.5" "SW-0.7" "12" "120"
## [217] "S/E" "WN" "S/W" "S 1.2"
## [221] "223" "SW-0.8" "56" "N-2"
## [225] "172" "L.W.0" "N-2.5" "S-2.2"
## [229] "N-2.2" "S-1.5" "260" "286"
## [233] "W-1.6" "2.5" "237" "W-1.5"
## [237] "SWSW" "SLY" "328" "177"
## [241] "N-0.5" "352" "NE-0.2" "NNE-0.4"
## [245] "E-0.5" "210" "NW-0.2" "111"
## [249] "N-1.4" "N/A" "NW-0.7" "E-2.5"
## [253] "N-1.1" "189" "51" "S-0.8"
## [257] "E-1.4" "138" "98" "69"
## [261] "28" "W-1.1" "270" "E-0.8"
## [265] "N-1.5" "238" "NW-0.6" "SWLY"
## [269] "NE-1.2" "259" "NE-0.7" "67"
## [273] "44" "NW-0.8" "N-1.2" "324"
## [277] "WLY" "65" "29" "NE-2.0"
## [281] "W-1.7" "E-1.9" "52" "41"
## [285] "305" "27" "NE-0.4" "Flooding"
## [289] "E-01.0" "E-3.0" "99" "S-2.0"
## [293] "E-2.0" "245" "149" "211"
## [297] "N-1.9" "SE-0.3" "S-1.6" "NNE-0.5"
## [301] "190" "SW-1.9" "flood" "SS"
## [305] "WWN" "213" "N-1.3" "SW-0.9"
## [309] "S-1.8" "SE-0.7" "170" "SE-0.8"
## [313] "15" "SW-1.0" "8" "72"
## [317] "183" "176" "S 0.8" "3.2"
## [321] "NE-1.1" "S-3.1" "N-2.4" "N."
## [325] "31" "160" "S-1.4" "S-1.7"
## [329] "N-0.9" "S-2.1" "SE-1.5" "SE-2.0"
## [333] "322" "SE-1.2" "SE-1.7" "123"
## [337] "NW-01.0" "SHW" "M" "3.1"
## [341] "197" "S-3.2" "S-3" "N-1.8"
## [345] "35" "90" "S-2" "S-2.3"
## [349] "NE-1.6" "S 0.7" "2.7" ".S"
## [353] "225" "S-3.8" "S-2.7" "S-2.5"
## [357] "2.3" "S-3.0" "S 1.9" "217"
## [361] "End of Ebb" "150" "115" "SW-1.6"
## [365] "OO" "2.8" "S-1.9" "SW-2.1"
## [369] "0.05" "246" "N-1.7" "SE-0.9"
## [373] "End of Flood" "SE-1.3" "95" "SW-0.6"
## [377] "319" "E-8" "E-1.6" "NN"
## [381] "SW-0.3" "304" "92" "220"
## [385] "23" "17" "WNW-0.3" "179"
## [389] "0.0 CALM" "SE-0.4" "S0" "355"
## [393] "---" "196" "135" "131"
## [397] "348" "75" "VAR" "59"
## [401] "88" "38" "Beginning Ebb" "1st Flood"
## [405] "n/a" "lwslk" "ebb"
class(hwq$current_direction)
## [1] "character"
summary(hwq$salinity)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 20.40 24.20 22.72 26.60 36.30 22641
class(hwq$salinity)
## [1] "numeric"
class(hwq$sample_date)
## [1] "character"
head(hwq$sample_date)
## [1] "8/9/2006" "9/12/2006" "8/9/2006" "9/12/2006" "10/23/2006"
## [6] "7/31/2006"
Manipulating the Dataset
Dataframe hwq became hwq1 through the following
actions-
The variable class for the column “sample_date” was initially
character string. I will use the “mdy” function to create a new readable
“date class” column in R called date. Next, I will apply a
filter for only dates after January 1, 2000 and for only current
directions within groups (“N”, “S”, “E”, “W”, “NE”, “NW”, “SE”, “SW”).
Additionally, I will prepare my data by removing NA’s from the
salinity variable using “!is.na”. Lastly, I will pass hwq1
through the “group_by” function in order to observe the distribution
counts as a new frame called hwq_direction.
hwq1 <- hwq |>
mutate(date = mdy(sample_date)) |>
filter(date >= 01-01-2000) |>
filter(current_direction %in% c("N", "S", "E", "W", "NE", "NW", "SE", "SW")) |>
filter(!is.na(salinity))
hwq_direction <- hwq1 |>
group_by(current_direction)|>
count()
hwq_direction
## # A tibble: 8 × 2
## # Groups: current_direction [8]
## current_direction n
## <chr> <int>
## 1 E 10937
## 2 N 11843
## 3 NE 1647
## 4 NW 1047
## 5 S 14839
## 6 SE 1013
## 7 SW 1522
## 8 W 10322
Counts are sufficiently large that outliers should not affect my ability to use ANOVA.
Creating a vizualization
ggplot(hwq1, aes(x = current_direction, y = salinity, fill = current_direction)) +
geom_boxplot() +
labs(title = "Salinity by Current Direction", x = "Current Direction", y = "Salinity") +
theme_minimal() +
theme(legend.position = "none")
Side-by-side ox plot was selected to visualize the distributions between the current direction groups. Observations include; extremely low outliers (Up to 0psu), East (E) and West (W) have higher median values and smaller quartiles, while North (N) and South (S) have lower medians and greater range in their distribution.
H₀: mean salinity is equivalent in all current directions H₁: mean salinity is not equivalent in all current directions
summary(hwq1$salinity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.10 20.58 24.10 22.64 26.32 36.30
anova_result <- aov(salinity ~ current_direction, data = hwq1)
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## current_direction 7 156936 22419 780.9 <2e-16 ***
## Residuals 53162 1526284 29
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
F statistic: 780.9 p-value: 2e-16 The p value is close to zero and much less than the alpha (0.05). As such, I found evicence to reject the null hypothesis. In simpler terms, the salinity is not equivalent in all current directions.
tukey_result <- TukeyHSD(anova_result)
tukey_result
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = salinity ~ current_direction, data = hwq1)
##
## $current_direction
## diff lwr upr p adj
## N-E -3.23194843 -3.4473171 -3.01657980 0.0000000
## NE-E -1.12905353 -1.5582924 -0.69981469 0.0000000
## NW-E -0.94918107 -1.4745496 -0.42381259 0.0000012
## S-E -3.83644463 -4.0411088 -3.63178042 0.0000000
## SE-E -1.58807057 -2.1214247 -1.05471642 0.0000000
## SW-E -1.98653721 -2.4308315 -1.54224293 0.0000000
## W-E 0.03864819 -0.1842086 0.26150501 0.9995328
## NE-N 2.10289490 1.6758103 2.52997947 0.0000000
## NW-N 2.28276736 1.7591575 2.80637722 0.0000000
## S-N -0.60449620 -0.8046029 -0.40438953 0.0000000
## SE-N 1.64387786 1.1122559 2.17549981 0.0000000
## SW-N 1.24541122 0.8031979 1.68762457 0.0000000
## W-N 3.27059662 3.0519178 3.48927539 0.0000000
## NW-NE 0.17987246 -0.4620229 0.82176779 0.9901830
## S-NE -2.70739110 -3.1291790 -2.28560323 0.0000000
## SE-NE -0.45901705 -1.1074646 0.18943051 0.3854846
## SW-NE -0.85748368 -1.4349047 -0.28006271 0.0001822
## W-NE 1.16770171 0.7367925 1.59861091 0.0000000
## S-NW -2.88726356 -3.4065622 -2.36796492 0.0000000
## SE-NW -0.63888950 -1.3546061 0.07682713 0.1207612
## SW-NW -1.03735614 -1.6894152 -0.38529706 0.0000389
## W-NW 0.98782926 0.4610952 1.51456335 0.0000004
## SE-S 2.24837406 1.7209978 2.77575029 0.0000000
## SW-S 1.84990742 1.4128074 2.28700745 0.0000000
## W-S 3.87509282 3.6669482 4.08323747 0.0000000
## SW-SE -0.39846664 -1.0569768 0.26004354 0.5966725
## W-SE 1.62671876 1.0920194 2.16141813 0.0000000
## W-SW 2.02518540 1.5792771 2.47109365 0.0000000
Tukey Observations: 23 out of 28 current direction group pairs were significant for difference of salinity. These 5 were the only pairs with an adjusted pvalue greater than 0.05. A majority of the comparisons showed a difference in mean salinity. This demonstrates evidence why I am rejecting the null hypothesis. W-E 0.03864819 -0.1842086 0.26150501 0.9995328 NW-NE 0.17987246 -0.4620229 0.82176779 0.9901830 SE-NE -0.45901705 -1.1074646 0.18943051 0.3854846 SE-NW -0.63888950 -1.3546061 0.07682713 0.1207612 SW-SE -0.39846664 -1.0569768 0.26004354 0.5966725
The New York City Department of Environmental Protection claims that the Harbor is cleaner than it has been for the last 100 years, due to their efforts to improve wastewater handling and treatment. They report on 45 individual waterbodies, monitoring sources of pollution and releasing advisory warnings for the planning of recreational activities.
One conclusion that stood out to me is that the outliers were extremely low (as low as 0 psu). I wonder if this could have to do with weather conditions and increased precipitation. If I was to explore more, I would want to use a multiple regression model to incorporate weather conditions as a factor in the salinity of the water in each direction.
Anthropic. (2026, June 27). Claude (Sonnet 4.6). Response to prompt: “Remove word entries and blank rows from numerical variable in tidyverse”.
New York City Department of Environmental Protection. (June 15, 2026). Harbor Water Quality [dataset]. NYC Open Data. https://data.cityofnewyork.us/Environment/Harbor-Water-Quality/5uug-f49n/about_data
New York City Department of Environmental Protection. (n.d.) Harbor Water Quality - DEP. NYC.gov. https://www.nyc.gov/site/dep/water/harbor-water-quality.page