Data101_Midterm Project

load the library

library(tidyverse)

load the data

setwd("~/Nassor/MC/Data101") 
HWQ <- read_csv("Harbor_Water_Quality_02_25_NYC.csv")

## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)

## Rows: 98288 Columns: 100
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (26): Sampling Location, Duplicate Sample, Sample Date, Weather Conditi...
## dbl  (64): Top Sample Temperature (ºC), Bottom Sample Temperature (ºC), Site...
## num   (1): Bottom Fecal Coliform Bacteria (Cells/100mL)
## lgl   (8): Bottom Dissolved Organic Carbon (mg/L), top_dissolved_organic_car...
## time  (1): Sample Time
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Part A: Introduction

This project seeks to answer the research question - Is the salinity different in each direction? The Harbor Water Quality Dataset I found contains 100 total columns and over 98,000 rows. This dataset can be found at Harbor Water Quality | NYC Open Data. Each case is a specific New York City water sample on a particular day from January 1909 to December 2025 and each column represents a measurement of a specific water quality factor. Since this dataset has a total of 100 variables, I chose to narrow my analysis down to one numerical variable and a character column, that I will convert to a categorical variable. As I am interested in differences in salinity, a numerical factor, I will compare the mean salinity across cardinal and ordinal direction categories (N, S, E, W, NE, NW, SE, SW).

“collect samples of harbor water at 85 sampling stations throughout New York Harbor” ()

First Let’s Take a Look at the data

dim(HWQ)

## [1] 98288   100

head(HWQ)

## # A tibble: 6 × 100
##   `Sampling Location` `Duplicate Sample` `Sample Date` `Sample Time`
##   <chr>               <chr>              <chr>         <time>       
## 1 BB2                 <NA>               8/9/2006      13:53        
## 2 BB2                 <NA>               9/12/2006     13:10        
## 3 BB4                 <NA>               8/9/2006      14:02        
## 4 BB4                 <NA>               9/12/2006     13:17        
## 5 E10                 <NA>               10/23/2006    10:46        
## 6 E10                 <NA>               7/31/2006     10:00        
## # ℹ 96 more variables: `Weather Condition (Dry or Wet)` <chr>,
## #   `Top Sample Temperature (ºC)` <dbl>,
## #   `Bottom Sample Temperature (ºC)` <dbl>, `Site Actual Depth (ft)` <dbl>,
## #   `Top Sample Depth(ft)` <dbl>, `Bottom Sample Depth (ft)` <dbl>,
## #   `Top Salinity  (psu)` <dbl>, `Bottom Salinity  (psu)` <dbl>,
## #   `Top Conductivity (S/m)` <dbl>, `Bottom Conductivity (S/m)` <dbl>,
## #   `CTD (conductivity, temperature, depth profiler) Top Dissolved Oxygen (mg/L)` <dbl>, …

Part B: Clean the Dataset and Conduct an Exploratory Data Analysis (EDA)

Simplify column names for use
Select key variables
Explore the class of key variables
Manipulate the dataset
Create a visualization

Simplifying column names for use

names(HWQ) <- tolower(names(HWQ)) #lower case variable names
names(HWQ) <- gsub(" ", "_", names(HWQ)) # remove spaces in variable names
names(HWQ) <- gsub( "[(,)]", "", names(HWQ)) #remove the parenthesis in variable names

hwq <- HWQ |> 
  rename(current_direction = current_direction_current_direction) |>
  rename(salinity = top_salinity__psu)

head(hwq)

## # A tibble: 6 × 100
##   sampling_location duplicate_sample sample_date sample_time
##   <chr>             <chr>            <chr>       <time>     
## 1 BB2               <NA>             8/9/2006    13:53      
## 2 BB2               <NA>             9/12/2006   13:10      
## 3 BB4               <NA>             8/9/2006    14:02      
## 4 BB4               <NA>             9/12/2006   13:17      
## 5 E10               <NA>             10/23/2006  10:46      
## 6 E10               <NA>             7/31/2006   10:00      
## # ℹ 96 more variables: weather_condition_dry_or_wet <chr>,
## #   top_sample_temperature_ºc <dbl>, bottom_sample_temperature_ºc <dbl>,
## #   site_actual_depth_ft <dbl>, top_sample_depthft <dbl>,
## #   bottom_sample_depth_ft <dbl>, salinity <dbl>, bottom_salinity__psu <dbl>,
## #   `top_conductivity_s/m` <dbl>, `bottom_conductivity_s/m` <dbl>,
## #   `ctd_conductivity_temperature_depth_profiler_top_dissolved_oxygen_mg/l` <dbl>,
## #   `ctd_conductivity_temperature_depth_profiler_bottom_dissolved_oxygen_mg/l` <dbl>, …

Selecting Key Variables

“sample_date” “top_salinity__psu” “current_direction_current_direction”

Exploring the class of key variables

unique(hwq$current_direction)

##   [1] "NE"            "SW"            "N"             "S"            
##   [5] "SE"            "E"             "W"             "NW"           
##   [9] "SSW"           "SSE"           NA              "*"            
##  [13] "LW SLK"        "SLK"           "WNW"           "ND"           
##  [17] "ENE"           "0"             "WEAK"          "N/E"          
##  [21] "49"            "NNE"           "0.2"           "0.4"          
##  [25] "O"             "0.5"           "0.1"           "ESE"          
##  [29] "ELY"           "86"            "EBB"           "117"          
##  [33] "Sly"           "NNW"           "Ebb"           "SLACK"        
##  [37] "H.W 0"         "Slack"         "LW 0"          "FLOOD"        
##  [41] "FLD"           "10"            "CALM"          "NLY"          
##  [45] "F"             "WEST"          "EB"            "FL"           
##  [49] "."             "?"             "SL"            "W 1.6"        
##  [53] "1.1"           "1"             "SE-1.1"        "E-1.0"        
##  [57] "W-0.8"         "0.8"           "0.6"           "2"            
##  [61] "W-1.0"         "E-0.1"         "240"           "E-0.4"        
##  [65] "NE-0.6"        "slack"         "EBBSW"         "46"           
##  [69] "W-0.7"         "W-.4"          "0.7"           "W-0.3"        
##  [73] "1.5"           "NE-0.8"        "SW-0.4"        "W-0.6"        
##  [77] "1.3"           "0.3"           "26"            "180"          
##  [81] "185"           "FLDNE"         "202"           "Nly"          
##  [85] "WSW"           "3"             "E 0.5"         "181"          
##  [89] "0E-0.7"        "N-1.0"         "S-0.4"         "S-0.2"        
##  [93] "N-0.3"         "N/W"           "S-0.5"         "SW-1.3"       
##  [97] "S-0.3"         "S-1.1"         "199"           "N-0.4"        
## [101] "N-.3"          "N-0.8"         "W-0.9"         "103"          
## [105] "294"           "EBBNE"         "130"           "W 1.3"        
## [109] "1.7"           "194"           "S-1.3"         "E-0.3"        
## [113] "E-0.6"         "W-2.1"         "NS"            "171"          
## [117] "SLW"           "E-0.9"         "N-0.6"         "357"          
## [121] "E-.7"          "W-1.8"         "W-1.2"         "1.2"          
## [125] "1.4"           "N\\E"          "ME"            "E 3.8"        
## [129] "Flood"         "248"           "0.9"           "W-4.4"        
## [133] "E-3.6"         "2.2"           "E-1.3"         "E-1.2"        
## [137] "5"             "303N"          "E-5.0"         "W-3.2"        
## [141] "239"           "W-3.5"         "WW"            "W-3.7"        
## [145] "230"           "E-4.7"         "W-4.8"         "E-1.8"        
## [149] "W-3.0"         "1.9"           "E-1.5"         "W-0.2"        
## [153] "2.1"           "SW-2.4"        "4"             "E 0.7"        
## [157] "287"           "SE-0.6"        "S-1.0"         "S-1.2"        
## [161] "WEAk"          "S-0.6"         "SE-0.5"        "S-0.9"        
## [165] "N-0.7"         "125"           "N-1.6"         "W-.9"         
## [169] "NN-0.7"        "S-0.7"         "1.6"           "W-0.5"        
## [173] "SE 2.0"        "256"           "SE-1.9"        "1.8"          
## [177] "E-0.2"         "E-2.3"         "W-1.9"         "311"          
## [181] "SE-0.1"        "314"           "E-0.7"         "none"         
## [185] "W-.8"          "85"            "E 2.8"         "280"          
## [189] "S-2.4"         "W-0.4"         "146"           "W-.6"         
## [193] "E-1.1"         "NW-0.5"        "HW"            "LW"           
## [197] "147"           "N-0.2"         "N-0.1"         "NE-0.9"       
## [201] "186"           "292"           "HW 0"          "94"           
## [205] "207"           "251"           "--"            "277"          
## [209] "188"           "FLOW"          "70"            "NE-0.3"       
## [213] "NE-0.5"        "SW-0.7"        "12"            "120"          
## [217] "S/E"           "WN"            "S/W"           "S 1.2"        
## [221] "223"           "SW-0.8"        "56"            "N-2"          
## [225] "172"           "L.W.0"         "N-2.5"         "S-2.2"        
## [229] "N-2.2"         "S-1.5"         "260"           "286"          
## [233] "W-1.6"         "2.5"           "237"           "W-1.5"        
## [237] "SWSW"          "SLY"           "328"           "177"          
## [241] "N-0.5"         "352"           "NE-0.2"        "NNE-0.4"      
## [245] "E-0.5"         "210"           "NW-0.2"        "111"          
## [249] "N-1.4"         "N/A"           "NW-0.7"        "E-2.5"        
## [253] "N-1.1"         "189"           "51"            "S-0.8"        
## [257] "E-1.4"         "138"           "98"            "69"           
## [261] "28"            "W-1.1"         "270"           "E-0.8"        
## [265] "N-1.5"         "238"           "NW-0.6"        "SWLY"         
## [269] "NE-1.2"        "259"           "NE-0.7"        "67"           
## [273] "44"            "NW-0.8"        "N-1.2"         "324"          
## [277] "WLY"           "65"            "29"            "NE-2.0"       
## [281] "W-1.7"         "E-1.9"         "52"            "41"           
## [285] "305"           "27"            "NE-0.4"        "Flooding"     
## [289] "E-01.0"        "E-3.0"         "99"            "S-2.0"        
## [293] "E-2.0"         "245"           "149"           "211"          
## [297] "N-1.9"         "SE-0.3"        "S-1.6"         "NNE-0.5"      
## [301] "190"           "SW-1.9"        "flood"         "SS"           
## [305] "WWN"           "213"           "N-1.3"         "SW-0.9"       
## [309] "S-1.8"         "SE-0.7"        "170"           "SE-0.8"       
## [313] "15"            "SW-1.0"        "8"             "72"           
## [317] "183"           "176"           "S 0.8"         "3.2"          
## [321] "NE-1.1"        "S-3.1"         "N-2.4"         "N."           
## [325] "31"            "160"           "S-1.4"         "S-1.7"        
## [329] "N-0.9"         "S-2.1"         "SE-1.5"        "SE-2.0"       
## [333] "322"           "SE-1.2"        "SE-1.7"        "123"          
## [337] "NW-01.0"       "SHW"           "M"             "3.1"          
## [341] "197"           "S-3.2"         "S-3"           "N-1.8"        
## [345] "35"            "90"            "S-2"           "S-2.3"        
## [349] "NE-1.6"        "S 0.7"         "2.7"           ".S"           
## [353] "225"           "S-3.8"         "S-2.7"         "S-2.5"        
## [357] "2.3"           "S-3.0"         "S 1.9"         "217"          
## [361] "End of Ebb"    "150"           "115"           "SW-1.6"       
## [365] "OO"            "2.8"           "S-1.9"         "SW-2.1"       
## [369] "0.05"          "246"           "N-1.7"         "SE-0.9"       
## [373] "End of Flood"  "SE-1.3"        "95"            "SW-0.6"       
## [377] "319"           "E-8"           "E-1.6"         "NN"           
## [381] "SW-0.3"        "304"           "92"            "220"          
## [385] "23"            "17"            "WNW-0.3"       "179"          
## [389] "0.0 CALM"      "SE-0.4"        "S0"            "355"          
## [393] "---"           "196"           "135"           "131"          
## [397] "348"           "75"            "VAR"           "59"           
## [401] "88"            "38"            "Beginning Ebb" "1st Flood"    
## [405] "n/a"           "lwslk"         "ebb"

class(hwq$current_direction)

## [1] "character"

summary(hwq$salinity)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   20.40   24.20   22.72   26.60   36.30   22641

class(hwq$salinity)

## [1] "numeric"

class(hwq$sample_date)

## [1] "character"

head(hwq$sample_date)

## [1] "8/9/2006"   "9/12/2006"  "8/9/2006"   "9/12/2006"  "10/23/2006"
## [6] "7/31/2006"

Manipulating the Dataset

Dataframe hwq became hwq1 through the following actions-

The variable class for the column “sample_date” was initially character string. I will use the “mdy” function to create a new readable “date class” column in R called date. Next, I will apply a filter for only dates after January 1, 2000 and for only current directions within groups (“N”, “S”, “E”, “W”, “NE”, “NW”, “SE”, “SW”). Additionally, I will prepare my data by removing NA’s from the salinity variable using “!is.na”. Lastly, I will pass hwq1 through the “group_by” function in order to observe the distribution counts as a new frame called hwq_direction.

hwq1 <- hwq |>
  mutate(date = mdy(sample_date)) |>
  filter(date >= 01-01-2000) |>
  filter(current_direction %in% c("N", "S", "E", "W", "NE", "NW", "SE", "SW")) |>
  filter(!is.na(salinity))

hwq_direction <- hwq1 |>
  group_by(current_direction)|>
  count()
hwq_direction

## # A tibble: 8 × 2
## # Groups:   current_direction [8]
##   current_direction     n
##   <chr>             <int>
## 1 E                 10937
## 2 N                 11843
## 3 NE                 1647
## 4 NW                 1047
## 5 S                 14839
## 6 SE                 1013
## 7 SW                 1522
## 8 W                 10322

Counts are sufficiently large that outliers should not affect my ability to use ANOVA.

Creating a vizualization

ggplot(hwq1, aes(x = current_direction, y = salinity, fill = current_direction)) +
  geom_boxplot() +
  labs(title = "Salinity by Current Direction", x = "Current Direction", y = "Salinity") +
  theme_minimal() +
  theme(legend.position = "none")

Side-by-side ox plot was selected to visualize the distributions between the current direction groups. Observations include; extremely low outliers (Up to 0psu), East (E) and West (W) have higher median values and smaller quartiles, while North (N) and South (S) have lower medians and greater range in their distribution.

Statistical Analysis: One-way ANOVA

H₀: mean salinity is equivalent in all current directions H₁: mean salinity is not equivalent in all current directions

summary(hwq1$salinity)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.10   20.58   24.10   22.64   26.32   36.30

anova_result <- aov(salinity ~ current_direction, data = hwq1)
summary(anova_result)

##                      Df  Sum Sq Mean Sq F value Pr(>F)    
## current_direction     7  156936   22419   780.9 <2e-16 ***
## Residuals         53162 1526284      29                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

F statistic: 780.9 p-value: 2e-16 The p value is close to zero and much less than the alpha (0.05). As such, I found evicence to reject the null hypothesis. In simpler terms, the salinity is not equivalent in all current directions.

tukey_result <- TukeyHSD(anova_result)
tukey_result

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = salinity ~ current_direction, data = hwq1)
## 
## $current_direction
##              diff        lwr         upr     p adj
## N-E   -3.23194843 -3.4473171 -3.01657980 0.0000000
## NE-E  -1.12905353 -1.5582924 -0.69981469 0.0000000
## NW-E  -0.94918107 -1.4745496 -0.42381259 0.0000012
## S-E   -3.83644463 -4.0411088 -3.63178042 0.0000000
## SE-E  -1.58807057 -2.1214247 -1.05471642 0.0000000
## SW-E  -1.98653721 -2.4308315 -1.54224293 0.0000000
## W-E    0.03864819 -0.1842086  0.26150501 0.9995328
## NE-N   2.10289490  1.6758103  2.52997947 0.0000000
## NW-N   2.28276736  1.7591575  2.80637722 0.0000000
## S-N   -0.60449620 -0.8046029 -0.40438953 0.0000000
## SE-N   1.64387786  1.1122559  2.17549981 0.0000000
## SW-N   1.24541122  0.8031979  1.68762457 0.0000000
## W-N    3.27059662  3.0519178  3.48927539 0.0000000
## NW-NE  0.17987246 -0.4620229  0.82176779 0.9901830
## S-NE  -2.70739110 -3.1291790 -2.28560323 0.0000000
## SE-NE -0.45901705 -1.1074646  0.18943051 0.3854846
## SW-NE -0.85748368 -1.4349047 -0.28006271 0.0001822
## W-NE   1.16770171  0.7367925  1.59861091 0.0000000
## S-NW  -2.88726356 -3.4065622 -2.36796492 0.0000000
## SE-NW -0.63888950 -1.3546061  0.07682713 0.1207612
## SW-NW -1.03735614 -1.6894152 -0.38529706 0.0000389
## W-NW   0.98782926  0.4610952  1.51456335 0.0000004
## SE-S   2.24837406  1.7209978  2.77575029 0.0000000
## SW-S   1.84990742  1.4128074  2.28700745 0.0000000
## W-S    3.87509282  3.6669482  4.08323747 0.0000000
## SW-SE -0.39846664 -1.0569768  0.26004354 0.5966725
## W-SE   1.62671876  1.0920194  2.16141813 0.0000000
## W-SW   2.02518540  1.5792771  2.47109365 0.0000000

Tukey Observations: 23 out of 28 current direction group pairs were significant for difference of salinity. These 5 were the only pairs with an adjusted pvalue greater than 0.05. A majority of the comparisons showed a difference in mean salinity. This demonstrates evidence why I am rejecting the null hypothesis. W-E 0.03864819 -0.1842086 0.26150501 0.9995328 NW-NE 0.17987246 -0.4620229 0.82176779 0.9901830 SE-NE -0.45901705 -1.1074646 0.18943051 0.3854846 SE-NW -0.63888950 -1.3546061 0.07682713 0.1207612 SW-SE -0.39846664 -1.0569768 0.26004354 0.5966725

Part D: Conclusion and Future Directions

Mean and median values for salinity were significantly different across the 8 different current directions.
Outliers were low (as low as 0 psu), but sample sizes were large, allowing the ANOVA test to be used.
The salinity is not equivalent in all current directions – Therefore, reject the null hypothesis
A majority of the Tukey pairwise comparisons showed a difference in mean salinity.

The New York City Department of Environmental Protection claims that the Harbor is cleaner than it has been for the last 100 years, due to their efforts to improve wastewater handling and treatment. They report on 45 individual waterbodies, monitoring sources of pollution and releasing advisory warnings for the planning of recreational activities.

One conclusion that stood out to me is that the outliers were extremely low (as low as 0 psu). I wonder if this could have to do with weather conditions and increased precipitation. If I was to explore more, I would want to use a multiple regression model to incorporate weather conditions as a factor in the salinity of the water in each direction.

Part E: References

Anthropic. (2026, June 27). Claude (Sonnet 4.6). Response to prompt: “Remove word entries and blank rows from numerical variable in tidyverse”.

New York City Department of Environmental Protection. (June 15, 2026). Harbor Water Quality [dataset]. NYC Open Data. https://data.cityofnewyork.us/Environment/Harbor-Water-Quality/5uug-f49n/about_data

New York City Department of Environmental Protection. (n.d.) Harbor Water Quality - DEP. NYC.gov. https://www.nyc.gov/site/dep/water/harbor-water-quality.page