introduction

Water is one of the most essential ressources in life. From drinking to sanitation, water fulfills needs across several domains. However, poor water quality poses serious risks. In consequence, this study aims to analyze the quality of water in regard to its clarity using a publicly available dataset provided by the U.S. department of the interior, accessible through the following link: https://catalog.data.gov/dataset/water-quality-data. This dataset contains 2371 observations distributed across 17 variables. key variables include:

The study is led by the following question: ” Does lower secchi depth implies poorer water quality?”

Secchi depth is a measure of water clarity. It refers to the depth at which a secchi disk is no longer visible when dropped into water.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(corrplot)
## corrplot 0.95 loaded
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.5
## ✔ ggplot2   4.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.4     ✔ tibble    3.3.0
## ✔ purrr     1.1.0     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
data<- read_csv("water quality.csv")
## Rows: 2371 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): Site_Id, Unit_Id, Read_Date, Time (24:00), Field_Tech, DateVerifie...
## dbl (10): Salinity (ppt), Dissolved Oxygen (mg/L), pH (standard units), Secc...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(data)
## # A tibble: 6 × 17
##   Site_Id Unit_Id Read_Date `Salinity (ppt)` `Dissolved Oxygen (mg/L)`
##   <chr>   <chr>   <chr>                <dbl>                     <dbl>
## 1 Bay     <NA>    1/3/1994               1.3                      11.7
## 2 Bay     <NA>    1/31/1994              1.5                      12  
## 3 Bay     <NA>    2/7/1994               1                        10.5
## 4 Bay     <NA>    2/23/1994              1                        10.1
## 5 Bay     <NA>    2/28/1994              1                        12.6
## 6 Bay     <NA>    3/7/1994               1                         9.9
## # ℹ 12 more variables: `pH (standard units)` <dbl>, `Secchi Depth (m)` <dbl>,
## #   `Water Depth (m)` <dbl>, `Water Temp (?C)` <dbl>, `Air Temp-Celsius` <dbl>,
## #   `Air Temp (?F)` <dbl>, `Time (24:00)` <chr>, Field_Tech <chr>,
## #   DateVerified <chr>, WhoVerified <chr>, `AirTemp (C)` <dbl>, Year <dbl>
str(data)
## spc_tbl_ [2,371 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Site_Id                : chr [1:2371] "Bay" "Bay" "Bay" "Bay" ...
##  $ Unit_Id                : chr [1:2371] NA NA NA NA ...
##  $ Read_Date              : chr [1:2371] "1/3/1994" "1/31/1994" "2/7/1994" "2/23/1994" ...
##  $ Salinity (ppt)         : num [1:2371] 1.3 1.5 1 1 1 1 0.5 1 1 1 ...
##  $ Dissolved Oxygen (mg/L): num [1:2371] 11.7 12 10.5 10.1 12.6 9.9 10.4 9.2 9.2 8.6 ...
##  $ pH (standard units)    : num [1:2371] 7.3 7.4 7.2 7.4 7.2 7.1 7.2 7.1 7.2 7.3 ...
##  $ Secchi Depth (m)       : num [1:2371] 0.4 0.2 0.25 0.35 0.2 0.2 0.25 0.15 0.25 0.2 ...
##  $ Water Depth (m)        : num [1:2371] 0.4 0.35 0.6 0.5 0.4 0.9 0.75 0.95 0.75 0.75 ...
##  $ Water Temp (?C)        : num [1:2371] 5.9 3 5.9 10 1.6 9.7 9.8 16.1 15 15.7 ...
##  $ Air Temp-Celsius       : num [1:2371] 8 2.6 7.6 2.7 0 15.2 10.1 22.1 13.5 13 ...
##  $ Air Temp (?F)          : num [1:2371] 46.4 36.7 45.7 36.9 32 ...
##  $ Time (24:00)           : chr [1:2371] "11:00" "11:30" "9:45" "N/A" ...
##  $ Field_Tech             : chr [1:2371] NA NA NA NA ...
##  $ DateVerified           : chr [1:2371] NA NA NA NA ...
##  $ WhoVerified            : chr [1:2371] NA NA NA NA ...
##  $ AirTemp (C)            : num [1:2371] 8 2.6 7.6 2.7 0 15.2 10.1 22.1 13.5 13 ...
##  $ Year                   : num [1:2371] 1994 1994 1994 1994 1994 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Site_Id = col_character(),
##   ..   Unit_Id = col_character(),
##   ..   Read_Date = col_character(),
##   ..   `Salinity (ppt)` = col_double(),
##   ..   `Dissolved Oxygen (mg/L)` = col_double(),
##   ..   `pH (standard units)` = col_double(),
##   ..   `Secchi Depth (m)` = col_double(),
##   ..   `Water Depth (m)` = col_double(),
##   ..   `Water Temp (?C)` = col_double(),
##   ..   `Air Temp-Celsius` = col_double(),
##   ..   `Air Temp (?F)` = col_double(),
##   ..   `Time (24:00)` = col_character(),
##   ..   Field_Tech = col_character(),
##   ..   DateVerified = col_character(),
##   ..   WhoVerified = col_character(),
##   ..   `AirTemp (C)` = col_double(),
##   ..   Year = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Data transformation, cleaning, and exploration

Since many of the variables are not needed, a subset of the data is made to facilitate the process. Variables like Year, DateVerified, WhoVerified, and more… are not included in order to make the data more relevant.

#Transforming the data by taking only the relevant variables

df<- data |>
  select("Site_Id", "Salinity (ppt)", "Dissolved Oxygen (mg/L)", "pH (standard units)", "Water Depth (m)", "Water Temp (?C)", "Year", "Read_Date", "Secchi Depth (m)")
df
## # A tibble: 2,371 × 9
##    Site_Id `Salinity (ppt)` `Dissolved Oxygen (mg/L)` `pH (standard units)`
##    <chr>              <dbl>                     <dbl>                 <dbl>
##  1 Bay                  1.3                      11.7                   7.3
##  2 Bay                  1.5                      12                     7.4
##  3 Bay                  1                        10.5                   7.2
##  4 Bay                  1                        10.1                   7.4
##  5 Bay                  1                        12.6                   7.2
##  6 Bay                  1                         9.9                   7.1
##  7 Bay                  0.5                      10.4                   7.2
##  8 Bay                  1                         9.2                   7.1
##  9 Bay                  1                         9.2                   7.2
## 10 Bay                  1                         8.6                   7.3
## # ℹ 2,361 more rows
## # ℹ 5 more variables: `Water Depth (m)` <dbl>, `Water Temp (?C)` <dbl>,
## #   Year <dbl>, Read_Date <chr>, `Secchi Depth (m)` <dbl>

To make our variables more readable, the column names are modified by replacing empty spaces by underscores. Moreover, rows where the site of data collection name is missing are dropped to ensure more accuracy.

#removing spaces in variable names for better readability
names(df)<- gsub(" ", "_", names(df))
#checking the different sites
unique(df$Site_Id)
## [1] "Bay" "A"   "B"   "C"   "D"   "d"   NA
#adjusting incorrect site names
df$Site_Id<- gsub("d", "D", df$Site_Id)

#removing rows with missing site name
df<-df |>
  filter(!is.na(df$Site_Id))

#checking for missing values across other variables
colSums(is.na(df))
##                 Site_Id          Salinity_(ppt) Dissolved_Oxygen_(mg/L) 
##                       0                     129                     850 
##     pH_(standard_units)         Water_Depth_(m)         Water_Temp_(?C) 
##                      94                      70                     120 
##                    Year               Read_Date        Secchi_Depth_(m) 
##                       0                       4                      72

Although the observations without site collection name were dropped, a lot of missing values are still observed across some variables. However, they should not be dropped because the quantity is very high. To compensate, the median will be used to substitute these values to avoid outlier influence.

md_DO <- median(df$`Dissolved_Oxygen_(mg/L)`, na.rm = T)
md_PH <- median(df$`pH_(standard_units)`, na.rm = T)
md_WT <- median(df$`Water_Temp_(?C)`, na.rm = T)
md_sal <- median(df$`Salinity_(ppt)`, na.rm = T)
md_WD <- median(df$`Water_Depth_(m)`, na.rm = T)
md_SD <- median(df$`Secchi_Depth_(m)`, na.rm = T)

df<- df |>
  mutate( `Dissolved_Oxygen_(mg/L)` = if_else(is.na(`Dissolved_Oxygen_(mg/L)`), md_DO, `Dissolved_Oxygen_(mg/L)`), 
     `pH_(standard_units)` = if_else(is.na(`pH_(standard_units)`),md_PH, `pH_(standard_units)`), 
     `Salinity_(ppt)` = if_else(is.na(`Salinity_(ppt)`), md_sal, `Salinity_(ppt)`),
     `Water_Depth_(m)` = if_else(is.na(`Water_Depth_(m)`), md_WD, `Water_Depth_(m)`),
     `Water_Temp_(?C)` = if_else(is.na(`Water_Temp_(?C)`), md_WT, `Water_Temp_(?C)`),
     `Secchi_Depth_(m)`= if_else(is.na(`Secchi_Depth_(m)`), md_SD, `Secchi_Depth_(m)`)
     
     )
colSums(is.na(df))
##                 Site_Id          Salinity_(ppt) Dissolved_Oxygen_(mg/L) 
##                       0                       0                       0 
##     pH_(standard_units)         Water_Depth_(m)         Water_Temp_(?C) 
##                       0                       0                       0 
##                    Year               Read_Date        Secchi_Depth_(m) 
##                       0                       4                       0

To understand the distribution of secchi depth across the different sites, a boxplot is used to visualize the distribution.

ggplot(df, aes(x=Site_Id, y=`Secchi_Depth_(m)`))+
  geom_boxplot(fill="#F8F4EC", color="#E67E22", width = 0.3)+
  labs(x="Sites", y="secchi depth", title="Distribution of secchi depth across sites")+
  theme_classic()

As observed in the graph above, the distribution is not widely spread across the variables, with all the sites having a median value close to two, which means that the water is as clear everywhere.However, a lot of outliers are detected.

Since we are assessing water quality, even if it is in regard to the secchi depth, several factors may shape it.

Firstly, PH is a major indicator when it comes to water quality. pH measures acidity or alkalinity. Secondly, Salinity reflects the concentration of dissolved salts. High salinity can stress freshwater organisms and reduce oxygen availability. Finally, Dissolved oxygen is essential for aquatic respiration; low dissolved oxygen levels can signal pollution. Together, these parameters help assess the balance of water bodies.

ggplot(df, aes(x=`pH_(standard_units)`,y=`Secchi_Depth_(m)`, color=Site_Id))+
  geom_point()+
  labs(x="PH", y="secchi depth", title="PH vs secchi depth", color="sites")+
  theme_classic()

In the visual above, we can observe that higher PH results in lower secchi depth.

ggplot(df, aes(x=`Salinity_(ppt)`,y=`Secchi_Depth_(m)`, color=Site_Id))+
  geom_point()+
  labs(x="Salinity", y="secchi depth", title="Salinity vs secchi depth", color="sites")+
  theme_classic()

The scatterplot above shows that lower salinity implies lower secchi depth.

ggplot(df, aes(x=`Dissolved_Oxygen_(mg/L)`,y=`Secchi_Depth_(m)`, color=Site_Id))+
  geom_point()+
  labs(x="Dissolved oxygen", y="secchi depth", title="Dissolved oxygen vs secchi depth", color="sites")+
  theme_classic()

There is no actual correlation between these two variables.

matrix<- cor(df |>
               select( `pH_(standard_units)`, `Secchi_Depth_(m)`, `Salinity_(ppt)`, `Dissolved_Oxygen_(mg/L)`), use = "complete.obs"
             )

corrplot(matrix, method = "color", type = "upper", order = "hclust",
         tl.col = "black", tl.srt = 90, addCoef.col = "black",
         title = "Correlation Matrix secchi depth and purity indicators")

Hypothesis testing

\(H_0\): \(\mu_1\) = \(\mu_2\) = \(\mu_3\) = \(\mu_4\) = \(\mu_5\)

\(H_a\): not all \(\mu_i\) are equal

data_test<- data.frame(site= df$Site_Id,  PH=df$`pH_(standard_units)`)
anov <- aov(PH~site, data=data_test)
anov
## Call:
##    aov(formula = PH ~ site, data = data_test)
## 
## Terms:
##                      site Residuals
## Sum of Squares   149.7514 1267.1904
## Deg. of Freedom         4      2365
## 
## Residual standard error: 0.7319904
## Estimated effects may be unbalanced
summary(anov)
##               Df Sum Sq Mean Sq F value Pr(>F)    
## site           4  149.8   37.44   69.87 <2e-16 ***
## Residuals   2365 1267.2    0.54                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value, p = <2e-16, is close to zero, we can conclude that there are significant differences between the average PH across the sites.

TukeyHSD(anov)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = PH ~ site, data = data_test)
## 
## $site
##              diff          lwr         upr     p adj
## B-A    0.35911008  0.223693261  0.49452690 0.0000000
## Bay-A  0.49229562  0.373008596  0.61158264 0.0000000
## C-A    0.52324745  0.367281426  0.67921348 0.0000000
## D-A   -0.09416648 -0.229276949  0.04094398 0.3161131
## Bay-B  0.13318553  0.014163551  0.25220752 0.0193081
## C-B    0.16413737  0.008373958  0.31990078 0.0330267
## D-B   -0.45327657 -0.588153091 -0.31840004 0.0000000
## C-Bay  0.03095184 -0.111012659  0.17291633 0.9758354
## D-Bay -0.58646210 -0.705135409 -0.46778879 0.0000000
## D-C   -0.61741394 -0.772911081 -0.46191679 0.0000000

These results suggest that Sites B, Bay, and C consistently exhibit higher PH levels than Site D. The confidence intervals and adjusted p-values support distinctions.

Conclusion

In conclusion, linking water quality and secchi depth in this analysis relied on a public dataset that allowed the assessment of factors that determine whether or not water is good. After thorough analysis, we can conclude that the clarity of water does not necessarily mean good water. There are several factors that need to be considered when figuring out if a water is good or not. For example, water can be clear but have a low PH, which indicates high acidity, meaning severe harm can result from using this water. Nevertheless, only one factor cannot determine water quality. As discovered in the analysis, a low correlation exists between secci depth(water clarity determinant) and each single variable. A water can be clear and have a moderate PH, yet it is highly saline, or it is low on dissolved oxygen.

A hypothesis testing was conducted to assess whether average PH levels differed across sites. The result showed a low p-value–close to zero–providing enough evidence to reject the null hypothesis stating that PH is equivalent across sites. As a result, site-specific factors need to be monitored.

Therefore, more sampling should be collected to capture broader environmental gradual changes. In addition, more biological indicators must be included in future data samples.