Loading data

## Rows: 5 Columns: 9
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): Sample_ID
## dbl (8): pH, Turbidity (NTU), Dissolved_Oxygen (mg/L), BOD (mg/L), Nitrate(m...
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
Sample_ID pH Turbidity (NTU) Dissolved_Oxygen (mg/L) BOD (mg/L) Nitrate(mg/L) Phosphate (mg/L) Total_Coliform (CFU/100mL) Temperature ( C )
S1 7.2 2.1 8.5 2.3 4.5 0.15 80 22
S2 6.8 3.4 7.2 3.8 6.1 0.25 120 24
S3 7.5 1.0 9.0 1.5 2.8 0.10 30 20
S4 6.2 5.8 5.6 5.2 9.4 0.40 400 26
S5 7.0 2.5 8.0 2.0 3.2 0.18 60 23

Introduction

The dataset represents water quality measurements collected from five different samples. Each sample was tested for key physical, chemical, and biological parameters that influence water safety and ecological health. The parameters include pH, turbidity, dissolved oxygen (DO), biochemical oxygen demand (BOD), nitrate, phosphate, total Coliform, and temperature.

This project aims to analyze the quality of water by evaluating these indicators against standard water quality guidelines.

By studying this dataset, we can identify whether the water is safe for drinking, agricultural use, or aquatic life, and highlight potential environmental or health risks.

Studying correlations

attach(Waterdata)
cor(`Turbidity (NTU)`,pH)
## [1] -0.9963169

There exists a strong negative correlation between value of pH and the turbidity of water.

Heatmap overall correlation

numeric_cols <- sapply(Waterdata,is.numeric)
data_numeric <- Waterdata[,numeric_cols]
correlation_matrix <- cor(data_numeric,use="pairwise.complete.obs")
corrplot(correlation_matrix,method="color",type="upper",tl.col="black",tl.srt=60)

Strong correlations among all indicators studied is clear in this matrix.

Visualizations

Waterdata %>% ggplot(aes(x=`Temperature ( C )`,y=`Dissolved_Oxygen (mg/L)`))+
  geom_point()+
  geom_smooth(method="lm")
## `geom_smooth()` using formula = 'y ~ x'

It’s somewhat a linear negative relation between temperature of water and the dissolved oxygen within, as Temperature affects DO and aquatic life.

So in this project:

The goal is to check whether each measured parameter (pH, turbidity, DO, BOD, nitrate, phosphate, Coliform, temperature) in each sample falls within or outside the acceptable standard range.

Data manipulation

WaterEval <- Waterdata %>% mutate(evaluation = case_when(pH>=6.5 & pH<=8.5 & `Turbidity (NTU)`<=5 & `BOD (mg/L)`<=3 & `Dissolved_Oxygen (mg/L)`>=6 & `Total_Coliform (CFU/100mL)`== 0 & `Nitrate(mg/L)`<= 10  & `Phosphate (mg/L)`<= 0.03 ~ "clean",
                                           TRUE ~ "Polluted"))

Water Evaluation to Polluted or Not

gt(WaterEval)%>% opt_stylize(style=4,color="red")
Sample_ID pH Turbidity (NTU) Dissolved_Oxygen (mg/L) BOD (mg/L) Nitrate(mg/L) Phosphate (mg/L) Total_Coliform (CFU/100mL) Temperature ( C ) evaluation
S1 7.2 2.1 8.5 2.3 4.5 0.15 80 22 Polluted
S2 6.8 3.4 7.2 3.8 6.1 0.25 120 24 Polluted
S3 7.5 1.0 9.0 1.5 2.8 0.10 30 20 Polluted
S4 6.2 5.8 5.6 5.2 9.4 0.40 400 26 Polluted
S5 7.0 2.5 8.0 2.0 3.2 0.18 60 23 Polluted

According to our threshold-based evaluation, the environmental quality of the five samples is below ideal standards.

Deep Evaluation

water_data <- Waterdata %>%
  rowwise() %>% # Evaluate conditions for each row (sample)
  mutate(
    count_out_of_range = sum(
      # pH outside range
      pH < 6.5 | pH > 8.5,
      
      # Turbidity outside range
      `Turbidity (NTU)` > 5,
      
      # Dissolved oxygen below limit
      `Dissolved_Oxygen (mg/L)` < 6,
      
      # BOD above limit
      `BOD (mg/L)` > 3,
      
      # Nitrate above limit
      `Nitrate(mg/L)` > 10,
      
      # Phosphate above limit
      `Phosphate (mg/L)` > 0.03,
      
      # Total coliform not equal to 0 (should be zero)
      `Total_Coliform (CFU/100mL)` != 0,
      
      # Temperature (optional) - you can set a reference, e.g. >25°C
      `Temperature ( C )` > 25,
      
      na.rm = TRUE # ignore missing values
    )
  ) %>%
  ungroup()
water_data <- water_data %>%
  mutate(Evaluation = case_when(
    count_out_of_range == 0 ~ "Clean",
    count_out_of_range <= 2 ~ "Slightly Polluted",
    count_out_of_range <= 4 ~ "Moderately Polluted",
    TRUE ~ "Highly Polluted"
  ))
gt(water_data)%>% opt_stylize(style=4,color="blue")
Sample_ID pH Turbidity (NTU) Dissolved_Oxygen (mg/L) BOD (mg/L) Nitrate(mg/L) Phosphate (mg/L) Total_Coliform (CFU/100mL) Temperature ( C ) count_out_of_range Evaluation
S1 7.2 2.1 8.5 2.3 4.5 0.15 80 22 2 Slightly Polluted
S2 6.8 3.4 7.2 3.8 6.1 0.25 120 24 3 Moderately Polluted
S3 7.5 1.0 9.0 1.5 2.8 0.10 30 20 2 Slightly Polluted
S4 6.2 5.8 5.6 5.2 9.4 0.40 400 26 7 Highly Polluted
S5 7.0 2.5 8.0 2.0 3.2 0.18 60 23 2 Slightly Polluted

Bar Graph Showing Evaluation Category Number of Samples

evaluation_summary <- water_data %>%
  group_by(Evaluation) %>%
  summarise(Sample_Count = n())
ggplot(evaluation_summary, aes(x = Evaluation, y = Sample_Count, fill = Evaluation)) +
  geom_col(width = 0.6, color = "black") +
  scale_fill_manual(values = c(
    "Clean" = "#74c476",
    "Slightly Polluted" = "#fdae6b",
    "Moderately Polluted" = "#fd8d3c",
    "Highly Polluted" = "#e6550d"
  )) +
  labs(
    title = "Water Quality Evaluation by Category",
    x = "Evaluation Category",
    y = "Number of Samples",
    fill = "Water Quality"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5),
    legend.position = "none"
  )

Heatmap of Out-of-Range Indicators

water_heatmap <- water_data %>%
  mutate(
    pH_out = pH < 6.5 | pH > 8.5,
    Turbidity_out = `Turbidity (NTU)` > 5,
    DO_out = `Dissolved_Oxygen (mg/L)` < 6,
    BOD_out = `BOD (mg/L)` > 3,
    Nitrate_out = `Nitrate(mg/L)` > 10,
    Phosphate_out = `Phosphate (mg/L)` > 0.03,
    Coliform_out = `Total_Coliform (CFU/100mL)` != 0,
    Temperature_out = `Temperature ( C )` > 25
  ) %>%
  select(Sample_ID, ends_with("_out")) %>%
  pivot_longer(
    cols = -Sample_ID,
    names_to = "Parameter",
    values_to = "Out_of_Range"
  )
ggplot(water_heatmap, aes(x = Parameter, y = Sample_ID, fill = Out_of_Range)) +
  geom_tile(color = "white") +
  scale_fill_manual(
    values = c("FALSE" = "#74c476", "TRUE" = "#e6550d"),
    labels = c("Within Range", "Out of Range")
  ) +
  labs(
    title = "Water Quality Parameters Out of Range by Sample",
    x = "Parameter",
    y = "Sample",
    fill = "Condition"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(face = "bold", hjust = 0.5)
  )