Water Quality Analysis

ENVX1002 | Project 1: Describing data

Published

March 19, 2024

Introduction

I have chosen to analyse the data set of water quality which has been measured at numerous monitoring stations around NSW. This data set was the most interesting as water is a large part of our everyday lives, literally keeping us alive and it is important to recognise the variables which go into keeping our water quality sufficient. The variables I have chosen to specifically analyse are all continuous numerical values, that is, pH levels, turbidity in the Nephelometric Turbidity Unit (NTU), and dissolved oxygen (mg/L). These variables are some of the most important aspects to consider when measuring water quality.

Summary statistics, standard deviation, range and mode have been used to analyse the frequency, centre and spread of the data. Histograms, boxplots and skewness have been used to analyse the shape of the distribution of data.

Exploratory Data Analysis

library(readxl)
water <- read_excel("ENVX1002_Water_Quality_Data.xlsx", sheet = "Water_Quality")

pH

pH is an important measurement for water quality and determines how basic or acidic a solution is. The national aesthetic limit is between 6.5 to 8.5 pH.

Typical values of data

summary(water$pH, na.rm=T)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  6.100   7.230   7.460   7.485   7.720   9.150       5 
sd(water$pH, na.rm=T)
[1] 0.4062292
max(water$pH, na.rm=T) - min(water$pH, na.rm=T)
[1] 3.05

Standard variation, IQR, and the range are all low in value indicating a small spread and limited variability of values.

frequency_table <- table(water$pH)
mode <- names(frequency_table)[frequency_table == max(frequency_table)]
print(mode)
[1] "7.4"

This data set is symmetrical as the mean, mode and median are very close together (all at 7.4). This also depicts that there aren’t any outliers distorting the mean.

Distribution of data

library(ggplot2)

ggplot(water, aes(pH)) +
  geom_histogram(bins = 5, fill = "blue", na.rm=TRUE) +
  xlab("pH")

boxplot(water$pH, horizontal = TRUE)

#install.packages("moments")
library(moments)
skewness(water$pH, na.rm=TRUE)
[1] 0.3542711

Most of the data is situated around the mean and displays a normal distribution, being symmetrical. Skewness of 0.35 indicates a slight right skewness.

Turbidity

Turbidity is the measurement of the clarity of water. The turbidity of drinking water should be below 5 NTU, though should ideally be below 1 NTU.

Typical values of data

summary(water$Turbidity.NTU)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  0.100   1.600   3.200   5.464   5.830 106.950       5 
sd(water$Turbidity.NTU, na.rm=T)
[1] 8.656119
max(water$Turbidity.NTU, na.rm=T) - min(water$Turbidity.NTU, na.rm=T)
[1] 106.85
frequency_table <- table(water$Turbidity.NTU)
mode <- names(frequency_table)[frequency_table == max(frequency_table)]
print(mode)
[1] "1"

The mean, median and mode are all within 5 values but there is high variance in the data set due to the presence of outliers, the range is 106.85 while the most common number is 1. The centre of the data is between 1 and 5 but the spread is large.

Distribution of data

library(ggplot2)

ggplot(water, aes(Turbidity.NTU)) +
  geom_histogram(bins = 20, fill = "blue", na.rm=TRUE) +
  xlab("Turbidity (NTU)")

boxplot(water$Turbidity.NTU, horizontal = TRUE)

library(moments)
skewness(water$Turbidity.NTU, na.rm=TRUE)
[1] 5.402291

As displayed in the histogram majority of the data lies under 10 NTU. The boxplot provides information on the numerous outliers that skew the data set. The skewness function at 5.40 represents a highly right skewness.

Dissolved Oxygen

Dissolved oxygen is beneficial for water quality, but high levels can lead to corrosion of pipes. The safest lowest level of dissolved oxygen is between 6.5 to 8 mg/L.

Typical values of data

summary(water$Dissolved.Oxygen.mg_L)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  5.300   8.320   9.420   9.408  10.523  14.000 
sd(water$Dissolved.Oxygen.mg_L, na.rm=T)
[1] 1.543495
max(water$Dissolved.Oxygen.mg_L, na.rm=T) - min(water$Dissolved.Oxygen.mg_L, na.rm=T)
[1] 8.7
frequency_table <- table(water$Dissolved.Oxygen.mg_L)
mode <- names(frequency_table)[frequency_table == max(frequency_table)]
print(mode)
[1] "10.5"

The majority of the values are centred around the mean, as the median and mean are very similar this data set is symmetrical. The standard deviation and range are small, representing a small spread and variability of data.

Distribution of data

library(ggplot2)

ggplot(water, aes(Dissolved.Oxygen.mg_L)) +
  geom_histogram(bins = 10, fill = "blue") +
  xlab("Dissolved oxygen (mg/L)")

boxplot(water$Dissolved.Oxygen.mg_L, horizontal = TRUE)

library(moments)
skewness(water$Dissolved.Oxygen.mg_L)
[1] -0.1659755

As represented in the histogram and boxplot this data set is a normal distribution with most values centred around the mean, presenting a bell shape. The skewness function is a negative value, depicting a slight skewness to the left.

Discussion and conclusions

The most interesting finding was the number of outliers in the turbidity data set, there were many outliers of very high turbidity which skewed the results. It would be interesting to do further research into why this occurred and at what monitoring stations this was recorded to see if there’s any correlation.

The variables were mostly inside of their recommended limits. pH values were consistent around 7.4, turbidity was mostly under 5 NTU, though dissolved oxygen was higher than recommended with a mean of 9.4.

Overall this data set displays variables of sufficient water quality.

References & Journal

References

Cirino, E 2019, What pH Should My Drinking Water Be? Healthline, viewed 18 March 2024, https://www.healthline.com/health/ph-of-drinking-water

Department of Environment, Science and Innovation 2022, Ecosystem health indicators, viewed 19 March 2024,https://environment.des.qld.gov.au/management/water/health-indicators#:~:text=They%20include%20dissolved%20oxygen%2C%20pH,is%20impacting%20on%20the%20system.

‘Dissolved Oxygen in Drinking Water’, Atlas Scientific Environmental Robotics, web log post, 21 March 2022, viewed 18 March 2024, https://atlas-scientific.com/blog/dissolved-oxygen-in-drinking-water/#:~:text=High%20dissolved%20oxygen%20levels%20are,human%20consumption%20to%20be%20efficient.

‘How Turbidity is Measured’, Atlas Scientific Environmental Robotics, web log post, 13 March 2022, https://atlas-scientific.com/blog/how-turbidity-is-measured/

OpenAI 2023. ChatGPT 3.5, viewed 19 March 2024, https://chat.openai.com/chat

Journal

Google:

  • I used Google to search for information about each of my variables, why they are used for water quality testing and what values they should be within.
  • The websites I used were government, science and water company websites that provided accurate information on units of measurement.

ChatGPT:

  • I initially had issues with finding the mode, I asked ChatGPT “What does it mean by”numeric” when finding mode in RStudio”, and secondly “How do you find the mode in R studio?”. This enabled me to find the mode in a different way to which I previously was.

  • I used ChatGPT ethically, as a way to ask questions like I would a tutor and I trusted this as I tested out the methods first which ended up working.

Tutorial and Lab work:

  • I used the lab and tutorial notes to help me with writing code and explain what different functions meant.