Project 1 Water Quality (NSW)

Author

Sophia Tweed

Introduction

Below in this project I used the Water Quality data. This data is a subset of water quality variables at a number of monitoring stations in NSW, there are both biological and chemical variables that are commonly used to assess water quality. This data was collected by Water NSW and the variables being analysed are Total Nitrogen, Total phosphorous, Dissolved Oxygen and pH. Things that are analysed is the distribution of the data, using multiple visual representations as well as probability distributions. Relationships within the data will also be analysed before conculsions will be drawn in the discussion.

What data sets am I using, and why?

Total nitrogen –> Total nitrogen is important for water quality as it acts as a vital nutrient, but excessive levels can lead to eutrophication, disrupting aquatic ecosystems and degrading water quality.

Total Phosphorous –> Total phosphorus is important for water quality as it serves as a key indicator of nutrient pollution, influencing algae growth and overall ecosystem health.

Dissolved Oxygen –> Dissolved oxygen is crucial for water quality as it sustains aquatic life by supporting respiration processes and maintaining ecosystem health.

pH –> pH is critical for water quality as it dictates the balance of acidity and alkalinity, with an ideal range typically between 6.5 and 8.5, ensuring the health of aquatic ecosystems and suitability for human use.

Describe types of data

All of the data within my data set is Numerical Continuous.

Code

library(readxl)
df<-read_excel("data/ENVX1002_Water_Quality_Data.xlsx")

Code

totalnirtrogen <- df$Nitrogen.Total.mg_L
totalphos <- df$Phosphorus.Total.mg_L
disoxy <- df$Dissolved.Oxygen.mg_L
ph <- df$pH

Distribution of data

Total nitrogen

Code

library(ggplot2)

ggplot(df, aes(totalnirtrogen)) +
  geom_histogram(bins = 25, fill = "#158cba") +
  ggtitle("Histogram for Total Nitrogen")+
  xlab("Total Nitrogen in mg/L")+
  ylab("Count")

Figure 1: Histogram for Total Nitrogen in mg/L. This histogram shows that the data is right skewed, with majority of the data between 0-1.

Code

library(ggplot2)

ggplot(df, aes(totalnirtrogen)) +
  geom_boxplot(fill = "#158cba") +
  ggtitle("Boxplot for Total Nitrogen")+
  xlab("Total Nitrogen in mg/L")+
  ylab("Count")

Figure 2: Boxplot for Total Nitrogen in mg/L. this box plot shows a negative skew of data. it also gives a visual representation of the upper tail outliers. The boxplot visually shows that mean of this data is around 0.4.

Practical example of data

For drinking water in NSW it is recommended that water should have a lower than 1mg/L of Nitrogen.

Below the data represents that 24 individual times the water was test at different stations they received a reading higher than or equal to 1mg/L.

and therefore 864 times they received readings that were below or equal to 1mg/L which is a positive thing.

Code

s=sort(df$Nitrogen.Total.mg_L) # Sorts the data

Code

length(s[s>=1]) # Counts how many are more than or equal to 1

[1] 24

Code

length(s[s<=1]) # Counts how many are less than or equal to 1

[1] 865

Code

library(ggplot2)
ggplot(data.frame(x = c(0.442-4*0.265, 0.442+4*0.265)), aes(x = x)) +
  stat_function(fun = dnorm, args = list(mean = 0.442, sd=0.265)) +
  xlab("x") +
  ylab(expression(N(0.442,0.265^2)~pdf))

Figure 3: This normal distribution curve. with a μ=0.442 and a σ= 0.265

Code

ggplot(data.frame(x = c(0.442-4*0.265, 0.442+4*0.265)), aes(x = x)) +
  stat_function(fun = dnorm, args = list(mean = 0.442, sd=0.265)
                ,geom = "area", fill = "white") +
  stat_function(fun = dnorm, args = list(mean = 0.442, sd=0.265)
    , xlim = c(0.442-4*0.265, 1), geom = "area", fill = "#158cba") +
  xlab("x") +
  ylab(expression(N(0.442,0.265^2)~pdf))

Figure 4: This is a normal distribution curve showing the probability that 1 test has a Total nitrogen reading less than 1mg/L.

Total Phosphorous

Code

library(ggplot2)

ggplot(df, aes(totalphos)) +
  geom_histogram(bins = 25, fill = "#28b62c") +
  ggtitle("Histogram for Total Phosphorus")+
  xlab("Total Phosphorus mg_L")+
  ylab("Count")

Figure 5: Histogram for Total Phosphorous in mg/L. This histogram shows that the data is slightly right skewed.

Code

library(ggplot2)

ggplot(df, aes(totalphos)) +
  geom_boxplot(fill = "#28b62c") +
  ggtitle("Boxplot for Total Phosphorus")+
  xlab("Total Phosphorus mg_L")+
  ylab("Count")

Figure 6: Boxplot for Total Phosphorous in mg/L. this box plot shows a negative skew of data. it also gives a visual representation of the upper tail outliers.

Practical example of data

The acceptable range for phosphorus in drinking water in NSW is often set below detectable limits or very low concentrations, typically less than 0.01mg/L

Below the data represents that 548 individual times the water was test at different stations they received a reading higher than or equal to 0.01mg/L. this number is significantly higher than it should be making it highly negativly skewed and they should fix and monitor the phosphorous level within the water to avoid high algae and bacteria growth, leading to taste issues and compromised water quality.

and therefore 390 times they received readings that were below or equal to 0.01mg/L which is a positive thing to avoid the risk of algee issues.

Code

p=sort(df$Phosphorus.Total.mg_L) # Sorts the data

Code

length(s[p>=0.01]) # Counts how many are more than or equal to 0.01

[1] 548

Code

length(s[p<=0.01]) # Counts how many are less than or equal to 0.01

[1] 390

Code

library(ggplot2)
ggplot(data.frame(x = c(0.015-4*0.0143, 0.015+4*0.0143)), aes(x = x)) +
  stat_function(fun = dnorm, args = list(mean = 0.015, sd=0.0143)) +
  xlab("x") +
  ylab(expression(N(0.015,0.0143^2)~pdf))

Figure 7: This normal distribution curve. with a μ=0.015 and a σ= 0.0143

Code

ggplot(data.frame(x = c(0.015-4*0.0143, 0.015+4*0.0143)), aes(x = x)) +
  stat_function(fun = dnorm, args = list(mean = 0.015, sd=0.0143)
                ,geom = "area", fill = "white") +
  stat_function(fun = dnorm, args = list(mean = 0.015, sd=0.0143)
    , xlim = c(0.015-4*0.0143, 0.01), geom = "area", fill = "#28b62c") +
  xlab("x") +
  ylab(expression(N(0.015,0.0143^2)~pdf))

Figure 8: This is a normal distribution curve showing the probability that 1 test has a Total nitrogen reading less than 0.01mg/L.

Dissolved Oxygen

Code

library(ggplot2)

ggplot(df, aes(disoxy)) +
  geom_histogram(bins = 25, fill = "#e83e8c") +
  ggtitle("Histogram for Dissolved Oxygen")+
  xlab("Dissolved Oxygen mg_L")+
  ylab("Count")

Figure 9: Histogram of dissolved oxygen mg/L in water. This expresses a symmetrical data set.

Code

library(ggplot2)

ggplot(df, aes(disoxy)) +
  geom_boxplot(fill = "#e83e8c") +
  ggtitle("Boxplot for Dissolved Oxygen")+
  xlab("Dissolved Oxygen mg/L")+
  ylab("Count")

Figure 10: Boxplot of Dissolved Oxygen mg/L with a visual representation of distribution of symmetrical data set with one upper tail outlier. the median line is in the center of the box indicating central tendency within the data set.

Practical example of data

Within NSW the acceptable range for dissolved oxygen (DO) levels in water can vary due to many factors like water type (fresh, lake, drinking etc.) as well as factors like if marine life is living within. on average an acceptable range is between 4-8mg/L however the optimum range is between 8-14mg/g, therefore i will base my numbers off 10mg/L.

Below the data represents that 337 individual times the water was test at different stations they received a reading higher than or equal to 10mg/L. This number indicates a postive thing for water quality as it helps with the survival of marine and aquatic life.

and therefore 558 times they received readings that were below or equal to 10mg/L which isnt a negative thing depending on how low the data set goes. so for intrest i calculated below 4mg/L as well as below that limit it can cause stress and can be lethal to species, this returned a value of 0 meaning that this is a postive thing and all marine life if just based off DO would survive.

Code

DO=sort(df$Dissolved.Oxygen.mg_L) # Sorts the data

Code

length(s[DO>=10]) # Counts how many are more than or equal to 10

[1] 337

Code

length(s[DO<=10]) # Counts how many are less than or equal to 10

[1] 558

Code

length(s[DO<4])

[1] 0

Code

library(ggplot2)
ggplot(data.frame(x = c(9.40-4*1.54, 9.40+4*1.54)), aes(x = x)) +
  stat_function(fun = dnorm, args = list(mean = 9.40, sd=1.54)) +
  xlab("x") +
  ylab(expression(N(9.40,1.54^2)~pdf))

Figure 11: This normal distribution curve. with a μ=9.40 and a σ= 1.54

Code

ggplot(data.frame(x = c(9.40-4*1.54, 9.40+4*1.54)), aes(x = x)) +
  stat_function(fun = dnorm, args = list(mean = 9.40, sd=1.54)
                ,geom = "area", fill = "white") +
  stat_function(fun = dnorm, args = list(mean = 9.40, sd=1.54)
    , xlim = c(9.40-4*1.54, 10), geom = "area", fill = "#e83e8c") +
  xlab("x") +
  ylab(expression(N(9.40,1.54^2)~pdf))

Figure 12: This is a normal distribution curve showing the probability that 1 test has a Total nitrogen reading less than 10mg/L.

pH

Code

library(ggplot2)

ggplot(df, aes(ph)) +
  geom_histogram(bins = 25, fill = "#6f42c1") +
  ggtitle("Histogram for pH")+
  xlab("pH")+
  ylab("Count")

Warning: Removed 5 rows containing non-finite outside the scale range
(`stat_bin()`).

Figure 13: This is a histogram on PH levels in water.

Code

library(ggplot2)

ggplot(df, aes(ph)) +
  geom_boxplot(fill = "#6f42c1") +
  ggtitle("Boxplot for pH")+
  xlab("pH")+
  ylab("Count")

Warning: Removed 5 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Figure 14: This boxplot represents symmetrical data, with a number of upper and lower tail outliers.

Practical example of data

Ensuring that pH is the at the right level ensures that water is safe for human consumption. The acceptable range for drinking water in NSW is between 6.5 and 8.5.

Being above a pH of 10 or below 4 can have significant health impacts as well as impacts on taste. Within this data set pleasingly none of the data starts bellow 6 or goes above 10 which is postive.

Therefore to get values i used the range between 6 as the lower limit and 8 as the upper limit. 93 tests were above a level of 8 and 5 tests returned readings less than a ph of 6. However for this i will use a ph neutral score of 7. 817 times a result was more than or equal to 7, and 106 times a result was less and or equal to 7.

Code

pH=sort(df$pH)

Code

length(s[df$pH>=8]) # Counts how many are more than or equal to 10

[1] 93

Code

length(s[df$pH<6]) # Counts how many are less than or equal to 10

[1] 5

Code

length(s[df$pH>=7]) # Counts how many are more than or equal to 10

[1] 817

Code

length(s[df$pH<=7]) # Counts how many are less than or equal to 10

[1] 106

Code

mean(pH)

[1] 7.485461

Code

sd(pH)

[1] 0.4062292

Code

library(ggplot2)
ggplot(data.frame(x = c(14.97-4*0.81, 14.97+4*0.81)), aes(x = x)) +
  stat_function(fun = dnorm, args = list(mean = 14.97, sd=0.81)) +
  xlab("x") +
  ylab(expression(N(14.97,0.81^2)~pdf))

Figure 11: This normal distribution curve. with a μ=14.97 and a σ= 0.81

Code

ggplot(data.frame(x = c(14.97-4*0.81, 14.97+4*0.81)), aes(x = x)) +
  stat_function(fun = dnorm, args = list(mean = 14.97, sd=0.81)
                ,geom = "area", fill = "white") +
  stat_function(fun = dnorm, args = list(mean = 14.97, sd=0.81)
    , xlim = c(14.97-4*0.81, 7), geom = "area", fill = "#e83e8c") +
  xlab("x") +
  ylab(expression(N(14.97,0.81^2)~pdf))

Relationships within data

Summaries

Total Nirtrogen

Below is data for Total Nitrogen within water. As discussed earlier that the optimum range for Total nitrogen is below 1mg/L, therefore within this data set over 75% of the data is below 0.54 which therefore is positive for this data set.

Code

# Measures of central tendency
summary(totalnirtrogen)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0300  0.2550  0.3900  0.4425  0.5400  2.5800

Code

mode(totalnirtrogen)

[1] "numeric"

Code

# Measures of spread
var(totalnirtrogen)

[1] 0.07032665

Code

sd(totalnirtrogen)

[1] 0.2651917

Code

range(totalnirtrogen)

[1] 0.03 2.58

Code

IQR(totalnirtrogen)

[1] 0.285

Total Phosphorous

Below is the data for Total phosphorous within water sampled across different stations. The median for this data is higher than the recommended limit to have within water of 0.01, with the median being 0.012 therefore this can lead to issues with algee and other water quality issues. variance is a measure of viability and expresses how much the values deviate from the mean, the varience for this data is very low ( 0.0002060229) meaning that there is a low measure of spread and all the points are close to the mean.

Code

# Measures of central tendency
summary(totalphos)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.00500 0.00700 0.01200 0.01515 0.01800 0.24800

Code

mode(totalphos)

[1] "numeric"

Code

# Measures of spread
var(totalphos)

[1] 0.0002060229

Code

sd(totalphos)

[1] 0.0143535

Code

range(totalphos)

[1] 0.005 0.248

Code

IQR(totalphos)

[1] 0.011

Dissolved Oxygen

This data expressed below shows the amount of dissolved oxygen in the water for NSW tested sites. This data set expresses very good dissolved oxygen in the samples. the ideal range is 4mg/L and above, looking at the summary and range of data it relays that the lowest (min) is 5.300 meaning that this is higher than wanted and therefore a postive thing for aquatic and marine life.

Code

# Measures of central tendency
summary(disoxy)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  5.300   8.320   9.420   9.408  10.523  14.000

Code

mode(disoxy)

[1] "numeric"

Code

# Measures of spread
var(disoxy)

[1] 2.382377

Code

sd(disoxy)

[1] 1.543495

Code

range(disoxy)

[1]  5.3 14.0

Code

IQR(disoxy)

[1] 2.2025

pH

pH tests the acidity and alkalinity of the water, wanting it to generally be neutral however an acceptable range is between 6.5 and 8.5. this data set shows that over 75% of the data is below 7.720 meaning that it is an acceptable range. after assessing the boxplot in hand with the the range data it is obvious that there are a number of outliers in the upper tail that are affecting the skew of the data, however over 50% of the data is lying between 7.23 and 7.72 which is quite a small range.

Code

summary(ph)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  6.100   7.230   7.460   7.485   7.720   9.150       5

Code

sd(ph,na.rm = TRUE)

[1] 0.4062292

Code

range(ph, na.rm = TRUE)

[1] 6.10 9.15

Code

IQR (ph, na.rm = TRUE)

[1] 0.49

XY Scatterplot (Heat Maps)

This heat map explores the connection between Total Nitrogen mg/L and Phosphorus total mg/L within the water samples. This data visually expresses that the data is located around the 0.0-0.5 for Nitrogen and 0.00 and 0.05 for phosphorus. There is a number of outliers that can be seen on this visual representation.

Code

smoothScatter(df$Nitrogen.Total.mg_L, df$Phosphorus.Total.mg_L, transformation = function(x) x ^ 0.4,
              colramp = colorRampPalette(c("#000099", "#00FEFF", "#45FE4F",
                                           "#FCFF00", "#FF9400", "#FF3100")))
# install.packages("MASS")
library(MASS)
kern <- kde2d(df$Nitrogen.Total.mg_L, df$Phosphorus.Total.mg_L)

contour(kern, drawlabels = FALSE, nlevels = 6,
        col = rev(heat.colors(6)), add = TRUE, lwd = 3)

Figure 13: Heat map scatter plot with total nitrogen mg/L and total phosphorous mg/L for water quality data.

This XY Heat map scatter plot shows the comparison of Disolved oxygen and ph. both these data sets were quite symmetrical and therefore are being represented on the heat map by majority (darker red and orange) located around the center of the map with limited outliers.

Code

smoothScatter(df$pH, df$Dissolved.Oxygen.mg_L, transformation = function(x) x ^ 0.4,
              colramp = colorRampPalette(c("#000099", "#00FEFF", "#45FE4F",
                                           "#FCFF00", "#FF9400", "#FF3100")))

Figure 14: Heat map scatter plot with dissolved oxygen mg/L and ph for water quality data.

Discussion

Overall this data has shown fairly consistent results.

Total Nitrogen = Overall this data was rightly skewed with a standard deviation 0.265. However in the context of water quality this is a positive thing because because over 864/888 data entries were below or equal t0 1mg/L. For aquatic ecosystem protection and management, nitrogen levels are regulated to prevent eutrophication, which can lead to algal blooms, oxygen depletion, and harm to aquatic life. For agricultural or irrigation water, nitrogen levels may be regulated to minimize nutrient runoff and pollution, which can affect water quality downstream and lead to environmental damage. Therefore it is important that Nitrogen levels are monitored and maintained.

Total Phosphorous = The phosphorous data was the most significantly deviant from what it should be. With over 548 individual test expressing that it was higher than 0.01mg/L. The phosphorous data had a significant amount of right tailed outliers which affected the skew of the data. In a water context high levels of phosphorus can contribute to the growth of algae and bacteria in water distribution systems, leading to taste and odor issues and potentially compromising water quality.

Dissolved Oxygen = Dissolved oxygen is a crucial aspect in the survival of aquatic organisms and plays several crucial roles in aquatic ecosystems. This data given explores very good and high levels of Dissolved oxygen with the lowest being 5.3 which is still in the optimum range. The data shows that 100% of the entries were about the acceptable range of 4mg/L, therefore this means there is the ability for aquatic and marine life to thrive in these waters.

pH = The PH data proved to be a fairly symmetrical data set with a number of outliers within the frame. Whilst having water with a higher ph or lower dosen’t pose any health risks, however by maintaining the level around 7 or close to it, it ensure palatability of the water and avoids corrosion of pipes.

Journal

Research into what variables to choose (pH of Water - Environmental Measurement Systems 2019).I used the Government website which is highly credible and also the Water NSW site to look at what they usually use as variables.
Lab 2 this is where we went through the ggplot function and i used this to be able to create more detailed boxplots and histograms.
Tutorial 3 Binomial and poisson data. this is where i could regain an understanding of what these were and when i was using what one as i was confused with when to use each one.
Lab 4 This is where i got majority of help for the distribution of data. I used the Example 2 with the milk to firstly work out the ggplot histogram, then sort my data in assending order, and count using the length(s()) function to work out <>= for my chosen data set. then in part 2 i was able to use this to plot on a normal distribution curve and then work out probabilities using standard deviation and the mean of the individual data sets.
Stack exchange to work out how to bold text (StackExchange 2019) this website was highly credible as it was a constant location that I would always go to for information on small intrequet details that i could work out normally. I knew that it was highly trusted due to the amount that it is discussed in our labs and lectures as well as it is a community where they all want to help each other.
Bootswatch this is where i learnt how to change the theme and use different themes. I went with the theme ‘lumen’ and here i learnt how to use different parts of the theme as well as look at different colours that i could use from the package.
ChatGPT this was my go-to website during this project, becuase it allowed me to have quick and accurate answers. (ChatGPT 2024) I used chatgpt to understand how to
1. scatterplot
2. heat map–> which i didnt end up using becuase it wouldnt work for me
3. understand what “kde2d” is for my heat map
4. how to “clean” data to remove the unkown variables
5. use the dnorm function.

References

Box Plot Explained: Interpretation, Examples, & Comparison 2023, Simply Psychology, viewed 22 March 2024, https://www.simplypsychology.org/boxplots.html#How-to-compare-box-plots.
Scatter plot in R 2020, RCODER, viewed 22 March 2024, https://r-coder.com/scatter-plot-r/.
GfG 2020, Data Visualization in R, GeeksforGeeks, GeeksforGeeks, viewed 22 March 2024, https://www.geeksforgeeks.org/data-visualization-in-r/.
ggplot2 box plot : Quick start guide - R software and data visualization - Easy Guides - Wiki - STHDA 2020, Sthda.com, viewed 22 March 2024, http://www.sthda.com/english/wiki/ggplot2-box-plot-quick-start-guide-r-software-and-data-visualization.
pH of Water - Environmental Measurement Systems 2019, Environmental Measurement Systems, viewed 22 March 2024, https://www.fondriest.com/environmental-measurements/parameters/water-quality/ph/.
GfG 2021, How to import an Excel File into R ?, GeeksforGeeks, GeeksforGeeks, viewed 22 March 2024, https://www.geeksforgeeks.org/how-to-import-an-excel-file-into-r/.
Quarto - HTML Theming 2024, Quarto, viewed 22 March 2024, https://quarto.org/docs/output-formats/html-themes.html.
Bootswatch 2015, Lumen v5, Jsfiddle.net, viewed 22 March 2024, https://jsfiddle.net/bootswatch/3nw5ocub/.
Bootswatch: Lumen 2024, Bootswatch.com, viewed 22 March 2024, https://bootswatch.com/lumen/.
Donovan, K 2019, 6 Working with Tables in R | Data Analysis and Processing with R based on IBIS data, Bookdown.org, viewed 22 March 2024, https://bookdown.org/kdonovan125/ibis_data_analysis_r4/working-with-tables-in-r.html.
R - Normal Distribution 2024, Tutorialspoint.com, viewed 22 March 2024, https://www.tutorialspoint.com/r/r_normal_distribution.htm.
Probability distributions 2024, Thomasleeper.com, viewed 22 March 2024, https://thomasleeper.com/Rcourse/Tutorials/distributions.html#:~:text=For%20example%2C%20the%20dnorm%20function,distribution%20at%20a%20specific%20quantile..
The Normal Distribution in R 2024, Michaelminn.net, viewed 22 March 2024, https://michaelminn.net/tutorials/r-normal-rank-order/index.html.
Total Nitrogen - DCCEEW 2022, Dcceew.gov.au, viewed 22 March 2024, https://www.dcceew.gov.au/environment/protection/npi/substances/fact-sheets/total-nitrogen.
Normal distribution in R 2020, RCODER, viewed 22 March 2024, https://r-coder.com/normal-distribution-r/.
Xie, Y 2024, 2.10 HTML widgets | bookdown: Authoring Books and Technical Documents with R Markdown, Bookdown.org, viewed 22 March 2024, https://bookdown.org/yihui/bookdown/html-widgets.html.
StackExchange 2019, How to type italics or bold inside code in a comment, Meta Stack Exchange, viewed 22 March 2024, https://meta.stackexchange.com/questions/332439/how-to-type-italics-or-bold-inside-code-in-a-comment.
OpenAI. (2022). ChatGPT (Version 3.5). [AI model]. Retrieved from https://openai.com/chatgpt