Code
library(readxl)
<-read_excel("data/ENVX1002_Water_Quality_Data.xlsx") df
Below in this project I used the Water Quality data. This data is a subset of water quality variables at a number of monitoring stations in NSW, there are both biological and chemical variables that are commonly used to assess water quality. This data was collected by Water NSW and the variables being analysed are Total Nitrogen, Total phosphorous, Dissolved Oxygen and pH. Things that are analysed is the distribution of the data, using multiple visual representations as well as probability distributions. Relationships within the data will also be analysed before conculsions will be drawn in the discussion.
Total nitrogen –> Total nitrogen is important for water quality as it acts as a vital nutrient, but excessive levels can lead to eutrophication, disrupting aquatic ecosystems and degrading water quality.
Total Phosphorous –> Total phosphorus is important for water quality as it serves as a key indicator of nutrient pollution, influencing algae growth and overall ecosystem health.
Dissolved Oxygen –> Dissolved oxygen is crucial for water quality as it sustains aquatic life by supporting respiration processes and maintaining ecosystem health.
pH –> pH is critical for water quality as it dictates the balance of acidity and alkalinity, with an ideal range typically between 6.5 and 8.5, ensuring the health of aquatic ecosystems and suitability for human use.
All of the data within my data set is Numerical Continuous.
library(readxl)
<-read_excel("data/ENVX1002_Water_Quality_Data.xlsx") df
<- df$Nitrogen.Total.mg_L
totalnirtrogen <- df$Phosphorus.Total.mg_L
totalphos <- df$Dissolved.Oxygen.mg_L
disoxy <- df$pH ph
library(ggplot2)
ggplot(df, aes(totalnirtrogen)) +
geom_histogram(bins = 25, fill = "#158cba") +
ggtitle("Histogram for Total Nitrogen")+
xlab("Total Nitrogen in mg/L")+
ylab("Count")
Figure 1: Histogram for Total Nitrogen in mg/L. This histogram shows that the data is right skewed, with majority of the data between 0-1.
library(ggplot2)
ggplot(df, aes(totalnirtrogen)) +
geom_boxplot(fill = "#158cba") +
ggtitle("Boxplot for Total Nitrogen")+
xlab("Total Nitrogen in mg/L")+
ylab("Count")
Figure 2: Boxplot for Total Nitrogen in mg/L. this box plot shows a negative skew of data. it also gives a visual representation of the upper tail outliers. The boxplot visually shows that mean of this data is around 0.4.
For drinking water in NSW it is recommended that water should have a lower than 1mg/L of Nitrogen.
Below the data represents that 24 individual times the water was test at different stations they received a reading higher than or equal to 1mg/L.
and therefore 864 times they received readings that were below or equal to 1mg/L which is a positive thing.
=sort(df$Nitrogen.Total.mg_L) # Sorts the data s
length(s[s>=1]) # Counts how many are more than or equal to 1
[1] 24
length(s[s<=1]) # Counts how many are less than or equal to 1
[1] 865
library(ggplot2)
ggplot(data.frame(x = c(0.442-4*0.265, 0.442+4*0.265)), aes(x = x)) +
stat_function(fun = dnorm, args = list(mean = 0.442, sd=0.265)) +
xlab("x") +
ylab(expression(N(0.442,0.265^2)~pdf))
Figure 3: This normal distribution curve. with a μ=0.442 and a σ= 0.265
ggplot(data.frame(x = c(0.442-4*0.265, 0.442+4*0.265)), aes(x = x)) +
stat_function(fun = dnorm, args = list(mean = 0.442, sd=0.265)
geom = "area", fill = "white") +
,stat_function(fun = dnorm, args = list(mean = 0.442, sd=0.265)
xlim = c(0.442-4*0.265, 1), geom = "area", fill = "#158cba") +
, xlab("x") +
ylab(expression(N(0.442,0.265^2)~pdf))
Figure 4: This is a normal distribution curve showing the probability that 1 test has a Total nitrogen reading less than 1mg/L.
library(ggplot2)
ggplot(df, aes(totalphos)) +
geom_histogram(bins = 25, fill = "#28b62c") +
ggtitle("Histogram for Total Phosphorus")+
xlab("Total Phosphorus mg_L")+
ylab("Count")
Figure 5: Histogram for Total Phosphorous in mg/L. This histogram shows that the data is slightly right skewed.
library(ggplot2)
ggplot(df, aes(totalphos)) +
geom_boxplot(fill = "#28b62c") +
ggtitle("Boxplot for Total Phosphorus")+
xlab("Total Phosphorus mg_L")+
ylab("Count")
Figure 6: Boxplot for Total Phosphorous in mg/L. this box plot shows a negative skew of data. it also gives a visual representation of the upper tail outliers.
The acceptable range for phosphorus in drinking water in NSW is often set below detectable limits or very low concentrations, typically less than 0.01mg/L
Below the data represents that 548 individual times the water was test at different stations they received a reading higher than or equal to 0.01mg/L. this number is significantly higher than it should be making it highly negativly skewed and they should fix and monitor the phosphorous level within the water to avoid high algae and bacteria growth, leading to taste issues and compromised water quality.
and therefore 390 times they received readings that were below or equal to 0.01mg/L which is a positive thing to avoid the risk of algee issues.
=sort(df$Phosphorus.Total.mg_L) # Sorts the data p
length(s[p>=0.01]) # Counts how many are more than or equal to 0.01
[1] 548
length(s[p<=0.01]) # Counts how many are less than or equal to 0.01
[1] 390
library(ggplot2)
ggplot(data.frame(x = c(0.015-4*0.0143, 0.015+4*0.0143)), aes(x = x)) +
stat_function(fun = dnorm, args = list(mean = 0.015, sd=0.0143)) +
xlab("x") +
ylab(expression(N(0.015,0.0143^2)~pdf))
Figure 7: This normal distribution curve. with a μ=0.015 and a σ= 0.0143
ggplot(data.frame(x = c(0.015-4*0.0143, 0.015+4*0.0143)), aes(x = x)) +
stat_function(fun = dnorm, args = list(mean = 0.015, sd=0.0143)
geom = "area", fill = "white") +
,stat_function(fun = dnorm, args = list(mean = 0.015, sd=0.0143)
xlim = c(0.015-4*0.0143, 0.01), geom = "area", fill = "#28b62c") +
, xlab("x") +
ylab(expression(N(0.015,0.0143^2)~pdf))
Figure 8: This is a normal distribution curve showing the probability that 1 test has a Total nitrogen reading less than 0.01mg/L.
library(ggplot2)
ggplot(df, aes(disoxy)) +
geom_histogram(bins = 25, fill = "#e83e8c") +
ggtitle("Histogram for Dissolved Oxygen")+
xlab("Dissolved Oxygen mg_L")+
ylab("Count")
Figure 9: Histogram of dissolved oxygen mg/L in water. This expresses a symmetrical data set.
library(ggplot2)
ggplot(df, aes(disoxy)) +
geom_boxplot(fill = "#e83e8c") +
ggtitle("Boxplot for Dissolved Oxygen")+
xlab("Dissolved Oxygen mg/L")+
ylab("Count")
Figure 10: Boxplot of Dissolved Oxygen mg/L with a visual representation of distribution of symmetrical data set with one upper tail outlier. the median line is in the center of the box indicating central tendency within the data set.
Within NSW the acceptable range for dissolved oxygen (DO) levels in water can vary due to many factors like water type (fresh, lake, drinking etc.) as well as factors like if marine life is living within. on average an acceptable range is between 4-8mg/L however the optimum range is between 8-14mg/g, therefore i will base my numbers off 10mg/L.
Below the data represents that 337 individual times the water was test at different stations they received a reading higher than or equal to 10mg/L. This number indicates a postive thing for water quality as it helps with the survival of marine and aquatic life.
and therefore 558 times they received readings that were below or equal to 10mg/L which isnt a negative thing depending on how low the data set goes. so for intrest i calculated below 4mg/L as well as below that limit it can cause stress and can be lethal to species, this returned a value of 0 meaning that this is a postive thing and all marine life if just based off DO would survive.
=sort(df$Dissolved.Oxygen.mg_L) # Sorts the data DO
length(s[DO>=10]) # Counts how many are more than or equal to 10
[1] 337
length(s[DO<=10]) # Counts how many are less than or equal to 10
[1] 558
length(s[DO<4])
[1] 0
library(ggplot2)
ggplot(data.frame(x = c(9.40-4*1.54, 9.40+4*1.54)), aes(x = x)) +
stat_function(fun = dnorm, args = list(mean = 9.40, sd=1.54)) +
xlab("x") +
ylab(expression(N(9.40,1.54^2)~pdf))
Figure 11: This normal distribution curve. with a μ=9.40 and a σ= 1.54
ggplot(data.frame(x = c(9.40-4*1.54, 9.40+4*1.54)), aes(x = x)) +
stat_function(fun = dnorm, args = list(mean = 9.40, sd=1.54)
geom = "area", fill = "white") +
,stat_function(fun = dnorm, args = list(mean = 9.40, sd=1.54)
xlim = c(9.40-4*1.54, 10), geom = "area", fill = "#e83e8c") +
, xlab("x") +
ylab(expression(N(9.40,1.54^2)~pdf))
Figure 12: This is a normal distribution curve showing the probability that 1 test has a Total nitrogen reading less than 10mg/L.
library(ggplot2)
ggplot(df, aes(ph)) +
geom_histogram(bins = 25, fill = "#6f42c1") +
ggtitle("Histogram for pH")+
xlab("pH")+
ylab("Count")
Warning: Removed 5 rows containing non-finite outside the scale range
(`stat_bin()`).
Figure 13: This is a histogram on PH levels in water.
library(ggplot2)
ggplot(df, aes(ph)) +
geom_boxplot(fill = "#6f42c1") +
ggtitle("Boxplot for pH")+
xlab("pH")+
ylab("Count")
Warning: Removed 5 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Figure 14: This boxplot represents symmetrical data, with a number of upper and lower tail outliers.
Ensuring that pH is the at the right level ensures that water is safe for human consumption. The acceptable range for drinking water in NSW is between 6.5 and 8.5.
Being above a pH of 10 or below 4 can have significant health impacts as well as impacts on taste. Within this data set pleasingly none of the data starts bellow 6 or goes above 10 which is postive.
Therefore to get values i used the range between 6 as the lower limit and 8 as the upper limit. 93 tests were above a level of 8 and 5 tests returned readings less than a ph of 6. However for this i will use a ph neutral score of 7. 817 times a result was more than or equal to 7, and 106 times a result was less and or equal to 7.
=sort(df$pH) pH
length(s[df$pH>=8]) # Counts how many are more than or equal to 10
[1] 93
length(s[df$pH<6]) # Counts how many are less than or equal to 10
[1] 5
length(s[df$pH>=7]) # Counts how many are more than or equal to 10
[1] 817
length(s[df$pH<=7]) # Counts how many are less than or equal to 10
[1] 106
mean(pH)
[1] 7.485461
sd(pH)
[1] 0.4062292
library(ggplot2)
ggplot(data.frame(x = c(14.97-4*0.81, 14.97+4*0.81)), aes(x = x)) +
stat_function(fun = dnorm, args = list(mean = 14.97, sd=0.81)) +
xlab("x") +
ylab(expression(N(14.97,0.81^2)~pdf))
Figure 11: This normal distribution curve. with a μ=14.97 and a σ= 0.81
ggplot(data.frame(x = c(14.97-4*0.81, 14.97+4*0.81)), aes(x = x)) +
stat_function(fun = dnorm, args = list(mean = 14.97, sd=0.81)
geom = "area", fill = "white") +
,stat_function(fun = dnorm, args = list(mean = 14.97, sd=0.81)
xlim = c(14.97-4*0.81, 7), geom = "area", fill = "#e83e8c") +
, xlab("x") +
ylab(expression(N(14.97,0.81^2)~pdf))
Below is data for Total Nitrogen within water. As discussed earlier that the optimum range for Total nitrogen is below 1mg/L, therefore within this data set over 75% of the data is below 0.54 which therefore is positive for this data set.
# Measures of central tendency
summary(totalnirtrogen)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0300 0.2550 0.3900 0.4425 0.5400 2.5800
mode(totalnirtrogen)
[1] "numeric"
# Measures of spread
var(totalnirtrogen)
[1] 0.07032665
sd(totalnirtrogen)
[1] 0.2651917
range(totalnirtrogen)
[1] 0.03 2.58
IQR(totalnirtrogen)
[1] 0.285
Below is the data for Total phosphorous within water sampled across different stations. The median for this data is higher than the recommended limit to have within water of 0.01, with the median being 0.012 therefore this can lead to issues with algee and other water quality issues. variance is a measure of viability and expresses how much the values deviate from the mean, the varience for this data is very low ( 0.0002060229) meaning that there is a low measure of spread and all the points are close to the mean.
# Measures of central tendency
summary(totalphos)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00500 0.00700 0.01200 0.01515 0.01800 0.24800
mode(totalphos)
[1] "numeric"
# Measures of spread
var(totalphos)
[1] 0.0002060229
sd(totalphos)
[1] 0.0143535
range(totalphos)
[1] 0.005 0.248
IQR(totalphos)
[1] 0.011
This data expressed below shows the amount of dissolved oxygen in the water for NSW tested sites. This data set expresses very good dissolved oxygen in the samples. the ideal range is 4mg/L and above, looking at the summary and range of data it relays that the lowest (min) is 5.300 meaning that this is higher than wanted and therefore a postive thing for aquatic and marine life.
# Measures of central tendency
summary(disoxy)
Min. 1st Qu. Median Mean 3rd Qu. Max.
5.300 8.320 9.420 9.408 10.523 14.000
mode(disoxy)
[1] "numeric"
# Measures of spread
var(disoxy)
[1] 2.382377
sd(disoxy)
[1] 1.543495
range(disoxy)
[1] 5.3 14.0
IQR(disoxy)
[1] 2.2025
pH tests the acidity and alkalinity of the water, wanting it to generally be neutral however an acceptable range is between 6.5 and 8.5. this data set shows that over 75% of the data is below 7.720 meaning that it is an acceptable range. after assessing the boxplot in hand with the the range data it is obvious that there are a number of outliers in the upper tail that are affecting the skew of the data, however over 50% of the data is lying between 7.23 and 7.72 which is quite a small range.
summary(ph)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
6.100 7.230 7.460 7.485 7.720 9.150 5
sd(ph,na.rm = TRUE)
[1] 0.4062292
range(ph, na.rm = TRUE)
[1] 6.10 9.15
IQR (ph, na.rm = TRUE)
[1] 0.49
This heat map explores the connection between Total Nitrogen mg/L and Phosphorus total mg/L within the water samples. This data visually expresses that the data is located around the 0.0-0.5 for Nitrogen and 0.00 and 0.05 for phosphorus. There is a number of outliers that can be seen on this visual representation.
smoothScatter(df$Nitrogen.Total.mg_L, df$Phosphorus.Total.mg_L, transformation = function(x) x ^ 0.4,
colramp = colorRampPalette(c("#000099", "#00FEFF", "#45FE4F",
"#FCFF00", "#FF9400", "#FF3100")))
# install.packages("MASS")
library(MASS)
<- kde2d(df$Nitrogen.Total.mg_L, df$Phosphorus.Total.mg_L)
kern
contour(kern, drawlabels = FALSE, nlevels = 6,
col = rev(heat.colors(6)), add = TRUE, lwd = 3)
Figure 13: Heat map scatter plot with total nitrogen mg/L and total phosphorous mg/L for water quality data.
This XY Heat map scatter plot shows the comparison of Disolved oxygen and ph. both these data sets were quite symmetrical and therefore are being represented on the heat map by majority (darker red and orange) located around the center of the map with limited outliers.
smoothScatter(df$pH, df$Dissolved.Oxygen.mg_L, transformation = function(x) x ^ 0.4,
colramp = colorRampPalette(c("#000099", "#00FEFF", "#45FE4F",
"#FCFF00", "#FF9400", "#FF3100")))
Figure 14: Heat map scatter plot with dissolved oxygen mg/L and ph for water quality data.
Overall this data has shown fairly consistent results.
Total Nitrogen = Overall this data was rightly skewed with a standard deviation 0.265. However in the context of water quality this is a positive thing because because over 864/888 data entries were below or equal t0 1mg/L. For aquatic ecosystem protection and management, nitrogen levels are regulated to prevent eutrophication, which can lead to algal blooms, oxygen depletion, and harm to aquatic life. For agricultural or irrigation water, nitrogen levels may be regulated to minimize nutrient runoff and pollution, which can affect water quality downstream and lead to environmental damage. Therefore it is important that Nitrogen levels are monitored and maintained.
Total Phosphorous = The phosphorous data was the most significantly deviant from what it should be. With over 548 individual test expressing that it was higher than 0.01mg/L. The phosphorous data had a significant amount of right tailed outliers which affected the skew of the data. In a water context high levels of phosphorus can contribute to the growth of algae and bacteria in water distribution systems, leading to taste and odor issues and potentially compromising water quality.
Dissolved Oxygen = Dissolved oxygen is a crucial aspect in the survival of aquatic organisms and plays several crucial roles in aquatic ecosystems. This data given explores very good and high levels of Dissolved oxygen with the lowest being 5.3 which is still in the optimum range. The data shows that 100% of the entries were about the acceptable range of 4mg/L, therefore this means there is the ability for aquatic and marine life to thrive in these waters.
pH = The PH data proved to be a fairly symmetrical data set with a number of outliers within the frame. Whilst having water with a higher ph or lower dosen’t pose any health risks, however by maintaining the level around 7 or close to it, it ensure palatability of the water and avoids corrosion of pipes.
Research into what variables to choose (pH of Water - Environmental Measurement Systems 2019).I used the Government website which is highly credible and also the Water NSW site to look at what they usually use as variables.
Lab 2 this is where we went through the ggplot function and i used this to be able to create more detailed boxplots and histograms.
Tutorial 3 Binomial and poisson data. this is where i could regain an understanding of what these were and when i was using what one as i was confused with when to use each one.
Lab 4 This is where i got majority of help for the distribution of data. I used the Example 2 with the milk to firstly work out the ggplot histogram, then sort my data in assending order, and count using the length(s()) function to work out <>= for my chosen data set. then in part 2 i was able to use this to plot on a normal distribution curve and then work out probabilities using standard deviation and the mean of the individual data sets.
Stack exchange to work out how to bold text (StackExchange 2019) this website was highly credible as it was a constant location that I would always go to for information on small intrequet details that i could work out normally. I knew that it was highly trusted due to the amount that it is discussed in our labs and lectures as well as it is a community where they all want to help each other.
Bootswatch this is where i learnt how to change the theme and use different themes. I went with the theme ‘lumen’ and here i learnt how to use different parts of the theme as well as look at different colours that i could use from the package.
ChatGPT this was my go-to website during this project, becuase it allowed me to have quick and accurate answers. (ChatGPT 2024) I used chatgpt to understand how to