Using two different sets of shark data to be able to come to the conclusion of three questions:
Does air and water have a correlation?
Can blotching time be predicted?
Does multiple captures have an affect on blotching time?
Firstly loading the correct packages into studio r to be able to create different graphs and gain more information to be able to answer the three questions.
Loading the first data set into studio R to be able to gain the information we need..
# this reads the data.sharks <-read_excel("C:/Users/Natas/OneDrive/Documents/workbook/sharks.xlsx")#checking if the data set has loaded into studio Rhead(sharks)
The first steps taken is to be able to find the mean,median and range of the first data set.This will be able to show if there is any outliers within this data set and shows how spread out the first data set is.
#Loading the necessary libraries for this data set.library(tidyverse)#calculating the mean, median and the range for a variable of Lengthmean_length<-mean(sharks$length,na.rm=TRUE) #This is the mean of the "length" of sharks in the data set.median_length<-median(sharks$length, na.rm=TRUE) # median of lengthrange_length<-range(sharks$length,na.rm=TRUE) #Range of length# Display results of the length of sharks in both sexes.mean_length
[1] 211.0442
median_length
[1] 211.0681
range_length
[1] 128.2534 290.9527
Doing another test to check the data ,using the T-TEST for the blotch times provided in the first data set
#This calculates the interval for the mean of blotch times.t.test(sharks$blotch)$conf.int
#This gives a range where the blotch time lies.#This shows there is an true difference between the means in both blotch tests and as the interval does not include sero,it shows there is an significant difference between the two variables.
Calculating the mean of the blotch times in the data set provided to see if there is any outliers in this row
#calculating the mean of the blotch times in the data set.mean(sharks$blotch, na.rm =TRUE)
[1] 35.12541
AIR AND WATER
Also testing the coefficient between air and water before doing graphs
cor(sharks$air, sharks$water, use ="complete.obs")
[1] -0.05524051
#This calculates the correlation coefficient between the two variables air and water in the shark data set.
Doing other tests to show results in the data provided
# Check for normality of differencesshapiro.test(sharks$air - sharks$water)
Shapiro-Wilk normality test
data: sharks$air - sharks$water
W = 0.98944, p-value = 0.001154
# Perform the paired t-testt_test_result <-t.test(sharks$air, sharks$water, paired =TRUE)# Display the resultsprint(t_test_result)
Paired t-test
data: sharks$air and sharks$water
t = 123.98, df = 499, p-value < 2.2e-16
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
12.31642 12.71305
sample estimates:
mean difference
12.51474
wilcox_test_result <-wilcox.test(sharks$air, sharks$water, paired =TRUE)# Display the resultprint(wilcox_test_result)
Wilcoxon signed rank test with continuity correction
data: sharks$air and sharks$water
V = 125250, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0
Now to use graphs to visualize if there is a correlation between air and water
# Create a scatter plot with a regression lineggplot(sharks, aes(x = air, y = water)) +geom_point(color ="black", size =2) +# Add scatter pointsgeom_smooth(method ="lm", color ="red", se =TRUE) +# Add regression linelabs(title ="Scatter Plot of Air vs Water",x ="Air",y ="Water" ) +theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
# The graph indicates there is no strong correlation between the air and water as they appear randomly distributed.
Using multiple graphs to be able to check if there is a correlation in the air and water data provided to show any differences
library(tidyr)library(ggplot2)# Reshape data using pivot_longersharks_melted <- sharks %>%pivot_longer(cols =c("air", "water"), names_to ="variable", values_to ="value")# Create boxplotggplot(sharks_melted, aes(x = variable, y = value, fill = variable)) +geom_boxplot() +labs(title ="Boxplots of Air and Water", x ="Variable", y ="Value") +theme_minimal() +scale_fill_manual(values =c("skyblue", "salmon"))
# this boxplot shows that air has a higher median and a narrower range which indicates more consistent values
ggplot(sharks_melted, aes(x = variable, y = value, fill = variable)) +geom_boxplot() +labs(title ="Boxplots of Air and Water", x ="Variable", y ="Value") +facet_wrap(~variable, scales ="free") +# Separate plotstheme_minimal() +#applies the grid lines.scale_fill_manual(values =c("skyblue", "salmon")) # enters the colours.
# This visualization is helpful for comparing the distributions of the two variables ("Air" and "Water") in a structured and visually distinct way. Using faceting and separate y-axis scales provides clarity when the ranges of the variables differ.
# dataset with shark size measurements in air and watershark_size_data <-data.frame(Environment =rep(c("Air", "Water"), each =50),Size =c(rnorm(50, mean =30, sd =5), rnorm(50, mean =45, sd =10)) # random data for size)ggplot(shark_size_data, aes(x = Environment, y = Size, fill = Environment)) +geom_boxplot() +labs(title ="Shark Size in Air and Water", x ="Environment", y ="Shark Size") +theme_minimal()
These different graphs show the information that air and water does not have a correlation in the first set of data provided
downloading the second data set into studio r to be able to further the research
Paired t-test
data: sharksub$blotch1 and sharksub$blotch2
t = -17.39, df = 49, p-value < 2.2e-16
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
-1.037176 -0.822301
sample estimates:
mean difference
-0.9297384
# Comparing the means of the two variables.# This t-test measures the sample distance in terms of the standard error of the difference.# This test shows that blotch1 is smaller than blotch2 on average. # The p-value is extremely small, which indicates that we reject the null hypothesis.
Wilcoxon signed rank test with continuity correction
data: sharksub$blotch1 and sharksub$blotch2
V = 12, p-value = 1.606e-09
alternative hypothesis: true location shift is not equal to 0
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.1121 -1.0711 -1.0453 -0.9297 -1.0216 0.4488
# This provides an insight into the magnitude of differences.
Testing the data to understand if there is any outliers before making the graphs,giving us an insight into the data from the shark sub data
Now to use graphs to visualize if multiple captures have an affect on blotching time and to see if the blotching time can be predicted
# Creating a boxplot to compare blotch1 and blotch2boxplot(sharksub$blotch1, sharksub$blotch2,names =c("blotch time 1", "blotch time 2"),main ="Comparison of Blotch Times", ylab ="Blotch Times")
# Density plot for blotch1 and blotch2plot(density(sharksub$blotch1), main ="Density Plot of Blotch1 and Blotch2", col ="blue", lwd =2, xlab ="Blotch Values")lines(density(sharksub$blotch2), col ="red", lwd =2)legend("topright", legend =c("Blotch1", "Blotch2"), col =c("blue", "red"), lty =1, lwd =2)
#This results in both of the blotches being side by side for comparison with axis labels and titles.Blotch 1 is also been shown to have more prominent concentrated values around its mean compared to blotch 2.
# Create a violin plot for blotch1 and blotch2ggplot(sharksub, aes(x =factor(1), y = blotch1)) +geom_violin(fill ="blue", alpha =0.5) +geom_violin(aes(y = blotch2), fill ="red", alpha =0.5) +labs(title ="Violin Plot of Blotch1 and Blotch2", x ="Blotch Times", y ="Values") #adds the labels.
# This represents the density distribution of the data.The wider sections indicate where more data points are higher in density.The narrower sections indicate a lower density in points.
#Creating a matplot to make it clearer to see if there is any outliers in the data, this compares the both to show how similiar or differ they are.matplot(t(data.frame(sharksub$blotch1, sharksub$blotch2)), type ="b", pch =16, col ="blue",xlab ="Blotch Measurements", ylab ="Values", main ="Paired Comparison of Blotch1 and Blotch2")
#This creates a bland-altman plot to show the mean values and any differances.mean_values <- (sharksub$blotch1 + sharksub$blotch2) /2differences <- sharksub$blotch1 - sharksub$blotch2plot(mean_values, differences, xlab ="Mean of Blotch1 and Blotch2", ylab ="Difference (Blotch1 - Blotch2)", main ="Bland-Altman Plot", pch =19, col ="darkgreen")
#It shows where blotch 1 and blotch 2 have variables that are similar.However points from the x-axis show that both variables differ massively.
plot(1:nrow(sharksub), sharksub$blotch1, type ="b", col ="red", pch =19, xlab ="Observation", ylab ="Blotch Value",main ="Line Plot of Blotch1 and Blotch2")lines(1:nrow(sharksub), sharksub$blotch2, type ="b", col ="blue", pch =19) legend("topright", legend =c("Blotch1", "Blotch2"), col =c("red", "blue"), pch =19)
# used to be able to see if there is any correlation between the two blotch sets.
Overall these graphs and data show no correlation between blotch 1 and blotch 2, the reasons could because of the animals getting use to being caught and the stress levels decreasing or environmental factors that can affect the stress levels and the time for the blotching to heal as well.