sharks

Using two different sets of shark data to be able to come to the conclusion of three questions:

Does air and water have a correlation?

Can blotching time be predicted?

Does multiple captures have an affect on blotching time?

Firstly loading the correct packages into studio r to be able to create different graphs and gain more information to be able to answer the three questions.

Loading the first data set into studio R to be able to gain the information we need..

# this reads the data.
sharks <- read_excel("C:/Users/Natas/OneDrive/Documents/workbook/sharks.xlsx")

#checking if the data set has loaded into studio R

head(sharks)
# A tibble: 6 × 10
  ID    sex    blotch   BPM weight length   air water  meta depth
  <chr> <chr>   <dbl> <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
1 SH001 Female   37.2   148   74.7   187.  37.7  23.4  64.1  53.2
2 SH002 Female   34.5   158   73.4   189.  35.7  21.4  73.7  49.6
3 SH003 Female   36.3   125   71.8   284.  34.8  20.1  54.4  49.4
4 SH004 Male     35.3   161  105.    171.  36.2  21.6  86.3  50.3
5 SH005 Female   37.4   138   67.1   264.  33.6  21.8 108.   49.0
6 SH006 Male     33.5   126  110.    270.  36.4  20.9 109.   46.8

The first steps taken is to be able to find the mean,median and range of the first data set.This will be able to show if there is any outliers within this data set and shows how spread out the first data set is.

#Loading the necessary libraries for this data set.
library(tidyverse)



#calculating the mean, median and the range for a variable of Length

mean_length<- mean(sharks$length,na.rm=TRUE) #This is the mean of the "length" of sharks in the data set.
median_length<- median(sharks$length, na.rm=TRUE) # median of length
range_length<- range(sharks$length,na.rm=TRUE) #Range of length

# Display results of the length of sharks in both sexes.
mean_length
[1] 211.0442
median_length
[1] 211.0681
range_length
[1] 128.2534 290.9527

Doing another test to check the data ,using the T-TEST for the blotch times provided in the first data set

#This calculates the interval for the mean of blotch times.

t.test(sharks$blotch)$conf.int
[1] 34.99998 35.25084
attr(,"conf.level")
[1] 0.95
#This gives a range where the blotch time lies.

#This shows there is an true difference between the means in both blotch tests and as the interval does not include sero,it shows there is an significant difference between the two variables.

Calculating the mean of the blotch times in the data set provided to see if there is any outliers in this row

#calculating the mean of the blotch times in the data set.

mean(sharks$blotch, na.rm = TRUE)
[1] 35.12541

AIR AND WATER

Also testing the coefficient between air and water before doing graphs

cor(sharks$air, sharks$water, use = "complete.obs")
[1] -0.05524051
#This calculates the correlation coefficient between the two variables air and water in the shark data set.

Doing other tests to show results in the data provided

# Check for normality of differences
shapiro.test(sharks$air - sharks$water)

    Shapiro-Wilk normality test

data:  sharks$air - sharks$water
W = 0.98944, p-value = 0.001154
# Perform the paired t-test
t_test_result <- t.test(sharks$air, sharks$water, paired = TRUE)

# Display the results
print(t_test_result)

    Paired t-test

data:  sharks$air and sharks$water
t = 123.98, df = 499, p-value < 2.2e-16
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 12.31642 12.71305
sample estimates:
mean difference 
       12.51474 
wilcox_test_result <- wilcox.test(sharks$air, sharks$water, paired = TRUE)

# Display the result
print(wilcox_test_result)

    Wilcoxon signed rank test with continuity correction

data:  sharks$air and sharks$water
V = 125250, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0

Now to use graphs to visualize if there is a correlation between air and water

# Create a scatter plot with a regression line
ggplot(sharks, aes(x = air, y = water)) +
  geom_point(color = "black", size = 2) +          # Add scatter points
  geom_smooth(method = "lm", color = "red", se = TRUE) + # Add regression line
  labs(
    title = "Scatter Plot of Air vs Water",
    x = "Air",
    y = "Water"
  ) +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

# The graph indicates there is no strong correlation between the air and water as they appear randomly distributed.

Using multiple graphs to be able to check if there is a correlation in the air and water data provided to show any differences

library(tidyr)
library(ggplot2)

# Reshape data using pivot_longer
sharks_melted <- sharks %>%
  pivot_longer(cols = c("air", "water"), names_to = "variable", values_to = "value")

# Create boxplot
ggplot(sharks_melted, aes(x = variable, y = value, fill = variable)) +
  geom_boxplot() +
  labs(title = "Boxplots of Air and Water", x = "Variable", y = "Value") +
  theme_minimal() +
  scale_fill_manual(values = c("skyblue", "salmon"))

# this boxplot shows that air has a higher median and a narrower range which indicates more consistent values
ggplot(sharks_melted, aes(x = variable, y = value, fill = variable)) +
  geom_boxplot() +
  labs(title = "Boxplots of Air and Water", x = "Variable", y = "Value") +
  facet_wrap(~variable, scales = "free") +  # Separate plots
  theme_minimal() + #applies the grid lines.
  scale_fill_manual(values = c("skyblue", "salmon")) # enters the colours.

# This visualization is helpful for comparing the distributions of the two variables ("Air" and "Water") in a structured and visually distinct way. Using faceting and separate y-axis scales provides clarity when the ranges of the variables differ.
#  dataset with shark size measurements in air and water
shark_size_data <- data.frame(
  Environment = rep(c("Air", "Water"), each = 50),
  Size = c(rnorm(50, mean = 30, sd = 5), rnorm(50, mean = 45, sd = 10))  # random data for size
)

ggplot(shark_size_data, aes(x = Environment, y = Size, fill = Environment)) +
  geom_boxplot() +
  labs(title = "Shark Size in Air and Water", x = "Environment", y = "Shark Size") +
  theme_minimal()

These different graphs show the information that air and water does not have a correlation in the first set of data provided

downloading the second data set into studio r to be able to further the research

sharksub <- read_excel("C:/Users/Natas/OneDrive/Documents/workbook/sharksub.xlsx")

head(sharksub)
# A tibble: 6 × 4
  ID    sex    blotch1 blotch2
  <chr> <chr>    <dbl>   <dbl>
1 SH269 Female    36.1    37.2
2 SH163 Female    33.4    34.4
3 SH008 Female    36.3    36.5
4 SH239 Female    35.0    36.0
5 SH332 Female    35.7    36.8
6 SH328 Female    34.9    35.9

Testing the second set of data to be able to show any outliers

t.test(sharksub$blotch1, sharksub$blotch2, paired = TRUE)

    Paired t-test

data:  sharksub$blotch1 and sharksub$blotch2
t = -17.39, df = 49, p-value < 2.2e-16
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -1.037176 -0.822301
sample estimates:
mean difference 
     -0.9297384 
# Comparing the means of the two variables.
# This t-test measures the sample distance in terms of the standard error of the difference.
# This test shows that blotch1 is smaller than blotch2 on average. 
# The p-value is extremely small, which indicates that we reject the null hypothesis.
wilcox.test(sharksub$blotch1, sharksub$blotch2, paired = TRUE)

    Wilcoxon signed rank test with continuity correction

data:  sharksub$blotch1 and sharksub$blotch2
V = 12, p-value = 1.606e-09
alternative hypothesis: true location shift is not equal to 0
cor.test(sharksub$blotch1, sharksub$blotch2, method = "pearson")

    Pearson's product-moment correlation

data:  sharksub$blotch1 and sharksub$blotch2
t = 20.154, df = 48, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.9057584 0.9689707
sample estimates:
      cor 
0.9456842 
sharksub$blotch_diff <- sharksub$blotch1 - sharksub$blotch2
summary(sharksub$blotch_diff)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-1.1121 -1.0711 -1.0453 -0.9297 -1.0216  0.4488 
# This provides an insight into the magnitude of differences.

Testing the data to understand if there is any outliers before making the graphs,giving us an insight into the data from the shark sub data

Now to use graphs to visualize if multiple captures have an affect on blotching time and to see if the blotching time can be predicted

# Creating a boxplot to compare blotch1 and blotch2
boxplot(sharksub$blotch1, sharksub$blotch2,
        names = c("blotch time 1", "blotch time 2"),
        main = "Comparison of Blotch Times", 
        ylab = "Blotch Times")

# Density plot for blotch1 and blotch2
plot(density(sharksub$blotch1), main = "Density Plot of Blotch1 and Blotch2", col = "blue", lwd = 2, xlab = "Blotch Values")
lines(density(sharksub$blotch2), col = "red", lwd = 2)
legend("topright", legend = c("Blotch1", "Blotch2"), col = c("blue", "red"), lty = 1, lwd = 2)

#This results in both of the blotches being side by side for comparison with axis labels and titles.Blotch 1 is also been shown to have more prominent concentrated values around its mean compared to blotch 2. 
# Create a violin plot for blotch1 and blotch2
ggplot(sharksub, aes(x = factor(1), y = blotch1)) + 
  geom_violin(fill = "blue", alpha = 0.5) + 
  geom_violin(aes(y = blotch2), fill = "red", alpha = 0.5) + 
  labs(title = "Violin Plot of Blotch1 and Blotch2", x = "Blotch Times", y = "Values") #adds the labels.

# This represents the density distribution of the data.The wider sections indicate where more data points are higher in density.The narrower sections indicate a lower density in points.
#Creating a matplot to make it clearer to see if there is any outliers in the data, this compares the both to show how similiar or differ they are.
matplot(t(data.frame(sharksub$blotch1, sharksub$blotch2)), type = "b", pch = 16, col = "blue",
        xlab = "Blotch Measurements", ylab = "Values", main = "Paired Comparison of Blotch1 and Blotch2")

#This creates a bland-altman plot to show the mean values and any differances.
mean_values <- (sharksub$blotch1 + sharksub$blotch2) / 2
differences <- sharksub$blotch1 - sharksub$blotch2

plot(mean_values, differences, xlab = "Mean of Blotch1 and Blotch2", ylab = "Difference (Blotch1 - Blotch2)", 
     main = "Bland-Altman Plot", pch = 19, col = "darkgreen")

#It shows where blotch 1 and blotch 2 have variables that are similar.However points from the x-axis show that both variables differ massively.
plot(1:nrow(sharksub), sharksub$blotch1, type = "b", col = "red", pch = 19, xlab = "Observation", ylab = "Blotch Value",
     main = "Line Plot of Blotch1 and Blotch2")
lines(1:nrow(sharksub), sharksub$blotch2, type = "b", col = "blue", pch = 19) 
legend("topright", legend = c("Blotch1", "Blotch2"), col = c("red", "blue"), pch = 19)

# used to be able to see if there is any correlation between the two blotch sets.

Overall these graphs and data show no correlation between blotch 1 and blotch 2, the reasons could because of the animals getting use to being caught and the stress levels decreasing or environmental factors that can affect the stress levels and the time for the blotching to heal as well.