RMDA Statistical Analysis

Author

Katie Prange (N1331382)

Load packages

It is important to load all of the necessary packages at the start using the library function.

library(tidyverse)
library(magrittr)
library(vtable)
library(gt)
library(car)
library(gridExtra)
library(tidyr)
library(dplyr)
library(ggpubr)
library(GGally)
library(reshape2)
library(corrplot)
library(caret)
library(Hmisc)

Load datasets

sharks <- read.csv("/Users/katieprange/Documents/NTU/Research Methods/Summative/sharks.csv")
sharksub <- read.csv("/Users/katieprange/Documents/NTU/Research Methods/Summative/sharksub.csv")

Data Wrangling

Step 1: Data cleaning

1.1 Type of variables

Before analysis it is first useful to check the structure of the two datasets. The sharks dataframe comprises 500 observations of 10 variables and the sharksub dataframe comprises 50 observations of 4 variables. The types of variables include:

  • Character: text or string data

  • Numeric: numbers including both integers and decimals

  • Integer: a subset of numeric that specifically represents whole numbers

str(sharks) #Check the structure of the dataset
'data.frame':   500 obs. of  10 variables:
 $ ID    : chr  "SH001" "SH002" "SH003" "SH004" ...
 $ sex   : chr  "Female" "Female" "Female" "Male" ...
 $ blotch: num  37.2 34.5 36.3 35.3 37.4 ...
 $ BPM   : int  148 158 125 161 138 126 166 135 132 127 ...
 $ weight: num  74.7 73.4 71.8 104.6 67.1 ...
 $ length: num  187 189 284 171 264 ...
 $ air   : num  37.7 35.7 34.8 36.2 33.6 ...
 $ water : num  23.4 21.4 20.1 21.6 21.8 ...
 $ meta  : num  64.1 73.7 54.4 86.3 108 ...
 $ depth : num  53.2 49.6 49.4 50.3 49 ...
str(sharksub) #Check the structure of the dataset
'data.frame':   50 obs. of  4 variables:
 $ ID     : chr  "SH269" "SH163" "SH008" "SH239" ...
 $ sex    : chr  "Female" "Female" "Female" "Female" ...
 $ blotch1: num  36.1 33.4 36.3 35 35.7 ...
 $ blotch2: num  37.2 34.4 36.5 36 36.8 ...

1.2 Summary

Next it is useful to explore the raw data by producing a summary of the two datasets. This is important for identifying if there are any missing values which could cause problems during data analysis. In this case neither dataframe has missing data.

summary(sharks) # Produce a summary table
      ID                sex                blotch           BPM       
 Length:500         Length:500         Min.   :30.78   Min.   :119.0  
 Class :character   Class :character   1st Qu.:34.16   1st Qu.:129.0  
 Mode  :character   Mode  :character   Median :35.05   Median :142.0  
                                       Mean   :35.13   Mean   :141.8  
                                       3rd Qu.:36.05   3rd Qu.:153.2  
                                       Max.   :40.08   Max.   :166.0  
     weight           length           air            water      
 Min.   : 65.10   Min.   :128.3   Min.   :33.00   Min.   :20.01  
 1st Qu.: 75.68   1st Qu.:172.0   1st Qu.:34.42   1st Qu.:21.55  
 Median : 87.82   Median :211.1   Median :35.43   Median :23.11  
 Mean   : 87.94   Mean   :211.0   Mean   :35.54   Mean   :23.02  
 3rd Qu.:100.40   3rd Qu.:251.8   3rd Qu.:36.71   3rd Qu.:24.37  
 Max.   :110.94   Max.   :291.0   Max.   :38.00   Max.   :25.99  
      meta            depth      
 Min.   : 50.03   Min.   :44.64  
 1st Qu.: 67.39   1st Qu.:48.90  
 Median : 82.45   Median :50.14  
 Mean   : 82.04   Mean   :50.14  
 3rd Qu.: 95.97   3rd Qu.:51.35  
 Max.   :112.45   Max.   :56.83  
summary(sharksub) # Produce a summary table
      ID                sex               blotch1         blotch2     
 Length:50          Length:50          Min.   :32.49   Min.   :33.47  
 Class :character   Class :character   1st Qu.:34.38   1st Qu.:35.31  
 Mode  :character   Mode  :character   Median :34.94   Median :35.94  
                                       Mean   :35.03   Mean   :35.96  
                                       3rd Qu.:35.90   3rd Qu.:36.78  
                                       Max.   :37.07   Max.   :38.18  

1.3 Outliers

It is necessary to identify the presence of outliers in the dataset. First a boxplot is created to visualise the spread of data. It is evident that both blotch and depth have outliers.

# Visualise outliers using boxplots for sharks dataset
par(mfrow=c(2,5)) # Create a grid of 2 rows and 5 columns for the boxplots

# Loop through each numeric variable to generate a boxplot
numeric_vars <- sharks[, sapply(sharks, is.numeric)]  # Select numeric columns
for (var in colnames(numeric_vars)) {
  boxplot(numeric_vars[[var]], main = var, col = "lightblue") # Specifying the colour of the boxplots
}

The specific row that outliers are located within can be identified by using the interquartile range method. Then all of the rows containing outliers are removed.

# Identify outliers using the IQR method
find_outliers <- function(x) {
  Q1 <- quantile(x, 0.25) # Lower quartile
  Q3 <- quantile(x, 0.75) # Upper quartile
  IQR_val <- Q3 - Q1 # Calculate the IQR value 
  lower_bound <- Q1 - 1.5 * IQR_val # Calculate lower bound
  upper_bound <- Q3 + 1.5 * IQR_val # Calculate upper bound
  return(x < lower_bound | x > upper_bound)
}

# This function is applied to all numeric variables in the dataset
numeric_vars <- sharks[, sapply(sharks, is.numeric)]  # Select numeric columns
outliers <- sapply(numeric_vars, find_outliers) # Outliers are identified

# Create a logical vector to remove rows with outliers
rows_to_remove <- apply(outliers, 1, any)  # Identify rows that have outliers

# Remove rows with outliers
sharks_cleaned <- sharks[!rows_to_remove, ]

# Check the cleaned dataset
summary(sharks_cleaned)
      ID                sex                blotch           BPM       
 Length:490         Length:490         Min.   :31.36   Min.   :119.0  
 Class :character   Class :character   1st Qu.:34.17   1st Qu.:129.0  
 Mode  :character   Mode  :character   Median :35.04   Median :142.5  
                                       Mean   :35.10   Mean   :141.8  
                                       3rd Qu.:36.02   3rd Qu.:154.0  
                                       Max.   :38.66   Max.   :166.0  
     weight           length           air            water      
 Min.   : 65.10   Min.   :128.3   Min.   :33.00   Min.   :20.01  
 1st Qu.: 75.58   1st Qu.:171.6   1st Qu.:34.43   1st Qu.:21.53  
 Median : 87.65   Median :211.3   Median :35.46   Median :23.11  
 Mean   : 87.82   Mean   :211.1   Mean   :35.55   Mean   :23.02  
 3rd Qu.:100.34   3rd Qu.:251.8   3rd Qu.:36.76   3rd Qu.:24.39  
 Max.   :110.94   Max.   :291.0   Max.   :38.00   Max.   :25.99  
      meta            depth      
 Min.   : 50.03   Min.   :45.43  
 1st Qu.: 67.76   1st Qu.:48.91  
 Median : 82.43   Median :50.13  
 Mean   : 81.96   Mean   :50.12  
 3rd Qu.: 95.86   3rd Qu.:51.31  
 Max.   :112.45   Max.   :54.79  

The process is repeated for the sharksub dataframe to visualise outliers using a boxplot. It is evident that there are no outliers that need to be removed.

# Visualise outliers using boxplots for sharksub dataset
par(mfrow = c(1, 2))  # Create a grid with 1 row and 2 columns for plots

# Select numeric variables from sharksub
numeric_vars_sub <- sharksub[, sapply(sharksub, is.numeric)]  # Select numeric columns

# Loop through each numeric variable in sharksub to generate boxplots
for (var in colnames(numeric_vars_sub)) {
  boxplot(numeric_vars_sub[[var]], 
          main = paste("Boxplot of", var), 
          col = "lightblue", 
          ylab = var)
}

Step 2: Checking key assumptions

2.1 Testing for normality

Next it is important to test for assumptions of normality before applying statistical tests that assume normality. First a visual inspection is conducted for the sharks dataframe using histograms and Q-Q plots.

# Histograms for visual inspection of normality for all numeric variables
par(mfrow=c(2,5))  # Arrange plots in a grid
for (var in colnames(numeric_vars)) {
  hist(numeric_vars[[var]], main = paste("Histogram of", var), col = "lightgreen", breaks = 10) # Assigns the number of breaks visualised by the histogram
}

# Q-Q plots for normality for all numeric variables
par(mfrow=c(2,5))  # Arrange plots in a grid
for (var in colnames(numeric_vars)) {
  qqnorm(numeric_vars[[var]], main = paste("Q-Q Plot of", var))
  qqline(numeric_vars[[var]], col = "red")
}

A visual inspection is also conducted for the sharksub dataframe using histograms and Q-Q plots.

# Histograms for visual inspection of normality for blotch1 and blotch2 (numeric variables)
par(mfrow = c(1, 2))  # Set up two plots in one row
hist(sharksub$blotch1, main = "Histogram of Blotch1", col = "lightgreen", breaks = 10)
hist(sharksub$blotch2, main = "Histogram of Blotch2", col = "lightblue", breaks = 10)

# Q-Q plots for normality check for blotch1 and blotch2 (numeric variables)
par(mfrow = c(1, 2))  # Set up two Q-Q plots in one row
qqnorm(sharksub$blotch1, main = "Q-Q Plot of Blotch1")
qqline(sharksub$blotch1, col = "red")

qqnorm(sharksub$blotch2, main = "Q-Q Plot of Blotch2")
qqline(sharksub$blotch2, col = "red")

2.2 Multicollinearity testing

It is also useful to test whether there are high correlations among independent variables as this may distort the results, particularly for regressions. The chosen method is Variance Inflation Factor to detect for multicollinearity. All of the VIF values are close to 1 which suggests that there is no significant multicollinearity among the independent variables in the model.

# Fit a linear model with blotch as the dependent variable and other independent variables
model_sharks <- lm(blotch ~ sex + BPM + weight + length + air + water + meta + depth, data = sharks)

# Calculate VIF to check for multicollinearity
vif(model_sharks)
     sex      BPM   weight   length      air    water     meta    depth 
1.005973 1.010600 1.010919 1.018264 1.029337 1.019376 1.018138 1.011213 

Step 3: Visualise relationships

Before conducting hypothesis testing it is useful to visualise simple relationships between dependent and independent variables. This helps to understand potential correlations and trends.

3.1 Scatterplots (numerical variables)

The scatterplots for the sharks dataset show little evidence for correlations except between blotch and depth.

# Pairwise scatterplots with smaller points
pairs(numeric_vars, 
      main = "Scatterplot Matrix", 
      cex = 0.2)  # Make the points smaller (they were too big before)

The scatterplot for the sharksub dataset shows evidence for a strong correlation between blotch1 and blotch2.

# Scatterplot for the relationship between blotch1 and blotch2
plot(sharksub$blotch1, sharksub$blotch2,
     main = "Scatterplot of Blotch1 vs. Blotch2",
     xlab = "Blotch1", ylab = "Blotch2",
     pch = 19, col = "blue")

Results of data exploration

1) Outliers

  • Sharks: outliers identified for blotch and depth which were subsequently remove

  • Sharksub: no outliers

2) Normality

  • Sharks: all variables were non-normally distributed

  • Sharksub: variables normally distributed

3) Multicollinearity

  • Sharks: no evidence for multicollinearity among independent variables

  • Sharksub: only two numerical variables so not an issue

4) Potential relationships

  • Sharks: potential correlation between blotch and depth

  • Sharksub: potential correlation between blotch1 and blotch2

Hypothesis Testing

Question 1: Is there a correlation between the variables air and water?

This question involves two continuous variables from the sharks dataset - air and water.

1.1 Testing for normality

During the data wrangling it was tested to see if the data was normally distributed. It is helpful just to check again for air and water individually. First here are the histograms.

# Histogram for visual inspection of normality (air)
ggplot(sharks_cleaned, aes(x = air)) + 
  geom_histogram(bins = 30, fill = "lightblue", color = "black") +
  ggtitle("Histogram of Air Temperature")

# Histogram for visual inspection of normality (water)
ggplot(sharks_cleaned, aes(x = water)) + 
  geom_histogram(bins = 30, fill = "lightblue", color = "black") +
  ggtitle("Histogram of Water Temperature")

The Shapiro-Wilk test for normality is also applied. Both variables have a p-value of less than 0.05 so the null hypothesis is rejected. Therefore the data is not normally distributed so it is necessary to use a non-parametric test such as Spearman’s correlation.

shapiro.test(sharks_cleaned$air)  # Test normality for air

    Shapiro-Wilk normality test

data:  sharks_cleaned$air
W = 0.95867, p-value = 1.72e-10
shapiro.test(sharks_cleaned$water)  # Test normality for water

    Shapiro-Wilk normality test

data:  sharks_cleaned$water
W = 0.95943, p-value = 2.284e-10

1.2 Spearman’s Rank Correlation Test

Spearman’s Rank is a non-parametric test so it doesn’t assume normality in the data. This is applied to determine whether there is a correlation between air and water.

# Perform Spearman's rank correlation test
cor.test(sharks_cleaned$air, sharks_cleaned$water, method = "spearman")

    Spearman's rank correlation rho

data:  sharks_cleaned$air and sharks_cleaned$water
S = 20675862, p-value = 0.2288
alternative hypothesis: true rho is not equal to 0
sample estimates:
        rho 
-0.05445596 

1.3 Interpretation of the results

Correlation coefficient (ρ/rho):

  • ρ = -0.054

  • Indicates a very weak negative relationship between the two variables

  • The value is so close to zero that the correlation is almost negligible

  • This indicates that there is no meaningful relationship

p-value

  • p-value = 0.228

  • There is no statisitcally significant correlation between the variables

1.4 Scatter plots

While the Spearman’s test indicates that there is no significant correlation, it is useful to use a scatterplot to visually inspect the relationship between the two variables. Confidence intervals are added as shaded regions around the trendline to represent the confidence interval of the linear regression. Various formatting aspects of the graph were changed to improve visualisation.

ggplot(sharks_cleaned, aes(x = air, y = water)) +
  geom_point(color = "#1E3A44", size = 2, alpha = 0.7) +  # Set the points to a darker blue-grey
  geom_smooth(method = "lm", color = "blue", se = TRUE) +  # Add linear regression line with confidence intervals
  labs(
    x = "Air Temperature (°C)", # Label for x axis
    y = "Water Temperature (°C)" # Label for y axis
  ) +
  theme_minimal() +  # Apply minimal theme
  theme(
    panel.grid = element_blank(),  # Remove gridlines 
    panel.border = element_rect(color = "black", fill = NA, size = 1),  # Add border around the plot area
    legend.position = "none",  # Remove the autmatic legend as it wasn't needed
    axis.text = element_text(size = 12),  # Adjust axis text size
    axis.title = element_text(size = 14, face = "bold")  # Bold axis titles
  )
Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
ℹ Please use the `linewidth` argument instead.
`geom_smooth()` using formula = 'y ~ x'

Question 2: Does multiple capture have an effect on blotching time?

This question involves the sharksub dataset of the continuous variable blotch taken at different times (categories).

2.1 Testing for normality

It is important to check normality of the differences between blotch1 and blotch2 because of the assumption behind the paired t-test. Paired t-test assumes that the differences between the paired observations are normally distributed. Therefore, the difference is calculated and then a Shapiro-Wilk test is applied. The p-value is less than 0.05 which suggests that the differences are not normally distributed.

# Calculate the differences between Blotch1 and Blotch2
sharksub$blotch_diff <- sharksub$blotch2 - sharksub$blotch1

# Shapiro-Wilk Test for Normality on the differences
shapiro.test(sharksub$blotch_diff)

    Shapiro-Wilk normality test

data:  sharksub$blotch_diff
W = 0.43157, p-value = 1.212e-12

2.2 Wilcoxon Signed-Rank Test

As the differences are non-normally distributed, the Wilcoxon test is applied. This is the non-parametric version of the paired t-test.

# Wilcoxon signed-rank test for non-normally distributed differences
wilcox.test(sharksub$blotch1, sharksub$blotch2, paired = TRUE)

    Wilcoxon signed rank test with continuity correction

data:  sharksub$blotch1 and sharksub$blotch2
V = 12, p-value = 1.606e-09
alternative hypothesis: true location shift is not equal to 0

2.3 Interpretation of the results

p-value

  • p-value < 0.05

  • The null hypothesis is rejected

  • Indicates that there is a statistically significant difference between blotch1 and blotch2

  • The difference between the two capture times is unlikely to have occurred by chance

V value

  • V = 12

  • Represents the sum of the ranks of the positive differences between the two paired groups

2.4 Paired boxplot

To visualise the relationship between blotch1 and blotch2 a boxplot is produced. This viusally compares the distributions of the two related variables to show how they differ. The first plot shows the basic version of a paired boxplot using the normal functions.

# Reshape the sharksub dataset to long format
sharksub_long <- sharksub %>%
  pivot_longer(cols = c(blotch1, blotch2), 
               names_to = "blotch_type", 
               values_to = "blotch_value")

# Create the paired boxplot using ggpaired
ggpaired(sharksub_long, 
         x = "blotch_type", 
         y = "blotch_value",
         color = "blotch_type", 
         line.color = "gray", 
         line.size = 0.4,
         palette = "jco") +
  stat_compare_means(paired = TRUE)

# Regardless of adjustments I can't get this boxplot to have paired lines

No matter what changes I make I can’t seem to adapt the above code to create joining lines between paired observations. Therefore, as the only alternative that I could get to work I added the 50 observations before and after into set categories and then made the paired boxplot from that.

# The original data from sharksub split into before and after
before <- c(36.07201, 33.38396, 36.29497, 34.98931, 35.70572, 34.90283, 33.11113, 32.49322, 34.27203, 35.54855, 35.99021, 34.88511, 36.83684, 34.79932, 34.41429, 34.742, 34.36868, 34.27528, 36.11878, 33.98649, 34.45879, 34.9396, 35.8195, 34.47858, 33.26947, 32.85958, 34.65632, 36.48702, 34.93594, 36.42572, 36.50001, 36.37596, 35.17775, 34.57425, 36.68703, 35.28866, 35.92057, 34.73333, 33.6012, 35.70079, 35.02983, 35.63334, 34.24796, 35.34228, 36.4289, 33.53, 37.07165, 34.29493, 35.33881, 34.52245)
after <- c(37.1541703, 34.3854788, 36.461024, 36.0389893, 36.7768916, 35.9499149, 34.1044639, 33.4680166, 35.3001909, 36.6150065, 37.0699163, 35.9316633, 37.9419452, 35.8432996, 35.4467187, 34.39458, 35.3997404, 35.3035384, 37.2023434, 35.0060847, 35.4925537, 35.987788, 36.894085, 35.5129374, 34.2675541, 33.8453674, 35.6960096, 37.5816306, 35.9840182, 37.5184916, 37.5950103, 37.4672388, 36.2330825, 35.6114775, 37.7876409, 36.3473198, 35.52544373, 35.7753299, 34.609236, 36.7718137, 36.0807249, 36.7023402, 35.2753988, 36.4025484, 37.521767, 34.5359, 38.1837995, 35.3237779, 35.609952, 34.07365815)

d <- data.frame(before = before, after = after)

# Create a paired boxplot
ggpaired(d, 
         cond1 = "before", cond2 = "after",
         fill = "condition") +
  # Add axis labels
  labs(
    x = "Sample",  # Rename x-axis title to sample
    y = "Blotch Time (seconds)"  # Rename y-axis title
  ) +
  theme(
    panel.grid = element_blank(),  # Remove gridlines
    axis.text = element_text(size = 12, color = "black"),  # Change axis text
    axis.title = element_text(size = 14, face = "bold"),  # Bold axis titles
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),  # Change title
    axis.title.x = element_text(size = 14, face = "bold"),  # Make x-axis title bold
    axis.title.y = element_text(size = 14, face = "bold")  # Make y-axis title bold
  ) +
  # Add custom colors for before/after
  scale_fill_manual(values = c("#FF9E2A", "#00A4A9")) +
  scale_color_manual(values = c("#FF9E2A", "#00A4A9")) +
  # Rename before/after conditions
  scale_x_discrete(labels = c("First Capture", "Second Capture"))

Question 3: Is it possible to predict blotching time?

This question potentially involves the sharks dataset. The response is the continuous variable of blotch. The potential predictors are the continuous (BPM, weight, length, meta, depth, air, water) and categorical (sex) variables.

Understanding relationships with variables

It is helpful to understand the relationships between blotch and other variables to guide the decision about which to include in the model. Earlier it was evident that depth was the only variable that seemed to be correlated with blotch.

3.1 Correlation matrix

A heatmap can be created using the corrplot function that states the correlation values. Similar to the scatterplots, the correlation matrix for the sharks dataset only shows a strong correlations between blotch and depth.

# Select all of the relevant numeric variables
numeric_vars <- sharks_cleaned[, c("blotch", "BPM", "meta", "weight", "length", "depth", "air", "water")]

# Calculate the Pearson correlation matrix
cor_matrix <- cor(numeric_vars, method = "pearson")  # Set the test as Pearson

# Visualise the correlation matrix
corrplot(cor_matrix, 
         method = "ellipse",       # Use ellipses to represent the strength of correlations
         type = "upper",           # Display only the upper half of the matrix
         tl.col = "black",         # Colour of the text
         addCoef.col = "black",    # Add correlation coefficients
         number.cex = 0.8,         # Change font size 
         diag = FALSE,             # Diagonal is hidden
         order = "hclust",         # Order the variables based on hierarchical clustering
         tl.srt = 45               # Rotation of text labels
)

It is also helpful to created a correlation matrix table that produces all of the coefficient values and associated p-values of the correlation test.

# Select numerical variables
variables <- sharks_cleaned[, c("blotch", "BPM", "meta", "weight", "length", "depth", "air", "water")]

# Compute the correlation matrix (Pearson) and p-values 
cor_matrix <- rcorr(as.matrix(variables), type = "pearson") # Use rcorr function from Hmisc package

The results of the correlation coefficients and p-values can then be combined into one singular table.

# Combine the correlation coefficients and p-values
cor_matrix_combined <- round(cor_matrix$r, 2)
pvalue_matrix <- round(cor_matrix$P, 3)

# Create a combined matrix
cor_matrix_display <- cor_matrix_combined
for (i in 1:nrow(cor_matrix_combined)) {
  for (j in 1:ncol(cor_matrix_combined)) {
    cor_matrix_display[i, j] <- paste(cor_matrix_combined[i, j], 
                                      " (p = ", pvalue_matrix[i, j], ")", sep = "")
  }
}

# Show the final matrix
print(cor_matrix_display)
       blotch              BPM                 meta               
blotch "1 (p = NA)"        "-0.04 (p = 0.427)" "-0.03 (p = 0.443)"
BPM    "-0.04 (p = 0.427)" "1 (p = NA)"        "-0.01 (p = 0.808)"
meta   "-0.03 (p = 0.443)" "-0.01 (p = 0.808)" "1 (p = NA)"       
weight "0 (p = 0.966)"     "0.01 (p = 0.784)"  "0.02 (p = 0.591)" 
length "-0.01 (p = 0.893)" "-0.08 (p = 0.079)" "0 (p = 0.923)"    
depth  "0.69 (p = 0)"      "-0.01 (p = 0.76)"  "-0.01 (p = 0.842)"
air    "-0.03 (p = 0.451)" "-0.07 (p = 0.11)"  "0.13 (p = 0.004)" 
water  "-0.04 (p = 0.335)" "0.03 (p = 0.518)"  "0.03 (p = 0.53)"  
       weight              length              depth              
blotch "0 (p = 0.966)"     "-0.01 (p = 0.893)" "0.69 (p = 0)"     
BPM    "0.01 (p = 0.784)"  "-0.08 (p = 0.079)" "-0.01 (p = 0.76)" 
meta   "0.02 (p = 0.591)"  "0 (p = 0.923)"     "-0.01 (p = 0.842)"
weight "1 (p = NA)"        "-0.03 (p = 0.554)" "-0.01 (p = 0.849)"
length "-0.03 (p = 0.554)" "1 (p = NA)"        "-0.07 (p = 0.137)"
depth  "-0.01 (p = 0.849)" "-0.07 (p = 0.137)" "1 (p = NA)"       
air    "-0.05 (p = 0.273)" "-0.03 (p = 0.5)"   "-0.02 (p = 0.687)"
water  "0.09 (p = 0.052)"  "-0.06 (p = 0.191)" "-0.03 (p = 0.485)"
       air                 water              
blotch "-0.03 (p = 0.451)" "-0.04 (p = 0.335)"
BPM    "-0.07 (p = 0.11)"  "0.03 (p = 0.518)" 
meta   "0.13 (p = 0.004)"  "0.03 (p = 0.53)"  
weight "-0.05 (p = 0.273)" "0.09 (p = 0.052)" 
length "-0.03 (p = 0.5)"   "-0.06 (p = 0.191)"
depth  "-0.02 (p = 0.687)" "-0.03 (p = 0.485)"
air    "1 (p = NA)"        "-0.05 (p = 0.238)"
water  "-0.05 (p = 0.238)" "1 (p = NA)"       

3.2 Scatterplot for depth

From the correlation matrix it was evident that depth was the strongest predictor variable. Therefore a scatterplot is created to visualise the correlation between depth and blotch. This confirms that a linear regression model is appropriate.

# Scatter plot of depth against blotch
ggplot(sharks_cleaned, aes(x = depth, y = blotch)) +
  geom_point(color = "#1E3A44", size = 2, alpha = 0.7) +  # Darker blue-grey points
  geom_smooth(method = "lm", color = "blue", se = TRUE) +  # Add linear regression line with confidence interval as shaded region
  labs(
    x = "Depth (m)", # Add x axis label
    y = "Blotch Time (seconds)" # Add y axis label
  ) +
  theme_minimal() +  # Minimal theme applied
  theme(
    panel.grid = element_blank(),  # Remove gridlines
    panel.border = element_rect(color = "black", fill = NA, size = 1),  # Border around the plot area
    legend.position = "none",  # Remove legend as not needed
    axis.text = element_text(size = 12),  # Change axis text size
    axis.title = element_text(size = 14, face = "bold")  # Bold axis titles
  )
`geom_smooth()` using formula = 'y ~ x'

3.3 Pearson’s Correlation Test

A Pearson’s correlation was performed to determine the strength of the correlation between blotch and depth.

# Pearson correlation test between depth and blotch
cor_test <- cor.test(sharks_cleaned$depth, sharks_cleaned$blotch, method = "pearson")

# Display the results
cor_test

    Pearson's product-moment correlation

data:  sharks_cleaned$depth and sharks_cleaned$blotch
t = 21.203, df = 488, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.6433494 0.7359091
sample estimates:
     cor 
0.692468 

3.4 Interpretation of the results

Correlation coefficient (r):

  • ρ = 0.69

  • Positive relationship between the two variables

p-value

  • p-value < 0.05

  • There is a statisitcally significant correlation between the variables

3.5 Testing for normality

An assumption of linear regressions is that the residuals are normally distributed so it is important to test this. First a simple linear regression model is fitted to the data.

# Fit a linear regression model
lm_depth <- lm(blotch ~ depth, data = sharks_cleaned)

# Get a summary of the model
summary(lm_depth)

Call:
lm(formula = blotch ~ depth, data = sharks_cleaned)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.80724 -0.63964 -0.00208  0.59800  2.82832 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 10.49049    1.16171    9.03   <2e-16 ***
depth        0.49109    0.02316   21.20   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9784 on 488 degrees of freedom
Multiple R-squared:  0.4795,    Adjusted R-squared:  0.4784 
F-statistic: 449.6 on 1 and 488 DF,  p-value: < 2.2e-16

Then a Q-Q plot is made to visually inspect if the residuals follow a normal distribution. This is important as a visual check alongside statistical values.

# Q-Q plot of residuals
qqnorm(residuals(lm_depth))
qqline(residuals(lm_depth), col = "red")

A Shapiro-Wilk test was then applied to provide a statistical assessment of normality. The p-value is above 0.05 which suggests that the residuals are normally distributed. The null hypothesis is rejected.

# Shapiro-Wilk normality test
shapiro.test(residuals(lm_depth))

    Shapiro-Wilk normality test

data:  residuals(lm_depth)
W = 0.99743, p-value = 0.6523

3.6 Testing for homogeneity of variance

An assumption of linear regression is that the variance of the residuals is constant across all levels of the independent variable. Therefore a plot of residuals against fitted values was created to test this assumption. The residuals appear to be evenly spread around zero with no obvious patterns so the assumption is met.

# Plot of residuals against fitted values
plot(fitted(lm_depth), residuals(lm_depth), 
     xlab = "Fitted Values", ylab = "Residuals", 
     main = "Residuals vs Fitted Values")
abline(h = 0, col = "red")

3.7 Fitting the regression model

As the tests suggest that the assumptions of normality and homoscedasticity are met, a regression model can be confidently fitted to the data.

# Fit linear regression model
lm_depth <- lm(blotch ~ depth, data = sharks_cleaned)

# Get a summary of the model
summary(lm_depth)

Call:
lm(formula = blotch ~ depth, data = sharks_cleaned)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.80724 -0.63964 -0.00208  0.59800  2.82832 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 10.49049    1.16171    9.03   <2e-16 ***
depth        0.49109    0.02316   21.20   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9784 on 488 degrees of freedom
Multiple R-squared:  0.4795,    Adjusted R-squared:  0.4784 
F-statistic: 449.6 on 1 and 488 DF,  p-value: < 2.2e-16

A comparison was made to a linear regression model in which all the variables were included. This didn’t provide much more explanatory power for the response variable. Therefore it seems that depth alone is the best predictor.

# Fit linear regression model
lm_all <- lm(blotch ~ depth + length + weight + meta + air + water + BPM, data = sharks_cleaned)

# Get a summary of the model
summary(lm_all)

Call:
lm(formula = blotch ~ depth + length + weight + meta + air + 
    water + BPM, data = sharks_cleaned)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.76355 -0.64919 -0.00181  0.62864  2.77396 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 11.6904631  1.9116624   6.115 1.99e-09 ***
depth        0.4918029  0.0232996  21.108  < 2e-16 ***
length       0.0010741  0.0009558   1.124    0.262    
weight       0.0006775  0.0033089   0.205    0.838    
meta        -0.0020259  0.0025811  -0.785    0.433    
air         -0.0184273  0.0314997  -0.585    0.559    
water       -0.0158975  0.0266314  -0.597    0.551    
BPM         -0.0023599  0.0031564  -0.748    0.455    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9809 on 482 degrees of freedom
Multiple R-squared:  0.4833,    Adjusted R-squared:  0.4758 
F-statistic:  64.4 on 7 and 482 DF,  p-value: < 2.2e-16

3.8 Interpretation of the results

Coefficients

  • Intercept = 10.49

  • Slope of the model = 0.49

  • The slope represents the expected change in blotch for one-unit change in depth

R-squared

  • R-squared = 0.48

  • Represents the proportion of variance in blotch explained by the model

p-value

  • p-value < 0.05

  • Suggests that depth significantly predicts blotch

3.9 Predicting values

Able to run some predictions using the model.

# Predict values based on the linear regression model
predictions <- predict(lm_depth, newdata = sharks_cleaned)

# Compare predictions with actual values
comparison <- data.frame(Actual = sharks_cleaned$blotch, Predicted = predictions)
head(comparison)
    Actual Predicted
1 37.17081  36.62957
2 34.54973  34.86786
3 36.32861  34.77040
4 35.33881  35.19104
5 37.39799  34.56967
6 33.54668  33.49400

Predictive modelling

Predictive modelling can be used to make predictions about unknown outcomes based on historic data. In this case, the aim is to predict blotch using the predictor variables.

4.1 Split the dataset into training and testing sets

The first step in building a predictive model is to split the data into training and testing sets. This allows the model to be trained on one subset of the data and tested on a separate subset to evaluate its performance.

# Set a seed for reproducibility
set.seed(40) # The seed was set at 40

# Split the data 
trainIndex <- createDataPartition(sharks_cleaned$blotch, p = 0.8, list = FALSE) # 80% training and 20% testing

# Create training and testing datasets
trainData <- sharks_cleaned[trainIndex, ] # Train the model
testData <- sharks_cleaned[-trainIndex, ] # Evaluate the performance of the model

4.2 Building the Linear Regression model

A linear regression model is fit where blotch is the response variable and depth is the predictor.

# Build the linear regression model using the training data
lm_model <- lm(blotch ~ depth, data = trainData)

# View the model summary
summary(lm_model)

Call:
lm(formula = blotch ~ depth, data = trainData)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.8091 -0.6433 -0.0039  0.5958  2.8226 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 10.67583    1.29544   8.241 2.59e-15 ***
depth        0.48743    0.02585  18.857  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9687 on 392 degrees of freedom
Multiple R-squared:  0.4757,    Adjusted R-squared:  0.4743 
F-statistic: 355.6 on 1 and 392 DF,  p-value: < 2.2e-16

4.3 Interpretation of the results

Coefficients

  • Intercept = 10.68

  • Slope of the model = 0.49

  • The slope represents the expected change in blotch for one-unit change in depth

R-squared

  • R-squared = 0.48

  • Represents the proportion of variance in blotch explained by the model

p-value

  • p-value < 0.05

  • Suggests that depth significantly predicts blotch

4.4 Evaluate the model - make predictions

It is important to assess how well the model performs. This involves making predictions on the test set.

# Make predictions on the test subset of data
predictions <- predict(lm_model, newdata = testData)

# View the first few predictions from the model
head(predictions)
      11       13       16       26       50       52 
34.35365 35.20272 35.37245 36.89665 35.63074 33.86592 

4.5 Evaluate the model - calculate performance metrics

Another important evaluation is to calculate:

Mean squared error (MSE)

  • Measures the average squared difference between predicted and actual values

R-squared (R²)

  • Indicates how much of the variance in blotch is explained by depth
# Calculate Mean Squared Error
mse <- mean((predictions - testData$blotch)^2)
cat("Mean Squared Error (MSE):", mse, "\n")
Mean Squared Error (MSE): 1.03445 
# Calculate R-squared
rsq <- 1 - sum((testData$blotch - predictions)^2) / sum((testData$blotch - mean(testData$blotch))^2)
cat("R-squared (R²):", rsq, "\n")
R-squared (R²): 0.4915227 

4.6 Visualising the results

Then a scatterplot is created to visualise how well the model’s predictions match the actual values.

# Scatter plot of Actual vs Predicted Blotch
ggplot(testData, aes(x = blotch, y = predictions)) +
  geom_point(color = "#1E3A44", size = 2, alpha = 0.7) +  # Darker blue-grey points
  geom_smooth(method = "lm", color = "red", se = TRUE) +  # Add linear regression line with confidence interval
  labs(
    x = "Actual Blotch Time (seconds)", 
    y = "Predicted Blotch Time (seconds)"
  ) +
  theme_minimal() +  # Minimal theme
  theme(
    panel.grid = element_blank(),  # Remove gridlines
    panel.border = element_rect(color = "black", fill = NA, size = 1),  # Border around the plot area
    legend.position = "none",  # Remove legend
    axis.text = element_text(size = 12),  # Adjust axis text size
    axis.title = element_text(size = 14, face = "bold")  # Bold axis titles
  )
`geom_smooth()` using formula = 'y ~ x'

4.7 Checking residuals

It is important to inspect the residuals to check if the model is valid.

  • Residuals vs Fitted values plot: this plot should show no patterns

  • Residuals histogram: an even distribution of residuals is ideal

# Residuals
residuals <- testData$blotch - predictions

# Plot the residuals against the fitted values
plot(predictions, residuals, main = "Residuals vs Fitted Values", xlab = "Predicted Blotch (secs)", ylab = "Residuals")
abline(h = 0, col = "red")

# Histogram of residuals
hist(residuals, main = "Residuals Histogram", xlab = "Residuals")

4.8 Make predictions on new data

The model can be used to make predictions for blotch values based on new depth values.

# Example of new data (depth values)
new_data <- data.frame(depth = c(55, 70))

# Make predictions on the new data
new_predictions <- predict(lm_model, newdata = new_data)
print(new_predictions)
       1        2 
37.48442 44.79585