library(tidyverse)
library(magrittr)
library(vtable)
library(gt)
library(car)
library(gridExtra)
library(tidyr)
library(dplyr)
library(ggpubr)
library(GGally)
library(reshape2)
library(corrplot)
library(caret)
library(Hmisc)
RMDA Statistical Analysis
Load packages
It is important to load all of the necessary packages at the start using the library function.
Load datasets
<- read.csv("/Users/katieprange/Documents/NTU/Research Methods/Summative/sharks.csv")
sharks <- read.csv("/Users/katieprange/Documents/NTU/Research Methods/Summative/sharksub.csv") sharksub
Data Wrangling
Step 1: Data cleaning
1.1 Type of variables
Before analysis it is first useful to check the structure of the two datasets. The sharks dataframe comprises 500 observations of 10 variables and the sharksub dataframe comprises 50 observations of 4 variables. The types of variables include:
Character: text or string data
Numeric: numbers including both integers and decimals
Integer: a subset of numeric that specifically represents whole numbers
str(sharks) #Check the structure of the dataset
'data.frame': 500 obs. of 10 variables:
$ ID : chr "SH001" "SH002" "SH003" "SH004" ...
$ sex : chr "Female" "Female" "Female" "Male" ...
$ blotch: num 37.2 34.5 36.3 35.3 37.4 ...
$ BPM : int 148 158 125 161 138 126 166 135 132 127 ...
$ weight: num 74.7 73.4 71.8 104.6 67.1 ...
$ length: num 187 189 284 171 264 ...
$ air : num 37.7 35.7 34.8 36.2 33.6 ...
$ water : num 23.4 21.4 20.1 21.6 21.8 ...
$ meta : num 64.1 73.7 54.4 86.3 108 ...
$ depth : num 53.2 49.6 49.4 50.3 49 ...
str(sharksub) #Check the structure of the dataset
'data.frame': 50 obs. of 4 variables:
$ ID : chr "SH269" "SH163" "SH008" "SH239" ...
$ sex : chr "Female" "Female" "Female" "Female" ...
$ blotch1: num 36.1 33.4 36.3 35 35.7 ...
$ blotch2: num 37.2 34.4 36.5 36 36.8 ...
1.2 Summary
Next it is useful to explore the raw data by producing a summary of the two datasets. This is important for identifying if there are any missing values which could cause problems during data analysis. In this case neither dataframe has missing data.
summary(sharks) # Produce a summary table
ID sex blotch BPM
Length:500 Length:500 Min. :30.78 Min. :119.0
Class :character Class :character 1st Qu.:34.16 1st Qu.:129.0
Mode :character Mode :character Median :35.05 Median :142.0
Mean :35.13 Mean :141.8
3rd Qu.:36.05 3rd Qu.:153.2
Max. :40.08 Max. :166.0
weight length air water
Min. : 65.10 Min. :128.3 Min. :33.00 Min. :20.01
1st Qu.: 75.68 1st Qu.:172.0 1st Qu.:34.42 1st Qu.:21.55
Median : 87.82 Median :211.1 Median :35.43 Median :23.11
Mean : 87.94 Mean :211.0 Mean :35.54 Mean :23.02
3rd Qu.:100.40 3rd Qu.:251.8 3rd Qu.:36.71 3rd Qu.:24.37
Max. :110.94 Max. :291.0 Max. :38.00 Max. :25.99
meta depth
Min. : 50.03 Min. :44.64
1st Qu.: 67.39 1st Qu.:48.90
Median : 82.45 Median :50.14
Mean : 82.04 Mean :50.14
3rd Qu.: 95.97 3rd Qu.:51.35
Max. :112.45 Max. :56.83
summary(sharksub) # Produce a summary table
ID sex blotch1 blotch2
Length:50 Length:50 Min. :32.49 Min. :33.47
Class :character Class :character 1st Qu.:34.38 1st Qu.:35.31
Mode :character Mode :character Median :34.94 Median :35.94
Mean :35.03 Mean :35.96
3rd Qu.:35.90 3rd Qu.:36.78
Max. :37.07 Max. :38.18
1.3 Outliers
It is necessary to identify the presence of outliers in the dataset. First a boxplot is created to visualise the spread of data. It is evident that both blotch and depth have outliers.
# Visualise outliers using boxplots for sharks dataset
par(mfrow=c(2,5)) # Create a grid of 2 rows and 5 columns for the boxplots
# Loop through each numeric variable to generate a boxplot
<- sharks[, sapply(sharks, is.numeric)] # Select numeric columns
numeric_vars for (var in colnames(numeric_vars)) {
boxplot(numeric_vars[[var]], main = var, col = "lightblue") # Specifying the colour of the boxplots
}
The specific row that outliers are located within can be identified by using the interquartile range method. Then all of the rows containing outliers are removed.
# Identify outliers using the IQR method
<- function(x) {
find_outliers <- quantile(x, 0.25) # Lower quartile
Q1 <- quantile(x, 0.75) # Upper quartile
Q3 <- Q3 - Q1 # Calculate the IQR value
IQR_val <- Q1 - 1.5 * IQR_val # Calculate lower bound
lower_bound <- Q3 + 1.5 * IQR_val # Calculate upper bound
upper_bound return(x < lower_bound | x > upper_bound)
}
# This function is applied to all numeric variables in the dataset
<- sharks[, sapply(sharks, is.numeric)] # Select numeric columns
numeric_vars <- sapply(numeric_vars, find_outliers) # Outliers are identified
outliers
# Create a logical vector to remove rows with outliers
<- apply(outliers, 1, any) # Identify rows that have outliers
rows_to_remove
# Remove rows with outliers
<- sharks[!rows_to_remove, ]
sharks_cleaned
# Check the cleaned dataset
summary(sharks_cleaned)
ID sex blotch BPM
Length:490 Length:490 Min. :31.36 Min. :119.0
Class :character Class :character 1st Qu.:34.17 1st Qu.:129.0
Mode :character Mode :character Median :35.04 Median :142.5
Mean :35.10 Mean :141.8
3rd Qu.:36.02 3rd Qu.:154.0
Max. :38.66 Max. :166.0
weight length air water
Min. : 65.10 Min. :128.3 Min. :33.00 Min. :20.01
1st Qu.: 75.58 1st Qu.:171.6 1st Qu.:34.43 1st Qu.:21.53
Median : 87.65 Median :211.3 Median :35.46 Median :23.11
Mean : 87.82 Mean :211.1 Mean :35.55 Mean :23.02
3rd Qu.:100.34 3rd Qu.:251.8 3rd Qu.:36.76 3rd Qu.:24.39
Max. :110.94 Max. :291.0 Max. :38.00 Max. :25.99
meta depth
Min. : 50.03 Min. :45.43
1st Qu.: 67.76 1st Qu.:48.91
Median : 82.43 Median :50.13
Mean : 81.96 Mean :50.12
3rd Qu.: 95.86 3rd Qu.:51.31
Max. :112.45 Max. :54.79
The process is repeated for the sharksub dataframe to visualise outliers using a boxplot. It is evident that there are no outliers that need to be removed.
# Visualise outliers using boxplots for sharksub dataset
par(mfrow = c(1, 2)) # Create a grid with 1 row and 2 columns for plots
# Select numeric variables from sharksub
<- sharksub[, sapply(sharksub, is.numeric)] # Select numeric columns
numeric_vars_sub
# Loop through each numeric variable in sharksub to generate boxplots
for (var in colnames(numeric_vars_sub)) {
boxplot(numeric_vars_sub[[var]],
main = paste("Boxplot of", var),
col = "lightblue",
ylab = var)
}
Step 2: Checking key assumptions
2.1 Testing for normality
Next it is important to test for assumptions of normality before applying statistical tests that assume normality. First a visual inspection is conducted for the sharks dataframe using histograms and Q-Q plots.
# Histograms for visual inspection of normality for all numeric variables
par(mfrow=c(2,5)) # Arrange plots in a grid
for (var in colnames(numeric_vars)) {
hist(numeric_vars[[var]], main = paste("Histogram of", var), col = "lightgreen", breaks = 10) # Assigns the number of breaks visualised by the histogram
}
# Q-Q plots for normality for all numeric variables
par(mfrow=c(2,5)) # Arrange plots in a grid
for (var in colnames(numeric_vars)) {
qqnorm(numeric_vars[[var]], main = paste("Q-Q Plot of", var))
qqline(numeric_vars[[var]], col = "red")
}
A visual inspection is also conducted for the sharksub dataframe using histograms and Q-Q plots.
# Histograms for visual inspection of normality for blotch1 and blotch2 (numeric variables)
par(mfrow = c(1, 2)) # Set up two plots in one row
hist(sharksub$blotch1, main = "Histogram of Blotch1", col = "lightgreen", breaks = 10)
hist(sharksub$blotch2, main = "Histogram of Blotch2", col = "lightblue", breaks = 10)
# Q-Q plots for normality check for blotch1 and blotch2 (numeric variables)
par(mfrow = c(1, 2)) # Set up two Q-Q plots in one row
qqnorm(sharksub$blotch1, main = "Q-Q Plot of Blotch1")
qqline(sharksub$blotch1, col = "red")
qqnorm(sharksub$blotch2, main = "Q-Q Plot of Blotch2")
qqline(sharksub$blotch2, col = "red")
2.2 Multicollinearity testing
It is also useful to test whether there are high correlations among independent variables as this may distort the results, particularly for regressions. The chosen method is Variance Inflation Factor to detect for multicollinearity. All of the VIF values are close to 1 which suggests that there is no significant multicollinearity among the independent variables in the model.
# Fit a linear model with blotch as the dependent variable and other independent variables
<- lm(blotch ~ sex + BPM + weight + length + air + water + meta + depth, data = sharks)
model_sharks
# Calculate VIF to check for multicollinearity
vif(model_sharks)
sex BPM weight length air water meta depth
1.005973 1.010600 1.010919 1.018264 1.029337 1.019376 1.018138 1.011213
Step 3: Visualise relationships
Before conducting hypothesis testing it is useful to visualise simple relationships between dependent and independent variables. This helps to understand potential correlations and trends.
3.1 Scatterplots (numerical variables)
The scatterplots for the sharks dataset show little evidence for correlations except between blotch and depth.
# Pairwise scatterplots with smaller points
pairs(numeric_vars,
main = "Scatterplot Matrix",
cex = 0.2) # Make the points smaller (they were too big before)
The scatterplot for the sharksub dataset shows evidence for a strong correlation between blotch1 and blotch2.
# Scatterplot for the relationship between blotch1 and blotch2
plot(sharksub$blotch1, sharksub$blotch2,
main = "Scatterplot of Blotch1 vs. Blotch2",
xlab = "Blotch1", ylab = "Blotch2",
pch = 19, col = "blue")
Results of data exploration
1) Outliers
Sharks: outliers identified for blotch and depth which were subsequently remove
Sharksub: no outliers
2) Normality
Sharks: all variables were non-normally distributed
Sharksub: variables normally distributed
3) Multicollinearity
Sharks: no evidence for multicollinearity among independent variables
Sharksub: only two numerical variables so not an issue
4) Potential relationships
Sharks: potential correlation between blotch and depth
Sharksub: potential correlation between blotch1 and blotch2
Hypothesis Testing
Question 1: Is there a correlation between the variables air and water?
This question involves two continuous variables from the sharks dataset - air and water.
1.1 Testing for normality
During the data wrangling it was tested to see if the data was normally distributed. It is helpful just to check again for air and water individually. First here are the histograms.
# Histogram for visual inspection of normality (air)
ggplot(sharks_cleaned, aes(x = air)) +
geom_histogram(bins = 30, fill = "lightblue", color = "black") +
ggtitle("Histogram of Air Temperature")
# Histogram for visual inspection of normality (water)
ggplot(sharks_cleaned, aes(x = water)) +
geom_histogram(bins = 30, fill = "lightblue", color = "black") +
ggtitle("Histogram of Water Temperature")
The Shapiro-Wilk test for normality is also applied. Both variables have a p-value of less than 0.05 so the null hypothesis is rejected. Therefore the data is not normally distributed so it is necessary to use a non-parametric test such as Spearman’s correlation.
shapiro.test(sharks_cleaned$air) # Test normality for air
Shapiro-Wilk normality test
data: sharks_cleaned$air
W = 0.95867, p-value = 1.72e-10
shapiro.test(sharks_cleaned$water) # Test normality for water
Shapiro-Wilk normality test
data: sharks_cleaned$water
W = 0.95943, p-value = 2.284e-10
1.2 Spearman’s Rank Correlation Test
Spearman’s Rank is a non-parametric test so it doesn’t assume normality in the data. This is applied to determine whether there is a correlation between air and water.
# Perform Spearman's rank correlation test
cor.test(sharks_cleaned$air, sharks_cleaned$water, method = "spearman")
Spearman's rank correlation rho
data: sharks_cleaned$air and sharks_cleaned$water
S = 20675862, p-value = 0.2288
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
-0.05445596
1.3 Interpretation of the results
Correlation coefficient (ρ/rho):
ρ = -0.054
Indicates a very weak negative relationship between the two variables
The value is so close to zero that the correlation is almost negligible
This indicates that there is no meaningful relationship
p-value
p-value = 0.228
There is no statisitcally significant correlation between the variables
1.4 Scatter plots
While the Spearman’s test indicates that there is no significant correlation, it is useful to use a scatterplot to visually inspect the relationship between the two variables. Confidence intervals are added as shaded regions around the trendline to represent the confidence interval of the linear regression. Various formatting aspects of the graph were changed to improve visualisation.
ggplot(sharks_cleaned, aes(x = air, y = water)) +
geom_point(color = "#1E3A44", size = 2, alpha = 0.7) + # Set the points to a darker blue-grey
geom_smooth(method = "lm", color = "blue", se = TRUE) + # Add linear regression line with confidence intervals
labs(
x = "Air Temperature (°C)", # Label for x axis
y = "Water Temperature (°C)" # Label for y axis
+
) theme_minimal() + # Apply minimal theme
theme(
panel.grid = element_blank(), # Remove gridlines
panel.border = element_rect(color = "black", fill = NA, size = 1), # Add border around the plot area
legend.position = "none", # Remove the autmatic legend as it wasn't needed
axis.text = element_text(size = 12), # Adjust axis text size
axis.title = element_text(size = 14, face = "bold") # Bold axis titles
)
Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
ℹ Please use the `linewidth` argument instead.
`geom_smooth()` using formula = 'y ~ x'
Question 2: Does multiple capture have an effect on blotching time?
This question involves the sharksub dataset of the continuous variable blotch taken at different times (categories).
2.1 Testing for normality
It is important to check normality of the differences between blotch1 and blotch2 because of the assumption behind the paired t-test. Paired t-test assumes that the differences between the paired observations are normally distributed. Therefore, the difference is calculated and then a Shapiro-Wilk test is applied. The p-value is less than 0.05 which suggests that the differences are not normally distributed.
# Calculate the differences between Blotch1 and Blotch2
$blotch_diff <- sharksub$blotch2 - sharksub$blotch1
sharksub
# Shapiro-Wilk Test for Normality on the differences
shapiro.test(sharksub$blotch_diff)
Shapiro-Wilk normality test
data: sharksub$blotch_diff
W = 0.43157, p-value = 1.212e-12
2.2 Wilcoxon Signed-Rank Test
As the differences are non-normally distributed, the Wilcoxon test is applied. This is the non-parametric version of the paired t-test.
# Wilcoxon signed-rank test for non-normally distributed differences
wilcox.test(sharksub$blotch1, sharksub$blotch2, paired = TRUE)
Wilcoxon signed rank test with continuity correction
data: sharksub$blotch1 and sharksub$blotch2
V = 12, p-value = 1.606e-09
alternative hypothesis: true location shift is not equal to 0
2.3 Interpretation of the results
p-value
p-value < 0.05
The null hypothesis is rejected
Indicates that there is a statistically significant difference between blotch1 and blotch2
The difference between the two capture times is unlikely to have occurred by chance
V value
V = 12
Represents the sum of the ranks of the positive differences between the two paired groups
2.4 Paired boxplot
To visualise the relationship between blotch1 and blotch2 a boxplot is produced. This viusally compares the distributions of the two related variables to show how they differ. The first plot shows the basic version of a paired boxplot using the normal functions.
# Reshape the sharksub dataset to long format
<- sharksub %>%
sharksub_long pivot_longer(cols = c(blotch1, blotch2),
names_to = "blotch_type",
values_to = "blotch_value")
# Create the paired boxplot using ggpaired
ggpaired(sharksub_long,
x = "blotch_type",
y = "blotch_value",
color = "blotch_type",
line.color = "gray",
line.size = 0.4,
palette = "jco") +
stat_compare_means(paired = TRUE)
# Regardless of adjustments I can't get this boxplot to have paired lines
No matter what changes I make I can’t seem to adapt the above code to create joining lines between paired observations. Therefore, as the only alternative that I could get to work I added the 50 observations before and after into set categories and then made the paired boxplot from that.
# The original data from sharksub split into before and after
<- c(36.07201, 33.38396, 36.29497, 34.98931, 35.70572, 34.90283, 33.11113, 32.49322, 34.27203, 35.54855, 35.99021, 34.88511, 36.83684, 34.79932, 34.41429, 34.742, 34.36868, 34.27528, 36.11878, 33.98649, 34.45879, 34.9396, 35.8195, 34.47858, 33.26947, 32.85958, 34.65632, 36.48702, 34.93594, 36.42572, 36.50001, 36.37596, 35.17775, 34.57425, 36.68703, 35.28866, 35.92057, 34.73333, 33.6012, 35.70079, 35.02983, 35.63334, 34.24796, 35.34228, 36.4289, 33.53, 37.07165, 34.29493, 35.33881, 34.52245)
before <- c(37.1541703, 34.3854788, 36.461024, 36.0389893, 36.7768916, 35.9499149, 34.1044639, 33.4680166, 35.3001909, 36.6150065, 37.0699163, 35.9316633, 37.9419452, 35.8432996, 35.4467187, 34.39458, 35.3997404, 35.3035384, 37.2023434, 35.0060847, 35.4925537, 35.987788, 36.894085, 35.5129374, 34.2675541, 33.8453674, 35.6960096, 37.5816306, 35.9840182, 37.5184916, 37.5950103, 37.4672388, 36.2330825, 35.6114775, 37.7876409, 36.3473198, 35.52544373, 35.7753299, 34.609236, 36.7718137, 36.0807249, 36.7023402, 35.2753988, 36.4025484, 37.521767, 34.5359, 38.1837995, 35.3237779, 35.609952, 34.07365815)
after
<- data.frame(before = before, after = after)
d
# Create a paired boxplot
ggpaired(d,
cond1 = "before", cond2 = "after",
fill = "condition") +
# Add axis labels
labs(
x = "Sample", # Rename x-axis title to sample
y = "Blotch Time (seconds)" # Rename y-axis title
+
) theme(
panel.grid = element_blank(), # Remove gridlines
axis.text = element_text(size = 12, color = "black"), # Change axis text
axis.title = element_text(size = 14, face = "bold"), # Bold axis titles
plot.title = element_text(size = 16, face = "bold", hjust = 0.5), # Change title
axis.title.x = element_text(size = 14, face = "bold"), # Make x-axis title bold
axis.title.y = element_text(size = 14, face = "bold") # Make y-axis title bold
+
) # Add custom colors for before/after
scale_fill_manual(values = c("#FF9E2A", "#00A4A9")) +
scale_color_manual(values = c("#FF9E2A", "#00A4A9")) +
# Rename before/after conditions
scale_x_discrete(labels = c("First Capture", "Second Capture"))
Question 3: Is it possible to predict blotching time?
This question potentially involves the sharks dataset. The response is the continuous variable of blotch. The potential predictors are the continuous (BPM, weight, length, meta, depth, air, water) and categorical (sex) variables.
Understanding relationships with variables
It is helpful to understand the relationships between blotch and other variables to guide the decision about which to include in the model. Earlier it was evident that depth was the only variable that seemed to be correlated with blotch.
3.1 Correlation matrix
A heatmap can be created using the corrplot function that states the correlation values. Similar to the scatterplots, the correlation matrix for the sharks dataset only shows a strong correlations between blotch and depth.
# Select all of the relevant numeric variables
<- sharks_cleaned[, c("blotch", "BPM", "meta", "weight", "length", "depth", "air", "water")]
numeric_vars
# Calculate the Pearson correlation matrix
<- cor(numeric_vars, method = "pearson") # Set the test as Pearson
cor_matrix
# Visualise the correlation matrix
corrplot(cor_matrix,
method = "ellipse", # Use ellipses to represent the strength of correlations
type = "upper", # Display only the upper half of the matrix
tl.col = "black", # Colour of the text
addCoef.col = "black", # Add correlation coefficients
number.cex = 0.8, # Change font size
diag = FALSE, # Diagonal is hidden
order = "hclust", # Order the variables based on hierarchical clustering
tl.srt = 45 # Rotation of text labels
)
It is also helpful to created a correlation matrix table that produces all of the coefficient values and associated p-values of the correlation test.
# Select numerical variables
<- sharks_cleaned[, c("blotch", "BPM", "meta", "weight", "length", "depth", "air", "water")]
variables
# Compute the correlation matrix (Pearson) and p-values
<- rcorr(as.matrix(variables), type = "pearson") # Use rcorr function from Hmisc package cor_matrix
The results of the correlation coefficients and p-values can then be combined into one singular table.
# Combine the correlation coefficients and p-values
<- round(cor_matrix$r, 2)
cor_matrix_combined <- round(cor_matrix$P, 3)
pvalue_matrix
# Create a combined matrix
<- cor_matrix_combined
cor_matrix_display for (i in 1:nrow(cor_matrix_combined)) {
for (j in 1:ncol(cor_matrix_combined)) {
<- paste(cor_matrix_combined[i, j],
cor_matrix_display[i, j] " (p = ", pvalue_matrix[i, j], ")", sep = "")
}
}
# Show the final matrix
print(cor_matrix_display)
blotch BPM meta
blotch "1 (p = NA)" "-0.04 (p = 0.427)" "-0.03 (p = 0.443)"
BPM "-0.04 (p = 0.427)" "1 (p = NA)" "-0.01 (p = 0.808)"
meta "-0.03 (p = 0.443)" "-0.01 (p = 0.808)" "1 (p = NA)"
weight "0 (p = 0.966)" "0.01 (p = 0.784)" "0.02 (p = 0.591)"
length "-0.01 (p = 0.893)" "-0.08 (p = 0.079)" "0 (p = 0.923)"
depth "0.69 (p = 0)" "-0.01 (p = 0.76)" "-0.01 (p = 0.842)"
air "-0.03 (p = 0.451)" "-0.07 (p = 0.11)" "0.13 (p = 0.004)"
water "-0.04 (p = 0.335)" "0.03 (p = 0.518)" "0.03 (p = 0.53)"
weight length depth
blotch "0 (p = 0.966)" "-0.01 (p = 0.893)" "0.69 (p = 0)"
BPM "0.01 (p = 0.784)" "-0.08 (p = 0.079)" "-0.01 (p = 0.76)"
meta "0.02 (p = 0.591)" "0 (p = 0.923)" "-0.01 (p = 0.842)"
weight "1 (p = NA)" "-0.03 (p = 0.554)" "-0.01 (p = 0.849)"
length "-0.03 (p = 0.554)" "1 (p = NA)" "-0.07 (p = 0.137)"
depth "-0.01 (p = 0.849)" "-0.07 (p = 0.137)" "1 (p = NA)"
air "-0.05 (p = 0.273)" "-0.03 (p = 0.5)" "-0.02 (p = 0.687)"
water "0.09 (p = 0.052)" "-0.06 (p = 0.191)" "-0.03 (p = 0.485)"
air water
blotch "-0.03 (p = 0.451)" "-0.04 (p = 0.335)"
BPM "-0.07 (p = 0.11)" "0.03 (p = 0.518)"
meta "0.13 (p = 0.004)" "0.03 (p = 0.53)"
weight "-0.05 (p = 0.273)" "0.09 (p = 0.052)"
length "-0.03 (p = 0.5)" "-0.06 (p = 0.191)"
depth "-0.02 (p = 0.687)" "-0.03 (p = 0.485)"
air "1 (p = NA)" "-0.05 (p = 0.238)"
water "-0.05 (p = 0.238)" "1 (p = NA)"
3.2 Scatterplot for depth
From the correlation matrix it was evident that depth was the strongest predictor variable. Therefore a scatterplot is created to visualise the correlation between depth and blotch. This confirms that a linear regression model is appropriate.
# Scatter plot of depth against blotch
ggplot(sharks_cleaned, aes(x = depth, y = blotch)) +
geom_point(color = "#1E3A44", size = 2, alpha = 0.7) + # Darker blue-grey points
geom_smooth(method = "lm", color = "blue", se = TRUE) + # Add linear regression line with confidence interval as shaded region
labs(
x = "Depth (m)", # Add x axis label
y = "Blotch Time (seconds)" # Add y axis label
+
) theme_minimal() + # Minimal theme applied
theme(
panel.grid = element_blank(), # Remove gridlines
panel.border = element_rect(color = "black", fill = NA, size = 1), # Border around the plot area
legend.position = "none", # Remove legend as not needed
axis.text = element_text(size = 12), # Change axis text size
axis.title = element_text(size = 14, face = "bold") # Bold axis titles
)
`geom_smooth()` using formula = 'y ~ x'
3.3 Pearson’s Correlation Test
A Pearson’s correlation was performed to determine the strength of the correlation between blotch and depth.
# Pearson correlation test between depth and blotch
<- cor.test(sharks_cleaned$depth, sharks_cleaned$blotch, method = "pearson")
cor_test
# Display the results
cor_test
Pearson's product-moment correlation
data: sharks_cleaned$depth and sharks_cleaned$blotch
t = 21.203, df = 488, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.6433494 0.7359091
sample estimates:
cor
0.692468
3.4 Interpretation of the results
Correlation coefficient (r):
ρ = 0.69
Positive relationship between the two variables
p-value
p-value < 0.05
There is a statisitcally significant correlation between the variables
3.5 Testing for normality
An assumption of linear regressions is that the residuals are normally distributed so it is important to test this. First a simple linear regression model is fitted to the data.
# Fit a linear regression model
<- lm(blotch ~ depth, data = sharks_cleaned)
lm_depth
# Get a summary of the model
summary(lm_depth)
Call:
lm(formula = blotch ~ depth, data = sharks_cleaned)
Residuals:
Min 1Q Median 3Q Max
-2.80724 -0.63964 -0.00208 0.59800 2.82832
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.49049 1.16171 9.03 <2e-16 ***
depth 0.49109 0.02316 21.20 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9784 on 488 degrees of freedom
Multiple R-squared: 0.4795, Adjusted R-squared: 0.4784
F-statistic: 449.6 on 1 and 488 DF, p-value: < 2.2e-16
Then a Q-Q plot is made to visually inspect if the residuals follow a normal distribution. This is important as a visual check alongside statistical values.
# Q-Q plot of residuals
qqnorm(residuals(lm_depth))
qqline(residuals(lm_depth), col = "red")
A Shapiro-Wilk test was then applied to provide a statistical assessment of normality. The p-value is above 0.05 which suggests that the residuals are normally distributed. The null hypothesis is rejected.
# Shapiro-Wilk normality test
shapiro.test(residuals(lm_depth))
Shapiro-Wilk normality test
data: residuals(lm_depth)
W = 0.99743, p-value = 0.6523
3.6 Testing for homogeneity of variance
An assumption of linear regression is that the variance of the residuals is constant across all levels of the independent variable. Therefore a plot of residuals against fitted values was created to test this assumption. The residuals appear to be evenly spread around zero with no obvious patterns so the assumption is met.
# Plot of residuals against fitted values
plot(fitted(lm_depth), residuals(lm_depth),
xlab = "Fitted Values", ylab = "Residuals",
main = "Residuals vs Fitted Values")
abline(h = 0, col = "red")
3.7 Fitting the regression model
As the tests suggest that the assumptions of normality and homoscedasticity are met, a regression model can be confidently fitted to the data.
# Fit linear regression model
<- lm(blotch ~ depth, data = sharks_cleaned)
lm_depth
# Get a summary of the model
summary(lm_depth)
Call:
lm(formula = blotch ~ depth, data = sharks_cleaned)
Residuals:
Min 1Q Median 3Q Max
-2.80724 -0.63964 -0.00208 0.59800 2.82832
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.49049 1.16171 9.03 <2e-16 ***
depth 0.49109 0.02316 21.20 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9784 on 488 degrees of freedom
Multiple R-squared: 0.4795, Adjusted R-squared: 0.4784
F-statistic: 449.6 on 1 and 488 DF, p-value: < 2.2e-16
A comparison was made to a linear regression model in which all the variables were included. This didn’t provide much more explanatory power for the response variable. Therefore it seems that depth alone is the best predictor.
# Fit linear regression model
<- lm(blotch ~ depth + length + weight + meta + air + water + BPM, data = sharks_cleaned)
lm_all
# Get a summary of the model
summary(lm_all)
Call:
lm(formula = blotch ~ depth + length + weight + meta + air +
water + BPM, data = sharks_cleaned)
Residuals:
Min 1Q Median 3Q Max
-2.76355 -0.64919 -0.00181 0.62864 2.77396
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.6904631 1.9116624 6.115 1.99e-09 ***
depth 0.4918029 0.0232996 21.108 < 2e-16 ***
length 0.0010741 0.0009558 1.124 0.262
weight 0.0006775 0.0033089 0.205 0.838
meta -0.0020259 0.0025811 -0.785 0.433
air -0.0184273 0.0314997 -0.585 0.559
water -0.0158975 0.0266314 -0.597 0.551
BPM -0.0023599 0.0031564 -0.748 0.455
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9809 on 482 degrees of freedom
Multiple R-squared: 0.4833, Adjusted R-squared: 0.4758
F-statistic: 64.4 on 7 and 482 DF, p-value: < 2.2e-16
3.8 Interpretation of the results
Coefficients
Intercept = 10.49
Slope of the model = 0.49
The slope represents the expected change in blotch for one-unit change in depth
R-squared
R-squared = 0.48
Represents the proportion of variance in blotch explained by the model
p-value
p-value < 0.05
Suggests that depth significantly predicts blotch
3.9 Predicting values
Able to run some predictions using the model.
# Predict values based on the linear regression model
<- predict(lm_depth, newdata = sharks_cleaned)
predictions
# Compare predictions with actual values
<- data.frame(Actual = sharks_cleaned$blotch, Predicted = predictions)
comparison head(comparison)
Actual Predicted
1 37.17081 36.62957
2 34.54973 34.86786
3 36.32861 34.77040
4 35.33881 35.19104
5 37.39799 34.56967
6 33.54668 33.49400
Predictive modelling
Predictive modelling can be used to make predictions about unknown outcomes based on historic data. In this case, the aim is to predict blotch using the predictor variables.
4.1 Split the dataset into training and testing sets
The first step in building a predictive model is to split the data into training and testing sets. This allows the model to be trained on one subset of the data and tested on a separate subset to evaluate its performance.
# Set a seed for reproducibility
set.seed(40) # The seed was set at 40
# Split the data
<- createDataPartition(sharks_cleaned$blotch, p = 0.8, list = FALSE) # 80% training and 20% testing
trainIndex
# Create training and testing datasets
<- sharks_cleaned[trainIndex, ] # Train the model
trainData <- sharks_cleaned[-trainIndex, ] # Evaluate the performance of the model testData
4.2 Building the Linear Regression model
A linear regression model is fit where blotch is the response variable and depth is the predictor.
# Build the linear regression model using the training data
<- lm(blotch ~ depth, data = trainData)
lm_model
# View the model summary
summary(lm_model)
Call:
lm(formula = blotch ~ depth, data = trainData)
Residuals:
Min 1Q Median 3Q Max
-2.8091 -0.6433 -0.0039 0.5958 2.8226
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.67583 1.29544 8.241 2.59e-15 ***
depth 0.48743 0.02585 18.857 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9687 on 392 degrees of freedom
Multiple R-squared: 0.4757, Adjusted R-squared: 0.4743
F-statistic: 355.6 on 1 and 392 DF, p-value: < 2.2e-16
4.3 Interpretation of the results
Coefficients
Intercept = 10.68
Slope of the model = 0.49
The slope represents the expected change in blotch for one-unit change in depth
R-squared
R-squared = 0.48
Represents the proportion of variance in blotch explained by the model
p-value
p-value < 0.05
Suggests that depth significantly predicts blotch
4.4 Evaluate the model - make predictions
It is important to assess how well the model performs. This involves making predictions on the test set.
# Make predictions on the test subset of data
<- predict(lm_model, newdata = testData)
predictions
# View the first few predictions from the model
head(predictions)
11 13 16 26 50 52
34.35365 35.20272 35.37245 36.89665 35.63074 33.86592
4.5 Evaluate the model - calculate performance metrics
Another important evaluation is to calculate:
Mean squared error (MSE)
- Measures the average squared difference between predicted and actual values
R-squared (R²)
- Indicates how much of the variance in blotch is explained by depth
# Calculate Mean Squared Error
<- mean((predictions - testData$blotch)^2)
mse cat("Mean Squared Error (MSE):", mse, "\n")
Mean Squared Error (MSE): 1.03445
# Calculate R-squared
<- 1 - sum((testData$blotch - predictions)^2) / sum((testData$blotch - mean(testData$blotch))^2)
rsq cat("R-squared (R²):", rsq, "\n")
R-squared (R²): 0.4915227
4.6 Visualising the results
Then a scatterplot is created to visualise how well the model’s predictions match the actual values.
# Scatter plot of Actual vs Predicted Blotch
ggplot(testData, aes(x = blotch, y = predictions)) +
geom_point(color = "#1E3A44", size = 2, alpha = 0.7) + # Darker blue-grey points
geom_smooth(method = "lm", color = "red", se = TRUE) + # Add linear regression line with confidence interval
labs(
x = "Actual Blotch Time (seconds)",
y = "Predicted Blotch Time (seconds)"
+
) theme_minimal() + # Minimal theme
theme(
panel.grid = element_blank(), # Remove gridlines
panel.border = element_rect(color = "black", fill = NA, size = 1), # Border around the plot area
legend.position = "none", # Remove legend
axis.text = element_text(size = 12), # Adjust axis text size
axis.title = element_text(size = 14, face = "bold") # Bold axis titles
)
`geom_smooth()` using formula = 'y ~ x'
4.7 Checking residuals
It is important to inspect the residuals to check if the model is valid.
Residuals vs Fitted values plot: this plot should show no patterns
Residuals histogram: an even distribution of residuals is ideal
# Residuals
<- testData$blotch - predictions
residuals
# Plot the residuals against the fitted values
plot(predictions, residuals, main = "Residuals vs Fitted Values", xlab = "Predicted Blotch (secs)", ylab = "Residuals")
abline(h = 0, col = "red")
# Histogram of residuals
hist(residuals, main = "Residuals Histogram", xlab = "Residuals")
4.8 Make predictions on new data
The model can be used to make predictions for blotch values based on new depth values.
# Example of new data (depth values)
<- data.frame(depth = c(55, 70))
new_data
# Make predictions on the new data
<- predict(lm_model, newdata = new_data)
new_predictions print(new_predictions)
1 2
37.48442 44.79585