# The following code chunk identifies the packages used in the data analysis and identifies how to load the packages into R Studio.
library(tidyverse) # The tidyverse package is crucial as it contains the packages tidyr, ggplot2, dplyr, and more, used throughout this code.
library(RColorBrewer) # Optional package: used to apply colour changes to graphs.
library(PairedData)
library(ggpubr)
library(car)
library(patchwork)
Loading Packages
This section is used to load the R Studio packages used in this data analysis project:
Importing Data and Data Wrangling
This section is used to load in the data and perform data wrangling.
# Loading in the data.
<- read.csv("C:\\Users\\jacob\\OneDrive\\Uni\\Masters\\Sharks_Data\\sharks.csv")
sharks
<- read.csv("C:\\Users\\jacob\\OneDrive\\Uni\\Masters\\Sharks_Data\\sharksub.csv") sharksub
Looking at the dara:
glimpse(sharks) # Provides an overview of the sharks data
Rows: 500
Columns: 10
$ ID <chr> "SH001", "SH002", "SH003", "SH004", "SH005", "SH006", "SH007", …
$ sex <chr> "Female", "Female", "Female", "Male", "Female", "Male", "Male",…
$ blotch <dbl> 37.17081, 34.54973, 36.32861, 35.33881, 37.39799, 33.54668, 36.…
$ BPM <int> 148, 158, 125, 161, 138, 126, 166, 135, 132, 127, 126, 131, 133…
$ weight <dbl> 74.69050, 73.41627, 71.80837, 104.62985, 67.13098, 110.49396, 1…
$ length <dbl> 186.6703, 189.3189, 283.6332, 171.0986, 264.3160, 269.9829, 194…
$ air <dbl> 37.73957, 35.68413, 34.79854, 36.15973, 33.61477, 36.38343, 33.…
$ water <dbl> 23.37377, 21.42088, 20.05114, 21.64319, 21.76143, 20.85200, 21.…
$ meta <dbl> 64.11183, 73.68676, 54.43466, 86.32615, 107.97796, 108.86475, 9…
$ depth <dbl> 53.22635, 49.63903, 49.44057, 50.29711, 49.03183, 46.84148, 49.…
glimpse(sharksub) # Provides an overview of the sharksub data
Rows: 50
Columns: 4
$ ID <chr> "SH269", "SH163", "SH008", "SH239", "SH332", "SH328", "SH150",…
$ sex <chr> "Female", "Female", "Female", "Female", "Female", "Female", "F…
$ blotch1 <dbl> 36.07201, 33.38396, 36.29497, 34.98931, 35.70572, 34.90283, 33…
$ blotch2 <dbl> 37.15417, 34.38548, 36.46102, 36.03899, 36.77689, 35.94991, 34…
str(sharks) # provides a different overview of the data
'data.frame': 500 obs. of 10 variables:
$ ID : chr "SH001" "SH002" "SH003" "SH004" ...
$ sex : chr "Female" "Female" "Female" "Male" ...
$ blotch: num 37.2 34.5 36.3 35.3 37.4 ...
$ BPM : int 148 158 125 161 138 126 166 135 132 127 ...
$ weight: num 74.7 73.4 71.8 104.6 67.1 ...
$ length: num 187 189 284 171 264 ...
$ air : num 37.7 35.7 34.8 36.2 33.6 ...
$ water : num 23.4 21.4 20.1 21.6 21.8 ...
$ meta : num 64.1 73.7 54.4 86.3 108 ...
$ depth : num 53.2 49.6 49.4 50.3 49 ...
str(sharksub) # provides a different overview of the data
'data.frame': 50 obs. of 4 variables:
$ ID : chr "SH269" "SH163" "SH008" "SH239" ...
$ sex : chr "Female" "Female" "Female" "Female" ...
$ blotch1: num 36.1 33.4 36.3 35 35.7 ...
$ blotch2: num 37.2 34.4 36.5 36 36.8 ...
Checking for Null Values
Checking the data for null values:
colSums(is.na(sharks)) # Function to check data for na or null values
ID sex blotch BPM weight length air water meta depth
0 0 0 0 0 0 0 0 0 0
colSums(is.na(sharksub)) # Function to check data for na or null values
ID sex blotch1 blotch2
0 0 0 0
Preparing the Data
# A new data set is created from a wide format to a long format to be able to create a boxplot. This new data is saved as Blotching and contains the data for blotch 1 and 2 and their associated value.
<- sharksub %>% # Identifying data to use, pipe operator (%>%) is used to tell R to use that data or line in the upcoming function.
Blotching pivot_longer( # pivot_longer is used to change the data from a wide to a long format.
cols = c(blotch1, blotch2), # Creates a new column with the selected variables.
names_to = "Observation", # This is naming the new column.
values_to = "BlotchingTime" # This is naming the column with the associated values from the column.
)
Data Analysis
Summary of Data Sets
The data can now be explored. First, the summary function will be used to create an analysis of the data and the variables. This analysis is helpful in determining the minimum, maximum, mean, median, and quartiles of the data.
summary(sharks) # Creates a summary of the data.
ID sex blotch BPM
Length:500 Length:500 Min. :30.78 Min. :119.0
Class :character Class :character 1st Qu.:34.16 1st Qu.:129.0
Mode :character Mode :character Median :35.05 Median :142.0
Mean :35.13 Mean :141.8
3rd Qu.:36.05 3rd Qu.:153.2
Max. :40.08 Max. :166.0
weight length air water
Min. : 65.10 Min. :128.3 Min. :33.00 Min. :20.01
1st Qu.: 75.68 1st Qu.:172.0 1st Qu.:34.42 1st Qu.:21.55
Median : 87.82 Median :211.1 Median :35.43 Median :23.11
Mean : 87.94 Mean :211.0 Mean :35.54 Mean :23.02
3rd Qu.:100.40 3rd Qu.:251.8 3rd Qu.:36.71 3rd Qu.:24.37
Max. :110.94 Max. :291.0 Max. :38.00 Max. :25.99
meta depth
Min. : 50.03 Min. :44.64
1st Qu.: 67.39 1st Qu.:48.90
Median : 82.45 Median :50.14
Mean : 82.04 Mean :50.14
3rd Qu.: 95.97 3rd Qu.:51.35
Max. :112.45 Max. :56.83
summary(sharksub) # Creates a summary of the data.
ID sex blotch1 blotch2
Length:50 Length:50 Min. :32.49 Min. :33.47
Class :character Class :character 1st Qu.:34.38 1st Qu.:35.31
Mode :character Mode :character Median :34.94 Median :35.94
Mean :35.03 Mean :35.96
3rd Qu.:35.90 3rd Qu.:36.78
Max. :37.07 Max. :38.18
Identifying the number of males and females:
table(sharks$sex) # Table of number of males and females
Female Male
236 264
Looking at the data by sex:
<- sharks %>% # Creation of new dataset using the sharks data
sharks.by.sex group_by(sex) %>% # grouping the data together by sex
summarize(
mean_blotch = mean(blotch),
sd_blotch = sd(blotch),
mean_weight = mean(weight),
mean_length = mean(length),
mean_BPM = mean(BPM),
mean_meta = mean(meta),
mean_depth = mean(depth) # Summarising the variables by the sex and the means
%>%
) ungroup() # Ungrouping the data, crucial to ensure future code is not grouped together.
# Renaming the column headers:
<- sharks.by.sex %>%
sharks.by.sex rename(
`Sex` = sex,
`Mean Blotch` = mean_blotch,
`SD Blotch` = sd_blotch,
`Mean Weight` = mean_weight,
`Mean Length` = mean_length,
`Mean BPM` = mean_BPM,
`Mean Cortisol` = mean_meta,
`Mean Depth` = mean_depth
)print(sharks.by.sex) # Viewing the data
# A tibble: 2 × 8
Sex `Mean Blotch` `SD Blotch` `Mean Weight` `Mean Length` `Mean BPM`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Female 34.9 1.39 88.1 212. 142.
2 Male 35.3 1.44 87.8 210. 142.
# ℹ 2 more variables: `Mean Cortisol` <dbl>, `Mean Depth` <dbl>
<- sharksub %>% # Creation of new dataset using the sharksub data
sharksub.by.sex group_by(sex) %>% # grouping the data together by sex
summarize(
mean_blotch1 = mean(blotch1),
sd_blotch1 = sd(blotch1),
mean_blotch2 = mean(blotch2),
sd_blotch2 = sd(blotch2)# Summarising the variables by the sex, with the means and sd of blotch 1 and 2 being created.
%>%
) ungroup() # Ungrouping the data
# Renaming the column headers:
<- sharksub.by.sex %>%
sharksub.by.sex rename(
`Mean Blotch 1` = mean_blotch1,
`SD Blotch 1` = sd_blotch1,
`Mean Blotch 2` = mean_blotch2,
`SD Blotch 2` = sd_blotch2
)print(sharksub.by.sex) # Viewing the data
# A tibble: 2 × 5
sex `Mean Blotch 1` `SD Blotch 1` `Mean Blotch 2` `SD Blotch 2`
<chr> <dbl> <dbl> <dbl> <dbl>
1 Female 34.8 1.08 35.8 1.11
2 Male 35.3 1.09 36.2 1.20
Statistical Analysis
1. Correlation Between Air and Water Temperature
Visualising the Data
As both air and water are both quantitative data, a scatter plot is the best graph to use here:
%>% # Calling in data
sharks ggplot(aes(
x = air, # Defining the x-axis variable
y = water, # Defining the y-axis variable
+
)) geom_point(alpha = 0.7, size = 1) + # Creating a scatter plot
geom_smooth(method = "lm", se = FALSE, color = "red", linewidth = 1) + # Applying a line of best fit using the linear model for a smooth line.
theme_minimal(base_size = 13) + # Applying the minimal theme to the graph
labs (
x = "Air Temperature (°C)", # Labelling of the x-axis
y = "Water Temperature (°C)" # Labelling of the y-axis
)
`geom_smooth()` using formula = 'y ~ x'
Checking for normality
Before a correlation test can be performed, the data needs to be checked for normality to determine whether Pearson’s or Spearman’s Correlation is used.
shapiro.test(sharks$air) # Shapiro-Wilk test for normality on air vairbale
Shapiro-Wilk normality test
data: sharks$air
W = 0.95885, p-value = 1.338e-10
shapiro.test(sharks$water) # Shapiro-Wilk test for normality on water vairbale
Shapiro-Wilk normality test
data: sharks$water
W = 0.96035, p-value = 2.371e-10
Data is not normally distributed, Spearman’s is then used.
Spearman’s Correlation Test
cor.test(sharks$air, sharks$water, method = "spearman") # Performing the correlation test using Spearman's method.
Spearman's rank correlation rho
data: sharks$air and sharks$water
S = 22007692, p-value = 0.2082
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
-0.05637344
As the R-value is close to 0, we can assume that there is very weak to no correlation between the two factors.
2. Effect of Multiple Captures on Blotching Time
Research Questions
To check for the impact of multiple captures on the blotching time, a Paired T-test will be used if the data is normally distributed, or a Wilcoxon test if the data is not normal.
First, however, the research question needs to be established.
I want to test whether the mean blotching time on the second capture is significantly different from the first blotch time.
The null hypothesis is if there is no significant difference between the two blotch times.
The alternative hypothesis is that the two blotch times are significantly different.
Preparing and Visualising the Data
Summarising the Data:
%>%
Blotching group_by(Observation) %>% # Grouping this data by blotch1 and blotch 2
summarise(
count = n(), # Calculating the numnber in blotch1 and blotch 2
mean = mean(BlotchingTime), # Calculating the mean time of blotching
sd = sd(BlotchingTime) # Calculating the standard deviation of blotching
)
# A tibble: 2 × 4
Observation count mean sd
<chr> <int> <dbl> <dbl>
1 blotch1 50 35.0 1.10
2 blotch2 50 36.0 1.16
# Output is a summary table.
Visualising the Data
<- Blotching %>% # Calling in the data and saving as new dataset for presentation later
box.plot # ggplot here is defining the axis, grouping each ID point, and colouring the points by observation.
ggplot(aes(x = Observation, y = BlotchingTime, fill = Observation)) +
geom_boxplot() + # Creating a boxplot graph with the ggplot variables
labs(
x = "Capture Event",
y = "Blotching Time (seconds)" # Labelling of axis
+
) theme_minimal(base_size = 13) + # Changes the graph aesthetics
theme(legend.position = "none") + # Removing the legend
scale_fill_brewer(palette = "Dark2") # Settin the colour scale of the graph.
box.plot
This can be put into a paired data plot:
<- Blotching %>% # Calling in the data and saving as new dataset for presentation later
paired.plot # ggplot here is defining the axis, grouping each ID point together, and colouring the points by observation
ggplot(aes(x = Observation, y = BlotchingTime, group = ID, color = Observation)) +
geom_line(linewidth = 1, color = "darkgrey", alpha = 0.7) + # Adding of the lines to the scatter plot
geom_point(size = 1.8) + # Creating a scatter plot
theme_minimal(base_size = 13) + # Setting the theme of the graph
theme(legend.position = "none") + # Removing of the legend
labs(x = "Capture Event",
y = "Blotching Time (seconds)") + # Labelling of the axis
scale_color_brewer(palette = "Dark2") # Setting the colour palette of the points
paired.plot
# Using patchwork package, the 2 graphs can be combined into a single graph:
<- box.plot + paired.plot + plot_layout(ncol = 2)
combined combined
Preliminary test to check paired t-test assumptions:
- Are the 2 samples paired? Yes, both are a time for blotching.
- Is this a large sample? No, I will test for normality on data.
Visualising the normality of the data as a histogram:
<- sharksub %>% # Creating a new data set with new column in the sharksub data with the difference between blotch 1 and 2.
sharksub2 mutate(Difference = blotch2 - blotch1)
%>%
sharksub2 ggplot(aes(x = Difference))+ # Defining the x-axis
geom_histogram(binwidth = 0.1, fill = "lightblue", color = "black") + # Creating a histogram
theme_minimal(base_size = 13) + # Setting the graph theme
labs(x = "Difference",
y = "Count") # Labelling of axis
The histogram hints that the data may not be normally distributed and will be confirmed with the Shapiro-Wilk test.
# This code creates a new data set with the difference between blotch1 and blotch2, which is then tested for normality in the Shapiro-Wilk test.
<- sharksub$blotch1 - sharksub$blotch2
difference shapiro.test(difference)
Shapiro-Wilk normality test
data: difference
W = 0.43157, p-value = 1.212e-12
As the p-value is below our significance level of 0.05, we assume that the data deviates significantly from normality.
Therefore, the Wilcoxon signed-rank test is used.
Wilcoxon Signed-rank Test
# Running of the Wilcoxon Signed-rank test
wilcox.test(sharksub$blotch1, sharksub$blotch2, paired = TRUE)
Wilcoxon signed rank test with continuity correction
data: sharksub$blotch1 and sharksub$blotch2
V = 12, p-value = 1.606e-09
alternative hypothesis: true location shift is not equal to 0
Checking if sex has an impact on multiple capture events
We have seen that there is a significant difference between the first and second capture events; this will now look to see if this is the same between shark genders.
Checking the data
table(sharksub$sex) # obtaining the number of males and females in the data
Female Male
25 25
Visualising the data
%>%
Blotching ggplot(aes(x = sex, y = BlotchingTime, fill = Observation)) + # Defining the axis
geom_boxplot() + # Creating the boxplot
theme_minimal(base_size = 13)+ # Setting the graph theme
labs(x = "Sex",
y = "Blotching Time (s)") + # Labelling of the axis
scale_fill_brewer(palette = "Dark2") # applying a colour scheme
Preparing data and testing normality
# Creating a new dataset with male blotching time
<- sharksub %>%
males filter(sex == "Male")
# Creating a new dataset with female blotching time
<- sharksub %>%
females filter(sex == "Female")
# Finding the difference between blotch 1 and 2 for each gender
<- males$blotch1 - males$blotch2
male_difference <-females$blotch1 - females$blotch2
female_difference
# Testing the normality for the difference between blotch 1 and 2 for males
shapiro.test(male_difference)
Shapiro-Wilk normality test
data: male_difference
W = 0.46081, p-value = 1.569e-08
# Testing the normality for the difference between blotch 1 and 2 for females
shapiro.test(female_difference)
Shapiro-Wilk normality test
data: female_difference
W = 0.40466, p-value = 4.846e-09
All the data here is not normally distributed, therefore, wilcoxon is used.
Performing Wilcoxon Test
# Performing Wilcoxon test on males blotch time
wilcox.test(males$blotch1, males$blotch2, paired = TRUE)
Wilcoxon signed rank exact test
data: males$blotch1 and males$blotch2
V = 5, p-value = 5.96e-07
alternative hypothesis: true location shift is not equal to 0
# Performing Wilcoxon test on females blotch time
wilcox.test(females$blotch1, females$blotch2, paired = TRUE)
Wilcoxon signed rank exact test
data: females$blotch1 and females$blotch2
V = 2, p-value = 1.788e-07
alternative hypothesis: true location shift is not equal to 0
3. Predicting Blotching Time
Setting Up the Model
This question will be used for a linear regression model, so the response and predictors need to be identified.
For this test, the blotch variable from the sharks’ data is the response variable, as it is what we want to predict.
While the predictor variables are the others in the data: sex, weight, length, air, water, meta and depth.
Visualising the Data and Trends
This section was used to visualise if any variables had a relationship with blotching.
Continuous Data
Weight
%>%
sharks ggplot(aes(x = weight, y = blotch)) + # Defining the variables on the axis
geom_point(alpha = 0.6) + # Creating a scatter plot
geom_smooth(method = "lm", se = FALSE) + # Adding a line of best fit into the model
theme_minimal(base_size = 13) + # Setting the theme of the graph
labs(x = "Weight (Kg)",
y = "Blotch Time (seconds)") # Labelling of the axis
`geom_smooth()` using formula = 'y ~ x'
Length
%>% #
sharks ggplot(aes(x = length, y = blotch)) + # Defining the variables on the axis
geom_point(alpha = 0.6) + # Creating a scatter plot
geom_smooth(method = "lm", se = FALSE) + # Adding a line of best fit into the model
theme_minimal(base_size = 13) + # Setting the theme of the graph
labs(x = "Length (cm)",
y = "Blotch Time (seconds)") # Labelling of the axis
`geom_smooth()` using formula = 'y ~ x'
Air
%>%
sharks ggplot(aes(x = air, y = blotch)) + # Defining the variables on the axis
geom_point(alpha = 0.6) + # Creating a scatter plot
geom_smooth(method = "lm", se = FALSE) + # Adding a line of best fit into the model
theme_minimal(base_size = 13) + # Setting the theme of the graph
labs(x = "Air Temperature (°C)",
y = "Blotch Time (seconds)") # Labelling of the axis
`geom_smooth()` using formula = 'y ~ x'
Water
%>%
sharks ggplot(aes(x = water, y = blotch)) + # Defining the variables on the axis
geom_point(alpha = 0.6) + # Creating a scatter plot
geom_smooth(method = "lm", se = FALSE) + # Adding a line of best fit into the model
theme_minimal(base_size = 13) + # Setting the theme of the graph
labs(x = "Water Temperature (°C)",
y = "Blotch Time (seconds)") # Labelling of the axis
`geom_smooth()` using formula = 'y ~ x'
Meta
%>%
sharks ggplot(aes(x = meta, y = blotch)) + # Defining the variables on the axis
geom_point(alpha = 0.6) + # Creating a scatter plot
geom_smooth(method = "lm", se = FALSE) + # Adding a line of best fit into the model
theme_minimal(base_size = 13) + # Setting the theme of the graph
labs(x = "Cortisol levels (mcg/dl)",
y = "Blotch Time (seconds)") # Labelling of the axis
`geom_smooth()` using formula = 'y ~ x'
Depth
%>%
sharks ggplot(aes(x = depth, y = blotch)) + # Defining the variables on the axis
geom_point(alpha = 0.6) + # Creating a scatter plot
geom_smooth(method = "lm", se = FALSE) + # Adding a line of best fit into the model
theme_minimal(base_size = 13) + # Setting the theme of the graph
labs(x = "Depth (m)",
y = "Blotch Time (seconds)") # Labelling of the axis
`geom_smooth()` using formula = 'y ~ x'
The graph shows a potentially good indicator for blotch time on sharks due to the steep slope line of best fit; a correlation test needs to be run to confirm. It identifies that sharks at greater depths took longer to blotch than those at shallower depths.
shapiro.test(sharks$blotch) # Running normality test on Blotch data
Shapiro-Wilk normality test
data: sharks$blotch
W = 0.99695, p-value = 0.4769
shapiro.test(sharks$depth) # Running normality on depth data.
Shapiro-Wilk normality test
data: sharks$depth
W = 0.99746, p-value = 0.6485
cor.test(sharks$depth, sharks$blotch, method = "pearson") # Testing for normality of normally distributed data
Pearson's product-moment correlation
data: sharks$depth and sharks$blotch
t = 22.772, df = 498, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.6683963 0.7546509
sample estimates:
cor
0.7142247
Categorical data
Sex
Visualising categorical data:
%>%
sharks ggplot(aes(x = sex, y = blotch, fill = sex)) + # Defining the variable for each axis
geom_boxplot() + # Creating a boxplot
labs(x = "Sex",
y = "Blotch Time (seconds)") + # Labelling of the x- and y-axis
theme_minimal(base_size = 13) + # Setting the theme of the boxplot
scale_fill_brewer(palette = "Dark2") # applying a colour scheme
To test to see if sex is a reliable predictor for blotching time, it needs to be tested to see if there is a significant difference between the blotching times; this can be done with a t-test if normally distributed or Wilcoxon Rank-sum test if not:
shapiro.test(sharks$blotch[sharks$sex == "Male"])
Shapiro-Wilk normality test
data: sharks$blotch[sharks$sex == "Male"]
W = 0.99209, p-value = 0.1701
shapiro.test(sharks$blotch[sharks$sex == "Female"])
Shapiro-Wilk normality test
data: sharks$blotch[sharks$sex == "Female"]
W = 0.99527, p-value = 0.682
As male and female are normally distributed a t-test is used:
t.test(blotch ~ sex, data = sharks) # Performing a T-Test
Welch Two Sample t-test
data: blotch by sex
t = -3.0282, df = 494.67, p-value = 0.002589
alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
95 percent confidence interval:
-0.6322714 -0.1346620
sample estimates:
mean in group Female mean in group Male
34.92294 35.30641
Running LM Model
To get an idea of the predictors for blotching time, a simple LM will be run first:
<- lm(blotch ~ sex + BPM + weight + length + air + water + meta + depth, data = sharks) # Creating a linear model with blotch time as the response variable, with all the other variables as the predictor variables.
blotch.model summary(blotch.model) # Viewing the results of the model.
Call:
lm(formula = blotch ~ sex + BPM + weight + length + air + water +
meta + depth, data = sharks)
Residuals:
Min 1Q Median 3Q Max
-2.97715 -0.66193 -0.00841 0.64123 2.90395
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.1179728 1.8749828 5.930 5.73e-09 ***
sexMale 0.3088617 0.0890602 3.468 0.00057 ***
BPM -0.0020791 0.0031540 -0.659 0.51009
weight 0.0017281 0.0033143 0.521 0.60231
length 0.0013042 0.0009606 1.358 0.17517
air -0.0310068 0.0315302 -0.983 0.32590
water -0.0143878 0.0268112 -0.537 0.59176
meta -0.0011610 0.0025671 -0.452 0.65127
depth 0.5034077 0.0220870 22.792 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9912 on 491 degrees of freedom
Multiple R-squared: 0.5256, Adjusted R-squared: 0.5178
F-statistic: 67.99 on 8 and 491 DF, p-value: < 2.2e-16
To reduce overfitting of the results, a model with reduced predictors will be used:
<- lm(blotch ~ depth + sex, data = sharks) # performing linear model with only the significant predictors
blotch.clean summary(blotch.clean) # Viewing the results
Call:
lm(formula = blotch ~ depth + sex, data = sharks)
Residuals:
Min 1Q Median 3Q Max
-2.96224 -0.64833 -0.01317 0.58882 2.98829
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.80938 1.10029 8.915 < 2e-16 ***
depth 0.50172 0.02194 22.864 < 2e-16 ***
sexMale 0.30379 0.08871 3.424 0.000667 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9895 on 497 degrees of freedom
Multiple R-squared: 0.5214, Adjusted R-squared: 0.5195
F-statistic: 270.7 on 2 and 497 DF, p-value: < 2.2e-16
Checking the assumptions and performance of the LM
Histogram:
# Creating a histogram for the residuals of the model.
hist(blotch.clean$residuals,
breaks = 20, # Setting number of bins
col = "skyblue", # Fill colour
border = "black", # Boarder colour
xlab = "Residual Values",
ylab = "Frequency", # Labelling of axis
main = "") # Removing preset title
Shapiro-Wilk:
Testing for data normality in the residuals
shapiro.test(blotch.clean$residuals) # Running code to test for normality
Shapiro-Wilk normality test
data: blotch.clean$residuals
W = 0.9978, p-value = 0.7661
The residual plots aid in checking the model’s validity, with the normality and showing the residuals follow a linear model.
plot(blotch.clean) # Plotting the residuals
The residuals vs fitted plot shows a reasonably straight, which indicates that the residuals likely follow a linear model, and this is the appropriate test to use. The Q-Q plot or Quantile-Quantile plot identifies that the data follows normality by not straying from the line; this is supported by the Shapiro-Wilk test.
ANOVA
Testing for the overall significance of the model
<- anova(blotch.clean) # Running the ANOVA test on the linear model
blotch.anova # Displaying the results of the ANOVA test blotch.anova
Analysis of Variance Table
Response: blotch
Df Sum Sq Mean Sq F value Pr(>F)
depth 1 518.70 518.70 529.739 < 2.2e-16 ***
sex 1 11.48 11.48 11.727 0.0006671 ***
Residuals 497 486.64 0.98
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This ANOVA test identifies that depth has the most notable impact on the blotching of sharks, producing a far more significant F value than sex. However, sex is still somewhat of a predictor for the effect of blotching on the sharks.