Summative Code for Research Methods and Data Analysis

Loading Packages

This section is used to load the R Studio packages used in this data analysis project:

# The following code chunk identifies the packages used in the data analysis and identifies how to load the packages into R Studio.

library(tidyverse) # The tidyverse package is crucial as it contains the packages tidyr, ggplot2, dplyr, and more, used throughout this code.
library(RColorBrewer) # Optional package: used to apply colour changes to graphs.
library(PairedData)
library(ggpubr)
library(car)
library(patchwork)

Importing Data and Data Wrangling

This section is used to load in the data and perform data wrangling.

# Loading in the data.
sharks <- read.csv("C:\\Users\\jacob\\OneDrive\\Uni\\Masters\\Sharks_Data\\sharks.csv")

sharksub <- read.csv("C:\\Users\\jacob\\OneDrive\\Uni\\Masters\\Sharks_Data\\sharksub.csv")

Looking at the dara:

glimpse(sharks) # Provides an overview of the sharks data

Rows: 500
Columns: 10
$ ID     <chr> "SH001", "SH002", "SH003", "SH004", "SH005", "SH006", "SH007", …
$ sex    <chr> "Female", "Female", "Female", "Male", "Female", "Male", "Male",…
$ blotch <dbl> 37.17081, 34.54973, 36.32861, 35.33881, 37.39799, 33.54668, 36.…
$ BPM    <int> 148, 158, 125, 161, 138, 126, 166, 135, 132, 127, 126, 131, 133…
$ weight <dbl> 74.69050, 73.41627, 71.80837, 104.62985, 67.13098, 110.49396, 1…
$ length <dbl> 186.6703, 189.3189, 283.6332, 171.0986, 264.3160, 269.9829, 194…
$ air    <dbl> 37.73957, 35.68413, 34.79854, 36.15973, 33.61477, 36.38343, 33.…
$ water  <dbl> 23.37377, 21.42088, 20.05114, 21.64319, 21.76143, 20.85200, 21.…
$ meta   <dbl> 64.11183, 73.68676, 54.43466, 86.32615, 107.97796, 108.86475, 9…
$ depth  <dbl> 53.22635, 49.63903, 49.44057, 50.29711, 49.03183, 46.84148, 49.…

glimpse(sharksub) # Provides an overview of the sharksub data

Rows: 50
Columns: 4
$ ID      <chr> "SH269", "SH163", "SH008", "SH239", "SH332", "SH328", "SH150",…
$ sex     <chr> "Female", "Female", "Female", "Female", "Female", "Female", "F…
$ blotch1 <dbl> 36.07201, 33.38396, 36.29497, 34.98931, 35.70572, 34.90283, 33…
$ blotch2 <dbl> 37.15417, 34.38548, 36.46102, 36.03899, 36.77689, 35.94991, 34…

str(sharks) # provides a different overview of the data

'data.frame':   500 obs. of  10 variables:
 $ ID    : chr  "SH001" "SH002" "SH003" "SH004" ...
 $ sex   : chr  "Female" "Female" "Female" "Male" ...
 $ blotch: num  37.2 34.5 36.3 35.3 37.4 ...
 $ BPM   : int  148 158 125 161 138 126 166 135 132 127 ...
 $ weight: num  74.7 73.4 71.8 104.6 67.1 ...
 $ length: num  187 189 284 171 264 ...
 $ air   : num  37.7 35.7 34.8 36.2 33.6 ...
 $ water : num  23.4 21.4 20.1 21.6 21.8 ...
 $ meta  : num  64.1 73.7 54.4 86.3 108 ...
 $ depth : num  53.2 49.6 49.4 50.3 49 ...

str(sharksub) # provides a different overview of the data

'data.frame':   50 obs. of  4 variables:
 $ ID     : chr  "SH269" "SH163" "SH008" "SH239" ...
 $ sex    : chr  "Female" "Female" "Female" "Female" ...
 $ blotch1: num  36.1 33.4 36.3 35 35.7 ...
 $ blotch2: num  37.2 34.4 36.5 36 36.8 ...

Checking for Null Values

Checking the data for null values:

colSums(is.na(sharks)) # Function to check data for na or null values

    ID    sex blotch    BPM weight length    air  water   meta  depth 
     0      0      0      0      0      0      0      0      0      0

colSums(is.na(sharksub)) # Function to check data for na or null values

     ID     sex blotch1 blotch2 
      0       0       0       0

Preparing the Data

# A new data set is created from a wide format to a long format to be able to create a boxplot. This new data is saved as Blotching and contains the data for blotch 1 and 2 and their associated value.

Blotching <- sharksub %>% # Identifying data to use, pipe operator (%>%) is used to tell R to use that data or line in the upcoming function.
  pivot_longer(    # pivot_longer is used to change the data from a wide to a long format. 
    cols = c(blotch1, blotch2), # Creates a new column with the selected    variables.
    names_to = "Observation",   # This is naming the new column.
    values_to = "BlotchingTime" # This is naming the column with the        associated values from the column.
  )

Data Analysis

Summary of Data Sets

The data can now be explored. First, the summary function will be used to create an analysis of the data and the variables. This analysis is helpful in determining the minimum, maximum, mean, median, and quartiles of the data.

summary(sharks) # Creates a summary of the data.

      ID                sex                blotch           BPM       
 Length:500         Length:500         Min.   :30.78   Min.   :119.0  
 Class :character   Class :character   1st Qu.:34.16   1st Qu.:129.0  
 Mode  :character   Mode  :character   Median :35.05   Median :142.0  
                                       Mean   :35.13   Mean   :141.8  
                                       3rd Qu.:36.05   3rd Qu.:153.2  
                                       Max.   :40.08   Max.   :166.0  
     weight           length           air            water      
 Min.   : 65.10   Min.   :128.3   Min.   :33.00   Min.   :20.01  
 1st Qu.: 75.68   1st Qu.:172.0   1st Qu.:34.42   1st Qu.:21.55  
 Median : 87.82   Median :211.1   Median :35.43   Median :23.11  
 Mean   : 87.94   Mean   :211.0   Mean   :35.54   Mean   :23.02  
 3rd Qu.:100.40   3rd Qu.:251.8   3rd Qu.:36.71   3rd Qu.:24.37  
 Max.   :110.94   Max.   :291.0   Max.   :38.00   Max.   :25.99  
      meta            depth      
 Min.   : 50.03   Min.   :44.64  
 1st Qu.: 67.39   1st Qu.:48.90  
 Median : 82.45   Median :50.14  
 Mean   : 82.04   Mean   :50.14  
 3rd Qu.: 95.97   3rd Qu.:51.35  
 Max.   :112.45   Max.   :56.83

summary(sharksub) # Creates a summary of the data.

      ID                sex               blotch1         blotch2     
 Length:50          Length:50          Min.   :32.49   Min.   :33.47  
 Class :character   Class :character   1st Qu.:34.38   1st Qu.:35.31  
 Mode  :character   Mode  :character   Median :34.94   Median :35.94  
                                       Mean   :35.03   Mean   :35.96  
                                       3rd Qu.:35.90   3rd Qu.:36.78  
                                       Max.   :37.07   Max.   :38.18

Identifying the number of males and females:

table(sharks$sex) # Table of number of males and females


Female   Male 
   236    264

Looking at the data by sex:

sharks.by.sex <- sharks %>% # Creation of new dataset using the sharks data
  group_by(sex) %>%    # grouping the data together by sex
  summarize(  
    mean_blotch = mean(blotch),
    sd_blotch = sd(blotch),
    mean_weight = mean(weight),
    mean_length = mean(length),
    mean_BPM = mean(BPM),
    mean_meta = mean(meta),
    mean_depth = mean(depth) # Summarising the variables by the sex and the means
  ) %>%
  ungroup() # Ungrouping the data, crucial to ensure future code is not grouped together.

# Renaming the column headers:
sharks.by.sex <- sharks.by.sex %>%
rename(
  `Sex` = sex,
  `Mean Blotch` = mean_blotch,
  `SD Blotch` = sd_blotch,
  `Mean Weight` = mean_weight,
  `Mean Length` = mean_length,
  `Mean BPM` = mean_BPM,
  `Mean Cortisol` = mean_meta,
  `Mean Depth` = mean_depth
)
print(sharks.by.sex) # Viewing the data

# A tibble: 2 × 8
  Sex    `Mean Blotch` `SD Blotch` `Mean Weight` `Mean Length` `Mean BPM`
  <chr>          <dbl>       <dbl>         <dbl>         <dbl>      <dbl>
1 Female          34.9        1.39          88.1          212.       142.
2 Male            35.3        1.44          87.8          210.       142.
# ℹ 2 more variables: `Mean Cortisol` <dbl>, `Mean Depth` <dbl>

sharksub.by.sex <- sharksub %>% # Creation of new dataset using the sharksub data
  group_by(sex) %>%    # grouping the data together by sex
  summarize(
    mean_blotch1 = mean(blotch1),
    sd_blotch1 = sd(blotch1),
    mean_blotch2 = mean(blotch2),
    sd_blotch2 = sd(blotch2)# Summarising the variables by the sex, with the means and sd of blotch 1 and 2 being created.
  ) %>%
  ungroup() # Ungrouping the data

# Renaming the column headers:
sharksub.by.sex <- sharksub.by.sex %>%
rename(
  `Mean Blotch 1` = mean_blotch1,
  `SD Blotch 1` = sd_blotch1,
  `Mean Blotch 2` = mean_blotch2,
  `SD Blotch 2` = sd_blotch2
)
print(sharksub.by.sex) # Viewing the data

# A tibble: 2 × 5
  sex    `Mean Blotch 1` `SD Blotch 1` `Mean Blotch 2` `SD Blotch 2`
  <chr>            <dbl>         <dbl>           <dbl>         <dbl>
1 Female            34.8          1.08            35.8          1.11
2 Male              35.3          1.09            36.2          1.20

Statistical Analysis

1. Correlation Between Air and Water Temperature

Visualising the Data

As both air and water are both quantitative data, a scatter plot is the best graph to use here:

sharks %>%   # Calling in data
  ggplot(aes(
    x = air,   # Defining the x-axis variable
    y = water, # Defining the y-axis variable
  )) +
  geom_point(alpha = 0.7, size = 1) + # Creating a scatter plot
  geom_smooth(method = "lm", se = FALSE, color = "red", linewidth = 1) + # Applying a line of best fit using the linear model for a smooth line.
  theme_minimal(base_size = 13) + # Applying the minimal theme to the graph
  labs ( 
    x = "Air Temperature (°C)",  # Labelling of the x-axis
    y = "Water Temperature (°C)" # Labelling of the y-axis
  )

`geom_smooth()` using formula = 'y ~ x'

Checking for normality

Before a correlation test can be performed, the data needs to be checked for normality to determine whether Pearson’s or Spearman’s Correlation is used.

shapiro.test(sharks$air) # Shapiro-Wilk test for normality on air vairbale


    Shapiro-Wilk normality test

data:  sharks$air
W = 0.95885, p-value = 1.338e-10

shapiro.test(sharks$water) # Shapiro-Wilk test for normality on water vairbale


    Shapiro-Wilk normality test

data:  sharks$water
W = 0.96035, p-value = 2.371e-10

Data is not normally distributed, Spearman’s is then used.

Spearman’s Correlation Test

cor.test(sharks$air, sharks$water, method = "spearman") # Performing the correlation test using Spearman's method.


    Spearman's rank correlation rho

data:  sharks$air and sharks$water
S = 22007692, p-value = 0.2082
alternative hypothesis: true rho is not equal to 0
sample estimates:
        rho 
-0.05637344

As the R-value is close to 0, we can assume that there is very weak to no correlation between the two factors.

2. Effect of Multiple Captures on Blotching Time

Research Questions

To check for the impact of multiple captures on the blotching time, a Paired T-test will be used if the data is normally distributed, or a Wilcoxon test if the data is not normal.

First, however, the research question needs to be established.

I want to test whether the mean blotching time on the second capture is significantly different from the first blotch time.

The null hypothesis is if there is no significant difference between the two blotch times.

The alternative hypothesis is that the two blotch times are significantly different.

Preparing and Visualising the Data

Summarising the Data:

Blotching %>%  
  group_by(Observation) %>%  # Grouping this data by blotch1 and blotch 2
  summarise(
    count = n(),  # Calculating the numnber in blotch1 and blotch 2
    mean = mean(BlotchingTime),  # Calculating the mean time of blotching
    sd = sd(BlotchingTime)       # Calculating the standard deviation of blotching 
  )

# A tibble: 2 × 4
  Observation count  mean    sd
  <chr>       <int> <dbl> <dbl>
1 blotch1        50  35.0  1.10
2 blotch2        50  36.0  1.16

# Output is a summary table.

Visualising the Data

box.plot <- Blotching %>%  # Calling in the data and saving as new dataset for presentation later
  # ggplot here is defining the axis, grouping each ID point, and colouring the points by observation.
  ggplot(aes(x = Observation, y = BlotchingTime, fill = Observation)) +
  geom_boxplot() + # Creating a boxplot graph with the ggplot variables
  labs(
    x = "Capture Event",  
    y = "Blotching Time (seconds)"  # Labelling of axis
  ) +
  theme_minimal(base_size = 13) +  # Changes the graph aesthetics
  theme(legend.position = "none") +  # Removing the legend
  scale_fill_brewer(palette = "Dark2")  # Settin the colour scale of the  graph.
box.plot

This can be put into a paired data plot:

paired.plot <- Blotching %>% # Calling in the data and saving as new dataset for presentation later
  # ggplot here is defining the axis, grouping each ID point together, and colouring the points by observation
ggplot(aes(x = Observation, y = BlotchingTime, group = ID, color = Observation)) +
  geom_line(linewidth = 1, color = "darkgrey", alpha = 0.7) + # Adding of the lines to the scatter plot
  geom_point(size = 1.8) +  # Creating a scatter plot
  theme_minimal(base_size = 13) +   # Setting the theme of the graph
  theme(legend.position = "none") +  # Removing of the legend
  labs(x = "Capture Event",
       y = "Blotching Time (seconds)") + # Labelling of the axis
  scale_color_brewer(palette = "Dark2")  # Setting the colour palette of the points
 paired.plot

# Using patchwork package, the 2 graphs can be combined into a single graph:
combined <- box.plot + paired.plot + plot_layout(ncol = 2)
combined

Preliminary test to check paired t-test assumptions:

Are the 2 samples paired? Yes, both are a time for blotching.
Is this a large sample? No, I will test for normality on data.

Visualising the normality of the data as a histogram:

 sharksub2 <- sharksub %>%   # Creating a new data set with new column in the sharksub data with the difference between blotch 1 and 2.
   mutate(Difference = blotch2 - blotch1)
 sharksub2 %>% 
   ggplot(aes(x = Difference))+ # Defining the x-axis
   geom_histogram(binwidth = 0.1, fill = "lightblue", color = "black") + # Creating a histogram
   theme_minimal(base_size = 13) + # Setting the graph theme
   labs(x = "Difference",
        y = "Count") # Labelling of axis

The histogram hints that the data may not be normally distributed and will be confirmed with the Shapiro-Wilk test.

# This code creates a new data set with the difference between blotch1 and blotch2, which is then tested for normality in the Shapiro-Wilk test. 
difference <- sharksub$blotch1 - sharksub$blotch2
shapiro.test(difference)


    Shapiro-Wilk normality test

data:  difference
W = 0.43157, p-value = 1.212e-12

As the p-value is below our significance level of 0.05, we assume that the data deviates significantly from normality.

Therefore, the Wilcoxon signed-rank test is used.

Wilcoxon Signed-rank Test

# Running of the Wilcoxon Signed-rank test
wilcox.test(sharksub$blotch1, sharksub$blotch2, paired = TRUE)


    Wilcoxon signed rank test with continuity correction

data:  sharksub$blotch1 and sharksub$blotch2
V = 12, p-value = 1.606e-09
alternative hypothesis: true location shift is not equal to 0

Checking if sex has an impact on multiple capture events

We have seen that there is a significant difference between the first and second capture events; this will now look to see if this is the same between shark genders.

Checking the data

table(sharksub$sex) # obtaining the number of males and females in the data


Female   Male 
    25     25

Visualising the data

Blotching %>% 
  ggplot(aes(x = sex, y = BlotchingTime, fill = Observation)) + # Defining the axis
  geom_boxplot() + # Creating the boxplot
  theme_minimal(base_size = 13)+ # Setting the graph theme
  labs(x = "Sex",
    y = "Blotching Time (s)") + # Labelling of the axis
  scale_fill_brewer(palette = "Dark2") # applying a colour scheme

Preparing data and testing normality

# Creating a new dataset with male blotching time
males <- sharksub %>%
  filter(sex == "Male") 
# Creating a new dataset with female blotching time
females <- sharksub %>%
  filter(sex == "Female")

# Finding the difference between blotch 1 and 2 for each gender
male_difference <- males$blotch1 - males$blotch2
female_difference <-females$blotch1 - females$blotch2

# Testing the normality for the difference between blotch 1 and 2 for males
shapiro.test(male_difference)


    Shapiro-Wilk normality test

data:  male_difference
W = 0.46081, p-value = 1.569e-08

# Testing the normality for the difference between blotch 1 and 2 for females
shapiro.test(female_difference)


    Shapiro-Wilk normality test

data:  female_difference
W = 0.40466, p-value = 4.846e-09

All the data here is not normally distributed, therefore, wilcoxon is used.

Performing Wilcoxon Test

# Performing Wilcoxon test on males blotch time
wilcox.test(males$blotch1, males$blotch2, paired = TRUE)


    Wilcoxon signed rank exact test

data:  males$blotch1 and males$blotch2
V = 5, p-value = 5.96e-07
alternative hypothesis: true location shift is not equal to 0

# Performing Wilcoxon test on females blotch time
wilcox.test(females$blotch1, females$blotch2, paired = TRUE)


    Wilcoxon signed rank exact test

data:  females$blotch1 and females$blotch2
V = 2, p-value = 1.788e-07
alternative hypothesis: true location shift is not equal to 0

3. Predicting Blotching Time

Setting Up the Model

This question will be used for a linear regression model, so the response and predictors need to be identified.

For this test, the blotch variable from the sharks’ data is the response variable, as it is what we want to predict.

While the predictor variables are the others in the data: sex, weight, length, air, water, meta and depth.

Visualising the Data and Trends

This section was used to visualise if any variables had a relationship with blotching.

Continuous Data

Weight

sharks %>%  
  ggplot(aes(x = weight, y = blotch)) +  # Defining the variables on the axis
  geom_point(alpha = 0.6) +   # Creating a scatter plot
  geom_smooth(method = "lm", se = FALSE) +  # Adding a line of best fit into the model
  theme_minimal(base_size = 13) +  # Setting the theme of the graph
  labs(x = "Weight (Kg)",
       y = "Blotch Time (seconds)")  # Labelling of the axis

`geom_smooth()` using formula = 'y ~ x'

Length

sharks %>%  #
  ggplot(aes(x = length, y = blotch)) +  # Defining the variables on the axis
  geom_point(alpha = 0.6) +   # Creating a scatter plot
  geom_smooth(method = "lm", se = FALSE) +  # Adding a line of best fit into the model
  theme_minimal(base_size = 13) +  # Setting the theme of the graph
  labs(x = "Length (cm)",
       y = "Blotch Time (seconds)")  # Labelling of the axis

`geom_smooth()` using formula = 'y ~ x'

Air

sharks %>% 
  ggplot(aes(x = air, y = blotch)) +  # Defining the variables on the axis
  geom_point(alpha = 0.6) +   # Creating a scatter plot
  geom_smooth(method = "lm", se = FALSE) +  # Adding a line of best fit into the model
  theme_minimal(base_size = 13) +  # Setting the theme of the graph
  labs(x = "Air Temperature (°C)",
       y = "Blotch Time (seconds)")  # Labelling of the axis

`geom_smooth()` using formula = 'y ~ x'

Water

sharks %>% 
  ggplot(aes(x = water, y = blotch)) +  # Defining the variables on the axis
  geom_point(alpha = 0.6) +   # Creating a scatter plot
  geom_smooth(method = "lm", se = FALSE) +  # Adding a line of best fit into the model
  theme_minimal(base_size = 13) +  # Setting the theme of the graph
  labs(x = "Water Temperature (°C)",
       y = "Blotch Time (seconds)")  # Labelling of the axis

`geom_smooth()` using formula = 'y ~ x'

Categorical data

Sex

Visualising categorical data:

sharks %>%  
  ggplot(aes(x = sex, y = blotch, fill = sex)) +  # Defining the variable for each axis
  geom_boxplot() +  # Creating a boxplot
  labs(x = "Sex",
       y = "Blotch Time (seconds)") + # Labelling of the x- and y-axis
  theme_minimal(base_size = 13) + # Setting the theme of the boxplot
scale_fill_brewer(palette = "Dark2") # applying a colour scheme

To test to see if sex is a reliable predictor for blotching time, it needs to be tested to see if there is a significant difference between the blotching times; this can be done with a t-test if normally distributed or Wilcoxon Rank-sum test if not:

shapiro.test(sharks$blotch[sharks$sex == "Male"])


    Shapiro-Wilk normality test

data:  sharks$blotch[sharks$sex == "Male"]
W = 0.99209, p-value = 0.1701

shapiro.test(sharks$blotch[sharks$sex == "Female"])


    Shapiro-Wilk normality test

data:  sharks$blotch[sharks$sex == "Female"]
W = 0.99527, p-value = 0.682

As male and female are normally distributed a t-test is used:

t.test(blotch ~ sex, data = sharks)  # Performing a T-Test


    Welch Two Sample t-test

data:  blotch by sex
t = -3.0282, df = 494.67, p-value = 0.002589
alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
95 percent confidence interval:
 -0.6322714 -0.1346620
sample estimates:
mean in group Female   mean in group Male 
            34.92294             35.30641

Running LM Model

To get an idea of the predictors for blotching time, a simple LM will be run first:

blotch.model <- lm(blotch ~ sex + BPM + weight + length + air + water + meta + depth, data = sharks) # Creating a linear model with blotch time as the response variable, with all the other variables as the predictor variables.
summary(blotch.model) # Viewing the results of the model.


Call:
lm(formula = blotch ~ sex + BPM + weight + length + air + water + 
    meta + depth, data = sharks)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.97715 -0.66193 -0.00841  0.64123  2.90395 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 11.1179728  1.8749828   5.930 5.73e-09 ***
sexMale      0.3088617  0.0890602   3.468  0.00057 ***
BPM         -0.0020791  0.0031540  -0.659  0.51009    
weight       0.0017281  0.0033143   0.521  0.60231    
length       0.0013042  0.0009606   1.358  0.17517    
air         -0.0310068  0.0315302  -0.983  0.32590    
water       -0.0143878  0.0268112  -0.537  0.59176    
meta        -0.0011610  0.0025671  -0.452  0.65127    
depth        0.5034077  0.0220870  22.792  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9912 on 491 degrees of freedom
Multiple R-squared:  0.5256,    Adjusted R-squared:  0.5178 
F-statistic: 67.99 on 8 and 491 DF,  p-value: < 2.2e-16

To reduce overfitting of the results, a model with reduced predictors will be used:

blotch.clean <- lm(blotch ~ depth + sex, data = sharks) # performing linear model with only the significant predictors
summary(blotch.clean) # Viewing the results


Call:
lm(formula = blotch ~ depth + sex, data = sharks)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.96224 -0.64833 -0.01317  0.58882  2.98829 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  9.80938    1.10029   8.915  < 2e-16 ***
depth        0.50172    0.02194  22.864  < 2e-16 ***
sexMale      0.30379    0.08871   3.424 0.000667 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9895 on 497 degrees of freedom
Multiple R-squared:  0.5214,    Adjusted R-squared:  0.5195 
F-statistic: 270.7 on 2 and 497 DF,  p-value: < 2.2e-16

Checking the assumptions and performance of the LM

Histogram:

# Creating a histogram for the residuals of the model.
hist(blotch.clean$residuals,
     breaks = 20,  # Setting number of bins
     col = "skyblue",  # Fill colour
     border = "black", # Boarder colour
     xlab = "Residual Values",
     ylab = "Frequency",  # Labelling of axis
     main = "") # Removing preset title

Shapiro-Wilk:

Testing for data normality in the residuals

shapiro.test(blotch.clean$residuals) # Running code to test for normality


    Shapiro-Wilk normality test

data:  blotch.clean$residuals
W = 0.9978, p-value = 0.7661

The residual plots aid in checking the model’s validity, with the normality and showing the residuals follow a linear model.

plot(blotch.clean) # Plotting the residuals

The residuals vs fitted plot shows a reasonably straight, which indicates that the residuals likely follow a linear model, and this is the appropriate test to use. The Q-Q plot or Quantile-Quantile plot identifies that the data follows normality by not straying from the line; this is supported by the Shapiro-Wilk test.

ANOVA

Testing for the overall significance of the model

blotch.anova <- anova(blotch.clean) # Running the ANOVA test on the linear model
blotch.anova  # Displaying the results of the ANOVA test

Analysis of Variance Table

Response: blotch
           Df Sum Sq Mean Sq F value    Pr(>F)    
depth       1 518.70  518.70 529.739 < 2.2e-16 ***
sex         1  11.48   11.48  11.727 0.0006671 ***
Residuals 497 486.64    0.98                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This ANOVA test identifies that depth has the most notable impact on the blotching of sharks, producing a far more significant F value than sex. However, sex is still somewhat of a predictor for the effect of blotching on the sharks.