Summative Code for Research Methods and Data Analysis

Loading Packages

This section is used to load the R Studio packages used in this data analysis project:

# The following code chunk identifies the packages used in the data analysis and identifies how to load the packages into R Studio.

library(tidyverse) # The tidyverse package is crucial as it contains the packages tidyr, ggplot2, dplyr, and more, used throughout this code.
library(RColorBrewer) # Optional package: used to apply colour changes to graphs.
library(PairedData)
library(ggpubr)
library(car)
library(patchwork)

Importing Data and Data Wrangling

This section is used to load in the data and perform data wrangling.

# Loading in the data.
sharks <- read.csv("C:\\Users\\jacob\\OneDrive\\Uni\\Masters\\Sharks_Data\\sharks.csv")

sharksub <- read.csv("C:\\Users\\jacob\\OneDrive\\Uni\\Masters\\Sharks_Data\\sharksub.csv")

Looking at the data:

str(sharks) # provides a different overview of the data

'data.frame':   500 obs. of  10 variables:
 $ ID    : chr  "SH001" "SH002" "SH003" "SH004" ...
 $ sex   : chr  "Female" "Female" "Female" "Male" ...
 $ blotch: num  37.2 34.5 36.3 35.3 37.4 ...
 $ BPM   : int  148 158 125 161 138 126 166 135 132 127 ...
 $ weight: num  74.7 73.4 71.8 104.6 67.1 ...
 $ length: num  187 189 284 171 264 ...
 $ air   : num  37.7 35.7 34.8 36.2 33.6 ...
 $ water : num  23.4 21.4 20.1 21.6 21.8 ...
 $ meta  : num  64.1 73.7 54.4 86.3 108 ...
 $ depth : num  53.2 49.6 49.4 50.3 49 ...

str(sharksub) # provides a different overview of the data

'data.frame':   50 obs. of  4 variables:
 $ ID     : chr  "SH269" "SH163" "SH008" "SH239" ...
 $ sex    : chr  "Female" "Female" "Female" "Female" ...
 $ blotch1: num  36.1 33.4 36.3 35 35.7 ...
 $ blotch2: num  37.2 34.4 36.5 36 36.8 ...

Checking for Null Values:

colSums(is.na(sharks)) # Function to check data for na or null values

    ID    sex blotch    BPM weight length    air  water   meta  depth 
     0      0      0      0      0      0      0      0      0      0

colSums(is.na(sharksub)) # Function to check data for na or null values

     ID     sex blotch1 blotch2 
      0       0       0       0

Preparing the Data

# A new data set is created from a wide format to a long format to be able to create a boxplot. This new data is saved as Blotching and contains the data for blotch 1 and 2 and their associated value.

Blotching <- sharksub %>% # Identifying data to use, pipe operator (%>%) is used to tell R to use that data or line in the upcoming function.
  pivot_longer(    # pivot_longer is used to change the data from a wide to a long format. 
    cols = c(blotch1, blotch2), # Creates a new column with the selected    variables.
    names_to = "Observation",   # This is naming the new column.
    values_to = "BlotchingTime" # This is naming the column with the        associated values from the column.
  )

Data Analysis

Summary of Data Sets

The data can now be explored. First, the summary function will be used to create an analysis of the data and the variables. This analysis is helpful in determining the minimum, maximum, mean, median, and quartiles of the data.

summary(sharks) # Creates a summary of the data.

      ID                sex                blotch           BPM       
 Length:500         Length:500         Min.   :30.78   Min.   :119.0  
 Class :character   Class :character   1st Qu.:34.16   1st Qu.:129.0  
 Mode  :character   Mode  :character   Median :35.05   Median :142.0  
                                       Mean   :35.13   Mean   :141.8  
                                       3rd Qu.:36.05   3rd Qu.:153.2  
                                       Max.   :40.08   Max.   :166.0  
     weight           length           air            water      
 Min.   : 65.10   Min.   :128.3   Min.   :33.00   Min.   :20.01  
 1st Qu.: 75.68   1st Qu.:172.0   1st Qu.:34.42   1st Qu.:21.55  
 Median : 87.82   Median :211.1   Median :35.43   Median :23.11  
 Mean   : 87.94   Mean   :211.0   Mean   :35.54   Mean   :23.02  
 3rd Qu.:100.40   3rd Qu.:251.8   3rd Qu.:36.71   3rd Qu.:24.37  
 Max.   :110.94   Max.   :291.0   Max.   :38.00   Max.   :25.99  
      meta            depth      
 Min.   : 50.03   Min.   :44.64  
 1st Qu.: 67.39   1st Qu.:48.90  
 Median : 82.45   Median :50.14  
 Mean   : 82.04   Mean   :50.14  
 3rd Qu.: 95.97   3rd Qu.:51.35  
 Max.   :112.45   Max.   :56.83

summary(sharksub) # Creates a summary of the data.

      ID                sex               blotch1         blotch2     
 Length:50          Length:50          Min.   :32.49   Min.   :33.47  
 Class :character   Class :character   1st Qu.:34.38   1st Qu.:35.31  
 Mode  :character   Mode  :character   Median :34.94   Median :35.94  
                                       Mean   :35.03   Mean   :35.96  
                                       3rd Qu.:35.90   3rd Qu.:36.78  
                                       Max.   :37.07   Max.   :38.18

Identifying the number of males and females:

table(sharks$sex) # Table of number of males and females from first capture event


Female   Male 
   236    264

table(sharksub$sex) # obtaining the number of males and females of those recorded twice


Female   Male 
    25     25

Looking at the data by sex:

# Creating a summary table by sex for the first capture event.

sharks.by.sex <- sharks %>% # Creation of new dataset using the sharks data
  group_by(sex) %>%    # grouping the data together by sex
  summarize(  
    mean_blotch = mean(blotch),
    sd_blotch = sd(blotch),
    mean_weight = mean(weight),
    mean_length = mean(length),
    mean_BPM = mean(BPM),
    mean_meta = mean(meta),
    mean_depth = mean(depth) # Summarising the variables by the sex and the means
  ) %>%
  ungroup() # Ungrouping the data, crucial to ensure future code is not grouped together.

# Renaming the column headers:
sharks.by.sex <- sharks.by.sex %>%
rename(
  `Sex` = sex,
  `Mean Blotch` = mean_blotch,
  `SD Blotch` = sd_blotch,
  `Mean Weight` = mean_weight,
  `Mean Length` = mean_length,
  `Mean BPM` = mean_BPM,
  `Mean Cortisol` = mean_meta,
  `Mean Depth` = mean_depth
)
print(sharks.by.sex) # Viewing the data

# A tibble: 2 × 8
  Sex    `Mean Blotch` `SD Blotch` `Mean Weight` `Mean Length` `Mean BPM`
  <chr>          <dbl>       <dbl>         <dbl>         <dbl>      <dbl>
1 Female          34.9        1.39          88.1          212.       142.
2 Male            35.3        1.44          87.8          210.       142.
# ℹ 2 more variables: `Mean Cortisol` <dbl>, `Mean Depth` <dbl>

# Creating a table of the second capture event

sharksub.by.sex <- sharksub %>% # Creation of new dataset using the sharksub data
  group_by(sex) %>%    # grouping the data together by sex
  summarize(
    mean_blotch1 = mean(blotch1),
    sd_blotch1 = sd(blotch1),
    mean_blotch2 = mean(blotch2),
    sd_blotch2 = sd(blotch2)# Summarising the variables by the sex, with the means and sd of blotch 1 and 2 being created.
  ) %>%
  ungroup() # Ungrouping the data

# Renaming the column headers:
sharksub.by.sex <- sharksub.by.sex %>%
rename(
  `Mean Blotch 1` = mean_blotch1,
  `SD Blotch 1` = sd_blotch1,
  `Mean Blotch 2` = mean_blotch2,
  `SD Blotch 2` = sd_blotch2
)
print(sharksub.by.sex) # Viewing the data

# A tibble: 2 × 5
  sex    `Mean Blotch 1` `SD Blotch 1` `Mean Blotch 2` `SD Blotch 2`
  <chr>            <dbl>         <dbl>           <dbl>         <dbl>
1 Female            34.8          1.08            35.8          1.11
2 Male              35.3          1.09            36.2          1.20

Statistical Analysis

1. Correlation Between Air and Water Temperature

Visualising the Data

As blotch, air and water are quantitative data, a scatter plot is the best graph to use here:

# Creating a scatterplot for Air temp against Blotch time

# Saving graph to air.plot data set to combine with the water.plot below

air.plot <- sharks %>%   # Calling in data
  ggplot(aes(
    x = air,   # Defining the x-axis variable
    y = blotch, # Defining the y-axis variable
  )) +
  geom_point(alpha = 1, size = 1) + # Creating a scatter plot
  geom_smooth(method = "lm", se = FALSE, color = "darkgrey", linewidth = 1) + # Applying a line of best fit using the linear model for a smooth line.
  theme_minimal(base_size = 13) + # Applying the minimal theme to the graph
  labs ( 
    x = "Air Temperature (°C)",  # Labelling of the x-axis
    y = "Blotch Time (s)" # Labelling of the y-axis
  )

# Creating a scatterplot for blotch time against water temp.

# Saving graph to water.plot data set to combine with the air.plot below

water.plot <- sharks %>%   
  ggplot(aes(
    x = water,  
    y = blotch, 
  )) +
  geom_point(alpha = 1, size = 1) + 
  geom_smooth(method = "lm", se = FALSE, color = "darkgrey", linewidth = 1) + 
  theme_minimal(base_size = 13) + 
  labs ( 
    x = "Water Temperature (°C)",  
    y = "Blotch Time (s)" 
  )

# Using patchwork package, the 2 graphs can be combined into a single graph:
air.water <- water.plot + air.plot + plot_layout(ncol = 1)
air.water

`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

Checking for normality

Before a correlation test can be performed, the data needs to be checked for normality to determine whether Pearson’s or Spearman’s Correlation is used.

# Performing Shapiro-Wilk on variables for normality
shapiro.test(sharks$air)


    Shapiro-Wilk normality test

data:  sharks$air
W = 0.95885, p-value = 1.338e-10

shapiro.test(sharks$water)


    Shapiro-Wilk normality test

data:  sharks$water
W = 0.96035, p-value = 2.371e-10

shapiro.test(sharks$blotch)


    Shapiro-Wilk normality test

data:  sharks$blotch
W = 0.99695, p-value = 0.4769

As water and air is not normally distributed, Spearmans is used

Spearman’s Correlation Test

cor.test(sharks$air, sharks$blotch, method = "spearman") # Performing the correlation test using Spearman's method.


    Spearman's rank correlation rho

data:  sharks$air and sharks$blotch
S = 21469324, p-value = 0.4956
alternative hypothesis: true rho is not equal to 0
sample estimates:
        rho 
-0.03053167

cor.test(sharks$water, sharks$blotch, method = "spearman") # Performing the correlation test using Spearman's method.


    Spearman's rank correlation rho

data:  sharks$water and sharks$blotch
S = 21758692, p-value = 0.3214
alternative hypothesis: true rho is not equal to 0
sample estimates:
        rho 
-0.04442139

Both show slight negative correlation but close to 0, so the relationship is very limited.

2. Effect of Multiple Captures on Blotching Time

Research Questions

To check for the impact of multiple captures on the blotching time, a Paired T-test will be used if the data is normally distributed, or a Wilcoxon test if the data is not normal.

First, however, the research question needs to be established.

I want to test whether the mean blotching time on the second capture is significantly different from the first blotch time.

The null hypothesis is if there is no significant difference between the two blotch times.

The alternative hypothesis is that the two blotch times are significantly different.

Preparing and Visualising the Data

Summarising the Data:

Blotching %>%  
  group_by(Observation) %>%  # Grouping this data by blotch1 and blotch 2
  summarise(
    count = n(),  # Calculating the numnber in blotch1 and blotch 2
    mean = mean(BlotchingTime),  # Calculating the mean time of blotching
    sd = sd(BlotchingTime)       # Calculating the standard deviation of blotching 
  )

# A tibble: 2 × 4
  Observation count  mean    sd
  <chr>       <int> <dbl> <dbl>
1 blotch1        50  35.0  1.10
2 blotch2        50  36.0  1.16

# Output is a summary table.

Visualising the Data

# Creating a boxplot for the blotching of capture 1 and 2.

box.plot <- Blotching %>%  # saving as new dataset for presentation later
  ggplot(aes(x = Observation, y = BlotchingTime, fill = Observation)) +
  geom_boxplot() + # Creating a boxplot graph with the ggplot variables
  labs(
    x = "Capture Event",  
    y = "Blotching Time (seconds)"  
  ) +
  theme_minimal(base_size = 13) +  # Changes the graph aesthetics
  theme(legend.position = "none") +  # Removing the legend
  scale_fill_brewer(palette = "Greys")  # setting colour to journal style

This can be put into a paired data plot:

paired.plot <- Blotching %>% # saving as new dataset for presentation later
ggplot(aes(x = Observation, y = BlotchingTime, group = ID, color = Observation)) +
  geom_line(linewidth = 1, color = "darkgrey", alpha = 0.7) + # Adding of the lines to the scatter plot
  geom_point(size = 1.8) +
  theme_minimal(base_size = 13) +   
  theme(legend.position = "none") +  
  labs(x = "Capture Event",
       y = "Blotching Time (seconds)") + 
  scale_color_grey(start = 0.6, end = 0.3)

# Using patchwork package, the 2 graphs can be combined into a single graph:
combined <- box.plot + paired.plot + plot_layout(ncol = 2)
combined

Preliminary test to check paired t-test assumptions:

Are the 2 samples paired? Yes, both are a time for blotching.
Is this a large sample? No, I will test for normality on data.

Visualising the normality of the data as a histogram:

 sharksub2 <- sharksub %>%   # Creating a new data set with new column in the sharksub data with the difference between blotch 1 and 2.
   mutate(Difference = blotch2 - blotch1)
 sharksub2 %>% 
   ggplot(aes(x = Difference))+
   geom_histogram(binwidth = 0.1, fill = "lightgrey", color = "black") + # Creating a histogram
   theme_minimal(base_size = 13) + 
   labs(x = "Difference",
        y = "Count")

The histogram hints that the data may not be normally distributed and will be confirmed with the Shapiro-Wilk test.

# This code creates a new data set with the difference between blotch1 and blotch2, which is then tested for normality in the Shapiro-Wilk test. 
difference <- sharksub$blotch1 - sharksub$blotch2
shapiro.test(difference)


    Shapiro-Wilk normality test

data:  difference
W = 0.43157, p-value = 1.212e-12

As the p-value is below our significance level of 0.05, we assume that the data deviates significantly from normality.

Therefore, the Wilcoxon signed-rank test is used.

Wilcoxon Signed-rank Test

# Running of the Wilcoxon Signed-rank test
wilcox.test(sharksub$blotch1, sharksub$blotch2, paired = TRUE)


    Wilcoxon signed rank test with continuity correction

data:  sharksub$blotch1 and sharksub$blotch2
V = 12, p-value = 1.606e-09
alternative hypothesis: true location shift is not equal to 0

Blotching %>% 
  ggplot(aes(x = sex, y = BlotchingTime, fill = Observation)) + # Defining the axis
  geom_boxplot() + # Creating the boxplot
  theme_minimal(base_size = 13)+ # Setting the graph theme
  labs(x = "Sex",
    y = "Blotching Time (s)") + # Labelling of the axis
  scale_fill_brewer(palette = "Greys") # applying a colour scheme

3. Predicting Blotching Time

Setting Up the Model

This question will be used for a linear regression model, so the response and predictors need to be identified.

For this test, the blotch variable from the sharks’ data is the response variable, as it is what we want to predict.

While the predictor variables are the others in the data: sex, weight, length, air, water, meta and depth.

Visualising the Data and Trends

This section was used to visualise if any variables had a relationship with blotching.

Continuous Data

Weight

sharks %>%  
  ggplot(aes(x = weight, y = blotch)) +  # Defining the variables on the axis
  geom_point(alpha = 0.6) +   # Creating a scatter plot
  geom_smooth(method = "lm", se = FALSE) +  # Adding a line of best fit into the model
  theme_minimal(base_size = 13) +  # Setting the theme of the graph
  labs(x = "Weight (Kg)",
       y = "Blotch Time (seconds)")  # Labelling of the axis

`geom_smooth()` using formula = 'y ~ x'

Length

sharks %>%  #
  ggplot(aes(x = length, y = blotch)) +  # Defining the variables on the axis
  geom_point(alpha = 0.6) +   # Creating a scatter plot
  geom_smooth(method = "lm", se = FALSE) +  # Adding a line of best fit into the model
  theme_minimal(base_size = 13) +  # Setting the theme of the graph
  labs(x = "Length (cm)",
       y = "Blotch Time (seconds)")  # Labelling of the axis

`geom_smooth()` using formula = 'y ~ x'

Air and Water Temperature

These variables were plotted against blotch time in question one and were found to have no relationship, therefore, another scatter is not made for them.

Categorical data

Sex

Visualising categorical data:

sharks %>%  
  ggplot(aes(x = sex, y = blotch, fill = sex)) +  # Defining the variable for each axis
  geom_boxplot() +  # Creating a boxplot
  labs(x = "Sex",
       y = "Blotch Time (seconds)") + # Labelling of the x- and y-axis
  theme_minimal(base_size = 13) + # Setting the theme of the boxplot
scale_fill_brewer(palette = "Greys") # applying a colour scheme

To test to see if sex is a reliable predictor for blotching time, it needs to be tested to see if there is a significant difference between the blotching times; this can be done with a t-test if normally distributed or Wilcoxon Rank-sum test if not:

# Performin Shapiro-Wilk on Male and Female data for normality
shapiro.test(sharks$blotch[sharks$sex == "Male"])


    Shapiro-Wilk normality test

data:  sharks$blotch[sharks$sex == "Male"]
W = 0.99209, p-value = 0.1701

shapiro.test(sharks$blotch[sharks$sex == "Female"])


    Shapiro-Wilk normality test

data:  sharks$blotch[sharks$sex == "Female"]
W = 0.99527, p-value = 0.682

As male and female are normally distributed a t-test is used:

t.test(blotch ~ sex, data = sharks)  # Performing a T-Test


    Welch Two Sample t-test

data:  blotch by sex
t = -3.0282, df = 494.67, p-value = 0.002589
alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
95 percent confidence interval:
 -0.6322714 -0.1346620
sample estimates:
mean in group Female   mean in group Male 
            34.92294             35.30641

Running LM Model

To get an idea of the predictors for blotching time, a simple LM will be run first:

blotch.model <- lm(blotch ~ sex + BPM + weight + length + air + water + meta + depth, data = sharks) # Creating a linear model with blotch time as the response variable, with all the other variables as the predictor variables.
summary(blotch.model) # Viewing the results of the model.


Call:
lm(formula = blotch ~ sex + BPM + weight + length + air + water + 
    meta + depth, data = sharks)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.97715 -0.66193 -0.00841  0.64123  2.90395 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 11.1179728  1.8749828   5.930 5.73e-09 ***
sexMale      0.3088617  0.0890602   3.468  0.00057 ***
BPM         -0.0020791  0.0031540  -0.659  0.51009    
weight       0.0017281  0.0033143   0.521  0.60231    
length       0.0013042  0.0009606   1.358  0.17517    
air         -0.0310068  0.0315302  -0.983  0.32590    
water       -0.0143878  0.0268112  -0.537  0.59176    
meta        -0.0011610  0.0025671  -0.452  0.65127    
depth        0.5034077  0.0220870  22.792  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9912 on 491 degrees of freedom
Multiple R-squared:  0.5256,    Adjusted R-squared:  0.5178 
F-statistic: 67.99 on 8 and 491 DF,  p-value: < 2.2e-16

To reduce overfitting of the results, a model with reduced predictors will be used:

blotch.clean <- lm(blotch ~ depth + sex, data = sharks) # performing linear model with only the significant predictors
summary(blotch.clean)


Call:
lm(formula = blotch ~ depth + sex, data = sharks)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.96224 -0.64833 -0.01317  0.58882  2.98829 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  9.80938    1.10029   8.915  < 2e-16 ***
depth        0.50172    0.02194  22.864  < 2e-16 ***
sexMale      0.30379    0.08871   3.424 0.000667 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9895 on 497 degrees of freedom
Multiple R-squared:  0.5214,    Adjusted R-squared:  0.5195 
F-statistic: 270.7 on 2 and 497 DF,  p-value: < 2.2e-16

# Performing lm model for just sex, to see this variables R-Squared value
sexlm <- lm(blotch ~ sex, data = sharks)
summary(sexlm)


Call:
lm(formula = blotch ~ sex, data = sharks)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.1471 -0.9653 -0.0222  0.9889  4.7772 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 34.92294    0.09217 378.885  < 2e-16 ***
sexMale      0.38347    0.12685   3.023  0.00263 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.416 on 498 degrees of freedom
Multiple R-squared:  0.01802,   Adjusted R-squared:  0.01605 
F-statistic: 9.139 on 1 and 498 DF,  p-value: 0.002632

# Performing lm on just the depth variable to see the influence it has on blotch time.
depthlm <- lm(blotch ~ depth, data = sharks)
summary(depthlm)


Call:
lm(formula = blotch ~ depth, data = sharks)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.81869 -0.65427 -0.01035  0.58825  2.83116 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  9.82178    1.11207   8.832   <2e-16 ***
depth        0.50467    0.02216  22.772   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1 on 498 degrees of freedom
Multiple R-squared:  0.5101,    Adjusted R-squared:  0.5091 
F-statistic: 518.6 on 1 and 498 DF,  p-value: < 2.2e-16

Checking the assumptions and performance of the LM

Histogram:

# Creating a histogram for the residuals of the model.
hist(blotch.clean$residuals,
     breaks = 20,  # Setting number of bins
     col = "lightgrey",  # Fill colour
     border = "black", # Boarder colour
     xlab = "Residual Values",
     ylab = "Frequency",  # Labelling of axis
     main = "") # Removing preset title

Shapiro-Wilk:

Testing for data normality in the residuals

shapiro.test(blotch.clean$residuals) # Running code to test for normality


    Shapiro-Wilk normality test

data:  blotch.clean$residuals
W = 0.9978, p-value = 0.7661

The residual plots aid in checking the model’s validity, with the normality and showing the residuals follow a linear model.

plot(blotch.clean) # Plotting the residuals

The residuals vs fitted plot shows a reasonably straight, which indicates that the residuals likely follow a linear model, and this is the appropriate test to use. The Q-Q plot or Quantile-Quantile plot identifies that the data follows normality by not straying from the line; this is supported by the Shapiro-Wilk test.

ANOVA

Testing for the overall significance of the model

blotch.anova <- anova(blotch.clean) # Running the ANOVA test on the linear model
blotch.anova  # Displaying the results of the ANOVA test

Analysis of Variance Table

Response: blotch
           Df Sum Sq Mean Sq F value    Pr(>F)    
depth       1 518.70  518.70 529.739 < 2.2e-16 ***
sex         1  11.48   11.48  11.727 0.0006671 ***
Residuals 497 486.64    0.98                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This ANOVA test identifies that depth has the most notable impact on the blotching of sharks, producing a far more significant F value than sex. However, sex is still somewhat of a predictor for the effect of blotching on the sharks.