RM&DA Summative Assignment

Author

FD N1349335

Summative Assessment

Setting up R- general housekeeping

Installing the function packages needed

options(repos = c(CRAN = "https://cloud.r-project.org")) 
# sets a CRAN mirror to allow package installation

echo = FALSE
install.packages("dplyr")

package 'dplyr' successfully unpacked and MD5 sums checked


The downloaded binary packages are in
    C:\Users\finch\AppData\Local\Temp\RtmpYnI2u6\downloaded_packages

install.packages("corrplot")

package 'corrplot' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\finch\AppData\Local\Temp\RtmpYnI2u6\downloaded_packages

install.packages("broom")

package 'broom' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\finch\AppData\Local\Temp\RtmpYnI2u6\downloaded_packages

#installs the specified R packages so functions can be used

library(tidyverse)  #for various functions 
library(psych)      #for various fuctions including "describe"
library(readxl)     #allows excel documents to be read
library(dplyr)      #for various functions
library(corrplot)   #for correlation visualisation
library(broom)      #for modeling arguments 
library(lmtest)     #for testing modelling assumptions

Importing the data

sharks <- read_excel("C:/Users/finch/OneDrive/Documents/MSc Biodiversity Conservation/Research Methods & Data Analysis/Assignment/sharks.xlsx")

sharksub <- read_excel("C:/Users/finch/OneDrive/Documents/MSc Biodiversity Conservation/Research Methods & Data Analysis/Assignment/sharksub.xlsx")

# using "read_excel" fuction to read and import the excel files containing the data; specififyig file pathway to find said files

General understanding

using view to load visual tables into R to see all the data
descriptive stats for both data sets for basic understanding

sharks data

view(sharks)

describe(sharks)

       vars   n   mean     sd median trimmed    mad    min    max  range  skew
ID*       1 500 250.50 144.48 250.50  250.50 185.32   1.00 500.00 499.00  0.00
sex*      2 500   1.53   0.50   2.00    1.53   0.00   1.00   2.00   1.00 -0.11
blotch    3 500  35.13   1.43  35.05   35.11   1.38  30.78  40.08   9.31  0.15
BPM       4 500 141.76  14.14 142.00  141.54  19.27 119.00 166.00  47.00  0.08
weight    5 500  87.94  13.46  87.82   88.03  18.36  65.10 110.94  45.84 -0.02
length    6 500 211.04  46.61 211.07  211.24  59.51 128.25 290.95 162.70 -0.02
air       7 500  35.54   1.43  35.43   35.53   1.71  33.00  38.00   5.00  0.04
water     8 500  23.02   1.67  23.11   23.02   2.08  20.01  25.99   5.98 -0.01
meta      9 500  82.04  17.44  82.45   82.14  20.70  50.03 112.45  62.42 -0.05
depth    10 500  50.14   2.02  50.14   50.13   1.82  44.64  56.83  12.18  0.03
       kurtosis   se
ID*       -1.21 6.46
sex*      -1.99 0.02
blotch     0.20 0.06
BPM       -1.24 0.63
weight    -1.28 0.60
length    -1.19 2.08
air       -1.14 0.06
water     -1.15 0.07
meta      -1.10 0.78
depth     -0.06 0.09

Sharksub data

view(sharksub)

describe(sharksub)

        vars  n  mean    sd median trimmed   mad   min   max range  skew
ID*        1 50 25.50 14.58  25.50   25.50 18.53  1.00 50.00 49.00  0.00
sex*       2 50  1.50  0.51   1.50    1.50  0.74  1.00  2.00  1.00  0.00
blotch1    3 50 35.03  1.10  34.94   35.07  1.03 32.49 37.07  4.58 -0.20
blotch2    4 50 35.96  1.16  35.94   35.98  1.06 33.47 38.18  4.72 -0.08
        kurtosis   se
ID*        -1.27 2.06
sex*       -2.04 0.07
blotch1    -0.65 0.15
blotch2    -0.78 0.16

General data exploration

creating very basic plots to explore the relationships, trends etc of variables that would be assumed to show some form of correlation

ggplot(sharks, aes(x = length, y = weight)) + 
  geom_line() + 
  labs(title = "Relationship between length and weight of caught sharks", x = "Length of shark (cm)", y = "Weight of shark (kg)")

#creates a line graph plotting length and weight to see if longer sharks are heavier

comments:

There is no relationship between the weight and length of the shark

I’ll also see if there is a relationship between blotch time and BPM

ggplot(sharks, aes(x = blotch, y = BPM)) + 
  geom_point() + 
  labs(x = "Blotch time (seconds)", y = "BPM")

# creates a scatter plot of blotch time and BPM correlation

comments:

There is no trend showing a higher heart rate results in faster or slower blotching time.
No correlation seen

1- Correlation between air and water variables

Visuals

Scatter plot

to view any correlation between air and water temperatures

#|label: scatter plot- air and water

ggplot(sharks, aes(x = air, y = water)) +
  geom_point() +
  geom_smooth(method = "lm", color = "red") +
  labs(title = "Relationship between air and water temperatures", x = "Air temperature (°c)", y = "Water temperature (°c)")

# creates a scatter plot of air and water temperatues to visualise relationship, with a trend line added

comments:

the scatter plot shows there is no correlation or trend in the two temperature variables. They do not increase or decrease with any relationship to one another

Boxplot

shows comparison of range and values to aid data understanding

airwater <- sharks %>%
  select(air, water) %>% #selects the relevent columns
  pivot_longer(cols = everything(), #Shapes columns into a long format
               names_to = "Variable", #Creats column to store variable names
               values_to = "Temperature") 

# this chunk reshapes the separate air and water columns into a singular variable column using a long format to combine them
# doing this makes it eaiser to plot the two variables onto a singular boxplot visual

ggplot(airwater, aes(x = Variable, y = Temperature, fill = Variable)) +
  geom_boxplot() +
  labs(title = "Comparitive Air and Water Temperatures",
       x = "Variable", y = "Temperature(°c)") +
  theme_minimal() + # for cleaner display
  scale_fill_manual(values = c("#FF9900", "#6699CC")) # sets colours for both variables

# creates a boxlpot of the two variabes to allow them to be viewed side by side

comments:

This makes it easier to understand the temperatures by visualizing them side by side. There is no overlap in range, no outliers to note and the means can be easily assessed.

Next to test the normality of the two variables so appropriate statistical analysis into the relationship can be done

Normality testing- Shapiro-wilk

shapiro.test(sharks$air)


    Shapiro-Wilk normality test

data:  sharks$air
W = 0.95885, p-value = 1.338e-10

# p < 0.05

shapiro.test(sharks$water)


    Shapiro-Wilk normality test

data:  sharks$water
W = 0.96035, p-value = 2.371e-10

# p < 0.05

comments:

in both cases, p < 0.05 so data is not normally distributed

Since data is not normally distributed, non parametric tests can be used in statistical analysis

Correlation test-Spearman rank

non parametric test for correlations

cor.test(sharks$air, sharks$water)


    Pearson's product-moment correlation

data:  sharks$air and sharks$water
t = -1.2346, df = 498, p-value = 0.2176
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.14224207  0.03260803
sample estimates:
        cor 
-0.05524051

# p > 0.05

comments:

p > 0.05 suggests there is no statistical significance between air temperate and water temperate
cor = -0.06 (2dp) suggests no correlation as value is very close to 0
For the context of this research, there is no link between the two temperature variables.

Blotch exploration (general)

Exploring the blotch data from “sharks” to get a better understanding of it

describe(sharks$blotch)

   vars   n  mean   sd median trimmed  mad   min   max range skew kurtosis   se
X1    1 500 35.13 1.43  35.05   35.11 1.38 30.78 40.08  9.31 0.15      0.2 0.06

#descriptive stats table

comments:

makes basic descriptive stats easy to view

Using a boxplot to view the data

ggplot(sharks, aes(y = blotch)) +
  geom_boxplot(fill = "skyblue") +
  labs(title = "Blotch time", y = "seconds") +
  theme_minimal()

# creates a boxplot showing the blotch time of the sharks data

comments:

4 outlier values
useful for a general understanding and visualization
X axis scale is a little confusing

To make the data both easier to understand and add more depth to what it shows, I’ll separate the blotching times by sex to show a better representation of the sampled population

ggplot(sharks, aes(x = factor(sex), y = blotch, fill = factor(sex))) +
  geom_boxplot() +
  labs(title = "Boxplot of Blotch Values by Sex",
       x = "Sex",
       y = "Blotch Value") +
  scale_fill_manual(values = c("pink", "lightblue"), name = "Sex") +  
  theme_minimal()

# creates a box plot of the blotching time with female and male sexes viewed side by side, with added colour to support viewing

comments:

there is a difference in mean with males having a higher mean than females.

using a t-test to see if the mean between the two groups is statistically significant

t.test(blotch ~ sex, data = sharks)


    Welch Two Sample t-test

data:  blotch by sex
t = -3.0282, df = 494.67, p-value = 0.002589
alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
95 percent confidence interval:
 -0.6322714 -0.1346620
sample estimates:
mean in group Female   mean in group Male 
            34.92294             35.30641

comments:

p < 0.05 so means are statistically significantly different
t = -3.03 (2dp) so mean of males is less than mean of females

2- Multiple captures

descriptive stats to get a basic understanding of the two variables

describe(sharksub)

        vars  n  mean    sd median trimmed   mad   min   max range  skew
ID*        1 50 25.50 14.58  25.50   25.50 18.53  1.00 50.00 49.00  0.00
sex*       2 50  1.50  0.51   1.50    1.50  0.74  1.00  2.00  1.00  0.00
blotch1    3 50 35.03  1.10  34.94   35.07  1.03 32.49 37.07  4.58 -0.20
blotch2    4 50 35.96  1.16  35.94   35.98  1.06 33.47 38.18  4.72 -0.08
        kurtosis   se
ID*        -1.27 2.06
sex*       -2.04 0.07
blotch1    -0.65 0.15
blotch2    -0.78 0.16

Visualisations

sscombine <- sharksub %>% 
  pivot_longer(cols = c(blotch1, blotch2), # reshapes data into long format fro ease of use 
               names_to = "Capture", # combines blotch1 and blotch2 into singular column 
               values_to = "Blotch_Time") # reshapes blotching time values into singular column 

# this chunk combines blotch 1 and 2 into a singular column to they can be easily displayed side by side on a boxplot

ggplot(sscombine, aes(x = Capture, y = Blotch_Time, fill = Capture)) +
  geom_boxplot() +
  scale_fill_manual(values = c("lightgreen", "lightblue")) +
  labs(title = "Blotching Times for First and Second Capture events",
       x = "Capture Event", 
       y = "Blotch Time (seconds)") +
  theme_minimal()

# this chunk codes the boxplot to show blotch times between 1st and 2nd capture side by side with color added to aid visualising

comments:

first capture- mean = 35.03; min = 32.49; max = 37.07; range = 4.58
second capture- mean = 35.96; min = 33.47; max = 38.18; range = 4.72
second capture had overall longer blotch times compared to first capture which suggests multiple captures does affect blotching times- it takes longer to blotch upon a second capture

Next to use statistical analysis to evaluate if the means difference in significant

stats analysis

starting with distribution normality testing

shapiro.test(sharksub$blotch1)


    Shapiro-Wilk normality test

data:  sharksub$blotch1
W = 0.97958, p-value = 0.5345

# p > 0.05 = normally distributed

shapiro.test(sharksub$blotch2)


    Shapiro-Wilk normality test

data:  sharksub$blotch2
W = 0.97936, p-value = 0.5255

# p > 0.05 = normally distributed

comments:

both variables have normal distributions with p > 0.05

Now to test for statistical significance in the means

assumption: two samples are dependent

t.test(sharksub$blotch1, sharksub$blotch2, paired = TRUE)


    Paired t-test

data:  sharksub$blotch1 and sharksub$blotch2
t = -17.39, df = 49, p-value < 2.2e-16
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -1.037176 -0.822301
sample estimates:
mean difference 
     -0.9297384

comments:

p > 0.05 - this means the null hypothesis is rejected, suggesting there is statistical significance between the two means
t = -4.1143 - the negative value shows the mean of the first group is smaller than that of the second

3- Predicting blotching times

in order to build a predictive model, I need to establish which variables, if any, have correlations or relationships. Any that do can act as predictors for the model.
rather than examine them one by one, i can use a correlation matrix to assess all the variable relationships in one go

Correlation matrix

numeric_var <- sharks[, sapply(sharks, is.numeric)] 

# creates a new usable data set containing only numeric variables as categorical variables cannot be analised in this way

cormatrix <- cor(numeric_var, use = "complete.obs")  # ensures removal of any n/a values
corround <- round(cormatrix, 3)                      # rounds all results to 3 d.p
print(corround)                                      # prints the results below

       blotch    BPM weight length    air  water   meta  depth
blotch  1.000 -0.029  0.009 -0.016 -0.038 -0.052 -0.010  0.714
BPM    -0.029  1.000  0.017 -0.069 -0.068  0.025 -0.006 -0.012
weight  0.009  0.017  1.000 -0.020 -0.053  0.086  0.020 -0.006
length -0.016 -0.069 -0.020  1.000 -0.030 -0.059  0.003 -0.083
air    -0.038 -0.068 -0.053 -0.030  1.000 -0.055  0.125 -0.012
water  -0.052  0.025  0.086 -0.059 -0.055  1.000  0.022 -0.041
meta   -0.010 -0.006  0.020  0.003  0.125  0.022  1.000  0.008
depth   0.714 -0.012 -0.006 -0.083 -0.012 -0.041  0.008  1.000

# this chunk codes for a correlation matrix which prints as a table form

corrplot(cormatrix, 
         method = "color",#uses the colour version 
         type = "upper", #only shows upper triangle 
         addCoef.col = "black",  #sets colour for correlation coefficient values 
         tl.col = "black", #sets text colour 
         tl.srt = 45, #rotation angle of axis lables
         number.cex = 0.7, #adjusts numeric text size
         title = "Correlation matrix",  
         mar = c(0,0,1,0)) #sets plot margins

#this chunk produces a correlation visual form of the correlaiton matrix with a color gradient used to simplify viewing and understanding

comments:

visual matrix shows the correlation coefficients between all the numerical values
the limits for coefficiant values are +1 or -1
1 is a perfect positive correlation. the closer to 1 the value is, the stronger the positive correlation is- as one variable increases, the other also increases
-1 is a perfect negative correlation. the closer to -1 the value is, the stronger the negative correlation is- as one variable increases, the other decreases
0 means no correlation. the closer to 0 the value is, the weaker and more absent the correlation is- the variables have no correlation
there is a strong correlation between the depth of capture and blotching time. sharks hooked from a deeper depth have longer blotching times // sharks captured at shallower depths had quicker blotching times

To confirm the correlation between hook depth and blotching time, I’ll create an independent plot to isolate the focus of these variables.

ggplot(sharks, aes(x = depth, y = blotch)) +
  geom_point() +
  geom_smooth(method = "lm", color = "red") +
  labs(title = "Blotch time and hooked depth correlation",
       x = "Hooked depth (Metres)",
       y = "Blotch time (seconds)")

# creates a scatter plot for capture depth and blotching time with a linear trend line

comments:

positive correlation seen
sharks hooked at deeper depths have longer blotching times

supporting this correlation with statistical analysis

Statistical analysis

testing if the relationship is linear or non linear

cor.test(sharks$blotch, sharks$depth, method = "pearson")


    Pearson's product-moment correlation

data:  sharks$blotch and sharks$depth
t = 22.772, df = 498, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.6683963 0.7546509
sample estimates:
      cor 
0.7142247

# testing if relationship is linear

comments:

linear relationship- p < 0.05
pearson correlation max is +1, closer to this the stronger the positive correlation is
cor = 0.71, these is a strong positive correlation between blotch and depth

comparing the pearson correlation (which assumes linearity) to spearmans correlation (which assumes non-linearity and monotonic)

cor.test(sharks$blotch, sharks$depth, method = "spearman")


    Spearman's rank correlation rho

data:  sharks$blotch and sharks$depth
S = 6797520, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.6737177

# testing if relationship is nonlinear and monotonic

comments:

p < 0.05, rho = 0.67

The pearson correlation has a higher correlation value than spearman (0.71 vs 0.67) so I can assume the relationship is linear over nonlinear.

Predictive model

blotching time is the target for analysis with depth of capture being the predictor variable

blotchmodel <- lm(blotch ~ depth, data = sharks) #creates the model base and states the target (bloth) and predictor (depth)
summary(blotchmodel) #prints the output for viewing


Call:
lm(formula = blotch ~ depth, data = sharks)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.81869 -0.65427 -0.01035  0.58825  2.83116 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  9.82178    1.11207   8.832   <2e-16 ***
depth        0.50467    0.02216  22.772   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1 on 498 degrees of freedom
Multiple R-squared:  0.5101,    Adjusted R-squared:  0.5091 
F-statistic: 518.6 on 1 and 498 DF,  p-value: < 2.2e-16

# codes for a linear model with the predictor of depth acting on the target of blotch time

par(mfrow = c(2,2)) # splits display into 2x2 grid for better viewing
plot(blotchmodel) # generates the visual plots

# dispays visual plots for the model created in the chunk above, allowing interpretation

comments:

residuals: checks for non linear variance and homoscedasicity
Q-Q residuals: compares the residual distribution to the theoretical normal distribution
Scale- location: shows relationship between the residuals and fitted values
Residuals vs Leverage: shows outliers and points of key interest

Checking assumptions for linear model

Residual plot

plot(fitted(blotchmodel), resid(blotchmodel),
     main = "Residual Plot",
     xlab = "Fitted Values", ylab = "Residuals")
abline(h = 0, col = "red", lwd = 2)

# recreates the residual plot from the linear model output for better viewing

comments:

no curve or pattern visible which suggests linear relationship

supporting residual plot for heteroscedasiticity testing

bptest(blotchmodel)


    studentized Breusch-Pagan test

data:  blotchmodel
BP = 1.4255, df = 1, p-value = 0.2325

# conducts statistical test for heteroscedasiticty

comments:

p > 0.05 which suggests no hetersoscedasticity (variance)
Constant variance can be assumed (homoscedasticity)

testing the residual normality

shapiro.test(resid(blotchmodel))


    Shapiro-Wilk normality test

data:  resid(blotchmodel)
W = 0.99748, p-value = 0.653

# tests the normaility of residuals from the linear model output

comments:

p > 0.05 so residuals are normality distributed

QQ Plot

sharks$stand_residuals <- rstandard(blotchmodel) #adds standardised residuals to sharks data

ggplot(sharks, aes(sample = stand_residuals)) +
  stat_qq() + # adds q-q plots 
  stat_qq_line(color = "red") + # adds line showing perfect normal distribution 
  labs(title = "Normality Q-Q plot", x = "Therotetical values", y = "standard values") +
  theme_minimal()

# recreated the QQ plot from the linear mdoel output for better viewing

comments:

points lie close to line of idea normal distribution suggesting normal distributions

Evaluating model fit

summary(blotchmodel)


Call:
lm(formula = blotch ~ depth, data = sharks)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.81869 -0.65427 -0.01035  0.58825  2.83116 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  9.82178    1.11207   8.832   <2e-16 ***
depth        0.50467    0.02216  22.772   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1 on 498 degrees of freedom
Multiple R-squared:  0.5101,    Adjusted R-squared:  0.5091 
F-statistic: 518.6 on 1 and 498 DF,  p-value: < 2.2e-16

comments:

R² measures variance proportions of blotch in regard to the predictor of depth. Value of 0.51 translates to 51% of blotch time variance being explained by capture depth
shows depth is a reasonably strong predictor, with 49% of variance being caused by other variables
R² adjusted gives a more realistic measure of models power.
Both R values are incredibly similar as model only uses one predictor