options(repos = c(CRAN = "https://cloud.r-project.org"))
# sets a CRAN mirror to allow package installation RM&DA Summative Assignment
Summative Assessment
Setting up R- general housekeeping
Installing the function packages needed
echo = FALSE
install.packages("dplyr")package 'dplyr' successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\finch\AppData\Local\Temp\RtmpYnI2u6\downloaded_packages
install.packages("corrplot")package 'corrplot' successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\finch\AppData\Local\Temp\RtmpYnI2u6\downloaded_packages
install.packages("broom")package 'broom' successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\finch\AppData\Local\Temp\RtmpYnI2u6\downloaded_packages
#installs the specified R packages so functions can be usedlibrary(tidyverse) #for various functions
library(psych) #for various fuctions including "describe"
library(readxl) #allows excel documents to be read
library(dplyr) #for various functions
library(corrplot) #for correlation visualisation
library(broom) #for modeling arguments
library(lmtest) #for testing modelling assumptions Importing the data
sharks <- read_excel("C:/Users/finch/OneDrive/Documents/MSc Biodiversity Conservation/Research Methods & Data Analysis/Assignment/sharks.xlsx")
sharksub <- read_excel("C:/Users/finch/OneDrive/Documents/MSc Biodiversity Conservation/Research Methods & Data Analysis/Assignment/sharksub.xlsx")
# using "read_excel" fuction to read and import the excel files containing the data; specififyig file pathway to find said filesGeneral understanding
using view to load visual tables into R to see all the data
descriptive stats for both data sets for basic understanding
General data exploration
- creating very basic plots to explore the relationships, trends etc of variables that would be assumed to show some form of correlation
ggplot(sharks, aes(x = length, y = weight)) +
geom_line() +
labs(title = "Relationship between length and weight of caught sharks", x = "Length of shark (cm)", y = "Weight of shark (kg)")#creates a line graph plotting length and weight to see if longer sharks are heaviercomments:
- There is no relationship between the weight and length of the shark
I’ll also see if there is a relationship between blotch time and BPM
ggplot(sharks, aes(x = blotch, y = BPM)) +
geom_point() +
labs(x = "Blotch time (seconds)", y = "BPM")# creates a scatter plot of blotch time and BPM correlationcomments:
There is no trend showing a higher heart rate results in faster or slower blotching time.
No correlation seen
1- Correlation between air and water variables
Visuals
Scatter plot
- to view any correlation between air and water temperatures
#|label: scatter plot- air and water
ggplot(sharks, aes(x = air, y = water)) +
geom_point() +
geom_smooth(method = "lm", color = "red") +
labs(title = "Relationship between air and water temperatures", x = "Air temperature (°c)", y = "Water temperature (°c)")# creates a scatter plot of air and water temperatues to visualise relationship, with a trend line added comments:
- the scatter plot shows there is no correlation or trend in the two temperature variables. They do not increase or decrease with any relationship to one another
Boxplot
- shows comparison of range and values to aid data understanding
airwater <- sharks %>%
select(air, water) %>% #selects the relevent columns
pivot_longer(cols = everything(), #Shapes columns into a long format
names_to = "Variable", #Creats column to store variable names
values_to = "Temperature")
# this chunk reshapes the separate air and water columns into a singular variable column using a long format to combine them
# doing this makes it eaiser to plot the two variables onto a singular boxplot visual ggplot(airwater, aes(x = Variable, y = Temperature, fill = Variable)) +
geom_boxplot() +
labs(title = "Comparitive Air and Water Temperatures",
x = "Variable", y = "Temperature(°c)") +
theme_minimal() + # for cleaner display
scale_fill_manual(values = c("#FF9900", "#6699CC")) # sets colours for both variables# creates a boxlpot of the two variabes to allow them to be viewed side by sidecomments:
- This makes it easier to understand the temperatures by visualizing them side by side. There is no overlap in range, no outliers to note and the means can be easily assessed.
Next to test the normality of the two variables so appropriate statistical analysis into the relationship can be done
Normality testing- Shapiro-wilk
shapiro.test(sharks$air)
Shapiro-Wilk normality test
data: sharks$air
W = 0.95885, p-value = 1.338e-10
# p < 0.05shapiro.test(sharks$water)
Shapiro-Wilk normality test
data: sharks$water
W = 0.96035, p-value = 2.371e-10
# p < 0.05comments:
- in both cases, p < 0.05 so data is not normally distributed
Since data is not normally distributed, non parametric tests can be used in statistical analysis
Correlation test-Spearman rank
- non parametric test for correlations
cor.test(sharks$air, sharks$water)
Pearson's product-moment correlation
data: sharks$air and sharks$water
t = -1.2346, df = 498, p-value = 0.2176
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.14224207 0.03260803
sample estimates:
cor
-0.05524051
# p > 0.05comments:
p > 0.05 suggests there is no statistical significance between air temperate and water temperate
cor = -0.06 (2dp) suggests no correlation as value is very close to 0
For the context of this research, there is no link between the two temperature variables.
Blotch exploration (general)
- Exploring the blotch data from “sharks” to get a better understanding of it
describe(sharks$blotch) vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 500 35.13 1.43 35.05 35.11 1.38 30.78 40.08 9.31 0.15 0.2 0.06
#descriptive stats table comments:
- makes basic descriptive stats easy to view
Using a boxplot to view the data
ggplot(sharks, aes(y = blotch)) +
geom_boxplot(fill = "skyblue") +
labs(title = "Blotch time", y = "seconds") +
theme_minimal()# creates a boxplot showing the blotch time of the sharks datacomments:
- 4 outlier values
- useful for a general understanding and visualization
- X axis scale is a little confusing
To make the data both easier to understand and add more depth to what it shows, I’ll separate the blotching times by sex to show a better representation of the sampled population
ggplot(sharks, aes(x = factor(sex), y = blotch, fill = factor(sex))) +
geom_boxplot() +
labs(title = "Boxplot of Blotch Values by Sex",
x = "Sex",
y = "Blotch Value") +
scale_fill_manual(values = c("pink", "lightblue"), name = "Sex") +
theme_minimal()# creates a box plot of the blotching time with female and male sexes viewed side by side, with added colour to support viewing comments:
- there is a difference in mean with males having a higher mean than females.
using a t-test to see if the mean between the two groups is statistically significant
t.test(blotch ~ sex, data = sharks)
Welch Two Sample t-test
data: blotch by sex
t = -3.0282, df = 494.67, p-value = 0.002589
alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
95 percent confidence interval:
-0.6322714 -0.1346620
sample estimates:
mean in group Female mean in group Male
34.92294 35.30641
comments:
p < 0.05 so means are statistically significantly different
t = -3.03 (2dp) so mean of males is less than mean of females
2- Multiple captures
- descriptive stats to get a basic understanding of the two variables
describe(sharksub) vars n mean sd median trimmed mad min max range skew
ID* 1 50 25.50 14.58 25.50 25.50 18.53 1.00 50.00 49.00 0.00
sex* 2 50 1.50 0.51 1.50 1.50 0.74 1.00 2.00 1.00 0.00
blotch1 3 50 35.03 1.10 34.94 35.07 1.03 32.49 37.07 4.58 -0.20
blotch2 4 50 35.96 1.16 35.94 35.98 1.06 33.47 38.18 4.72 -0.08
kurtosis se
ID* -1.27 2.06
sex* -2.04 0.07
blotch1 -0.65 0.15
blotch2 -0.78 0.16
Visualisations
sscombine <- sharksub %>%
pivot_longer(cols = c(blotch1, blotch2), # reshapes data into long format fro ease of use
names_to = "Capture", # combines blotch1 and blotch2 into singular column
values_to = "Blotch_Time") # reshapes blotching time values into singular column
# this chunk combines blotch 1 and 2 into a singular column to they can be easily displayed side by side on a boxplotggplot(sscombine, aes(x = Capture, y = Blotch_Time, fill = Capture)) +
geom_boxplot() +
scale_fill_manual(values = c("lightgreen", "lightblue")) +
labs(title = "Blotching Times for First and Second Capture events",
x = "Capture Event",
y = "Blotch Time (seconds)") +
theme_minimal()# this chunk codes the boxplot to show blotch times between 1st and 2nd capture side by side with color added to aid visualising comments:
first capture- mean = 35.03; min = 32.49; max = 37.07; range = 4.58
second capture- mean = 35.96; min = 33.47; max = 38.18; range = 4.72
second capture had overall longer blotch times compared to first capture which suggests multiple captures does affect blotching times- it takes longer to blotch upon a second capture
Next to use statistical analysis to evaluate if the means difference in significant
stats analysis
- starting with distribution normality testing
shapiro.test(sharksub$blotch1)
Shapiro-Wilk normality test
data: sharksub$blotch1
W = 0.97958, p-value = 0.5345
# p > 0.05 = normally distributed shapiro.test(sharksub$blotch2)
Shapiro-Wilk normality test
data: sharksub$blotch2
W = 0.97936, p-value = 0.5255
# p > 0.05 = normally distributed comments:
- both variables have normal distributions with p > 0.05
Now to test for statistical significance in the means
assumption: two samples are dependent
t.test(sharksub$blotch1, sharksub$blotch2, paired = TRUE)
Paired t-test
data: sharksub$blotch1 and sharksub$blotch2
t = -17.39, df = 49, p-value < 2.2e-16
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
-1.037176 -0.822301
sample estimates:
mean difference
-0.9297384
comments:
p > 0.05 - this means the null hypothesis is rejected, suggesting there is statistical significance between the two means
t = -4.1143 - the negative value shows the mean of the first group is smaller than that of the second
3- Predicting blotching times
in order to build a predictive model, I need to establish which variables, if any, have correlations or relationships. Any that do can act as predictors for the model.
rather than examine them one by one, i can use a correlation matrix to assess all the variable relationships in one go
Correlation matrix
numeric_var <- sharks[, sapply(sharks, is.numeric)]
# creates a new usable data set containing only numeric variables as categorical variables cannot be analised in this waycormatrix <- cor(numeric_var, use = "complete.obs") # ensures removal of any n/a values
corround <- round(cormatrix, 3) # rounds all results to 3 d.p
print(corround) # prints the results below blotch BPM weight length air water meta depth
blotch 1.000 -0.029 0.009 -0.016 -0.038 -0.052 -0.010 0.714
BPM -0.029 1.000 0.017 -0.069 -0.068 0.025 -0.006 -0.012
weight 0.009 0.017 1.000 -0.020 -0.053 0.086 0.020 -0.006
length -0.016 -0.069 -0.020 1.000 -0.030 -0.059 0.003 -0.083
air -0.038 -0.068 -0.053 -0.030 1.000 -0.055 0.125 -0.012
water -0.052 0.025 0.086 -0.059 -0.055 1.000 0.022 -0.041
meta -0.010 -0.006 0.020 0.003 0.125 0.022 1.000 0.008
depth 0.714 -0.012 -0.006 -0.083 -0.012 -0.041 0.008 1.000
# this chunk codes for a correlation matrix which prints as a table formcorrplot(cormatrix,
method = "color",#uses the colour version
type = "upper", #only shows upper triangle
addCoef.col = "black", #sets colour for correlation coefficient values
tl.col = "black", #sets text colour
tl.srt = 45, #rotation angle of axis lables
number.cex = 0.7, #adjusts numeric text size
title = "Correlation matrix",
mar = c(0,0,1,0)) #sets plot margins#this chunk produces a correlation visual form of the correlaiton matrix with a color gradient used to simplify viewing and understandingcomments:
visual matrix shows the correlation coefficients between all the numerical values
the limits for coefficiant values are +1 or -1
1 is a perfect positive correlation. the closer to 1 the value is, the stronger the positive correlation is- as one variable increases, the other also increases
-1 is a perfect negative correlation. the closer to -1 the value is, the stronger the negative correlation is- as one variable increases, the other decreases
0 means no correlation. the closer to 0 the value is, the weaker and more absent the correlation is- the variables have no correlation
there is a strong correlation between the depth of capture and blotching time. sharks hooked from a deeper depth have longer blotching times // sharks captured at shallower depths had quicker blotching times
To confirm the correlation between hook depth and blotching time, I’ll create an independent plot to isolate the focus of these variables.
ggplot(sharks, aes(x = depth, y = blotch)) +
geom_point() +
geom_smooth(method = "lm", color = "red") +
labs(title = "Blotch time and hooked depth correlation",
x = "Hooked depth (Metres)",
y = "Blotch time (seconds)")# creates a scatter plot for capture depth and blotching time with a linear trend linecomments:
positive correlation seen
sharks hooked at deeper depths have longer blotching times
supporting this correlation with statistical analysis
Statistical analysis
testing if the relationship is linear or non linear
cor.test(sharks$blotch, sharks$depth, method = "pearson")
Pearson's product-moment correlation
data: sharks$blotch and sharks$depth
t = 22.772, df = 498, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.6683963 0.7546509
sample estimates:
cor
0.7142247
# testing if relationship is linear comments:
linear relationship- p < 0.05
pearson correlation max is +1, closer to this the stronger the positive correlation is
cor = 0.71, these is a strong positive correlation between blotch and depth
comparing the pearson correlation (which assumes linearity) to spearmans correlation (which assumes non-linearity and monotonic)
cor.test(sharks$blotch, sharks$depth, method = "spearman")
Spearman's rank correlation rho
data: sharks$blotch and sharks$depth
S = 6797520, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.6737177
# testing if relationship is nonlinear and monotoniccomments:
- p < 0.05, rho = 0.67
The pearson correlation has a higher correlation value than spearman (0.71 vs 0.67) so I can assume the relationship is linear over nonlinear.
Predictive model
- blotching time is the target for analysis with depth of capture being the predictor variable
blotchmodel <- lm(blotch ~ depth, data = sharks) #creates the model base and states the target (bloth) and predictor (depth)
summary(blotchmodel) #prints the output for viewing
Call:
lm(formula = blotch ~ depth, data = sharks)
Residuals:
Min 1Q Median 3Q Max
-2.81869 -0.65427 -0.01035 0.58825 2.83116
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.82178 1.11207 8.832 <2e-16 ***
depth 0.50467 0.02216 22.772 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1 on 498 degrees of freedom
Multiple R-squared: 0.5101, Adjusted R-squared: 0.5091
F-statistic: 518.6 on 1 and 498 DF, p-value: < 2.2e-16
# codes for a linear model with the predictor of depth acting on the target of blotch timepar(mfrow = c(2,2)) # splits display into 2x2 grid for better viewing
plot(blotchmodel) # generates the visual plots# dispays visual plots for the model created in the chunk above, allowing interpretation comments:
residuals: checks for non linear variance and homoscedasicity
Q-Q residuals: compares the residual distribution to the theoretical normal distribution
Scale- location: shows relationship between the residuals and fitted values
Residuals vs Leverage: shows outliers and points of key interest
Checking assumptions for linear model
Residual plot
plot(fitted(blotchmodel), resid(blotchmodel),
main = "Residual Plot",
xlab = "Fitted Values", ylab = "Residuals")
abline(h = 0, col = "red", lwd = 2)# recreates the residual plot from the linear model output for better viewing comments:
- no curve or pattern visible which suggests linear relationship
supporting residual plot for heteroscedasiticity testing
bptest(blotchmodel)
studentized Breusch-Pagan test
data: blotchmodel
BP = 1.4255, df = 1, p-value = 0.2325
# conducts statistical test for heteroscedasiticty comments:
p > 0.05 which suggests no hetersoscedasticity (variance)
Constant variance can be assumed (homoscedasticity)
testing the residual normality
shapiro.test(resid(blotchmodel))
Shapiro-Wilk normality test
data: resid(blotchmodel)
W = 0.99748, p-value = 0.653
# tests the normaility of residuals from the linear model outputcomments:
- p > 0.05 so residuals are normality distributed
QQ Plot
sharks$stand_residuals <- rstandard(blotchmodel) #adds standardised residuals to sharks data
ggplot(sharks, aes(sample = stand_residuals)) +
stat_qq() + # adds q-q plots
stat_qq_line(color = "red") + # adds line showing perfect normal distribution
labs(title = "Normality Q-Q plot", x = "Therotetical values", y = "standard values") +
theme_minimal()# recreated the QQ plot from the linear mdoel output for better viewingcomments:
- points lie close to line of idea normal distribution suggesting normal distributions
Evaluating model fit
summary(blotchmodel)
Call:
lm(formula = blotch ~ depth, data = sharks)
Residuals:
Min 1Q Median 3Q Max
-2.81869 -0.65427 -0.01035 0.58825 2.83116
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.82178 1.11207 8.832 <2e-16 ***
depth 0.50467 0.02216 22.772 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1 on 498 degrees of freedom
Multiple R-squared: 0.5101, Adjusted R-squared: 0.5091
F-statistic: 518.6 on 1 and 498 DF, p-value: < 2.2e-16
comments:
R² measures variance proportions of blotch in regard to the predictor of depth. Value of 0.51 translates to 51% of blotch time variance being explained by capture depth
shows depth is a reasonably strong predictor, with 49% of variance being caused by other variables
R² adjusted gives a more realistic measure of models power.
Both R values are incredibly similar as model only uses one predictor