Sharks <- read.csv("sharks.csv", # imports shark data set
header = TRUE) # ensures first row of data is imported as variable namesSummative Assignment R Code
Importing Data Sets
A quarto project has been set up in R, containing the sharks and sharksub data, saved as csv files.
Sharksub <- read.csv("sharksub.csv", # imports sharksub data set
header = TRUE) # ensures first row of data is imported as variable namesInstalling necessary packages
The following packages were first installed via the console using the install.packages(““) function, and then called using the library function below.
library(tidyverse) # contains packages such as ggplot2 and dyplr for data visualisation and manipulation
library(corrplot) # for correlation analyses in question 3
library(rstatix) # for identifying outliers in question 3Question 1: Is there a correlation between the variables air and water?
- The 2 variables, air and water, are both quantitative continuous
str(Sharks) # check that the variables air and water are in numerical format'data.frame': 500 obs. of 10 variables:
$ ID : chr "SH001" "SH002" "SH003" "SH004" ...
$ sex : chr "Female" "Female" "Female" "Male" ...
$ blotch: num 37.2 34.5 36.3 35.3 37.4 ...
$ BPM : int 148 158 125 161 138 126 166 135 132 127 ...
$ weight: num 74.7 73.4 71.8 104.6 67.1 ...
$ length: num 187 189 284 171 264 ...
$ air : num 37.7 35.7 34.8 36.2 33.6 ...
$ water : num 23.4 21.4 20.1 21.6 21.8 ...
$ meta : num 64.1 73.7 54.4 86.3 108 ...
$ depth : num 53.2 49.6 49.4 50.3 49 ...
sum(is.na(Sharks)) # checks whether there are any NA values in the data[1] 0
Sharks %>%
ggplot(aes(x = air)) +
geom_histogram() # produce a histogram to see whether the data for air is normally distributedThe histogram suggests that the data is not normally distributed.
shapiro.test(Sharks$air) # check whether the variable 'air' is normally distributed
Shapiro-Wilk normality test
data: Sharks$air
W = 0.95885, p-value = 1.338e-10
The very small P-value (<0.05) means we reject the null hypothesis, and accept that the data for variable ‘air’ is not normally distributed.
Sharks %>%
ggplot(aes(x = water)) +
geom_histogram() # produce a histogram to see whether the data for water is normally distributedThe histogram suggests that the data is not normally distributed.
shapiro.test(Sharks$water) # check whether the variable 'water' is normally distributed
Shapiro-Wilk normality test
data: Sharks$water
W = 0.96035, p-value = 2.371e-10
The very small P-value (<0.05) means we reject the null hypothesis, and accept that the data for variable ‘water’ is not normally distributed.
As both variables are not normally distributed, the test used is the Spearman’s rank correlation test.
Sharks %>%
ggplot(aes(x = air,
y = water)) +
geom_point() + # plots a scatter plot using the variables 'air' and 'water'
geom_smooth(method = "lm") + # adds a line of best fit with standard error, using linear regression
labs(x = "Ambient air temperature (°C)",
y = "Surface water temperature (°C)") +
theme_minimal() +
theme(axis.title = element_text(size = 12),
axis.text = element_text(size = 10)) # changes font size of axis text and titlesThe scatter plot indicates that there is a very weak negative correlation between the variables.
correlation <- cor.test(x=Sharks$air, y=Sharks$water, method = 'spearman') # calculates Spearman's rank correlation coefficient and saves the output as an objectcorrelation
Spearman's rank correlation rho
data: Sharks$air and Sharks$water
S = 22007692, p-value = 0.2082
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
-0.05637344
A correlation coefficient of -0.056 confirms a very weak negative relationship.
The P-value is greater than 0.05 which means we accept the null hypothesis ie. there is not enough evidence to suggest that the true correlation coefficient is not equal to zero.
Conclusion: there is no statistically significant correlation between the variables air and water.
Question 2: Does multiple capture have an effect on blotching time?
- Using sharksub dataset
- Capture ID is a categorical variable and blotching time is continuous
str(Sharksub) # check what format the variables are in'data.frame': 50 obs. of 4 variables:
$ ID : chr "SH269" "SH163" "SH008" "SH239" ...
$ sex : chr "Female" "Female" "Female" "Female" ...
$ blotch1: num 36.1 33.4 36.3 35 35.7 ...
$ blotch2: num 37.2 34.4 36.5 36 36.8 ...
Sharksub %>%
ggplot(aes(x = blotch1)) +
geom_histogram() # produce a histogram to see whether data for blotch1 is normally distributedThe histogram suggest a normal distsribution, but it is not perfectly clear.
shapiro.test(Sharksub$blotch1) # check whether blotch1 is normally distributed
Shapiro-Wilk normality test
data: Sharksub$blotch1
W = 0.97958, p-value = 0.5345
As the p-value is greater than 0.05, we accept the null hypothesis that the data is normally distributed.
Sharksub %>%
ggplot(aes(x = blotch2)) +
geom_histogram() # produce a histogram to see whether data for blotch2 is normally distributedThe histogram suggest a normal distribution, but it is not perfectly clear.
shapiro.test(Sharksub$blotch2) # check whether blotch2 is normally distributed
Shapiro-Wilk normality test
data: Sharksub$blotch2
W = 0.97936, p-value = 0.5255
As the p-value is greater than 0.05, we accept the null hypothesis that the data is normally distributed.
As both variables are normally distributed, we can use a paired T-test to do a comparison of means for blotch1 and blotch 2.
multiple.capture <- t.test(Sharksub$blotch1, Sharksub$blotch2,
paired = TRUE) # paired sample T-test to compare time taken to blotch on two separate captures per shark
multiple.capture
Paired t-test
data: Sharksub$blotch1 and Sharksub$blotch2
t = -17.39, df = 49, p-value < 2.2e-16
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
-1.037176 -0.822301
sample estimates:
mean difference
-0.9297384
Mean difference of -0.9 suggests that blotching time is slower on second capture.
This difference is statistically significant as P < 0.05, so we reject the null hypothesis, and accept that there is a difference in blotching time with multiple capture.
sharksub_adapted <- Sharksub %>%
pivot_longer(cols = starts_with("blotch"),
names_to = "Capture",
values_to = "Blotching_Time") %>%
mutate(Capture = recode(Capture,
blotch1 = "Capture 1",
blotch2 = "Capture 2")) # alters the data to aid production of a boxplot. Changing blotch names to capture, as well as naming each data entry as a blotching time
sharksub_adapted %>%
ggplot(aes(x = Capture,
y = Blotching_Time)) +
geom_boxplot() + # produces boxplot of time taken to blotch in first and second capture
theme_minimal() +
labs(x = "Capture number",
y = "Time taken to blotch (seconds)") +
theme(axis.title = element_text(size = 12),
axis.text = element_text(size = 10)) # changes font size of axis text and titlesThe data also provides information on the sex of each shark. It would be interesting to know if sex has any influence on blotching time with multiple capture.
Sharksub %>%
group_by(sex) %>% # groups sharksub data by sex
summarise(x = n()) %>% # summarises data by number of individuals per sex
ungroup() # ungroups data to ensure the original is not altered# A tibble: 2 × 2
sex x
<chr> <int>
1 Female 25
2 Male 25
Here we can see that equal numbers of female and male sharks were tested, and therefore the data is not likely to be skewed by sex.
sharksub_adapted %>% # uses same data set that was used to create the previous boxplot
ggplot(aes(x = Capture,
y = Blotching_Time,
color = sex)) + # separates the data by sex
geom_boxplot() + # produces boxplot of time taken to blotch in first and second capture
theme_minimal() +
scale_color_brewer(palette = "Dark2") + # uses a colour palette that is color-blind friendly
labs(x = "Capture number",
y = "Time taken to blotch (seconds)") +
theme(axis.title = element_text(size = 12),
axis.text = element_text(size = 10)) # changes font size of axis text and titlesThe boxplot shows that median blotching time is greater in males in both captures. There is also greater variability of the interquartile ranges in males than in females. The time taken to blotch increases in both males and females with multiple capture. There is large overlap of both the boxes and whiskers of the plot for each sex, indicating that differences may not be significant.
To determine whether differences between blotching times by sex are significant, an independent sample t-test can be used, to compare the means.
One assumption of the t-test is that the two samples have similar variance. The boxplot shows that this is true, but this can also be confirmed in R;
Sharksub %>%
group_by(sex) %>% # groups sharksub data by sex
summarise(variance_blotch1 = var(blotch1), # summarises variance of blotch 1 data
variance_blotch2 = var(blotch2)) %>% # summarises variance of blotch 2 data
ungroup() # final ungrouping of data# A tibble: 2 × 3
sex variance_blotch1 variance_blotch2
<chr> <dbl> <dbl>
1 Female 1.16 1.23
2 Male 1.18 1.44
blotch1.t.test <- Sharksub %>%
t.test(blotch1 ~ sex, data = .) # independent sample t-test for sharksub data, of difference in mean time taken to blotch in males and females (for first capture)
blotch1.t.test
Welch Two Sample t-test
data: blotch1 by sex
t = -1.4629, df = 47.997, p-value = 0.15
alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
95 percent confidence interval:
-1.0644425 0.1678337
sample estimates:
mean in group Female mean in group Male
34.80627 35.25457
As P > 0.05, we accept the null hypothesis, and accept that the difference is not statistically significant.
blotch2.t.test <- Sharksub %>%
t.test(blotch2 ~ sex, data = .) # independent sample t-test for sharksub data, of difference in mean time taken to blotch in males and females (for second capture)
blotch2.t.test
Welch Two Sample t-test
data: blotch2 by sex
t = -1.2362, df = 47.702, p-value = 0.2224
alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
95 percent confidence interval:
-1.0622680 0.2534615
sample estimates:
mean in group Female mean in group Male
35.75796 36.16236
As P > 0.05, we accept the null hypothesis, and accept that the difference is not statistically significant.
Conclusion: time taken to blotch increases with multiple capture, but there is no significant effect of sex on blotching time.
Question 3: Is it possible to predict blotching time?
A balloon plot can be produced, to evaluate correlations between the variables in the ‘Sharks’ dataset, as a helpful starting point.
The corrplot function can only analyse numeric values. Shark ID does not need to be included, and sex will be looked at separately.
corrplot(corr = cor(Sharks[, 3:10]), # produces a correlation matrix of the sharks data, excluding columns for ID and sex
method = "ellipse", # presents correlations as ellipses, with direction and spread
type = "lower", # changes shape of balloon plot
addCoef.col = "black", # adds numerical correlation coefficient to each balloon
number.cex = 0.5, # reduces text size of coefficients
diag = FALSE, # removes instances where variables are correlated against themselves
bg = "grey", # changes background colour
outline = "black") # changes outline colour of balloons The balloon plot shows a strong positive correlation between blotching time and the depth at which the animal was hooked.
As both variables are quantitative, linear regression can be used to analyse their relationship.
Sharks %>%
ggplot(aes(x = depth,
y = blotch)) +
geom_point() + # produces scatter plot of depth and blotching time
geom_smooth(method = "lm") + # adds a line of best fit with standard error, using linear regression
labs(x = "Depth at capture (m)",
y = "Time taken to blotch (s)") +
theme_minimal() +
theme(axis.title = element_text(size = 12),
axis.text = element_text(size = 10)) # changes font size of axis text and titleslinear.model <- lm(blotch ~ depth, data = Sharks) # linear regression model of blotching time as a function of capture depth
summary(linear.model)
Call:
lm(formula = blotch ~ depth, data = Sharks)
Residuals:
Min 1Q Median 3Q Max
-2.81869 -0.65427 -0.01035 0.58825 2.83116
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.82178 1.11207 8.832 <2e-16 ***
depth 0.50467 0.02216 22.772 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1 on 498 degrees of freedom
Multiple R-squared: 0.5101, Adjusted R-squared: 0.5091
F-statistic: 518.6 on 1 and 498 DF, p-value: < 2.2e-16
hist(linear.model$residuals) # plots a histogram of the residuals of the linear modelThe histogram shows that the residuals are evenly distributed around zero ie. the data is not skewed. This means that data can be equally-well predicted at both ends of the range.
shapiro.test(linear.model$residuals) # checks for normality of residuals
Shapiro-Wilk normality test
data: linear.model$residuals
W = 0.99748, p-value = 0.653
The coefficient output produces an equation that can be used to predict blotching time based on depth at capture:
y = 0.50467x + 9.82178
The P-value for the coefficient is very small (close to zero) which shows that the variable is significant to the linear model. This means that we can be confident in using depth to predict blotching.
A R-squared value of 0.5101 suggests that 51% of the variance in blotching time can be explained by depth at capture.
It might be useful to see whether sex has any effect on blotching time in the Sharks data set, as it is much larger.
Sharks %>%
group_by(sex) %>% # groups sharks data by sex
summarise(x = n()) %>% # summarises data by number of individuals per sex
ungroup() # final ungrouping of data# A tibble: 2 × 2
sex x
<chr> <int>
1 Female 236
2 Male 264
More males were captured than females, but there is a large enough sample of each sex for the sample to be representative.
To check that the data have similar variances, and therefore meet the assumptions of the t-test:
Sharks %>%
group_by(sex) %>% # groups sharks data by sex
summarise(variance = var(blotch)) %>% # summarises variance of blotch time for each sex
ungroup() # final ungrouping of data# A tibble: 2 × 2
sex variance
<chr> <dbl>
1 Female 1.94
2 Male 2.06
Also check whether the data is normally distributed:
Sharks %>%
ggplot(aes(x = blotch,
colour = sex)) +
geom_histogram() + # plots histogram of blotching times, separated by sex
scale_color_brewer(palette = "Dark2") # uses a colour palette that is colour-blind friendlyThe histogram suggests that the data is normally distributed.
sharks.males <- Sharks %>%
subset(sex == "Male") # subsets sharks data into males only
sharks.females <- Sharks %>%
subset(sex == "Female") # subsets sharks data into females onlyshapiro.test(sharks.males$blotch) # tests for normality of blotch data for males
Shapiro-Wilk normality test
data: sharks.males$blotch
W = 0.99209, p-value = 0.1701
As P > 0.05, we accept the null hypothesis that the data is normally distributed.
shapiro.test(sharks.females$blotch) # tests for normality of blotch data for females
Shapiro-Wilk normality test
data: sharks.females$blotch
W = 0.99527, p-value = 0.682
As P > 0.05, we accept the null hypothesis that the data is normally distributed.
sharks.t.test <- Sharks %>%
t.test(blotch ~ sex, data = .) # independent sample t-test for sharks data, of difference in mean time taken to blotch in males and females
sharks.t.test
Welch Two Sample t-test
data: blotch by sex
t = -3.0282, df = 494.67, p-value = 0.002589
alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
95 percent confidence interval:
-0.6322714 -0.1346620
sample estimates:
mean in group Female mean in group Male
34.92294 35.30641
As with the Sharksub data, time taken to blotch is higher in males than females. However in this instance, P < 0.05 which means we reject the null hypothesis and accept that the difference in means between males and females is statistically significant. This suggests that the Sharksub data did not contain enough individuals to see a true trend.
Sharks %>%
ggplot(aes(x = sex,
y = blotch)) +
geom_boxplot() + # box plot of time taken to blotch in males vs females
theme_minimal() +
labs(x = "Sex",
y = "Time taken to blotch (seconds)") +
theme(axis.title = element_text(size = 12),
axis.text = element_text(size = 10)) # changes font size of axis text and titles The boxplot shows that there are some outliers in the data.
Sharks %>%
group_by(sex) %>% # groups sharks data by sex
identify_outliers(blotch) %>% # identifies any outliers in the blotching data
print(,width = Inf) # ensures all columns of table are presented in quarto# A tibble: 4 × 12
sex ID blotch BPM weight length air water meta depth is.outlier
<chr> <chr> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl>
1 Female SH161 30.8 149 109. 254. 34.5 24.0 64.2 45.3 TRUE
2 Female SH334 31.4 137 74.2 224. 34.7 22.1 86.1 45.8 TRUE
3 Male SH014 39.8 121 90.9 193. 34.4 23.4 99.8 55.3 TRUE
4 Male SH392 40.1 141 111. 204. 34.3 24.5 60.2 54.5 TRUE
is.extreme
<lgl>
1 FALSE
2 FALSE
3 FALSE
4 FALSE
This function identifies the 4 outliers that are present in the boxplot. R identifies that these outliers are not extreme, and so the assumptions of the independent sample t-test carried out above, are still met.
Space for further analyses
Determine the total number of recordings for each data set:
Sharks %>%
summarise(x = n_distinct(ID)) # total number of unique shark IDs in sharks data x
1 500
Sharksub %>%
summarise(x = n_distinct(ID)) # total number of unique shark IDs in sharksub data x
1 50
Check whether there is any NA data in the sharksub dataset (already done for sharks in question 1):
sum(is.na(Sharksub))[1] 0
Determine the maximum length of longline used:
Sharks %>%
summarise(x = max(depth)) # identifies maximum value for depth in sharks dataset x
1 56.82916
Temperature range/ average for duration of study:
Sharks %>%
summarise(r = range(air), # calculates range of air temperature measurements
a = mean(air)) # calculates mean of air temperature measurements r a
1 33.00454 35.53526
2 37.99978 35.53526
Sharks %>%
summarise(r = range(water), # calculates range of water temperature measurements
a = mean(water)) # calculates mean of water temperature measurements r a
1 20.00503 23.02052
2 25.98523 23.02052
To work out how many sharks were adult vs juvenile:
Sharks %>%
filter(sex == "Female") %>% # filters sharks data by females only
summarise(juv = sum(length < 200), # total number of sharks whose length is less than 200cm
adult = sum(length >= 200)) # total number of sharks whose length is 200cm or more juv adult
1 102 134
Sharks %>%
filter(sex == "Male") %>% # filters sharks data by males only
summarise(juv = sum(length < 150), # total number of sharks whose length is less than 150cm
adult = sum(length >= 150)) # total number of sharks whose length is 150cm or more juv adult
1 29 235