Summative Assignment R Code

Author

Erica Bass, N1343960

Importing Data Sets

A quarto project has been set up in R, containing the sharks and sharksub data, saved as csv files.

Sharks <- read.csv("sharks.csv",  # imports shark data set
                   header = TRUE) # ensures first row of data is imported as variable names
Sharksub <- read.csv("sharksub.csv", # imports sharksub data set
                     header = TRUE) # ensures first row of data is imported as variable names

Installing necessary packages

The following packages were first installed via the console using the install.packages(““) function, and then called using the library function below.

library(tidyverse) # contains packages such as ggplot2 and dyplr for data visualisation and manipulation
library(corrplot) # for correlation analyses in question 3
library(rstatix) # for identifying outliers in question 3

Question 1: Is there a correlation between the variables air and water?

  • The 2 variables, air and water, are both quantitative continuous
str(Sharks) # check that the variables air and water are in numerical format
'data.frame':   500 obs. of  10 variables:
 $ ID    : chr  "SH001" "SH002" "SH003" "SH004" ...
 $ sex   : chr  "Female" "Female" "Female" "Male" ...
 $ blotch: num  37.2 34.5 36.3 35.3 37.4 ...
 $ BPM   : int  148 158 125 161 138 126 166 135 132 127 ...
 $ weight: num  74.7 73.4 71.8 104.6 67.1 ...
 $ length: num  187 189 284 171 264 ...
 $ air   : num  37.7 35.7 34.8 36.2 33.6 ...
 $ water : num  23.4 21.4 20.1 21.6 21.8 ...
 $ meta  : num  64.1 73.7 54.4 86.3 108 ...
 $ depth : num  53.2 49.6 49.4 50.3 49 ...
sum(is.na(Sharks)) # checks whether there are any NA values in the data
[1] 0
Sharks %>%
  ggplot(aes(x = air)) +
  geom_histogram() # produce a histogram to see whether the data for air is normally distributed

The histogram suggests that the data is not normally distributed.

shapiro.test(Sharks$air) # check whether the variable 'air' is normally distributed

    Shapiro-Wilk normality test

data:  Sharks$air
W = 0.95885, p-value = 1.338e-10

The very small P-value (<0.05) means we reject the null hypothesis, and accept that the data for variable ‘air’ is not normally distributed.

Sharks %>%
  ggplot(aes(x = water)) +
  geom_histogram() # produce a histogram to see whether the data for water is normally distributed

The histogram suggests that the data is not normally distributed.

shapiro.test(Sharks$water) # check whether the variable 'water' is normally distributed

    Shapiro-Wilk normality test

data:  Sharks$water
W = 0.96035, p-value = 2.371e-10

The very small P-value (<0.05) means we reject the null hypothesis, and accept that the data for variable ‘water’ is not normally distributed.


As both variables are not normally distributed, the test used is the Spearman’s rank correlation test.

Sharks %>%
  ggplot(aes(x = air, 
             y = water)) +
  geom_point() + # plots a scatter plot using the variables 'air' and 'water'
  geom_smooth(method = "lm") + # adds a line of best fit with standard error, using linear regression
  labs(x = "Ambient air temperature (°C)",
       y = "Surface water temperature (°C)") +
  theme_minimal() +
  theme(axis.title = element_text(size = 12),
        axis.text = element_text(size = 10)) # changes font size of axis text and titles

The scatter plot indicates that there is a very weak negative correlation between the variables.

correlation <- cor.test(x=Sharks$air, y=Sharks$water, method = 'spearman') # calculates Spearman's rank correlation coefficient and saves the output as an object
correlation

    Spearman's rank correlation rho

data:  Sharks$air and Sharks$water
S = 22007692, p-value = 0.2082
alternative hypothesis: true rho is not equal to 0
sample estimates:
        rho 
-0.05637344 

A correlation coefficient of -0.056 confirms a very weak negative relationship.

The P-value is greater than 0.05 which means we accept the null hypothesis ie. there is not enough evidence to suggest that the true correlation coefficient is not equal to zero.

Conclusion: there is no statistically significant correlation between the variables air and water.

Question 2: Does multiple capture have an effect on blotching time?

  • Using sharksub dataset
  • Capture ID is a categorical variable and blotching time is continuous
str(Sharksub) # check what format the variables are in
'data.frame':   50 obs. of  4 variables:
 $ ID     : chr  "SH269" "SH163" "SH008" "SH239" ...
 $ sex    : chr  "Female" "Female" "Female" "Female" ...
 $ blotch1: num  36.1 33.4 36.3 35 35.7 ...
 $ blotch2: num  37.2 34.4 36.5 36 36.8 ...
Sharksub %>%
ggplot(aes(x = blotch1)) +
  geom_histogram() # produce a histogram to see whether data for blotch1 is normally distributed

The histogram suggest a normal distsribution, but it is not perfectly clear.

shapiro.test(Sharksub$blotch1) # check whether blotch1 is normally distributed

    Shapiro-Wilk normality test

data:  Sharksub$blotch1
W = 0.97958, p-value = 0.5345

As the p-value is greater than 0.05, we accept the null hypothesis that the data is normally distributed.

Sharksub %>%
  ggplot(aes(x = blotch2)) +
  geom_histogram() # produce a histogram to see whether data for blotch2 is normally distributed

The histogram suggest a normal distribution, but it is not perfectly clear.

shapiro.test(Sharksub$blotch2) # check whether blotch2 is normally distributed

    Shapiro-Wilk normality test

data:  Sharksub$blotch2
W = 0.97936, p-value = 0.5255

As the p-value is greater than 0.05, we accept the null hypothesis that the data is normally distributed.


As both variables are normally distributed, we can use a paired T-test to do a comparison of means for blotch1 and blotch 2.

multiple.capture <- t.test(Sharksub$blotch1, Sharksub$blotch2,
                           paired = TRUE) # paired sample T-test to compare time taken to blotch on two separate captures per shark
multiple.capture

    Paired t-test

data:  Sharksub$blotch1 and Sharksub$blotch2
t = -17.39, df = 49, p-value < 2.2e-16
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -1.037176 -0.822301
sample estimates:
mean difference 
     -0.9297384 

Mean difference of -0.9 suggests that blotching time is slower on second capture.

This difference is statistically significant as P < 0.05, so we reject the null hypothesis, and accept that there is a difference in blotching time with multiple capture.

sharksub_adapted <- Sharksub %>%
  pivot_longer(cols = starts_with("blotch"), 
               names_to = "Capture", 
               values_to = "Blotching_Time") %>%
  mutate(Capture = recode(Capture, 
                              blotch1 = "Capture 1", 
                              blotch2 = "Capture 2")) # alters the data to aid production of a boxplot. Changing blotch names to capture, as well as naming each data entry as a blotching time

sharksub_adapted %>%
  ggplot(aes(x = Capture,
             y = Blotching_Time)) +
  geom_boxplot() + # produces boxplot of time taken to blotch in first and second capture
  theme_minimal() +
  labs(x = "Capture number",
       y = "Time taken to blotch (seconds)") +
  theme(axis.title = element_text(size = 12),
        axis.text = element_text(size = 10)) # changes font size of axis text and titles

The data also provides information on the sex of each shark. It would be interesting to know if sex has any influence on blotching time with multiple capture.

Sharksub %>% 
  group_by(sex) %>% # groups sharksub data by sex
  summarise(x = n()) %>% # summarises data by number of individuals per sex
  ungroup() # ungroups data to ensure the original is not altered
# A tibble: 2 × 2
  sex        x
  <chr>  <int>
1 Female    25
2 Male      25

Here we can see that equal numbers of female and male sharks were tested, and therefore the data is not likely to be skewed by sex.

sharksub_adapted %>% # uses same data set that was used to create the previous boxplot
  ggplot(aes(x = Capture,
             y = Blotching_Time,
             color = sex)) + # separates the data by sex
  geom_boxplot() + # produces boxplot of time taken to blotch in first and second capture
  theme_minimal() +
  scale_color_brewer(palette = "Dark2") + # uses a colour palette that is color-blind friendly
  labs(x = "Capture number",
       y = "Time taken to blotch (seconds)") +
  theme(axis.title = element_text(size = 12),
        axis.text = element_text(size = 10)) # changes font size of axis text and titles

The boxplot shows that median blotching time is greater in males in both captures. There is also greater variability of the interquartile ranges in males than in females. The time taken to blotch increases in both males and females with multiple capture. There is large overlap of both the boxes and whiskers of the plot for each sex, indicating that differences may not be significant.

To determine whether differences between blotching times by sex are significant, an independent sample t-test can be used, to compare the means.

One assumption of the t-test is that the two samples have similar variance. The boxplot shows that this is true, but this can also be confirmed in R;

Sharksub %>%
  group_by(sex) %>% # groups sharksub data by sex
  summarise(variance_blotch1 = var(blotch1), # summarises variance of blotch 1 data
            variance_blotch2 = var(blotch2)) %>% # summarises variance of blotch 2 data
  ungroup() # final ungrouping of data
# A tibble: 2 × 3
  sex    variance_blotch1 variance_blotch2
  <chr>             <dbl>            <dbl>
1 Female             1.16             1.23
2 Male               1.18             1.44
blotch1.t.test <- Sharksub %>%
  t.test(blotch1 ~ sex, data = .) # independent sample t-test for sharksub data, of difference in mean time taken to blotch in males and females (for first capture)

blotch1.t.test

    Welch Two Sample t-test

data:  blotch1 by sex
t = -1.4629, df = 47.997, p-value = 0.15
alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
95 percent confidence interval:
 -1.0644425  0.1678337
sample estimates:
mean in group Female   mean in group Male 
            34.80627             35.25457 

As P > 0.05, we accept the null hypothesis, and accept that the difference is not statistically significant.

blotch2.t.test <- Sharksub %>%
  t.test(blotch2 ~ sex, data = .) # independent sample t-test for sharksub data, of difference in mean time taken to blotch in males and females (for second capture)

blotch2.t.test

    Welch Two Sample t-test

data:  blotch2 by sex
t = -1.2362, df = 47.702, p-value = 0.2224
alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
95 percent confidence interval:
 -1.0622680  0.2534615
sample estimates:
mean in group Female   mean in group Male 
            35.75796             36.16236 

As P > 0.05, we accept the null hypothesis, and accept that the difference is not statistically significant.

Conclusion: time taken to blotch increases with multiple capture, but there is no significant effect of sex on blotching time.

Question 3: Is it possible to predict blotching time?

A balloon plot can be produced, to evaluate correlations between the variables in the ‘Sharks’ dataset, as a helpful starting point.

The corrplot function can only analyse numeric values. Shark ID does not need to be included, and sex will be looked at separately.

corrplot(corr = cor(Sharks[, 3:10]), # produces a correlation matrix of the sharks data, excluding columns for ID and sex
         method = "ellipse", # presents correlations as ellipses, with direction and spread
         type = "lower", # changes shape of balloon plot
         addCoef.col = "black", # adds numerical correlation coefficient to each balloon
         number.cex = 0.5, # reduces text size of coefficients
         diag = FALSE,  # removes instances where variables are correlated against themselves
         bg = "grey", # changes background colour
         outline = "black") # changes outline colour of balloons 

The balloon plot shows a strong positive correlation between blotching time and the depth at which the animal was hooked.

As both variables are quantitative, linear regression can be used to analyse their relationship.

Sharks %>%
  ggplot(aes(x = depth,
             y = blotch)) +
  geom_point() + # produces scatter plot of depth and blotching time
 geom_smooth(method = "lm") + # adds a line of best fit with standard error, using linear regression
  labs(x = "Depth at capture (m)",
       y = "Time taken to blotch (s)") +
  theme_minimal() +
  theme(axis.title = element_text(size = 12),
        axis.text = element_text(size = 10)) # changes font size of axis text and titles

linear.model <- lm(blotch ~ depth, data = Sharks) # linear regression model of blotching time as a function of capture depth

summary(linear.model)

Call:
lm(formula = blotch ~ depth, data = Sharks)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.81869 -0.65427 -0.01035  0.58825  2.83116 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  9.82178    1.11207   8.832   <2e-16 ***
depth        0.50467    0.02216  22.772   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1 on 498 degrees of freedom
Multiple R-squared:  0.5101,    Adjusted R-squared:  0.5091 
F-statistic: 518.6 on 1 and 498 DF,  p-value: < 2.2e-16
hist(linear.model$residuals) # plots a histogram of the residuals of the linear model

The histogram shows that the residuals are evenly distributed around zero ie. the data is not skewed. This means that data can be equally-well predicted at both ends of the range.

shapiro.test(linear.model$residuals) # checks for normality of residuals

    Shapiro-Wilk normality test

data:  linear.model$residuals
W = 0.99748, p-value = 0.653

The coefficient output produces an equation that can be used to predict blotching time based on depth at capture:

y = 0.50467x + 9.82178

The P-value for the coefficient is very small (close to zero) which shows that the variable is significant to the linear model. This means that we can be confident in using depth to predict blotching.

A R-squared value of 0.5101 suggests that 51% of the variance in blotching time can be explained by depth at capture.


It might be useful to see whether sex has any effect on blotching time in the Sharks data set, as it is much larger.

Sharks %>%
  group_by(sex) %>% # groups sharks data by sex
  summarise(x = n()) %>% # summarises data by number of individuals per sex
  ungroup() # final ungrouping of data
# A tibble: 2 × 2
  sex        x
  <chr>  <int>
1 Female   236
2 Male     264

More males were captured than females, but there is a large enough sample of each sex for the sample to be representative.

To check that the data have similar variances, and therefore meet the assumptions of the t-test:

Sharks %>%
  group_by(sex) %>% # groups sharks data by sex
  summarise(variance = var(blotch)) %>% # summarises variance of blotch time for each sex
  ungroup() # final ungrouping of data
# A tibble: 2 × 2
  sex    variance
  <chr>     <dbl>
1 Female     1.94
2 Male       2.06

Also check whether the data is normally distributed:

Sharks %>%
  ggplot(aes(x = blotch,
             colour = sex)) +
  geom_histogram() + # plots histogram of blotching times, separated by sex
  scale_color_brewer(palette = "Dark2") # uses a colour palette that is colour-blind friendly

The histogram suggests that the data is normally distributed.

sharks.males <- Sharks %>%
  subset(sex == "Male") # subsets sharks data into males only

sharks.females <- Sharks %>%
  subset(sex == "Female") # subsets sharks data into females only
shapiro.test(sharks.males$blotch) # tests for normality of blotch data for males

    Shapiro-Wilk normality test

data:  sharks.males$blotch
W = 0.99209, p-value = 0.1701

As P > 0.05, we accept the null hypothesis that the data is normally distributed.

shapiro.test(sharks.females$blotch) # tests for normality of blotch data for females

    Shapiro-Wilk normality test

data:  sharks.females$blotch
W = 0.99527, p-value = 0.682

As P > 0.05, we accept the null hypothesis that the data is normally distributed.

sharks.t.test <- Sharks %>%
  t.test(blotch ~ sex, data = .) # independent sample t-test for sharks data, of difference in mean time taken to blotch in males and females

sharks.t.test

    Welch Two Sample t-test

data:  blotch by sex
t = -3.0282, df = 494.67, p-value = 0.002589
alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
95 percent confidence interval:
 -0.6322714 -0.1346620
sample estimates:
mean in group Female   mean in group Male 
            34.92294             35.30641 

As with the Sharksub data, time taken to blotch is higher in males than females. However in this instance, P < 0.05 which means we reject the null hypothesis and accept that the difference in means between males and females is statistically significant. This suggests that the Sharksub data did not contain enough individuals to see a true trend.

Sharks %>%
  ggplot(aes(x = sex,
             y = blotch)) +
  geom_boxplot() + # box plot of time taken to blotch in males vs females
  theme_minimal() +
  labs(x = "Sex",
       y = "Time taken to blotch (seconds)") +
  theme(axis.title = element_text(size = 12),
        axis.text = element_text(size = 10)) # changes font size of axis text and titles 

The boxplot shows that there are some outliers in the data.

Sharks %>%
  group_by(sex) %>% # groups sharks data by sex
  identify_outliers(blotch) %>% # identifies any outliers in the blotching data
  print(,width = Inf) # ensures all columns of table are presented in quarto
# A tibble: 4 × 12
  sex    ID    blotch   BPM weight length   air water  meta depth is.outlier
  <chr>  <chr>  <dbl> <int>  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <lgl>     
1 Female SH161   30.8   149  109.    254.  34.5  24.0  64.2  45.3 TRUE      
2 Female SH334   31.4   137   74.2   224.  34.7  22.1  86.1  45.8 TRUE      
3 Male   SH014   39.8   121   90.9   193.  34.4  23.4  99.8  55.3 TRUE      
4 Male   SH392   40.1   141  111.    204.  34.3  24.5  60.2  54.5 TRUE      
  is.extreme
  <lgl>     
1 FALSE     
2 FALSE     
3 FALSE     
4 FALSE     

This function identifies the 4 outliers that are present in the boxplot. R identifies that these outliers are not extreme, and so the assumptions of the independent sample t-test carried out above, are still met.

Space for further analyses

Determine the total number of recordings for each data set:

Sharks %>%
  summarise(x = n_distinct(ID)) # total number of unique shark IDs in sharks data
    x
1 500
Sharksub %>%
  summarise(x = n_distinct(ID)) # total number of unique shark IDs in sharksub data
   x
1 50

Check whether there is any NA data in the sharksub dataset (already done for sharks in question 1):

sum(is.na(Sharksub))
[1] 0

Determine the maximum length of longline used:

Sharks %>%
  summarise(x = max(depth)) # identifies maximum value for depth in sharks dataset
         x
1 56.82916

Temperature range/ average for duration of study:

Sharks %>%
  summarise(r = range(air), # calculates range of air temperature measurements
            a = mean(air)) # calculates mean of air temperature measurements
         r        a
1 33.00454 35.53526
2 37.99978 35.53526
Sharks %>%
  summarise(r = range(water), # calculates range of water temperature measurements
            a = mean(water)) # calculates mean of water temperature measurements
         r        a
1 20.00503 23.02052
2 25.98523 23.02052

To work out how many sharks were adult vs juvenile:

Sharks %>%
  filter(sex == "Female") %>% # filters sharks data by females only
  summarise(juv = sum(length < 200), # total number of sharks whose length is less than 200cm
            adult = sum(length >= 200)) # total number of sharks whose length is 200cm or more
  juv adult
1 102   134
Sharks %>%
  filter(sex == "Male") %>% # filters sharks data by males only
  summarise(juv = sum(length < 150), # total number of sharks whose length is less than 150cm
            adult = sum(length >= 150)) # total number of sharks whose length is 150cm or more
  juv adult
1  29   235