RMDA_summative

Author

H. Cyril

Research Methods and Data Analysis - Summative Assessment

#For the analysis, the data files “sharks.xlsx” and “sharksub.xlsx” were converted and saved as .csv files. Subsequently, the “sharks.csv” and “sharksub.csv” files were read into RStudio using the following code.

library(readr)
sharks <- read_csv("E:/RStudio/sharks.csv")
Rows: 500 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): ID, sex
dbl (8): blotch, BPM, weight, length, air, water, meta, depth

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sharksub <- read.csv("E:/RStudio/sharksub.csv")
View(sharks)

Loading tidyverse package and viewing the sharks dataset structure

library(tidyverse) #loading tidyverse package
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ purrr     1.0.2
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(magrittr) #loading magrittr package

Attaching package: 'magrittr'

The following object is masked from 'package:purrr':

    set_names

The following object is masked from 'package:tidyr':

    extract
str(sharks) #viewing the sharks.csv file structure
spc_tbl_ [500 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ ID    : chr [1:500] "SH001" "SH002" "SH003" "SH004" ...
 $ sex   : chr [1:500] "Female" "Female" "Female" "Male" ...
 $ blotch: num [1:500] 37.2 34.5 36.3 35.3 37.4 ...
 $ BPM   : num [1:500] 148 158 125 161 138 126 166 135 132 127 ...
 $ weight: num [1:500] 74.7 73.4 71.8 104.6 67.1 ...
 $ length: num [1:500] 187 189 284 171 264 ...
 $ air   : num [1:500] 37.7 35.7 34.8 36.2 33.6 ...
 $ water : num [1:500] 23.4 21.4 20.1 21.6 21.8 ...
 $ meta  : num [1:500] 64.1 73.7 54.4 86.3 108 ...
 $ depth : num [1:500] 53.2 49.6 49.4 50.3 49 ...
 - attr(*, "spec")=
  .. cols(
  ..   ID = col_character(),
  ..   sex = col_character(),
  ..   blotch = col_double(),
  ..   BPM = col_double(),
  ..   weight = col_double(),
  ..   length = col_double(),
  ..   air = col_double(),
  ..   water = col_double(),
  ..   meta = col_double(),
  ..   depth = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 

#The following code gives the number of females and males whose measurements are recorded in the sharks dataset.

sharks %>% 
  group_by(sex) %>%  #grouping sharks data by sex
  summarise(n = n()) %>% #calculating number of obervations by sex
  ungroup()
# A tibble: 2 × 2
  sex        n
  <chr>  <int>
1 Female   236
2 Male     264

#The following code gives the number of females and males whose measurements are recorded in the sharksub dataset.

sharksub %>% 
  group_by(sex) %>%  #grouping sharks data by sex
  summarise(n = n()) %>% #calculating number of observations by sex
  ungroup()
# A tibble: 2 × 2
  sex        n
  <chr>  <int>
1 Female    25
2 Male      25

Q1. Is there a correlation between the variables air and water?

#To answer this question, we consider the following points

  • A1. question is to check for a relationship between air and water
  • A2. scale of data: both variables are continuous and numeric
  • A3. experimental design: unpaired
  • A4. no. of groups: 1
  • A5. assumptions of normality analysed using histogram, QQ plot and Shapiro-Wilk Test to decide whether to use parametric or non-parametric tests

#A5. Checking normality of distribution:

#A5.1. Creating histograms for air and water

#A5.1.1. Histogram for air temperature

air_hist <- ggplot(data = sharks, aes(x = air)) +
  geom_histogram(bins = 30, col = "blue") + #creating histogram for air using ggplot 
  theme_bw() + #removing grey background
  scale_color_brewer() + #choosing colour for visualisation
    labs(x = "Air (Celsius)", #labeling x-axis
       y = "Frequency",  #labeling y-axis
       title = "Distribution of ambient air temperature data") #assigning suitable title for histogram
air_hist  #calling histogram for air

#A5.1.2. Histogram for water temperature

water_hist <- ggplot(data = sharks, aes(x = water)) +
  geom_histogram(bins = 30, col = "dodgerblue") + #creating histogram for water using ggplot package
  theme_bw() + #removing grey background
  scale_color_brewer() + #selecting colours for visualisation
  labs(x = "Water temperature at surface at time of processing (Celsius)",  #labeling x-axis
       y = "Frequency",  #labeling y-axis
       title = "Distribution of surface water temperature data") #assigning suitable title for histogram
water_hist #calling histogram for water

#Histograms reveal non-normal distribution for both air and water temperatures

#A5.2. Creating QQ plots for air and water

#A5.2.1. Q-Q plot for air temperature

qqnorm(sharks$air, pch = 1, frame = FALSE) #creating normal Q-Q plot for air
qqline(sharks$air, col = "mediumslateblue", lwd = 2) #adding line to normal Q-Q plot

#A5.2.2. Q-Q plot for water temperature

qqnorm(sharks$water, pch = 1, frame = FALSE) #creating normal Q-Q plot for water
qqline(sharks$water, col = "lightslateblue", lwd = 2) #adding line to normal Q-Q plot

#Some data points do not align with the line, and therefore, data is not normally distributed.

##A5.3. Shapiro-Wilk Test for normality

#A5.3.1. Shapiro -Wilk test for air temperature

shapiro.test(sharks$air) #analysing air using Shapiro-Wilk test

    Shapiro-Wilk normality test

data:  sharks$air
W = 0.95885, p-value = 1.338e-10

#A5.3.2. Shapiro-Wilk Test for water temperature

shapiro.test(sharks$water) #analysing water using Shapiro-Wilk Test

    Shapiro-Wilk normality test

data:  sharks$water
W = 0.96035, p-value = 2.371e-10

#p-values for both air and water temperature variables are considerably smaller than the threshold of 0.05; therefore, data does not follow normal distribution

#Considering the abovementioned points, the most appropriate test to check whether there exists a correlation between ambient air temperature and surface water temperature is the non-parametric test, namely the Spearman’s correlation test, with significance level alpha = 0.05.

##Q1. Spearman’s correlation for air and water

#null hypothesis H0: There is no significant correlation between air and water temperatures. #alternative hypothesis H1: There is significant correlation between air and water temperatures.

cor.test(x=sharks$air, y=sharks$water, method = 'spearman') #testing for correlation between air and water using Spearman's correlation test

    Spearman's rank correlation rho

data:  sharks$air and sharks$water
S = 22007692, p-value = 0.2082
alternative hypothesis: true rho is not equal to 0
sample estimates:
        rho 
-0.05637344 

#Value of Spearman’s correlation coefficient, rho = -0.056, p = 0.21 (p-value > 0.05) This possibly indicates a very weak non-significant negative correlation between ambient air temperature and surface water temperature. Therefore, we fail to reject the null hypothesis, and conclude that there is no significant correlation between ambient air temperature and surface water temperature.

#Q1. However, since sample size, n = 500 (n>30), data distribution could be considered normal. Therefore, the appropriate parametric test would be Pearson’s correlation test, with significance level alpha = 0.05.

#null hypothesis H0: There is no significant correlation between air and water temperatures. #alternative hypothesis H1: There is a significant correlation between air and water temperatures.

 cor.test(x = sharks$air, y = sharks$water, method = 'pearson') #testing correlation between air and water using Pearson's correlation test

    Pearson's product-moment correlation

data:  sharks$air and sharks$water
t = -1.2346, df = 498, p-value = 0.2176
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.14224207  0.03260803
sample estimates:
        cor 
-0.05524051 

#Value of Pearson’s correlation coefficient, rho = -0.055, p = 0.22 (p-value > 0.05) This possibly indicates a very weak non-significant negative correlation between ambient air temperature and surface water temperature. Therefore, we fail to reject the null hypothesis, and conclude that there is no significant correlation between ambient air temperature and surface water temperature.

Q2. Does multiple capture have an effect on blotching time?

view(sharksub) #viewing sharksub dataset
str(sharksub) #viewing structure of sharksub dataset
'data.frame':   50 obs. of  4 variables:
 $ ID     : chr  "SH269" "SH163" "SH008" "SH239" ...
 $ sex    : chr  "Female" "Female" "Female" "Female" ...
 $ blotch1: num  36.1 33.4 36.3 35 35.7 ...
 $ blotch2: num  37.2 34.4 36.5 36 36.8 ...
summary(sharksub) #producing a summary of sharksub dataset
      ID                sex               blotch1         blotch2     
 Length:50          Length:50          Min.   :32.49   Min.   :33.47  
 Class :character   Class :character   1st Qu.:34.38   1st Qu.:35.31  
 Mode  :character   Mode  :character   Median :34.94   Median :35.94  
                                       Mean   :35.03   Mean   :35.96  
                                       3rd Qu.:35.90   3rd Qu.:36.78  
                                       Max.   :37.07   Max.   :38.18  

#The following code produces a summary of blotch1 statistics.

summary(sharksub$blotch1) #producing summary statistics of blotch1
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  32.49   34.38   34.94   35.03   35.90   37.07 

#The following code produces a summary of blotch2 statistics.

summary(sharksub$blotch2) #producing summary statistics of blotch2
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  33.47   35.31   35.94   35.96   36.78   38.18 

#The following chunk of code creates a combined/stacked histogram with blotch1 and blotch2 in the same figure for easy visualisation of the blotching time taken after the first capture and after the second capture.

library(dplyr) #loading dplyr package 
library(tidyr) #loading tidy package


sharksub2 <- sharksub %>%
  pivot_longer(cols = c(blotch1, blotch2), #converting data from wide format to long format
               names_to = "SharkCapture", 
               values_to = "TimeforBlotching")


combo_hist <- ggplot(data = sharksub2, aes(x = TimeforBlotching, fill = SharkCapture)) + #creating histogram of blotching time by capture
  geom_histogram(position = "stack", bins = 30, color = "navyblue") +
  theme_bw() +  #removing grey background
  scale_fill_manual(values = c("blotch1" = "slateblue3", "blotch2" = "steelblue3")) +  #adding colours of choice
  labs(x = "Time taken for blotching to cover 30% of ventral surface (s)", #assigning label to x-axis
       y = "Frequency", #assigning label to y-axis
       title = "Combined distribution of blotching time after two captures") + #assigning suitable title
  theme(legend.title = element_blank()) #removing legend title

combo_hist #calling combined histogram for viewing

#To answer the question as to whether multiple capture has an effect on blotching time, we consider the following points:

  • B1. question is whether there is a difference in blotching time for the same individuals after first capture and second capture
  • B2. scale of data: both variables are continuous and numeric
  • B3. experimental design: paired, as 2 captures were performed for the same individuals
  • B4. no. of groups: 2 (for the same reason as above)
  • B5. assumptions of normality: histogram, QQ plot, Shapiro-Wilk Test

#B5. Checking normality of distribution

#B5.1. Creating histograms for blotch1 and blotch2

#B5.1.1. Histogram for blotch1

blotch1_hist <- ggplot(data = sharksub, aes(x = blotch1)) +
  geom_histogram(bins = 30, col = "royalblue2") + #creating histogram for blotch1
  theme_bw() + #removing grey background
  scale_color_brewer() + #selecting colour for visualisation
  labs(x = "Time taken for blotching to cover 30% of ventral surface (s)", #labeling x-axis
       y = "Frequency", #labeling y-axis
       title = "Distribution of blotching time after first capture") #assigning title for histogram
blotch1_hist #calling histogram for blotch1

#B.5.1.2. Histogram for blotch2

blotch2_hist <- ggplot(data = sharksub, aes(x = blotch2)) +
  geom_histogram(bins = 30, col = "deepskyblue") + #creating histogram for blotch2
  theme_bw() + #removing grey background
  scale_color_brewer() + #selecting colour for visualisation 
  labs(x = "Time taken for blotching to cover 30% of ventral surface (s)", #labeling x-axis
       y = "Frequency", #labeling y-axis
       title = "Distribution of blotching time after second capture") #assigning title to histogram
blotch2_hist #calling histogram for blotch2

#Histograms reveal roughly normal distribution for both blotch1 and blotch2

#B5.2. Creating Q-Q plots for blotch1 and blotch2

#B5.2.1. Q-Q plot for blotch1

qqnorm(sharksub$blotch1, pch = 1, frame = FALSE) #creating normal Q-Q plot for blotch1
qqline(sharksub$blotch1, col = "slateblue", lwd = 2) #adding line to normal Q-Q plot

#B5.2.2. Q-Q plot for blotch2

qqnorm(sharksub$blotch2, pch = 1, frame = FALSE) #creating normal Q-Q plot for blotch2
qqline(sharksub$blotch2, col = "mediumblue", lwd = 2) #adding line to normal Q-Q plot

#Most data points align with the line, and therefore, data is normally distributed.

##B.5.3. Shapiro-Wilk Test for normality

#B5.3.1. Shapiro-Wilk test for blotch1

shapiro.test(sharksub$blotch1) #testing normality of blotch1 data using Shapiro-Wilk test

    Shapiro-Wilk normality test

data:  sharksub$blotch1
W = 0.97958, p-value = 0.5345

#B5.3.2. Shapiro-Wilk Test for blotch2

shapiro.test(sharksub$blotch2) #testing normality of blotch2 data using Shapiro-Wilk test

    Shapiro-Wilk normality test

data:  sharksub$blotch2
W = 0.97936, p-value = 0.5255

#For both blotch1 and blotch2, the Shapiro-Wilk test produces p-values > 0.05. Therefore, it can be concluded that the distribution of blotch1 and blotch2 data does not deviate significantly from normality.

#Q2. Considering the abovementioned points, the most appropriate parametric test to answer Q2 is the paired t-test. This test is used to compare the means of two related groups of samples. The question “Does multiple capture have an effect on blotching time?” can be interpreted as whether there is a difference in blotching time after first capture and second capture of the same individuals.

#null hypothesis H0: There is no significant difference between blotch1 and blotch2. #alternative hypothesis H1: There is a significant difference between blotch1 and blotch2.

capture1 <- sharksub$blotch1 # assigning blotch1 to object 'capture1'
capture2 <- sharksub$blotch2 # assigning blotch2 to object 'capture2'

#Q2. Paired t-test for blotch1 and blotch2

diff <- t.test(capture1, capture2, paired = TRUE) #testing difference between blotch1 and blotch2 using paired t-test, assigning it to object 'diff'
diff #calling object 'diff' to show results of paired t-test

    Paired t-test

data:  capture1 and capture2
t = -17.39, df = 49, p-value < 2.2e-16
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -1.037176 -0.822301
sample estimates:
mean difference 
     -0.9297384 

#The output of the paired t-test gives a p-value < 2.2e-16, which is considerably less than the significance level alpha = 0.05. Therefore, the null hypothesis can be rejected, and it can be concluded that there is a significant difference between blotch1 and blotch2. Furthermore, the 95 percent confidence interval does not include 0, which implies that the difference is statistically significant. In other words, multiple capture does have an effect on blotching time.

Q3. Is it possible to predict blotching time?

As the question does not specify the predictor variables, multiple regression analysis will be used to test whether it is possible to predict blotching time using all the variables, namely, BPM, weight, length, air, water, meta, and depth, using the sharks dataset.

#Q3. Multiple regression for blotching time prediction

mult_reg_model <- lm(blotch ~ BPM + weight + length + air
                     + water + meta + depth, data = sharks) #creating regression model with all variables, assigning it to object 'mult_reg_model'
summary(mult_reg_model) #showing summary of regression model output

Call:
lm(formula = blotch ~ BPM + weight + length + air + water + meta + 
    depth, data = sharks)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.83745 -0.66117 -0.00702  0.60110  2.74108 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 11.1405851  1.8958668   5.876 7.74e-09 ***
BPM         -0.0019723  0.0031890  -0.618    0.537    
weight       0.0016283  0.0033511   0.486    0.627    
length       0.0012295  0.0009710   1.266    0.206    
air         -0.0281474  0.0318707  -0.883    0.378    
water       -0.0188934  0.0270782  -0.698    0.486    
meta        -0.0009712  0.0025951  -0.374    0.708    
depth        0.5061285  0.0223191  22.677  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.002 on 492 degrees of freedom
Multiple R-squared:  0.514, Adjusted R-squared:  0.507 
F-statistic: 74.32 on 7 and 492 DF,  p-value: < 2.2e-16

#For this model involving all 7 variables, the p-value of the F-statistic is less than 2.2e-16. This implies that at least one of the 7 predictor variables can be used to significantly predict blotching time.

#Looking at the coefficients will help determine the predictor variable with a significant association with blotching time

summary(mult_reg_model)$coefficient #showing only the coefficients
                 Estimate   Std. Error    t value     Pr(>|t|)
(Intercept) 11.1405850738 1.8958667617  5.8762490 7.742498e-09
BPM         -0.0019723462 0.0031890181 -0.6184807 5.365447e-01
weight       0.0016283130 0.0033510746  0.4859077 6.272489e-01
length       0.0012295257 0.0009710195  1.2662214 2.060330e-01
air         -0.0281474117 0.0318706960 -0.8831753 3.775729e-01
water       -0.0188934313 0.0270781617 -0.6977369 4.856714e-01
meta        -0.0009712356 0.0025951073 -0.3742564 7.083748e-01
depth        0.5061284856 0.0223190976 22.6769242 1.816298e-78

#From the above table, it can be understood that none of the variables except depth are significantly associated with blotching time, ie., the beta coefficient values of all the predictor variables except depth are not significantly different from 0. This implies that depth is a significant predictor of blotching time. Therefore, the regression model can be modified as a linear regression model, whereby the outcome variable is predicted using a single predictor variable.

lin_reg_model <- lm(blotch ~ depth, data = sharks) #creating regression model with depth as a predictor of blotch, assigning it to object 'lin_reg_model'
summary(lin_reg_model) #showing a summary of regression model output

Call:
lm(formula = blotch ~ depth, data = sharks)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.81869 -0.65427 -0.01035  0.58825  2.83116 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  9.82178    1.11207   8.832   <2e-16 ***
depth        0.50467    0.02216  22.772   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1 on 498 degrees of freedom
Multiple R-squared:  0.5101,    Adjusted R-squared:  0.5091 
F-statistic: 518.6 on 1 and 498 DF,  p-value: < 2.2e-16
confint(lin_reg_model) #generating 95% confidence interval
                2.5 %     97.5 %
(Intercept) 7.6368581 12.0067017
depth       0.4611309  0.5482156

This shows that the relation between depth and blotch as lying between the values 0.461 and 0.548 can be accepted with 95% confidence. Therefore, the linear regression model for predicting blotching time is: blotch = 9.82178 + 0.50467 X depth. The adjusted R-squared value is 0.5091, which implies that 50.91% of variance in blotching time can be explained by depth.

#To ensure that multiple regression is appropriate for predicting blotching time using depth, the distribution of residuals is assessed for normality by visualising on histogram and QQ plot.

#Q3.1. Histogram of residuals

mult_model <- mult_reg_model$residuals #residuals of multiple regression model assigned to a new object
hist(mult_model)  #histogram of residuals created

#The histogram above shows that the residuals are normally distributed.

#Q.3.2. Q-Q Plot of residuals

qqnorm(mult_model) #creating normal Q-Q plot for residuals
qqline(mult_model, col = "cornflowerblue", lwd = 2) #adding line to normal Q-Q plot

#Most of the data points align with the reference line, and therefore, the residuals are normally distributed.

#Creating a scatter plot between blotch and depth

#To understand the relationship between blotch and depth, as modeled using regression analysis, the following code is used to generate a scatterplot between blotch and depth.

library(ggplot2) # loading ggplot2 package

ggplot(sharks, aes(x = depth, y = blotch)) +
  geom_point(color = "slateblue2", size = 2) + #creating scatter plot
  geom_smooth(method = "lm", color = "dodgerblue4", se = TRUE) + # adding regression line with standard error ribbon
  labs(title = "Relationship between Depth and Blotch", #assigning title
       x = "Depth (m)", #assigning label to x-axis
       y = "Blotch (s)") + #assigning label to y-axis
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'