Research Methods and Data Analysis - Summative Assessment
#For the analysis, the data files “sharks.xlsx” and “sharksub.xlsx” were converted and saved as .csv files. Subsequently, the “sharks.csv” and “sharksub.csv” files were read into RStudio using the following code.
Rows: 500 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): ID, sex
dbl (8): blotch, BPM, weight, length, air, water, meta, depth
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Loading tidyverse package and viewing the sharks dataset structure
library(tidyverse) #loading tidyverse package
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ purrr 1.0.2
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(magrittr) #loading magrittr package
Attaching package: 'magrittr'
The following object is masked from 'package:purrr':
set_names
The following object is masked from 'package:tidyr':
extract
str(sharks) #viewing the sharks.csv file structure
spc_tbl_ [500 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ ID : chr [1:500] "SH001" "SH002" "SH003" "SH004" ...
$ sex : chr [1:500] "Female" "Female" "Female" "Male" ...
$ blotch: num [1:500] 37.2 34.5 36.3 35.3 37.4 ...
$ BPM : num [1:500] 148 158 125 161 138 126 166 135 132 127 ...
$ weight: num [1:500] 74.7 73.4 71.8 104.6 67.1 ...
$ length: num [1:500] 187 189 284 171 264 ...
$ air : num [1:500] 37.7 35.7 34.8 36.2 33.6 ...
$ water : num [1:500] 23.4 21.4 20.1 21.6 21.8 ...
$ meta : num [1:500] 64.1 73.7 54.4 86.3 108 ...
$ depth : num [1:500] 53.2 49.6 49.4 50.3 49 ...
- attr(*, "spec")=
.. cols(
.. ID = col_character(),
.. sex = col_character(),
.. blotch = col_double(),
.. BPM = col_double(),
.. weight = col_double(),
.. length = col_double(),
.. air = col_double(),
.. water = col_double(),
.. meta = col_double(),
.. depth = col_double()
.. )
- attr(*, "problems")=<externalptr>
#The following code gives the number of females and males whose measurements are recorded in the sharks dataset.
sharks %>%group_by(sex) %>%#grouping sharks data by sexsummarise(n =n()) %>%#calculating number of obervations by sexungroup()
# A tibble: 2 × 2
sex n
<chr> <int>
1 Female 236
2 Male 264
#The following code gives the number of females and males whose measurements are recorded in the sharksub dataset.
sharksub %>%group_by(sex) %>%#grouping sharks data by sexsummarise(n =n()) %>%#calculating number of observations by sexungroup()
# A tibble: 2 × 2
sex n
<chr> <int>
1 Female 25
2 Male 25
Q1. Is there a correlation between the variables air and water?
#To answer this question, we consider the following points
A1. question is to check for a relationship between air and water
A2. scale of data: both variables are continuous and numeric
A3. experimental design: unpaired
A4. no. of groups: 1
A5. assumptions of normality analysed using histogram, QQ plot and Shapiro-Wilk Test to decide whether to use parametric or non-parametric tests
#A5. Checking normality of distribution:
#A5.1. Creating histograms for air and water
#A5.1.1. Histogram for air temperature
air_hist <-ggplot(data = sharks, aes(x = air)) +geom_histogram(bins =30, col ="blue") +#creating histogram for air using ggplot theme_bw() +#removing grey backgroundscale_color_brewer() +#choosing colour for visualisationlabs(x ="Air (Celsius)", #labeling x-axisy ="Frequency", #labeling y-axistitle ="Distribution of ambient air temperature data") #assigning suitable title for histogramair_hist #calling histogram for air
#A5.1.2. Histogram for water temperature
water_hist <-ggplot(data = sharks, aes(x = water)) +geom_histogram(bins =30, col ="dodgerblue") +#creating histogram for water using ggplot packagetheme_bw() +#removing grey backgroundscale_color_brewer() +#selecting colours for visualisationlabs(x ="Water temperature at surface at time of processing (Celsius)", #labeling x-axisy ="Frequency", #labeling y-axistitle ="Distribution of surface water temperature data") #assigning suitable title for histogramwater_hist #calling histogram for water
#Histograms reveal non-normal distribution for both air and water temperatures
#A5.2. Creating QQ plots for air and water
#A5.2.1. Q-Q plot for air temperature
qqnorm(sharks$air, pch =1, frame =FALSE) #creating normal Q-Q plot for airqqline(sharks$air, col ="mediumslateblue", lwd =2) #adding line to normal Q-Q plot
#A5.2.2. Q-Q plot for water temperature
qqnorm(sharks$water, pch =1, frame =FALSE) #creating normal Q-Q plot for waterqqline(sharks$water, col ="lightslateblue", lwd =2) #adding line to normal Q-Q plot
#Some data points do not align with the line, and therefore, data is not normally distributed.
##A5.3. Shapiro-Wilk Test for normality
#A5.3.1. Shapiro -Wilk test for air temperature
shapiro.test(sharks$air) #analysing air using Shapiro-Wilk test
Shapiro-Wilk normality test
data: sharks$air
W = 0.95885, p-value = 1.338e-10
#A5.3.2. Shapiro-Wilk Test for water temperature
shapiro.test(sharks$water) #analysing water using Shapiro-Wilk Test
Shapiro-Wilk normality test
data: sharks$water
W = 0.96035, p-value = 2.371e-10
#p-values for both air and water temperature variables are considerably smaller than the threshold of 0.05; therefore, data does not follow normal distribution
#Considering the abovementioned points, the most appropriate test to check whether there exists a correlation between ambient air temperature and surface water temperature is the non-parametric test, namely the Spearman’s correlation test, with significance level alpha = 0.05.
##Q1. Spearman’s correlation for air and water
#null hypothesis H0: There is no significant correlation between air and water temperatures. #alternative hypothesis H1: There is significant correlation between air and water temperatures.
cor.test(x=sharks$air, y=sharks$water, method ='spearman') #testing for correlation between air and water using Spearman's correlation test
Spearman's rank correlation rho
data: sharks$air and sharks$water
S = 22007692, p-value = 0.2082
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
-0.05637344
#Value of Spearman’s correlation coefficient, rho = -0.056, p = 0.21 (p-value > 0.05) This possibly indicates a very weak non-significant negative correlation between ambient air temperature and surface water temperature. Therefore, we fail to reject the null hypothesis, and conclude that there is no significant correlation between ambient air temperature and surface water temperature.
#Q1. However, since sample size, n = 500 (n>30), data distribution could be considered normal. Therefore, the appropriate parametric test would be Pearson’s correlation test, with significance level alpha = 0.05.
#null hypothesis H0: There is no significant correlation between air and water temperatures. #alternative hypothesis H1: There is a significant correlation between air and water temperatures.
cor.test(x = sharks$air, y = sharks$water, method ='pearson') #testing correlation between air and water using Pearson's correlation test
Pearson's product-moment correlation
data: sharks$air and sharks$water
t = -1.2346, df = 498, p-value = 0.2176
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.14224207 0.03260803
sample estimates:
cor
-0.05524051
#Value of Pearson’s correlation coefficient, rho = -0.055, p = 0.22 (p-value > 0.05) This possibly indicates a very weak non-significant negative correlation between ambient air temperature and surface water temperature. Therefore, we fail to reject the null hypothesis, and conclude that there is no significant correlation between ambient air temperature and surface water temperature.
Q2. Does multiple capture have an effect on blotching time?
view(sharksub) #viewing sharksub datasetstr(sharksub) #viewing structure of sharksub dataset
'data.frame': 50 obs. of 4 variables:
$ ID : chr "SH269" "SH163" "SH008" "SH239" ...
$ sex : chr "Female" "Female" "Female" "Female" ...
$ blotch1: num 36.1 33.4 36.3 35 35.7 ...
$ blotch2: num 37.2 34.4 36.5 36 36.8 ...
summary(sharksub) #producing a summary of sharksub dataset
ID sex blotch1 blotch2
Length:50 Length:50 Min. :32.49 Min. :33.47
Class :character Class :character 1st Qu.:34.38 1st Qu.:35.31
Mode :character Mode :character Median :34.94 Median :35.94
Mean :35.03 Mean :35.96
3rd Qu.:35.90 3rd Qu.:36.78
Max. :37.07 Max. :38.18
#The following code produces a summary of blotch1 statistics.
summary(sharksub$blotch1) #producing summary statistics of blotch1
Min. 1st Qu. Median Mean 3rd Qu. Max.
32.49 34.38 34.94 35.03 35.90 37.07
#The following code produces a summary of blotch2 statistics.
summary(sharksub$blotch2) #producing summary statistics of blotch2
Min. 1st Qu. Median Mean 3rd Qu. Max.
33.47 35.31 35.94 35.96 36.78 38.18
#The following chunk of code creates a combined/stacked histogram with blotch1 and blotch2 in the same figure for easy visualisation of the blotching time taken after the first capture and after the second capture.
library(dplyr) #loading dplyr package library(tidyr) #loading tidy packagesharksub2 <- sharksub %>%pivot_longer(cols =c(blotch1, blotch2), #converting data from wide format to long formatnames_to ="SharkCapture", values_to ="TimeforBlotching")combo_hist <-ggplot(data = sharksub2, aes(x = TimeforBlotching, fill = SharkCapture)) +#creating histogram of blotching time by capturegeom_histogram(position ="stack", bins =30, color ="navyblue") +theme_bw() +#removing grey backgroundscale_fill_manual(values =c("blotch1"="slateblue3", "blotch2"="steelblue3")) +#adding colours of choicelabs(x ="Time taken for blotching to cover 30% of ventral surface (s)", #assigning label to x-axisy ="Frequency", #assigning label to y-axistitle ="Combined distribution of blotching time after two captures") +#assigning suitable titletheme(legend.title =element_blank()) #removing legend titlecombo_hist #calling combined histogram for viewing
#To answer the question as to whether multiple capture has an effect on blotching time, we consider the following points:
B1. question is whether there is a difference in blotching time for the same individuals after first capture and second capture
B2. scale of data: both variables are continuous and numeric
B3. experimental design: paired, as 2 captures were performed for the same individuals
B4. no. of groups: 2 (for the same reason as above)
B5. assumptions of normality: histogram, QQ plot, Shapiro-Wilk Test
#B5. Checking normality of distribution
#B5.1. Creating histograms for blotch1 and blotch2
#B5.1.1. Histogram for blotch1
blotch1_hist <-ggplot(data = sharksub, aes(x = blotch1)) +geom_histogram(bins =30, col ="royalblue2") +#creating histogram for blotch1theme_bw() +#removing grey backgroundscale_color_brewer() +#selecting colour for visualisationlabs(x ="Time taken for blotching to cover 30% of ventral surface (s)", #labeling x-axisy ="Frequency", #labeling y-axistitle ="Distribution of blotching time after first capture") #assigning title for histogramblotch1_hist #calling histogram for blotch1
#B.5.1.2. Histogram for blotch2
blotch2_hist <-ggplot(data = sharksub, aes(x = blotch2)) +geom_histogram(bins =30, col ="deepskyblue") +#creating histogram for blotch2theme_bw() +#removing grey backgroundscale_color_brewer() +#selecting colour for visualisation labs(x ="Time taken for blotching to cover 30% of ventral surface (s)", #labeling x-axisy ="Frequency", #labeling y-axistitle ="Distribution of blotching time after second capture") #assigning title to histogramblotch2_hist #calling histogram for blotch2
#Histograms reveal roughly normal distribution for both blotch1 and blotch2
#B5.2. Creating Q-Q plots for blotch1 and blotch2
#B5.2.1. Q-Q plot for blotch1
qqnorm(sharksub$blotch1, pch =1, frame =FALSE) #creating normal Q-Q plot for blotch1qqline(sharksub$blotch1, col ="slateblue", lwd =2) #adding line to normal Q-Q plot
#B5.2.2. Q-Q plot for blotch2
qqnorm(sharksub$blotch2, pch =1, frame =FALSE) #creating normal Q-Q plot for blotch2qqline(sharksub$blotch2, col ="mediumblue", lwd =2) #adding line to normal Q-Q plot
#Most data points align with the line, and therefore, data is normally distributed.
##B.5.3. Shapiro-Wilk Test for normality
#B5.3.1. Shapiro-Wilk test for blotch1
shapiro.test(sharksub$blotch1) #testing normality of blotch1 data using Shapiro-Wilk test
Shapiro-Wilk normality test
data: sharksub$blotch1
W = 0.97958, p-value = 0.5345
#B5.3.2. Shapiro-Wilk Test for blotch2
shapiro.test(sharksub$blotch2) #testing normality of blotch2 data using Shapiro-Wilk test
Shapiro-Wilk normality test
data: sharksub$blotch2
W = 0.97936, p-value = 0.5255
#For both blotch1 and blotch2, the Shapiro-Wilk test produces p-values > 0.05. Therefore, it can be concluded that the distribution of blotch1 and blotch2 data does not deviate significantly from normality.
#Q2. Considering the abovementioned points, the most appropriate parametric test to answer Q2 is the paired t-test. This test is used to compare the means of two related groups of samples. The question “Does multiple capture have an effect on blotching time?” can be interpreted as whether there is a difference in blotching time after first capture and second capture of the same individuals.
#null hypothesis H0: There is no significant difference between blotch1 and blotch2. #alternative hypothesis H1: There is a significant difference between blotch1 and blotch2.
capture1 <- sharksub$blotch1 # assigning blotch1 to object 'capture1'capture2 <- sharksub$blotch2 # assigning blotch2 to object 'capture2'
#Q2. Paired t-test for blotch1 and blotch2
diff <-t.test(capture1, capture2, paired =TRUE) #testing difference between blotch1 and blotch2 using paired t-test, assigning it to object 'diff'diff #calling object 'diff' to show results of paired t-test
Paired t-test
data: capture1 and capture2
t = -17.39, df = 49, p-value < 2.2e-16
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
-1.037176 -0.822301
sample estimates:
mean difference
-0.9297384
#The output of the paired t-test gives a p-value < 2.2e-16, which is considerably less than the significance level alpha = 0.05. Therefore, the null hypothesis can be rejected, and it can be concluded that there is a significant difference between blotch1 and blotch2. Furthermore, the 95 percent confidence interval does not include 0, which implies that the difference is statistically significant. In other words, multiple capture does have an effect on blotching time.
Q3. Is it possible to predict blotching time?
As the question does not specify the predictor variables, multiple regression analysis will be used to test whether it is possible to predict blotching time using all the variables, namely, BPM, weight, length, air, water, meta, and depth, using the sharks dataset.
#Q3. Multiple regression for blotching time prediction
mult_reg_model <-lm(blotch ~ BPM + weight + length + air+ water + meta + depth, data = sharks) #creating regression model with all variables, assigning it to object 'mult_reg_model'summary(mult_reg_model) #showing summary of regression model output
Call:
lm(formula = blotch ~ BPM + weight + length + air + water + meta +
depth, data = sharks)
Residuals:
Min 1Q Median 3Q Max
-2.83745 -0.66117 -0.00702 0.60110 2.74108
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.1405851 1.8958668 5.876 7.74e-09 ***
BPM -0.0019723 0.0031890 -0.618 0.537
weight 0.0016283 0.0033511 0.486 0.627
length 0.0012295 0.0009710 1.266 0.206
air -0.0281474 0.0318707 -0.883 0.378
water -0.0188934 0.0270782 -0.698 0.486
meta -0.0009712 0.0025951 -0.374 0.708
depth 0.5061285 0.0223191 22.677 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.002 on 492 degrees of freedom
Multiple R-squared: 0.514, Adjusted R-squared: 0.507
F-statistic: 74.32 on 7 and 492 DF, p-value: < 2.2e-16
#For this model involving all 7 variables, the p-value of the F-statistic is less than 2.2e-16. This implies that at least one of the 7 predictor variables can be used to significantly predict blotching time.
#Looking at the coefficients will help determine the predictor variable with a significant association with blotching time
summary(mult_reg_model)$coefficient #showing only the coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.1405850738 1.8958667617 5.8762490 7.742498e-09
BPM -0.0019723462 0.0031890181 -0.6184807 5.365447e-01
weight 0.0016283130 0.0033510746 0.4859077 6.272489e-01
length 0.0012295257 0.0009710195 1.2662214 2.060330e-01
air -0.0281474117 0.0318706960 -0.8831753 3.775729e-01
water -0.0188934313 0.0270781617 -0.6977369 4.856714e-01
meta -0.0009712356 0.0025951073 -0.3742564 7.083748e-01
depth 0.5061284856 0.0223190976 22.6769242 1.816298e-78
#From the above table, it can be understood that none of the variables except depth are significantly associated with blotching time, ie., the beta coefficient values of all the predictor variables except depth are not significantly different from 0. This implies that depth is a significant predictor of blotching time. Therefore, the regression model can be modified as a linear regression model, whereby the outcome variable is predicted using a single predictor variable.
lin_reg_model <-lm(blotch ~ depth, data = sharks) #creating regression model with depth as a predictor of blotch, assigning it to object 'lin_reg_model'summary(lin_reg_model) #showing a summary of regression model output
Call:
lm(formula = blotch ~ depth, data = sharks)
Residuals:
Min 1Q Median 3Q Max
-2.81869 -0.65427 -0.01035 0.58825 2.83116
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.82178 1.11207 8.832 <2e-16 ***
depth 0.50467 0.02216 22.772 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1 on 498 degrees of freedom
Multiple R-squared: 0.5101, Adjusted R-squared: 0.5091
F-statistic: 518.6 on 1 and 498 DF, p-value: < 2.2e-16
This shows that the relation between depth and blotch as lying between the values 0.461 and 0.548 can be accepted with 95% confidence. Therefore, the linear regression model for predicting blotching time is: blotch = 9.82178 + 0.50467 X depth. The adjusted R-squared value is 0.5091, which implies that 50.91% of variance in blotching time can be explained by depth.
#To ensure that multiple regression is appropriate for predicting blotching time using depth, the distribution of residuals is assessed for normality by visualising on histogram and QQ plot.
#Q3.1. Histogram of residuals
mult_model <- mult_reg_model$residuals #residuals of multiple regression model assigned to a new objecthist(mult_model) #histogram of residuals created
#The histogram above shows that the residuals are normally distributed.
#Q.3.2. Q-Q Plot of residuals
qqnorm(mult_model) #creating normal Q-Q plot for residualsqqline(mult_model, col ="cornflowerblue", lwd =2) #adding line to normal Q-Q plot
#Most of the data points align with the reference line, and therefore, the residuals are normally distributed.
#Creating a scatter plot between blotch and depth
#To understand the relationship between blotch and depth, as modeled using regression analysis, the following code is used to generate a scatterplot between blotch and depth.
library(ggplot2) # loading ggplot2 packageggplot(sharks, aes(x = depth, y = blotch)) +geom_point(color ="slateblue2", size =2) +#creating scatter plotgeom_smooth(method ="lm", color ="dodgerblue4", se =TRUE) +# adding regression line with standard error ribbonlabs(title ="Relationship between Depth and Blotch", #assigning titlex ="Depth (m)", #assigning label to x-axisy ="Blotch (s)") +#assigning label to y-axistheme_minimal()