Loading required package: MASS
Attaching package: 'MASS'
The following object is masked from 'package:dplyr':
select
Loading required package: gld
Loading required package: mvtnorm
Loading required package: lattice
Attaching package: 'PairedData'
The following object is masked from 'package:base':
summary
ID sex blotch BPM
Length:500 Length:500 Min. :30.78 Min. :119.0
Class :character Class :character 1st Qu.:34.16 1st Qu.:129.0
Mode :character Mode :character Median :35.05 Median :142.0
Mean :35.13 Mean :141.8
3rd Qu.:36.05 3rd Qu.:153.2
Max. :40.08 Max. :166.0
weight length air water
Min. : 65.10 Min. :128.3 Min. :33.00 Min. :20.01
1st Qu.: 75.68 1st Qu.:172.0 1st Qu.:34.42 1st Qu.:21.55
Median : 87.82 Median :211.1 Median :35.43 Median :23.11
Mean : 87.94 Mean :211.0 Mean :35.54 Mean :23.02
3rd Qu.:100.40 3rd Qu.:251.8 3rd Qu.:36.71 3rd Qu.:24.37
Max. :110.94 Max. :291.0 Max. :38.00 Max. :25.99
meta depth
Min. : 50.03 Min. :44.64
1st Qu.: 67.39 1st Qu.:48.90
Median : 82.45 Median :50.14
Mean : 82.04 Mean :50.14
3rd Qu.: 95.97 3rd Qu.:51.35
Max. :112.45 Max. :56.83
# A tibble: 6 × 4
ID sex blotch1 blotch2
<chr> <chr> <dbl> <dbl>
1 SH260 Male 36.4 37.5
2 SH357 Male 33.5 34.5
3 SH199 Male 37.1 38.2
4 SH413 Male 34.3 35.3
5 SH004 Male 35.3 35.6
6 SH307 Male 34.5 34.1
dim(sharksub)
[1] 50 4
nrow(sharksub)
[1] 50
ncol(sharksub)
[1] 4
colnames(sharksub)
[1] "ID" "sex" "blotch1" "blotch2"
summary(sharksub)
ID sex blotch1 blotch2
Length:50 Length:50 Min. :32.49 Min. :33.47
Class :character Class :character 1st Qu.:34.38 1st Qu.:35.31
Mode :character Mode :character Median :34.94 Median :35.94
Mean :35.03 Mean :35.96
3rd Qu.:35.90 3rd Qu.:36.78
Max. :37.07 Max. :38.18
Check for missing data
any(is.na(sharks))
[1] FALSE
any(is.na(sharksub))
[1] FALSE
Both are false so no need to handle missing values
I will rename the column name in the first dataset to match the second: turn blotch into blotch1
Q1. Is there a correlation between the variables air and water?
Visualize the correlation between air and water using a scatterplot
ggplot(sharks, aes(x = air,y = water)) +geom_point(color ="navy") +geom_smooth(method="lm",se=TRUE, aes(group =1), colour="maroon") +labs(x ="Air Temperature (°C)",y ="Water Temperature (°C)", title ="Relationship between Air and Water Temperature") +theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
There is no visually recognizable correlation between the two variables
There is no correlation between the variables air and water
Question 2
Q2. Does multiple capture have an effect on blotching time?
View merged data: this dataset only contains sharks that had multiple capture events and include both botching times
Convert wide data to a long data format so that using ggplot2 is easier: instead of a column each for blotch1 and blotch2, there is a column for capture and another for blotch time
# A tibble: 6 × 11
ID sex BPM weight length air water meta depth capture blotch_time
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 SH004 Male 161 105. 171. 36.2 21.6 86.3 50.3 blotch1 35.3
2 SH004 Male 161 105. 171. 36.2 21.6 86.3 50.3 blotch2 35.6
3 SH008 Female 135 101. 128. 36.8 21.3 96.3 50.6 blotch1 36.3
4 SH008 Female 135 101. 128. 36.8 21.3 96.3 50.6 blotch2 36.5
5 SH029 Male 122 69.0 219. 37.6 23.3 105. 50.6 blotch1 35.2
6 SH029 Male 122 69.0 219. 37.6 23.3 105. 50.6 blotch2 36.2
View as a boxlplot
ggboxplot(my_data_long, x ="capture", y ="blotch_time", color ="capture", palette =c("#DC143C", "#87CEEB"),ylab ="Blotch Time (Seconds)") +labs(color ="Capture Event") +scale_color_manual(values =c("#DC143C", "#87CEEB"), labels =c("First Capture", "Second Capture")) +theme(axis.title.x =element_blank(), axis.text.x =element_blank())
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
As I have already created a data set with sharks that have all been captured twice, there is no need to subset my data
Now I plot the paired data
ggplot(my_data_long, aes(x =factor(ID), y = blotch_time, color = capture)) +geom_line(aes(group = ID), color ="gray") +# Lines to show pairinggeom_point() +labs(x ="Shark ID", y ="Blotch Time (Seconds)") +theme_bw()
Now, I reshape the data back to wide format to calculate the difference between blotch1 and blotch2
ggboxplot(my_data_no_outliers, y ="difference", ylab ="Difference in Blotch Time (Seconds)", xlab ="Paired Data")
No outliers are seen now
Check the paired t-test assumptions
Assumption 1:Are the two samples paired? Yes
Assumption 2:The observations must be independent of each other: Yes
Assumption 3:The differences between the paired values should be approximately normally distributed: therefore we need to check whether the differences of the pairs follow a normal distribution
Check for normal distribution among the dependent variable (Visualization)
qqnorm(my_data_no_outliers$difference)qqline(my_data_no_outliers$difference, col ="red")
Paired t-test
data: my_data_no_outliers$blotch1 and my_data_no_outliers$blotch2
t = -208.14, df = 44, p-value < 2.2e-16
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
-1.059966 -1.039637
sample estimates:
mean difference
-1.049801
The p value is <0.5 (2.2e - 16) therefore, we reject the null hypothesis.
We accept the alternative hypothesis: There is a significant difference in blotching time between the first and second capture
Therefore, multiple capture has an effect on blotching time.
library(ggplot2)ggplot(sharks, aes(y = blotch)) +geom_boxplot(fill ="lightblue", color ="black") +labs(title ="Boxplot of Blotch Time", y ="Blotch Time (Seconds)") +theme_minimal()
I will have another try as that boxplot was stretched
boxplot(sharks$blotch,main ="Boxplot of Blotch Time",ylab ="Blotch Time (Seconds)",col ="#E7B800",border ="black")
Outliers are seen in the boxplot, therefore identify them using IQR & Define the lower and upper bounds for outliers
ggplot(sharks_no_outliers, aes(x = BPM, y = blotch)) +geom_point() +geom_smooth(method ="lm", color ="blue") +labs(x ="Heart Rate (BPM)", y ="Blotch Time (Seconds)", title ="Heart Rate vs. Blotch Time")
`geom_smooth()` using formula = 'y ~ x'
heart_rate_blotch_lm <-lm(blotch ~ BPM, data = sharks_no_outliers)summary(heart_rate_blotch_lm)
Call:
lm(formula = blotch ~ BPM, data = sharks_no_outliers)
Residuals:
Min 1Q Median 3Q Max
-3.7593 -0.9311 -0.0689 0.8915 3.5609
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 35.467281 0.623104 56.920 <2e-16 ***
BPM -0.002545 0.004374 -0.582 0.561
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.375 on 494 degrees of freedom
Multiple R-squared: 0.0006846, Adjusted R-squared: -0.001338
F-statistic: 0.3384 on 1 and 494 DF, p-value: 0.561
The p value is >0.05 (0.561)therefore no correlation
Weight and blotch
ggplot(sharks_no_outliers, aes(x = weight, y = blotch)) +geom_point() +geom_smooth(method ="lm", color ="blue") +labs(x ="Weight (kg)", y ="Blotch Time (Seconds)", title ="Weight vs. Blotch Time")
`geom_smooth()` using formula = 'y ~ x'
weight_blotch_lm <-lm(blotch ~ weight, data = sharks_no_outliers)summary(weight_blotch_lm)
Call:
lm(formula = blotch ~ weight, data = sharks_no_outliers)
Residuals:
Min 1Q Median 3Q Max
-3.7474 -0.9430 -0.0662 0.9314 3.5558
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.511e+01 4.091e-01 85.827 <2e-16 ***
weight -9.491e-06 4.604e-03 -0.002 0.998
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.376 on 494 degrees of freedom
Multiple R-squared: 8.6e-09, Adjusted R-squared: -0.002024
F-statistic: 4.249e-06 on 1 and 494 DF, p-value: 0.9984
The p value is >0.05 (0.9984) therefore no correlation
Length and blotch
ggplot(sharks_no_outliers, aes(x = length, y = blotch)) +geom_point() +geom_smooth(method ="lm", color ="blue") +labs(x ="Length (cm)", y ="Blotch Time (Seconds)", title ="Length vs. Blotch Time")
`geom_smooth()` using formula = 'y ~ x'
length_blotch_lm <-lm(blotch ~ length, data = sharks_no_outliers)summary(length_blotch_lm)
Call:
lm(formula = blotch ~ length, data = sharks_no_outliers)
Residuals:
Min 1Q Median 3Q Max
-3.7425 -0.9497 -0.0604 0.9262 3.5421
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 35.1854644 0.2858843 123.076 <2e-16 ***
length -0.0003739 0.0013234 -0.283 0.778
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.376 on 494 degrees of freedom
Multiple R-squared: 0.0001616, Adjusted R-squared: -0.001862
F-statistic: 0.07984 on 1 and 494 DF, p-value: 0.7776
The p value is >0.05 (0.7776) therefore no correlation
Air and blotch
ggplot(sharks_no_outliers, aes(x = air, y = blotch)) +geom_point() +geom_smooth(method ="lm", color ="blue") +labs(x ="Air Temperature (°C)", y ="Blotch Time (Seconds)", title ="Air Temperature vs. Blotch Time")
`geom_smooth()` using formula = 'y ~ x'
air_blotch_lm <-lm(blotch ~ air, data = sharks_no_outliers)summary(air_blotch_lm)
Call:
lm(formula = blotch ~ air, data = sharks_no_outliers)
Residuals:
Min 1Q Median 3Q Max
-3.7737 -0.9320 -0.0353 0.9392 3.5534
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 36.24234 1.53621 23.59 <2e-16 ***
air -0.03196 0.04319 -0.74 0.46
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.375 on 494 degrees of freedom
Multiple R-squared: 0.001107, Adjusted R-squared: -0.000915
F-statistic: 0.5475 on 1 and 494 DF, p-value: 0.4597
The p value is >0.05 (0.4597) therefore no correlation
Water and blotch
ggplot(sharks_no_outliers, aes(x = water, y = blotch)) +geom_point() +geom_smooth(method ="lm", color ="blue") +labs(x ="Water Temperature (°C)", y ="Blotch Time (Seconds)", title ="Water Temperature vs. Blotch Time")
`geom_smooth()` using formula = 'y ~ x'
water_blotch_lm <-lm(blotch ~ water, data = sharks_no_outliers)summary(water_blotch_lm)
Call:
lm(formula = blotch ~ water, data = sharks_no_outliers)
Residuals:
Min 1Q Median 3Q Max
-3.7887 -0.9439 -0.0413 0.9546 3.5189
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 36.12179 0.85072 42.460 <2e-16 ***
water -0.04411 0.03686 -1.196 0.232
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.374 on 494 degrees of freedom
Multiple R-squared: 0.00289, Adjusted R-squared: 0.0008712
F-statistic: 1.432 on 1 and 494 DF, p-value: 0.2321
The p value is >0.05 (0.2321) therefore no correlation
Visualization of meta and blotch
ggplot(sharks_no_outliers, aes(x = meta, y = blotch)) +geom_point() +geom_smooth(method ="lm", color ="blue") +labs(x ="Meta", y ="Blotch Time (Seconds)", title ="Meta vs. Blotch Time")
`geom_smooth()` using formula = 'y ~ x'
meta_blotch_lm <-lm(blotch ~ meta, data = sharks_no_outliers)summary(meta_blotch_lm)
Call:
lm(formula = blotch ~ meta, data = sharks_no_outliers)
Residuals:
Min 1Q Median 3Q Max
-3.7404 -0.9355 -0.0638 0.9421 3.6073
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 35.246148 0.297592 118.438 <2e-16 ***
meta -0.001701 0.003548 -0.479 0.632
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.375 on 494 degrees of freedom
Multiple R-squared: 0.000465, Adjusted R-squared: -0.001558
F-statistic: 0.2298 on 1 and 494 DF, p-value: 0.6319
The p value is >0.05 (0.6319) therefore no correlation
Depth and blotch
ggplot(sharks_no_outliers, aes(x = depth, y = blotch)) +geom_point() +geom_smooth(method ="lm", color ="blue") +labs(x ="Depth (m)", y ="Blotch Time (Seconds)", title ="Depth vs. Blotch Time")
`geom_smooth()` using formula = 'y ~ x'
depth_blotch_lm <-lm(blotch ~ depth, data = sharks_no_outliers)summary(depth_blotch_lm)
Call:
lm(formula = blotch ~ depth, data = sharks_no_outliers)
Residuals:
Min 1Q Median 3Q Max
-2.80824 -0.64783 -0.00811 0.60695 2.82044
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.81902 1.11064 9.741 <2e-16 ***
depth 0.48455 0.02214 21.885 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9802 on 494 degrees of freedom
Multiple R-squared: 0.4923, Adjusted R-squared: 0.4912
F-statistic: 479 on 1 and 494 DF, p-value: < 2.2e-16
The p value is <0.05 (2.2e - 16) therefore there is a correlation
Sex and blotch
ggplot(sharks_no_outliers, aes(x = sex, y = blotch, fill = sex)) +geom_boxplot() +labs(x ="Sex", y ="Blotch Time (Seconds)", title ="Sex vs. Blotch Time")
sex_blotch_lm <-lm(blotch ~ sex, data = sharks_no_outliers)summary(sex_blotch_lm)
Call:
lm(formula = blotch ~ sex, data = sharks_no_outliers)
Residuals:
Min 1Q Median 3Q Max
-3.5812 -0.9630 -0.0208 0.9739 3.7218
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 34.94059 0.08914 391.956 <2e-16 ***
sexMale 0.31547 0.12289 2.567 0.0105 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.367 on 494 degrees of freedom
Multiple R-squared: 0.01316, Adjusted R-squared: 0.01117
F-statistic: 6.59 on 1 and 494 DF, p-value: 0.01055
The p value is <0.05 (0.01055) therefore there is a correlation
Now I will take the significant linear regression models (blotch vs sex, and blotch vs depth) and fit a multiple linear regression model
mlr_model <-lm(blotch ~ sex + depth, data = sharks_no_outliers)summary(mlr_model)
Call:
lm(formula = blotch ~ sex + depth, data = sharks_no_outliers)
Residuals:
Min 1Q Median 3Q Max
-2.93661 -0.63242 -0.00156 0.58855 2.96121
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.75572 1.10129 9.766 < 2e-16 ***
sexMale 0.27080 0.08741 3.098 0.00206 **
depth 0.48297 0.02196 21.997 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9718 on 493 degrees of freedom
Multiple R-squared: 0.502, Adjusted R-squared: 0.4999
F-statistic: 248.4 on 2 and 493 DF, p-value: < 2.2e-16
I would like to see if a model with both independent variables fits the data better and answers the question better
I will test a full model with both significant variables and a model with just depth and sex
full_model <-lm(blotch ~ sex + depth, data = sharks_no_outliers)summary(full_model)
Call:
lm(formula = blotch ~ sex + depth, data = sharks_no_outliers)
Residuals:
Min 1Q Median 3Q Max
-2.93661 -0.63242 -0.00156 0.58855 2.96121
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.75572 1.10129 9.766 < 2e-16 ***
sexMale 0.27080 0.08741 3.098 0.00206 **
depth 0.48297 0.02196 21.997 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9718 on 493 degrees of freedom
Multiple R-squared: 0.502, Adjusted R-squared: 0.4999
F-statistic: 248.4 on 2 and 493 DF, p-value: < 2.2e-16
Reduced model removing ‘sex’
reduced_model <-lm(blotch ~ depth, data = sharks_no_outliers)summary(reduced_model)
Call:
lm(formula = blotch ~ depth, data = sharks_no_outliers)
Residuals:
Min 1Q Median 3Q Max
-2.80824 -0.64783 -0.00811 0.60695 2.82044
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.81902 1.11064 9.741 <2e-16 ***
depth 0.48455 0.02214 21.885 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9802 on 494 degrees of freedom
Multiple R-squared: 0.4923, Adjusted R-squared: 0.4912
F-statistic: 479 on 1 and 494 DF, p-value: < 2.2e-16
Reduced model removing ‘depth’
reduced_model2 <-lm(blotch ~ sex, data = sharks_no_outliers)summary(reduced_model2)
Call:
lm(formula = blotch ~ sex, data = sharks_no_outliers)
Residuals:
Min 1Q Median 3Q Max
-3.5812 -0.9630 -0.0208 0.9739 3.7218
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 34.94059 0.08914 391.956 <2e-16 ***
sexMale 0.31547 0.12289 2.567 0.0105 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.367 on 494 degrees of freedom
Multiple R-squared: 0.01316, Adjusted R-squared: 0.01117
F-statistic: 6.59 on 1 and 494 DF, p-value: 0.01055
Then i will conduct an ANOVA comparing the results of the models
First an ANOVA between full model and model without sex
anova(reduced_model, full_model)
Analysis of Variance Table
Model 1: blotch ~ depth
Model 2: blotch ~ sex + depth
Res.Df RSS Df Sum of Sq F Pr(>F)
1 494 474.64
2 493 465.58 1 9.0636 9.5974 0.00206 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value <0.05 (0.00206), therefore removing sex significantly worsens the model
Second an ANOVA between full model and model without depth
anova(reduced_model2, full_model)
Analysis of Variance Table
Model 1: blotch ~ sex
Model 2: blotch ~ sex + depth
Res.Df RSS Df Sum of Sq F Pr(>F)
1 494 922.53
2 493 465.58 1 456.95 483.86 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value <0.05 (2.2e-16), therefore removing depth significantly worsens the model Therefore I will include both sex and depth in the model (full model)
Earlier, I had turned sex into a factor, I will now check the levels of the factor to see if Male comes first or if Female comes first
levels(sharks_no_outliers$sex)
[1] "Female" "Male"
Female came first which means Female is coded as 0 and Male is coded as 1
I will now validate the model by conducting train-testing splitting
First I will create a training set and a test set
library(caret)
Warning: package 'caret' was built under R version 4.4.2
Attaching package: 'caret'
The following object is masked from 'package:purrr':
lift