Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R,
Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E,
Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi
K, Vaughan D, Wilke C, Woo K, Yutani H (2019). "Welcome to the
tidyverse." _Journal of Open Source Software_, *4*(43), 1686.
doi:10.21105/joss.01686 <https://doi.org/10.21105/joss.01686>.
print(citation("ggplot2"), style ="text")
Wickham H (2016). _ggplot2: Elegant Graphics for Data Analysis_.
Springer-Verlag New York. ISBN 978-3-319-24277-4,
<https://ggplot2.tidyverse.org>.
print(citation("dplyr"), style ="text")
Wickham H, François R, Henry L, Müller K, Vaughan D (2023). _dplyr: A
Grammar of Data Manipulation_. R package version 1.1.4,
<https://CRAN.R-project.org/package=dplyr>.
print(citation("viridis"), style ="text")
Garnier, Simon, Ross, Noam, Rudis, Robert, Camargo, Pedro A, Sciaini,
Marco, Scherer, Cédric (2024). _viridis(Lite) - Colorblind-Friendly
Color Maps for R_. doi:10.5281/zenodo.4679423
<https://doi.org/10.5281/zenodo.4679423>, viridis package version
0.6.5, <https://sjmgarnier.github.io/viridis/>.
# Load data sets in csv form & text()# Load data sets in csv form & save as data frame variables in global environment:eqsub <-read.csv("~/Library/Mobile Documents/com~apple~CloudDocs/R Studio/RMDA Project/eqsub.csv")equine <-read.csv("~/Library/Mobile Documents/com~apple~CloudDocs/R Studio/RMDA Project/equine.csv")
equine data: Compliance time (time taken for a horse to completed a particular task ) has long been considered an easy estimate of a horses “stress levels”. Here you have been supplied with a data set in the form of an Excel file (equine.xlsx) that contains compliance times for a group of horses who have been tasked with negotiating a novel obstacle (a streamer curtain - see video recording).
The data file equine.xlsx also contains a number of other variables used to quantify stress in addition to some environmental and biometric measurements.
eqsub data: A second set of experiments was conducted on the original study group where individuals were set the same task again after the area was sprayed with a substance designed to calm horses.
Themes and questions:
In your write up please consider the pros and cons of invasive and non-invasive measurements in animal research.
Cons: Stress induced by the sampling process might influence measurements.
Non-Invasive Measurements Examples: Compliance time, heart rate (BPM), thermographic eye temperature (IRT).
Pros: No direct harm or stress to the animal.
Cons: Can be less accurate or influenced by external factors.
In your analysis please answer these three questions:
1. Is there a correlation between IRT (Thermographic Eye Temperature) and Cortisol?
Hypothesis: Stress may cause changes in both IRT and cortisol, suggesting a possible correlation.
H0: There is a correlation between IRT and Cortisol
H1: There is no correlation between IRT and Coritsol
Approach:
Test assumptions of normality
Calculate the Pearson/Spearmans/ Kendall Tau correlation coefficient (cor() in R) to measure the strength of the linear relationship.
Visualize the relationship with a scatterplot and a trend line.
Test for statistical significance using cor.test().
2. Does the calming spray affect compliance times?
Hypothesis: The calming spray reduces stress, leading to shorter compliance times.
H0: Calming spray does have an effect on compliance time
H1: Calming spray does not have an affect on compliance time
Approach:
Compare compliance times before and after the spray using a paired t-test (if the same horses are measured) or an unpaired t-test (if different groups are used).
Visualize the results using boxplots to show before-and-after differences.
3. Can compliance time be predicted?
Hypothesis: Compliance time is influenced by a combination of factors such as IRT, cortisol, air temperature, weight, and heart rate.
H0: Compliance time can be predicted by correlation to multiple factors
H1: Compliance time cannot be predicted by correlation to multiple factors
Approach:
Build a multiple linear regression model using lm() in R.
Check the model’s assumptions (normality, homoscedasticity, multicollinearity).
Evaluate the model’s performance using metrics such as R2 and residual plots.
Data Preparation
# Inspect the structure of the datastr(equine)
'data.frame': 498 obs. of 8 variables:
$ ID : chr "Eq001" "Eq002" "Eq003" "Eq004" ...
$ sex : chr "Female" "Female" "Female" "Male" ...
$ comp : num 37.2 34.5 36.3 35.3 37.4 ...
$ weight : num 74.7 73.4 71.8 104.6 67.1 ...
$ irt : num 37.7 35.7 34.8 36.2 33.6 ...
$ air : num 23.4 21.4 20.1 21.6 21.8 ...
$ cortisol: num 64.1 73.7 54.4 86.3 108 ...
$ BPM : num 153 150 149 150 149 ...
str(eqsub)
'data.frame': 498 obs. of 4 variables:
$ ID : chr "EQ001" "EQ002" "EQ003" "EQ004" ...
$ sex : chr "Female" "Female" "Female" "Male" ...
$ comp : num 37.2 34.5 36.3 35.3 37.4 ...
$ comp2: num 31.5 29.6 30.8 30.1 31.7 ...
# Get summaries of datasummary(equine)
ID sex comp weight
Length:498 Length:498 Min. :30.78 Min. : 65.10
Class :character Class :character 1st Qu.:34.16 1st Qu.: 75.67
Mode :character Mode :character Median :35.04 Median : 87.82
Mean :35.12 Mean : 87.93
3rd Qu.:36.05 3rd Qu.:100.34
Max. :40.08 Max. :110.94
irt air cortisol BPM
Min. :33.00 Min. :20.01 Min. : 50.03 Min. :144.6
1st Qu.:34.42 1st Qu.:21.54 1st Qu.: 67.27 1st Qu.:148.9
Median :35.43 Median :23.11 Median : 82.43 Median :150.1
Mean :35.54 Mean :23.02 Mean : 82.00 Mean :150.1
3rd Qu.:36.71 3rd Qu.:24.35 3rd Qu.: 95.96 3rd Qu.:151.3
Max. :38.00 Max. :25.99 Max. :112.45 Max. :156.8
summary(eqsub)
ID sex comp comp2
Length:498 Length:498 Min. :30.78 Min. :27.37
Class :character Class :character 1st Qu.:34.16 1st Qu.:29.23
Mode :character Mode :character Median :35.04 Median :29.99
Mean :35.12 Mean :29.94
3rd Qu.:36.05 3rd Qu.:30.65
Max. :40.08 Max. :32.81
# Check for missing valuessum(is.na(equine))
[1] 0
sum(is.na(eqsub))
[1] 0
1. Correlation Analysis: IRT vs. Cortisol
Stress may cause changes in both IRT and cortisol, suggesting a possible correlation.
H0: There is a correlation between IRT and Cortisol
H1: There is no correlation between IRT and Coritsol
Data exploration & Normality:
# See structure of equine datastr(equine)
'data.frame': 498 obs. of 8 variables:
$ ID : chr "Eq001" "Eq002" "Eq003" "Eq004" ...
$ sex : chr "Female" "Female" "Female" "Male" ...
$ comp : num 37.2 34.5 36.3 35.3 37.4 ...
$ weight : num 74.7 73.4 71.8 104.6 67.1 ...
$ irt : num 37.7 35.7 34.8 36.2 33.6 ...
$ air : num 23.4 21.4 20.1 21.6 21.8 ...
$ cortisol: num 64.1 73.7 54.4 86.3 108 ...
$ BPM : num 153 150 149 150 149 ...
# An excess of zerossum(equine$irt ==0) *100/nrow(equine)
[1] 0
sum(equine$cortisol ==0) *100/nrow(equine)
[1] 0
# Shapiro-Wilk normality test for IRTshapiro_irt <-shapiro.test(equine$irt)print("Shapiro-Wilk Test for IRT")
[1] "Shapiro-Wilk Test for IRT"
print(shapiro_irt)
Shapiro-Wilk normality test
data: equine$irt
W = 0.95889, p-value = 1.447e-10
# Shapiro-Wilk normality test for cortisolshapiro_cortisol <-shapiro.test(equine$cortisol)print("Shapiro-Wilk Test for Cortisol")
[1] "Shapiro-Wilk Test for Cortisol"
print(shapiro_cortisol)
Shapiro-Wilk normality test
data: equine$cortisol
W = 0.96362, p-value = 9.245e-10
Both data sets are non-normally distributed (P <<< 0.05), but statistically significant. Thus, we cannot use a normal linear correlation model. We cannot then use the Pearson’s correlation but can rather use the Spearman’s Rank Correlation or the Kendall Tau Correlation.
Spearman's rank correlation rho
data: equine$irt and equine$cortisol
S = 17878766, p-value = 0.00332
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.1314346
Spearman’s Rank Correlation Coefficient (𝜌= 0.131): This is a weak positive monotonic relationship between IRT (Infrared Thermography) and Cortisol levels. A weak relationship suggests that as one variable increases, the other tends to increase slightly, but this trend is not strong.
P-value (p=0.00332): The p-value is less than 0.05, indicating that the correlation is statistically significant. This means it is unlikely that the observed relationship is due to random chance, and there is evidence of a weak monotonic association between IRT and Cortisol.
Since the p-value is significant, we reject the null hypothesis (𝐻0:𝜌=0) and accept the alternative hypothesis that there is a non-zero correlation between IRT and Cortisol.
Kendal Tau’s Correlation Interpretation:
A more robust alternative to the Spearman’s correlation, which accommodates ties in the ranked data and small sample size, is Kendall’s Tau correlation.
Kendall's rank correlation tau
data: equine$irt and equine$cortisol
z = 2.8644, p-value = 0.004178
alternative hypothesis: true tau is not equal to 0
sample estimates:
tau
0.08587267
Kendall’s tau (τ): This is a measure of the ordinal association between two measured quantities. It ranges from -1 to 1.
Tau value: The tau value for your data is approximately 0.0859, suggesting a very weak positive relationship between Thermographic Eye Temperature (IRT) and Cortisol levels.
P-value: The p-value is 0.004178. This indicates the probability that the observed correlation is due to random chance. Since this p-value is less than 0.05, it implies that the correlation is statistically significant.
Alternative hypothesis: The result supports the alternative hypothesis that the true tau is not equal to 0, meaning there’s evidence of a relationship between IRT and Cortisol levels in this sample.
In summary, despite the tau value indicating a weak positive correlation, the association between IRT and Cortisol levels is statistically significant. This suggests that even though the relationship is weak, it is likely not due to random chance.
Interpretation of Scatterplot (non-parametric trend):
# Load necessary librarylibrary(ggplot2)library(viridis)# Updated plotggplot(equine, aes(x = irt, y = cortisol)) +geom_point(alpha =0.6, color =viridis(1)) +# Colorblind-safe point colorgeom_smooth(method ="loess", color =viridis(3)[3], se =TRUE) +# Safe smooth line colorlabs(title ="Correlation Between IRT and Cortisol",x ="Thermographic Eye Temperature (Celsius)",y ="Cortisol (mcg/dL)") +theme_minimal() +# Clean themetheme(plot.title =element_text(hjust =0.5, face ="bold"))
`geom_smooth()` using formula = 'y ~ x'
This graph depicts the relationship between Thermographic Eye Temperature (IRT) and Cortisol levels (measured in mcg/dL). On the x-axis, we have the IRT ranging from 33 to 38 degrees Celsius, and on the y-axis, the Cortisol levels range from 60 to 120 mcg/dL. Each blue dot represents an individual data point.
The red line running through the data points is a non-parametric regression fit, meaning it doesn’t assume a specific mathematical form for the relationship but rather adapts to the data. The shaded gray area around the red line indicates the confidence interval, suggesting the range within which we can be reasonably certain the true relationship lies.
From the red line’s U-shaped pattern, it appears that Cortisol levels initially decrease as IRT increases up to around 34 degrees Celsius. Beyond that point, as IRT continues to rise, Cortisol levels begin to increase again.
This pattern could potentially be important for understanding physiological responses or stress indicators. The non-parametric fit gives a more flexible and accurate representation of the underlying relationship without being constrained by a particular model.
Interpretation of Scatterplot (linear model):
# Scatterplot with trendlineggplot(equine, aes(x = irt, y = cortisol)) +geom_point(color =viridis(1), alpha =0.8, size =2) +# Colorblind-safe pointsgeom_smooth(method ="lm", se =TRUE, color =viridis(3)[2], fill =viridis(3)[1]) +# Colorblind-safe trendline and filllabs(title ="Correlation Between IRT and Cortisol",x ="Thermographic Eye Temperature (Celsius)",y ="Cortisol (mcg/dL)") +theme_minimal() +# Clean and modern themetheme(plot.title =element_text(hjust =0.5, size =14, face ="bold"),axis.title =element_text(size =12),axis.text =element_text(size =10))
`geom_smooth()` using formula = 'y ~ x'
Positive Correlation: The trend line indicates a positive correlation, meaning as the Thermographic Eye Temperature increases, the Cortisol levels tend to increase.
Scatter of Data Points: The data points are widely scattered around the trend line, indicating that while there is a positive correlation, it is relatively weak.
Based on the correlation test results:
Correlation Coefficient (0.1286011): Indicates a positive but weak correlation between IRT and Cortisol levels.
P-Value (0.004045604): The p-value is less than 0.05, indicating that the correlation is statistically significant.
The scatter plot confirms the above statement by visually representing the positive correlation, though it is weak, as indicated by the scatter of the data points. The trend line and confidence interval also support the statistical significance of the correlation found in the test.
Overall, the graph aligns with the correlation test results, confirming a significant but weak positive correlation between Thermographic Eye Temperature and Cortisol levels.
2. Effect of Calming Spray on Compliance Time
Hypothesis: The calming spray reduces stress, leading to shorter compliance times.
H0: Calming spray does have an effect on compliance time
H1: Calming spray does not have an affect on compliance time
Data Exploration and Normality:
# Explore the structure of the eqsub datastr(eqsub)
'data.frame': 498 obs. of 4 variables:
$ ID : chr "EQ001" "EQ002" "EQ003" "EQ004" ...
$ sex : chr "Female" "Female" "Female" "Male" ...
$ comp : num 37.2 34.5 36.3 35.3 37.4 ...
$ comp2: num 31.5 29.6 30.8 30.1 31.7 ...
# Get the summary of eqsub datasummary(eqsub)
ID sex comp comp2
Length:498 Length:498 Min. :30.78 Min. :27.37
Class :character Class :character 1st Qu.:34.16 1st Qu.:29.23
Mode :character Mode :character Median :35.04 Median :29.99
Mean :35.12 Mean :29.94
3rd Qu.:36.05 3rd Qu.:30.65
Max. :40.08 Max. :32.81
# Test normality for separate variablesshapiro_comp <-shapiro.test(eqsub$comp)print("Shapiro-Wilk Test for Comp")
[1] "Shapiro-Wilk Test for Comp"
print(shapiro_comp)
Shapiro-Wilk normality test
data: eqsub$comp
W = 0.99691, p-value = 0.4666
shapiro_comp2 <-shapiro.test(eqsub$comp2)print("Shapiro-Wilk Test for Comp")
[1] "Shapiro-Wilk Test for Comp"
print(shapiro_comp2)
Shapiro-Wilk normality test
data: eqsub$comp2
W = 0.9943, p-value = 0.05974
# Additional variables of interest, not included in the original data set, are the change in compliance time following the calming spray:eqsub$diff <- eqsub$comp2 - eqsub$comp# We can now see the new variable createdstr(eqsub)
'data.frame': 498 obs. of 5 variables:
$ ID : chr "EQ001" "EQ002" "EQ003" "EQ004" ...
$ sex : chr "Female" "Female" "Female" "Male" ...
$ comp : num 37.2 34.5 36.3 35.3 37.4 ...
$ comp2: num 31.5 29.6 30.8 30.1 31.7 ...
$ diff : num -5.67 -4.92 -5.53 -5.2 -5.74 -4.98 -5.55 -5.5 -5.2 -5.4 ...
# Test normality for difference in complianceshapiro_compdiff <-shapiro.test(eqsub$diff)print("Shapiro-Wilk Test for Comp Diff")
[1] "Shapiro-Wilk Test for Comp Diff"
print(shapiro_compdiff)
Shapiro-Wilk normality test
data: eqsub$diff
W = 0.84738, p-value < 2.2e-16
For the separate variables, p > 0.05 for both which suggests there is no significant departures from normality. We can use this to compare the effect of calming spray on on compliance between sex’s. Using t-test and ANOVA.
However, when looking at the difference in the compliance variable the p value < 0.05 which suggests there is a significant departure from normality. We can use this to compare the effect of the calming spray before and after on compliance time.
# Histogram for compggplot(eqsub, aes(x = comp)) +geom_histogram(aes(y = ..density..), binwidth =1, fill ="blue", color ="black", alpha =0.7) +geom_density(color ="darkblue", size =1.5, adjust =1.5) +labs(title ="Histogram of compliance tiime before calming spray (comp)", x ="comp (seconds)", y ="Density") +theme_minimal() +theme(plot.title =element_text(hjust =0.5))
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.
# Histogram for comp2ggplot(eqsub, aes(x = comp2)) +geom_histogram(aes(y = ..density..), binwidth =1, fill ="orange", color ="black", alpha =0.7) +geom_density(color ="darkorange", size =1.5, adjust =1.5) +labs(title ="Histogram of compliance time after calming spray (comp2)", x ="comp2 (seconds)", y ="Density") +theme_minimal() +theme(plot.title =element_text(hjust =0.5))
# Histogram for difference in compsggplot(eqsub, aes(x = diff)) +geom_histogram(aes(y = ..density..), binwidth =1, fill ="red", color ="black", alpha =0.7) +geom_density(color ="pink", size =1.5, adjust =1.5) +labs(title ="Histogram of difference in compliance time's", x ="difference (seconds)", y ="Density") +theme_minimal() +theme(plot.title =element_text(hjust =0.5))
# Perform Bartlett's Testbartlett.test(comp ~ sex, data = eqsub)
Bartlett test of homogeneity of variances
data: comp by sex
Bartlett's K-squared = 0.44588, df = 1, p-value = 0.5043
bartlett.test(comp2 ~ sex, data = eqsub)
Bartlett test of homogeneity of variances
data: comp2 by sex
Bartlett's K-squared = 0.22253, df = 1, p-value = 0.6371
Given that we now know that the variables ‘comp’,‘comp2’, and’compdiff’ do not deviate from normality, we can test for homogenity using Bartlett’s and Levenes Test. The results show that the compliance time data do not deviate significantly from homogeneity (both P > 0.05). This suggests that the variances equal.
# Zero's in the response variablessum(eqsub$diff ==0) *100/nrow(eqsub)
[1] 0
No zeros in any.
# Boxplot to compare compliance time between sexesggplot(eqsub, aes(x = sex, y = diff, fill = sex)) +geom_boxplot(outlier.colour ="red", outlier.shape =16, outlier.size =2) +# Customize outliersscale_fill_viridis_d(option ="C", begin =0.3, end =0.7) +# Use a colorblind-safe palettelabs(title ="Change in Compliance Time Following Calming Application",x ="Sex",y ="Change in Compliance Time (seconds)" ) +theme_minimal() +# Clean themetheme(plot.title =element_text(hjust =0.5, size =14, face ="bold"),axis.title =element_text(size =12),axis.text =element_text(size =10) )
Two Sample t-test
data: diff by sex
t = 3.2612, df = 496, p-value = 0.001186
alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
95 percent confidence interval:
0.04961562 0.20000304
sample estimates:
mean in group Female mean in group Male
-5.119675 -5.244484
# Mann-Whitney t-test for non-normal unpaired data for comparison of calming spray on different complaince times of female and male horseswilcox.test(diff ~ sex,data = eqsub,conf.int=TRUE)
Wilcoxon rank sum test with continuity correction
data: diff by sex
W = 35896, p-value = 0.002277
alternative hypothesis: true location shift is not equal to 0
95 percent confidence interval:
0.0199440 0.1299789
sample estimates:
difference in location
0.06996072
aggregate(diff ~ sex,data = eqsub, median)
sex diff
1 Female -5.045
2 Male -5.190
The results of the Mann-Whitney test show we should reject the null hypothesis that the changes in compliance time are equal between the sexes after calming spray (W = 35896, P = 0.002). We can expect a median increase in compliance of about 0.07 seconds for males in comparison with females following calming spray. We are also 95% certain that if we were to collect additional data from the same equine population that the median difference between females and males would be between 0.02 and 0.13 seconds.
Paired t-test of normal data
# Comparing 2 groups of normal paired data: paired t-test - the appropriate test to compare 2 groups of paired and normally distributed data with equal variancest.test(eqsub$comp, eqsub$comp2,paired =TRUE,var.equal =TRUE)
Paired t-test
data: eqsub$comp and eqsub$comp2
t = 268.28, df = 497, p-value < 2.2e-16
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
5.144875 5.220788
sample estimates:
mean difference
5.182831
The results of the paired t-test show an effect of the calming spray on equine compliance time (t268.28, P<0.0000..2). The mean (95% CI) change in compliance time following calming spray was 5.183 seconds.
ANOVA
# Load necessary librarieslibrary(dplyr)# Convert 'sex' to a factor for ANOVAeqsub$sex <-as.factor(eqsub$sex)# Perform one-way ANOVA for 'comp'anova_comp <-aov(comp ~ sex, data = eqsub)summary(anova_comp)
Df Sum Sq Mean Sq F value Pr(>F)
sex 1 17.7 17.689 8.782 0.00319 **
Residuals 496 999.1 2.014
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Perform one-way ANOVA for 'comp2'anova_comp2 <-aov(comp2 ~ sex, data = eqsub)summary(anova_comp2)
Df Sum Sq Mean Sq F value Pr(>F)
sex 1 7.9 7.915 7.317 0.00707 **
Residuals 496 536.5 1.082
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova_model <-aov(comp2 ~ sex + comp, data = eqsub)# Summary of the ANOVA modelsummary(anova_model)
Df Sum Sq Mean Sq F value Pr(>F)
sex 1 7.9 7.9 281.6 <2e-16 ***
comp 1 522.6 522.6 18596.2 <2e-16 ***
Residuals 495 13.9 0.0
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Check assumptions: Homogeneity of regression slopesinteraction_model <-aov(comp2 ~ sex * comp, data = eqsub)summary(interaction_model)
Df Sum Sq Mean Sq F value Pr(>F)
sex 1 7.9 7.9 281.086 <2e-16 ***
comp 1 522.6 522.6 18560.069 <2e-16 ***
sex:comp 1 0.0 0.0 0.038 0.846
Residuals 494 13.9 0.0
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Plot the relationship between pre-spray and post-spray compliance time by sexggplot(eqsub, aes(x = comp, y = comp2, color = sex)) +geom_point(alpha =0.6) +geom_smooth(method ="lm", se =FALSE) +labs(title ="ANCOVA: Post-Spray Compliance Time by Sex Controlling for Pre-Spray Compliance Time",x ="Pre-Spray Compliance Time (comp)",y ="Post-Spray Compliance Time (comp2)") +theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
ANOVA Results: The p-value’s are less than 0.05, indicating that the mean value of the dependent variable differs significantly between the sex groups.
2 Way ANOVA Results: The p-value is less than the standard significance level (0.05), meaning there is a statistically significant difference in the combined dependent variables (comp and comp2) between the groups defined by sex.
Box-plot of compliance time differences
boxplot_data <- eqsub %>%pivot_longer(cols =starts_with("comp"), names_to ="Condition", values_to ="ComplianceTime")# Create the boxplotggplot(boxplot_data, aes(x = Condition, y = ComplianceTime, fill = Condition)) +geom_boxplot(outlier.color ="red", outlier.shape =16, outlier.size =2) +# Customize outliersscale_fill_viridis_d(option ="C", begin =0.3, end =0.8) +# Colorblind-safe fill colorslabs(title ="Effect of Calming Spray Before and After Application on Compliance Time",x ="Condition (Before/After Spray)",y ="Compliance Time (seconds)" ) +theme_minimal() +# Clean and modern themetheme(plot.title =element_text(hjust =0.5, size =14, face ="bold"),axis.title =element_text(size =12),axis.text =element_text(size =10),legend.position ="none") # Remove legend if unnecessary
Analysis:
Median Compliance Time:
comp: The median compliance time is around 35 seconds.
comp2: The median compliance time is around 32 seconds.
Interquartile Range (IQR):
comp: The IQR spans from approximately 33 to 37 seconds.
comp2: The IQR spans from approximately 31 to 33 seconds.
Outliers:
comp: There are a few outliers above 40 seconds and below 30 seconds.
comp2: There is one outlier below 30 seconds.
Interpretation: Reduction in Median Compliance Time: The median compliance time decreases from about 35 seconds to 32 seconds after the application of the calming spray, suggesting that the spray effectively reduces compliance time.
Narrower IQR: The IQR is narrower after the application of the calming spray (comp2), indicating less variability in compliance times, which could suggest more consistent behavior post-application.
Outliers: There are fewer outliers in the post-spray condition, which could indicate that the calming spray helps to standardize the behaviour more effectively.
Conclusion: The boxplot suggests that the application of the calming spray results in a reduced and more consistent compliance time, indicating the spray’s potential effectiveness in improving compliance behavior.
3. Predicting Compliance Time
Hypothesis: Compliance time is influenced by a combination of factors such as IRT, cortisol, air temperature, weight, and heart rate.
H0: Compliance time can be predicted by correlation to multiple factors
H1: Compliance time cannot be predicted by correlation to multiple factors
# Check correlations to understand relationships between predictors and compliance timecor(equine[, c("comp", "irt", "cortisol", "BPM", "weight", "air")], use ="complete.obs")
# Fit a multiple linear regression modelmodel <-lm(comp ~ irt + cortisol + BPM + weight + air, data = equine)# Summary of the modelsummary(model)
Call:
lm(formula = comp ~ irt + cortisol + BPM + weight + air, data = equine)
Residuals:
Min 1Q Median 3Q Max
-2.88796 -0.65339 0.00537 0.58312 2.79258
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.910e+01 3.641e+00 -10.739 <2e-16 ***
irt -3.078e-02 3.191e-02 -0.964 0.335
cortisol -8.361e-04 2.602e-03 -0.321 0.748
BPM 5.048e-01 2.228e-02 22.661 <2e-16 ***
weight 1.325e-03 3.361e-03 0.394 0.694
air -2.256e-02 2.708e-02 -0.833 0.405
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.003 on 492 degrees of freedom
Multiple R-squared: 0.5129, Adjusted R-squared: 0.5079
F-statistic: 103.6 on 5 and 492 DF, p-value: < 2.2e-16
# Visualize residualspar(mfrow =c(2, 2)) # Arrange diagnostic plots in a gridplot(model)
Key Interpretation Points Coefficients and Significance:
The intercept (-39.10) is statistically significant (p-value < 2e-16), but this is expected as the model includes multiple predictors.
BPM (Heart rate) has a strong and significant relationship with compliance time (p-value < 2e-16), with a positive coefficient (0.5048), suggesting that higher heart rates are associated with increased compliance time.
IRT (Thermographic eye temperature), cortisol, weight, and air temperature are not statistically significant, with p-values well above 0.05 (e.g., irt = 0.335, cortisol = 0.748, weight = 0.694, and air = 0.405). This suggests that these variables do not have a strong influence on the compliance time, at least in the linear context of this model. Model Fit:
R-squared: 0.513. This means that about 51.3% of the variance in compliance time (comp) is explained by the model. While this is not very high, it indicates that the model does capture some of the variability.
Adjusted R-squared: 0.507. This value adjusts for the number of predictors and gives a slightly more conservative estimate of model performance. Since it’s close to the R-squared value, it shows that the predictors are not adding too much noise.
F-statistic: The F-statistic of 103.6 and the p-value < 2.2e-16 suggest that the model overall is statistically significant, meaning at least one predictor contributes to explaining the compliance time.
Residuals: The residuals show a good spread around zero, with a range from -2.89 to 2.79. The median of the residuals is close to zero, which suggests no systematic bias.
Residual standard error: The standard error of the residuals is 1.003, which can be interpreted as the average difference between the observed and predicted compliance time.
What This Means for Predicting Compliance Time
Heart rate (BPM) is the most important predictor, with a strong and statistically significant effect on compliance time. Other variables (like irt, cortisol, weight, and air) do not contribute significantly to the prediction of compliance time in this linear model. The R-squared value of 0.513 indicates a moderate fit, meaning that the model explains about half of the variability in compliance time. This suggests that while there are some predictive capabilities, there are likely other factors not captured by the current set of predictors.
How to Improve the Model or Visualize Predictions:
Further Exploration of Non-significant Variables: Since some predictors like irt, cortisol, and air were not significant, you might want to check if there are interactions between variables or if some predictors need to be transformed (e.g., log transformation or polynomial terms) to better model their effects. Advanced Models:
Generalized Additive Models (GAMs): A GAM might help if the relationship between predictors and compliance time is nonlinear.
Better Visualization: You can use predicted vs actual plots to assess how well your model is predicting compliance time.
# Get predicted values from the modelequine$predicted <-predict(model)# Plot actual vs predicted compliance timeggplot(equine, aes(x = predicted, y = comp)) +geom_point(alpha =0.6) +geom_abline(slope =1, intercept =0, color ="red", linetype ="dashed") +labs(title ="Actual vs Predicted Compliance Time",x ="Predicted Compliance Time (seconds)",y ="Actual Compliance Time (seconds)") +theme_minimal() +theme(plot.title =element_text(hjust =0.5))
Data Exploration
str(equine)
'data.frame': 498 obs. of 9 variables:
$ ID : chr "Eq001" "Eq002" "Eq003" "Eq004" ...
$ sex : chr "Female" "Female" "Female" "Male" ...
$ comp : num 37.2 34.5 36.3 35.3 37.4 ...
$ weight : num 74.7 73.4 71.8 104.6 67.1 ...
$ irt : num 37.7 35.7 34.8 36.2 33.6 ...
$ air : num 23.4 21.4 20.1 21.6 21.8 ...
$ cortisol : num 64.1 73.7 54.4 86.3 108 ...
$ BPM : num 153 150 149 150 149 ...
$ predicted: num 36.6 34.9 34.9 35.2 34.6 ...
colSums(is.na(equine))
ID sex comp weight irt air cortisol BPM
0 0 0 0 0 0 0 0
predicted
0
No missing data are associated with any of the variables in the dataframe.
Normality & Homogeneity of dependent variables
A gamma GLM does not assume is that the response variable is normally distributed. We can visualise the distribution of the response variable by dividing the x-axis into ‘bins’ and counting the number of observations in each bin as a frequency polygon using the geom_freqpoly() function from the ggplot2 package:
p <-ggplot() p <- p +ylab("Frequency") p <- p +xlab("Compliance Time (seconds")p <- p +theme(panel.background =element_blank()) p <- p +theme(panel.border =element_rect(fill =NA, colour ="black", size =1))
Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
ℹ Please use the `linewidth` argument instead.
p <- p +theme(strip.background =element_rect(fill ="white", color ="white", size =1)) p <- p +theme(text =element_text(size=15)) p <- p +theme(legend.position='none') p <- p +geom_freqpoly(data = equine, aes(comp), bins =10) p
Nomrality
shapiro.test(equine$comp)
Shapiro-Wilk normality test
data: equine$comp
W = 0.99692, p-value = 0.4695
The test shows no significant departure from normality (P > 0.05).
VIF
#install.packages("car")library(car)
Loading required package: carData
Attaching package: 'car'
The following object is masked from 'package:dplyr':
recode
The following object is masked from 'package:purrr':
some
vif(lm(comp~ sex + weight + irt + air + cortisol + BPM, data = equine))
sex weight irt air cortisol BPM
1.006023 1.010740 1.024534 1.013868 1.018760 1.004401
MANOVA
# Perform Multiple ANOVA to evaluate predictors of 'comp'anova_model_multi <-aov(comp ~ BPM + sex + irt + weight + cortisol + air, data = equine)# Summary of the Multiple ANOVA modelsummary(anova_model_multi)
Pearson's product-moment correlation
data: equine$comp and equine$BPM
t = 22.767, df = 496, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.6689964 0.7552703
sample estimates:
cor
0.7148428
## There is a statistically significant and strong positive correlation (r=0.715) between comp and BPM.
Simple Generelized Linear Model
# Perform Linear Regression: Predict 'comp' using 'BPM'linear_model <-lm(comp ~ BPM, data = equine)# Summary of the linear regression modelsummary(linear_model)
Call:
lm(formula = comp ~ BPM, data = equine)
Residuals:
Min 1Q Median 3Q Max
-2.81905 -0.65308 -0.01023 0.59085 2.83181
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -40.79081 3.33482 -12.23 <2e-16 ***
BPM 0.50564 0.02221 22.77 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.001 on 496 degrees of freedom
Multiple R-squared: 0.511, Adjusted R-squared: 0.51
F-statistic: 518.3 on 1 and 496 DF, p-value: < 2.2e-16
# Plot regression line between 'comp' and 'BPM'ggplot(equine, aes(x = BPM, y = comp)) +geom_point(color ="blue", alpha =0.4) +geom_smooth(method ="lm", se =TRUE, color ="red") +labs(x ="Heart Rate (BPM)",y ="Compliance Time (seconds)") +theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
correlation coefficient r = 0.7148428 This is the sample correlation, indicating a strong positive linear relationship between comp and BPM. A value close to 1 suggests a strong positive correlation.
PCA - Principle Component Analysis
#install.packages("devtools")library(devtools)
Loading required package: usethis
library(ggplot2)library(grid)str(equine)
'data.frame': 498 obs. of 9 variables:
$ ID : chr "Eq001" "Eq002" "Eq003" "Eq004" ...
$ sex : chr "Female" "Female" "Female" "Male" ...
$ comp : num 37.2 34.5 36.3 35.3 37.4 ...
$ weight : num 74.7 73.4 71.8 104.6 67.1 ...
$ irt : num 37.7 35.7 34.8 36.2 33.6 ...
$ air : num 23.4 21.4 20.1 21.6 21.8 ...
$ cortisol : num 64.1 73.7 54.4 86.3 108 ...
$ BPM : num 153 150 149 150 149 ...
$ predicted: num 36.6 34.9 34.9 35.2 34.6 ...
#Run PCA#Begin by deriving PCsequine.pca <-prcomp(equine[,c(3:8)], center =TRUE, scale. =TRUE)#Examine PCA objectsummary(equine.pca)
These results show that PC1 explains 29% of the total variance in the dataset of 6 variables. PC2 explain 19% of variance, which means the combination of PC1 and PC2 explain ~50% of variance in the dataset. The third Principal Component (PC3) explains an additional 18% of variance etc. This information can be visualised in a scree plot:
#Apply Kaisers' criterion## An additional approach is to use Kaiser’s criterion. Here we retain Principal Components for which the variance is above 1 when PCA is applied to standardised data.options(digits=1)(equine.pca$sdev)^2
[1] 1.7 1.2 1.1 0.9 0.8 0.3
The scree plot further illustrates that PC1 and PC2 (& PC3) capture the greatest proportion of variance in the data.
The first rule-of-thumb is to aim for a specific threshold of explained variance. A figure of 80% is often used, which in this case means PC1, PC2, and PC3 are needed to achieve the threshold.
Based on Kaiser’s criterion only PC1 (1.7), PC2 (1.2), and PC3 (1.1) should be retained for analysis. Thus, based on an 80% threshold, scree plot and Kaisers’ criterion just PC1, PC2, and PC3 are sufficient to obtain an accurate picture of the equine data.
#Loading coefficients for PC1options(digits=2)equine.pca$rotation[,1]
# Performing a gaussian (normally distrib.) generalized linear model on key componentsglm_model <-glm(comp ~ sex + irt + BPM, data = equine, family ="gaussian")summary(glm_model)
Call:
glm(formula = comp ~ sex + irt + BPM, family = "gaussian", data = equine)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -39.1723 3.5052 -11.18 <2e-16 ***
sexMale 0.2561 0.0893 2.87 0.0043 **
irt -0.0341 0.0313 -1.09 0.2760
BPM 0.5021 0.0221 22.74 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 1)
Null deviance: 1016.77 on 497 degrees of freedom
Residual deviance: 488.07 on 494 degrees of freedom
AIC: 1413
Number of Fisher Scoring iterations: 2
Sex has a small but significant effect on the response variable. BPM is the strongest predictor in this model, as indicated by the large and highly significant coefficient. IRT does not significantly contribute to explaining the response.
Sex (0.2561): Being male is associated with an increase of 0.2561 units in the response variable compared to females, holding other variables constant. This effect is statistically significant (p=0.0043<0.01).
irt (-0.0341): A unit increase in irt is associated with a decrease of 0.0341 units in the response variable, but this effect is not statistically significant (p=0.276).
BPM (0.5021): A unit increase in BPM is associated with a significant increase of 0.5021 units in the response variable (p<0.001).
Visualizing the Gaussian Relationship
# Load ggplot2 for visualizationlibrary(ggplot2)# Create a new data frame for predictionsnew_equine <-data.frame(sex =c("Male", "Female"),irt =seq(min(equine$irt), max(equine$irt), length.out =100),BPM =seq(min(equine$BPM), max(equine$BPM), length.out =100))# Predict values based on the glm modelnew_equine$predicted_comp <-predict(glm_model, newdata = new_equine)# Plot the regression lineggplot(equine, aes(x = BPM, y = comp, color = sex)) +geom_point(alpha =0.5) +# Scatter plot of the actual datageom_line(data = new_equine, aes(x = BPM, y = predicted_comp, color = sex), size =1, se =TRUE) +labs(title ="Regression Line for Compliance Time",x ="Heart Rate (BPM)",y ="Compliance Time (seconds)",color ="Sex" ) +theme_minimal() +scale_color_brewer(palette ="Set1")
Warning in geom_line(data = new_equine, aes(x = BPM, y = predicted_comp, :
Ignoring unknown parameters: `se`