RMDA Project

Author

Ciarra Mooney

Research Methods & Data Analysis Paper

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(dplyr)
# install.packages("viridis")
library(viridis)

Loading required package: viridisLite

print(citation("tidyverse"), style = "text")

Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R,
Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E,
Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi
K, Vaughan D, Wilke C, Woo K, Yutani H (2019). "Welcome to the
tidyverse." _Journal of Open Source Software_, *4*(43), 1686.
doi:10.21105/joss.01686 <https://doi.org/10.21105/joss.01686>.

print(citation("ggplot2"), style = "text")

Wickham H (2016). _ggplot2: Elegant Graphics for Data Analysis_.
Springer-Verlag New York. ISBN 978-3-319-24277-4,
<https://ggplot2.tidyverse.org>.

print(citation("dplyr"), style = "text")

Wickham H, François R, Henry L, Müller K, Vaughan D (2023). _dplyr: A
Grammar of Data Manipulation_. R package version 1.1.4,
<https://CRAN.R-project.org/package=dplyr>.

print(citation("viridis"), style = "text")

Garnier, Simon, Ross, Noam, Rudis, Robert, Camargo, Pedro A, Sciaini,
Marco, Scherer, Cédric (2024). _viridis(Lite) - Colorblind-Friendly
Color Maps for R_. doi:10.5281/zenodo.4679423
<https://doi.org/10.5281/zenodo.4679423>, viridis package version
0.6.5, <https://sjmgarnier.github.io/viridis/>.

# Load data sets in csv form & text()# Load data sets in csv form & save as data frame variables in global environment:

eqsub <- read.csv("~/Library/Mobile Documents/com~apple~CloudDocs/R Studio/RMDA Project/eqsub.csv")

equine <- read.csv("~/Library/Mobile Documents/com~apple~CloudDocs/R Studio/RMDA Project/equine.csv")

equine data: Compliance time (time taken for a horse to completed a particular task ) has long been considered an easy estimate of a horses “stress levels”. Here you have been supplied with a data set in the form of an Excel file (equine.xlsx) that contains compliance times for a group of horses who have been tasked with negotiating a novel obstacle (a streamer curtain - see video recording).

The data file equine.xlsx also contains a number of other variables used to quantify stress in addition to some environmental and biometric measurements.

eqsub data: A second set of experiments was conducted on the original study group where individuals were set the same task again after the area was sprayed with a substance designed to calm horses.

Themes and questions:

In your write up please consider the pros and cons of invasive and non-invasive measurements in animal research.

Invasive vs. Non-Invasive Measurements

Invasive Measurements Examples: Blood cortisol levels (require blood sampling).

Pros: Highly accurate and specific.
Cons: Stress induced by the sampling process might influence measurements.

Non-Invasive Measurements Examples: Compliance time, heart rate (BPM), thermographic eye temperature (IRT).

Pros: No direct harm or stress to the animal.
Cons: Can be less accurate or influenced by external factors.

In your analysis please answer these three questions:

1. Is there a correlation between IRT (Thermographic Eye Temperature) and Cortisol?

Hypothesis: Stress may cause changes in both IRT and cortisol, suggesting a possible correlation.
H0: There is a correlation between IRT and Cortisol
H1: There is no correlation between IRT and Coritsol
Approach:
- Test assumptions of normality
- Calculate the Pearson/Spearmans/ Kendall Tau correlation coefficient (cor() in R) to measure the strength of the linear relationship.
- Visualize the relationship with a scatterplot and a trend line.
- Test for statistical significance using cor.test().

2. Does the calming spray affect compliance times?

Hypothesis: The calming spray reduces stress, leading to shorter compliance times.
H0: Calming spray does have an effect on compliance time
H1: Calming spray does not have an affect on compliance time
Approach:
- Compare compliance times before and after the spray using a paired t-test (if the same horses are measured) or an unpaired t-test (if different groups are used).
- Visualize the results using boxplots to show before-and-after differences.

3. Can compliance time be predicted?

Hypothesis: Compliance time is influenced by a combination of factors such as IRT, cortisol, air temperature, weight, and heart rate.
H0: Compliance time can be predicted by correlation to multiple factors
H1: Compliance time cannot be predicted by correlation to multiple factors
Approach:
- Build a multiple linear regression model using lm() in R.
- Check the model’s assumptions (normality, homoscedasticity, multicollinearity).
- Evaluate the model’s performance using metrics such as R2 and residual plots.

Data Preparation

# Inspect the structure of the data
str(equine)

'data.frame':   498 obs. of  8 variables:
 $ ID      : chr  "Eq001" "Eq002" "Eq003" "Eq004" ...
 $ sex     : chr  "Female" "Female" "Female" "Male" ...
 $ comp    : num  37.2 34.5 36.3 35.3 37.4 ...
 $ weight  : num  74.7 73.4 71.8 104.6 67.1 ...
 $ irt     : num  37.7 35.7 34.8 36.2 33.6 ...
 $ air     : num  23.4 21.4 20.1 21.6 21.8 ...
 $ cortisol: num  64.1 73.7 54.4 86.3 108 ...
 $ BPM     : num  153 150 149 150 149 ...

str(eqsub)

'data.frame':   498 obs. of  4 variables:
 $ ID   : chr  "EQ001" "EQ002" "EQ003" "EQ004" ...
 $ sex  : chr  "Female" "Female" "Female" "Male" ...
 $ comp : num  37.2 34.5 36.3 35.3 37.4 ...
 $ comp2: num  31.5 29.6 30.8 30.1 31.7 ...

# Get summaries of data
summary(equine)

      ID                sex                 comp           weight      
 Length:498         Length:498         Min.   :30.78   Min.   : 65.10  
 Class :character   Class :character   1st Qu.:34.16   1st Qu.: 75.67  
 Mode  :character   Mode  :character   Median :35.04   Median : 87.82  
                                       Mean   :35.12   Mean   : 87.93  
                                       3rd Qu.:36.05   3rd Qu.:100.34  
                                       Max.   :40.08   Max.   :110.94  
      irt             air           cortisol           BPM       
 Min.   :33.00   Min.   :20.01   Min.   : 50.03   Min.   :144.6  
 1st Qu.:34.42   1st Qu.:21.54   1st Qu.: 67.27   1st Qu.:148.9  
 Median :35.43   Median :23.11   Median : 82.43   Median :150.1  
 Mean   :35.54   Mean   :23.02   Mean   : 82.00   Mean   :150.1  
 3rd Qu.:36.71   3rd Qu.:24.35   3rd Qu.: 95.96   3rd Qu.:151.3  
 Max.   :38.00   Max.   :25.99   Max.   :112.45   Max.   :156.8

summary(eqsub)

      ID                sex                 comp           comp2      
 Length:498         Length:498         Min.   :30.78   Min.   :27.37  
 Class :character   Class :character   1st Qu.:34.16   1st Qu.:29.23  
 Mode  :character   Mode  :character   Median :35.04   Median :29.99  
                                       Mean   :35.12   Mean   :29.94  
                                       3rd Qu.:36.05   3rd Qu.:30.65  
                                       Max.   :40.08   Max.   :32.81

# Check for missing values
sum(is.na(equine))

[1] 0

sum(is.na(eqsub))

[1] 0

1. Correlation Analysis: IRT vs. Cortisol

Stress may cause changes in both IRT and cortisol, suggesting a possible correlation.

H0: There is a correlation between IRT and Cortisol
H1: There is no correlation between IRT and Coritsol

Data exploration & Normality:

# See structure of equine data
str(equine)

'data.frame':   498 obs. of  8 variables:
 $ ID      : chr  "Eq001" "Eq002" "Eq003" "Eq004" ...
 $ sex     : chr  "Female" "Female" "Female" "Male" ...
 $ comp    : num  37.2 34.5 36.3 35.3 37.4 ...
 $ weight  : num  74.7 73.4 71.8 104.6 67.1 ...
 $ irt     : num  37.7 35.7 34.8 36.2 33.6 ...
 $ air     : num  23.4 21.4 20.1 21.6 21.8 ...
 $ cortisol: num  64.1 73.7 54.4 86.3 108 ...
 $ BPM     : num  153 150 149 150 149 ...

# An excess of zeros

sum(equine$irt == 0) * 100 / nrow(equine)

[1] 0

sum(equine$cortisol == 0) * 100 / nrow(equine)

[1] 0

# Shapiro-Wilk normality test for IRT
shapiro_irt <- shapiro.test(equine$irt)
print("Shapiro-Wilk Test for IRT")

[1] "Shapiro-Wilk Test for IRT"

print(shapiro_irt)


    Shapiro-Wilk normality test

data:  equine$irt
W = 0.95889, p-value = 1.447e-10

# Shapiro-Wilk normality test for cortisol
shapiro_cortisol <- shapiro.test(equine$cortisol)
print("Shapiro-Wilk Test for Cortisol")

[1] "Shapiro-Wilk Test for Cortisol"

print(shapiro_cortisol)


    Shapiro-Wilk normality test

data:  equine$cortisol
W = 0.96362, p-value = 9.245e-10

Both data sets are non-normally distributed (P <<< 0.05), but statistically significant. Thus, we cannot use a normal linear correlation model. We cannot then use the Pearson’s correlation but can rather use the Spearman’s Rank Correlation or the Kendall Tau Correlation.

Spearmen’s Rank Correlation Interpretation (non-normal data):

# Spearman's correlation
spearman_corr <- cor(equine$irt, equine$cortisol, method = "spearman", use = "complete.obs")
print("Spearman's Rank Correlation Coefficient:")

[1] "Spearman's Rank Correlation Coefficient:"

print(spearman_corr)

[1] 0.1314346

# Spearman's correlation significance test
spearman_test <- cor.test(equine$irt, equine$cortisol, method = "spearman")
print("Spearman's Correlation Test")

[1] "Spearman's Correlation Test"

print(spearman_test)


    Spearman's rank correlation rho

data:  equine$irt and equine$cortisol
S = 17878766, p-value = 0.00332
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.1314346

Spearman’s Rank Correlation Coefficient (𝜌= 0.131): This is a weak positive monotonic relationship between IRT (Infrared Thermography) and Cortisol levels. A weak relationship suggests that as one variable increases, the other tends to increase slightly, but this trend is not strong.

P-value (p=0.00332): The p-value is less than 0.05, indicating that the correlation is statistically significant. This means it is unlikely that the observed relationship is due to random chance, and there is evidence of a weak monotonic association between IRT and Cortisol.

Since the p-value is significant, we reject the null hypothesis (𝐻0:𝜌=0) and accept the alternative hypothesis that there is a non-zero correlation between IRT and Cortisol.

Kendal Tau’s Correlation Interpretation:

A more robust alternative to the Spearman’s correlation, which accommodates ties in the ranked data and small sample size, is Kendall’s Tau correlation.

kendall_tau <- cor.test(equine$irt,            
         equine$cortisol,            
         method = c("kendall"))
print("Kendall Tau's Correlation Test")

[1] "Kendall Tau's Correlation Test"

print(kendall_tau)


    Kendall's rank correlation tau

data:  equine$irt and equine$cortisol
z = 2.8644, p-value = 0.004178
alternative hypothesis: true tau is not equal to 0
sample estimates:
       tau 
0.08587267

Kendall’s tau (τ): This is a measure of the ordinal association between two measured quantities. It ranges from -1 to 1.

Tau value: The tau value for your data is approximately 0.0859, suggesting a very weak positive relationship between Thermographic Eye Temperature (IRT) and Cortisol levels.

P-value: The p-value is 0.004178. This indicates the probability that the observed correlation is due to random chance. Since this p-value is less than 0.05, it implies that the correlation is statistically significant.

Alternative hypothesis: The result supports the alternative hypothesis that the true tau is not equal to 0, meaning there’s evidence of a relationship between IRT and Cortisol levels in this sample.

In summary, despite the tau value indicating a weak positive correlation, the association between IRT and Cortisol levels is statistically significant. This suggests that even though the relationship is weak, it is likely not due to random chance.

Interpretation of Scatterplot (non-parametric trend):

# Load necessary library
library(ggplot2)
library(viridis)

# Updated plot
ggplot(equine, aes(x = irt, y = cortisol)) +
  geom_point(alpha = 0.6, color = viridis(1)) +  # Colorblind-safe point color
  geom_smooth(method = "loess", color = viridis(3)[3], se = TRUE) + # Safe smooth line color
  labs(title = "Correlation Between IRT and Cortisol",
       x = "Thermographic Eye Temperature (Celsius)",
       y = "Cortisol (mcg/dL)") +
  theme_minimal() +  # Clean theme
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

`geom_smooth()` using formula = 'y ~ x'

This graph depicts the relationship between Thermographic Eye Temperature (IRT) and Cortisol levels (measured in mcg/dL). On the x-axis, we have the IRT ranging from 33 to 38 degrees Celsius, and on the y-axis, the Cortisol levels range from 60 to 120 mcg/dL. Each blue dot represents an individual data point.

The red line running through the data points is a non-parametric regression fit, meaning it doesn’t assume a specific mathematical form for the relationship but rather adapts to the data. The shaded gray area around the red line indicates the confidence interval, suggesting the range within which we can be reasonably certain the true relationship lies.

From the red line’s U-shaped pattern, it appears that Cortisol levels initially decrease as IRT increases up to around 34 degrees Celsius. Beyond that point, as IRT continues to rise, Cortisol levels begin to increase again.

This pattern could potentially be important for understanding physiological responses or stress indicators. The non-parametric fit gives a more flexible and accurate representation of the underlying relationship without being constrained by a particular model.

Interpretation of Scatterplot (linear model):

# Scatterplot with trendline
ggplot(equine, aes(x = irt, y = cortisol)) +
  geom_point(color = viridis(1), alpha = 0.8, size = 2) +  # Colorblind-safe points
  geom_smooth(method = "lm", se = TRUE, color = viridis(3)[2], fill = viridis(3)[1]) + # Colorblind-safe trendline and fill
  labs(title = "Correlation Between IRT and Cortisol",
       x = "Thermographic Eye Temperature (Celsius)",
       y = "Cortisol (mcg/dL)") +
  theme_minimal() +  # Clean and modern theme
  theme(
    plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
    axis.title = element_text(size = 12),
    axis.text = element_text(size = 10))

`geom_smooth()` using formula = 'y ~ x'

Positive Correlation: The trend line indicates a positive correlation, meaning as the Thermographic Eye Temperature increases, the Cortisol levels tend to increase.

Scatter of Data Points: The data points are widely scattered around the trend line, indicating that while there is a positive correlation, it is relatively weak.

Based on the correlation test results:

Correlation Coefficient (0.1286011): Indicates a positive but weak correlation between IRT and Cortisol levels.

P-Value (0.004045604): The p-value is less than 0.05, indicating that the correlation is statistically significant.

The scatter plot confirms the above statement by visually representing the positive correlation, though it is weak, as indicated by the scatter of the data points. The trend line and confidence interval also support the statistical significance of the correlation found in the test.

Overall, the graph aligns with the correlation test results, confirming a significant but weak positive correlation between Thermographic Eye Temperature and Cortisol levels.

2. Effect of Calming Spray on Compliance Time

Hypothesis: The calming spray reduces stress, leading to shorter compliance times.

H0: Calming spray does have an effect on compliance time
H1: Calming spray does not have an affect on compliance time

Data Exploration and Normality:

# Explore the structure of the eqsub data
str(eqsub)

'data.frame':   498 obs. of  4 variables:
 $ ID   : chr  "EQ001" "EQ002" "EQ003" "EQ004" ...
 $ sex  : chr  "Female" "Female" "Female" "Male" ...
 $ comp : num  37.2 34.5 36.3 35.3 37.4 ...
 $ comp2: num  31.5 29.6 30.8 30.1 31.7 ...

# Get the summary of eqsub data
summary(eqsub)

      ID                sex                 comp           comp2      
 Length:498         Length:498         Min.   :30.78   Min.   :27.37  
 Class :character   Class :character   1st Qu.:34.16   1st Qu.:29.23  
 Mode  :character   Mode  :character   Median :35.04   Median :29.99  
                                       Mean   :35.12   Mean   :29.94  
                                       3rd Qu.:36.05   3rd Qu.:30.65  
                                       Max.   :40.08   Max.   :32.81

# Test normality for separate variables
shapiro_comp <- shapiro.test(eqsub$comp)
print("Shapiro-Wilk Test for Comp")

[1] "Shapiro-Wilk Test for Comp"

print(shapiro_comp)


    Shapiro-Wilk normality test

data:  eqsub$comp
W = 0.99691, p-value = 0.4666

shapiro_comp2 <- shapiro.test(eqsub$comp2)
print("Shapiro-Wilk Test for Comp")

[1] "Shapiro-Wilk Test for Comp"

print(shapiro_comp2)


    Shapiro-Wilk normality test

data:  eqsub$comp2
W = 0.9943, p-value = 0.05974

# Additional variables of interest, not included in the original data set, are the change in compliance time following the calming spray:
eqsub$diff <- eqsub$comp2 - eqsub$comp

# We can now see the new variable created
str(eqsub)

'data.frame':   498 obs. of  5 variables:
 $ ID   : chr  "EQ001" "EQ002" "EQ003" "EQ004" ...
 $ sex  : chr  "Female" "Female" "Female" "Male" ...
 $ comp : num  37.2 34.5 36.3 35.3 37.4 ...
 $ comp2: num  31.5 29.6 30.8 30.1 31.7 ...
 $ diff : num  -5.67 -4.92 -5.53 -5.2 -5.74 -4.98 -5.55 -5.5 -5.2 -5.4 ...

# Test normality for difference in compliance
shapiro_compdiff <- shapiro.test(eqsub$diff)
print("Shapiro-Wilk Test for Comp Diff")

[1] "Shapiro-Wilk Test for Comp Diff"

print(shapiro_compdiff)


    Shapiro-Wilk normality test

data:  eqsub$diff
W = 0.84738, p-value < 2.2e-16

For the separate variables, p > 0.05 for both which suggests there is no significant departures from normality. We can use this to compare the effect of calming spray on on compliance between sex’s. Using t-test and ANOVA.

However, when looking at the difference in the compliance variable the p value < 0.05 which suggests there is a significant departure from normality. We can use this to compare the effect of the calming spray before and after on compliance time.

# Histogram for comp
ggplot(eqsub, aes(x = comp)) +
  geom_histogram(aes(y = ..density..), binwidth = 1, fill = "blue", color = "black", alpha = 0.7) +
  geom_density(color = "darkblue", size = 1.5, adjust = 1.5) +
  labs(title = "Histogram of compliance tiime before calming spray (comp)", x = "comp (seconds)", y = "Density") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.

# Histogram for comp2
ggplot(eqsub, aes(x = comp2)) +
  geom_histogram(aes(y = ..density..), binwidth = 1, fill = "orange", color = "black", alpha = 0.7) +
  geom_density(color = "darkorange", size = 1.5, adjust = 1.5) +
  labs(title = "Histogram of compliance time after calming spray (comp2)", x = "comp2 (seconds)", y = "Density") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

# Histogram for difference in comps
ggplot(eqsub, aes(x = diff)) +
  geom_histogram(aes(y = ..density..), binwidth = 1, fill = "red", color = "black", alpha = 0.7) +
  geom_density(color = "pink", size = 1.5, adjust = 1.5) +
  labs(title = "Histogram of difference in compliance time's", x = "difference (seconds)", y = "Density") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

# Perform Bartlett's Test
bartlett.test(comp ~ sex,                                
              data = eqsub)


    Bartlett test of homogeneity of variances

data:  comp by sex
Bartlett's K-squared = 0.44588, df = 1, p-value = 0.5043

bartlett.test(comp2 ~ sex,                                
              data = eqsub)


    Bartlett test of homogeneity of variances

data:  comp2 by sex
Bartlett's K-squared = 0.22253, df = 1, p-value = 0.6371

Given that we now know that the variables ‘comp’,‘comp2’, and’compdiff’ do not deviate from normality, we can test for homogenity using Bartlett’s and Levenes Test. The results show that the compliance time data do not deviate significantly from homogeneity (both P > 0.05). This suggests that the variances equal.

# Zero's in the response variables
sum(eqsub$diff == 0) * 100 / nrow(eqsub)

[1] 0

No zeros in any.

# Boxplot to compare compliance time between sexes
ggplot(eqsub, aes(x = sex, y = diff, fill = sex)) +
  geom_boxplot(outlier.colour = "red", outlier.shape = 16, outlier.size = 2) +  # Customize outliers
  scale_fill_viridis_d(option = "C", begin = 0.3, end = 0.7) +  # Use a colorblind-safe palette
  labs(
    title = "Change in Compliance Time Following Calming Application",
    x = "Sex",
    y = "Change in Compliance Time (seconds)"
  ) +
  theme_minimal() +  # Clean theme
  theme(
    plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
    axis.title = element_text(size = 12),
    axis.text = element_text(size = 10)
  )

Mann-Whitney T-test for Non-normal Unpaired Data

# t-testfor unpaired data
t.test(diff ~ sex,          
       data = eqsub,          
       var.equal = TRUE)


    Two Sample t-test

data:  diff by sex
t = 3.2612, df = 496, p-value = 0.001186
alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
95 percent confidence interval:
 0.04961562 0.20000304
sample estimates:
mean in group Female   mean in group Male 
           -5.119675            -5.244484

# Mann-Whitney t-test for non-normal unpaired data for comparison of calming spray on different complaince times of female and male horses
 wilcox.test(diff ~ sex,
        data = eqsub,
        conf.int=TRUE)


    Wilcoxon rank sum test with continuity correction

data:  diff by sex
W = 35896, p-value = 0.002277
alternative hypothesis: true location shift is not equal to 0
95 percent confidence interval:
 0.0199440 0.1299789
sample estimates:
difference in location 
            0.06996072

 aggregate(diff ~ sex,
           data = eqsub,
           median)

     sex   diff
1 Female -5.045
2   Male -5.190

The results of the Mann-Whitney test show we should reject the null hypothesis that the changes in compliance time are equal between the sexes after calming spray (W = 35896, P = 0.002). We can expect a median increase in compliance of about 0.07 seconds for males in comparison with females following calming spray. We are also 95% certain that if we were to collect additional data from the same equine population that the median difference between females and males would be between 0.02 and 0.13 seconds.

Paired t-test of normal data

# Comparing 2 groups of normal paired data: paired t-test - the appropriate test to compare 2 groups of paired and normally distributed data with equal variances
t.test(eqsub$comp,
       eqsub$comp2,
       paired = TRUE,
       var.equal = TRUE)


    Paired t-test

data:  eqsub$comp and eqsub$comp2
t = 268.28, df = 497, p-value < 2.2e-16
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 5.144875 5.220788
sample estimates:
mean difference 
       5.182831

The results of the paired t-test show an effect of the calming spray on equine compliance time (t268.28, P<0.0000..2). The mean (95% CI) change in compliance time following calming spray was 5.183 seconds.

ANOVA

# Load necessary libraries
library(dplyr)

# Convert 'sex' to a factor for ANOVA
eqsub$sex <- as.factor(eqsub$sex)

# Perform one-way ANOVA for 'comp'
anova_comp <- aov(comp ~ sex, data = eqsub)
summary(anova_comp)

             Df Sum Sq Mean Sq F value  Pr(>F)   
sex           1   17.7  17.689   8.782 0.00319 **
Residuals   496  999.1   2.014                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Perform one-way ANOVA for 'comp2'
anova_comp2 <- aov(comp2 ~ sex, data = eqsub)
summary(anova_comp2)

             Df Sum Sq Mean Sq F value  Pr(>F)   
sex           1    7.9   7.915   7.317 0.00707 **
Residuals   496  536.5   1.082                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova_model <- aov(comp2 ~ sex + comp, data = eqsub)

# Summary of the ANOVA model
summary(anova_model)

             Df Sum Sq Mean Sq F value Pr(>F)    
sex           1    7.9     7.9   281.6 <2e-16 ***
comp          1  522.6   522.6 18596.2 <2e-16 ***
Residuals   495   13.9     0.0                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Check assumptions: Homogeneity of regression slopes
interaction_model <- aov(comp2 ~ sex * comp, data = eqsub)
summary(interaction_model)

             Df Sum Sq Mean Sq   F value Pr(>F)    
sex           1    7.9     7.9   281.086 <2e-16 ***
comp          1  522.6   522.6 18560.069 <2e-16 ***
sex:comp      1    0.0     0.0     0.038  0.846    
Residuals   494   13.9     0.0                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Plot the relationship between pre-spray and post-spray compliance time by sex
ggplot(eqsub, aes(x = comp, y = comp2, color = sex)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "ANCOVA: Post-Spray Compliance Time by Sex Controlling for Pre-Spray Compliance Time",
       x = "Pre-Spray Compliance Time (comp)",
       y = "Post-Spray Compliance Time (comp2)") +
  theme_minimal()

`geom_smooth()` using formula = 'y ~ x'

ANOVA Results: The p-value’s are less than 0.05, indicating that the mean value of the dependent variable differs significantly between the sex groups.

2 Way ANOVA Results: The p-value is less than the standard significance level (0.05), meaning there is a statistically significant difference in the combined dependent variables (comp and comp2) between the groups defined by sex.

Box-plot of compliance time differences

boxplot_data <- eqsub %>%
  pivot_longer(cols = starts_with("comp"), 
               names_to = "Condition", 
               values_to = "ComplianceTime")

# Create the boxplot
ggplot(boxplot_data, aes(x = Condition, y = ComplianceTime, fill = Condition)) +
  geom_boxplot(outlier.color = "red", outlier.shape = 16, outlier.size = 2) +  # Customize outliers
  scale_fill_viridis_d(option = "C", begin = 0.3, end = 0.8) +  # Colorblind-safe fill colors
  labs(
    title = "Effect of Calming Spray Before and After Application on Compliance Time",
    x = "Condition (Before/After Spray)",
    y = "Compliance Time (seconds)"
  ) +
  theme_minimal() +  # Clean and modern theme
  theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
    axis.title = element_text(size = 12),
    axis.text = element_text(size = 10),
    legend.position = "none")  # Remove legend if unnecessary

Analysis:

Median Compliance Time:

comp: The median compliance time is around 35 seconds.
comp2: The median compliance time is around 32 seconds.

Interquartile Range (IQR):

comp: The IQR spans from approximately 33 to 37 seconds.
comp2: The IQR spans from approximately 31 to 33 seconds.

Outliers:

comp: There are a few outliers above 40 seconds and below 30 seconds.
comp2: There is one outlier below 30 seconds.

Interpretation: Reduction in Median Compliance Time: The median compliance time decreases from about 35 seconds to 32 seconds after the application of the calming spray, suggesting that the spray effectively reduces compliance time.

Narrower IQR: The IQR is narrower after the application of the calming spray (comp2), indicating less variability in compliance times, which could suggest more consistent behavior post-application.

Outliers: There are fewer outliers in the post-spray condition, which could indicate that the calming spray helps to standardize the behaviour more effectively.

Conclusion: The boxplot suggests that the application of the calming spray results in a reduced and more consistent compliance time, indicating the spray’s potential effectiveness in improving compliance behavior.

3. Predicting Compliance Time

Hypothesis: Compliance time is influenced by a combination of factors such as IRT, cortisol, air temperature, weight, and heart rate.

H0: Compliance time can be predicted by correlation to multiple factors
H1: Compliance time cannot be predicted by correlation to multiple factors

# Check correlations to understand relationships between predictors and compliance time
cor(equine[, c("comp", "irt", "cortisol", "BPM", "weight", "air")], use = "complete.obs")

                 comp          irt     cortisol          BPM       weight
comp      1.000000000 -0.037368697 -0.009849812  0.714842827  0.009205350
irt      -0.037368697  1.000000000  0.128601080 -0.008652045 -0.056144259
cortisol -0.009849812  0.128601080  1.000000000  0.006528990  0.020371404
BPM       0.714842827 -0.008652045  0.006528990  1.000000000 -0.003608681
weight    0.009205350 -0.056144259  0.020371404 -0.003608681  1.000000000
air      -0.051747939 -0.057830148  0.023078942 -0.039173303  0.083481211
                 air
comp     -0.05174794
irt      -0.05783015
cortisol  0.02307894
BPM      -0.03917330
weight    0.08348121
air       1.00000000

# Fit a multiple linear regression model
model <- lm(comp ~ irt + cortisol + BPM + weight + air, data = equine)

# Summary of the model
summary(model)


Call:
lm(formula = comp ~ irt + cortisol + BPM + weight + air, data = equine)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.88796 -0.65339  0.00537  0.58312  2.79258 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -3.910e+01  3.641e+00 -10.739   <2e-16 ***
irt         -3.078e-02  3.191e-02  -0.964    0.335    
cortisol    -8.361e-04  2.602e-03  -0.321    0.748    
BPM          5.048e-01  2.228e-02  22.661   <2e-16 ***
weight       1.325e-03  3.361e-03   0.394    0.694    
air         -2.256e-02  2.708e-02  -0.833    0.405    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.003 on 492 degrees of freedom
Multiple R-squared:  0.5129,    Adjusted R-squared:  0.5079 
F-statistic: 103.6 on 5 and 492 DF,  p-value: < 2.2e-16

# Visualize residuals
par(mfrow = c(2, 2))  # Arrange diagnostic plots in a grid
plot(model)

Key Interpretation Points Coefficients and Significance:

The intercept (-39.10) is statistically significant (p-value < 2e-16), but this is expected as the model includes multiple predictors.

BPM (Heart rate) has a strong and significant relationship with compliance time (p-value < 2e-16), with a positive coefficient (0.5048), suggesting that higher heart rates are associated with increased compliance time.

IRT (Thermographic eye temperature), cortisol, weight, and air temperature are not statistically significant, with p-values well above 0.05 (e.g., irt = 0.335, cortisol = 0.748, weight = 0.694, and air = 0.405). This suggests that these variables do not have a strong influence on the compliance time, at least in the linear context of this model. Model Fit:

R-squared: 0.513. This means that about 51.3% of the variance in compliance time (comp) is explained by the model. While this is not very high, it indicates that the model does capture some of the variability.

Adjusted R-squared: 0.507. This value adjusts for the number of predictors and gives a slightly more conservative estimate of model performance. Since it’s close to the R-squared value, it shows that the predictors are not adding too much noise.

F-statistic: The F-statistic of 103.6 and the p-value < 2.2e-16 suggest that the model overall is statistically significant, meaning at least one predictor contributes to explaining the compliance time.

Residuals: The residuals show a good spread around zero, with a range from -2.89 to 2.79. The median of the residuals is close to zero, which suggests no systematic bias.

Residual standard error: The standard error of the residuals is 1.003, which can be interpreted as the average difference between the observed and predicted compliance time.

What This Means for Predicting Compliance Time

Heart rate (BPM) is the most important predictor, with a strong and statistically significant effect on compliance time. Other variables (like irt, cortisol, weight, and air) do not contribute significantly to the prediction of compliance time in this linear model. The R-squared value of 0.513 indicates a moderate fit, meaning that the model explains about half of the variability in compliance time. This suggests that while there are some predictive capabilities, there are likely other factors not captured by the current set of predictors.

How to Improve the Model or Visualize Predictions:

Further Exploration of Non-significant Variables: Since some predictors like irt, cortisol, and air were not significant, you might want to check if there are interactions between variables or if some predictors need to be transformed (e.g., log transformation or polynomial terms) to better model their effects. Advanced Models:

Generalized Additive Models (GAMs): A GAM might help if the relationship between predictors and compliance time is nonlinear.

Better Visualization: You can use predicted vs actual plots to assess how well your model is predicting compliance time.

# Get predicted values from the model
equine$predicted <- predict(model)

# Plot actual vs predicted compliance time
ggplot(equine, aes(x = predicted, y = comp)) +
  geom_point(alpha = 0.6) +
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
  labs(title = "Actual vs Predicted Compliance Time",
       x = "Predicted Compliance Time (seconds)",
       y = "Actual Compliance Time (seconds)") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

Data Exploration

str(equine)

'data.frame':   498 obs. of  9 variables:
 $ ID       : chr  "Eq001" "Eq002" "Eq003" "Eq004" ...
 $ sex      : chr  "Female" "Female" "Female" "Male" ...
 $ comp     : num  37.2 34.5 36.3 35.3 37.4 ...
 $ weight   : num  74.7 73.4 71.8 104.6 67.1 ...
 $ irt      : num  37.7 35.7 34.8 36.2 33.6 ...
 $ air      : num  23.4 21.4 20.1 21.6 21.8 ...
 $ cortisol : num  64.1 73.7 54.4 86.3 108 ...
 $ BPM      : num  153 150 149 150 149 ...
 $ predicted: num  36.6 34.9 34.9 35.2 34.6 ...

colSums(is.na(equine))

       ID       sex      comp    weight       irt       air  cortisol       BPM 
        0         0         0         0         0         0         0         0 
predicted 
        0

No missing data are associated with any of the variables in the dataframe.

Normality & Homogeneity of dependent variables

A gamma GLM does not assume is that the response variable is normally distributed. We can visualise the distribution of the response variable by dividing the x-axis into ‘bins’ and counting the number of observations in each bin as a frequency polygon using the geom_freqpoly() function from the ggplot2 package:

p <- ggplot() 
p <- p + ylab("Frequency") 
p <- p + xlab("Compliance Time (seconds")
p <- p + theme(panel.background = element_blank()) 
p <- p + theme(panel.border = element_rect(fill = NA,                
                                           colour = "black", size = 1))

Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
ℹ Please use the `linewidth` argument instead.

p <- p + theme(strip.background = element_rect(fill = "white", 
                                               color = "white", 
                                               size = 1)) 
p <- p + theme(text = element_text(size=15)) 
p <- p + theme(legend.position='none') 
p <- p + geom_freqpoly(data = equine, aes(comp),            
                       bins = 10) 
p

Nomrality

shapiro.test(equine$comp)


    Shapiro-Wilk normality test

data:  equine$comp
W = 0.99692, p-value = 0.4695

The test shows no significant departure from normality (P > 0.05).

VIF

#install.packages("car")
library(car)

Loading required package: carData


Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some

vif(lm(comp~ sex + weight + irt + air + cortisol + BPM, data = equine))

     sex   weight      irt      air cortisol      BPM 
1.006023 1.010740 1.024534 1.013868 1.018760 1.004401

MANOVA

# Perform Multiple ANOVA to evaluate predictors of 'comp'
anova_model_multi <- aov(comp ~ BPM + sex + irt + weight + cortisol + air, data = equine)

# Summary of the Multiple ANOVA model
summary(anova_model_multi)

             Df Sum Sq Mean Sq F value  Pr(>F)    
BPM           1  519.6   519.6 523.565 < 2e-16 ***
sex           1    7.9     7.9   8.010 0.00484 ** 
irt           1    1.2     1.2   1.184 0.27702    
weight        1    0.1     0.1   0.148 0.70035    
cortisol      1    0.1     0.1   0.151 0.69787    
air           1    0.5     0.5   0.529 0.46717    
Residuals   491  487.3     1.0                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

### This shows that BPM and sex is the only statistically significant data when trying to predict compliance time

Pearson’s Correlation (r)

pearson_correlation <- cor.test(equine$comp, equine$BPM, method = "pearson")
print(pearson_correlation)


    Pearson's product-moment correlation

data:  equine$comp and equine$BPM
t = 22.767, df = 496, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.6689964 0.7552703
sample estimates:
      cor 
0.7148428

## There is a statistically significant and strong positive correlation (r=0.715) between comp and BPM.

Simple Generelized Linear Model

# Perform Linear Regression: Predict 'comp' using 'BPM'
linear_model <- lm(comp ~ BPM, data = equine)

# Summary of the linear regression model
summary(linear_model)


Call:
lm(formula = comp ~ BPM, data = equine)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.81905 -0.65308 -0.01023  0.59085  2.83181 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -40.79081    3.33482  -12.23   <2e-16 ***
BPM           0.50564    0.02221   22.77   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.001 on 496 degrees of freedom
Multiple R-squared:  0.511, Adjusted R-squared:   0.51 
F-statistic: 518.3 on 1 and 496 DF,  p-value: < 2.2e-16

# Plot regression line between 'comp' and 'BPM'
ggplot(equine, aes(x = BPM, y = comp)) +
  geom_point(color = "blue", alpha = 0.4) +
  geom_smooth(method = "lm", se = TRUE, color = "red") +
  labs(x = "Heart Rate (BPM)",
       y = "Compliance Time (seconds)") +
  theme_minimal()

`geom_smooth()` using formula = 'y ~ x'

correlation coefficient r = 0.7148428 This is the sample correlation, indicating a strong positive linear relationship between comp and BPM. A value close to 1 suggests a strong positive correlation.

PCA - Principle Component Analysis

#install.packages("devtools")
library(devtools)

Loading required package: usethis

library(ggplot2)
library(grid)

str(equine)

'data.frame':   498 obs. of  9 variables:
 $ ID       : chr  "Eq001" "Eq002" "Eq003" "Eq004" ...
 $ sex      : chr  "Female" "Female" "Female" "Male" ...
 $ comp     : num  37.2 34.5 36.3 35.3 37.4 ...
 $ weight   : num  74.7 73.4 71.8 104.6 67.1 ...
 $ irt      : num  37.7 35.7 34.8 36.2 33.6 ...
 $ air      : num  23.4 21.4 20.1 21.6 21.8 ...
 $ cortisol : num  64.1 73.7 54.4 86.3 108 ...
 $ BPM      : num  153 150 149 150 149 ...
 $ predicted: num  36.6 34.9 34.9 35.2 34.6 ...

#Run PCA

#Begin by deriving PCs
equine.pca <- prcomp(equine[,c(3:8)], center = TRUE, scale. = TRUE)


#Examine PCA object
summary(equine.pca)

Importance of components:
                         PC1    PC2    PC3    PC4    PC5     PC6
Standard deviation     1.312 1.0745 1.0402 0.9561 0.9184 0.53313
Proportion of Variance 0.287 0.1924 0.1803 0.1524 0.1406 0.04737
Cumulative Proportion  0.287 0.4794 0.6597 0.8121 0.9526 1.00000

6 principle components labeled PC1 - 6

These results show that PC1 explains 29% of the total variance in the dataset of 6 variables. PC2 explain 19% of variance, which means the combination of PC1 and PC2 explain ~50% of variance in the dataset. The third Principal Component (PC3) explains an additional 18% of variance etc. This information can be visualised in a scree plot:

Scree Plot & Kaisers’ Criterion

#Scree plot
# Fig. 13.2
screeplot(equine.pca, type="lines")

#Apply Kaisers' criterion

## An  additional  approach  is  to  use  Kaiser’s  criterion.  Here  we  retain  Principal Components  for  which  the  variance  is  above  1  when  PCA  is  applied  to standardised data.

options(digits=1)
(equine.pca$sdev)^2

[1] 1.7 1.2 1.1 0.9 0.8 0.3

The scree plot further illustrates that PC1 and PC2 (& PC3) capture the greatest proportion of variance in the data.

The first rule-of-thumb is to aim for a specific threshold of explained variance. A figure of 80% is often used, which in this case means PC1, PC2, and PC3 are needed to achieve the threshold.

Based on Kaiser’s criterion only PC1 (1.7), PC2 (1.2), and PC3 (1.1) should be retained for analysis. Thus, based on an 80% threshold, scree plot and Kaisers’ criterion just PC1, PC2, and PC3 are sufficient to obtain an accurate picture of the equine data.

#Loading coefficients for PC1
options(digits=2)
equine.pca$rotation[,1]

    comp   weight      irt      air cortisol      BPM 
 -0.7047   0.0017   0.0402   0.0861   0.0132  -0.7030

#Loading coefficients for PC2
options(digits=1)
equine.pca$rotation[,2]

    comp   weight      irt      air cortisol      BPM 
   -0.02    -0.41     0.68    -0.40     0.45     0.01

#Loading coefficients for PC3
options(digits=1)
equine.pca$rotation[,3]

    comp   weight      irt      air cortisol      BPM 
    0.03     0.54     0.22     0.52     0.62     0.05

Gaussian Generalized Linear Model

# Performing a gaussian (normally distrib.) generalized linear model on key components
glm_model <- glm(comp ~ sex + irt + BPM, 
                 data = equine, 
                 family = "gaussian")

summary(glm_model)


Call:
glm(formula = comp ~ sex + irt + BPM, family = "gaussian", data = equine)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -39.1723     3.5052  -11.18   <2e-16 ***
sexMale       0.2561     0.0893    2.87   0.0043 ** 
irt          -0.0341     0.0313   -1.09   0.2760    
BPM           0.5021     0.0221   22.74   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 1)

    Null deviance: 1016.77  on 497  degrees of freedom
Residual deviance:  488.07  on 494  degrees of freedom
AIC: 1413

Number of Fisher Scoring iterations: 2

Sex has a small but significant effect on the response variable. BPM is the strongest predictor in this model, as indicated by the large and highly significant coefficient. IRT does not significantly contribute to explaining the response.

Sex (0.2561): Being male is associated with an increase of 0.2561 units in the response variable compared to females, holding other variables constant. This effect is statistically significant (p=0.0043<0.01).

irt (-0.0341): A unit increase in irt is associated with a decrease of 0.0341 units in the response variable, but this effect is not statistically significant (p=0.276).

BPM (0.5021): A unit increase in BPM is associated with a significant increase of 0.5021 units in the response variable (p<0.001).

Visualizing the Gaussian Relationship

# Load ggplot2 for visualization
library(ggplot2)

# Create a new data frame for predictions
new_equine <- data.frame(
  sex = c("Male", "Female"),
  irt = seq(min(equine$irt), max(equine$irt), length.out = 100),
  BPM = seq(min(equine$BPM), max(equine$BPM), length.out = 100))

# Predict values based on the glm model
new_equine$predicted_comp <- predict(glm_model, newdata = new_equine)

# Plot the regression line
ggplot(equine, aes(x = BPM, y = comp, color = sex)) +
  geom_point(alpha = 0.5) +  # Scatter plot of the actual data
  geom_line(data = new_equine, aes(x = BPM, y = predicted_comp, color = sex), size = 1, se = TRUE) +
  labs(
    title = "Regression Line for Compliance Time",
    x = "Heart Rate (BPM)",
    y = "Compliance Time (seconds)",
    color = "Sex"
  ) +
  theme_minimal() +
  scale_color_brewer(palette = "Set1")

Warning in geom_line(data = new_equine, aes(x = BPM, y = predicted_comp, :
Ignoring unknown parameters: `se`