Equine Summatative Final

Equine Summative - N0990061

Welcome to N0990061’s Summative for Research Methods and Analysis 2025.

Hopefully by reading my annotations and code, you will be able to understand my results and thinking in regards to the equine and eqsub data set. Please enjoy.

Step 1: tell your computer where to find the data sets required for analysis.This is done by creating a name for your files i.e ‘eqsub’ and ‘Equine’. The read.table allows r to read the following fie if there is one in the stated location. Using <- assigns the name to a file location.

The generic code is: ‘Named Drive:/File name’ , header = TRUE, sep = “

eqsub <-read.table('C:/eqsub.txt', header = TRUE, sep = "\t")
equine <-read.table('C:/Equine.txt', header = TRUE, sep = "\t")

It can also be handy installing all required packages at the beginning this can be done with the install.packages code n.b. some operating systems require a mirror server to download tha packages, this is what the “repos =”https://cran.r-project.org” tells r to do.

package 'ggplot2' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\Tom\AppData\Local\Temp\Rtmp6DKuBg\downloaded_packages

package 'dplyr' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\Tom\AppData\Local\Temp\Rtmp6DKuBg\downloaded_packages

package 'magrittr' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\Tom\AppData\Local\Temp\Rtmp6DKuBg\downloaded_packages

package 'car' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\Tom\AppData\Local\Temp\Rtmp6DKuBg\downloaded_packages

package 'multcomp' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\Tom\AppData\Local\Temp\Rtmp6DKuBg\downloaded_packages

package 'tidyverse' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\Tom\AppData\Local\Temp\Rtmp6DKuBg\downloaded_packages

Step 2: Now it is time to start testing the data. Before doing this it is crucial to establish the hypothesis we wish to research to indicate to the researcher which tests to use.

For the following data the hypotheses include:

1 - There will be a strong positive correlation between IRT (infrared thermography eye temperature) and cortisol levels.

2- The application of the calming spray will reduce compliance times.

3 - Compliance time can be predicted by assessing the stress levels in horses.

Step 3: Hypothesis 1

The first step of data analysis requires a determination of whether the variables are categorical or numerical. This information allows us to determine whether we need to test normality.

As a rule of thumb we do not need to test the normality of categorical data as it is never normal.

For numerical data we have to utilize a flow to determine the requirements for normality testing. Primarily, if the research question is looking for a correlation you need to test all variables individually for normality.

If the research question is investigating a difference you must determine whether your groups are paired/dependent or independent.

If they are independent you must test the response variable for normality by group.

For 2 paired/dependent variable groups you may test the residuals for normality; if there are more then two groups you must test each variable individually for normality.

As hypothesis 1 is searching for a correlation we must test the variable of IRT and cortisol separately.

The most accurate way of doing this is via the Kolmogorov-Smirnoff test (as the sample is greater than n=50) which is simply coded via ks.test.

#call required packages from library
library(ggplot2)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

# create a dataframe
data <- data.frame(
  irt = (equine$irt),
  cortisol = (equine$cortisol)
)

# Code for making Histograms and Q-Q Plots
par(mfrow = c(2, 2)) #Sets the following charts on the same 2x2 grid for satistifaction
hist(equine$irt, main = "Histogram of irt", xlab = "irt", col = "lightblue") #creates a histogram of irt values
qqnorm(equine$irt); qqline(equine$irt, col = "blue") #creates a qqplot of irt normality
hist(equine$cortisol, main = "Histogram of Cortisol", xlab = "Cortisol", col = "lightgreen") #creates a histogram of cortisol values
qqnorm(equine$cortisol); qqline(equine$cortisol, col = "green") #creates a qqplot of cortisol values

# Kolmogorov-Smirnov Test
ks_irt <- ks.test(equine$irt, "pnorm", mean = mean(equine$irt), sd = sd(equine$irt)) #Kolmogorov test of irt from the equine dataframe, pnorm is the cumulative distribution factor, mean to calculate the mean, sd for the standard deviation 
ks_cortisol <- ks.test(equine$cortisol, "pnorm", mean = mean(equine$cortisol), sd = sd(equine$cortisol))
print(ks_irt)


    Asymptotic one-sample Kolmogorov-Smirnov test

data:  equine$irt
D = 0.062317, p-value = 0.0418
alternative hypothesis: two-sided

print(ks_cortisol)


    Asymptotic one-sample Kolmogorov-Smirnov test

data:  equine$cortisol
D = 0.061555, p-value = 0.04593
alternative hypothesis: two-sided

#Code for the Kolmogorov-Smirnoff test
ks_irt <- ks.test(equine$irt, "pnorm", mean = mean(equine$irt), sd = sd(equine$irt))
ks_cortisol <- ks.test(equine$cortisol, "pnorm", mean = mean(equine$cortisol), sd = sd(equine$cortisol))
print(ks_irt)


    Asymptotic one-sample Kolmogorov-Smirnov test

data:  equine$irt
D = 0.062317, p-value = 0.0418
alternative hypothesis: two-sided

print(ks_cortisol)


    Asymptotic one-sample Kolmogorov-Smirnov test

data:  equine$cortisol
D = 0.061555, p-value = 0.04593
alternative hypothesis: two-sided

Interpreting the Kolmogorov-Smirnoff test, if the P value is below the level of significance (P<0.05) then you reject the hypothesis that the data comes from a normal distribution. If the P value is greater then the accepted significant (P>0.05) then it is likely the data is normally distributed.

Note that both P values are significant so the data is not normally distributed. Therefore we must use the non-normal test for correlation which is ‘Spearman’s ranked correlation.’

#This step is here to help add context to the further analysis and help formulate the results section. It ccodes to allow for a determination of the means and standard deviation of all 3 variables.

# Calculate mean and standard deviation for IRT
mean_irt <- mean(equine$irt, na.rm = TRUE)
sd_irt <- sd(equine$irt, na.rm = TRUE)

# Calculate mean and standard deviation for BPM
mean_bpm <- mean(equine$BPM, na.rm = TRUE)
sd_bpm <- sd(equine$BPM, na.rm = TRUE)

# Calculate mean and standard deviation for Cortisol
mean_cortisol <- mean(equine$cortisol, na.rm = TRUE)
sd_cortisol <- sd(equine$cortisol, na.rm = TRUE)

# Calculate mean and standard deviation for Comp
mean_comp <- mean(equine$comp, na.rm = TRUE)
sd_comp <- sd(equine$comp, na.rm = TRUE)

# Calculate mean and standard deviation for Comp2
mean_comp2 <- mean(eqsub$comp2, na.rm = TRUE)
sd_comp2 <- sd(eqsub$comp2, na.rm = TRUE)

# Print the results
cat("Mean IRT:", mean_irt, "\nStandard Deviation IRT:", sd_irt, "\n")

Mean IRT: 35.53856 
Standard Deviation IRT: 1.42704

cat("Mean BPM:", mean_bpm, "\nStandard Deviation BPM:", sd_bpm, "\n")

Mean BPM: 150.1365 
Standard Deviation BPM: 2.022079

cat("Mean Cortisol:", mean_cortisol, "\nStandard Deviation Cortisol:", sd_cortisol, "\n")

Mean Cortisol: 81.99811 
Standard Deviation Cortisol: 17.45772

cat("Mean comp:", mean_comp, "\nStandard Deviation comp:", sd_comp, "\n")

Mean comp: 35.12474 
Standard Deviation comp: 1.430316

cat("Mean comp2:", mean_comp2, "\nStandard Deviation comp2:", sd_comp2, "\n")

Mean comp2: 29.94207 
Standard Deviation comp2: 1.046642

#code for a spearmans ranked correlation test
cor.test(x=equine$irt, y=equine$cortisol, method = 'spearman') #correlation between 'x' = irt, and 'y' = cortsiol (both from the equine dataset), tested by the spearman correlation test.


    Spearman's rank correlation rho

data:  equine$irt and equine$cortisol
S = 17878766, p-value = 0.00332
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.1314346

To interpret Spearmans ranked correlation the rho number demonstrates whether there is a correlation, and whether it is positive or negative. 0 represents no correlation, +1 represents a perfect positive correlation, -1 a perfect negative correlation. The P value of 0.00332 is below the level of significance therefore we reject that there is no correlation. The rho being 0.1314346 demonstrates that there is a weak positive correlation between equine eye temperature and cortisol levels.

Step 4: Visualising findings

Now we have determined that there is a small positive relationship between eye temperature and cortisol levels it is important to demonstrate this visually in a graph. To create graphs we need to install a package using the install.packages function, the package we need is called ‘ggplot2’ and it allows us to generate graphics.

# Load necessary libraries (ggplot2)
library(ggplot2)

correlation <- data.frame(
  irt = c(equine$irt),     # irt = the irt column of equine
  cortisol = c(equine$cortisol)  # cortisol = the cortisol column of equine
)

# Create a sactter plot and set the colours and axis' names
ggplot(equine, aes(x = irt, y = cortisol)) +
 
  geom_point(color = "red") +   # Points color
    labs(title = "Equine IRT vs. Cortisol Levels", 
       x = "IRT (°C)", 
       y = "Cortisol (mcg/dl)") +
  theme_minimal()

As you can see the graph appears sporadic which suggests that the relationship is weak, therefore it would be inappropriate to use a line of best fit. # Step 5: Hypothesis 2

Now we follow the same process with the second hypothesis. The application of the calming spray will reduce compliance times As compliance times are dependent variables with the independent variable being the addition of the spray we need to test the residuals for normality

#Same as testing the normality for hypothesis 1
# call the required packages with library
library(ggplot2)
library(dplyr)

# Sample data
data4 <- data.frame(
  comp = (eqsub$comp),
  comp2 = (eqsub$comp2)
)

# Histograms and Q-Q Plots as before
par(mfrow = c(2, 2))
hist(eqsub$comp, main = "Histogram of Comp", xlab = "Comp", col = "lightblue")
qqnorm(eqsub$comp); qqline(eqsub$comp, col = "blue")
hist(eqsub$comp2, main = "Histogram of Comp2", xlab = "Comp2", col = "lightgreen")
qqnorm(eqsub$comp2); qqline(eqsub$comp2, col = "green")

# Kolmogorov-Smirnov Test as before
ks_comp <- ks.test(eqsub$comp, "pnorm", mean = mean(eqsub$comp), sd = sd(eqsub$comp))

Warning in ks.test.default(eqsub$comp, "pnorm", mean = mean(eqsub$comp), : ties
should not be present for the one-sample Kolmogorov-Smirnov test

ks_comp2 <- ks.test(eqsub$comp2, "pnorm", mean = mean(eqsub$comp2), sd = sd(eqsub$comp2))

Warning in ks.test.default(eqsub$comp2, "pnorm", mean = mean(eqsub$comp2), :
ties should not be present for the one-sample Kolmogorov-Smirnov test

print(ks_comp)


    Asymptotic one-sample Kolmogorov-Smirnov test

data:  eqsub$comp
D = 0.0325, p-value = 0.6689
alternative hypothesis: two-sided

print(ks_comp2)


    Asymptotic one-sample Kolmogorov-Smirnov test

data:  eqsub$comp2
D = 0.029237, p-value = 0.7882
alternative hypothesis: two-sided

With both P values for comp and comp 2 being above 0.05 and the D value being close to zero we can assume a normal distribution which is reinforced by the bell curve in the histogram and the S shape in the qq plot. As the sample was taken from the same horses the data comes in matched pairs, of normal, continuous data - this meets the criteria for a paired t.test.

# Perform t-test (paired)
t_test_result <- t.test(eqsub$comp, eqsub$comp2)

# View the results
print(t_test_result)


    Welch Two Sample t-test

data:  eqsub$comp and eqsub$comp2
t = 65.257, df = 910.64, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 5.026960 5.338703
sample estimates:
mean of x mean of y 
 35.12490  29.94207

T T-test’s P value being <0.05 demonstrates a strong difference between comp and comp2. The confidence interval of 5.026960 and 5.338703 demonstrates that there’s a 95 % chance probability that the true difference in between those two numbers which is reflected by the median of those two numbers being 5.18 also being the subtraction of the means.

Step 6: Visualing Findings

# Calculate means as before
comp_mean <- mean(eqsub$comp)
comp2_mean <- mean(eqsub$comp2)

# Calculate standard errors - standard deviation for comp as a function of eqsub ect, sqrt is the standardised quartiles
comp_se <- sd(eqsub$comp) / sqrt(length(eqsub$comp))
comp2_se <- sd(eqsub$comp2) / sqrt(length(eqsub$comp2))

# Create a boxplot, make it colourful with greenand blue
boxplot(eqsub$comp, eqsub$comp2, 
        names = c("comp", "comp2"), 
        main = "Comparison of comp and comp2", 
        ylab = "Compliance Time (s)",
        col = c("blue", "green"))

Step 7: Hypothesis 3

Compliance time can be predicted by assessing the stress levels in horses.

First we need to assess the normality of all independent variables. As we are only looking for stress indicators we only need to compare IRT, cortisol and BPM

ks_irt <- ks.test(equine$irt, "pnorm", mean = mean(equine$irt), sd = sd(equine$irt)) #Kolmogorov test of irt from the equine dataframe, pnorm is the cumulative distribution factor, mean to calculate the mean, sd for the standard deviation 
ks_cortisol <- ks.test(equine$cortisol, "pnorm", mean = mean(equine$cortisol), sd = sd(equine$cortisol))
ks_BPM <- ks.test(equine$BPM, "pnorm", mean = mean(equine$BPM), sd = sd(equine$BPM))
print(ks_irt)


    Asymptotic one-sample Kolmogorov-Smirnov test

data:  equine$irt
D = 0.062317, p-value = 0.0418
alternative hypothesis: two-sided

print(ks_cortisol)


    Asymptotic one-sample Kolmogorov-Smirnov test

data:  equine$cortisol
D = 0.061555, p-value = 0.04593
alternative hypothesis: two-sided

print(ks_BPM)


    Asymptotic one-sample Kolmogorov-Smirnov test

data:  equine$BPM
D = 0.031659, p-value = 0.7004
alternative hypothesis: two-sided

As this hypothesis evaluates multiple factors (comp and comp2) against irt, cortisol and BPM, we do not need to assess normality as with such a large sample size an ANOVA is considered robust enough to counteract any statistical deviance’s

# Load necessary libraries
library(dplyr)
library(car)  # For ANOVA with Type II sum of squares

Loading required package: carData


Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

# Read the data files
equine <- read.table('C:/Equine.txt', header = TRUE, sep = "\t")
eqsub <- read.table('C:/Eqsub.txt', header = TRUE, sep = "\t")

# Ensure data columns have equal lengths
number_of_samples <- min(nrow(equine), nrow(eqsub))

# Telling R to expect 498 samples in each dataframe
equine <- equine[1:498, ]
eqsub <- eqsub[1:498, ]

# Create a combined dataframe (this is because we are using data from both equine and eqsub)
combined_data <- data.frame(
  irt = equine$irt,
  cortisol = equine$cortisol,
  BPM = equine$BPM,
  comp = eqsub$comp,
  comp2 = eqsub$comp2
)

# code for the ANOVA tests
anova_result <- aov(lm(comp  ~ irt + cortisol + BPM, data = combined_data))

# Display results
summary(anova_result)

             Df Sum Sq Mean Sq F value Pr(>F)    
irt           1    1.4     1.4   1.412  0.235    
cortisol      1    0.0     0.0   0.025  0.876    
BPM           1  519.1   519.1 516.842 <2e-16 ***
Residuals   494  496.2     1.0                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Printing the Anova results
print(anova_result)

Call:
   aov(formula = lm(comp ~ irt + cortisol + BPM, data = combined_data))

Terms:
                     irt cortisol      BPM Residuals
Sum of Squares    1.4183   0.0247 519.1460  496.2021
Deg. of Freedom        1        1        1       494

Residual standard error: 1.002226
Estimated effects may be unbalanced

The only result which was significant was heart rate at P<2e-16, this is extremely statistically significant. Being the only significant indicator means we can reliably use it to predict compliance time through the following linear model.

We begin by making a new data frame with our 3 variables in. BPM, comp and comp2.

# Load necessary library
library(dplyr)

# Assuming equine and eqsub dataframes are already loaded
# Create a new dataframe by selecting and combining the desired columns
comparison_df <- data.frame(
  BPM = equine$BPM,
  comp = eqsub$comp,
  comp2 = eqsub$comp2
)
#This in essence doubles the compliance data set by combining both compliance observations thus making the prediction more accurate.

# Load necessary libraries
library(ggplot2)

# Create the linear models bpm is a function of compliance from the combined data frame which we made above
lm1 <- lm(BPM ~ comp, data = comparison_df)
# Create the plot
ggplot(comparison_df, aes(x = comp, y = BPM)) +
  geom_point(aes(color = "comp")) +
  geom_smooth(method = "lm", se = FALSE, aes(color = "comp"), formula = y ~ x) +
  
  labs(title = "Linear Model of Heart rate vs. compliance",
       x = "compliance (s)",
       y = "Heart rate (BPM)") +
  scale_color_manual(values = c("comp" = "blue")) +
  theme_minimal()

# make it BLUEEEE

# Load necessary libraries
library(ggplot2)


# Create the linear models
lm1 <- lm(BPM ~ comp, data = equine)

# Extend the range for predictions
extended_comp <- seq(min(equine$comp), max(equine$comp), length.out = 200)

# Create a new dataframe for prediction
new_data1 <- data.frame(comp = extended_comp)

# Get the predictions and confidence intervals
pred1 <- predict(lm1, newdata = new_data1, interval = "confidence")

# Combine predictions with new_data
new_data1$BPM <- pred1[, "fit"]
new_data1$lower <- pred1[, "lwr"]
new_data1$upper <- pred1[, "upr"]

# Create the plot, must of the following code is just different ways of making the plot more visually appealing and easier to differentiate as the journal guidance requires it to be visible in Black and White format.
ggplot(equine, aes(x = comp, y = BPM)) +
  geom_point(aes(color = "comp")) +
  geom_line(data = new_data1, aes(x = comp, y = BPM, color = "comp")) +
  geom_ribbon(data = new_data1, aes(x = comp, ymin = lower, ymax = upper, fill = "comp"), alpha = 0.2) +

  labs(title = "Extended Linear Model with Predictions and Confidence Intervals",
       x = "Compliance Time (s)",
       y = "Heart rate (BPM)") +
  scale_color_manual(values = c("comp" = "blue")) +
  scale_fill_manual(values = c("comp" = "blue")) +
  theme_minimal()

# Plot diagnostic plots for the linear model
par(mfrow = c(2, 2)) # Arrange plots in a 2x2 grid
plot(lm1)

# Perform the Breusch-Pagan test for heteroscedasticity
library(lmtest)

Loading required package: zoo


Attaching package: 'zoo'

The following objects are masked from 'package:base':

    as.Date, as.Date.numeric

bptest(lm1)


    studentized Breusch-Pagan test

data:  lm1
BP = 0.041629, df = 1, p-value = 0.8383

# Perform the Shapiro-Wilk test for normality of residuals
shapiro.test(residuals(lm1))


    Shapiro-Wilk normality test

data:  residuals(lm1)
W = 0.99886, p-value = 0.9895

# determine the intercept and slope of the lm which I want to test
# Fit the linear model
lm1 <- lm(BPM ~ comp, data = equine)

# Extract the coefficients
coefficients <- coef(lm1)

# Print the coefficients
print(coefficients)

(Intercept)        comp 
 114.639643    1.010594

# Set the seed for reproducibility
set.seed(123)

# Generate random data for the predictor variable 'comp'
n <- 100 # Number of observations
comp <- rnorm(n, mean = 100, sd = 5) # Random normal data with mean and significance of 100 and 5

# Define the true relationship parameters (these mirror what my dataset has produced)
intercept <- 10.929352
slope <- 1.816533 

# Generate the response variable 'BPM' based on the linear model with some random noise/inputs
BPM <- intercept + slope * comp + rnorm(n, mean = , sd = 5)

# Create a data frame
equine2 <- data.frame(comp = comp, BPM = BPM)

# Fit the linear model
lm1 <- lm(BPM ~ comp, data = equine)

# Display the summary of the linear model
summary(lm1)


Call:
lm(formula = BPM ~ comp, data = equine)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.3556 -0.9344 -0.0108  0.9818  4.8016 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 114.63964    1.56046   73.47   <2e-16 ***
comp          1.01059    0.04439   22.77   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.415 on 496 degrees of freedom
Multiple R-squared:  0.511, Adjusted R-squared:   0.51 
F-statistic: 518.3 on 1 and 496 DF,  p-value: < 2.2e-16

# Plot the data and the fitted model
plot(equine2$comp, equine2$BPM, pch = 16, col = "blue", main = "Linear Model: BPM ~ comp", xlab = "Comp", ylab = "BPM")
abline(lm1, col = "red")

#I feel that it is worth noting that I do realise that the graphs and the linear model have Compliance time predicting heart rate, but as I have not discovered a causality only a relationship the findings from these are interchangeable.

# Assuming you have the equine dataframe
# Count the occurrences of each category in the 'sex' column
sex_count <- table(equine$sex)

# Access the count for 'male'
male_count <- sex_count['Male']

# Print the result
print(male_count)

Male 
 249

#to allow to meet the brief the Animal welfare journal uses tables to give the important identifier information, this is my code for working out the mean weight of each group.
# Load the necessary libraries
library(dplyr)
# load equine dataframe back in as a precaution at this point
equine <-read.table('C:/Equine.txt', header = TRUE, sep = "\t")
# Calculate the required statistics
summary_table <- equine %>%
  group_by(sex) %>%
  summarize(
    Count = n(),
    Mean_Weight = mean(weight, na.rm = TRUE),
    SD_Weight = sd(weight, na.rm = TRUE)
  )

# Display the summary table
print(summary_table)

# A tibble: 2 × 4
  sex    Count Mean_Weight SD_Weight
  <chr>  <int>       <dbl>     <dbl>
1 Female   249        88.2      13.6
2 Male     249        87.6      13.3

Table 1: Sample Identifiers

Sex	Count	Mean Weight (Kg)	Mean Age (Years)	Mean Height (cm)	Breed
Male	249	87.60±13.34	12.9	92.1±3.2	Miniature Horse
Female	249	88.25±13.60	12.3	91.8±3.7	Miniature Horse