Ciaran_dis

Linear mixed model for Ciaran

Your aim is to see if the longer player spend in the world cup environment links to increase physical outputs e.g., total distance.

Set up

Make sure you set your data up in a folder and set your working directory - you’ll use the same function as I have (setwd ) but this will be for your directory - NOTE you can click on “More” in files and set manually there:

 setwd("/Users/matthewwright/Library/CloudStorage/OneDrive-TeessideUniversity/Work/Teaching/Human Movement/2023/R")

You can are also going to need to use several packages - the ones below should suffice to run your analysis. Remember you may need to install these first using install.packages(“package_name_here”)

library(dplyr) #This does lots of things you'll need! 
library(readxl) #This reads in your data
library(ggplot2) #This plots your data
library(janitor)#This is used to clean your header name up so you can use them

library(lme4) #This runs your mixed linear model to analyse your data
library(performance) #This check your model meets the assumptions and fits well
library(emmeans) #This gives you some key data for your results, differences between fixed factors (e.g., differences in distance run by round of match or kick off time etc. )

Reading your data

Now you need to read in your data - make sure it is saved in your folder and under the correct name. If you have set your working directly correctly this will read in.

data <- read_excel("Ciaran_data.xlsx")
data<- clean_names(data)
head(data)

# A tibble: 6 × 16
  team_name player…¹ playe…² round date  ko_time oppos…³ total…⁴ zone_…⁵ zone_…⁶
  <chr>        <dbl> <chr>   <dbl> <chr> <chr>   <chr>     <dbl>   <dbl>   <dbl>
1 Qatar            1 Saad A…     1 20,1… 7pm     Ecuador      0    3342.   1245.
2 Qatar            2 Pedro …     1 20,1… 7pm     Ecuador   9075    3906.   3144.
3 Qatar            3 Abdelk…     1 20,1… 7pm     Ecuador   8870.   3360.   3826.
4 Qatar            6 Abdula…     1 20,1… 7pm     Ecuador  10710.   3427.   4998.
5 Qatar           10 Hassan…     1 20,1… 7pm     Ecuador   8570.   2210.   4142.
6 Qatar           11 Akram …     1 20,1… 7pm     Ecuador   9244.   3667    3540.
# … with 6 more variables: zone_3_15_20km_h_m <dbl>, zone_4_20_25km_h_m <dbl>,
#   zone_5_25_km_h <dbl>, high_speed_runs <dbl>, sprints <dbl>,
#   top_speed <dbl>, and abbreviated variable names ¹player_number,
#   ²player_name, ³opposition, ⁴total_distance_m, ⁵zone_1_0_7km_h_m,
#   ⁶zone_2_7_15km_h_m

You need to tell R that some of your data columns are “factors” and you need to clean your header names up for use, the following code will do this:

data$player_number<-as.factor(data$player_number)
data$player_name<-as.factor(data$player_name)
data$round<-as.factor(data$round)
data$ko_time<-as.factor(data$ko_time)
data$team_name<-as.factor(data$team_name)
data$opposition<-as.factor(data$opposition)


head(data)

# A tibble: 6 × 16
  team_name player…¹ playe…² round date  ko_time oppos…³ total…⁴ zone_…⁵ zone_…⁶
  <fct>     <fct>    <fct>   <fct> <chr> <fct>   <fct>     <dbl>   <dbl>   <dbl>
1 Qatar     1        Saad A… 1     20,1… 7pm     Ecuador      0    3342.   1245.
2 Qatar     2        Pedro … 1     20,1… 7pm     Ecuador   9075    3906.   3144.
3 Qatar     3        Abdelk… 1     20,1… 7pm     Ecuador   8870.   3360.   3826.
4 Qatar     6        Abdula… 1     20,1… 7pm     Ecuador  10710.   3427.   4998.
5 Qatar     10       Hassan… 1     20,1… 7pm     Ecuador   8570.   2210.   4142.
6 Qatar     11       Akram … 1     20,1… 7pm     Ecuador   9244.   3667    3540.
# … with 6 more variables: zone_3_15_20km_h_m <dbl>, zone_4_20_25km_h_m <dbl>,
#   zone_5_25_km_h <dbl>, high_speed_runs <dbl>, sprints <dbl>,
#   top_speed <dbl>, and abbreviated variable names ¹player_number,
#   ²player_name, ³opposition, ⁴total_distance_m, ⁵zone_1_0_7km_h_m,
#   ⁶zone_2_7_15km_h_m

Cleaning your data

You might also want to clean your data and get rid of out-liers - at the moment your data set includes subs and goalkeepers so it is messy an our analysis does not fit the assumptions it should. You might be better removing anyone who has not played at least 60 minutes or those who did not start as well as goalkeepers manually then read it back in.

The code below removes anyone who is outside of 3 SD from the mean for Total Distance.

# Calculate the mean and standard deviation of the total distance variable
mean_TD <- mean(data$total_distance_m)
sd_TD <- sd(data$total_distance_m)

# Define a threshold for outliers as 3 standard deviations from the mean
threshold <- 3 * sd_TD

# Remove rows containing outliers from the Five variable
data_clean <- data[abs(data$total_distance_m - mean_TD) < threshold, ]

Plotting your data

This plots your data using a ggplot, you might want to make it look nice by adding a color fill using fill = or changing the colour theme e.g., scale_fill_viridis_d(option = “viridis”, direction = 1)

box<-ggplot(data_clean, aes(round, total_distance_m, fill = round))+
  geom_boxplot()+
  geom_jitter(alpha = 0.2)+
  labs(title= "Title here if you want", x = "x axis title here", y= "Total distance (m)") + 
  theme_classic() + 
      theme(legend.position = "none")
box

ggsave(
  "box_2.png", dpi = 320,
)

Saving 7 x 5 in image

Modeling your data

You will need to run a mixed linear model, this basically fits lots of individual lines for each individual players (random factors). It is “mixed” because you have fixed factors - in this case round of match or kick off time and random factors which are your different players. The code below model’s and check’s your data. Hopefully, when you clean your data the model will fit slightly better. That said even if it doesn’t I think we can afford to run with it for now!

Notice I have run two models here 1 includes kick-off time.

m1<-lmer(total_distance_m ~ 1 + round + (1| player_number), data)
m2<-lmer(total_distance_m ~ 1 + ko_time + (1| player_number), data)
check_model(m1)

check_normality(m1)     # shapiro.test, however, visual inspection (e.g. Q-Q plots) are preferable

Warning: Non-normality of residuals detected (p < .001).

check_heteroscedasticity(m1)

Warning: Heteroscedasticity (non-constant error variance) detected (p < .001).

Getting your results

First sumarise your results - take a look at the fixed effects table, this shows you that round 1 players covered 6897 m and then the difference from this value for all the other rounds. You might have a slight issue here in that rounds 5, 6 & 7 had games that went to extra time! You can see the correlations between distance covered in the later rounds is much less than the earlier ones.

You just need to think about this in terms of which player data you include in your final analysis.

The next line of codes gives you your mean differences between rounds, p-values and 95% confidence intervals. It also gives you the mean distance for each round in meters. It’s worth noting again how close the three group games look in terms of total distance.

run: emm <- emmeans(m2, pairwise ~ ko_time) to look at kick off time.

emm <- emmeans(m1, pairwise ~ round)
emm

$emmeans
 round emmean  SE    df lower.CL upper.CL
 1       7236 255  45.3     6722     7750
 2       7084 255  45.2     6570     7598
 3       7102 254  44.6     6589     7614
 4       7341 292  77.5     6761     7922
 5       8108 362 176.8     7393     8823
 6       7102 470 449.5     6178     8026
 7       7760 456 405.2     6863     8657

Degrees-of-freedom method: kenward-roger 
Confidence level used: 0.95 

$contrasts
 contrast         estimate  SE   df t.ratio p.value
 round1 - round2   152.155 210 1967   0.726  0.9910
 round1 - round3   134.253 209 1968   0.642  0.9954
 round1 - round4  -105.026 253 1968  -0.414  0.9996
 round1 - round5  -871.707 332 1968  -2.629  0.1179
 round1 - round6   134.113 448 1967   0.300  0.9999
 round1 - round7  -523.988 433 1969  -1.211  0.8902
 round2 - round3   -17.903 209 1967  -0.086  1.0000
 round2 - round4  -257.181 253 1968  -1.015  0.9506
 round2 - round5 -1023.862 332 1969  -3.087  0.0335
 round2 - round6   -18.042 447 1967  -0.040  1.0000
 round2 - round7  -676.144 433 1969  -1.562  0.7066
 round3 - round4  -239.278 253 1969  -0.946  0.9649
 round3 - round5 -1005.959 331 1969  -3.035  0.0392
 round3 - round6    -0.139 447 1967   0.000  1.0000
 round3 - round7  -658.241 432 1969  -1.522  0.7315
 round4 - round5  -766.681 361 1968  -2.125  0.3379
 round4 - round6   239.139 469 1967   0.510  0.9987
 round4 - round7  -418.963 455 1969  -0.920  0.9695
 round5 - round6  1005.820 516 1967   1.950  0.4471
 round5 - round7   347.718 503 1968   0.691  0.9931
 round6 - round7  -658.102 585 1967  -1.124  0.9208

Degrees-of-freedom method: kenward-roger 
P value adjustment: tukey method for comparing a family of 7 estimates

confint(emm, level = 0.95)

$emmeans
 round emmean  SE    df lower.CL upper.CL
 1       7236 255  45.3     6722     7750
 2       7084 255  45.2     6570     7598
 3       7102 254  44.6     6589     7614
 4       7341 292  77.5     6761     7922
 5       8108 362 176.8     7393     8823
 6       7102 470 449.5     6178     8026
 7       7760 456 405.2     6863     8657

Degrees-of-freedom method: kenward-roger 
Confidence level used: 0.95 

$contrasts
 contrast         estimate  SE   df lower.CL upper.CL
 round1 - round2   152.155 210 1967     -467    771.0
 round1 - round3   134.253 209 1968     -483    751.8
 round1 - round4  -105.026 253 1968     -853    642.9
 round1 - round5  -871.707 332 1968    -1850    106.9
 round1 - round6   134.113 448 1967    -1187   1454.9
 round1 - round7  -523.988 433 1969    -1801    753.3
 round2 - round3   -17.903 209 1967     -635    598.9
 round2 - round4  -257.181 253 1968    -1005    490.4
 round2 - round5 -1023.862 332 1969    -2003    -44.9
 round2 - round6   -18.042 447 1967    -1339   1302.6
 round2 - round7  -676.144 433 1969    -1954    601.3
 round3 - round4  -239.278 253 1969     -986    507.1
 round3 - round5 -1005.959 331 1969    -1984    -27.7
 round3 - round6    -0.139 447 1967    -1320   1319.7
 round3 - round7  -658.241 432 1969    -1934    618.0
 round4 - round5  -766.681 361 1968    -1831    297.9
 round4 - round6   239.139 469 1967    -1146   1624.2
 round4 - round7  -418.963 455 1969    -1763    925.4
 round5 - round6  1005.820 516 1967     -516   2528.0
 round5 - round7   347.718 503 1968    -1137   1832.0
 round6 - round7  -658.102 585 1967    -2386   1069.7

Degrees-of-freedom method: kenward-roger 
Confidence level used: 0.95 
Conf-level adjustment: tukey method for comparing a family of 7 estimates