setwd("/Users/matthewwright/Library/CloudStorage/OneDrive-TeessideUniversity/Work/Teaching/Human Movement/2023/R")Ciaran_dis
Linear mixed model for Ciaran
Your aim is to see if the longer player spend in the world cup environment links to increase physical outputs e.g., total distance.
Set up
Make sure you set your data up in a folder and set your working directory - you’ll use the same function as I have (setwd ) but this will be for your directory - NOTE you can click on “More” in files and set manually there:
You can are also going to need to use several packages - the ones below should suffice to run your analysis. Remember you may need to install these first using install.packages(“package_name_here”)
library(dplyr) #This does lots of things you'll need!
library(readxl) #This reads in your data
library(ggplot2) #This plots your data
library(janitor)#This is used to clean your header name up so you can use them
library(lme4) #This runs your mixed linear model to analyse your data
library(performance) #This check your model meets the assumptions and fits well
library(emmeans) #This gives you some key data for your results, differences between fixed factors (e.g., differences in distance run by round of match or kick off time etc. )Reading your data
Now you need to read in your data - make sure it is saved in your folder and under the correct name. If you have set your working directly correctly this will read in.
data <- read_excel("Ciaran_data.xlsx")
data<- clean_names(data)
head(data)# A tibble: 6 × 16
team_name player…¹ playe…² round date ko_time oppos…³ total…⁴ zone_…⁵ zone_…⁶
<chr> <dbl> <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Qatar 1 Saad A… 1 20,1… 7pm Ecuador 0 3342. 1245.
2 Qatar 2 Pedro … 1 20,1… 7pm Ecuador 9075 3906. 3144.
3 Qatar 3 Abdelk… 1 20,1… 7pm Ecuador 8870. 3360. 3826.
4 Qatar 6 Abdula… 1 20,1… 7pm Ecuador 10710. 3427. 4998.
5 Qatar 10 Hassan… 1 20,1… 7pm Ecuador 8570. 2210. 4142.
6 Qatar 11 Akram … 1 20,1… 7pm Ecuador 9244. 3667 3540.
# … with 6 more variables: zone_3_15_20km_h_m <dbl>, zone_4_20_25km_h_m <dbl>,
# zone_5_25_km_h <dbl>, high_speed_runs <dbl>, sprints <dbl>,
# top_speed <dbl>, and abbreviated variable names ¹player_number,
# ²player_name, ³opposition, ⁴total_distance_m, ⁵zone_1_0_7km_h_m,
# ⁶zone_2_7_15km_h_m
You need to tell R that some of your data columns are “factors” and you need to clean your header names up for use, the following code will do this:
data$player_number<-as.factor(data$player_number)
data$player_name<-as.factor(data$player_name)
data$round<-as.factor(data$round)
data$ko_time<-as.factor(data$ko_time)
data$team_name<-as.factor(data$team_name)
data$opposition<-as.factor(data$opposition)
head(data)# A tibble: 6 × 16
team_name player…¹ playe…² round date ko_time oppos…³ total…⁴ zone_…⁵ zone_…⁶
<fct> <fct> <fct> <fct> <chr> <fct> <fct> <dbl> <dbl> <dbl>
1 Qatar 1 Saad A… 1 20,1… 7pm Ecuador 0 3342. 1245.
2 Qatar 2 Pedro … 1 20,1… 7pm Ecuador 9075 3906. 3144.
3 Qatar 3 Abdelk… 1 20,1… 7pm Ecuador 8870. 3360. 3826.
4 Qatar 6 Abdula… 1 20,1… 7pm Ecuador 10710. 3427. 4998.
5 Qatar 10 Hassan… 1 20,1… 7pm Ecuador 8570. 2210. 4142.
6 Qatar 11 Akram … 1 20,1… 7pm Ecuador 9244. 3667 3540.
# … with 6 more variables: zone_3_15_20km_h_m <dbl>, zone_4_20_25km_h_m <dbl>,
# zone_5_25_km_h <dbl>, high_speed_runs <dbl>, sprints <dbl>,
# top_speed <dbl>, and abbreviated variable names ¹player_number,
# ²player_name, ³opposition, ⁴total_distance_m, ⁵zone_1_0_7km_h_m,
# ⁶zone_2_7_15km_h_m
Cleaning your data
You might also want to clean your data and get rid of out-liers - at the moment your data set includes subs and goalkeepers so it is messy an our analysis does not fit the assumptions it should. You might be better removing anyone who has not played at least 60 minutes or those who did not start as well as goalkeepers manually then read it back in.
The code below removes anyone who is outside of 3 SD from the mean for Total Distance.
# Calculate the mean and standard deviation of the total distance variable
mean_TD <- mean(data$total_distance_m)
sd_TD <- sd(data$total_distance_m)
# Define a threshold for outliers as 3 standard deviations from the mean
threshold <- 3 * sd_TD
# Remove rows containing outliers from the Five variable
data_clean <- data[abs(data$total_distance_m - mean_TD) < threshold, ]Plotting your data
This plots your data using a ggplot, you might want to make it look nice by adding a color fill using fill = or changing the colour theme e.g., scale_fill_viridis_d(option = “viridis”, direction = 1)
box<-ggplot(data_clean, aes(round, total_distance_m, fill = round))+
geom_boxplot()+
geom_jitter(alpha = 0.2)+
labs(title= "Title here if you want", x = "x axis title here", y= "Total distance (m)") +
theme_classic() +
theme(legend.position = "none")
boxggsave(
"box_2.png", dpi = 320,
)Saving 7 x 5 in image
Modeling your data
You will need to run a mixed linear model, this basically fits lots of individual lines for each individual players (random factors). It is “mixed” because you have fixed factors - in this case round of match or kick off time and random factors which are your different players. The code below model’s and check’s your data. Hopefully, when you clean your data the model will fit slightly better. That said even if it doesn’t I think we can afford to run with it for now!
Notice I have run two models here 1 includes kick-off time.
m1<-lmer(total_distance_m ~ 1 + round + (1| player_number), data)
m2<-lmer(total_distance_m ~ 1 + ko_time + (1| player_number), data)
check_model(m1)check_normality(m1) # shapiro.test, however, visual inspection (e.g. Q-Q plots) are preferableWarning: Non-normality of residuals detected (p < .001).
check_heteroscedasticity(m1)Warning: Heteroscedasticity (non-constant error variance) detected (p < .001).
Getting your results
First sumarise your results - take a look at the fixed effects table, this shows you that round 1 players covered 6897 m and then the difference from this value for all the other rounds. You might have a slight issue here in that rounds 5, 6 & 7 had games that went to extra time! You can see the correlations between distance covered in the later rounds is much less than the earlier ones.
You just need to think about this in terms of which player data you include in your final analysis.
The next line of codes gives you your mean differences between rounds, p-values and 95% confidence intervals. It also gives you the mean distance for each round in meters. It’s worth noting again how close the three group games look in terms of total distance.
run: emm <- emmeans(m2, pairwise ~ ko_time) to look at kick off time.
emm <- emmeans(m1, pairwise ~ round)
emm$emmeans
round emmean SE df lower.CL upper.CL
1 7236 255 45.3 6722 7750
2 7084 255 45.2 6570 7598
3 7102 254 44.6 6589 7614
4 7341 292 77.5 6761 7922
5 8108 362 176.8 7393 8823
6 7102 470 449.5 6178 8026
7 7760 456 405.2 6863 8657
Degrees-of-freedom method: kenward-roger
Confidence level used: 0.95
$contrasts
contrast estimate SE df t.ratio p.value
round1 - round2 152.155 210 1967 0.726 0.9910
round1 - round3 134.253 209 1968 0.642 0.9954
round1 - round4 -105.026 253 1968 -0.414 0.9996
round1 - round5 -871.707 332 1968 -2.629 0.1179
round1 - round6 134.113 448 1967 0.300 0.9999
round1 - round7 -523.988 433 1969 -1.211 0.8902
round2 - round3 -17.903 209 1967 -0.086 1.0000
round2 - round4 -257.181 253 1968 -1.015 0.9506
round2 - round5 -1023.862 332 1969 -3.087 0.0335
round2 - round6 -18.042 447 1967 -0.040 1.0000
round2 - round7 -676.144 433 1969 -1.562 0.7066
round3 - round4 -239.278 253 1969 -0.946 0.9649
round3 - round5 -1005.959 331 1969 -3.035 0.0392
round3 - round6 -0.139 447 1967 0.000 1.0000
round3 - round7 -658.241 432 1969 -1.522 0.7315
round4 - round5 -766.681 361 1968 -2.125 0.3379
round4 - round6 239.139 469 1967 0.510 0.9987
round4 - round7 -418.963 455 1969 -0.920 0.9695
round5 - round6 1005.820 516 1967 1.950 0.4471
round5 - round7 347.718 503 1968 0.691 0.9931
round6 - round7 -658.102 585 1967 -1.124 0.9208
Degrees-of-freedom method: kenward-roger
P value adjustment: tukey method for comparing a family of 7 estimates
confint(emm, level = 0.95)$emmeans
round emmean SE df lower.CL upper.CL
1 7236 255 45.3 6722 7750
2 7084 255 45.2 6570 7598
3 7102 254 44.6 6589 7614
4 7341 292 77.5 6761 7922
5 8108 362 176.8 7393 8823
6 7102 470 449.5 6178 8026
7 7760 456 405.2 6863 8657
Degrees-of-freedom method: kenward-roger
Confidence level used: 0.95
$contrasts
contrast estimate SE df lower.CL upper.CL
round1 - round2 152.155 210 1967 -467 771.0
round1 - round3 134.253 209 1968 -483 751.8
round1 - round4 -105.026 253 1968 -853 642.9
round1 - round5 -871.707 332 1968 -1850 106.9
round1 - round6 134.113 448 1967 -1187 1454.9
round1 - round7 -523.988 433 1969 -1801 753.3
round2 - round3 -17.903 209 1967 -635 598.9
round2 - round4 -257.181 253 1968 -1005 490.4
round2 - round5 -1023.862 332 1969 -2003 -44.9
round2 - round6 -18.042 447 1967 -1339 1302.6
round2 - round7 -676.144 433 1969 -1954 601.3
round3 - round4 -239.278 253 1969 -986 507.1
round3 - round5 -1005.959 331 1969 -1984 -27.7
round3 - round6 -0.139 447 1967 -1320 1319.7
round3 - round7 -658.241 432 1969 -1934 618.0
round4 - round5 -766.681 361 1968 -1831 297.9
round4 - round6 239.139 469 1967 -1146 1624.2
round4 - round7 -418.963 455 1969 -1763 925.4
round5 - round6 1005.820 516 1967 -516 2528.0
round5 - round7 347.718 503 1968 -1137 1832.0
round6 - round7 -658.102 585 1967 -2386 1069.7
Degrees-of-freedom method: kenward-roger
Confidence level used: 0.95
Conf-level adjustment: tukey method for comparing a family of 7 estimates