setwd("~/Library/CloudStorage/OneDrive-TeessideUniversity/Work/Teaching/MSc/Advanced Testing, Monitoring and Data Analysis for Strength and Conditioning/2024/Data")Athlete Reported Outcome Measures
Intoduction
We are going to use some Athlete Reported Outcome Measures (AROM) of “readiness” to train to explore joining multiple data sheets with common column structure. Here the Athlete Readiness to Train Questionnaire (ART-Q) was filled in on a weekly basis by a group of girls football players across a season. We also have RPE data in this data set for the training session performed.
Data was collected using several difference tablet applications, each creating a separate .csv file.
Your initial challenge is to read the data in from these files and combine.
First you’ll need to set up a folder for your files (just the .csv file you want to combine) and set the working directory to that folder. In files select settings and “*set as working directory”* - you can then copy the script into your script or Markdown / Quatro file e.g.,
Note: you might not want this to be your working directory, instead you might want to read all your data in from a separate file and have your working directly separate.
Reading in multiple data files
Create a list of all your files
The code below creates a list of all .csv files in a particular folder, here we have called this “list_of_files”
list_of_files <- list.files(path = "~/Library/CloudStorage/OneDrive-TeessideUniversity/Work/Teaching/MSc/Advanced Testing, Monitoring and Data Analysis for Strength and Conditioning/2024/Data",
recursive = TRUE,
pattern = "\\.csv$",
full.names = TRUE)Combining these file using “rbind”
Here we are asking r to read all files in “list_of_files” into R Studio using the lapply function. As our .csv files have no column headers we specify this.
The code below uses rbind to bind (adds the rows of each file underneath): NOTE we will use cbind lager in the module to combine files with the same rows but different columns.
data <- do.call(rbind, lapply
(list_of_files, read.csv, as.is=T, header = FALSE))
head(data) V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15
1 51 31/10/2022 17:01 NA Resistance NA 7 6 7 7 5 6 7 7 2
2 28 31/10/2022 17:01 NA Resistance NA 7 7 7 7 7 7 7 7 0
3 12 31/10/2022 17:22 NA Resistance NA 4 4 4 4 4 4 4 4 0
4 30 31/10/2022 17:56 NA Resistance NA 5 6 4 5 6 5 5 6 3
5 3 31/10/2022 17:58 NA Resistance NA 4 4 6 6 5 7 3 7 0
6 14 31/10/2022 18:01 NA Resistance NA 6 3 4 3 5 7 1 6 0
V16 V17 V18 V19 V20 V21 V22 V23
1 jess vasey 80 0 0 0 0 0
2 80 0 0 0 0 0
3 80 0 0 0 0 0
4 Maddy Hillyer 80 0 0 0 0 0
5 80 0 0 0 0 0
6 80 0 0 0 0 0
V24 V25 V26 V27
1 Football Female U16s Tactical/Technical
2 Football Female U16s Tactical/Technical
3 Football Female U16s Tactical/Technical
4 Football Female U16s Tactical/Technical
5 Football Female U16s Tactical/Technical
6 Football Female U16s Tactical/Technical
Hey presto we have multiple files combined, we just need to tidy the data up and give them some context now.
Data tiding
Packages
For this code you will need the following packages:
library(lubridate) #this helps ensure the date in the .csv is converted into the correct format
Attaching package: 'lubridate'
The following objects are masked from 'package:base':
date, intersect, setdiff, union
library(tidyverse)── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.0 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.1 ✔ tibble 3.1.8
✔ purrr 1.0.1 ✔ tidyr 1.3.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
First of all we don’t need columns V4, V5 or V6 or anything after columns 24 (these are default settings and incorrect). So we are going to use *subset* & *select* to leave us with all the columns that are not 4-6 or 24-27. Again I’ve used head(data) so you can see what you are doing.
data<- subset(data , select = -c(4:6,24:27) )
head(data) V1 V2 V3 V7 V8 V9 V10 V11 V12 V13 V14 V15
1 51 31/10/2022 17:01 7 6 7 7 5 6 7 7 2
2 28 31/10/2022 17:01 7 7 7 7 7 7 7 7 0
3 12 31/10/2022 17:22 4 4 4 4 4 4 4 4 0
4 30 31/10/2022 17:56 5 6 4 5 6 5 5 6 3
5 3 31/10/2022 17:58 4 4 6 6 5 7 3 7 0
6 14 31/10/2022 18:01 6 3 4 3 5 7 1 6 0
V16 V17 V18 V19 V20 V21 V22 V23
1 jess vasey 80 0 0 0 0 0
2 80 0 0 0 0 0
3 80 0 0 0 0 0
4 Maddy Hillyer 80 0 0 0 0 0
5 80 0 0 0 0 0
6 80 0 0 0 0 0
So we are left with the data we are interested in, but what are the columns representing. Luckily we (or I) know this. I have written them in the code below in the correct order. You could try changing these and see what happens.
The command here is *colnames(data)* - this is going to apply these names to the dataframe “data”.
colnames(data) <- c("ID","Date","Time", "Mood", "Health", "Tiredness", "Sleep", "Soreness", "Food", "School", "Hydration" , "Other", "Comments",
"Duration", "sRPE", "sRPE_B", "sRPE_L", "sRPE_U", "sRPE_B", "Session Comments")
head(data) ID Date Time Mood Health Tiredness Sleep Soreness Food School
1 51 31/10/2022 17:01 7 6 7 7 5 6 7
2 28 31/10/2022 17:01 7 7 7 7 7 7 7
3 12 31/10/2022 17:22 4 4 4 4 4 4 4
4 30 31/10/2022 17:56 5 6 4 5 6 5 5
5 3 31/10/2022 17:58 4 4 6 6 5 7 3
6 14 31/10/2022 18:01 6 3 4 3 5 7 1
Hydration Other Comments Duration sRPE sRPE_B sRPE_L sRPE_U
1 7 2 jess vasey 80 0 0 0 0
2 7 0 80 0 0 0 0
3 4 0 80 0 0 0 0
4 6 3 Maddy Hillyer 80 0 0 0 0
5 7 0 80 0 0 0 0
6 6 0 80 0 0 0 0
sRPE_B Session Comments
1 0
2 0
3 0
4 0
5 0
6 0
So, this looks a bit better? However, we need to tell it that “Date” refers to a date and that ID refers to a factor. So we use *as.Date()* and *as.factor()* - if we want it as a number we could use: *as.numeric()*. Notice how these change:
NOTE: as.Date does not work well here as the files have different date formats so I had to use *parse_date_time* & *guess_formats* from the lubridate package
library(lubridate) #this helps ensure the date in the .csv is converted into the correct format
data$Date <- parse_date_time(data$Date,guess_formats(data$Date, c("dmy", "ymd")))
data$ID<-as.factor(data$ID)
head(data) ID Date Time Mood Health Tiredness Sleep Soreness Food School
1 51 2022-10-31 17:01 7 6 7 7 5 6 7
2 28 2022-10-31 17:01 7 7 7 7 7 7 7
3 12 2022-10-31 17:22 4 4 4 4 4 4 4
4 30 2022-10-31 17:56 5 6 4 5 6 5 5
5 3 2022-10-31 17:58 4 4 6 6 5 7 3
6 14 2022-10-31 18:01 6 3 4 3 5 7 1
Hydration Other Comments Duration sRPE sRPE_B sRPE_L sRPE_U
1 7 2 jess vasey 80 0 0 0 0
2 7 0 80 0 0 0 0
3 4 0 80 0 0 0 0
4 6 3 Maddy Hillyer 80 0 0 0 0
5 7 0 80 0 0 0 0
6 6 0 80 0 0 0 0
sRPE_B Session Comments
1 0
2 0
3 0
4 0
5 0
6 0
We have “readiness to train” data with several AROM and some RPE data.
We might be interested in both or just in one set of data so we can subset these if we want. For today let’s focus just on the AROM. In this case he first 11 columns are what we need so we will use subset and select again but rather than -c() we’ll use c(). Notice how the description here is of a dataframe 6x11 now rather than 6x20:
ART<-subset(data , select = c(1:11) )
head(ART) ID Date Time Mood Health Tiredness Sleep Soreness Food School
1 51 2022-10-31 17:01 7 6 7 7 5 6 7
2 28 2022-10-31 17:01 7 7 7 7 7 7 7
3 12 2022-10-31 17:22 4 4 4 4 4 4 4
4 30 2022-10-31 17:56 5 6 4 5 6 5 5
5 3 2022-10-31 17:58 4 4 6 6 5 7 3
6 14 2022-10-31 18:01 6 3 4 3 5 7 1
Hydration
1 7
2 7
3 4
4 6
5 7
6 6
Do we need “Time” (column 3)?
If not we can delete it the same way:
ID Date Mood Health Tiredness Sleep Soreness Food School Hydration
1 51 2022-10-31 7 6 7 7 5 6 7 7
2 28 2022-10-31 7 7 7 7 7 7 7 7
3 12 2022-10-31 4 4 4 4 4 4 4 4
4 30 2022-10-31 5 6 4 5 6 5 5 6
5 3 2022-10-31 4 4 6 6 5 7 3 7
6 14 2022-10-31 6 3 4 3 5 7 1 6
Great but why have we got “0” in these data? This is because when the players rate their RPE on the app and not the ART items these are populated by a “0” - so we need to delete every row with a zero. We also need to drop cells containing “N/A”
ART<-ART[ART$Mood != 0, ] #deletes all rows where Mood has been rated 0
ART<-ART %>% drop_na() #deletes any rows with an N/A Now we can do “stuff” with the data!
Remember rule 1: what is your question!
On average how do the players rate their mood and how is this distributed across the team. We can do this a few ways in R but here is a way of doing it in base r - so no need to install a package here.
Below we are creating a data set with player ID and average mood using the mean function:
mood<-aggregate(Mood ~ ID, data = ART, FUN = mean)
head(mood) ID Mood
1 0 5.000000
2 1 4.250000
3 10 3.933333
4 11 7.000000
5 12 4.181818
6 13 4.000000
Really simply we can boxplot this now:
boxplot(mood$Mood, main ="Box plot of the mean Mood score", ylab = "Mean mood score out of 7")We could also run a basic histogram to see the distribution of these data in a bit more detail:
hist(mood$Mood, main = "Histogram of players mean Mood score", xlab = "Likert score from 1 to 7")We can turn the counts into a distribution percentage to create a density plot here which is sometimes a nice way to view distribution:
plot(hist(mood$Mood),
main = "Histogram of players mean Mood score",
xlab = "Likert score from 1 to 7",
col = 'light grey',
freq = FALSE
)plot(density(mood$Mood),
main = "Density plot of players mean Mood score",
xlab = "Likert score from 1 to 7",
col = 'blue',
freq = FALSE
)Once you write plot( this starts a new canvas but we can change the code to combine the plots on the same canvas as so:
plot(hist(mood$Mood),
main = "Histogram of players mean Mood score",
xlab = "Likert score from 1 to 7",
col = 'light grey',
freq = FALSE
)lines(density(mood$Mood),
main = "Density plot of players mean Mood score",
xlab = "Likert score from 1 to 7",
col = 'blue',
freq = FALSE
)What does this mean?
Most players are in a rating their mood high on average - some payers are rating it very high most of the time - what do we think about this?
Could it be useful to look at the distribution for each player?
boxplot(ART$Mood ~ ART$ID, main ="Box plot of the players mood score", xlab = "Mean mood score out of 7",
ylab = "ID",
horizontal = TRUE)The number of different players make it more difficult to view these data
So we might need to cut the data down and look at groups of players - age group would be a useful contextual factor we don’t have data for here.
Or perhaps we can use ggplot to graph the data and help us visualize it better?